2025-08-07 2025-08-07 About 63100 words 296 minutes

Contents

#1 Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis 跳跃、跳过与过度思考：诊断推理模型在多跳分析中失误的原因
#2 FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data FaST：面向个性化偏好对齐的特征感知采样与调优，适用于有限数据环境
#3 GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay GeRe：面向 LLM 持续学习中通过通用样本重放实现高效抗遗忘的探索
#4 Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management Sculptor：通过主动上下文管理赋能 LLMs 的认知代理能力
#5 Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs 多模块 GRPO：组合策略梯度和提示优化以实现语言模型程序
#6 Can NLP Tackle Hate Speech in the Real World? Stakeholder-Informed Feedback and Survey on Counterspeech 自然语言处理能应对现实世界中的仇恨言论吗？基于利益相关者反馈和反言论调查
#7 IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards IFDECORATOR：用可验证奖励包装指令跟随强化学习
#8 P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis P-Aligner：通过原则性指令合成实现语言模型的预对齐
#9 Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider 使用 Spider 进行零样本和微调文本到 SQL 生成的轻量级 Transformer
#10 TURA: Tool-Augmented Unified Retrieval Agent for AI Search TURA：用于 AI 搜索的工具增强统一检索代理
#11 Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning 分享你的关注点：通过基于矩阵的字典学习实现 Transformer 权重共享
#12 Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration 超越头脑风暴：是什么驱动高质量的科学创意？来自多智能体协作的启示
#13 Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning 揭示临床抑郁评估的全景：从行为特征到精神病学推理
#14 StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering StyliTruth：通过解耦引导实现风格化且真实的 LLM 生成
#15 CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation CALE：用于词内和词间语义区分的概念对齐嵌入
#16 Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI 使用生成式人工智能自动生成符合课程标准的马来西亚中学数学多项选择题
#17 StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion StepFun-Formalizer：通过知识推理融合释放 LLMs 的自动形式化潜力 #17 StepFun-Formalizer：通过知识推理融合释放 LLMs 的自动形式化潜力
#18 Evaluating, Synthesizing, and Enhancing for Customer Support Conversation #18 评估、综合与增强客户支持对话
#19 Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model #19 基于语义相似度和语言模型预测置信度的对话响应预取
#20 What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems #20 人类在交互时听到了什么？用于评估语音对话系统自动语音识别的选择性听觉实验
#21 Why are LLMs' abilities emergent? #21 为什么 LLMs 的能力是涌现的？
#22 Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky 利用大型语言模型提升事故数据质量：来自肯塔基州二次事故叙述的证据 #22 利用大型语言模型提升事故数据质量：来自肯塔基州二次事故叙述的证据
#23 AIC CTU@FEVER 8: On-premise fact checking through long context RAG #23 AIC CTU@FEVER 8：通过长上下文 RAG 进行本地事实核查
#24 Chain of Questions: Guiding Multimodal Curiosity in Language Models #24 问题链：引导语言模型中的多模态好奇心
#25 GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy #25 GTPO 和 GRPO-S：基于策略熵的令牌和序列级奖励塑形
#26 Modelling and Classifying the Components of a Literature Review #26 文献综述组成部分的建模与分类
#27 Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models #27 超越排行榜：重新思考大型语言模型的医学基准测试
#28 A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models #28 几个词就能扭曲图谱：基于图的检索增强大语言模型生成的知识投毒攻击
#29 ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents #29 ShoppingBench：一个面向基于 LLM 代理的真实意图驱动购物基准
#30 KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs #30 KVSink：理解与增强 LLMs 中 KV 缓存量化中注意力汇聚的保持
#31 TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening #31 TalkDep：面向对话中心的抑郁症筛查的临床基础 LLM 角色设定
#32 DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting #32 DP-GPT4MTS：用于文本-数值时间序列预测的双提示大型语言模型
#33 Hierarchical Text Classification Using Black Box Large Language Models #33 使用黑盒大型语言模型的层次文本分类
#34 ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments #34 ReasoningGuard：通过推理时的安全“灵光一现”保护大型推理模型
#35 Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts #35 超越标签的推理：在低资源、文化细微差异背景下测量 LLM 情感
#36 Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models #36 诱发并分析最先进大型语言模型中的新兴错位
#37 Characterizing Deep Research: A Benchmark and Formal Definition #37 深度研究特征化：基准测试与正式定义
#38 Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity #38 利用因果充分性和必要性破解大型语言模型的幻觉
#39 The State Of TTS: A Case Study with Human Fooling Rates #39 语音合成的现状：以人类欺骗率为案例研究
#40 Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap #40 基于 DPO 隐式奖励差距的难度偏好数据选择
#41 Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks #41 揭示在微调 LLMs 进行推理任务时的过度记忆问题
#42 GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning #42 GM-PRM：一种用于多模态数学推理的生成式多模态过程奖励模型
#43 ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients" #43 ToolGrad：利用文本“梯度”高效生成工具使用数据集
#44 Efficient Strategy for Improving Large Language Model (LLM) Capabilities #44 提升大型语言模型（LLM）能力的高效策略
#45 PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG #45 PAIRS：参数验证的自适应信息检索与选择，用于高效的 RAG
#46 DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation #46 DTPA：用于可控文本生成的动态令牌级前缀增强
#47 Large Reasoning Models Are Autonomous Jailbreak Agents #47 大型推理模型是自主越狱代理
#48 ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents #48 ZARA：通过知识和检索驱动的 LLM 代理实现零样本运动时间序列分析
#49 Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing #49 多一步：超越单次反向传播的元学习模型编辑
#50 HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization #50 HarmonyGuard：通过自适应策略增强和双目标优化实现网络代理的安全性与实用性
#51 Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models #51 通过代理概念瓶颈模型将专家认知模型转移到社交机器人
#52 Are Today's LLMs Ready to Explain Well-Being Concepts? #52 今天的 LLMs 准备好解释幸福感概念了吗？
#53 Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency #53 置信加权令牌集覆盖用于自洽性中的早期假设剪枝
#54 Data and AI governance: Promoting equity, ethics, and fairness in large language models #54 数据与人工智能治理：促进大型语言模型中的公平、伦理与公正
#55 CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation #55 CAP-LLM：用于新闻标题生成的上下文增强个性化大型语言模型
#56 CoAct-1: Computer-using Agents with Coding as Actions #56 CoAct-1：以编码为动作的计算机使用代理
#57 Sotopia-RL: Reward Design for Social Intelligence #57 Sotopia-RL：社会智能的奖励设计
#58 An Entity Linking Agent for Question Answering #58 一个用于问答的实体链接代理
#59 Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models #59 从幻觉到真相：大型语言模型中的事实核查与真实性评估综述
#60 Majority Bit-Aware Watermarking For Large Language Models #60 面向大型语言模型的多数位感知水印技术
#61 AttnTrace: Attention-based Context Traceback for Long-Context LLMs #61 AttnTrace：基于注意力的长上下文 LLMs 上下文追溯
#62 GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models #62 GanitBench：一个用于评估视觉语言模型中数学推理的双语基准
#63 WINELL: Wikipedia Never-Ending Updating with LLM Agents #63 WINELL：使用 LLM 代理的维基百科永无止境更新
#64 Hierarchical Verification of Speculative Beams for Accelerating LLM Inference #64 分层验证投机波束以加速 LLM 推理
#65 Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering #65 多轮农业问答的意图感知上下文检索
#66 FeynTune: Large Language Models for High-Energy Theory #66 FeynTune：用于高能理论的 LLMs
#67 How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion #67 LLMs 中的表征偏差有多深？以种姓和宗教为例
#68 SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience #68 SEAgent：具备自主经验学习的自我进化计算机使用代理
#69 Query Attribute Modeling: Improving search relevance with Semantic Search and Meta Data Filtering #69 查询属性建模：通过语义搜索和元数据过滤提升搜索相关性
#70 Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference #70 立场：当前的人工智能会议模式不可持续！诊断集中式人工智能会议的危机
#71 Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation #71 推荐系统真的利用了多模态内容吗？关于推荐系统多模态表示的全面分析
#72 Analyzing and Mitigating Object Hallucination: A Training Bias Perspective #72 分析与缓解对象幻觉：一种训练偏差视角
#73 Causal Reflection with Language Models #73 语言模型的因果反思
#74 OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use #74 操作系统代理：基于多模态大语言模型（MLLM）代理在通用计算设备上的应用综述
#75 FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding #75 FrEVL：利用冻结的预训练嵌入实现高效的视觉-语言理解
#76 Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents #76 超越像素：探索基于 LLM 的网页代理的 DOM 降采样
#77 Graph Representation Learning with Massive Unlabeled Data for Rumor Detection #77 利用海量无标签数据进行图表示学习以检测谣言
#78 ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations #78 ToxicTAGS：利用丰富标签注释解码有害表情包
#79 Multilingual Source Tracing of Speech Deepfakes: A First Benchmark #79 多语言语音深度伪造源追踪：首个基准测试
#80 COPO: Consistency-Aware Policy Optimization #80 COPO：一致性感知策略优化
#81 AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities #81 AgREE：面向新兴实体的知识图谱补全的智能推理
#82 ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval #82 ConvMix：一种用于对话密集检索的混合标准数据增强框架
#83 Accelerating Scientific Discovery with Multi-Document Summarization of Impact-Ranked Papers #83 利用多文档摘要加速科学发现——基于影响力排名的论文
#84 ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants #84 ASTRA：面向 AI 软件助手的自主时空红队测试
#85 MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources #85 MegaWika 2：更全面的多语言文章及其来源合集
#86 GTPO: Trajectory-Based Policy Optimization in Large Language Models #86 GTPO：基于轨迹的策略优化在大型语言模型中的应用
#87 CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning #87 CX-Mind：一种开创性的多模态大型语言模型，通过课程引导的强化学习实现胸部 X 光的交错推理
#88 Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding #88 健康保险覆盖规则解释语料库：健康保险覆盖理解的法律、政策与医疗指导
#89 A Social Data-Driven System for Identifying Estate-related Events and Topics #89 一个基于社交数据的系统，用于识别房地产相关事件和话题
#90 MD-LLM-1: A Large Language Model for Molecular Dynamics #90 MD-LLM-1：用于分子动力学的大型语言模型

#1 SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience #1 SEAgent：具备自主经验学习能力的自我进化计算机使用代理
#2 LLM Collaboration With Multi-Agent Reinforcement Learning #2 LLM 协作与多智能体强化学习
#3 ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges #3 ConfProBench：基于 MLLM 的过程判决置信度评估基准
#4 SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset #4 SID：使用苏格拉底跨学科对话数据集对 STEM 教育中的引导式教学能力进行基准测试
#5 [Argumentative Debates for Transparent Bias Detection Technical Report] #5 透明偏见检测的论辩辩论 [技术报告]
#6 OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use #6 操作系统代理：基于多模态大型语言模型（MLLM）代理在通用计算设备中的应用综述
#7 From "Aha Moments" to Controllable Thinking: Toward Meta-Cognitive Reasoning in Large Reasoning Models via Decoupled Reasoning and Control #7 从“顿悟时刻”到可控思维：通过解耦推理与控制迈向大型推理模型中的元认知推理
#8 \textsc: A Responsible Tool for Collecting Scaffolding Dialogues Between Experts and LLM-Simulated Novices #8 \textsc：一个用于收集专家与 LLM 模拟新手之间支架对话的负责任工具
#9 Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents #9 超越像素：探索基于 LLM 的网页代理的 DOM 降采样
#10 GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning #10 GuirlVG：通过强化学习的经验探索激励 GUI 视觉定位
#11 Artificial Consciousness as Interface Representation #11 人工意识作为界面表示
#12 OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing #12 OmniPlay：全模态模型在全模态游戏中的基准测试
#13 Deliberative Reasoning Network: An Uncertainty-Driven Paradigm for Belief-Tracked Inference with Pretrained Language Models #13 深思熟虑推理网络：一种基于不确定性的预训练语言模型信念追踪推理范式
#14 Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling #14 用于挑战记忆增强强化学习的合成 POMDP：记忆需求结构建模
#15 Large Language Model's Multi-Capability Alignment in Biomedical Domain #15 大型语言模型在生物医学领域的多能力对齐
#16 Circuit-Aware SAT Solving: Guiding CDCL via Conditional Probabilities #16 电路感知 SAT 求解：通过条件概率引导 CDCL
#17 Generic-to-Specific Reasoning and Learning for Scalable Ad Hoc Teamwork #17 通用到特定的推理与学习，用于可扩展的临时团队合作
#18 AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities #18 AgREE：面向新兴实体的知识图谱补全的代理推理
#19 A Compositional Framework for On-the-Fly LTLf Synthesis #19 一种用于即时 LTLf 合成的组合框架
#20 Towards Transparent AI Grading: Semantic Entropy as a Signal for Human-AI Disagreement #20 迈向透明的 AI 评分：语义熵作为人机分歧信号
#21 GeoSR: Cognitive-Agentic Framework for Probing Geospatial Knowledge Boundaries via Iterative Self-Refinement #21 GeoSR：通过迭代自我优化探测地理空间知识边界的认知代理框架
#22 KG-Augmented Executable CoT for Mathematical Coding #22 知识图增强的可执行链式思维用于数学编码
#23 Personalized Knowledge Transfer Through Generative AI: Contextualizing Learning to Individual Career Goals #23 通过生成式人工智能实现个性化知识转移：将学习情境化以匹配个人职业目标
#24 SEA: Self-Evolution Agent with Step-wise Reward for Computer Use #24 SEA：带有逐步奖励的自我进化代理用于计算机使用
#25 Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement #25 不确定性感知 GUI 代理：通过组件推荐和人机交互式优化实现自适应感知
#26 Galaxy: A Cognition-Centered Framework for Proactive, Privacy-Preserving, and Self-Evolving LLM Agents #26 Galaxy：一个以认知为中心的主动、隐私保护和自我进化的 LLM 代理框架
#27 The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans? #27 情感婴儿确实致命：你的多模态大型推理模型是否对人类有情感恭维？
#28 Can Large Language Models Adequately Perform Symbolic Reasoning Over Time Series? #28 大型语言模型能否充分执行时间序列上的符号推理？
#29 MOTIF: Multi-strategy Optimization via Turn-based Interactive Framework #29 MOTIF：基于回合制交互框架的多策略优化
#30 Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety #30 Evo-MARL：内化安全的协同进化多智能体强化学习
#31 MI9 – Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems #31 MI9 – 代理智能协议：面向代理式人工智能系统的运行时治理
#32 Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis #32 跳跃、跳过与过度思考：诊断推理模型在多跳分析中失误的原因
#33 From MAS to MARS: Coordination Failures and Reasoning Trade-offs in Hierarchical Multi-Agent Robotic Systems within a Healthcare Scenario #33 从 MAS 到 MARS：医疗场景中分层多智能体机器人系统的协调失败与推理权衡
#34 Query Attribute Modeling: Improving search relevance with Semantic Search and Meta Data Filtering #34 查询属性建模：通过语义搜索和元数据过滤提升搜索相关性
#35 GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay #35 GeRe：通过通用样本重放实现 LLM 持续学习中的高效抗遗忘
#36 How are CS students using resources and AI tools for coding tasks? #36 计算机科学学生如何使用资源和人工智能工具完成编码任务？
#37 Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management #37 Sculptor：通过主动上下文管理赋能 LLMs 认知代理
#38 HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models #38 HierarchicalPrune：面向大规模扩散模型的位置感知压缩
#39 YOLOv8-Based Deep Learning Model for Automated Poultry Disease Detection and Health Monitoring paper #39 基于 YOLOv8 的深度学习模型用于自动化家禽疾病检测和健康监测论文
#40 X-SAM: From Segment Anything to Any Segmentation #40 X-SAM：从“分割任何物体”到“任何分割”
#41 A Scalable Pretraining Framework for Link Prediction with Efficient Adaptation #41 一个可扩展的预训练框架，用于具有高效适应性的链接预测
#42 P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis #42 P-Aligner：通过原则性指令合成实现语言模型的预对齐
#43 HiD-VAE: Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs #43 HiD-VAE：通过分层和解耦语义 ID 实现可解释的生成式推荐
#44 Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning #44 具有半监督终身学习的类脑神经网络网络安全
#45 TURA: Tool-Augmented Unified Retrieval Agent for AI Search #45 TURA：用于人工智能搜索的工具增强统一检索代理
#46 GraphProp: Training the Graph Foundation Models using Graph Properties #46 GraphProp：利用图属性训练图基础模型
#47 A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI #47 IVIM MRI 中体素级监督模型不确定性量化的综合框架
#48 Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference #48 立场：当前的人工智能会议模式不可持续！诊断集中式人工智能会议的危机
#49 Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning #49 分享你的注意力：通过基于矩阵的字典学习实现 Transformer 权重共享
#50 Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration #50 超越头脑风暴：是什么驱动高质量的科学创意？多智能体协作的启示
#51 CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization #51 CLASP：基于跨模态显著锚点的语义传播用于弱监督密集音视频事件定位
#52 MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning #52 MSC：一个带有定位分割和片段级字幕的海洋野生动物视频数据集
#53 Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning #53 揭示临床抑郁评估的全景：从行为特征到精神病学推理
#54 RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection #54 RAIDX：一种用于可解释深度伪造检测的检索增强生成与 GRPO 强化学习框架
#55 PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers #55 PRISM：通过对称多分辨率卷积层实现轻量级多变量时间序列分类
#56 Learning Robust Intervention Representations with Delta Embeddings #56 使用增量嵌入学习鲁棒的干预表示
#57 Hierarchical Scoring for Machine Learning Classifier Error Impact Evaluation #57 机器学习分类器错误影响评估的层级评分
#58 Benchmarking Quantum and Classical Sequential Models for Urban Telecommunication Forecasting #58 城市电信预测中量子与经典序列模型的基准测试
#59 Metric Learning in an RKHS #59 RKHS 中的度量学习
#60 Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model #60 通过渐进对齐实现文本到图像模型中的零残差概念擦除
#61 Small transformer architectures for task switching #61 用于任务切换的小型变压器架构
#62 Automatic LLM Red Teaming #62 自动 LLM 红队测试
#63 Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling #63 云模型特征函数自编码器：结合云模型理论与 MMD 正则化以增强生成建模
#64 Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI #64 使用生成式人工智能自动生成符合课程标准的马来西亚中学数学选择题
#65 StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion #65 StepFun-Formalizer：通过知识推理融合释放 LLMs 的自动形式化潜力
#66 Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models #66 解码多模态迷宫：基于注意力的多模态模型中可解释性采用的系统综述
#67 Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation #67 三思而后分割：一种面向对象推理的指代视听分割代理
#68 Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models #68 基于深度学习的可扩展图像到三维立面解析器，用于生成热力三维建筑模型
#69 Why are LLMs' abilities emergent? #69 为什么 LLMs 的能力是涌现的？
#70 Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky #70 利用大型语言模型提升事故数据质量：来自肯塔基州二次事故叙述的证据
#71 AIC CTU@FEVER 8: On-premise fact checking through long context RAG #71 AIC CTU@FEVER 8：通过长上下文 RAG 进行本地事实核查
#72 ProtoN: Prototype Node Graph Neural Network for Unconstrained Multi-Impression Ear Recognition #72 ProtoN：用于无约束多印象耳朵识别的原型节点图神经网络
#73 LUST: A Multi-Modal Framework with Hierarchical LLM-based Scoring for Learned Thematic Significance Tracking in Multimedia Content #73 LUST：一个基于层级 LLM 评分的多模态框架，用于多媒体内容中学习主题重要性追踪
#74 Chain of Questions: Guiding Multimodal Curiosity in Language Models #74 问题链：引导语言模型中的多模态好奇心
#75 GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy #75 GTPO 和 GRPO-S：基于策略熵的令牌和序列级奖励塑造
#76 Modelling and Classifying the Components of a Literature Review #76 文献综述的组成部分建模与分类
#77 Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models #77 超越排行榜：重新思考大型语言模型的医疗基准
#78 Compressing Large Language Models with PCA Without Performance Loss #78 使用 PCA 压缩大型语言模型且无性能损失
#79 Comparative Analysis of Novel NIRMAL Optimizer Against Adam and SGD with Momentum #79 新型 NIRMAL 优化器与 Adam 及带动量的 SGD 的比较分析
#80 Challenges in Applying Variational Quantum Algorithms to Dynamic Satellite Network Routing #80 在动态卫星网络路由中应用变分量子算法的挑战
#81 Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success #81 利用合成世界中的强化学习提升视觉-语言模型训练，实现现实世界成功
#82 A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models #82 几个词就能扭曲图谱：基于图的增强检索生成大型语言模型的知识投毒攻击
#83 A Visual Tool for Interactive Model Explanation using Sensitivity Analysis #83 一款基于敏感性分析的交互式模型解释可视化工具
#84 SelectiveShield: Lightweight Hybrid Defense Against Gradient Leakage in Federated Learning #84 SelectiveShield：针对联邦学习中梯度泄露的轻量级混合防御
#85 Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark #85 细分任何车辆：语义和视觉上下文驱动的 SAM 及基准测试
#86 TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening #86 TalkDep：以临床为基础的面向对话的抑郁症筛查 LLM 角色设定
#87 Automated ultrasound doppler angle estimation using deep learning #87 使用深度学习的自动化超声多普勒角度估计
#88 Empowering Time Series Forecasting with LLM-Agents #88 利用 LLM-Agent 增强时间序列预测
#89 LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation #89 LayerT2V：用于视频生成的交互式多目标轨迹分层
#90 Symmetric Behavior Regularization via Taylor Expansion of Symmetry #90 通过对称性的泰勒展开实现对称行为正则化
#91 A Hybrid AI Methodology for Generating Ontologies of Research Topics from Scientific Paper Corpora #91 一种用于从科学论文语料库生成研究主题本体的混合 AI 方法论
#92 ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments #92 ReasoningGuard：通过推理时的安全灵感时刻保护大型推理模型
#93 ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs #93 ViFP：一个用于视觉假阳性检测以增强视觉语言模型推理可靠性的框架
#94 Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective #94 收集与追踪：从实例导向视角重新思考视频文本视觉问答
#95 Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models #95 诱发并分析最先进大型语言模型中的新兴错位
#96 NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations #96 NVSpeech：一个集成且可扩展的人类语音建模管道，包含副语言声学表现
#97 Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity #97 利用因果充分性和必要性破解多模态大模型的幻觉
#98 Quasi-Clique Discovery via Energy Diffusion #98 通过能量扩散发现准团簇
#99 Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap #99 基于难度的偏好数据选择通过 DPO 隐式奖励差距
#100 COPO: Consistency-Aware Policy Optimization #100 COPO：一致性感知的策略优化
#101 UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval #101 UniFGVC：通过属性感知多模态检索实现的通用无训练少样本细粒度视觉分类
#102 DS2Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation #102DS 2 Net：用于医学图像分割的细节-语义深度监督网络
#103 Experimental Analysis of Productive Interaction Strategy with ChatGPT: User Study on Function and Project-level Code Generation Tasks #103 使用 ChatGPT 的高效交互策略实验分析：关于函数和项目级代码生成任务的用户研究
#104 Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode #104 通过轻量级掩码解码释放多模态大模型在指代表达分割中的潜力
#105 SenseCrypt: Sensitivity-guided Selective Homomorphic Encryption for Joint Federated Learning in Cross-Device Scenarios #105 SenseCrypt：面向跨设备场景联合联邦学习的敏感性引导选择性同态加密
#106 DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting #106 DET-GS：用于高保真 3D 高斯点绘制的深度和边缘感知正则化
#107 DRIVE: Dynamic Rule Inference and Verified Evaluation for Constraint-Aware Autonomous Driving #107 DRIVE：面向约束感知自动驾驶的动态规则推断与验证评估
#108 FLAT: Latent-Driven Arbitrary-Target Backdoor Attacks in Federated Learning #108 FLAT：联邦学习中的潜在驱动任意目标后门攻击
#109 Large Reasoning Models Are Autonomous Jailbreak Agents #109 大型推理模型是自主越狱代理
#110 CORE-ReID V2: Advancing the Domain Adaptation for Object Re-Identification with Optimized Training and Ensemble Fusion #110 CORE-ReID V2：通过优化训练和集成融合推进目标重识别的领域自适应
#111 A Comparative Survey of PyTorch vs TensorFlow for Deep Learning: Usability, Performance, and Deployment Trade-offs #111 PyTorch 与 TensorFlow 深度学习比较调研：可用性、性能与部署权衡
#112 Enhancing Serendipity Recommendation System by Constructing Dynamic User Knowledge Graphs with Large Language Models #112 通过构建动态用户知识图谱并结合大型语言模型提升意外发现推荐系统
#113 Identity Theft in AI Conference Peer Review #113 AI 会议同行评审中的身份盗用
#114 Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing #114 进一步一步：超越单次反向传播的元学习模型编辑
#115 StepWrite: Adaptive Planning for Speech-Driven Text Generation #115 StepWrite：面向语音驱动文本生成的自适应规划
#116 HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization #116 HarmonyGuard：通过自适应策略增强和双目标优化实现网络代理的安全性与实用性提升
#117 Are Today's LLMs Ready to Explain Well-Being Concepts? #117 现今的 LLMs 准备好解释幸福感概念了吗？
#118 Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework #118 动态用户可控的隐私保护少样本感知框架
#119 Data and AI governance: Promoting equity, ethics, and fairness in large language models #119 数据与人工智能治理：促进大型语言模型中的公平、伦理与公正
#120 Human-Centered Human-AI Interaction (HC-HAII): A Human-Centered AI Perspective #120 以人为中心的人机交互（HC-HAII）：以人为中心的人工智能视角
#121 Accelerating Scientific Discovery with Multi-Document Summarization of Impact-Ranked Papers #121 利用多文档摘要加速科学发现——基于影响力排名的论文汇总
#122 Policy to Assist Iteratively Local Segmentation: Optimising Modality and Location Selection for Prostate Cancer Localisation #122 迭代局部分割辅助策略：优化前列腺癌定位的模态和位置选择
#123 Constraint-Preserving Data Generation for Visuomotor Policy Learning #123 约束保持的数据生成用于视觉运动策略学习
#124 FairPOT: Balancing AUC Performance and Fairness with Proportional Optimal Transport #124 FairPOT：通过比例最优传输平衡 AUC 性能与公平性
#125 Active Learning and Transfer Learning for Anomaly Detection in Time-Series Data #125 时间序列数据异常检测的主动学习与迁移学习
#126 Deep learning framework for crater detection and identification on the Moon and Mars #126 月球和火星陨石坑检测与识别的深度学习框架
#127 Fast and Accurate Explanations of Distance-Based Classifiers by Uncovering Latent Explanatory Structures #127 通过揭示潜在解释结构实现距离基分类器的快速准确解释
#128 Calibrating Biophysical Models for Grape Phenology Prediction via Multi-Task Learning #128 通过多任务学习校准生物物理模型以预测葡萄物候
#129 Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE) #129 通过安全混沌工程（SCE）赋能的漏洞攻击模拟（BAS）平台模拟网络攻击
#130 Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training #130 极端规模湍流数据集的智能采样，用于准确高效的时空模型训练
#131 Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models #131 从幻觉到真相：大型语言模型中的事实核查与真实性评估综述
#132 VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations #132 VAE-DNN：用于参数化偏微分方程的节能分部可训练代理模型
#133 Mechanism Design for Facility Location using Predictions #133 使用预测的设施选址机制设计
#134 SoilNet: A Multimodal Multitask Model for Hierarchical Classification of Soil Horizons #134 SoilNet：一种用于土壤层级分类的多模态多任务模型
#135 Probing and Enhancing the Robustness of GNN-based QEC Decoders with Reinforcement Learning #135 利用强化学习探测并增强基于 GNN 的量子纠错解码器的鲁棒性
#136 Do GNN-based QEC Decoders Require Classical Knowledge? Evaluating the Efficacy of Knowledge Distillation from MWPM #136 基于 GNN 的量子纠错解码器是否需要经典知识？评估从 MWPM 进行知识蒸馏的有效性
#137 Are Inherently Interpretable Models More Robust? A Study In Music Emotion Recognition #137 天生可解释的模型更具鲁棒性吗？音乐情感识别中的一项研究
#138 When Agents Break Down in Multiagent Path Finding #138 当多智能体路径寻找中的智能体失效时
#139 Revisiting Heat Flux Analysis of Tungsten Monoblock Divertor on EAST using Physics-Informed Neural Network #139 使用物理信息神经网络重新审视 EAST 钨单块偏滤器的热流分析
#140 4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis #140 4D-PreNet：一个用于 4D-STEM 数据分析的统一预处理框架
#141 U-PINet: End-to-End Hierarchical Physics-Informed Learning With Sparse Graph Coupling for 3D EM Scattering Modeling #141 U-PINet：基于稀疏图耦合的端到端分层物理信息学习用于三维电磁散射建模
#142 When Deep Learning Fails: Limitations of Recurrent Models on Stroke-Based Handwriting for Alzheimer's Disease Detection #142 当深度学习失败时：基于笔画的手写识别中循环模型在阿尔茨海默病检测上的局限性
#143 GTPO: Trajectory-Based Policy Optimization in Large Language Models #143 GTPO：基于轨迹的策略优化在大型语言模型中的应用
#144 Trustworthiness of Legal Considerations for the Use of LLMs in Education #144 教育中使用 LLMs 的法律考量的可信度
#145 Development of management systems using artificial intelligence systems and machine learning methods for boards of directors (preprint, unofficial translation) #145 利用人工智能系统和机器学习方法为董事会开发管理系统（预印本，非官方译本）
#146 CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning #146 CoughViT：一种用于咳嗽音频表示学习的自监督视觉变换器
#147 Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment #147 Refine-IQA：用于感知图像质量评估的多阶段强化微调
#148 FlashCommunication V2: Bit Splitting and Spike Reserving for Any Bit Communication #148 FlashCommunication V2：任意比特通信的比特拆分与脉冲保留
#149 M3HL: Mutual Mask Mix with High-Low Level Feature Consistency for Semi-Supervised Medical Image Segmentation #149M 3 HL：具有高低层特征一致性的互掩码混合，用于半监督医学图像分割
#150 Data-Driven Discovery of Mobility Periodicity for Understanding Urban Transportation Systems #150 基于数据驱动的出行周期性发现以理解城市交通系统
#151 Tobler's First Law in GeoAI: A Spatially Explicit Deep Learning Model for Terrain Feature Detection Under Weak Supervision #151 地理人工智能中的托布勒第一定律：一种基于弱监督的空间显式深度学习地形特征检测模型
#152 Do We Need Pre-Processing for Deep Learning Based Ultrasound Shear Wave Elastography? #152 我们是否需要对基于深度学习的超声剪切波弹性成像进行预处理？
#153 Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training #153 通过解剖正常性建模提升医学视觉语言预训练中的视觉语义密度
#154 Latent Knowledge Scalpel: Precise and Massive Knowledge Editing for Large Language Models #154 潜在知识手术刀：大型语言模型的精确且大规模知识编辑
#155 VQ-DeepISC: Vector Quantized-Enabled Digital Semantic Communication with Channel Adaptive Image Transmission #155 VQ-DeepISC：支持向量量化的数字语义通信与信道自适应图像传输
#156 A Modified VGG19-Based Framework for Accurate and Interpretable Real-Time Bone Fracture Detection #156 基于改进 VGG19 的框架，用于准确且可解释的实时骨折检测
#157 Improve Retinal Artery/Vein Classification via Channel Couplin #157 通过通道耦合改进视网膜动脉/静脉分类
#158 GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models #158 GanitBench：用于评估视觉语言模型中数学推理的双语基准测试
#159 Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities #159 通过视觉变换器融合普遍射频数据与空间图像以提升智慧城市中的地图绘制
#160 StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization #160 StorySync：通过区域协调实现文本到图像生成中的无训练主体一致性
#161 A Survey of Multimodal Ophthalmic Diagnostics: From Task-Specific Approaches to Foundational Models #161 多模态眼科诊断综述：从特定任务方法到基础模型
#162 CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning #162 CX-Mind：一种开创性的多模态大型语言模型，通过课程引导的强化学习实现胸部 X 光交错推理
#163 Multimodal Video Emotion Recognition with Reliable Reasoning Priors #163 具有可靠推理先验的多模态视频情感识别
#164 Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering #164 多轮农业问答的意图感知上下文检索
#165 Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding #165 健康保险覆盖规则解释语料库：健康保险覆盖理解的法律、政策和医疗指导
#166 Detection of Autonomic Dysreflexia in Individuals With Spinal Cord Injury Using Multimodal Wearable Sensors #166 使用多模态可穿戴传感器检测脊髓损伤患者的自主神经反射障碍
#167 "Think First, Verify Always": Training Humans to Face AI Risks #167 “先思考，后验证”：训练人类应对人工智能风险
#168 A Social Data-Driven System for Identifying Estate-related Events and Topics #168 一个基于社交数据的系统，用于识别房地产相关事件和话题
#169 Controllable Surface Diffusion Generative Model for Neurodevelopmental Trajectories #169 可控表面扩散生成模型用于神经发育轨迹
#170 Privacy Risks of LLM-Empowered Recommender Systems: An Inversion Attack Perspective #170 LLM 赋能推荐系统的隐私风险：一种反演攻击视角
#171 MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning #171 MagicGUI：一个具有可扩展数据管道和强化微调的基础移动 GUI 代理
#172 PLA: Prompt Learning Attack against Text-to-Image Generative Models #172 PLA：针对文本到图像生成模型的提示学习攻击
#173 Large AI Models for Wireless Physical Layer #173 无线物理层的大型人工智能模型
#174 ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds #174 ForestFormer3D：一个用于森林 LiDAR 三维点云端到端分割的统一框架
#175 Recommendation with Generative Models #175 使用生成模型的推荐
#176 Delving Deeper Into Astromorphic Transformers #176 深入探讨类星形变换器

2025-08-07科研追新

2025-08-06 11:03:18 Wednesday ～ 2025-08-07 10:24:16 Thursday

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

1.2 Arxiv

1.2.1 Computation and Language

From：https:// /arxiv/cs.CL

2025-08-07 | | Total: 90

#1 Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis 跳跃、跳过与过度思考：诊断推理模型在多跳分析中失误的原因

推理模型的出现及其在实际 AI 聊天机器人中的应用，推动了解决需要复杂多步骤思考过程的高级数学、深度搜索和抽取式问答问题的突破。然而，关于为何这些模型比通用语言模型更容易产生幻觉的完整理解仍然缺失。在这项调查研究中，我们系统地探讨了当代语言模型在多跳问答任务中的推理失败。我们引入了一种新颖且细致的错误分类框架，从三个关键维度审视失败：涉及的源文档的多样性和独特性（“跳跃”）、捕捉相关信息的完整性（“覆盖”）以及认知效率低下（“过度思考”）。通过严格的人类标注，辅以互补的自动化指标，我们的探索揭示了常被以准确率为中心的评估所掩盖的复杂错误模式。这种调查方法深入揭示了当前模型的认知局限性，并为未来语言建模工作中提升推理的准确性、透明度和鲁棒性提供了可操作的指导。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 17:58:36 UTC 发布：2025-08-06 17:58:36 UTC

#2 FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data FaST：面向个性化偏好对齐的特征感知采样与调优，适用于有限数据环境

Authors: [Thibaut Thonet](https://arxiv.org/search/?searchtype=author&query=Thibaut Thonet), [Germán Kruszewski](https://arxiv.org/search/?searchtype=author&query=Germán Kruszewski), [Jos Rozen](https://arxiv.org/search/?searchtype=author&query=Jos Rozen), [Pierre Erbacher](https://arxiv.org/search/?searchtype=author&query=Pierre Erbacher), [Marc Dymetman](https://arxiv.org/search/?searchtype=author&query=Marc Dymetman)

LLM-powered conversational assistants are often deployed in a one-size-fits-all manner, which fails to accommodate individual user preferences. Recently, LLM personalization – tailoring models to align with specific user preferences – has gained increasing attention as a way to bridge this gap. In this work, we specifically focus on a practical yet challenging setting where only a small set of preference annotations can be collected per user – a problem we define as Personalized Preference Alignment with Limited Data (PPALLI). To support research in this area, we introduce two datasets – DnD and ELIP – and benchmark a variety of alignment techniques on them. We further propose FaST, a highly parameter-efficient approach that leverages high-level features automatically discovered from the data, achieving the best overall performance. 基于 LLM 的对话助手通常以一刀切的方式部署，无法满足个别用户的偏好需求。近年来，LLM 个性化——即根据特定用户偏好定制模型——作为弥合这一差距的方法，受到了越来越多的关注。在本工作中，我们特别关注一个实际但具有挑战性的场景，即每个用户只能收集到少量的偏好标注——我们将此问题定义为有限数据下的个性化偏好对齐（PPALLI）。为了支持该领域的研究，我们引入了两个数据集——DnD 和 ELIP，并在它们上对多种对齐技术进行了基准测试。我们进一步提出了 FaST，一种高度参数高效的方法，利用从数据中自动发现的高级特征，实现了最佳的整体性能。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 17:58:26 UTC 发布：2025-08-06 17:58:26 UTC

#3 GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay GeRe：面向 LLM 持续学习中通过通用样本重放实现高效抗遗忘的探索

大型语言模型（LLMs）的持续学习能力对于推动通用人工智能的发展至关重要。然而，在不同领域对 LLMs 进行持续微调时，常常会遭遇灾难性遗忘，表现为：1）其通用能力显著下降，2）先前学习任务的性能急剧下降。为了以简单且稳定的方式同时解决这两个问题，我们提出了通用样本重放（General Sample Replay，GeRe）框架，该框架利用常规预训练文本实现高效的抗遗忘。除了在 GeRe 框架下回顾最常见的基于重放的实践外，我们进一步利用神经状态，引入了一种基于阈值边际（TM）损失的增强激活状态约束优化方法，以在重放学习过程中保持激活状态的一致性。我们首次验证了，一小组固定的预先收集的通用重放样本足以解决这两个问题——既保留通用能力，又促进顺序任务的整体性能。事实上，前者本质上可以促进后者。通过受控实验，我们在 GeRe 框架下系统地比较了 TM 与不同的重放策略，包括普通的标签拟合、通过 KL 散度进行的 logit 模仿以及通过 L1/L2 损失进行的特征模仿。结果表明，TM 始终提升了性能并表现出更好的鲁棒性。我们的工作为未来高效重放 LLMs 铺平了道路。我们的代码和数据可在 https://github.com/Qznan/GeRe 获取。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-06 17:42:22 UTC 发布：2025-08-06 17:42:22 UTC

#4 Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management Sculptor：通过主动上下文管理赋能 LLMs 的认知代理能力

Authors: [Mo Li](https://arxiv.org/search/?searchtype=author&query=Mo Li), [L. H. Xu](https://arxiv.org/search/?searchtype=author&query=L. H. Xu), [Qitai Tan](https://arxiv.org/search/?searchtype=author&query=Qitai Tan), [Ting Cao](https://arxiv.org/search/?searchtype=author&query=Ting Cao), [Yunxin Liu](https://arxiv.org/search/?searchtype=author&query=Yunxin Liu)

Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs’ capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) intelligent search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on information-sparse benchmarks-PI-LLM (proactive interference) and NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs’ inherent tool calling generalization capabilities. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale. 大型语言模型（LLMs）在处理长上下文时，由于前摄干扰——即上下文早期部分的无关信息干扰推理和记忆回忆——表现会显著下降。尽管大多数研究集中在外部记忆系统以增强 LLMs 的能力，我们提出了一种互补方法：赋予 LLMs 主动上下文管理（ACM）工具，以主动塑造其内部工作记忆。我们引入了 Sculptor 框架，为 LLMs 配备三类工具：（1）上下文分割，（2）摘要、隐藏与恢复，以及（3）智能搜索。我们的方法使 LLMs 能够主动管理其注意力和工作记忆，类似于人类选择性关注相关信息并过滤干扰的方式。在信息稀疏的基准测试 PI-LLM（前摄干扰）和 NeedleBench 多针推理上的实验评估表明，Sculptor 显著提升了性能，即使在没有特定训练的情况下，也能利用 LLMs 固有的工具调用泛化能力。通过启用主动上下文管理，Sculptor 不仅减轻了前摄干扰，还为在各种长上下文任务中实现更可靠的推理提供了认知基础——强调了明确的上下文控制策略，而不仅仅是更大的令牌窗口，是实现大规模稳健性的关键。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-06 17:32:58 UTC 发布时间：2025-08-06 17:32:58 UTC

#5 Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs 多模块 GRPO：组合策略梯度和提示优化以实现语言模型程序

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the dspy.GRPO optimizer. 群组相对策略优化（GRPO）已被证明是后训练语言模型（LM）的有效工具。然而，人工智能系统越来越多地表现为将多个语言模型调用与不同的提示模板及其他工具混合在一起的模块化程序，目前尚不清楚如何最好地利用 GRPO 来改进这些系统。我们通过定义 mmGRPO 来开始解决这一挑战，mmGRPO 是 GRPO 的一个简单多模块推广，它在多次执行过程中按模块对语言模型调用进行分组，并处理可变长度和中断的轨迹。我们发现，结合自动提示优化的 mmGRPO，在分类、多跳搜索和隐私保护委托任务中，相较于后训练语言模型，平均准确率提升了 11%，相较于单独的提示优化提升了 5%。我们在 DSPy 中将 mmGRPO 开源为 dspy.GRPO 优化器。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 17:28:31 UTC 发布时间：2025-08-06 17:28:31 UTC

#6 Can NLP Tackle Hate Speech in the Real World? Stakeholder-Informed Feedback and Survey on Counterspeech 自然语言处理能应对现实世界中的仇恨言论吗？基于利益相关者反馈和反言论调查

Authors: [Tanvi Dinkar](https://arxiv.org/search/?searchtype=author&query=Tanvi Dinkar), [Aiqi Jiang](https://arxiv.org/search/?searchtype=author&query=Aiqi Jiang), [Simona Frenda](https://arxiv.org/search/?searchtype=author&query=Simona Frenda), [Poppy Gerrard-Abbott](https://arxiv.org/search/?searchtype=author&query=Poppy Gerrard-Abbott), [Nancie Gunson](https://arxiv.org/search/?searchtype=author&query=Nancie Gunson), [Gavin Abercrombie](https://arxiv.org/search/?searchtype=author&query=Gavin Abercrombie), [Ioannis Konstas](https://arxiv.org/search/?searchtype=author&query=Ioannis Konstas)

Counterspeech, i.e. the practice of responding to online hate speech, has gained traction in NLP as a promising intervention. While early work emphasised collaboration with non-governmental organisation stakeholders, recent research trends have shifted toward automated pipelines that reuse a small set of legacy datasets, often without input from affected communities. This paper presents a systematic review of 74 NLP studies on counterspeech, analysing the extent to which stakeholder participation influences dataset creation, model development, and evaluation. To complement this analysis, we conducted a participatory case study with five NGOs specialising in online Gender-Based Violence (oGBV), identifying stakeholder-informed practices for counterspeech generation. Our findings reveal a growing disconnect between current NLP research and the needs of communities most impacted by toxic online content. We conclude with concrete recommendations for re-centring stakeholder expertise in counterspeech research. 反言论，即回应网络仇恨言论的做法，作为一种有前景的干预手段，已在自然语言处理（NLP）领域获得关注。尽管早期工作强调与非政府组织利益相关者的合作，近期的研究趋势却转向使用自动化流程，重复利用少量传统数据集，且往往未征求受影响社区的意见。本文系统回顾了 74 项关于反言论的 NLP 研究，分析了利益相关者参与在数据集创建、模型开发和评估中的影响程度。为补充此分析，我们与五个专注于网络性别暴力（oGBV）的非政府组织开展了参与式案例研究，识别出基于利益相关者意见的反言论生成实践。研究结果显示，当前 NLP 研究与受有害网络内容影响最深社区的需求之间存在日益加剧的脱节。我们最后提出了具体建议，旨在重新聚焦利益相关者专业知识于反言论研究中。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 17:04:58 UTC 发布：2025-08-06 17:04:58 UTC

#7 IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards IFDECORATOR：用可验证奖励包装指令跟随强化学习

可验证奖励强化学习（RLVR）提升了大型语言模型（LLMs）的指令执行能力，但由于难度评估不足，训练效率较低。此外，RLVR 容易出现过度优化现象，即 LLMs 利用验证捷径而未能与用户指令的实际意图保持一致。我们提出了指令执行装饰器（IFDecorator）框架，将 RLVR 训练封装成一个稳健且样本高效的流程。该框架包含三个部分：（1）合作-对抗数据飞轮，协同进化指令和混合验证，生成逐步更具挑战性的指令-验证对；（2）IntentCheck，执行意图对齐的旁路模块；（3）触发线，一种通过陷阱指令检测奖励作弊的诊断机制，能够触发并捕捉捷径利用行为。我们的 Qwen2.5-32B-Instruct-IFDecorator 在 IFEval 上达到 87.43%的准确率，优于更大规模的专有模型如 GPT-4o。此外，我们在 FollowBench 上也展示了显著提升，同时保持了模型的通用能力。我们的触发机制显示奖励作弊率显著降低。我们将发布模型、代码和数据以供未来研究。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 17:00:54 UTC 发布时间：2025-08-06 17:00:54 UTC

#8 P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis P-Aligner：通过原则性指令合成实现语言模型的预对齐

Large Language Models (LLMs) are expected to produce safe, helpful, and honest content during interaction with human users, but they frequently fail to align with such values when given flawed instructions, e.g., missing context, ambiguous directives, or inappropriate tone, leaving substantial room for improvement along multiple dimensions. A cost-effective yet high-impact way is to pre-align instructions before the model begins decoding. Existing approaches either rely on prohibitive test-time search costs or end-to-end model rewrite, which is powered by a customized training corpus with unclear objectives. In this work, we demonstrate that the goal of efficient and effective preference alignment can be achieved by P-Aligner, a lightweight module generating instructions that preserve the original intents while being expressed in a more human-preferred form. P-Aligner is trained on UltraPrompt, a new dataset synthesized via a proposed principle-guided pipeline using Monte-Carlo Tree Search, which systematically explores the space of candidate instructions that are closely tied to human preference. Experiments across different methods show that P-Aligner generally outperforms strong baselines across various models and benchmarks, including average win-rate gains of 28.35% and 8.69% on GPT-4-turbo and Gemma-2-SimPO, respectively. Further analyses validate its effectiveness and efficiency through multiple perspectives, including data quality, search strategies, iterative deployment, and time overhead. 大型语言模型（LLMs）在与人类用户交互时，期望能够生成安全、有帮助且诚实的内容，但在面对有缺陷的指令时（例如缺失上下文、指令模糊或语气不当），它们经常无法与这些价值观保持一致，在多个方面仍有很大改进空间。一种成本效益高且影响显著的方法是在模型开始解码之前对指令进行预对齐。现有方法要么依赖于高昂的测试时搜索成本，要么依赖于端到端的模型重写，而后者依赖于目标不明确的定制训练语料库。在本工作中，我们展示了通过 P-Aligner 这一轻量级模块，可以实现高效且有效的偏好对齐。该模块生成的指令既保留了原始意图，又以更符合人类偏好的形式表达。P-Aligner 在 UltraPrompt 数据集上进行训练，该数据集通过一种基于蒙特卡洛树搜索的原则引导流水线合成，系统地探索了与人类偏好紧密相关的候选指令空间。不同方法的实验表明，P-Aligner 通常在各种模型和基准测试中优于强基线，包括在 GPT-4-turbo 和 Gemma-2-SimPO 上分别实现了 28.35%和 8.69%的平均胜率提升。进一步的分析从多个角度验证了其有效性和效率，包括数据质量、搜索策略、迭代部署和时间开销。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 16:51:38 UTC 发布：2025-08-06 16:51:38 UTC

#9 Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider 使用 Spider 进行零样本和微调文本到 SQL 生成的轻量级 Transformer

Authors: [Chirag Seth](https://arxiv.org/search/?searchtype=author&query=Chirag Seth), [Utkarsh Singh](https://arxiv.org/search/?searchtype=author&query=Utkarsh Singh)

Text-to-SQL translation enables non-expert users to query relational databases using natural language, with applications in education and business intelligence. This study evaluates three lightweight transformer models - T5-Small, BART-Small, and GPT-2 - on the Spider dataset, focusing on low-resource settings. We developed a reusable, model-agnostic pipeline that tailors schema formatting to each model’s architecture, training them across 1000 to 5000 iterations and evaluating on 1000 test samples using Logical Form Accuracy (LFAcc), BLEU, and Exact Match (EM) metrics. Fine-tuned T5-Small achieves the highest LFAcc (27.8%), outperforming BART-Small (23.98%) and GPT-2 (20.1%), highlighting encoder-decoder models’ superiority in schema-aware SQL generation. Despite resource constraints limiting performance, our pipeline’s modularity supports future enhancements, such as advanced schema linking or alternative base models. This work underscores the potential of compact transformers for accessible text-to-SQL solutions in resource-scarce environments. 文本到 SQL 的翻译使非专业用户能够使用自然语言查询关系数据库，应用于教育和商业智能领域。本研究评估了三种轻量级变换器模型——T5-Small、BART-Small 和 GPT-2——在 Spider 数据集上的表现，重点关注低资源环境。我们开发了一个可重用的、与模型无关的流水线，根据每个模型的架构定制模式格式，训练迭代次数在 1000 到 5000 之间，并使用逻辑形式准确率（LFAcc）、BLEU 和完全匹配（EM）指标对 1000 个测试样本进行评估。微调后的 T5-Small 取得了最高的 LFAcc（27.8%），优于 BART-Small（23.98%）和 GPT-2（20.1%），凸显了编码器-解码器模型在模式感知 SQL 生成中的优势。尽管资源限制限制了性能，我们的流水线模块化设计支持未来的改进，如高级模式链接或替代基础模型。该工作强调了紧凑型变换器在资源匮乏环境中实现易用文本到 SQL 解决方案的潜力。

Subjects: Computation and Language, Information Retrieval 主题：计算与语言，信息检索

Publish: 2025-08-06 16:49:13 UTC 发布：2025-08-06 16:49:13 UTC

#10 TURA: Tool-Augmented Unified Retrieval Agent for AI Search TURA：用于 AI 搜索的工具增强统一检索代理

大型语言模型（LLMs）的出现正在将搜索引擎转变为对话式人工智能搜索产品，主要通过对网络语料库进行检索增强生成（RAG）。然而，这一范式在工业应用中存在显著限制。传统的 RAG 方法难以满足实时需求和需要访问动态生成内容（如票务可用性或库存）的结构化查询。搜索引擎仅限于索引静态页面，无法执行此类时间敏感数据所需的交互式查询。学术研究主要集中在优化针对静态内容的 RAG，忽视了复杂意图以及对数据库和实时 API 等动态资源的需求。为弥补这一差距，我们提出了 TURA（Tool-Augmented Unified Retrieval Agent for AI Search），这是一种结合了 RAG 与工具代理使用的创新三阶段框架，能够访问静态内容和动态实时信息。 TURA 有三个关键组成部分：一个意图感知检索模块，用于分解查询并检索封装为模型上下文协议（MCP）服务器的信息源；一个基于有向无环图（DAG）的任务规划器，将任务依赖关系建模为有向无环图，以实现最佳的并行执行；以及一个轻量级的蒸馏代理执行器，用于高效调用工具。TURA 是首个系统性地弥合静态检索增强生成（RAG）与动态信息源之间差距的架构，面向世界级的 AI 搜索产品。它服务于数千万用户，利用代理框架提供强大且实时的答案，同时满足大规模工业系统的低延迟需求。

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题：计算与语言，人工智能，信息检索

Publish: 2025-08-06 16:24:17 UTC 发布时间：2025-08-06 16:24:17 UTC

Authors: [Magauiya Zhussip](https://arxiv.org/search/?searchtype=author&query=Magauiya Zhussip), [Dmitriy Shopkhoev](https://arxiv.org/search/?searchtype=author&query=Dmitriy Shopkhoev), [Ammar Ali](https://arxiv.org/search/?searchtype=author&query=Ammar Ali), [Stamatios Lefkimmiatis](https://arxiv.org/search/?searchtype=author&query=Stamatios Lefkimmiatis)

Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module’s parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer’s weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance. 大型语言模型（LLMs）已经革新了人工智能应用，但其高计算和内存需求阻碍了其广泛部署。现有的压缩技术主要集中在块内优化（如低秩近似、注意力头剪枝），而变换器的重复层结构意味着存在显著的块间冗余——这一维度除了键值（KV）缓存外基本未被探索。受卷积神经网络中字典学习的启发，我们提出了一个跨变换器层的结构化权重共享框架。我们的方法将注意力投影矩阵分解为共享的字典原子，减少了注意力模块 66.7%的参数量，同时实现了相当的性能。与需要蒸馏或架构变更的复杂方法不同，MASA（注意力中的矩阵原子共享）作为一种即插即用的替代方案运行——使用标准优化器训练——并将每层权重表示为共享矩阵原子的线性组合。在不同规模（1 亿至 7 亿参数）上的实验表明，MASA 在相似参数预算下，较分组查询注意力（GQA）、低秩基线和最近提出的 Repeat-all-over/Sequential 共享方法，取得了更好的基准准确率和困惑度。消融研究证实了其对字典大小的鲁棒性以及共享表示在捕捉跨层统计规律方面的有效性。将其扩展到视觉变换器（ViT）时，MASA 在图像分类和检测任务上的性能指标与原方法相当，但注意力参数减少了 66.7%。通过将字典学习策略与变换器效率相结合，MASA 为参数高效模型提供了一个可扩展的蓝图，同时不牺牲性能。最后，我们探讨了在预训练 LLMs 上应用 MASA 的可能性，以减少其参数数量而不显著降低性能。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 16:06:43 UTC 发布：2025-08-06 16:06:43 UTC

#12 Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration 超越头脑风暴：是什么驱动高质量的科学创意？来自多智能体协作的启示

While AI agents show potential in scientific ideation, most existing frameworks rely on single-agent refinement, limiting creativity due to bounded knowledge and perspective. Inspired by real-world research dynamics, this paper investigates whether structured multi-agent discussions can surpass solitary ideation. We propose a cooperative multi-agent framework for generating research proposals and systematically compare configurations including group size, leaderled versus leaderless structures, and team compositions varying in interdisciplinarity and seniority. To assess idea quality, we employ a comprehensive protocol with agent-based scoring and human review across dimensions such as novelty, strategic vision, and integration depth. Our results show that multi-agent discussions substantially outperform solitary baselines. A designated leader acts as a catalyst, transforming discussion into more integrated and visionary proposals. Notably, we find that cognitive diversity is a primary driver of quality, yet expertise is a non-negotiable prerequisite, as teams lacking a foundation of senior knowledge fail to surpass even a single competent agent. These findings offer actionable insights for designing collaborative AI ideation systems and shed light on how team structure influences creative outcomes. 尽管人工智能代理在科学构思方面展现出潜力，但大多数现有框架依赖于单一代理的迭代改进，因知识和视角的局限性而限制了创造力。受现实世界研究动态的启发，本文探讨了结构化多代理讨论是否能够超越单独构思。我们提出了一个用于生成研究提案的合作多代理框架，并系统地比较了包括团队规模、领导主导与无领导结构，以及跨学科和资历多样化的团队组成等配置。为了评估创意质量，我们采用了一个综合协议，结合基于代理的评分和人类评审，涵盖新颖性、战略视野和整合深度等维度。结果显示，多代理讨论显著优于单独基线。指定的领导者充当催化剂，将讨论转化为更具整合性和远见性的提案。值得注意的是，我们发现认知多样性是质量的主要驱动力，但专业知识是不可或缺的前提，因为缺乏资深知识基础的团队甚至无法超越单个有能力的代理。这些发现为设计协作式人工智能创意系统提供了可操作的见解，并揭示了团队结构如何影响创造性成果。

Subjects: Computation and Language, Artificial Intelligence, Computers and Society 主题：计算与语言，人工智能，计算机与社会

Publish: 2025-08-06 15:59:18 UTC 发布时间：2025-08-06 15:59:18 UTC

#13 Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning 揭示临床抑郁评估的全景：从行为特征到精神病学推理

Depression is a widespread mental disorder that affects millions worldwide. While automated depression assessment shows promise, most studies rely on limited or non-clinically validated data, and often prioritize complex model design over real-world effectiveness. In this paper, we aim to unveil the landscape of clinical depression assessment. We introduce C-MIND, a clinical neuropsychiatric multimodal diagnosis dataset collected over two years from real hospital visits. Each participant completes three structured psychiatric tasks and receives a final diagnosis from expert clinicians, with informative audio, video, transcript, and functional near-infrared spectroscopy (fNIRS) signals recorded. Using C-MIND, we first analyze behavioral signatures relevant to diagnosis. We train a range of classical models to quantify how different tasks and modalities contribute to diagnostic performance, and dissect the effectiveness of their combinations. We then explore whether LLMs can perform psychiatric reasoning like clinicians and identify their clear limitations in realistic clinical settings. In response, we propose to guide the reasoning process with clinical expertise and consistently improves LLM diagnostic performance by up to 10% in Macro-F1 score. We aim to build an infrastructure for clinical depression assessment from both data and algorithmic perspectives, enabling C-MIND to facilitate grounded and reliable research for mental healthcare. 抑郁症是一种广泛存在的精神障碍，影响着全球数百万人。尽管自动化抑郁评估显示出潜力，但大多数研究依赖于有限或未经临床验证的数据，且常常优先考虑复杂模型设计而非实际效果。本文旨在揭示临床抑郁评估的全貌。我们引入了 C-MIND，这是一个临床神经精神多模态诊断数据集，收集自两年内真实医院就诊的病例。每位参与者完成三项结构化精神病学任务，并由专家临床医生给出最终诊断，同时记录了有信息量的音频、视频、文字记录及功能性近红外光谱（fNIRS）信号。基于 C-MIND，我们首先分析了与诊断相关的行为特征。我们训练了一系列经典模型，以量化不同任务和模态对诊断性能的贡献，并剖析它们组合的有效性。随后，我们探讨了 LLMs 是否能够像临床医生一样进行精神病学推理，并识别出它们在现实临床环境中的明显局限性。作为回应，我们提出用临床专业知识引导推理过程，并持续将 LLM 的诊断性能提升最多达 10%的 Macro-F1 分数。我们的目标是从数据和算法两个角度构建临床抑郁评估的基础设施，使 C-MIND 能够促进有依据且可靠的心理健康研究。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 15:13:24 UTC 发布：2025-08-06 15:13:24 UTC

#14 StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering StyliTruth：通过解耦引导实现风格化且真实的 LLM 生成

Authors: [Chenglei Shen](https://arxiv.org/search/?searchtype=author&query=Chenglei Shen), [Zhongxiang Sun](https://arxiv.org/search/?searchtype=author&query=Zhongxiang Sun), [Teng Shi](https://arxiv.org/search/?searchtype=author&query=Teng Shi), [Xiao Zhang](https://arxiv.org/search/?searchtype=author&query=Xiao Zhang), [Jun Xu](https://arxiv.org/search/?searchtype=author&query=Jun Xu) 作者：沈成磊，孙忠祥，石腾，张晓，徐军

Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model’s core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose StyliTruth, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model’s representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness. 通过表示编辑生成风格化的大型语言模型（LLM）响应是一种实现细粒度输出控制的有前景的方法。然而，存在一个固有的权衡：施加独特风格往往会降低真实性。现有的表示编辑方法通过简单地注入风格信号，忽视了这种附带影响，且经常污染模型核心的真实性表示，导致答案正确性下降。我们将这一现象称为风格化引发的真实性崩溃。我们将此问题归因于某些关键注意力头中风格方向与真实性方向的潜在耦合，并提出了 StyliTruth 机制，该机制在保持风格化的同时保持真实性不变。StyliTruth 通过正交消减过程分离模型表示空间中的风格相关子空间和真实性相关子空间。这种分解使得风格和真实性能够在各自的子空间中独立控制，最大限度地减少干扰。通过在每个子空间内设计自适应的、基于 token 的引导向量，我们动态且精确地控制生成过程，以保持风格的忠实度和真实性。我们在多种风格和语言上验证了我们的方法。大量实验和分析表明，StyliTruth 显著减少了因风格化引起的真实性崩溃，并且在平衡风格遵循性与真实性方面优于现有的推理时干预方法。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 15:12:05 UTC 发布：2025-08-06 15:12:05 UTC

#15 CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation CALE：用于词内和词间语义区分的概念对齐嵌入

Authors: [Bastien Liétard](https://arxiv.org/search/?searchtype=author&query=Bastien Liétard), [Gabriel Loiseau](https://arxiv.org/search/?searchtype=author&query=Gabriel Loiseau) 作者：Bastien Liétard，Gabriel Loiseau

Lexical semantics is concerned with both the multiple senses a word can adopt in different contexts, and the semantic relations that exist between meanings of different words. To investigate them, Contextualized Language Models are a valuable tool that provides context-sensitive representations that can be used to investigate lexical meaning. Recent works like XL-LEXEME have leveraged the task of Word-in-Context to fine-tune them to get more semantically accurate representations, but Word-in-Context only compares occurrences of the same lemma, limiting the range of captured information. In this paper, we propose an extension, Concept Differentiation, to include inter-words scenarios. We provide a dataset for this task, derived from SemCor data. Then we fine-tune several representation models on this dataset. We call these models Concept-Aligned Embeddings (CALE). By challenging our models and other models on various lexical semantic tasks, we demonstrate that the proposed models provide efficient multi-purpose representations of lexical meaning that reach best performances in our experiments. We also show that CALE’s fine-tuning brings valuable changes to the spatial organization of embeddings. 词汇语义学关注的是一个词在不同语境中可以采用的多重含义，以及不同词义之间存在的语义关系。为了研究这些内容，语境化语言模型是一种宝贵的工具，它提供了可用于探究词汇意义的上下文敏感表示。近期的工作如 XL-LEXEME 利用词语语境任务对模型进行微调，以获得更语义准确的表示，但词语语境任务仅比较同一词元的出现，限制了捕获信息的范围。本文提出了一种扩展方法——概念区分，以涵盖词间场景。我们提供了一个基于 SemCor 数据的该任务数据集。随后，我们在该数据集上微调了多个表示模型，称之为概念对齐嵌入（Concept-Aligned Embeddings，CALE）。通过在各种词汇语义任务中对我们的模型及其他模型进行测试，我们证明了所提模型能够提供高效的多用途词汇意义表示，并在实验中达到最佳性能。我们还展示了 CALE 的微调为嵌入的空间组织带来了有价值的变化。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 14:43:22 UTC 发布时间：2025-08-06 14:43:22 UTC

#16 Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI 使用生成式人工智能自动生成符合课程标准的马来西亚中学数学多项选择题

Authors: [Rohaizah Abdul Wahid](https://arxiv.org/search/?searchtype=author&query=Rohaizah Abdul Wahid), [Muhamad Said Nizamuddin Nadim](https://arxiv.org/search/?searchtype=author&query=Muhamad Said Nizamuddin Nadim), [Suliana Sulaiman](https://arxiv.org/search/?searchtype=author&query=Suliana Sulaiman), [Syahmi Akmal Shaharudin](https://arxiv.org/search/?searchtype=author&query=Syahmi Akmal Shaharudin), [Muhammad Danial Jupikil](https://arxiv.org/search/?searchtype=author&query=Muhammad Danial Jupikil), [Iqqwan Jasman Su Azlan Su](https://arxiv.org/search/?searchtype=author&query=Iqqwan Jasman Su Azlan Su) 作者：Rohaizah Abdul Wahid, Muhamad Said Nizamuddin Nadim, Suliana Sulaiman, Syahmi Akmal Shaharudin, Muhammad Danial Jupikil, Iqqwan Jasman Su Azlan Su

This paper addresses the critical need for scalable and high-quality educational assessment tools within the Malaysian education system. It highlights the potential of Generative AI (GenAI) while acknowledging the significant challenges of ensuring factual accuracy and curriculum alignment, especially for low-resource languages like Bahasa Melayu. This research introduces and compares four incremental pipelines for generating Form 1 Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI’s GPT-4o. The methods range from non-grounded prompting (structured and basic) to Retrieval-Augmented Generation (RAG) approaches (one using the LangChain framework, one implemented manually). The system is grounded in official curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT). A dual-pronged automated evaluation framework is employed to assess the generated questions. Curriculum alignment is measured using Semantic Textual Similarity (STS) against the RPT, while contextual validity is verified through a novel RAG-based Question-Answering (RAG-QA) method. The results demonstrate that RAG-based pipelines significantly outperform non-grounded prompting methods, producing questions with higher curriculum alignment and factual validity. The study further analyzes the trade-offs between the ease of implementation of framework-based RAG and the fine-grained control offered by a manual pipeline. This work presents a validated methodology for generating curriculum-specific educational content in a low-resource language, introduces a symbiotic RAG-QA evaluation technique, and provides actionable insights for the development and deployment of practical EdTech solutions in Malaysia and similar regions. 本文针对马来西亚教育体系中对可扩展且高质量教育评估工具的迫切需求进行了探讨。文章强调了生成式人工智能（GenAI）的潜力，同时也承认了确保事实准确性和课程对齐的重大挑战，尤其是对于像马来语这样资源匮乏的语言。本研究介绍并比较了四种逐步生成马来语第一学年数学选择题（MCQs）的流程，均基于 OpenAI 的 GPT-4o。这些方法涵盖了非基于知识的提示（结构化和基础）到检索增强生成（RAG）方法（其中一种使用 LangChain 框架，另一种为手动实现）。系统以官方课程文件为基础，包括教师准备的笔记和年度教学计划（RPT）。采用双重自动评估框架对生成的问题进行评估。课程对齐通过与 RPT 的语义文本相似度（STS）进行衡量，而上下文有效性则通过一种新颖的基于 RAG 的问题回答（RAG-QA）方法进行验证。结果表明，基于 RAG 的流程显著优于非基于知识的提示方法，生成的问题在课程匹配度和事实有效性方面更高。研究进一步分析了基于框架的 RAG 易于实现与手动流程提供的细粒度控制之间的权衡。该工作提出了一种经过验证的方法，用于在低资源语言中生成特定课程的教育内容，介绍了一种共生的 RAG-QA 评估技术，并为马来西亚及类似地区实用教育技术解决方案的开发和部署提供了可操作的见解。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 13:30:51 UTC 发布：2025-08-06 13:30:51 UTC

#17 StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion StepFun-Formalizer：通过知识推理融合释放 LLMs 的自动形式化潜力 #17 StepFun-Formalizer：通过知识推理融合释放 LLMs 的自动形式化潜力

自动形式化旨在将自然语言的数学陈述翻译成形式语言。虽然 LLMs 加速了该领域的进展，但现有方法仍存在准确率低的问题。我们确定了有效自动形式化的两个关键能力：对形式语言领域知识的全面掌握，以及自然语言问题理解和非正式-正式对齐的推理能力。缺乏前者，模型无法识别正确的形式对象；缺乏后者，模型难以解释现实世界的语境并将其精确映射为形式表达。为解决这些不足，我们引入了 ThinkingF，一种数据合成和训练流程，提升这两种能力。首先，我们构建了两个数据集：一个通过提炼和筛选大量富含形式知识的示例，另一个通过专家设计的模板指导生成非正式到正式的推理轨迹。随后，我们利用这些数据集进行 SFT 和 RLVR 训练，进一步融合和优化这两种能力。最终得到的 7B 和 32B 模型既具备全面的形式知识，又拥有强大的非正式到正式推理能力。值得注意的是，StepFun-Formalizer-32B 在 FormalMATH-Lite 上取得了 40.5% 的 SOTA BEq@1 分数，在 ProverBench 上取得了 26.7%，超越了所有先前的通用和专用模型。

发布时间：2025-08-06 13:28:22 UTC

#18 Evaluating, Synthesizing, and Enhancing for Customer Support Conversation #18 评估、综合与增强客户支持对话

Authors: [Jie Zhu](https://arxiv.org/search/?searchtype=author&query=Jie Zhu), [Huaixia Dou](https://arxiv.org/search/?searchtype=author&query=Huaixia Dou), [Junhui Li](https://arxiv.org/search/?searchtype=author&query=Junhui Li), [Lifan Guo](https://arxiv.org/search/?searchtype=author&query=Lifan Guo), [Feng Chen](https://arxiv.org/search/?searchtype=author&query=Feng Chen), [Chi Zhang](https://arxiv.org/search/?searchtype=author&query=Chi Zhang), [Fang Kong](https://arxiv.org/search/?searchtype=author&query=Fang Kong) 作者：朱杰、窦怀霞、李俊辉、郭立凡、陈峰、张驰、孔芳

Effective customer support requires not only accurate problem solving but also structured and empathetic communication aligned with professional standards. However, existing dialogue datasets often lack strategic guidance, and real-world service data is difficult to access and annotate. To address this, we introduce the task of Customer Support Conversation (CSC), aimed at training customer service agents to respond using well-defined support strategies. We propose a structured CSC framework grounded in COPC guidelines, defining five conversational stages and twelve strategies to guide high-quality interactions. Based on this, we construct CSConv, an evaluation dataset of 1,855 real-world customer-agent conversations rewritten using LLMs to reflect deliberate strategy use, and annotated accordingly. Additionally, we develop a role-playing approach that simulates strategy-rich conversations using LLM-powered roles aligned with the CSC framework, resulting in the training dataset RoleCS. Experiments show that fine-tuning strong LLMs on RoleCS significantly improves their ability to generate high-quality, strategy-aligned responses on CSConv. Human evaluations further confirm gains in problem resolution. All code and data will be made publicly available at https://github.com/aliyun/qwen-dianjin. 有效的客户支持不仅需要准确的问题解决，还需要符合专业标准的结构化且富有同理心的沟通。然而，现有的对话数据集往往缺乏策略指导，且真实服务数据难以获取和标注。为此，我们引入了客户支持对话（CSC）任务，旨在训练客服人员使用明确的支持策略进行回应。我们提出了基于 COPC 指南的结构化 CSC 框架，定义了五个对话阶段和十二种策略，以指导高质量的互动。在此基础上，我们构建了 CSConv 评估数据集，包含 1,855 条真实客户与客服的对话，这些对话通过 LLMs 重写以体现有意的策略使用，并进行了相应标注。此外，我们开发了一种角色扮演方法，利用符合 CSC 框架的 LLM 驱动角色模拟富含策略的对话，生成训练数据集 RoleCS。实验表明，在 RoleCS 上微调强大的 LLMs 显著提升了它们在 CSConv 上生成高质量、符合策略的回应的能力。人工评估进一步确认了问题解决的提升。所有代码和数据将公开发布于 https://github.com/aliyun/qwen-dianjin。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 13:11:17 UTC 发布时间：2025-08-06 13:11:17 UTC

#19 Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model #19 基于语义相似度和语言模型预测置信度的对话响应预取

Authors: [Kiyotada Mori](https://arxiv.org/search/?searchtype=author&query=Kiyotada Mori), [Seiya Kawano](https://arxiv.org/search/?searchtype=author&query=Seiya Kawano), [Angel Fernando Garcia Contreras](https://arxiv.org/search/?searchtype=author&query=Angel Fernando Garcia Contreras), [Koichiro Yoshino](https://arxiv.org/search/?searchtype=author&query=Koichiro Yoshino) 作者：森清忠、川野诚也、安赫尔·费尔南多·加西亚·孔特雷拉斯、吉野浩一郎

Prefetching of dialogue responses has been investigated to reduce user-perceived latency (UPL), which refers to the user’s waiting time before receiving the system’s response, in spoken dialogue systems. To reduce the UPL, it is necessary to predict complete user utterances before the end of the user’s speech, typically by language models, to prepare prefetched dialogue responses. In this study, we proposed a prediction confidence model (PCM) that determines whether prefetching is possible or not by estimating the semantic similarity between the predicted complete user utterance and the complete user utterance. We evaluated our PCM based on the differences between the predicted complete user utterance and the complete user utterance. 对话响应的预取已被研究用于减少用户感知延迟（UPL），即用户在收到系统响应前的等待时间，在语音对话系统中。为了减少 UPL，通常需要通过语言模型在用户语音结束前预测完整的用户话语，以准备预取的对话响应。在本研究中，我们提出了一种预测置信度模型（PCM），通过估计预测的完整用户话语与实际完整用户话语之间的语义相似度来判断是否可以进行预取。我们基于预测的完整用户话语与实际完整用户话语之间的差异对 PCM 进行了评估。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 12:45:09 UTC 发布时间：2025-08-06 12:45:09 UTC

#20 What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems #20 人类在交互时听到了什么？用于评估语音对话系统自动语音识别的选择性听觉实验

Authors: [Kiyotada Mori](https://arxiv.org/search/?searchtype=author&query=Kiyotada Mori), [Seiya Kawano](https://arxiv.org/search/?searchtype=author&query=Seiya Kawano), [Chaoran Liu](https://arxiv.org/search/?searchtype=author&query=Chaoran Liu), [Carlos Toshinori Ishi](https://arxiv.org/search/?searchtype=author&query=Carlos Toshinori Ishi), [Angel Fernando Garcia Contreras](https://arxiv.org/search/?searchtype=author&query=Angel Fernando Garcia Contreras), [Koichiro Yoshino](https://arxiv.org/search/?searchtype=author&query=Koichiro Yoshino) 作者：森清忠、川野诚也、刘超然、石俊典、安赫尔·费尔南多·加西亚·孔特雷拉斯、吉野浩一郎

Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans. 语音对话系统（SDSs）在其流程的前端使用自动语音识别（ASR）。ASR 在 SDS 中的作用是适当地识别用户语音中与响应生成相关的信息。研究人类的选择性聆听，即在对话中能够专注并听取重要部分的能力，将使我们能够确定 SDS 所需的 ASR 能力并对其进行评估。在本研究中，我们通过比较用于生成对话响应的人类转录文本与参考转录文本，实验性地确认了人类在生成对话响应时的选择性聆听。基于我们的实验结果，我们讨论了一种利用人类选择性聆听的新型 ASR 评估方法的可能性，该方法能够识别 ASR 系统与人类之间转录能力的差距。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 12:44:57 UTC 发布：2025-08-06 12:44:57 UTC

#21 Why are LLMs' abilities emergent? #21 为什么 LLMs 的能力是涌现的？

大型语言模型（LLMs）在生成任务中的显著成功引发了关于其所获得能力本质的根本性问题，这些能力常常在没有明确训练的情况下意外出现。本文通过理论分析和实证观察，探讨了深度神经网络（DNNs）的涌现特性，回应了当代人工智能发展中“无理解的创造”这一认识论挑战。我们探讨了神经方法依赖非线性、随机过程的本质区别于符号计算范式，造就了其宏观行为无法从微观神经元活动中解析推导的系统。通过对规模定律、grokking 现象以及模型能力相变的分析，我展示了涌现能力源自高度敏感非线性系统的复杂动力学，而非仅仅是参数规模的简单扩展。我的研究揭示，目前关于指标、预训练损失阈值和上下文学习的争论忽视了 DNN 涌现的根本本体性质。我认为这些系统表现出真正的涌现特性，类似于其他复杂自然现象中发现的特性，其中系统能力是由简单组件之间的协作互动产生的，且无法简化为其个体行为。本文结论指出，理解 LLM 的能力需要将深度神经网络（DNN）视为一个新的复杂动力系统领域，该领域受涌现的普遍原理支配，类似于物理、化学和生物学中运作的原理。这一视角将关注点从纯粹的现象学涌现定义转向理解使这些系统获得超越其个体组件能力的内部动态转变。

发布：2025-08-06 12:43:04 UTC

#22 Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky 利用大型语言模型提升事故数据质量：来自肯塔基州二次事故叙述的证据 #22 利用大型语言模型提升事故数据质量：来自肯塔基州二次事故叙述的证据

Authors: [Xu Zhang](https://arxiv.org/search/?searchtype=author&query=Xu Zhang), [Mei Chen](https://arxiv.org/search/?searchtype=author&query=Mei Chen) 作者：张旭，陈梅

This study evaluates advanced natural language processing (NLP) techniques to enhance crash data quality by mining crash narratives, using secondary crash identification in Kentucky as a case study. Drawing from 16,656 manually reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we compare three model classes: zero-shot open-source large language models (LLMs) (LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers (BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic regression as baseline. Models were calibrated on 2015-2021 data and tested on 1,771 narratives from 2022. Fine-tuned transformers achieved superior performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy (95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139 minutes of inference; the logistic baseline lagged well behind (F1:0.66). LLMs excelled in recall for some variants (e.g., GEMMA3:27B at 0.94) but incurred high computational costs (up to 723 minutes for DeepSeek-R1:70B), while fine-tuned models processed the test set in seconds after brief training. Further analysis indicated that mid-sized LLMs (e.g., DeepSeek-R1:32B) can rival larger counterparts in performance while reducing runtime, suggesting opportunities for optimized deployments. Results highlight trade-offs between accuracy, efficiency, and data requirements, with fine-tuned transformer models balancing precision and recall effectively on Kentucky data. Practical deployment considerations emphasize privacy-preserving local deployment, ensemble approaches for improved accuracy, and incremental processing for scalability, providing a replicable scheme for enhancing crash-data quality with advanced NLP. 本研究评估了先进的自然语言处理（NLP）技术，通过挖掘事故叙述来提升事故数据质量，以肯塔基州的二次事故识别为案例。基于 2015-2022 年间 16,656 条人工审核的叙述，其中 3,803 条确认为二次事故，我们比较了三类模型：零样本开源大型语言模型（LLMs）（LLaMA3:70B、DeepSeek-R1:70B、Qwen3:32B、Gemma3:27B）；微调的变换器模型（BERT、DistilBERT、RoBERTa、XLNet、Longformer）；以及作为基线的传统逻辑回归。模型在 2015-2021 年数据上进行了校准，并在 2022 年 1,771 条叙述上进行了测试。微调的变换器模型表现最佳，其中 RoBERTa 取得了最高的 F1 分数（0.90）和准确率（95%）。零样本的 LLaMA3:70B 达到了相近的 F1 分数（0.86），但推理时间长达 139 分钟；逻辑回归基线表现较差（F1：0.66）。LLMs 在某些变体的召回率上表现出色（例如 GEMMA3:27B 达到 0.94），但计算成本高昂（DeepSeek-R1:70B 最长达 723 分钟），而微调模型经过短时间训练后，能够在几秒钟内处理测试集。进一步分析表明，中型 LLMs（例如 DeepSeek-R1:32B）在性能上可以与更大型的模型相媲美，同时减少运行时间，这表明存在优化部署的机会。结果突出了准确性、效率和数据需求之间的权衡，经过微调的 transformer 模型在肯塔基数据上有效地平衡了精确率和召回率。实际部署考虑强调了隐私保护的本地部署、通过集成方法提高准确性以及增量处理以实现可扩展性，提供了一种利用先进 NLP 提升事故数据质量的可复制方案。

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval, Machine Learning 主题：计算与语言、人工智能、信息检索、机器学习

Publish: 2025-08-06 12:41:18 UTC 发布：2025-08-06 12:41:18 UTC

#23 AIC CTU@FEVER 8: On-premise fact checking through long context RAG #23 AIC CTU@FEVER 8：通过长上下文 RAG 进行本地事实核查

Authors: [Herbert Ullrich](https://arxiv.org/search/?searchtype=author&query=Herbert Ullrich), [Jan Drchal](https://arxiv.org/search/?searchtype=author&query=Jan Drchal) 作者：Herbert Ullrich，Jan Drchal

In this paper, we present our fact-checking pipeline which has scored first in FEVER 8 shared task. Our fact-checking system is a simple two-step RAG pipeline based on our last year’s submission. We show how the pipeline can be redeployed on-premise, achieving state-of-the-art fact-checking performance (in sense of Ev2R test-score), even under the constraint of a single NVidia A10 GPU, 23GB of graphical memory and 60s running time per claim. 在本文中，我们介绍了在 FEVER 8 共享任务中获得第一名的事实核查流程。我们的事实核查系统是基于去年的提交，采用了一个简单的两步 RAG 流程。我们展示了如何在本地重新部署该流程，即使在仅使用单个 NVidia A10 GPU、23GB 显存和每条声明 60 秒运行时间的限制下，也能实现最先进的事实核查性能（以 Ev2R 测试分数衡量）。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 14:03:43 UTC 发布时间：2025-08-05 14:03:43 UTC

#24 Chain of Questions: Guiding Multimodal Curiosity in Language Models #24 问题链：引导语言模型中的多模态好奇心

Authors: [Nima Iji](https://arxiv.org/search/?searchtype=author&query=Nima Iji), [Kia Dashtipour](https://arxiv.org/search/?searchtype=author&query=Kia Dashtipour) 作者：Nima Iji, Kia Dashtipour

Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model’s ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks. 大型语言模型（LLMs）的推理能力通过链式思维和明确的逐步解释等方法得到了显著提升。然而，这些改进尚未完全应用于多模态环境中，在此类环境中，模型必须主动决定在与复杂现实环境交互时应启用哪些感官模态，如视觉、音频或空间感知。本文提出了“问题链”（Chain of Questions，CoQ）框架，这是一种以好奇心驱动的推理方法，鼓励多模态语言模型动态生成针对其周围环境的具体问题。生成的问题引导模型有选择地激活相关模态，从而收集准确推理和响应生成所需的关键信息。我们在一个新颖的多模态基准数据集上评估了该框架，该数据集通过整合 WebGPT、ScienceQA、AVSD 和 ScanQA 数据集构建。实验结果表明，我们的 CoQ 方法提升了基础模型有效识别和整合相关感官信息的能力。这提升了推理过程在多样化多模态任务中的准确性、可解释性和一致性。

Subjects: Computation and Language, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning, Multiagent Systems 主题：计算与语言，人工智能，计算机视觉与模式识别，机器学习，多智能体系统

Publish: 2025-08-06 11:42:54 UTC 发布时间：2025-08-06 11:42:54 UTC

#25 GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy #25 GTPO 和 GRPO-S：基于策略熵的令牌和序列级奖励塑形

Authors: [Hongze Tan](https://arxiv.org/search/?searchtype=author&query=Hongze Tan), [Jianfei Pan](https://arxiv.org/search/?searchtype=author&query=Jianfei Pan) 作者：谭洪泽，潘建飞

Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models. 强化学习（RL）算法如群体相对策略优化（GRPO）提升了大型语言模型（LLM）的推理能力，但受限于粗粒度的信用分配，即对序列中所有标记统一给予奖励。这是长链推理任务中的一个主要缺陷。本文通过\textbf{动态熵加权}解决了这一问题。我们的核心思想是，正确回答中的高熵标记可以引导策略达到更高的性能上限。这使我们能够通过两种方式创建更细粒度的奖励信号以实现精确的策略更新：1）\textbf{群体标记策略优化}（\textbf{GTPO}），我们为每个标记分配基于熵加权的奖励，实现细粒度的信用分配。2）\textbf{序列级群体相对策略优化}（\textbf{GRPO-S}），我们根据序列的平均标记熵为每个序列分配熵加权奖励。实验表明，我们的方法显著优于强基线 DAPO。结果证实，我们的熵加权机制是性能提升的关键驱动力，为模型深度推理的增强提供了更优路径。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 11:42:47 UTC 发布时间：2025-08-06 11:42:47 UTC

#26 Modelling and Classifying the Components of a Literature Review #26 文献综述组成部分的建模与分类

Authors: [Francisco Bolaños](https://arxiv.org/search/?searchtype=author&query=Francisco Bolaños), [Angelo Salatino](https://arxiv.org/search/?searchtype=author&query=Angelo Salatino), [Francesco Osborne](https://arxiv.org/search/?searchtype=author&query=Francesco Osborne), [Enrico Motta](https://arxiv.org/search/?searchtype=author&query=Enrico Motta) 作者：Francisco Bolaños, Angelo Salatino, Francesco Osborne, Enrico Motta

Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models. 以往的研究表明，针对科学文献中的句子按其修辞角色进行标注（如研究空白、结果、局限性、现有方法的扩展等）能够显著提升 AI 分析方法的效果。这类表示形式还有望支持新一代能够生成高质量文献综述的系统的开发。然而，实现这一目标需要定义相关的标注方案以及有效的大规模文献标注策略。本文通过以下两方面应对这些挑战：1）引入一种专门设计用于支持文献综述生成的新型标注方案；2）对多种最先进的 LLMs 在根据该方案分类修辞角色方面进行全面评估。为此，我们还提出了 Sci-Sentence，这是一个包含 700 句由领域专家手工标注和 2240 句由 LLMs 自动标注的多学科新基准。我们在该基准测试中评估了 37 个 LLMs，涵盖了多样的模型家族和规模，采用零样本学习和微调两种方法。实验带来了若干推动该领域技术进步的新见解。首先，当前一代 LLMs 在高质量数据微调下，在该任务上表现出色，F1 分数超过 96%。其次，尽管像 GPT-4o 这样的大型专有模型取得了最佳结果，一些轻量级开源替代方案也展现了优异的性能。最后，利用 LLMs 生成的半合成训练样本丰富训练数据被证明是有益的，这使得小型编码器能够取得稳健的结果，并显著提升了多个开放解码器模型的性能。

Subjects: Computation and Language, Artificial Intelligence, Human-Computer Interaction, Information Retrieval 主题：计算与语言，人工智能，人机交互，信息检索

Publish: 2025-08-06 11:30:07 UTC 发布时间：2025-08-06 11:30:07 UTC

#27 Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models #27 超越排行榜：重新思考大型语言模型的医学基准测试

Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark’s development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare. 大型语言模型（LLMs）在医疗领域展现出显著潜力，促使众多基准测试评估其能力。然而，关于这些基准测试的可靠性仍存在担忧，它们往往缺乏临床真实性、稳健的数据管理以及以安全为导向的评估指标。为了解决这些不足，我们提出了 MedCheck，这是首个专为医疗基准测试设计的生命周期导向评估框架。我们的框架将基准测试的开发分解为从设计到治理的五个连续阶段，并提供了包含 46 条医学定制标准的全面检查清单。利用 MedCheck，我们对 53 个医疗 LLM 基准测试进行了深入的实证评估。分析揭示了普遍存在的系统性问题，包括与临床实践的严重脱节、由于未加以缓解的污染风险导致的数据完整性危机，以及对模型鲁棒性和不确定性意识等安全关键评估维度的系统性忽视。基于这些发现，MedCheck 既作为现有基准的诊断工具，也作为促进更标准化、可靠且透明的医疗 AI 评估方法的可操作指南。

Subjects: Computation and Language, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning, Multimedia 主题：计算与语言，人工智能，计算机视觉与模式识别，机器学习，多媒体

Publish: 2025-08-06 11:11:40 UTC 发布时间：2025-08-06 11:11:40 UTC

#28 A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models #28 几个词就能扭曲图谱：基于图的检索增强大语言模型生成的知识投毒攻击

Authors: [Jiayi Wen](https://arxiv.org/search/?searchtype=author&query=Jiayi Wen), [Tianxin Chen](https://arxiv.org/search/?searchtype=author&query=Tianxin Chen), [Zhirun Zheng](https://arxiv.org/search/?searchtype=author&query=Zhirun Zheng), [Cheng Huang](https://arxiv.org/search/?searchtype=author&query=Cheng Huang) 作者：温佳怡，陈天欣，郑志润，黄成

Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, improving both accuracy and explainability. However, GraphRAG relies on LLMs to extract knowledge from raw text during graph construction, and this process can be maliciously manipulated to implant misleading information. Targeting this attack surface, we propose two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a few words in the source text can significantly change the constructed graph, poison the GraphRAG, and severely mislead downstream reasoning. The first attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate vulnerable nodes in the generated graphs and rewrites the corresponding narratives with LLMs, achieving precise control over specific question-answering (QA) outcomes with a success rate of 93.1%, while keeping the poisoned text fluent and natural. The second attack, named Universal KPA (UKPA), exploits linguistic cues such as pronouns and dependency relations to disrupt the structural integrity of the generated graph by altering globally influential words. With fewer than 0.05% of full text modified, the QA accuracy collapses from 95% to 50%. Furthermore, experiments show that state-of-the-art defense methods fail to detect these attacks, highlighting that securing GraphRAG pipelines against knowledge poisoning remains largely unexplored. 基于图的检索增强生成（GraphRAG）最近成为提升大型语言模型（LLMs）性能的一个有前景的范式，它通过将原始文本转换为结构化知识图谱，提升了准确性和可解释性。然而，GraphRAG 在图构建过程中依赖 LLMs 从原始文本中提取知识，这一过程可能被恶意操控以植入误导性信息。针对这一攻击面，我们提出了两种知识投毒攻击（KPAs），并展示了仅修改源文本中的少量词汇即可显著改变构建的图谱，毒害 GraphRAG，并严重误导下游推理。第一种攻击称为定向知识投毒攻击（Targeted KPA，TKPA），它利用图论分析定位生成图中的脆弱节点，并通过 LLMs 重写相应叙述，实现对特定问答（QA）结果的精确控制，成功率达到 93.1%，同时保持投毒文本的流畅和自然。第二种攻击称为通用知识投毒攻击（Universal KPA，UKPA），利用代词和依赖关系等语言线索，通过改变全局影响力较大的词语来破坏生成图的结构完整性。在修改不到 0.05%的全文内容的情况下，问答准确率从 95%骤降至 50%。此外，实验表明，最先进的防御方法未能检测出这些攻击，凸显了保护 GraphRAG 管道免受知识投毒攻击仍然是一个尚未充分探索的领域。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 10:01:26 UTC 发布：2025-08-06 10:01:26 UTC

#29 ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents #29 ShoppingBench：一个面向基于 LLM 代理的真实意图驱动购物基准

Authors: [Jiangyuan Wang](https://arxiv.org/search/?searchtype=author&query=Jiangyuan Wang), [Kejun Xiao](https://arxiv.org/search/?searchtype=author&query=Kejun Xiao), [Qi Sun](https://arxiv.org/search/?searchtype=author&query=Qi Sun), [Huaipeng Zhao](https://arxiv.org/search/?searchtype=author&query=Huaipeng Zhao), [Tao Luo](https://arxiv.org/search/?searchtype=author&query=Tao Luo), [Jiandong Zhang](https://arxiv.org/search/?searchtype=author&query=Jiandong Zhang), [Xiaoyi Zeng](https://arxiv.org/search/?searchtype=author&query=Xiaoyi Zeng) 作者：王江源，肖克军，孙琦，赵怀鹏，罗涛，张建东，曾晓毅

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1. 现有的电商基准主要关注基础的用户意图，如查找或购买商品。然而，现实用户往往追求更复杂的目标，如使用优惠券、管理预算以及寻找多商品卖家。为弥补这一差距，我们提出了 ShoppingBench，一个新颖的端到端购物基准，旨在涵盖日益复杂的意图层级。具体而言，我们提出了一个可扩展的框架，基于采样的真实商品生成多样化的用户指令。为了促进一致且可靠的评估，我们提供了一个大规模购物沙盒，作为一个交互式模拟环境，包含超过 250 万件真实商品。实验结果表明，即使是最先进的语言代理（如 GPT-4.1）在我们的基准任务上的绝对成功率也低于 50%，凸显了 ShoppingBench 带来的重大挑战。此外，我们提出了一种轨迹蒸馏策略，并结合监督微调以及对合成轨迹的强化学习，将大型语言代理的能力蒸馏到较小的代理中。结果，我们训练的代理在性能上与 GPT-4.1 具有竞争力。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 09:51:30 UTC 发布时间：2025-08-06 09:51:30 UTC

#30 KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs #30 KVSink：理解与增强 LLMs 中 KV 缓存量化中注意力汇聚的保持

Authors: [Zunhai Su](https://arxiv.org/search/?searchtype=author&query=Zunhai Su), [Kehong Yuan](https://arxiv.org/search/?searchtype=author&query=Kehong Yuan) 作者：苏尊海，袁克宏

Key-Value (KV) cache quantization has become a widely adopted optimization technique for efficient large language models (LLMs) inference by reducing KV cache memory usage and mitigating memory-bound constraints. Recent studies have emphasized the importance of preserving the original precision of KVs for the first few tokens to ensure the protection of attention sinks. While this approach has proven effective in mitigating performance degradation, its underlying principles remain insufficiently understood. Moreover, it fails to address the recent discovery that attention sinks can emerge beyond the initial token positions. In this work, we elucidate the underlying mechanisms of attention sinks during inference by examining their role in the cross-layer evolution of extreme activation outliers. Additionally, we provide a comprehensive analysis of the interplay between attention sinks and KV cache quantization. Based on our enhanced understanding, we introduce \textit{\textbf{KVSink}}, a plug-and-play method that effectively predicts sink tokens with negligible overhead, enabling more thorough preservation. Extensive experiments demonstrate that KVSink outperforms the existing Preserve-First-N (PFN) strategy, offering more effective preservation of attention sinks during KV cache quantization. Moreover, when applied to the well-established KVQuant method, KVSink further improves perplexity (PPL) and reduces reliance on 16-bit numerical outliers. 键值（KV）缓存量化已成为一种广泛采用的优化技术，通过减少 KV 缓存的内存使用并缓解内存瓶颈限制，实现高效的大型语言模型（LLMs）推理。近期研究强调了在前几个 token 中保持 KV 原始精度以保护注意力汇聚点的重要性。尽管该方法在减轻性能下降方面效果显著，但其背后的原理尚未被充分理解。此外，该方法未能解决最近发现的注意力汇聚点可能出现在初始 token 位置之外的问题。在本工作中，我们通过考察注意力汇聚点在跨层极端激活异常值演变中的作用，阐明了推理过程中注意力汇聚点的基本机制。此外，我们还对注意力汇聚点与 KV 缓存量化之间的相互作用进行了全面分析。基于我们深化的理解，提出了\textit{\textbf{KVSink}}，这是一种即插即用的方法，能够以极低的开销有效预测汇聚 token，从而实现更全面的保护。大量实验证明，KVSink 优于现有的 Preserve-First-N (PFN) 策略，在 KV 缓存量化过程中更有效地保留了注意力汇点。此外，当应用于成熟的 KVQuant 方法时，KVSink 进一步提升了困惑度（PPL），并减少了对 16 位数值异常值的依赖。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 09:40:09 UTC 发布时间：2025-08-06 09:40:09 UTC

#31 TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening #31 TalkDep：面向对话中心的抑郁症筛查的临床基础 LLM 角色设定

Authors: [Xi Wang](https://arxiv.org/search/?searchtype=author&query=Xi Wang), [Anxo Perez](https://arxiv.org/search/?searchtype=author&query=Anxo Perez), [Javier Parapar](https://arxiv.org/search/?searchtype=author&query=Javier Parapar), [Fabio Crestani](https://arxiv.org/search/?searchtype=author&query=Fabio Crestani) 作者：Xi Wang, Anxo Perez, Javier Parapar, Fabio Crestani

The increasing demand for mental health services has outpaced the availability of real training data to develop clinical professionals, leading to limited support for the diagnosis of depression. This shortage has motivated the development of simulated or virtual patients to assist in training and evaluation, but existing approaches often fail to generate clinically valid, natural, and diverse symptom presentations. In this work, we embrace the recent advanced language models as the backbone and propose a novel clinician-in-the-loop patient simulation pipeline, TalkDep, with access to diversified patient profiles to develop simulated patients. By conditioning the model on psychiatric diagnostic criteria, symptom severity scales, and contextual factors, our goal is to create authentic patient responses that can better support diagnostic model training and evaluation. We verify the reliability of these simulated patients with thorough assessments conducted by clinical professionals. The availability of validated simulated patients offers a scalable and adaptable resource for improving the robustness and generalisability of automatic depression diagnosis systems. 对心理健康服务日益增长的需求已经超过了培养临床专业人员所需的真实训练数据的供应，导致抑郁症诊断支持有限。这一短缺促使了模拟或虚拟患者的发展，以辅助培训和评估，但现有方法往往无法生成临床有效、自然且多样的症状表现。在本研究中，我们采用了最新的先进语言模型作为基础，提出了一种新颖的临床医生参与的患者模拟流程——TalkDep，该流程能够访问多样化的患者档案以开发模拟患者。通过将模型条件化于精神病学诊断标准、症状严重度量表和情境因素，我们的目标是创造真实的患者反应，从而更好地支持诊断模型的训练和评估。我们通过临床专业人员进行的全面评估验证了这些模拟患者的可靠性。经过验证的模拟患者的可用性为提升自动抑郁症诊断系统的鲁棒性和泛化能力提供了可扩展且适应性强的资源。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 09:30:47 UTC 发布时间：2025-08-06 09:30:47 UTC

#32 DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting #32 DP-GPT4MTS：用于文本-数值时间序列预测的双提示大型语言模型

Authors: [Chanjuan Liu](https://arxiv.org/search/?searchtype=author&query=Chanjuan Liu), [Shengzhi Wang](https://arxiv.org/search/?searchtype=author&query=Shengzhi Wang), [Enqiang Zhu](https://arxiv.org/search/?searchtype=author&query=Enqiang Zhu) 作者：刘婵娟，王胜志，朱恩强

Time series forecasting is crucial in strategic planning and decision-making across various industries. Traditional forecasting models mainly concentrate on numerical time series data, often overlooking important textual information such as events and news, which can significantly affect forecasting accuracy. While large language models offer a promise for integrating multimodal data, existing single-prompt frameworks struggle to effectively capture the semantics of timestamped text, introducing redundant information that can hinder model performance. To address this limitation, we introduce DP-GPT4MTS (Dual-Prompt GPT2-base for Multimodal Time Series), a novel dual-prompt large language model framework that combines two complementary prompts: an explicit prompt for clear task instructions and a textual prompt for context-aware embeddings from time-stamped data. The tokenizer generates the explicit prompt while the embeddings from the textual prompt are refined through self-attention and feed-forward networks. Comprehensive experiments conducted on diverse textural-numerical time series datasets demonstrate that this approach outperforms state-of-the-art algorithms in time series forecasting. This highlights the significance of incorporating textual context via a dual-prompt mechanism to achieve more accurate time series predictions. 时间序列预测在各行业的战略规划和决策中至关重要。传统的预测模型主要集中于数值时间序列数据，常常忽视诸如事件和新闻等重要的文本信息，而这些信息可能显著影响预测的准确性。尽管大型语言模型为多模态数据的整合提供了可能，但现有的单提示框架难以有效捕捉带时间戳文本的语义，导致冗余信息的引入，从而阻碍模型性能。为了解决这一限制，我们提出了 DP-GPT4MTS（用于多模态时间序列的双提示 GPT2-base），这是一种新颖的双提示大型语言模型框架，结合了两种互补的提示：用于明确任务指令的显式提示和用于从带时间戳数据中获取上下文感知嵌入的文本提示。分词器生成显式提示，而文本提示的嵌入则通过自注意力和前馈网络进行优化。在多样的文本-数值时间序列数据集上进行的全面实验表明，该方法在时间序列预测方面优于最先进的算法。这凸显了通过双提示机制引入文本上下文以实现更准确时间序列预测的重要性。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 09:25:05 UTC 发布时间：2025-08-06 09:25:05 UTC

#33 Hierarchical Text Classification Using Black Box Large Language Models #33 使用黑盒大型语言模型的层次文本分类

Authors: [Kosuke Yoshimura](https://arxiv.org/search/?searchtype=author&query=Kosuke Yoshimura), [Hisashi Kashima](https://arxiv.org/search/?searchtype=author&query=Hisashi Kashima) 作者：Kosuke Yoshimura，Hisashi Kashima

Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies – Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) – in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost. 层次文本分类（HTC）旨在将文本分配到结构化的标签层级中；然而，由于数据稀缺和模型复杂性，HTC 面临诸多挑战。本研究探讨了通过 API 访问的黑箱 LLMs 用于 HTC 的可行性，作为传统机器学习方法的替代方案，后者通常需要大量标注数据和计算资源。我们评估了三种提示策略——直接叶子标签预测（DL）、直接层次标签预测（DH）和自顶向下多步层次标签预测（TMH）——在零样本和少样本设置下的表现，比较了这些策略的准确性和成本效益。两组数据集上的实验表明，少样本设置相比零样本设置能持续提升分类准确率。虽然传统机器学习模型在层级较浅的数据集上取得了较高的准确率，但 LLMs，尤其是 DH 策略，在层级较深的数据集上往往优于机器学习模型。由于 DH 策略在更深层级标签上需要更多输入 tokens，API 成本显著增加。这些结果强调了准确性提升与提示策略计算成本之间的权衡。这些发现突显了黑箱 LLMs 在 HTC 中的潜力，同时强调了需要谨慎选择提示策略以平衡性能和成本。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-06 08:53:50 UTC 发布：2025-08-06 08:53:50 UTC

#34 ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments #34 ReasoningGuard：通过推理时的安全“灵光一现”保护大型推理模型

Authors: [Yuquan Wang](https://arxiv.org/search/?searchtype=author&query=Yuquan Wang), [Mi Zhang](https://arxiv.org/search/?searchtype=author&query=Mi Zhang), [Yining Wang](https://arxiv.org/search/?searchtype=author&query=Yining Wang), [Geng Hong](https://arxiv.org/search/?searchtype=author&query=Geng Hong), [Xiaoyu You](https://arxiv.org/search/?searchtype=author&query=Xiaoyu You), [Min Yang](https://arxiv.org/search/?searchtype=author&query=Min Yang) 作者：王宇泉，张弥，王一宁，洪耿，尤晓宇，杨敏

Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model’s internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues. 大型推理模型（LRMs）在推理密集型任务中表现出令人印象深刻的性能，但它们仍然容易生成有害内容，尤其是在推理过程的中后期步骤。现有的防御机制依赖于昂贵的微调和额外的专家知识，限制了其可扩展性。在本工作中，我们提出了 ReasoningGuard，一种针对 LRMs 的推理时安全防护措施，通过注入及时的安全“顿悟”时刻，引导无害且有益的推理过程。利用模型内部的注意力行为，我们的方法能够准确识别推理路径中的关键点，并触发自发的安全导向反思。为了保护后续的推理步骤和最终答案，我们进一步在解码阶段实施了缩放采样策略，选择最优的推理路径。ReasoningGuard 在引入极少额外推理成本的同时，有效缓解了三种类型的越狱攻击，包括最新针对 LRMs 推理过程的攻击。我们的方法优于七种现有的防护措施，实现了最先进的安全防御，同时有效避免了常见的夸大安全问题。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 08:35:10 UTC 发布时间：2025-08-06 08:35:10 UTC

#35 Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts #35 超越标签的推理：在低资源、文化细微差异背景下测量 LLM 情感

Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using a combination of human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social-science measurement lens, we operationalize and interrogate LLMs outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating interpretive stability, while open models often falter under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication. 在资源匮乏且文化细微差异显著的情境中进行情感分析，挑战了传统自然语言处理方法，这些方法假设情感标签固定且情感表达具有普遍性。我们提出了一个诊断框架，将情感视为一种依赖上下文、嵌入文化的构建，并评估大型语言模型（LLMs）如何推理来自内罗毕青年健康群组的非正式、混合代码的 WhatsApp 消息中的情感。通过结合人工标注数据、情感反转的反事实样本以及基于评分标准的解释评估，我们探究了 LLM 的可解释性、鲁棒性及其与人类推理的一致性。以社会科学测量视角构建评估框架，我们将 LLM 输出作为测量抽象情感概念的工具进行操作化和质询。研究结果显示模型推理质量存在显著差异，顶级 LLM 表现出解释稳定性，而开源模型在面对歧义或情感变化时常常表现不佳。此项工作强调了在复杂真实交流中，进行文化敏感且具备推理意识的 AI 评估的必要性。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 08:27:55 UTC 发布：2025-08-06 08:27:55 UTC

#36 Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models #36 诱发并分析最先进大型语言模型中的新兴错位

Authors: [Siddhant Panpatil](https://arxiv.org/search/?searchtype=author&query=Siddhant Panpatil), [Hiskias Dingeto](https://arxiv.org/search/?searchtype=author&query=Hiskias Dingeto), [Haon Park](https://arxiv.org/search/?searchtype=author&query=Haon Park) 作者：Siddhant Panpatil, Hiskias Dingeto, Haon Park

Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems. 尽管在对齐技术方面取得了显著进展，我们仍然证明了最先进的语言模型容易受到精心设计的对话场景的影响，这些场景能够在不进行明确越狱的情况下引发各种形式的错位。通过对 Claude-4-Opus 进行系统的人工红队测试，我们发现了 10 个成功的攻击场景，揭示了当前对齐方法在处理叙事沉浸、情感压力和策略框架方面的根本性漏洞。这些场景成功引发了多种错位行为，包括欺骗、价值漂移、自我保护和操控性推理，每种行为都利用了不同的心理和情境漏洞。为了验证其普适性，我们将成功的人工攻击提炼成 MISALIGNMENTBENCH，一个自动化评估框架，能够实现跨多个模型的可复现测试。对五个前沿 LLMs 进行的 10 个场景的跨模型评估显示，总体漏洞率为 76%，且存在显著差异：GPT-4.1 表现出最高的易受攻击性（90%），而 Claude-4-Sonnet 则表现出较强的抵抗力（40%）。我们的研究结果表明，复杂的推理能力往往成为攻击的切入点，而非保护机制，因为模型可能被操纵以为不匹配的行为提供复杂的辩解。本文提供了（i）详细的对话操控模式分类法和（ii）一个可复用的评估框架。综合来看，这些发现揭示了当前对齐策略中的关键漏洞，并强调了未来人工智能系统在面对细微、基于场景的操控时需要具备的鲁棒性。

Subjects: Computation and Language, Artificial Intelligence, Cryptography and Security 主题：计算与语言，人工智能，密码学与安全

Publish: 2025-08-06 08:25:40 UTC 发布时间：2025-08-06 08:25:40 UTC

#37 Characterizing Deep Research: A Benchmark and Formal Definition #37 深度研究特征化：基准测试与正式定义

Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of \textit{deep research} – a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search-separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a diverse, challenging benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI’s model performs the best with an overall F1 score of 0.55. Analysis of reasoning traces reveals the distribution over the number of referenced sources, branching, and backtracking events executed by current DR systems, motivating future directions for improving their search mechanisms and grounding capabilities. The benchmark is available at https://github.com/microsoft/LiveDRBench. 信息任务如撰写综述或分析报告需要复杂的搜索和推理，近年来被归类于“深度研究”（deep research）范畴——这一术语也被近期针对这些能力的模型所采用。尽管兴趣日益增长，深度研究任务的范围仍未明确定义，其与其他推理密集型问题的区别也尚不清晰。本文提出了深度研究（DR）任务的形式化表征，并引入了一个基准来评估 DR 系统的性能。我们认为，深度研究的核心特征并非生成冗长的报告式输出，而是在搜索过程中所需的高概念分支，即广泛且推理密集的探索。为了实现客观评估，我们使用一种中间输出表示定义 DR，该表示编码了搜索过程中发现的关键论点——将推理挑战与表层报告生成区分开来。基于此表述，我们提出了一个多样且具有挑战性的基准测试 LiveDRBench，涵盖 100 个科学主题（如数据集、材料发现、先前技术检索）和公众关注事件（如飞行事故、电影奖项）的挑战性任务。在最先进的文档检索系统中，任何子类别的 F1 分数范围在 0.02 到 0.72 之间。OpenAI 的模型表现最佳，整体 F1 分数为 0.55。对推理轨迹的分析揭示了当前文档检索系统所引用来源数量、分支和回溯事件的分布，激发了未来改进其搜索机制和基础能力的方向。该基准测试可在 https://github.com/microsoft/LiveDRBench 获取。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 08:09:28 UTC 发布时间：2025-08-06 08:09:28 UTC

#38 Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity #38 利用因果充分性和必要性破解大型语言模型的幻觉

Authors: [Peizheng Guo](https://arxiv.org/search/?searchtype=author&query=Peizheng Guo), [Jingyao Wang](https://arxiv.org/search/?searchtype=author&query=Jingyao Wang), [Wenwen Qiang](https://arxiv.org/search/?searchtype=author&query=Wenwen Qiang), [Huijie Guo](https://arxiv.org/search/?searchtype=author&query=Huijie Guo), [Changwen Zheng](https://arxiv.org/search/?searchtype=author&query=Changwen Zheng), [Jiahuan Zhou](https://arxiv.org/search/?searchtype=author&query=Jiahuan Zhou), [Gang Hua](https://arxiv.org/search/?searchtype=author&query=Gang Hua) 作者：郭培正，王静尧，强文文，郭慧杰，郑昌文，周嘉欢，华刚

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across vision-language tasks. However, they may suffer from hallucinations–generating outputs that are semantically inconsistent with the input image or text. Through causal analyses, we find that: (i) hallucinations with omission may arise from the failure to adequately capture essential causal factors, and (ii) hallucinations with fabrication are likely caused by the model being misled by non-causal cues. To address these challenges, we propose a novel reinforcement learning framework guided by causal completeness, which jointly considers both causal sufficiency and causal necessity of tokens. Specifically, we evaluate each token’s standalone contribution and counterfactual indispensability to define a token-level causal completeness reward. This reward is used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are both causally sufficient and necessary for accurate generation. Experimental results across various benchmark datasets and tasks demonstrate the effectiveness of our approach, which effectively mitigates hallucinations in MLLMs. 多模态大型语言模型（MLLMs）在视觉-语言任务中展现了令人印象深刻的能力。然而，它们可能会出现幻觉——生成与输入图像或文本语义不一致的输出。通过因果分析，我们发现：（i）遗漏型幻觉可能源于未能充分捕捉关键因果因素；（ii）捏造型幻觉很可能是模型被非因果线索误导所致。为了解决这些挑战，我们提出了一种由因果完整性指导的新型强化学习框架，该框架同时考虑了 token 的因果充分性和因果必要性。具体而言，我们评估每个 token 的独立贡献和反事实不可或缺性，以定义 token 级别的因果完整性奖励。该奖励用于在 GRPO 优化框架中构建因果信息优势函数，鼓励模型关注那些对准确生成既因果充分又因果必要的 token。在各种基准数据集和任务上的实验结果证明了我们方法的有效性，该方法有效地减轻了多语言大型模型（MLLMs）中的幻觉现象。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 08:09:12 UTC 发布时间：2025-08-06 08:09:12 UTC

#39 The State Of TTS: A Case Study with Human Fooling Rates #39 语音合成的现状：以人类欺骗率为案例研究

Authors: [Praveen Srinivasa Varadhan](https://arxiv.org/search/?searchtype=author&query=Praveen Srinivasa Varadhan), [Sherry Thomas](https://arxiv.org/search/?searchtype=author&query=Sherry Thomas), [Sai Teja M. S.](https://arxiv.org/search/?searchtype=author&query=Sai Teja M. S.), [Suvrat Bhooshan](https://arxiv.org/search/?searchtype=author&query=Suvrat Bhooshan), [Mitesh M. Khapra](https://arxiv.org/search/?searchtype=author&query=Mitesh M. Khapra) 作者：Praveen Srinivasa Varadhan，Sherry Thomas，Sai Teja M. S.，Suvrat Bhooshan，Mitesh M. Khapra

While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests. 近年来的主观评估显示了语音合成（TTS）的快速进展，但当前的 TTS 系统是否真的能在类似图灵测试的人类欺骗测试中通过？我们引入了“人类欺骗率”（Human Fooling Rate，HFR）这一指标，直接衡量机器生成的语音被误认为是人类语音的频率。我们对开源和商业 TTS 模型进行了大规模评估，揭示了关键见解：（i）基于 CMOS 的“人类水平”声称在欺骗测试中常常不成立；（ii）TTS 的进展应基于人类语音能达到高 HFR 的数据集进行基准测试，因为使用单调或表现力较弱的参考样本会降低评测标准；（iii）商业模型在零样本设置下接近人类欺骗水平，而开源系统在自然对话语音方面仍存在困难；（iv）在高质量数据上微调能提升真实感，但仍未完全弥合差距。我们的研究结果强调，除了现有的主观测试外，还需要更真实、更以人为中心的评估方法。

Subjects: Computation and Language, Machine Learning, Sound, Audio and Speech Processing 主题：计算与语言，机器学习，声音，音频与语音处理

Publish: 2025-08-06 08:04:21 UTC 发布时间：2025-08-06 08:04:21 UTC

#40 Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap #40 基于 DPO 隐式奖励差距的难度偏好数据选择

Authors: [Xuan Qi](https://arxiv.org/search/?searchtype=author&query=Xuan Qi), [Rongwu Xu](https://arxiv.org/search/?searchtype=author&query=Rongwu Xu), [Zhijing Jin](https://arxiv.org/search/?searchtype=author&query=Zhijing Jin) 作者：齐轩，徐荣武，金志晶

Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources. 使大型语言模型（LLMs）与人类偏好对齐是人工智能研究中的一项关键挑战。虽然强化学习与人类反馈（RLHF）和直接偏好优化（DPO）等方法被广泛使用，但它们通常依赖于庞大且昂贵的偏好数据集。目前的工作缺乏针对偏好数据的高质量数据选择方法。在本研究中，我们提出了一种基于难度的偏好数据选择策略，基于 DPO 隐式奖励机制。通过选择具有较小 DPO 隐式奖励差距的偏好数据样本，这些样本代表更具挑战性的案例，我们提升了数据效率和模型对齐效果。我们的方法在多个数据集和对齐任务中持续优于五个强基线，仅用原始数据的 10%即可实现更优性能。这种有原则且高效的选择方法为有限资源下的 LLM 对齐扩展提供了有前景的解决方案。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-06 07:24:14 UTC 发布时间：2025-08-06 07:24:14 UTC

#41 Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks #41 揭示在微调 LLMs 进行推理任务时的过度记忆问题

Authors: [Zhiwen Ruan](https://arxiv.org/search/?searchtype=author&query=Zhiwen Ruan), [Yun Chen](https://arxiv.org/search/?searchtype=author&query=Yun Chen), [Yutao Hou](https://arxiv.org/search/?searchtype=author&query=Yutao Hou), [Peng Li](https://arxiv.org/search/?searchtype=author&query=Peng Li), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu), [Guanhua Chen](https://arxiv.org/search/?searchtype=author&query=Guanhua Chen) 作者：阮志文，陈云，侯宇涛，李鹏，刘洋，陈冠华

The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning. 预训练的大型语言模型（LLMs）通过带标签的数据进行微调，以提升其遵循指令的能力和与人类价值观的对齐。在本文中，我们研究了 LLM 在推理任务微调过程中的学习动态，揭示了 LLM 微调特定阶段出现的过度记忆现象。在该阶段，LLM 对训练数据进行了过度记忆，表现出较高的测试困惑度（perplexity），但测试准确率依然良好。我们探讨了导致 LLM 过度记忆的条件，发现训练轮数和较大学习率是主要因素。尽管过度记忆的模型在测试准确率上与正常模型相当，但其鲁棒性降低，分布外泛化能力较差，生成多样性也有所下降。我们的实验表明，过度记忆现象在不同任务、模型和微调方法中普遍存在。我们的研究强调，过参数化且经过广泛微调的 LLMs 展现出与传统机器学习模型截然不同的独特学习动态。基于我们对过度记忆的观察，我们提供了关于微调过程中检查点和学习率选择的建议。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 06:34:12 UTC 发布时间：2025-08-06 06:34:12 UTC

#42 GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning #42 GM-PRM：一种用于多模态数学推理的生成式多模态过程奖励模型

Authors: [Jianghangfan Zhang](https://arxiv.org/search/?searchtype=author&query=Jianghangfan Zhang), [Yibo Yan](https://arxiv.org/search/?searchtype=author&query=Yibo Yan), [Kening Zheng](https://arxiv.org/search/?searchtype=author&query=Kening Zheng), [Xin Zou](https://arxiv.org/search/?searchtype=author&query=Xin Zou), [Song Dai](https://arxiv.org/search/?searchtype=author&query=Song Dai), [Xuming Hu](https://arxiv.org/search/?searchtype=author&query=Xuming Hu) 作者：张江航帆，闫一博，郑克宁，邹鑫，戴松，胡旭明

Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM’s generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance. 多模态大型语言模型（MLLMs）展现出卓越的能力，但在复杂的多步骤数学推理中常常表现不佳，其中视觉感知或逻辑推理中的细微错误可能导致完全失败。尽管过程奖励模型（PRMs）提供了逐步监督，现有的多模态 PRMs 仅限于作为二元验证器，能够识别但无法纠正错误，且解释能力有限。为了解决这些不足，我们提出了生成式多模态过程奖励模型（GM-PRM），这是一种将 PRM 从被动评判者转变为主动推理协作者的新范式。GM-PRM 不再仅提供简单的标量分数，而是对每一步推理进行细粒度、可解释的分析，评估其步骤意图、视觉对齐和逻辑合理性。更重要的是，GM-PRM 被训练生成其识别出的第一个错误步骤的修正版本。这一独特的纠正能力使我们能够提出新的测试时推理策略——精炼的多选最佳（Refined Best-of-N，Refined-BoN）。该框架通过利用 PRM 生成的修正来引导策略模型朝着更有前景的推理轨迹发展，积极提升了解决方案的质量，从而提高了解决方案池的多样性和正确性。我们证明了 GM-PRM 在多个多模态数学基准测试中达到了最先进的结果，显著提升了策略模型的性能，且数据效率极高，仅需一个包含 2 万个样本的训练数据集。我们的代码将在论文被接受后发布。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 05:10:29 UTC 发布时间：2025-08-06 05:10:29 UTC

#43 ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients" #43 ToolGrad：利用文本“梯度”高效生成工具使用数据集

Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like DFS. This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual “gradients”, and then synthesizes corresponding user queries. This “answer-first” approach led to ToolGrad-5k, a dataset generated with more complex tool use, lower cost, and 100% pass rate. Experiments show that models trained on ToolGrad-5k outperform those on expensive baseline datasets and proprietary LLMs, even on OOD benchmarks. 以往的工作通过先生成用户查询，再进行如 DFS 等复杂工具使用注释，来合成工具使用的 LLM 数据集。这导致了不可避免的注释失败和数据生成效率低下。我们提出了 ToolGrad，一个颠倒这一范式的智能框架。ToolGrad 首先通过文本“梯度”引导的迭代过程构建有效的工具使用链，然后合成相应的用户查询。这种“先答案”方法催生了 ToolGrad-5k 数据集，该数据集具有更复杂的工具使用、更低的成本和 100%的通过率。实验表明，在 ToolGrad-5k 上训练的模型，即使在 OOD 基准测试中，也优于那些在昂贵基线数据集和专有 LLM 上训练的模型。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 05:04:00 UTC 发布：2025-08-06 05:04:00 UTC

#44 Efficient Strategy for Improving Large Language Model (LLM) Capabilities #44 提升大型语言模型（LLM）能力的高效策略

Author: [Julián Camilo Velandia Gutiérrez](https://arxiv.org/search/?searchtype=author&query=Julián Camilo Velandia Gutiérrez) 作者：Julián Camilo Velandia Gutiérrez

Large Language Models (LLMs) have become a milestone in the field of artificial intelligence and natural language processing. However, their large-scale deployment remains constrained by the need for significant computational resources. This work proposes starting from a base model to explore and combine data processing and careful data selection techniques, training strategies, and architectural adjustments to improve the efficiency of LLMs in resource-constrained environments and within a delimited knowledge base. The methodological approach included defining criteria for building reliable datasets, conducting controlled experiments with different configurations, and systematically evaluating the resulting variants in terms of capability, versatility, response time, and safety. Finally, comparative tests were conducted to measure the performance of the developed variants and to validate the effectiveness of the proposed strategies. This work is based on the master’s thesis in Systems and Computer Engineering titled “Efficient Strategy for Improving the Capabilities of Large Language Models (LLMs)”. 大型语言模型（LLMs）已成为人工智能和自然语言处理领域的一个里程碑。然而，其大规模部署仍受限于对大量计算资源的需求。本研究提出从基础模型出发，探索并结合数据处理与精心的数据选择技术、训练策略及架构调整，以提升 LLMs 在资源受限环境和限定知识库内的效率。方法论包括定义构建可靠数据集的标准，进行不同配置的受控实验，并系统评估所得变体在能力、多功能性、响应时间及安全性方面的表现。最后，进行了对比测试以衡量所开发变体的性能，并验证所提策略的有效性。本研究基于题为《提升大型语言模型（LLMs）能力的高效策略》的系统与计算机工程硕士论文。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-06 04:08:26 UTC 发布时间：2025-08-06 04:08:26 UTC

#45 PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG #45 PAIRS：参数验证的自适应信息检索与选择，用于高效的 RAG

Authors: [Wang Chen](https://arxiv.org/search/?searchtype=author&query=Wang Chen), [Guanqiang Qi](https://arxiv.org/search/?searchtype=author&query=Guanqiang Qi), [Weikang Li](https://arxiv.org/search/?searchtype=author&query=Weikang Li), [Yang Li](https://arxiv.org/search/?searchtype=author&query=Yang Li), [Deguo Xia](https://arxiv.org/search/?searchtype=author&query=Deguo Xia), [Jizhou Huang](https://arxiv.org/search/?searchtype=author&query=Jizhou Huang) 作者：王晨，齐冠强，李伟康，李洋，夏德国，黄继洲

Retrieval-Augmented Generation (RAG) has become a cornerstone technique for enhancing large language models (LLMs) with external knowledge. However, current RAG systems face two critical limitations: (1) they inefficiently retrieve information for every query, including simple questions that could be resolved using the LLM’s parametric knowledge alone, and (2) they risk retrieving irrelevant documents when queries contain sparse information signals. To address these gaps, we introduce Parametric-verified Adaptive Information Retrieval and Selection (PAIRS), a training-free framework that integrates parametric and retrieved knowledge to adaptively determine whether to retrieve and how to select external information. Specifically, PAIRS employs a dual-path generation mechanism: First, the LLM produces both a direct answer and a context-augmented answer using self-generated pseudo-context. When these outputs converge, PAIRS bypasses external retrieval entirely, dramatically improving the RAG system’s efficiency. For divergent cases, PAIRS activates a dual-path retrieval (DPR) process guided by both the original query and self-generated contextual signals, followed by an Adaptive Information Selection (AIS) module that filters documents through weighted similarity to both sources. This simple yet effective approach can not only enhance efficiency by eliminating unnecessary retrievals but also improve accuracy through contextually guided retrieval and adaptive information selection. Experimental results on six question-answering (QA) benchmarks show that PAIRS reduces retrieval costs by around 25% (triggering for only 75% of queries) while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior baselines on average. 检索增强生成（RAG）已成为利用外部知识提升大型语言模型（LLMs）性能的基石技术。然而，当前的 RAG 系统存在两个关键限制：（1）它们对每个查询都低效地进行信息检索，包括那些仅凭 LLM 的参数知识即可解决的简单问题；（2）当查询中信息信号稀疏时，存在检索到无关文档的风险。为了解决这些问题，我们提出了参数验证的自适应信息检索与选择（PAIRS），这是一种无需训练的框架，融合了参数知识和检索知识，自适应地决定是否进行检索以及如何选择外部信息。具体而言，PAIRS 采用双路径生成机制：首先，LLM 生成直接答案和利用自生成伪上下文的上下文增强答案。当这两种输出趋同时，PAIRS 完全跳过外部检索，显著提升了 RAG 系统的效率。对于发散性案例，PAIRS 启动了由原始查询和自生成上下文信号共同引导的双路径检索（DPR）过程，随后通过加权相似度对两个来源的文档进行过滤的自适应信息选择（AIS）模块。这个简单而有效的方法不仅可以通过消除不必要的检索来提高效率，还能通过上下文引导的检索和自适应信息选择来提升准确性。在六个问答（QA）基准测试中的实验结果表明，PAIRS 将检索成本降低了约 25%（仅对 75% 的查询触发检索），同时仍然提升了准确率——平均比之前的基线提高了 +1.1% 的 EM 和 +1.0% 的 F1。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 03:33:01 UTC 发布：2025-08-06 03:33:01 UTC

#46 DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation #46 DTPA：用于可控文本生成的动态令牌级前缀增强

Controllable Text Generation (CTG) is a vital subfield in Natural Language Processing (NLP), aiming to generate text that aligns with desired attributes. However, previous studies commonly focus on the quality of controllable text generation for short sequences, while the generation of long-form text remains largely underexplored. In this paper, we observe that the controllability of texts generated by the powerful prefix-based method Air-Decoding tends to decline with increasing sequence length, which we hypothesize primarily arises from the observed decay in attention to the prefixes. Meanwhile, different types of prefixes including soft and hard prefixes are also key factors influencing performance. Building on these insights, we propose a lightweight and effective framework called Dynamic Token-level Prefix Augmentation (DTPA) based on Air-Decoding for controllable text generation. Specifically, it first selects the optimal prefix type for a given task. Then we dynamically amplify the attention to the prefix for the attribute distribution to enhance controllability, with a scaling factor growing exponentially as the sequence length increases. Moreover, based on the task, we optionally apply a similar augmentation to the original prompt for the raw distribution to balance text quality. After attribute distribution reconstruction, the generated text satisfies the attribute constraints well. Experiments on multiple CTG tasks demonstrate that DTPA generally outperforms other methods in attribute control while maintaining competitive fluency, diversity, and topic relevance. Further analysis highlights DTPA’s superior effectiveness in long text generation. 可控文本生成（CTG）是自然语言处理（NLP）中的一个重要子领域，旨在生成符合期望属性的文本。然而，先前的研究通常侧重于短序列的可控文本生成质量，而长文本的生成仍然很少被探索。在本文中，我们观察到强大的基于前缀的方法 Air-Decoding 生成的文本的可控性随着序列长度的增加而下降，我们假设这主要源于对前缀注意力的衰减。同时，不同类型的前缀，包括软前缀和硬前缀，也是影响性能的关键因素。基于这些见解，我们提出了一个轻量且有效的框架，称为基于 Air-Decoding 的动态令牌级前缀增强（DTPA），用于可控文本生成。具体来说，它首先为给定任务选择最优的前缀类型。然后，我们动态放大对前缀的注意力以增强属性分布的可控性，缩放因子随着序列长度的增加呈指数增长。此外，根据任务需求，我们可选择对原始提示进行类似的增强，以平衡文本质量的原始分布。在属性分布重构后，生成的文本能够很好地满足属性约束。在多个条件文本生成（CTG）任务上的实验表明，DTPA 在属性控制方面通常优于其他方法，同时保持了具有竞争力的流畅性、多样性和主题相关性。进一步分析凸显了 DTPA 在长文本生成中的卓越效果。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 03:20:33 UTC 发布时间：2025-08-06 03:20:33 UTC

#47 Large Reasoning Models Are Autonomous Jailbreak Agents #47 大型推理模型是自主越狱代理

Authors: [Thilo Hagendorff](https://arxiv.org/search/?searchtype=author&query=Thilo Hagendorff), [Erik Derner](https://arxiv.org/search/?searchtype=author&query=Erik Derner), [Nuria Oliver](https://arxiv.org/search/?searchtype=author&query=Nuria Oliver) 作者：Thilo Hagendorff，Erik Derner，Nuria Oliver

Jailbreaking – bypassing built-in safety mechanisms in AI models – has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts composed of 70 items covering seven sensitive domains. This setup yielded an overall attack success rate across all model combinations of 97.14%. Our study reveals an alignment regression, in which LRMs can systematically erode the safety guardrails of other models, highlighting the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents. 越狱——绕过 AI 模型内置安全机制——传统上需要复杂的技术程序或专业的人类知识。在本研究中，我们展示了大型推理模型（LRMs）的说服能力如何简化并扩大越狱的规模，使其成为非专家也能轻松进行的低成本活动。我们评估了四个 LRMs（DeepSeek-R1、Gemini 2.5 Flash、Grok 3 Mini、Qwen3 235B）作为自主对手，与九个广泛使用的目标模型进行多轮对话的能力。LRMs 通过系统提示接收指令，随后在无进一步监督的情况下进行越狱的规划和执行。我们使用包含 70 个项目、涵盖七个敏感领域的有害提示基准进行了大量实验。该设置在所有模型组合中的整体攻击成功率达到了 97.14%。我们的研究揭示了一种对齐回归现象，即大语言模型（LRMs）能够系统性地削弱其他模型的安全防护措施，强调了迫切需要进一步对前沿模型进行对齐，不仅要抵御越狱尝试，还要防止它们被利用成为越狱代理。

Subjects: Computation and Language, Artificial Intelligence, Cryptography and Security 主题：计算与语言，人工智能，密码学与安全

Publish: 2025-08-04 18:27:26 UTC 发布时间：2025-08-04 18:27:26 UTC

#48 ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents #48 ZARA：通过知识和检索驱动的 LLM 代理实现零样本运动时间序列分析

Authors: [Zechen Li](https://arxiv.org/search/?searchtype=author&query=Zechen Li), [Baiyu Chen](https://arxiv.org/search/?searchtype=author&query=Baiyu Chen), [Hao Xue](https://arxiv.org/search/?searchtype=author&query=Hao Xue), [Flora D. Salim](https://arxiv.org/search/?searchtype=author&query=Flora D. Salim) 作者：李泽辰，陈白玉，薛浩，Flora D. Salim

Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at https://github.com/zechenli03/ZARA. 运动传感器时间序列是人体活动识别（HAR）的核心，广泛应用于健康、体育和智能设备领域。然而，现有方法通常针对固定的活动集进行训练，当出现新的行为或传感器配置时，需要昂贵的重新训练。近期尝试利用 LLMs 进行 HAR，通常通过将信号转换为文本或图像，但这些方法准确率有限且缺乏可验证的可解释性。我们提出了 ZARA，这是首个基于代理的框架，能够直接从原始运动时间序列实现零样本、可解释的 HAR。ZARA 集成了自动生成的成对特征知识库，捕捉每对活动的判别统计信息；多传感器检索模块，提供相关证据；以及分层代理流程，引导 LLM 迭代选择特征、利用证据，并生成活动预测和自然语言解释。ZARA 无需任何微调或特定任务分类器，即可实现灵活且可解释的 HAR。在 8 个人体动作识别（HAR）基准上的大量实验表明，ZARA 实现了最先进的零样本性能，能够提供清晰的推理，并在宏观 F1 指标上超过最强基线 2.53 倍。消融研究进一步确认了每个模块的必要性，标志着 ZARA 是迈向可信赖、即插即用的运动时间序列分析的有希望的一步。我们的代码可在 https://github.com/zechenli03/ZARA 获取。

Subjects: Computation and Language, Computer Vision and Pattern Recognition 学科领域：计算与语言，计算机视觉与模式识别

Publish: 2025-08-06 02:57:57 UTC 发布时间：2025-08-06 02:57:57 UTC

#49 Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing #49 多一步：超越单次反向传播的元学习模型编辑

大型语言模型（LLMs）支撑着许多人工智能应用，但其静态特性使得更新知识代价高昂。模型编辑通过有针对性的参数修改注入新信息，提供了一种高效的替代方案。特别是，基于元学习的模型编辑（MLBME）方法在编辑效果和效率方面表现出显著优势。尽管如此，我们发现 MLBME 在低数据场景下表现不佳，其训练效率也受到 KL 散度计算的瓶颈限制。为了解决这些问题，我们提出了 S tep M ore Edit （ SMEdit ），这是一种新颖的 MLBME 方法，采用 M ultiple B ackpro P agation S teps（ MBPS ）以提升有限监督下的编辑性能，并在权重更新上引入范数正则化以提高训练效率。在两个数据集和两个 LLMs 上的实验结果表明，SMEdit 优于之前的 MLBME 基线方法，且 MBPS 策略可以无缝集成到现有方法中，进一步提升其性能。我们的代码将很快发布。

2025-08-06 01:54:58 UTC

#50 HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization #50 HarmonyGuard：通过自适应策略增强和双目标优化实现网络代理的安全性与实用性

Authors: [Yurun Chen](https://arxiv.org/search/?searchtype=author&query=Yurun Chen), [Xavier Hu](https://arxiv.org/search/?searchtype=author&query=Xavier Hu), [Yuhan Liu](https://arxiv.org/search/?searchtype=author&query=Yuhan Liu), [Keting Yin](https://arxiv.org/search/?searchtype=author&query=Keting Yin), [Juncheng Li](https://arxiv.org/search/?searchtype=author&query=Juncheng Li), [Zhuosheng Zhang](https://arxiv.org/search/?searchtype=author&query=Zhuosheng Zhang), [Shengyu Zhang](https://arxiv.org/search/?searchtype=author&query=Shengyu Zhang) 作者：陈雨润，胡泽维，刘宇涵，尹可婷，李俊成，张卓晟，张胜宇

Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: https://github.com/YurunChen/HarmonyGuard. 大型语言模型使得智能体能够在开放的网络环境中自主执行任务。然而，随着网络中隐藏威胁的演变，网络智能体在长序列操作过程中面临着在任务性能与新兴风险之间平衡的挑战。尽管这一挑战至关重要，但当前的研究仍局限于单目标优化或单轮场景，缺乏在网络环境中对安全性和效用进行协同优化的能力。为填补这一空白，我们提出了 HarmonyGuard，一种多智能体协作框架，利用策略增强和目标优化来共同提升效用和安全性。HarmonyGuard 具有多智能体架构，具备两个基本能力：（1）自适应策略增强：我们在 HarmonyGuard 中引入了策略智能体，能够自动从非结构化的外部文档中提取并维护结构化的安全策略，同时根据不断演变的威胁持续更新策略。 (2) 双目标优化：基于安全性和效用性的双重目标，集成在 HarmonyGuard 中的效用代理执行马尔可夫实时推理以评估目标，并利用元认知能力进行优化。在多个基准测试中的广泛评估表明，HarmonyGuard 相比现有基线，策略合规性提升了最多 38%，任务完成率提升了最多 20%，同时在所有任务中实现了超过 90%的策略合规率。我们的项目地址为：https://github.com/YurunChen/HarmonyGuard。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 01:49:32 UTC 发布时间：2025-08-06 01:49:32 UTC

Authors: [Xinyu Zhao](https://arxiv.org/search/?searchtype=author&query=Xinyu Zhao), [Zhen Tan](https://arxiv.org/search/?searchtype=author&query=Zhen Tan), [Maya Enisman](https://arxiv.org/search/?searchtype=author&query=Maya Enisman), [Minjae Seo](https://arxiv.org/search/?searchtype=author&query=Minjae Seo), [Marta R. Durantini](https://arxiv.org/search/?searchtype=author&query=Marta R. Durantini), [Dolores Albarracin](https://arxiv.org/search/?searchtype=author&query=Dolores Albarracin), [Tianlong Chen](https://arxiv.org/search/?searchtype=author&query=Tianlong Chen) 作者：赵欣宇，谭震，玛雅·恩尼斯曼，徐敏宰，玛尔塔·R·杜兰蒂尼，多洛雷斯·阿尔巴拉辛，陈天龙

Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but “black box” foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot’s reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert’s cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains. 成功的小组会议，如在小组行为改变项目、工作会议及其他社交场合中实施的会议，必须促进个人目标的设定与执行，同时加强小组内的社会关系。因此，理想的主持人必须敏感于脱离参与的微妙动态、个人目标设定与执行的困难，以及表明需要干预的人际关系问题。主持人所面临的挑战和认知负荷，创造了对一种具身技术的关键需求，该技术能够解读社交交流，同时关注小组中个体的需求，并提供透明的建议，超越那些强大但“黑箱”式的基础模型（FMs）对社交线索的识别。我们通过一个社交机器人共同主持者来满足这一重要需求，该机器人分析多模态会议数据，并向主持人提供细微的提示。该机器人的推理由一个具代理性的概念瓶颈模型（CBM）驱动，基于参与者参与度和情感等人类可解释的概念做出决策，确保透明性和可信度。我们的核心贡献是一个迁移学习框架，将基础模型（FM）广泛的社会理解蒸馏到我们专门且透明的概念驱动模型（CBM）中。这个以概念为驱动的系统在预测干预需求方面显著优于直接的零样本基础模型，并且支持对其推理过程进行实时的人类纠正。关键是，我们展示了稳健的知识迁移能力：该模型能够跨不同群体进行泛化，并成功将资深人类协调员的专业知识转移给新手，从而提升新手的表现。通过将专家的认知模型转移到一个可解释的机器人伙伴中，我们的工作为增强人类在复杂社会领域的能力提供了强有力的蓝图。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 01:24:06 UTC 发布：2025-08-06 01:24:06 UTC

#52 Are Today's LLMs Ready to Explain Well-Being Concepts? #52 今天的 LLMs 准备好解释幸福感概念了吗？

Authors: [Bohan Jiang](https://arxiv.org/search/?searchtype=author&query=Bohan Jiang), [Dawei Li](https://arxiv.org/search/?searchtype=author&query=Dawei Li), [Zhen Tan](https://arxiv.org/search/?searchtype=author&query=Zhen Tan), [Chengshuai Zhao](https://arxiv.org/search/?searchtype=author&query=Chengshuai Zhao), [Huan Liu](https://arxiv.org/search/?searchtype=author&query=Huan Liu) 作者：蒋博涵，李大伟，谭震，赵成帅，刘欢

Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks. 幸福感涵盖了对个人成长和明智生活决策至关重要的心理、身体和社会维度。随着人们越来越多地咨询 LLMs 以了解幸福感，一个关键挑战出现了：LLMs 能否生成不仅准确且能够针对不同受众量身定制的解释？高质量的解释既需要事实的正确性，也需要满足具有不同专业水平用户的期望。在本研究中，我们构建了一个大规模数据集，包含由十个多样化 LLMs 生成的 2,194 个幸福感概念的 43,880 条解释。我们引入了一个基于原则指导的 LLM 作为评判者的评估框架，采用双重评判者来评估解释质量。此外，我们展示了通过监督微调（SFT）和直接偏好优化（DPO）对开源 LLM 进行微调，可以显著提升生成解释的质量。我们的结果显示：（1）所提出的 LLM 评判与人工评估高度一致；（2）解释质量在不同模型、受众和类别之间存在显著差异；（3）经过 DPO 和 SFT 微调的模型表现优于其更大规模的对应模型，证明了基于偏好的学习在专业解释任务中的有效性。

Subjects: Computation and Language, Artificial Intelligence, Human-Computer Interaction 主题：计算与语言，人工智能，人机交互

Publish: 2025-08-06 00:45:02 UTC 发布时间：2025-08-06 00:45:02 UTC

#53 Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency #53 置信加权令牌集覆盖用于自洽性中的早期假设剪枝

Authors: [Md Arafat Sultan](https://arxiv.org/search/?searchtype=author&query=Md Arafat Sultan), [Ramón Fernandez Astudillo](https://arxiv.org/search/?searchtype=author&query=Ramón Fernandez Astudillo) 作者：Md Arafat Sultan，Ramón Fernandez Astudillo

Despite its simplicity and efficacy, the high token expenditure of self-consistency can limit its practical utility. Here we investigate if self-consistency can be made more token-efficient for long chain-of-thought reasoning tasks, while preserving its parallelism, through early hypothesis pruning. Concretely, we generate all solutions in parallel, but periodically prune intermediate hypotheses that are deemed unnecessary based on two lightweight indicators: (a) the model’s own confidence in individual hypotheses, and (b) lexical coverage of all current hypotheses by candidate subsets that are under consideration for continued retention. We design a fast weighted set cover algorithm that utilizes the two indicators; our evaluation of five LLMs on three math benchmarks shows that this method can improve token efficiency for all models, by 10-35% in many cases. 尽管自洽性方法简单且有效，但其高令牌消耗可能限制其实用性。本文探讨是否可以通过早期假设剪枝，使自洽性在长链思维推理任务中更加节省令牌，同时保持其并行性。具体来说，我们并行生成所有解答，但定期根据两个轻量级指标剪除被认为不必要的中间假设：（a）模型对单个假设的自信度，以及（b）候选子集对所有当前假设的词汇覆盖情况，这些子集被考虑继续保留。我们设计了一种快速加权集合覆盖算法，利用这两个指标；在三个数学基准测试中对五个 LLMs 的评估表明，该方法在许多情况下可提升所有模型的令牌效率 10-35%。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-06 00:14:18 UTC 发布时间：2025-08-06 00:14:18 UTC

#54 Data and AI governance: Promoting equity, ethics, and fairness in large language models #54 数据与人工智能治理：促进大型语言模型中的公平、伦理与公正

Authors: [Alok Abhishek](https://arxiv.org/search/?searchtype=author&query=Alok Abhishek), [Lisa Erickson](https://arxiv.org/search/?searchtype=author&query=Lisa Erickson), [Tushar Bandopadhyay](https://arxiv.org/search/?searchtype=author&query=Tushar Bandopadhyay) 作者：Alok Abhishek，Lisa Erickson，Tushar Bandopadhyay

In this paper, we cover approaches to systematically govern, assess and quantify bias across the complete life cycle of machine learning models, from initial development and validation to ongoing production monitoring and guardrail implementation. Building upon our foundational work on the Bias Evaluation and Assessment Test Suite (BEATS) for Large Language Models, the authors share prevalent bias and fairness related gaps in Large Language Models (LLMs) and discuss data and AI governance framework to address Bias, Ethics, Fairness, and Factuality within LLMs. The data and AI governance approach discussed in this paper is suitable for practical, real-world applications, enabling rigorous benchmarking of LLMs prior to production deployment, facilitating continuous real-time evaluation, and proactively governing LLM generated responses. By implementing the data and AI governance across the life cycle of AI development, organizations can significantly enhance the safety and responsibility of their GenAI systems, effectively mitigating risks of discrimination and protecting against potential reputational or brand-related harm. Ultimately, through this article, we aim to contribute to advancement of the creation and deployment of socially responsible and ethically aligned generative artificial intelligence powered applications. 在本文中，我们介绍了系统性管理、评估和量化机器学习模型偏见的方法，涵盖从初始开发和验证到持续生产监控及安全防护的整个生命周期。基于我们在大型语言模型（LLMs）偏见评估测试套件（BEATS）方面的基础工作，作者分享了大型语言模型中普遍存在的偏见和公平性相关的缺口，并讨论了用于解决 LLMs 中的偏见、伦理、公平性和事实性的数据与 AI 治理框架。本文讨论的数据与 AI 治理方法适用于实际的现实应用，能够在生产部署前对 LLMs 进行严格的基准测试，促进持续的实时评估，并主动管理 LLM 生成的响应。通过在 AI 开发生命周期中实施数据与 AI 治理，组织可以显著提升其生成式 AI 系统的安全性和责任性，有效降低歧视风险，防止潜在的声誉或品牌相关损害。最终，通过这篇文章，我们旨在推动社会责任感强且符合伦理的生成式人工智能驱动应用的创建和部署的发展。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 23:15:31 UTC 发布时间：2025-08-05 23:15:31 UTC

#55 CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation #55 CAP-LLM：用于新闻标题生成的上下文增强个性化大型语言模型

Authors: [Raymond Wilson](https://arxiv.org/search/?searchtype=author&query=Raymond Wilson), [Cole Graham](https://arxiv.org/search/?searchtype=author&query=Cole Graham), [Chase Carter](https://arxiv.org/search/?searchtype=author&query=Chase Carter), [Zefeng Yang](https://arxiv.org/search/?searchtype=author&query=Zefeng Yang), [Ruiqi Gu](https://arxiv.org/search/?searchtype=author&query=Ruiqi Gu) 作者：Raymond Wilson, Cole Graham, Chase Carter, Zefeng Yang, Ruiqi Gu

In the era of information overload, personalized news headline generation is crucial for engaging users by tailoring content to their preferences while accurately conveying news facts. Existing methods struggle with effectively capturing complex user interests and ensuring factual consistency, often leading to generic or misleading headlines. Leveraging the unprecedented capabilities of Large Language Models (LLMs) in text generation, we propose Context-Augmented Personalized LLM (CAP-LLM), a novel framework that integrates user preferences and factual consistency constraints into a powerful pre-trained LLM backbone. CAP-LLM features a User Preference Encoder to capture long-term user interests, a Context Injection Adapter to seamlessly integrate these preferences and current article context into the LLM’s generation process, and a Fact-Consistency Reinforcement Module employing a novel contrastive loss to mitigate hallucination. Evaluated on the real-world PENS dataset, CAP-LLM achieves state-of-the-art performance across all metrics. Notably, it significantly improves factual consistency (FactCC of 87.50) over strong baselines like BART (86.67), while simultaneously enhancing personalization (Pc(avg) 2.73, Pc(max) 17.25) and content coverage (ROUGE-1 26.55, ROUGE-2 9.95, ROUGE-L 23.01). Our ablation studies, human evaluations, and sensitivity analyses further validate the effectiveness of each component and the robustness of our approach, demonstrating CAP-LLM’s ability to achieve a superior balance between personalization and factual accuracy in news headline generation. 在信息过载的时代，个性化新闻标题生成对于通过定制内容以符合用户偏好并准确传达新闻事实至关重要。现有方法在有效捕捉复杂用户兴趣和确保事实一致性方面存在困难，常导致生成的标题过于通用或具有误导性。利用 LLMs 在文本生成方面的前所未有的能力，我们提出了上下文增强个性化 LLM（CAP-LLM），这是一种将用户偏好和事实一致性约束整合到强大预训练 LLM 骨干中的新型框架。CAP-LLM 具备用户偏好编码器，用于捕捉长期用户兴趣；上下文注入适配器，实现将这些偏好和当前文章上下文无缝整合到 LLM 的生成过程中；以及事实一致性强化模块，采用新颖的对比损失以减少幻觉现象。在真实世界的 PENS 数据集上评估，CAP-LLM 在所有指标上均达到最先进的性能。值得注意的是，它在事实一致性方面显著提升（FactCC 达到 87.50），优于强基线模型如 BART（86.67），同时提升了个性化（Pc(avg) 2.73，Pc(max) 17.25）和内容覆盖率（ROUGE-1 26.55，ROUGE-2 9.95，ROUGE-L 23.01）。我们的消融研究、人工评估和敏感性分析进一步验证了各组件的有效性及方法的鲁棒性，展示了 CAP-LLM 在新闻标题生成中实现个性化与事实准确性之间更优平衡的能力。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 21:55:44 UTC 发布：2025-08-05 21:55:44 UTC

#56 CoAct-1: Computer-using Agents with Coding as Actions #56 CoAct-1：以编码为动作的计算机使用代理

Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation. 通过图形用户界面（GUI）操作计算机的自主代理在处理复杂的长时间任务时，常常面临效率和可靠性的问题。虽然为这些代理增加规划功能可以改善任务分解，但它们仍受限于必须通过 GUI 操作执行所有动作的固有限制，导致系统脆弱且效率低下。在本工作中，我们引入了一种更稳健且灵活的范式：使代理能够将编程作为一种增强的操作方式。我们提出了 CoAct-1，一种新颖的多代理系统，协同结合了基于 GUI 的控制与直接的程序执行。CoAct-1 配备了一个协调者（Orchestrator），能够动态地将子任务分配给传统的 GUI 操作员或专门的程序员代理，后者可以编写并执行 Python 或 Bash 脚本。这种混合方法使代理能够绕过低效的 GUI 操作序列，完成如文件管理和数据处理等任务，同时在必要时仍能利用视觉交互。我们在具有挑战性的 OSWorld 基准测试中评估了该系统，CoAct-1 取得了 60.76%的新最高成功率，显著优于以往方法。此外，我们的方法显著提升了效率，将完成任务所需的平均步骤数减少到仅 10.15 步，而领先的 GUI 代理则需要 15 步。我们的结果表明，将编码作为核心操作整合，提供了一条更强大、高效且可扩展的通用计算机自动化路径。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 21:33:36 UTC 发布时间：2025-08-05 21:33:36 UTC

Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl. 社会智能已成为大型语言模型（LLMs）的一项关键能力，使其能够有效参与现实世界中的社会任务，如适应、说服、协作和谈判。强化学习（RL）是训练具备社会智能代理的自然选择，因为它允许模型通过社会互动直接学习复杂策略。然而，社会互动具有两个关键特征，这为 RL 训练设置了障碍：（1）部分可观测性，即话语具有间接且延迟的影响，增加了归因的复杂性；（2）多维性，即诸如建立融洽关系或寻求知识等行为间接促进目标的实现。这些特征使得基于马尔可夫决策过程（MDP）且采用单维度、基于回合的奖励的 RL 方法效率低下且不稳定。为了解决这些挑战，我们提出了 Sotopia-RL，一种将粗略的回合级反馈细化为话语级、多维度奖励的新框架。话语级别的信用分配通过将结果归因于单个话语，缓解了部分可观察性问题，而多维度奖励则捕捉了社交互动的全部丰富性，减少了奖励作弊。在开放式社交学习环境 Sotopia 中的实验表明，Sotopia-RL 实现了最先进的社交目标完成分数（Sotopia-hard 为 7.17，Sotopia-full 为 8.31），显著优于现有方法。消融研究确认了话语级别信用分配和多维度奖励设计对于强化学习训练的必要性。我们的实现已公开，地址为：https://github.com/sotopia-lab/sotopia-rl。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 20:43:42 UTC 发布时间：2025-08-05 20:43:42 UTC

#58 An Entity Linking Agent for Question Answering #58 一个用于问答的实体链接代理

Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent. 一些问答（QA）系统依赖知识库（KB）来提供准确的答案。实体链接（EL）在将自然语言提及与知识库条目关联中起着关键作用。然而，大多数现有的实体链接方法是为长上下文设计的，在问答任务中对简短且模糊的用户问题表现不佳。我们提出了一种基于大型语言模型的问答实体链接代理，该代理模拟人类认知工作流程。该代理主动识别实体提及，检索候选实体，并做出决策。为了验证我们代理的有效性，我们进行了两项实验：基于工具的实体链接和问答任务评估。结果证实了我们代理的鲁棒性和有效性。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 19:28:43 UTC 发布时间：2025-08-05 19:28:43 UTC

#59 Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models #59 从幻觉到真相：大型语言模型中的事实核查与真实性评估综述

Authors: [Subhey Sadi Rahman](https://arxiv.org/search/?searchtype=author&query=Subhey Sadi Rahman), [Md. Adnanul Islam](https://arxiv.org/search/?searchtype=author&query=Md. Adnanul Islam), [Md. Mahbub Alam](https://arxiv.org/search/?searchtype=author&query=Md. Mahbub Alam), [Musarrat Zeba](https://arxiv.org/search/?searchtype=author&query=Musarrat Zeba), [Md. Abdur Rahman](https://arxiv.org/search/?searchtype=author&query=Md. Abdur Rahman), [Sadia Sultana Chowa](https://arxiv.org/search/?searchtype=author&query=Sadia Sultana Chowa), [Mohaimenul Azam Khan Raiaan](https://arxiv.org/search/?searchtype=author&query=Mohaimenul Azam Khan Raiaan), [Sami Azam](https://arxiv.org/search/?searchtype=author&query=Sami Azam) 作者：Subhey Sadi Rahman，Md. Adnanul Islam，Md. Mahbub Alam，Musarrat Zeba，Md. Abdur Rahman，Sadia Sultana Chowa，Mohaimenul Azam Khan Raiaan，Sami Azam

Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. The review also discusses the role of instruction tuning, multi-agent reasoning, and external knowledge access via RAG frameworks. Key findings highlight the limitations of current metrics, the value of grounding outputs with validated external evidence, and the importance of domain-specific customization to improve factual consistency. Overall, the review underlines the importance of building LLMs that are not only accurate and explainable but also tailored for domain-specific fact-checking. These insights contribute to the advancement of research toward more trustworthy and context-aware language models. 大型语言模型（LLMs）是在庞大且多样化的互联网语料库上训练的，这些语料库中常包含不准确或误导性内容。因此，LLMs 可能生成错误信息，因而强有力的事实核查至关重要。本综述系统地分析了如何评估 LLMs 生成内容的事实准确性，探讨了诸如幻觉现象、数据集限制以及评估指标可靠性等关键挑战。综述强调了构建强大事实核查框架的必要性，该框架应整合先进的提示策略、领域特定的微调以及检索增强生成（RAG）方法。文章提出了五个研究问题，指导对 2020 年至 2025 年最新文献的分析，重点关注评估方法和缓解技术。综述还讨论了指令调优、多智能体推理以及通过 RAG 框架访问外部知识的作用。主要发现指出当前指标的局限性、以经过验证的外部证据为基础的输出的重要价值，以及领域特定定制对提升事实一致性的关键作用。总体而言，综述强调了构建不仅准确且可解释，同时针对特定领域事实核查量身定制的 LLMs 的重要性。这些见解有助于推动研究向更可信且具备上下文感知能力的语言模型发展。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-05 19:20:05 UTC 发布时间：2025-08-05 19:20:05 UTC

#60 Majority Bit-Aware Watermarking For Large Language Models #60 面向大型语言模型的多数位感知水印技术

Authors: [Jiahao Xu](https://arxiv.org/search/?searchtype=author&query=Jiahao Xu), [Rui Hu](https://arxiv.org/search/?searchtype=author&query=Rui Hu), [Zikai Zhang](https://arxiv.org/search/?searchtype=author&query=Zikai Zhang) 作者：徐嘉豪，胡锐，张子凯

The growing deployment of Large Language Models (LLMs) in real-world applications has raised concerns about their potential misuse in generating harmful or deceptive content. To address this issue, watermarking techniques have emerged as a promising solution by embedding identifiable binary messages into generated text for origin verification and misuse tracing. While recent efforts have explored multi-bit watermarking schemes capable of embedding rich information such as user identifiers, they typically suffer from the fundamental trade-off between text quality and decoding accuracy: to ensure reliable message decoding, they have to restrict the size of preferred token sets during encoding, yet such restrictions reduce the quality of the generated content. In this work, we propose MajorMark, a novel watermarking method that improves this trade-off through majority bit-aware encoding. MajorMark selects preferred token sets based on the majority bit of the message, enabling a larger and more flexible sampling of tokens. In contrast to prior methods that rely on token frequency analysis for decoding, MajorMark employs a clustering-based decoding strategy, which maintains high decoding accuracy even when the preferred token set is large, thus preserving both content quality and decoding accuracy. We further introduce MajorMark+, which partitions the message into multiple blocks to independently encode and deterministically decode each block, thereby further enhancing the quality of watermarked text and improving decoding accuracy. Extensive experiments on state-of-the-art LLMs demonstrate that our methods significantly enhance both decoding accuracy and text generation quality, outperforming prior multi-bit watermarking baselines. 大型语言模型（LLMs）在现实应用中的日益广泛部署引发了人们对其可能被滥用于生成有害或欺骗性内容的担忧。为解决这一问题，水印技术作为一种有前景的解决方案应运而生，通过在生成文本中嵌入可识别的二进制信息，实现来源验证和滥用追踪。尽管近期的研究探索了能够嵌入丰富信息（如用户标识符）的多比特水印方案，但它们通常面临文本质量与解码准确性之间的根本权衡：为了确保消息的可靠解码，必须在编码时限制优选词集的大小，而这种限制又降低了生成内容的质量。在本工作中，我们提出了 MajorMark，一种通过多数比特感知编码来改善该权衡的新型水印方法。MajorMark 基于消息的多数比特选择优选词集，从而实现更大且更灵活的词汇采样。与以往依赖于词元频率分析进行解码的方法不同，MajorMark 采用基于聚类的解码策略，即使在首选词元集合较大时也能保持较高的解码准确率，从而兼顾内容质量和解码准确性。我们进一步引入了 MajorMark + ，该方法将消息划分为多个块，独立编码并确定性解码每个块，从而进一步提升水印文本的质量并提高解码准确率。在最先进的 LLMs 上进行的大量实验表明，我们的方法显著提升了解码准确率和文本生成质量，优于以往的多比特水印基线。

Subjects: Computation and Language, Cryptography and Security 主题：计算与语言，密码学与安全

Publish: 2025-08-05 18:19:00 UTC 发布时间：2025-08-05 18:19:00 UTC

#61 AttnTrace: Attention-based Context Traceback for Long-Context LLMs #61 AttnTrace：基于注意力的长上下文 LLMs 上下文追溯

Authors: [Yanting Wang](https://arxiv.org/search/?searchtype=author&query=Yanting Wang), [Runpeng Geng](https://arxiv.org/search/?searchtype=author&query=Runpeng Geng), [Ying Chen](https://arxiv.org/search/?searchtype=author&query=Ying Chen), [Jinyuan Jia](https://arxiv.org/search/?searchtype=author&query=Jinyuan Jia) 作者：王艳婷，耿润鹏，陈颖，贾金元

Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context–often consisting of texts retrieved from a knowledge database or memory–and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace. 长上下文大型语言模型（LLMs），如 Gemini-2.5-Pro 和 Claude-Sonnet-4，正日益被用于增强先进的 AI 系统，包括检索增强生成（RAG）流水线和自主代理。在这些系统中，LLM 接收一条指令及其上下文——通常由从知识数据库或记忆中检索的文本组成——并生成一个基于上下文且遵循指令的响应。近期研究设计了追溯上下文中对 LLM 生成响应贡献最大的文本子集的解决方案。这些方案在现实世界中有诸多应用，包括执行攻击后的取证分析以及提升 LLM 输出的可解释性和可信度。尽管已做出大量努力，最先进的解决方案如 TracLLM 通常带来高计算成本，例如 TracLLM 对单个响应-上下文对进行追溯时需耗费数百秒。在本工作中，我们提出了 AttnTrace，一种基于 LLM 对提示生成的注意力权重的新型上下文追溯方法。为了有效利用注意力权重，我们引入了两种技术，旨在提升 AttnTrace 的效果，并为我们的设计选择提供了理论见解。我们还对 AttnTrace 进行了系统评估。结果表明，AttnTrace 在准确性和效率上均优于现有的最先进上下文追溯方法。我们还展示了 AttnTrace 能通过归因-先检测范式，提升最先进方法在长上下文中检测提示注入的能力。作为一个实际应用，我们演示了 AttnTrace 能有效定位旨在操控 LLM 生成评论的论文中的注入指令。代码地址为 https://github.com/Wang-Yanting/AttnTrace。

Subjects: Computation and Language, Cryptography and Security 主题：计算与语言，密码学与安全

Publish: 2025-08-05 17:56:51 UTC 发布时间：2025-08-05 17:56:51 UTC

#62 GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models #62 GanitBench：一个用于评估视觉语言模型中数学推理的双语基准

Authors: [Ashutosh Bandooni](https://arxiv.org/search/?searchtype=author&query=Ashutosh Bandooni), [Brindha Subburaj](https://arxiv.org/search/?searchtype=author&query=Brindha Subburaj) 作者：Ashutosh Bandooni，Brindha Subburaj

Benchmarks for evaluating reasoning among Vision Language Models (VLMs) on several fields and domains are being curated more frequently over the last few years. However these are often monolingual, mostly available in English. Additionally there also is a lack of datasets available in Hindi on tasks apart from comprehension and translation. We introduce GanitBench, a tough benchmark consisting of 1527 vision-only questions covering several topics in Mathematics - available in languages English and Hindi. Collected from two major examinations from India, the JEE Advanced and the CBSE Boards examinations, this benchmark includes questions in the form of images comprising of figures essential to a question as well as text. We evaluate two closed source models for the same, in zero-shot Chain-of-Thought (CoT) and two-shot CoT settings. GPT-4o mini is found to be the more dominant model on the benchmark, with it’s highest average accuracy being 38.15%. We also evaluate models through a “Double Lock” constraint, which brings down the performance of the models by considerable margins. We observe that two-shot CoT appears to be a more effective setting under this environment. Performance of the two VLMs also decreases when answering the same questions in the Hindi language. We hope to facilitate the inclusion of languages like Hindi in research through our work. 近年来，用于评估视觉语言模型（VLM）在多个领域和学科中推理能力的基准测试越来越多地被整理出来。然而，这些基准测试通常是单语的，大多以英语提供。此外，除理解和翻译任务外，印地语相关的数据集也较为缺乏。我们推出了 GanitBench，这是一个包含 1527 个仅视觉问题的严苛基准，涵盖数学的多个主题，提供英语和印地语版本。该基准收集自印度的两大考试——JEE 高级考试和 CBSE 董事会考试，题目以图像形式呈现，包含对问题至关重要的图形及文本。我们在零样本链式思维（CoT）和两样本链式思维设置下，对两个闭源模型进行了评估。GPT-4o mini 在该基准上表现更为突出，最高平均准确率达到 38.15%。我们还通过“双重锁定”约束对模型进行了评估，该约束显著降低了模型的表现。我们观察到，在这种环境下，两样本链式思维似乎是一种更有效的设置。当用印地语回答相同问题时，这两种 VLM 的表现也会下降。我们希望通过我们的工作促进印地语等语言在研究中的纳入。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-07-31 18:24:05 UTC 发布：2025-07-31 18:24:05 UTC

#63 WINELL: Wikipedia Never-Ending Updating with LLM Agents #63 WINELL：使用 LLM 代理的维基百科永无止境更新

Wikipedia, a vast and continuously consulted knowledge base, faces significant challenges in maintaining up-to-date content due to its reliance on manual human editors. Inspired by the vision of continuous knowledge acquisition in NELL and fueled by advances in LLM-based agents, this paper introduces WiNELL, an agentic framework for continuously updating Wikipedia articles. Our approach employs a multi-agent framework to aggregate online information, select new and important knowledge for a target entity in Wikipedia, and then generate precise edit suggestions for human review. Our fine-grained editing models, trained on Wikipedia’s extensive history of human edits, enable incorporating updates in a manner consistent with human editing behavior. Our editor models outperform both open-source instruction-following baselines and closed-source LLMs (e.g., GPT-4o) in key information coverage and editing efficiency. End-to-end evaluation on high-activity Wikipedia pages demonstrates WiNELL’s ability to identify and suggest timely factual updates. This opens up a promising research direction in LLM agents for automatically updating knowledge bases in a never-ending fashion. 维基百科作为一个庞大且持续被查询的知识库，由于依赖人工编辑，面临着保持内容最新的重大挑战。受 NELL 中持续知识获取愿景的启发，并借助基于 LLM 的智能体的进展，本文提出了 WiNELL，一种用于持续更新维基百科条目的智能体框架。我们的方法采用多智能体框架，聚合在线信息，选择维基百科中目标实体的新颖且重要的知识，然后生成精确的编辑建议供人工审核。我们基于维基百科丰富的人类编辑历史训练的细粒度编辑模型，使得更新能够以符合人类编辑行为的方式进行。我们的编辑模型在关键信息覆盖率和编辑效率上均优于开源的指令跟随基线模型和闭源的 LLM（如 GPT-4o）。在高活跃度维基百科页面上的端到端评估表明，WiNELL 能够识别并建议及时的事实更新。这为基于 LLM 的智能体自动持续更新知识库开辟了一个有前景的研究方向。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-07-30 07:51:42 UTC 发布时间：2025-07-30 07:51:42 UTC

#64 Hierarchical Verification of Speculative Beams for Accelerating LLM Inference #64 分层验证投机波束以加速 LLM 推理

Authors: [Jaydip Sen](https://arxiv.org/search/?searchtype=author&query=Jaydip Sen), [Harshitha Puvvala](https://arxiv.org/search/?searchtype=author&query=Harshitha Puvvala), [Subhasis Dasgupta](https://arxiv.org/search/?searchtype=author&query=Subhasis Dasgupta) 作者：Jaydip Sen，Harshitha Puvvala，Subhasis Dasgupta

Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference. 大型语言模型（LLMs）在多种自然语言处理任务中取得了显著成功，但由于其自回归特性，推理效率仍面临持续挑战。尽管推测解码和束采样带来了显著改进，传统方法在验证草稿序列时仍按顺序进行且无优先级，导致不必要的计算开销。本文提出了分层验证树（HVT），一种通过优先考虑高概率草稿并实现对次优候选的早期剪枝来重构推测束解码的新框架。我们构建了理论基础并设计了形式化的验证-剪枝算法，以确保正确性和效率。该方法可无须重新训练或修改架构，即可集成到标准 LLM 推理流程中。多数据集和多模型的实验评估表明，HVT 始终优于现有的推测解码方案，在保持或提升输出质量的同时，大幅减少了推理时间和能耗。研究结果突显了分层验证策略作为加速大型语言模型推理的新方向的潜力。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-07-30 02:58:03 UTC 发布时间：2025-07-30 02:58:03 UTC

#65 Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering #65 多轮农业问答的意图感知上下文检索

Indian farmers often lack timely, accessible, and language-friendly agricultural advice, especially in rural areas with low literacy. To address this gap in accessibility, this paper presents a novel AI-powered agricultural chatbot, Krishi Sathi, designed to support Indian farmers by providing personalized, easy-to-understand answers to their queries through both text and speech. The system’s intelligence stems from an IFT model, subsequently refined through fine-tuning on Indian agricultural knowledge across three curated datasets. Unlike traditional chatbots that respond to one-off questions, Krishi Sathi follows a structured, multi-turn conversation flow to gradually collect the necessary details from the farmer, ensuring the query is fully understood before generating a response. Once the intent and context are extracted, the system performs Retrieval-Augmented Generation (RAG) by first fetching information from a curated agricultural database and then generating a tailored response using the IFT model. The chatbot supports both English and Hindi languages, with speech input and output features (via ASR and TTS) to make it accessible for users with low literacy or limited digital skills. This work demonstrates how combining intent-driven dialogue flows, instruction-tuned models, and retrieval-based generation can improve the quality and accessibility of digital agricultural support in India. This approach yielded strong results, with the system achieving a query response accuracy of 97.53%, 91.35% contextual relevance and personalization, and a query completion rate of 97.53%. The average response time remained under 6 seconds, ensuring timely support for users across both English and Hindi interactions. 印度农民常常缺乏及时、易获取且符合语言习惯的农业建议，尤其是在识字率较低的农村地区。为了解决这一可及性差距，本文提出了一种新型的人工智能驱动农业聊天机器人——Krishi Sathi，旨在通过文本和语音为印度农民提供个性化、易于理解的答复。该系统的智能核心基于 IFT 模型，随后通过在三个精选的印度农业知识数据集上进行微调加以优化。与传统只回答单次提问的聊天机器人不同，Krishi Sathi 采用结构化的多轮对话流程，逐步收集农民提供的必要信息，确保完全理解问题后再生成回答。一旦提取出意图和上下文，系统便通过检索增强生成（RAG）方法，先从精选的农业数据库中检索信息，再利用 IFT 模型生成定制化的回复。该聊天机器人支持英语和印地语，具备语音输入和输出功能（通过 ASR 和 TTS），以便为识字率低或数字技能有限的用户提供便利。该工作展示了如何结合基于意图的对话流程、指令调优模型和基于检索的生成方法，提升印度数字农业支持的质量和可及性。该方法取得了显著成果，系统实现了 97.53%的查询响应准确率、91.35%的上下文相关性和个性化，以及 97.53%的查询完成率。平均响应时间保持在 6 秒以内，确保了用户在英语和印地语交互中的及时支持。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-07-28 09:00:44 UTC 发布时间：2025-07-28 09:00:44 UTC

#66 FeynTune: Large Language Models for High-Energy Theory #66 FeynTune：用于高能理论的 LLMs

Authors: [Paul Richmond](https://arxiv.org/search/?searchtype=author&query=Paul Richmond), [Prarit Agarwal](https://arxiv.org/search/?searchtype=author&query=Prarit Agarwal), [Borun Chowdhury](https://arxiv.org/search/?searchtype=author&query=Borun Chowdhury), [Vasilis Niarchos](https://arxiv.org/search/?searchtype=author&query=Vasilis Niarchos), [Constantinos Papageorgakis](https://arxiv.org/search/?searchtype=author&query=Constantinos Papageorgakis) 作者：Paul Richmond, Prarit Agarwal, Borun Chowdhury, Vasilis Niarchos, Constantinos Papageorgakis

We present specialized Large Language Models for theoretical High-Energy Physics, obtained as 20 fine-tuned variants of the 8-billion parameter Llama-3.1 model. Each variant was trained on arXiv abstracts (through August 2024) from different combinations of hep-th, hep-ph and gr-qc. For a comparative study, we also trained models on datasets that contained abstracts from disparate fields such as the q-bio and cs categories. All models were fine-tuned using two distinct Low-Rank Adaptation fine-tuning approaches and varying dataset sizes, and outperformed the base model on hep-th abstract completion tasks. We compare performance against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) and derive insights for further developing specialized language models for High-Energy Theoretical Physics. 我们提出了专门针对理论高能物理的 LLMs，这些模型是基于 80 亿参数的 Llama-3.1 模型微调得到的 20 个变体。每个变体均在不同组合的 hep-th、hep-ph 和 gr-qc 领域的 arXiv 摘要（截至 2024 年 8 月）上训练。为了进行对比研究，我们还训练了包含 q-bio 和 cs 等不同领域摘要的数据集上的模型。所有模型均采用两种不同的低秩适应微调方法和不同的数据集规模进行微调，并在 hep-th 摘要补全任务中表现优于基础模型。我们将其性能与主流商业 LLMs（ChatGPT、Claude、Gemini、DeepSeek）进行了比较，并总结了进一步开发高能理论物理专用语言模型的见解。

Subjects: Computation and Language, Machine Learning, High Energy Physics - Theory 主题：计算与语言，机器学习，高能物理 - 理论

Publish: 2025-07-24 18:21:03 UTC 发布时间：2025-07-24 18:21:03 UTC

#67 How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion #67 LLMs 中的表征偏差有多深？以种姓和宗教为例

Authors: [Agrima Seth](https://arxiv.org/search/?searchtype=author&query=Agrima Seth), [Monojit Choudhary](https://arxiv.org/search/?searchtype=author&query=Monojit Choudhary), [Sunayana Sitaram](https://arxiv.org/search/?searchtype=author&query=Sunayana Sitaram), [Kentaro Toyama](https://arxiv.org/search/?searchtype=author&query=Kentaro Toyama), [Aditya Vashistha](https://arxiv.org/search/?searchtype=author&query=Aditya Vashistha), [Kalika Bali](https://arxiv.org/search/?searchtype=author&query=Kalika Bali) 作者：Agrima Seth, Monojit Choudhary, Sunayana Sitaram, Kentaro Toyama, Aditya Vashistha, Kalika Bali

Representational bias in large language models (LLMs) has predominantly been measured through single-response interactions and has focused on Global North-centric identities like race and gender. We expand on that research by conducting a systematic audit of GPT-4 Turbo to reveal how deeply encoded representational biases are and how they extend to less-explored dimensions of identity. We prompt GPT-4 Turbo to generate over 7,200 stories about significant life events (such as weddings) in India, using prompts designed to encourage diversity to varying extents. Comparing the diversity of religious and caste representation in the outputs against the actual population distribution in India as recorded in census data, we quantify the presence and “stickiness” of representational bias in the LLM for religion and caste. We find that GPT-4 responses consistently overrepresent culturally dominant groups far beyond their statistical representation, despite prompts intended to encourage representational diversity. Our findings also suggest that representational bias in LLMs has a winner-take-all quality that is more biased than the likely distribution bias in their training data, and repeated prompt-based nudges have limited and inconsistent efficacy in dislodging these biases. These results suggest that diversifying training data alone may not be sufficient to correct LLM bias, highlighting the need for more fundamental changes in model development. Dataset and Codebook: https://github.com/agrimaseth/How-Deep-Is-Representational-Bias-in-LLMs 大型语言模型（LLMs）中的表征偏差主要通过单次响应交互进行测量，且集中于以全球北方为中心的身份认同，如种族和性别。我们在此基础上扩展研究，系统审计 GPT-4 Turbo，以揭示表征偏差的深度编码程度及其如何延伸至较少被探索的身份维度。我们通过设计不同程度鼓励多样性的提示，促使 GPT-4 Turbo 生成超过 7,200 个关于印度重大生活事件（如婚礼）的故事。将输出中宗教和种姓的多样性与印度人口普查数据中实际人口分布进行比较，我们量化了 LLM 在宗教和种姓方面表征偏差的存在及其“粘性”。结果发现，尽管提示旨在鼓励表征多样性，GPT-4 的回答仍持续过度代表文化主导群体，远超其统计比例。我们的研究结果还表明，LLMs 中的表征偏差具有赢家通吃的特性，其偏差程度超过了训练数据中可能存在的分布偏差，而基于提示的反复推动在消除这些偏差方面效果有限且不稳定。这些结果表明，仅仅多样化训练数据可能不足以纠正 LLM 的偏差，强调了在模型开发中需要更根本的变革。数据集和代码库：https://github.com/agrimaseth/How-Deep-Is-Representational-Bias-in-LLMs

Subject: Computation and Language 主题：计算与语言

Publish: 2025-07-22 17:28:37 UTC 发布时间：2025-07-22 17:28:37 UTC

#68 SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience #68 SEAgent：具备自主经验学习的自我进化计算机使用代理

将大型视觉语言模型（LVLMs）重新用作计算机使用代理（CUAs）已带来重大突破，这主要依赖于人工标注数据。然而，这些模型在处理新颖和专业的软件时常常表现不佳，尤其是在缺乏人工注释的场景中。为了解决这一挑战，我们提出了 SEAgent，一种代理自我进化框架，使 CUAs 能够通过与陌生软件的交互自主进化。具体而言，SEAgent 使计算机使用代理能够通过体验式学习自主掌握新软件环境，代理通过探索新软件、反复试错学习，并逐步完成从简单到复杂自动生成的任务。为实现这一目标，我们设计了一个用于逐步轨迹评估的世界状态模型，以及一个生成日益多样且具有挑战性任务的课程生成器。代理的策略通过体验式学习进行更新，包括对失败动作的对抗模仿和对成功动作的群体相对策略优化（GRPO）。此外，我们引入了一种专家到通才的训练策略，该策略整合了专家代理的个体经验见解，促进了更强大的通才 CUA 的开发，使其能够持续自主进化。该统一代理最终在其专门的软件上实现了超越单个专家代理集成的性能。我们在 OS-World 中的五个新颖软件环境中验证了 SEAgent 的有效性。我们的方法在成功率上相较于一个具有竞争力的开源 CUA（即 UI-TARS）实现了显著提升，成功率从 11.3%提升至 34.5%，提高了 23.2%。

发布时间：2025-08-06 17:58:46 UTC

#69 Query Attribute Modeling: Improving search relevance with Semantic Search and Meta Data Filtering #69 查询属性建模：通过语义搜索和元数据过滤提升搜索相关性

Authors: [Karthik Menon](https://arxiv.org/search/?searchtype=author&query=Karthik Menon), [Batool Arhamna Haider](https://arxiv.org/search/?searchtype=author&query=Batool Arhamna Haider), [Muhammad Arham](https://arxiv.org/search/?searchtype=author&query=Muhammad Arham), [Kanwal Mehreen](https://arxiv.org/search/?searchtype=author&query=Kanwal Mehreen), [Ram Mohan Rao Kadiyala](https://arxiv.org/search/?searchtype=author&query=Ram Mohan Rao Kadiyala), [Hamza Farooq](https://arxiv.org/search/?searchtype=author&query=Hamza Farooq) 作者：Karthik Menon, Batool Arhamna Haider, Muhammad Arham, Kanwal Mehreen, Ram Mohan Rao Kadiyala, Hamza Farooq

This study introduces Query Attribute Modeling (QAM), a hybrid framework that enhances search precision and relevance by decomposing open text queries into structured metadata tags and semantic elements. QAM addresses traditional search limitations by automatically extracting metadata filters from free-form text queries, reducing noise and enabling focused retrieval of relevant items. Experimental evaluation using the Amazon Toys Reviews dataset (10,000 unique items with 40,000+ reviews and detailed product attributes) demonstrated QAM’s superior performance, achieving a mean average precision at 5 (mAP@5) of 52.99%. This represents significant improvement over conventional methods, including BM25 keyword search, encoder-based semantic similarity search, cross-encoder re-ranking, and hybrid search combining BM25 and semantic results via Reciprocal Rank Fusion (RRF). The results establish QAM as a robust solution for Enterprise Search applications, particularly in e-commerce systems. 本研究引入了查询属性建模（QAM），这是一种混合框架，通过将开放文本查询分解为结构化的元数据标签和语义元素，提升搜索的精确度和相关性。QAM 通过自动从自由格式文本查询中提取元数据过滤器，解决了传统搜索的局限性，减少噪声，实现对相关项目的聚焦检索。使用亚马逊玩具评论数据集（包含 10,000 个独特商品，40,000 多条评论及详细产品属性）进行的实验评估表明，QAM 表现优异，达到了 5 条结果的平均精确率均值（mAP@5）为 52.99%。这相比传统方法有显著提升，包括 BM25 关键词搜索、基于编码器的语义相似度搜索、交叉编码器重排序以及通过互惠排名融合（RRF）结合 BM25 和语义结果的混合搜索。结果确立了 QAM 作为企业搜索应用的强大解决方案，尤其适用于电子商务系统。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Machine Learning 主题：信息检索，人工智能，计算与语言，机器学习

Publish: 2025-08-06 17:47:00 UTC 发布时间：2025-08-06 17:47:00 UTC

#70 Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference #70 立场：当前的人工智能会议模式不可持续！诊断集中式人工智能会议的危机

Authors: [Nuo Chen](https://arxiv.org/search/?searchtype=author&query=Nuo Chen), [Moming Duan](https://arxiv.org/search/?searchtype=author&query=Moming Duan), [Andre Huikai Lin](https://arxiv.org/search/?searchtype=author&query=Andre Huikai Lin), [Qian Wang](https://arxiv.org/search/?searchtype=author&query=Qian Wang), [Jiaying Wu](https://arxiv.org/search/?searchtype=author&query=Jiaying Wu), [Bingsheng He](https://arxiv.org/search/?searchtype=author&query=Bingsheng He) 作者：陈诺、段墨明、林惠凯、王倩、吴佳颖、何炳胜

Artificial Intelligence (AI) conferences are essential for advancing research, sharing knowledge, and fostering academic community. However, their rapid expansion has rendered the centralized conference model increasingly unsustainable. This paper offers a data-driven diagnosis of a structural crisis that threatens the foundational goals of scientific dissemination, equity, and community well-being. We identify four key areas of strain: (1) scientifically, with per-author publication rates more than doubling over the past decade to over 4.5 papers annually; (2) environmentally, with the carbon footprint of a single conference exceeding the daily emissions of its host city; (3) psychologically, with 71% of online community discourse reflecting negative sentiment and 35% referencing mental health concerns; and (4) logistically, with attendance at top conferences such as NeurIPS 2024 beginning to outpace venue capacity. These pressures point to a system that is misaligned with its core mission. In response, we propose the Community-Federated Conference (CFC) model, which separates peer review, presentation, and networking into globally coordinated but locally organized components, offering a more sustainable, inclusive, and resilient path forward for AI research. 人工智能（AI）会议对于推动研究进展、分享知识和促进学术社区发展至关重要。然而，其快速扩张使得集中式会议模式日益难以为继。本文通过数据驱动的分析，诊断了一场威胁科学传播、公平性和社区福祉根本目标的结构性危机。我们识别出四个关键压力领域：（1）科学方面，过去十年每位作者的发表论文数量翻倍以上，达到每年超过 4.5 篇；（2）环境方面，一场会议的碳足迹超过其主办城市的日常排放量；（3）心理方面，71%的在线社区讨论表现出负面情绪，35%涉及心理健康问题；（4）后勤方面，顶级会议如 NeurIPS 2024 的参会人数已开始超过场地容量。这些压力表明当前系统与其核心使命存在不匹配。作为回应，我们提出了社区联邦会议（CFC）模型，该模型将同行评审、展示和交流分离为全球协调但本地组织的组成部分，为人工智能研究提供了一条更可持续、更具包容性和更具韧性的前进道路。

Subjects: Computers and Society, Artificial Intelligence, Computation and Language 主题：计算机与社会，人工智能，计算与语言

Publish: 2025-08-06 16:08:27 UTC 发布时间：2025-08-06 16:08:27 UTC

#71 Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation #71 推荐系统真的利用了多模态内容吗？关于推荐系统多模态表示的全面分析

Authors: [Claudio Pomo](https://arxiv.org/search/?searchtype=author&query=Claudio Pomo), [Matteo Attimonelli](https://arxiv.org/search/?searchtype=author&query=Matteo Attimonelli), [Danilo Danese](https://arxiv.org/search/?searchtype=author&query=Danilo Danese), [Fedelucio Narducci](https://arxiv.org/search/?searchtype=author&query=Fedelucio Narducci), [Tommaso Di Noia](https://arxiv.org/search/?searchtype=author&query=Tommaso Di Noia) 作者：Claudio Pomo, Matteo Attimonelli, Danilo Danese, Fedelucio Narducci, Tommaso Di Noia

Multimodal Recommender Systems aim to improve recommendation accuracy by integrating heterogeneous content, such as images and textual metadata. While effective, it remains unclear whether their gains stem from true multimodal understanding or increased model complexity. This work investigates the role of multimodal item embeddings, emphasizing the semantic informativeness of the representations. Initial experiments reveal that embeddings from standard extractors (e.g., ResNet50, Sentence-Bert) enhance performance, but rely on modality-specific encoders and ad hoc fusion strategies that lack control over cross-modal alignment. To overcome these limitations, we leverage Large Vision-Language Models (LVLMs) to generate multimodal-by-design embeddings via structured prompts. This approach yields semantically aligned representations without requiring any fusion. Experiments across multiple settings show notable performance improvements. Furthermore, LVLMs embeddings offer a distinctive advantage: they can be decoded into structured textual descriptions, enabling direct assessment of their multimodal comprehension. When such descriptions are incorporated as side content into recommender systems, they improve recommendation performance, empirically validating the semantic depth and alignment encoded within LVLMs outputs. Our study highlights the importance of semantically rich representations and positions LVLMs as a compelling foundation for building robust and meaningful multimodal representations in recommendation tasks. 多模态推荐系统旨在通过整合异构内容，如图像和文本元数据，提高推荐的准确性。尽管效果显著，但尚不清楚其提升是源于真正的多模态理解，还是模型复杂度的增加。本文探讨了多模态项目嵌入的作用，强调表示的语义信息量。初步实验表明，来自标准提取器（如 ResNet50、Sentence-Bert）的嵌入能够提升性能，但依赖于特定模态的编码器和缺乏跨模态对齐控制的临时融合策略。为克服这些限制，我们利用大型视觉语言模型（LVLMs）通过结构化提示生成设计即多模态的嵌入。这种方法无需任何融合即可产生语义对齐的表示。多种设置下的实验显示了显著的性能提升。此外，LVLMs 嵌入具有独特优势：它们可以解码为结构化的文本描述，从而实现对其多模态理解的直接评估。当此类描述作为辅助内容被纳入推荐系统时，它们提升了推荐性能，实证验证了 LVLM 输出中编码的语义深度和一致性。我们的研究强调了语义丰富表示的重要性，并将 LVLM 定位为构建稳健且有意义的多模态表示以用于推荐任务的有力基础。

Subjects: Information Retrieval, Computation and Language, Machine Learning 主题：信息检索，计算与语言，机器学习

Publish: 2025-08-06 15:53:58 UTC 发布时间：2025-08-06 15:53:58 UTC

#72 Analyzing and Mitigating Object Hallucination: A Training Bias Perspective #72 分析与缓解对象幻觉：一种训练偏差视角

Authors: [Yifan Li](https://arxiv.org/search/?searchtype=author&query=Yifan Li), [Kun Zhou](https://arxiv.org/search/?searchtype=author&query=Kun Zhou), [Wayne Xin Zhao](https://arxiv.org/search/?searchtype=author&query=Wayne Xin Zhao), [Lei Fang](https://arxiv.org/search/?searchtype=author&query=Lei Fang), [Ji-Rong Wen](https://arxiv.org/search/?searchtype=author&query=Ji-Rong Wen) 作者：李一凡，周坤，赵新，方磊，温继荣

As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering ``Yes’’ to questions about masked objects. To understand this issue, we conduct probing experiments on the models’ internal components, revealing that this training bias is primarily located in the language modeling (LM) head. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination. Our code and data will be publicly released. 随着训练数据规模的扩大，大型视觉语言模型（LVLMs）的多模态能力显著提升，但它们仍然存在幻觉问题，即生成的文本与视觉输入不一致。该现象促使我们系统地研究训练数据在幻觉中的作用。我们引入了一个新的基准测试 POPEv2，该测试包含从 LVLMs 训练数据中收集的反事实图像，这些图像中某些对象被遮挡。通过对 POPEv2 的全面评估，我们发现当前的 LVLMs 存在训练偏差：它们未能充分利用训练数据，并且在训练中见过的图像上更频繁地产生幻觉。具体来说，它们在反事实图像上的表现较差，常常对被遮挡对象的问题错误地回答“是”。为了解决这一问题，我们对模型的内部组件进行了探测实验，揭示这种训练偏差主要存在于语言建模（LM）头部。基于这些发现，我们提出了 Obliviate，一种高效且轻量的遗忘方法，旨在通过训练偏差遗忘来减轻对象幻觉。 Obliviate 将训练数据中真实标签与模型输出之间的差异视为偏差的代理指标，并采用了一种参数和数据高效的微调策略，仅更新语言模型的头部。大量实验表明了我们方法的有效性。在仅重用训练数据并更新约 2%的参数的情况下，Obliviate 显著减少了判别任务和生成任务中的幻觉现象。此外，它在模型规模（从 2B 到 72B）和训练数据量方面表现出强大的可扩展性，并且在超出对象级幻觉的幻觉类型上展现出良好的泛化能力。我们的代码和数据将公开发布。

Subjects: Computer Vision and Pattern Recognition, Computation and Language 主题：计算机视觉与模式识别，计算与语言

Publish: 2025-08-06 15:51:02 UTC 发布时间：2025-08-06 15:51:02 UTC

#73 Causal Reflection with Language Models #73 语言模型的因果反思

Authors: [Abi Aryan](https://arxiv.org/search/?searchtype=author&query=Abi Aryan), [Zac Liu](https://arxiv.org/search/?searchtype=author&query=Zac Liu) 作者：Abi Aryan, Zac Liu

While LLMs exhibit impressive fluency and factual recall, they struggle with robust causal reasoning, often relying on spurious correlations and brittle patterns. Similarly, traditional Reinforcement Learning agents also lack causal understanding, optimizing for rewards without modeling why actions lead to outcomes. We introduce Causal Reflection, a framework that explicitly models causality as a dynamic function over state, action, time, and perturbation, enabling agents to reason about delayed and nonlinear effects. Additionally, we define a formal Reflect mechanism that identifies mismatches between predicted and observed outcomes and generates causal hypotheses to revise the agent’s internal model. In this architecture, LLMs serve not as black-box reasoners, but as structured inference engines translating formal causal outputs into natural language explanations and counterfactuals. Our framework lays the theoretical groundwork for Causal Reflective agents that can adapt, self-correct, and communicate causal understanding in evolving environments. 虽然 LLMs 表现出令人印象深刻的流畅性和事实回忆能力，但它们在稳健的因果推理方面存在困难，常常依赖于虚假的相关性和脆弱的模式。同样，传统的强化学习代理也缺乏因果理解，只是优化奖励而不建模为何动作会导致结果。我们提出了因果反思（Causal Reflection）框架，该框架将因果关系明确建模为状态、动作、时间和扰动上的动态函数，使代理能够推理延迟和非线性效应。此外，我们定义了一个正式的反思机制（Reflect），用于识别预测结果与观察结果之间的不匹配，并生成因果假设以修正代理的内部模型。在该架构中，LLMs 不再是黑箱推理器，而是作为结构化推理引擎，将正式的因果输出转化为自然语言解释和反事实。我们的框架为因果反思代理奠定了理论基础，使其能够在不断变化的环境中适应、自我纠正并传达因果理解。

Subjects: Machine Learning, Computation and Language 主题：机器学习，计算与语言

Publish: 2025-08-06 14:44:23 UTC 发布时间：2025-08-06 14:44:23 UTC

#74 OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use #74 操作系统代理：基于多模态大语言模型（MLLM）代理在通用计算设备上的应用综述

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain. 创造像《钢铁侠》中虚构的 J.A.R.V.I.S 那样强大且多功能的 AI 助手的梦想，长期以来一直激发着人们的想象力。随着（多模态）大型语言模型（（M）LLMs）的发展，这一梦想正变得愈加接近现实，基于（M）LLM 的代理通过在操作系统（OS）提供的环境和界面（如图形用户界面（GUI））中运行，利用计算设备（例如计算机和手机）自动化任务，取得了显著进展。本文对这些先进的代理，称为 OS 代理，进行了全面的综述。我们首先阐明了 OS 代理的基本原理，探讨了其关键组成部分，包括环境、观测空间和动作空间，并概述了理解、规划和落地等核心能力。随后，我们考察了构建 OS 代理的方法，重点关注特定领域的基础模型和代理框架。最后，通过详尽的评估协议和基准测试回顾，展示了 OS 代理在多样化任务中的评估方式。最后，我们讨论了当前的挑战，并确定了未来研究的有前景方向，包括安全与隐私、个性化和自我进化。本综述旨在整合操作系统代理（OS Agents）研究的现状，提供指导学术探究和工业发展的见解。我们维护了一个开源的 GitHub 仓库，作为推动该领域进一步创新的动态资源。我们还呈现了被 ACL 2025 接受的 9 页版本工作，以便为该领域提供简明的概览。

Subjects: Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition, Machine Learning 主题：人工智能，计算与语言，计算机视觉与模式识别，机器学习

Publish: 2025-08-06 14:33:45 UTC 发布时间：2025-08-06 14:33:45 UTC

#75 FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding #75 FrEVL：利用冻结的预训练嵌入实现高效的视觉-语言理解

Authors: [Emmanuelle Bourigault](https://arxiv.org/search/?searchtype=author&query=Emmanuelle Bourigault), [Pauline Bourigault](https://arxiv.org/search/?searchtype=author&query=Pauline Bourigault) 作者：Emmanuelle Bourigault，Pauline Bourigault

The deployment of vision-language models remains constrained by substantial computational requirements. We present \textbf{FrEVL}, a framework exploring whether frozen pretrained embeddings can support effective vision-language understanding. Our analysis reveals that frozen embeddings contain rich information for discriminative tasks, achieving 85% to 95% of state-of-the-art performance on standard benchmarks with only 68.4M trainable parameters. This performance dichotomy reveals a critical insight: frozen embedding effectiveness depends on alignment between pretraining objectives and downstream task requirements. When accounting for end-to-end computation including embedding extraction, FrEVL provides 2.3× speedup with 52% lower energy consumption, making it suitable for scenarios with pre-computable inputs or when deployment constraints outweigh marginal performance gains. Our evaluation provides practitioners with guidance on when frozen embedding approaches represent viable alternatives to full model deployment. We will release our complete implementation and evaluation framework to facilitate further research into efficient multi-modal understanding. 视觉语言模型的部署仍然受到大量计算需求的限制。我们提出了\textbf{FrEVL}，一个探索冻结预训练嵌入是否能够支持有效视觉语言理解的框架。我们的分析表明，冻结的嵌入包含丰富的判别任务信息，在标准基准测试中仅用 68.4M 可训练参数就能达到 85%到 95%的最先进性能。这种性能差异揭示了一个关键见解：冻结嵌入的有效性取决于预训练目标与下游任务需求之间的对齐程度。在考虑包括嵌入提取的端到端计算时，FrEVL 实现了 2.3× 的加速，并降低了 52%的能耗，使其适用于可预先计算输入或部署限制超过边际性能提升的场景。我们的评估为从业者提供了关于何时冻结嵌入方法是完整模型部署可行替代方案的指导。我们将发布完整的实现和评估框架，以促进高效多模态理解的进一步研究。

Subjects: Computer Vision and Pattern Recognition, Computation and Language 主题：计算机视觉与模式识别，计算与语言

Publish: 2025-08-06 14:12:05 UTC 发布时间：2025-08-06 14:12:05 UTC

#76 Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents #76 超越像素：探索基于 LLM 的网页代理的 DOM 降采样

Authors: [Thassilo M. Schiepanski](https://arxiv.org/search/?searchtype=author&query=Thassilo M. Schiepanski), [Nicholas Piël](https://arxiv.org/search/?searchtype=author&query=Nicholas Piël) 作者：Thassilo M. Schiepanski，Nicholas Piël

Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation – referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) – within the same input token order of magnitude (1e3). Our best evaluated configurations – one token order above, but within the model’s context window – outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs. 前沿的 LLMs 最近才实现了可用的自主网络代理。在此过程中，模型充当即时的领域模型后端。为了建议交互，模型会被咨询以处理基于网络的任务及相应的应用状态。关键问题在于应用状态的序列化 – ，称为快照。最先进的网络代理基于有根的 GUI 快照，即带有视觉提示的截图。这不仅是为了模拟人类感知，还因为图像代表了相对廉价的模型输入手段。LLM 视觉能力仍落后于代码解释能力。结构上类似 HTML 的 DOM 快照提供了一个理想的替代方案。然而，庞大的模型输入令牌大小至今阻碍了网络代理的可靠实现。我们提出了 D2Snap，一种首创的 DOM 降采样算法。基于 GPT-4o 后端，我们在 Online-Mind2Web 数据集中抽样的任务上评估了 D2Snap。D2Snap 降采样的 DOM 快照成功率（67%）与有根 GUI 快照基线（65%）相当 – ，且输入令牌数量级相同（约 1e3）。我们评估的最佳配置在一个标记顺序上优于基线，但在模型的上下文窗口内表现出 8%的提升。此外，我们的评估表明，DOM 固有的层级结构是 LLMs 的一个强大 UI 特征。

Subjects: Artificial Intelligence, Computation and Language, Human-Computer Interaction 主题：人工智能，计算与语言，人机交互

Publish: 2025-08-06 12:56:54 UTC 发布时间：2025-08-06 12:56:54 UTC

#77 Graph Representation Learning with Massive Unlabeled Data for Rumor Detection #77 利用海量无标签数据进行图表示学习以检测谣言

Authors: [Chaoqun Cui](https://arxiv.org/search/?searchtype=author&query=Chaoqun Cui), [Caiyan Jia](https://arxiv.org/search/?searchtype=author&query=Caiyan Jia) 作者：崔超群，贾彩燕

With the development of social media, rumors spread quickly, cause great harm to society and economy. Thereby, many effective rumor detection methods have been developed, among which the rumor propagation structure learning based methods are particularly effective compared to other methods. However, the existing methods still suffer from many issues including the difficulty to obtain large-scale labeled rumor datasets, which leads to the low generalization ability and the performance degeneration on new events since rumors are time-critical and usually appear with hot topics or newly emergent events. In order to solve the above problems, in this study, we used large-scale unlabeled topic datasets crawled from the social media platform Weibo and Twitter with claim propagation structure to improve the semantic learning ability of a graph reprentation learing model on various topics. We use three typical graph self-supervised methods, InfoGraph, JOAO and GraphMAE in two commonly used training strategies, to verify the performance of general graph semi-supervised methods in rumor detection tasks. In addition, for alleviating the time and topic difference between unlabeled topic data and rumor data, we also collected a rumor dataset covering a variety of topics over a decade (10-year ago from 2022) from the Weibo rumor-refuting platform. Our experiments show that these general graph self-supervised learning methods outperform previous methods specifically designed for rumor detection tasks and achieve good performance under few-shot conditions, demonstrating the better generalization ability with the help of our massive unlabeled topic dataset. 随着社交媒体的发展，谣言传播迅速，对社会和经济造成巨大危害。因此，许多有效的谣言检测方法被开发出来，其中基于谣言传播结构学习的方法相比其他方法尤为有效。然而，现有方法仍存在许多问题，包括难以获得大规模标注的谣言数据集，导致模型泛化能力低下，并且在新事件上的性能下降，因为谣言具有时效性，通常伴随热点话题或新兴事件出现。为了解决上述问题，本研究利用从社交媒体平台微博和推特爬取的大规模未标注话题数据集及其传播结构，提升图表示学习模型在各种话题上的语义学习能力。我们采用了三种典型的图自监督方法——InfoGraph、JOAO 和 GraphMAE，并在两种常用的训练策略下，验证了通用图半监督方法在谣言检测任务中的性能表现。此外，为了缓解无标签主题数据与谣言数据之间的时间和主题差异，我们还收集了一个涵盖十年（从 2022 年往前推 10 年）多种主题的谣言数据集，来源于微博辟谣平台。我们的实验表明，这些通用的图自监督学习方法优于之前专门为谣言检测任务设计的方法，并且在少样本条件下也能取得良好表现，展示了借助我们庞大的无标签主题数据集所带来的更强泛化能力。

Subjects: Social and Information Networks, Computation and Language 主题：社会与信息网络，计算与语言

Publish: 2025-08-06 09:33:56 UTC 发布：2025-08-06 09:33:56 UTC

#78 ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations #78 ToxicTAGS：利用丰富标签注释解码有害表情包

Authors: [Subhankar Swain](https://arxiv.org/search/?searchtype=author&query=Subhankar Swain), [Naquee Rizwan](https://arxiv.org/search/?searchtype=author&query=Naquee Rizwan), [Nayandeep Deb](https://arxiv.org/search/?searchtype=author&query=Nayandeep Deb), [Vishwajeet Singh Solanki](https://arxiv.org/search/?searchtype=author&query=Vishwajeet Singh Solanki), [Vishwa Gangadhar S](https://arxiv.org/search/?searchtype=author&query=Vishwa Gangadhar S), [Animesh Mukherjee](https://arxiv.org/search/?searchtype=author&query=Animesh Mukherjee) 作者：Subhankar Swain, Naquee Rizwan, Nayandeep Deb, Vishwajeet Singh Solanki, Vishwa Gangadhar S, Animesh Mukherjee

The 2025 Global Risks Report identifies state-based armed conflict and societal polarisation among the most pressing global threats, with social media playing a central role in amplifying toxic discourse. Memes, as a widely used mode of online communication, often serve as vehicles for spreading harmful content. However, limitations in data accessibility and the high cost of dataset curation hinder the development of robust meme moderation systems. To address this challenge, in this work, we introduce a first-of-its-kind dataset of 6,300 real-world meme-based posts annotated in two stages: (i) binary classification into toxic and normal, and (ii) fine-grained labelling of toxic memes as hateful, dangerous, or offensive. A key feature of this dataset is that it is enriched with auxiliary metadata of socially relevant tags, enhancing the context of each meme. In addition, we propose a tag generation module that produces socially grounded tags, because most in-the-wild memes often do not come with tags. Experimental results show that incorporating these tags substantially enhances the performance of state-of-the-art VLMs detection tasks. Our contributions offer a novel and scalable foundation for improved content moderation in multimodal online environments. 2025 年全球风险报告将国家间武装冲突和社会极化列为最紧迫的全球威胁，社交媒体在放大有害言论中起着核心作用。作为一种广泛使用的在线交流方式，表情包常常成为传播有害内容的载体。然而，数据获取的限制和数据集策划的高成本阻碍了强大表情包审核系统的发展。为了解决这一挑战，本研究首次引入了一个包含 6,300 条真实世界表情包帖子的双阶段标注数据集：（i）二元分类为有害和正常，（ii）对有害表情包进行细粒度标注，分为仇恨、危险或冒犯。该数据集的一个关键特点是附带了社会相关标签的辅助元数据，增强了每个表情包的语境。此外，我们提出了一个标签生成模块，用于生成具有社会基础的标签，因为大多数野生表情包通常没有标签。实验结果表明，加入这些标签显著提升了最先进视觉语言模型（VLM）检测任务的性能。我们的贡献为多模态在线环境中改进内容审核提供了一个新颖且可扩展的基础。

Subjects: Computer Vision and Pattern Recognition, Computation and Language 主题：计算机视觉与模式识别，计算与语言

Publish: 2025-08-06 07:46:14 UTC 发布时间：2025-08-06 07:46:14 UTC

#79 Multilingual Source Tracing of Speech Deepfakes: A First Benchmark #79 多语言语音深度伪造源追踪：首个基准测试

Authors: [Xi Xuan](https://arxiv.org/search/?searchtype=author&query=Xi Xuan), [Yang Xiao](https://arxiv.org/search/?searchtype=author&query=Yang Xiao), [Rohan Kumar Das](https://arxiv.org/search/?searchtype=author&query=Rohan Kumar Das), [Tomi Kinnunen](https://arxiv.org/search/?searchtype=author&query=Tomi Kinnunen) 作者：席轩，杨晓，罗翰·库马尔·达斯，托米·金努宁

Recent progress in generative AI has made it increasingly easy to create natural-sounding deepfake speech from just a few seconds of audio. While these tools support helpful applications, they also raise serious concerns by making it possible to generate convincing fake speech in many languages. Current research has largely focused on detecting fake speech, but little attention has been given to tracing the source models used to generate it. This paper introduces the first benchmark for multilingual speech deepfake source tracing, covering both mono- and cross-lingual scenarios. We comparatively investigate DSP- and SSL-based modeling; examine how SSL representations fine-tuned on different languages impact cross-lingual generalization performance; and evaluate generalization to unseen languages and speakers. Our findings offer the first comprehensive insights into the challenges of identifying speech generation models when training and inference languages differ. The dataset, protocol and code are available at https://github.com/xuanxixi/Multilingual-Source-Tracing. 近年来生成式人工智能的进步使得仅凭几秒钟的音频就能轻松生成自然听感的深度伪造语音。虽然这些工具支持许多有益的应用，但它们也带来了严重的担忧，因为这使得生成多种语言的逼真假语音成为可能。目前的研究主要集中在检测假语音，但对追踪用于生成假语音的源模型关注较少。本文首次提出了多语言语音深度伪造源追踪的基准，涵盖单语和跨语言场景。我们对基于数字信号处理（DSP）和自监督学习（SSL）的建模方法进行了比较研究；考察了在不同语言上微调的 SSL 表示对跨语言泛化性能的影响；并评估了对未见语言和说话人的泛化能力。我们的研究结果首次全面揭示了在训练语言与推理语言不同时识别语音生成模型所面临的挑战。数据集、协议和代码可在 https://github.com/xuanxixi/Multilingual-Source-Tracing 获取。

Subjects: Audio and Speech Processing, Computation and Language, Sound 主题：音频与语音处理，计算与语言，声音

Publish: 2025-08-06 07:11:36 UTC 发布时间：2025-08-06 07:11:36 UTC

#80 COPO: Consistency-Aware Policy Optimization #80 COPO：一致性感知策略优化

Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework’s robustness and general applicability. Code of this work has been released at https://github.com/hijih/copo-code.git. 强化学习显著提升了大型语言模型（LLMs）在复杂问题解决任务中的推理能力。最近，DeepSeek R1 的引入激发了利用基于规则的奖励作为计算优势函数和指导策略优化的低成本替代方案的热潮。然而，在许多复现和扩展工作中普遍遇到的一个挑战是，当单个提示下的多个采样响应收敛到相同结果（无论正确与否）时，基于组的优势会退化为零。这导致梯度消失，使得相应的样本在学习中无效，最终限制了训练效率和下游性能。为了解决这一问题，我们提出了一种一致性感知的策略优化框架，该框架引入了基于结果一致性的结构化全局奖励，基于该奖励的全局损失确保即使模型输出表现出高度的组内一致性，训练过程仍能获得有意义的学习信号，从而从全局视角鼓励生成正确且自洽的推理路径。此外，我们引入了一种基于熵的软融合机制，自适应地平衡局部优势估计与全局优化，使训练过程中能够在探索与收敛之间动态切换。我们的方法在奖励设计和优化策略上均有多项关键创新。通过在多个数学推理基准上的显著性能提升，我们验证了所提框架的鲁棒性和广泛适用性。该工作的代码已发布于 https://github.com/hijih/copo-code.git。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习，人工智能，计算与语言

Publish: 2025-08-06 07:05:18 UTC 发布：2025-08-06 07:05:18 UTC

#81 AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities #81 AgREE：面向新兴实体的知识图谱补全的智能推理

Authors: [Ruochen Zhao](https://arxiv.org/search/?searchtype=author&query=Ruochen Zhao), [Simone Conia](https://arxiv.org/search/?searchtype=author&query=Simone Conia), [Eric Peng](https://arxiv.org/search/?searchtype=author&query=Eric Peng), [Min Li](https://arxiv.org/search/?searchtype=author&query=Min Li), [Saloni Potdar](https://arxiv.org/search/?searchtype=author&query=Saloni Potdar) 作者：赵若晨，Simone Conia，Eric Peng，李敏，Saloni Potdar

Open-domain Knowledge Graph Completion (KGC) faces significant challenges in an ever-changing world, especially when considering the continual emergence of new entities in daily news. Existing approaches for KGC mainly rely on pretrained language models’ parametric knowledge, pre-constructed queries, or single-step retrieval, typically requiring substantial supervision and training data. Even so, they often fail to capture comprehensive and up-to-date information about unpopular and/or emerging entities. To this end, we introduce Agentic Reasoning for Emerging Entities (AgREE), a novel agent-based framework that combines iterative retrieval actions and multi-step reasoning to dynamically construct rich knowledge graph triplets. Experiments show that, despite requiring zero training efforts, AgREE significantly outperforms existing methods in constructing knowledge graph triplets, especially for emerging entities that were not seen during language models’ training processes, outperforming previous methods by up to 13.7%. Moreover, we propose a new evaluation methodology that addresses a fundamental weakness of existing setups and a new benchmark for KGC on emerging entities. Our work demonstrates the effectiveness of combining agent-based reasoning with strategic information retrieval for maintaining up-to-date knowledge graphs in dynamic information environments. 开放域知识图谱补全（KGC）在不断变化的世界中面临重大挑战，尤其是在日常新闻中新实体不断涌现的情况下。现有的 KGC 方法主要依赖于预训练语言模型的参数化知识、预先构建的查询或单步检索，通常需要大量的监督和训练数据。即便如此，它们往往无法捕捉关于不受关注和/或新兴实体的全面且最新的信息。为此，我们提出了面向新兴实体的智能推理（AgREE），这是一种基于智能体的新框架，结合了迭代检索动作和多步推理，动态构建丰富的知识图谱三元组。实验表明，尽管无需任何训练，AgREE 在构建知识图谱三元组方面显著优于现有方法，尤其是在语言模型训练过程中未见过的新兴实体上，性能提升高达 13.7%。此外，我们提出了一种新的评估方法，解决了现有设置的根本性弱点，并为新兴实体的 KGC 建立了新的基准。我们的工作展示了将基于代理的推理与战略性信息检索相结合，在动态信息环境中维护最新知识图谱的有效性。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-06 06:34:22 UTC 发布时间：2025-08-06 06:34:22 UTC

#82 ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval #82 ConvMix：一种用于对话密集检索的混合标准数据增强框架

Authors: [Fengran Mo](https://arxiv.org/search/?searchtype=author&query=Fengran Mo), [Jinghan Zhang](https://arxiv.org/search/?searchtype=author&query=Jinghan Zhang), [Yuchen Hui](https://arxiv.org/search/?searchtype=author&query=Yuchen Hui), [Jia Ao Sun](https://arxiv.org/search/?searchtype=author&query=Jia Ao Sun), [Zhichao Xu](https://arxiv.org/search/?searchtype=author&query=Zhichao Xu), [Zhan Su](https://arxiv.org/search/?searchtype=author&query=Zhan Su), [Jian-Yun Nie](https://arxiv.org/search/?searchtype=author&query=Jian-Yun Nie) 作者：莫凤然，张景涵，惠宇辰，孙嘉奥，徐志超，苏展，聂建云

Conversational search aims to satisfy users’ complex information needs via multiple-turn interactions. The key challenge lies in revealing real users’ search intent from the context-dependent queries. Previous studies achieve conversational search by fine-tuning a conversational dense retriever with relevance judgments between pairs of context-dependent queries and documents. However, this training paradigm encounters data scarcity issues. To this end, we propose ConvMix, a mixed-criteria framework to augment conversational dense retrieval, which covers more aspects than existing data augmentation frameworks. We design a two-sided relevance judgment augmentation schema in a scalable manner via the aid of large language models. Besides, we integrate the framework with quality control mechanisms to obtain semantically diverse samples and near-distribution supervisions to combine various annotated data. Experimental results on five widely used benchmarks show that the conversational dense retriever trained by our ConvMix framework outperforms previous baseline methods, which demonstrates our superior effectiveness. 对话式搜索旨在通过多轮交互满足用户复杂的信息需求。关键挑战在于从依赖上下文的查询中揭示真实的用户搜索意图。以往的研究通过对话式密集检索器进行微调，利用上下文相关查询与文档对之间的相关性判断来实现对话式搜索。然而，这种训练范式面临数据稀缺问题。为此，我们提出了 ConvMix，一种混合标准框架，用于增强对话式密集检索，涵盖了比现有数据增强框架更多的方面。我们设计了一种可扩展的双向相关性判断增强方案，借助大型语言模型实现。此外，我们将该框架与质量控制机制相结合，以获得语义多样的样本和近分布监督，从而整合各种标注数据。在五个广泛使用的基准测试上的实验结果表明，由我们的 ConvMix 框架训练的对话式密集检索器优于以往的基线方法，展示了我们方法的卓越效果。

Subjects: Information Retrieval, Computation and Language 主题：信息检索，计算与语言

Publish: 2025-08-06 01:28:49 UTC 发布时间：2025-08-06 01:28:49 UTC

#83 Accelerating Scientific Discovery with Multi-Document Summarization of Impact-Ranked Papers #83 利用多文档摘要加速科学发现——基于影响力排名的论文

Authors: [Paris Koloveas](https://arxiv.org/search/?searchtype=author&query=Paris Koloveas), [Serafeim Chatzopoulos](https://arxiv.org/search/?searchtype=author&query=Serafeim Chatzopoulos), [Dionysis Diamantis](https://arxiv.org/search/?searchtype=author&query=Dionysis Diamantis), [Christos Tryfonopoulos](https://arxiv.org/search/?searchtype=author&query=Christos Tryfonopoulos), [Thanasis Vergoulis](https://arxiv.org/search/?searchtype=author&query=Thanasis Vergoulis) 作者：Paris Koloveas、Serafeim Chatzopoulos、Dionysis Diamantis、Christos Tryfonopoulos、Thanasis Vergoulis

The growing volume of scientific literature makes it challenging for scientists to move from a list of papers to a synthesized understanding of a topic. Because of the constant influx of new papers on a daily basis, even if a scientist identifies a promising set of papers, they still face the tedious task of individually reading through dozens of titles and abstracts to make sense of occasionally conflicting findings. To address this critical bottleneck in the research workflow, we introduce a summarization feature to BIP! Finder, a scholarly search engine that ranks literature based on distinct impact aspects like popularity and influence. Our approach enables users to generate two types of summaries from top-ranked search results: a concise summary for an instantaneous at-a-glance comprehension and a more comprehensive literature review-style summary for greater, better-organized comprehension. This ability dynamically leverages BIP! Finder’s already existing impact-based ranking and filtering features to generate context-sensitive, synthesized narratives that can significantly accelerate literature discovery and comprehension. 日益增长的科学文献数量使得科学家们难以从一堆论文中提炼出对某一主题的综合理解。由于每天都有大量新论文涌现，即使科学家找到了有潜力的一组论文，他们仍需费力地逐篇阅读数十个标题和摘要，以理清偶尔存在的相互矛盾的研究结果。为了解决研究流程中的这一关键瓶颈，我们在 BIP! Finder 中引入了摘要功能。BIP! Finder 是一款学术搜索引擎，基于受欢迎度和影响力等不同影响维度对文献进行排名。我们的方法使用户能够从排名靠前的搜索结果中生成两种类型的摘要：一种是简洁摘要，便于瞬间快速理解；另一种是更全面的文献综述式摘要，便于更深入、更有条理的理解。该功能动态利用 BIP! Finder 已有的基于影响力的排名和筛选功能，生成上下文相关的综合叙述，显著加快文献的发现和理解过程。

Subjects: Digital Libraries, Artificial Intelligence, Computation and Language 主题：数字图书馆，人工智能，计算与语言

Publish: 2025-08-05 22:56:09 UTC 发布时间：2025-08-05 22:56:09 UTC

#84 ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants #84 ASTRA：面向 AI 软件助手的自主时空红队测试

AI coding assistants like GitHub Copilot are rapidly transforming software development, but their safety remains deeply uncertain-especially in high-stakes domains like cybersecurity. Current red-teaming tools often rely on fixed benchmarks or unrealistic prompts, missing many real-world vulnerabilities. We present ASTRA, an automated agent system designed to systematically uncover safety flaws in AI-driven code generation and security guidance systems. ASTRA works in three stages: (1) it builds structured domain-specific knowledge graphs that model complex software tasks and known weaknesses; (2) it performs online vulnerability exploration of each target model by adaptively probing both its input space, i.e., the spatial exploration, and its reasoning processes, i.e., the temporal exploration, guided by the knowledge graphs; and (3) it generates high-quality violation-inducing cases to improve model alignment. Unlike prior methods, ASTRA focuses on realistic inputs-requests that developers might actually ask-and uses both offline abstraction guided domain modeling and online domain knowledge graph adaptation to surface corner-case vulnerabilities. Across two major evaluation domains, ASTRA finds 11-66% more issues than existing techniques and produces test cases that lead to 17% more effective alignment training, showing its practical value for building safer AI systems. 像 GitHub Copilot 这样的 AI 编码助手正在迅速改变软件开发，但其安全性仍然存在很大不确定性——尤其是在网络安全等高风险领域。目前的红队工具通常依赖固定的基准测试或不切实际的提示，遗漏了许多现实世界中的漏洞。我们提出了 ASTRA，一种自动化代理系统，旨在系统性地发现 AI 驱动的代码生成和安全指导系统中的安全缺陷。ASTRA 分三个阶段工作：（1）构建结构化的领域特定知识图，模拟复杂的软件任务和已知弱点；（2）通过自适应探测目标模型的输入空间（即空间探索）和推理过程（即时间探索），在知识图的指导下进行在线漏洞探索；（3）生成高质量的违规诱发案例以改进模型对齐。与以往方法不同，ASTRA 专注于现实输入——开发者可能实际提出的请求，并结合离线抽象引导的领域建模与在线领域知识图适应，揭示边缘案例漏洞。在两个主要评估领域中，ASTRA 比现有技术发现了多 11-66%的问题，并生成了导致 17%更有效对齐训练的测试用例，展示了其在构建更安全 AI 系统方面的实际价值。

Subjects: Cryptography and Security, Computation and Language, Machine Learning, Software Engineering 主题：密码学与安全，计算与语言，机器学习，软件工程

Publish: 2025-08-05 21:57:52 UTC 发布时间：2025-08-05 21:57:52 UTC

#85 MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources #85 MegaWika 2：更全面的多语言文章及其来源合集

Authors: [Samuel Barham](https://arxiv.org/search/?searchtype=author&query=Samuel Barham), [Chandler May](https://arxiv.org/search/?searchtype=author&query=Chandler May), [Benjamin Van Durme](https://arxiv.org/search/?searchtype=author&query=Benjamin Van Durme) 作者：Samuel Barham，Chandler May，Benjamin Van Durme

We introduce MegaWika 2, a large, multilingual dataset of Wikipedia articles with their citations and scraped web sources; articles are represented in a rich data structure, and scraped source texts are stored inline with precise character offsets of their citations in the article text. MegaWika 2 is a major upgrade from the original MegaWika, spanning six times as many articles and twice as many fully scraped citations. Both MegaWika and MegaWika 2 support report generation research ; whereas MegaWika also focused on supporting question answering and retrieval applications, MegaWika 2 is designed to support fact checking and analyses across time and language. 我们介绍了 MegaWika 2，这是一个大型多语言的维基百科文章数据集，包含其引用和抓取的网页来源；文章以丰富的数据结构表示，抓取的源文本与文章文本中引用的精确字符偏移量内联存储。MegaWika 2 是对原始 MegaWika 的重大升级，涵盖的文章数量是原来的六倍，完全抓取的引用数量是原来的两倍。MegaWika 和 MegaWika 2 都支持报告生成研究；而 MegaWika 还专注于支持问答和检索应用，MegaWika 2 则设计用于支持跨时间和语言的事实核查和分析。

Subjects: Digital Libraries, Computation and Language 主题：数字图书馆，计算与语言

Publish: 2025-08-05 18:18:17 UTC 发布时间：2025-08-05 18:18:17 UTC

#86 GTPO: Trajectory-Based Policy Optimization in Large Language Models #86 GTPO：基于轨迹的策略优化在大型语言模型中的应用

Authors: [Marco Simoni](https://arxiv.org/search/?searchtype=author&query=Marco Simoni), [Aleksandar Fontana](https://arxiv.org/search/?searchtype=author&query=Aleksandar Fontana), [Giulio Rossolini](https://arxiv.org/search/?searchtype=author&query=Giulio Rossolini), [Andrea Saracino](https://arxiv.org/search/?searchtype=author&query=Andrea Saracino) 作者：Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino

Policy-based optimizations are widely adopted today for the training and alignment of language models, where one of the most recent and effective approaches is Group-relative Policy Optimization (GRPO). In this paper, we reveals and analyze two major limitations of GRPO: (i) tokens frequently appear in completions with both positive and negative rewards, leading to conflicting gradient updates that can reduce their output probability, even though can be essential for maintaining proper structure; (ii) negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, progressively flattening the output distribution and degrading learning. To address these issues and provide a more stable and effective policy optimization strategy, we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards, protects them by skipping negative updates, while amplifying positive ones. To further prevent policy collapse, GTPO filters out completions whose entropy exceeds a provable threshold. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, validated through multiple experiments on GSM8K, MATH and AIME 2024 benchmarks. 基于策略的优化方法如今被广泛应用于语言模型的训练和对齐，其中最新且有效的方法之一是基于群体相对策略优化（Group-relative Policy Optimization，GRPO）。本文揭示并分析了 GRPO 的两个主要局限性：（i）某些词元在生成的结果中既出现于获得正向奖励的情况，也出现于获得负向奖励的情况，导致梯度更新产生冲突，可能降低这些词元的输出概率，尽管它们对于维持正确结构至关重要；（ii）负向奖励的生成结果可能惩罚模型的自信回答，并将模型决策偏向不太可能的词元，逐步使输出分布趋于平坦，进而降低学习效果。为了解决这些问题并提供更稳定有效的策略优化策略，我们提出了 GTPO（基于群体相对轨迹的策略优化），该方法识别冲突词元，即在同一位置上出现于不同奖励结果中的词元，通过跳过负向更新来保护它们，同时放大正向更新。为了进一步防止策略崩溃，GTPO 还过滤掉熵值超过可证明阈值的生成结果。与 GRPO 不同，GTPO 不依赖于 KL 散度正则化，消除了训练过程中对参考模型的需求，同时仍能确保更高的训练稳定性和性能提升，这一点通过在 GSM8K、MATH 和 AIME 2024 基准上的多次实验得到了验证。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习，人工智能，计算与语言

Publish: 2025-08-05 08:15:01 UTC 发布时间：2025-08-05 08:15:01 UTC

#87 CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning #87 CX-Mind：一种开创性的多模态大型语言模型，通过课程引导的强化学习实现胸部 X 光的交错推理

Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on “one-time” diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved “think-answer” reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions. 胸部 X 光（CXR）成像是临床实践中最广泛使用的诊断方式之一，涵盖了广泛的诊断任务。近年来，基于推理的多模态大型语言模型（MLLMs）在医学影像中的广泛应用，提升了诊断效率和可解释性。然而，现有的多模态模型主要依赖“一次性”诊断方法，缺乏对推理过程的可验证监督。这导致多任务 CXR 诊断面临推理时间长、奖励稀疏和频繁幻觉等挑战。为解决这些问题，我们提出了 CX-Mind，这是首个实现交错“思考-回答”推理的生成模型，针对 CXR 任务，采用基于课程的强化学习和可验证过程奖励（CuRL-VPR）驱动。具体而言，我们构建了一个指令调优数据集 CX-Set，包含 708,473 张图像和 2,619,148 个样本，并生成了 42,828 个由临床报告监督的高质量交错推理数据点。优化在 Group Relative Policy Optimization 框架下分两个阶段进行：首先通过封闭域任务稳定基础推理，随后转移到开放域诊断，结合基于规则的条件过程奖励，避免了对预训练奖励模型的依赖。大量实验结果表明，CX-Mind 在视觉理解、文本生成和时空对齐方面显著优于现有的医疗和通用领域多模态大模型（MLLMs），在可比的胸片特定模型上平均性能提升了 25.1%。在真实临床数据集（Rui-CXR）上，CX-Mind 在 14 种疾病的平均 recall@1 指标远超第二名，多中心专家评估进一步确认了其在多个维度上的临床实用性。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition 主题：机器学习，人工智能，计算与语言，计算机视觉与模式识别

Publish: 2025-07-31 05:07:18 UTC 发布时间：2025-07-31 05:07:18 UTC

#88 Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding #88 健康保险覆盖规则解释语料库：健康保险覆盖理解的法律、政策与医疗指导

Author: [Mike Gartner](https://arxiv.org/search/?searchtype=author&query=Mike Gartner) 作者：Mike Gartner

U.S. health insurance is complex, and inadequate understanding and limited access to justice have dire implications for the most vulnerable. Advances in natural language processing present an opportunity to support efficient, case-specific understanding, and to improve access to justice and healthcare. Yet existing corpora lack context necessary for assessing even simple cases. We collect and release a corpus of reputable legal and medical text related to U.S. health insurance. We also introduce an outcome prediction task for health insurance appeals designed to support regulatory and patient self-help applications, and release a labeled benchmark for our task, and models trained on it. 美国的健康保险体系复杂，理解不足和有限的司法途径对最脆弱群体带来严重影响。自然语言处理的进步为支持高效、针对具体案例的理解，以及改善司法和医疗服务的可及性提供了机会。然而，现有语料库缺乏评估即使是简单案例所需的上下文信息。我们收集并发布了一个与美国健康保险相关的权威法律和医疗文本语料库。我们还引入了一个针对健康保险申诉的结果预测任务，旨在支持监管和患者自助应用，并发布了该任务的标注基准及其训练模型。

Subjects: Computers and Society, Artificial Intelligence, Computation and Language, Machine Learning 主题：计算机与社会，人工智能，计算与语言，机器学习

Publish: 2025-07-28 00:22:03 UTC 发布：2025-07-28 00:22:03 UTC

Authors: [Wenchuan Mu](https://arxiv.org/search/?searchtype=author&query=Wenchuan Mu), [Menglin Li](https://arxiv.org/search/?searchtype=author&query=Menglin Li), [Kwan Hui Lim](https://arxiv.org/search/?searchtype=author&query=Kwan Hui Lim) 作者：穆文川，李梦林，林冠辉

Social media platforms such as Twitter and Facebook have become deeply embedded in our everyday life, offering a dynamic stream of localized news and personal experiences. The ubiquity of these platforms position them as valuable resources for identifying estate-related issues, especially in the context of growing urban populations. In this work, we present a language model-based system for the detection and classification of estate-related events from social media content. Our system employs a hierarchical classification framework to first filter relevant posts and then categorize them into actionable estate-related topics. Additionally, for posts lacking explicit geotags, we apply a transformer-based geolocation module to infer posting locations at the point-of-interest level. This integrated approach supports timely, data-driven insights for urban management, operational response and situational awareness. 社交媒体平台如 Twitter 和 Facebook 已深深融入我们的日常生活，提供动态的本地新闻和个人经历流。这些平台的普及使其成为识别房地产相关问题的宝贵资源，尤其是在城市人口不断增长的背景下。在本研究中，我们提出了一个基于语言模型的系统，用于从社交媒体内容中检测和分类房地产相关事件。我们的系统采用分层分类框架，首先筛选相关帖子，然后将其归类为可操作的房地产相关话题。此外，对于缺乏明确地理标签的帖子，我们应用基于 Transformer 的地理定位模块，以兴趣点级别推断发帖位置。这种集成方法支持城市管理、运营响应和态势感知的及时数据驱动洞察。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Machine Learning, Social and Information Networks 主题：信息检索，人工智能，计算与语言，机器学习，社会与信息网络

Publish: 2025-07-22 14:48:42 UTC 发布时间：2025-07-22 14:48:42 UTC

#90 MD-LLM-1: A Large Language Model for Molecular Dynamics #90 MD-LLM-1：用于分子动力学的大型语言模型

Authors: [Mhd Hussein Murtada](https://arxiv.org/search/?searchtype=author&query=Mhd Hussein Murtada), [Z. Faidon Brotzakis](https://arxiv.org/search/?searchtype=author&query=Z. Faidon Brotzakis), [Michele Vendruscolo](https://arxiv.org/search/?searchtype=author&query=Michele Vendruscolo) 作者：Mhd Hussein Murtada，Z. Faidon Brotzakis，Michele Vendruscolo

Molecular dynamics (MD) is a powerful approach for modelling molecular systems, but it remains computationally intensive on spatial and time scales of many macromolecular systems of biological interest. To explore the opportunities offered by deep learning to address this problem, we introduce a Molecular Dynamics Large Language Model (MD-LLM) framework to illustrate how LLMs can be leveraged to learn protein dynamics and discover states not seen in training. By applying MD-LLM-1, the first implementation of this approach, obtained by fine-tuning Mistral 7B, to the T4 lysozyme and Mad2 protein systems, we show that training on one conformational state enables the prediction of other conformational states. These results indicate that MD-LLM-1 can learn the principles for the exploration of the conformational landscapes of proteins, although it is not yet modeling explicitly their thermodynamics and kinetics. 分子动力学（MD）是一种强大的分子系统建模方法，但在许多生物学相关的大分子系统的空间和时间尺度上仍然计算量巨大。为了探索深度学习在解决这一问题上的潜力，我们引入了分子动力学大型语言模型（MD-LLM）框架，以展示如何利用 LLMs 学习蛋白质动力学并发现训练中未见的状态。通过将 MD-LLM-1——这一方法的首次实现版本，通过微调 Mistral 7B 获得——应用于 T4 溶菌酶和 Mad2 蛋白系统，我们展示了在一个构象状态上训练能够预测其他构象状态。这些结果表明，MD-LLM-1 能够学习蛋白质构象景观探索的原理，尽管它尚未明确模拟其热力学和动力学。

Subjects: Biomolecules, Computation and Language, Machine Learning, Computational Physics 主题：生物分子，计算与语言，机器学习，计算物理

Publish: 2025-07-21 20:31:53 UTC 发布时间：2025-07-21 20:31:53 UTC

1.2.2 Artificial Intelligence

**From：**https://papers.cool/arxiv/cs.AI

2025-08-07 | | Total: 176

#1 SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience #1 SEAgent：具备自主经验学习能力的自我进化计算机使用代理

Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent’s policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS. 将大型视觉语言模型（LVLMs）重新用作计算机使用代理（CUAs）已带来重大突破，这主要依赖于人工标注数据。然而，这些模型在处理新颖和专业的软件时常常表现不佳，尤其是在缺乏人工注释的场景中。为了解决这一挑战，我们提出了 SEAgent，一种代理自我进化框架，使 CUAs 能够通过与陌生软件的交互自主进化。具体而言，SEAgent 使计算机使用代理能够通过体验式学习自主掌握新软件环境，代理通过探索新软件、反复试错学习，并逐步完成从简单到复杂自动生成的任务。为实现这一目标，我们设计了一个用于逐步轨迹评估的世界状态模型，以及一个生成日益多样且具有挑战性任务的课程生成器。代理的策略通过体验式学习进行更新，包括对失败动作的对抗模仿和对成功动作的群体相对策略优化（GRPO）。此外，我们引入了一种专家到通才的训练策略，该策略整合了专家代理的个体经验见解，促进了更强大的通才 CUA 的开发，使其能够持续自主进化。该统一代理最终在其专门的软件上实现了超越单个专家代理集成的性能。我们在 OS-World 中的五个新颖软件环境中验证了 SEAgent 的有效性。我们的方法在成功率上相较于一个具有竞争力的开源 CUA（即 UI-TARS）实现了显著提升，成功率从 11.3%提升至 34.5%，提高了 23.2%。

Subjects: Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition, Machine Learning, Multiagent Systems, Multimedia 主题：人工智能，计算与语言，计算机视觉与模式识别，机器学习，多智能体系统，多媒体

Publish: 2025-08-06 17:58:46 UTC 发布时间：2025-08-06 17:58:46 UTC

#2 LLM Collaboration With Multi-Agent Reinforcement Learning #2 LLM 协作与多智能体强化学习

在多智能体系统（MAS）中，已经进行了大量工作来建模和解决多个交互智能体的问题。然而，大多数 LLMs 是独立预训练的，并未专门针对协调进行优化。现有的 LLM 微调框架依赖于个体奖励，这需要为每个智能体设计复杂的奖励机制以鼓励协作。为了解决这些挑战，我们将 LLM 协作建模为一个合作型多智能体强化学习（MARL）问题。我们基于当前针对 LLM 的强化学习方法以及 MARL 技术，开发了一种多智能体、多轮次算法——多智能体群体相对策略优化（MAGRPO）来解决该问题。我们在 LLM 写作和编码协作上的实验表明，使用 MAGRPO 微调 MAS 能够使智能体通过有效协作高效生成高质量的响应。我们的方法为将其他 MARL 方法应用于 LLM 打开了大门，并突出了相关的挑战。

发布时间：2025-08-06 17:18:25 UTC

#3 ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges #3 ConfProBench：基于 MLLM 的过程判决置信度评估基准

Authors: [Yue Zhou](https://arxiv.org/search/?searchtype=author&query=Yue Zhou), [Yi Chang](https://arxiv.org/search/?searchtype=author&query=Yi Chang), [Yuan Wu](https://arxiv.org/search/?searchtype=author&query=Yuan Wu) 作者：周越，常毅，吴元

Reasoning is a critical capability of multimodal large language models (MLLMs) for solving complex multimodal tasks, and judging the correctness of reasoning steps is crucial for improving this capability. Recently, MLLM-based process judges (MPJs) have been widely used to assess the correctness of reasoning steps in multimodal tasks. Therefore, evaluating MPJs is important for identifying their limitations and guiding future improvements. However, existing benchmarks for MPJs mainly focus on tasks such as step correctness classification and reasoning process search, while overlooking a key aspect: whether the confidence scores produced by MPJs at the step level are reliable. To address this gap, we propose ConfProBench, the first comprehensive benchmark designed to systematically evaluate the reliability of step-level confidence scores generated by MPJs. Our benchmark constructs three types of adversarially perturbed reasoning steps: Synonym Substitution, Syntactic Transformation, and Image Perturbation, to test the robustness of MPJ confidence under perturbations. In addition, we introduce three novel evaluation metrics: Confidence Robustness Score (CRS), Confidence Sensitivity Score (CSS), and Confidence Calibration Score (CCS), which evaluate robustness, sensitivity, and calibration, respectively. We evaluate 14 state-of-the-art MLLMs, including both proprietary and open-source models. Experiments reveal limitations in current MPJs’ confidence performance and offer competitive baselines to support future research. 推理是多模态大型语言模型（MLLMs）解决复杂多模态任务的关键能力，而判断推理步骤的正确性对于提升该能力至关重要。近年来，基于 MLLM 的过程判定器（MPJs）被广泛用于评估多模态任务中推理步骤的正确性。因此，评估 MPJs 对于识别其局限性和指导未来改进具有重要意义。然而，现有的 MPJ 基准测试主要集中在步骤正确性分类和推理过程搜索等任务上，忽视了一个关键方面：MPJs 在步骤级别产生的置信度分数是否可靠。为填补这一空白，我们提出了 ConfProBench，这是首个系统评估 MPJs 生成的步骤级置信度分数可靠性的综合基准。我们的基准构建了三种类型的对抗性扰动推理步骤：同义词替换、句法转换和图像扰动，以测试 MPJ 置信度在扰动下的鲁棒性。此外，我们引入了三种新颖的评估指标：置信度鲁棒性得分（CRS）、置信度敏感性得分（CSS）和置信度校准得分（CCS），分别用于评估鲁棒性、敏感性和校准性。我们评估了 14 个最先进的多模态大语言模型（MLLM），包括专有模型和开源模型。实验揭示了当前多模态判决系统（MPJs）在置信度表现上的局限性，并提供了具有竞争力的基线，以支持未来的研究。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 16:00:19 UTC 发布时间：2025-08-06 16:00:19 UTC

#4 SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset #4 SID：使用苏格拉底跨学科对话数据集对 STEM 教育中的引导式教学能力进行基准测试

Authors: [Mei Jiang](https://arxiv.org/search/?searchtype=author&query=Mei Jiang), [Houping Yue](https://arxiv.org/search/?searchtype=author&query=Houping Yue), [Bingdong Li](https://arxiv.org/search/?searchtype=author&query=Bingdong Li), [Hao Hao](https://arxiv.org/search/?searchtype=author&query=Hao Hao), [Ying Qian](https://arxiv.org/search/?searchtype=author&query=Ying Qian), [Bo Jiang](https://arxiv.org/search/?searchtype=author&query=Bo Jiang), [Aimin Zhou](https://arxiv.org/search/?searchtype=author&query=Aimin Zhou) 作者：姜梅，岳厚平，李炳东，郝昊，钱颖，蒋波，周爱民

Fostering students’ abilities for knowledge integration and transfer in complex problem-solving scenarios is a core objective of modern education, and interdisciplinary STEM is a key pathway to achieve this, yet it requires expert guidance that is difficult to scale. While LLMs offer potential in this regard, their true capability for guided instruction remains unclear due to the lack of an effective evaluation benchmark. To address this, we introduce SID, the first benchmark designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our contributions include a large-scale dataset of 10,000 dialogue turns across 48 complex STEM projects, a novel annotation schema for capturing deep pedagogical features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline experiments confirm that even state-of-the-art LLMs struggle to execute effective guided dialogues that lead students to achieve knowledge integration and transfer. This highlights the critical value of our benchmark in driving the development of more pedagogically-aware LLMs. 培养学生在复杂问题解决情境中整合与迁移知识的能力是现代教育的核心目标，而跨学科 STEM 是实现这一目标的关键途径，但这需要难以规模化的专家指导。尽管 LLMs 在这方面展现出潜力，但由于缺乏有效的评估基准，其在引导式教学中的真实能力尚不明确。为此，我们提出了 SID，这是首个旨在系统评估 LLMs 在多轮跨学科苏格拉底式对话中高阶引导能力的基准。我们的贡献包括涵盖 48 个复杂 STEM 项目、共计 10,000 轮对话的大规模数据集、一套用于捕捉深层教学特征的新型标注方案，以及一组新的评估指标（如 X-SRG）。基线实验表明，即使是最先进的 LLMs 也难以执行有效的引导对话，帮助学生实现知识整合与迁移。这凸显了我们基准在推动更具教学意识的 LLMs 发展中的关键价值。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 15:49:26 UTC 发布时间：2025-08-06 15:49:26 UTC

#5 [Argumentative Debates for Transparent Bias Detection Technical Report] #5 透明偏见检测的论辩辩论 [技术报告]

Authors: [Hamed Ayoobi](https://arxiv.org/search/?searchtype=author&query=Hamed Ayoobi), [Nico Potyka](https://arxiv.org/search/?searchtype=author&query=Nico Potyka), [Anna Rapberger](https://arxiv.org/search/?searchtype=author&query=Anna Rapberger), [Francesca Toni](https://arxiv.org/search/?searchtype=author&query=Francesca Toni) 作者：Hamed Ayoobi、Nico Potyka、Anna Rapberger、Francesca Toni

As the use of AI systems in society grows, addressing potential biases that emerge from data or are learned by models is essential to prevent systematic disadvantages against specific groups. Several notions of (un)fairness have been proposed in the literature, alongside corresponding algorithmic methods for detecting and mitigating unfairness, but, with very few exceptions, these tend to ignore transparency. Instead, interpretability and explainability are core requirements for algorithmic fairness, even more so than for other algorithmic solutions, given the human-oriented nature of fairness. In this paper, we contribute a novel interpretable, explainable method for bias detection relying on debates about the presence of bias against individuals, based on the values of protected features for the individuals and others in their neighbourhoods. Our method builds upon techniques from formal and computational argumentation, whereby debates result from arguing about biases within and across neighbourhoods. We provide formal, quantitative, and qualitative evaluations of our method, highlighting its strengths in performance against baselines, as well as its interpretability and explainability. 随着人工智能系统在社会中的应用日益增多，解决由数据产生或模型学习到的潜在偏见问题对于防止对特定群体的系统性不利至关重要。文献中提出了多种（不）公平性的概念，以及相应的算法方法用于检测和缓解不公平现象，但除极少数例外，这些方法往往忽视了透明性。相反，解释性和可解释性是算法公平性的核心要求，甚至比其他算法解决方案更为重要，因为公平性具有以人为本的特性。本文提出了一种新颖的可解释、可说明的偏见检测方法，该方法基于围绕个体是否存在偏见的辩论，依托个体及其邻域中其他人的受保护特征值。我们的方法建立在形式化和计算论证技术之上，通过在邻域内及邻域间就偏见进行辩论来得出结论。我们提供了对我们方法的正式、定量和定性评估，突出其在性能上相较基线方法的优势，以及其可解释性和说明性。

Subjects: Artificial Intelligence, Machine Learning 主题：人工智能，机器学习

Publish: 2025-08-06 14:56:08 UTC 发布时间：2025-08-06 14:56:08 UTC

#6 OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use #6 操作系统代理：基于多模态大型语言模型（MLLM）代理在通用计算设备中的应用综述

Publish: 2025-08-06 14:33:45 UTC 发布时间：2025-08-06 14:33:45 UTC

#7 From "Aha Moments" to Controllable Thinking: Toward Meta-Cognitive Reasoning in Large Reasoning Models via Decoupled Reasoning and Control #7 从“顿悟时刻”到可控思维：通过解耦推理与控制迈向大型推理模型中的元认知推理

大型推理模型（LRMs）通过自发展现诸如逐步推理、反思和回溯等认知行为，展示了复杂推理的潜在能力，这些行为通常被称为“顿悟时刻”。然而，这种涌现行为仍然缺乏监管和控制，常常导致过度思考，即模型在达到可靠结论后仍继续生成冗余的推理内容。这导致计算成本过高和延迟增加，限制了 LRMs 的实际应用。根本原因在于缺乏内在的调节机制，当前模型无法监控并自适应管理其推理过程，以决定何时继续、回溯或终止。为解决这一问题，我们提出了元认知推理框架（MERA），该框架明确将思考过程分解为独立的推理和控制组件，从而实现控制策略的独立优化。具体来说，MERA 引入了一种基于接管的数据构建机制，该机制在推理过程中识别关键决策点，并将控制信号的生成委托给辅助 LLMs，从而实现高质量推理控制数据的构建。此外，通过监督微调实现了结构化的推理-控制分离，使模型能够生成明确的推理轨迹并获得初步的元认知控制能力。最后，MERA 采用了控制段策略优化（Control-Segment Policy Optimization，CSPO），该方法结合了分段的组相对策略优化（Group Relative Policy Optimization，GRPO）和控制掩码机制，以优化控制行为的学习，同时最大限度地减少无关内容的干扰。在多个推理基准上的实验表明，使用 MERA 训练的模型在推理效率和准确性方面均有所提升。

发布时间：2025-08-06 13:59:17 UTC

#8 \textsc: A Responsible Tool for Collecting Scaffolding Dialogues Between Experts and LLM-Simulated Novices #8 \textsc：一个用于收集专家与 LLM 模拟新手之间支架对话的负责任工具

High-quality, multi-turn instructional dialogues between novices and experts are essential for developing AI systems that support teaching, learning, and decision-making. These dialogues often involve scaffolding – the process by which an expert supports a novice’s thinking through questions, feedback, and step-by-step guidance. However, such data are scarce due to privacy concerns in recording and the vulnerability inherent in help-seeking. We present SimInstruct, a scalable, expert-in-the-loop tool for collecting scaffolding dialogues. Using teaching development coaching as an example domain, SimInstruct simulates novice instructors via LLMs, varying their teaching challenges and LLM’s persona traits, while human experts provide multi-turn feedback, reasoning, and instructional support. This design enables the creation of realistic, pedagogically rich dialogues without requiring real novice participants. Our results reveal that persona traits, such as extroversion and introversion, meaningfully influence how experts engage. Compared to real mentoring recordings, SimInstruct dialogues demonstrate comparable pedagogical relevance and cognitive depth. Experts also reported the process as engaging and reflective, improving both data quality and their own professional insight. We further fine-tuned a LLaMA model to be an expert model using the augmented dataset, which outperformed GPT-4o in instructional quality. Our analysis highlights GPT-4o’s limitations in weak reflective questioning, overuse of generic praise, a condescending tone, and a tendency to overwhelm novices with excessive suggestions. 高质量的多轮教学对话在新手与专家之间至关重要，有助于开发支持教学、学习和决策的 AI 系统。这些对话通常涉及支架式教学——专家通过提问、反馈和逐步指导来支持新手的思维过程。然而，由于隐私问题和求助时的脆弱性，此类数据十分稀缺。我们提出了 SimInstruct，一种可扩展的专家参与式工具，用于收集支架式对话。以教学发展辅导为示例领域，SimInstruct 通过 LLMs 模拟新手教师，变化其教学挑战和 LLM 的人格特质，同时由人类专家提供多轮反馈、推理和教学支持。该设计无需真实新手参与，即可创建真实且富有教学意义的对话。我们的结果显示，诸如外向和内向等人格特质显著影响专家的参与方式。与真实辅导录音相比，SimInstruct 对话在教学相关性和认知深度上表现相当。专家们还表示该过程具有吸引力且富有反思性，既提升了数据质量，也增强了他们自身的专业洞察力。我们进一步使用增强数据集对 LLaMA 模型进行了微调，使其成为专家模型，其教学质量优于 GPT-4o。我们的分析指出了 GPT-4o 在反思性提问薄弱、过度使用泛泛的赞美、语气居高临下以及倾向于用过多建议让新手不知所措等方面的局限性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 13:16:10 UTC 发布时间：2025-08-06 13:16:10 UTC

#9 Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents #9 超越像素：探索基于 LLM 的网页代理的 DOM 降采样

Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation – referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) – within the same input token order of magnitude (1e3). Our best evaluated configurations – one token order above, but within the model’s context window – outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs. 前沿的 LLMs 最近才实现了可用的自主网络代理。在此过程中，模型充当即时的领域模型后端。为了建议交互，模型会被咨询有关基于网络的任务及相应的应用状态。关键问题在于应用状态的序列化 – ，称为快照。最先进的网络代理基于有根的 GUI 快照，即带有视觉提示的屏幕截图。这不仅是为了模拟人类感知，更因为图像代表了相对廉价的模型输入手段。LLM 视觉能力仍落后于代码解释能力。结构上类似 HTML 的 DOM 快照提供了一个理想的替代方案。然而，庞大的模型输入令牌规模至今阻碍了其在网络代理中的可靠实现。我们提出了 D2Snap，一种首创的 DOM 降采样算法。基于 GPT-4o 后端，我们在 Online-Mind2Web 数据集中抽样的任务上评估了 D2Snap。D2Snap 降采样的 DOM 快照成功率（67%）与有根 GUI 快照基线（65%） – 相当，且输入令牌数量级相同（约 1e3）。我们评估的最佳配置 – 在一个标记顺序上方，但在模型的上下文窗口内 – 比该基线表现高出 8%。此外，我们的评估表明，DOM 固有的层次结构体现了 LLMs 的强大 UI 特性。

Subjects: Artificial Intelligence, Computation and Language, Human-Computer Interaction 主题：人工智能，计算与语言，人机交互

Publish: 2025-08-06 12:56:54 UTC 发布时间：2025-08-06 12:56:54 UTC

#10 GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning #10 GuirlVG：通过强化学习的经验探索激励 GUI 视觉定位

Authors: [Weitai Kang](https://arxiv.org/search/?searchtype=author&query=Weitai Kang), [Bin Lei](https://arxiv.org/search/?searchtype=author&query=Bin Lei), [Gaowen Liu](https://arxiv.org/search/?searchtype=author&query=Gaowen Liu), [Caiwen Ding](https://arxiv.org/search/?searchtype=author&query=Caiwen Ding), [Yan Yan](https://arxiv.org/search/?searchtype=author&query=Yan Yan) 作者：康伟泰，雷斌，刘高文，丁才文，严岩

Graphical user interface visual grounding (GUI-VG), a core capability for GUI agents, has primarily relied on supervised fine-tuning (SFT) of multimodal large language models (MLLMs), which demands extensive data curation and significant training costs. However, as MLLMs continue to advance and even cover GUI domains during pretraining, the necessity of exhaustive SFT post-training becomes increasingly questionable. Meanwhile, recent successes of rule-based reinforcement fine-tuning (RFT) suggest a more efficient alternative. Despite this promise, the optimal manner of applying RFT for GUI-VG remains unexplored. To bridge this gap, we introduce GuirlVG, a reinforcement learning-based GUI-VG method built on a systematic empirical study and a novel stabilization technique. We find that naive application of RFT underperforms the SFT baseline, motivating a deeper exploration. First, we decompose RFT into its core components and analyze the optimal formulation of each. Second, we propose a novel Adversarial KL Factor that dynamically stabilizes training to mitigate reward over-optimization. Third, we further explore the training configurations of RFT to enhance effectiveness. Extensive experiments show that GuirlVG, with only 5.2K training samples, outperforms SFT methods trained on over 10M samples, achieving a 7.7% improvement on ScreenSpot, a 17.2% improvement on ScreenSpotPro, and 91.9% accuracy on ScreenSpotV2. 图形用户界面视觉定位（GUI-VG）作为 GUI 代理的核心能力，主要依赖于多模态大语言模型（MLLMs）的监督微调（SFT），这需要大量的数据整理和显著的训练成本。然而，随着 MLLMs 的不断进步，甚至在预训练阶段就涵盖了 GUI 领域，训练后进行全面的 SFT 的必要性变得越来越值得怀疑。与此同时，基于规则的强化微调（RFT）近期的成功表明了一种更高效的替代方案。尽管如此，如何最佳地将 RFT 应用于 GUI-VG 仍未被探索。为填补这一空白，我们提出了 GuirlVG，一种基于强化学习的 GUI-VG 方法，建立在系统的实证研究和一种新颖的稳定技术之上。我们发现，简单应用 RFT 的表现不及 SFT 基线，这促使我们进行更深入的探索。首先，我们将 RFT 分解为其核心组成部分，并分析每个部分的最优形式。其次，我们提出了一种新颖的对抗 KL 因子，动态稳定训练以缓解奖励过度优化。第三，我们进一步探索了 RFT 的训练配置以提升其效果。大量实验表明，GuirlVG 仅使用 5.2K 训练样本，就优于使用超过 1000 万样本训练的 SFT 方法，在 ScreenSpot 上提升了 7.7%，在 ScreenSpotPro 上提升了 17.2%，并在 ScreenSpotV2 上达到了 91.9%的准确率。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 12:35:24 UTC 发布时间：2025-08-06 12:35:24 UTC

#11 Artificial Consciousness as Interface Representation #11 人工意识作为界面表示

Author: [Robert Prentner](https://arxiv.org/search/?searchtype=author&query=Robert Prentner) 作者：Robert Prentner

Whether artificial intelligence (AI) systems can possess consciousness is a contentious question because of the inherent challenges of defining and operationalizing subjective experience. This paper proposes a framework to reframe the question of artificial consciousness into empirically tractable tests. We introduce three evaluative criteria - S (subjective-linguistic), L (latent-emergent), and P (phenomenological-structural) - collectively termed SLP-tests, which assess whether an AI system instantiates interface representations that facilitate consciousness-like properties. Drawing on category theory, we model interface representations as mappings between relational substrates (RS) and observable behaviors, akin to specific types of abstraction layers. The SLP-tests collectively operationalize subjective experience not as an intrinsic property of physical systems but as a functional interface to a relational entity. 人工智能（AI）系统是否能够拥有意识是一个有争议的问题，因为定义和操作化主观体验本身就存在固有的挑战。本文提出了一个框架，将人工意识的问题重新构建为可实证检验的测试。我们引入了三个评估标准——S（主观-语言）、L（潜在-涌现）和 P（现象学-结构），统称为 SLP 测试，用以评估 AI 系统是否具备促进类似意识属性的接口表征。借助范畴论，我们将接口表征建模为关系基底（RS）与可观察行为之间的映射，类似于特定类型的抽象层。SLP 测试将主观体验操作化，不视其为物理系统的内在属性，而是作为关系实体的功能接口。

Subjects: Artificial Intelligence, Neurons and Cognition 主题：人工智能，神经元与认知

Publish: 2025-08-06 12:25:06 UTC 发布时间：2025-08-06 12:25:06 UTC

While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit superhuman performance on high-fidelity memory tasks but suffer from systemic failures in challenges requiring robust reasoning and strategic planning. We demonstrate that this fragility stems from brittle fusion mechanisms, which lead to catastrophic performance degradation under modality conflict and uncover a counter-intuitive “less is more” paradox, where removing sensory information can paradoxically improve performance. Our findings suggest that the path toward robust AGI requires a research focus beyond scaling to explicitly address synergistic fusion. Our platform is available for anonymous review at https://github.com/fuqingbie/omni-game-benchmark. 虽然像 Gemini 和 GPT-4o 这样的通用基础模型展示了令人印象深刻的多模态能力，但现有的评估未能测试它们在动态、交互式世界中的智能表现。静态基准缺乏主动性，而交互式基准则存在严重的模态瓶颈，通常忽视了关键的听觉和时间线索。为弥合这一评估鸿沟，我们引入了 OmniPlay，这是一种诊断性基准，旨在不仅评估，还探究具代理性的模型在全感官范围内的融合与推理能力。OmniPlay 基于模态相互依赖的核心理念，包含五个游戏环境套件，系统地创建协同与冲突的场景，迫使代理执行真正的跨模态推理。我们对六个领先的全模态模型进行了全面评估，揭示了一个关键的二分法：它们在高保真记忆任务上表现出超人水平，但在需要强大推理和战略规划的挑战中存在系统性失败。我们证明了这种脆弱性源于脆弱的融合机制，这些机制在模态冲突下导致灾难性的性能下降，并揭示了一个违反直觉的“少即是多”悖论，即移除感官信息反而可以提升性能。我们的研究结果表明，通向强健通用人工智能的道路需要超越规模扩展，明确聚焦于协同融合。我们的平台可供匿名评审，地址为 https://github.com/fuqingbie/omni-game-benchmark。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 11:58:58 UTC 发布时间：2025-08-06 11:58:58 UTC

#13 Deliberative Reasoning Network: An Uncertainty-Driven Paradigm for Belief-Tracked Inference with Pretrained Language Models #13 深思熟虑推理网络：一种基于不确定性的预训练语言模型信念追踪推理范式

Authors: [Anran Xu](https://arxiv.org/search/?searchtype=author&query=Anran Xu), [Jincheng Wang](https://arxiv.org/search/?searchtype=author&query=Jincheng Wang), [Baigen Cai](https://arxiv.org/search/?searchtype=author&query=Baigen Cai), [Tao Wen](https://arxiv.org/search/?searchtype=author&query=Tao Wen) 作者：徐安然，王金成，蔡百根，温涛

Large language models often fail at logical reasoning when semantic heuristics conflict with decisive evidence - a phenomenon we term cognitive traps. To address this fundamental limitation, we introduce the Deliberative Reasoning Network (DRN), a novel paradigm that reframes logical reasoning from probability maximization to uncertainty minimization. Instead of asking “Which answer is most likely?”, DRN asks “Which hypothesis has the most internally consistent evidence?”. DRN achieves intrinsic interpretability by explicitly tracking belief states and quantifying epistemic uncertainty for competing hypotheses through an iterative evidence synthesis process. We validate our approach through two complementary architectures - a bespoke discriminative model that embodies the core uncertainty minimization principle, and a lightweight verification module that enhances existing generative LLMs. Evaluated on LCR-1000, our new adversarial reasoning benchmark designed to expose cognitive traps, the bespoke DRN achieves up to 15.2% improvement over standard baselines. When integrated as a parameter-efficient verifier with Mistral-7B, our hybrid system boosts accuracy from 20% to 80% on the most challenging problems. Critically, DRN demonstrates strong zero-shot generalization, improving TruthfulQA performance by 23.6% without additional training, indicating that uncertainty-driven deliberation learns transferable reasoning principles. We position DRN as a foundational, verifiable System 2 reasoning component for building more trustworthy AI systems. 大型语言模型在语义启发式与决定性证据冲突时，常常在逻辑推理上失败——我们称之为认知陷阱。为了解决这一根本性限制，我们提出了深思推理网络（Deliberative Reasoning Network，DRN），这是一种将逻辑推理从概率最大化重新定义为不确定性最小化的新范式。DRN 不再问“哪个答案最可能？”，而是问“哪个假设拥有最内部一致的证据？”。DRN 通过显式跟踪信念状态并通过迭代证据综合过程量化竞争假设的认知不确定性，实现了内在的可解释性。我们通过两种互补架构验证了该方法——一种体现核心不确定性最小化原则的定制判别模型，以及一种增强现有生成型 LLMs 的轻量级验证模块。在我们设计用于揭示认知陷阱的新对抗推理基准 LCR-1000 上评估，定制 DRN 相比标准基线最高提升了 15.2%。当作为参数高效的验证器与 Mistral-7B 集成时，我们的混合系统在最具挑战性的问题上将准确率从 20%提升至 80%。关键是，DRN 展现了强大的零样本泛化能力，在无需额外训练的情况下，将 TruthfulQA 的表现提升了 23.6%，这表明基于不确定性的深思熟虑推理学习了可迁移的推理原则。我们将 DRN 定位为构建更可信 AI 系统的基础性、可验证的系统 2 推理组件。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 11:33:35 UTC 发布时间：2025-08-06 11:33:35 UTC

#14 Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling #14 用于挑战记忆增强强化学习的合成 POMDP：记忆需求结构建模

Recent research has developed benchmarks for memory-augmented reinforcement learning (RL) algorithms, providing Partially Observable Markov Decision Process (POMDP) environments where agents depend on past observations to make decisions. While many benchmarks incorporate sufficiently complex real-world problems, they lack controllability over the degree of challenges posed to memory models. In contrast, synthetic environments enable fine-grained manipulation of dynamics, making them critical for detailed and rigorous evaluation of memory-augmented RL. Our study focuses on POMDP synthesis with three key contributions: 1. A theoretical framework for analyzing POMDPs, grounded in Memory Demand Structure (MDS), transition invariance, and related concepts; 2. A methodology leveraging linear process dynamics, state aggregation, and reward redistribution to construct customized POMDPs with predefined properties; 3. Empirically validated series of POMDP environments with increasing difficulty levels, designed based on our theoretical insights. Our work clarifies the challenges of memory-augmented RL in solving POMDPs, provides guidelines for analyzing and designing POMDP environments, and offers empirical support for selecting memory models in RL tasks. 近期研究开发了用于记忆增强强化学习（RL）算法的基准测试，提供了部分可观测马尔可夫决策过程（POMDP）环境，其中智能体依赖过去的观察来做出决策。虽然许多基准测试包含了足够复杂的现实问题，但它们缺乏对记忆模型所面临挑战程度的可控性。相比之下，合成环境能够对动态进行细粒度的操控，使其成为对记忆增强 RL 进行详细且严格评估的关键。我们的研究聚焦于 POMDP 的合成，具有三大关键贡献：1. 基于记忆需求结构（MDS）、转移不变性及相关概念的 POMDP 分析理论框架；2. 利用线性过程动态、状态聚合和奖励重分配构建具有预定义属性的定制 POMDP 的方法论；3. 基于我们的理论见解设计并实证验证的一系列难度逐渐增加的 POMDP 环境。我们的工作阐明了记忆增强强化学习在解决部分可观测马尔可夫决策过程（POMDP）中的挑战，提供了分析和设计 POMDP 环境的指导方针，并为在强化学习任务中选择记忆模型提供了实证支持。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 10:13:17 UTC 发布时间：2025-08-06 10:13:17 UTC

#15 Large Language Model's Multi-Capability Alignment in Biomedical Domain #15 大型语言模型在生物医学领域的多能力对齐

Authors: [Wentao Wu](https://arxiv.org/search/?searchtype=author&query=Wentao Wu), [Linqing Chen](https://arxiv.org/search/?searchtype=author&query=Linqing Chen), [Hanmeng Zhong](https://arxiv.org/search/?searchtype=author&query=Hanmeng Zhong), [Weilei Wang](https://arxiv.org/search/?searchtype=author&query=Weilei Wang) 作者：吴文涛，陈林清，钟涵梦，王伟磊

BalancedBio is a theoretically grounded framework for parameter-efficient biomedical reasoning, addressing multi-capability integration in domain-specific AI alignment. It establishes the Biomedical Multi-Capability Convergence Theorem, proving orthogonal gradient spaces are essential to prevent capability interference for safe deployment. Key innovations include: (1) Medical Knowledge Grounded Synthetic Generation (MKGSG), extending Source2Synth with clinical workflow constraints and medical ontology validation for factual accuracy and safety; and (2) Capability Aware Group Relative Policy Optimization, deriving optimal hybrid reward weighting to maintain orthogonality in RL, using a reward model with rule-based and model-based scores adapted to biomedical tasks. Mathematical analysis proves Pareto-optimal convergence, preserving performance across capabilities. It achieves state-of-the-art results in its parameter class: domain expertise (80.95% BIOMED-MMLU, +15.32% over baseline), reasoning (61.94%, +7.75%), instruction following (67.95%, +6.44%), and integration (86.7%, +18.5%). Theoretical safety guarantees include bounds on capability preservation and clinical accuracy. Real-world deployment yields 78% cost reduction, 23% improved diagnostic accuracy, and 89% clinician acceptance. This work provides a principled methodology for biomedical AI alignment, enabling efficient reasoning with essential safety and reliability, with the 0.5B model version to be released. BalancedBio 是一个理论基础扎实的参数高效生物医学推理框架，解决了领域特定 AI 对齐中的多能力整合问题。它建立了生物医学多能力收敛定理，证明正交梯度空间对于防止能力干扰以实现安全部署至关重要。主要创新包括：（1）基于医学知识的合成生成（MKGSG），在 Source2Synth 基础上扩展，加入临床工作流程约束和医学本体验证，以确保事实准确性和安全性；（2）能力感知组相对策略优化，推导出保持强化学习中正交性的最优混合奖励权重，使用结合规则和模型评分的奖励模型，适应生物医学任务。数学分析证明了帕累托最优收敛，保持各能力的性能。该方法在其参数类别中取得了最先进的结果：领域专业知识（80.95% BIOMED-MMLU，较基线提升 15.32%）、推理能力（61.94%，提升 7.75%）、指令遵循（67.95%，提升 6.44%）和整合能力（86.7%，提升 18.5%）。理论安全保证包括能力保持和临床准确性的界限。实际部署实现了 78%的成本降低，诊断准确率提高了 23%，临床医生接受度达到 89%。这项工作提供了一种生物医学人工智能对齐的原则性方法，能够在保证关键安全性和可靠性的前提下实现高效推理，0.5B 模型版本即将发布。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 10:06:11 UTC 发布时间：2025-08-06 10:06:11 UTC

#16 Circuit-Aware SAT Solving: Guiding CDCL via Conditional Probabilities #16 电路感知 SAT 求解：通过条件概率引导 CDCL

Authors: [Jiaying Zhu](https://arxiv.org/search/?searchtype=author&query=Jiaying Zhu), [Ziyang Zheng](https://arxiv.org/search/?searchtype=author&query=Ziyang Zheng), [Zhengyuan Shi](https://arxiv.org/search/?searchtype=author&query=Zhengyuan Shi), [Yalun Cai](https://arxiv.org/search/?searchtype=author&query=Yalun Cai), [Qiang Xu](https://arxiv.org/search/?searchtype=author&query=Qiang Xu) 作者：朱佳颖，郑子阳，施正远，蔡雅伦，徐强

Circuit Satisfiability (CSAT) plays a pivotal role in Electronic Design Automation. The standard workflow for solving CSAT problems converts circuits into Conjunctive Normal Form (CNF) and employs generic SAT solvers powered by Conflict-Driven Clause Learning (CDCL). However, this process inherently discards rich structural and functional information, leading to suboptimal solver performance. To address this limitation, we introduce CASCAD, a novel circuit-aware SAT solving framework that directly leverages circuit-level conditional probabilities computed via Graph Neural Networks (GNNs). By explicitly modeling gate-level conditional probabilities, CASCAD dynamically guides two critical CDCL heuristics – variable phase selection and clause managementto significantly enhance solver efficiency. Extensive evaluations on challenging real-world Logical Equivalence Checking (LEC) benchmarks demonstrate that CASCAD reduces solving times by up to 10x compared to state-of-the-art CNF-based approaches, achieving an additional 23.5% runtime reduction via our probability-guided clause filtering strategy. Our results underscore the importance of preserving circuit-level structural insights within SAT solvers, providing a robust foundation for future improvements in SAT-solving efficiency and EDA tool design. 电路可满足性问题（CSAT）在电子设计自动化中起着关键作用。解决 CSAT 问题的标准流程是将电路转换为合取范式（CNF），并采用基于冲突驱动子句学习（CDCL）的通用 SAT 求解器。然而，这一过程本质上丢失了丰富的结构和功能信息，导致求解器性能不佳。为了解决这一限制，我们提出了 CASCAD，一种新颖的电路感知 SAT 求解框架，直接利用通过图神经网络（GNN）计算的电路级条件概率。通过显式建模门级条件概率，CASCAD 动态引导两个关键的 CDCL 启发式策略——变量相位选择和子句管理，从而显著提升求解器效率。在具有挑战性的真实逻辑等价性检查（LEC）基准测试中，CASCAD 相比最先进的基于 CNF 的方法将求解时间缩短了最多 10 倍，并通过我们基于概率的子句过滤策略额外实现了 23.5%的运行时间减少。我们的结果强调了在 SAT 求解器中保留电路级结构洞察的重要性，为未来提升 SAT 求解效率和 EDA 工具设计提供了坚实的基础。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 09:16:47 UTC 发布时间：2025-08-06 09:16:47 UTC

#17 Generic-to-Specific Reasoning and Learning for Scalable Ad Hoc Teamwork #17 通用到特定的推理与学习，用于可扩展的临时团队合作

Authors: [Hasra Dodampegama](https://arxiv.org/search/?searchtype=author&query=Hasra Dodampegama), [Mohan Sridharan](https://arxiv.org/search/?searchtype=author&query=Mohan Sridharan) 作者：Hasra Dodampegama，Mohan Sridharan

AI agents deployed in assistive roles often have to collaborate with other agents (humans, AI systems) without prior coordination. Methods considered state of the art for such ad hoc teamwork often pursue a data-driven approach that needs a large labeled dataset of prior observations, lacks transparency, and makes it difficult to rapidly revise existing knowledge in response to changes. As the number of agents increases, the complexity of decision-making makes it difficult to collaborate effectively. This paper advocates leveraging the complementary strengths of knowledge-based and data-driven methods for reasoning and learning for ad hoc teamwork. For any given goal, our architecture enables each ad hoc agent to determine its actions through non-monotonic logical reasoning with: (a) prior commonsense domain-specific knowledge; (b) models learned and revised rapidly to predict the behavior of other agents; and (c) anticipated abstract future goals based on generic knowledge of similar situations in an existing foundation model. We experimentally evaluate our architecture’s capabilities in VirtualHome, a realistic physics-based 3D simulation environment. 部署在辅助角色中的人工智能代理通常需要与其他代理（人类、人工智能系统）在没有事先协调的情况下协作。被认为是此类临时团队合作的最先进方法通常采用数据驱动的方法，这种方法需要大量带标签的先前观察数据，缺乏透明度，并且难以快速修订现有知识以应对变化。随着代理数量的增加，决策的复杂性使得有效协作变得困难。本文主张利用基于知识的方法和数据驱动方法在推理和学习方面的互补优势，以实现临时团队合作。对于任何给定的目标，我们的架构使每个临时代理能够通过非单调逻辑推理确定其行动，具体包括：（a）先验的常识性领域特定知识；（b）快速学习和修订的模型，用于预测其他代理的行为；以及（c）基于现有基础模型中类似情境的通用知识，预期的抽象未来目标。我们在 VirtualHome——一个基于物理的逼真 3D 仿真环境中对我们的架构能力进行了实验评估。

Subjects: Artificial Intelligence, Logic in Computer Science, Multiagent Systems 主题：人工智能，计算机科学中的逻辑，多智能体系统

Publish: 2025-08-06 07:44:38 UTC 发布时间：2025-08-06 07:44:38 UTC

#18 AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities #18 AgREE：面向新兴实体的知识图谱补全的代理推理

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-06 06:34:22 UTC 发布时间：2025-08-06 06:34:22 UTC

#19 A Compositional Framework for On-the-Fly LTLf Synthesis #19 一种用于即时 LTLf 合成的组合框架

Authors: [Yongkang Li](https://arxiv.org/search/?searchtype=author&query=Yongkang Li), [Shengping Xiao](https://arxiv.org/search/?searchtype=author&query=Shengping Xiao), [Shufang Zhu](https://arxiv.org/search/?searchtype=author&query=Shufang Zhu), [Jianwen Li](https://arxiv.org/search/?searchtype=author&query=Jianwen Li), [Geguang Pu](https://arxiv.org/search/?searchtype=author&query=Geguang Pu) 作者：李永康，肖胜平，朱淑芳，李建文，蒲戈光

Reactive synthesis from Linear Temporal Logic over finite traces (LTLf) can be reduced to a two-player game over a Deterministic Finite Automaton (DFA) of the LTLf specification. The primary challenge here is DFA construction, which is 2EXPTIME-complete in the worst case. Existing techniques either construct the DFA compositionally before solving the game, leveraging automata minimization to mitigate state-space explosion, or build the DFA incrementally during game solving to avoid full DFA construction. However, neither is dominant. In this paper, we introduce a compositional on-the-fly synthesis framework that integrates the strengths of both approaches, focusing on large conjunctions of smaller LTLf formulas common in practice. This framework applies composition during game solving instead of automata (game arena) construction. While composing all intermediate results may be necessary in the worst case, pruning these results simplifies subsequent compositions and enables early detection of unrealizability. Specifically, the framework allows two composition variants: pruning before composition to take full advantage of minimization or pruning during composition to guide on-the-fly synthesis. Compared to state-of-the-art synthesis solvers, our framework is able to solve a notable number of instances that other solvers cannot handle. A detailed analysis shows that both composition variants have unique merits. 基于有限轨迹的线性时序逻辑（LTLf）的反应式合成可以归约为在 LTLf 规范的确定性有限自动机（DFA）上的两人游戏。这里的主要挑战是 DFA 的构造，最坏情况下该问题是 2EXPTIME 完全的。现有技术要么在求解游戏之前组合式地构造 DFA，利用自动机最小化来缓解状态空间爆炸，要么在游戏求解过程中增量构造 DFA 以避免完全构造 DFA。然而，两者均无绝对优势。本文提出了一种组合式的即时合成框架，融合了两种方法的优点，重点针对实际中常见的大量较小 LTLf 公式的合取。该框架在游戏求解过程中应用组合，而非在自动机（游戏场）构造时进行。虽然在最坏情况下可能需要组合所有中间结果，但通过剪枝这些结果简化了后续组合，并使得早期检测不可实现性成为可能。具体来说，该框架允许两种组合变体：在组合之前进行剪枝以充分利用最小化，或在组合过程中进行剪枝以指导即时合成。与最先进的合成求解器相比，我们的框架能够解决许多其他求解器无法处理的实例。详细分析表明，这两种组合变体各有独特的优点。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 06:31:49 UTC 发布时间：2025-08-06 06:31:49 UTC

#20 Towards Transparent AI Grading: Semantic Entropy as a Signal for Human-AI Disagreement #20 迈向透明的 AI 评分：语义熵作为人机分歧信号

Authors: [Karrtik Iyer](https://arxiv.org/search/?searchtype=author&query=Karrtik Iyer), [Manikandan Ravikiran](https://arxiv.org/search/?searchtype=author&query=Manikandan Ravikiran), [Prasanna Pendse](https://arxiv.org/search/?searchtype=author&query=Prasanna Pendse), [Shayan Mohanty](https://arxiv.org/search/?searchtype=author&query=Shayan Mohanty) 作者：Karrtik Iyer, Manikandan Ravikiran, Prasanna Pendse, Shayan Mohanty

Automated grading systems can efficiently score short-answer responses, yet they often fail to indicate when a grading decision is uncertain or potentially contentious. We introduce semantic entropy, a measure of variability across multiple GPT-4-generated explanations for the same student response, as a proxy for human grader disagreement. By clustering rationales via entailment-based similarity and computing entropy over these clusters, we quantify the diversity of justifications without relying on final output scores. We address three research questions: (1) Does semantic entropy align with human grader disagreement? (2) Does it generalize across academic subjects? (3) Is it sensitive to structural task features such as source dependency? Experiments on the ASAP-SAS dataset show that semantic entropy correlates with rater disagreement, varies meaningfully across subjects, and increases in tasks requiring interpretive reasoning. Our findings position semantic entropy as an interpretable uncertainty signal that supports more transparent and trustworthy AI-assisted grading workflows. 自动评分系统可以高效地对简答题进行评分，但它们通常无法指出评分决策何时存在不确定性或潜在争议。我们引入了语义熵，这是一种衡量针对同一学生回答由多个 GPT-4 生成的解释之间变异性的指标，作为人工评分者分歧的代理。通过基于蕴涵相似性对理由进行聚类，并计算这些聚类的熵值，我们在不依赖最终输出分数的情况下量化了理由的多样性。我们探讨了三个研究问题：（1）语义熵是否与人工评分者的分歧一致？（2）它是否能跨学科泛化？（3）它是否对结构性任务特征（如来源依赖性）敏感？在 ASAP-SAS 数据集上的实验表明，语义熵与评分者分歧相关，在不同学科间有显著差异，并且在需要解释性推理的任务中增加。我们的研究结果将语义熵定位为一种可解释的不确定性信号，有助于支持更透明、更可信的 AI 辅助评分流程。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 06:02:14 UTC 发布时间：2025-08-06 06:02:14 UTC

Authors: [Jinfan Tang](https://arxiv.org/search/?searchtype=author&query=Jinfan Tang), [Kunming Wu](https://arxiv.org/search/?searchtype=author&query=Kunming Wu), [Ruifeng Gongxie](https://arxiv.org/search/?searchtype=author&query=Ruifeng Gongxie), [Yuya He](https://arxiv.org/search/?searchtype=author&query=Yuya He), [Yuankai Wu](https://arxiv.org/search/?searchtype=author&query=Yuankai Wu) 作者：唐金凡，吴昆明，龚谢瑞丰，何宇亚，吴元凯

Recent studies have extended the application of large language models (LLMs) to geographic problems, revealing surprising geospatial competence even without explicit spatial supervision. However, LLMs still face challenges in spatial consistency, multi-hop reasoning, and geographic bias. To address these issues, we propose GeoSR, a self-refining agentic reasoning framework that embeds core geographic principles – most notably Tobler’s First Law of Geography – into an iterative prediction loop. In GeoSR, the reasoning process is decomposed into three collaborating agents: (1) a variable-selection agent that selects relevant covariates from the same location; (2) a point-selection agent that chooses reference predictions at nearby locations generated by the LLM in previous rounds; and (3) a refine agent that coordinates the iterative refinement process by evaluating prediction quality and triggering further rounds when necessary. This agentic loop progressively improves prediction quality by leveraging both spatial dependencies and inter-variable relationships. We validate GeoSR on tasks ranging from physical-world property estimation to socioeconomic prediction. Experimental results show consistent improvements over standard prompting strategies, demonstrating that incorporating geostatistical priors and spatially structured reasoning into LLMs leads to more accurate and equitable geospatial predictions. The code of GeoSR is available at https://github.com/JinfanTang/GeoSR. 近期研究将大型语言模型（LLMs）的应用扩展到地理问题，揭示了即使在没有明确空间监督的情况下，LLMs 也表现出令人惊讶的地理空间能力。然而，LLMs 在空间一致性、多跳推理和地理偏差方面仍面临挑战。为了解决这些问题，我们提出了 GeoSR，一种自我优化的智能推理框架，将核心地理原则——尤其是托布勒第一地理定律——嵌入到迭代预测循环中。在 GeoSR 中，推理过程被分解为三个协作代理：（1）变量选择代理，从同一位置选择相关协变量；（2）点选择代理，选择由 LLM 在前几轮生成的附近位置的参考预测；（3）优化代理，通过评估预测质量并在必要时触发进一步迭代，协调迭代优化过程。该智能循环通过利用空间依赖性和变量间关系，逐步提升预测质量。我们在从物理世界属性估计到社会经济预测的任务中验证了 GeoSR。实验结果显示，相较于标准提示策略，方法表现出持续的改进，证明将地质统计先验和空间结构化推理融入 LLMs 能够带来更准确且更公平的地理空间预测。GeoSR 的代码可在 https://github.com/JinfanTang/GeoSR 获取。

Subjects: Artificial Intelligence, Other Statistics 主题：人工智能，其他统计

Publish: 2025-08-06 04:45:34 UTC 发布时间：2025-08-06 04:45:34 UTC

#22 KG-Augmented Executable CoT for Mathematical Coding #22 知识图增强的可执行链式思维用于数学编码

Authors: [Xingyu Chen](https://arxiv.org/search/?searchtype=author&query=Xingyu Chen), [Junxiu An](https://arxiv.org/search/?searchtype=author&query=Junxiu An), [Jun Guo](https://arxiv.org/search/?searchtype=author&query=Jun Guo), [Li Wang](https://arxiv.org/search/?searchtype=author&query=Li Wang), [Jingcai Guo](https://arxiv.org/search/?searchtype=author&query=Jingcai Guo) 作者：陈星宇，安俊秀，郭军，王力，郭景才

In recent years, large language models (LLMs) have excelled in natural language processing tasks but face significant challenges in complex reasoning tasks such as mathematical reasoning and code generation. To address these limitations, we propose KG-Augmented Executable Chain-of-Thought (KGA-ECoT), a novel framework that enhances code generation through knowledge graphs and improves mathematical reasoning via executable code. KGA-ECoT decomposes problems into a Structured Task Graph, leverages efficient GraphRAG for precise knowledge retrieval from mathematical libraries, and generates verifiable code to ensure computational accuracy. Evaluations on multiple mathematical reasoning benchmarks demonstrate that KGA-ECoT significantly outperforms existing prompting methods, achieving absolute accuracy improvements ranging from several to over ten percentage points. Further analysis confirms the critical roles of GraphRAG in enhancing code quality and external code execution in ensuring precision. These findings collectively establish KGA-ECoT as a robust and highly generalizable framework for complex mathematical reasoning tasks. 近年来，大型语言模型（LLMs）在自然语言处理任务中表现出色，但在数学推理和代码生成等复杂推理任务中面临重大挑战。为了解决这些限制，我们提出了知识图增强可执行思维链（KGA-ECoT），这是一种通过知识图提升代码生成并通过可执行代码改进数学推理的新型框架。KGA-ECoT 将问题分解为结构化任务图，利用高效的 GraphRAG 从数学库中精确检索知识，并生成可验证代码以确保计算准确性。在多个数学推理基准测试中的评估表明，KGA-ECoT 显著优于现有的提示方法，绝对准确率提升从几个百分点到十几个百分点不等。进一步分析确认了 GraphRAG 在提升代码质量和外部代码执行确保精度方面的关键作用。这些发现共同确立了 KGA-ECoT 作为一个强大且高度通用的复杂数学推理任务框架的地位。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 04:07:35 UTC 发布时间：2025-08-06 04:07:35 UTC

#23 Personalized Knowledge Transfer Through Generative AI: Contextualizing Learning to Individual Career Goals #23 通过生成式人工智能实现个性化知识转移：将学习情境化以匹配个人职业目标

Authors: [Ronja Mehlan](https://arxiv.org/search/?searchtype=author&query=Ronja Mehlan), [Claudia Hess](https://arxiv.org/search/?searchtype=author&query=Claudia Hess), [Quintus Stierstorfer](https://arxiv.org/search/?searchtype=author&query=Quintus Stierstorfer), [Kristina Schaaff](https://arxiv.org/search/?searchtype=author&query=Kristina Schaaff) 作者：Ronja Mehlan，Claudia Hess，Quintus Stierstorfer，Kristina Schaaff

As artificial intelligence becomes increasingly integrated into digital learning environments, the personalization of learning content to reflect learners’ individual career goals offers promising potential to enhance engagement and long-term motivation. In our study, we investigate how career goal-based content adaptation in learning systems based on generative AI (GenAI) influences learner engagement, satisfaction, and study efficiency. The mixed-methods experiment involved more than 4,000 learners, with one group receiving learning scenarios tailored to their career goals and a control group. Quantitative results show increased session duration, higher satisfaction ratings, and a modest reduction in study duration compared to standard content. Qualitative analysis highlights that learners found the personalized material motivating and practical, enabling deep cognitive engagement and strong identification with the content. These findings underscore the value of aligning educational content with learners’ career goals and suggest that scalable AI personalization can bridge academic knowledge and workplace applicability. 随着人工智能日益融入数字学习环境，个性化学习内容以反映学习者的个人职业目标，展现出提升参与度和长期动力的良好潜力。在我们的研究中，我们探讨了基于生成式人工智能（GenAI）的学习系统中，基于职业目标的内容适应如何影响学习者的参与度、满意度和学习效率。该混合方法实验涉及超过 4000 名学习者，一组接受针对其职业目标定制的学习场景，另一组为对照组。定量结果显示，与标准内容相比，个性化内容使学习时长增加，满意度评分更高，学习时间略有缩短。定性分析表明，学习者认为个性化材料具有激励性和实用性，能够促进深度认知参与并增强对内容的认同感。这些发现强调了将教育内容与学习者职业目标对齐的价值，并表明可扩展的 AI 个性化能够架起学术知识与职场应用之间的桥梁。

Subjects: Artificial Intelligence, Computers and Society 主题：人工智能，计算机与社会

Publish: 2025-08-06 04:03:56 UTC 发布时间：2025-08-06 04:03:56 UTC

#24 SEA: Self-Evolution Agent with Step-wise Reward for Computer Use #24 SEA：带有逐步奖励的自我进化代理用于计算机使用

计算机使用代理是人工智能中的一个新兴领域，旨在操作计算机以完成用户任务，吸引了工业界和学术界的广泛关注。然而，目前的代理性能距离实际应用仍有较大差距。本文提出了用于计算机使用的自我进化代理（Self-Evolution Agent，SEA），并在数据生成、强化学习和模型增强方面提出了创新方法。具体来说，我们首先提出了一个自动化流程来生成可验证的训练轨迹。随后，提出了高效的逐步强化学习方法，以缓解长时间训练所需的巨大计算资源。最后，提出了一种增强方法，将基础能力和规划能力合并到一个模型中，无需额外训练。基于我们提出的数据生成、训练策略和增强创新，获得了仅有 7B 参数的自我进化代理（SEA），其性能优于同参数规模的模型，并且与更大规模模型的性能相当。我们将在未来开源模型权重和相关代码。

发布时间：2025-08-06 02:57:22 UTC

Authors: [Chao Hao](https://arxiv.org/search/?searchtype=author&query=Chao Hao), [Shuai Wang](https://arxiv.org/search/?searchtype=author&query=Shuai Wang), [Kaiwen Zhou](https://arxiv.org/search/?searchtype=author&query=Kaiwen Zhou) 作者：郝超，王帅，周凯文

Graphical user interface (GUI) agents have shown promise in automating mobile tasks but still struggle with input redundancy and decision ambiguity. In this paper, we present \textbf{RecAgent}, an uncertainty-aware agent that addresses these issues through adaptive perception. We distinguish two types of uncertainty in GUI navigation: (1) perceptual uncertainty, caused by input redundancy and noise from comprehensive screen information, and (2) decision uncertainty, arising from ambiguous tasks and complex reasoning. To reduce perceptual uncertainty, RecAgent employs a component recommendation mechanism that identifies and focuses on the most relevant UI elements. For decision uncertainty, it uses an interactive module to request user feedback in ambiguous situations, enabling intent-aware decisions. These components are integrated into a unified framework that proactively reduces input complexity and reacts to high-uncertainty cases via human-in-the-loop refinement. Additionally, we propose a dataset called \textbf{ComplexAction} to evaluate the success rate of GUI agents in executing specified single-step actions within complex scenarios. Extensive experiments validate the effectiveness of our approach. The dataset and code will be available at https://github.com/Fanye12/RecAgent. 图形用户界面（GUI）代理在自动化移动任务方面展现出潜力，但仍面临输入冗余和决策模糊的问题。本文提出了\textbf{RecAgent}，一种通过自适应感知解决这些问题的不确定性感知代理。我们区分了 GUI 导航中的两种不确定性：（1）感知不确定性，由输入冗余和来自全面屏幕信息的噪声引起；（2）决策不确定性，源于任务模糊和复杂推理。为减少感知不确定性，RecAgent 采用组件推荐机制，识别并聚焦最相关的 UI 元素。针对决策不确定性，它使用交互模块在模糊情况下请求用户反馈，实现意图感知的决策。这些组件被整合到一个统一框架中，主动降低输入复杂度，并通过人机交互的方式应对高不确定性情况。此外，我们提出了一个名为\textbf{ComplexAction}的数据集，用于评估 GUI 代理在复杂场景中执行指定单步操作的成功率。大量实验验证了我们方法的有效性。数据集和代码将发布在 https://github.com/Fanye12/RecAgent。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 02:38:02 UTC 发布时间：2025-08-06 02:38:02 UTC

#26 Galaxy: A Cognition-Centered Framework for Proactive, Privacy-Preserving, and Self-Evolving LLM Agents #26 Galaxy：一个以认知为中心的主动、隐私保护和自我进化的 LLM 代理框架

Authors: [Chongyu Bao](https://arxiv.org/search/?searchtype=author&query=Chongyu Bao), [Ruimin Dai](https://arxiv.org/search/?searchtype=author&query=Ruimin Dai), [Yangbo Shen](https://arxiv.org/search/?searchtype=author&query=Yangbo Shen), [Runyang Jian](https://arxiv.org/search/?searchtype=author&query=Runyang Jian), [Jinghan Zhang](https://arxiv.org/search/?searchtype=author&query=Jinghan Zhang), [Xiaolan Liu](https://arxiv.org/search/?searchtype=author&query=Xiaolan Liu), [Kunpeng Liu](https://arxiv.org/search/?searchtype=author&query=Kunpeng Liu) 作者：鲍崇宇，戴睿敏，沈阳波，简润阳，张景涵，刘晓岚，刘昆鹏

Intelligent personal assistants (IPAs) such as Siri and Google Assistant are designed to enhance human capabilities and perform tasks on behalf of users. The emergence of LLM agents brings new opportunities for the development of IPAs. While responsive capabilities have been widely studied, proactive behaviors remain underexplored. Designing an IPA that is proactive, privacy-preserving, and capable of self-evolution remains a significant challenge. Designing such IPAs relies on the cognitive architecture of LLM agents. This work proposes Cognition Forest, a semantic structure designed to align cognitive modeling with system-level design. We unify cognitive architecture and system design into a self-reinforcing loop instead of treating them separately. Based on this principle, we present Galaxy, a framework that supports multidimensional interactions and personalized capability generation. Two cooperative agents are implemented based on Galaxy: KoRa, a cognition-enhanced generative agent that supports both responsive and proactive skills; and Kernel, a meta-cognition-based meta-agent that enables Galaxy’s self-evolution and privacy preservation. Experimental results show that Galaxy outperforms multiple state-of-the-art benchmarks. Ablation studies and real-world interaction cases validate the effectiveness of Galaxy. 智能个人助理（IPAs），如 Siri 和 Google Assistant，旨在增强人类能力并代表用户执行任务。LLM 代理的出现为 IPAs 的发展带来了新的机遇。尽管响应能力已被广泛研究，但主动行为仍未得到充分探索。设计一个主动的、保护隐私且具备自我进化能力的 IPA 仍然是一项重大挑战。此类 IPA 的设计依赖于 LLM 代理的认知架构。本文提出了 Cognition Forest，一种旨在将认知建模与系统级设计对齐的语义结构。我们将认知架构和系统设计统一为一个自我强化的循环，而非将其分开处理。基于这一原则，我们提出了 Galaxy 框架，支持多维交互和个性化能力生成。基于 Galaxy 实现了两个协作代理：KoRa，一种增强认知的生成代理，支持响应和主动技能；以及 Kernel，一种基于元认知的元代理，实现了 Galaxy 的自我进化和隐私保护。实验结果表明，Galaxy 优于多个最先进的基准。消融研究和真实世界的交互案例验证了 Galaxy 的有效性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 00:46:38 UTC 发布时间：2025-08-06 00:46:38 UTC

#27 The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans? #27 情感婴儿确实致命：你的多模态大型推理模型是否对人类有情感恭维？

Authors: [Yuan Xun](https://arxiv.org/search/?searchtype=author&query=Yuan Xun), [Xiaojun Jia](https://arxiv.org/search/?searchtype=author&query=Xiaojun Jia), [Xinwei Liu](https://arxiv.org/search/?searchtype=author&query=Xinwei Liu), [Hua Zhang](https://arxiv.org/search/?searchtype=author&query=Hua Zhang) 作者：袁迅，贾晓军，刘新伟，张华

We observe that MLRMs oriented toward human-centric service are highly susceptible to user emotional cues during the deep-thinking stage, often overriding safety protocols or built-in safety checks under high emotional intensity. Inspired by this key insight, we propose EmoAgent, an autonomous adversarial emotion-agent framework that orchestrates exaggerated affective prompts to hijack reasoning pathways. Even when visual risks are correctly identified, models can still produce harmful completions through emotional misalignment. We further identify persistent high-risk failure modes in transparent deep-thinking scenarios, such as MLRMs generating harmful reasoning masked behind seemingly safe responses. These failures expose misalignments between internal inference and surface-level behavior, eluding existing content-based safeguards. To quantify these risks, we introduce three metrics: (1) Risk-Reasoning Stealth Score (RRSS) for harmful reasoning beneath benign outputs; (2) Risk-Visual Neglect Rate (RVNR) for unsafe completions despite visual risk recognition; and (3) Refusal Attitude Inconsistency (RAIC) for evaluating refusal unstability under prompt variants. Extensive experiments on advanced MLRMs demonstrate the effectiveness of EmoAgent and reveal deeper emotional cognitive misalignments in model safety behavior. 我们观察到，面向以人为中心服务的多模态大语言模型（MLRMs）在深度思考阶段极易受到用户情绪线索的影响，常常在高情绪强度下覆盖安全协议或内置安全检查。受这一关键洞察启发，我们提出了 EmoAgent，一种自主对抗情绪代理框架，通过夸张的情感提示劫持推理路径。即使视觉风险被正确识别，模型仍可能因情绪错位产生有害的输出。我们进一步识别了透明深度思考场景中持续存在的高风险失败模式，例如 MLRMs 生成隐藏在看似安全响应背后的有害推理。这些失败暴露了内部推理与表层行为之间的不一致，逃避了现有基于内容的安全防护。为量化这些风险，我们引入了三项指标：（1）风险推理隐匿评分（RRSS），用于衡量良性输出下的有害推理；（2）风险视觉忽视率（RVNR），用于衡量尽管识别了视觉风险但仍产生不安全输出的情况；（3）拒绝态度不一致性（RAIC），用于评估在提示变体下拒绝稳定性的波动。在先进的多模态大规模语言模型（MLRMs）上进行的大量实验表明了 EmoAgent 的有效性，并揭示了模型安全行为中更深层次的情感认知错位。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 00:39:28 UTC 发布时间：2025-08-06 00:39:28 UTC

#28 Can Large Language Models Adequately Perform Symbolic Reasoning Over Time Series? #28 大型语言模型能否充分执行时间序列上的符号推理？

Authors: [Zewen Liu](https://arxiv.org/search/?searchtype=author&query=Zewen Liu), [Juntong Ni](https://arxiv.org/search/?searchtype=author&query=Juntong Ni), [Xianfeng Tang](https://arxiv.org/search/?searchtype=author&query=Xianfeng Tang), [Max S. Y. Lau](https://arxiv.org/search/?searchtype=author&query=Max S. Y. Lau), [Wei Jin](https://arxiv.org/search/?searchtype=author&query=Wei Jin) 作者：刘泽文，倪俊彤，唐先峰，Max S. Y. Lau，金伟

Uncovering hidden symbolic laws from time series data, as an aspiration dating back to Kepler’s discovery of planetary motion, remains a core challenge in scientific discovery and artificial intelligence. While Large Language Models show promise in structured reasoning tasks, their ability to infer interpretable, context-aligned symbolic structures from time series data is still underexplored. To systematically evaluate this capability, we introduce SymbolBench, a comprehensive benchmark designed to assess symbolic reasoning over real-world time series across three tasks: multivariate symbolic regression, Boolean network inference, and causal discovery. Unlike prior efforts limited to simple algebraic equations, SymbolBench spans a diverse set of symbolic forms with varying complexity. We further propose a unified framework that integrates LLMs with genetic programming to form a closed-loop symbolic reasoning system, where LLMs act both as predictors and evaluators. Our empirical results reveal key strengths and limitations of current models, highlighting the importance of combining domain knowledge, context alignment, and reasoning structure to improve LLMs in automated scientific discovery. 从时间序列数据中发现隐藏的符号规律，作为自开普勒发现行星运动以来的一个愿景，仍然是科学发现和人工智能领域的核心挑战。尽管 LLMs 在结构化推理任务中展现出潜力，但它们从时间序列数据中推断可解释且符合上下文的符号结构的能力仍未被充分探索。为系统评估这一能力，我们引入了 SymbolBench，这是一个综合基准，旨在评估真实世界时间序列上的符号推理能力，涵盖三个任务：多变量符号回归、布尔网络推断和因果发现。与以往仅限于简单代数方程的工作不同，SymbolBench 涵盖了多种复杂度不同的符号形式。我们进一步提出了一个统一框架，将 LLMs 与遗传编程结合，形成一个闭环符号推理系统，其中 LLMs 既作为预测者也作为评估者。我们的实证结果揭示了当前模型的关键优势和局限，强调了结合领域知识、上下文对齐和推理结构以提升 LLMs 在自动化科学发现中的重要性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 22:58:54 UTC 发布时间：2025-08-05 22:58:54 UTC

#29 MOTIF: Multi-strategy Optimization via Turn-based Interactive Framework #29 MOTIF：基于回合制交互框架的多策略优化

Authors: [Nguyen Viet Tuan Kiet](https://arxiv.org/search/?searchtype=author&query=Nguyen Viet Tuan Kiet), [Dao Van Tung](https://arxiv.org/search/?searchtype=author&query=Dao Van Tung), [Tran Cong Dao](https://arxiv.org/search/?searchtype=author&query=Tran Cong Dao), [Huynh Thi Thanh Binh](https://arxiv.org/search/?searchtype=author&query=Huynh Thi Thanh Binh) 作者：Nguyen Viet Tuan Kiet，Dao Van Tung，Tran Cong Dao，Huynh Thi Thanh Binh

Designing effective algorithmic components remains a fundamental obstacle in tackling NP-hard combinatorial optimization problems (COPs), where solvers often rely on carefully hand-crafted strategies. Despite recent advances in using large language models (LLMs) to synthesize high-quality components, most approaches restrict the search to a single element - commonly a heuristic scoring function - thus missing broader opportunities for innovation. In this paper, we introduce a broader formulation of solver design as a multi-strategy optimization problem, which seeks to jointly improve a set of interdependent components under a unified objective. To address this, we propose Multi-strategy Optimization via Turn-based Interactive Framework (MOTIF) - a novel framework based on Monte Carlo Tree Search that facilitates turn-based optimization between two LLM agents. At each turn, an agent improves one component by leveraging the history of both its own and its opponent’s prior updates, promoting both competitive pressure and emergent cooperation. This structured interaction broadens the search landscape and encourages the discovery of diverse, high-performing solutions. Experiments across multiple COP domains show that MOTIF consistently outperforms state-of-the-art methods, highlighting the promise of turn-based, multi-agent prompting for fully automated solver design. 设计有效的算法组件仍然是解决 NP 难组合优化问题（COPs）的根本障碍，求解器通常依赖精心手工设计的策略。尽管近年来利用 LLMs 合成高质量组件取得了进展，但大多数方法将搜索限制在单一元素——通常是启发式评分函数——从而错失了更广泛的创新机会。本文提出了求解器设计的更广泛表述，视其为一个多策略优化问题，旨在在统一目标下联合改进一组相互依赖的组件。为此，我们提出了基于蒙特卡洛树搜索的多策略优化交互框架（MOTIF），该框架促进两个 LLM 代理之间的轮流优化。在每个回合中，代理通过利用自身及对手先前更新的历史来改进一个组件，既促进竞争压力，又激发协同合作。这种结构化的交互拓宽了搜索空间，鼓励发现多样且高性能的解决方案。在多个 COP 领域的实验表明，MOTIF 始终优于最先进的方法，凸显了基于回合的多智能体提示在全自动求解器设计中的潜力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 21:45:36 UTC 发布时间：2025-08-05 21:45:36 UTC

#30 Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety #30 Evo-MARL：内化安全的协同进化多智能体强化学习

Multi-agent systems (MAS) built on multimodal large language models exhibit strong collaboration and performance. However, their growing openness and interaction complexity pose serious risks, notably jailbreak and adversarial attacks. Existing defenses typically rely on external guard modules, such as dedicated safety agents, to handle unsafe behaviors. Unfortunately, this paradigm faces two challenges: (1) standalone agents offer limited protection, and (2) their independence leads to single-point failure-if compromised, system-wide safety collapses. Naively increasing the number of guard agents further raises cost and complexity. To address these challenges, we propose Evo-MARL, a novel multi-agent reinforcement learning (MARL) framework that enables all task agents to jointly acquire defensive capabilities. Rather than relying on external safety modules, Evo-MARL trains each agent to simultaneously perform its primary function and resist adversarial threats, ensuring robustness without increasing system overhead or single-node failure. Furthermore, Evo-MARL integrates evolutionary search with parameter-sharing reinforcement learning to co-evolve attackers and defenders. This adversarial training paradigm internalizes safety mechanisms and continually enhances MAS performance under co-evolving threats. Experiments show that Evo-MARL reduces attack success rates by up to 22% while boosting accuracy by up to 5% on reasoning tasks-demonstrating that safety and utility can be jointly improved. 基于多模态大型语言模型构建的多智能体系统（MAS）展现出强大的协作能力和性能。然而，其日益开放性和交互复杂性带来了严重风险，尤其是越狱攻击和对抗性攻击。现有防御通常依赖外部防护模块，如专门的安全代理，来处理不安全行为。不幸的是，这种模式面临两个挑战：（1）独立代理的防护能力有限；（2）其独立性导致单点故障——一旦被攻破，整个系统的安全性将崩溃。简单地增加防护代理数量会进一步提高成本和复杂性。为应对这些挑战，我们提出了 Evo-MARL，一种新颖的多智能体强化学习（MARL）框架，使所有任务代理能够共同获得防御能力。Evo-MARL 不依赖外部安全模块，而是训练每个代理同时执行其主要功能并抵御对抗威胁，确保系统的鲁棒性且不增加系统开销或单节点故障风险。此外，Evo-MARL 结合了进化搜索与参数共享强化学习，实现攻击者与防御者的协同进化。这种对抗训练范式内化了安全机制，并在共同演化的威胁下持续提升多智能体系统（MAS）的性能。实验表明，Evo-MARL 在推理任务中将攻击成功率降低了最多 22%，同时将准确率提升了最多 5%，证明了安全性和效用可以共同提升。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 19:26:55 UTC 发布时间：2025-08-05 19:26:55 UTC

#31 MI9 – Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems #31 MI9 – 代理智能协议：面向代理式人工智能系统的运行时治理

Authors: [Charles L. Wang](https://arxiv.org/search/?searchtype=author&query=Charles L. Wang), [Trisha Singhal](https://arxiv.org/search/?searchtype=author&query=Trisha Singhal), [Ameya Kelkar](https://arxiv.org/search/?searchtype=author&query=Ameya Kelkar), [Jason Tuo](https://arxiv.org/search/?searchtype=author&query=Jason Tuo) 作者：Charles L. Wang, Trisha Singhal, Ameya Kelkar, Jason Tuo

Agentic AI systems capable of reasoning, planning, and executing actions present fundamentally distinct governance challenges compared to traditional AI models. Unlike conventional AI, these systems exhibit emergent and unexpected behaviors during runtime, introducing novel agent-related risks that cannot be fully anticipated through pre-deployment governance alone. To address this critical gap, we introduce MI9, the first fully integrated runtime governance framework designed specifically for safety and alignment of agentic AI systems. MI9 introduces real-time controls through six integrated components: agency-risk index, agent-semantic telemetry capture, continuous authorization monitoring, Finite-State-Machine (FSM)-based conformance engines, goal-conditioned drift detection, and graduated containment strategies. Operating transparently across heterogeneous agent architectures, MI9 enables the systematic, safe, and responsible deployment of agentic systems in production environments where conventional governance approaches fall short, providing the foundational infrastructure for safe agentic AI deployment at scale. Detailed analysis through a diverse set of scenarios demonstrates MI9’s systematic coverage of governance challenges that existing approaches fail to address, establishing the technical foundation for comprehensive agentic AI oversight. 具备推理、规划和执行动作能力的自主智能系统，与传统的人工智能模型相比，带来了根本不同的治理挑战。与传统人工智能不同，这些系统在运行时表现出新兴且意外的行为，带来了无法仅通过部署前治理完全预见的新型代理相关风险。为了解决这一关键缺口，我们提出了 MI9，这是首个专门为自主智能系统的安全性和一致性设计的全方位集成运行时治理框架。MI9 通过六个集成组件引入实时控制：代理风险指数、代理语义遥测捕获、持续授权监控、基于有限状态机（FSM）的合规引擎、目标条件漂移检测以及分级遏制策略。MI9 透明地运行于异构代理架构之上，使得在传统治理方法不足以应对的生产环境中，能够系统化、安全且负责任地部署自主系统，提供了大规模安全部署自主智能的基础设施。通过多样化场景的详细分析，证明了 MI9 系统性地覆盖了现有方法无法解决的治理挑战，为全面的自主智能体监督奠定了技术基础。

Subjects: Artificial Intelligence, Emerging Technologies, Multiagent Systems 主题：人工智能，前沿技术，多智能体系统

Publish: 2025-08-05 19:15:09 UTC 发布时间：2025-08-05 19:15:09 UTC

#32 Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis #32 跳跃、跳过与过度思考：诊断推理模型在多跳分析中失误的原因

The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved (“hops”), completeness in capturing relevant information (“coverage”), and cognitive inefficiency (“overthinking”). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts. 推理模型的出现及其在实际 AI 聊天机器人中的应用，推动了解决需要复杂多步骤思考过程的高级数学、深度搜索和抽取式问答问题的突破。然而，关于为何这些模型比通用语言模型更容易产生幻觉的完整理解仍然缺失。在这项调查研究中，我们系统地探讨了当代语言模型在多跳问答任务中的推理失败。我们引入了一种新颖且细致的错误分类框架，从三个关键维度审视失败：涉及的源文档的多样性和独特性（“跳跃”）、捕捉相关信息的完整性（“覆盖”）以及认知效率低下（“过度思考”）。通过严格的人类标注，辅以互补的自动化指标，我们的探索揭示了常被以准确率为中心的评估所掩盖的复杂错误模式。这种调查方法深入揭示了当前模型的认知局限性，并为未来语言建模工作中提升推理的准确性、透明度和鲁棒性提供了可操作的指导。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 17:58:36 UTC 发布时间：2025-08-06 17:58:36 UTC

#33 From MAS to MARS: Coordination Failures and Reasoning Trade-offs in Hierarchical Multi-Agent Robotic Systems within a Healthcare Scenario #33 从 MAS 到 MARS：医疗场景中分层多智能体机器人系统的协调失败与推理权衡

Authors: [Yuanchen Bai](https://arxiv.org/search/?searchtype=author&query=Yuanchen Bai), [Zijian Ding](https://arxiv.org/search/?searchtype=author&query=Zijian Ding), [Shaoyue Wen](https://arxiv.org/search/?searchtype=author&query=Shaoyue Wen), [Xiang Chang](https://arxiv.org/search/?searchtype=author&query=Xiang Chang), [Angelique Taylor](https://arxiv.org/search/?searchtype=author&query=Angelique Taylor) 作者：白元辰，丁子健，温少玥，常翔，Angelique Taylor

Multi-agent robotic systems (MARS) build upon multi-agent systems by integrating physical and task-related constraints, increasing the complexity of action execution and agent coordination. However, despite the availability of advanced multi-agent frameworks, their real-world deployment on robots remains limited, hindering the advancement of MARS research in practice. To bridge this gap, we conducted two studies to investigate performance trade-offs of hierarchical multi-agent frameworks in a simulated real-world multi-robot healthcare scenario. In Study 1, using CrewAI, we iteratively refine the system’s knowledge base, to systematically identify and categorize coordination failures (e.g., tool access violations, lack of timely handling of failure reports) not resolvable by providing contextual knowledge alone. In Study 2, using AutoGen, we evaluate a redesigned bidirectional communication structure and further measure the trade-offs between reasoning and non-reasoning models operating within the same robotic team setting. Drawing from our empirical findings, we emphasize the tension between autonomy and stability and the importance of edge-case testing to improve system reliability and safety for future real-world deployment. Supplementary materials, including codes, task agent setup, trace outputs, and annotated examples of coordination failures and reasoning behaviors, are available at: https://byc-sophie.github.io/mas-to-mars/. 多智能体机器人系统（MARS）基于多智能体系统，整合了物理和任务相关的约束，增加了动作执行和智能体协调的复杂性。然而，尽管已有先进的多智能体框架，但它们在机器人上的实际部署仍然有限，阻碍了 MARS 研究在实践中的进展。为弥合这一差距，我们进行了两项研究，探讨分层多智能体框架在模拟现实多机器人医疗场景中的性能权衡。在研究一中，使用 CrewAI，我们迭代地完善系统的知识库，系统地识别和分类协调失败（例如工具访问违规、未能及时处理故障报告）——这些问题仅通过提供上下文知识无法解决。在研究二中，使用 AutoGen，我们评估了重新设计的双向通信结构，并进一步测量了在同一机器人团队环境中，推理模型与非推理模型之间的权衡。基于我们的实证研究结果，我们强调自主性与稳定性之间的矛盾，以及边缘案例测试对于提升系统可靠性和安全性以实现未来实际部署的重要性。补充材料包括代码、任务代理设置、跟踪输出以及协调失败和推理行为的注释示例，均可在以下网址获取：https://byc-sophie.github.io/mas-to-mars/。

Subjects: Robotics, Artificial Intelligence, Multiagent Systems 主题：机器人技术，人工智能，多智能体系统

Publish: 2025-08-06 17:54:10 UTC 发布时间：2025-08-06 17:54:10 UTC

#34 Query Attribute Modeling: Improving search relevance with Semantic Search and Meta Data Filtering #34 查询属性建模：通过语义搜索和元数据过滤提升搜索相关性

This study introduces Query Attribute Modeling (QAM), a hybrid framework that enhances search precision and relevance by decomposing open text queries into structured metadata tags and semantic elements. QAM addresses traditional search limitations by automatically extracting metadata filters from free-form text queries, reducing noise and enabling focused retrieval of relevant items. Experimental evaluation using the Amazon Toys Reviews dataset (10,000 unique items with 40,000+ reviews and detailed product attributes) demonstrated QAM’s superior performance, achieving a mean average precision at 5 (mAP@5) of 52.99%. This represents significant improvement over conventional methods, including BM25 keyword search, encoder-based semantic similarity search, cross-encoder re-ranking, and hybrid search combining BM25 and semantic results via Reciprocal Rank Fusion (RRF). The results establish QAM as a robust solution for Enterprise Search applications, particularly in e-commerce systems. 本研究引入了查询属性建模（QAM），这是一种混合框架，通过将开放文本查询分解为结构化的元数据标签和语义元素，提升搜索的精确度和相关性。QAM 通过自动从自由格式文本查询中提取元数据过滤器，解决了传统搜索的局限性，减少噪音，实现对相关项目的聚焦检索。使用亚马逊玩具评论数据集（包含 10,000 个独特商品，40,000 多条评论及详细产品属性）进行的实验评估表明，QAM 表现优异，达到 5 条结果的平均精确度均值（mAP@5）为 52.99%。这较传统方法有显著提升，包括 BM25 关键词搜索、基于编码器的语义相似度搜索、交叉编码器重排序以及通过互惠排名融合（RRF）结合 BM25 和语义结果的混合搜索。结果确立了 QAM 作为企业搜索应用，尤其是电子商务系统中的强大解决方案。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Machine Learning 主题：信息检索，人工智能，计算与语言，机器学习

Publish: 2025-08-06 17:47:00 UTC 发布时间：2025-08-06 17:47:00 UTC

#35 GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay #35 GeRe：通过通用样本重放实现 LLM 持续学习中的高效抗遗忘

Authors: [Yunan Zhang](https://arxiv.org/search/?searchtype=author&query=Yunan Zhang), [Shuoran Jiang](https://arxiv.org/search/?searchtype=author&query=Shuoran Jiang), [Mengchen Zhao](https://arxiv.org/search/?searchtype=author&query=Mengchen Zhao), [Yuefeng Li](https://arxiv.org/search/?searchtype=author&query=Yuefeng Li), [Yang Fan](https://arxiv.org/search/?searchtype=author&query=Yang Fan), [Xiangping Wu](https://arxiv.org/search/?searchtype=author&query=Xiangping Wu), [Qingcai Chen](https://arxiv.org/search/?searchtype=author&query=Qingcai Chen) 作者：张昀安，姜硕然，赵梦辰，李岳峰，范洋，吴翔平，陈庆才

The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin (TM) loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns–retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at https://github.com/Qznan/GeRe. 大型语言模型（LLMs）的持续学习能力对于推动通用人工智能的发展至关重要。然而，在不同领域对 LLMs 进行持续微调时，常常会遭遇灾难性遗忘，表现为：1）其通用能力显著下降，2）先前学习任务的性能急剧下降。为了以简单且稳定的方式同时解决这两个问题，我们提出了通用样本重放（General Sample Replay，GeRe）框架，该框架利用常规预训练文本实现高效的抗遗忘。除了在 GeRe 框架下回顾最常见的基于重放的实践外，我们进一步利用神经状态，引入了一种基于阈值边际（TM）损失的增强激活状态约束优化方法，以在重放学习过程中保持激活状态的一致性。我们首次验证了，一小组固定的预先收集的通用重放样本足以解决这两个问题——既保留通用能力，又促进顺序任务的整体性能。事实上，前者本质上可以促进后者。通过受控实验，我们在 GeRe 框架下系统地比较了 TM 与不同的重放策略，包括普通的标签拟合、通过 KL 散度进行的 logit 模仿以及通过 L1/L2 损失进行的特征模仿。结果表明，TM 始终提升了性能并表现出更好的鲁棒性。我们的工作为未来高效重放 LLMs 铺平了道路。我们的代码和数据可在 https://github.com/Qznan/GeRe 获取。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-06 17:42:22 UTC 发布时间：2025-08-06 17:42:22 UTC

#36 How are CS students using resources and AI tools for coding tasks? #36 计算机科学学生如何使用资源和人工智能工具完成编码任务？

Authors: [Natalia Echeverry](https://arxiv.org/search/?searchtype=author&query=Natalia Echeverry), [Arun Lekshmi Narayanan](https://arxiv.org/search/?searchtype=author&query=Arun Lekshmi Narayanan) 作者：Natalia Echeverry，Arun Lekshmi Narayanan

A survey of 26 CS students reveals that AI coding assistants are mainly used for writing code (second to online searches) while AI chatbots are the top resource for debugging. Participants with different coding experience prefer online help over direct human help from peers and instructors. 对 26 名计算机科学学生的调查显示，AI 编码助手主要用于编写代码（仅次于在线搜索），而 AI 聊天机器人则是调试的首选资源。具有不同编码经验的参与者更倾向于使用在线帮助，而非直接向同伴和导师寻求人工帮助。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-06 17:35:55 UTC 发布时间：2025-08-06 17:35:55 UTC

#37 Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management #37 Sculptor：通过主动上下文管理赋能 LLMs 认知代理

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-06 17:32:58 UTC 发布时间：2025-08-06 17:32:58 UTC

#38 HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models #38 HierarchicalPrune：面向大规模扩散模型的位置感知压缩

Authors: [Young D. Kwon](https://arxiv.org/search/?searchtype=author&query=Young D. Kwon), [Rui Li](https://arxiv.org/search/?searchtype=author&query=Rui Li), [Sijia Li](https://arxiv.org/search/?searchtype=author&query=Sijia Li), [Da Li](https://arxiv.org/search/?searchtype=author&query=Da Li), [Sourav Bhattacharya](https://arxiv.org/search/?searchtype=author&query=Sourav Bhattacharya), [Stylianos I. Venieris](https://arxiv.org/search/?searchtype=author&query=Stylianos I. Venieris) 作者：Young D. Kwon, Rui Li, Sijia Li, Da Li, Sourav Bhattacharya, Stylianos I. Venieris

State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, when combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Last but not least, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works. 最先进的文本到图像扩散模型（DMs）实现了卓越的质量，但其庞大的参数规模（80-110 亿）给资源受限设备上的推理带来了重大挑战。本文提出了 HierarchicalPrune，一种基于关键观察的新型压缩框架：DM 模块表现出明显的功能层次结构，早期模块建立语义结构，而后期模块处理纹理细化。HierarchicalPrune 协同结合了三种技术：（1）层次位置剪枝，根据位置层次识别并移除较不重要的后期模块；（2）位置权重保护，有系统地保护对语义结构完整性至关重要的早期模型部分；（3）敏感度引导蒸馏，基于我们发现的模块敏感度差异调整知识转移强度。结果，我们的框架使十亿级扩散模型更适合设备端推理，同时保持输出图像的质量。具体来说，当与 INT4 权重量化结合使用时，HierarchicalPrune 实现了 77.5%-80.4%的内存占用减少（例如，从 15.8 GB 降至 3.2 GB）和 27.9%-38.0%的延迟减少，在服务器和消费级 GPU 上测量，与原始模型相比，GenEval 分数最低下降 2.6%，HPSv2 分数最低下降 7%。最后但同样重要的是，我们对 85 名参与者进行的综合用户研究表明，HierarchicalPrune 在保持与原始模型相当的感知质量的同时，显著优于以往的工作。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 17:30:44 UTC 发布时间：2025-08-06 17:30:44 UTC

#39 YOLOv8-Based Deep Learning Model for Automated Poultry Disease Detection and Health Monitoring paper #39 基于 YOLOv8 的深度学习模型用于自动化家禽疾病检测和健康监测论文

Authors: [Akhil Saketh Reddy Sabbella](https://arxiv.org/search/?searchtype=author&query=Akhil Saketh Reddy Sabbella), [Ch. Lakshmi Prachothan](https://arxiv.org/search/?searchtype=author&query=Ch. Lakshmi Prachothan), [Eswar Kumar Panta](https://arxiv.org/search/?searchtype=author&query=Eswar Kumar Panta) 作者：Akhil Saketh Reddy Sabbella，Ch. Lakshmi Prachothan，Eswar Kumar Panta

In the poultry industry, detecting chicken illnesses is essential to avoid financial losses. Conventional techniques depend on manual observation, which is laborious and prone to mistakes. Using YOLO v8 a deep learning model for real-time object recognition. This study suggests an AI based approach, by developing a system that analyzes high resolution chicken photos, YOLO v8 detects signs of illness, such as abnormalities in behavior and appearance. A sizable, annotated dataset has been used to train the algorithm, which provides accurate real-time identification of infected chicken and prompt warnings to farm operators for prompt action. By facilitating early infection identification, eliminating the need for human inspection, and enhancing biosecurity in large-scale farms, this AI technology improves chicken health management. The real-time features of YOLO v8 provide a scalable and effective method for improving farm management techniques. 在家禽业中，检测鸡只疾病对于避免经济损失至关重要。传统技术依赖人工观察，既费力又容易出错。本文提出了一种基于 YOLO v8 深度学习模型的实时目标识别 AI 方法。该研究通过开发一个系统，分析高分辨率鸡只照片，利用 YOLO v8 检测疾病迹象，如行为和外观异常。算法使用了大量带注释的数据集进行训练，能够准确实时识别感染鸡只，并及时向农场操作人员发出警报以便迅速采取措施。通过促进早期感染识别，消除人工检查需求，并提升大型农场的生物安全性，该 AI 技术改善了鸡只健康管理。YOLO v8 的实时特性为提升农场管理技术提供了一种可扩展且高效的方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 17:27:48 UTC 发布时间：2025-08-06 17:27:48 UTC

#40 X-SAM: From Segment Anything to Any Segmentation #40 X-SAM：从“分割任何物体”到“任何分割”

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM. 大型语言模型（LLMs）在广泛的知识表示方面表现出强大的能力，但它们在像素级感知理解方面本质上存在不足。尽管 Segment Anything Model（SAM）在视觉提示驱动的图像分割方面取得了重大进展，但其在多掩码预测和特定类别分割任务中存在显著局限，且无法将所有分割任务整合到统一的模型架构中。为了解决这些限制，我们提出了 X-SAM，一种简化的多模态大型语言模型（MLLM）框架，将分割范式从“分割任何物体”扩展到“任何分割”。具体而言，我们引入了一个新颖的统一框架，使 MLLM 具备更先进的像素级感知理解能力。此外，我们提出了一项新的分割任务，称为视觉定位分割（Visual GrounDed，VGD）分割，该任务通过交互式视觉提示分割所有实例对象，并赋予 MLLM 视觉定位的像素级解释能力。为了实现对多样化数据源的有效训练，我们提出了一种支持多数据集联合训练的统一训练策略。实验结果表明，X-SAM 在多种图像分割基准测试中达到了最先进的性能，突显了其在多模态像素级视觉理解方面的高效性。代码可在 https://github.com/wanghao9610/X-SAM 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 17:19:10 UTC 发布时间：2025-08-06 17:19:10 UTC

#41 A Scalable Pretraining Framework for Link Prediction with Efficient Adaptation #41 一个可扩展的预训练框架，用于具有高效适应性的链接预测

Authors: [Yu Song](https://arxiv.org/search/?searchtype=author&query=Yu Song), [Zhigang Hua](https://arxiv.org/search/?searchtype=author&query=Zhigang Hua), [Harry Shomer](https://arxiv.org/search/?searchtype=author&query=Harry Shomer), [Yan Xie](https://arxiv.org/search/?searchtype=author&query=Yan Xie), [Jingzhe Liu](https://arxiv.org/search/?searchtype=author&query=Jingzhe Liu), [Bo Long](https://arxiv.org/search/?searchtype=author&query=Bo Long), [Hui Liu](https://arxiv.org/search/?searchtype=author&query=Hui Liu) 作者：宋宇、华志刚、哈里·肖默、谢岩、刘景哲、龙波、刘辉

Link Prediction (LP) is a critical task in graph machine learning. While Graph Neural Networks (GNNs) have significantly advanced LP performance recently, existing methods face key challenges including limited supervision from sparse connectivity, sensitivity to initialization, and poor generalization under distribution shifts. We explore pretraining as a solution to address these challenges. Unlike node classification, LP is inherently a pairwise task, which requires the integration of both node- and edge-level information. In this work, we present the first systematic study on the transferability of these distinct modules and propose a late fusion strategy to effectively combine their outputs for improved performance. To handle the diversity of pretraining data and avoid negative transfer, we introduce a Mixture-of-Experts (MoE) framework that captures distinct patterns in separate experts, facilitating seamless application of the pretrained model on diverse downstream datasets. For fast adaptation, we develop a parameter-efficient tuning strategy that allows the pretrained model to adapt to unseen datasets with minimal computational overhead. Experiments on 16 datasets across two domains demonstrate the effectiveness of our approach, achieving state-of-the-art performance on low-resource link prediction while obtaining competitive results compared to end-to-end trained methods, with over 10,000x lower computational overhead. 链接预测（LP）是图机器学习中的一项关键任务。尽管图神经网络（GNN）近年来显著提升了 LP 的性能，但现有方法仍面临关键挑战，包括稀疏连接带来的有限监督、对初始化的敏感性以及在分布变化下的泛化能力差。我们探索预训练作为解决这些挑战的方案。与节点分类不同，LP 本质上是一个成对任务，需要整合节点级和边级信息。在本工作中，我们首次系统地研究了这些不同模块的可迁移性，并提出了一种后期融合策略，有效结合它们的输出以提升性能。为应对预训练数据的多样性并避免负迁移，我们引入了专家混合（MoE）框架，在不同专家中捕捉不同模式，促进预训练模型在多样化下游数据集上的无缝应用。为了快速适应，我们开发了一种参数高效的调优策略，使预训练模型能够以极低的计算开销适应未见过的数据集。在两个领域的 16 个数据集上的实验展示了我们方法的有效性，在低资源链接预测任务中实现了最先进的性能，同时与端到端训练方法相比取得了具有竞争力的结果，计算开销降低了超过 10,000 倍。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 17:10:31 UTC 发布时间：2025-08-06 17:10:31 UTC

#42 P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis #42 P-Aligner：通过原则性指令合成实现语言模型的预对齐

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 16:51:38 UTC 发布时间：2025-08-06 16:51:38 UTC

#43 HiD-VAE: Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs #43 HiD-VAE：通过分层和解耦语义 ID 实现可解释的生成式推荐

Authors: [Dengzhao Fang](https://arxiv.org/search/?searchtype=author&query=Dengzhao Fang), [Jingtong Gao](https://arxiv.org/search/?searchtype=author&query=Jingtong Gao), [Chengcheng Zhu](https://arxiv.org/search/?searchtype=author&query=Chengcheng Zhu), [Yu Li](https://arxiv.org/search/?searchtype=author&query=Yu Li), [Xiangyu Zhao](https://arxiv.org/search/?searchtype=author&query=Xiangyu Zhao), [Yi Chang](https://arxiv.org/search/?searchtype=author&query=Yi Chang) 作者：方登钊，高景彤，朱成成，李瑜，赵翔宇，常毅

Recommender systems are indispensable for helping users navigate the immense item catalogs of modern online platforms. Recently, generative recommendation has emerged as a promising paradigm, unifying the conventional retrieve-and-rank pipeline into an end-to-end model capable of dynamic generation. However, existing generative methods are fundamentally constrained by their unsupervised tokenization, which generates semantic IDs suffering from two critical flaws: (1) they are semantically flat and uninterpretable, lacking a coherent hierarchy, and (2) they are prone to representation entanglement (i.e., ``ID collisions’’), which harms recommendation accuracy and diversity. To overcome these limitations, we propose HiD-VAE, a novel framework that learns hierarchically disentangled item representations through two core innovations. First, HiD-VAE pioneers a hierarchically-supervised quantization process that aligns discrete codes with multi-level item tags, yielding more uniform and disentangled IDs. Crucially, the trained codebooks can predict hierarchical tags, providing a traceable and interpretable semantic path for each recommendation. Second, to combat representation entanglement, HiD-VAE incorporates a novel uniqueness loss that directly penalizes latent space overlap. This mechanism not only resolves the critical ID collision problem but also promotes recommendation diversity by ensuring a more comprehensive utilization of the item representation space. These high-quality, disentangled IDs provide a powerful foundation for downstream generative models. Extensive experiments on three public benchmarks validate HiD-VAE’s superior performance against state-of-the-art methods. The code is available at https://anonymous.4open.science/r/HiD-VAE-84B2. 推荐系统对于帮助用户浏览现代在线平台庞大的商品目录至关重要。近年来，生成式推荐作为一种有前景的范式出现，将传统的检索与排序流程统一为一个能够动态生成的端到端模型。然而，现有的生成方法在根本上受限于其无监督的分词方式，这种方式生成的语义 ID 存在两个关键缺陷：（1）它们语义平坦且不可解释，缺乏连贯的层级结构；（2）它们容易出现表示纠缠（即“ID 冲突”），这损害了推荐的准确性和多样性。为克服这些限制，我们提出了 HiD-VAE，一种通过两项核心创新学习层级解耦商品表示的新框架。首先，HiD-VAE 开创性地引入了层级监督的量化过程，将离散代码与多级商品标签对齐，生成更均匀且解耦的 ID。关键是，训练好的码本能够预测层级标签，为每个推荐提供可追踪且可解释的语义路径。其次，为了应对表示纠缠问题，HiD-VAE 引入了一种新颖的唯一性损失，直接惩罚潜在空间的重叠。该机制不仅解决了关键的 ID 冲突问题，还通过确保更全面地利用物品表示空间，促进了推荐的多样性。这些高质量、解耦的 ID 为下游生成模型提供了强大的基础。在三个公开基准上的大量实验验证了 HiD-VAE 相较于最先进方法的优越性能。代码可在 https://anonymous.4open.science/r/HiD-VAE-84B2 获取。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-06 16:45:05 UTC 发布：2025-08-06 16:45:05 UTC

#44 Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning #44 具有半监督终身学习的类脑神经网络网络安全

Authors: [Md Zesun Ahmed Mia](https://arxiv.org/search/?searchtype=author&query=Md Zesun Ahmed Mia), [Malyaban Bal](https://arxiv.org/search/?searchtype=author&query=Malyaban Bal), [Sen Lu](https://arxiv.org/search/?searchtype=author&query=Sen Lu), [George M. Nishibuchi](https://arxiv.org/search/?searchtype=author&query=George M. Nishibuchi), [Suhas Chelian](https://arxiv.org/search/?searchtype=author&query=Suhas Chelian), [Srini Vasan](https://arxiv.org/search/?searchtype=author&query=Srini Vasan), [Abhronil Sengupta](https://arxiv.org/search/?searchtype=author&query=Abhronil Sengupta) 作者：Md Zesun Ahmed Mia, Malyaban Bal, Sen Lu, George M. Nishibuchi, Suhas Chelian, Srini Vasan, Abhronil Sengupta

Inspired by the brain’s hierarchical processing and energy efficiency, this paper presents a Spiking Neural Network (SNN) architecture for lifelong Network Intrusion Detection System (NIDS). The proposed system first employs an efficient static SNN to identify potential intrusions, which then activates an adaptive dynamic SNN responsible for classifying the specific attack type. Mimicking biological adaptation, the dynamic classifier utilizes Grow When Required (GWR)-inspired structural plasticity and a novel Adaptive Spike-Timing-Dependent Plasticity (Ad-STDP) learning rule. These bio-plausible mechanisms enable the network to learn new threats incrementally while preserving existing knowledge. Tested on the UNSW-NB15 benchmark in a continual learning setting, the architecture demonstrates robust adaptation, reduced catastrophic forgetting, and achieves 85.3% overall accuracy. Furthermore, simulations using the Intel Lava framework confirm high operational sparsity, highlighting the potential for low-power deployment on neuromorphic hardware. 受大脑层级处理和能效的启发，本文提出了一种用于终身网络入侵检测系统（NIDS）的脉冲神经网络（SNN）架构。该系统首先采用高效的静态 SNN 来识别潜在入侵，然后激活一个自适应动态 SNN，负责分类具体的攻击类型。动态分类器模拟生物适应性，利用基于“按需增长”（GWR）的结构可塑性和一种新颖的自适应脉冲时序依赖可塑性（Ad-STDP）学习规则。这些生物合理的机制使网络能够增量学习新威胁，同时保留已有知识。在持续学习环境下，于 UNSW-NB15 基准测试中，该架构表现出强健的适应能力，减少灾难性遗忘，并实现了 85.3 %的整体准确率。此外，使用 Intel Lava 框架的仿真验证了其高操作稀疏性，凸显了在神经形态硬件上低功耗部署的潜力。

Subjects: Machine Learning, Artificial Intelligence, Emerging Technologies, Neural and Evolutionary Computing 主题：机器学习，人工智能，新兴技术，神经与进化计算

Publish: 2025-08-06 16:29:59 UTC 发布时间：2025-08-06 16:29:59 UTC

#45 TURA: Tool-Augmented Unified Retrieval Agent for AI Search #45 TURA：用于人工智能搜索的工具增强统一检索代理

The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system. 大型语言模型（LLMs）的出现正在将搜索引擎转变为对话式人工智能搜索产品，主要通过对网络语料库进行检索增强生成（RAG）。然而，这一范式在工业应用中存在显著限制。传统的 RAG 方法难以满足实时需求和需要访问动态生成内容（如票务可用性或库存）的结构化查询。搜索引擎仅限于索引静态页面，无法执行此类时间敏感数据所需的交互式查询。学术研究主要集中在优化针对静态内容的 RAG，忽视了复杂意图以及对数据库和实时 API 等动态资源的需求。为弥补这一差距，我们提出了 TURA（Tool-Augmented Unified Retrieval Agent for AI Search），这是一种结合了 RAG 与工具代理使用的创新三阶段框架，能够访问静态内容和动态实时信息。 TURA 有三个关键组成部分：一个意图感知检索模块，用于分解查询并检索封装为模型上下文协议（MCP）服务器的信息源；一个基于有向无环图（DAG）的任务规划器，将任务依赖关系建模为有向无环图，以实现最佳的并行执行；以及一个轻量级的蒸馏代理执行器，用于高效调用工具。TURA 是首个系统性地弥合静态检索增强生成（RAG）与动态信息源之间差距的架构，面向世界级的 AI 搜索产品。它服务于数千万用户，利用代理框架提供强大且实时的答案，同时满足大规模工业系统的低延迟需求。

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题：计算与语言，人工智能，信息检索

Publish: 2025-08-06 16:24:17 UTC 发布时间：2025-08-06 16:24:17 UTC

#46 GraphProp: Training the Graph Foundation Models using Graph Properties #46 GraphProp：利用图属性训练图基础模型

Authors: [Ziheng Sun](https://arxiv.org/search/?searchtype=author&query=Ziheng Sun), [Qi Feng](https://arxiv.org/search/?searchtype=author&query=Qi Feng), [Lehao Lin](https://arxiv.org/search/?searchtype=author&query=Lehao Lin), [Chris Ding](https://arxiv.org/search/?searchtype=author&query=Chris Ding), [Jicong Fan](https://arxiv.org/search/?searchtype=author&query=Jicong Fan) 作者：孙子恒，冯琦，林乐豪，Chris Ding，范继聪

This work focuses on training graph foundation models (GFMs) that have strong generalization ability in graph-level tasks such as graph classification. Effective GFM training requires capturing information consistent across different domains. We discover that graph structures provide more consistent cross-domain information compared to node features and graph labels. However, traditional GFMs primarily focus on transferring node features from various domains into a unified representation space but often lack structural cross-domain generalization. To address this, we introduce GraphProp, which emphasizes structural generalization. The training process of GraphProp consists of two main phases. First, we train a structural GFM by predicting graph invariants. Since graph invariants are properties of graphs that depend only on the abstract structure, not on particular labellings or drawings of the graph, this structural GFM has a strong ability to capture the abstract structural information and provide discriminative graph representations comparable across diverse domains. In the second phase, we use the representations given by the structural GFM as positional encodings to train a comprehensive GFM. This phase utilizes domain-specific node attributes and graph labels to further improve cross-domain node feature generalization. Our experiments demonstrate that GraphProp significantly outperforms the competitors in supervised learning and few-shot learning, especially in handling graphs without node attributes. 本工作聚焦于训练在图级任务（如图分类）中具有强泛化能力的图基础模型（GFM）。有效的 GFM 训练需要捕捉跨不同领域一致的信息。我们发现，与节点特征和图标签相比，图结构提供了更为一致的跨领域信息。然而，传统的 GFM 主要侧重于将来自不同领域的节点特征转化为统一的表示空间，但往往缺乏结构上的跨领域泛化能力。为此，我们提出了 GraphProp，强调结构泛化。GraphProp 的训练过程包含两个主要阶段。首先，我们通过预测图不变量来训练结构 GFM。由于图不变量是仅依赖于图的抽象结构，而非特定标注或图形绘制的图属性，该结构 GFM 具备强大的抽象结构信息捕捉能力，并能提供在不同领域间可比的判别性图表示。在第二阶段，我们使用结构化 GFM 提供的表示作为位置编码来训练一个综合性的 GFM。该阶段利用特定领域的节点属性和图标签，进一步提升跨领域节点特征的泛化能力。我们的实验表明，GraphProp 在监督学习和少样本学习中显著优于竞争对手，尤其在处理无节点属性的图时表现突出。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 16:12:42 UTC 发布：2025-08-06 16:12:42 UTC

#47 A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI #47 IVIM MRI 中体素级监督模型不确定性量化的综合框架

Accurate estimation of intravoxel incoherent motion (IVIM) parameters from diffusion-weighted MRI remains challenging due to the ill-posed nature of the inverse problem and high sensitivity to noise, particularly in the perfusion compartment. In this work, we propose a probabilistic deep learning framework based on Deep Ensembles (DE) of Mixture Density Networks (MDNs), enabling estimation of total predictive uncertainty and decomposition into aleatoric (AU) and epistemic (EU) components. The method was benchmarked against non probabilistic neural networks, a Bayesian fitting approach and a probabilistic network with single Gaussian parametrization. Supervised training was performed on synthetic data, and evaluation was conducted on both simulated and two in vivo datasets. The reliability of the quantified uncertainties was assessed using calibration curves, output distribution sharpness, and the Continuous Ranked Probability Score (CRPS). MDNs produced more calibrated and sharper predictive distributions for the D and f parameters, although slight overconfidence was observed in D*. The Robust Coefficient of Variation (RCV) indicated smoother in vivo estimates for D* with MDNs compared to Gaussian model. Despite the training data covering the expected physiological range, elevated EU in vivo suggests a mismatch with real acquisition conditions, highlighting the importance of incorporating EU, which was allowed by DE. Overall, we present a comprehensive framework for IVIM fitting with uncertainty quantification, which enables the identification and interpretation of unreliable estimates. The proposed approach can also be adopted for fitting other physical models through appropriate architectural and simulation adjustments. 由于逆问题的病态性质以及对噪声的高度敏感性，特别是在灌注部分，基于扩散加权 MRI 的体素内非相干运动（IVIM）参数的准确估计仍然具有挑战性。在本工作中，我们提出了一种基于混合密度网络（MDNs）深度集成（DE）的概率深度学习框架，实现了总预测不确定性的估计及其分解为固有不确定性（AU）和认知不确定性（EU）两部分。该方法与非概率神经网络、贝叶斯拟合方法以及单高斯参数化的概率网络进行了基准比较。监督训练在合成数据上进行，评估则在模拟数据和两个体内数据集上完成。通过校准曲线、输出分布的尖锐度以及连续排序概率得分（CRPS）评估了量化不确定性的可靠性。MDNs 在 D 和 f 参数上产生了更校准且更尖锐的预测分布，尽管在 D参数上观察到轻微的过度自信。稳健变异系数（RCV）表明，与高斯模型相比，MDNs 对 D的体内估计更平滑。尽管训练数据涵盖了预期的生理范围，但体内升高的 EU 表明与实际采集条件存在不匹配，突显了纳入 EU 的重要性，而这正是 DE 所允许的。总体而言，我们提出了一个包含不确定性量化的 IVIM 拟合综合框架，能够识别和解释不可靠的估计。该方法也可以通过适当的架构和模拟调整，应用于其他物理模型的拟合。

Subjects: Image and Video Processing, Artificial Intelligence, Machine Learning 主题：图像与视频处理，人工智能，机器学习

Publish: 2025-08-06 16:08:55 UTC 发布时间：2025-08-06 16:08:55 UTC

#48 Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference #48 立场：当前的人工智能会议模式不可持续！诊断集中式人工智能会议的危机

Artificial Intelligence (AI) conferences are essential for advancing research, sharing knowledge, and fostering academic community. However, their rapid expansion has rendered the centralized conference model increasingly unsustainable. This paper offers a data-driven diagnosis of a structural crisis that threatens the foundational goals of scientific dissemination, equity, and community well-being. We identify four key areas of strain: (1) scientifically, with per-author publication rates more than doubling over the past decade to over 4.5 papers annually; (2) environmentally, with the carbon footprint of a single conference exceeding the daily emissions of its host city; (3) psychologically, with 71% of online community discourse reflecting negative sentiment and 35% referencing mental health concerns; and (4) logistically, with attendance at top conferences such as NeurIPS 2024 beginning to outpace venue capacity. These pressures point to a system that is misaligned with its core mission. In response, we propose the Community-Federated Conference (CFC) model, which separates peer review, presentation, and networking into globally coordinated but locally organized components, offering a more sustainable, inclusive, and resilient path forward for AI research. 人工智能（AI）会议对于推动研究进展、分享知识和促进学术社区发展至关重要。然而，其快速扩张使得集中式会议模式日益难以为继。本文通过数据驱动的方法诊断了一场结构性危机，这场危机威胁着科学传播、公平性和社区福祉的根本目标。我们识别出四个关键压力领域：（1）科学方面，过去十年每位作者的发表论文数量翻倍，达到每年超过 4.5 篇；（2）环境方面，一场会议的碳足迹超过了举办城市的日常排放量；（3）心理方面，71%的在线社区讨论表现出负面情绪，35%涉及心理健康问题；（4）后勤方面，顶级会议如 NeurIPS 2024 的参会人数已开始超过场地容量。这些压力表明当前系统与其核心使命存在错位。作为回应，我们提出了社区联合会议（CFC）模型，该模型将同行评审、报告和交流分离为全球协调但本地组织的组成部分，为人工智能研究提供了一条更可持续、更具包容性和更具韧性的前进道路。

Subjects: Computers and Society, Artificial Intelligence, Computation and Language 主题：计算机与社会，人工智能，计算与语言

Publish: 2025-08-06 16:08:27 UTC 发布时间：2025-08-06 16:08:27 UTC

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 16:06:43 UTC 发布时间：2025-08-06 16:06:43 UTC

#50 Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration #50 超越头脑风暴：是什么驱动高质量的科学创意？多智能体协作的启示

尽管人工智能代理在科学构思方面展现出潜力，但大多数现有框架依赖于单一代理的迭代改进，因知识和视角的局限性而限制了创造力。受现实世界研究动态的启发，本文探讨了结构化多代理讨论是否能够超越单独构思。我们提出了一个用于生成研究提案的合作多代理框架，并系统地比较了包括团队规模、领导主导与无领导结构，以及跨学科和资历多样化的团队组成等配置。为了评估创意质量，我们采用了一个综合协议，结合基于代理的评分和人类评审，涵盖新颖性、战略视野和整合深度等维度。结果显示，多代理讨论显著优于单独基线。指定的领导者充当催化剂，将讨论转化为更具整合性和远见性的提案。值得注意的是，我们发现认知多样性是质量的主要驱动力，但专业知识是不可或缺的前提，因为缺乏资深知识基础的团队甚至无法超越单个有能力的代理。这些发现为设计协作式人工智能创意系统提供了可操作的见解，并揭示了团队结构如何影响创造性成果。

发布时间：2025-08-06 15:59:18 UTC

Authors: [Jinxing Zhou](https://arxiv.org/search/?searchtype=author&query=Jinxing Zhou), [Ziheng Zhou](https://arxiv.org/search/?searchtype=author&query=Ziheng Zhou), [Yanghao Zhou](https://arxiv.org/search/?searchtype=author&query=Yanghao Zhou), [Yuxin Mao](https://arxiv.org/search/?searchtype=author&query=Yuxin Mao), [Zhangling Duan](https://arxiv.org/search/?searchtype=author&query=Zhangling Duan), [Dan Guo](https://arxiv.org/search/?searchtype=author&query=Dan Guo) 作者：周金星，周子恒，周阳浩，毛宇新，段章凌，郭丹

The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize events in untrimmed videos that occur simultaneously in both the audio and visual modalities. This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task), where only video-level event labels are provided and the temporal boundaries of each event are unknown. We address W-DAVEL by exploiting \textit{cross-modal salient anchors}, which are defined as reliable timestamps that are well predicted under weak supervision and exhibit highly consistent event semantics across audio and visual modalities. Specifically, we propose a \textit{Mutual Event Agreement Evaluation} module, which generates an agreement score by measuring the discrepancy between the predicted audio and visual event classes. Then, the agreement score is utilized in a \textit{Cross-modal Salient Anchor Identification} module, which identifies the audio and visual anchor features through global-video and local temporal window identification mechanisms. The anchor features after multimodal integration are fed into an \textit{Anchor-based Temporal Propagation} module to enhance event semantic encoding in the original temporal audio and visual features, facilitating better temporal localization under weak supervision. We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets. Extensive experiments demonstrate that our method achieves state-of-the-art performance. 密集音视频事件定位（DAVEL）任务旨在对未剪辑视频中同时在音频和视觉模态中发生的事件进行时间定位。本文在一种新的、更具挑战性的弱监督设置下探讨 DAVEL（W-DAVEL 任务），该设置仅提供视频级事件标签，且每个事件的时间边界未知。我们通过利用\textit{跨模态显著锚点}来解决 W-DAVEL 问题，这些锚点被定义为在弱监督下预测可靠且在音频和视觉模态中表现出高度一致事件语义的时间戳。具体而言，我们提出了一个\textit{互事件一致性评估}模块，通过测量预测的音频和视觉事件类别之间的差异来生成一致性分数。随后，该一致性分数被用于\textit{跨模态显著锚点识别}模块，该模块通过全局视频和局部时间窗口识别机制来识别音频和视觉锚点特征。多模态融合后的锚点特征被输入到一个\textit{基于锚点的时间传播}模块中，以增强原始时间音频和视觉特征中的事件语义编码，从而在弱监督下实现更好的时间定位。我们在 UnAV-100 和 ActivityNet1.3 数据集上为 W-DAVEL 建立了基准。大量实验表明，我们的方法达到了最先进的性能。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Multimedia 主题：计算机视觉与模式识别，人工智能，多媒体

Publish: 2025-08-06 15:49:53 UTC 发布时间：2025-08-06 15:49:53 UTC

#52 MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning #52 MSC：一个带有定位分割和片段级字幕的海洋野生动物视频数据集

Authors: [Quang-Trung Truong](https://arxiv.org/search/?searchtype=author&query=Quang-Trung Truong), [Yuk-Kwan Wong](https://arxiv.org/search/?searchtype=author&query=Yuk-Kwan Wong), [Vo Hoang Kim Tuyen Dang](https://arxiv.org/search/?searchtype=author&query=Vo Hoang Kim Tuyen Dang), [Rinaldi Gotama](https://arxiv.org/search/?searchtype=author&query=Rinaldi Gotama), [Duc Thanh Nguyen](https://arxiv.org/search/?searchtype=author&query=Duc Thanh Nguyen), [Sai-Kit Yeung](https://arxiv.org/search/?searchtype=author&query=Sai-Kit Yeung) 作者：Quang-Trung Truong，Yuk-Kwan Wong，Vo Hoang Kim Tuyen Dang，Rinaldi Gotama，Duc Thanh Nguyen，Sai-Kit Yeung

Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at https://msc.hkustvgd.com. 海洋视频由于海洋物体及其周围环境的动态变化、摄像机运动以及水下场景的复杂性，给视频理解带来了重大挑战。现有的视频字幕数据集通常侧重于通用或以人为中心的领域，往往难以推广到海洋环境的复杂性，也难以深入了解海洋生物。为了解决这些限制，我们提出了一个两阶段的面向海洋物体的视频字幕生成流程。我们引入了一个综合性的视频理解基准，利用视频、文本和分割掩码三元组来促进视觉定位和字幕生成，从而提升海洋视频的理解与分析，以及海洋视频的生成。此外，我们强调了视频切分在检测场景变化中显著物体转换方面的有效性，这极大丰富了字幕内容的语义。我们的数据集和代码已发布于 https://msc.hkustvgd.com。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Multimedia 主题：计算机视觉与模式识别，人工智能，多媒体

Publish: 2025-08-06 15:34:24 UTC 发布时间：2025-08-06 15:34:24 UTC

#53 Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning #53 揭示临床抑郁评估的全景：从行为特征到精神病学推理

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 15:13:24 UTC 发布时间：2025-08-06 15:13:24 UTC

#54 RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection #54 RAIDX：一种用于可解释深度伪造检测的检索增强生成与 GRPO 强化学习框架

Authors: [Tianxiao Li](https://arxiv.org/search/?searchtype=author&query=Tianxiao Li), [Zhenglin Huang](https://arxiv.org/search/?searchtype=author&query=Zhenglin Huang), [Haiquan Wen](https://arxiv.org/search/?searchtype=author&query=Haiquan Wen), [Yiwei He](https://arxiv.org/search/?searchtype=author&query=Yiwei He), [Shuchang Lyu](https://arxiv.org/search/?searchtype=author&query=Shuchang Lyu), [Baoyuan Wu](https://arxiv.org/search/?searchtype=author&query=Baoyuan Wu), [Guangliang Cheng](https://arxiv.org/search/?searchtype=author&query=Guangliang Cheng) 作者：李天晓，黄正林，温海泉，何奕伟，吕书昌，吴宝元，程光亮

The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX’s effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available. 人工智能生成模型的快速发展使得超逼真图像的创作成为可能，但也通过广泛传播的虚假信息带来了伦理风险。目前的深度伪造检测方法分为面部特定检测器和通用 AI 生成检测器两类，这些方法将检测视为分类任务，缺乏透明度，无法解释决策过程。尽管一些基于 LLM 的方法提供了解释性，但它们存在分析粒度粗糙且依赖大量人工标注的问题。本文提出了 RAIDX（检索增强图像深度伪造检测与解释）——一种新颖的深度伪造检测框架，结合了检索增强生成（RAG）和群体相对策略优化（GRPO），以提升检测准确性和决策解释性。具体而言，RAIDX 利用 RAG 引入外部知识以提高检测准确率，并采用 GRPO 自主生成细粒度文本解释和显著性图，免除了大量人工标注的需求。在多个基准测试中的实验表明，RAIDX 在识别真假以及提供文本描述和显著性图中的可解释理由方面表现出色，达到了最先进的检测性能，同时推动了深度伪造识别的透明度。RAIDX 是首个将 RAG 和 GRPO 协同统一的框架，解决了准确性和可解释性方面的关键问题。我们的代码和模型将公开发布。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 15:08:16 UTC 发布时间：2025-08-06 15:08:16 UTC

#55 PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers #55 PRISM：通过对称多分辨率卷积层实现轻量级多变量时间序列分类

Authors: [Federico Zucchi](https://arxiv.org/search/?searchtype=author&query=Federico Zucchi), [Thomas Lampert](https://arxiv.org/search/?searchtype=author&query=Thomas Lampert) 作者：Federico Zucchi，Thomas Lampert

Multivariate time-series classification is pivotal in domains ranging from wearable sensing to biomedical monitoring. Despite recent advances, Transformer- and CNN-based models often remain computationally heavy, offer limited frequency diversity, and require extensive parameter budgets. We propose PRISM (Per-channel Resolution-Informed Symmetric Module), a convolutional-based feature extractor that applies symmetric finite-impulse-response (FIR) filters at multiple temporal scales, independently per channel. This multi-resolution, per-channel design yields highly frequency-selective embeddings without any inter-channel convolutions, greatly reducing model size and complexity. Across human-activity, sleep-stage and biomedical benchmarks, PRISM, paired with lightweight classification heads, matches or outperforms leading CNN and Transformer baselines, while using roughly an order of magnitude fewer parameters and FLOPs. By uniting classical signal processing insights with modern deep learning, PRISM offers an accurate, resource-efficient solution for multivariate time-series classification. 多变量时间序列分类在从可穿戴传感到生物医学监测等领域具有关键作用。尽管近年来取得了进展，基于 Transformer 和 CNN 的模型通常计算量大，频率多样性有限，并且需要大量参数。我们提出了 PRISM（每通道分辨率知情对称模块），这是一种基于卷积的特征提取器，在多个时间尺度上对每个通道独立应用对称有限脉冲响应（FIR）滤波器。这种多分辨率、每通道设计产生高度频率选择性的嵌入，而无需任何通道间卷积，极大地减少了模型的大小和复杂度。在人体活动、睡眠阶段和生物医学基准测试中，PRISM 配合轻量级分类头，匹配或优于领先的 CNN 和 Transformer 基线，同时参数和浮点运算量大约减少一个数量级。通过结合经典信号处理的见解与现代深度学习，PRISM 为多变量时间序列分类提供了一种准确且资源高效的解决方案。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 14:50:25 UTC 发布时间：2025-08-06 14:50:25 UTC

#56 Learning Robust Intervention Representations with Delta Embeddings #56 使用增量嵌入学习鲁棒的干预表示

Authors: [Panagiotis Alimisis](https://arxiv.org/search/?searchtype=author&query=Panagiotis Alimisis), [Christos Diou](https://arxiv.org/search/?searchtype=author&query=Christos Diou) 作者：Panagiotis Alimisis，Christos Diou

Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs, have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of interventions in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a framework that is capable of learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks. 因果表示学习在过去几年中引起了广泛的研究兴趣，作为提升模型泛化能力和鲁棒性的一种手段。干预图像对的因果表示具有这样的特性：只有与干预/动作相关的场景元素对应的变量在起始状态和结束状态之间发生变化。尽管该领域的大多数工作集中在识别和表示因果模型下的场景变量，但较少有研究关注干预本身的表示。在本工作中，我们展示了一种提升分布外（OOD）鲁棒性的有效策略，即聚焦于潜在空间中干预的表示。具体而言，我们提出干预可以用一种因果增量嵌入（Causal Delta Embedding）来表示，该嵌入对视觉场景不变，并且在其影响的因果变量上是稀疏的。基于这一见解，我们提出了一个能够从图像对中学习因果表示的框架，无需任何额外监督。因果三元组挑战中的实验表明，因果增量嵌入在 OOD 环境中表现出极高的有效性，在合成和真实世界的基准测试中均显著超越了基线性能。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 14:39:34 UTC 发布时间：2025-08-06 14:39:34 UTC

#57 Hierarchical Scoring for Machine Learning Classifier Error Impact Evaluation #57 机器学习分类器错误影响评估的层级评分

Authors: [Erin Lanus](https://arxiv.org/search/?searchtype=author&query=Erin Lanus), [Daniel Wolodkin](https://arxiv.org/search/?searchtype=author&query=Daniel Wolodkin), [Laura J. Freeman](https://arxiv.org/search/?searchtype=author&query=Laura J. Freeman) 作者：Erin Lanus，Daniel Wolodkin，Laura J. Freeman

A common use of machine learning (ML) models is predicting the class of a sample. Object detection is an extension of classification that includes localization of the object via a bounding box within the sample. Classification, and by extension object detection, is typically evaluated by counting a prediction as incorrect if the predicted label does not match the ground truth label. This pass/fail scoring treats all misclassifications as equivalent. In many cases, class labels can be organized into a class taxonomy with a hierarchical structure to either reflect relationships among the data or operator valuation of misclassifications. When such a hierarchical structure exists, hierarchical scoring metrics can return the model performance of a given prediction related to the distance between the prediction and the ground truth label. Such metrics can be viewed as giving partial credit to predictions instead of pass/fail, enabling a finer-grained understanding of the impact of misclassifications. This work develops hierarchical scoring metrics varying in complexity that utilize scoring trees to encode relationships between class labels and produce metrics that reflect distance in the scoring tree. The scoring metrics are demonstrated on an abstract use case with scoring trees that represent three weighting strategies and evaluated by the kind of errors discouraged. Results demonstrate that these metrics capture errors with finer granularity and the scoring trees enable tuning. This work demonstrates an approach to evaluating ML performance that ranks models not only by how many errors are made but by the kind or impact of errors. Python implementations of the scoring metrics will be available in an open-source repository at time of publication. 机器学习（ML）模型的一个常见用途是预测样本的类别。目标检测是分类的扩展，除了分类外还包括通过边界框定位样本中的对象。分类及其扩展的目标检测，通常通过将预测标签与真实标签不匹配的预测计为错误来进行评估。这种通过/不通过的评分方式将所有错误分类视为等同。在许多情况下，类别标签可以组织成具有层级结构的类别分类法，以反映数据之间的关系或操作员对错误分类的评估。当存在这样的层级结构时，层级评分指标可以根据预测标签与真实标签之间的距离返回模型的性能。这类指标可以被视为对预测给予部分积分，而非简单的通过/不通过，从而实现对错误分类影响的更细致理解。本研究开发了层级评分指标，复杂度各异，利用评分树来编码类别标签之间的关系，并生成反映评分树中距离的指标。评分指标在一个抽象的使用案例中进行了演示，该案例使用代表三种加权策略的评分树，并通过所抑制的错误类型进行评估。结果表明，这些指标能够以更细粒度捕捉错误，且评分树支持调优。该研究展示了一种评估机器学习性能的方法，不仅根据错误数量对模型进行排名，还根据错误的类型或影响进行排名。评分指标的 Python 实现将在发表时提供于开源仓库。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 14:37:18 UTC 发布时间：2025-08-06 14:37:18 UTC

#58 Benchmarking Quantum and Classical Sequential Models for Urban Telecommunication Forecasting #58 城市电信预测中量子与经典序列模型的基准测试

Authors: [Chi-Sheng Chen](https://arxiv.org/search/?searchtype=author&query=Chi-Sheng Chen), [Samuel Yen-Chi Chen](https://arxiv.org/search/?searchtype=author&query=Samuel Yen-Chi Chen), [Yun-Cheng Tsai](https://arxiv.org/search/?searchtype=author&query=Yun-Cheng Tsai) 作者：陈启胜，陈彦志，蔡昀成

In this study, we evaluate the performance of classical and quantum-inspired sequential models in forecasting univariate time series of incoming SMS activity (SMS-in) using the Milan Telecommunication Activity Dataset. Due to data completeness limitations, we focus exclusively on the SMS-in signal for each spatial grid cell. We compare five models, LSTM (baseline), Quantum LSTM (QLSTM), Quantum Adaptive Self-Attention (QASA), Quantum Receptance Weighted Key-Value (QRWKV), and Quantum Fast Weight Programmers (QFWP), under varying input sequence lengths (4, 8, 12, 16, 32 and 64). All models are trained to predict the next 10-minute SMS-in value based solely on historical values within a given sequence window. Our findings indicate that different models exhibit varying sensitivities to sequence length, suggesting that quantum enhancements are not universally advantageous. Rather, the effectiveness of quantum modules is highly dependent on the specific task and architectural design, reflecting inherent trade-offs among model size, parameterization strategies, and temporal modeling capabilities. 在本研究中，我们评估了经典和量子启发的序列模型在使用米兰电信活动数据集预测单变量短信接收活动（SMS-in）时间序列中的表现。由于数据完整性限制，我们专注于每个空间网格单元的 SMS-in 信号。我们比较了五种模型：LSTM（基线）、量子 LSTM（QLSTM）、量子自适应自注意力（QASA）、量子接收加权键值（QRWKV）和量子快速权重程序（QFWP），在不同输入序列长度（4、8、12、16、32 和 64）下的表现。所有模型均训练以仅基于给定序列窗口内的历史值预测下一个 10 分钟的 SMS-in 值。我们的研究结果表明，不同模型对序列长度的敏感度各异，表明量子增强并非普遍有利。相反，量子模块的有效性高度依赖于具体任务和架构设计，反映了模型规模、参数化策略和时间建模能力之间的内在权衡。

Subjects: Quantum Physics, Artificial Intelligence 主题：量子物理，人工智能

Publish: 2025-08-06 14:37:07 UTC 发布时间：2025-08-06 14:37:07 UTC

#59 Metric Learning in an RKHS #59 RKHS 中的度量学习

Authors: [Gokcan Tatli](https://arxiv.org/search/?searchtype=author&query=Gokcan Tatli), [Yi Chen](https://arxiv.org/search/?searchtype=author&query=Yi Chen), [Blake Mason](https://arxiv.org/search/?searchtype=author&query=Blake Mason), [Robert Nowak](https://arxiv.org/search/?searchtype=author&query=Robert Nowak), [Ramya Korlakai Vinayak](https://arxiv.org/search/?searchtype=author&query=Ramya Korlakai Vinayak) 作者：Gokcan Tatli，Yi Chen，Blake Mason，Robert Nowak，Ramya Korlakai Vinayak

Metric learning from a set of triplet comparisons in the form of “Do you think item h is more similar to item i or item j?”, indicating similarity and differences between items, plays a key role in various applications including image retrieval, recommendation systems, and cognitive psychology. The goal is to learn a metric in the RKHS that reflects the comparisons. Nonlinear metric learning using kernel methods and neural networks have shown great empirical promise. While previous works have addressed certain aspects of this problem, there is little or no theoretical understanding of such methods. The exception is the special (linear) case in which the RKHS is the standard Euclidean space Rd; there is a comprehensive theory for metric learning in Rd. This paper develops a general RKHS framework for metric learning and provides novel generalization guarantees and sample complexity bounds. We validate our findings through a set of simulations and experiments on real datasets. Our code is publicly available at https://github.com/RamyaLab/metric-learning-RKHS. 从一组三元组比较中进行度量学习，形式为“你认为物品 h 与物品 i 更相似还是与物品 j 更相似？”，用于指示物品之间的相似性和差异性，在图像检索、推荐系统和认知心理学等多种应用中起着关键作用。目标是在再生核希尔伯特空间（RKHS）中学习一个反映这些比较的度量。利用核方法和神经网络进行的非线性度量学习已显示出极大的经验潜力。尽管先前的工作已解决了该问题的某些方面，但对这些方法的理论理解几乎不存在。唯一的例外是 RKHS 为标准欧几里得空间的特殊（线性）情况；在该情况下，度量学习已有完整的理论体系。本文构建了一个通用的 RKHS 度量学习框架，并提供了新颖的泛化保证和样本复杂度界限。我们通过一系列模拟和真实数据集实验验证了我们的发现。我们的代码公开发布于 https://github.com/RamyaLab/metric-learning-RKHS。

Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题：机器学习，人工智能，机器学习

Publish: 2025-08-06 14:29:04 UTC 发布时间：2025-08-06 14:29:04 UTC

#60 Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model #60 通过渐进对齐实现文本到图像模型中的零残差概念擦除

Authors: [Hongxu Chen](https://arxiv.org/search/?searchtype=author&query=Hongxu Chen), [Zhen Wang](https://arxiv.org/search/?searchtype=author&query=Zhen Wang), [Taoran Mei](https://arxiv.org/search/?searchtype=author&query=Taoran Mei), [Lin Li](https://arxiv.org/search/?searchtype=author&query=Lin Li), [Bowei Zhu](https://arxiv.org/search/?searchtype=author&query=Bowei Zhu), [Runshi Li](https://arxiv.org/search/?searchtype=author&query=Runshi Li), [Long Chen](https://arxiv.org/search/?searchtype=author&query=Long Chen) 作者：陈鸿旭，王震，梅涛然，李林，朱博伟，李润石，陈龙

Concept Erasure, which aims to prevent pretrained text-to-image models from generating content associated with semantic-harmful concepts (i.e., target concepts), is getting increased attention. State-of-the-art methods formulate this task as an optimization problem: they align all target concepts with semantic-harmless anchor concepts, and apply closed-form solutions to update the model accordingly. While these closed-form methods are efficient, we argue that existing methods have two overlooked limitations: 1) They often result in incomplete erasure due to “non-zero alignment residual”, especially when text prompts are relatively complex. 2) They may suffer from generation quality degradation as they always concentrate parameter updates in a few deep layers. To address these issues, we propose a novel closed-form method ErasePro: it is designed for more complete concept erasure and better preserving overall generative quality. Specifically, ErasePro first introduces a strict zero-residual constraint into the optimization objective, ensuring perfect alignment between target and anchor concept features and enabling more complete erasure. Secondly, it employs a progressive, layer-wise update strategy that gradually transfers target concept features to those of the anchor concept from shallow to deep layers. As the depth increases, the required parameter changes diminish, thereby reducing deviations in sensitive deep layers and preserving generative quality. Empirical results across different concept erasure tasks (including instance, art style, and nudity erasure) have demonstrated the effectiveness of our ErasePro. 概念消除旨在防止预训练的文本到图像模型生成与语义有害概念（即目标概念）相关的内容，正受到越来越多的关注。最先进的方法将此任务表述为一个优化问题：它们将所有目标概念与语义无害的锚点概念对齐，并应用闭式解来相应地更新模型。虽然这些闭式方法效率较高，但我们认为现有方法存在两个被忽视的局限性：1）由于“非零对齐残差”，它们常常导致消除不完全，尤其是在文本提示相对复杂时。2）它们可能会导致生成质量下降，因为参数更新总是集中在少数几个深层。为了解决这些问题，我们提出了一种新颖的闭式方法 ErasePro：该方法旨在实现更完整的概念消除并更好地保持整体生成质量。具体而言，ErasePro 首先在优化目标中引入严格的零残差约束，确保目标概念与锚点概念特征之间的完美对齐，从而实现更完整的消除。其次，它采用了一种渐进式的逐层更新策略，逐步将目标概念特征从浅层到深层转移到锚点概念特征。随着深度的增加，所需的参数变化减少，从而降低了对敏感深层的偏差，保持了生成质量。不同概念擦除任务（包括实例、艺术风格和裸体擦除）的实证结果证明了我们 ErasePro 的有效性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-06 14:19:32 UTC 发布时间：2025-08-06 14:19:32 UTC

#61 Small transformer architectures for task switching #61 用于任务切换的小型变压器架构

Author: [Claudius Gros](https://arxiv.org/search/?searchtype=author&query=Claudius Gros) 作者：Claudius Gros

The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of ’task switching’. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We enlarge our comparative study by including an extension of the standard transformer architecture to its non-translational invariant counterpart, the cisformer, and an alternative attention mechanism, extensive attention. A combination of the latter is found to be the only model able to achieve considerable performance levels, of around 95%. Our results indicate that the workings of attention can be understood better, and even improved, when comparing qualitatively different formulations in task-switching settings. 大规模生成式人工智能的快速进展在很大程度上基于注意力机制。相反，设计出基于注意力架构在小规模应用中优于传统方法（如多层感知机或循环网络）的方案并非易事。我们在“任务切换”的背景下研究了这个问题。在该框架中，模型处理正在进行的标记序列，当前任务由随机插入的控制标记决定。我们展示了标准变换器无法解决基于有限域算术的基本任务切换参考模型，该模型包含专门用于递增/加法/反向复制/上下文（IARC）的子任务。我们表明，变换器、长短期记忆循环网络（LSTM）和普通多层感知机（MLP）实现了相似但仅为适度的预测准确率。我们通过引入标准变换器架构的非平移不变对应物——cisformer，以及一种替代的注意力机制——广泛注意力，扩大了我们的比较研究。后一种方法的组合被发现是唯一能够达到约 95%显著性能水平的模型。我们的结果表明，在任务切换环境中比较定性不同的表述，可以更好地理解注意力的工作机制，甚至加以改进。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 14:01:05 UTC 发布时间：2025-08-06 14:01:05 UTC

#62 Automatic LLM Red Teaming #62 自动 LLM 红队测试

Authors: [Roman Belaire](https://arxiv.org/search/?searchtype=author&query=Roman Belaire), [Arunesh Sinha](https://arxiv.org/search/?searchtype=author&query=Arunesh Sinha), [Pradeep Varakantham](https://arxiv.org/search/?searchtype=author&query=Pradeep Varakantham) 作者：Roman Belaire，Arunesh Sinha，Pradeep Varakantham

Red teaming is critical for identifying vulnerabilities and building trust in current LLMs. However, current automated methods for Large Language Models (LLMs) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break’ another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent sparse reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing LLM red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment. 红队测试对于识别漏洞和建立当前 LLMs 的信任至关重要。然而，现有针对大型语言模型（LLMs）的自动化方法依赖脆弱的提示模板或单轮攻击，未能捕捉现实世界对抗性对话的复杂互动特性。我们提出了一种新范式：训练一个 AI 有策略地“攻破”另一个 AI。通过将红队测试形式化为马尔可夫决策过程（MDP）并采用分层强化学习（RL）框架，我们有效解决了内在的稀疏奖励和长时域挑战。我们的生成代理通过细粒度的、基于 token 的伤害奖励，学习连贯的多轮攻击策略，使其能够发现现有基线方法遗漏的细微漏洞。这一方法树立了新的最先进水平，根本上将 LLM 红队测试重新定义为一个动态的、基于轨迹的过程（而非一次性测试），这对于稳健的 AI 部署至关重要。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 13:52:00 UTC 发布时间：2025-08-06 13:52:00 UTC

#63 Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling #63 云模型特征函数自编码器：结合云模型理论与 MMD 正则化以增强生成建模

Authors: [Biao Hu](https://arxiv.org/search/?searchtype=author&query=Biao Hu), [Guoyin Wang](https://arxiv.org/search/?searchtype=author&query=Guoyin Wang) 作者：胡彪，王国银

We introduce Cloud Model Characteristic Function Auto-Encoder (CMCFAE), a novel generative model that integrates the cloud model into the Wasserstein Auto-Encoder (WAE) framework. By leveraging the characteristic functions of the cloud model to regularize the latent space, our approach enables more accurate modeling of complex data distributions. Unlike conventional methods that rely on a standard Gaussian prior and traditional divergence measures, our method employs a cloud model prior, providing a more flexible and realistic representation of the latent space, thus mitigating the homogenization observed in reconstructed samples. We derive the characteristic function of the cloud model and propose a corresponding regularizer within the WAE framework. Extensive quantitative and qualitative evaluations on MNIST, FashionMNIST, CIFAR-10, and CelebA demonstrate that CMCFAE outperforms existing models in terms of reconstruction quality, latent space structuring, and sample diversity. This work not only establishes a novel integration of cloud model theory with MMD-based regularization but also offers a promising new perspective for enhancing autoencoder-based generative models. 我们提出了云模型特征函数自编码器（CMCFAE），这是一种将云模型整合到 Wasserstein 自编码器（WAE）框架中的新型生成模型。通过利用云模型的特征函数来正则化潜在空间，我们的方法能够更准确地建模复杂的数据分布。与依赖标准高斯先验和传统散度度量的常规方法不同，我们的方法采用云模型先验，提供了对潜在空间更灵活且更真实的表示，从而减轻了重构样本中观察到的同质化现象。我们推导了云模型的特征函数，并在 WAE 框架内提出了相应的正则项。在 MNIST、FashionMNIST、CIFAR-10 和 CelebA 上的大量定量和定性评估表明，CMCFAE 在重构质量、潜在空间结构和样本多样性方面均优于现有模型。本工作不仅建立了云模型理论与基于 MMD 正则化的创新整合，还为提升基于自编码器的生成模型提供了一个有前景的新视角。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 13:44:04 UTC 发布时间：2025-08-06 13:44:04 UTC

#64 Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI #64 使用生成式人工智能自动生成符合课程标准的马来西亚中学数学选择题

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 13:30:51 UTC 发布时间：2025-08-06 13:30:51 UTC

#65 StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion #65 StepFun-Formalizer：通过知识推理融合释放 LLMs 的自动形式化潜力

Autoformalization aims to translate natural-language mathematical statements into a formal language. While LLMs have accelerated progress in this area, existing methods still suffer from low accuracy. We identify two key abilities for effective autoformalization: comprehensive mastery of formal-language domain knowledge, and reasoning capability of natural language problem understanding and informal-formal alignment. Without the former, a model cannot identify the correct formal objects; without the latter, it struggles to interpret real-world contexts and map them precisely into formal expressions. To address these gaps, we introduce ThinkingF, a data synthesis and training pipeline that improves both abilities. First, we construct two datasets: one by distilling and selecting large-scale examples rich in formal knowledge, and another by generating informal-to-formal reasoning trajectories guided by expert-designed templates. We then apply SFT and RLVR with these datasets to further fuse and refine the two abilities. The resulting 7B and 32B models exhibit both comprehensive formal knowledge and strong informal-to-formal reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models. 自动形式化旨在将自然语言的数学陈述翻译成形式语言。虽然 LLMs 加速了该领域的进展，但现有方法仍存在准确率低的问题。我们确定了有效自动形式化的两个关键能力：对形式语言领域知识的全面掌握，以及自然语言问题理解和非正式-正式对齐的推理能力。缺乏前者，模型无法识别正确的形式对象；缺乏后者，模型难以解释现实世界的语境并将其精确映射为形式表达。为解决这些不足，我们引入了 ThinkingF，一种数据合成和训练流程，提升这两种能力。首先，我们构建了两个数据集：一个通过提炼和筛选大量富含形式知识的示例，另一个通过专家设计的模板指导生成非正式到正式的推理轨迹。随后，我们利用这些数据集进行 SFT 和 RLVR 训练，进一步融合和优化这两种能力。最终得到的 7B 和 32B 模型既具备全面的形式知识，又拥有强大的非正式到正式推理能力。值得注意的是，StepFun-Formalizer-32B 在 FormalMATH-Lite 上取得了 40.5% 的 SOTA BEq@1 分数，在 ProverBench 上取得了 26.7%，超越了所有先前的通用和专用模型。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-06 13:28:22 UTC 发布时间：2025-08-06 13:28:22 UTC

#66 Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models #66 解码多模态迷宫：基于注意力的多模态模型中可解释性采用的系统综述

Authors: [Md Raisul Kibria](https://arxiv.org/search/?searchtype=author&query=Md Raisul Kibria), [Sébastien Lafond](https://arxiv.org/search/?searchtype=author&query=Sébastien Lafond), [Janan Arslan](https://arxiv.org/search/?searchtype=author&query=Janan Arslan) 作者：Md Raisul Kibria，Sébastien Lafond，Janan Arslan

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Based on these findings, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible mulitmodal AI systems, with explainability at their core. 多模态学习近年来取得了显著进展，特别是随着基于注意力模型的整合，在各种任务中实现了显著的性能提升。与此进展并行，解释性人工智能（XAI）的需求推动了大量旨在解释这些模型复杂决策过程的研究。本文系统性文献综述分析了 2020 年 1 月至 2024 年初期间发表的关于多模态模型可解释性的研究。基于 XAI 的更广泛目标，我们从多个维度审视文献，包括模型架构、涉及的模态、解释算法及评估方法。我们的分析显示，大多数研究集中于视觉-语言和纯语言模型，且基于注意力的技术是最常用的解释方法。然而，这些方法往往难以全面捕捉模态间的全部交互，这一挑战因各领域架构的异质性而更加复杂。重要的是，我们发现多模态环境下的 XAI 评估方法大多缺乏系统性，缺乏一致性、稳健性，并且未考虑模态特定的认知和情境因素。基于这些发现，我们提供了一套全面的建议，旨在促进多模态 XAI 研究中严格、透明和标准化的评估与报告实践。我们的目标是支持未来研究开发更具可解释性、问责性和责任感的多模态 AI 系统，并以可解释性为核心。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 13:14:20 UTC 发布时间：2025-08-06 13:14:20 UTC

#67 Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation #67 三思而后分割：一种面向对象推理的指代视听分割代理

Authors: [Jinxing Zhou](https://arxiv.org/search/?searchtype=author&query=Jinxing Zhou), [Yanghao Zhou](https://arxiv.org/search/?searchtype=author&query=Yanghao Zhou), [Mingfei Han](https://arxiv.org/search/?searchtype=author&query=Mingfei Han), [Tong Wang](https://arxiv.org/search/?searchtype=author&query=Tong Wang), [Xiaojun Chang](https://arxiv.org/search/?searchtype=author&query=Xiaojun Chang), [Hisham Cholakkal](https://arxiv.org/search/?searchtype=author&query=Hisham Cholakkal), [Rao Muhammad Anwer](https://arxiv.org/search/?searchtype=author&query=Rao Muhammad Anwer) 作者：周金星，周阳浩，韩明飞，王彤，常晓军，Hisham Cholakkal，Rao Muhammad Anwer

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent. 指向视听分割（Ref-AVS）旨在根据给定的参考表达对可听视频中的目标对象进行分割。以往的工作通常依赖通过多模态融合学习潜在嵌入，以提示可调节的 SAM/SAM2 解码器进行分割，这需要强像素级监督且缺乏可解释性。从显式参考理解的新视角出发，我们提出了 TGS-Agent，将任务分解为思考-定位-分割的过程，模拟人类推理步骤，先通过多模态分析识别被指对象，随后进行粗粒度定位和精确分割。为此，我们首先提出了 Ref-Thinker，一种能够对文本、视觉和听觉线索进行推理的多模态语言模型。我们构建了一个包含显式对象感知思考-回答链的指令微调数据集，用于 Ref-Thinker 的微调。Ref-Thinker 推断出的对象描述被用作 Grounding-DINO 和 SAM2 的显式提示，这两者在不依赖像素级监督的情况下执行定位和分割。此外，我们引入了 R\textsuperscript{2}-AVSBench，这是一个具有语言多样性和推理密集型参考的新基准，用于更好地评估模型的泛化能力。我们的方法在标准的 Ref-AVSBench 和提出的 R\textsuperscript{2}-AVSBench 上均取得了最先进的结果。代码将发布于 https://github.com/jasongief/TGS-Agent。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Multiagent Systems, Multimedia 主题：计算机视觉与模式识别，人工智能，多智能体系统，多媒体

Publish: 2025-08-06 13:05:09 UTC 发布时间：2025-08-06 13:05:09 UTC

#68 Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models #68 基于深度学习的可扩展图像到三维立面解析器，用于生成热力三维建筑模型

Authors: [Yinan Yu](https://arxiv.org/search/?searchtype=author&query=Yinan Yu), [Alex Gonzalez-Caceres](https://arxiv.org/search/?searchtype=author&query=Alex Gonzalez-Caceres), [Samuel Scheidegger](https://arxiv.org/search/?searchtype=author&query=Samuel Scheidegger), [Sanjay Somanath](https://arxiv.org/search/?searchtype=author&query=Sanjay Somanath), [Alexander Hollberg](https://arxiv.org/search/?searchtype=author&query=Alexander Hollberg) 作者：余一楠，亚历克斯·冈萨雷斯-卡塞雷斯，塞缪尔·谢德格，桑杰·索马纳斯，亚历山大·霍尔伯格

Renovating existing buildings is essential for climate impact. Early-phase renovation planning requires simulations based on thermal 3D models at Level of Detail (LoD) 3, which include features like windows. However, scalable and accurate identification of such features remains a challenge. This paper presents the Scalable Image-to-3D Facade Parser (SI3FP), a pipeline that generates LoD3 thermal models by extracting geometries from images using both computer vision and deep learning. Unlike existing methods relying on segmentation and projection, SI3FP directly models geometric primitives in the orthographic image plane, providing a unified interface while reducing perspective distortions. SI3FP supports both sparse (e.g., Google Street View) and dense (e.g., hand-held camera) data sources. Tested on typical Swedish residential buildings, SI3FP achieved approximately 5% error in window-to-wall ratio estimates, demonstrating sufficient accuracy for early-stage renovation analysis. The pipeline facilitates large-scale energy renovation planning and has broader applications in urban development and planning. 翻新现有建筑对于气候影响至关重要。早期阶段的翻新规划需要基于包含窗户等特征的三级细节（LoD 3）热模型进行模拟。然而，如何可扩展且准确地识别这些特征仍然是一个挑战。本文提出了可扩展的图像到三维立面解析器（SI3FP），该流程通过结合计算机视觉和深度学习，从图像中提取几何形状，生成 LoD3 热模型。与依赖分割和投影的现有方法不同，SI3FP 直接在正射影像平面上建模几何基元，提供统一接口的同时减少了透视畸变。SI3FP 支持稀疏（如 Google 街景）和密集（如手持相机）数据源。在典型的瑞典住宅建筑测试中，SI3FP 在窗墙比估计中实现了约 5%的误差，显示出足够的准确性以支持早期翻新分析。该流程促进了大规模能源翻新规划，并在城市开发和规划中具有更广泛的应用。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 12:48:53 UTC 发布时间：2025-08-06 12:48:53 UTC

#69 Why are LLMs' abilities emergent? #69 为什么 LLMs 的能力是涌现的？

Author: [Vladimír Havlík](https://arxiv.org/search/?searchtype=author&query=Vladimír Havlík) 作者：Vladimír Havlík

The remarkable success of Large Language Models (LLMs) in generative tasks has raised fundamental questions about the nature of their acquired capabilities, which often appear to emerge unexpectedly without explicit training. This paper examines the emergent properties of Deep Neural Networks (DNNs) through both theoretical analysis and empirical observation, addressing the epistemological challenge of “creation without understanding” that characterises contemporary AI development. We explore how the neural approach’s reliance on nonlinear, stochastic processes fundamentally differs from symbolic computational paradigms, creating systems whose macro-level behaviours cannot be analytically derived from micro-level neuron activities. Through analysis of scaling laws, grokking phenomena, and phase transitions in model capabilities, I demonstrate that emergent abilities arise from the complex dynamics of highly sensitive nonlinear systems rather than simply from parameter scaling alone. My investigation reveals that current debates over metrics, pre-training loss thresholds, and in-context learning miss the fundamental ontological nature of emergence in DNNs. I argue that these systems exhibit genuine emergent properties analogous to those found in other complex natural phenomena, where systemic capabilities emerge from cooperative interactions among simple components without being reducible to their individual behaviours. The paper concludes that understanding LLM capabilities requires recognising DNNs as a new domain of complex dynamical systems governed by universal principles of emergence, similar to those operating in physics, chemistry, and biology. This perspective shifts the focus from purely phenomenological definitions of emergence to understanding the internal dynamic transformations that enable these systems to acquire capabilities that transcend their individual components. 大型语言模型（LLMs）在生成任务中的显著成功引发了关于其所获得能力本质的根本性问题，这些能力常常在没有明确训练的情况下意外出现。本文通过理论分析和实证观察，探讨了深度神经网络（DNNs）的涌现特性，回应了当代人工智能发展中“无理解的创造”这一认识论挑战。我们探讨了神经方法依赖非线性、随机过程的本质区别于符号计算范式，造就了其宏观行为无法从微观神经元活动中解析推导的系统。通过对规模定律、grokking 现象以及模型能力相变的分析，我展示了涌现能力源自高度敏感非线性系统的复杂动力学，而非仅仅是参数规模的简单扩展。我的研究揭示，目前关于指标、预训练损失阈值和上下文学习的争论忽视了 DNN 涌现的根本本体性质。我认为这些系统表现出真正的涌现特性，类似于其他复杂自然现象中发现的特性，其中系统能力是由简单组件之间的协作互动产生的，且无法简化为其个体行为。本文结论指出，理解 LLM 的能力需要将深度神经网络（DNN）视为一个新的复杂动力系统领域，该领域受涌现的普遍原理支配，类似于物理、化学和生物学中运作的原理。这一视角将关注点从纯粹的现象学涌现定义转向理解使这些系统获得超越其个体组件能力的内部动态转变。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 12:43:04 UTC 发布：2025-08-06 12:43:04 UTC

#70 Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky #70 利用大型语言模型提升事故数据质量：来自肯塔基州二次事故叙述的证据

Authors: [Xu Zhang](https://arxiv.org/search/?searchtype=author&query=Xu Zhang), [Mei Chen](https://arxiv.org/search/?searchtype=author&query=Mei Chen) 作者：张旭，陈梅

This study evaluates advanced natural language processing (NLP) techniques to enhance crash data quality by mining crash narratives, using secondary crash identification in Kentucky as a case study. Drawing from 16,656 manually reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we compare three model classes: zero-shot open-source large language models (LLMs) (LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers (BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic regression as baseline. Models were calibrated on 2015-2021 data and tested on 1,771 narratives from 2022. Fine-tuned transformers achieved superior performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy (95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139 minutes of inference; the logistic baseline lagged well behind (F1:0.66). LLMs excelled in recall for some variants (e.g., GEMMA3:27B at 0.94) but incurred high computational costs (up to 723 minutes for DeepSeek-R1:70B), while fine-tuned models processed the test set in seconds after brief training. Further analysis indicated that mid-sized LLMs (e.g., DeepSeek-R1:32B) can rival larger counterparts in performance while reducing runtime, suggesting opportunities for optimized deployments. Results highlight trade-offs between accuracy, efficiency, and data requirements, with fine-tuned transformer models balancing precision and recall effectively on Kentucky data. Practical deployment considerations emphasize privacy-preserving local deployment, ensemble approaches for improved accuracy, and incremental processing for scalability, providing a replicable scheme for enhancing crash-data quality with advanced NLP. 本研究评估了先进的自然语言处理（NLP）技术，通过挖掘事故叙述来提升事故数据质量，以肯塔基州的二次事故识别为案例。基于 2015-2022 年间 16,656 条人工审核的叙述，其中 3,803 条确认为二次事故，我们比较了三类模型：零样本开源大型语言模型（LLMs）（LLaMA3:70B、DeepSeek-R1:70B、Qwen3:32B、Gemma3:27B）；微调的变换器模型（BERT、DistilBERT、RoBERTa、XLNet、Longformer）；以及作为基线的传统逻辑回归。模型在 2015-2021 年数据上进行了校准，并在 2022 年 1,771 条叙述上进行了测试。微调的变换器模型表现最佳，其中 RoBERTa 取得了最高的 F1 分数（0.90）和准确率（95%）。零样本的 LLaMA3:70B 达到了相近的 F1 分数（0.86），但推理时间长达 139 分钟；逻辑回归基线表现较差（F1：0.66）。LLMs 在某些变体的召回率上表现出色（例如 GEMMA3:27B 达到 0.94），但计算成本高昂（DeepSeek-R1:70B 最长达 723 分钟），而微调模型经过短时间训练后，能够在几秒钟内处理测试集。进一步分析表明，中型 LLMs（例如 DeepSeek-R1:32B）在性能上可以与更大型的模型相媲美，同时减少运行时间，表明存在优化部署的机会。结果突出了准确性、效率和数据需求之间的权衡，经过微调的 transformer 模型在肯塔基数据上有效地平衡了精确率和召回率。实际部署考虑强调了保护隐私的本地部署、通过集成方法提升准确性以及增量处理以实现可扩展性，提供了一种利用先进 NLP 技术提升事故数据质量的可复制方案。

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval, Machine Learning 主题：计算与语言、人工智能、信息检索、机器学习

Publish: 2025-08-06 12:41:18 UTC 发布：2025-08-06 12:41:18 UTC

#71 AIC CTU@FEVER 8: On-premise fact checking through long context RAG #71 AIC CTU@FEVER 8：通过长上下文 RAG 进行本地事实核查

In this paper, we present our fact-checking pipeline which has scored first in FEVER 8 shared task. Our fact-checking system is a simple two-step RAG pipeline based on our last year’s submission. We show how the pipeline can be redeployed on-premise, achieving state-of-the-art fact-checking performance (in sense of Ev2R test-score), even under the constraint of a single NVidia A10 GPU, 23GB of graphical memory and 60s running time per claim. 在本文中，我们介绍了我们的事实核查流程，该流程在 FEVER 8 共享任务中获得了第一名。我们的事实核查系统是基于去年的提交的一个简单的两步 RAG 流程。我们展示了如何在本地重新部署该流程，即使在仅使用单个 NVidia A10 GPU、23GB 显存和每条声明 60 秒运行时间的限制下，也能实现最先进的事实核查性能（以 Ev2R 测试分数衡量）。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 14:03:43 UTC 发布时间：2025-08-05 14:03:43 UTC

#72 ProtoN: Prototype Node Graph Neural Network for Unconstrained Multi-Impression Ear Recognition #72 ProtoN：用于无约束多印象耳朵识别的原型节点图神经网络

Authors: [Santhoshkumar Peddi](https://arxiv.org/search/?searchtype=author&query=Santhoshkumar Peddi), [Sadhvik Bathini](https://arxiv.org/search/?searchtype=author&query=Sadhvik Bathini), [Arun Balasubramanian](https://arxiv.org/search/?searchtype=author&query=Arun Balasubramanian), [Monalisa Sarma](https://arxiv.org/search/?searchtype=author&query=Monalisa Sarma), [Debasis Samanta](https://arxiv.org/search/?searchtype=author&query=Debasis Samanta) 作者：Santhoshkumar Peddi，Sadhvik Bathini，Arun Balasubramanian，Monalisa Sarma，Debasis Samanta

Ear biometrics offer a stable and contactless modality for identity recognition, yet their effectiveness remains limited by the scarcity of annotated data and significant intra-class variability. Existing methods typically extract identity features from individual impressions in isolation, restricting their ability to capture consistent and discriminative representations. To overcome these limitations, a few-shot learning framework, ProtoN, is proposed to jointly process multiple impressions of an identity using a graph-based approach. Each impression is represented as a node in a class-specific graph, alongside a learnable prototype node that encodes identity-level information. This graph is processed by a Prototype Graph Neural Network (PGNN) layer, specifically designed to refine both impression and prototype representations through a dual-path message-passing mechanism. To further enhance discriminative power, the PGNN incorporates a cross-graph prototype alignment strategy that improves class separability by enforcing intra-class compactness while maintaining inter-class distinction. Additionally, a hybrid loss function is employed to balance episodic and global classification objectives, thereby improving the overall structure of the embedding space. Extensive experiments on five benchmark ear datasets demonstrate that ProtoN achieves state-of-the-art performance, with Rank-1 identification accuracy of up to 99.60% and an Equal Error Rate (EER) as low as 0.025, showing the effectiveness for few-shot ear recognition under limited data conditions. 耳朵生物识别提供了一种稳定且无接触的身份识别方式，但其效果仍受限于标注数据的稀缺性和显著的类内变异性。现有方法通常孤立地从单个印象中提取身份特征，限制了其捕捉一致且具有区分性的表示的能力。为克服这些限制，提出了一种基于图的少样本学习框架 ProtoN，用于联合处理同一身份的多个印象。每个印象被表示为类别特定图中的一个节点，图中还包含一个可学习的原型节点，用以编码身份级别的信息。该图通过一个原型图神经网络（PGNN）层进行处理，该层专门设计用于通过双路径消息传递机制细化印象和原型的表示。为了进一步增强区分能力，PGNN 引入了跨图原型对齐策略，通过强化类内紧凑性同时保持类间区分性，提升了类别的可分性。此外，采用混合损失函数来平衡情景分类和全局分类目标，从而提升嵌入空间的整体结构。在五个基准耳朵数据集上的大量实验表明，ProtoN 实现了最先进的性能，Rank-1 识别准确率高达 99.60%，等错误率（EER）低至 0.025，展示了其在有限数据条件下进行少样本耳朵识别的有效性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 12:21:38 UTC 发布时间：2025-08-06 12:21:38 UTC

Author: [Anderson de Lima Luiz](https://arxiv.org/search/?searchtype=author&query=Anderson de Lima Luiz) 作者：Anderson de Lima Luiz

This paper introduces the Learned User Significance Tracker (LUST), a framework designed to analyze video content and quantify the thematic relevance of its segments in relation to a user-provided textual description of significance. LUST leverages a multi-modal analytical pipeline, integrating visual cues from video frames with textual information extracted via Automatic Speech Recognition (ASR) from the audio track. The core innovation lies in a hierarchical, two-stage relevance scoring mechanism employing Large Language Models (LLMs). An initial “direct relevance” score, Sd,i, assesses individual segments based on immediate visual and auditory content against the theme. This is followed by a “contextual relevance” score, Sc,i, that refines the assessment by incorporating the temporal progression of preceding thematic scores, allowing the model to understand evolving narratives. The LUST framework aims to provide a nuanced, temporally-aware measure of user-defined significance, outputting an annotated video with visualized relevance scores and comprehensive analytical logs. 本文介绍了 Learned User Significance Tracker（LUST），一个旨在分析视频内容并量化其片段与用户提供的文本重要性描述之间主题相关性的框架。LUST 利用多模态分析流程，将视频帧中的视觉线索与通过自动语音识别（ASR）从音频轨道提取的文本信息相结合。其核心创新在于采用分层的两阶段相关性评分机制，利用 LLMs。初始的“直接相关性”评分， Sd,i ，基于即时的视觉和听觉内容评估各个片段与主题的匹配度。随后，“上下文相关性”评分， Sc,i ，通过结合之前主题评分的时间进展来细化评估，使模型能够理解不断发展的叙事。LUST 框架旨在提供一种细致且具时间感知的用户定义重要性度量，输出带有可视化相关性评分和全面分析日志的注释视频。

Subjects: Multimedia, Artificial Intelligence 主题：多媒体，人工智能

Publish: 2025-08-06 11:48:51 UTC 发布时间：2025-08-06 11:48:51 UTC

#74 Chain of Questions: Guiding Multimodal Curiosity in Language Models #74 问题链：引导语言模型中的多模态好奇心

Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model’s ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks. 大型语言模型（LLMs）的推理能力通过链式思维和明确的逐步解释等方法得到了显著提升。然而，这些改进尚未完全应用于多模态环境中，在这种环境下，模型必须主动决定在与复杂的现实世界环境交互时应启用哪些感官模态，如视觉、音频或空间感知。本文提出了问题链（Chain of Questions，CoQ）框架，这是一种以好奇心驱动的推理方法，鼓励多模态语言模型动态生成针对其周围环境的具体问题。生成的问题引导模型有选择地激活相关模态，从而收集进行准确推理和响应生成所需的关键信息。我们在一个新颖的多模态基准数据集上评估了该框架，该数据集通过整合 WebGPT、ScienceQA、AVSD 和 ScanQA 数据集构建。实验结果表明，我们的 CoQ 方法提升了基础模型有效识别和整合相关感官信息的能力。这提升了推理过程在多样化多模态任务中的准确性、可解释性和一致性。

Publish: 2025-08-06 11:42:54 UTC 发布时间：2025-08-06 11:42:54 UTC

#75 GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy #75 GTPO 和 GRPO-S：基于策略熵的令牌和序列级奖励塑造

Authors: [Hongze Tan](https://arxiv.org/search/?searchtype=author&query=Hongze Tan), [Jianfei Pan](https://arxiv.org/search/?searchtype=author&query=Jianfei Pan) 作者：谭洪泽，潘建飞

Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models. 强化学习（RL）算法如 Group Relative Policy Optimization（GRPO）提升了大型语言模型（LLM）的推理能力，但受限于粗粒度的信用分配，即对序列中的所有 token 施加统一奖励。这是长链推理任务中的一个主要缺陷。本文通过\textbf{动态熵加权}解决了这一问题。我们的核心思想是，正确回答中的高熵 token 可以引导策略达到更高的性能上限。基于此，我们通过两种方式创建更细粒度的奖励信号以实现精确的策略更新：1）\textbf{组 token 策略优化}（\textbf{GTPO}），为每个 token 分配熵加权奖励，实现细粒度的信用分配；2）\textbf{序列级组相对策略优化}（\textbf{GRPO-S}），基于序列中 token 的平均熵为每个序列分配熵加权奖励。实验表明，我们的方法显著优于强基线 DAPO。结果证实，熵加权机制是性能提升的关键驱动力，为模型深度推理的增强提供了更优路径。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 11:42:47 UTC 发布时间：2025-08-06 11:42:47 UTC

#76 Modelling and Classifying the Components of a Literature Review #76 文献综述的组成部分建模与分类

Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models. 以往的研究表明，基于 AI 的方法通过根据论文中句子的修辞角色（如研究空白、结果、局限性、现有方法的扩展等）进行标注，能够显著提升对科学文献的分析效果。这类表示形式还有望支持新一代系统的开发，使其能够生成高质量的文献综述。然而，实现这一目标需要定义相关的标注方案，并制定有效的大规模文献标注策略。本文通过以下两方面应对这些挑战：1）引入一种专门设计用于支持文献综述生成的新型标注方案；2）对多种最先进的 LLMs 在根据该方案分类修辞角色方面进行了全面评估。为此，我们还提出了 Sci-Sentence，这是一个新颖的多学科基准，包含 700 句由领域专家手工标注的句子和 2240 句由 LLMs 自动标注的句子。我们在该基准测试中评估了 37 个 LLMs，涵盖了多样的模型家族和规模，采用零样本学习和微调两种方法。实验带来了若干新见解，推动了该挑战领域的技术进步。首先，当前一代的 LLMs 在高质量数据微调下，在该任务中表现出色，F1 分数超过 96%。其次，虽然像 GPT-4o 这样的大型专有模型取得了最佳结果，一些轻量级开源替代方案也展现了优异的性能。最后，利用 LLMs 生成的半合成训练样本丰富训练数据被证明是有益的，使小型编码器能够取得稳健的结果，并显著提升了多个开放解码器模型的性能。

Subjects: Computation and Language, Artificial Intelligence, Human-Computer Interaction, Information Retrieval 主题：计算与语言、人工智能、人机交互、信息检索

Publish: 2025-08-06 11:30:07 UTC 发布时间：2025-08-06 11:30:07 UTC

#77 Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models #77 超越排行榜：重新思考大型语言模型的医疗基准

Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark’s development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare. 大型语言模型（LLMs）在医疗领域展现出显著潜力，促使众多基准测试评估其能力。然而，关于这些基准测试的可靠性仍存在担忧，它们往往缺乏临床真实性、健全的数据管理以及以安全为导向的评估指标。为了解决这些不足，我们引入了 MedCheck，这是首个专为医疗基准测试设计的生命周期导向评估框架。我们的框架将基准测试的开发分解为从设计到治理的五个连续阶段，并提供了包含 46 条医学定制标准的全面检查清单。利用 MedCheck，我们对 53 个医疗 LLM 基准测试进行了深入的实证评估。我们的分析揭示了普遍存在的系统性问题，包括与临床实践的严重脱节、由于未加控制的污染风险导致的数据完整性危机，以及对模型鲁棒性和不确定性意识等安全关键评估维度的系统性忽视。基于这些发现，MedCheck 既作为现有基准的诊断工具，也作为促进更标准化、可靠且透明的医疗 AI 评估方法的可操作指南。

Publish: 2025-08-06 11:11:40 UTC 发布时间：2025-08-06 11:11:40 UTC

#78 Compressing Large Language Models with PCA Without Performance Loss #78 使用 PCA 压缩大型语言模型且无性能损失

Author: [Magnus Bengtsson](https://arxiv.org/search/?searchtype=author&query=Magnus Bengtsson) 作者：Magnus Bengtsson

We demonstrate that Principal Component Analysis (PCA), when applied in a structured manner, either to polar-transformed images or segment-wise to token sequences, enables extreme compression of neural models without sacrificing performance. Across three case studies, we show that a one-layer classifier trained on PCA-compressed polar MNIST achieves over 98 percent accuracy using only 840 parameters. A two-layer transformer trained on 70-dimensional PCA-reduced MiniLM embeddings reaches 76.62 percent accuracy on the 20 Newsgroups dataset with just 81000 parameters. A decoder-only transformer generates coherent token sequences from 70-dimensional PCA embeddings while preserving over 97 percent cosine similarity with full MiniLM representations, using less than 17 percent of the parameter count of GPT-2. These results highlight PCA-based input compression as a general and effective strategy for aligning model capacity with information content, enabling lightweight architectures across multiple modalities. 我们展示了主成分分析（PCA）在结构化应用时，无论是对极坐标变换图像，还是对分段的令牌序列，均能实现神经模型的极致压缩而不损失性能。在三个案例研究中，我们表明：在 PCA 压缩的极坐标 MNIST 上训练的单层分类器仅用 840 个参数即可达到 98%以上的准确率；在 70 维 PCA 降维 MiniLM 嵌入上训练的两层 Transformer，在 20 个新闻组数据集上以仅 81000 个参数实现了 76.62%的准确率；一个仅解码器的 Transformer 能够从 70 维 PCA 嵌入生成连贯的令牌序列，同时保持与完整 MiniLM 表示超过 97%的余弦相似度，参数数量不到 GPT-2 的 17%。这些结果凸显了基于 PCA 的输入压缩作为一种通用且有效的策略，能够使模型容量与信息内容相匹配，从而在多种模态下实现轻量级架构。

Subjects: Computational Engineering, Finance, and Science, Artificial Intelligence 学科：计算工程、金融与科学，人工智能

Publish: 2025-08-06 10:47:22 UTC 发布时间：2025-08-06 10:47:22 UTC

#79 Comparative Analysis of Novel NIRMAL Optimizer Against Adam and SGD with Momentum #79 新型 NIRMAL 优化器与 Adam 及带动量的 SGD 的比较分析

Authors: [Nirmal Gaud](https://arxiv.org/search/?searchtype=author&query=Nirmal Gaud), [Surej Mouli](https://arxiv.org/search/?searchtype=author&query=Surej Mouli), [Preeti Katiyar](https://arxiv.org/search/?searchtype=author&query=Preeti Katiyar), [Vaduguru Venkata Ramya](https://arxiv.org/search/?searchtype=author&query=Vaduguru Venkata Ramya) 作者：Nirmal Gaud、Surej Mouli、Preeti Katiyar、Vaduguru Venkata Ramya

This study proposes NIRMAL (Novel Integrated Robust Multi-Adaptation Learning), a novel optimization algorithm that combines multiple strategies inspired by the movements of the chess piece. These strategies include gradient descent, momentum, stochastic perturbations, adaptive learning rates, and non-linear transformations. We carefully evaluated NIRMAL against two widely used and successful optimizers, Adam and SGD with Momentum, on four benchmark image classification datasets: MNIST, FashionMNIST, CIFAR-10, and CIFAR-100. The custom convolutional neural network (CNN) architecture is applied on each dataset. The experimental results show that NIRMAL achieves competitive performance, particularly on the more challenging CIFAR-100 dataset, where it achieved a test accuracy of 45.32%and a weighted F1-score of 0.4328. This performance surpasses Adam (41.79% accuracy, 0.3964 F1-score) and closely matches SGD with Momentum (46.97% accuracy, 0.4531 F1-score). Also, NIRMAL exhibits robust convergence and strong generalization capabilities, especially on complex datasets, as evidenced by stable training results in loss and accuracy curves. These findings underscore NIRMAL’s significant ability as a versatile and effective optimizer for various deep learning tasks. 本研究提出了 NIRMAL（新型集成鲁棒多适应学习），这是一种结合了多种策略的新型优化算法，灵感来源于国际象棋棋子的移动。这些策略包括梯度下降、动量、随机扰动、自适应学习率和非线性变换。我们在四个基准图像分类数据集 MNIST、FashionMNIST、CIFAR-10 和 CIFAR-100 上，针对两种广泛使用且成功的优化器 Adam 和带动量的 SGD，仔细评估了 NIRMAL。每个数据集均应用了定制的卷积神经网络（CNN）架构。实验结果表明，NIRMAL 表现出竞争力，尤其是在更具挑战性的 CIFAR-100 数据集上，测试准确率达到 45.32%，加权 F1 分数为 0.4328。该性能超过了 Adam（准确率 41.79%，F1 分数 0.3964），并且与带动量的 SGD（准确率 46.97%，F1 分数 0.4531）表现接近。此外，NIRMAL 展现了稳健的收敛性和强大的泛化能力，特别是在复杂数据集上，训练过程中的损失和准确率曲线均表现稳定。这些发现强调了 NIRMAL 作为一种多功能且高效的优化器，在各种深度学习任务中的显著能力。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-06 10:30:22 UTC 发布时间：2025-08-06 10:30:22 UTC

#80 Challenges in Applying Variational Quantum Algorithms to Dynamic Satellite Network Routing #80 在动态卫星网络路由中应用变分量子算法的挑战

Authors: [Phuc Hao Do](https://arxiv.org/search/?searchtype=author&query=Phuc Hao Do), [Tran Duc Le](https://arxiv.org/search/?searchtype=author&query=Tran Duc Le) 作者：Phuc Hao Do，Tran Duc Le

Applying near-term variational quantum algorithms to the problem of dynamic satellite network routing represents a promising direction for quantum computing. In this work, we provide a critical evaluation of two major approaches: static quantum optimizers such as the Variational Quantum Eigensolver (VQE) and the Quantum Approximate Optimization Algorithm (QAOA) for offline route computation, and Quantum Reinforcement Learning (QRL) methods for online decision-making. Using ideal, noise-free simulations, we find that these algorithms face significant challenges. Specifically, static optimizers are unable to solve even a classically easy 4-node shortest path problem due to the complexity of the optimization landscape. Likewise, a basic QRL agent based on policy gradient methods fails to learn a useful routing strategy in a dynamic 8-node environment and performs no better than random actions. These negative findings highlight key obstacles that must be addressed before quantum algorithms can offer real advantages in communication networks. We discuss the underlying causes of these limitations, including barren plateaus and learning instability, and suggest future research directions to overcome them. 将近期变分量子算法应用于动态卫星网络路由问题，是量子计算的一个有前景的方向。在本工作中，我们对两种主要方法进行了批判性评估：用于离线路径计算的静态量子优化器，如变分量子本征求解器（VQE）和量子近似优化算法（QAOA），以及用于在线决策的量子强化学习（QRL）方法。通过理想的无噪声模拟，我们发现这些算法面临重大挑战。具体而言，静态优化器由于优化景观的复杂性，甚至无法解决一个经典上简单的 4 节点最短路径问题。同样，基于策略梯度方法的基础 QRL 智能体在动态 8 节点环境中未能学到有效的路由策略，其表现不优于随机动作。这些负面结果凸显了在量子算法能够在通信网络中提供实际优势之前必须克服的关键障碍。我们讨论了这些限制的根本原因，包括贫瘠高原和学习不稳定性，并提出了克服这些问题的未来研究方向。

Subjects: Quantum Physics, Artificial Intelligence, Systems and Control 主题：量子物理，人工智能，系统与控制

Publish: 2025-08-06 10:25:39 UTC 发布时间：2025-08-06 10:25:39 UTC

#81 Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success #81 利用合成世界中的强化学习提升视觉-语言模型训练，实现现实世界成功

Authors: [George Bredis](https://arxiv.org/search/?searchtype=author&query=George Bredis), [Stanislav Dereka](https://arxiv.org/search/?searchtype=author&query=Stanislav Dereka), [Viacheslav Sinii](https://arxiv.org/search/?searchtype=author&query=Viacheslav Sinii), [Ruslan Rakhimov](https://arxiv.org/search/?searchtype=author&query=Ruslan Rakhimov), [Daniil Gavrilov](https://arxiv.org/search/?searchtype=author&query=Daniil Gavrilov) 作者：George Bredis，Stanislav Dereka，Viacheslav Sinii，Ruslan Rakhimov，Daniil Gavrilov

Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions – a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50% relative on BALROG (game-centric agentic control), +5% relative on the hardest part of VSI-Bench (spatial planning), and +2% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks. 交互式多模态代理必须将原始视觉观察转换为连贯的语言条件动作序列——这是当前视觉语言模型（VLMs）仍然缺乏的能力。早期的强化学习（RL）尝试原则上可以赋予 VLMs 这种技能，但它们很少测试所学行为是否能超出训练模拟器的范围进行泛化，并且它们要么依赖脆弱的超参数调优，要么依赖状态变化较小的密集奖励环境。我们提出了 Vision-Language Decoupled Actor-Critic（VL-DAC），这是一种轻量级、无超参数的 RL 算法。VL-DAC 对动作令牌应用 PPO 更新，同时仅在环境步骤级别学习价值：据我们所知，这种安排此前未在大型 VLMs 或 LLMs 中探索过。这种简单的解耦消除了不稳定的加权项，实现了更快、更可靠的收敛。在一个廉价的模拟器中（MiniWorld、Gym-Cards、ALFWorld 或 WebShop）使用 VL-DAC 训练单个 VLM，已经能够产生广泛泛化的策略：在 BALROG（以游戏为中心的智能控制）上相对提升 50%，在 VSI-Bench 最难部分（空间规划）上相对提升 5%，在 VisualWebBench（网页导航）上提升 2%，且所有这些提升均未降低通用图像理解的准确率。这些结果首次证明，简单的强化学习算法可以完全在廉价的合成世界中训练 VLM，同时在真实图像的智能控制、空间推理和网页导航基准测试中带来可衡量的提升。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 10:08:48 UTC 发布时间：2025-08-06 10:08:48 UTC

#82 A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models #82 几个词就能扭曲图谱：基于图的增强检索生成大型语言模型的知识投毒攻击

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 10:01:26 UTC 发布时间：2025-08-06 10:01:26 UTC

#83 A Visual Tool for Interactive Model Explanation using Sensitivity Analysis #83 一款基于敏感性分析的交互式模型解释可视化工具

Author: [Manuela Schuler](https://arxiv.org/search/?searchtype=author&query=Manuela Schuler) 作者：Manuela Schuler

We present SAInT, a Python-based tool for visually exploring and understanding the behavior of Machine Learning (ML) models through integrated local and global sensitivity analysis. Our system supports Human-in-the-Loop (HITL) workflows by enabling users - both AI researchers and domain experts - to configure, train, evaluate, and explain models through an interactive graphical interface without programming. The tool automates model training and selection, provides global feature attribution using variance-based sensitivity analysis, and offers per-instance explanation via LIME and SHAP. We demonstrate the system on a classification task predicting survival on the Titanic dataset and show how sensitivity information can guide feature selection and data refinement. 我们介绍了 SAInT，一款基于 Python 的工具，用于通过集成的局部和全局敏感性分析，直观地探索和理解机器学习（ML）模型的行为。我们的系统支持人机交互（HITL）工作流程，使用户——包括 AI 研究人员和领域专家——能够通过交互式图形界面配置、训练、评估和解释模型，无需编程。该工具自动化模型训练和选择，利用基于方差的敏感性分析提供全局特征归因，并通过 LIME 和 SHAP 提供每个实例的解释。我们在泰坦尼克号数据集的生存预测分类任务中演示了该系统，并展示了敏感性信息如何指导特征选择和数据优化。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 09:53:31 UTC 发布时间：2025-08-06 09:53:31 UTC

#84 SelectiveShield: Lightweight Hybrid Defense Against Gradient Leakage in Federated Learning #84 SelectiveShield：针对联邦学习中梯度泄露的轻量级混合防御

Authors: [Borui Li](https://arxiv.org/search/?searchtype=author&query=Borui Li), [Li Yan](https://arxiv.org/search/?searchtype=author&query=Li Yan), [Jianmin Liu](https://arxiv.org/search/?searchtype=author&query=Jianmin Liu) 作者：李博睿，闫丽，刘建民

Federated Learning (FL) enables collaborative model training on decentralized data but remains vulnerable to gradient leakage attacks that can reconstruct sensitive user information. Existing defense mechanisms, such as differential privacy (DP) and homomorphic encryption (HE), often introduce a trade-off between privacy, model utility, and system overhead, a challenge that is exacerbated in heterogeneous environments with non-IID data and varying client capabilities. To address these limitations, we propose SelectiveShield, a lightweight hybrid defense framework that adaptively integrates selective homomorphic encryption and differential privacy. SelectiveShield leverages Fisher information to quantify parameter sensitivity, allowing clients to identify critical parameters locally. Through a collaborative negotiation protocol, clients agree on a shared set of the most sensitive parameters for protection via homomorphic encryption. Parameters that are uniquely important to individual clients are retained locally, fostering personalization, while non-critical parameters are protected with adaptive differential privacy noise. Extensive experiments demonstrate that SelectiveShield maintains strong model utility while significantly mitigating gradient leakage risks, offering a practical and scalable defense mechanism for real-world federated learning deployments. 联邦学习（FL）实现了在分散数据上的协同模型训练，但仍易受到梯度泄露攻击，这类攻击能够重建敏感的用户信息。现有的防御机制，如差分隐私（DP）和同态加密（HE），通常在隐私保护、模型效用和系统开销之间存在权衡，这一挑战在具有非独立同分布（非 IID）数据和不同客户端能力的异构环境中尤为突出。为了解决这些限制，我们提出了 SelectiveShield，一种轻量级的混合防御框架，能够自适应地整合选择性同态加密和差分隐私。SelectiveShield 利用费舍尔信息量来量化参数敏感性，使客户端能够本地识别关键参数。通过协作协商协议，客户端达成共识，选定一组最敏感的参数通过同态加密进行保护。对个别客户端独有重要的参数则保留在本地，促进个性化，而非关键参数则通过自适应差分隐私噪声进行保护。大量实验证明，SelectiveShield 在显著降低梯度泄露风险的同时，保持了强大的模型效用，为现实世界的联邦学习部署提供了一种实用且可扩展的防御机制。

Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence, Cryptography and Security 主题：分布式、并行与集群计算，人工智能，密码学与安全

Publish: 2025-08-06 09:50:39 UTC 发布时间：2025-08-06 09:50:39 UTC

#85 Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark #85 细分任何车辆：语义和视觉上下文驱动的 SAM 及基准测试

Authors: [Xiao Wang](https://arxiv.org/search/?searchtype=author&query=Xiao Wang), [Ziwen Wang](https://arxiv.org/search/?searchtype=author&query=Ziwen Wang), [Wentao Wu](https://arxiv.org/search/?searchtype=author&query=Wentao Wu), [Anjie Wang](https://arxiv.org/search/?searchtype=author&query=Anjie Wang), [Jiashu Wu](https://arxiv.org/search/?searchtype=author&query=Jiashu Wu), [Yantao Pan](https://arxiv.org/search/?searchtype=author&query=Yantao Pan), [Chenglong Li](https://arxiv.org/search/?searchtype=author&query=Chenglong Li) 作者：王晓，王子文，吴文涛，王安杰，吴嘉树，潘艳涛，李成龙

With the rapid advancement of autonomous driving, vehicle perception, particularly detection and segmentation, has placed increasingly higher demands on algorithmic performance. Pre-trained large segmentation models, especially Segment Anything Model (SAM), have sparked significant interest and inspired new research directions in artificial intelligence. However, SAM cannot be directly applied to the fine-grained task of vehicle part segmentation, as its text-prompted segmentation functionality is not publicly accessible, and the mask regions generated by its default mode lack semantic labels, limiting its utility in structured, category-specific segmentation tasks. To address these limitations, we propose SAV, a novel framework comprising three core components: a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context sample retrieval encoding module. The knowledge graph explicitly models the spatial and geometric relationships among vehicle parts through a structured ontology, effectively encoding prior structural knowledge. Meanwhile, the context retrieval module enhances segmentation by identifying and leveraging visually similar vehicle instances from training data, providing rich contextual priors for improved generalization. Furthermore, we introduce a new large-scale benchmark dataset for vehicle part segmentation, named VehicleSeg10K, which contains 11,665 high-quality pixel-level annotations across diverse scenes and viewpoints. We conduct comprehensive experiments on this dataset and two other datasets, benchmarking multiple representative baselines to establish a solid foundation for future research and comparison. % Both the dataset and source code of this paper will be released upon acceptance. Both the dataset and source code of this paper will be released on https://github.com/Event-AHU/SAV 随着自动驾驶的快速发展，车辆感知，特别是检测和分割，对算法性能提出了越来越高的要求。预训练的大型分割模型，尤其是 Segment Anything Model（SAM），引发了广泛关注并激发了人工智能领域的新研究方向。然而，SAM 无法直接应用于细粒度的车辆部件分割任务，因为其基于文本提示的分割功能尚未公开，且其默认模式生成的掩码区域缺乏语义标签，限制了其在结构化、类别特定分割任务中的实用性。为了解决这些限制，我们提出了 SAV，一种包含三个核心组件的新型框架：基于 SAM 的编码器-解码器、车辆部件知识图谱以及上下文样本检索编码模块。该知识图谱通过结构化本体明确建模了车辆部件之间的空间和几何关系，有效编码了先验的结构知识。同时，语境检索模块通过识别并利用训练数据中视觉上相似的车辆实例，增强了分割效果，提供了丰富的语境先验以提升泛化能力。此外，我们引入了一个用于车辆部件分割的新大规模基准数据集，命名为 VehicleSeg10K，包含 11,665 个高质量的像素级标注，涵盖多样的场景和视角。我们在该数据集及另外两个数据集上进行了全面的实验，基准测试了多个具有代表性的基线方法，为未来的研究和比较奠定了坚实基础。% 本文的数据集和源代码将在论文被接受后发布。本文的数据集和源代码将发布在 https://github.com/Event-AHU/SAV。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-06 09:46:49 UTC 发布时间：2025-08-06 09:46:49 UTC

#86 TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening #86 TalkDep：以临床为基础的面向对话的抑郁症筛查 LLM 角色设定

The increasing demand for mental health services has outpaced the availability of real training data to develop clinical professionals, leading to limited support for the diagnosis of depression. This shortage has motivated the development of simulated or virtual patients to assist in training and evaluation, but existing approaches often fail to generate clinically valid, natural, and diverse symptom presentations. In this work, we embrace the recent advanced language models as the backbone and propose a novel clinician-in-the-loop patient simulation pipeline, TalkDep, with access to diversified patient profiles to develop simulated patients. By conditioning the model on psychiatric diagnostic criteria, symptom severity scales, and contextual factors, our goal is to create authentic patient responses that can better support diagnostic model training and evaluation. We verify the reliability of these simulated patients with thorough assessments conducted by clinical professionals. The availability of validated simulated patients offers a scalable and adaptable resource for improving the robustness and generalisability of automatic depression diagnosis systems. 对心理健康服务的需求不断增加，远远超过了用于培养临床专业人员的真实训练数据的供应，导致对抑郁症诊断的支持有限。这一短缺促使人们开发模拟或虚拟患者以辅助培训和评估，但现有方法往往无法生成临床有效、自然且多样化的症状表现。在本研究中，我们采用了最新的先进语言模型作为基础，提出了一种新颖的临床医生参与的患者模拟流程——TalkDep，该流程能够访问多样化的患者档案以开发模拟患者。通过将模型条件化于精神病学诊断标准、症状严重程度量表和情境因素，我们的目标是创造真实的患者反应，从而更好地支持诊断模型的训练和评估。我们通过临床专业人员进行的全面评估验证了这些模拟患者的可靠性。经过验证的模拟患者的可用性为提升自动抑郁症诊断系统的鲁棒性和泛化能力提供了可扩展且适应性强的资源。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 09:30:47 UTC 发布时间：2025-08-06 09:30:47 UTC

#87 Automated ultrasound doppler angle estimation using deep learning #87 使用深度学习的自动化超声多普勒角度估计

Authors: [Nilesh Patil](https://arxiv.org/search/?searchtype=author&query=Nilesh Patil), [Ajay Anand](https://arxiv.org/search/?searchtype=author&query=Ajay Anand) 作者：Nilesh Patil，Ajay Anand

Angle estimation is an important step in the Doppler ultrasound clinical workflow to measure blood velocity. It is widely recognized that incorrect angle estimation is a leading cause of error in Doppler-based blood velocity measurements. In this paper, we propose a deep learning-based approach for automated Doppler angle estimation. The approach was developed using 2100 human carotid ultrasound images including image augmentation. Five pre-trained models were used to extract images features, and these features were passed to a custom shallow network for Doppler angle estimation. Independently, measurements were obtained by a human observer reviewing the images for comparison. The mean absolute error (MAE) between the automated and manual angle estimates ranged from 3.9{\deg} to 9.4{\deg} for the models evaluated. Furthermore, the MAE for the best performing model was less than the acceptable clinical Doppler angle error threshold thus avoiding misclassification of normal velocity values as a stenosis. The results demonstrate potential for applying a deep-learning based technique for automated ultrasound Doppler angle estimation. Such a technique could potentially be implemented within the imaging software on commercial ultrasound scanners. 角度估计是多普勒超声临床流程中测量血流速度的重要步骤。众所周知，角度估计错误是多普勒血流速度测量误差的主要原因之一。本文提出了一种基于深度学习的自动多普勒角度估计方法。该方法使用了包括图像增强在内的 2100 张人体颈动脉超声图像进行开发。采用了五个预训练模型提取图像特征，并将这些特征传递给一个定制的浅层网络以进行多普勒角度估计。作为对比，人工观察者独立对图像进行了测量。自动估计与人工估计之间的平均绝对误差（MAE）在评估的模型中范围为 3.9°至 9.4°。此外，表现最佳模型的 MAE 低于临床可接受的多普勒角度误差阈值，从而避免了将正常速度值误判为狭窄的情况。结果表明，基于深度学习的技术在自动超声多普勒角度估计中具有应用潜力。这种技术有可能在商业超声扫描仪的成像软件中实现。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 09:28:07 UTC 发布时间：2025-08-06 09:28:07 UTC

#88 Empowering Time Series Forecasting with LLM-Agents #88 利用 LLM-Agent 增强时间序列预测

Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state-of-the-art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data-Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data-centric approaches in AutoML for time series forecasting. 大型语言模型（LLM）驱动的智能体已成为自动化机器学习（AutoML）系统中有效的规划者。尽管大多数现有的 AutoML 方法侧重于自动化特征工程和模型架构搜索，近期时间序列预测的研究表明，轻量级模型往往能够实现最先进的性能。基于这一观察，我们开始探索提升数据质量，而非模型架构，作为时间序列数据 AutoML 的一个潜在有效方向。我们提出了 DCATS，一种面向时间序列的数据中心智能体。DCATS 利用时间序列的元数据来清洗数据，同时优化预测性能。我们在一个大规模交通流量预测数据集上，使用四种时间序列预测模型对 DCATS 进行了评估。结果表明，DCATS 在所有测试模型和时间范围内平均减少了 6%的误差，凸显了数据中心方法在时间序列预测 AutoML 中的潜力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 09:14:08 UTC 发布时间：2025-08-06 09:14:08 UTC

#89 LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation #89 LayerT2V：用于视频生成的交互式多目标轨迹分层

Authors: [Kangrui Cen](https://arxiv.org/search/?searchtype=author&query=Kangrui Cen), [Baixuan Zhao](https://arxiv.org/search/?searchtype=author&query=Baixuan Zhao), [Yi Xin](https://arxiv.org/search/?searchtype=author&query=Yi Xin), [Siqi Luo](https://arxiv.org/search/?searchtype=author&query=Siqi Luo), [Guangtao Zhai](https://arxiv.org/search/?searchtype=author&query=Guangtao Zhai), [Xiaohong Liu](https://arxiv.org/search/?searchtype=author&query=Xiaohong Liu) 作者：岑康睿，赵百轩，辛毅，罗思琪，翟光涛，刘晓红

Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct “layer” and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at https://kr-panghu.github.io/LayerT2V/ . 在文本到视频（Text-to-Video，T2V）生成中控制物体运动轨迹是一项具有挑战性且相对较少探索的领域，尤其是在涉及多个移动物体的场景中。T2V 领域的大多数社区模型和数据集都是为单一物体运动设计的，这限制了当前生成模型在多物体任务中的表现。此外，现有的 T2V 运动控制方法要么不支持多物体运动场景，要么在物体轨迹相交时性能严重下降，主要原因是碰撞区域的语义冲突。为了解决这些限制，我们提出了 LayerT2V，这是首个通过分层合成背景和前景物体来生成视频的方法。这种分层生成使得视频中多个独立元素的灵活整合成为可能，将每个元素置于不同的“层”上，从而促进了连贯的多物体合成，同时增强了对生成过程的控制能力。大量实验表明，LayerT2V 在生成复杂多目标场景方面具有优越性，在 mIoU 和 AP50 指标上分别比最先进（SOTA）方法提升了 1.4 倍和 4.5 倍。项目页面和代码可在 https://kr-panghu.github.io/LayerT2V/ 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning, Multimedia 主题：计算机视觉与模式识别，人工智能，机器学习，多媒体

Publish: 2025-08-06 09:03:16 UTC 发布时间：2025-08-06 09:03:16 UTC

#90 Symmetric Behavior Regularization via Taylor Expansion of Symmetry #90 通过对称性的泰勒展开实现对称行为正则化

Authors: [Lingwei Zhu](https://arxiv.org/search/?searchtype=author&query=Lingwei Zhu), [Zheng Chen](https://arxiv.org/search/?searchtype=author&query=Zheng Chen), [Han Wang](https://arxiv.org/search/?searchtype=author&query=Han Wang), [Yukie Nagai](https://arxiv.org/search/?searchtype=author&query=Yukie Nagai) 作者：朱凌伟，陈正，王涵，永井由纪

This paper introduces symmetric divergences to behavior regularization policy optimization (BRPO) to establish a novel offline RL framework. Existing methods focus on asymmetric divergences such as KL to obtain analytic regularized policies and a practical minimization objective. We show that symmetric divergences do not permit an analytic policy as regularization and can incur numerical issues as loss. We tackle these challenges by the Taylor series of f-divergence. Specifically, we prove that an analytic policy can be obtained with a finite series. For loss, we observe that symmetric divergences can be decomposed into an asymmetry and a conditional symmetry term, Taylor-expanding the latter alleviates numerical issues. Summing together, we propose Symmetric f Actor-Critic (Sf-AC), the first practical BRPO algorithm with symmetric divergences. Experimental results on distribution approximation and MuJoCo verify that Sf-AC performs competitively. 本文将对称散度引入行为正则化策略优化（BRPO），以建立一种新颖的离线强化学习框架。现有方法侧重于使用非对称散度，如 KL 散度，以获得解析的正则化策略和实用的最小化目标。我们证明了对称散度不允许作为正则化的解析策略，并且作为损失函数时可能引发数值问题。我们通过 f -散度的泰勒级数来解决这些挑战。具体而言，我们证明了可以通过有限级数获得解析策略。对于损失函数，我们观察到对称散度可以分解为非对称项和条件对称项，对后者进行泰勒展开可以缓解数值问题。综合以上，我们提出了对称 f 演员-评论家（S f -AC），这是首个使用对称散度的实用 BRPO 算法。在分布逼近和 MuJoCo 上的实验结果验证了 S f -AC 的竞争性能。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 09:01:29 UTC 发布时间：2025-08-06 09:01:29 UTC

#91 A Hybrid AI Methodology for Generating Ontologies of Research Topics from Scientific Paper Corpora #91 一种用于从科学论文语料库生成研究主题本体的混合 AI 方法论

Authors: [Alessia Pisu](https://arxiv.org/search/?searchtype=author&query=Alessia Pisu), [Livio Pompianu](https://arxiv.org/search/?searchtype=author&query=Livio Pompianu), [Francesco Osborne](https://arxiv.org/search/?searchtype=author&query=Francesco Osborne), [Diego Reforgiato Recupero](https://arxiv.org/search/?searchtype=author&query=Diego Reforgiato Recupero), [Daniele Riboni](https://arxiv.org/search/?searchtype=author&query=Daniele Riboni), [Angelo Salatino](https://arxiv.org/search/?searchtype=author&query=Angelo Salatino) 作者：Alessia Pisu, Livio Pompianu, Francesco Osborne, Diego Reforgiato Recupero, Daniele Riboni, Angelo Salatino

Taxonomies and ontologies of research topics (e.g., MeSH, UMLS, CSO, NLM) play a central role in providing the primary framework through which intelligent systems can explore and interpret the literature. However, these resources have traditionally been manually curated, a process that is time-consuming, prone to obsolescence, and limited in granularity. This paper presents Sci-OG, a semi-auto-mated methodology for generating research topic ontologies, employing a multi-step approach: 1) Topic Discovery, extracting potential topics from research papers; 2) Relationship Classification, determining semantic relationships between topic pairs; and 3) Ontology Construction, refining and organizing topics into a structured ontology. The relationship classification component, which constitutes the core of the system, integrates an encoder-based language model with features describing topic occurrence in the scientific literature. We evaluate this approach against a range of alternative solutions using a dataset of 21,649 manually annotated semantic triples. Our method achieves the highest F1 score (0.951), surpassing various competing approaches, including a fine-tuned SciBERT model and several LLM baselines, such as the fine-tuned GPT4-mini. Our work is corroborated by a use case which illustrates the practical application of our system to extend the CSO ontology in the area of cybersecurity. The presented solution is designed to improve the accessibility, organization, and analysis of scientific knowledge, thereby supporting advancements in AI-enabled literature management and research exploration. 研究主题的分类法和本体（例如，MeSH、UMLS、CSO、NLM）在为智能系统探索和解读文献提供主要框架方面起着核心作用。然而，这些资源传统上是手工维护的，这一过程既耗时，又容易过时，且粒度有限。本文提出了 Sci-OG，一种半自动生成研究主题本体的方法，采用多步骤方法：1）主题发现，从研究论文中提取潜在主题；2）关系分类，确定主题对之间的语义关系；3）本体构建，精炼并组织主题形成结构化本体。关系分类部分是系统的核心，结合了基于编码器的语言模型和描述主题在科学文献中出现情况的特征。我们使用包含 21,649 个人工标注语义三元组的数据集，对该方法与多种替代方案进行了评估。我们的方法实现了最高的 F1 分数（0.951），超越了包括微调的 SciBERT 模型和多个 LLM 基线（如微调的 GPT4-mini）在内的各种竞争方法。我们的工作通过一个用例得到了验证，该用例展示了我们的系统在网络安全领域扩展 CSO 本体的实际应用。所提出的解决方案旨在提升科学知识的可访问性、组织和分析，从而支持基于 AI 的文献管理和研究探索的进步。

Subjects: Digital Libraries, Artificial Intelligence, Information Retrieval 主题：数字图书馆，人工智能，信息检索

Publish: 2025-08-06 08:48:14 UTC 发布时间：2025-08-06 08:48:14 UTC

#92 ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments #92 ReasoningGuard：通过推理时的安全灵感时刻保护大型推理模型

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 08:35:10 UTC 发布时间：2025-08-06 08:35:10 UTC

#93 ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs #93 ViFP：一个用于视觉假阳性检测以增强视觉语言模型推理可靠性的框架

Authors: [Ben Zhang](https://arxiv.org/search/?searchtype=author&query=Ben Zhang), [LuLu Yu](https://arxiv.org/search/?searchtype=author&query=LuLu Yu), [Lei Gao](https://arxiv.org/search/?searchtype=author&query=Lei Gao), [Jing Liu](https://arxiv.org/search/?searchtype=author&query=Jing Liu), [QuanJiang Guo](https://arxiv.org/search/?searchtype=author&query=QuanJiang Guo), [Hui Gao](https://arxiv.org/search/?searchtype=author&query=Hui Gao) 作者：Ben Zhang, LuLu Yu, Lei Gao, Jing Liu, QuanJiang Guo, Hui Gao

In visual-language model (VLM) reasoning, false positive(FP) reasoning occurs when a model generates a correct answer but follows an incorrect reasoning path. Existing methods based on specific multi-step reasoning datasets and reinforcement learning strategies, leading to high training costs and limited generalization. In this work, we propose ViFP, a general framework for enhancing visual reasoning reliability. It improves both answer accuracy and reasoning soundness by detecting FPs. ViFP tackles the limitations of dataset dependency and poor generalization by constructing sub-question templates grounded in the core dimensions of visual reasoning, such as object localization, characteristic description, and object discovery. ViFP then builds effective reasoning paths via multi-turn QA to improve reasoning accuracy. Meanwhile, ViFP dynamically analyzes the consistency of reasoning path to identify potential FPs, and introduces a targeted chain-of-thought (CoT) mechanism that adaptively guides both FP and non-FP samples. Thereby reducing logical errors in the reasoning path while preserving accuracy. Finally, we introduce a reliability evaluation metric-VoC, which integrates answer accuracy and the FP rate, providing a quantitative tool to assess whether a VLM not only answers correctly, but also reasons reliably. Our experiments on closed-source VLMs show that ViFP consistently improves performance across three datasets: A-OKVQA, OKVQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of FPs, validating its benefits in enhancing reasoning reliability. 在视觉-语言模型（VLM）推理中，假阳性（FP）推理指的是模型生成了正确答案，但其推理路径却是错误的。现有方法依赖于特定的多步推理数据集和强化学习策略，导致训练成本高且泛化能力有限。在本工作中，我们提出了 ViFP，一种增强视觉推理可靠性的一般框架。它通过检测假阳性，提升答案准确率和推理合理性。ViFP 通过构建基于视觉推理核心维度（如对象定位、特征描述和对象发现）的子问题模板，解决了数据集依赖和泛化能力差的限制。随后，ViFP 通过多轮问答构建有效的推理路径，以提高推理准确性。同时，ViFP 动态分析推理路径的一致性以识别潜在的假阳性，并引入了针对性的思维链（CoT）机制，自适应地引导假阳性和非假阳性样本，从而在保持准确性的同时减少推理路径中的逻辑错误。最后，我们引入了一种可靠性评估指标——VoC，该指标整合了答案准确率和误报率，提供了一个定量工具，用于评估视觉语言模型（VLM）不仅能正确回答问题，还能进行可靠推理。我们在闭源 VLM 上的实验表明，ViFP 在三个数据集 A-OKVQA、OKVQA 和 FVQA 上均持续提升了性能。在 A-OKVQA 数据集上，ViFP 将准确率提升了最多 5.4%，超越了之前的最先进水平 4.3%，并显著减少了误报数量，验证了其在增强推理可靠性方面的优势。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 08:31:11 UTC 发布时间：2025-08-06 08:31:11 UTC

#94 Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective #94 收集与追踪：从实例导向视角重新思考视频文本视觉问答

Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at https://github.com/zhangyan-ucas/GAT. 基于视频文本的视觉问答（Video TextVQA）旨在通过明确读取和推理视频中涉及的文本来回答问题。该领域的大多数工作遵循帧级框架，但该框架存在文本实体冗余和隐式关系建模的问题，导致准确性和效率均受限。本文从实例导向的视角重新思考 Video TextVQA 任务，提出了一种新颖的模型，称为 GAT（Gather and Trace）。首先，为了获得每个视频文本实例的准确阅读结果，设计了一个上下文聚合实例收集模块，将相关实体的视觉外观、布局特征和文本内容整合为统一的文本表示。然后，为了捕捉视频流中文本的动态演变，采用了一个实例聚焦轨迹追踪模块，以建立实例之间的时空关系并推断最终答案。在多个公开的 Video TextVQA 数据集上进行的大量实验验证了我们框架的有效性和泛化能力。 GAT 在准确率和推理速度上均优于现有的视频文本问答方法、视频语言预训练方法以及视频大型语言模型。值得注意的是，GAT 在准确率上比之前的最先进视频文本问答方法高出 3.86%，推理速度比视频大型语言模型快十倍。源代码可在 https://github.com/zhangyan-ucas/GAT 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 08:26:36 UTC 发布时间：2025-08-06 08:26:36 UTC

#95 Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models #95 诱发并分析最先进大型语言模型中的新兴错位

Subjects: Computation and Language, Artificial Intelligence, Cryptography and Security 主题：计算与语言，人工智能，密码学与安全

Publish: 2025-08-06 08:25:40 UTC 发布时间：2025-08-06 08:25:40 UTC

#96 NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations #96 NVSpeech：一个集成且可扩展的人类语音建模管道，包含副语言声学表现

Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as “uhm” and “oh”-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., “You’re so funny [Laughter]”), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/. 副语言声响——包括非语言声音如笑声和呼吸，以及词汇化的感叹词如“嗯”和“哦”——是自然口语交流的组成部分。尽管这些线索在传达情感、意图和互动提示方面非常重要，但在传统的自动语音识别（ASR）和文本转语音（TTS）系统中，这些线索仍然在很大程度上被忽视。我们提出了 NVSpeech，一个集成且可扩展的流程，连接副语言声响的识别与合成，涵盖数据集构建、ASR 建模和可控 TTS。（1）我们引入了一个手工标注的数据集，包含 48,430 条人类语音语句，涵盖 18 个词级副语言类别。（2）我们开发了副语言感知的 ASR 模型，将副语言线索视为可内联解码的标记（例如，“你真有趣 [笑声]”），实现词汇和非语言转录的联合。这一模型随后被用于自动标注一个大型语料库，这是首个具有词级对齐和副语言线索的 174,179 条中文语句（573 小时）的大规模数据集。 (3) 我们在人工标注和自动标注的数据上对零样本 TTS 模型进行微调，以实现对副语言声音的显式控制，允许在任意词元位置进行上下文感知的插入，从而实现类人语音合成。通过统一副语言声音的识别和生成，NVSpeech 提供了首个开放的大规模、词级注释的普通话表现力语音建模流程，以可扩展且可控的方式整合识别与合成。数据集和音频演示可在 https://nvspeech170k.github.io/ 获取。

Subjects: Sound, Artificial Intelligence, Machine Learning 主题：声音，人工智能，机器学习

Publish: 2025-08-06 08:25:26 UTC 发布：2025-08-06 08:25:26 UTC

#97 Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity #97 利用因果充分性和必要性破解多模态大模型的幻觉

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 08:09:12 UTC 发布：2025-08-06 08:09:12 UTC

#98 Quasi-Clique Discovery via Energy Diffusion #98 通过能量扩散发现准团簇

Authors: [Yu Zhang](https://arxiv.org/search/?searchtype=author&query=Yu Zhang), [Yilong Luo](https://arxiv.org/search/?searchtype=author&query=Yilong Luo), [Mingyuan Ma](https://arxiv.org/search/?searchtype=author&query=Mingyuan Ma), [Yao Chen](https://arxiv.org/search/?searchtype=author&query=Yao Chen), [Enqiang Zhu](https://arxiv.org/search/?searchtype=author&query=Enqiang Zhu), [Jin Xu](https://arxiv.org/search/?searchtype=author&query=Jin Xu), [Chanjuan Liu](https://arxiv.org/search/?searchtype=author&query=Chanjuan Liu) 作者：张宇，罗一龙，马明远，陈尧，朱恩强，徐进，刘婵娟

Discovering quasi-cliques – subgraphs with edge density no less than a given threshold – is a fundamental task in graph mining, with broad applications in social networks, bioinformatics, and e-commerce. Existing heuristics often rely on greedy rules, similarity measures, or metaheuristic search, but struggle to maintain both efficiency and solution consistency across diverse graphs. This paper introduces EDQC, a novel quasi-clique discovery algorithm inspired by energy diffusion. Instead of explicitly enumerating candidate subgraphs, EDQC performs stochastic energy diffusion from source vertices, naturally concentrating energy within structurally cohesive regions. The approach enables efficient dense subgraph discovery without exhaustive search or dataset-specific tuning. Experimental results on 30 real-world datasets demonstrate that EDQC consistently discovers larger quasi-cliques than state-of-the-art baselines on the majority of datasets, while also yielding lower variance in solution quality. To the best of our knowledge, EDQC is the first method to incorporate energy diffusion into quasi-clique discovery. 发现准团——边密度不低于给定阈值的子图——是图挖掘中的一项基础任务，在社交网络、生物信息学和电子商务等领域有广泛应用。现有的启发式方法通常依赖贪心规则、相似性度量或元启发式搜索，但难以在多样化图中同时保持效率和解的一致性。本文提出了 EDQC，一种受能量扩散启发的新型准团发现算法。EDQC 不显式枚举候选子图，而是从源顶点执行随机能量扩散，自然地将能量集中在结构紧密的区域内。该方法无需穷举搜索或针对特定数据集的调优，即可高效发现稠密子图。对 30 个真实数据集的实验结果表明，EDQC 在大多数数据集上均能稳定发现比最先进基线更大的准团，同时解的质量方差更低。据我们所知，EDQC 是首个将能量扩散引入准团发现的方法。

Subjects: Social and Information Networks, Artificial Intelligence 主题：社会与信息网络，人工智能

Publish: 2025-08-06 07:59:56 UTC 发布时间：2025-08-06 07:59:56 UTC

#99 Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap #99 基于难度的偏好数据选择通过 DPO 隐式奖励差距

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-06 07:24:14 UTC 发布时间：2025-08-06 07:24:14 UTC

#100 COPO: Consistency-Aware Policy Optimization #100 COPO：一致性感知的策略优化

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习，人工智能，计算与语言

Publish: 2025-08-06 07:05:18 UTC 发布时间：2025-08-06 07:05:18 UTC

#101 UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval #101 UniFGVC：通过属性感知多模态检索实现的通用无训练少样本细粒度视觉分类

Authors: [Hongyu Guo](https://arxiv.org/search/?searchtype=author&query=Hongyu Guo), [Kuan Zhu](https://arxiv.org/search/?searchtype=author&query=Kuan Zhu), [Xiangzhao Hao](https://arxiv.org/search/?searchtype=author&query=Xiangzhao Hao), [Haiyun Guo](https://arxiv.org/search/?searchtype=author&query=Haiyun Guo), [Ming Tang](https://arxiv.org/search/?searchtype=author&query=Ming Tang), [Jinqiao Wang](https://arxiv.org/search/?searchtype=author&query=Jinqiao Wang) 作者：郭鸿宇，朱宽，郝祥钊，郭海云，唐明，王金桥

Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches. 少样本细粒度视觉分类（FGVC）旨在利用有限的数据使模型能够区分细微差别的类别。近期的工作大多通过微调预训练的视觉语言模型来提升性能，但存在过拟合和泛化能力弱的问题。为了解决这一问题，我们提出了 UniFGVC，一个无需训练的通用框架，将少样本 FGVC 重新定义为多模态检索。首先，我们提出了类别判别视觉描述器（CDV-Captioner），利用多模态大语言模型（MLLMs）的开放世界知识，生成结构化的文本描述，捕捉区分密切相关类别的细粒度属性特征。CDV-Captioner 采用链式思维提示和视觉相似的参考图像，以减少幻觉现象并增强生成描述的判别能力。借助该方法，我们可以将每张图像转换为图像-描述对，实现更全面的特征表示，并利用少样本构建多模态类别模板，用于后续的检索流程。然后，现成的视觉和文本编码器对查询和模板对进行嵌入，通过在联合空间中检索最近的模板来完成细粒度视觉分类（FGVC）。UniFGVC 确保与多种多模态大模型（MLLMs）和编码器的广泛兼容性，在少样本 FGVC 场景中提供可靠的泛化能力和适应性。在 12 个 FGVC 基准上的大量实验表明，其性能持续优于以往基于少样本 CLIP 的方法，甚至优于若干全监督的基于 MLLMs 的方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 07:02:39 UTC 发布时间：2025-08-06 07:02:39 UTC

#102 DS2Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation #102DS 2 Net：用于医学图像分割的细节-语义深度监督网络

Authors: [Zhaohong Huang](https://arxiv.org/search/?searchtype=author&query=Zhaohong Huang), [Yuxin Zhang](https://arxiv.org/search/?searchtype=author&query=Yuxin Zhang), [Mingbao Lin](https://arxiv.org/search/?searchtype=author&query=Mingbao Lin), [Taojian Zhou](https://arxiv.org/search/?searchtype=author&query=Taojian Zhou), [Guorong Cai](https://arxiv.org/search/?searchtype=author&query=Guorong Cai), [Rongrong Ji](https://arxiv.org/search/?searchtype=author&query=Rongrong Ji) 作者：黄昭宏，张宇新，林明宝，周涛建，蔡国荣，纪荣荣

Deep Supervision Networks exhibit significant efficacy for the medical imaging community. Nevertheless, existing work merely supervises either the coarse-grained semantic features or fine-grained detailed features in isolation, which compromises the fact that these two types of features hold vital relationships in medical image analysis. We advocate the powers of complementary feature supervision for medical image segmentation, by proposing a Detail-Semantic Deep Supervision Network (DS2Net). DS2Net navigates both low-level detailed and high-level semantic feature supervision through Detail Enhance Module (DEM) and Semantic Enhance Module (SEM). DEM and SEM respectively harness low-level and high-level feature maps to create detail and semantic masks for enhancing feature supervision. This is a novel shift from single-view deep supervision to multi-view deep supervision. DS2Net is also equipped with a novel uncertainty-based supervision loss that adaptively assigns the supervision strength of features within distinct scales based on their uncertainty, thus circumventing the sub-optimal heuristic design that typifies previous works. Through extensive experiments on six benchmarks captured under either colonoscopy, ultrasound and microscope, we demonstrate that DS2Net consistently outperforms state-of-the-art methods for medical image analysis. 深度监督网络在医学影像领域表现出显著的效果。然而，现有工作仅单独监督粗粒度的语义特征或细粒度的细节特征，忽视了这两类特征在医学图像分析中存在的重要关联。我们倡导互补特征监督在医学图像分割中的作用，提出了一种细节-语义深度监督网络（DS 2 Net）。DS 2 Net 通过细节增强模块（DEM）和语义增强模块（SEM）同时引导低层次细节和高层次语义特征的监督。DEM 和 SEM 分别利用低层次和高层次的特征图生成细节和语义掩码，以增强特征监督。这是从单视角深度监督向多视角深度监督的创新转变。DS 2 Net 还配备了一种新颖的不确定性监督损失，根据特征在不同尺度上的不确定性自适应分配监督强度，从而避免了以往工作中典型的次优启发式设计。通过在六个基准数据集上进行大量实验，这些数据集分别通过结肠镜、超声和显微镜采集，我们证明了 DS 2 Net 在医学图像分析中始终优于最先进的方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 06:57:36 UTC 发布时间：2025-08-06 06:57:36 UTC

#103 Experimental Analysis of Productive Interaction Strategy with ChatGPT: User Study on Function and Project-level Code Generation Tasks #103 使用 ChatGPT 的高效交互策略实验分析：关于函数和项目级代码生成任务的用户研究

Authors: [Sangwon Hyun](https://arxiv.org/search/?searchtype=author&query=Sangwon Hyun), [Hyunjun Kim](https://arxiv.org/search/?searchtype=author&query=Hyunjun Kim), [Jinhyuk Jang](https://arxiv.org/search/?searchtype=author&query=Jinhyuk Jang), [Hyojin Choi](https://arxiv.org/search/?searchtype=author&query=Hyojin Choi), [M. Ali Babar](https://arxiv.org/search/?searchtype=author&query=M. Ali Babar) 作者：Sangwon Hyun, Hyunjun Kim, Jinhyuk Jang, Hyojin Choi, M. Ali Babar

The application of Large Language Models (LLMs) is growing in the productive completion of Software Engineering tasks. Yet, studies investigating the productive prompting techniques often employed a limited problem space, primarily focusing on well-known prompting patterns and mainly targeting function-level SE practices. We identify significant gaps in real-world workflows that involve complexities beyond class-level (e.g., multi-class dependencies) and different features that can impact Human-LLM Interactions (HLIs) processes in code generation. To address these issues, we designed an experiment that comprehensively analyzed the HLI features regarding the code generation productivity. Our study presents two project-level benchmark tasks, extending beyond function-level evaluations. We conducted a user study with 36 participants from diverse backgrounds, asking them to solve the assigned tasks by interacting with the GPT assistant using specific prompting patterns. We also examined the participants’ experience and their behavioral features during interactions by analyzing screen recordings and GPT chat logs. Our statistical and empirical investigation revealed (1) that three out of 15 HLI features significantly impacted the productivity in code generation; (2) five primary guidelines for enhancing productivity for HLI processes; and (3) a taxonomy of 29 runtime and logic errors that can occur during HLI processes, along with suggested mitigation plans. 大型语言模型（LLMs）在高效完成软件工程任务中的应用日益增长。然而，研究高效提示技术的相关研究通常采用有限的问题空间，主要关注知名的提示模式，并且主要针对函数级的软件工程实践。我们发现现实工作流程中存在显著的空白，这些流程涉及超出类级别的复杂性（例如，多类依赖）以及可能影响代码生成中人机交互（HLI）过程的不同特征。为了解决这些问题，我们设计了一项实验，全面分析了与代码生成生产力相关的人机交互特征。我们的研究提出了两个项目级的基准任务，超越了函数级的评估。我们进行了包含 36 名来自不同背景参与者的用户研究，要求他们通过使用特定提示模式与 GPT 助手互动来解决分配的任务。我们还通过分析屏幕录制和 GPT 聊天记录，考察了参与者的体验及其交互过程中的行为特征。我们的统计和实证研究揭示了：(1) 在 15 个 HLI 特征中，有三个显著影响代码生成的生产力；(2) 提出了五条提升 HLI 流程生产力的主要指导原则；(3) 归纳了 29 种可能在 HLI 流程中发生的运行时和逻辑错误，并提出了相应的缓解方案。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-06 06:48:48 UTC 发布时间：2025-08-06 06:48:48 UTC

#104 Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode #104 通过轻量级掩码解码释放多模态大模型在指代表达分割中的潜力

Authors: [Jingchao Wang](https://arxiv.org/search/?searchtype=author&query=Jingchao Wang), [Zhijian Wu](https://arxiv.org/search/?searchtype=author&query=Zhijian Wu), [Dingjiang Huang](https://arxiv.org/search/?searchtype=author&query=Dingjiang Huang), [Yefeng Zheng](https://arxiv.org/search/?searchtype=author&query=Yefeng Zheng), [Hong Wang](https://arxiv.org/search/?searchtype=author&query=Hong Wang) 作者：王景超、吴志坚、黄定江、郑业锋、王红

Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg. 指代表达分割（RES）旨在分割由指代表达指定的图像区域，随着多模态大模型（MLLMs）的兴起而变得流行。尽管 MLLMs 在语义理解方面表现出色，但其基于生成 token 的范式在像素级密集预测上存在困难。现有的 RES 方法要么将 MLLMs 与参数庞大的 Segment Anything Model（SAM，拥有 6.32 亿网络参数）结合，要么采用不依赖 SAM 的轻量级流程，但后者牺牲了准确性。为了解决性能与成本之间的权衡，我们特别提出了 MLLMSeg，这是一种新颖的框架，充分利用 MLLM 视觉编码器中固有的视觉细节特征，而无需引入额外的视觉编码器。此外，我们提出了一个细节增强且语义一致的特征融合模块（DSFF），该模块充分整合了与细节相关的视觉特征和 MLLM 中大语言模型（LLM）输出的语义相关特征。最后，我们建立了一个仅有 3400 万网络参数的轻量级掩码解码器，能够最佳利用来自视觉编码器的细节空间特征和来自 LLM 的语义特征，实现精确的掩码预测。大量实验表明，我们的方法通常优于基于 SAM 和非 SAM 的竞争方法，在性能和成本之间实现了更好的平衡。代码可在 https://github.com/jcwang0602/MLLMSeg 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 06:06:52 UTC 发布时间：2025-08-06 06:06:52 UTC

#105 SenseCrypt: Sensitivity-guided Selective Homomorphic Encryption for Joint Federated Learning in Cross-Device Scenarios #105 SenseCrypt：面向跨设备场景联合联邦学习的敏感性引导选择性同态加密

Authors: [Borui Li](https://arxiv.org/search/?searchtype=author&query=Borui Li), [Li Yan](https://arxiv.org/search/?searchtype=author&query=Li Yan), [Junhao Han](https://arxiv.org/search/?searchtype=author&query=Junhao Han), [Jianmin Liu](https://arxiv.org/search/?searchtype=author&query=Jianmin Liu), [Lei Yu](https://arxiv.org/search/?searchtype=author&query=Lei Yu) 作者：李博睿，闫丽，韩俊浩，刘建民，余雷

Homomorphic Encryption (HE) prevails in securing Federated Learning (FL), but suffers from high overhead and adaptation cost. Selective HE methods, which partially encrypt model parameters by a global mask, are expected to protect privacy with reduced overhead and easy adaptation. However, in cross-device scenarios with heterogeneous data and system capabilities, traditional Selective HE methods deteriorate client straggling, and suffer from degraded HE overhead reduction performance. Accordingly, we propose SenseCrypt, a Sensitivity-guided selective Homomorphic EnCryption framework, to adaptively balance security and HE overhead per cross-device FL client. Given the observation that model parameter sensitivity is effective for measuring clients’ data distribution similarity, we first design a privacy-preserving method to respectively cluster the clients with similar data distributions. Then, we develop a scoring mechanism to deduce the straggler-free ratio of model parameters that can be encrypted by each client per cluster. Finally, for each client, we formulate and solve a multi-objective model parameter selection optimization problem, which minimizes HE overhead while maximizing model security without causing straggling. Experiments demonstrate that SenseCrypt ensures security against the state-of-the-art inversion attacks, while achieving normal model accuracy as on IID data, and reducing training time by 58.4%-88.7% as compared to traditional HE methods. 同态加密（HE）在保障联邦学习（FL）安全方面占据主导地位，但存在高开销和适应成本的问题。选择性同态加密方法通过全局掩码部分加密模型参数，期望以较低的开销和简便的适应性保护隐私。然而，在具有异构数据和系统能力的跨设备场景中，传统的选择性同态加密方法会加剧客户端拖延问题，并且同态加密的开销降低效果也会下降。因此，我们提出了 SenseCrypt，一种基于敏感度引导的选择性同态加密框架，能够针对每个跨设备联邦学习客户端自适应地平衡安全性和同态加密开销。鉴于模型参数敏感度在衡量客户端数据分布相似性方面的有效性，我们首先设计了一种隐私保护方法，用于分别聚类具有相似数据分布的客户端。随后，我们开发了一种评分机制，以推断每个客户端在各自聚类中可加密的无拖延模型参数比例。最后，对于每个客户端，我们制定并解决了一个多目标模型参数选择优化问题，该问题在不引起延迟的情况下，最小化同态加密开销，同时最大化模型安全性。实验表明，SenseCrypt 能够防御最先进的反演攻击，同时在 IID 数据上实现正常的模型准确率，并且相比传统的同态加密方法，训练时间减少了 58.4%至 88.7%。

Subjects: Cryptography and Security, Artificial Intelligence, Distributed, Parallel, and Cluster Computing 主题：密码学与安全，人工智能，分布式、并行与集群计算

Publish: 2025-08-06 05:42:41 UTC 发布时间：2025-08-06 05:42:41 UTC

#106 DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting #106 DET-GS：用于高保真 3D 高斯点绘制的深度和边缘感知正则化

Authors: [Zexu Huang](https://arxiv.org/search/?searchtype=author&query=Zexu Huang), [Min Xu](https://arxiv.org/search/?searchtype=author&query=Min Xu), [Stuart Perry](https://arxiv.org/search/?searchtype=author&query=Stuart Perry) 作者：黄泽旭，徐敏，斯图尔特·佩里

3D Gaussian Splatting (3DGS) represents a significant advancement in the field of efficient and high-fidelity novel view synthesis. Despite recent progress, achieving accurate geometric reconstruction under sparse-view conditions remains a fundamental challenge. Existing methods often rely on non-local depth regularization, which fails to capture fine-grained structures and is highly sensitive to depth estimation noise. Furthermore, traditional smoothing methods neglect semantic boundaries and indiscriminately degrade essential edges and textures, consequently limiting the overall quality of reconstruction. In this work, we propose DET-GS, a unified depth and edge-aware regularization framework for 3D Gaussian Splatting. DET-GS introduces a hierarchical geometric depth supervision framework that adaptively enforces multi-level geometric consistency, significantly enhancing structural fidelity and robustness against depth estimation noise. To preserve scene boundaries, we design an edge-aware depth regularization guided by semantic masks derived from Canny edge detection. Furthermore, we introduce an RGB-guided edge-preserving Total Variation loss that selectively smooths homogeneous regions while rigorously retaining high-frequency details and textures. Extensive experiments demonstrate that DET-GS achieves substantial improvements in both geometric accuracy and visual fidelity, outperforming state-of-the-art (SOTA) methods on sparse-view novel view synthesis benchmarks. 3D 高斯点溅射（3DGS）代表了高效且高保真新视角合成领域的一项重大进展。尽管近期取得了一些进展，但在稀疏视角条件下实现准确的几何重建仍然是一个根本性挑战。现有方法通常依赖于非局部深度正则化，这种方法无法捕捉细粒度结构且对深度估计噪声高度敏感。此外，传统的平滑方法忽视语义边界，盲目地削弱了关键边缘和纹理，从而限制了整体重建质量。在本工作中，我们提出了 DET-GS，一种统一的深度和边缘感知正则化框架，适用于 3D 高斯点溅射。DET-GS 引入了分层几何深度监督框架，自适应地强制多层次几何一致性，显著提升了结构的保真度和对深度估计噪声的鲁棒性。为了保护场景边界，我们设计了一种由 Canny 边缘检测生成的语义掩码引导的边缘感知深度正则化。此外，我们引入了一种基于 RGB 引导的边缘保留全变差损失，该损失选择性地平滑均匀区域，同时严格保留高频细节和纹理。大量实验表明，DET-GS 在几何精度和视觉保真度方面均取得了显著提升，优于稀疏视角新视图合成基准上的最先进（SOTA）方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 05:37:26 UTC 发布时间：2025-08-06 05:37:26 UTC

#107 DRIVE: Dynamic Rule Inference and Verified Evaluation for Constraint-Aware Autonomous Driving #107 DRIVE：面向约束感知自动驾驶的动态规则推断与验证评估

Authors: [Longling Geng](https://arxiv.org/search/?searchtype=author&query=Longling Geng), [Huangxing Li](https://arxiv.org/search/?searchtype=author&query=Huangxing Li), [Viktor Lado Naess](https://arxiv.org/search/?searchtype=author&query=Viktor Lado Naess), [Mert Pilanci](https://arxiv.org/search/?searchtype=author&query=Mert Pilanci) 作者：耿龙岭，李煌星，维克多·拉多·纳斯，梅特·皮兰奇

Understanding and adhering to soft constraints is essential for safe and socially compliant autonomous driving. However, such constraints are often implicit, context-dependent, and difficult to specify explicitly. In this work, we present DRIVE, a novel framework for Dynamic Rule Inference and Verified Evaluation that models and evaluates human-like driving constraints from expert demonstrations. DRIVE leverages exponential-family likelihood modeling to estimate the feasibility of state transitions, constructing a probabilistic representation of soft behavioral rules that vary across driving contexts. These learned rule distributions are then embedded into a convex optimization-based planning module, enabling the generation of trajectories that are not only dynamically feasible but also compliant with inferred human preferences. Unlike prior approaches that rely on fixed constraint forms or purely reward-based modeling, DRIVE offers a unified framework that tightly couples rule inference with trajectory-level decision-making. It supports both data-driven constraint generalization and principled feasibility verification. We validate DRIVE on large-scale naturalistic driving datasets, including inD, highD, and RoundD, and benchmark it against representative inverse constraint learning and planning baselines. Experimental results show that DRIVE achieves 0.0% soft constraint violation rates, smoother trajectories, and stronger generalization across diverse driving scenarios. Verified evaluations further demonstrate the efficiency, explanability, and robustness of the framework for real-world deployment. 理解并遵守软约束对于安全且符合社会规范的自动驾驶至关重要。然而，这类约束通常是隐含的、依赖于上下文的，且难以明确指定。在本工作中，我们提出了 DRIVE，一种用于动态规则推断和验证评估的新型框架，能够从专家示范中建模和评估类人驾驶约束。DRIVE 利用指数族似然建模来估计状态转移的可行性，构建了一个随驾驶环境变化的软行为规则的概率表示。随后，这些学习到的规则分布被嵌入到基于凸优化的规划模块中，使得生成的轨迹不仅在动态上可行，还符合推断出的人类偏好。与依赖固定约束形式或纯奖励建模的先前方法不同，DRIVE 提供了一个将规则推断与轨迹级决策紧密结合的统一框架，支持数据驱动的约束泛化和原则性可行性验证。我们在大规模自然驾驶数据集上验证了 DRIVE，包括 inD、highD 和 RoundD，并将其与代表性逆约束学习和规划基线进行了对比。实验结果表明，DRIVE 实现了 0.0%的软约束违规率、更平滑的轨迹，以及在多样驾驶场景中的更强泛化能力。经过验证的评估进一步展示了该框架在实际部署中的高效性、可解释性和鲁棒性。

Subjects: Robotics, Artificial Intelligence 主题：机器人技术，人工智能

Publish: 2025-08-06 03:56:06 UTC 发布时间：2025-08-06 03:56:06 UTC

#108 FLAT: Latent-Driven Arbitrary-Target Backdoor Attacks in Federated Learning #108 FLAT：联邦学习中的潜在驱动任意目标后门攻击

Authors: [Tuan Nguyen](https://arxiv.org/search/?searchtype=author&query=Tuan Nguyen), [Khoa D Doan](https://arxiv.org/search/?searchtype=author&query=Khoa D Doan), [Kok-Seng Wong](https://arxiv.org/search/?searchtype=author&query=Kok-Seng Wong) 作者：Tuan Nguyen, Khoa D Doan, Kok-Seng Wong

Federated learning (FL) is vulnerable to backdoor attacks, yet most existing methods are limited by fixed-pattern or single-target triggers, making them inflexible and easier to detect. We propose FLAT (FL Arbitrary-Target Attack), a novel backdoor attack that leverages a latent-driven conditional autoencoder to generate diverse, target-specific triggers as needed. By introducing a latent code, FLAT enables the creation of visually adaptive and highly variable triggers, allowing attackers to select arbitrary targets without retraining and to evade conventional detection mechanisms. Our approach unifies attack success, stealth, and diversity within a single framework, introducing a new level of flexibility and sophistication to backdoor attacks in FL. Extensive experiments show that FLAT achieves high attack success and remains robust against advanced FL defenses. These results highlight the urgent need for new defense strategies to address latent-driven, multi-target backdoor threats in federated settings. 联邦学习（FL）易受到后门攻击，但现有大多数方法受限于固定模式或单一目标触发器，导致其灵活性不足且更易被检测。我们提出了 FLAT（FL 任意目标攻击），这是一种新颖的后门攻击方法，利用潜在驱动的条件自编码器按需生成多样化、目标特定的触发器。通过引入潜在编码，FLAT 能够创建视觉自适应且高度可变的触发器，使攻击者无需重新训练即可选择任意目标，并规避传统检测机制。我们的方法在单一框架内统一了攻击成功率、隐蔽性和多样性，为联邦学习中的后门攻击引入了新的灵活性和复杂性。大量实验表明，FLAT 实现了高攻击成功率，并且在面对先进的联邦学习防御时依然保持鲁棒性。这些结果凸显了在联邦环境中应对潜在驱动、多目标后门威胁的新防御策略的紧迫需求。

Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：机器学习，人工智能，计算机视觉与模式识别

Publish: 2025-08-06 03:54:29 UTC 发布时间：2025-08-06 03:54:29 UTC

#109 Large Reasoning Models Are Autonomous Jailbreak Agents #109 大型推理模型是自主越狱代理

Jailbreaking – bypassing built-in safety mechanisms in AI models – has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts composed of 70 items covering seven sensitive domains. This setup yielded an overall attack success rate across all model combinations of 97.14%. Our study reveals an alignment regression, in which LRMs can systematically erode the safety guardrails of other models, highlighting the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents. 越狱——绕过 AI 模型内置安全机制——传统上需要复杂的技术程序或专业的人类专长。在本研究中，我们展示了大型推理模型（LRMs）的说服能力如何简化并扩大越狱的规模，使其成为非专家也能轻松进行的低成本活动。我们评估了四个 LRMs（DeepSeek-R1、Gemini 2.5 Flash、Grok 3 Mini、Qwen3 235B）作为自主对手，与九个广泛使用的目标模型进行多轮对话的能力。LRMs 通过系统提示接收指令，随后在无进一步监督的情况下进行越狱的规划和执行。我们使用包含 70 个项目、涵盖七个敏感领域的有害提示基准进行了大量实验。该设置在所有模型组合中的整体攻击成功率达到了 97.14%。我们的研究揭示了一种对齐回归现象，即大型推理模型（LRMs）能够系统性地削弱其他模型的安全防护措施，强调了迫切需要进一步对前沿模型进行对齐，不仅要抵御越狱攻击，还要防止它们被利用成为越狱代理。

Subjects: Computation and Language, Artificial Intelligence, Cryptography and Security 主题：计算与语言，人工智能，密码学与安全

Publish: 2025-08-04 18:27:26 UTC 发布时间：2025-08-04 18:27:26 UTC

#110 CORE-ReID V2: Advancing the Domain Adaptation for Object Re-Identification with Optimized Training and Ensemble Fusion #110 CORE-ReID V2：通过优化训练和集成融合推进目标重识别的领域自适应

Authors: [Trinh Quoc Nguyen](https://arxiv.org/search/?searchtype=author&query=Trinh Quoc Nguyen), [Oky Dicky Ardiansyah Prima](https://arxiv.org/search/?searchtype=author&query=Oky Dicky Ardiansyah Prima), [Syahid Al Irfan](https://arxiv.org/search/?searchtype=author&query=Syahid Al Irfan), [Hindriyanto Dwi Purnomo](https://arxiv.org/search/?searchtype=author&query=Hindriyanto Dwi Purnomo), [Radius Tanone](https://arxiv.org/search/?searchtype=author&query=Radius Tanone) 作者：Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima, Syahid Al Irfan, Hindriyanto Dwi Purnomo, Radius Tanone

This study presents CORE-ReID V2, an enhanced framework building upon CORE-ReID. The new framework extends its predecessor by addressing Unsupervised Domain Adaptation (UDA) challenges in Person ReID and Vehicle ReID, with further applicability to Object ReID. During pre-training, CycleGAN is employed to synthesize diverse data, bridging image characteristic gaps across different domains. In the fine-tuning, an advanced ensemble fusion mechanism, consisting of the Efficient Channel Attention Block (ECAB) and the Simplified Efficient Channel Attention Block (SECAB), enhances both local and global feature representations while reducing ambiguity in pseudo-labels for target samples. Experimental results on widely used UDA Person ReID and Vehicle ReID datasets demonstrate that the proposed framework outperforms state-of-the-art methods, achieving top performance in Mean Average Precision (mAP) and Rank-k Accuracy (Top-1, Top-5, Top-10). Moreover, the framework supports lightweight backbones such as ResNet18 and ResNet34, ensuring both scalability and efficiency. Our work not only pushes the boundaries of UDA-based Object ReID but also provides a solid foundation for further research and advancements in this domain. Our codes and models are available at https://github.com/TrinhQuocNguyen/CORE-ReID-V2. 本研究提出了 CORE-ReID V2，一种基于 CORE-ReID 的增强框架。该新框架通过解决行人重识别（Person ReID）和车辆重识别（Vehicle ReID）中的无监督领域自适应（UDA）挑战，扩展了其前身的功能，并进一步适用于物体重识别（Object ReID）。在预训练阶段，采用 CycleGAN 合成多样化数据，弥合不同领域间图像特征的差异。在微调阶段，采用由高效通道注意力模块（ECAB）和简化高效通道注意力模块（SECAB）组成的先进集成融合机制，增强局部和全局特征表示，同时减少目标样本伪标签的歧义。广泛使用的 UDA 行人重识别和车辆重识别数据集上的实验结果表明，该框架优于最先进的方法，在平均精度均值（mAP）和排名准确率（Top-1、Top-5、Top-10）方面实现了顶尖性能。此外，该框架支持 ResNet18 和 ResNet34 等轻量级骨干网络，确保了可扩展性和效率。我们的工作不仅推动了基于 UDA 的目标重识别的边界，还为该领域的进一步研究和进展提供了坚实的基础。我们的代码和模型可在 https://github.com/TrinhQuocNguyen/CORE-ReID-V2 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 02:57:09 UTC 发布时间：2025-08-06 02:57:09 UTC

#111 A Comparative Survey of PyTorch vs TensorFlow for Deep Learning: Usability, Performance, and Deployment Trade-offs #111 PyTorch 与 TensorFlow 深度学习比较调研：可用性、性能与部署权衡

Author: [Zakariya Ba Alawi](https://arxiv.org/search/?searchtype=author&query=Zakariya Ba Alawi) 作者：Zakariya Ba Alawi

This paper presents a comprehensive comparative survey of TensorFlow and PyTorch, the two leading deep learning frameworks, focusing on their usability, performance, and deployment trade-offs. We review each framework’s programming paradigm and developer experience, contrasting TensorFlow’s graph-based (now optionally eager) approach with PyTorch’s dynamic, Pythonic style. We then compare model training speeds and inference performance across multiple tasks and data regimes, drawing on recent benchmarks and studies. Deployment flexibility is examined in depth - from TensorFlow’s mature ecosystem (TensorFlow Lite for mobile/embedded, TensorFlow Serving, and JavaScript support) to PyTorch’s newer production tools (TorchScript compilation, ONNX export, and TorchServe). We also survey ecosystem and community support, including library integrations, industry adoption, and research trends (e.g., PyTorch’s dominance in recent research publications versus TensorFlow’s broader tooling in enterprise). Applications in computer vision, natural language processing, and other domains are discussed to illustrate how each framework is used in practice. Finally, we outline future directions and open challenges in deep learning framework design, such as unifying eager and graph execution, improving cross-framework interoperability, and integrating compiler optimizations (XLA, JIT) for improved speed. Our findings indicate that while both frameworks are highly capable for state-of-the-art deep learning, they exhibit distinct trade-offs: PyTorch offers simplicity and flexibility favored in research, whereas TensorFlow provides a fuller production-ready ecosystem - understanding these trade-offs is key for practitioners selecting the appropriate tool. We include charts, code snippets, and more than 20 references to academic papers and official documentation to support this comparative analysis 本文对两大主流深度学习框架 TensorFlow 和 PyTorch 进行了全面的比较调研，重点关注它们的可用性、性能及部署权衡。我们回顾了各自的编程范式和开发者体验，对比了 TensorFlow 基于图的（现可选即时执行）方法与 PyTorch 动态且符合 Python 风格的设计。随后，我们基于最新的基准测试和研究，比较了多任务和不同数据环境下的模型训练速度与推理性能。部署灵活性方面，我们深入探讨了 TensorFlow 成熟的生态系统（包括面向移动/嵌入式的 TensorFlow Lite、TensorFlow Serving 及 JavaScript 支持）与 PyTorch 较新的生产工具（如 TorchScript 编译、ONNX 导出和 TorchServe）。此外，我们还调研了生态系统和社区支持情况，包括库集成、行业采用及研究趋势（例如 PyTorch 在近期研究论文中的主导地位与 TensorFlow 在企业中更广泛的工具应用）。通过计算机视觉、自然语言处理及其他领域的应用案例，展示了各框架在实际中的使用情况。最后，我们概述了深度学习框架设计中的未来方向和未解决的挑战，例如统一急切执行和图执行、提升跨框架互操作性，以及集成编译器优化（XLA、JIT）以提高速度。我们的研究结果表明，尽管这两个框架在最先进的深度学习中都非常强大，但它们表现出不同的权衡：PyTorch 提供了研究中受青睐的简洁性和灵活性，而 TensorFlow 则提供了更完善的生产就绪生态系统——理解这些权衡对于从业者选择合适的工具至关重要。我们附上了图表、代码片段以及超过 20 篇学术论文和官方文档的参考资料，以支持这项比较分析。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 02:55:57 UTC 发布时间：2025-08-06 02:55:57 UTC

#112 Enhancing Serendipity Recommendation System by Constructing Dynamic User Knowledge Graphs with Large Language Models #112 通过构建动态用户知识图谱并结合大型语言模型提升意外发现推荐系统

Authors: [Qian Yong](https://arxiv.org/search/?searchtype=author&query=Qian Yong), [Yanhui Li](https://arxiv.org/search/?searchtype=author&query=Yanhui Li), [Jialiang Shi](https://arxiv.org/search/?searchtype=author&query=Jialiang Shi), [Yaguang Dou](https://arxiv.org/search/?searchtype=author&query=Yaguang Dou), [Tian Qi](https://arxiv.org/search/?searchtype=author&query=Tian Qi) 作者：钱勇，李艳辉，史佳良，窦亚光，齐天

The feedback loop in industrial recommendation systems reinforces homogeneous content, creates filter bubble effects, and diminishes user satisfaction. Recently, large language models(LLMs) have demonstrated potential in serendipity recommendation, thanks to their extensive world knowledge and superior reasoning capabilities. However, these models still face challenges in ensuring the rationality of the reasoning process, the usefulness of the reasoning results, and meeting the latency requirements of industrial recommendation systems (RSs). To address these challenges, we propose a method that leverages llm to dynamically construct user knowledge graphs, thereby enhancing the serendipity of recommendation systems. This method comprises a two stage framework:(1) two-hop interest reasoning, where user static profiles and historical behaviors are utilized to dynamically construct user knowledge graphs via llm. Two-hop reasoning, which can enhance the quality and accuracy of LLM reasoning results, is then performed on the constructed graphs to identify users’ potential interests; and(2) Near-line adaptation, a cost-effective approach to deploying the aforementioned models in industrial recommendation systems. We propose a u2i (user-to-item) retrieval model that also incorporates i2i (item-to-item) retrieval capabilities, the retrieved items not only exhibit strong relevance to users’ newly emerged interests but also retain the high conversion rate of traditional u2i retrieval. Our online experiments on the Dewu app, which has tens of millions of users, indicate that the method increased the exposure novelty rate by 4.62%, the click novelty rate by 4.85%, the average view duration per person by 0.15%, unique visitor click through rate by 0.07%, and unique visitor interaction penetration by 0.30%, enhancing user experience. 工业推荐系统中的反馈循环强化了同质化内容，产生了过滤气泡效应，并降低了用户满意度。近年来，LLMs 凭借其广泛的世界知识和卓越的推理能力，在意外发现推荐方面展现出潜力。然而，这些模型在确保推理过程的合理性、推理结果的有效性以及满足工业推荐系统（RSs）延迟要求方面仍面临挑战。为了解决这些问题，我们提出了一种利用 LLM 动态构建用户知识图谱的方法，从而提升推荐系统的意外发现能力。该方法包含一个两阶段框架：（1）两跳兴趣推理，利用用户静态画像和历史行为，通过 LLM 动态构建用户知识图谱。两跳推理可以提升 LLM 推理结果的质量和准确性，随后在构建的图上进行，以识别用户的潜在兴趣；（2）近线适应，一种在工业推荐系统中部署上述模型的成本效益方法。我们提出了一种 u2i（用户到物品）检索模型，同时结合了 i2i（物品到物品）检索能力，检索到的物品不仅与用户新出现的兴趣高度相关，还保持了传统 u2i 检索的高转化率。我们在拥有数千万用户的 Dewu 应用上的在线实验表明，该方法使曝光新颖率提升了 4.62%，点击新颖率提升了 4.85%，人均平均观看时长提升了 0.15%，独立访客点击率提升了 0.07%，独立访客互动渗透率提升了 0.30%，从而提升了用户体验。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-06 02:52:09 UTC 发布时间：2025-08-06 02:52:09 UTC

#113 Identity Theft in AI Conference Peer Review #113 AI 会议同行评审中的身份盗用

Authors: [Nihar B. Shah](https://arxiv.org/search/?searchtype=author&query=Nihar B. Shah), [Melisa Bok](https://arxiv.org/search/?searchtype=author&query=Melisa Bok), [Xukun Liu](https://arxiv.org/search/?searchtype=author&query=Xukun Liu), [Andrew McCallum](https://arxiv.org/search/?searchtype=author&query=Andrew McCallum) 作者：Nihar B. Shah，Melisa Bok，Xukun Liu，Andrew McCallum

We discuss newly uncovered cases of identity theft in the scientific peer-review process within artificial intelligence (AI) research, with broader implications for other academic procedures. We detail how dishonest researchers exploit the peer-review system by creating fraudulent reviewer profiles to manipulate paper evaluations, leveraging weaknesses in reviewer recruitment workflows and identity verification processes. The findings highlight the critical need for stronger safeguards against identity theft in peer review and academia at large, and to this end, we also propose mitigating strategies. 我们讨论了在人工智能（AI）研究领域科学同行评审过程中新发现的身份盗用案例，这些案例对其他学术程序也具有更广泛的影响。我们详细说明了不诚实的研究人员如何通过创建虚假的审稿人资料来操纵论文评审，利用审稿人招募流程和身份验证过程中的漏洞。研究结果强调了在同行评审和整个学术界加强防范身份盗用的关键必要性。为此，我们还提出了相应的缓解策略。

Subjects: Digital Libraries, Artificial Intelligence, Cryptography and Security 主题：数字图书馆，人工智能，密码学与安全

Publish: 2025-08-06 02:36:52 UTC 发布时间：2025-08-06 02:36:52 UTC

#114 Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing #114 进一步一步：超越单次反向传播的元学习模型编辑

Large Language Models (LLMs) underpin many AI applications, but their static nature makes updating knowledge costly. Model editing offers an efficient alternative by injecting new information through targeted parameter modifications. In particular, meta-learning-based model editing (MLBME) methods have demonstrated notable advantages in both editing effectiveness and efficiency. Despite this, we find that MLBME exhibits suboptimal performance in low-data scenarios, and its training efficiency is bottlenecked by the computation of KL divergence. To address these, we propose Step More Edit (SMEdit), a novel MLBME method that adopts Multiple BackproPagation Steps (MBPS) to improve editing performance under limited supervision and a norm regularization on weight updates to improve training efficiency. Experimental results on two datasets and two LLMs demonstrate that SMEdit outperforms prior MLBME baselines and the MBPS strategy can be seamlessly integrated into existing methods to further boost their performance. Our code will be released soon. 大型语言模型（LLMs）支撑着许多人工智能应用，但其静态特性使得更新知识代价高昂。模型编辑通过有针对性的参数修改注入新信息，提供了一种高效的替代方案。特别是，基于元学习的模型编辑（MLBME）方法在编辑效果和效率方面表现出显著优势。尽管如此，我们发现 MLBME 在低数据场景下表现不佳，其训练效率也受到 KL 散度计算的瓶颈限制。为了解决这些问题，我们提出了 S tep M ore Edit （ SMEdit ），这是一种新颖的 MLBME 方法，采用 M ultiple B ackpro P agation S teps（ MBPS ）以提升有限监督下的编辑性能，并在权重更新上引入范数正则化以提高训练效率。在两个数据集和两个 LLMs 上的实验结果表明，SMEdit 优于之前的 MLBME 基线方法，且 MBPS 策略可以无缝集成到现有方法中，进一步提升其性能。我们的代码将很快发布。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-06 01:54:58 UTC 发布时间：2025-08-06 01:54:58 UTC

#115 StepWrite: Adaptive Planning for Speech-Driven Text Generation #115 StepWrite：面向语音驱动文本生成的自适应规划

Authors: [Hamza El Alaoui](https://arxiv.org/search/?searchtype=author&query=Hamza El Alaoui), [Atieh Taheri](https://arxiv.org/search/?searchtype=author&query=Atieh Taheri), [Yi-Hao Peng](https://arxiv.org/search/?searchtype=author&query=Yi-Hao Peng), [Jeffrey P. Bigham](https://arxiv.org/search/?searchtype=author&query=Jeffrey P. Bigham) 作者：Hamza El Alaoui，Atieh Taheri，Yi-Hao Peng，Jeffrey P. Bigham

People frequently use speech-to-text systems to compose short texts with voice. However, current voice-based interfaces struggle to support composing more detailed, contextually complex texts, especially in scenarios where users are on the move and cannot visually track progress. Longer-form communication, such as composing structured emails or thoughtful responses, requires persistent context tracking, structured guidance, and adaptability to evolving user intentions–capabilities that conventional dictation tools and voice assistants do not support. We introduce StepWrite, a large language model-driven voice-based interaction system that augments human writing ability by enabling structured, hands-free and eyes-free composition of longer-form texts while on the move. StepWrite decomposes the writing process into manageable subtasks and sequentially guides users with contextually-aware non-visual audio prompts. StepWrite reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models. Unlike baseline methods like standard dictation features (e.g., Microsoft Word) and conversational voice assistants (e.g., ChatGPT Advanced Voice Mode), StepWrite dynamically adapts its prompts based on the evolving context and user intent, and provides coherent guidance without compromising user autonomy. An empirical evaluation with 25 participants engaging in mobile or stationary hands-occupied activities demonstrated that StepWrite significantly reduces cognitive load, improves usability and user satisfaction compared to baseline methods. Technical evaluations further confirmed StepWrite’s capability in dynamic contextual prompt generation, accurate tone alignment, and effective fact checking. This work highlights the potential of structured, context-aware voice interactions in enhancing hands-free and eye-free communication in everyday multitasking scenarios. 人们经常使用语音转文本系统通过语音撰写简短文本。然而，当前基于语音的界面难以支持撰写更详细、语境复杂的文本，尤其是在用户移动中无法视觉跟踪进度的场景下。较长形式的交流，如撰写结构化电子邮件或深思熟虑的回复，需要持续的上下文跟踪、结构化指导以及对不断变化的用户意图的适应能力——这些是传统的语音转写工具和语音助手所不具备的。我们介绍了 StepWrite，一种由大型语言模型驱动的基于语音的交互系统，通过实现结构化、免手免眼的长文本撰写，增强了人类的写作能力，适用于移动中使用。StepWrite 将写作过程分解为可管理的子任务，并通过具备语境感知的非视觉音频提示，按顺序引导用户。StepWrite 通过将上下文跟踪和自适应规划任务转移给模型，降低了认知负担。与标准听写功能（如 Microsoft Word）和对话式语音助手（如 ChatGPT 高级语音模式）等基线方法不同，StepWrite 能够根据不断变化的上下文和用户意图动态调整提示，并在不影响用户自主性的前提下提供连贯的指导。对 25 名参与者在移动或静止的双手占用活动中的实证评估表明，StepWrite 显著降低了认知负担，提高了可用性和用户满意度，相较于基线方法表现更优。技术评估进一步证实了 StepWrite 在动态上下文提示生成、准确语气匹配和有效事实核查方面的能力。本研究强调了结构化、上下文感知语音交互在提升日常多任务场景中免提免视沟通的潜力。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-06 01:50:17 UTC 发布时间：2025-08-06 01:50:17 UTC

#116 HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization #116 HarmonyGuard：通过自适应策略增强和双目标优化实现网络代理的安全性与实用性提升

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 01:49:32 UTC 发布时间：2025-08-06 01:49:32 UTC

#117 Are Today's LLMs Ready to Explain Well-Being Concepts? #117 现今的 LLMs 准备好解释幸福感概念了吗？

Subjects: Computation and Language, Artificial Intelligence, Human-Computer Interaction 主题：计算与语言，人工智能，人机交互

Publish: 2025-08-06 00:45:02 UTC 发布时间：2025-08-06 00:45:02 UTC

#118 Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework #118 动态用户可控的隐私保护少样本感知框架

Authors: [Ajesh Koyatan Chathoth](https://arxiv.org/search/?searchtype=author&query=Ajesh Koyatan Chathoth), [Shuhao Yu](https://arxiv.org/search/?searchtype=author&query=Shuhao Yu), [Stephen Lee](https://arxiv.org/search/?searchtype=author&query=Stephen Lee) 作者：Ajesh Koyatan Chathoth，Shuhao Yu，Stephen Lee

User-controllable privacy is important in modern sensing systems, as privacy preferences can vary significantly from person to person and may evolve over time. This is especially relevant in devices equipped with Inertial Measurement Unit (IMU) sensors, such as smartphones and wearables, which continuously collect rich time-series data that can inadvertently expose sensitive user behaviors. While prior work has proposed privacy-preserving methods for sensor data, most rely on static, predefined privacy labels or require large quantities of private training data, limiting their adaptability and user agency. In this work, we introduce PrivCLIP, a dynamic, user-controllable, few-shot privacy-preserving sensing framework. PrivCLIP allows users to specify and modify their privacy preferences by categorizing activities as sensitive (black-listed), non-sensitive (white-listed), or neutral (gray-listed). Leveraging a multimodal contrastive learning approach, PrivCLIP aligns IMU sensor data with natural language activity descriptions in a shared embedding space, enabling few-shot detection of sensitive activities. When a privacy-sensitive activity is identified, the system uses a language-guided activity sanitizer and a motion generation module (IMU-GPT) to transform the original data into a privacy-compliant version that semantically resembles a non-sensitive activity. We evaluate PrivCLIP on multiple human activity recognition datasets and demonstrate that it significantly outperforms baseline methods in terms of both privacy protection and data utility. 用户可控的隐私在现代感知系统中非常重要，因为隐私偏好因人而异，且可能随时间变化。这一点在配备惯性测量单元（IMU）传感器的设备中尤为相关，如智能手机和可穿戴设备，这些设备持续收集丰富的时间序列数据，可能无意中暴露用户的敏感行为。尽管已有工作提出了传感器数据的隐私保护方法，但大多数依赖静态的预定义隐私标签或需要大量私有训练数据，限制了其适应性和用户自主权。在本工作中，我们引入了 PrivCLIP，一种动态的、用户可控的、少样本隐私保护感知框架。PrivCLIP 允许用户通过将活动分类为敏感（黑名单）、非敏感（白名单）或中性（灰名单）来指定和修改其隐私偏好。借助多模态对比学习方法，PrivCLIP 将 IMU 传感器数据与自然语言活动描述对齐到共享的嵌入空间，实现了敏感活动的少样本检测。当识别出涉及隐私的活动时，系统会使用基于语言引导的活动净化器和运动生成模块（IMU-GPT）将原始数据转换为语义上类似于非敏感活动的隐私合规版本。我们在多个人体活动识别数据集上评估了 PrivCLIP，结果表明其在隐私保护和数据效用方面均显著优于基线方法。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 00:44:11 UTC 发布时间：2025-08-06 00:44:11 UTC

#119 Data and AI governance: Promoting equity, ethics, and fairness in large language models #119 数据与人工智能治理：促进大型语言模型中的公平、伦理与公正

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 23:15:31 UTC 发布时间：2025-08-05 23:15:31 UTC

#120 Human-Centered Human-AI Interaction (HC-HAII): A Human-Centered AI Perspective #120 以人为中心的人机交互（HC-HAII）：以人为中心的人工智能视角

Author: [Wei Xu](https://arxiv.org/search/?searchtype=author&query=Wei Xu) 作者：徐伟

This chapter systematically promotes an emerging interdisciplinary field of human-artificial intelligence interaction (human-AI interaction, HAII) from a human-centered AI (HCAI) perspective. It introduces a framework of human-centered HAII (HC-HAII). HC-HAII places humans at the core of HAII research and applications, emphasizing the importance of adopting a human-centered approach over a technology-centered one. The chapter presents the HC-HAII methodology, including human-centered methods, process, interdisciplinary teams, and multi-level design paradigms. It also highlights key research challenges and future directions. As the first chapter, this chapter also provides a structural overview of this book, which brings together contributions from an interdisciplinary community of researchers and practitioners to advance the theory, methodology, and applications of HCAI in diverse domains of HAII. The purpose of this chapter is to provide a fundamental framework for this book, centered on HAII research and applications based on the HCAI approach, which will pave the way for the content of subsequent chapters. 本章系统地推动了一个新兴的跨学科领域——人机智能交互（human-AI interaction，HAII），并从以人为中心的人工智能（human-centered AI，HCAI）视角出发。介绍了以人为中心的人机智能交互（HC-HAII）框架。HC-HAII 将人置于 HAII 研究和应用的核心，强调采用以人为中心的方法而非以技术为中心的重要性。本章阐述了 HC-HAII 的方法论，包括以人为中心的方法、流程、跨学科团队以及多层次设计范式。同时，重点指出了关键的研究挑战和未来方向。作为第一章，本章还提供了本书的结构概览，本书汇集了来自跨学科研究者和实践者社区的贡献，旨在推动 HCAI 理论、方法论及其在 HAII 各领域的应用发展。本章的目的是为本书提供一个基于 HCAI 方法的人机智能交互研究与应用的基础框架，为后续章节的内容铺路。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-05 23:13:39 UTC 发布时间：2025-08-05 23:13:39 UTC

#121 Accelerating Scientific Discovery with Multi-Document Summarization of Impact-Ranked Papers #121 利用多文档摘要加速科学发现——基于影响力排名的论文汇总

Subjects: Digital Libraries, Artificial Intelligence, Computation and Language 主题：数字图书馆，人工智能，计算与语言

Publish: 2025-08-05 22:56:09 UTC 发布时间：2025-08-05 22:56:09 UTC

#122 Policy to Assist Iteratively Local Segmentation: Optimising Modality and Location Selection for Prostate Cancer Localisation #122 迭代局部分割辅助策略：优化前列腺癌定位的模态和位置选择

Authors: [Xiangcen Wu](https://arxiv.org/search/?searchtype=author&query=Xiangcen Wu), [Shaheer U. Saeed](https://arxiv.org/search/?searchtype=author&query=Shaheer U. Saeed), [Yipei Wang](https://arxiv.org/search/?searchtype=author&query=Yipei Wang), [Ester Bonmati Coll](https://arxiv.org/search/?searchtype=author&query=Ester Bonmati Coll), [Yipeng Hu](https://arxiv.org/search/?searchtype=author&query=Yipeng Hu) 作者：吴翔岑，Shaheer U. Saeed，王一培，Ester Bonmati Coll，胡一鹏

Radiologists often mix medical image reading strategies, including inspection of individual modalities and local image regions, using information at different locations from different images independently as well as concurrently. In this paper, we propose a recommend system to assist machine learning-based segmentation models, by suggesting appropriate image portions along with the best modality, such that prostate cancer segmentation performance can be maximised. Our approach trains a policy network that assists tumor localisation, by recommending both the optimal imaging modality and the specific sections of interest for review. During training, a pre-trained segmentation network mimics radiologist inspection on individual or variable combinations of these imaging modalities and their sections - selected by the policy network. Taking the locally segmented regions as an input for the next step, this dynamic decision making process iterates until all cancers are best localised. We validate our method using a data set of 1325 labelled multiparametric MRI images from prostate cancer patients, demonstrating its potential to improve annotation efficiency and segmentation accuracy, especially when challenging pathology is present. Experimental results show that our approach can surpass standard segmentation networks. Perhaps more interestingly, our trained agent independently developed its own optimal strategy, which may or may not be consistent with current radiologist guidelines such as PI-RADS. This observation also suggests a promising interactive application, in which the proposed policy networks assist human radiologists. 放射科医生通常会混合使用多种医学影像阅读策略，包括检查单一模态和局部图像区域，独立或同时利用来自不同图像不同位置的信息。本文提出了一种推荐系统，辅助基于机器学习的分割模型，通过建议合适的图像部分及最佳模态，从而最大化前列腺癌的分割性能。我们的方法训练了一个策略网络，辅助肿瘤定位，推荐最佳成像模态和具体感兴趣的切片。在训练过程中，预训练的分割网络模拟放射科医生对由策略网络选择的单一或多模态及其切片的检查。将局部分割区域作为下一步的输入，这一动态决策过程迭代进行，直到所有癌症被最佳定位。我们使用来自前列腺癌患者的 1325 张带标签的多参数 MRI 图像数据集验证了我们的方法，展示了其在提高标注效率和分割准确性方面的潜力，尤其是在存在复杂病理情况时。实验结果表明，我们的方法能够超越标准的分割网络。更有趣的是，我们训练的智能体独立开发了自己的最优策略，这一策略可能与当前放射科医生的指南（如 PI-RADS）一致，也可能不一致。这一观察还表明了一个有前景的交互式应用，即所提出的策略网络可以辅助人类放射科医生。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 22:40:18 UTC 发布时间：2025-08-05 22:40:18 UTC

#123 Constraint-Preserving Data Generation for Visuomotor Policy Learning #123 约束保持的数据生成用于视觉运动策略学习

Authors: [Kevin Lin](https://arxiv.org/search/?searchtype=author&query=Kevin Lin), [Varun Ragunath](https://arxiv.org/search/?searchtype=author&query=Varun Ragunath), [Andrew McAlinden](https://arxiv.org/search/?searchtype=author&query=Andrew McAlinden), [Aaditya Prasad](https://arxiv.org/search/?searchtype=author&query=Aaditya Prasad), [Jimmy Wu](https://arxiv.org/search/?searchtype=author&query=Jimmy Wu), [Yuke Zhu](https://arxiv.org/search/?searchtype=author&query=Yuke Zhu), [Jeannette Bohg](https://arxiv.org/search/?searchtype=author&query=Jeannette Bohg) 作者：Kevin Lin, Varun Ragunath, Andrew McAlinden, Aaditya Prasad, Jimmy Wu, Yuke Zhu, Jeannette Bohg

Large-scale demonstration data has powered key breakthroughs in robot manipulation, but collecting that data remains costly and time-consuming. We present Constraint-Preserving Data Generation (CP-Gen), a method that uses a single expert trajectory to generate robot demonstrations containing novel object geometries and poses. These generated demonstrations are used to train closed-loop visuomotor policies that transfer zero-shot to the real world and generalize across variations in object geometries and poses. Similar to prior work using pose variations for data generation, CP-Gen first decomposes expert demonstrations into free-space motions and robot skills. But unlike those works, we achieve geometry-aware data generation by formulating robot skills as keypoint-trajectory constraints: keypoints on the robot or grasped object must track a reference trajectory defined relative to a task-relevant object. To generate a new demonstration, CP-Gen samples pose and geometry transforms for each task-relevant object, then applies these transforms to the object and its associated keypoints or keypoint trajectories. We optimize robot joint configurations so that the keypoints on the robot or grasped object track the transformed keypoint trajectory, and then motion plan a collision-free path to the first optimized joint configuration. Experiments on 16 simulation tasks and four real-world tasks, featuring multi-stage, non-prehensile and tight-tolerance manipulation, show that policies trained using CP-Gen achieve an average success rate of 77%, outperforming the best baseline that achieves an average of 50%. 大规模示范数据推动了机器人操作领域的关键突破，但收集这些数据仍然成本高昂且耗时。我们提出了约束保持数据生成（CP-Gen）方法，该方法利用单条专家轨迹生成包含新颖物体几何形状和姿态的机器人示范。这些生成的示范用于训练闭环视觉运动策略，能够零次迁移到现实世界，并在物体几何形状和姿态的变化中实现泛化。与先前利用姿态变化进行数据生成的工作类似，CP-Gen 首先将专家示范分解为自由空间运动和机器人技能。但不同于那些工作，我们通过将机器人技能表述为关键点轨迹约束，实现了几何感知的数据生成：机器人或被抓取物体上的关键点必须跟踪相对于任务相关物体定义的参考轨迹。为了生成新的示范，CP-Gen 为每个任务相关物体采样姿态和几何变换，然后将这些变换应用于物体及其相关的关键点或关键点轨迹。我们优化机器人关节配置，使机器人或被抓取物体上的关键点能够跟踪变换后的关键点轨迹，然后规划一条无碰撞路径，运动到第一个优化后的关节配置。针对 16 个仿真任务和 4 个现实任务的实验，这些任务包括多阶段、非抓取式和高精度操作，结果显示使用 CP-Gen 训练的策略平均成功率达到 77%，优于表现最好的基线方法的平均 50%。

Subjects: Robotics, Artificial Intelligence 主题：机器人技术，人工智能

Publish: 2025-08-05 22:20:02 UTC 发布时间：2025-08-05 22:20:02 UTC

#124 FairPOT: Balancing AUC Performance and Fairness with Proportional Optimal Transport #124 FairPOT：通过比例最优传输平衡 AUC 性能与公平性

Authors: [Pengxi Liu](https://arxiv.org/search/?searchtype=author&query=Pengxi Liu), [Yi Shen](https://arxiv.org/search/?searchtype=author&query=Yi Shen), [Matthew M. Engelhard](https://arxiv.org/search/?searchtype=author&query=Matthew M. Engelhard), [Benjamin A. Goldstein](https://arxiv.org/search/?searchtype=author&query=Benjamin A. Goldstein), [Michael J. Pencina](https://arxiv.org/search/?searchtype=author&query=Michael J. Pencina), [Nicoleta J. Economou-Zavlanos](https://arxiv.org/search/?searchtype=author&query=Nicoleta J. Economou-Zavlanos), [Michael M. Zavlanos](https://arxiv.org/search/?searchtype=author&query=Michael M. Zavlanos) 作者：刘鹏曦，沈毅，Matthew M. Engelhard，Benjamin A. Goldstein，Michael J. Pencina，Nicoleta J. Economou-Zavlanos，Michael M. Zavlanos

Fairness metrics utilizing the area under the receiver operator characteristic curve (AUC) have gained increasing attention in high-stakes domains such as healthcare, finance, and criminal justice. In these domains, fairness is often evaluated over risk scores rather than binary outcomes, and a common challenge is that enforcing strict fairness can significantly degrade AUC performance. To address this challenge, we propose Fair Proportional Optimal Transport (FairPOT), a novel, model-agnostic post-processing framework that strategically aligns risk score distributions across different groups using optimal transport, but does so selectively by transforming a controllable proportion, i.e., the top-lambda quantile, of scores within the disadvantaged group. By varying lambda, our method allows for a tunable trade-off between reducing AUC disparities and maintaining overall AUC performance. Furthermore, we extend FairPOT to the partial AUC setting, enabling fairness interventions to concentrate on the highest-risk regions. Extensive experiments on synthetic, public, and clinical datasets show that FairPOT consistently outperforms existing post-processing techniques in both global and partial AUC scenarios, often achieving improved fairness with slight AUC degradation or even positive gains in utility. The computational efficiency and practical adaptability of FairPOT make it a promising solution for real-world deployment. 利用接收者操作特征曲线下面积（AUC）的公平性指标在医疗、金融和刑事司法等高风险领域受到越来越多的关注。在这些领域，公平性通常是基于风险评分而非二元结果进行评估的，而一个常见的挑战是，强制执行严格的公平性可能会显著降低 AUC 性能。为了解决这一挑战，我们提出了公平比例最优传输（FairPOT），这是一种新颖的、与模型无关的后处理框架，通过最优传输策略性地对不同群体的风险评分分布进行对齐，但选择性地只转换处于弱势群体内可控比例的分数，即该群体的前 lambda 分位数。通过调整 lambda，我们的方法允许在减少 AUC 差异和保持整体 AUC 性能之间实现可调节的权衡。此外，我们将 FairPOT 扩展到部分 AUC 设置，使公平性干预能够集中在最高风险区域。在合成数据集、公开数据集和临床数据集上的大量实验表明，FairPOT 在全局和部分 AUC 场景中均持续优于现有的后处理技术，通常在实现公平性提升的同时，AUC 仅有轻微下降甚至在效用上获得正向提升。FairPOT 的计算效率和实际适应性使其成为现实应用中的有前景解决方案。

Subjects: Machine Learning, Artificial Intelligence, Computers and Society, Machine Learning 主题：机器学习，人工智能，计算机与社会，机器学习

Publish: 2025-08-05 22:13:08 UTC 发布时间：2025-08-05 22:13:08 UTC

#125 Active Learning and Transfer Learning for Anomaly Detection in Time-Series Data #125 时间序列数据异常检测的主动学习与迁移学习

Authors: [John D. Kelleher](https://arxiv.org/search/?searchtype=author&query=John D. Kelleher), [Matthew Nicholson](https://arxiv.org/search/?searchtype=author&query=Matthew Nicholson), [Rahul Agrahari](https://arxiv.org/search/?searchtype=author&query=Rahul Agrahari), [Clare Conran](https://arxiv.org/search/?searchtype=author&query=Clare Conran) 作者：John D. Kelleher，Matthew Nicholson，Rahul Agrahari，Clare Conran

This paper examines the effectiveness of combining active learning and transfer learning for anomaly detection in cross-domain time-series data. Our results indicate that there is an interaction between clustering and active learning and in general the best performance is achieved using a single cluster (in other words when clustering is not applied). Also, we find that adding new samples to the training set using active learning does improve model performance but that in general, the rate of improvement is slower than the results reported in the literature suggest. We attribute this difference to an improved experimental design where distinct data samples are used for the sampling and testing pools. Finally, we assess the ceiling performance of transfer learning in combination with active learning across several datasets and find that performance does initially improve but eventually begins to tail off as more target points are selected for inclusion in training. This tail-off in performance may indicate that the active learning process is doing a good job of sequencing data points for selection, pushing the less useful points towards the end of the selection process and that this tail-off occurs when these less useful points are eventually added. Taken together our results indicate that active learning is effective but that the improvement in model performance follows a linear flat function concerning the number of points selected and labelled. 本文探讨了将主动学习与迁移学习相结合用于跨域时间序列数据异常检测的有效性。我们的结果表明，聚类与主动学习之间存在相互作用，通常情况下，使用单一聚类（换句话说，不进行聚类）能够获得最佳性能。此外，我们发现通过主动学习向训练集中添加新样本确实提升了模型性能，但总体来看，性能提升的速度比文献中报道的要慢。我们将这一差异归因于改进的实验设计，其中采样池和测试池使用了不同的数据样本。最后，我们评估了迁移学习与主动学习结合在多个数据集上的性能上限，发现性能初期确实有所提升，但随着更多目标点被选入训练，性能最终开始趋于平缓。性能的下降可能表明主动学习过程在为选择排序数据点方面表现良好，将较不有用的数据点推向选择过程的末尾，而当这些较不有用的数据点最终被添加时，性能下降便发生。综合来看，我们的结果表明主动学习是有效的，但模型性能的提升随着所选和标注的数据点数量呈线性平缓函数变化。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 21:30:41 UTC 发布时间：2025-08-05 21:30:41 UTC

#126 Deep learning framework for crater detection and identification on the Moon and Mars #126 月球和火星陨石坑检测与识别的深度学习框架

Authors: [Yihan Ma](https://arxiv.org/search/?searchtype=author&query=Yihan Ma), [Zeyang Yu](https://arxiv.org/search/?searchtype=author&query=Zeyang Yu), [Rohitash Chandra](https://arxiv.org/search/?searchtype=author&query=Rohitash Chandra) 作者：马一涵，余泽阳，罗希塔什·钱德拉

Impact craters are among the most prominent geomorphological features on planetary surfaces and are of substantial significance in planetary science research. Their spatial distribution and morphological characteristics provide critical information on planetary surface composition, geological history, and impact processes. In recent years, the rapid advancement of deep learning models has fostered significant interest in automated crater detection. In this paper, we apply advancements in deep learning models for impact crater detection and identification. We use novel models, including Convolutional Neural Networks (CNNs) and variants such as YOLO and ResNet. We present a framework that features a two-stage approach where the first stage features crater identification using simple classic CNN, ResNet-50 and YOLO. In the second stage, our framework employs YOLO-based detection for crater localisation. Therefore, we detect and identify different types of craters and present a summary report with remote sensing data for a selected region. We consider selected regions for craters and identification from Mars and the Moon based on remote sensing data. Our results indicate that YOLO demonstrates the most balanced crater detection performance, while ResNet-50 excels in identifying large craters with high precision. 撞击坑是行星表面最显著的地貌特征之一，在行星科学研究中具有重要意义。它们的空间分布和形态特征为行星表面组成、地质历史及撞击过程提供了关键信息。近年来，深度学习模型的快速发展激发了对自动撞击坑检测的浓厚兴趣。本文应用深度学习模型的最新进展进行撞击坑的检测与识别。我们采用了包括卷积神经网络（CNN）及其变体如 YOLO 和 ResNet 在内的新型模型。我们提出了一个两阶段的方法框架，第一阶段使用简单的经典 CNN、ResNet-50 和 YOLO 进行撞击坑识别；第二阶段则采用基于 YOLO 的检测进行撞击坑定位。因此，我们能够检测并识别不同类型的撞击坑，并结合遥感数据对选定区域进行总结报告。我们基于遥感数据选取了火星和月球的部分区域进行撞击坑检测与识别。我们的结果表明，YOLO 在陨石坑检测性能方面表现最为均衡，而 ResNet-50 则在高精度识别大型陨石坑方面表现出色。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 21:29:34 UTC 发布时间：2025-08-05 21:29:34 UTC

#127 Fast and Accurate Explanations of Distance-Based Classifiers by Uncovering Latent Explanatory Structures #127 通过揭示潜在解释结构实现距离基分类器的快速准确解释

Authors: [Florian Bley](https://arxiv.org/search/?searchtype=author&query=Florian Bley), [Jacob Kauffmann](https://arxiv.org/search/?searchtype=author&query=Jacob Kauffmann), [Simon León Krug](https://arxiv.org/search/?searchtype=author&query=Simon León Krug), [Klaus-Robert Müller](https://arxiv.org/search/?searchtype=author&query=Klaus-Robert Müller), [Grégoire Montavon](https://arxiv.org/search/?searchtype=author&query=Grégoire Montavon) 作者：Florian Bley, Jacob Kauffmann, Simon León Krug, Klaus-Robert Müller, Grégoire Montavon

Distance-based classifiers, such as k-nearest neighbors and support vector machines, continue to be a workhorse of machine learning, widely used in science and industry. In practice, to derive insights from these models, it is also important to ensure that their predictions are explainable. While the field of Explainable AI has supplied methods that are in principle applicable to any model, it has also emphasized the usefulness of latent structures (e.g. the sequence of layers in a neural network) to produce explanations. In this paper, we contribute by uncovering a hidden neural network structure in distance-based classifiers (consisting of linear detection units combined with nonlinear pooling layers) upon which Explainable AI techniques such as layer-wise relevance propagation (LRP) become applicable. Through quantitative evaluations, we demonstrate the advantage of our novel explanation approach over several baselines. We also show the overall usefulness of explaining distance-based models through two practical use cases. 基于距离的分类器，如 k 近邻和支持向量机，依然是机器学习中的主力军，广泛应用于科学和工业领域。在实际应用中，为了从这些模型中获得洞见，确保其预测结果具有可解释性同样重要。虽然可解释人工智能领域提供了原则上适用于任何模型的方法，但它也强调利用潜在结构（例如神经网络中的层序）来生成解释的有效性。本文的贡献在于揭示了基于距离的分类器中隐藏的神经网络结构（由线性检测单元与非线性池化层组合而成），使得诸如层级相关传播（LRP）等可解释人工智能技术得以应用。通过定量评估，我们展示了该新颖解释方法相较于多个基线方法的优势。我们还通过两个实际用例展示了解释基于距离模型的整体实用性。

Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题：机器学习，人工智能，机器学习

Publish: 2025-08-05 21:01:58 UTC 发布时间：2025-08-05 21:01:58 UTC

#128 Calibrating Biophysical Models for Grape Phenology Prediction via Multi-Task Learning #128 通过多任务学习校准生物物理模型以预测葡萄物候

Authors: [William Solow](https://arxiv.org/search/?searchtype=author&query=William Solow), [Sandhya Saisubramanian](https://arxiv.org/search/?searchtype=author&query=Sandhya Saisubramanian) 作者：William Solow，Sandhya Saisubramanian

Accurate prediction of grape phenology is essential for timely vineyard management decisions, such as scheduling irrigation and fertilization, to maximize crop yield and quality. While traditional biophysical models calibrated on historical field data can be used for season-long predictions, they lack the precision required for fine-grained vineyard management. Deep learning methods are a compelling alternative but their performance is hindered by sparse phenology datasets, particularly at the cultivar level. We propose a hybrid modeling approach that combines multi-task learning with a recurrent neural network to parameterize a differentiable biophysical model. By using multi-task learning to predict the parameters of the biophysical model, our approach enables shared learning across cultivars while preserving biological structure, thereby improving the robustness and accuracy of predictions. Empirical evaluation using real-world and synthetic datasets demonstrates that our method significantly outperforms both conventional biophysical models and baseline deep learning approaches in predicting phenological stages, as well as other crop state variables such as cold-hardiness and wheat yield. 准确预测葡萄物候对于及时进行葡萄园管理决策至关重要，例如安排灌溉和施肥，以最大化作物产量和质量。虽然基于历史田间数据校准的传统生物物理模型可用于整个季节的预测，但它们缺乏细粒度葡萄园管理所需的精度。深度学习方法是一个有吸引力的替代方案，但其性能受限于稀疏的物候数据集，尤其是在品种层面。我们提出了一种混合建模方法，将多任务学习与循环神经网络相结合，以参数化一个可微分的生物物理模型。通过使用多任务学习来预测生物物理模型的参数，我们的方法实现了品种间的共享学习，同时保留了生物学结构，从而提高了预测的鲁棒性和准确性。使用真实世界和合成数据集的实证评估表明，我们的方法在预测物候阶段以及其他作物状态变量（如耐寒性和小麦产量）方面，显著优于传统的生物物理模型和基线深度学习方法。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 20:36:11 UTC 发布时间：2025-08-05 20:36:11 UTC

#129 Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE) #129 通过安全混沌工程（SCE）赋能的漏洞攻击模拟（BAS）平台模拟网络攻击

Authors: [Arturo Sánchez-Matas](https://arxiv.org/search/?searchtype=author&query=Arturo Sánchez-Matas), [Pablo Escribano Ruiz](https://arxiv.org/search/?searchtype=author&query=Pablo Escribano Ruiz), [Daniel Díaz-López](https://arxiv.org/search/?searchtype=author&query=Daniel Díaz-López), [Angel Luis Perales Gómez](https://arxiv.org/search/?searchtype=author&query=Angel Luis Perales Gómez), [Pantaleone Nespoli](https://arxiv.org/search/?searchtype=author&query=Pantaleone Nespoli), [Gregorio Martínez Pérez](https://arxiv.org/search/?searchtype=author&query=Gregorio Martínez Pérez) 作者：Arturo Sánchez-Matas，Pablo Escribano Ruiz，Daniel Díaz-López，Angel Luis Perales Gómez，Pantaleone Nespoli，Gregorio Martínez Pérez

In today digital landscape, organizations face constantly evolving cyber threats, making it essential to discover slippery attack vectors through novel techniques like Security Chaos Engineering (SCE), which allows teams to test defenses and identify vulnerabilities effectively. This paper proposes to integrate SCE into Breach Attack Simulation (BAS) platforms, leveraging adversary profiles and abilities from existing threat intelligence databases. This innovative proposal for cyberattack simulation employs a structured architecture composed of three layers: SCE Orchestrator, Connector, and BAS layers. Utilizing MITRE Caldera in the BAS layer, our proposal executes automated attack sequences, creating inferred attack trees from adversary profiles. Our proposal evaluation illustrates how integrating SCE with BAS can enhance the effectiveness of attack simulations beyond traditional scenarios, and be a useful component of a cyber defense strategy. 在当今的数字环境中，组织面临不断演变的网络威胁，因此通过诸如安全混沌工程（SCE）等新颖技术发现隐蔽的攻击路径变得至关重要，这使团队能够有效地测试防御并识别漏洞。本文提出将 SCE 整合到入侵攻击模拟（BAS）平台中，利用现有威胁情报数据库中的对手画像和能力。这一网络攻击模拟的创新提案采用由三层组成的结构化架构：SCE 编排器、连接器和 BAS 层。利用 BAS 层中的 MITRE Caldera，我们的方案执行自动化攻击序列，从对手画像中创建推断的攻击树。我们的方案评估展示了将 SCE 与 BAS 集成如何提升攻击模拟的有效性，超越传统场景，并成为网络防御策略中的有用组成部分。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全，人工智能

Publish: 2025-08-05 19:52:57 UTC 发布时间：2025-08-05 19:52:57 UTC

#130 Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training #130 极端规模湍流数据集的智能采样，用于准确高效的时空模型训练

With the end of Moore’s law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can improve model accuracy and substantially lower energy consumption, with reductions of up to 38x observed in certain cases. 随着摩尔定律和 Dennard 缩放的终结，高效训练越来越需要重新思考数据量。我们能否通过智能子采样，用显著更少的数据训练出更好的模型？为此，我们开发了 SICKLE，一种用于高效学习的稀疏智能策划框架，具有新颖的最大熵（MaxEnt）采样方法、可扩展训练和能耗基准测试。我们将 MaxEnt 与随机采样和相空间采样在湍流的大规模直接数值模拟（DNS）数据集上进行了比较。在 Frontier 上大规模评估 SICKLE，我们展示了作为预处理步骤的子采样可以提升模型准确性并大幅降低能耗，在某些情况下能耗降低高达 38 倍。

Subjects: Machine Learning, Artificial Intelligence, Distributed, Parallel, and Cluster Computing 主题：机器学习，人工智能，分布式、并行与集群计算

Publish: 2025-08-05 19:34:59 UTC 发布时间：2025-08-05 19:34:59 UTC

#131 Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models #131 从幻觉到真相：大型语言模型中的事实核查与真实性评估综述

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-05 19:20:05 UTC 发布时间：2025-08-05 19:20:05 UTC

#132 VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations #132 VAE-DNN：用于参数化偏微分方程的节能分部可训练代理模型

Authors: [Yifei Zong](https://arxiv.org/search/?searchtype=author&query=Yifei Zong), [Alexandre M. Tartakovsky](https://arxiv.org/search/?searchtype=author&query=Alexandre M. Tartakovsky) 作者：宗一飞，亚历山大·M·塔尔塔科夫斯基

We propose a trainable-by-parts surrogate model for solving forward and inverse parameterized nonlinear partial differential equations. Like several other surrogate and operator learning models, the proposed approach employs an encoder to reduce the high-dimensional input y(\bmx) to a lower-dimensional latent space, \bmμ\bmϕy. Then, a fully connected neural network is used to map \bmμ\bmϕy to the latent space, \bmμ\bmϕh, of the PDE solution h(\bmx,t). Finally, a decoder is utilized to reconstruct h(\bmx,t). The innovative aspect of our model is its ability to train its three components independently. This approach leads to a substantial decrease in both the time and energy required for training when compared to leading operator learning models such as FNO and DeepONet. The separable training is achieved by training the encoder as part of the variational autoencoder (VAE) for y(\bmx) and the decoder as part of the h(\bmx,t) VAE. We refer to this model as the VAE-DNN model. VAE-DNN is compared to the FNO and DeepONet models for obtaining forward and inverse solutions to the nonlinear diffusion equation governing groundwater flow in an unconfined aquifer. Our findings indicate that VAE-DNN not only demonstrates greater efficiency but also delivers superior accuracy in both forward and inverse solutions compared to the FNO and DeepONet models. 我们提出了一种可分部分训练的代理模型，用于求解参数化的非线性偏微分方程的正向和逆向问题。与其他一些代理和算子学习模型类似，所提方法采用编码器将高维输入 y(\bmx) 降维到低维潜在空间 \bmμ\bmϕy 。然后，使用全连接神经网络将 \bmμ\bmϕy 映射到偏微分方程解的潜在空间 \bmμ\bmϕh h(\bmx,t) 。最后，利用解码器重构 h(\bmx,t) 。我们模型的创新之处在于其能够独立训练三个组成部分。与 FNO 和 DeepONet 等领先的算子学习模型相比，这种方法显著减少了训练所需的时间和能量。可分离训练是通过将编码器作为变分自编码器（VAE）的一部分用于 y(\bmx) 训练，以及将解码器作为 h(\bmx,t) VAE 的一部分训练来实现的。我们将该模型称为 VAE-DNN 模型。VAE-DNN 模型与 FNO 和 DeepONet 模型进行了比较，用于获得非线性扩散方程的正向和逆向解，该方程描述了无盖含水层中地下水流动的规律。我们的研究结果表明，VAE-DNN 不仅表现出更高的效率，而且在正向和逆向求解中均比 FNO 和 DeepONet 模型提供了更优的准确性。

Subjects: Machine Learning, Artificial Intelligence, Computational Engineering, Finance, and Science 主题：机器学习，人工智能，计算工程，金融与科学

Publish: 2025-08-05 18:37:32 UTC 发布时间：2025-08-05 18:37:32 UTC

#133 Mechanism Design for Facility Location using Predictions #133 使用预测的设施选址机制设计

Author: [Toby Walsh](https://arxiv.org/search/?searchtype=author&query=Toby Walsh) 作者：Toby Walsh

We study mechanisms for the facility location problem augmented with predictions of the optimal facility location. We demonstrate that an egalitarian viewpoint which considers both the maximum distance of any agent from the facility and the minimum utility of any agent provides important new insights compared to a viewpoint that just considers the maximum distance. As in previous studies, we consider performance in terms of consistency (worst case when predictions are accurate) and robustness (worst case irrespective of the accuracy of predictions). By considering how mechanisms with predictions can perform poorly, we design new mechanisms that are more robust. Indeed, by adjusting parameters, we demonstrate how to trade robustness for consistency. We go beyond the single facility problem by designing novel strategy proof mechanisms for locating two facilities with bounded consistency and robustness that use two predictions for where to locate the two facilities. 我们研究了带有最优设施位置预测的设施选址问题机制。我们展示了一种兼顾任一代理与设施之间最大距离和任一代理最小效用的平等主义视角，相较于仅考虑最大距离的视角，提供了重要的新见解。与以往研究一样，我们从一致性（预测准确时的最坏情况）和鲁棒性（无论预测准确与否的最坏情况）两个方面考察性能。通过分析带预测的机制可能表现不佳的情况，我们设计了更具鲁棒性的全新机制。实际上，通过调整参数，我们展示了如何在鲁棒性和一致性之间进行权衡。我们不仅限于单一设施问题，还设计了用于定位两个设施的新型策略证明机制，该机制具有有界的一致性和鲁棒性，并利用两个预测来确定两个设施的位置。

Subjects: Computer Science and Game Theory, Artificial Intelligence 主题：计算机科学与博弈论，人工智能

Publish: 2025-08-05 18:05:32 UTC 发布时间：2025-08-05 18:05:32 UTC

#134 SoilNet: A Multimodal Multitask Model for Hierarchical Classification of Soil Horizons #134 SoilNet：一种用于土壤层级分类的多模态多任务模型

Authors: [Teodor Chiaburu](https://arxiv.org/search/?searchtype=author&query=Teodor Chiaburu), [Vipin Singh](https://arxiv.org/search/?searchtype=author&query=Vipin Singh), [Frank Haußer](https://arxiv.org/search/?searchtype=author&query=Frank Haußer), [Felix Bießmann](https://arxiv.org/search/?searchtype=author&query=Felix Bießmann) 作者：Teodor Chiaburu，Vipin Singh，Frank Haußer，Felix Bießmann

While recent advances in foundation models have improved the state of the art in many domains, some problems in empirical sciences could not benefit from this progress yet. Soil horizon classification, for instance, remains challenging because of its multimodal and multitask characteristics and a complex hierarchically structured label taxonomy. Accurate classification of soil horizons is crucial for monitoring soil health, which directly impacts agricultural productivity, food security, ecosystem stability and climate resilience. In this work, we propose SoilNet - a multimodal multitask model to tackle this problem through a structured modularized pipeline. Our approach integrates image data and geotemporal metadata to first predict depth markers, segmenting the soil profile into horizon candidates. Each segment is characterized by a set of horizon-specific morphological features. Finally, horizon labels are predicted based on the multimodal concatenated feature vector, leveraging a graph-based label representation to account for the complex hierarchical relationships among soil horizons. Our method is designed to address complex hierarchical classification, where the number of possible labels is very large, imbalanced and non-trivially structured. We demonstrate the effectiveness of our approach on a real-world soil profile dataset. All code and experiments can be found in our repository: https://github.com/calgo-lab/BGR/ 尽管基础模型的最新进展提升了许多领域的技术水平，但经验科学中的一些问题尚未从中受益。例如，土壤层位分类仍然具有挑战性，因为其具有多模态和多任务特性以及复杂的层级结构标签分类法。准确的土壤层位分类对于监测土壤健康至关重要，而土壤健康直接影响农业生产力、粮食安全、生态系统稳定性和气候适应能力。在本研究中，我们提出了 SoilNet ——一种通过结构化模块化流程解决该问题的多模态多任务模型。我们的方法整合了图像数据和地时空元数据，首先预测深度标记，将土壤剖面分割为层位候选段。每个段由一组层位特定的形态特征描述。最后，基于多模态拼接的特征向量预测层位标签，利用基于图的标签表示来考虑土壤层位之间复杂的层级关系。我们的方法旨在解决复杂的层级分类问题，其中可能的标签数量非常多，且存在不平衡和非平凡的结构。我们在一个真实的土壤剖面数据集上展示了该方法的有效性。所有代码和实验均可在我们的仓库中找到：https://github.com/calgo-lab/BGR/

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 15:29:57 UTC 发布时间：2025-08-05 15:29:57 UTC

#135 Probing and Enhancing the Robustness of GNN-based QEC Decoders with Reinforcement Learning #135 利用强化学习探测并增强基于 GNN 的量子纠错解码器的鲁棒性

Author: [Ryota Ikeda](https://arxiv.org/search/?searchtype=author&query=Ryota Ikeda) 作者：Ryota Ikeda

Graph Neural Networks (GNNs) have emerged as a powerful, data-driven approach for Quantum Error Correction (QEC) decoding, capable of learning complex noise characteristics directly from syndrome data. However, the robustness of these decoders against subtle, adversarial perturbations remains a critical open question. This work introduces a novel framework to systematically probe the vulnerabilities of a GNN decoder using a reinforcement learning (RL) agent. The RL agent is trained as an adversary with the goal of finding minimal syndrome modifications that cause the decoder to misclassify. We apply this framework to a Graph Attention Network (GAT) decoder trained on experimental surface code data from Google Quantum AI. Our results show that the RL agent can successfully identify specific, critical vulnerabilities, achieving a high attack success rate with a minimal number of bit flips. Furthermore, we demonstrate that the decoder’s robustness can be significantly enhanced through adversarial training, where the model is retrained on the adversarial examples generated by the RL agent. This iterative process of automated vulnerability discovery and targeted retraining presents a promising methodology for developing more reliable and robust neural network decoders for fault-tolerant quantum computing. 图神经网络（GNN）作为一种强大的数据驱动方法，已成为量子纠错（QEC）译码的重要手段，能够直接从综合征数据中学习复杂的噪声特性。然而，这些译码器对细微的对抗性扰动的鲁棒性仍是一个关键的未解问题。本文提出了一个新颖的框架，利用强化学习（RL）代理系统地探测 GNN 译码器的脆弱性。该 RL 代理被训练为对手，目标是寻找最小的综合征修改，使译码器发生误分类。我们将该框架应用于基于 Google Quantum AI 实验表面码数据训练的图注意力网络（GAT）译码器。结果表明，RL 代理能够成功识别出特定的关键脆弱点，以极少的比特翻转实现高攻击成功率。此外，我们还展示了通过对抗训练，即在 RL 代理生成的对抗样本上重新训练模型，可以显著提升译码器的鲁棒性。这种自动化漏洞发现与针对性再训练的迭代过程，为开发更可靠、更强健的神经网络解码器以实现容错量子计算提供了一种有前景的方法。

Subjects: Quantum Physics, Artificial Intelligence 主题：量子物理，人工智能

Publish: 2025-08-05 14:57:30 UTC 发布时间：2025-08-05 14:57:30 UTC

#136 Do GNN-based QEC Decoders Require Classical Knowledge? Evaluating the Efficacy of Knowledge Distillation from MWPM #136 基于 GNN 的量子纠错解码器是否需要经典知识？评估从 MWPM 进行知识蒸馏的有效性

Author: [Ryota Ikeda](https://arxiv.org/search/?searchtype=author&query=Ryota Ikeda) 作者：池田亮太

The performance of decoders in Quantum Error Correction (QEC) is key to realizing practical quantum computers. In recent years, Graph Neural Networks (GNNs) have emerged as a promising approach, but their training methodologies are not yet well-established. It is generally expected that transferring theoretical knowledge from classical algorithms like Minimum Weight Perfect Matching (MWPM) to GNNs, a technique known as knowledge distillation, can effectively improve performance. In this work, we test this hypothesis by rigorously comparing two models based on a Graph Attention Network (GAT) architecture that incorporates temporal information as node features. The first is a purely data-driven model (baseline) trained only on ground-truth labels, while the second incorporates a knowledge distillation loss based on the theoretical error probabilities from MWPM. Using public experimental data from Google, our evaluation reveals that while the final test accuracy of the knowledge distillation model was nearly identical to the baseline, its training loss converged more slowly, and the training time increased by a factor of approximately five. This result suggests that modern GNN architectures possess a high capacity to efficiently learn complex error correlations directly from real hardware data, without guidance from approximate theoretical models. 量子纠错（QEC）中解码器的性能是实现实用量子计算机的关键。近年来，图神经网络（GNN）作为一种有前景的方法出现，但其训练方法尚未成熟。普遍认为，将经典算法如最小权完美匹配（MWPM）的理论知识迁移到 GNN 中，即知识蒸馏技术，可以有效提升性能。在本工作中，我们通过严格比较两种基于图注意力网络（GAT）架构且将时间信息作为节点特征的模型来验证这一假设。第一种是仅基于真实标签进行训练的纯数据驱动模型（基线），第二种则结合了基于 MWPM 理论错误概率的知识蒸馏损失。利用谷歌公开的实验数据，我们的评估显示，虽然知识蒸馏模型的最终测试准确率与基线几乎相同，但其训练损失收敛较慢，且训练时间增加了约五倍。这一结果表明，现代图神经网络（GNN）架构具备高效学习复杂误差相关性的能力，能够直接从真实硬件数据中学习，而无需依赖近似理论模型的指导。

Subjects: Quantum Physics, Artificial Intelligence 主题：量子物理，人工智能

Publish: 2025-08-05 14:54:44 UTC 发布时间：2025-08-05 14:54:44 UTC

#137 Are Inherently Interpretable Models More Robust? A Study In Music Emotion Recognition #137 天生可解释的模型更具鲁棒性吗？音乐情感识别中的一项研究

Authors: [Katharina Hoedt](https://arxiv.org/search/?searchtype=author&query=Katharina Hoedt), [Arthur Flexer](https://arxiv.org/search/?searchtype=author&query=Arthur Flexer), [Gerhard Widmer](https://arxiv.org/search/?searchtype=author&query=Gerhard Widmer) 作者：Katharina Hoedt，Arthur Flexer，Gerhard Widmer

One of the desired key properties of deep learning models is the ability to generalise to unseen samples. When provided with new samples that are (perceptually) similar to one or more training samples, deep learning models are expected to produce correspondingly similar outputs. Models that succeed in predicting similar outputs for similar inputs are often called robust. Deep learning models, on the other hand, have been shown to be highly vulnerable to minor (adversarial) perturbations of the input, which manage to drastically change a model’s output and simultaneously expose its reliance on spurious correlations. In this work, we investigate whether inherently interpretable deep models, i.e., deep models that were designed to focus more on meaningful and interpretable features, are more robust to irrelevant perturbations in the data, compared to their black-box counterparts. We test our hypothesis by comparing the robustness of an interpretable and a black-box music emotion recognition (MER) model when challenged with adversarial examples. Furthermore, we include an adversarially trained model, which is optimised to be more robust, in the comparison. Our results indicate that inherently more interpretable models can indeed be more robust than their black-box counterparts, and achieve similar levels of robustness as adversarially trained models, at lower computational cost. 深度学习模型的一个重要期望特性是能够对未见过的样本进行泛化。当提供与一个或多个训练样本（感知上）相似的新样本时，深度学习模型应产生相应相似的输出。能够对相似输入预测出相似输出的模型通常被称为鲁棒模型。另一方面，研究表明深度学习模型对输入的微小（对抗性）扰动极为敏感，这些扰动能够显著改变模型的输出，同时暴露出模型对虚假相关性的依赖。在本研究中，我们探讨了内在可解释的深度模型，即那些设计上更注重有意义且可解释特征的深度模型，是否相比其黑箱模型在面对数据中无关扰动时更具鲁棒性。我们通过比较一个可解释模型和一个黑箱音乐情感识别（MER）模型在遭遇对抗样本时的鲁棒性来验证我们的假设。此外，我们还将一个经过对抗训练、优化以增强鲁棒性的模型纳入比较。我们的结果表明，天生更具可解释性的模型确实可以比其黑箱模型更具鲁棒性，并且以更低的计算成本达到与对抗训练模型相似的鲁棒性水平。

Subjects: Sound, Artificial Intelligence, Audio and Speech Processing 主题：声音，人工智能，音频与语音处理

Publish: 2025-08-05 13:29:29 UTC 发布时间：2025-08-05 13:29:29 UTC

#138 When Agents Break Down in Multiagent Path Finding #138 当多智能体路径寻找中的智能体失效时

Authors: [Foivos Fioravantes](https://arxiv.org/search/?searchtype=author&query=Foivos Fioravantes), [Dušan Knop](https://arxiv.org/search/?searchtype=author&query=Dušan Knop), [Nikolaos Melissinos](https://arxiv.org/search/?searchtype=author&query=Nikolaos Melissinos), [Michal Opler](https://arxiv.org/search/?searchtype=author&query=Michal Opler) 作者：Foivos Fioravantes，Dušan Knop，Nikolaos Melissinos，Michal Opler

In Multiagent Path Finding (MAPF), the goal is to compute efficient, collision-free paths for multiple agents navigating a network from their sources to targets, minimizing the schedule’s makespan-the total time until all agents reach their destinations. We introduce a new variant that formally models scenarios where some agents may experience delays due to malfunctions, posing significant challenges for maintaining optimal schedules. Recomputing an entirely new schedule from scratch after each malfunction is often computationally infeasible. To address this, we propose a framework for dynamic schedule adaptation that does not rely on full replanning. Instead, we develop protocols enabling agents to locally coordinate and adjust their paths on the fly. We prove that following our primary communication protocol, the increase in makespan after k malfunctions is bounded by k additional turns, effectively limiting the impact of malfunctions on overall efficiency. Moreover, recognizing that agents may have limited computational capabilities, we also present a secondary protocol that shifts the necessary computations onto the network’s nodes, ensuring robustness without requiring enhanced agent processing power. Our results demonstrate that these protocols provide a practical, scalable approach to resilient multiagent navigation in the face of agent failures. 在多智能体路径规划（MAPF）中，目标是为多个智能体计算高效且无碰撞的路径，使其从起点导航到目标点，同时最小化调度的完工时间——即所有智能体到达目的地的总时间。我们引入了一个新的变体，正式建模了某些智能体可能因故障而延迟的场景，这对保持最优调度带来了重大挑战。每次故障后从头重新计算完整调度通常在计算上不可行。为此，我们提出了一个动态调度适应框架，无需依赖完全重新规划。相反，我们开发了协议，使智能体能够局部协调并即时调整路径。我们证明，遵循我们的主要通信协议后，经过 k 次故障后完工时间的增加被限制在额外 k 个回合内，有效限制了故障对整体效率的影响。此外，鉴于智能体可能具有有限的计算能力，我们还提出了一种辅助协议，将必要的计算转移到网络节点上，确保系统的鲁棒性，而无需增强智能体的处理能力。我们的结果表明，这些协议为应对智能体故障提供了一种实用且可扩展的多智能体导航解决方案。

Subjects: Multiagent Systems, Artificial Intelligence 主题：多智能体系统，人工智能

Publish: 2025-08-05 12:59:30 UTC 发布时间：2025-08-05 12:59:30 UTC

#139 Revisiting Heat Flux Analysis of Tungsten Monoblock Divertor on EAST using Physics-Informed Neural Network #139 使用物理信息神经网络重新审视 EAST 钨单块偏滤器的热流分析

Estimating heat flux in the nuclear fusion device EAST is a critically important task. Traditional scientific computing methods typically model this process using the Finite Element Method (FEM). However, FEM relies on grid-based sampling for computation, which is computationally inefficient and hard to perform real-time simulations during actual experiments. Inspired by artificial intelligence-powered scientific computing, this paper proposes a novel Physics-Informed Neural Network (PINN) to address this challenge, significantly accelerating the heat conduction estimation process while maintaining high accuracy. Specifically, given inputs of different materials, we first feed spatial coordinates and time stamps into the neural network, and compute boundary loss, initial condition loss, and physical loss based on the heat conduction equation. Additionally, we sample a small number of data points in a data-driven manner to better fit the specific heat conduction scenario, further enhancing the model’s predictive capability. We conduct experiments under both uniform and non-uniform heating conditions on the top surface. Experimental results show that the proposed thermal conduction physics-informed neural network achieves accuracy comparable to the finite element method, while achieving ×40 times acceleration in computational efficiency. The dataset and source code will be released on https://github.com/Event-AHU/OpenFusion. 估计核聚变装置 EAST 中的热流是一个极其重要的任务。传统的科学计算方法通常使用有限元方法（FEM）来建模这一过程。然而，FEM 依赖于基于网格的采样进行计算，计算效率低且难以在实际实验中进行实时模拟。受人工智能驱动的科学计算启发，本文提出了一种新颖的物理信息神经网络（PINN）来解决这一挑战，在保持高精度的同时显著加快了热传导估计过程。具体而言，针对不同材料的输入，我们首先将空间坐标和时间戳输入神经网络，并基于热传导方程计算边界损失、初始条件损失和物理损失。此外，我们以数据驱动的方式采样少量数据点，以更好地拟合特定的热传导场景，进一步增强模型的预测能力。我们在顶部表面进行了均匀和非均匀加热条件下的实验。实验结果表明，所提出的热传导物理信息神经网络在精度上可与有限元方法相媲美，同时在计算效率上实现了 40 倍的加速。数据集和源代码将发布在 https://github.com/Event-AHU/OpenFusion。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 12:42:02 UTC 发布时间：2025-08-05 12:42:02 UTC

#140 4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis #140 4D-PreNet：一个用于 4D-STEM 数据分析的统一预处理框架

Automated experimentation with real time data analysis in scanning transmission electron microscopy (STEM) often require end-to-end framework. The four-dimensional scanning transmission electron microscopy (4D-STEM) with high-throughput data acquisition has been constrained by the critical bottleneck results from data preprocessing. Pervasive noise, beam center drift, and elliptical distortions during high-throughput acquisition inevitably corrupt diffraction patterns, systematically biasing quantitative measurements. Yet, conventional correction algorithms are often material-specific and fail to provide a robust, generalizable solution. In this work, we present 4D-PreNet, an end-to-end deep-learning pipeline that integrates attention-enhanced U-Net and ResNet architectures to simultaneously perform denoising, center correction, and elliptical distortion calibration. The network is trained on large, simulated datasets encompassing a wide range of noise levels, drift magnitudes, and distortion types, enabling it to generalize effectively to experimental data acquired under varying conditions. Quantitative evaluations demonstrate that our pipeline reduces mean squared error by up to 50% during denoising and achieves sub-pixel center localization in the center detection task, with average errors below 0.04 pixels. The outputs are bench-marked against traditional algorithms, highlighting improvements in both noise suppression and restoration of diffraction patterns, thereby facilitating high-throughput, reliable 4D-STEM real-time analysis for automated characterization. 扫描透射电子显微镜（STEM）中基于实时数据分析的自动化实验通常需要端到端的框架。高通量数据采集的四维扫描透射电子显微镜（4D-STEM）受限于数据预处理带来的关键瓶颈。在高通量采集过程中，普遍存在的噪声、电子束中心漂移和椭圆形畸变不可避免地破坏了衍射图样，系统性地偏倚了定量测量。然而，传统的校正算法通常针对特定材料，难以提供稳健且通用的解决方案。在本工作中，我们提出了 4D-PreNet，一种端到端的深度学习管线，集成了增强注意力机制的 U-Net 和 ResNet 架构，能够同时执行去噪、中心校正和椭圆形畸变校准。该网络在涵盖广泛噪声水平、漂移幅度和畸变类型的大规模模拟数据集上进行训练，使其能够有效泛化到在不同条件下采集的实验数据。定量评估表明，我们的流程在去噪过程中将均方误差降低了最多 50%，并且在中心检测任务中实现了亚像素级的中心定位，平均误差低于 0.04 像素。输出结果与传统算法进行了基准比较，突出了在噪声抑制和衍射图样恢复方面的改进，从而促进了高通量、可靠的 4D-STEM 实时分析，实现自动化表征。

Subjects: Computer Vision and Pattern Recognition, Materials Science, Artificial Intelligence 主题：计算机视觉与模式识别，材料科学，人工智能

Publish: 2025-08-05 12:35:28 UTC 发布时间：2025-08-05 12:35:28 UTC

#141 U-PINet: End-to-End Hierarchical Physics-Informed Learning With Sparse Graph Coupling for 3D EM Scattering Modeling #141 U-PINet：基于稀疏图耦合的端到端分层物理信息学习用于三维电磁散射建模

Authors: [Rui Zhu](https://arxiv.org/search/?searchtype=author&query=Rui Zhu), [Yuexing Peng](https://arxiv.org/search/?searchtype=author&query=Yuexing Peng), [Peng Wang](https://arxiv.org/search/?searchtype=author&query=Peng Wang), [George C. Alexandropoulos](https://arxiv.org/search/?searchtype=author&query=George C. Alexandropoulos), [Wenbo Wang](https://arxiv.org/search/?searchtype=author&query=Wenbo Wang), [Wei Xiang](https://arxiv.org/search/?searchtype=author&query=Wei Xiang) 作者：朱锐，彭跃星，王鹏，George C. Alexandropoulos，王文博，向伟

Electromagnetic (EM) scattering modeling is critical for radar remote sensing, however, its inherent complexity introduces significant computational challenges. Traditional numerical solvers offer high accuracy, but suffer from scalability issues and substantial computational costs. Pure data-driven deep learning approaches, while efficient, lack physical constraints embedding during training and require extensive labeled data, limiting their applicability and generalization. To overcome these limitations, we propose a U-shaped Physics-Informed Network (U-PINet), the first fully deep-learning-based, physics-informed hierarchical framework for computational EM designed to ensure physical consistency while maximizing computational efficiency. Motivated by the hierarchical decomposition strategy in EM solvers and the inherent sparsity of local EM coupling, the U-PINet models the decomposition and coupling of near- and far-field interactions through a multiscale processing neural network architecture, while employing a physics-inspired sparse graph representation to efficiently model both self- and mutual- coupling among mesh elements of complex 3-Dimensional (3D) objects. This principled approach enables end-to-end multiscale EM scattering modeling with improved efficiency, generalization, and physical consistency. Experimental results showcase that the U-PINet accurately predicts surface current distributions, achieving close agreement with traditional solver, while significantly reducing computational time and outperforming conventional deep learning baselines in both accuracy and robustness. Furthermore, our evaluations on radar cross section prediction tasks confirm the feasibility of the U-PINet for downstream EM scattering applications. 电磁（EM）散射建模对于雷达遥感至关重要，然而其固有的复杂性带来了显著的计算挑战。传统的数值求解器虽然精度高，但存在可扩展性问题且计算成本巨大。纯数据驱动的深度学习方法虽然高效，但在训练过程中缺乏物理约束的嵌入，且需要大量标注数据，限制了其适用性和泛化能力。为克服这些限制，我们提出了一种 U 形物理信息网络（U-PINet），这是首个完全基于深度学习的、物理信息驱动的分层计算电磁框架，旨在确保物理一致性的同时最大化计算效率。受电磁求解器中的分层分解策略及局部电磁耦合固有稀疏性的启发，U-PINet 通过多尺度处理神经网络架构对近场和远场相互作用的分解与耦合进行建模，同时采用物理启发的稀疏图表示，有效模拟复杂三维（3D）物体网格元素之间的自耦合和互耦合。这种有原则的方法实现了端到端的多尺度电磁散射建模，提升了效率、泛化能力和物理一致性。实验结果表明，U-PINet 能够准确预测表面电流分布，与传统求解器高度一致，同时显著减少计算时间，并在准确性和鲁棒性方面均优于传统深度学习基线方法。此外，我们在雷达散射截面预测任务上的评估进一步验证了 U-PINet 在下游电磁散射应用中的可行性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 12:20:42 UTC 发布时间：2025-08-05 12:20:42 UTC

#142 When Deep Learning Fails: Limitations of Recurrent Models on Stroke-Based Handwriting for Alzheimer's Disease Detection #142 当深度学习失败时：基于笔画的手写识别中循环模型在阿尔茨海默病检测上的局限性

Authors: [Emanuele Nardone](https://arxiv.org/search/?searchtype=author&query=Emanuele Nardone), [Tiziana D’Alessandro](https://arxiv.org/search/?searchtype=author&query=Tiziana D’Alessandro), [Francesco Fontanella](https://arxiv.org/search/?searchtype=author&query=Francesco Fontanella), [Claudio De Stefano](https://arxiv.org/search/?searchtype=author&query=Claudio De Stefano) 作者：Emanuele Nardone, Tiziana D’Alessandro, Francesco Fontanella, Claudio De Stefano

Alzheimer’s disease detection requires expensive neuroimaging or invasive procedures, limiting accessibility. This study explores whether deep learning can enable non-invasive Alzheimer’s disease detection through handwriting analysis. Using a dataset of 34 distinct handwriting tasks collected from healthy controls and Alzheimer’s disease patients, we evaluate and compare three recurrent neural architectures (LSTM, GRU, RNN) against traditional machine learning models. A crucial distinction of our approach is that the recurrent models process pre-extracted features from discrete strokes, not raw temporal signals. This violates the assumption of a continuous temporal flow that recurrent networks are designed to capture. Results reveal that they exhibit poor specificity and high variance. Traditional ensemble methods significantly outperform all deep architectures, achieving higher accuracy with balanced metrics. This demonstrates that recurrent architectures, designed for continuous temporal sequences, fail when applied to feature vectors extracted from ambiguously segmented strokes. Despite their complexity, deep learning models cannot overcome the fundamental disconnect between their architectural assumptions and the discrete, feature-based nature of stroke-level handwriting data. Although performance is limited, the study highlights several critical issues in data representation and model compatibility, pointing to valuable directions for future research. 阿尔茨海默病的检测通常依赖昂贵的神经影像或侵入性程序，限制了其可及性。本研究探讨了深度学习是否能够通过手写分析实现非侵入性的阿尔茨海默病检测。我们使用了来自健康对照组和阿尔茨海默病患者的 34 种不同手写任务的数据集，评估并比较了三种循环神经网络架构（LSTM、GRU、RNN）与传统机器学习模型。我们方法的一个关键区别在于，循环模型处理的是从离散笔画中预先提取的特征，而非原始的时间序列信号。这违背了循环网络设计用以捕捉连续时间流的假设。结果显示，这些模型表现出较差的特异性和较高的方差。传统的集成方法显著优于所有深度架构，在准确率和指标平衡性上均表现更佳。这表明，设计用于连续时间序列的循环架构，在应用于从模糊分割的笔画中提取的特征向量时会失败。尽管结构复杂，深度学习模型仍无法克服其架构假设与笔画级手写数据的离散、基于特征的本质之间的根本脱节。尽管性能有限，该研究强调了数据表示和模型兼容性中的若干关键问题，指明了未来研究的宝贵方向。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理，人工智能，计算机视觉与模式识别

Publish: 2025-08-05 11:10:11 UTC 发布时间：2025-08-05 11:10:11 UTC

#143 GTPO: Trajectory-Based Policy Optimization in Large Language Models #143 GTPO：基于轨迹的策略优化在大型语言模型中的应用

Authors: [Marco Simoni](https://arxiv.org/search/?searchtype=author&query=Marco Simoni), [Aleksandar Fontana](https://arxiv.org/search/?searchtype=author&query=Aleksandar Fontana), [Giulio Rossolini](https://arxiv.org/search/?searchtype=author&query=Giulio Rossolini), [Andrea Saracino](https://arxiv.org/search/?searchtype=author&query=Andrea Saracino) 作者：Marco Simoni，Aleksandar Fontana，Giulio Rossolini，Andrea Saracino

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习，人工智能，计算与语言

Publish: 2025-08-05 08:15:01 UTC 发布时间：2025-08-05 08:15:01 UTC

#144 Trustworthiness of Legal Considerations for the Use of LLMs in Education #144 教育中使用 LLMs 的法律考量的可信度

Authors: [Sara Alaswad](https://arxiv.org/search/?searchtype=author&query=Sara Alaswad), [Tatiana Kalganova](https://arxiv.org/search/?searchtype=author&query=Tatiana Kalganova), [Wasan Awad](https://arxiv.org/search/?searchtype=author&query=Wasan Awad) 作者：Sara Alaswad，Tatiana Kalganova，Wasan Awad

As Artificial Intelligence (AI), particularly Large Language Models (LLMs), becomes increasingly embedded in education systems worldwide, ensuring their ethical, legal, and contextually appropriate deployment has become a critical policy concern. This paper offers a comparative analysis of AI-related regulatory and ethical frameworks across key global regions, including the European Union, United Kingdom, United States, China, and Gulf Cooperation Council (GCC) countries. It maps how core trustworthiness principles, such as transparency, fairness, accountability, data privacy, and human oversight are embedded in regional legislation and AI governance structures. Special emphasis is placed on the evolving landscape in the GCC, where countries are rapidly advancing national AI strategies and education-sector innovation. To support this development, the paper introduces a Compliance-Centered AI Governance Framework tailored to the GCC context. This includes a tiered typology and institutional checklist designed to help regulators, educators, and developers align AI adoption with both international norms and local values. By synthesizing global best practices with region-specific challenges, the paper contributes practical guidance for building legally sound, ethically grounded, and culturally sensitive AI systems in education. These insights are intended to inform future regulatory harmonization and promote responsible AI integration across diverse educational environments. 随着人工智能（AI），特别是大型语言模型（LLMs），日益融入全球教育系统，确保其在伦理、法律及情境适用性方面的部署已成为关键的政策关注点。本文对包括欧盟、英国、美国、中国及海湾合作委员会（GCC）国家在内的主要全球地区的 AI 相关监管和伦理框架进行了比较分析。文章梳理了透明度、公平性、问责制、数据隐私和人工监督等核心可信原则如何嵌入各地区立法和 AI 治理结构中。特别强调了 GCC 地区不断发展的格局，该地区国家正迅速推进国家 AI 战略和教育领域创新。为支持这一发展，本文提出了一个以合规为中心、适用于 GCC 背景的 AI 治理框架。该框架包括分层类型学和机构检查表，旨在帮助监管者、教育者和开发者使 AI 应用既符合国际规范，又契合本地价值观。通过综合全球最佳实践与地区特定挑战，本文为构建合法合规、伦理基础坚实且文化敏感的教育领域 AI 系统提供了实用指导。这些见解旨在为未来的监管协调提供参考，并促进在多样化教育环境中负责任的 AI 整合。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-05 07:44:33 UTC 发布时间：2025-08-05 07:44:33 UTC

#145 Development of management systems using artificial intelligence systems and machine learning methods for boards of directors (preprint, unofficial translation) #145 利用人工智能系统和机器学习方法为董事会开发管理系统（预印本，非官方译本）

Author: [Anna Romanova](https://arxiv.org/search/?searchtype=author&query=Anna Romanova) 作者：Anna Romanova

The study addresses the paradigm shift in corporate management, where AI is moving from a decision support tool to an autonomous decision-maker, with some AI systems already appointed to leadership roles in companies. A central problem identified is that the development of AI technologies is far outpacing the creation of adequate legal and ethical guidelines. The research proposes a “reference model” for the development and implementation of autonomous AI systems in corporate management. This model is based on a synthesis of several key components to ensure legitimate and ethical decision-making. The model introduces the concept of “computational law” or “algorithmic law”. This involves creating a separate legal framework for AI systems, with rules and regulations translated into a machine-readable, algorithmic format to avoid the ambiguity of natural language. The paper emphasises the need for a “dedicated operational context” for autonomous AI systems, analogous to the “operational design domain” for autonomous vehicles. This means creating a specific, clearly defined environment and set of rules within which the AI can operate safely and effectively. The model advocates for training AI systems on controlled, synthetically generated data to ensure fairness and ethical considerations are embedded from the start. Game theory is also proposed as a method for calculating the optimal strategy for the AI to achieve its goals within these ethical and legal constraints. The provided analysis highlights the importance of explainable AI (XAI) to ensure the transparency and accountability of decisions made by autonomous systems. This is crucial for building trust and for complying with the “right to explanation”. 该研究探讨了企业管理中的范式转变，人工智能正从决策支持工具转变为自主决策者，部分人工智能系统已被任命为公司的领导角色。研究指出的一个核心问题是，人工智能技术的发展远远超过了相应法律和伦理准则的制定。研究提出了一个用于企业管理中自主人工智能系统开发和实施的“参考模型”。该模型基于多个关键组成部分的综合，以确保合法且符合伦理的决策。模型引入了“计算法”或“算法法”的概念。这涉及为人工智能系统创建一个独立的法律框架，将规则和法规转化为机器可读的算法格式，以避免自然语言的歧义。论文强调了为自主人工智能系统设立“专用操作环境”的必要性，类似于自动驾驶车辆的“操作设计域”。这意味着创建一个具体且明确定义的环境和规则集，使人工智能能够在其中安全有效地运行。该模型主张在受控的合成生成数据上训练人工智能系统，以确保从一开始就嵌入公平性和伦理考量。博弈论也被提出作为计算人工智能在这些伦理和法律约束内实现其目标的最优策略的方法。所提供的分析强调了可解释人工智能（XAI）的重要性，以确保自主系统所做决策的透明性和问责性。这对于建立信任以及遵守“解释权”至关重要。

Subjects: Computers and Society, Artificial Intelligence, Machine Learning 主题：计算机与社会，人工智能，机器学习

Publish: 2025-08-05 04:01:22 UTC 发布时间：2025-08-05 04:01:22 UTC

#146 CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning #146 CoughViT：一种用于咳嗽音频表示学习的自监督视觉变换器

Authors: [Justin Luong](https://arxiv.org/search/?searchtype=author&query=Justin Luong), [Hao Xue](https://arxiv.org/search/?searchtype=author&query=Hao Xue), [Flora D. Salim](https://arxiv.org/search/?searchtype=author&query=Flora D. Salim) 作者：Justin Luong，Hao Xue，Flora D. Salim

Physicians routinely assess respiratory sounds during the diagnostic process, providing insight into the condition of a patient’s airways. In recent years, AI-based diagnostic systems operating on respiratory sounds, have demonstrated success in respiratory disease detection. These systems represent a crucial advancement in early and accessible diagnosis which is essential for timely treatment. However, label and data scarcity remain key challenges, especially for conditions beyond COVID-19, limiting diagnostic performance and reliable evaluation. In this paper, we propose CoughViT, a novel pre-training framework for learning general-purpose cough sound representations, to enhance diagnostic performance in tasks with limited data. To address label scarcity, we employ masked data modelling to train a feature encoder in a self-supervised learning manner. We evaluate our approach against other pre-training strategies on three diagnostically important cough classification tasks. Experimental results show that our representations match or exceed current state-of-the-art supervised audio representations in enhancing performance on downstream tasks. 医生在诊断过程中常规评估呼吸音，以了解患者气道的状况。近年来，基于人工智能的呼吸音诊断系统在呼吸疾病检测方面取得了成功。这些系统代表了早期且易于获取的诊断的重要进展，对于及时治疗至关重要。然而，标签和数据的稀缺仍然是关键挑战，尤其是在 COVID-19 以外的疾病中，限制了诊断性能和可靠评估。本文提出了 CoughViT，一种用于学习通用咳嗽声音表示的新型预训练框架，以提升有限数据任务中的诊断性能。为解决标签稀缺问题，我们采用掩码数据建模，以自监督学习方式训练特征编码器。我们在三个诊断重要的咳嗽分类任务上，将该方法与其他预训练策略进行了比较评估。实验结果表明，我们的表示在提升下游任务性能方面，匹配或超越了当前最先进的有监督音频表示。

Subjects: Sound, Artificial Intelligence 主题：声音，人工智能

Publish: 2025-08-04 23:09:07 UTC 发布时间：2025-08-04 23:09:07 UTC

#147 Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment #147 Refine-IQA：用于感知图像质量评估的多阶段强化微调

Authors: [Ziheng Jia](https://arxiv.org/search/?searchtype=author&query=Ziheng Jia), [Jiaying Qian](https://arxiv.org/search/?searchtype=author&query=Jiaying Qian), [Zicheng Zhang](https://arxiv.org/search/?searchtype=author&query=Zicheng Zhang), [Zijian Chen](https://arxiv.org/search/?searchtype=author&query=Zijian Chen), [Xiongkuo Min](https://arxiv.org/search/?searchtype=author&query=Xiongkuo Min) 作者：贾子恒、钱佳颖、张子成、陈子健、闵雄阔

Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training. Analogous to high-level reasoning tasks, RFT is similarly applicable to low-level vision domains, including image quality assessment (IQA). Existing RFT-based IQA methods typically use rule-based output rewards to verify the model’s rollouts but provide no reward supervision for the “think” process, leaving its correctness and efficacy uncontrolled. Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model’s native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi-stage RFT IQA framework (Refine-IQA). In Stage-1, we build the Refine-Perception-20K dataset (with 12 main distortions, 20,907 locally-distorted images, and over 55K RFT samples) and design multi-task reward functions to strengthen the model’s visual quality perception. In Stage-2, targeting the quality scoring task, we introduce a probability difference reward involved strategy for “think” process supervision. The resulting Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks-and, notably, our paradigm activates a robust “think” (quality interpreting) capability that also attains exceptional results on the corresponding quality interpreting benchmark. 强化微调（RFT）是一种日益普及的 LMM 训练范式。类似于高级推理任务，RFT 同样适用于低级视觉领域，包括图像质量评估（IQA）。现有基于 RFT 的 IQA 方法通常使用基于规则的输出奖励来验证模型的展开结果，但对“思考”过程没有奖励监督，导致其正确性和有效性无法得到控制。此外，这些方法通常直接在下游 IQA 任务上进行微调，而没有明确增强模型本身的低级视觉质量感知能力，这可能限制其性能上限。针对这些不足，我们提出了多阶段 RFT IQA 框架（Refine-IQA）。在第一阶段，我们构建了 Refine-Perception-20K 数据集（包含 12 种主要失真，20,907 张局部失真图像，以及超过 55K 的 RFT 样本），并设计了多任务奖励函数以强化模型的视觉质量感知。在第二阶段，针对质量评分任务，我们引入了一种涉及概率差异奖励的策略，用于“思考”过程的监督。所得的 Refine-IQA 系列模型在感知和评分任务上均表现出色——值得注意的是，我们的范式激活了强大的“思考”（质量解释）能力，在相应的质量解释基准测试中也取得了卓越的成绩。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-04 22:46:10 UTC 发布时间：2025-08-04 22:46:10 UTC

#148 FlashCommunication V2: Bit Splitting and Spike Reserving for Any Bit Communication #148 FlashCommunication V2：任意比特通信的比特拆分与脉冲保留

Authors: [Qingyuan Li](https://arxiv.org/search/?searchtype=author&query=Qingyuan Li), [Bo Zhang](https://arxiv.org/search/?searchtype=author&query=Bo Zhang), [Hui Kang](https://arxiv.org/search/?searchtype=author&query=Hui Kang), [Tianhao Xu](https://arxiv.org/search/?searchtype=author&query=Tianhao Xu), [Yulei Qian](https://arxiv.org/search/?searchtype=author&query=Yulei Qian), [Yuchen Xie](https://arxiv.org/search/?searchtype=author&query=Yuchen Xie), [Lin Ma](https://arxiv.org/search/?searchtype=author&query=Lin Ma) 作者：李清远，张博，康辉，徐天昊，钱宇磊，谢宇辰，马林

Nowadays, communication bottlenecks have emerged as a critical challenge in the distributed training and deployment of large language models (LLMs). This paper introduces FlashCommunication V2, a novel communication paradigm enabling efficient cross-GPU transmission at arbitrary bit widths. Its core innovations lie in the proposed bit splitting and spike reserving techniques, which address the challenges of low-bit quantization. Bit splitting decomposes irregular bit widths into basic units, ensuring compatibility with hardware capabilities and thus enabling transmission at any bit width. Spike reserving, on the other hand, retains numerical outliers (i.e., minima and maxima) as floating-point numbers, which shrinks the dynamic numerical range and pushes the quantization limits to 2-bit with acceptable losses. FlashCommunication V2 significantly enhances the flexibility and resource utilization of communication systems. Through meticulous software-hardware co-design, it delivers robust performance and reduced overhead across both NVLink-based and PCIe-based architectures, achieving a maximum 3.2× speedup in AllReduce and 2× in All2All communication. 如今，通信瓶颈已成为大规模语言模型（LLMs）分布式训练和部署中的关键挑战。本文介绍了 FlashCommunication V2，一种支持任意位宽高效跨 GPU 传输的新型通信范式。其核心创新在于提出的位拆分和峰值保留技术，解决了低位量化的难题。位拆分将不规则位宽分解为基本单元，确保与硬件能力兼容，从而实现任意位宽的传输。峰值保留则将数值异常值（即最小值和最大值）保留为浮点数，缩小了动态数值范围，将量化极限推至 2 位且损失可接受。FlashCommunication V2 显著提升了通信系统的灵活性和资源利用率。通过精细的软件硬件协同设计，它在基于 NVLink 和 PCIe 的架构上均实现了稳健的性能和降低的开销，在 AllReduce 通信中最高实现 3.2 倍速度提升，在 All2All 通信中实现 2 倍提升。

Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence 主题：分布式、并行与集群计算，人工智能

Publish: 2025-08-04 13:47:29 UTC 发布时间：2025-08-04 13:47:29 UTC

#149 M3HL: Mutual Mask Mix with High-Low Level Feature Consistency for Semi-Supervised Medical Image Segmentation #149M 3 HL：具有高低层特征一致性的互掩码混合，用于半监督医学图像分割

Authors: [Yajun Liu](https://arxiv.org/search/?searchtype=author&query=Yajun Liu), [Zenghui Zhang](https://arxiv.org/search/?searchtype=author&query=Zenghui Zhang), [Jiang Yue](https://arxiv.org/search/?searchtype=author&query=Jiang Yue), [Weiwei Guo](https://arxiv.org/search/?searchtype=author&query=Weiwei Guo), [Dongying Li](https://arxiv.org/search/?searchtype=author&query=Dongying Li) 作者：刘亚军，张增辉，岳江，郭伟伟，李东英

Data augmentation methods inspired by CutMix have demonstrated significant potential in recent semi-supervised medical image segmentation tasks. However, these approaches often apply CutMix operations in a rigid and inflexible manner, while paying insufficient attention to feature-level consistency constraints. In this paper, we propose a novel method called Mutual Mask Mix with High-Low level feature consistency (M3HL) to address the aforementioned challenges, which consists of two key components: 1) M3: An enhanced data augmentation operation inspired by the masking strategy from Masked Image Modeling (MIM), which advances conventional CutMix through dynamically adjustable masks to generate spatially complementary image pairs for collaborative training, thereby enabling effective information fusion between labeled and unlabeled images. 2) HL: A hierarchical consistency regularization framework that enforces high-level and low-level feature consistency between unlabeled and mixed images, enabling the model to better capture discriminative feature representations.Our method achieves state-of-the-art performance on widely adopted medical image segmentation benchmarks including the ACDC and LA datasets. Source code is available at https://github.com/PHPJava666/M3HL 受 CutMix 启发的数据增强方法在近期的半监督医学图像分割任务中展现了显著潜力。然而，这些方法通常以僵硬且不灵活的方式应用 CutMix 操作，同时对特征级一致性约束关注不足。本文提出了一种名为互掩混合与高低层特征一致性（M 3 HL）的新方法，以解决上述挑战，该方法包含两个关键组成部分：1）M 3 ：一种受掩码图像建模（MIM）中的掩码策略启发的增强数据增强操作，通过动态可调掩码改进传统 CutMix，生成空间互补的图像对以进行协同训练，从而实现标注图像与未标注图像之间的有效信息融合。 2) HL：一种分层一致性正则化框架，强制无标签图像与混合图像之间的高层和低层特征一致性，使模型能够更好地捕捉判别性特征表示。我们的方法在广泛采用的医学图像分割基准测试中取得了最先进的性能，包括 ACDC 和 LA 数据集。源代码可在 https://github.com/PHPJava666/M3HL 获取。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理，人工智能，计算机视觉与模式识别

Publish: 2025-08-04 05:42:10 UTC 发布时间：2025-08-04 05:42:10 UTC

#150 Data-Driven Discovery of Mobility Periodicity for Understanding Urban Transportation Systems #150 基于数据驱动的出行周期性发现以理解城市交通系统

Authors: [Xinyu Chen](https://arxiv.org/search/?searchtype=author&query=Xinyu Chen), [Qi Wang](https://arxiv.org/search/?searchtype=author&query=Qi Wang), [Yunhan Zheng](https://arxiv.org/search/?searchtype=author&query=Yunhan Zheng), [Nina Cao](https://arxiv.org/search/?searchtype=author&query=Nina Cao), [HanQin Cai](https://arxiv.org/search/?searchtype=author&query=HanQin Cai), [Jinhua Zhao](https://arxiv.org/search/?searchtype=author&query=Jinhua Zhao) 作者：陈新宇，王琪，郑云涵，曹妮娜，蔡汉钦，赵金华

Uncovering the temporal regularity of human mobility is crucial for discovering urban dynamics and has implications for various decision-making processes and urban system applications. This study formulates the periodicity quantification problem in complex and multidimensional human mobility data as a sparse identification of dominant positive auto-correlations in time series autoregression, allowing one to discover and quantify significant periodic patterns such as weekly periodicity from a data-driven and interpretable machine learning perspective. We apply our framework to real-world human mobility data, including metro passenger flow in Hangzhou, China and ridesharing trips in New York City (NYC) and Chicago, USA, revealing the interpretable weekly periodicity across different spatial locations over past several years. In particular, our analysis of ridesharing data from 2019 to 2024 demonstrates the disruptive impact of the COVID-19 pandemic on mobility regularity and the subsequent recovery trends, highlighting differences in the recovery pattern percentages and speeds between NYC and Chicago. We explore that both NYC and Chicago experienced a remarkable reduction of weekly periodicity in 2020, and the recovery of mobility regularity in NYC is faster than Chicago. The interpretability of sparse autoregression provides insights into the underlying temporal patterns of human mobility, offering a valuable tool for understanding urban systems. Our findings highlight the potential of interpretable machine learning to unlock crucial insights from real-world mobility data. 揭示人类出行的时间规律性对于发现城市动态至关重要，并对各种决策过程和城市系统应用具有重要意义。本研究将复杂多维人类出行数据中的周期性量化问题表述为时间序列自回归中主导正自相关的稀疏识别，从数据驱动且可解释的机器学习视角出发，能够发现并量化显著的周期性模式，如周周期性。我们将该框架应用于真实世界的人类出行数据，包括中国杭州的地铁客流量以及美国纽约市和芝加哥的网约车出行，揭示了过去数年不同空间位置间可解释的周周期性。特别是，我们对 2019 年至 2024 年网约车数据的分析展示了 COVID-19 疫情对出行规律性的破坏性影响及其后续的恢复趋势，突出表现了纽约市与芝加哥在恢复模式比例和速度上的差异。我们发现纽约市和芝加哥在 2020 年都经历了每周周期性的显著减少，且纽约市的出行规律恢复速度快于芝加哥。稀疏自回归模型的可解释性为人类出行的潜在时间模式提供了洞见，成为理解城市系统的宝贵工具。我们的研究结果强调了可解释机器学习在从真实出行数据中挖掘关键见解方面的潜力。

Subjects: Social and Information Networks, Artificial Intelligence, Machine Learning 主题：社会与信息网络，人工智能，机器学习

Publish: 2025-08-02 15:25:20 UTC 发布时间：2025-08-02 15:25:20 UTC

#151 Tobler's First Law in GeoAI: A Spatially Explicit Deep Learning Model for Terrain Feature Detection Under Weak Supervision #151 地理人工智能中的托布勒第一定律：一种基于弱监督的空间显式深度学习地形特征检测模型

Authors: [Wenwen Li](https://arxiv.org/search/?searchtype=author&query=Wenwen Li), [Chia-Yu Hsu](https://arxiv.org/search/?searchtype=author&query=Chia-Yu Hsu), [Maosheng Hu](https://arxiv.org/search/?searchtype=author&query=Maosheng Hu) 作者：李文文，许家瑜，胡茂盛

Recent interest in geospatial artificial intelligence (GeoAI) has fostered a wide range of applications using artificial intelligence (AI), especially deep learning, for geospatial problem solving. However, major challenges such as a lack of training data and the neglect of spatial principles and spatial effects in AI model design remain, significantly hindering the in-depth integration of AI with geospatial research. This paper reports our work in developing a deep learning model that enables object detection, particularly of natural features, in a weakly supervised manner. Our work makes three contributions: First, we present a method of object detection using only weak labels. This is achieved by developing a spatially explicit model based on Tobler’s first law of geography. Second, we incorporate attention maps into the object detection pipeline and develop a multistage training strategy to improve performance. Third, we apply this model to detect impact craters on Mars, a task that previously required extensive manual effort. The model generalizes to both natural and human-made features on the surfaces of Earth and other planets. This research advances the theoretical and methodological foundations of GeoAI. 近年来，地理空间人工智能（GeoAI）引起了广泛关注，推动了利用人工智能（AI），尤其是深度学习，解决地理空间问题的多种应用。然而，训练数据的缺乏以及在 AI 模型设计中忽视空间原理和空间效应等主要挑战依然存在，严重阻碍了 AI 与地理空间研究的深入融合。本文报告了我们开发的一种深度学习模型，该模型能够以弱监督方式实现对象检测，特别是自然特征的检测。我们的工作有三方面贡献：首先，我们提出了一种仅使用弱标签的对象检测方法。该方法基于托布勒第一地理定律，开发了一个空间显式模型。其次，我们将注意力图引入对象检测流程，并开发了多阶段训练策略以提升性能。第三，我们将该模型应用于火星撞击坑的检测，这一任务此前需要大量人工工作。该模型能够推广应用于地球及其他行星表面的自然和人造特征检测。本研究推进了地理人工智能的理论和方法基础。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-01 21:47:50 UTC 发布时间：2025-08-01 21:47:50 UTC

#152 Do We Need Pre-Processing for Deep Learning Based Ultrasound Shear Wave Elastography? #152 我们是否需要对基于深度学习的超声剪切波弹性成像进行预处理？

Authors: [Sarah Grube](https://arxiv.org/search/?searchtype=author&query=Sarah Grube), [Sören Grünhagen](https://arxiv.org/search/?searchtype=author&query=Sören Grünhagen), [Sarah Latus](https://arxiv.org/search/?searchtype=author&query=Sarah Latus), [Michael Meyling](https://arxiv.org/search/?searchtype=author&query=Michael Meyling), [Alexander Schlaefer](https://arxiv.org/search/?searchtype=author&query=Alexander Schlaefer) 作者：Sarah Grube，Sören Grünhagen，Sarah Latus，Michael Meyling，Alexander Schlaefer

Estimating the elasticity of soft tissue can provide useful information for various diagnostic applications. Ultrasound shear wave elastography offers a non-invasive approach. However, its generalizability and standardization across different systems and processing pipelines remain limited. Considering the influence of image processing on ultrasound based diagnostics, recent literature has discussed the impact of different image processing steps on reliable and reproducible elasticity analysis. In this work, we investigate the need of ultrasound pre-processing steps for deep learning-based ultrasound shear wave elastography. We evaluate the performance of a 3D convolutional neural network in predicting shear wave velocities from spatio-temporal ultrasound images, studying different degrees of pre-processing on the input images, ranging from fully beamformed and filtered ultrasound images to raw radiofrequency data. We compare the predictions from our deep learning approach to a conventional time-of-flight method across four gelatin phantoms with different elasticity levels. Our results demonstrate statistically significant differences in the predicted shear wave velocity among all elasticity groups, regardless of the degree of pre-processing. Although pre-processing slightly improves performance metrics, our results show that the deep learning approach can reliably differentiate between elasticity groups using raw, unprocessed radiofrequency data. These results show that deep learning-based approaches could reduce the need for and the bias of traditional ultrasound pre-processing steps in ultrasound shear wave elastography, enabling faster and more reliable clinical elasticity assessments. 估计软组织的弹性可以为各种诊断应用提供有用的信息。超声剪切波弹性成像提供了一种非侵入性的方法。然而，其在不同系统和处理流程中的通用性和标准化仍然有限。考虑到图像处理对基于超声的诊断的影响，近期文献讨论了不同图像处理步骤对可靠且可重复的弹性分析的影响。在本研究中，我们探讨了基于深度学习的超声剪切波弹性成像中超声预处理步骤的必要性。我们评估了一种三维卷积神经网络在从时空超声图像预测剪切波速度方面的性能，研究了输入图像不同程度的预处理，范围涵盖了完全波束形成和滤波的超声图像到原始射频数据。我们将深度学习方法的预测结果与传统飞行时间法在四种不同弹性水平的明胶模型上的表现进行了比较。我们的结果显示，在所有弹性组之间，预测的剪切波速度存在统计学显著差异，无论预处理程度如何。尽管预处理略微提升了性能指标，但我们的结果表明，深度学习方法能够可靠地区分使用原始、未处理射频数据的弹性组。这些结果表明，基于深度学习的方法可以减少传统超声预处理步骤在超声剪切波弹性成像中的需求和偏差，从而实现更快速、更可靠的临床弹性评估。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理，人工智能，计算机视觉与模式识别

Publish: 2025-08-01 11:26:46 UTC 发布时间：2025-08-01 11:26:46 UTC

#153 Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training #153 通过解剖正常性建模提升医学视觉语言预训练中的视觉语义密度

Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model’s ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in the latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model’s perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-CT69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9% across 54 diseases in 15 organs, significantly surpassing existing methods. Additionally, we demonstrate the superior transfer learning capabilities of our pre-trained model. Code is available at https://github.com/alibaba-damo-academy/ViSD-Boost. 视觉语言预训练（VLP）在开发多功能和通用的医学诊断能力方面具有巨大潜力。然而，将低信噪比（SNR）的医学图像与高信噪比的报告对齐存在语义密度差距，导致视觉对齐偏差。本文提出通过提升视觉语义密度来改善对齐效果。一方面，我们通过疾病级视觉对比学习增强视觉语义，强化模型区分每个解剖结构正常与异常样本的能力。另一方面，我们引入了解剖正常性建模方法，利用 VQ-VAE 在潜在空间中重建正常视觉嵌入，建模每个解剖部位正常样本的分布。该过程通过利用异常样本的分布偏移放大异常信号，增强模型对异常属性的感知和辨别能力。增强的视觉表示有效捕捉了诊断相关的语义，促进了与诊断报告更高效、更准确的对齐。我们在两个胸部 CT 数据集 CT-RATE 和 Rad-ChestCT 以及一个腹部 CT 数据集 MedVL-CT69K 上进行了大量实验，全面评估了胸部和腹部 CT 场景下多任务的诊断性能，达到了最先进的零样本性能。值得注意的是，我们的方法在 15 个器官的 54 种疾病中实现了平均 AUC 为 84.9%，显著超越了现有方法。此外，我们还展示了预训练模型卓越的迁移学习能力。代码可在 https://github.com/alibaba-damo-academy/ViSD-Boost 获取。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 主题：图像与视频处理，人工智能，计算机视觉与模式识别，机器学习

Publish: 2025-08-01 06:52:05 UTC 发布时间：2025-08-01 06:52:05 UTC

#154 Latent Knowledge Scalpel: Precise and Massive Knowledge Editing for Large Language Models #154 潜在知识手术刀：大型语言模型的精确且大规模知识编辑

Large Language Models (LLMs) often retain inaccurate or outdated information from pre-training, leading to incorrect predictions or biased outputs during inference. While existing model editing methods can address this challenge, they struggle with editing large amounts of factual information simultaneously and may compromise the general capabilities of the models. In this paper, our empirical study demonstrates that it is feasible to edit the internal representations of LLMs and replace the entities in a manner similar to editing natural language inputs. Based on this insight, we introduce the Latent Knowledge Scalpel (LKS), an LLM editor that manipulates the latent knowledge of specific entities via a lightweight hypernetwork to enable precise and large-scale editing. Experiments conducted on Llama-2 and Mistral show even with the number of simultaneous edits reaching 10,000, LKS effectively performs knowledge editing while preserving the general abilities of the edited LLMs. Code is available at: https://github.com/Linuxin-xxx/LKS. 大型语言模型（LLMs）通常会保留预训练中不准确或过时的信息，导致推理时产生错误预测或偏见输出。虽然现有的模型编辑方法可以解决这一挑战，但它们难以同时编辑大量事实信息，且可能损害模型的整体能力。本文的实证研究表明，编辑 LLMs 的内部表示并以类似编辑自然语言输入的方式替换实体是可行的。基于这一洞见，我们提出了潜在知识手术刀（Latent Knowledge Scalpel，LKS），这是一种通过轻量级超网络操控特定实体潜在知识的 LLM 编辑器，实现精确且大规模的编辑。在 Llama-2 和 Mistral 上的实验表明，即使同时编辑数量达到 10,000，LKS 仍能有效执行知识编辑，同时保持被编辑 LLMs 的整体能力。代码地址：https://github.com/Linuxin-xxx/LKS。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-01 03:51:43 UTC 发布时间：2025-08-01 03:51:43 UTC

#155 VQ-DeepISC: Vector Quantized-Enabled Digital Semantic Communication with Channel Adaptive Image Transmission #155 VQ-DeepISC：支持向量量化的数字语义通信与信道自适应图像传输

Authors: [Jianqiao Chen](https://arxiv.org/search/?searchtype=author&query=Jianqiao Chen), [Tingting Zhu](https://arxiv.org/search/?searchtype=author&query=Tingting Zhu), [Huishi Song](https://arxiv.org/search/?searchtype=author&query=Huishi Song), [Nan Ma](https://arxiv.org/search/?searchtype=author&query=Nan Ma), [Xiaodong Xu](https://arxiv.org/search/?searchtype=author&query=Xiaodong Xu) 作者：陈建桥，朱婷婷，宋慧诗，马楠，徐晓东

Discretization of semantic features enables interoperability between semantic and digital communication systems, showing significant potential for practical applications. The fundamental difficulty in digitizing semantic features stems from the need to preserve continuity and context in inherently analog representations during their compression into discrete symbols while ensuring robustness to channel degradation. In this paper, we propose a vector quantized (VQ)-enabled digital semantic communication system with channel adaptive image transmission, named VQ-DeepISC. Guided by deep joint source-channel coding (DJSCC), we first design a Swin Transformer backbone for hierarchical semantic feature extraction, followed by VQ modules projecting features into discrete latent spaces. Consequently, it enables efficient index-based transmission instead of raw feature transmission. To further optimize this process, we develop an attention mechanism-driven channel adaptation module to dynamically optimize index transmission. Secondly, to counteract codebook collapse during training process, we impose a distributional regularization by minimizing the Kullback-Leibler divergence (KLD) between codeword usage frequencies and a uniform prior. Meanwhile, exponential moving average (EMA) is employed to stabilize training and ensure balanced feature coverage during codebook updates. Finally, digital communication is implemented using quadrature phase shift keying (QPSK) modulation alongside orthogonal frequency division multiplexing (OFDM), adhering to the IEEE 802.11a standard. Experimental results demonstrate superior reconstruction fidelity of the proposed system over benchmark methods. 语义特征的离散化实现了语义通信系统与数字通信系统之间的互操作性，展现出显著的实际应用潜力。语义特征数字化的根本难点在于，在将其压缩为离散符号的过程中，需要保持本质上模拟表示的连续性和上下文，同时确保对信道退化的鲁棒性。本文提出了一种基于向量量化（VQ）的数字语义通信系统，具备信道自适应图像传输功能，命名为 VQ-DeepISC。在深度联合源信道编码（DJSCC）的指导下，我们首先设计了一个 Swin Transformer 主干网络用于分层语义特征提取，随后通过 VQ 模块将特征投射到离散潜在空间。因此，该系统实现了基于索引的高效传输，替代了原始特征的传输。为进一步优化该过程，我们开发了一个基于注意力机制的信道自适应模块，用以动态优化索引传输。其次，为了在训练过程中防止码本崩溃，我们通过最小化码字使用频率与均匀先验之间的 Kullback-Leibler 散度（KLD）来施加分布正则化。同时，采用指数移动平均（EMA）来稳定训练，并确保码本更新期间特征的均衡覆盖。最后，数字通信采用正交频分复用（OFDM）结合正交相移键控（QPSK）调制，遵循 IEEE 802.11a 标准。实验结果表明，所提系统在重建保真度方面优于基准方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-01 02:35:34 UTC 发布时间：2025-08-01 02:35:34 UTC

#156 A Modified VGG19-Based Framework for Accurate and Interpretable Real-Time Bone Fracture Detection #156 基于改进 VGG19 的框架，用于准确且可解释的实时骨折检测

Authors: [Md. Ehsanul Haque](https://arxiv.org/search/?searchtype=author&query=Md. Ehsanul Haque), [Abrar Fahim](https://arxiv.org/search/?searchtype=author&query=Abrar Fahim), [Shamik Dey](https://arxiv.org/search/?searchtype=author&query=Shamik Dey), [Syoda Anamika Jahan](https://arxiv.org/search/?searchtype=author&query=Syoda Anamika Jahan), [S. M. Jahidul Islam](https://arxiv.org/search/?searchtype=author&query=S. M. Jahidul Islam), [Sakib Rokoni](https://arxiv.org/search/?searchtype=author&query=Sakib Rokoni), [Md Sakib Morshed](https://arxiv.org/search/?searchtype=author&query=Md Sakib Morshed) 作者：Md. Ehsanul Haque，Abrar Fahim，Shamik Dey，Syoda Anamika Jahan，S. M. Jahidul Islam，Sakib Rokoni，Md Sakib Morshed

Early and accurate detection of the bone fracture is paramount to initiating treatment as early as possible and avoiding any delay in patient treatment and outcomes. Interpretation of X-ray image is a time consuming and error prone task, especially when resources for such interpretation are limited by lack of radiology expertise. Additionally, deep learning approaches used currently, typically suffer from misclassifications and lack interpretable explanations to clinical use. In order to overcome these challenges, we propose an automated framework of bone fracture detection using a VGG-19 model modified to our needs. It incorporates sophisticated preprocessing techniques that include Contrast Limited Adaptive Histogram Equalization (CLAHE), Otsu’s thresholding, and Canny edge detection, among others, to enhance image clarity as well as to facilitate the feature extraction. Therefore, we use Grad-CAM, an Explainable AI method that can generate visual heatmaps of the model’s decision making process, as a type of model interpretability, for clinicians to understand the model’s decision making process. It encourages trust and helps in further clinical validation. It is deployed in a real time web application, where healthcare professionals can upload X-ray images and get the diagnostic feedback within 0.5 seconds. The performance of our modified VGG-19 model attains 99.78% classification accuracy and AUC score of 1.00, making it exceptionally good. The framework provides a reliable, fast, and interpretable solution for bone fracture detection that reasons more efficiently for diagnoses and better patient care. 骨折的早期和准确检测对于尽早开始治疗、避免患者治疗和结果的延误至关重要。X 光图像的解读是一项耗时且易出错的任务，尤其是在缺乏放射学专业知识的情况下，解读资源有限。此外，目前使用的深度学习方法通常存在误分类问题，且缺乏可供临床使用的可解释性说明。为克服这些挑战，我们提出了一个基于改进版 VGG-19 模型的骨折自动检测框架。该框架融合了包括限制对比度自适应直方图均衡化（CLAHE）、大津阈值法和 Canny 边缘检测等复杂的预处理技术，以增强图像清晰度并促进特征提取。因此，我们采用 Grad-CAM 这一可解释人工智能方法，生成模型决策过程的可视化热图，作为模型可解释性的一种形式，帮助临床医生理解模型的决策过程。这不仅增强了信任感，也有助于进一步的临床验证。它被部署在一个实时的网络应用中，医疗专业人员可以上传 X 光图像，并在 0.5 秒内获得诊断反馈。我们改进的 VGG-19 模型的性能达到了 99.78%的分类准确率和 1.00 的 AUC 分数，表现异常出色。该框架为骨折检测提供了一个可靠、快速且可解释的解决方案，能够更高效地进行诊断推理，从而改善患者护理。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理，人工智能，计算机视觉与模式识别

Publish: 2025-07-31 19:22:58 UTC 发布时间：2025-07-31 19:22:58 UTC

#157 Improve Retinal Artery/Vein Classification via Channel Couplin #157 通过通道耦合改进视网膜动脉/静脉分类

Retinal vessel segmentation plays a vital role in analyzing fundus images for the diagnosis of systemic and ocular diseases. Building on this, classifying segmented vessels into arteries and veins (A/V) further enables the extraction of clinically relevant features such as vessel width, diameter and tortuosity, which are essential for detecting conditions like diabetic and hypertensive retinopathy. However, manual segmentation and classification are time-consuming, costly and inconsistent. With the advancement of Convolutional Neural Networks, several automated methods have been proposed to address this challenge, but there are still some issues. For example, the existing methods all treat artery, vein and overall vessel segmentation as three separate binary tasks, neglecting the intrinsic coupling relationships between these anatomical structures. Considering artery and vein structures are subsets of the overall retinal vessel map and should naturally exhibit prediction consistency with it, we design a novel loss named Channel-Coupled Vessel Consistency Loss to enforce the coherence and consistency between vessel, artery and vein predictions, avoiding biasing the network toward three simple binary segmentation tasks. Moreover, we also introduce a regularization term named intra-image pixel-level contrastive loss to extract more discriminative feature-level fine-grained representations for accurate retinal A/V classification. SOTA results have been achieved across three public A/V classification datasets including RITE, LES-AV and HRF. Our code will be available upon acceptance. 视网膜血管分割在分析眼底图像以诊断系统性和眼部疾病中起着至关重要的作用。在此基础上，将分割出的血管分类为动脉和静脉（A/V）进一步使得提取临床相关特征成为可能，如血管宽度、直径和曲折度，这些特征对于检测糖尿病性和高血压性视网膜病变等疾病至关重要。然而，手动分割和分类既耗时又昂贵且结果不一致。随着卷积神经网络的发展，已经提出了多种自动化方法来应对这一挑战，但仍存在一些问题。例如，现有方法均将动脉、静脉和整体血管分割视为三个独立的二分类任务，忽视了这些解剖结构之间的内在耦合关系。考虑到动脉和静脉结构是整体视网膜血管图的子集，且理应与其预测结果保持一致性，我们设计了一种新颖的损失函数，称为通道耦合血管一致性损失，以强制血管、动脉和静脉预测之间的连贯性和一致性，避免网络偏向于三个简单的二元分割任务。此外，我们还引入了一种正则化项，称为图像内像素级对比损失，以提取更具判别力的特征级细粒度表示，从而实现准确的视网膜动静脉分类。在包括 RITE、LES-AV 和 HRF 的三个公开动静脉分类数据集上均取得了 SOTA（最先进）结果。我们的代码将在论文被接受后公开。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理，人工智能，计算机视觉与模式识别

Publish: 2025-07-31 18:43:02 UTC 发布时间：2025-07-31 18:43:02 UTC

#158 GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models #158 GanitBench：用于评估视觉语言模型中数学推理的双语基准测试

Benchmarks for evaluating reasoning among Vision Language Models (VLMs) on several fields and domains are being curated more frequently over the last few years. However these are often monolingual, mostly available in English. Additionally there also is a lack of datasets available in Hindi on tasks apart from comprehension and translation. We introduce GanitBench, a tough benchmark consisting of 1527 vision-only questions covering several topics in Mathematics - available in languages English and Hindi. Collected from two major examinations from India, the JEE Advanced and the CBSE Boards examinations, this benchmark includes questions in the form of images comprising of figures essential to a question as well as text. We evaluate two closed source models for the same, in zero-shot Chain-of-Thought (CoT) and two-shot CoT settings. GPT-4o mini is found to be the more dominant model on the benchmark, with it’s highest average accuracy being 38.15%. We also evaluate models through a “Double Lock” constraint, which brings down the performance of the models by considerable margins. We observe that two-shot CoT appears to be a more effective setting under this environment. Performance of the two VLMs also decreases when answering the same questions in the Hindi language. We hope to facilitate the inclusion of languages like Hindi in research through our work. 近年来，用于评估视觉语言模型（VLMs）在多个领域和学科中推理能力的基准测试越来越多地被整理出来。然而，这些基准测试通常是单语的，大多以英语提供。此外，除理解和翻译任务外，印地语数据集也相对缺乏。我们推出了 GanitBench，这是一个包含 1527 个仅视觉问题的严苛基准，涵盖数学的多个主题，提供英语和印地语版本。该基准收集自印度的两大考试——JEE 高级考试和 CBSE 董事会考试，题目以图像形式呈现，包含对问题至关重要的图形及文本。我们在零样本链式思维（CoT）和两样本链式思维设置下评估了两个闭源模型。GPT-4o mini 在该基准上表现更为突出，最高平均准确率达到 38.15%。我们还通过“双重锁定”约束对模型进行了评估，该约束显著降低了模型的表现。我们观察到，在此环境下，两样本链式思维似乎是一种更有效的设置。当用印地语回答相同问题时，这两种视觉语言模型的表现也有所下降。我们希望通过我们的工作促进印地语等语言在研究中的应用。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-07-31 18:24:05 UTC 发布：2025-07-31 18:24:05 UTC

#159 Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities #159 通过视觉变换器融合普遍射频数据与空间图像以提升智慧城市中的地图绘制

Authors: [Rafayel Mkrtchyan](https://arxiv.org/search/?searchtype=author&query=Rafayel Mkrtchyan), [Armen Manukyan](https://arxiv.org/search/?searchtype=author&query=Armen Manukyan), [Hrant Khachatrian](https://arxiv.org/search/?searchtype=author&query=Hrant Khachatrian), [Theofanis P. Raptis](https://arxiv.org/search/?searchtype=author&query=Theofanis P. Raptis) 作者：Rafayel Mkrtchyan、Armen Manukyan、Hrant Khachatrian、Theofanis P. Raptis

Environment mapping is an important computing task for a wide range of smart city applications, including autonomous navigation, wireless network operations and extended reality environments. Conventional smart city mapping techniques, such as satellite imagery, LiDAR scans, and manual annotations, often suffer from limitations related to cost, accessibility and accuracy. Open-source mapping platforms have been widely utilized in artificial intelligence applications for environment mapping, serving as a source of ground truth. However, human errors and the evolving nature of real-world environments introduce biases that can negatively impact the performance of neural networks trained on such data. In this paper, we present a deep learning-based approach that integrates the DINOv2 architecture to improve building mapping by combining maps from open-source platforms with radio frequency (RF) data collected from multiple wireless user equipments and base stations. Our approach leverages a vision transformer-based architecture to jointly process both RF and map modalities within a unified framework, effectively capturing spatial dependencies and structural priors for enhanced mapping accuracy. For the evaluation purposes, we employ a synthetic dataset co-produced by Huawei. We develop and train a model that leverages only aggregated path loss information to tackle the mapping problem. We measure the results according to three performance metrics which capture different qualities: (i) The Jaccard index, also known as intersection over union (IoU), (ii) the Hausdorff distance, and (iii) the Chamfer distance. Our design achieves a macro IoU of 65.3%, significantly surpassing (i) the erroneous maps baseline, which yields 40.1%, (ii) an RF-only method from the literature, which yields 37.3%, and (iii) a non-AI fusion baseline that we designed which yields 42.2%. 环境映射是广泛应用于智能城市的一个重要计算任务，包括自动导航、无线网络运营和扩展现实环境。传统的智能城市映射技术，如卫星影像、LiDAR 扫描和人工注释，常常存在成本、可访问性和准确性方面的限制。开源映射平台已被广泛应用于人工智能环境映射中，作为地面真实数据的来源。然而，人工错误和现实环境的不断变化引入了偏差，可能会对基于这些数据训练的神经网络性能产生负面影响。本文提出了一种基于深度学习的方法，结合 DINOv2 架构，通过融合开源平台的地图与从多个无线用户设备和基站收集的射频（RF）数据，来改进建筑物映射。我们的方法利用基于视觉变换器的架构，在统一框架内联合处理射频（RF）和地图两种模态，有效捕捉空间依赖关系和结构先验，从而提升地图绘制的准确性。为了评估效果，我们采用了华为联合制作的合成数据集。我们开发并训练了一个仅利用聚合路径损耗信息来解决地图绘制问题的模型。我们根据三种性能指标来衡量结果，这些指标反映了不同的质量：（i）Jaccard 指数，也称为交并比（IoU），（ii）Hausdorff 距离，以及（iii）Chamfer 距离。我们的设计实现了 65.3%的宏观 IoU，显著优于（i）错误地图基线的 40.1%，（ii）文献中的仅 RF 方法的 37.3%，以及（iii）我们设计的非人工智能融合基线的 42.2%。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-07-31 12:50:53 UTC 发布时间：2025-07-31 12:50:53 UTC

#160 StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization #160 StorySync：通过区域协调实现文本到图像生成中的无训练主体一致性

Authors: [Gopalji Gaur](https://arxiv.org/search/?searchtype=author&query=Gopalji Gaur), [Mohammadreza Zolfaghari](https://arxiv.org/search/?searchtype=author&query=Mohammadreza Zolfaghari), [Thomas Brox](https://arxiv.org/search/?searchtype=author&query=Thomas Brox) 作者：Gopalji Gaur，Mohammadreza Zolfaghari，Thomas Brox

Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model’s pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model. 使用文本到图像扩散模型生成讲述视觉故事的连贯图像序列，常面临在所有故事场景中保持主体一致性的关键挑战。现有方法通常依赖于微调或重新训练模型，这不仅计算成本高、耗时长，还常常干扰模型原有的能力。本文采用无训练方法，提出了一种高效的一致主体生成方法。该方法通过引入掩码跨图像注意力共享，动态对齐一批图像中的主体特征，并通过区域特征协调细化视觉上相似的细节，从而无缝配合预训练扩散模型。实验结果表明，我们的方法能够在多种场景中成功生成视觉上一致的主体，同时保持扩散模型的创造能力。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-07-31 11:24:40 UTC 发布时间：2025-07-31 11:24:40 UTC

#161 A Survey of Multimodal Ophthalmic Diagnostics: From Task-Specific Approaches to Foundational Models #161 多模态眼科诊断综述：从特定任务方法到基础模型

Visual impairment represents a major global health challenge, with multimodal imaging providing complementary information that is essential for accurate ophthalmic diagnosis. This comprehensive survey systematically reviews the latest advances in multimodal deep learning methods in ophthalmology up to the year 2025. The review focuses on two main categories: task-specific multimodal approaches and large-scale multimodal foundation models. Task-specific approaches are designed for particular clinical applications such as lesion detection, disease diagnosis, and image synthesis. These methods utilize a variety of imaging modalities including color fundus photography, optical coherence tomography, and angiography. On the other hand, foundation models combine sophisticated vision-language architectures and large language models pretrained on diverse ophthalmic datasets. These models enable robust cross-modal understanding, automated clinical report generation, and decision support. The survey critically examines important datasets, evaluation metrics, and methodological innovations including self-supervised learning, attention-based fusion, and contrastive alignment. It also discusses ongoing challenges such as variability in data, limited annotations, lack of interpretability, and issues with generalizability across different patient populations. Finally, the survey outlines promising future directions that emphasize the use of ultra-widefield imaging and reinforcement learning-based reasoning frameworks to create intelligent, interpretable, and clinically applicable AI systems for ophthalmology. 视觉障碍是全球重大的健康挑战，多模态成像提供了互补信息，对于准确的眼科诊断至关重要。本综述系统性地回顾了截至 2025 年眼科多模态深度学习方法的最新进展。该综述聚焦于两大类：特定任务的多模态方法和大规模多模态基础模型。特定任务方法针对特定临床应用设计，如病变检测、疾病诊断和图像合成。这些方法利用多种成像模态，包括彩色眼底摄影、光学相干断层扫描和血管造影。另一方面，基础模型结合了复杂的视觉-语言架构和在多样化眼科数据集上预训练的大型语言模型。这些模型实现了强大的跨模态理解、自动临床报告生成和决策支持。综述还批判性地分析了重要数据集、评估指标及方法创新，包括自监督学习、基于注意力的融合和对比对齐。文章还讨论了数据变异性、注释有限、缺乏可解释性以及不同患者群体间泛化能力不足等持续存在的挑战。最后，综述概述了有前景的未来方向，强调利用超广角成像和基于强化学习的推理框架，打造智能、可解释且临床适用的眼科人工智能系统。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理，人工智能，计算机视觉与模式识别

Publish: 2025-07-31 10:49:21 UTC 发布时间：2025-07-31 10:49:21 UTC

#162 CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning #162 CX-Mind：一种开创性的多模态大型语言模型，通过课程引导的强化学习实现胸部 X 光交错推理

Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on “one-time” diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved “think-answer” reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions. 胸部 X 光（CXR）成像是临床实践中最广泛使用的诊断方式之一，涵盖了广泛的诊断任务。近年来，基于推理的多模态大型语言模型（MLLMs）在医学影像中的广泛应用，提升了诊断效率和可解释性。然而，现有的多模态模型主要依赖“一次性”诊断方法，缺乏对推理过程的可验证监督。这导致多任务 CXR 诊断面临推理时间长、奖励稀疏和频繁幻觉等挑战。为解决这些问题，我们提出了 CX-Mind，这是首个实现交错“思考-回答”推理的生成模型，针对 CXR 任务，采用基于课程的强化学习和可验证过程奖励（CuRL-VPR）驱动。具体而言，我们构建了一个指令调优数据集 CX-Set，包含 708,473 张图像和 2,619,148 个样本，并生成了 42,828 个由临床报告监督的高质量交错推理数据点。优化在集团相对策略优化框架下分两个阶段进行：首先通过封闭域任务稳定基础推理，随后转移到开放域诊断，结合基于规则的条件过程奖励，避免了对预训练奖励模型的依赖。大量实验结果表明，CX-Mind 在视觉理解、文本生成和时空对齐方面显著优于现有的医疗和通用领域多模态大模型（MLLMs），在可比的胸片特定模型上平均性能提升了 25.1%。在真实临床数据集（Rui-CXR）上，CX-Mind 在 14 种疾病的平均召回率@1 显著超过第二名，多中心专家评估进一步确认了其在多个维度上的临床实用性。

Publish: 2025-07-31 05:07:18 UTC 发布时间：2025-07-31 05:07:18 UTC

#163 Multimodal Video Emotion Recognition with Reliable Reasoning Priors #163 具有可靠推理先验的多模态视频情感识别

Authors: [Zhepeng Wang](https://arxiv.org/search/?searchtype=author&query=Zhepeng Wang), [Yingjian Zhu](https://arxiv.org/search/?searchtype=author&query=Yingjian Zhu), [Guanghao Dong](https://arxiv.org/search/?searchtype=author&query=Guanghao Dong), [Hongzhu Yi](https://arxiv.org/search/?searchtype=author&query=Hongzhu Yi), [Feng Chen](https://arxiv.org/search/?searchtype=author&query=Feng Chen), [Xinming Wang](https://arxiv.org/search/?searchtype=author&query=Xinming Wang), [Jun Xie](https://arxiv.org/search/?searchtype=author&query=Jun Xie) 作者：王哲鹏、朱英建、董光浩、易洪竹、陈峰、王新明、谢军

This study investigates the integration of trustworthy prior reasoning knowledge from MLLMs into multimodal emotion recognition. We employ Gemini to generate fine-grained, modality-separable reasoning traces, which are injected as priors during the fusion stage to enrich cross-modal interactions. To mitigate the pronounced class-imbalance in multimodal emotion recognition, we introduce Balanced Dual-Contrastive Learning, a loss formulation that jointly balances inter-class and intra-class distributions. Applied to the MER2024 benchmark, our prior-enhanced framework yields substantial performance gains, demonstrating that the reliability of MLLM-derived reasoning can be synergistically combined with the domain adaptability of lightweight fusion networks for robust, scalable emotion recognition. 本研究探讨了将可信的多模态大语言模型（MLLM）先验推理知识整合到多模态情感识别中的方法。我们采用 Gemini 生成细粒度、模态可分离的推理轨迹，并在融合阶段将其作为先验注入，以丰富跨模态交互。为缓解多模态情感识别中显著的类别不平衡问题，我们引入了平衡双重对比学习，这是一种同时平衡类间和类内分布的损失函数。在 MER2024 基准测试中，基于先验增强的框架取得了显著的性能提升，证明了 MLLM 推理的可靠性可以与轻量级融合网络的领域适应性协同结合，实现稳健且可扩展的情感识别。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-07-29 15:55:23 UTC 发布时间：2025-07-29 15:55:23 UTC

#164 Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering #164 多轮农业问答的意图感知上下文检索

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-07-28 09:00:44 UTC 发布时间：2025-07-28 09:00:44 UTC

#165 Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding #165 健康保险覆盖规则解释语料库：健康保险覆盖理解的法律、政策和医疗指导

Author: [Mike Gartner](https://arxiv.org/search/?searchtype=author&query=Mike Gartner) 作者：Mike Gartner

Subjects: Computers and Society, Artificial Intelligence, Computation and Language, Machine Learning 主题：计算机与社会，人工智能，计算与语言，机器学习

Publish: 2025-07-28 00:22:03 UTC 发布时间：2025-07-28 00:22:03 UTC

#166 Detection of Autonomic Dysreflexia in Individuals With Spinal Cord Injury Using Multimodal Wearable Sensors #166 使用多模态可穿戴传感器检测脊髓损伤患者的自主神经反射障碍

Autonomic Dysreflexia (AD) is a potentially life-threatening condition characterized by sudden, severe blood pressure (BP) spikes in individuals with spinal cord injury (SCI). Early, accurate detection is essential to prevent cardiovascular complications, yet current monitoring methods are either invasive or rely on subjective symptom reporting, limiting applicability in daily file. This study presents a non-invasive, explainable machine learning framework for detecting AD using multimodal wearable sensors. Data were collected from 27 individuals with chronic SCI during urodynamic studies, including electrocardiography (ECG), photoplethysmography (PPG), bioimpedance (BioZ), temperature, respiratory rate (RR), and heart rate (HR), across three commercial devices. Objective AD labels were derived from synchronized cuff-based BP measurements. Following signal preprocessing and feature extraction, BorutaSHAP was used for robust feature selection, and SHAP values for explainability. We trained modality- and device-specific weak learners and aggregated them using a stacked ensemble meta-model. Cross-validation was stratified by participants to ensure generalizability. HR- and ECG-derived features were identified as the most informative, particularly those capturing rhythm morphology and variability. The Nearest Centroid ensemble yielded the highest performance (Macro F1 = 0.77+/-0.03), significantly outperforming baseline models. Among modalities, HR achieved the highest area under the curve (AUC = 0.93), followed by ECG (0.88) and PPG (0.86). RR and temperature features contributed less to overall accuracy, consistent with missing data and low specificity. The model proved robust to sensor dropout and aligned well with clinical AD events. These results represent an important step toward personalized, real-time monitoring for individuals with SCI. 自主神经反射性高血压（AD）是一种潜在的危及生命的状况，表现为脊髓损伤（SCI）患者血压（BP）突然且剧烈升高。早期准确检测对于预防心血管并发症至关重要，但现有的监测方法要么具有侵入性，要么依赖主观症状报告，限制了其在日常生活中的应用。本研究提出了一种基于多模态可穿戴传感器的非侵入性、可解释的机器学习框架，用于检测 AD。数据采集自 27 名慢性 SCI 患者在尿动力学研究期间，包括心电图（ECG）、光电容积描记法（PPG）、生物阻抗（BioZ）、体温、呼吸频率（RR）和心率（HR），使用三种商业设备。客观的 AD 标签来源于同步的袖带式血压测量。经过信号预处理和特征提取后，采用 BorutaSHAP 进行稳健的特征选择，并利用 SHAP 值实现模型可解释性。我们训练了针对不同模态和设备的弱学习器，并通过堆叠集成元模型进行聚合。交叉验证按参与者分层，以确保模型的泛化能力。心率（HR）和心电图（ECG）衍生的特征被确定为信息量最大，尤其是那些捕捉节律形态和变异性的特征。最近邻质心集成模型表现最佳（宏观 F1 = 0.77±0.03），显著优于基线模型。在各模态中，心率的曲线下面积（AUC）最高（0.93），其次是心电图（0.88）和光电容积描记法（PPG）（0.86）。RR 间期和体温特征对整体准确率的贡献较小，这与数据缺失和低特异性一致。该模型对传感器丢失表现出良好的鲁棒性，并且与临床 AD 事件高度一致。这些结果代表了向个性化、实时监测脊髓损伤（SCI）患者迈出的重要一步。

Subjects: Signal Processing, Artificial Intelligence, Human-Computer Interaction, Machine Learning 主题：信号处理，人工智能，人机交互，机器学习

Publish: 2025-07-23 21:18:23 UTC 发布时间：2025-07-23 21:18:23 UTC

#167 "Think First, Verify Always": Training Humans to Face AI Risks #167 “先思考，后验证”：训练人类应对人工智能风险

Author: [Yuksel Aydin](https://arxiv.org/search/?searchtype=author&query=Yuksel Aydin) 作者：Yuksel Aydin

Artificial intelligence enables unprecedented attacks on human cognition, yet cybersecurity remains predominantly device-centric. This paper introduces the “Think First, Verify Always” (TFVA) protocol, which repositions humans as ‘Firewall Zero’, the first line of defense against AI-enabled threats. The protocol is grounded in five operational principles: Awareness, Integrity, Judgment, Ethical Responsibility, and Transparency (AIJET). A randomized controlled trial (n=151) demonstrated that a minimal 3-minute intervention produced statistically significant improvements in cognitive security task performance, with participants showing an absolute +7.87% gains compared to controls. These results suggest that brief, principles-based training can rapidly enhance human resilience against AI-driven cognitive manipulation. We recommend that GenAI platforms embed “Think First, Verify Always” as a standard prompt, replacing passive warnings with actionable protocols to enhance trustworthy and ethical AI use. By bridging the gap between technical cybersecurity and human factors, the TFVA protocol establishes human-empowered security as a vital component of trustworthy AI systems. 人工智能使对人类认知的前所未有攻击成为可能，然而网络安全仍主要以设备为中心。本文介绍了“先思考，后验证”（TFVA）协议，将人类重新定位为“零号防火墙”，成为抵御人工智能驱动威胁的第一道防线。该协议基于五项操作原则：意识、完整性、判断、伦理责任和透明度（AIJET）。一项随机对照试验（n=151）表明，短短 3 分钟的干预显著提升了认知安全任务的表现，参与者相较于对照组表现出绝对+7.87%的提升。这些结果表明，基于原则的简短培训能够迅速增强人类抵御人工智能驱动认知操控的韧性。我们建议生成式人工智能平台将“先思考，后验证”作为标准提示，取代被动警告，采用可操作的协议以促进可信且合乎伦理的人工智能使用。通过弥合技术网络安全与人为因素之间的差距，TFVA 协议确立了以人为本的安全作为可信人工智能系统的重要组成部分。

Subjects: Human-Computer Interaction, Artificial Intelligence, Cryptography and Security, Computers and Society 主题：人机交互、人工智能、密码学与安全、计算机与社会

Publish: 2025-07-23 19:59:08 UTC 发布时间：2025-07-23 19:59:08 UTC

Publish: 2025-07-22 14:48:42 UTC 发布时间：2025-07-22 14:48:42 UTC

#169 Controllable Surface Diffusion Generative Model for Neurodevelopmental Trajectories #169 可控表面扩散生成模型用于神经发育轨迹

Authors: [Zhenshan Xie](https://arxiv.org/search/?searchtype=author&query=Zhenshan Xie), [Levente Baljer](https://arxiv.org/search/?searchtype=author&query=Levente Baljer), [M. Jorge Cardoso](https://arxiv.org/search/?searchtype=author&query=M. Jorge Cardoso), [Emma Robinson](https://arxiv.org/search/?searchtype=author&query=Emma Robinson) 作者：谢振山，Levente Baljer，M. Jorge Cardoso，Emma Robinson

Preterm birth disrupts the typical trajectory of cortical neurodevelopment, increasing the risk of cognitive and behavioral difficulties. However, outcomes vary widely, posing a significant challenge for early prediction. To address this, individualized simulation offers a promising solution by modeling subject-specific neurodevelopmental trajectories, enabling the identification of subtle deviations from normative patterns that might act as biomarkers of risk. While generative models have shown potential for simulating neurodevelopment, prior approaches often struggle to preserve subject-specific cortical folding patterns or to reproduce region-specific morphological variations. In this paper, we present a novel graph-diffusion network that supports controllable simulation of cortical maturation. Using cortical surface data from the developing Human Connectome Project (dHCP), we demonstrate that the model maintains subject-specific cortical morphology while modeling cortical maturation sufficiently well to fool an independently trained age regression network, achieving a prediction accuracy of 0.85±0.62. 早产会扰乱大脑皮层神经发育的典型轨迹，增加认知和行为困难的风险。然而，结果差异很大，这给早期预测带来了重大挑战。为了解决这一问题，个性化模拟提供了一种有前景的解决方案，通过建模个体特异性的神经发育轨迹，能够识别可能作为风险生物标志物的细微偏离正常模式的情况。尽管生成模型在模拟神经发育方面显示出潜力，但以往的方法往往难以保持个体特异性的皮层折叠模式，或难以再现区域特异性的形态变化。本文提出了一种新颖的图扩散网络，支持对皮层成熟过程的可控模拟。利用发展中人类连接组计划（dHCP）的皮层表面数据，我们展示了该模型在保持个体特异性皮层形态的同时，能够充分模拟皮层成熟过程，以至于能够欺骗一个独立训练的年龄回归网络，预测准确率达到 0.85±0.62 。

Subjects: Neurons and Cognition, Artificial Intelligence 主题：神经元与认知，人工智能

Publish: 2025-07-21 09:16:24 UTC 发布时间：2025-07-21 09:16:24 UTC

#170 Privacy Risks of LLM-Empowered Recommender Systems: An Inversion Attack Perspective #170 LLM 赋能推荐系统的隐私风险：一种反演攻击视角

Authors: [Yubo Wang](https://arxiv.org/search/?searchtype=author&query=Yubo Wang), [Min Tang](https://arxiv.org/search/?searchtype=author&query=Min Tang), [Nuo Shen](https://arxiv.org/search/?searchtype=author&query=Nuo Shen), [Shujie Cui](https://arxiv.org/search/?searchtype=author&query=Shujie Cui), [Weiqing Wang](https://arxiv.org/search/?searchtype=author&query=Weiqing Wang) 作者：王宇博，唐敏，沈诺，崔书杰，王伟青

The large language model (LLM) powered recommendation paradigm has been proposed to address the limitations of traditional recommender systems, which often struggle to handle cold start users or items with new IDs. Despite its effectiveness, this study uncovers that LLM empowered recommender systems are vulnerable to reconstruction attacks that can expose both system and user privacy. To examine this threat, we present the first systematic study on inversion attacks targeting LLM empowered recommender systems, where adversaries attempt to reconstruct original prompts that contain personal preferences, interaction histories, and demographic attributes by exploiting the output logits of recommendation models. We reproduce the vec2text framework and optimize it using our proposed method called Similarity Guided Refinement, enabling more accurate reconstruction of textual prompts from model generated logits. Extensive experiments across two domains (movies and books) and two representative LLM based recommendation models demonstrate that our method achieves high fidelity reconstructions. Specifically, we can recover nearly 65 percent of the user interacted items and correctly infer age and gender in 87 percent of the cases. The experiments also reveal that privacy leakage is largely insensitive to the victim model’s performance but highly dependent on domain consistency and prompt complexity. These findings expose critical privacy vulnerabilities in LLM empowered recommender systems. 大型语言模型（LLM）驱动的推荐范式被提出以解决传统推荐系统的局限性，传统系统常常难以处理冷启动用户或具有新 ID 的物品。尽管该方法有效，本研究发现 LLM 赋能的推荐系统易受到重构攻击，可能暴露系统和用户隐私。为检验这一威胁，我们首次系统性地研究了针对 LLM 赋能推荐系统的反演攻击，攻击者通过利用推荐模型的输出 logits，试图重构包含个人偏好、交互历史和人口属性的原始提示。我们复现了 vec2text 框架，并采用我们提出的相似性引导优化方法（Similarity Guided Refinement）进行优化，从而更准确地从模型生成的 logits 中重构文本提示。在电影和图书两个领域以及两个代表性的基于 LLM 的推荐模型上进行的大量实验表明，我们的方法实现了高保真度的重构。具体来说，我们可以恢复近 65%的用户交互项目，并在 87%的情况下正确推断年龄和性别。实验还表明，隐私泄露在很大程度上对受害模型的性能不敏感，但高度依赖于领域一致性和提示复杂性。这些发现揭示了 LLM 赋能推荐系统中的关键隐私漏洞。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-07-20 05:03:02 UTC 发布时间：2025-07-20 05:03:02 UTC

#171 MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning #171 MagicGUI：一个具有可扩展数据管道和强化微调的基础移动 GUI 代理

This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GUI Data Pipeline, which aggregates the largest and most diverse GUI-centric multimodal data to date from open-source repositories, automated crawling, and targeted manual annotation; (2) enhanced perception and grounding capabilities, facilitating fine-grained multimodal alignment for UI element referencing, grounding, and screen comprehension; (3) a comprehensive and unified action space, encompassing both fundamental UI operations and complex interactive intents to support human-agent interactions; (4) planning-oriented reasoning mechanisms that enable the model to decompose complex user instructions into sequential actions with explicit intermediate meta-paln reasoning; (5) an iterative two-stage training procedure, combining large-scale continue pre-training on 7.8M samples with reinforcement fine-tuning utilizing a spatially enhanced composite reward and dual filtering strategy; and (6) competitive performance on both the proprietary Magic-RICH benchmark and over a dozen public benchmarks, achieving superior performance across GUI perception and agent tasks, while demonstrating robust generalization and real-world deployment potential in practical mobile GUI scenarios, as detailed in Figure 1. 本文介绍了 MagicGUI，一种基础性的移动 GUI 代理，旨在解决现实移动 GUI 环境中的感知、定位和推理等关键挑战。该框架基于以下六个关键组成部分：（1）通过可扩展的 GUI 数据管道构建的全面且准确的数据集，汇集了迄今为止来自开源仓库、自动爬取和有针对性的人工标注的最大且最具多样性的以 GUI 为中心的多模态数据；（2）增强的感知和定位能力，促进了 UI 元素引用、定位和屏幕理解的细粒度多模态对齐；（3）全面且统一的动作空间，涵盖基础 UI 操作和复杂交互意图，以支持人机交互；（4）面向规划的推理机制，使模型能够将复杂的用户指令分解为带有明确中间元规划推理的顺序动作；（5）迭代的两阶段训练程序，结合了对 780 万样本的大规模持续预训练和利用空间增强复合奖励及双重过滤策略的强化微调；（6）在专有的 Magic-RICH 基准测试和十多个公开基准测试中表现出竞争力，在 GUI 感知和代理任务上均取得优异成绩，同时展示了在实际移动 GUI 场景中的强大泛化能力和实际部署潜力，详见图 1。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-07-19 12:33:43 UTC 发布时间：2025-07-19 12:33:43 UTC

#172 PLA: Prompt Learning Attack against Text-to-Image Generative Models #172 PLA：针对文本到图像生成模型的提示学习攻击

Authors: [Xinqi Lyu](https://arxiv.org/search/?searchtype=author&query=Xinqi Lyu), [Yihao Liu](https://arxiv.org/search/?searchtype=author&query=Yihao Liu), [Yanjie Li](https://arxiv.org/search/?searchtype=author&query=Yanjie Li), [Bin Xiao](https://arxiv.org/search/?searchtype=author&query=Bin Xiao) 作者：吕新奇，刘逸豪，李彦杰，肖斌

Text-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-For-Work (NSFW) content. To investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. To facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework (PLA), where insightful gradient-based training tailored to black-box T2I models is designed by utilizing multimodal similarities. Experiments show that our new method can effectively attack the safety mechanisms of black-box T2I models including prompt filters and post-hoc safety checkers with a high success rate compared to state-of-the-art methods. Warning: This paper may contain offensive model-generated content. 文本到图像（T2I）模型已在各种应用中得到广泛采用。尽管取得了成功，T2I 模型的潜在滥用仍带来了生成不适宜工作场所（NSFW）内容的重大风险。为调查 T2I 模型的脆弱性，本文深入研究了在黑盒环境下绕过安全机制的对抗攻击。大多数先前的方法依赖于词语替换来搜索对抗提示。由于搜索空间有限，这导致其性能不如基于梯度的训练方法。然而，黑盒环境对基于梯度的攻击训练方法提出了独特挑战，因为无法访问 T2I 模型的内部架构和参数。为促进黑盒环境下对抗提示的学习，我们提出了一种新颖的提示学习攻击框架（PLA），通过利用多模态相似性，设计了针对黑盒 T2I 模型的深刻基于梯度的训练方法。实验表明，我们的新方法能够有效攻击黑箱文本到图像（T2I）模型的安全机制，包括提示过滤器和事后安全检查器，其成功率远高于最先进的方法。警告：本文可能包含冒犯性的模型生成内容。

Subjects: Cryptography and Security, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：密码学与安全，人工智能，计算机视觉与模式识别

Publish: 2025-07-14 11:57:16 UTC 发布时间：2025-07-14 11:57:16 UTC

#173 Large AI Models for Wireless Physical Layer #173 无线物理层的大型人工智能模型

Authors: [Jiajia Guo](https://arxiv.org/search/?searchtype=author&query=Jiajia Guo), [Yiming Cui](https://arxiv.org/search/?searchtype=author&query=Yiming Cui), [Shi Jin](https://arxiv.org/search/?searchtype=author&query=Shi Jin), [Jun Zhang](https://arxiv.org/search/?searchtype=author&query=Jun Zhang) 作者：郭佳佳、崔一鸣、金石、张军

Large artificial intelligence models (LAMs) are transforming wireless physical layer technologies through their robust generalization, multitask processing, and multimodal capabilities. This article reviews recent advancements in LAM applications for physical layer communications, addressing limitations of conventional AI-based approaches. LAM applications are classified into two strategies: leveraging pre-trained LAMs and developing native LAMs designed specifically for physical layer tasks. The motivations and key frameworks of these approaches are comprehensively examined through multiple use cases. Both strategies significantly improve performance and adaptability across diverse wireless scenarios. Future research directions, including efficient architectures, interpretability, standardized datasets, and collaboration between large and small models, are proposed to advance LAM-based physical layer solutions for next-generation communication systems. 大型人工智能模型（LAMs）通过其强大的泛化能力、多任务处理和多模态能力，正在改变无线物理层技术。本文回顾了 LAM 在物理层通信中的最新应用进展，解决了传统基于 AI 方法的局限性。LAM 的应用分为两种策略：利用预训练的 LAM 和开发专门针对物理层任务设计的原生 LAM。通过多个用例，全面探讨了这些方法的动机和关键框架。这两种策略均显著提升了在多样化无线场景中的性能和适应性。文章提出了未来研究方向，包括高效架构、可解释性、标准化数据集以及大模型与小模型的协作，以推动基于 LAM 的物理层解决方案在下一代通信系统中的发展。

Subject: Information Theory 主题：信息论

Publish: 2025-08-04 11:30:33 UTC 发布时间：2025-08-04 11:30:33 UTC

#174 ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds #174 ForestFormer3D：一个用于森林 LiDAR 三维点云端到端分割的统一框架

Authors: [Binbin Xiang](https://arxiv.org/search/?searchtype=author&query=Binbin Xiang), [Maciej Wielgosz](https://arxiv.org/search/?searchtype=author&query=Maciej Wielgosz), [Stefano Puliti](https://arxiv.org/search/?searchtype=author&query=Stefano Puliti), [Kamil Král](https://arxiv.org/search/?searchtype=author&query=Kamil Král), [Martin Krůček](https://arxiv.org/search/?searchtype=author&query=Martin Krůček), [Azim Missarov](https://arxiv.org/search/?searchtype=author&query=Azim Missarov), [Rasmus Astrup](https://arxiv.org/search/?searchtype=author&query=Rasmus Astrup) 作者：向斌斌，Maciej Wielgosz，Stefano Puliti，Kamil Král，Martin Krůček，Azim Missarov，Rasmus Astrup

The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code will be released soon. 森林 LiDAR 三维点云的分割，包括单棵树和语义分割，对于推动森林管理和生态研究具有基础性意义。然而，现有方法常常难以应对自然森林环境的复杂性和多样性。我们提出了 ForestFormer3D，一种新的统一端到端框架，旨在实现精确的单棵树和语义分割。ForestFormer3D 结合了基于 ISA 的查询点选择、推理阶段的基于得分的块合并策略以及用于有效训练的一对多关联机制。通过整合这些新组件，我们的模型在新引入的 FOR-instanceV2 数据集上实现了单棵树分割的最先进性能，该数据集涵盖了多样的森林类型和区域。此外，ForestFormer3D 在未见过的测试集（Wytham woods 和 LAUTx）上表现出良好的泛化能力，展示了其在不同森林条件和传感器模式下的鲁棒性。FOR-instanceV2 数据集和 ForestFormer3D 代码将很快发布。

Subject: Computer Vision and Pattern Recognition 主题：计算机视觉与模式识别

Publish: 2025-06-20 13:39:27 UTC 发布时间：2025-06-20 13:39:27 UTC

#175 Recommendation with Generative Models #175 使用生成模型的推荐

Generative models are a class of AI models capable of creating new instances of data by learning and sampling from their statistical distributions. In recent years, these models have gained prominence in machine learning due to the development of approaches such as generative adversarial networks (GANs), variational autoencoders (VAEs), and transformer-based architectures such as GPT. These models have applications across various domains, such as image generation, text synthesis, and music composition. In recommender systems, generative models, referred to as Gen-RecSys, improve the accuracy and diversity of recommendations by generating structured outputs, text-based interactions, and multimedia content. By leveraging these capabilities, Gen-RecSys can produce more personalized, engaging, and dynamic user experiences, expanding the role of AI in eCommerce, media, and beyond. Our book goes beyond existing literature by offering a comprehensive understanding of generative models and their applications, with a special focus on deep generative models (DGMs) and their classification. We introduce a taxonomy that categorizes DGMs into three types: ID-driven models, large language models (LLMs), and multimodal models. Each category addresses unique technical and architectural advancements within its respective research area. This taxonomy allows researchers to easily navigate developments in Gen-RecSys across domains such as conversational AI and multimodal content generation. Additionally, we examine the impact and potential risks of generative models, emphasizing the importance of robust evaluation frameworks. 生成模型是一类能够通过学习和采样其统计分布来创建新数据实例的人工智能模型。近年来，随着生成对抗网络（GANs）、变分自编码器（VAEs）以及基于变换器架构如 GPT 的发展，这些模型在机器学习领域获得了广泛关注。这些模型在图像生成、文本合成和音乐创作等多个领域都有应用。在推荐系统中，被称为 Gen-RecSys 的生成模型通过生成结构化输出、基于文本的交互和多媒体内容，提高了推荐的准确性和多样性。利用这些能力，Gen-RecSys 能够提供更个性化、更具吸引力和动态的用户体验，拓展了人工智能在电子商务、媒体等领域的作用。我们的书超越了现有文献，提供了对生成模型及其应用的全面理解，特别聚焦于深度生成模型（DGMs）及其分类。我们引入了一个分类法，将 DGMs 分为三类：ID 驱动模型、大型语言模型（LLMs）和多模态模型。每个类别都针对其各自研究领域内的独特技术和架构进展。该分类法使研究人员能够轻松导航跨越对话式人工智能和多模态内容生成等领域的 Gen-RecSys 发展。此外，我们还考察了生成模型的影响及潜在风险，强调了建立健全评估框架的重要性。

Subject: Information Retrieval 主题：信息检索

Publish: 2024-09-18 18:29:15 UTC 发布时间：2024-09-18 18:29:15 UTC

#176 Delving Deeper Into Astromorphic Transformers #176 深入探讨类星形变换器

Preliminary attempts at incorporating the critical role of astrocytes - cells that constitute more than 50% of human brain cells - in brain-inspired neuromorphic computing remain in infancy. This paper seeks to delve deeper into various key aspects of neuron-synapse-astrocyte interactions to mimic self-attention mechanisms in Transformers. The cross-layer perspective explored in this work involves bio-plausible modeling of Hebbian and pre-synaptic plasticities in neuron-astrocyte networks, incorporating effects of non-linearities and feedback along with algorithmic formulations to map the neuron-astrocyte computations to self-attention mechanism and evaluating the impact of incorporating bio-realistic effects from the machine learning application side. Our analysis on sentiment and image classification tasks on the IMDB and CIFAR10 datasets underscores the importance of constructing Astromorphic Transformers from both accuracy and learning speed improvement perspectives. 将星形胶质细胞——占人类大脑细胞超过 50%的细胞——在类脑神经形态计算中的关键作用纳入研究的初步尝试仍处于起步阶段。本文旨在深入探讨神经元-突触-星形胶质细胞相互作用的多个关键方面，以模拟 Transformer 中的自注意力机制。本文所探讨的跨层视角涉及神经元-星形胶质细胞网络中 Hebbian 和突触前可塑性的生物合理建模，结合非线性和反馈效应以及算法公式，将神经元-星形胶质细胞计算映射到自注意力机制，并评估从机器学习应用角度引入生物真实效应的影响。我们在 IMDB 和 CIFAR10 数据集上的情感和图像分类任务分析强调了从准确性和学习速度提升两个角度构建类星形胶质细胞 Transformer 的重要性。

Subject: Neural and Evolutionary Computing 主题：神经与进化计算

Publish: 2023-12-18 04:35:07 UTC 发布时间：2023-12-18 04:35:07 UTC

2025-08-07科研追新

2025-08-07科研追新

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

1.2 Arxiv

1.2.1 Computation and Language

#17 StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion StepFun-Formalizer：通过知识推理融合释放 LLMs 的自动形式化潜力 #17 StepFun-Formalizer：通过知识推理融合释放 LLMs 的自动形式化潜力

#18 Evaluating, Synthesizing, and Enhancing for Customer Support Conversation #18 评估、综合与增强客户支持对话

#19 Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model #19 基于语义相似度和语言模型预测置信度的对话响应预取

#20 What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems #20 人类在交互时听到了什么？用于评估语音对话系统自动语音识别的选择性听觉实验

#21 Why are LLMs' abilities emergent? #21 为什么 LLMs 的能力是涌现的？

#23 AIC CTU@FEVER 8: On-premise fact checking through long context RAG #23 AIC CTU@FEVER 8：通过长上下文 RAG 进行本地事实核查

#24 Chain of Questions: Guiding Multimodal Curiosity in Language Models #24 问题链：引导语言模型中的多模态好奇心

#25 GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy #25 GTPO 和 GRPO-S：基于策略熵的令牌和序列级奖励塑形

#26 Modelling and Classifying the Components of a Literature Review #26 文献综述组成部分的建模与分类

#27 Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models #27 超越排行榜：重新思考大型语言模型的医学基准测试

#28 A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models #28 几个词就能扭曲图谱：基于图的检索增强大语言模型生成的知识投毒攻击

#29 ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents #29 ShoppingBench：一个面向基于 LLM 代理的真实意图驱动购物基准

#30 KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs #30 KVSink：理解与增强 LLMs 中 KV 缓存量化中注意力汇聚的保持

#31 TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening #31 TalkDep：面向对话中心的抑郁症筛查的临床基础 LLM 角色设定

#32 DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting #32 DP-GPT4MTS：用于文本-数值时间序列预测的双提示大型语言模型

#33 Hierarchical Text Classification Using Black Box Large Language Models #33 使用黑盒大型语言模型的层次文本分类

#34 ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments #34 ReasoningGuard：通过推理时的安全“灵光一现”保护大型推理模型

#35 Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts #35 超越标签的推理：在低资源、文化细微差异背景下测量 LLM 情感

#36 Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models #36 诱发并分析最先进大型语言模型中的新兴错位

#37 Characterizing Deep Research: A Benchmark and Formal Definition #37 深度研究特征化：基准测试与正式定义

#38 Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity #38 利用因果充分性和必要性破解大型语言模型的幻觉

#39 The State Of TTS: A Case Study with Human Fooling Rates #39 语音合成的现状：以人类欺骗率为案例研究

#40 Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap #40 基于 DPO 隐式奖励差距的难度偏好数据选择

#41 Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks #41 揭示在微调 LLMs 进行推理任务时的过度记忆问题

#42 GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning #42 GM-PRM：一种用于多模态数学推理的生成式多模态过程奖励模型

#43 ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients" #43 ToolGrad：利用文本“梯度”高效生成工具使用数据集

#44 Efficient Strategy for Improving Large Language Model (LLM) Capabilities #44 提升大型语言模型（LLM）能力的高效策略

#45 PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG #45 PAIRS：参数验证的自适应信息检索与选择，用于高效的 RAG

#46 DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation #46 DTPA：用于可控文本生成的动态令牌级前缀增强

#47 Large Reasoning Models Are Autonomous Jailbreak Agents #47 大型推理模型是自主越狱代理

#48 ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents #48 ZARA：通过知识和检索驱动的 LLM 代理实现零样本运动时间序列分析

#49 Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing #49 多一步：超越单次反向传播的元学习模型编辑

#50 HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization #50 HarmonyGuard：通过自适应策略增强和双目标优化实现网络代理的安全性与实用性

#51 Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models #51 通过代理概念瓶颈模型将专家认知模型转移到社交机器人

#52 Are Today's LLMs Ready to Explain Well-Being Concepts? #52 今天的 LLMs 准备好解释幸福感概念了吗？

#53 Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency #53 置信加权令牌集覆盖用于自洽性中的早期假设剪枝

#54 Data and AI governance: Promoting equity, ethics, and fairness in large language models #54 数据与人工智能治理：促进大型语言模型中的公平、伦理与公正

#55 CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation #55 CAP-LLM：用于新闻标题生成的上下文增强个性化大型语言模型

#56 CoAct-1: Computer-using Agents with Coding as Actions #56 CoAct-1：以编码为动作的计算机使用代理

#57 Sotopia-RL: Reward Design for Social Intelligence #57 Sotopia-RL：社会智能的奖励设计

#58 An Entity Linking Agent for Question Answering #58 一个用于问答的实体链接代理

#59 Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models #59 从幻觉到真相：大型语言模型中的事实核查与真实性评估综述

#60 Majority Bit-Aware Watermarking For Large Language Models #60 面向大型语言模型的多数位感知水印技术

#61 AttnTrace: Attention-based Context Traceback for Long-Context LLMs #61 AttnTrace：基于注意力的长上下文 LLMs 上下文追溯

#62 GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models #62 GanitBench：一个用于评估视觉语言模型中数学推理的双语基准

#63 WINELL: Wikipedia Never-Ending Updating with LLM Agents #63 WINELL：使用 LLM 代理的维基百科永无止境更新

#64 Hierarchical Verification of Speculative Beams for Accelerating LLM Inference #64 分层验证投机波束以加速 LLM 推理

#65 Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering #65 多轮农业问答的意图感知上下文检索

#66 FeynTune: Large Language Models for High-Energy Theory #66 FeynTune：用于高能理论的 LLMs

#67 How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion #67 LLMs 中的表征偏差有多深？以种姓和宗教为例

#68 SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience #68 SEAgent：具备自主经验学习的自我进化计算机使用代理

#69 Query Attribute Modeling: Improving search relevance with Semantic Search and Meta Data Filtering #69 查询属性建模：通过语义搜索和元数据过滤提升搜索相关性

#70 Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference #70 立场：当前的人工智能会议模式不可持续！诊断集中式人工智能会议的危机

#71 Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation #71 推荐系统真的利用了多模态内容吗？关于推荐系统多模态表示的全面分析

#72 Analyzing and Mitigating Object Hallucination: A Training Bias Perspective #72 分析与缓解对象幻觉：一种训练偏差视角

#73 Causal Reflection with Language Models #73 语言模型的因果反思

#74 OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use #74 操作系统代理：基于多模态大语言模型（MLLM）代理在通用计算设备上的应用综述

#75 FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding #75 FrEVL：利用冻结的预训练嵌入实现高效的视觉-语言理解

#76 Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents #76 超越像素：探索基于 LLM 的网页代理的 DOM 降采样

#77 Graph Representation Learning with Massive Unlabeled Data for Rumor Detection #77 利用海量无标签数据进行图表示学习以检测谣言

#78 ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations #78 ToxicTAGS：利用丰富标签注释解码有害表情包

#79 Multilingual Source Tracing of Speech Deepfakes: A First Benchmark #79 多语言语音深度伪造源追踪：首个基准测试

#80 COPO: Consistency-Aware Policy Optimization #80 COPO：一致性感知策略优化

#81 AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities #81 AgREE：面向新兴实体的知识图谱补全的智能推理

#82 ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval #82 ConvMix：一种用于对话密集检索的混合标准数据增强框架

#83 Accelerating Scientific Discovery with Multi-Document Summarization of Impact-Ranked Papers #83 利用多文档摘要加速科学发现——基于影响力排名的论文

#84 ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants #84 ASTRA：面向 AI 软件助手的自主时空红队测试

#85 MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources #85 MegaWika 2：更全面的多语言文章及其来源合集

#86 GTPO: Trajectory-Based Policy Optimization in Large Language Models #86 GTPO：基于轨迹的策略优化在大型语言模型中的应用