2025-08-08 2025-08-08 About 57400 words 270 minutes

Contents

#1 H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages #1 H-Net++：用于形态丰富语言的无分词器语言建模的分层动态分块
#2 How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations #2 LLMs 如何劝服？线性探针可揭示多轮对话中的劝服动态
#3 Learning to Reason for Factuality #3 学习推理以提升真实性
#4 OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks #4 OmniEAR：在具身任务中基准测试代理推理
#5 Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models #5 Cooper：在用于大型语言模型的强化学习中共同优化策略模型与奖励模型 [PDF 12 ] [Copy] [Kimi 8 ] [REL]
#6 MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy #6 MathSmith：通过使用强化策略构造合成问题迈向极难数学推理
#7 Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs #7 政治观点会在西方语言之间转移吗？对未对齐与已对齐多语种 LLM 的分析
#8 Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees #8 在黑盒设置下具有可证明覆盖保证的多项选择题回答中的保形集合
#9 CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation #9 CoCoLex：用于有根法律文本生成的基于复制的置信度引导解码
#10 The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities #10 根据 LLMs 的世界观：地理来源如何影响 LLMs 的实体推断能力
#11 LAG: Logic-Augmented Generation from a Cartesian Perspective #11 LAG：从笛卡尔视角的逻辑增强生成
#12 Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations #12 重新思考创造力评估：对现有创造力评估的批判性分析
#13 TASE: Token Awareness and Structured Evaluation for Multilingual Language Models #13 TASE：面向多语言语言模型的标记感知与结构化评估
#14 LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models #14 LLMEval-3：一项关于大型语言模型稳健性与公平性评估的大规模纵向研究
#15 MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints #15 MyCulture：在低资源语言约束下探索马来西亚多元文化
#16 The TUB Sign Language Corpus Collection #16 TUB 手语语料库集合
#17 Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025 #17 语言模型能自我批评吗？在 BioASQ 2025 上调查用于检索增强生成的自我反馈
#18 Evaluation of a Sign Language Avatar on Comprehensibility, User Experience & Acceptability #18 手语化身在可理解性、用户体验与可接受性方面的评估
#19 Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression 确定性引导的反思抑制方法
#20 SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens #20 SONAR-LLM：以句子嵌入进行思考、以令牌表达的自回归 Transformer
#21 Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue #21 带有深思熟虑的决策制定：作为基于文档对话的元评审
#22 ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs #22 ASCoT：一种用于 LLMs 后期脆弱性的自适应自我修正思路链方法
#23 CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL #23 CodeBoost: 通过从代码片段中挤出知识并使用强化学习提升代码 LLMs [PDF 5 ] [Copy] [Kimi 1 ] [REL]
#24 Pruning Large Language Models by Identifying and Preserving Functional Networks #24 通过识别并保留功能网络对大型语言模型进行剪枝 [PDF 1 ] [Copy] [Kimi 1 ] [REL]
#25 Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation #25 资源受限的联合多模态情感推理与分类：通过链式思维增强与蒸馏
#26 ATLANTIS at SemEval-2025 Task 3: Detecting Hallucinated Text Spans in Question Answering #26 ATLANTIS 在 SemEval-2025 任务 3：检测问答中的虚构文本片段 [PDF ] [Copy] [Kimi ] [REL]
#27 Towards Assessing Medical Ethics from Knowledge to Practice #27 从知识到实践的医学伦理评估探索
#28 Attention Basin: Why Contextual Position Matters in Large Language Models #28 注意力盆地：为什么上下文位置在大型语言模型中很重要
#29 BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation #29 BEE-RAG：检索增强生成的平衡熵工程 [PDF 1 ] [Copy] [Kimi ] [REL]
#30 Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations #30 使用统一的视觉、文本与上下文表示的多模态事实核查
#31 Align, Don't Divide: Revisiting the LoRA Architecture in Multi-Task Learning #31 对齐而非划分：在多任务学习中重审 LoRA 架构
#32 Evaluation of LLMs in AMR Parsing #32 在 AMR 解析中对 LLMs 的评估
#33 Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning #33 对话面向方面的情感四元组抽取通过结构熵最小化分割
#34 A Multi-Stage Large Language Model Framework for Extracting Suicide-Related Social Determinants of Health #34 一个用于提取与自杀相关的社会健康决定因素的多阶段大型语言模型框架
#35 Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering #35 朝着视觉活动识别的鲁棒评估：通过意义聚类解决动词歧义
#36 I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations #36 我思故我资质不足？用于评估 LLM 招聘评估中语言禁忌检测的基准测试
#37 RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory #37 RCR-Router：针对具有结构化记忆的多智能体 LLM 系统的高效角色感知上下文路由
#38 Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History #38 LLM 人格测量的持续不稳定性：规模、推理与对话历史的影响
#39 Pitch Accent Detection improves Pretrained Automatic Speech Recognition #39 音高重音检测提升预训练自动语音识别
#40 Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization #40 偶校验感知的字节对编码：在分词中提升跨语言公平性
#41 Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM #41 使用冻结的 LLM 通过说话人特征增强对话标注
#42 Test-Time Reinforcement Learning for GUI Grounding via Region Consistency #42 基于区域一致性的 GUI 定位测试时强化学习
#43 Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision #43 Uni-cot：迈向跨文本与视觉的统一链式思维推理
#44 Iterative Learning of Computable Phenotypes for Treatment Resistant Hypertension using Large Language Models #44 使用大型语言模型对难治性高血压可计算表型进行迭代学习以指导治疗
#45 Fairy±i: the First 2-bit Complex LLM with All Parameters in #45Fairy ±i ：首个所有参数均为的 2 位复数 LLM
#46 SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription #46 SPGISpeech 2.0：带说话人标注的多说话人金融音频转录 [PDF ] [Copy] [Kimi ] [REL]
#47 Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation #47 混合主动对话用于人机协作操作
#48 MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs #48 MELLA：为低资源语言的大型多语言模型架起语言能力与文化扎根性的桥梁
#49 Can Large Language Models Generate Effective Datasets for Emotion Recognition in Conversations? #49 大型语言模型能为对话中的情感识别生成有效的数据集吗？ [PDF ] [Copy] [Kimi ] [REL]
#50 Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? #50 Bench-2-CoP：我们能信任用于欧盟人工智能合规性的基准测试吗？
#51 A Novel Architecture for Symbolic Reasoning with Decision Trees and LLM Agents #51 一种结合决策树与 LLM 代理的符号推理新架构
#52 Understanding and Mitigating Errors of LLM-Generated RTL Code #52 理解并缓解 LLM 生成的 RTL 代码错误
#53 FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance #53 FAITH：用于评估金融领域表格内在幻觉的框架
#54 QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering #54 QA-Dragon：面向查询感知的动态 RAG 系统，用于知识密集型视觉问答
#55 Posterior-GRPO: Rewarding Reasoning Processes in Code Generation #55 Posterior-GRPO：在代码生成中奖励推理过程
#56 Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models #56 在预算内对 LLMs 进行对齐：使用启发式奖励模型进行推理时对齐
#57 Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages #57 低资源场景下的语音 LLMs：数据量需求及在高资源语言上预训练的影响
#58 Navigating Through Paper Flood: Advancing LLM-based Paper Evaluation through Domain-Aware Retrieval and Latent Reasoning #58 穿越论文洪流：通过面向领域的检索与潜在推理推进基于 LLM 的论文评估
#59 Exploring Superior Function Calls via Reinforcement Learning #59 通过强化学习探索更优的函数调用
#60 JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering #60 JPS：通过协作式视觉扰动和文本引导越狱多模态大语言模型
#61 Cognitive Duality for Adaptive Web Agents #61 适应性网络代理的认知二元性
#62 A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding #62 《用于三维场景理解的语言嵌入框架及其真实世界应用研究》
#63 Making Prompts First-Class Citizens for Adaptive LLM Pipelines #63 让提示成为自适应 LLM 流水线的一等公民
#64 Can Large Language Models Integrate Spatial Data? Empirical Insights into Reasoning Strengths and Computational Weaknesses #64 大型语言模型能整合空间数据吗？关于推理强项与计算弱点的实证见解
#65 R-Zero: Self-Evolving Reasoning LLM from Zero Data #65 R-Zero：从零数据自我进化推理 LLM
#66 REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation #66 REINA：基于正则化熵信息的高效同时语音翻译损失函数
#67 ConfAgents: A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis #67 ConfAgents：一种用于成本高效医疗诊断的受共形引导的多智能体框架
#68 Advancing Hate Speech Detection with Transformers: Insights from the MetaHate #68 使用变换器推进仇恨言论检测：来自 MetaHate 的见解
#69 Fine-Tuning Small Language Models (SLMs) for Autonomous Web-based Geographical Information Systems (AWebGIS) #69 针对自主基于网络的地理信息系统（AWebGIS）微调小型语言模型（SLMs）
#70 Federal Reserve Communication and the COVID-19 Pandemic #70 美联储沟通与 COVID-19 大流行 [PDF ] [Copy] [Kimi ] [REL]
#71 Prescriptive Agents based on Rag for Automated Maintenance (PARAM) #71 基于 RAG 的用于自动化维护的规范性代理（PARAM）

#1 Simulating Human-Like Learning Dynamics with LLM-Empowered Agents #1 使用由 LLM 赋能的代理模拟类人学习动态
#2 The Missing Reward: Active Inference in the Era of Experience #2 缺失的奖励：体验时代的主动推理
#3 MV-Debate: Multi-view Agent Debate with Dynamic Reflection Gating for Multimodal Harmful Content Detection in Social Media #3 MV-Debate：用于社交媒体多模态有害内容检测的多视角智能体辩论与动态反思门控
#4 Streamlining Admission with LOR Insights: AI-Based Leadership Assessment in Online Master's Program #4 通过推荐信洞察简化录取：基于人工智能的在在线硕士项目中的领导力评估
#5 Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation #5 自动评估评判者：迈向用于任务完成评估的通用智能体框架
#6 GRAIL:Learning to Interact with Large Knowledge Graphs for Retrieval Augmented Reasoning
#7 InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities #7 InfiAlign：一种可扩展且样本高效的框架，用于对齐 LLMs 以增强推理能力
#8 Can Large Language Models Generate Effective Datasets for Emotion Recognition in Conversations? #8 大型语言模型能为对话情感识别生成有效数据集吗？
#9 Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? #9 Bench-2-CoP：我们能信任用于欧盟人工智能合规性的基准测试吗？
#10 Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI #10 谁的真相？（具代理性的）人工智能的多元地理对齐
#11 Large Language Models Transform Organic Synthesis From Reaction Prediction to Automation #11 大型语言模型将有机合成从反应预测转向自动化 [PDF 1 ] [Copy] [Kimi 1 ] [REL]
#12 DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning #12 DeepPHY：在物理推理上对具代理性的视觉语言模型进行基准测试
#13 An Explainable Machine Learning Framework for Railway Predictive Maintenance using Data Streams from the Metro Operator of Portugal #13 一个用于铁路预测性维护的可解释机器学习框架，基于葡萄牙地铁运营商的数据流
#14 StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models #14 StructVRM：用结构化且可验证的奖励模型对齐多模态推理
#15 Minimal Model Reasoning in Description Logics: Don't Try This at Home! #15 描述逻辑中的最小模型推理：别在家尝试！
#16 NomicLaw: Emergent Trust and Strategic Argumentation in LLMs During Collaborative Law-Making #16 NomicLaw：在协作立法过程中，LLMs 的信任涌现与策略性论证
#17 The Term 'Agent' Has Been Diluted Beyond Utility and Requires Redefinition #17 “智能体”一词已被稀释至无实用价值，需重新定义
#18 A Novel Architecture for Symbolic Reasoning with Decision Trees and LLM Agents #18 一种用于符号推理的新型架构，结合决策树与 LLM 代理
#19 An Explainable Natural Language Framework for Identifying and Notifying Target Audiences In Enterprise Communication #19 一个可解释的自然语言框架，用于在企业沟通中识别并通知目标受众
#20 QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering #20 QA-Dragon：面向查询感知的动态 RAG 系统，用于知识密集型视觉问答 [PDF 2 ] [Copy] [Kimi ] [REL]
#21 Graph-based Event Log Repair #21 基于图的事件日志修复
#22 Beyond Automation: Socratic AI, Epistemic Agency, and the Implications of the Emergence of Orchestrated Multi-Agent Learning Architectures #22 超越自动化：苏格拉底式人工智能、认知主体性以及编排式多智能体学习架构出现的影响
#23 EasySize: Elastic Analog Circuit Sizing via LLM-Guided Heuristic Search #23 EasySize：通过 LLM 引导启发式搜索实现弹性模拟电路尺寸调整
#24 MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models #24 MedMKEB：用于医学多模态大语言模型的综合知识编辑基准
#25 Cognitive Duality for Adaptive Web Agents #25 认知二元性用于自适应网络代理 [PDF ] [Copy] [Kimi ] [REL]
#26 Can Large Language Models Integrate Spatial Data? Empirical Insights into Reasoning Strengths and Computational Weaknesses #26 大型语言模型能整合空间数据吗？关于推理优势与计算弱点的实证见解
#27 The Docking Game: Loop Self-Play for Fast, Dynamic, and Accurate Prediction of Flexible Protein–Ligand Binding #27 对接游戏：用于快速、动态且准确预测柔性蛋白—配体结合的循环自博弈
#28 ConfAgents: A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis #28 ConfAgents：一种用于成本高效医疗诊断的保形引导多智能体框架
#29 Large Language Models Reasoning Abilities Under Non-Ideal Conditions After RL-Fine-Tuning #29 在非理想条件下经过强化学习微调后大型语言模型的推理能力
#30 Fine-Tuning Small Language Models (SLMs) for Autonomous Web-based Geographical Information Systems (AWebGIS) #30 针对自主基于网络的地理信息系统（AWebGIS）微调小型语言模型（SLMs）
#31 Who is a Better Player: LLM against LLM #31 谁是更好的玩家：LLM 对 LLM
#32 GeoFlow: Agentic Workflow Automation for Geospatial Tasks #32 GeoFlow：用于地理空间任务的自治式工作流自动化
#33 Prescriptive Agents based on Rag for Automated Maintenance (PARAM) #33 基于 Rag 的用于自动化维护的规范性代理（PARAM）
#34 Towards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling #34 通过保形不确定性处理迈向可泛化的人群导航安全
#35 KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation
#36 H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages #36 H-Net++：用于形态丰富语言的无分词器语言建模的分层动态分块方法
#37 How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations #37 LLMs 如何说服？线性探针可以揭示多轮对话中的说服动态 [PDF ] [Copy] [Kimi ] [REL]
#38 TrajEvo: Trajectory Prediction Heuristics Design via LLM-driven Evolution #38 TrajEvo：通过 LLM 驱动的进化设计轨迹预测启发式
#39 Test-Time Reinforcement Learning for GUI Grounding via Region Consistency #39 通过区域一致性的 GUI 定位的测试时强化学习
#40 OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks #40 OmniEAR：在具身任务中对智能体推理进行基准测试
#41 Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models #41 Cooper：在用于大型语言模型的强化学习中共同优化策略与奖励模型
#42 Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle #42 Shuffle-R1：通过以数据为中心的动态洗牌实现多模态大型语言模型高效强化学习框架
#43 Iterative Learning of Computable Phenotypes for Treatment Resistant Hypertension using Large Language Models #43 使用大型语言模型对难治性高血压的可计算表型进行迭代学习
#44 Adapting Vision-Language Models Without Labels: A Comprehensive Survey #44 在无需标签的情况下适配视觉-语言模型：一篇综合性综述
#45 Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees #45 在黑盒设置下具有可证明覆盖保证的多项选择题回答中的保形集 [PDF ] [Copy] [Kimi 3 ] [REL]
#46 Tractable Sharpness-Aware Learning of Probabilistic Circuits #46 可处理的面向锐化的概率电路学习
#47 The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities #47 根据 LLMs 的世界：地理起源如何影响 LLMs 的实体推断能力
#48 LAG: Logic-Augmented Generation from a Cartesian Perspective #48 LAG：从笛卡尔视角的逻辑增强生成
#49 MoMA: A Mixture-of-Multimodal-Agents Architecture for Enhancing Clinical Prediction Modelling #49 MoMA：一种用于增强临床预测建模的多模态智能体混合架构
#50 Embedding Alignment in Code Generation for Audio #50 在音频代码生成中的嵌入对齐
#51 Task complexity shapes internal representations and robustness in neural networks #51 任务复杂性塑造神经网络的内部表征和鲁棒性 [PDF ] [Copy] [Kimi ] [REL]
#52 EnergyPatchTST: Multi-scale Time Series Transformers with Uncertainty Estimation for Energy Forecasting #52 EnergyPatchTST：用于能源预测并带有不确定性估计的多尺度时间序列变换器
#53 Tail-Risk-Safe Monte Carlo Tree Search under PAC-Level Guarantees #53 在 PAC 水平保证下的尾部风险安全蒙特卡洛树搜索
#54 Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions #54 用加权班扎夫相互作用解释视觉-语言编码器中的相似性 [PDF 2 ] [Copy] [Kimi 1 ] [REL]
#55 MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints #55 MyCulture：在低资源语言限制下探索马来西亚多元文化
#56 LLM-based Multi-Agent Copilot for Quantum Sensor #56 基于 LLM 的量子传感器多智能体助手
#57 UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation #57 UNCAGE：用于文本到图像生成中掩码生成式变换器的对比注意力引导
#58 Real-Time Iteration Scheme for Diffusion Policy #58 实时迭代方案用于扩散策略
#59 Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms #59 Echo：在异构蜂群上为大规模 RL 对齐解耦推理与训练
#60 Optimal Corpus Aware Training for Neural Machine Translation #60 面向语料库的最优训练用于神经机器翻译
#61 Building Effective Safety Guardrails in AI Education Tools #61 在人工智能教育工具中构建有效的安全护栏
#62 PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation #62 PriorRG：用于胸片报告生成的先验引导对比预训练与粗到细解码 [PDF 2 ] [复制] [Kimi ] [关联]
#63 Multi-Modal Multi-Behavior Sequential Recommendation with Conditional Diffusion-Based Feature Denoising #63 多模态多行为序列推荐与基于条件扩散的特征去噪
#64 Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control #64 基于信息论的图融合与视觉-语言-动作模型用于策略推理与双机器人控制 [PDF 2 ] [Copy] [Kimi ] [REL]
#65 Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression #65 通过确定性引导的反思抑制实现对大型推理语言模型的高效推理
#66 mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering #66 mKG-RAG：用于视觉问答的多模态知识图增强 RAG [PDF ] [Copy] [Kimi ] [REL]
#67 ASkDAgger: Active Skill-level Data Aggregation for Interactive Imitation Learning #67 ASkDAgger：用于交互式模仿学习的主动技能级数据聚合
#68 Estimating Musical Surprisal from Audio in Autoregressive Diffusion Model Noise Spaces #68 从自回归扩散模型噪声空间中估计音频的音乐意外性
#69 VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test #69 VS-LLM：基于 LLM 的绘画投射测验视觉语义抑郁评估
#70 Towards Embodied Agentic AI: Review and Classification of LLM- and VLM-Driven Robot Autonomy and Interaction #70 面向具身代理智能的进展：基于 LLM 和 VLM 驱动的机器人自主性与交互的综述与分类
#71 FlowState: Sampling Rate Invariant Time Series Forecasting #71 FlowState：采样率不变的时间序列预测
#72 SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion #72 SGDFuse：基于 SAM 引导扩散的高保真红外与可见光图像融合
#73 Robust Tracking with Particle Filtering for Fluorescent Cardiac Imaging #73 使用粒子滤波的稳健追踪用于荧光心脏成像
#74 Marine Chlorophyll Prediction and Driver Analysis based on LSTM-RF Hybrid Models #74 基于 LSTM-RF 混合模型的海洋叶绿素预测与驱动因素分析
#75 CF3: Compact and Fast 3D Feature Fields
#76 A Study of Gender Classification Techniques Based on Iris Images: A Deep Survey and Analysis #76 基于虹膜图像的性别分类技术研究：深度综述与分析
#77 RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding
#78 Coarse-to-Fine Joint Registration of MR and Ultrasound Images via Imaging Style Transfer #78 通过影像风格迁移实现的 MR 与超声图像粗到细联合配准
#79 Pruning Large Language Models by Identifying and Preserving Functional Networks #79 通过识别并保留功能网络来剪枝大型语言模型
#80 Driver Assistant: Persuading Drivers to Adjust Secondary Tasks Using Large Language Models #80 驾驶员助手：使用大型语言模型劝导驾驶员调整次要任务
#81 Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models #81 在权衡中航行：面向视觉-语言模型零样本对抗鲁棒性的防御策略综述
#82 Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation #82 在资源受限情况下通过链式思维增强与蒸馏进行联合多模态情感推理与分类
#83 FDC-Net: Rethinking the association between EEG artifact removal and multi-dimensional affective computing #83 FDC-Net：重新思考 EEG 伪迹去除与多维情感计算之间的关联 [PDF ] [Copy] [Kimi ] [REL]
#84 ADSEL: Adaptive dual self-expression learning for EEG feature selection via incomplete multi-dimensional emotional tagging
#85 CWEFS: Brain volume conduction effects inspired channel-wise EEG feature selection for multi-dimensional emotion recognition #85 CWEFS：受脑体积导传效应启发的逐通道脑电特征选择用于多维情感识别
#86 ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking #86 ReasoningTrack：用于长期视觉-语言跟踪的链式思维推理
#87 Advanced Hybrid Transformer LSTM Technique with Attention and TS Mixer for Drilling Rate of Penetration Prediction #87 具有注意力机制和 TS 混合器的高级混合 Transformer LSTM 技术用于钻速（ROP）预测
#88 SpectroStream: A Versatile Neural Codec for General Audio #88 SpectroStream：用于通用音频的多功能神经编解码器
#89 FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance #89 FAITH：用于评估金融领域内在表格幻觉的框架
#90 EvoGraph: Hybrid Directed Graph Evolution toward Software 3.0 #90 EvoGraph：面向软件 3.0 的混合有向图演化
#91 Balancing Accuracy and Novelty with Sub-Item Popularity #91 在子项目流行度下平衡准确性与新颖性
#92 Incident Response Planning Using a Lightweight Large Language Model with Reduced Hallucination #92 使用减少幻觉的轻量级大型语言模型进行事件响应规划
#93 Refining Gaussian Splatting: A Volumetric Densification Approach #93 精炼高斯点描：一种体积致密化方法
#94 Posterior-GRPO: Rewarding Reasoning Processes in Code Generation #94 Posterior-GRPO：在代码生成中奖励推理过程
#95 Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models #95 在预算内对齐 LLMs：使用启发式奖励模型的推理时对齐 [PDF 4 ] [Copy] [Kimi 1 ] [REL]
#96 Domain-driven Metrics for Reinforcement Learning: A Case Study on Epidemic Control using Agent-based Simulation #96 基于领域的强化学习评估指标：使用基于主体的模拟进行流行病控制的案例研究
#97 FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction #97 FCBV-Net：通过特征条件化的双手价值预测实现的类别级机器人服装平整
#98 Tool Graph Retriever: Exploring Dependency Graph-based Tool Retrieval for Large Language Models
#99 Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages #99 在低资源场景中的语音 LLMs：数据量需求及在高资源语言上预训练的影响
#100 Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories #100 Chemist Eye：一种由视觉语言模型驱动的自驱动实验室安全监控与机器人决策系统
#101 FedGIN: Federated Learning with Dynamic Global Intensity Non-linear Augmentation for Organ Segmentation using Multi-modal Images #101 FedGIN：用于多模态影像器官分割的联邦学习与动态全局强度非线性增强 [PDF ] [Copy] [Kimi ] [REL]
#102 Towards Assessing Medical Ethics from Knowledge to Practice #102 从知识到实践：评估医学伦理的进展
#103 Attention Basin: Why Contextual Position Matters in Large Language Models #103 Attention Basin：为什么上下文位置在大型语言模型中很重要
#104 Latent Expression Generation for Referring Image Segmentation and Grounding #104 用于指代表达图像分割与定位的潜在表达生成
#105 Exploring Superior Function Calls via Reinforcement Learning #105 通过强化学习探索更优的函数调用
#106 Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS #106 在构音障碍语音合成中的公平性：使用 F5-TTS 理解构音障碍语音克隆的内在偏差
#107 Integrated Influence: Data Attribution with Baseline #107 Integrated Influence：基线下的数据归因
#108 JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering #108 JPS：通过协同视觉扰动和文本引导对多模态大语言模型进行越狱
#109 Align, Don't Divide: Revisiting the LoRA Architecture in Multi-Task Learning
#110 Align-for-Fusion: Harmonizing Triple Preferences via Dual-oriented Diffusion for Cross-domain Sequential Recommendation
#111 Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks #111 使用卷积神经网络与生成对抗网络的自动图像着色
#112 Learning from Oblivion: Predicting Knowledge Overflowed Weights via Retrodiction of Forgetting #112 从遗忘中学习：通过对遗忘的反向预测来预测溢出知识的权重
#113 Human-AI Schema Discovery and Application for Creative Problem Solving #113 人机协同图式发现与应用以支持创造性问题解决
#114 Evaluation of LLMs in AMR Parsing
#115 Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning #115 对话基于方面的情感四元组抽取通过结构熵最小化划分
#116 Skin-SOAP: A Weakly Supervised Framework for Generating Structured SOAP Notes #116 Skin-SOAP：一种用于生成结构化 SOAP 记录的弱监督框架
#117 SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models #117 SPaRFT：面向大型语言模型的自定节奏强化微调
#118 Making Prompts First-Class Citizens for Adaptive LLM Pipelines #118 让提示成为自适应 LLM 管道的一等公民
#119 Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation #119 朝向无幻觉音乐：用于可靠歌曲生成的强化学习偏好优化框架
#120 R-Zero: Self-Evolving Reasoning LLM from Zero Data #120 R-Zero：从零数据自我进化推理 LLM
#121 A Multi-Stage Large Language Model Framework for Extracting Suicide-Related Social Determinants of Health #121 一种用于提取与自杀相关的社会健康决定因素的多阶段 LLM 框架
#122 AgenticData: An Agentic Data Analytics System for Heterogeneous Data #122 AgenticData：一种用于异构数据的主动式数据分析系统
#123 Situated Epistemic Infrastructures: A Diagnostic Framework for Post-Coherence Knowledge #123 情景化的认知基础设施：后连贯性知识的诊断框架
#124 Hierarchical Deep Deterministic Policy Gradient for Autonomous Maze Navigation of Mobile Robots
#125 UGOD: Uncertainty-Guided Differentiable Opacity and Soft Dropout for Enhanced Sparse-View 3DGS #125 UGOD：基于不确定性的可微不透明度与软丢弃以增强稀疏视角 3DGS
#126 MENDR: Manifold Explainable Neural Data Representations
#127 AdvDINO: Domain-Adversarial Self-Supervised Representation Learning for Spatial Proteomics #127 AdvDINO：用于空间蛋白质组学的域对抗自监督表征学习
#128 Tesserae: Scalable Placement Policies for Deep Learning Workloads #128 Tesserae：用于深度学习工作负载的可扩展部署策略
#129 Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering #129 面向视觉活动识别的稳健评估：通过意义聚类解决动词歧义
#130 TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring #130 TRKT：弱监督动态场景图生成与时序增强关系感知知识迁移 [PDF ] [Copy] [Kimi ] [REL]
#131 INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM #131 意向：通过交互直觉与基于视觉语言模型的实证方法推断类人机器人运动倾向
#132 Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens #132 将基础单目深度估计器扩展到带有校准 token 的鱼眼相机 [PDF ] [副本] [Kimi ] [REL]
#133 Taxonomy of Faults in Attention-Based Neural Networks #133 基于注意力的神经网络故障分类法
#134 RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory #134 RCR-Router：具有结构化记忆的多智能体 LLM 系统中高效的角色感知上下文路由
#135 Revealing Temporal Label Noise in Multimodal Hateful Video Classification #135 揭示多模态仇恨视频分类中的时间标签噪声
#136 Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLMs) #136 面向图意识大型语言模型（LLMs）的对抗攻击与防御
#137 Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates #137 利用深度学习纠正全球空气质量估算的物理模型偏差
#138 Uncertainty Quantification for Surface Ozone Emulators using Deep Learning #138 使用深度学习对地表臭氧模拟器进行不确定性量化
#139 Sequence Aware SAC Control for Engine Fuel Consumption Optimization in Electrified Powertrain #139 序列感知 SAC 控制用于电气化动力总成的发动机燃油消耗优化
#140 Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos #140 可证明的后训练量化：OPTQ 与 Qronos 的理论分析
#141 Multi-Stage Knowledge-Distilled VGAE and GAT for Robust Controller-Area-Network Intrusion Detection #141 多阶段知识蒸馏的 VGAE 与 GAT 用于鲁棒的控制器局域网络入侵检测
#142 Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History #142 LLM 人格测量的持续不稳定性：规模、推理和对话历史的影响 [PDF 1 ] [Copy] [Kimi ] [REL]
#143 Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off #143 Voost：一种用于双向虚拟试穿与脱衣的统一且可扩展的扩散变换器
#144 Automated File-Level Logging Generation for Machine Learning Applications using LLMs: A Case Study using GPT-4o Mini #144 使用 LLM 为机器学习应用自动生成文件级日志：以 GPT-4o Mini 为例的案例研究
#145 CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework #145 CoMAD：一种多教师自监督蒸馏框架
#146 Optimality Principles and Neural Ordinary Differential Equations-based Process Modeling for Distributed Control #146 最优性原理与基于神经常微分方程的分布式控制过程建模
#147 Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization #147 奇偶感知的字节对编码：在分词中提升跨语言公平性
#148 Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM #148 利用冻结的 LLM 增强带有说话者特征的对话标注
#149 Evaluating the Impact of LLM-guided Reflection on Learning Outcomes with Interactive AI-Generated Educational Podcasts #149 评估 LLM 指导的反思对交互式 AI 生成教育播客学习结果的影响
#150 Uncertainty-aware Predict-Then-Optimize Framework for Equitable Post-Disaster Power Restoration #150 不确定性感知的“先预测后优化”框架用于公平的灾后电力恢复
#151 ERDES: A Benchmark Video Dataset for Retinal Detachment and Macular Status Classification in Ocular Ultrasound #151 ERDES：一个用于眼部超声视网膜脱离和黄斑状态分类的基准视频数据集
#152 Cross-Domain Image Synthesis: Generating H&E from Multiplex Biomarker Imaging #152 跨域图像合成：从多重生物标志物成像生成 H&E
#153 Agency, Affordances, and Enculturation of Augmentation Technologies #153 代理性、可供性与增强技术的文化习得
#154 Wearable Music2Emotion : Assessing Emotions Induced by AI-Generated Music through Portable EEG-fNIRS Fusion #154 可穿戴音乐到情绪：通过便携式 EEG-fNIRS 融合评估由 AI 生成音乐引发的情绪
#155 Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS #155 面向电信的低延迟端到端语音代理：使用流式 ASR、量化 LLMs 和实时 TTS
#156 AI Should Be More Human, Not More Complex #156 AI 应该更有人性，而不是更复杂
#157 Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis #157 混合奖励驱动的强化学习用于高效量子电路综合
#158 How Robust are LLM-Generated Library Imports? An Empirical Study using Stack Overflow #158 LLM 生成的库导入有多稳健？使用 Stack Overflow 的实证研究
#159 Evaluating the Use of LLMs for Documentation to Code Traceability #159 评估将 LLMs 用于从文档到代码可追溯性的效果
#160 Reinforcement Learning Generation of 4-Qubits Entangled States #160 使用强化学习生成 4 量子比特纠缠态

2025-08-08科研追新

2025-08-07 10:24:16 Thursday ～ 2025-08-08 16:58:43 Friday

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

1.2 Arxiv

1.2.1 Computation and Language

From：https:// /arxiv/cs.CL

From：https://arxiv.org/list/cs.CL/recent

2025-08-08 | | 总计：71

#1 H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages #1 H-Net++：用于形态丰富语言的无分词器语言建模的分层动态分块

Authors: [Mehrdad Zakershahrak](https://arxiv.org/search/?searchtype=author&query=Mehrdad Zakershahrak), [Samira Ghodratnama](https://arxiv.org/search/?searchtype=author&query=Samira Ghodratnama) 作者：Mehrdad Zakershahrak、Samira Ghodratnama

Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handling of orthographic artifacts (e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks align with Persian morphology without explicit supervision, demonstrating that hierarchical dynamic chunking provides an effective tokenizer-free solution for MRLs while maintaining computational efficiency. 字节级语言模型消除了脆弱的分词器，但在形态丰富语言（MRLs）中面临计算挑战，因为单词跨越许多字节。我们提出了 H-NET++，一种层次化动态分块模型，通过端到端训练学习有语言学信息的分割。关键创新包括： (1) 一个用于跨分块注意力的轻量级 Transformer 上下文混合器（1.9M 参数），(2) 一个用于文档级一致性的两级潜在超先验，(3) 对正字法伪影（如波斯语 ZWNJ）的专门处理，以及 (4) 基于课程学习并分阶段增长序列长度的训练。在一个包含 14 亿标记的波斯语语料上，H-NET++ 达到最先进的结果：相比基于 BPE 的 GPT-2-fa 在每字节位（BPB）上减少 0.159（压缩率提高 12%），在 ParsGLUE 上提升 5.4 个百分点，对 ZWNJ 损坏的鲁棒性提高 53%，在金标形态边界上取得 73.8% 的 F1。我们学到的分块在没有显式监督的情况下与波斯语形态对齐，证明了层次化动态分块为形态丰富语言提供了一种在保持计算效率的同时无需分词器的有效解决方案。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 17:59:01 UTC 发布：2025-08-07 17:59:01 UTC

#2 How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations #2 LLMs 如何劝服？线性探针可揭示多轮对话中的劝服动态

Authors: [Brandon Jaipersaud](https://arxiv.org/search/?searchtype=author&query=Brandon Jaipersaud), [David Krueger](https://arxiv.org/search/?searchtype=author&query=David Krueger), [Ekdeep Singh Lubana](https://arxiv.org/search/?searchtype=author&query=Ekdeep Singh Lubana) 作者：Brandon Jaipersaud、David Krueger、Ekdeep Singh Lubana

Large Language Models (LLMs) have started to demonstrate the ability to persuade humans, yet our understanding of how this dynamic transpires is limited. Recent work has used linear probes, lightweight tools for analyzing model representations, to study various LLM skills such as the ability to model user sentiment and political perspective. Motivated by this, we apply probes to study persuasion dynamics in natural, multi-turn conversations. We leverage insights from cognitive science to train probes on distinct aspects of persuasion: persuasion success, persuadee personality, and persuasion strategy. Despite their simplicity, we show that they capture various aspects of persuasion at both the sample and dataset levels. For instance, probes can identify the point in a conversation where the persuadee was persuaded or where persuasive success generally occurs across the entire dataset. We also show that in addition to being faster than expensive prompting-based approaches, probes can do just as well and even outperform prompting in some settings, such as when uncovering persuasion strategy. This suggests probes as a plausible avenue for studying other complex behaviours such as deception and manipulation, especially in multi-turn settings and large-scale dataset analysis where prompting-based methods would be computationally inefficient. 大型语言模型（LLMs）已开始展现说服人类的能力，然而我们对这一动态如何发生的理解仍然有限。近期研究采用线性探针——一种用于分析模型表征的轻量工具——来研究各种 LLM 技能，例如建模用户情绪和政治立场的能力。受此启发，我们将探针应用于研究自然多回合对话中的说服动态。我们借鉴认知科学的见解，训练探针以识别说服的不同方面：说服是否成功、被说服者的个性以及说服策略。尽管这些探针很简单，我们证明了它们能在样本和数据集层面捕捉说服的各种方面。例如，探针可以识别对话中被说服者被说服的具体时点，或识别整个数据集中说服成功通常发生的阶段。我们还表明，除了比昂贵的基于提示的方法更快之外，探针在某些情形下（例如揭示说服策略）表现得同样出色，甚至优于提示方法。这表明探针是一种可行的方法，用于研究其他复杂行为，例如欺骗和操控，尤其是在多轮设置和大规模数据集分析中，在这些情况下基于提示的方法在计算上会效率低下。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言、人工智能、机器学习

Publish: 2025-08-07 17:58:41 UTC 发布：2025-08-07 17:58:41 UTC

#3 Learning to Reason for Factuality #3 学习推理以提升真实性

Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness. 推理大型语言模型（R-LLMs）在复杂推理任务上取得了显著进展，但在事实性方面往往表现不佳，在长篇事实性基准测试中比非推理模型产生更多幻觉。然而，将在线强化学习（RL）——最近 R-LLM 进展的关键组成部分——扩展到长篇事实性场景面临若干独特挑战，原因在于缺乏可靠的验证方法。先前工作在离线 RL 设置中利用诸如 FActScore 的自动事实性评估框架来整理偏好数据，但我们发现直接在在线 RL 中将此类方法作为奖励会以多种方式导致奖励作弊，例如生成更少细节或相关性较低的回答。我们提出了一种新颖的奖励函数，该函数同时考虑事实精确性、回答细节程度和答案相关性，并应用在线 RL 以学习高质量的事实性推理。在六个长篇事实性基准上评估，我们的事实推理模型在幻觉率上平均降低了 23.1 个百分点，答案细节程度提高了 23%，且整体响应的有用性没有下降。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-07 17:57:09 UTC 发表：2025-08-07 17:57:09 UTC

#4 OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks #4 OmniEAR：在具身任务中基准测试代理推理

Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance. 大型语言模型擅长抽象推理，但其在具身代理推理方面的能力仍 largely unexplored（广泛未被探索）。我们提出了 OmniEAR，一个用于评估语言模型如何在具身任务中就物理交互、工具使用和多代理协作进行推理的综合框架。与提供预定义工具集或明确协作指令的现有基准不同，OmniEAR 要求代理根据任务需求动态获取能力并自主确定协作策略。通过基于文本的环境表示，我们对跨越家居和工业领域的 1,500 个场景中的连续物理属性和复杂空间关系建模。我们的系统评估表明，当模型必须从约束中推理时性能严重下降：在有明确指示的情况下成功率为 85%–96%，但在工具推理时性能下降到 56%–85%，在隐式协作时为 63%–85%，复合任务的失败率超过 50%。令人意外的是，完整的环境信息会降低协作性能，表明模型无法筛选出与任务相关的约束。微调在单智能体任务上显著提升（从 0.6%到 76.3%），但在多智能体场景中几乎没有收益（从 1.5%到 5.5%），揭示了根本性的架构局限。这些发现表明，具身推理提出了与现有模型无法应对的根本不同的挑战，确立了 OmniEAR 作为评估和推进具身人工智能系统的严格基准。我们的代码和数据包含在补充材料中，论文被接受后将开源。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 17:54:15 UTC 发表：2025-08-07 17:54:15 UTC

#5 Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models #5 Cooper：在用于大型语言模型的强化学习中共同优化策略模型与奖励模型 [PDF 12 ] [Copy] [Kimi 8 ] [REL]

Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL. 大型语言模型（LLMs）在推理任务中表现出卓越的能力，强化学习（RL）作为提升其推理能力的关键算法。目前存在两种主流的奖励范式：基于模型的奖励和基于规则的奖励。然而，两种方法都存在局限性：基于规则的奖励缺乏鲁棒性，而基于模型的奖励则易受奖励劫持的影响。为了解决这些问题，我们提出了 Cooper（Co-optimizing Policy Model and Reward Model），一个同时优化策略模型与奖励模型的 RL 框架。Cooper 在识别正确回答时利用基于规则奖励的高精度，并动态构建和选择正负样本对以持续训练奖励模型。该设计增强了鲁棒性并降低了奖励劫持的风险。为进一步支持 Cooper，我们引入了一种混合注释策略，高效且准确地为奖励模型生成训练数据。我们还提出了一种基于参考答案的奖励建模范式，其中奖励模型将参考答案作为输入。基于该设计，我们训练了一个名为 VerifyRM 的奖励模型，在 VerifyBench 上比同等规模的其他模型具有更高的准确性。我们使用 VerifyRM 和 Cooper 进行了强化学习。我们的实验表明，Cooper 不仅缓解了奖励劫持问题，还提升了端到端的强化学习性能，例如在 Qwen2.5-1.5B-Instruct 上平均准确率提升了 0.54%。我们的发现证明，动态更新奖励模型是应对奖励劫持的有效方法，为更好地将奖励模型整合到强化学习中提供了参考。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 17:53:56 UTC 发布：2025-08-07 17:53:56 UTC

#6 MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy #6 MathSmith：通过使用强化策略构造合成问题迈向极难数学推理

Authors: [Shaoxiong Zhan](https://arxiv.org/search/?searchtype=author&query=Shaoxiong Zhan), [Yanlin Lai](https://arxiv.org/search/?searchtype=author&query=Yanlin Lai), [Ziyu Lu](https://arxiv.org/search/?searchtype=author&query=Ziyu Lu), [Dahua Lin](https://arxiv.org/search/?searchtype=author&query=Dahua Lin), [Ziqing Yang](https://arxiv.org/search/?searchtype=author&query=Ziqing Yang), [Fei Tang](https://arxiv.org/search/?searchtype=author&query=Fei Tang) 作者：詹少雄、赖彦霖、卢子瑜、林大华、杨子清、唐飞

Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities. 大型语言模型在数学推理方面取得了显著进展，但其提升受到高质量、高难度训练数据稀缺的限制。现有的合成方法大多依赖于改写人工编写的模板，因而在多样性和可扩展性上受到限制。我们提出了 MathSmith，一种旨在合成具有挑战性的数学问题以增强 LLM 推理能力的新框架。MathSmith 不通过修改现有问题来生成数据，而是从 PlanetMath 中随机抽取概念-解释对并从零构建新问题，确保数据独立性并避免污染。为了增加难度，我们在推理过程中设计了九种预定义策略作为软约束。我们进一步采用强化学习共同优化结构有效性、推理复杂性和答案一致性。通过自回归提示生成的推理轨迹长度被用来反映认知复杂性，从而鼓励生成更具挑战性的、与长链思维推理相一致的问题。在五个基准上的实验（按难度分为简单与中等：GSM8K、MATH-500；困难：AIME2024、AIME2025、OlympiadBench）表明，MathSmith 在短与长链式推理（CoT）设置下均持续优于现有基线。此外，一个以弱点为导向的变体生成模块使得对特定概念的定向改进成为可能。总体而言，MathSmith 展现出强大的可扩展性、泛化能力和可迁移性，突显了高难度合成数据在提升 LLM 推理能力方面的潜力。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 17:32:14 UTC 发表：2025-08-07 17:32:14 UTC

#7 Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs #7 政治观点会在西方语言之间转移吗？对未对齐与已对齐多语种 LLM 的分析

Authors: [Franziska Weeber](https://arxiv.org/search/?searchtype=author&query=Franziska Weeber), [Tanise Ceron](https://arxiv.org/search/?searchtype=author&query=Tanise Ceron), [Sebastian Padó](https://arxiv.org/search/?searchtype=author&query=Sebastian Padó) 作者：Franziska Weeber、Tanise Ceron、Sebastian Padó

Public opinion surveys show cross-cultural differences in political opinions between socio-cultural contexts. However, there is no clear evidence whether these differences translate to cross-lingual differences in multilingual large language models (MLLMs). We analyze whether opinions transfer between languages or whether there are separate opinions for each language in MLLMs of various sizes across five Western languages. We evaluate MLLMs’ opinions by prompting them to report their (dis)agreement with political statements from voting advice applications. To better understand the interaction between languages in the models, we evaluate them both before and after aligning them with more left or right views using direct preference optimization and English alignment data only. Our findings reveal that unaligned models show only very few significant cross-lingual differences in the political opinions they reflect. The political alignment shifts opinions almost uniformly across all five languages. We conclude that in Western language contexts, political opinions transfer between languages, demonstrating the challenges in achieving explicit socio-linguistic, cultural, and political alignment of MLLMs. 民意调查显示，不同社会文化背景下的政治观点存在跨文化差异。然而，目前尚无明确证据表明这些差异是否会在多语言大型语言模型（MLLMs）中转化为跨语言差异。我们分析了观点是否会在语言之间传递，或在不同语言中在 MLLMs 中形成独立的观点，研究对象为五种西方语言、不同规模的 MLLMs。我们通过提示模型报告其对投票建议应用中的政治声明的（不）赞同来评估 MLLMs 的观点。为更好地理解模型中语言间的相互作用，我们在仅使用英语对齐数据的情况下，分别在对模型进行以更偏左或偏右观点的直接偏好优化对齐前后对模型进行评估。我们的研究发现，未对齐的模型在其反映的政治观点上仅表现出极少的显著跨语言差异。政治对齐几乎在所有五种语言中以统一的方式改变了观点。我们得出结论：在西方语言环境中，政治观点会在语言之间传播，这表明要在多语言大型语言模型（MLLMs）中实现明确的社会语言学、文化和政治一致性存在挑战。

Subjects: Computation and Language, Computers and Society 主题：计算与语言，计算机与社会

Publish: 2025-08-07 16:33:45 UTC 发布：2025-08-07 16:33:45 UTC

#8 Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees #8 在黑盒设置下具有可证明覆盖保证的多项选择题回答中的保形集合

Authors: [Guang Yang](https://arxiv.org/search/?searchtype=author&query=Guang Yang), [Xinyang Liu](https://arxiv.org/search/?searchtype=author&query=Xinyang Liu) 作者：杨广，刘新阳

Large Language Models (LLMs) have shown remarkable progress in multiple-choice question answering (MCQA), but their inherent unreliability, such as hallucination and overconfidence, limits their application in high-risk domains. To address this, we propose a frequency-based uncertainty quantification method under black-box settings, leveraging conformal prediction (CP) to ensure provable coverage guarantees. Our approach involves multiple independent samplings of the model’s output distribution for each input, with the most frequent sample serving as a reference to calculate predictive entropy (PE). Experimental evaluations across six LLMs and four datasets (MedMCQA, MedQA, MMLU, MMLU-Pro) demonstrate that frequency-based PE outperforms logit-based PE in distinguishing between correct and incorrect predictions, as measured by AUROC. Furthermore, the method effectively controls the empirical miscoverage rate under user-specified risk levels, validating that sampling frequency can serve as a viable substitute for logit-based probabilities in black-box scenarios. This work provides a distribution-free model-agnostic framework for reliable uncertainty quantification in MCQA with guaranteed coverage, enhancing the trustworthiness of LLMs in practical applications. 大型语言模型（LLMs）在多项选择题问答（MCQA）方面取得了显著进展，但其固有的不可靠性，如幻觉和过度自信，限制了它们在高风险领域的应用。为了解决这一问题，我们在黑箱设置下提出了一种基于频率的不确定性量化方法，利用校准预测（conformal prediction，CP）以确保可证明的覆盖保证。我们的方法针对每个输入对模型的输出分布进行多次独立抽样，以出现频率最高的样本作为参考来计算预测熵（PE）。在六种 LLMs 和四个数据集（MedMCQA、MedQA、MMLU、MMLU-Pro）上的实验评估表明，基于频率的 PE 在区分正确与错误预测方面优于基于 logit 的 PE（以 AUROC 衡量）。此外，该方法在用户指定的风险水平下有效控制了经验失覆盖率，验证了在黑箱场景中抽样频率可以作为基于 logit 的概率的可行替代。这项工作提供了一个无分布假设、模型无关的框架，用于在带有覆盖保证的多项选择问答（MCQA）中进行可靠的不确定性量化，从而增强了 LLMs 在实际应用中的可信度。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 16:22:49 UTC 发布：2025-08-07 16:22:49 UTC

#9 CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation #9 CoCoLex：用于有根法律文本生成的基于复制的置信度引导解码

Due to their ability to process long and complex contexts, LLMs can offer key benefits to the Legal domain, but their adoption has been hindered by their tendency to generate unfaithful, ungrounded, or hallucinatory outputs. While Retrieval-Augmented Generation offers a promising solution by grounding generations in external knowledge, it offers no guarantee that the provided context will be effectively integrated. To address this, context-aware decoding strategies have been proposed to amplify the influence of relevant context, but they usually do not explicitly enforce faithfulness to the context. In this work, we introduce Confidence-guided Copy-based Decoding for Legal Text Generation (CoCoLex)-a decoding strategy that dynamically interpolates the model produced vocabulary distribution with a distribution derived based on copying from the context. CoCoLex encourages direct copying based on the model’s confidence, ensuring greater fidelity to the source. Experimental results on five legal benchmarks demonstrate that CoCoLex outperforms existing context-aware decoding methods, particularly in long-form generation tasks. 由于能够处理冗长且复杂的上下文，LLMs 能为法律领域带来关键益处，但其采用受到其倾向生成不忠实、无依据或幻觉式输出的阻碍。尽管检索增强生成通过将生成内容基于外部知识提供了有希望的解决方案，但它并不能保证所提供的上下文会被有效整合。为了解决这一点，已提出了上下文感知的解码策略以放大相关上下文的影响，但它们通常并不明确强制生成内容对上下文保持忠实。在本工作中，我们引入了用于法律文本生成的基于置信度引导的复制解码（CoCoLex）——一种在模型产生的词汇分布与基于从上下文复制而得到的分布之间动态插值的解码策略。CoCoLex 根据模型的置信度鼓励直接复制，从而确保对源文本具有更高的忠实度。在五个法律基准上的实验结果表明，CoCoLex 优于现有的上下文感知解码方法，尤其在长文本生成任务中表现突出。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 16:06:58 UTC 发布：2025-08-07 16:06:58 UTC

#10 The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities #10 根据 LLMs 的世界观：地理来源如何影响 LLMs 的实体推断能力

Authors: [Harsh Nishant Lalai](https://arxiv.org/search/?searchtype=author&query=Harsh Nishant Lalai), [Raj Sanjay Shah](https://arxiv.org/search/?searchtype=author&query=Raj Sanjay Shah), [Jiaxin Pei](https://arxiv.org/search/?searchtype=author&query=Jiaxin Pei), [Sashank Varma](https://arxiv.org/search/?searchtype=author&query=Sashank Varma), [Yi-Chia Wang](https://arxiv.org/search/?searchtype=author&query=Yi-Chia Wang), [Ali Emami](https://arxiv.org/search/?searchtype=author&query=Ali Emami) 作者：Harsh Nishant Lalai、Raj Sanjay Shah、Jiaxin Pei、Sashank Varma、Yi-Chia Wang、Ali Emami

Large Language Models (LLMs) have been extensively tuned to mitigate explicit biases, yet they often exhibit subtle implicit biases rooted in their pre-training data. Rather than directly probing LLMs with human-crafted questions that may trigger guardrails, we propose studying how models behave when they proactively ask questions themselves. The 20 Questions game, a multi-turn deduction task, serves as an ideal testbed for this purpose. We systematically evaluate geographic performance disparities in entity deduction using a new dataset, Geo20Q+, consisting of both notable people and culturally significant objects (e.g., foods, landmarks, animals) from diverse regions. We test popular LLMs across two gameplay configurations (canonical 20-question and unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs are substantially more successful at deducing entities from the Global North than the Global South, and the Global West than the Global East. While Wikipedia pageviews and pre-training corpus frequency correlate mildly with performance, they fail to fully explain these disparities. Notably, the language in which the game is played has minimal impact on performance gaps. These findings demonstrate the value of creative, free-form evaluation frameworks for uncovering subtle biases in LLMs that remain hidden in standard prompting setups. By analyzing how models initiate and pursue reasoning goals over multiple turns, we find geographic and cultural disparities embedded in their reasoning processes. We release the dataset (Geo20Q+) and code at https://sites.google.com/view/llmbias20q/home. 大型语言模型（LLMs）已经经过广泛调优以减轻明显偏见，但它们仍常表现出根植于预训练数据的微妙隐性偏见。与其用可能触发防护机制的人为问题直接探查 LLMs，我们提出研究模型在主动向外提出问题时的行为。二十个问题游戏（20 Questions），作为一个多轮演绎任务，是此目的的理想试验场。我们使用一份新的数据集 Geo20Q+ 系统性地评估实体推断中的地域性能差异，该数据集包含来自不同地区的名人和具有文化意义的物品（例如食物、地标、动物）。我们在两种游戏配置（标准 20 问和不限轮次）以及七种语言（英语、印地语、普通话、日语、法语、西班牙语和土耳其语）上测试流行的 LLMs。我们的结果揭示了地域差异：LLMs 在推断来自全球北方的实体方面明显比全球南方更成功，在全球西方方面也明显优于全球东方。尽管维基百科页面浏览量和预训练语料中的频次与性能有一定相关性，但它们无法完全解释这些差异。值得注意的是，游戏所使用的语言对性能差距影响甚微。这些发现表明，富有创意的、自由形式的评估框架对于揭示在标准提示设置中仍然隐藏的 LLMs 中的微妙偏见具有重要价值。通过分析模型如何在多回合中启动并追求推理目标，我们发现了嵌入于其推理过程中的地理和文化差异。我们在 https://sites.google.com/view/llmbias20q/home 发布了数据集（Geo20Q+）和代码。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言、人工智能

Publish: 2025-08-07 15:53:30 UTC 发布：2025-08-07 15:53:30 UTC

#11 LAG: Logic-Augmented Generation from a Cartesian Perspective #11 LAG：从笛卡尔视角的逻辑增强生成

Authors: [Yilin Xiao](https://arxiv.org/search/?searchtype=author&query=Yilin Xiao), [Chuang Zhou](https://arxiv.org/search/?searchtype=author&query=Chuang Zhou), [Qinggang Zhang](https://arxiv.org/search/?searchtype=author&query=Qinggang Zhang), [Su Dong](https://arxiv.org/search/?searchtype=author&query=Su Dong), [Shengyuan Chen](https://arxiv.org/search/?searchtype=author&query=Shengyuan Chen), [Xiao Huang](https://arxiv.org/search/?searchtype=author&query=Xiao Huang) 作者：肖一林、周闖、张庆刚、董肃、陈胜元、黄晓

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la méthode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems. 大型语言模型（LLMs）在广泛任务上展现出卓越能力，但在知识密集型任务中仍存在关键局限，面对需要专业知识的问题时常产生幻觉。尽管检索增强生成（RAG）通过整合外部知识缓解了这一问题，但由于依赖直接的语义检索且缺乏结构化逻辑组织，在复杂推理场景中表现不佳。受《方法谈》中的笛卡尔原则启发，本文提出了逻辑增强生成（LAG），一种通过系统性问题分解和依赖意识推理来重构知识增强的新范式。具体而言，LAG 首先将复杂问题分解为按逻辑依赖顺序排列的原子子问题，然后按序解决这些子问题，利用先前答案指导后续子问题的上下文检索，确保在逻辑链上逐步着地。为防止错误传播，LLM LAG(逻辑导向生成)融入了一种逻辑终止机制：在遇到无法回答的子问题时停止推理，从而减少在过度推理上的计算浪费。最后，它将所有子解整合以生成经过验证的回答。对四个基准数据集的实验表明，LAG 显著增强了推理的稳健性，减少了幻觉现象，并使 LLM 的问题解决方式更符合人类认知，为现有的 RAG 系统提供了一个有原则的替代方案。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 15:42:00 UTC 发布时间：2025-08-07 15:42:00 UTC

#12 Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations #12 重新思考创造力评估：对现有创造力评估的批判性分析

Authors: [Li-Chun Lu](https://arxiv.org/search/?searchtype=author&query=Li-Chun Lu), [Miri Liu](https://arxiv.org/search/?searchtype=author&query=Miri Liu), [Pin-Chun Lu](https://arxiv.org/search/?searchtype=author&query=Pin-Chun Lu), [Yufei Tian](https://arxiv.org/search/?searchtype=author&query=Yufei Tian), [Shao-Hua Sun](https://arxiv.org/search/?searchtype=author&query=Shao-Hua Sun), [Nanyun Peng](https://arxiv.org/search/?searchtype=author&query=Nanyun Peng) 作者：Li-Chun Lu、Miri Liu、Pin-Chun Lu、Yufei Tian、Shao-Hua Sun、Nanyun Peng

We systematically examine, analyze, and compare representative creativity measures–creativity index, perplexity, syntactic templates, and LLM-as-a-Judge–across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index’s focus on lexical diversity, perplexity’s sensitivity to model confidence, and syntactic templates’ inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity. 我们系统性地检视、分析并比较了代表性的创造力度量方法——创造力指数、困惑度、句法模板以及将 LLM 作为评审——在多样的创造性领域中的表现，涵盖创意写作、非常规问题解决和研究构思。我们的分析表明，这些度量方法的一致性有限，分别捕捉了创造力的不同维度。我们指出了若干关键局限性，包括创造力指数侧重词汇多样性、困惑度对模型置信度的敏感性，以及句法模板无法捕捉概念性创造力。此外，将 LLM 作为评审表现出不稳定性与偏见。我们的研究结果突显出需要更为稳健且可泛化的评估框架，以更好地与人类对创造力的判断相一致。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 15:11:48 UTC 发布：2025-08-07 15:11:48 UTC

#13 TASE: Token Awareness and Structured Evaluation for Multilingual Language Models #13 TASE：面向多语言语言模型的标记感知与结构化评估

Authors: [Chenzhuo Zhao](https://arxiv.org/search/?searchtype=author&query=Chenzhuo Zhao), [Xinda Wang](https://arxiv.org/search/?searchtype=author&query=Xinda Wang), [Yue Huang](https://arxiv.org/search/?searchtype=author&query=Yue Huang), [Junting Lu](https://arxiv.org/search/?searchtype=author&query=Junting Lu), [Ziqian Liu](https://arxiv.org/search/?searchtype=author&query=Ziqian Liu) 作者：赵辰卓、王鑫达、黄岳、卢君廷、刘子谦

While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning–capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs’ ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at https://github.com/cyzcz/Tase . 虽然大型语言模型（LLMs）在高层语义任务上表现出色，但它们在细粒度、基于标记的理解和结构化推理方面常常力不从心——这些能力对于需要精确性和可控性的应用至关重要。我们提出了 TASE，一个用于评估 LLMs 在多语言环境下感知和推理标记级信息能力的综合基准。TASE 涵盖两大核心类别下的 10 项任务：标记感知与结构理解，横跨中文、英文和韩文，包含 35,927 条评估实例，并提供可扩展的合成数据生成管道用于训练。任务包括字符计数、标记对齐、句法结构解析和长度约束满足等。我们评估了 30 多个领先的商业与开源 LLMs，包括 O3、Claude 4、Gemini 2.5 Pro 和 DeepSeek-R1，并使用 GRPO 训练方法训练了定制的 Qwen2.5-14B 模型。结果表明，人类表现明显优于当前的 LLMs，揭示了在标记级推理方面仍然存在的薄弱环节。 TASE 阐明了这些局限性，并为未来在低层次语言理解和跨语言泛化方面的改进提供了新的诊断视角。我们的代码和数据集已公开发布于 https://github.com/cyzcz/Tase 。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 15:11:17 UTC 发布：2025-08-07 15:11:17 UTC

#14 LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models #14 LLMEval-3：一项关于大型语言模型稳健性与公平性评估的大规模纵向研究

Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards. 现有对大型语言模型（LLMs）在静态基准上的评估存在数据污染和榜单过拟合的风险，这些关键问题掩盖了模型的真实能力。为了解决这一问题，我们引入了 LLMEval-3，一种用于对 LLMs 进行动态评估的框架。LLMEval-3 建立在一个包含 22 万道研究生水平题目的专有题库之上，从中为每次评估动态抽取此前未见的测试集。其自动化管道通过抗污染的数据策划、新颖的防作弊架构以及经过校准的“以 LLM 为裁判”的评判流程来确保完整性，该流程与人类专家的意见达成了 90% 的一致性，并辅以相对排名系统以实现公平比较。一项为期 20 个月、涵盖近 50 个领先模型的纵向研究显示了知识记忆方面的性能天花板，并揭示了静态基准无法检测到的数据污染漏洞。该框架在排名稳定性和一致性方面表现出卓越的鲁棒性，为动态评估范式提供了有力的实证验证。 LLMEval-3 提供了一种稳健且可信的方法，用于评估 LLMs 在排行榜分数之外的真实能力，促进更值得信赖的评估标准的发展。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 14:46:30 UTC 发布：2025-08-07 14:46:30 UTC

#15 MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints #15 MyCulture：在低资源语言约束下探索马来西亚多元文化

Authors: [Zhong Ken Hew](https://arxiv.org/search/?searchtype=author&query=Zhong Ken Hew), [Jia Xin Low](https://arxiv.org/search/?searchtype=author&query=Jia Xin Low), [Sze Jue Yang](https://arxiv.org/search/?searchtype=author&query=Sze Jue Yang), [Chee Seng chan](https://arxiv.org/search/?searchtype=author&query=Chee Seng chan) 作者：Zhong Ken Hew、Jia Xin Low、Sze Jue Yang、Chee Seng chan

Large Language Models (LLMs) often exhibit cultural biases due to training data dominated by high-resource languages like English and Chinese. This poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. To address this, we introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on Malaysian culture across six pillars: arts, attire, customs, entertainment, food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options, thereby reducing guessing and mitigating format bias. We provide a theoretical justification for the effectiveness of this open-ended structure in improving both fairness and discriminative power. Furthermore, we analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations. Our evaluation across a range of regional and international LLMs reveals significant disparities in cultural comprehension, highlighting the urgent need for culturally grounded and linguistically inclusive benchmarks in the development and assessment of LLMs. 大型语言模型（LLMs）常因训练数据以英语和中文等高资源语言为主而表现出文化偏见。这为准确呈现和评估多样文化语境带来挑战，尤其是在低资源语言环境中。为此，我们引入了 MyCulture，这是一个旨在全面评估 LLMs 在马来西亚文化方面表现的基准，涵盖六大领域：艺术、服饰、风俗、娱乐、食品和宗教，全部以马来语呈现。与传统基准不同，MyCulture 采用了一种新颖的无预设选项的开放式多项选择题格式，从而减少猜测并减轻格式偏差。我们为这种开放式结构在提升公平性和判别力方面的有效性提供了理论依据。此外，我们通过比较模型在结构化输出与自由形式输出上的表现来分析结构偏见，并通过多语言提示变体评估语言偏见。我们对一系列区域性和国际性的 LLMs 所做的评估揭示了在文化理解方面存在显著差异，突显出在 LLMs 的开发和评估中迫切需要以文化为基础并兼顾语言包容性的基准。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言、人工智能

Publish: 2025-08-07 14:17:43 UTC 发布：2025-08-07 14:17:43 UTC

#16 The TUB Sign Language Corpus Collection #16 TUB 手语语料库集合

We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~~M subtitles containing 14~~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data. 我们提供了一个包含 12 种手语的视频平行语料库集合，同时附有相应国家主要口语的字幕。整个集合包括超过 1,300 小时的 4,381 个视频文件，伴随 1.3 百万条字幕，包含 1,400 万词元。最值得注意的是，它包括了首批一致的 8 种拉丁美洲手语的平行语料库，而德国语手语语料库的规模是此前可用语料库的十倍。该集合是通过从各种在线来源收集和处理多种手语视频创建的，主要来源为新闻节目、政府机构和教育频道的广播材料。准备工作涉及若干阶段，包括数据收集、告知内容创作者并寻求使用许可、抓取和裁剪。论文提供了该集合的统计数据以及用于收集数据的方法概述。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 13:16:55 UTC 发布：2025-08-07 13:16:55 UTC

#17 Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025 #17 语言模型能自我批评吗？在 BioASQ 2025 上调查用于检索增强生成的自我反馈

Authors: [Samy Ateia](https://arxiv.org/search/?searchtype=author&query=Samy Ateia), [Udo Kruschwitz](https://arxiv.org/search/?searchtype=author&query=Udo Kruschwitz) 作者：Samy Ateia，Udo Kruschwitz

Agentic Retrieval Augmented Generation (RAG) and ‘deep research’ systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems. Agentic 检索增强生成（RAG）和“深度研究”系统旨在实现自主搜索流程，其中大型语言模型（LLMs）迭代地改进输出。然而，将这些系统应用于领域特定的专业检索（如生物医学研究）存在挑战，因为自动化系统可能减少用户参与并与专家的信息需求不一致。专业检索任务通常要求高水平的用户专业知识和透明性。使用专家提问的 BioASQ CLEF 2025 挑战可以作为研究这些问题的平台。我们探讨了诸如 Gemini-Flash 2.0、o3-mini、o4-mini 和 DeepSeek-R1 等当前具有推理能力和非推理能力的 LLMs 的表现。我们方法论的一个关键方面是自我反馈机制，在该机制中 LLMs 生成、评估然后改进其用于查询扩展和多种答案类型（是/否、事实型、列表型、理想答案）的输出。我们研究了这种迭代自我纠正是否能提升性能，以及具有推理能力的模型是否更能生成有用的反馈。初步结果表明，自我反馈策略在不同模型和任务上的表现各异。这项工作提供了对 LLM 自我纠正的洞见，并为未来将 LLM 生成的反馈与这些检索系统中直接人类专家输入的效果进行比较的研究提供了参考。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 13:13:19 UTC 发表时间：2025-08-07 13:13:19 UTC

#18 Evaluation of a Sign Language Avatar on Comprehensibility, User Experience & Acceptability #18 手语化身在可理解性、用户体验与可接受性方面的评估

Authors: [Fenya Wasserroth](https://arxiv.org/search/?searchtype=author&query=Fenya Wasserroth), [Eleftherios Avramidis](https://arxiv.org/search/?searchtype=author&query=Eleftherios Avramidis), [Vera Czehmann](https://arxiv.org/search/?searchtype=author&query=Vera Czehmann), [Tanja Kojic](https://arxiv.org/search/?searchtype=author&query=Tanja Kojic), [Fabrizio Nunnari](https://arxiv.org/search/?searchtype=author&query=Fabrizio Nunnari), [Sebastian Möller](https://arxiv.org/search/?searchtype=author&query=Sebastian Möller) 作者：Fenya Wasserroth、Eleftherios Avramidis、Vera Czehmann、Tanja Kojic、Fabrizio Nunnari、Sebastian Möller

This paper presents an investigation into the impact of adding adjustment features to an existing sign language (SL) avatar on a Microsoft Hololens 2 device. Through a detailed analysis of interactions of expert German Sign Language (DGS) users with both adjustable and non-adjustable avatars in a specific use case, this study identifies the key factors influencing the comprehensibility, the user experience (UX), and the acceptability of such a system. Despite user preference for adjustable settings, no significant improvements in UX or comprehensibility were observed, which remained at low levels, amid missing SL elements (mouthings and facial expressions) and implementation issues (indistinct hand shapes, lack of feedback and menu positioning). Hedonic quality was rated higher than pragmatic quality, indicating that users found the system more emotionally or aesthetically pleasing than functionally useful. Stress levels were higher for the adjustable avatar, reflecting lower performance, greater effort and more frustration. Additionally, concerns were raised about whether the Hololens adjustment gestures are intuitive and easy to familiarise oneself with. While acceptability of the concept of adjustability was generally positive, it was strongly dependent on usability and animation quality. This study highlights that personalisation alone is insufficient, and that SL avatars must be comprehensible by default. Key recommendations include enhancing mouthing and facial animation, improving interaction interfaces, and applying participatory design. 本文研究了在 Microsoft Hololens 2 设备上为现有手语（SL）虚拟形象添加可调节功能的影响。通过对德国语言手语（DGS）专家用户在特定使用场景中与可调节和不可调节虚拟形象交互的详细分析，本研究识别出影响该系统可理解性、用户体验（UX）和可接受性的关键因素。尽管用户偏好可调节设置，但在用户体验或可理解性方面并未观察到显著改善，且在缺失手语元素（口型和面部表情）及实现问题（手形不清、缺乏反馈和菜单位置）情况下，可理解性仍处于较低水平。享乐质量的评分高于实用质量，表明用户认为该系统在情感或美学上比在功能上更令人愉悦。可调节虚拟形象的压力水平更高，反映出较差的表现、更大的努力和更多的挫败感。此外，研究还提出了关于 Hololens 调节手势是否直观且易于上手的担忧。尽管对“可调节性”这一概念的可接受性总体上是积极的，但它在很大程度上取决于可用性和动画质量。本研究强调，仅靠个性化并不足够，且 SL 头像必须默认就易于理解。主要建议包括增强口型和面部动画、改进交互界面以及采用参与式设计。

Subjects: Computation and Language, Human-Computer Interaction

Publish: 2025-08-07 13:06:42 UTC

#19 Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression 确定性引导的反思抑制方法

Authors: [Jiameng Huang](https://arxiv.org/search/?searchtype=author&query=Jiameng Huang), [Baijiong Lin](https://arxiv.org/search/?searchtype=author&query=Baijiong Lin), [Guhao Feng](https://arxiv.org/search/?searchtype=author&query=Guhao Feng), [Jierun Chen](https://arxiv.org/search/?searchtype=author&query=Jierun Chen), [Di He](https://arxiv.org/search/?searchtype=author&query=Di He), [Lu Hou](https://arxiv.org/search/?searchtype=author&query=Lu Hou) 作者：Jiameng Huang、Baijiong Lin、Guhao Feng、Jierun Chen、Di He、Lu Hou

Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., “Wait” and “Alternatively”) to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model’s generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS’s effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS’s practical value for efficient reasoning. 近期的大型推理语言模型（LRLMs）在长链式思维推理中采用了带有复杂反思行为的长推理链，通常通过特定的触发词（如“等等”和“或者”）来提高性能。然而，这些反思行为可能导致过度思考问题——生成多余的推理步骤，导致不必要的令牌使用增加、推理成本上升并降低实际效用。本文提出了一种确定性引导的反思抑制方法（Certainty-Guided Reflection Suppression，CGRS），这是一种在保持推理准确性的同时缓解 LRLMs 过度思考的新方法。CGRS 通过在模型对当前响应具有高置信度时动态抑制其生成反思触发词，从而防止冗余的反思循环而不损害输出质量。我们的方法与模型无关，无需重新训练或修改架构，并且可以无缝集成到现有的自回归生成流水线中。在四个推理基准（即 AIME24、AMC23、MATH500 和 GPQA-D）上的大量实验表明，CGRS 非常有效：在保持准确性的同时，平均减少了 18.5% 到 41.9% 的令牌使用量。与最先进的基线方法相比，它在长度缩减与性能之间也实现了最佳平衡。这些结果在不同模型架构（如 DeepSeek-R1-Distill 系列、QwQ-32B 和 Qwen3 家族）和不同规模（从 4B 到 32B 参数）上均保持一致，突显了 CGRS 在高效推理方面的实际价值。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：Computation and Language，Artificial Intelligence，Machine Learning

Publish: 2025-08-07 12:38:22 UTC 发布：2025-08-07 12:38:22 UTC

#20 SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens #20 SONAR-LLM：以句子嵌入进行思考、以令牌表达的自回归 Transformer

Authors: [Nikita Dragunov](https://arxiv.org/search/?searchtype=author&query=Nikita Dragunov), [Temurbek Rahmatullaev](https://arxiv.org/search/?searchtype=author&query=Temurbek Rahmatullaev), [Elizaveta Goncharova](https://arxiv.org/search/?searchtype=author&query=Elizaveta Goncharova), [Andrey Kuznetsov](https://arxiv.org/search/?searchtype=author&query=Andrey Kuznetsov), [Anton Razzhigaev](https://arxiv.org/search/?searchtype=author&query=Anton Razzhigaev) 作者：Nikita Dragunov、Temurbek Rahmatullaev、Elizaveta Goncharova、Andrey Kuznetsov、Anton Razzhigaev

The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that “thinks” in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research. 最近提出的 Large Concept Model (LCM) 通过预测一系列句子级嵌入并使用均方误差或扩散目标进行训练来生成文本。我们提出了 SONAR-LLM，一种仅解码器的 transformer，它在相同的连续 SONAR 嵌入空间中“思考”，但通过冻结的 SONAR 解码器将基于令牌的交叉熵监督传递下去。这种混合目标保留了 LCM 的语义抽象，同时消除了其扩散采样器并恢复了基于似然的训练信号。在从 39M 到 1.3B 参数的模型规模上，SONAR-LLM 实现了具有竞争力的生成质量。我们报告了扩展趋势、消融实验、基准结果，并发布了完整的训练代码和所有预训练检查点，以促进可重复性和未来研究。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 12:03:44 UTC 发布：2025-08-07 12:03:44 UTC

#21 Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue #21 带有深思熟虑的决策制定：作为基于文档对话的元评审

Authors: [Sukannya Purkayastha](https://arxiv.org/search/?searchtype=author&query=Sukannya Purkayastha), [Nils Dycke](https://arxiv.org/search/?searchtype=author&query=Nils Dycke), [Anne Lauscher](https://arxiv.org/search/?searchtype=author&query=Anne Lauscher), [Iryna Gurevych](https://arxiv.org/search/?searchtype=author&query=Iryna Gurevych) 作者：Sukannya Purkayastha、Nils Dycke、Anne Lauscher、Iryna Gurevych

Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform \emph{off-the-shelf} LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.\footnote{Code and Data: https://github.com/UKPLab/arxiv2025-meta-review-as-dialog 元评审（meta-reviewing）是同行评审过程中的关键环节，作为决定论文是否推荐接收的最后一步。以往关于元评审的研究将其视为对评审报告的摘要问题。然而，补充这一视角的是，元评审实际上是一个决策过程，需要权衡评审人观点并将其置于更广泛的语境中。以往研究表明，在此类场景中，决策者可以通过对话代理得到有效帮助。基于这一框架，我们探讨了实现能够有效协助元评审人的对话代理所面临的实际挑战。具体而言，我们首先通过使用大型语言模型（LLMs）并基于自我精炼策略生成合成数据来解决用于训练对话代理的数据匮乏问题，以提高这些对话与专家领域的相关性。我们的实验表明，该方法能生成更高质量的合成数据，并可作为训练元评审助手的有价值资源。随后，我们利用这些数据训练了针对元评审的对话代理，并发现这些代理在该任务上优于现成的基于 LLM 的助手。最后，我们在真实的元评审场景中应用了我们的代理，确认它们在提高元评审效率方面的有效性。\footnote{代码和数据： https://github.com/UKPLab/arxiv2025-meta-review-as-dialog

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 11:27:43 UTC 发布：2025-08-07 11:27:43 UTC

#22 ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs #22 ASCoT：一种用于 LLMs 后期脆弱性的自适应自我修正思路链方法

Authors: [Dongxu Zhang](https://arxiv.org/search/?searchtype=author&query=Dongxu Zhang), [Ning Yang](https://arxiv.org/search/?searchtype=author&query=Ning Yang), [Jihua Zhu](https://arxiv.org/search/?searchtype=author&query=Jihua Zhu), [Jinnan Yang](https://arxiv.org/search/?searchtype=author&query=Jinnan Yang), [Miao Xin](https://arxiv.org/search/?searchtype=author&query=Miao Xin), [Baoliang Tian](https://arxiv.org/search/?searchtype=author&query=Baoliang Tian) 作者：张东旭、杨宁、朱继华、杨津南、辛淼、田保良

Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs), yet the reliability of these reasoning chains remains a critical challenge. A widely held “cascading failure” hypothesis suggests that errors are most detrimental when they occur early in the reasoning process. This paper challenges that assumption through systematic error-injection experiments, revealing a counter-intuitive phenomenon we term “Late-Stage Fragility”: errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning. To address this specific vulnerability, we introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method. ASCoT employs a modular pipeline in which an Adaptive Verification Manager (AVM) operates first, followed by the Multi-Perspective Self-Correction Engine (MSCE). The AVM leverages a Positional Impact Score function I(k) that assigns different weights based on the position within the reasoning chains, addressing the Late-Stage Fragility issue by identifying and prioritizing high-risk, late-stage steps. Once these critical steps are identified, the MSCE applies robust, dual-path correction specifically to the failure parts. Extensive experiments on benchmarks such as GSM8K and MATH demonstrate that ASCoT achieves outstanding accuracy, outperforming strong baselines, including standard CoT. Our work underscores the importance of diagnosing specific failure modes in LLM reasoning and advocates for a shift from uniform verification strategies to adaptive, vulnerability-aware correction mechanisms. 连锁思维（CoT）提示显著提升了大型语言模型（LLMs）的推理能力，但这些推理链的可靠性仍然是一个关键挑战。一种广泛流行的“级联失效”假说认为，错误在推理过程早期发生时最具破坏性。本文通过系统的错误注入实验挑战了这一假设，揭示了一个反直觉现象，我们称之为“后期脆弱性（Late-Stage Fragility）”：在 CoT 链的后期引入的错误比在开头发生的相同错误更有可能破坏最终答案。为了解决这一特定脆弱性，我们引入了自适应自我纠正链思维（ASCoT）方法。ASCoT 采用模块化流程，其中自适应验证管理器（AVM）先行，随后是多视角自我纠正引擎（MSCE）。AVM 利用位置影响得分函数 I(k)，根据推理链中的位置分配不同权重，通过识别和优先处理高风险的后期步骤来应对后期脆弱性问题。一旦识别出这些关键步骤，MSCE 就会针对失败部分实施稳健的双路径纠错。在 GSM8K 和 MATH 等基准上的大量实验表明，ASCoT 达到了卓越的准确率，优于包括标准 CoT 在内的强基线。我们的工作强调了诊断 LLM 推理中特定失败模式的重要性，并倡导从统一的验证策略转向自适应的、以脆弱性为导向的纠错机制。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 11:26:40 UTC 发布时间：2025-08-07 11:26:40 世界协调时间(UTC)

#23 CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL #23 CodeBoost: 通过从代码片段中挤出知识并使用强化学习提升代码 LLMs [PDF 5 ] [Copy] [Kimi 1 ] [REL]

Code large language models (LLMs) have become indispensable tools for building efficient and automated coding pipelines. Existing models are typically post-trained using reinforcement learning (RL) from general-purpose LLMs using “human instruction-final answer” pairs, where the instructions are usually from manual annotations. However, collecting high-quality coding instructions is both labor-intensive and difficult to scale. On the other hand, code snippets are abundantly available from various sources. This imbalance presents a major bottleneck in instruction-based post-training. We propose CodeBoost, a post-training framework that enhances code LLMs purely from code snippets, without relying on human-annotated instructions. CodeBoost introduces the following key components: (1) maximum-clique curation, which selects a representative and diverse training corpus from code; (2) bi-directional prediction, which enables the model to learn from both forward and backward prediction objectives; (3) error-aware prediction, which incorporates learning signals from both correct and incorrect outputs; (4) heterogeneous augmentation, which diversifies the training distribution to enrich code semantics; and (5) heterogeneous rewarding, which guides model learning through multiple reward types including format correctness and execution feedback from both successes and failures. Extensive experiments across several code LLMs and benchmarks verify that CodeBoost consistently improves performance, demonstrating its effectiveness as a scalable and effective training pipeline. 代码大型语言模型（LLMs）已成为构建高效自动化编码流水线的不可或缺工具。现有模型通常通过对通用 LLMs 进行“人类指令—最终答案”对的强化学习（RL）后训练，其中指令通常来自人工标注。然而，收集高质量的编码指令既费力又难以规模化。另一方面，代码片段在各种来源中大量存在。这种不平衡成为基于指令的后训练的一大瓶颈。我们提出了 CodeBoost，一种仅从代码片段中增强代码 LLMs 的后训练框架，无需依赖人工标注的指令。 CodeBoost 引入了以下关键组成部分：(1) 最大团策划，从代码中选择具有代表性且多样的训练语料库；(2) 双向预测，使模型能够从前向和后向预测目标中学习；(3) 错误感知预测，将来自正确和错误输出的学习信号纳入；(4) 异构增强，多样化训练分布以丰富代码语义；以及 (5) 异构奖励，通过多种奖励类型引导模型学习，包括格式正确性和来自成功与失败的执行反馈。在多个代码 LLMs 和基准上的大量实验证明，CodeBoost 始终如一地提高了性能，显示出其作为一种可扩展且有效的训练流程的有效性。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 10:31:24 UTC 发布：2025-08-07 10:31:24 UTC

#24 Pruning Large Language Models by Identifying and Preserving Functional Networks #24 通过识别并保留功能网络对大型语言模型进行剪枝 [PDF 1 ] [Copy] [Kimi 1 ] [REL]

Structured pruning is one of the representative techniques for compressing large language models (LLMs) to reduce GPU memory consumption and accelerate inference speed. It offers significant practical value in improving the efficiency of LLMs in real-world applications. Current structured pruning methods typically rely on assessment of the importance of the structure units and pruning the units with less importance. Most of them overlooks the interaction and collaboration among artificial neurons that are crucial for the functionalities of LLMs, leading to a disruption in the macro functional architecture of LLMs and consequently a pruning performance degradation. Inspired by the inherent similarities between artificial neural networks and functional neural networks in the human brain, we alleviate this challenge and propose to prune LLMs by identifying and preserving functional networks within LLMs in this study. To achieve this, we treat an LLM as a digital brain and decompose the LLM into functional networks, analogous to identifying functional brain networks in neuroimaging data. Afterwards, an LLM is pruned by preserving the key neurons within these functional networks. Experimental results demonstrate that the proposed method can successfully identify and locate functional networks and key neurons in LLMs, enabling efficient model pruning. Our code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION. 结构化剪枝是压缩大型语言模型（LLMs）以降低 GPU 内存消耗并加速推理速度的代表性技术之一。它在提高 LLMs 在现实应用中效率方面具有重要的实际价值。当前的结构化剪枝方法通常依赖于评估结构单元的重要性，并剪除重要性较低的单元。它们大多忽视了人工神经元之间对 LLMs 功能至关重要的相互作用和协作，从而导致 LLMs 宏观功能架构的破坏，并因此出现剪枝性能下降。受人工神经网络与人脑功能性神经网络之间固有相似性的启发，我们在本研究中缓解了这一挑战，提出通过识别并保留 LLMs 内部的功能网络来对 LLMs 进行剪枝。为此，我们将 LLM 视为一个数字大脑，并将 LLM 分解为功能网络，类似于在神经影像数据中识别功能性脑网络。随后，通过保留这些功能网络内的关键神经元来对 LLM 进行剪枝。实验结果表明，所提出的方法能够成功识别并定位 LLMs 中的功能网络和关键神经元，从而实现高效的模型剪枝。我们的代码可在 https://github.com/WhatAboutMyStar/LLM_ACTIVATION 获取。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-07 10:27:01 UTC 发布：2025-08-07 10:27:01 协调世界时

#25 Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation #25 资源受限的联合多模态情感推理与分类：通过链式思维增强与蒸馏

Authors: [Haonan Shangguan](https://arxiv.org/search/?searchtype=author&query=Haonan Shangguan), [Xiaocui Yang](https://arxiv.org/search/?searchtype=author&query=Xiaocui Yang), [Shi Feng](https://arxiv.org/search/?searchtype=author&query=Shi Feng), [Daling Wang](https://arxiv.org/search/?searchtype=author&query=Daling Wang), [Yifei Zhang](https://arxiv.org/search/?searchtype=author&query=Yifei Zhang), [Ge Yu](https://arxiv.org/search/?searchtype=author&query=Ge Yu) 作者：商浩南，杨晓翠，冯石，大岭王，张一飞，余戈

The surge in rich multimodal content on social media platforms has greatly advanced Multimodal Sentiment Analysis (MSA), with Large Language Models (LLMs) further accelerating progress in this field. Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) LLMs for sentiment classification, overlooking autonomous multimodal sentiment reasoning generation in resource-constrained environments. Therefore, we focus on the Resource-Limited Joint Multimodal Sentiment Reasoning and Classification task, JMSRC, which simultaneously performs multimodal sentiment reasoning chain generation and sentiment classification only with a lightweight model. We propose a Multimodal Chain-of-Thought Reasoning Distillation model, MulCoT-RD, designed for JMSRC that employs a “Teacher-Assistant-Student” distillation paradigm to address deployment constraints in resource-limited environments. We first leverage a high-performance Multimodal Large Language Model (MLLM) to generate the initial reasoning dataset and train a medium-sized assistant model with a multi-task learning mechanism. A lightweight student model is jointly trained to perform efficient multimodal sentiment reasoning generation and classification. Extensive experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability. 社交媒体平台上丰富的多模态内容激增极大地推动了多模态情感分析（MSA）的发展，LLMs 进一步加速了该领域的进展。当前的方法主要利用参数庞大的（多模态）LLMs 的知识和推理能力来进行情感分类，忽视了在资源受限的环境中自主生成多模态情感推理的能力。因此，我们聚焦于资源受限的联合多模态情感推理与分类任务（JMSRC），该任务仅使用轻量级模型同时执行多模态情感推理链生成和情感分类。我们提出了一种用于 JMSRC 的多模态思维链推理蒸馏模型 MulCoT-RD，采用“教师—助教—学生”蒸馏范式，以应对资源受限环境中的部署限制。我们首先利用高性能的多模态大模型（MLLM）生成初始推理数据集，并通过多任务学习机制训练一个中等规模的助教模型。一个轻量级学生模型被联合训练以执行高效的多模态情感推理生成与分类。在四个数据集上的大量实验证明，仅有 3B 参数的 MulCoT-RD 在 JMSRC 上取得了强劲表现，同时展现出稳健的泛化能力和增强的可解释性。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 10:23:14 UTC 发布：2025-08-07 10:23:14 UTC

#26 ATLANTIS at SemEval-2025 Task 3: Detecting Hallucinated Text Spans in Question Answering #26 ATLANTIS 在 SemEval-2025 任务 3：检测问答中的虚构文本片段 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Catherine Kobus](https://arxiv.org/search/?searchtype=author&query=Catherine Kobus), [François Lancelot](https://arxiv.org/search/?searchtype=author&query=François Lancelot), [Marion-Cécile Martin](https://arxiv.org/search/?searchtype=author&query=Marion-Cécile Martin), [Nawal Ould Amer](https://arxiv.org/search/?searchtype=author&query=Nawal Ould Amer) 作者：Catherine Kobus、François Lancelot、Marion-Cécile Martin、Nawal Ould Amer

This paper presents the contributions of the ATLANTIS team to SemEval-2025 Task 3, focusing on detecting hallucinated text spans in question answering systems. Large Language Models (LLMs) have significantly advanced Natural Language Generation (NLG) but remain susceptible to hallucinations, generating incorrect or misleading content. To address this, we explored methods both with and without external context, utilizing few-shot prompting with a LLM, token-level classification or LLM fine-tuned on synthetic data. Notably, our approaches achieved top rankings in Spanish and competitive placements in English and German. This work highlights the importance of integrating relevant context to mitigate hallucinations and demonstrate the potential of fine-tuned models and prompt engineering. 本文介绍了 ATLANTIS 团队在 SemEval-2025 任务 3 中的成果，重点是检测问答系统中的幻觉文本片段。大型语言模型（LLMs）在自然语言生成（NLG）方面取得了显著进展，但仍容易出现幻觉，生成不正确或误导性的内容。为了解决这一问题，我们探索了有无外部上下文的方法，采用了对 LLM 的少量示例提示、基于标记的分类或在合成数据上微调的 LLM 等手段。值得注意的是，我们的方法在西班牙语中取得了名列前茅的成绩，并在英语和德语中获得了有竞争力的名次。该研究强调了整合相关上下文以减轻幻觉问题的重要性，并展示了微调模型和提示工程的潜力。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 09:15:15 UTC 发布：2025-08-07 09:15:15 协调世界时（UTC）

#27 Towards Assessing Medical Ethics from Knowledge to Practice #27 从知识到实践的医学伦理评估探索

The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs’ alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models’ ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models’ overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI. 将大型语言模型整合到医疗领域需要对其伦理推理进行严格评估，而当前的基准测试常常忽视这一点。我们提出了 PrinciplismQA，一个包含 3,648 个问题的综合基准，用于系统评估 LLMs 在核心医学伦理方面的一致性。基于原则主义（Principlism），我们的基准包含高质量数据集。该数据集包括从权威教科书整理的多项选择题和从权威医学伦理案例研究文献中提取的开放式问题，所有题目均由医学专家验证。我们的实验显示，模型在伦理知识与其实践应用之间存在显著差距，尤其是在将伦理原则动态应用于现实情境时。大多数 LLMs 在涉及行善（Beneficence）的困境上表现不佳，往往过度强调其他原则。具有强大通用能力的前沿闭源模型目前在该基准中领先。值得注意的是，医学领域的微调可以提升模型的整体伦理能力，但要取得更大进展需要更好地使模型与医学伦理知识保持一致。 PrinciplismQA 提供了一个可扩展的框架来诊断这些具体的伦理薄弱环节，为更均衡和负责任的医疗人工智能铺平了道路。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 08:10:14 UTC 发布时间：2025-08-07 08:10:14 UTC

#28 Attention Basin: Why Contextual Position Matters in Large Language Models #28 注意力盆地：为什么上下文位置在大型语言模型中很重要

The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model’s intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures. 大型语言模型(LLMs)的性能对输入中信息的位置信息极为敏感。为了探究这种位置偏差背后的机制，我们的大量实验证明了一个一致的现象——我们称之为注意力盆地：当呈现一系列结构化条目（例如检索到的文档或少样本示例）时，模型会系统性地对序列开头和结尾的条目分配更高的注意力，而忽略中间的条目。关键是，我们的分析进一步表明，将更高的注意力分配给关键信息是提升模型性能的关键。基于这些洞见，我们提出了注意力驱动重排序（AttnRank），这是一个两阶段框架，(i) 使用一小部分校准集估计模型的固有位置注意力偏好，(ii) 重新排序检索到的文档或少样本示例，使最显著的内容与这些高注意力位置对齐。AttnRank 是一种与模型无关、无需训练且即插即用的方法，计算开销极小。在多跳问答和少样本上下文学习任务上的实验表明，AttnRank 在不同架构和规模的 10 个大语言模型上均取得了显著提升，且无需修改模型参数或训练流程。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 08:08:08 UTC 发布：2025-08-07 08:08:08 UTC

#29 BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation #29 BEE-RAG：检索增强生成的平衡熵工程 [PDF 1 ] [Copy] [Kimi ] [REL]

Authors: [Yuhao Wang](https://arxiv.org/search/?searchtype=author&query=Yuhao Wang), [Ruiyang Ren](https://arxiv.org/search/?searchtype=author&query=Ruiyang Ren), [Yucheng Wang](https://arxiv.org/search/?searchtype=author&query=Yucheng Wang), [Jing Liu](https://arxiv.org/search/?searchtype=author&query=Jing Liu), [Wayne Xin Zhao](https://arxiv.org/search/?searchtype=author&query=Wayne Xin Zhao), [Hua Wu](https://arxiv.org/search/?searchtype=author&query=Hua Wu), [Haifeng Wang](https://arxiv.org/search/?searchtype=author&query=Haifeng Wang) 作者：王宇豪、任睿阳、王郁成、刘婧、赵欣欣（Wayne Xin Zhao）、武华、王海峰

With the rapid advancement of large language models (LLMs), retrieval-augmented generation (RAG) has emerged as a critical approach to supplement the inherent knowledge limitations of LLMs. However, due to the typically large volume of retrieved information, RAG tends to operate with long context lengths. From the perspective of entropy engineering, we identify unconstrained entropy growth and attention dilution due to long retrieval context as significant factors affecting RAG performance. In this paper, we propose the balanced entropy-engineered RAG (BEE-RAG) framework, which improves the adaptability of RAG systems to varying context lengths through the principle of entropy invariance. By leveraging balanced context entropy to reformulate attention dynamics, BEE-RAG separates attention sensitivity from context length, ensuring a stable entropy level. Building upon this, we introduce a zero-shot inference strategy for multi-importance estimation and a parameter-efficient adaptive fine-tuning mechanism to obtain the optimal balancing factor for different settings. Extensive experiments across multiple RAG tasks demonstrate the effectiveness of BEE-RAG. 随着大型语言模型 (LLMs) 的迅速发展，检索增强生成（RAG）已成为弥补 LLMs 固有知识局限的关键方法。然而，由于检索到的信息通常量大，RAG 往往在长上下文长度下运行。从熵工程的角度来看，我们识别出由于长检索上下文导致的无约束熵增长和注意力稀释，是影响 RAG 性能的重要因素。在本文中，我们提出了平衡熵工程 RAG（BEE-RAG）框架，通过熵不变性原理提高 RAG 系统对不同上下文长度的适应性。通过利用平衡的上下文熵来重构注意力动态，BEE-RAG 将注意力敏感性与上下文长度分离，确保稳定的熵水平。在此基础上，我们引入了一种用于多重重要性估计的零样本推理策略以及一种参数高效的自适应微调机制，以在不同设置下获得最优平衡因子。在多个 RAG 任务上的大量实验证明了 BEE-RAG 的有效性。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 07:37:25 UTC 发布：2025-08-07 07:37:25 UTC

#30 Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations #30 使用统一的视觉、文本与上下文表示的多模态事实核查

Authors: [Aditya Kishore](https://arxiv.org/search/?searchtype=author&query=Aditya Kishore), [Gaurav Kumar](https://arxiv.org/search/?searchtype=author&query=Gaurav Kumar), [Jasabanta Patro](https://arxiv.org/search/?searchtype=author&query=Jasabanta Patro) 作者：Aditya Kishore、Gaurav Kumar、Jasabanta Patro

The growing rate of multimodal misinformation, where claims are supported by both text and images, poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we have proposed a unified framework for fine-grained multimodal fact verification called “MultiCheck”, designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios. 多模态错误信息的增长速度很快，其中论断同时由文本和图像支持，这对主要依赖文本证据的事实核查系统构成了重大挑战。在本工作中，我们提出了一个用于细粒度多模态事实核查的统一框架，称为“MultiCheck”，旨在对结构化的文本和视觉信号进行推理。我们的架构将专用的文本和图像编码器与一个通过逐元素交互捕捉跨模态关系的融合模块结合起来。然后分类头预测论断的真实性，并辅以对比学习目标，促使论断—证据对在共享潜在空间中实现语义对齐。我们在 Factify 2 数据集上评估了该方法，取得了 0.84 的加权 F1 分数，显著优于基线。这些结果突出了显式多模态推理的有效性，并展示了该方法在复杂真实场景中实现可扩展且可解释事实核查的潜力。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-07 07:36:53 UTC

#31 Align, Don't Divide: Revisiting the LoRA Architecture in Multi-Task Learning #31 对齐而非划分：在多任务学习中重审 LoRA 架构

Authors: [Jinda Liu](https://arxiv.org/search/?searchtype=author&query=Jinda Liu), [Bo Cheng](https://arxiv.org/search/?searchtype=author&query=Bo Cheng), [Yi Chang](https://arxiv.org/search/?searchtype=author&query=Yi Chang), [Yuan Wu](https://arxiv.org/search/?searchtype=author&query=Yuan Wu) 作者：刘金达、程博、常毅、吴远

Parameter-Efficient Fine-Tuning (PEFT) is essential for adapting Large Language Models (LLMs). In practice, LLMs are often required to handle a diverse set of tasks from multiple domains, a scenario naturally addressed by multi-task learning (MTL). Within this MTL context, a prevailing trend involves LoRA variants with multiple adapters or heads, which advocate for structural diversity to capture task-specific knowledge. Our findings present a direct challenge to this paradigm. We first show that a simplified multi-head architecture with high inter-head similarity substantially outperforms complex multi-adapter and multi-head systems. This leads us to question the multi-component paradigm itself, and we further demonstrate that a standard single-adapter LoRA, with a sufficiently increased rank, also achieves highly competitive performance. These results lead us to a new hypothesis: effective MTL generalization hinges on learning robust shared representations, not isolating task-specific features. To validate this, we propose Align-LoRA, which incorporates an explicit loss to align task representations within the shared adapter space. Experiments confirm that Align-LoRA significantly surpasses all baselines, establishing a simpler yet more effective paradigm for adapting LLMs to multiple tasks. The code is available at https://github.com/jinda-liu/Align-LoRA. 参数高效微调（PEFT）对于适配大型语言模型（LLMs）至关重要。在实际应用中，LLMs 常常需要处理来自多个领域的多样化任务，这一场景自然由多任务学习（MTL）来应对。在该 MTL 背景下，流行的趋势是采用带有多个适配器或多头的 LoRA 变体，主张通过结构多样性来捕捉任务特定知识。我们的发现直接挑战了这一范式。我们首先展示了一个简化的多头架构——具有高度头间相似性——在性能上远超复杂的多适配器和多头系统。这使我们质疑多组件范式本身，进一步证明了只要秩足够增大，标准的单适配器 LoRA 也能达到高度有竞争力的性能。这些结果引出了一个新假设：有效的 MTL 泛化依赖于学习健壮的共享表示，而非孤立的任务特定特征。为验证这一点，我们提出了 Align-LoRA，该方法在共享适配器空间中引入显式损失以对齐任务表示。实验结果确认，Align-LoRA 显著超越所有基线方法，确立了一种更简单但更有效的将 LLMs 适配到多任务的范式。代码可在 https://github.com/jinda-liu/Align-LoRA 获取。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 07:02:55 UTC 发布：2025-08-07 07:02:55 UTC

#32 Evaluation of LLMs in AMR Parsing #32 在 AMR 解析中对 LLMs 的评估

Author: [Shu Han Ho](https://arxiv.org/search/?searchtype=author&query=Shu Han Ho) 作者：Shu Han Ho

Meaning Representation (AMR) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity. 含义表示（AMR）是一种语义形式化方法，将句子意义编码为有根、定向、无环的图，其中节点表示概念，边表示语义关系。仅微调解码器型大型语言模型（LLMs）是 AMR 解析一个有前景且直接的新方向。本文对微调四种不同 LLM 架构——Phi 3.5、Gemma 2、LLaMA 3.2 和 DeepSeek R1 LLaMA Distilled——在 LDC2020T02 Gold AMR3.0 测试集上的表现进行了全面评估。我们的结果表明，直接微调仅解码器型 LLMs 可以达到与复杂的最先进（SOTA）AMR 解析器相当的性能。值得注意的是，LLaMA 3.2 在采用直接微调方法时表现出与 SOTA AMR 解析器相竞争的性能。我们在完整的 LDC2020T02 测试集上取得了 SMATCH F1：0.804，与 APT + Silver（IBM）的 0.804 相当，且接近 Graphene Smatch（MBSE）的 0.854。在我们的分析中，我们还观察到一个一致的模式：LLaMA 3.2 在语义性能方面领先，而 Phi 3.5 在结构有效性方面表现出色。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 04:43:47 UTC 发布：2025-08-07 04:43:47 协调世界时 (UTC)

#33 Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning #33 对话面向方面的情感四元组抽取通过结构熵最小化分割

Dialogues Aspect-based Sentiment Quadruple Extraction (DiaASQ) aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue. Existing methods typically learn word relations across entire dialogues, assuming a uniform distribution of sentiment elements. However, we find that dialogues often contain multiple semantically independent sub-dialogues without clear dependencies between them. Therefore, learning word relationships across the entire dialogue inevitably introduces additional noise into the extraction process. To address this, our method focuses on partitioning dialogues into semantically independent sub-dialogues. Achieving completeness while minimizing these sub-dialogues presents a significant challenge. Simply partitioning based on reply relationships is ineffective. Instead, we propose utilizing a structural entropy minimization algorithm to partition the dialogues. This approach aims to preserve relevant utterances while distinguishing irrelevant ones as much as possible. Furthermore, we introduce a two-step framework for quadruple extraction: first extracting individual sentiment elements at the utterance level, then matching quadruples at the sub-dialogue level. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in DiaASQ with much lower computational costs. 对话基于方面的情感四元组抽取（DiaASQ）旨在从给定的多轮、多参与者对话中抽取所有目标-方面-观点-情感四元组。现有方法通常在整个对话中学习词之间的关系，假定情感元素呈均匀分布。然而，我们发现对话中往往包含多个语义上彼此独立的子对话，子对话之间没有明确的依赖关系。因此，在整个对话范围内学习词关系不可避免地会为抽取过程引入额外噪声。为了解决这一问题，我们的方法侧重于将对话划分为语义独立的子对话。在保证完整性的同时尽量减少这些子对话的数量是一项重大挑战。仅仅基于回复关系进行划分是无效的。相反，我们提出利用结构熵最小化算法对对话进行划分。该方法旨在尽可能保留相关的语句，同时尽量区分出不相关的语句。此外，我们提出了一个用于四元组抽取的两步框架：首先在话语层面提取单个情感要素，然后在子对话层面进行四元组匹配。大量实验表明，我们的方法在 DiaASQ 上以更低的计算成本实现了最先进的性能。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 04:22:17 UTC 发布：2025-08-07 04:22:17 协调世界时（UTC）

Background: Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches to this goal face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability. Methods: We present a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text. Our approach was compared to other state-of-the-art language models (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1). We also evaluated how the model’s explanations help people annotate SDoH factors more quickly and accurately. The analysis included both automated comparisons and a pilot user study. Results: We show that our proposed framework demonstrated performance boosts in the overarching task of extracting SDoH factors and in the finer-grained tasks of retrieving relevant context. Additionally, we show that fine-tuning a smaller, task-specific model achieves comparable or better performance with reduced inference costs. The multi-stage design not only enhances extraction but also provides intermediate explanations, improving model explainability. Conclusions: Our approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts. These advancements have the potential to support early identification of individuals at risk and inform more effective prevention strategies. 背景：理解促成自杀事件的社会健康决定因素（SDoH）对于早期干预和预防至关重要。然而，以数据为驱动的方法在实现这一目标时面临诸多挑战，例如因素分布长尾化、解析发生自杀事件前的关键压力源，以及模型可解释性有限等问题。方法：我们提出了一个多阶段大型语言模型框架，以增强从非结构化文本中提取 SDoH 因素的能力。我们的方法与其他最先进的语言模型（如预训练的 BioBERT 和 GPT-3.5-turbo）及推理模型（如 DeepSeek-R1）进行了比较。我们还评估了模型的解释性如何帮助人工更快更准确地标注 SDoH 因素。分析包括自动化比较和一项试点用户研究。结果：我们展示了所提出的框架在提取 SDoH 因素的总体任务以及检索相关上下文的更细粒度任务中均表现出性能提升。此外，我们还表明，对较小的、针对特定任务的模型进行微调可以在降低推理成本的同时实现相当或更优的性能。多阶段设计不仅增强了抽取能力，还提供了中间解释，提升了模型的可解释性。结论：我们的方法提高了从非结构化文本中抽取与自杀相关的社会决定健康因素（SDoH）的准确性和透明度。这些进展有助于及早识别有风险的个体并为更有效的预防策略提供依据。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 03:36:38 UTC 发布日期：2025-08-07 03:36:38 UTC

#35 Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering #35 朝着视觉活动识别的鲁棒评估：通过意义聚类解决动词歧义

Authors: [Louie Hong Yao](https://arxiv.org/search/?searchtype=author&query=Louie Hong Yao), [Nicholas Jarvis](https://arxiv.org/search/?searchtype=author&query=Nicholas Jarvis), [Tianyu Jiang](https://arxiv.org/search/?searchtype=author&query=Tianyu Jiang) 作者：Louie Hong Yao、Nicholas Jarvis、Tianyu Jiang

Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs verb sense clusters, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to an average of 2.8 sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our cluster-based evaluation with standard evaluation methods. Additionally, our human alignment analysis suggests that the cluster-based evaluation better aligns with human judgements, offering a more nuanced assessment of model performance. 评估视觉动作识别系统具有挑战性，因为动词语义和图像解读本身就存在模糊性。在描述图像中的动作时，同义动词可能指同一事件（例如 brushing 与 grooming），而不同的视角可能导致同样合理但不同的动词选择（例如 piloting 与 operating）。依赖单一标准答案的严格匹配评估无法捕捉这些模糊性，导致对模型表现的评估不完整。为了解决这一问题，我们提出了一种视觉-语言聚类框架，用于构建动词含义簇，从而提供更稳健的评估。我们对 imSitu 数据集的分析显示，每张图像平均映射到 2.8 个含义簇，每个簇代表图像的一个独特视角。我们评估了多种动作识别模型，并将基于簇的评估与标准评估方法进行了比较。此外，我们的人类一致性分析表明，基于簇的评估更符合人类判断，能够对模型性能提供更细致的评估。

Subjects: Computation and Language, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：计算与语言、人工智能、计算机视觉与模式识别

Publish: 2025-08-07 00:22:15 UTC 发布时间：2025-08-07 00:22:15 UTC

#36 I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations #36 我思故我资质不足？用于评估 LLM 招聘评估中语言禁忌检测的基准测试

Authors: [Julia Kharchenko](https://arxiv.org/search/?searchtype=author&query=Julia Kharchenko), [Tanya Roosta](https://arxiv.org/search/?searchtype=author&query=Tanya Roosta), [Aman Chadha](https://arxiv.org/search/?searchtype=author&query=Aman Chadha), [Chirag Shah](https://arxiv.org/search/?searchtype=author&query=Chirag Shah) 作者：Julia Kharchenko、Tanya Roosta、Aman Chadha、Chirag Shah

This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark’s effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts. 本文提出了一个用于评估大型语言模型（LLMs）如何对语言信号作出反应的综合基准：这些语言信号是一些微妙的语言标记，可能无意中泄露性别、社会阶层或地区背景等人口属性。通过使用 100 组经过验证的问答对构建的精心设计的面试模拟，我们展示了 LLMs 如何系统性地对某些语言模式施加惩罚，尤其是含有回避性措辞的语言，尽管其内容质量等同。我们的基准生成受控的语言变体，在保持语义等价的同时隔离特定现象，从而能够精确测量自动评估系统中的人口偏见。我们在多个语言维度上验证了该方法，显示含回避性措辞的回答平均评分降低了 25.6%，并展示了该基准在识别模型特定偏见方面的有效性。该工作为在人工智能系统中检测和衡量语言歧视建立了基础框架，并可广泛应用于自动化决策环境中的公平性问题。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-06 23:51:03 UTC 发布：2025-08-06 23:51:03 UTC

#37 RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory #37 RCR-Router：针对具有结构化记忆的多智能体 LLM 系统的高效角色感知上下文路由

Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks – HotPotQA, MuSiQue, and 2WikiMultihop – demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems. 多智能体大型语言模型（LLM）系统在复杂推理和协作决策任务中展现出强大潜力。然而，大多数现有的协调方案依赖静态或全上下文路由策略，导致过度的令牌消耗、冗余的记忆暴露以及在多轮交互中的有限适应性。我们提出了 RCR-Router，一个模块化且角色感知的上下文路由框架，旨在使多智能体 LLM 中的协作变得高效且自适应。据我们所知，这是第一个基于智能体的角色和任务阶段动态为每个智能体选择语义相关记忆子集同时遵守严格令牌预算的路由方法。轻量级评分策略引导记忆选择，智能体输出被迭代地整合进共享记忆存储以促进逐步的上下文精化。为了更好地评估模型行为，我们进一步提出了一个答案质量得分（Answer Quality Score）指标，用以捕捉超出标准问答准确率的 LLM 生成解释质量。在三个多跳问答基准上——HotPotQA、MuSiQue 和 2WikiMultihop——的实验表明，RCR-Router 在减少令牌使用量（最多达 30%）的同时提升或保持了答案质量。这些结果突显了结构化记忆路由和对输出敏感的评估在推进可扩展多智能体 LLM 系统方面的重要性。

Subjects: Computation and Language, Artificial Intelligence, Multiagent Systems 科目：计算与语言，人工智能，多智能体系统

Publish: 2025-08-06 21:59:34 UTC 发表：2025-08-06 21:59:34 UTC

#38 Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History #38 LLM 人格测量的持续不稳定性：规模、推理与对话历史的影响

Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD > 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate. 大型语言模型在安全部署上需要一致的行为模式，但它们类人格的特征仍然知之甚少。我们提出了 PERSIST（PERsonality Stability in Synthetic Text），这是一个全面的评估框架，测试了 25+ 个开源模型（1B–671B 参数），共计超过 500,000 条回应。使用传统（BFI-44、SD3）和为 LLM 调整的新型人格测量工具，我们系统性地改变问题顺序、改写、设定人物角色和推理模式。我们的发现挑战了部署的基本假设：(1) 即使是 400B+ 的模型也表现出显著的回应变异性（标准差 > 0.4）；(2) 仅仅是轻微的提示重排就能将人格测量偏移多达 20%；(3) 本应稳定行为的干预措施，例如链式思维推理、详细的人物角色指令、包含对话历史，反而可能增加变异性；(4) 为 LLM 调整的测量工具与以人为中心的版本一样不稳定，证实了这是架构层面而非翻译层面的问题。这种在不同规模和缓解策略下持续存在的不稳定性表明，当前的 LLM 缺乏真正行为一致性的基础。对于需要可预测行为的安全关键应用，这些发现表明基于个性化的对齐策略可能从根本上是不够的。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 19:11:33 UTC 发布：2025-08-06 19:11:33 UTC

#39 Pitch Accent Detection improves Pretrained Automatic Speech Recognition #39 音高重音检测提升预训练自动语音识别

Authors: [David Sasu](https://arxiv.org/search/?searchtype=author&query=David Sasu), [Natalie Schluter](https://arxiv.org/search/?searchtype=author&query=Natalie Schluter) 作者：David Sasu, Natalie Schluter

We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent. 我们展示了使用半监督语音表示的自动语音识别（ASR）系统，通过引入联合 ASR 与音高重音检测模型，辅以互补的音高重音检测模块，其性能可以得到提升。我们模型中的音高重音检测组件在该任务上取得了显著的进步，使 F1 分数缩小了 41%的差距。此外，在有限资源微调下，联合训练中的 ASR 性能在 LibriSpeech 上将词错误率（WER）降低了 28.3%。通过这些结果，我们展示了扩展预训练语音模型以保留或重新学习诸如音高重音等重要韵律线索的重要性。

Subjects: Computation and Language, Sound, Audio and Speech Processing 主题：计算与语言、声音、音频与语音处理

Publish: 2025-08-06 18:52:05 UTC 发布时间：2025-08-06 18:52:05 UTC

#40 Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization #40 偶校验感知的字节对编码：在分词中提升跨语言公平性

Authors: [Negar Foroutan](https://arxiv.org/search/?searchtype=author&query=Negar Foroutan), [Clara Meister](https://arxiv.org/search/?searchtype=author&query=Clara Meister), [Debjit Paul](https://arxiv.org/search/?searchtype=author&query=Debjit Paul), [Joel Niklaus](https://arxiv.org/search/?searchtype=author&query=Joel Niklaus), [Sina Ahmadi](https://arxiv.org/search/?searchtype=author&query=Sina Ahmadi), [Antoine Bosselut](https://arxiv.org/search/?searchtype=author&query=Antoine Bosselut), [Rico Sennrich](https://arxiv.org/search/?searchtype=author&query=Rico Sennrich) 作者：Negar Foroutan、Clara Meister、Debjit Paul、Joel Niklaus、Sina Ahmadi、Antoine Bosselut、Rico Sennrich

Tokenization is the first – and often least scrutinized – step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with <UNK> placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks. 分词（Tokenization）是大多数自然语言处理流程中的第一步——也是经常最少被审视的一步。用于学习分词器的标准算法依赖基于频率的目标函数，这使得训练数据中占主导地位的语言获得优待，从而导致资源较少的语言得到的分词结果在长度上不成比例地更长、形态上不合理，甚至充斥着<UNK>占位符。该现象最终加剧了不同语言背景用户之间在计算和经济上的不平等。为了解决这个问题，我们提出了“兼顾公平性的字节对编码”（Parity-aware Byte Pair Encoding，BPE），这是广泛使用的 BPE 算法的一种变体。在每一次合并步骤中，兼顾公平性的 BPE 最大化当前压缩最差语言的压缩增益，以在牺牲少量全局压缩率的前提下换取跨语言的公平性。我们的实证结果表明，兼顾公平性的 BPE 能够使各语言之间的标记数量更为均衡，对全局压缩率的影响可以忽略不计，并且对下游任务中的语言模型性能没有实质性影响。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-06 18:14:43 UTC 发布：2025-08-06 18:14:43 UTC

#41 Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM #41 使用冻结的 LLM 通过说话人特征增强对话标注

Authors: [Thomas Thebaud](https://arxiv.org/search/?searchtype=author&query=Thomas Thebaud), [Yen-Ju Lu](https://arxiv.org/search/?searchtype=author&query=Yen-Ju Lu), [Matthew Wiesner](https://arxiv.org/search/?searchtype=author&query=Matthew Wiesner), [Peter Viechnicki](https://arxiv.org/search/?searchtype=author&query=Peter Viechnicki), [Najim Dehak](https://arxiv.org/search/?searchtype=author&query=Najim Dehak) 作者：Thomas Thebaud、Yen-Ju Lu、Matthew Wiesner、Peter Viechnicki、Najim Dehak

In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios. 在对话转录流程中，大型语言模型（LLMs）常被用于后处理，以改善语法、标点和可读性。我们探讨一种互补的后处理步骤：通过为说话人添加元数据标签（如年龄、性别和情绪）来丰富转录的对话。其中一些标签适用于整个对话的全局属性，而另一些则随时间变化。我们的方法将冻结的音频基础模型（如 Whisper 或 WavLM）与冻结的 LLAMA 语言模型相结合，以推断这些说话人属性，而无需对任一模型进行任务特定的微调。通过使用轻量且高效的连接器来桥接音频与语言表示，我们在保持模块化与速度的同时，在说话人画像任务上取得了有竞争力的表现。此外，我们还证明了冻结的 LLAMA 模型可以直接比较 x-vectors，在某些场景下实现了 8.8% 的等错误率（Equal Error Rate）。

Subjects: Computation and Language, Artificial Intelligence, Sound, Audio and Speech Processing 主题：计算与语言、人工智能、声音、音频与语音处理

Publish: 2025-08-06 18:14:04 UTC 发布时间：2025-08-06 18:14:04 UTC

#42 Test-Time Reinforcement Learning for GUI Grounding via Region Consistency #42 基于区域一致性的 GUI 定位测试时强化学习

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents. 图形用户界面（GUI）定位，将自然语言指令映射到精确屏幕坐标的任务，对于自主 GUI 代理至关重要。尽管现有方法通过大量监督训练或带有标注奖励的强化学习取得了强劲的性能，但它们仍受限于像素级注释的成本和可获得性。我们观察到，当模型为同一 GUI 元素生成多个预测时，空间重叠模式会揭示可用于更准确定位的隐含置信信号。利用这一洞见，我们提出了 GUI-RC（区域一致性），这是一种测试时扩展方法，通过从多次采样预测构建空间投票网格以识别模型显示最高一致性的共识区域。无需任何训练，GUI-RC 在 ScreenSpot 基准上对各种架构的准确率提高了 2–3%。我们进一步引入了 GUI-RCPO（区域一致性策略优化），将这些一致性模式转化为测试时强化学习的奖励。通过计算每个预测与集体共识的契合度，GUI-RCPO 使模型在推理时能在无标注数据上迭代地优化其输出。大量实验展示了我们方法的通用性：在 ScreenSpot-v2 上，GUI-RC 将 Qwen2.5-VL-3B-Instruct 的表现从 80.11% 提升到 83.57%，而通过自监督优化，GUI-RCPO 又将其进一步提高到 85.14%。我们的方法揭示了测试时扩展和测试时强化学习在 GUI 定位方面尚未被充分挖掘的潜力，为构建更鲁棒、更高数据效率的 GUI 代理提供了有前景的路径。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-08-07 17:54:27 UTC 发布时间：2025-08-07 17:54:27 UTC

#43 Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision #43 Uni-cot：迈向跨文本与视觉的统一链式思维推理

Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/ 链式思维（Chain-of-Thought，CoT）推理已被广泛采用于通过将复杂任务分解为更简单的顺序子任务来增强大型语言模型（LLMs）。然而，将 CoT 扩展到视觉-语言推理任务仍然具有挑战性，因为这通常需要解释视觉状态的转变以支持推理。现有方法常因建模视觉状态转变能力有限或由于架构碎片化导致的视觉轨迹不连贯而难以应对。为克服这些限制，我们提出了 Uni-CoT，一种统一的 Chain-of-Thought 框架，使单一统一模型能够在连贯且有依据的多模态推理中运作。其关键思想是利用既能理解图像又能生成图像的模型来对视觉内容进行推理并建模不断演变的视觉状态。然而，使一个统一模型具备此能力并非易事，因其计算成本高且训练负担大。为解决此问题，Uni-CoT 引入了一种新颖的两级推理范式：用于高层任务规划的宏观 CoT（Macro-Level CoT）和用于子任务执行的微观 CoT（Micro-Level CoT）。这一设计显著降低了计算开销。此外，我们提出了一种结构化训练范式，将用于宏观级链式思维（CoT）的交错图文监督与用于微观级链式思维的多任务目标相结合。结合这些创新，Uni-CoT 能够执行可扩展且连贯的多模态推理。此外，得益于我们的设计，所有实验都可以仅使用 8 块每块 80GB 显存的 A100 GPU 高效完成。在以推理为驱动的图像生成基准（WISE）和编辑基准（RISE 和 KRIS）上的实验结果表明，Uni-CoT 展示了最先进的性能和强大的泛化能力，确立了 Uni-CoT 作为一个有前景的多模态推理解决方案。项目页面和代码： https://sais-fuxi.github.io/projects/uni-cot/

Subjects: Computer Vision and Pattern Recognition, Computation and Language 主题：计算机视觉与模式识别，计算与语言

Publish: 2025-08-07 17:45:17 UTC 发布：2025-08-07 17:45:17 UTC

#44 Iterative Learning of Computable Phenotypes for Treatment Resistant Hypertension using Large Language Models #44 使用大型语言模型对难治性高血压可计算表型进行迭代学习以指导治疗

Authors: [Guilherme Seidyo Imai Aldeia](https://arxiv.org/search/?searchtype=author&query=Guilherme Seidyo Imai Aldeia), [Daniel S. Herman](https://arxiv.org/search/?searchtype=author&query=Daniel S. Herman), [William G. La Cava](https://arxiv.org/search/?searchtype=author&query=William G. La Cava) 作者：Guilherme Seidyo Imai Aldeia、Daniel S. Herman、William G. La Cava

Large language models (LLMs) have demonstrated remarkable capabilities for medical question answering and programming, but their potential for generating interpretable computable phenotypes (CPs) is under-explored. In this work, we investigate whether LLMs can generate accurate and concise CPs for six clinical phenotypes of varying complexity, which could be leveraged to enable scalable clinical decision support to improve care for patients with hypertension. In addition to evaluating zero-short performance, we propose and test a synthesize, execute, debug, instruct strategy that uses LLMs to generate and iteratively refine CPs using data-driven feedback. Our results show that LLMs, coupled with iterative learning, can generate interpretable and reasonably accurate programs that approach the performance of state-of-the-art ML methods while requiring significantly fewer training examples. 大型语言模型（LLMs）在医学问答和编程方面表现出显著能力，但它们在生成可解释的可计算表型（CPs）方面的潜力尚未得到充分探索。在本研究中，我们考察了 LLMs 是否能够为六种复杂程度不同的临床表型生成准确且简洁的 CPs，这些 CPs 可用于提供可扩展的临床决策支持，以改善高血压患者的护理。除了评估零次学习（zero-shot）性能外，我们提出并测试了一种“合成、执行、调试、指导”（synthesize, execute, debug, instruct）策略，使用 LLMs 在数据驱动反馈的基础上生成并迭代完善 CPs。我们的结果表明，结合迭代学习的 LLMs 能够生成可解释且相当准确的程序，其性能接近最先进的机器学习方法，同时所需的训练示例显著更少。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-07 17:15:17 UTC 发布：2025-08-07 17:15:17 UTC

#45 Fairy±i: the First 2-bit Complex LLM with All Parameters in #45Fairy ±i ：首个所有参数均为的 2 位复数 LLM

Quantization-Aware Training (QAT) integrates quantization into the training loop, enabling LLMs to learn robust low-bit representations, and is widely recognized as one of the most promising research directions. All current QAT research focuses on minimizing quantization error on full-precision models, where the full-precision accuracy acts as an upper bound (accuracy ceiling). No existing method has even attempted to surpass this ceiling. To break this ceiling, we propose a new paradigm: raising the ceiling (full-precision model), and then still quantizing it efficiently into 2 bits. We propose Fairy±i, the first 2-bit quantization framework for complex-valued LLMs. Specifically, our method leverages the representational advantages of the complex domain to boost full-precision accuracy. We map weights to the fourth roots of unity {±1,±i}, forming a perfectly symmetric and information-theoretically optimal 2-bit representation. Importantly, each quantized weight has either a zero real or imaginary part, enabling multiplication-free inference using only additions and element swaps. Experimental results show that Fairy±i outperforms the ceiling of existing 2-bit quantization approaches in terms of both PPL and downstream tasks, while maintaining strict storage and compute efficiency. This work opens a new direction for building highly accurate and practical LLMs under extremely low-bit constraints. 量化感知训练（QAT）将量化融入训练循环，使 LLMs 能够学习鲁棒的低比特表示，并被广泛认为是最有前途的研究方向之一。所有现有的 QAT 研究都集中在最小化全精度模型的量化误差，全精度准确率作为上限（准确率天花板）。目前没有任何方法尝试超越该天花板。为打破该天花板，我们提出了一种新范式：提升天花板（全精度模型），然后仍然将其高效量化为 2 比特。我们提出了 Fairy ±i ，首个用于复值 LLMs 的 2 比特量化框架。具体来说，我们的方法利用复数域的表示优势来提升全精度准确率。我们将权重映射到四次单位根 {±1,±i} ，形成一个完美对称且从信息论角度最优的 2 比特表示。重要的是，每个量化权重的实部或虚部为零，从而可在推理时仅通过加法和元素交换实现无乘法运算。实验结果表明，Fairy ±i 在困惑度（PPL）和下游任务方面均优于现有 2 位量化方法的上限，同时保持严格的存储和计算效率。这项工作为在极低位数限制下构建高精度且实用的 LLMs 开辟了新的方向。

Subjects: Machine Learning, Computation and Language 主题：机器学习，计算与语言

Publish: 2025-08-07 17:02:23 UTC 发布：2025-08-07 17:02:23 协调世界时（UTC）

#46 SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription #46 SPGISpeech 2.0：带说话人标注的多说话人金融音频转录 [PDF ] [Copy] [Kimi ] [REL]

We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the dataset contains call and speaker information for each audio snippet facilitating multi-talker ASR. We validate the utility of SPGISpeech 2.0 through improvements in speaker-tagged ASR performance of popular speech recognition models after fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect SPGISpeech 2.0 to foster advancements in speech recognition technologies and inspire a wide range of research applications. 我们介绍了 SPGISpeech 2.0，这是一个适用于金融领域带说话人标注转录的数据集。SPGISpeech 2.0 在保持原始 SPGISpeech 数据集核心特征的同时提升了适用建模任务的多样性：音频片段及其对应的完全格式化文本转录，可用于端到端自动语音识别（ASR）。SPGISpeech 2.0 包含额外 3,780 小时由专业人员转录的财报电话会议音频。此外，该数据集为每个音频片段提供了通话和说话人信息，便于多说话人 ASR 的研究。我们通过在 SPGISpeech 2.0 上微调流行语音识别模型并观察说话人标注 ASR 性能的提升来验证其有效性。该数据集以非商业用途免费发布，我们期望 SPGISpeech 2.0 能推动语音识别技术的发展并激发广泛的研究应用。

Subjects: Sound, Computation and Language, Audio and Speech Processing 主题：声音、计算与语言、音频与语音处理

Publish: 2025-08-07 16:35:29 UTC 发布：2025-08-07 16:35:29 UTC

#47 Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation #47 混合主动对话用于人机协作操作

Authors: [Albert Yu](https://arxiv.org/search/?searchtype=author&query=Albert Yu), [Chengshu Li](https://arxiv.org/search/?searchtype=author&query=Chengshu Li), [Luca Macesanu](https://arxiv.org/search/?searchtype=author&query=Luca Macesanu), [Arnav Balaji](https://arxiv.org/search/?searchtype=author&query=Arnav Balaji), [Ruchira Ray](https://arxiv.org/search/?searchtype=author&query=Ruchira Ray), [Raymond Mooney](https://arxiv.org/search/?searchtype=author&query=Raymond Mooney), [Roberto Martín-Martín](https://arxiv.org/search/?searchtype=author&query=Roberto Martín-Martín) 作者：Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray, Raymond Mooney, Roberto Martín-Martín

Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot’s capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We apply a Mixed-Initiative dialog paradigm to Collaborative human-roBot teaming and propose MICoBot, a system that handles the common scenario where both agents, using natural language, take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot’s capabilities (measured by a simulation-pretrained affordance model) and the human’s estimated availability to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. Our extensive evaluations in simulation and real-world – on a physical robot with 18 unique human participants over 27 hours – demonstrate the ability of our method to effectively collaborate with diverse human users, yielding significantly improved task success and user experience than a pure LLM baseline and other agent allocation models. See additional videos and materials at https://robin-lab.cs.utexas.edu/MicoBot/. 面向长时程人机协作的高效机器人系统必须适应各种人类合作伙伴，这些人类的身体行为、愿意协助的程度以及对机器人能力的理解可能会随着时间而变化。这就需要一个紧密耦合的通信回路，使双方在协调完成任务时都有灵活性来提出、接受或拒绝请求。我们将混合主动对话范式应用于协作人机团队，并提出了 MICoBot，一个处理常见情形的系统：在该情形中，双方使用自然语言，在制定、接受或拒绝关于谁最适合完成任务不同步骤的提议时都可主动发起。为处理多样的、以任务为导向的对话，并找到能将人类付出最小化的成功协作策略，MICoBot 在三个层面上做出决策：(1) 元规划器考虑人类对话以制定并编码高层次的协作策略，(2) 规划器基于机器人的能力（由模拟预训练的可供性模型衡量）和对人类可协助性估计，最优地将剩余步骤分配给任一代理，(3) 动作执行器决定要执行的低层次动作或对人类要说的话。我们在模拟和现实世界中进行了大量评估——在一台实体机器人上与 18 位不同的人类参与者进行了累计 27 小时的实验——结果表明我们的方法能够有效地与多样的人类用户协作，在任务成功率和用户体验方面显著优于纯 LLM 基线和其他代理分配模型。更多视频和资料见 https://robin-lab.cs.utexas.edu/MicoBot/。

Subjects: Robotics, Computation and Language, Human-Computer Interaction, Machine Learning, Multiagent Systems 主题：机器人学、计算与语言、人机交互、机器学习、多智能体系统

Publish: 2025-08-07 16:09:12 UTC 发布：2025-08-07 16:09:12 UTC

#48 MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs #48 MELLA：为低资源语言的大型多语言模型架起语言能力与文化扎根性的桥梁

Authors: [Yufei Gao](https://arxiv.org/search/?searchtype=author&query=Yufei Gao), [Jiaying Fei](https://arxiv.org/search/?searchtype=author&query=Jiaying Fei), [Nuo Chen](https://arxiv.org/search/?searchtype=author&query=Nuo Chen), [Ruirui Chen](https://arxiv.org/search/?searchtype=author&query=Ruirui Chen), [Guohang Yan](https://arxiv.org/search/?searchtype=author&query=Guohang Yan), [Yunshi Lan](https://arxiv.org/search/?searchtype=author&query=Yunshi Lan), [Botian Shi](https://arxiv.org/search/?searchtype=author&query=Botian Shi) 作者：高宇飞、费佳颖、陈诺、陈蕊蕊、颜国航、兰云诗、石博天

Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce “thin descriptions”, they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing “thick descriptions”. We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus. 多模态大型语言模型（MLLMs）在高资源语言中表现出色。然而，它们在低资源语言环境中的有效性显著下降。当前的多语种增强方法通常仅限于文本模态或完全依赖机器翻译。尽管这些方法有助于模型获得基本的语言能力并生成“肤浅的描述”，但它们忽视了多模态信息量和文化根植性的重要性，而这两者对于有效服务低资源语言用户至关重要。为弥合这一差距，本研究确定了在低资源语言环境中真正有效的 MLLM 应实现的两个重要目标，分别是 1）语言能力和 2）文化根植性，并特别强调文化意识。为实现这两个目标，我们提出了一种双源策略，指导为各目标量身定制的数据收集，针对文化使用本地网页替代文本，针对语言学使用 MLLM 生成的图注。作为具体实现，我们引入了 MELLA，一个多模态、多语种的数据集。实验结果表明，在 MELLA 上进行微调后，八种语言在各种多语言大模型骨干网络上总体性能都有所提升，模型生成了“丰富的描述”。我们验证了性能提升既来自于文化知识的增强，也来自于语言能力的提升。我们的数据集可在 https://opendatalab.com/applyMultilingualCorpus 获取。

Subjects: Computer Vision and Pattern Recognition, Computation and Language 主题：计算机视觉与模式识别，计算与语言

Publish: 2025-08-07 15:36:24 UTC 发布：2025-08-07 15:36:24 UTC

#49 Can Large Language Models Generate Effective Datasets for Emotion Recognition in Conversations? #49 大型语言模型能为对话中的情感识别生成有效的数据集吗？ [PDF ] [Copy] [Kimi ] [REL]

Authors: [Burak Can Kaplan](https://arxiv.org/search/?searchtype=author&query=Burak Can Kaplan), [Hugo Cesar De Castro Carneiro](https://arxiv.org/search/?searchtype=author&query=Hugo Cesar De Castro Carneiro), [Stefan Wermter](https://arxiv.org/search/?searchtype=author&query=Stefan Wermter) 作者：Burak Can Kaplan、Hugo Cesar De Castro Carneiro、Stefan Wermter

Emotion recognition in conversations (ERC) focuses on identifying emotion shifts within interactions, representing a significant step toward advancing machine intelligence. However, ERC data remains scarce, and existing datasets face numerous challenges due to their highly biased sources and the inherent subjectivity of soft labels. Even though Large Language Models (LLMs) have demonstrated their quality in many affective tasks, they are typically expensive to train, and their application to ERC tasks–particularly in data generation–remains limited. To address these challenges, we employ a small, resource-efficient, and general-purpose LLM to synthesize ERC datasets with diverse properties, supplementing the three most widely used ERC benchmarks. We generate six novel datasets, with two tailored to enhance each benchmark. We evaluate the utility of these datasets to (1) supplement existing datasets for ERC classification, and (2) analyze the effects of label imbalance in ERC. Our experimental results indicate that ERC classifier models trained on the generated datasets exhibit strong robustness and consistently achieve statistically significant performance improvements on existing ERC benchmarks. 对话情感识别（ERC）致力于识别交互中的情感变化，是推动机器智能发展的重要一步。然而，ERC 数据仍然稀缺，现有数据集由于来源高度偏颇和软标签的固有主观性而面临诸多挑战。尽管大型语言模型（LLMs）在许多情感任务中已展现出高质量，但它们通常训练成本高昂，且在 ERC 任务——尤其是数据生成方面的应用仍然有限。为应对这些挑战，我们采用了一个小型、资源高效且通用的 LLM 来合成具有多样特性的 ERC 数据集，以补充三个最常用的 ERC 基准。我们生成了六个新数据集，其中每个基准针对性增强了两个数据集。我们评估了这些数据集的实用性，具体包括（1）作为 ERC 分类的现有数据集的补充，以及（2）分析 ERC 中标签不平衡的影响。我们的实验结果表明，在生成的数据集上训练的情感角色识别（ERC）分类器模型表现出强大的鲁棒性，并在现有的 ERC 基准测试上持续获得具有统计显著性的性能提升。

Subjects: Artificial Intelligence, Computation and Language 主题: 人工智能, 计算与语言

Publish: 2025-08-07 15:13:55 UTC 发表: 2025-08-07 15:13:55 UTC

#50 Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? #50 Bench-2-CoP：我们能信任用于欧盟人工智能合规性的基准测试吗？

Authors: [Matteo Prandi](https://arxiv.org/search/?searchtype=author&query=Matteo Prandi), [Vincenzo Suriani](https://arxiv.org/search/?searchtype=author&query=Vincenzo Suriani), [Federico Pierucci](https://arxiv.org/search/?searchtype=author&query=Federico Pierucci), [Marcello Galisai](https://arxiv.org/search/?searchtype=author&query=Marcello Galisai), [Daniele Nardi](https://arxiv.org/search/?searchtype=author&query=Daniele Nardi), [Piercosma Bisconti](https://arxiv.org/search/?searchtype=author&query=Piercosma Bisconti) 作者：Matteo Prandi、Vincenzo Suriani、Federico Pierucci、Marcello Galisai、Daniele Nardi、Piercosma Bisconti

The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this “benchmark-regulation gap.” We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act’s taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem is overwhelmingly focused on a narrow set of behavioral propensities, such as “Tendency to hallucinate” (53.7% of the corpus) and “Discriminatory bias” (28.9%), while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This translates to a near-total evaluation gap for systemic risks like “Loss of Control” (0.4% coverage) and “Cyber Offence” (0.8% coverage). This study provides the first comprehensive, quantitative analysis of this gap, offering critical insights for policymakers to refine the CoP and for developers to build the next generation of evaluation tools, ultimately fostering safer and more compliant AI. 通用人工智能（GPAI）模型的快速发展需要强有力的评估框架，尤其在欧盟人工智能法案及其附带的行为准则（CoP）等新兴监管背景下。当前的 AI 评估实践在很大程度上依赖既有基准测试，但这些工具并非为衡量新监管环境所关注的系统性风险而设计。本研究旨在量化这一“基准与监管差距”的紧迫需求。我们提出了 Bench-2-CoP，一种新颖的系统性框架，利用经过验证的 LLM-as-judge 分析，将广泛使用的基准测试中 194,955 个问题映射到欧盟人工智能法案关于模型能力与倾向性的分类法上。我们的研究结果揭示了严重的不匹配：评估生态系统压倒性地集中于一小部分行为倾向，例如“倾向于幻觉”（占语料的 53.7%）和“歧视性偏见”（28.9%），而关键的功能性能力却被危险地忽视。关键的是，与失控场景密切相关的能力，包括规避人工监督、自我复制和自主的 AI 开发，在整个基准语料库中完全没有覆盖。这导致对“失控”（覆盖率为 0.4%）和“网络进攻”（覆盖率为 0.8%）等系统性风险的评估几乎完全空白。本研究提供了对此差距的首个全面定量分析，为政策制定者改进 CoP 以及为开发者构建下一代评估工具提供了关键见解，最终促进更安全、更合规的人工智能。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-07 15:03:39 UTC 发布：2025-08-07 15:03:39 UTC

#51 A Novel Architecture for Symbolic Reasoning with Decision Trees and LLM Agents #51 一种结合决策树与 LLM 代理的符号推理新架构

Author: [Andrew Kiruluta](https://arxiv.org/search/?searchtype=author&query=Andrew Kiruluta) 作者：Andrew Kiruluta

We propose a hybrid architecture that integrates decision tree-based symbolic reasoning with the generative capabilities of large language models (LLMs) within a coordinated multi-agent framework. Unlike prior approaches that loosely couple symbolic and neural modules, our design embeds decision trees and random forests as callable oracles within a unified reasoning system. Tree-based modules enable interpretable rule inference and causal logic, while LLM agents handle abductive reasoning, generalization, and interactive planning. A central orchestrator maintains belief state consistency and mediates communication across agents and external tools, enabling reasoning over both structured and unstructured inputs. The system achieves strong performance on reasoning benchmarks. On \textit{ProofWriter}, it improves entailment consistency by +7.2% through logic-grounded tree validation. On GSM8k, it achieves +5.3% accuracy gains in multistep mathematical problems via symbolic augmentation. On \textit{ARC}, it boosts abstraction accuracy by +6.0% through integration of symbolic oracles. Applications in clinical decision support and scientific discovery show how the system encodes domain rules symbolically while leveraging LLMs for contextual inference and hypothesis generation. This architecture offers a robust, interpretable, and extensible solution for general-purpose neuro-symbolic reasoning. 我们提出了一种混合架构，在一个协调的多智能体框架内将基于决策树的符号推理与大型语言模型（LLMs）的生成能力整合在一起。与以往松散耦合符号与神经模块的方法不同，我们的设计将决策树和随机森林作为可调用的神谕嵌入到统一的推理系统中。基于树的模块使可解释的规则推断和因果逻辑成为可能，而 LLM 代理则处理溯因推理、泛化和交互式规划。一个中央协调器维护信念状态的一致性并调节代理与外部工具之间的通信，使其能够对结构化和非结构化输入进行推理。该系统在推理基准测试上取得了优异表现。在 ProofWriter 上，通过基于逻辑的树验证将蕴含一致性提高了 +7.2%。在 GSM8k 上，通过符号增强在多步数学问题上取得了 +5.3% 的准确率提升。在 ARC 上，通过集成符号神谕将抽象化准确率提升了 +6.0%。在临床决策支持和科学发现中的应用展示了该系统如何以符号方式编码领域规则，同时利用 LLMs 进行上下文推断和假设生成。这种架构为通用的神经符号推理提供了一个稳健、可解释且可扩展的解决方案。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能、计算与语言

Publish: 2025-08-07 12:11:53 UTC 发布时间：2025-08-07 12:11:53 UTC

#52 Understanding and Mitigating Errors of LLM-Generated RTL Code #52 理解并缓解 LLM 生成的 RTL 代码错误

Authors: [Jiazheng Zhang](https://arxiv.org/search/?searchtype=author&query=Jiazheng Zhang), [Cheng Liu](https://arxiv.org/search/?searchtype=author&query=Cheng Liu), [Huawei Li](https://arxiv.org/search/?searchtype=author&query=Huawei Li) 作者：张嘉政、刘成、李华为

Despite the promising potential of large language model (LLM) based register-transfer-level (RTL) code generation, the overall success rate remains unsatisfactory. Errors arise from various factors, with limited understanding of specific failure causes hindering improvement. To address this, we conduct a comprehensive error analysis and manual categorization. Our findings reveal that most errors stem not from LLM reasoning limitations, but from insufficient RTL programming knowledge, poor understanding of circuit concepts, ambiguous design descriptions, or misinterpretation of complex multimodal inputs. Leveraging in-context learning, we propose targeted error correction techniques. Specifically, we construct a domain-specific knowledge base and employ retrieval-augmented generation (RAG) to supply necessary RTL knowledge. To mitigate ambiguity errors, we introduce design description rules and implement a rule-checking mechanism. For multimodal misinterpretation, we integrate external tools to convert inputs into LLM-compatible meta-formats. For remaining errors, we adopt an iterative debugging loop (simulation-error localization-correction). Integrating these techniques into an LLM-based framework significantly improves performance. We incorporate these error correction techniques into a foundational LLM-based RTL code generation framework, resulting in significantly improved performance. Experimental results show that our enhanced framework achieves 91.0% accuracy on the VerilogEval benchmark, surpassing the baseline code generation approach by 32.7%, demonstrating the effectiveness of our methods. 尽管基于大型语言模型（LLM）的寄存器传输级（RTL）代码生成前景可观，但总体成功率仍不令人满意。错误由多种因素引起，对具体失败原因的理解有限，阻碍了改进。为此，我们进行了全面的错误分析和人工分类。我们的发现表明，大多数错误并非源于 LLM 的推理限制，而是由于对 RTL 编程知识掌握不足、对电路概念理解不够、设计描述模糊或对复杂多模态输入的误解。利用上下文学习，我们提出了有针对性的错误纠正技术。具体而言，我们构建了领域特定的知识库并采用检索增强生成（RAG）来提供必要的 RTL 知识。为缓解歧义错误，我们引入了设计描述规则并实施了规则检查机制。针对多模态误解，我们集成了外部工具将输入转换为 LLM 兼容的元格式。对于剩余错误，我们采用了迭代调试循环（仿真——错误定位——修正）。将这些技术整合到基于 LLM 的框架中可显著提升性能。我们将这些错误纠正技术融入一个基于 LLM 的基础 RTL 代码生成框架，从而显著提高了性能。实验结果表明，我们增强后的框架在 VerilogEval 基准上的准确率达到 91.0%，比基线代码生成方法高出 32.7%，证明了我们方法的有效性。

Subjects: Hardware Architecture, Computation and Language, Machine Learning 主题：硬件架构，计算与语言，机器学习

Publish: 2025-08-07 11:02:32 UTC 发布：2025-08-07 11:02:32 UTC

#53 FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance #53 FAITH：用于评估金融领域表格内在幻觉的框架

Authors: [Mengao Zhang](https://arxiv.org/search/?searchtype=author&query=Mengao Zhang), [Jiayu Fu](https://arxiv.org/search/?searchtype=author&query=Jiayu Fu), [Tanya Warrier](https://arxiv.org/search/?searchtype=author&query=Tanya Warrier), [Yuwen Wang](https://arxiv.org/search/?searchtype=author&query=Yuwen Wang), [Tianhui Tan](https://arxiv.org/search/?searchtype=author&query=Tianhui Tan), [Ke-wei Huang](https://arxiv.org/search/?searchtype=author&query=Ke-wei Huang) 作者：Mengao Zhang、Jiayu Fu、Tanya Warrier、Yuwen Wang、Tianhui Tan、Ke-wei Huang

Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance. Accurate extraction and precise calculation from tabular data are essential for reliable financial analysis, since even minor numerical errors can undermine decision-making and regulatory compliance. Financial applications have unique requirements, often relying on context-dependent, numerical, and proprietary tabular data that existing hallucination benchmarks rarely capture. In this study, we develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs, conceptualized as a context-aware masked span prediction task over real-world financial documents. Our main contributions are: (1) a novel, automated dataset creation paradigm using a masking strategy; (2) a new hallucination evaluation dataset derived from S&P 500 annual reports; and (3) a comprehensive evaluation of intrinsic hallucination patterns in state-of-the-art LLMs on financial tabular data. Our work provides a robust methodology for in-house LLM evaluation and serves as a critical step toward building more trustworthy and reliable financial Generative AI systems. 幻觉仍然是将大型语言模型（LLMs）部署到金融领域的一个关键挑战。从表格数据中准确提取和精确计算对可靠的财务分析至关重要，因为即使是微小的数值错误也会削弱决策制定和合规性。金融应用有其独特的需求，通常依赖于情境相关的、数值型的和专有的表格数据，而现有的幻觉基准很少涵盖这些内容。在本研究中，我们提出了一个严格且可扩展的框架，用于评估金融 LLMs 的内在幻觉，将其概念化为对真实世界财务文件进行的情境感知掩码跨度预测任务。我们的主要贡献包括： (1) 一种使用掩码策略的新型自动化数据集创建范式； (2) 一个源自标准普尔 500 年报的新幻觉评估数据集；以及 (3) 对最先进 LLM 在金融表格数据上内在幻觉模式的全面评估。我们的工作为内部 LLM 评估提供了稳健的方法论，并且是构建更可信、更可靠的金融生成式人工智能系统的关键一步。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-07 09:37:14 UTC 发布：2025-08-07 09:37:14 UTC

#54 QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering #54 QA-Dragon：面向查询感知的动态 RAG 系统，用于知识密集型视觉问答

Authors: [Zhuohang Jiang](https://arxiv.org/search/?searchtype=author&query=Zhuohang Jiang), [Pangjing Wu](https://arxiv.org/search/?searchtype=author&query=Pangjing Wu), [Xu Yuan](https://arxiv.org/search/?searchtype=author&query=Xu Yuan), [Wenqi Fan](https://arxiv.org/search/?searchtype=author&query=Wenqi Fan), [Qing Li](https://arxiv.org/search/?searchtype=author&query=Qing Li) 作者：蒋卓航、吴磅靖、袁旭、范文琦、李青

Retrieval-Augmented Generation (RAG) has been introduced to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge into the generation process, and it has become a widely adopted approach for knowledge-intensive Visual Question Answering (VQA). However, existing RAG methods typically retrieve from either text or images in isolation, limiting their ability to address complex queries that require multi-hop reasoning or up-to-date factual knowledge. To address this limitation, we propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA. Specifically, QA-Dragon introduces a domain router to identify the query’s subject domain for domain-specific reasoning, along with a search router that dynamically selects optimal retrieval strategies. By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively. We evaluate our QA-Dragon on the Meta CRAG-MM Challenge at KDD Cup 2025, where it significantly enhances the reasoning performance of base models under challenging scenarios. Our framework achieves substantial improvements in both answer accuracy and knowledge overlap scores, outperforming baselines by 5.06% on the single-source task, 6.35% on the multi-source task, and 5.03% on the multi-turn task. 检索增强生成（RAG）被引入以通过在生成过程中整合外部知识来缓解多模态大语言模型（MLLMs）中的幻觉现象，并已成为知识密集型视觉问答（VQA）中广泛采用的方法。然而，现有的 RAG 方法通常单独从文本或图像中检索，这限制了它们处理需要多跳推理或最新事实知识的复杂查询的能力。为了解决这一限制，我们提出了 QA-Dragon，一种面向知识密集型 VQA 的查询感知动态 RAG 系统。具体而言，QA-Dragon 引入了域路由器以识别查询的主题领域以进行领域特定推理，同时配备了搜索路由器以动态选择最优检索策略。通过在混合设置中编排文本和图像检索代理，我们的系统支持多模态、多轮和多跳推理，使其能够有效应对复杂的 VQA 任务。我们在 2025 年 KDD Cup 的 Meta CRAG-MM 挑战赛上评估了 QA-Dragon，在具有挑战性的场景下，它显著提升了基础模型的推理性能。我们的框架在答案准确率和知识重合度评分上均取得了显著提升，在单源任务上比基线高出 5.06%，在多源任务上高出 6.35%，在多轮任务上高出 5.03%。

Subjects: Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition 主题：人工智能、计算与语言、计算机视觉与模式识别

Publish: 2025-08-07 09:32:49 UTC 发布：2025-08-07 09:32:49 UTC

#55 Posterior-GRPO: Rewarding Reasoning Processes in Code Generation #55 Posterior-GRPO：在代码生成中奖励推理过程

Authors: [Lishui Fan](https://arxiv.org/search/?searchtype=author&query=Lishui Fan), [Yu Zhang](https://arxiv.org/search/?searchtype=author&query=Yu Zhang), [Mouxiang Chen](https://arxiv.org/search/?searchtype=author&query=Mouxiang Chen), [Zhongxin Liu](https://arxiv.org/search/?searchtype=author&query=Zhongxin Liu) 作者：范黎水、张宇、陈谋翔、刘中昕

Reinforcement learning (RL) has significantly advanced code generation for large language models (LLMs). However, current paradigms rely on outcome-based rewards from test cases, neglecting the quality of the intermediate reasoning process. While supervising the reasoning process directly is a promising direction, it is highly susceptible to reward hacking, where the policy model learns to exploit the reasoning reward signal without improving final outcomes. To address this, we introduce a unified framework that can effectively incorporate the quality of the reasoning process during RL. First, to enable reasoning evaluation, we develop LCB-RB, a benchmark comprising preference pairs of superior and inferior reasoning processes. Second, to accurately score reasoning quality, we introduce an Optimized-Degraded based (OD-based) method for reward model training. This method generates high-quality preference pairs by systematically optimizing and degrading initial reasoning paths along curated dimensions of reasoning quality, such as factual accuracy, logical rigor, and coherence. A 7B parameter reward model with this method achieves state-of-the-art (SOTA) performance on LCB-RB and generalizes well to other benchmarks. Finally, we introduce Posterior-GRPO (P-GRPO), a novel RL method that conditions process-based rewards on task success. By selectively applying rewards to the reasoning processes of only successful outcomes, P-GRPO effectively mitigates reward hacking and aligns the model’s internal reasoning with final code correctness. A 7B parameter model with P-GRPO achieves superior performance across diverse code generation tasks, outperforming outcome-only baselines by 4.5%, achieving comparable performance to GPT-4-Turbo. We further demonstrate the generalizability of our approach by extending it to mathematical tasks. Our models, dataset, and code are publicly available. 强化学习（RL）在大型语言模型（LLMs）的代码生成方面取得了显著进展。然而，当前范式依赖于来自测试用例的基于结果的奖励，忽视了中间推理过程的质量。虽然直接监督推理过程是一个有前景的方向，但它非常容易受到奖励操纵的影响，即策略模型学会利用推理奖励信号而不改善最终结果。为了解决这一问题，我们引入了一个统一框架，能够在强化学习过程中有效地纳入推理过程的质量。首先，为了实现推理评估，我们开发了 LCB-RB，这是一个由优劣推理过程偏好对组成的基准。其次，为了准确评分推理质量，我们提出了一种基于优化-退化（OD-based）的方法用于奖励模型训练。该方法通过沿着经过挑选的推理质量维度（如事实准确性、逻辑严密性和连贯性）系统地优化和退化初始推理路径，从而生成高质量的偏好对。使用该方法训练的一个 7B 参数的奖励模型在 LCB-RB 上达到了最先进（SOTA）的表现，并且能很好地泛化到其他基准。最后，我们引入了 Posterior-GRPO（P-GRPO），这是一种新颖的强化学习方法，它将基于过程的奖励以任务成功为条件进行施加。通过有选择地仅将奖励应用于成功结果的推理过程，P-GRPO 有效地减轻了奖励被滥用的问题，并使模型的内部推理与最终代码的正确性保持一致。采用 P-GRPO 的一个 7B 参数模型在多样的代码生成任务上取得了优异表现，比仅基于结果的基线高出 4.5%，并达到了与 GPT-4-Turbo 相当的性能。我们还通过将该方法扩展到数学任务，展示了其可推广性。我们的模型、数据集和代码均已公开可用。

Subjects: Software Engineering, Artificial Intelligence, Computation and Language, Machine Learning 主题：软件工程、人工智能、计算与语言、机器学习

Publish: 2025-08-07 09:04:10 UTC 发布时间：2025-08-07 09:04:10 UTC

#56 Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models #56 在预算内对 LLMs 进行对齐：使用启发式奖励模型进行推理时对齐

Authors: [Mason Nakamura](https://arxiv.org/search/?searchtype=author&query=Mason Nakamura), [Saaduddin Mahmud](https://arxiv.org/search/?searchtype=author&query=Saaduddin Mahmud), [Kyle H. Wray](https://arxiv.org/search/?searchtype=author&query=Kyle H. Wray), [Hamed Zamani](https://arxiv.org/search/?searchtype=author&query=Hamed Zamani), [Shlomo Zilberstein](https://arxiv.org/search/?searchtype=author&query=Shlomo Zilberstein) 作者：Mason Nakamura、Saaduddin Mahmud、Kyle H. Wray、Hamed Zamani、Shlomo Zilberstein

Aligning LLMs with user preferences is crucial for real-world use but often requires costly fine-tuning or expensive inference, forcing trade-offs between alignment quality and computational cost. Existing inference-time methods typically ignore this balance, focusing solely on the optimized policy’s performance. We propose HIA (Heuristic-Guided Inference-time Alignment), a tuning-free, black-box-compatible approach that uses a lightweight prompt optimizer, heuristic reward models, and two-stage filtering to reduce inference calls while preserving alignment quality. On real-world prompt datasets, HelpSteer and ComPRed, HIA outperforms best-of-N sampling, beam search, and greedy search baselines in multi-objective, goal-conditioned tasks under the same inference budget. We also find that HIA is effective under low-inference budgets with as little as one or two response queries, offering a practical solution for scalable, personalized LLM deployment. 将 LLMs 与用户偏好对齐对真实世界的应用至关重要，但通常需要昂贵的微调或推理成本，从而在对齐质量和计算成本之间产生权衡。现有的推理时方法通常忽视这一平衡，只关注被优化策略的性能。我们提出了 HIA（启发式引导的推理时对齐），这是一种无需调优且兼容黑箱的方案，利用轻量级提示优化器、启发式奖励模型和两阶段过滤来在保持对齐质量的同时减少推理调用。在真实世界的提示数据集 HelpSteer 和 ComPRed 上，HIA 在相同的推理预算下，在多目标、目标条件化任务中优于 best-of-N 采样、束搜索和贪婪搜索基线。我们还发现，HIA 在低推理预算下也有效，仅需一到两个响应查询，为可扩展的个性化 LLM 部署提供了实用的解决方案。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-07 08:54:27 UTC 发布时间：2025-08-07 08:54:27 协调世界时（UTC）

#57 Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages #57 低资源场景下的语音 LLMs：数据量需求及在高资源语言上预训练的影响

Authors: [Seraphina Fong](https://arxiv.org/search/?searchtype=author&query=Seraphina Fong), [Marco Matassoni](https://arxiv.org/search/?searchtype=author&query=Marco Matassoni), [Alessio Brutti](https://arxiv.org/search/?searchtype=author&query=Alessio Brutti) 作者：Seraphina Fong、Marco Matassoni、Alessio Brutti

Large language models (LLMs) have demonstrated potential in handling spoken inputs for high-resource languages, reaching state-of-the-art performance in various tasks. However, their applicability is still less explored in low-resource settings. This work investigates the use of Speech LLMs for low-resource Automatic Speech Recognition using the SLAM-ASR framework, where a trainable lightweight projector connects a speech encoder and a LLM. Firstly, we assess training data volume requirements to match Whisper-only performance, re-emphasizing the challenges of limited data. Secondly, we show that leveraging mono- or multilingual projectors pretrained on high-resource languages reduces the impact of data scarcity, especially with small training sets. Using multilingual LLMs (EuroLLM, Salamandra) with whisper-large-v3-turbo, we evaluate performance on several public benchmarks, providing insights for future research on optimizing Speech LLMs for low-resource languages and multilinguality. 大型语言模型（LLMs）在处理高资源语言的语音输入方面展示出潜力，在多项任务中达到了最先进的性能。然而，它们在低资源环境中的适用性仍未得到充分探索。本文研究在低资源自动语音识别中使用语音 LLMs，采用 SLAM-ASR 框架，其中一个可训练的轻量投影器连接语音编码器和 LLM。首先，我们评估达到仅使用 Whisper 相同性能所需的训练数据量，重新强调了数据匮乏带来的挑战。其次，我们展示了在高资源语言上预训练的单语或多语投影器可以减轻数据稀缺的影响，尤其是在小规模训练集情况下。使用多语 LLMs（EuroLLM、Salamandra）与 whisper-large-v3-turbo，我们在多个公开基准上评估了性能，为未来优化低资源语言和多语性语音 LLMs 的研究提供了见解。

Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language 主题：音频与语音处理，人工智能，计算与语言

Publish: 2025-08-07 08:33:42 UTC 发布时间：2025-08-07 08:33:42 UTC

#58 Navigating Through Paper Flood: Advancing LLM-based Paper Evaluation through Domain-Aware Retrieval and Latent Reasoning #58 穿越论文洪流：通过面向领域的检索与潜在推理推进基于 LLM 的论文评估

Authors: [Wuqiang Zheng](https://arxiv.org/search/?searchtype=author&query=Wuqiang Zheng), [Yiyan Xu](https://arxiv.org/search/?searchtype=author&query=Yiyan Xu), [Xinyu Lin](https://arxiv.org/search/?searchtype=author&query=Xinyu Lin), [Chongming Gao](https://arxiv.org/search/?searchtype=author&query=Chongming Gao), [Wenjie Wang](https://arxiv.org/search/?searchtype=author&query=Wenjie Wang), [Fuli Feng](https://arxiv.org/search/?searchtype=author&query=Fuli Feng) 作者：郑武强、徐逸言、林欣宇、高崇明、王文杰、冯富立

With the rapid and continuous increase in academic publications, identifying high-quality research has become an increasingly pressing challenge. While recent methods leveraging Large Language Models (LLMs) for automated paper evaluation have shown great promise, they are often constrained by outdated domain knowledge and limited reasoning capabilities. In this work, we present PaperEval, a novel LLM-based framework for automated paper evaluation that addresses these limitations through two key components: 1) a domain-aware paper retrieval module that retrieves relevant concurrent work to support contextualized assessments of novelty and contributions, and 2) a latent reasoning mechanism that enables deep understanding of complex motivations and methodologies, along with comprehensive comparison against concurrently related work, to support more accurate and reliable evaluation. To guide the reasoning process, we introduce a progressive ranking optimization strategy that encourages the LLM to iteratively refine its predictions with an emphasis on relative comparison. Experiments on two datasets demonstrate that PaperEval consistently outperforms existing methods in both academic impact and paper quality evaluation. In addition, we deploy PaperEval in a real-world paper recommendation system for filtering high-quality papers, which has gained strong engagement on social media – amassing over 8,000 subscribers and attracting over 10,000 views for many filtered high-quality papers – demonstrating the practical effectiveness of PaperEval. 随着学术出版物的快速且持续增长，识别高质量研究已成为一个日益紧迫的挑战。尽管最近利用 LLMs 进行自动论文评估的方法显示出很大潜力，但它们常受限于过时的领域知识和有限的推理能力。在这项工作中，我们提出了 PaperEval，一种基于 LLM 的自动论文评估新框架，通过两个关键组件来解决这些局限：1）一个领域感知的论文检索模块，用于检索相关的同期工作，以支持对新颖性和贡献的情境化评估；2）一种潜在推理机制，使得能够深入理解复杂的动机与方法学，并与同期相关工作进行全面比较，从而支持更准确可靠的评估。为了引导推理过程，我们引入了一种渐进式排序优化策略，鼓励 LLM 通过强调相对比较来迭代地完善其预测。在两个数据集上的实验表明，PaperEval 在学术影响力和论文质量评估方面均持续优于现有方法。此外，我们将 PaperEval 部署在一个用于筛选高质量论文的真实论文推荐系统中，该系统在社交媒体上获得了强烈的参与——吸引了超过 8,000 名订阅者，并使许多被筛选出的高质量论文获得了超过 10,000 次浏览——展示了 PaperEval 的实际有效性。

Subjects: Information Retrieval, Computation and Language 主题：信息检索，计算与语言

Publish: 2025-08-07 08:08:13 UTC 发布：2025-08-07 08:08:13 UTC

#59 Exploring Superior Function Calls via Reinforcement Learning #59 通过强化学习探索更优的函数调用

Authors: [Bingguang Hao](https://arxiv.org/search/?searchtype=author&query=Bingguang Hao), [Maolin Wang](https://arxiv.org/search/?searchtype=author&query=Maolin Wang), [Zengzhuang Xu](https://arxiv.org/search/?searchtype=author&query=Zengzhuang Xu), [Yicheng Chen](https://arxiv.org/search/?searchtype=author&query=Yicheng Chen), [Cunyin Peng](https://arxiv.org/search/?searchtype=author&query=Cunyin Peng), [Jinjie GU](https://arxiv.org/search/?searchtype=author&query=Jinjie GU), [Chenyi Zhuang](https://arxiv.org/search/?searchtype=author&query=Chenyi Zhuang) 作者：郝炳光、王茂林、徐增庄、陈亦成、彭存寅、顾金杰、庄晨忆

Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel reinforcement learning framework designed to enhance group relative policy optimization through strategic entropy based exploration specifically tailored for function calling tasks. Our approach addresses three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction. Our two-stage data preparation pipeline ensures high-quality training samples through iterative LLM evaluation and abstract syntax tree validation. Extensive experiments on the Berkeley Function Calling Leaderboard demonstrate that this framework achieves state-of-the-art performance among open-source models with 86.02% overall accuracy, outperforming standard GRPO by up to 6% on complex multi-function scenarios. Notably, our method shows particularly strong improvements on code-pretrained models, suggesting that structured language generation capabilities provide an advantageous starting point for reinforcement learning in function calling tasks. We will release all the code, models and dataset to benefit the community. 函数调用能力对于在现实世界中部署大型语言模型（LLM）至关重要，然而当前的训练方法未能培养出健壮的推理策略。监督微调会生成依赖表面模式匹配的模型，而标准的强化学习方法在面对结构化函数调用的复杂动作空间时表现不佳。我们提出了一个新颖的强化学习框架，旨在通过专为函数调用任务定制的基于策略熵的策略性探索来增强组相对策略优化。我们的方法解决了函数调用中的三大关键挑战：策略学习过程中的探索不足、在链式思维生成中缺乏结构化推理，以及参数提取验证不足。我们的两阶段数据准备管道通过迭代的 LLM 评估与抽象语法树验证，确保了高质量的训练样本。在伯克利函数调用排行榜上的大量实验表明，该框架在开源模型中实现了最先进的性能，总体准确率为 86.02%，在复杂的多函数场景中相比标准 GRPO 最高提升了 6%。值得注意的是，我们的方法在经过代码预训练的模型上表现出特别显著的提升，表明结构化语言生成能力为函数调用任务中的强化学习提供了有利的起点。我们将发布所有代码、模型和数据集以造福社区。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-07 07:51:38 UTC 发布：2025-08-07 07:51:38 UTC

#60 JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering #60 JPS：通过协作式视觉扰动和文本引导越狱多模态大语言模型

Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker’s malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by “steering prompt” optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers’ intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.} 针对多模态大型语言模型（MLLMs）的越狱攻击是一个重要的研究方向。当前研究主要侧重于最大化攻击成功率（ASR），常常忽视生成的响应是否真正满足攻击者的恶意意图。这一疏漏往往导致输出质量低下，虽然能绕过安全过滤器，但缺乏实质性的有害内容。为了解决这一问题，我们提出了 JPS，即通过协同的视觉扰动和文本引导来越狱 MLLMs（Jailbreak MLLMs with collaborative visual Perturbation and textual Steering），它通过图像的协作与文本引导提示来实现越狱。具体而言，JPS 利用目标导向的对抗性图像扰动以有效绕过安全机制，并辅以通过多智能体系统优化的“引导提示（steering prompt）”，专门引导 LLM 生成满足攻击者意图的响应。这些视觉和文本组件通过迭代共同优化以提升性能。为评估攻击结果的质量，我们提出了恶意意图实现率（Malicious Intent Fulfillment Rate，MIFR）指标，并使用基于推理的 LLM 评估器进行评估。我们的实验表明，JPS 在多种多模态大模型（MLLMs）和基准测试中，在自动语音识别（ASR）和多模态信息检索（MIFR）方面均创下了新的最先进水平，分析结果也证实了其有效性。代码可在 \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS} 获取。 \color{warningcolor}{警告：本文可能包含敏感内容。}

Subjects: Multimedia, Artificial Intelligence, Computation and Language, Cryptography and Security 主题：多媒体、人工智能、计算与语言、密码学与安全

Publish: 2025-08-07 07:14:01 UTC 发布：2025-08-07 07:14:01 UTC

#61 Cognitive Duality for Adaptive Web Agents #61 适应性网络代理的认知二元性

Authors: [Jiarun Liu](https://arxiv.org/search/?searchtype=author&query=Jiarun Liu), [Chunhong Zhang](https://arxiv.org/search/?searchtype=author&query=Chunhong Zhang), [Zheng Hu](https://arxiv.org/search/?searchtype=author&query=Zheng Hu) 作者：刘佳润，张春宏，胡铮

Web navigation represents a critical and challenging domain for evaluating artificial general intelligence (AGI), demanding complex decision-making within high-entropy, dynamic environments with combinatorially explosive action spaces. Current approaches to building autonomous web agents either focus on offline imitation learning or online exploration, but rarely integrate both paradigms effectively. Inspired by the dual-process theory of human cognition, we derive a principled decomposition into fast System 1 and slow System 2 cognitive processes. This decomposition provides a unifying perspective on existing web agent methodologies, bridging the gap between offline learning of intuitive reactive behaviors and online acquisition of deliberative planning capabilities. We implement this framework in CogniWeb, a modular agent architecture that adaptively toggles between fast intuitive processing and deliberate reasoning based on task complexity. Our evaluation on WebArena demonstrates that CogniWeb achieves competitive performance (43.96% success rate) while maintaining significantly higher efficiency (75% reduction in token usage). 网页导航是评估通用人工智能（AGI）的一个关键且富有挑战性的领域，它要求在高熵、动态环境中进行复杂决策，并面对组合爆炸的动作空间。当前构建自主网页代理的方法要么侧重于离线模仿学习，要么侧重于在线探索，但很少能有效地将两种范式整合。在受人类认知的双系统理论启发下，我们推导出一种将认知过程原则性地分解为快速的系统 1 与缓慢的系统 2 的方法。这一分解为现有的网页代理方法提供了统一视角，弥合了对直觉反应行为的离线学习与对深思熟虑规划能力的在线获取之间的鸿沟。我们在 CogniWeb 中实现了这一框架，该架构为模块化代理，能够根据任务复杂性在快速直觉处理与深思熟虑推理之间自适应切换。我们在 WebArena 上的评估表明，CogniWeb 在保持显著更高效率（代币使用减少 75%）的同时，取得了具有竞争力的性能（成功率 43.96%）。

Subjects: Artificial Intelligence, Computation and Language, Multiagent Systems 主题：人工智能，计算与语言，多智能体系统

Publish: 2025-08-07 07:05:22 UTC 发表时间：2025-08-07 07:05:22 UTC

#62 A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding #62 《用于三维场景理解的语言嵌入框架及其真实世界应用研究》

Authors: [Mahmoud Chick Zaouali](https://arxiv.org/search/?searchtype=author&query=Mahmoud Chick Zaouali), [Todd Charter](https://arxiv.org/search/?searchtype=author&query=Todd Charter), [Yehor Karpichev](https://arxiv.org/search/?searchtype=author&query=Yehor Karpichev), [Brandon Haworth](https://arxiv.org/search/?searchtype=author&query=Brandon Haworth), [Homayoun Najjjaran](https://arxiv.org/search/?searchtype=author&query=Homayoun Najjjaran) 作者：Mahmoud Chick Zaouali、Todd Charter、Yehor Karpichev、Brandon Haworth、Homayoun Najjjaran

Gaussian Splatting has rapidly emerged as a transformative technique for real-time 3D scene representation, offering a highly efficient and expressive alternative to Neural Radiance Fields (NeRF). Its ability to render complex scenes with high fidelity has enabled progress across domains such as scene reconstruction, robotics, and interactive content creation. More recently, the integration of Large Language Models (LLMs) and language embeddings into Gaussian Splatting pipelines has opened new possibilities for text-conditioned generation, editing, and semantic scene understanding. Despite these advances, a comprehensive overview of this emerging intersection has been lacking. This survey presents a structured review of current research efforts that combine language guidance with 3D Gaussian Splatting, detailing theoretical foundations, integration strategies, and real-world use cases. We highlight key limitations such as computational bottlenecks, generalizability, and the scarcity of semantically annotated 3D Gaussian data and outline open challenges and future directions for advancing language-guided 3D scene understanding using Gaussian Splatting. 高斯撒点（Gaussian Splatting）迅速成为一种可用于实时三维场景表示的变革性技术，提供了比神经辐射场（NeRF）更高效且表达力更强的替代方案。它以高保真度渲染复杂场景的能力推动了场景重建、机器人技术和交互式内容创作等领域的进展。最近，将 LLMs 和语言嵌入集成到高斯撒点管线中，为基于文本的生成、编辑和语义场景理解开辟了新可能性。尽管取得了这些进展，但对这一新兴交叉领域的全面综述仍然缺乏。本文综述了将语言引导与三维高斯撒点相结合的当前研究工作，结构化地介绍了理论基础、集成策略和现实世界的应用案例。我们指出了关键的局限性，如计算瓶颈、泛化能力以及语义标注的三维高斯数据稀缺，并概述了利用高斯撒点推进语言引导三维场景理解的开放挑战和未来方向。

Subjects: Graphics, Computation and Language, Computer Vision and Pattern Recognition 主题：图形学、计算与语言、计算机视觉与模式识别

Publish: 2025-08-07 06:33:08 UTC 发布：2025-08-07 06:33:08 UTC

#63 Making Prompts First-Class Citizens for Adaptive LLM Pipelines #63 让提示成为自适应 LLM 流水线的一等公民

Authors: [Ugur Cetintemel](https://arxiv.org/search/?searchtype=author&query=Ugur Cetintemel), [Shu Chen](https://arxiv.org/search/?searchtype=author&query=Shu Chen), [Alexander W. Lee](https://arxiv.org/search/?searchtype=author&query=Alexander W. Lee), [Deepti Raghavan](https://arxiv.org/search/?searchtype=author&query=Deepti Raghavan) 作者：Ugur Cetintemel、Shu Chen、Alexander W. Lee、Deepti Raghavan

Modern LLM pipelines increasingly resemble data-centric systems: they retrieve external context, compose intermediate outputs, validate results, and adapt based on runtime feedback. Yet, the central element guiding this process – the prompt – remains a brittle, opaque string, disconnected from the surrounding dataflow. This disconnect limits reuse, optimization, and runtime control. In this paper, we describe our vision and an initial design for SPEAR, a language and runtime that fills this prompt management gap by making prompts structured, adaptive, and first-class components of the execution model. SPEAR enables (1) runtime prompt refinement – modifying prompts dynamically in response to execution-time signals such as confidence, latency, or missing context; and (2) structured prompt management – organizing prompt fragments into versioned views with support for introspection and logging. SPEAR defines a prompt algebra that governs how prompts are constructed and adapted within a pipeline. It supports multiple refinement modes (manual, assisted, and automatic), giving developers a balance between control and automation. By treating prompt logic as structured data, SPEAR enables optimizations such as operator fusion, prefix caching, and view reuse. Preliminary experiments quantify the behavior of different refinement modes compared to static prompts and agentic retries, as well as the impact of prompt-level optimizations such as operator fusion. 现代的 LLM 管道越来越像以数据为中心的系统：它们检索外部上下文，组合中间输出，验证结果，并根据运行时反馈进行调整。然而，引导这一过程的核心要素——提示（prompt）——仍然是一个脆弱、不透明的字符串，与周围的数据流脱节。这种脱节限制了重复使用、优化和运行时控制。本文描述了我们对 SPEAR 的愿景与初步设计，SPEAR 是一种语言与运行时，通过将提示结构化、可自适应并作为执行模型的一等组件，来填补提示管理的空白。SPEAR 支持（1）运行时提示精炼——根据执行时的信心水平、延迟或缺失上下文等信号动态修改提示；以及（2）结构化提示管理——将提示片段组织为带版本的视图，并支持自省与日志记录。SPEAR 定义了一种提示代数，用以规范提示在管道中如何构建与调整。它支持多种精炼模式（手动、辅助和自动），在控制与自动化之间为开发者提供平衡。通过将提示逻辑视为结构化数据，SPEAR 使得操作符融合、前缀缓存和视图重用等优化成为可能。初步实验量化了不同细化模式相较于静态提示和具有代理式重试的行为差异，以及诸如操作符融合等提示级优化的影响。

Subjects: Databases, Artificial Intelligence, Computation and Language 主题：数据库、人工智能、计算与语言

Publish: 2025-08-07 03:49:56 UTC 发布：2025-08-07 03:49:56 UTC

#64 Can Large Language Models Integrate Spatial Data? Empirical Insights into Reasoning Strengths and Computational Weaknesses #64 大型语言模型能整合空间数据吗？关于推理强项与计算弱点的实证见解

Authors: [Bin Han](https://arxiv.org/search/?searchtype=author&query=Bin Han), [Robert Wolfe](https://arxiv.org/search/?searchtype=author&query=Robert Wolfe), [Anat Caspi](https://arxiv.org/search/?searchtype=author&query=Anat Caspi), [Bill Howe](https://arxiv.org/search/?searchtype=author&query=Bill Howe) 作者：韩斌、Robert Wolfe、Anat Caspi、Bill Howe

We explore the application of large language models (LLMs) to empower domain experts in integrating large, heterogeneous, and noisy urban spatial datasets. Traditional rule-based integration methods are unable to cover all edge cases, requiring manual verification and repair. Machine learning approaches require collecting and labeling of large numbers of task-specific samples. In this study, we investigate the potential of LLMs for spatial data integration. Our analysis first considers how LLMs reason about environmental spatial relationships mediated by human experience, such as between roads and sidewalks. We show that while LLMs exhibit spatial reasoning capabilities, they struggle to connect the macro-scale environment with the relevant computational geometry tasks, often producing logically incoherent responses. But when provided relevant features, thereby reducing dependence on spatial reasoning, LLMs are able to generate high-performing results. We then adapt a review-and-refine method, which proves remarkably effective in correcting erroneous initial responses while preserving accurate responses. We discuss practical implications of employing LLMs for spatial data integration in real-world contexts and outline future research directions, including post-training, multi-modal integration methods, and support for diverse data formats. Our findings position LLMs as a promising and flexible alternative to traditional rule-based heuristics, advancing the capabilities of adaptive spatial data integration. 我们探讨了将大型语言模型（LLMs）应用于赋能领域专家，以整合大型、异构且嘈杂的城市空间数据。传统的基于规则的整合方法无法涵盖所有边缘情况，需要人工核验和修复。机器学习方法则需要收集并标注大量特定任务的样本。在本研究中，我们考察了 LLMs 在空间数据整合方面的潜力。我们的分析首先考虑了 LLMs 如何基于人类经验推理环境空间关系，例如道路与人行道之间的关系。我们表明，尽管 LLMs 表现出空间推理能力，但它们在将宏观环境与相关的计算几何任务连接起来方面存在困难，常常产生逻辑上不连贯的回答。但当提供了相关特征，从而减少对空间推理的依赖时，LLMs 能够生成高性能的结果。随后我们改进并采用了一种“审查并修正”的方法，该方法在纠正错误的初始回答同时保留准确回答方面表现出非凡的效果。我们讨论了在现实环境中利用 LLMs 进行空间数据整合的实际影响，并概述了未来的研究方向，包括训练后优化、多模态整合方法以及对多样化数据格式的支持。我们的研究结果将 LLMs 定位为一种有前景且灵活的替代传统基于规则启发式方法的方案，提升了自适应空间数据整合的能力。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-07 03:44:20 UTC 发布时间：2025-08-07 03:44:20 UTC

#65 R-Zero: Self-Evolving Reasoning LLM from Zero Data #65 R-Zero：从零数据自我进化推理 LLM

Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks. 自我进化的大型语言模型（LLMs）通过自主生成、改进并从自身经验中学习，提供了一条通往超智能的可扩展路径。然而，现有训练此类模型的方法仍然在很大程度上依赖大量人工策划的任务和标签，通常通过微调或强化学习来实现，这对将人工智能系统推进到超越人类智能的能力构成了根本性瓶颈。为克服这一限制，我们提出了 R-Zero，一个完全自主的框架，能够从零开始生成自己的训练数据。R-Zero 从单一基础 LLM 出发，初始化了两个具有不同角色的独立模型：挑战者（Challenger）和解答者（Solver）。这两个模型分别进行优化并通过交互共同进化：挑战者在提出接近解答者能力边界的任务时会获得奖励，而解答者在解决挑战者提出的越来越具挑战性的任务时会获得奖励。该过程在没有任何预先存在的任务和标签的情况下生成了一个有针对性、自我改进的课程。在实证上，R-Zero 在不同的骨干 LLMs 上显著提升推理能力，例如在数学推理基准上将 Qwen3-4B-Base 提升了 +6.49，在通用领域推理基准上提升了 +7.54。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-07 03:38:16 UTC 发布：2025-08-07 03:38:16 UTC

#66 REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation #66 REINA：基于正则化熵信息的高效同时语音翻译损失函数

Authors: [Nameer Hirschkind](https://arxiv.org/search/?searchtype=author&query=Nameer Hirschkind), [Joseph Liu](https://arxiv.org/search/?searchtype=author&query=Joseph Liu), [Mahesh Kumar Nandwana](https://arxiv.org/search/?searchtype=author&query=Mahesh Kumar Nandwana), [Xiao Yu](https://arxiv.org/search/?searchtype=author&query=Xiao Yu) 作者：Nameer Hirschkind、Joseph Liu、Mahesh Kumar Nandwana、Xiao Yu

Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores. 同时语音翻译（SimulST）系统在流式输入音频的同时输出翻译文本或语音。这类系统面临着平衡翻译质量与延迟的重大挑战。我们提出了一种优化此权衡的策略：只有在等待能带来信息增益时才等待更多输入。基于该策略，我们提出了正则化熵信息适应（REINA），这是一种用于利用现有非流式翻译模型训练自适应决策策略的新型损失函数。我们从信息论原理推导出 REINA，并展示了 REINA 有助于将延迟/质量权衡的已报告帕累托前沿推动到优于先前工作的水平。利用 REINA，我们在法语、西班牙语和德语上训练了一个双向与英语之间的 SimulST 模型。仅使用开源或合成生成的数据进行训练，我们在可比较规模的模型上实现了流式结果的最新（SOTA）水平。我们还引入了一个用于流式效率的度量，定量显示与先前方法相比，REINA 在延迟/质量权衡上最多可提升 21%，该数值已相对于非流式基线 BLEU 分数进行了归一化。

Subjects: Machine Learning, Computation and Language, Audio and Speech Processing 主题：机器学习、计算与语言、音频与语音处理

Publish: 2025-08-07 00:25:58 UTC 发布日期：2025-08-07 00:25:58 UTC

#67 ConfAgents: A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis #67 ConfAgents：一种用于成本高效医疗诊断的受共形引导的多智能体框架

Authors: [Huiya Zhao](https://arxiv.org/search/?searchtype=author&query=Huiya Zhao), [Yinghao Zhu](https://arxiv.org/search/?searchtype=author&query=Yinghao Zhu), [Zixiang Wang](https://arxiv.org/search/?searchtype=author&query=Zixiang Wang), [Yasha Wang](https://arxiv.org/search/?searchtype=author&query=Yasha Wang), [Junyi Gao](https://arxiv.org/search/?searchtype=author&query=Junyi Gao), [Liantao Ma](https://arxiv.org/search/?searchtype=author&query=Liantao Ma) 作者：赵慧雅、朱英豪、王子翔、王雅莎、高俊义、马连涛

The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool-users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its own high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To anchor our research and facilitate reproducible evaluation, we introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer-reviewed clinical research. Our comprehensive experiments demonstrate that HealthFlow’s self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work marks a necessary shift from building better tool-users to designing smarter, self-evolving task-managers, paving the way for more autonomous and effective AI for scientific discovery. 人工智能代理在医疗研究中的效能受到其依赖静态、预定义策略的制约。这带来了一个关键局限：代理可以成为更好的工具使用者，但无法学会成为更好的战略规划者，而战略规划对于像医疗这样复杂的领域至关重要。我们提出了 HealthFlow，一种通过新颖的元层进化机制克服该局限的自我进化人工智能代理。HealthFlow 通过将过程中的成功与失败提炼为持久的战略知识库，自动改进自身的高层问题解决策略。为支撑我们的研究并便于可复现评估，我们引入了 EHRFlowBench，这是一套新的基准，包含源自经同行评审临床研究的复杂、真实的健康数据分析任务。我们的全面实验证明，HealthFlow 的自我进化方法显著优于最先进的代理框架。本工作标志着从构建更好的工具使用者向设计更聪明的、自我进化的任务管理者的必要转变，为更自主、更高效的科学发现型人工智能铺平了道路。

Subjects: Artificial Intelligence, Computation and Language, Multiagent Systems 主题：人工智能、计算与语言、多智能体系统

Publish: 2025-08-06 22:39:38 UTC 发布：2025-08-06 22:39:38 UTC

#68 Advancing Hate Speech Detection with Transformers: Insights from the MetaHate #68 使用变换器推进仇恨言论检测：来自 MetaHate 的见解

Authors: [Santosh Chapagain](https://arxiv.org/search/?searchtype=author&query=Santosh Chapagain), [Shah Muhammad Hamdi](https://arxiv.org/search/?searchtype=author&query=Shah Muhammad Hamdi), [Soukaina Filali Boubrahimi](https://arxiv.org/search/?searchtype=author&query=Soukaina Filali Boubrahimi) 作者：Santosh Chapagain、Shah Muhammad Hamdi、Soukaina Filali Boubrahimi

Hate speech is a widespread and harmful form of online discourse, encompassing slurs and defamatory posts that can have serious social, psychological, and sometimes physical impacts on targeted individuals and communities. As social media platforms such as X (formerly Twitter), Facebook, Instagram, Reddit, and others continue to facilitate widespread communication, they also become breeding grounds for hate speech, which has increasingly been linked to real-world hate crimes. Addressing this issue requires the development of robust automated methods to detect hate speech in diverse social media environments. Deep learning approaches, such as vanilla recurrent neural networks (RNNs), long short-term memory (LSTM), and convolutional neural networks (CNNs), have achieved good results, but are often limited by issues such as long-term dependencies and inefficient parallelization. This study represents the comprehensive exploration of transformer-based models for hate speech detection using the MetaHate dataset–a meta-collection of 36 datasets with 1.2 million social media samples. We evaluate multiple state-of-the-art transformer models, including BERT, RoBERTa, GPT-2, and ELECTRA, with fine-tuned ELECTRA achieving the highest performance (F1 score: 0.8980). We also analyze classification errors, revealing challenges with sarcasm, coded language, and label noise. 仇恨言论是一种广泛且有害的在线话语形式，包含侮辱性词汇和诽谤性帖子，可能对被针对的个人和群体造成严重的社会、心理，有时甚至是身体上的影响。随着 X（前身为 Twitter）、Facebook、Instagram、Reddit 等社交媒体平台不断促进广泛交流，它们也成为仇恨言论的温床，而仇恨言论与现实世界的仇恨犯罪之间的联系日益紧密。解决这一问题需要在多样化的社交媒体环境中开发健壮的自动检测方法。诸如普通循环神经网络（RNN）、长短期记忆网络（LSTM）和卷积神经网络（CNN）等深度学习方法已经取得了良好效果，但常常受到长期依赖问题和并行化效率低等问题的限制。本研究使用 MetaHate 数据集——一个包含 36 个数据集、120 万条社交媒体样本的元集合，对基于 Transformer 的模型在仇恨言论检测中的表现进行了全面探索。我们评估了多种最先进的 Transformer 模型，包括 BERT、RoBERTa、GPT-2 和 ELECTRA，其中微调后的 ELECTRA 达到最高性能（F1 分数：0.8980）。我们还分析了分类错误，揭示了讽刺、编码语言和标签噪声等方面的挑战。

Subjects: Machine Learning, Computation and Language 主题：机器学习，计算与语言

Publish: 2025-08-06 22:36:17 UTC 发布：2025-08-06 22:36:17 UTC

#69 Fine-Tuning Small Language Models (SLMs) for Autonomous Web-based Geographical Information Systems (AWebGIS) #69 针对自主基于网络的地理信息系统（AWebGIS）微调小型语言模型（SLMs）

Autonomous web-based geographical information systems (AWebGIS) aim to perform geospatial operations from natural language input, providing intuitive, intelligent, and hands-free interaction. However, most current solutions rely on cloud-based large language models (LLMs), which require continuous internet access and raise users’ privacy and scalability issues due to centralized server processing. This study compares three approaches to enabling AWebGIS: (1) a fully-automated online method using cloud-based LLMs (e.g., Cohere); (2) a semi-automated offline method using classical machine learning classifiers such as support vector machine and random forest; and (3) a fully autonomous offline (client-side) method based on a fine-tuned small language model (SLM), specifically T5-small model, executed in the client’s web browser. The third approach, which leverages SLMs, achieved the highest accuracy among all methods, with an exact matching accuracy of 0.93, Levenshtein similarity of 0.99, and recall-oriented understudy for gisting evaluation ROUGE-1 and ROUGE-L scores of 0.98. Crucially, this client-side computation strategy reduces the load on backend servers by offloading processing to the user’s device, eliminating the need for server-based inference. These results highlight the feasibility of browser-executable models for AWebGIS solutions. 自治的基于网络的地理信息系统（AWebGIS）旨在从自然语言输入执行地理空间操作，提供直观、智能和免提的交互。然而，大多数现有解决方案依赖于基于云的 LLMs，这些模型需要持续的互联网访问，并由于集中式服务器处理而引发用户隐私和可扩展性问题。本研究比较了实现 AWebGIS 的三种方法： (1) 使用基于云的 LLMs（例如 Cohere）的全自动在线方法； (2) 使用经典机器学习分类器（如支持向量机和随机森林）的半自动离线方法；以及 (3) 基于微调的小型语言模型（SLM），具体为在客户端网页浏览器中运行的 T5-small 模型的全自主离线（客户端）方法。第三种方法利用了 SLMs，在所有方法中取得了最高的准确率，精确匹配准确率为 0.93，Levenshtein 相似度为 0.99，面向召回的摘要评估 ROUGE-1 和 ROUGE-L 得分为 0.98。关键在于，这种客户端计算策略通过将处理转移到用户设备上来减少后端服务器的负担，省去了基于服务器的推理需求。这些结果突显了可在浏览器中执行的模型用于 AWebGIS 解决方案的可行性。

Subjects: Artificial Intelligence, Computation and Language, Machine Learning 主题：Artificial Intelligence , Computation and Language , Machine Learning

Publish: 2025-08-06 19:50:29 UTC 发布：2025-08-06 19:50:29 UTC

#70 Federal Reserve Communication and the COVID-19 Pandemic #70 美联储沟通与 COVID-19 大流行 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Jonathan Benchimol](https://arxiv.org/search/?searchtype=author&query=Jonathan Benchimol), [Sophia Kazinnik](https://arxiv.org/search/?searchtype=author&query=Sophia Kazinnik), [Yossi Saadon](https://arxiv.org/search/?searchtype=author&query=Yossi Saadon) 作者：Jonathan Benchimol, Sophia Kazinnik, Yossi Saadon

In this study, we examine the Federal Reserve’s communication strategies during the COVID-19 pandemic, comparing them with communication during previous periods of economic stress. Using specialized dictionaries tailored to COVID-19, unconventional monetary policy (UMP), and financial stability, combined with sentiment analysis and topic modeling techniques, we identify a distinct focus in Fed communication during the pandemic on financial stability, market volatility, social welfare, and UMP, characterized by notable contextual uncertainty. Through comparative analysis, we juxtapose the Fed’s communication during the COVID-19 crisis with its responses during the dot-com and global financial crises, examining content, sentiment, and timing dimensions. Our findings reveal that Fed communication and policy actions were more reactive to the COVID-19 crisis than to previous crises. Additionally, declining sentiment related to financial stability in interest rate announcements and minutes anticipated subsequent accommodative monetary policy decisions. We further document that communicating about UMP has become the “new normal” for the Fed’s Federal Open Market Committee meeting minutes and Chairman’s speeches since the Global Financial Crisis, reflecting an institutional adaptation in communication strategy following periods of economic distress. These findings contribute to our understanding of how central bank communication evolves during crises and how communication strategies adapt to exceptional economic circumstances. 在本研究中，我们考察了联邦储备在新冠疫情期间的沟通策略，并将其与以往经济压力时期的沟通进行了比较。我们使用为新冠疫情、非常规货币政策（UMP）和金融稳定量身定制的专业词典，结合情感分析和主题建模技术，识别出疫情期间美联储沟通中对金融稳定、市场波动、社会福利和非常规货币政策的独特关注，这些沟通以显著的情境不确定性为特征。通过比较分析，我们将美联储在新冠危机期间的沟通与其在互联网泡沫和全球金融危机期间的应对进行了并置，考察了内容、情感和时机维度。我们的研究结果显示，美联储的沟通和政策行动在应对新冠危机时比以往危机更具反应性。此外，与利率公告和会议纪要中与金融稳定相关的情感下降预示了随后宽松货币政策决定的到来。我们进一步记录到，自全球金融危机以来，关于非常规货币政策（UMP）的沟通已成为美联储联邦公开市场委员会会议纪要和主席演讲中的“新常态”，这反映出在经济动荡时期之后，沟通策略在制度层面上的适应。这些发现有助于我们理解央行在危机期间的沟通如何演变以及沟通策略如何适应特殊的经济情形。

Subjects: General Economics, Computation and Language, Information Theory, Applications, Machine Learning 主题：一般经济学、计算与语言、信息论、应用、机器学习

Publish: 2025-08-06 19:17:24 UTC 发布：2025-08-06 19:17:24 协调世界时（UTC）

#71 Prescriptive Agents based on Rag for Automated Maintenance (PARAM) #71 基于 RAG 的用于自动化维护的规范性代理（PARAM）

Authors: [Chitranshu Harbola](https://arxiv.org/search/?searchtype=author&query=Chitranshu Harbola), [Anupam Purwar](https://arxiv.org/search/?searchtype=author&query=Anupam Purwar) 作者：Chitranshu Harbola，Anupam Purwar

Industrial machinery maintenance requires timely intervention to prevent catastrophic failures and optimize operational efficiency. This paper presents an integrated Large Language Model (LLM)-based intelligent system for prescriptive maintenance that extends beyond traditional anomaly detection to provide actionable maintenance recommendations. Building upon our prior LAMP framework for numerical data analysis, we develop a comprehensive solution that combines bearing vibration frequency analysis with multi agentic generation for intelligent maintenance planning. Our approach serializes bearing vibration data (BPFO, BPFI, BSF, FTF frequencies) into natural language for LLM processing, enabling few-shot anomaly detection with high accuracy. The system classifies fault types (inner race, outer race, ball/roller, cage faults) and assesses severity levels. A multi-agentic component processes maintenance manuals using vector embeddings and semantic search, while also conducting web searches to retrieve comprehensive procedural knowledge and access up-to-date maintenance practices for more accurate and in-depth recommendations. The Gemini model then generates structured maintenance recommendations includes immediate actions, inspection checklists, corrective measures, parts requirements, and timeline specifications. Experimental validation in bearing vibration datasets demonstrates effective anomaly detection and contextually relevant maintenance guidance. The system successfully bridges the gap between condition monitoring and actionable maintenance planning, providing industrial practitioners with intelligent decision support. This work advances the application of LLMs in industrial maintenance, offering a scalable framework for prescriptive maintenance across machinery components and industrial sectors. 工业机械维护需要及时干预以防止灾难性故障并优化运行效率。本文提出了一种基于 LLM 的综合智能预防性维护系统，超越了传统的异常检测，提供可执行的维护建议。在我们先前用于数值数据分析的 LAMP 框架基础上，开发了一个综合解决方案，将轴承振动频率分析与多代理生成结合，用于智能维护规划。我们的方法将轴承振动数据（BPFO、BPFI、BSF、FTF 频率）序列化为自然语言以供 LLM 处理，从而实现高精度的少样本异常检测。该系统对故障类型（内圈、外圈、滚珠/滚柱、保持架故障）进行分类并评估严重程度。多代理组件使用向量嵌入和语义搜索处理维护手册，同时进行网络搜索以检索全面的操作流程知识并获取最新的维护实践，从而提供更准确、更深入的建议。 Gemini 模型随后生成结构化的维护建议，包括即时措施、检查清单、纠正措施、零件需求和时间表说明。在轴承振动数据集上的实验验证表明其能够有效检测异常并提供具有上下文相关性的维护指导。该系统成功弥合了状态监测与可执行维护规划之间的鸿沟，为工业从业者提供智能决策支持。本工作推进了 LLMs 在工业维护中的应用，提供了一个可扩展的处方式维护框架，适用于各类机械部件和工业领域。

Subjects: Artificial Intelligence, Computation and Language, Machine Learning, Multiagent Systems, Signal Processing 主题：人工智能、计算与语言、机器学习、多智能体系统、信号处理

Publish: 2025-07-28 14:22:19 UTC 发表：2025-07-28 14:22:19 UTC

1.2.2 Artificial Intelligence

From：https://papers.cool/arxiv/cs.AI

From：https://arxiv.org/list/cs.AI/recenthttps://arxiv.org/list/cs.CL/recent 2025-08-08 | | 总计：160

#1 Simulating Human-Like Learning Dynamics with LLM-Empowered Agents #1 使用由 LLM 赋能的代理模拟类人学习动态

Authors: [Yu Yuan](https://arxiv.org/search/?searchtype=author&query=Yu Yuan), [Lili Zhao](https://arxiv.org/search/?searchtype=author&query=Lili Zhao), [Wei Chen](https://arxiv.org/search/?searchtype=author&query=Wei Chen), [Guangting Zheng](https://arxiv.org/search/?searchtype=author&query=Guangting Zheng), [Kai Zhang](https://arxiv.org/search/?searchtype=author&query=Kai Zhang), [Mengdi Zhang](https://arxiv.org/search/?searchtype=author&query=Mengdi Zhang), [Qi Liu](https://arxiv.org/search/?searchtype=author&query=Qi Liu) 作者：Yu Yuan, Lili Zhao, Wei Chen, Guangting Zheng, Kai Zhang, Mengdi Zhang, Qi Liu

Capturing human learning behavior based on deep learning methods has become a major research focus in both psychology and intelligent systems. Recent approaches rely on controlled experiments or rule-based models to explore cognitive processes. However, they struggle to capture learning dynamics, track progress over time, or provide explainability. To address these challenges, we introduce LearnerAgent, a novel multi-agent framework based on Large Language Models (LLMs) to simulate a realistic teaching environment. To explore human-like learning dynamics, we construct learners with psychologically grounded profiles-such as Deep, Surface, and Lazy-as well as a persona-free General Learner to inspect the base LLM’s default behavior. Through weekly knowledge acquisition, monthly strategic choices, periodic tests, and peer interaction, we can track the dynamic learning progress of individual learners over a full-year journey. Our findings are fourfold: 1) Longitudinal analysis reveals that only Deep Learner achieves sustained cognitive growth. Our specially designed “trap questions” effectively diagnose Surface Learner’s shallow knowledge. 2) The behavioral and cognitive patterns of distinct learners align closely with their psychological profiles. 3) Learners’ self-concept scores evolve realistically, with the General Learner developing surprisingly high self-efficacy despite its cognitive limitations. 4) Critically, the default profile of base LLM is a “diligent but brittle Surface Learner”-an agent that mimics the behaviors of a good student but lacks true, generalizable understanding. Extensive simulation experiments demonstrate that LearnerAgent aligns well with real scenarios, yielding more insightful findings about LLMs’ behavior. 基于深度学习方法捕捉人类学习行为已成为心理学和智能系统领域的主要研究焦点。近期的方法依赖受控实验或基于规则的模型来探索认知过程，然而它们难以捕捉学习动态、追踪随时间的进展或提供可解释性。为了解决这些挑战，我们提出了 LearnerAgent，一种基于 LLMs 的全新多智能体框架，用于模拟真实的教学环境。为了探索类人学习动态，我们构建了具有心理学基础档案的学习者——例如深度学习者（Deep）、表层学习者（Surface）和懒惰学习者（Lazy）——以及一个无人格设定的通用学习者（General Learner）以检查基础 LLM 的默认行为。通过每周的知识获取、每月的策略选择、定期的测验以及同伴互动，我们能够追踪个体学习者在为期一年的全过程中的动态学习进展。我们的发现有四点：1）纵向分析表明只有深度学习者实现了持续的认知增长。我们精心设计的“陷阱题”能够有效诊断表层学习者的浅层知识。 2) 不同学习者的行为和认知模式与其心理特征高度一致。3) 学习者的自我概念分数以现实的方式演变，其中通用学习者（General Learner）尽管在认知上存在局限，却意外地发展出较高的自我效能感。4) 关键是，基础 LLM 的默认画像是“勤奋但脆弱的表层学习者（Surface Learner）”——一种模仿好学生行为但缺乏真正可迁移理解的代理。大量模拟实验表明，LearnerAgent 与真实场景高度一致，能够对 LLM 的行为提供更有洞见的发现。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 17:57:46 UTC 发布：2025-08-07 17:57:46 UTC

#2 The Missing Reward: Active Inference in the Era of Experience #2 缺失的奖励：体验时代的主动推理

Author: [Bo Wen](https://arxiv.org/search/?searchtype=author&query=Bo Wen) 作者：Bo Wen

This paper argues that Active Inference (AIF) provides a crucial foundation for developing autonomous AI agents capable of learning from experience without continuous human reward engineering. As AI systems begin to exhaust high-quality training data and rely on increasingly large human workforces for reward design, the current paradigm faces significant scalability challenges that could impede progress toward genuinely autonomous intelligence. The proposal for an ``Era of Experience,’’ where agents learn from self-generated data, is a promising step forward. However, this vision still depends on extensive human engineering of reward functions, effectively shifting the bottleneck from data curation to reward curation. This highlights what we identify as the \textbf{grounded-agency gap}: the inability of contemporary AI systems to autonomously formulate, adapt, and pursue objectives in response to changing circumstances. We propose that AIF can bridge this gap by replacing external reward signals with an intrinsic drive to minimize free energy, allowing agents to naturally balance exploration and exploitation through a unified Bayesian objective. By integrating Large Language Models as generative world models with AIF’s principled decision-making framework, we can create agents that learn efficiently from experience while remaining aligned with human values. This synthesis offers a compelling path toward AI systems that can develop autonomously while adhering to both computational and physical constraints. 本文认为主动推断（AIF）为构建能够从经验中学习、无需持续人工奖励工程的自主人工智能代理提供了关键基础。随着人工智能系统开始耗尽高质量训练数据并依赖日益庞大的人工队伍来设计奖励，当前范式面临显著的可扩展性挑战，这可能阻碍迈向真正自主智能的进程。提出的“经验时代”——代理从自生数据中学习——是一个有前景的进展。然而，这一愿景仍依赖于大量的人为奖励函数工程，实际上将瓶颈从数据策划转移到了奖励策划上。这凸显了我们所指出的“有根代理差距”（grounded-agency gap）：当代人工智能系统无法自主地在变化的环境中制定、调整并追求目标。我们提出，主动推断可以通过用最小化自由能的内在驱动取代外部奖励信号来弥合这一差距，使代理能够通过统一的贝叶斯目标自然而然地在探索与利用之间取得平衡。通过将大型语言模型作为生成式世界模型，与 AIF 的有原则决策框架相结合，我们可以创建能够从经验中高效学习且保持与人类价值观一致的代理。这种综合为朝着能够自主发展的 AI 系统提供了一个引人注目的路径，同时遵守计算和物理约束。

Subjects: Artificial Intelligence, Adaptation and Self-Organizing Systems, Biological Physics, Computational Physics, History and Philosophy of Physics 主题：人工智能、适应与自组织系统、生物物理学、计算物理学、物理学史与哲学

Publish: 2025-08-07 17:57:12 UTC 发表时间：2025-08-07 17:57:12 UTC

Authors: [Rui Lu](https://arxiv.org/search/?searchtype=author&query=Rui Lu), [Jinhe Bi](https://arxiv.org/search/?searchtype=author&query=Jinhe Bi), [Yunpu Ma](https://arxiv.org/search/?searchtype=author&query=Yunpu Ma), [Feng Xiao](https://arxiv.org/search/?searchtype=author&query=Feng Xiao), [Yuntao Du](https://arxiv.org/search/?searchtype=author&query=Yuntao Du), [Yijun Tian](https://arxiv.org/search/?searchtype=author&query=Yijun Tian) 作者：Rui Lu、Jinhe Bi、Yunpu Ma、Feng Xiao、Yuntao Du、Yijun Tian

Social media has evolved into a complex multimodal environment where text, images, and other signals interact to shape nuanced meanings, often concealing harmful intent. Identifying such intent, whether sarcasm, hate speech, or misinformation, remains challenging due to cross-modal contradictions, rapid cultural shifts, and subtle pragmatic cues. To address these challenges, we propose MV-Debate, a multi-view agent debate framework with dynamic reflection gating for unified multimodal harmful content detection. MV-Debate assembles four complementary debate agents, a surface analyst, a deep reasoner, a modality contrast, and a social contextualist, to analyze content from diverse interpretive perspectives. Through iterative debate and reflection, the agents refine responses under a reflection-gain criterion, ensuring both accuracy and efficiency. Experiments on three benchmark datasets demonstrate that MV-Debate significantly outperforms strong single-model and existing multi-agent debate baselines. This work highlights the promise of multi-agent debate in advancing reliable social intent detection in safety-critical online contexts. 社交媒体已演变为一个复杂的多模态环境，文本、图像及其他信号相互作用以构建细腻的含义，常常掩盖有害意图。识别此类意图（无论是讽刺、仇恨言论还是错误信息）仍然具有挑战性，原因在于跨模态矛盾、快速变化的文化背景以及微妙的语用线索。为应对这些挑战，我们提出了 MV-Debate，一种具有动态反思门控的多视角代理辩论框架，用于统一的多模态有害内容检测。MV-Debate 组建了四个互补的辩论代理：表层分析者、深度推理者、模态对比者和社会情境主义者，从多样的解释视角分析内容。通过迭代辩论与反思，代理们在反思增益准则下细化响应，以确保准确性与效率。对三个基准数据集的实验表明，MV-Debate 显著优于强大的单模型和现有的多代理辩论基线。本工作凸显了多代理辩论在提升安全关键在线情境中可靠社会意图检测方面的潜力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 16:38:25 UTC 发布：2025-08-07 16:38:25 UTC

#4 Streamlining Admission with LOR Insights: AI-Based Leadership Assessment in Online Master's Program #4 通过推荐信洞察简化录取：基于人工智能的在在线硕士项目中的领导力评估

Authors: [Meryem Yilmaz Soylu](https://arxiv.org/search/?searchtype=author&query=Meryem Yilmaz Soylu), [Adrian Gallard](https://arxiv.org/search/?searchtype=author&query=Adrian Gallard), [Jeonghyun Lee](https://arxiv.org/search/?searchtype=author&query=Jeonghyun Lee), [Gayane Grigoryan](https://arxiv.org/search/?searchtype=author&query=Gayane Grigoryan), [Rushil Desai](https://arxiv.org/search/?searchtype=author&query=Rushil Desai), [Stephen Harmon](https://arxiv.org/search/?searchtype=author&query=Stephen Harmon) 作者：Meryem Yilmaz Soylu、Adrian Gallard、Jeonghyun Lee、Gayane Grigoryan、Rushil Desai、Stephen Harmon

Letters of recommendation (LORs) provide valuable insights into candidates’ capabilities and experiences beyond standardized test scores. However, reviewing these text-heavy materials is time-consuming and labor-intensive. To address this challenge and support the admission committee in providing feedback for students’ professional growth, our study introduces LORI: LOR Insights, a novel AI-based detection tool for assessing leadership skills in LORs submitted by online master’s program applicants. By employing natural language processing and leveraging large language models using RoBERTa and LLAMA, we seek to identify leadership attributes such as teamwork, communication, and innovation. Our latest RoBERTa model achieves a weighted F1 score of 91.6%, a precision of 92.4%, and a recall of 91.6%, showing a strong level of consistency in our test data. With the growing importance of leadership skills in the STEM sector, integrating LORI into the graduate admissions process is crucial for accurately assessing applicants’ leadership capabilities. This approach not only streamlines the admissions process but also automates and ensures a more comprehensive evaluation of candidates’ capabilities. 推荐信（LORs）在标准化考试成绩之外，提供了关于候选人能力和经历的宝贵见解。然而，审阅这些以文本为主的材料既耗时又费力。为了解决这一挑战并支持招生委员会为学生的职业成长提供反馈，我们的研究引入了 LORI：推荐信洞察（LOR Insights），这是一种基于人工智能的新颖检测工具，用于评估在线硕士项目申请人提交的推荐信中的领导力技能。通过采用自然语言处理并利用基于 RoBERTa 和 LLAMA 的大型语言模型，我们旨在识别诸如团队合作、沟通和创新等领导力属性。我们最新的 RoBERTa 模型在测试数据上取得了加权 F1 分数 91.6%、精确度 92.4% 和召回率 91.6%，显示出较高的一致性。随着领导力技能在 STEM 领域日益重要，将 LORI 纳入研究生招生流程对于准确评估申请人的领导能力至关重要。这种方法不仅简化了录取流程，还自动化并确保了对候选人能力的更全面评估。

Subjects: Artificial Intelligence, Machine Learning 主题：人工智能、机器学习

Publish: 2025-08-07 15:46:59 UTC 发布：2025-08-07 15:46:59 UTC

#5 Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation #5 自动评估评判者：迈向用于任务完成评估的通用智能体框架

The increasing adoption of foundation models as agents across diverse domains necessitates a robust evaluation framework. Current methods, such as LLM-as-a-Judge, focus only on final outputs, overlooking the step-by-step reasoning that drives agentic decision-making. Meanwhile, existing Agent-as-a-Judge systems, where one agent evaluates another’s task completion, are typically designed for narrow, domain-specific settings. To address this gap, we propose a generalizable, modular framework for evaluating agent task completion independent of the task domain. The framework emulates human-like evaluation by decomposing tasks into sub-tasks and validating each step using available information, such as the agent’s output and reasoning. Each module contributes to a specific aspect of the evaluation process, and their outputs are aggregated to produce a final verdict on task completion. We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench. Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively, compared to the GPT-4o based LLM-as-a-Judge baseline. This demonstrates the potential of our proposed general-purpose evaluation framework. 随着基础模型作为代理在各个领域越来越被广泛采用，迫切需要一个强有力的评估框架。现有方法，如 LLM-as-a-Judge，仅关注最终输出，忽视了驱动代理决策的逐步推理过程。同时，现有的 Agent-as-a-Judge 系统，即一个代理评估另一个代理的任务完成情况，通常设计用于狭窄的、特定领域的场景。为填补这一空白，我们提出了一个可泛化的模块化框架，用于在不依赖任务领域的情况下评估代理的任务完成情况。该框架通过将任务分解为子任务并使用可用信息（例如代理的输出和推理）验证每一步，来模拟类人评估。每个模块对评估过程的某一特定方面做出贡献，其输出被聚合以对任务完成情况给出最终裁定。我们通过在两个基准 GAIA 和 BigCodeBench 上评估 Magentic-One Actor Agent 来验证我们的框架。我们的评判代理在预测任务成功方面与人工评估的契合度更高，分别比以 GPT-4o 为基础的“作为评判者的 LLM”基线提升了 4.76% 和 10.52% 的对齐准确率。这证明了我们所提出的通用评估框架的潜力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 15:39:48 UTC 发布：2025-08-07 15:39:48 UTC

#6 GRAIL:Learning to Interact with Large Knowledge Graphs for Retrieval Augmented Reasoning

Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) techniques have exhibited remarkable performance across a wide range of domains. However, existing RAG approaches primarily operate on unstructured data and demonstrate limited capability in handling structured knowledge such as knowledge graphs. Meanwhile, current graph retrieval methods fundamentally struggle to capture holistic graph structures while simultaneously facing precision control challenges that manifest as either critical information gaps or excessive redundant connections, collectively undermining reasoning performance. To address this challenge, we propose GRAIL: Graph-Retrieval Augmented Interactive Learning, a framework designed to interact with large-scale graphs for retrieval-augmented reasoning. Specifically, GRAIL integrates LLM-guided random exploration with path filtering to establish a data synthesis pipeline, where a fine-grained reasoning trajectory is automatically generated for each task. Based on the synthesized data, we then employ a two-stage training process to learn a policy that dynamically decides the optimal actions at each reasoning step. The overall objective of precision-conciseness balance in graph retrieval is decoupled into fine-grained process-supervised rewards to enhance data efficiency and training stability. In practical deployment, GRAIL adopts an interactive retrieval paradigm, enabling the model to autonomously explore graph paths while dynamically balancing retrieval breadth and precision. Extensive experiments have shown that GRAIL achieves an average accuracy improvement of 21.01% and F1 improvement of 22.43% on three knowledge graph question-answering datasets. Our source code and datasets is available at https://github.com/Changgeww/GRAIL. 将大型语言模型（LLMs）与检索增强生成（RAG）技术相结合，在众多领域展现出卓越性能。然而，现有的 RAG 方法主要针对非结构化数据，在处理知识图谱等结构化知识方面能力有限。与此同时，当前的图检索方法在捕捉整体图结构方面存在根本性困难，并且在精确性控制上面临挑战，这表现为要么存在关键信息缺失，要么出现过多冗余连接，二者共同削弱了推理性能。为了解决这一问题，我们提出了 GRAIL：Graph-Retrieval Augmented Interactive Learning（图检索增强交互式学习），该框架旨在与大规模图谱交互以实现检索增强推理。具体而言，GRAIL 将 LLM 指导的随机探索与路径过滤相结合，建立了一个数据合成管道，在该管道中为每个任务自动生成细粒度的推理轨迹。基于合成数据，我们随后采用两阶段训练过程来学习一种策略，以在每个推理步骤动态决定最优动作。在图检索中，精确性与简洁性平衡的总体目标被解耦为细粒度的过程监督奖励，以提高数据效率和训练稳定性。在实际部署中，GRAIL 采用交互式检索范式，使模型能够自主探索图路径，同时动态平衡检索的广度与精确性。大量实验表明，GRAIL 在三个知识图谱问答数据集上的平均准确率提升了 21.01%，F1 值提升了 22.43%。我们的源代码和数据集可从 https://github.com/Changgeww/GRAIL 获取。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 15:34:41 UTC 发布：2025-08-07 15:34:41 UTC

#7 InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities #7 InfiAlign：一种可扩展且样本高效的框架，用于对齐 LLMs 以增强推理能力

Authors: [Shuo Cai](https://arxiv.org/search/?searchtype=author&query=Shuo Cai), [Su Lu](https://arxiv.org/search/?searchtype=author&query=Su Lu), [Qi Zhou](https://arxiv.org/search/?searchtype=author&query=Qi Zhou), [Kejing Yang](https://arxiv.org/search/?searchtype=author&query=Kejing Yang), [Zhijie Sang](https://arxiv.org/search/?searchtype=author&query=Zhijie Sang), [Congkai Xie](https://arxiv.org/search/?searchtype=author&query=Congkai Xie), [Hongxia Yang](https://arxiv.org/search/?searchtype=author&query=Hongxia Yang) 作者：蔡硕、卢肃、周琦、杨克静、桑志杰、谢丛凯、杨红霞

Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT. 大型语言模型（LLMs）在各种复杂任务上展现出令人印象深刻的推理能力。然而，通过后训练来增强这些能力仍然资源密集，尤其在数据和计算成本方面。尽管近期的努力试图通过选择性数据策划来提高样本效率，但现有方法通常依赖启发式或任务特定的策略，阻碍了可扩展性。在这项工作中，我们提出了 InfiAlign，一种可扩展且样本高效的后训练框架，将监督微调（SFT）与直接偏好优化（DPO）相结合，以对齐 LLMs 以增强推理能力。InfiAlign 的核心是一个稳健的数据选择流程，该流程使用多维质量指标从开源推理数据集中自动策划高质量的对齐数据。该流程在显著减少数据需求的同时实现了显著的性能提升，并且可扩展到新的数据来源。当应用于 Qwen2.5-Math-7B-Base 模型时，我们的 SFT 模型在性能上可与 DeepSeek-R1-Distill-Qwen-7B 相媲美，而所使用的训练数据仅约为其 12%，并在多样化的推理任务中展现出强大的泛化能力。通过应用 DPO 可获得额外改进，尤其在数学推理任务中表现出显著提升。该模型在 AIME 24/25 基准上平均提升了 3.89%。我们的结果突显了将有原则的数据选择与全阶段后训练相结合的有效性，为以可扩展且数据高效的方式对大型推理模型进行对齐提供了实用解决方案。模型检查点可在 https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT 获取。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 15:34:06 UTC 发布：2025-08-07 15:34:06 UTC

#8 Can Large Language Models Generate Effective Datasets for Emotion Recognition in Conversations? #8 大型语言模型能为对话情感识别生成有效数据集吗？

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-07 15:13:55 UTC 发布：2025-08-07 15:13:55 UTC

#9 Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? #9 Bench-2-CoP：我们能信任用于欧盟人工智能合规性的基准测试吗？

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-07 15:03:39 UTC 发布：2025-08-07 15:03:39 UTC

#10 Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI #10 谁的真相？（具代理性的）人工智能的多元地理对齐

AI (super) alignment describes the challenge of ensuring (future) AI systems behave in accordance with societal norms and goals. While a quickly evolving literature is addressing biases and inequalities, the geographic variability of alignment remains underexplored. Simply put, what is considered appropriate, truthful, or legal can differ widely across regions due to cultural norms, political realities, and legislation. Alignment measures applied to AI/ML workflows can sometimes produce outcomes that diverge from statistical realities, such as text-to-image models depicting balanced gender ratios in company leadership despite existing imbalances. Crucially, some model outputs are globally acceptable, while others, e.g., questions about Kashmir, depend on knowing the user’s location and their context. This geographic sensitivity is not new. For instance, Google Maps renders Kashmir’s borders differently based on user location. What is new is the unprecedented scale and automation with which AI now mediates knowledge, expresses opinions, and represents geographic reality to millions of users worldwide, often with little transparency about how context is managed. As we approach Agentic AI, the need for spatio-temporally aware alignment, rather than one-size-fits-all approaches, is increasingly urgent. This paper reviews key geographic research problems, suggests topics for future work, and outlines methods for assessing alignment sensitivity. AI（超级）对齐描述了确保（未来）人工智能系统的行为符合社会规范和目标的挑战。尽管不断发展的文献正在处理偏见和不平等问题，但对齐的地域差异仍然未被充分探讨。简单来说，由于文化规范、政治现实和法律的差异，什么被视为合适、真实或合法在不同地区可能大相径庭。应用于 AI/ML 工作流的对齐措施有时会产生与统计现实不一致的结果，例如文本生成图像模型在描绘公司领导层时显示出性别比例均衡，尽管实际存在不平衡。关键在于，有些模型输出在全球范围内都是可接受的，而另一些，例如关于克什米尔的问题，则取决于是否知道用户的位置及其背景。这种地域敏感性并非新鲜事。例如，谷歌地图会根据用户位置不同而以不同方式展示克什米尔的边界。新的在于，人工智能现在以前所未有的规模和自动化方式向全球数以百万计的用户中介知识、表达观点并呈现地理现实，而且通常对于如何管理上下文缺乏透明性。随着我们接近具备主体性（Agentic）的人工智能，对具有时空感知的对齐方法的需求日益紧迫，而不是一刀切的通用方法。本文回顾了关键的地理研究问题，提出了未来工作的研究课题，并概述了评估对齐敏感性的方法。

Subjects: Artificial Intelligence, Computers and Society 主题：人工智能，计算机与社会

Publish: 2025-08-07 14:21:33 UTC 发布时间：2025-08-07 14:21:33 UTC

#11 Large Language Models Transform Organic Synthesis From Reaction Prediction to Automation #11 大型语言模型将有机合成从反应预测转向自动化 [PDF 1 ] [Copy] [Kimi 1 ] [REL]

Authors: [Kartar Kumar Lohana Tharwani](https://arxiv.org/search/?searchtype=author&query=Kartar Kumar Lohana Tharwani), [Rajesh Kumar](https://arxiv.org/search/?searchtype=author&query=Rajesh Kumar), Sumita, [Numan Ahmed](https://arxiv.org/search/?searchtype=author&query=Numan Ahmed), [Yong Tang](https://arxiv.org/search/?searchtype=author&query=Yong Tang) 作者：Kartar Kumar Lohana Tharwani、Rajesh Kumar、Sumita、Numan Ahmed、Yong Tang

Large language models (LLMs) are beginning to reshape how chemists plan and run reactions in organic synthesis. Trained on millions of reported transformations, these text-based models can propose synthetic routes, forecast reaction outcomes and even instruct robots that execute experiments without human supervision. Here we survey the milestones that turned LLMs from speculative tools into practical lab partners. We show how coupling LLMs with graph neural networks, quantum calculations and real-time spectroscopy shrinks discovery cycles and supports greener, data-driven chemistry. We discuss limitations, including biased datasets, opaque reasoning and the need for safety gates that prevent unintentional hazards. Finally, we outline community initiatives open benchmarks, federated learning and explainable interfaces that aim to democratize access while keeping humans firmly in control. These advances chart a path towards rapid, reliable and inclusive molecular innovation powered by artificial intelligence and automation. 大型语言模型（LLMs）正在开始重塑化学家在有机合成中计划和实施反应的方式。通过在数百万条已报道的转化反应上训练，这些基于文本的模型可以提出合成路线、预测反应结果，甚至指导能够在无人监督下执行实验的机器人。在此，我们回顾了将 LLMs 从推测性工具转变为实用实验室伙伴的里程碑。我们展示了将 LLMs 与图神经网络、量子计算和实时光谱学相结合如何缩短发现周期并支持更绿色、以数据驱动的化学。我们讨论了局限性，包括存在偏见的数据集、推理不透明以及需要防止无意危险的安全闸门。最后，我们概述了旨在普及访问同时保持人类严格控制的社区倡议、开放基准、联邦学习和可解释界面。这些进展为由人工智能和自动化驱动的快速、可靠且包容的分子创新绘制了路径。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 14:17:23 UTC 发布时间：2025-08-07 14:17:23 UTC

#12 DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning #12 DeepPHY：在物理推理上对具代理性的视觉语言模型进行基准测试

Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs’ understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control. 虽然视觉语言模型（VLMs）在感知能力和视觉推理方面表现出色，但在复杂、动态环境中对细节的关注和精确的动作规划方面仍然存在困难，导致性能不佳。现实任务通常需要复杂的交互、高级的空间推理、长期规划和持续的策略调整，通常还需理解目标场景的物理规则。然而，在真实场景中评估这些能力往往代价高昂。为弥补这一差距，我们引入了 DeepPHY，一种新颖的基准框架，旨在通过一系列具有挑战性的模拟环境系统地评估 VLMs 对基本物理原理的理解和推理能力。DeepPHY 整合了多个不同难度等级的物理推理环境，并引入了细粒度的评估指标。我们的评估发现，即使是最先进的 VLMs 也难以将描述性的物理知识转化为精确的预测性控制。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 13:58:19 UTC 发布时间：2025-08-07 13:58:19 UTC

#13 An Explainable Machine Learning Framework for Railway Predictive Maintenance using Data Streams from the Metro Operator of Portugal #13 一个用于铁路预测性维护的可解释机器学习框架，基于葡萄牙地铁运营商的数据流

Authors: [Silvia García-Méndez](https://arxiv.org/search/?searchtype=author&query=Silvia García-Méndez), [Francisco de Arriba-Pérez](https://arxiv.org/search/?searchtype=author&query=Francisco de Arriba-Pérez), [Fátima Leal](https://arxiv.org/search/?searchtype=author&query=Fátima Leal), [Bruno Veloso](https://arxiv.org/search/?searchtype=author&query=Bruno Veloso), [Benedita Malheiro](https://arxiv.org/search/?searchtype=author&query=Benedita Malheiro), [Juan Carlos Burguillo-Rial](https://arxiv.org/search/?searchtype=author&query=Juan Carlos Burguillo-Rial) 作者：Silvia García-Méndez、Francisco de Arriba-Pérez、Fátima Leal、Bruno Veloso、Benedita Malheiro、Juan Carlos Burguillo-Rial

This work contributes to a real-time data-driven predictive maintenance solution for Intelligent Transportation Systems. The proposed method implements a processing pipeline comprised of sample pre-processing, incremental classification with Machine Learning models, and outcome explanation. This novel online processing pipeline has two main highlights: (i) a dedicated sample pre-processing module, which builds statistical and frequency-related features on the fly, and (ii) an explainability module. This work is the first to perform online fault prediction with natural language and visual explainability. The experiments were performed with the MetroPT data set from the metro operator of Porto, Portugal. The results are above 98 % for F-measure and 99 % for accuracy. In the context of railway predictive maintenance, achieving these high values is crucial due to the practical and operational implications of accurate failure prediction. In the specific case of a high F-measure, this ensures that the system maintains an optimal balance between detecting the highest possible number of real faults and minimizing false alarms, which is crucial for maximizing service availability. Furthermore, the accuracy obtained enables reliability, directly impacting cost reduction and increased safety. The analysis demonstrates that the pipeline maintains high performance even in the presence of class imbalance and noise, and its explanations effectively reflect the decision-making process. These findings validate the methodological soundness of the approach and confirm its practical applicability for supporting proactive maintenance decisions in real-world railway operations. Therefore, by identifying the early signs of failure, this pipeline enables decision-makers to understand the underlying problems and act accordingly swiftly. 这项工作为智能交通系统贡献了一个实时数据驱动的预测性维护解决方案。所提出的方法实现了一个处理管道，包含样本预处理、使用机器学习模型的增量分类以及结果解释。这个新颖的在线处理管道有两个主要亮点：（i）一个专用的样本预处理模块，能够即时构建统计和与频率相关的特征；（ii）一个可解释性模块。该工作是首个实现具有自然语言和可视化可解释性的在线故障预测的工作。实验使用了来自葡萄牙波尔图地铁运营商的 MetroPT 数据集。结果在 F 值上超过 98%，在准确率上超过 99%。在铁路预测性维护的背景下，取得这些高数值至关重要，因为准确故障预测在实际和运营上具有重要影响。就高 F 值的具体情况而言，这可确保系统在尽可能检测到真实故障数量与最小化误报之间保持最佳平衡，这对于最大化服务可用性至关重要。此外，所获得的高准确性带来可靠性，直接影响成本降低和安全性提升。分析表明，即使在类别不平衡和噪声存在的情况下，该流程仍能保持高性能，其解释有效地反映了决策过程。这些发现验证了该方法的科学性，并确认了其在支持实际铁路运营中主动维护决策方面的适用性。因此，通过识别故障的早期征兆，该流程使决策者能够理解潜在问题并迅速采取相应行动。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 13:38:49 UTC 发表：2025-08-07 13:38:49 协调世界时（UTC）

#14 StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models #14 StructVRM：用结构化且可验证的奖励模型对齐多模态推理

Existing Vision-Language Models often struggle with complex, multi-question reasoning tasks where partial correctness is crucial for effective learning. Traditional reward mechanisms, which provide a single binary score for an entire response, are too coarse to guide models through intricate problems with multiple sub-parts. To address this, we introduce StructVRM, a method that aligns multimodal reasoning with Structured and Verifiable Reward Models. At its core is a model-based verifier trained to provide fine-grained, sub-question-level feedback, assessing semantic and mathematical equivalence rather than relying on rigid string matching. This allows for nuanced, partial credit scoring in previously intractable problem formats. Extensive experiments demonstrate the effectiveness of StructVRM. Our trained model, Seed-StructVRM, achieves state-of-the-art performance on six out of twelve public multimodal benchmarks and our newly curated, high-difficulty STEM-Bench. The success of StructVRM validates that training with structured, verifiable rewards is a highly effective approach for advancing the capabilities of multimodal models in complex, real-world reasoning domains. 现有的视觉-语言模型在复杂的、多问题的推理任务上常常表现不佳，而在这些任务中部分正确性对于有效学习至关重要。传统的奖励机制为整段回答提供单一的二元分数，过于粗糙，无法在包含多个子问题的复杂问题中对模型进行有效引导。为了解决这一问题，我们提出了 StructVRM，一种将多模态推理与结构化且可验证的奖励模型对齐的方法。其核心是一个基于模型的验证器，经过训练以提供细粒度的子问题级反馈，评估语义和数学等价性，而不是依赖僵化的字符串匹配。这使得在以往难以处理的题型中实现更为细致的部分得分成为可能。大量实验表明了 StructVRM 的有效性。我们训练的模型 Seed-StructVRM 在十二个公开多模态基准测试中取得了六项的最新性能，并在我们新整理的高难度 STEM-Bench 上表现优异。StructVRM 的成功验证了使用结构化且可验证奖励进行训练，对于提升多模态模型在复杂、真实世界推理领域的能力是一种非常有效的途径。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 13:31:21 UTC 发布：2025-08-07 13:31:21 UTC

#15 Minimal Model Reasoning in Description Logics: Don't Try This at Home! #15 描述逻辑中的最小模型推理：别在家尝试！

Authors: [Federica Di Stefano](https://arxiv.org/search/?searchtype=author&query=Federica Di Stefano), [Quentin Manière](https://arxiv.org/search/?searchtype=author&query=Quentin Manière), [Magdalena Ortiz](https://arxiv.org/search/?searchtype=author&query=Magdalena Ortiz), [Mantas Šimkus](https://arxiv.org/search/?searchtype=author&query=Mantas Šimkus) 作者：Federica Di Stefano，Quentin Manière，Magdalena Ortiz，Mantas Šimkus

Reasoning with minimal models has always been at the core of many knowledge representation techniques, but we still have only a limited understanding of this problem in Description Logics (DLs). Minimization of some selected predicates, letting the remaining predicates vary or be fixed, as proposed in circumscription, has been explored and exhibits high complexity. The case of `pure’ minimal models, where the extension of all predicates must be minimal, has remained largely uncharted. We address this problem in popular DLs and obtain surprisingly negative results: concept satisfiability in minimal models is undecidable already for EL. This undecidability also extends to a very restricted fragment of tuple-generating dependencies. To regain decidability, we impose acyclicity conditions on the TBox that bring the worst-case complexity below double exponential time and allow us to establish a connection with the recently studied pointwise circumscription; we also derive results in data complexity. We conclude with a brief excursion to the DL-Lite family, where a positive result was known for DL-Litecore, but our investigation establishes ExpSpace-hardness already for its extension DL-Litehorn. 在许多知识表示技术中，基于极小模型的推理一直是核心，但我们对在描述逻辑（DL）中处理该问题的理解仍然有限。正如环叙法（circumscription）中提出的，对某些选定谓词进行最小化，同时允许其余谓词变化或被固定，这一做法已被研究并表现出高复杂性。而“纯”极小模型的情况，即要求所有谓词的扩展都必须是极小的，基本上尚未被探索。我们在流行的描述逻辑中研究了这一问题，得到令人惊讶的负面结果：在极小模型中概念可满足性在 EL 已经是不可判定的。此类不可判定性还扩展到一个非常受限的元组生成依赖（tuple-generating dependencies）片段。为恢复可判定性，我们对 TBox 施加了无环性条件，使最坏情况复杂度降至双指数时间以下，并使我们能够与近期研究的逐点环叙法（pointwise circumscription）建立联系；我们还推导了数据复杂度方面的结果。我们在结尾简要考察了 DL-Lite 家族，在该家族中针对 DL-Lite core 已知有正面结果，但我们的研究表明，其扩展 DL-Lite horn 已经是 ExpSpace-难的。

Subjects: Artificial Intelligence, Computational Complexity, Logic in Computer Science 主题：人工智能，计算复杂性，计算机科学中的逻辑

Publish: 2025-08-07 12:56:15 UTC 发表：2025-08-07 12:56:15 UTC

#16 NomicLaw: Emergent Trust and Strategic Argumentation in LLMs During Collaborative Law-Making #16 NomicLaw：在协作立法过程中，LLMs 的信任涌现与策略性论证

Authors: [Asutosh Hota](https://arxiv.org/search/?searchtype=author&query=Asutosh Hota), [Jussi P. P. Jokinen](https://arxiv.org/search/?searchtype=author&query=Jussi P. P. Jokinen) 作者：Asutosh Hota，Jussi P. P. Jokinen

Recent advancements in large language models (LLMs) have extended their capabilities from basic text processing to complex reasoning tasks, including legal interpretation, argumentation, and strategic interaction. However, empirical understanding of LLM behavior in open-ended, multi-agent settings especially those involving deliberation over legal and ethical dilemmas remains limited. We introduce NomicLaw, a structured multi-agent simulation where LLMs engage in collaborative law-making, responding to complex legal vignettes by proposing rules, justifying them, and voting on peer proposals. We quantitatively measure trust and reciprocity via voting patterns and qualitatively assess how agents use strategic language to justify proposals and influence outcomes. Experiments involving homogeneous and heterogeneous LLM groups demonstrate how agents spontaneously form alliances, betray trust, and adapt their rhetoric to shape collective decisions. Our results highlight the latent social reasoning and persuasive capabilities of ten open-source LLMs and provide insights into the design of future AI systems capable of autonomous negotiation, coordination and drafting legislation in legal settings. 近年来大型语言模型（LLMs）的进展已将其能力从基础文本处理扩展到复杂推理任务，包括法律解释、论证和战略互动。然而，对于 LLMs 在开放式、多智能体环境中——尤其是那些涉及对法律与伦理困境进行审议的场景——的实证理解仍然有限。我们引入了 NomicLaw，这是一种结构化的多智能体模拟，LLMs 在其中参与协作立法，针对复杂的法律情景提出规则、为其辩护并对同伴的提案进行投票。我们通过投票模式对信任与互惠进行定量衡量，并定性评估智能体如何使用策略性语言为提案辩护以影响结果。涉及同质与异质 LLM 群体的实验展示了智能体如何自发形成联盟、背弃信任并调整修辞以塑造集体决策。我们的结果突显了十款开源 LLMs 的潜在社会推理与说服能力，并为设计未来能够在法律环境中自主进行谈判、协调与起草立法的人工智能系统提供了见解。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 12:49:44 UTC 发布：2025-08-07 12:49:44 UTC

#17 The Term 'Agent' Has Been Diluted Beyond Utility and Requires Redefinition #17 “智能体”一词已被稀释至无实用价值，需重新定义

Author: [Brinnae Bent](https://arxiv.org/search/?searchtype=author&query=Brinnae Bent) 作者：Brinnae Bent

The term ‘agent’ in artificial intelligence has long carried multiple interpretations across different subfields. Recent developments in AI capabilities, particularly in large language model systems, have amplified this ambiguity, creating significant challenges in research communication, system evaluation and reproducibility, and policy development. This paper argues that the term ‘agent’ requires redefinition. Drawing from historical analysis and contemporary usage patterns, we propose a framework that defines clear minimum requirements for a system to be considered an agent while characterizing systems along a multidimensional spectrum of environmental interaction, learning and adaptation, autonomy, goal complexity, and temporal coherence. This approach provides precise vocabulary for system description while preserving the term’s historically multifaceted nature. After examining potential counterarguments and implementation challenges, we provide specific recommendations for moving forward as a field, including suggestions for terminology standardization and framework adoption. The proposed approach offers practical tools for improving research clarity and reproducibility while supporting more effective policy development. 在人工智能领域，“agent”（智能体）一词长期以来在不同子领域中被赋予多重含义。近期 AI 能力的发展，尤其是在大型语言模型系统中的进展，放大了这种歧义，给研究交流、系统评估与可重复性以及政策制定带来了重大挑战。本文认为“agent”一词需要重新定义。基于历史分析和当代使用模式，我们提出了一个框架，定义了将系统视为智能体的明确最低要求，同时沿着环境交互、学习与适应、自主性、目标复杂性和时间一致性等多维光谱来刻画系统。该方法为系统描述提供了精确的术语，同时保留了该术语在历史上的多面性。审视了潜在的反驳意见和实施挑战后，我们给出了作为一个领域前进的具体建议，包括术语标准化和框架采纳的建议。该方法为提高研究的清晰度和可重复性提供了实用工具，同时有助于更有效的政策制定。

Subjects: Artificial Intelligence, Computers and Society 学科：人工智能，计算机与社会

Publish: 2025-08-07 12:40:25 UTC 发表：2025-08-07 12:40:25 UTC

#18 A Novel Architecture for Symbolic Reasoning with Decision Trees and LLM Agents #18 一种用于符号推理的新型架构，结合决策树与 LLM 代理

Author: [Andrew Kiruluta](https://arxiv.org/search/?searchtype=author&query=Andrew Kiruluta) 作者：Andrew Kiruluta

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-07 12:11:53 UTC 发布时间：2025-08-07 12:11:53 UTC

#19 An Explainable Natural Language Framework for Identifying and Notifying Target Audiences In Enterprise Communication #19 一个可解释的自然语言框架，用于在企业沟通中识别并通知目标受众

Authors: [Vítor N. Lourenço](https://arxiv.org/search/?searchtype=author&query=Vítor N. Lourenço), [Mohnish Dubey](https://arxiv.org/search/?searchtype=author&query=Mohnish Dubey), [Yunfei Bai](https://arxiv.org/search/?searchtype=author&query=Yunfei Bai), [Audrey Depeige](https://arxiv.org/search/?searchtype=author&query=Audrey Depeige), [Vivek Jain](https://arxiv.org/search/?searchtype=author&query=Vivek Jain) 作者：Vítor N. Lourenço、Mohnish Dubey、Yunfei Bai、Audrey Depeige、Vivek Jain

In large-scale maintenance organizations, identifying subject matter experts and managing communications across complex entities relationships poses significant challenges – including information overload and longer response times – that traditional communication approaches fail to address effectively. We propose a novel framework that combines RDF graph databases with LLMs to process natural language queries for precise audience targeting, while providing transparent reasoning through a planning-orchestration architecture. Our solution enables communication owners to formulate intuitive queries combining concepts such as equipment, manufacturers, maintenance engineers, and facilities, delivering explainable results that maintain trust in the system while improving communication efficiency across the organization. 在大型维修组织中，识别主题专家并管理跨复杂实体关系的沟通带来重大挑战——包括信息过载和更长的响应时间——传统的沟通方式无法有效解决这些问题。我们提出了一种新颖框架，将 RDF 图数据库与 LLMs 结合，用于处理自然语言查询以实现精确的受众定位，同时通过规划-编排架构提供透明的推理。我们的解决方案使沟通负责人能够构建直观查询，组合设备、制造商、维修工程师和设施等概念，提供可解释的结果，在提高组织内部沟通效率的同时保持对系统的信任。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 11:02:40 UTC 发布：2025-08-07 11:02:40 UTC

#20 QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering #20 QA-Dragon：面向查询感知的动态 RAG 系统，用于知识密集型视觉问答 [PDF 2 ] [Copy] [Kimi ] [REL]

Subjects: Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition 主题：人工智能、计算与语言、计算机视觉与模式识别

Publish: 2025-08-07 09:32:49 UTC 发布：2025-08-07 09:32:49 UTC

#21 Graph-based Event Log Repair #21 基于图的事件日志修复

Authors: [Sebastiano Dissegna](https://arxiv.org/search/?searchtype=author&query=Sebastiano Dissegna), [Chiara Di Francescomarino](https://arxiv.org/search/?searchtype=author&query=Chiara Di Francescomarino), [Massimiliano Ronzani](https://arxiv.org/search/?searchtype=author&query=Massimiliano Ronzani) 作者：Sebastiano Dissegna、Chiara Di Francescomarino、Massimiliano Ronzani

The quality of event logs in Process Mining is crucial when applying any form of analysis to them. In real-world event logs, the acquisition of data can be non-trivial (e.g., due to the execution of manual activities and related manual recording or to issues in collecting, for each event, all its attributes), and often may end up with events recorded with some missing information. Standard approaches to the problem of trace (or log) reconstruction either require the availability of a process model that is used to fill missing values by leveraging different reasoning techniques or employ a Machine Learning/Deep Learning model to restore the missing values by learning from similar cases. In recent years, a new type of Deep Learning model that is capable of handling input data encoded as graphs has emerged, namely Graph Neural Networks. Graph Neural Network models, and even more so Heterogeneous Graph Neural Networks, offer the advantage of working with a more natural representation of complex multi-modal sequences like the execution traces in Process Mining, allowing for more expressive and semantically rich encodings. In this work, we focus on the development of a Heterogeneous Graph Neural Network model that, given a trace containing some incomplete events, will return the full set of attributes missing from those events. We evaluate our work against a state-of-the-art approach leveraging autoencoders on two synthetic logs and four real event logs, on different types of missing values. Different from state-of-the-art model-free approaches, which mainly focus on repairing a subset of event attributes, the proposed approach shows very good performance in reconstructing all different event attributes. 在流程挖掘中，事件日志的质量在对其应用任何形式的分析时都至关重要。在真实世界的事件日志中，数据获取可能并非易事（例如，由于手工活动的执行及相关的手工记录，或由于在为每个事件收集所有属性时出现的问题），并且经常会导致事件记录中出现某些缺失信息。解决迹（或日志）重建问题的标准方法要么需要可用的流程模型，通过利用不同的推理技术来填充缺失值，要么采用机器学习/深度学习模型通过从相似案例中学习来恢复缺失值。近年来，出现了一种能够处理以图形式编码的输入数据的新型深度学习模型，即图神经网络。图神经网络模型，尤其是异构图神经网络，具有使用更自然的方式表示复杂多模态序列（如流程挖掘中的执行轨迹）的优势，从而允许更具表现力和语义丰富的编码。在本研究中，我们专注于开发一种异构图神经网络模型，该模型在给定包含某些不完整事件的轨迹时，能够返回这些事件中缺失的完整属性集合。我们在两份合成日志和四份真实事件日志上、针对不同类型的缺失值，将我们的工作与利用自编码器的最先进方法进行了评估。有别于主要侧重修复事件属性子集的无模型最先进方法，所提出的方法在重建所有不同事件属性方面表现非常出色。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 08:26:16 UTC 发表：2025-08-07 08:26:16 UTC

#22 Beyond Automation: Socratic AI, Epistemic Agency, and the Implications of the Emergence of Orchestrated Multi-Agent Learning Architectures #22 超越自动化：苏格拉底式人工智能、认知主体性以及编排式多智能体学习架构出现的影响

Authors: [Peer-Benedikt Degen](https://arxiv.org/search/?searchtype=author&query=Peer-Benedikt Degen), [Igor Asanov](https://arxiv.org/search/?searchtype=author&query=Igor Asanov) 作者：Peer-Benedikt Degen, Igor Asanov

Generative AI is no longer a peripheral tool in higher education. It is rapidly evolving into a general-purpose infrastructure that reshapes how knowledge is generated, mediated, and validated. This paper presents findings from a controlled experiment evaluating a Socratic AI Tutor, a large language model designed to scaffold student research question development through structured dialogue grounded in constructivist theory. Conducted with 65 pre-service teacher students in Germany, the study compares interaction with the Socratic Tutor to engagement with an uninstructed AI chatbot. Students using the Socratic Tutor reported significantly greater support for critical, independent, and reflective thinking, suggesting that dialogic AI can stimulate metacognitive engagement and challenging recent narratives of de-skilling due to generative AI usage. These findings serve as a proof of concept for a broader pedagogical shift: the use of multi-agent systems (MAS) composed of specialised AI agents. To conceptualise this, we introduce the notion of orchestrated MAS, modular, pedagogically aligned agent constellations, curated by educators, that support diverse learning trajectories through differentiated roles and coordinated interaction. To anchor this shift, we propose an adapted offer-and-use model, in which students appropriate instructional offers from these agents. Beyond technical feasibility, we examine system-level implications for higher education institutions and students, including funding necessities, changes to faculty roles, curriculars, competencies and assessment practices. We conclude with a comparative cost-effectiveness analysis highlighting the scalability of such systems. In sum, this study contributes both empirical evidence and a conceptual roadmap for hybrid learning ecosystems that embed human-AI co-agency and pedagogical alignment. 生成式人工智能不再是高等教育中的边缘工具。它正在迅速演变为一种通用基础设施，重塑知识的生成、媒介化和验证方式。本文呈现了一项受控实验的发现，该实验评估了一种苏格拉底式 AI 导师——一种通过基于建构主义理论的结构化对话来支撑学生研究问题发展的大型语言模型。该研究在德国对 65 名师范生进行，比较了与苏格拉底式导师的互动与与一个未被指示的 AI 聊天机器人的互动。使用苏格拉底式导师的学生报告称在促进批判性、独立和反思性思维方面获得了显著更大的支持，这表明对话式 AI 可以激发元认知参与，并挑战了近期关于生成式 AI 使用会导致技能退化的叙事。这些发现作为一个概念验证，支持更广泛的教学变革：使用由专门化 AI 代理组成的多智能体系统（MAS）。为概念化这一点，我们引入了“编排式多智能体系统”（orchestrated MAS）这一概念，即由教育者策划的模块化、教学对齐的智能体群，通过差异化角色和协调互动支持多样化的学习轨迹。为落实这一转变，我们提出了一种改编的“提供与使用”模型，在该模型中，学生从这些智能体中采用教学供给。除了技术可行性之外，我们还考察了此类系统对高等教育机构和学生在系统层面的影响，包括资金需求、教师角色的变化、课程设置、能力要求和评估实践的变动。最后，我们以一项比较性成本效益分析作结，凸显此类系统的可扩展性。总之，本研究为嵌入人机协同代理与教学对齐的混合学习生态系统提供了实证证据和概念性路线图。

Subjects: Artificial Intelligence, Multiagent Systems 主题：人工智能，多智能体系统

Publish: 2025-08-07 07:49:03 UTC 发布时间：2025-08-07 07:49:03 UTC

#23 EasySize: Elastic Analog Circuit Sizing via LLM-Guided Heuristic Search #23 EasySize：通过 LLM 引导启发式搜索实现弹性模拟电路尺寸调整

Authors: [Xinyue Wu](https://arxiv.org/search/?searchtype=author&query=Xinyue Wu), [Fan Hu](https://arxiv.org/search/?searchtype=author&query=Fan Hu), [Shaik Jani Babu](https://arxiv.org/search/?searchtype=author&query=Shaik Jani Babu), [Yi Zhao](https://arxiv.org/search/?searchtype=author&query=Yi Zhao), [Xinfei Guo](https://arxiv.org/search/?searchtype=author&query=Xinfei Guo) 作者：吴欣悦、胡凡、Shaik Jani Babu、赵奕、郭新飞

Analog circuit design is a time-consuming, experience-driven task in chip development. Despite advances in AI, developing universal, fast, and stable gate sizing methods for analog circuits remains a significant challenge. Recent approaches combine Large Language Models (LLMs) with heuristic search techniques to enhance generalizability, but they often depend on large model sizes and lack portability across different technology nodes. To overcome these limitations, we propose EasySize, the first lightweight gate sizing framework based on a finetuned Qwen3-8B model, designed for universal applicability across process nodes, design specifications, and circuit topologies. EasySize exploits the varying Ease of Attainability (EOA) of performance metrics to dynamically construct task-specific loss functions, enabling efficient heuristic search through global Differential Evolution (DE) and local Particle Swarm Optimization (PSO) within a feedback-enhanced flow. Although finetuned solely on 350nm node data, EasySize achieves strong performance on 5 operational amplifier (Op-Amp) netlists across 180nm, 45nm, and 22nm technology nodes without additional targeted training, and outperforms AutoCkt, a widely-used Reinforcement Learning based sizing framework, on 86.67% of tasks with more than 96.67% of simulation resources reduction. We argue that EasySize can significantly reduce the reliance on human expertise and computational resources in gate sizing, thereby accelerating and simplifying the analog circuit design process. EasySize will be open-sourced at a later date. 模拟电路设计是芯片开发中一项耗时且依赖经验的任务。尽管人工智能取得了进展，但为模拟电路开发通用、快速且稳定的门尺寸调整方法仍然是一个重大挑战。近来的方法将 LLMs 与启发式搜索技术相结合以增强泛化能力，但它们通常依赖于较大的模型规模并且在不同工艺节点之间缺乏可移植性。为克服这些限制，我们提出了 EasySize，这是首个基于微调后 Qwen3-8B 模型的轻量级门尺寸调整框架，旨在跨工艺节点、设计规范和电路拓扑实现通用适用性。EasySize 利用性能指标可达难易（Ease of Attainability, EOA）的差异，动态构建任务特定的损失函数，通过在反馈增强流程中结合全局差分进化（DE）和局部粒子群优化（PSO）来实现高效的启发式搜索。尽管仅在 350nm 节点数据上进行微调，EasySize 在 5 个运算放大器（Op-Amp）网表上于 180nm、45nm 和 22nm 工艺节点均表现优异，且无需额外的针对性训练。在 86.67%的任务上，它优于常用的基于强化学习的尺寸调整框架 AutoCkt，并将仿真资源减少了超过 96.67%。我们认为，EasySize 能够显著降低门级尺寸调整对人工专业知识和计算资源的依赖，从而加速并简化模拟电路设计过程。EasySize 将在稍后开源。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 07:47:07 UTC 发表于：2025-08-07 07:47:07 UTC

#24 MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models #24 MedMKEB：用于医学多模态大语言模型的综合知识编辑基准

Authors: [Dexuan Xu](https://arxiv.org/search/?searchtype=author&query=Dexuan Xu), [Jieyi Wang](https://arxiv.org/search/?searchtype=author&query=Jieyi Wang), [Zhongyan Chai](https://arxiv.org/search/?searchtype=author&query=Zhongyan Chai), [Yongzhi Cao](https://arxiv.org/search/?searchtype=author&query=Yongzhi Cao), [Hanpin Wang](https://arxiv.org/search/?searchtype=author&query=Hanpin Wang), [Huamin Zhang](https://arxiv.org/search/?searchtype=author&query=Huamin Zhang), [Yu Huang](https://arxiv.org/search/?searchtype=author&query=Yu Huang) 作者：徐德轩、王杰怡、柴中堰、曹永志、王汉品、张华敏、黄宇

Recent advances in multimodal large language models (MLLMs) have significantly improved medical AI, enabling it to unify the understanding of visual and textual information. However, as medical knowledge continues to evolve, it is critical to allow these models to efficiently update outdated or incorrect information without retraining from scratch. Although textual knowledge editing has been widely studied, there is still a lack of systematic benchmarks for multimodal medical knowledge editing involving image and text modalities. To fill this gap, we present MedMKEB, the first comprehensive benchmark designed to evaluate the reliability, generality, locality, portability, and robustness of knowledge editing in medical multimodal large language models. MedMKEB is built on a high-quality medical visual question-answering dataset and enriched with carefully constructed editing tasks, including counterfactual correction, semantic generalization, knowledge transfer, and adversarial robustness. We incorporate human expert validation to ensure the accuracy and reliability of the benchmark. Extensive single editing and sequential editing experiments on state-of-the-art general and medical MLLMs demonstrate the limitations of existing knowledge-based editing approaches in medicine, highlighting the need to develop specialized editing strategies. MedMKEB will serve as a standard benchmark to promote the development of trustworthy and efficient medical knowledge editing algorithms. 多模态大型语言模型（MLLMs）的最新进展显著提升了医疗人工智能，使其能够将视觉与文本信息的理解统一起来。然而，随着医学知识的不断发展，关键在于让这些模型能够高效更新过时或错误的信息，而无需从头重新训练。尽管文本知识编辑已被广泛研究，但在涉及图像和文本模态的多模态医学知识编辑方面，仍缺乏系统性的基准测试。为填补这一空白，我们提出了 MedMKEB，这是首个全面的基准，旨在评估医学多模态大型语言模型在知识编辑方面的可靠性、通用性、局部性、可移植性和鲁棒性。MedMKEB 构建于高质量的医学视觉问答数据集之上，并通过精心构造的编辑任务进行丰富，包括反事实修正、语义泛化、知识迁移和对抗鲁棒性。我们引入了人工专家验证，以确保该基准的准确性和可靠性。在最先进的一般性和医学多模态大模型上进行的大量单次编辑和序列编辑实验展示了现有基于知识的医学编辑方法的局限性，凸显了开发专门编辑策略的必要性。MedMKEB 将作为一个标准基准，促进可靠且高效的医学知识编辑算法的发展。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 07:09:26 UTC 发布：2025-08-07 07:09:26 UTC

#25 Cognitive Duality for Adaptive Web Agents #25 认知二元性用于自适应网络代理 [PDF ] [Copy] [Kimi ] [REL]

Web navigation represents a critical and challenging domain for evaluating artificial general intelligence (AGI), demanding complex decision-making within high-entropy, dynamic environments with combinatorially explosive action spaces. Current approaches to building autonomous web agents either focus on offline imitation learning or online exploration, but rarely integrate both paradigms effectively. Inspired by the dual-process theory of human cognition, we derive a principled decomposition into fast System 1 and slow System 2 cognitive processes. This decomposition provides a unifying perspective on existing web agent methodologies, bridging the gap between offline learning of intuitive reactive behaviors and online acquisition of deliberative planning capabilities. We implement this framework in CogniWeb, a modular agent architecture that adaptively toggles between fast intuitive processing and deliberate reasoning based on task complexity. Our evaluation on WebArena demonstrates that CogniWeb achieves competitive performance (43.96% success rate) while maintaining significantly higher efficiency (75% reduction in token usage). 网页导航代表了评估通用人工智能（AGI）的一个关键且具有挑战性的领域，要求在高熵、动态环境中进行复杂决策，并面对组合爆炸性的动作空间。目前构建自主网页代理的方法要么侧重于离线模仿学习，要么侧重于在线探索，但很少能有效整合这两种范式。受人类认知双系统理论的启发，我们推导出一种原则性的分解，将认知过程划分为快速的系统 1 与缓慢的系统 2。这种分解为现有网页代理方法提供了统一视角，弥合了离线学习直觉反应行为与在线获得深思熟虑规划能力之间的鸿沟。我们在 CogniWeb 中实现了这一框架，这是一种模块化代理架构，能够根据任务复杂性自适应地在快速直觉处理与深思熟虑推理之间切换。我们在 WebArena 上的评估表明，CogniWeb 在保持显著更高效率（令牌使用减少 75%）的同时，取得了具有竞争力的性能（43.96% 成功率）。

Subjects: Artificial Intelligence, Computation and Language, Multiagent Systems 主题：人工智能、计算与语言、多智能体系统

Publish: 2025-08-07 07:05:22 UTC 发表时间：2025-08-07 07:05:22 UTC

#26 Can Large Language Models Integrate Spatial Data? Empirical Insights into Reasoning Strengths and Computational Weaknesses #26 大型语言模型能整合空间数据吗？关于推理优势与计算弱点的实证见解

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-07 03:44:20 UTC 发表：2025-08-07 03:44:20 UTC

#27 The Docking Game: Loop Self-Play for Fast, Dynamic, and Accurate Prediction of Flexible Protein–Ligand Binding #27 对接游戏：用于快速、动态且准确预测柔性蛋白—配体结合的循环自博弈

Authors: [Youzhi Zhang](https://arxiv.org/search/?searchtype=author&query=Youzhi Zhang), [Yufei Li](https://arxiv.org/search/?searchtype=author&query=Yufei Li), [Gaofeng Meng](https://arxiv.org/search/?searchtype=author&query=Gaofeng Meng), [Hongbin Liu](https://arxiv.org/search/?searchtype=author&query=Hongbin Liu), [Jiebo Luo](https://arxiv.org/search/?searchtype=author&query=Jiebo Luo) 作者：张佑志、李宇飞、孟高峰、刘鸿斌、罗杰博

Molecular docking is a crucial aspect of drug discovery, as it predicts the binding interactions between small-molecule ligands and protein pockets. However, current multi-task learning models for docking often show inferior performance in ligand docking compared to protein pocket docking. This disparity arises largely due to the distinct structural complexities of ligands and proteins. To address this issue, we propose a novel game-theoretic framework that models the protein-ligand interaction as a two-player game called the Docking Game, with the ligand docking module acting as the ligand player and the protein pocket docking module as the protein player. To solve this game, we develop a novel Loop Self-Play (LoopPlay) algorithm, which alternately trains these players through a two-level loop. In the outer loop, the players exchange predicted poses, allowing each to incorporate the other’s structural predictions, which fosters mutual adaptation over multiple iterations. In the inner loop, each player dynamically refines its predictions by incorporating its own predicted ligand or pocket poses back into its model. We theoretically show the convergence of LoopPlay, ensuring stable optimization. Extensive experiments conducted on public benchmark datasets demonstrate that LoopPlay achieves approximately a 10% improvement in predicting accurate binding modes compared to previous state-of-the-art methods. This highlights its potential to enhance the accuracy of molecular docking in drug discovery. 分子对接是药物发现中的关键环节，因为它可以预测小分子配体与蛋白口袋之间的结合相互作用。然而，现有用于对接的多任务学习模型在配体对接方面的表现通常不如蛋白口袋对接。造成这种差异的主要原因在于配体与蛋白质在结构上的复杂性存在显著不同。为了解决这一问题，我们提出了一个新颖的博弈论框架，将蛋白-配体相互作用建模为一个名为“对接博弈”的二人博弈，其中配体对接模块作为配体玩家，蛋白口袋对接模块作为蛋白玩家。为了解决该博弈，我们开发了一种新颖的循环自博弈（Loop Self-Play，LoopPlay）算法，通过双层循环交替训练这两位玩家。在外层循环中，玩家互相交换预测的构象，使每一方都能纳入对方的结构预测，从而在多次迭代中促进相互适应。在内层循环中，每位玩家通过将自身预测的配体或口袋构象重新输入到其模型中来动态地细化其预测。我们从理论上证明了 LoopPlay 的收敛性，确保了优化过程的稳定性。在公共基准数据集上进行的大量实验表明，与先前的最先进方法相比，LoopPlay 在预测准确结合构象方面大约提高了 10%。这凸显了其在提高药物发现中分子对接准确性方面的潜力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 03:38:28 UTC 发表：2025-08-07 03:38:28 UTC

#28 ConfAgents: A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis #28 ConfAgents：一种用于成本高效医疗诊断的保形引导多智能体框架

Authors: [Huiya Zhao](https://arxiv.org/search/?searchtype=author&query=Huiya Zhao), [Yinghao Zhu](https://arxiv.org/search/?searchtype=author&query=Yinghao Zhu), [Zixiang Wang](https://arxiv.org/search/?searchtype=author&query=Zixiang Wang), [Yasha Wang](https://arxiv.org/search/?searchtype=author&query=Yasha Wang), [Junyi Gao](https://arxiv.org/search/?searchtype=author&query=Junyi Gao), [Liantao Ma](https://arxiv.org/search/?searchtype=author&query=Liantao Ma) 作者：赵惠雅，朱英浩，王子祥，王雅莎，高俊逸，马连涛

Subjects: Artificial Intelligence, Computation and Language, Multiagent Systems 主题：人工智能、计算与语言、多智能体系统

Publish: 2025-08-06 22:39:38 UTC 发布：2025-08-06 22:39:38 UTC

#29 Large Language Models Reasoning Abilities Under Non-Ideal Conditions After RL-Fine-Tuning #29 在非理想条件下经过强化学习微调后大型语言模型的推理能力

Authors: [Chang Tian](https://arxiv.org/search/?searchtype=author&query=Chang Tian), [Matthew B. Blaschko](https://arxiv.org/search/?searchtype=author&query=Matthew B. Blaschko), [Mingzhe Xing](https://arxiv.org/search/?searchtype=author&query=Mingzhe Xing), [Xiuxing Li](https://arxiv.org/search/?searchtype=author&query=Xiuxing Li), [Yinliang Yue](https://arxiv.org/search/?searchtype=author&query=Yinliang Yue), [Marie-Francine Moens](https://arxiv.org/search/?searchtype=author&query=Marie-Francine Moens) 作者：Chang Tian、Matthew B. Blaschko、Mingzhe Xing、Xiuxing Li、Yinliang Yue、Marie-Francine Moens

Reinforcement learning (RL) has become a key technique for enhancing the reasoning abilities of large language models (LLMs), with policy-gradient algorithms dominating the post-training stage because of their efficiency and effectiveness. However, most existing benchmarks evaluate large-language-model reasoning under idealized settings, overlooking performance in realistic, non-ideal scenarios. We identify three representative non-ideal scenarios with practical relevance: summary inference, fine-grained noise suppression, and contextual filtering. We introduce a new research direction guided by brain-science findings that human reasoning remains reliable under imperfect inputs. We formally define and evaluate these challenging scenarios. We fine-tune three LLMs and a state-of-the-art large vision-language model (LVLM) using RL with a representative policy-gradient algorithm and then test their performance on eight public datasets. Our results reveal that while RL fine-tuning improves baseline reasoning under idealized settings, performance declines significantly across all three non-ideal scenarios, exposing critical limitations in advanced reasoning capabilities. Although we propose a scenario-specific remediation method, our results suggest current methods leave these reasoning deficits largely unresolved. This work highlights that the reasoning abilities of large models are often overstated and underscores the importance of evaluating models under non-ideal scenarios. The code and data will be released at XXXX. 强化学习（RL）已成为提升大型语言模型（LLMs）推理能力的关键技术，策略梯度算法因其高效性和有效性而主导了训练后阶段。然而，大多数现有基准在理想化设置下评估大型语言模型的推理能力，忽视了在现实的非理想情形下的表现。我们识别出三个具有实际相关性的代表性非理想情形：摘要推断、细粒度噪声抑制和上下文过滤。我们在脑科学发现的指导下提出了一个新的研究方向，该发现表明在人类推理中，即使在输入不完美的情况下也能保持可靠性。我们对这些具有挑战性的情形给出了正式定义并进行了评估。我们使用代表性的策略梯度算法通过强化学习对三种 LLMs 和一种最先进的大型视觉-语言模型（LVLM）进行了微调，然后在八个公开数据集上测试它们的表现。我们的结果表明，尽管强化学习微调在理想化设置下提升了基线推理，但在所有三种非理想情形下性能显著下降，暴露出现有高级推理能力的关键局限性。尽管我们提出了一种针对特定情景的补救方法，但我们的结果表明现有方法在很大程度上未能解决这些推理缺陷。该工作强调大型模型的推理能力常被高估，并强调在非理想情景下评估模型的重要性。代码和数据将在 XXXX 发布。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 19:51:29 UTC 发布时间：2025-08-06 19:51:29 UTC

#30 Fine-Tuning Small Language Models (SLMs) for Autonomous Web-based Geographical Information Systems (AWebGIS) #30 针对自主基于网络的地理信息系统（AWebGIS）微调小型语言模型（SLMs）

Subjects: Artificial Intelligence, Computation and Language, Machine Learning 主题：Artificial Intelligence , Computation and Language , Machine Learning

Publish: 2025-08-06 19:50:29 UTC 发布：2025-08-06 19:50:29 UTC

#31 Who is a Better Player: LLM against LLM #31 谁是更好的玩家：LLM 对 LLM

Adversarial board games, as a paradigmatic domain of strategic reasoning and intelligence, have long served as both a popular competitive activity and a benchmark for evaluating artificial intelligence (AI) systems. Building on this foundation, we propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition, compensating the limitation of data dependency of the mainstream Question-and-Answer (Q&A) based benchmark method. We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players. The platform employs both the Elo rating system and a novel Performance Loop Graph (PLG) to quantitatively evaluate the technical capabilities of LLMs, while also capturing Positive Sentiment Score (PSS) throughout gameplay to assess mental fitness. The evaluation is structured as a round-robin tournament, enabling systematic comparison across players. Experimental results indicate that, despite technical differences, most LLMs remain optimistic about winning and losing, demonstrating greater adaptability to high-stress adversarial environments than humans. On the other hand, the complex relationship between cyclic wins and losses in PLGs exposes the instability of LLMs’ skill play during games, warranting further explanation and exploration. 对抗棋类游戏作为战略推理与智力的典型领域，长期以来既是受欢迎的竞技活动，又是评估人工智能（AI）系统的基准。在此基础上，我们提出了一个对抗性基准框架，通过棋类竞赛来评估大型语言模型（LLMs）的综合表现，以弥补主流基于问答（Q&A）基准方法对数据依赖的限制。我们推出了 Qi Town，这是一个专门的评估平台，支持 5 种广泛流行的游戏，并包含 20 名由 LLM 驱动的玩家。该平台采用 Elo 评级系统和一种新颖的性能循环图（PLG）来对 LLMs 的技术能力进行量化评估，同时在整个游戏过程中捕捉积极情绪得分（PSS）以评估心理状态。评估以循环赛形式构建，便于对玩家进行系统比较。实验结果表明，尽管在技术上存在差异，大多数 LLMs 对胜负仍持乐观态度，表现出比人类更强的适应高压对抗环境的能力。另一方面，PLG 中循环胜负之间的复杂关系揭示了 LLMs 在游戏中技能发挥的不稳定性，这需要进一步解释和探索。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 06:41:47 UTC 发布：2025-08-05 06:41:47 UTC

#32 GeoFlow: Agentic Workflow Automation for Geospatial Tasks #32 GeoFlow：用于地理空间任务的自治式工作流自动化

We present GeoFlow, a method that automatically generates agentic workflows for geospatial tasks. Unlike prior work that focuses on reasoning decomposition and leaves API selection implicit, our method provides each agent with detailed tool-calling objectives to guide geospatial API invocation at runtime. GeoFlow increases agentic success by 6.8% and reduces token usage by up to fourfold across major LLM families compared to state-of-the-art approaches. 我们提出了 GeoFlow，一种能自动为地理空间任务生成自治式工作流的方法。与以往侧重于推理分解且将 API 选择留作隐式处理的工作不同，我们的方法为每个代理提供了详细的工具调用目标，以在运行时指导地理空间 API 的调用。与最先进的方法相比，GeoFlow 在主要 LLM 家族中将代理成功率提高了 6.8%，并将令牌使用量最多减少了四倍。

Subjects: Artificial Intelligence, Machine Learning 主题：人工智能，机器学习

Publish: 2025-08-05 02:14:58 UTC 发布时间：2025-08-05 02:14:58 UTC

#33 Prescriptive Agents based on Rag for Automated Maintenance (PARAM) #33 基于 Rag 的用于自动化维护的规范性代理（PARAM）

Industrial machinery maintenance requires timely intervention to prevent catastrophic failures and optimize operational efficiency. This paper presents an integrated Large Language Model (LLM)-based intelligent system for prescriptive maintenance that extends beyond traditional anomaly detection to provide actionable maintenance recommendations. Building upon our prior LAMP framework for numerical data analysis, we develop a comprehensive solution that combines bearing vibration frequency analysis with multi agentic generation for intelligent maintenance planning. Our approach serializes bearing vibration data (BPFO, BPFI, BSF, FTF frequencies) into natural language for LLM processing, enabling few-shot anomaly detection with high accuracy. The system classifies fault types (inner race, outer race, ball/roller, cage faults) and assesses severity levels. A multi-agentic component processes maintenance manuals using vector embeddings and semantic search, while also conducting web searches to retrieve comprehensive procedural knowledge and access up-to-date maintenance practices for more accurate and in-depth recommendations. The Gemini model then generates structured maintenance recommendations includes immediate actions, inspection checklists, corrective measures, parts requirements, and timeline specifications. Experimental validation in bearing vibration datasets demonstrates effective anomaly detection and contextually relevant maintenance guidance. The system successfully bridges the gap between condition monitoring and actionable maintenance planning, providing industrial practitioners with intelligent decision support. This work advances the application of LLMs in industrial maintenance, offering a scalable framework for prescriptive maintenance across machinery components and industrial sectors. 工业机械维护需要及时干预以防止灾难性故障并优化运行效率。本文提出了一种基于大型语言模型（LLM）的综合智能处方性维护系统，该系统超越了传统的异常检测，提供可执行的维护建议。基于我们之前用于数值数据分析的 LAMP 框架，我们开发了一个结合轴承振动频率分析与多智能体生成的全面解决方案，用于智能维护规划。我们的方法将轴承振动数据（BPFO、BPFI、BSF、FTF 频率）序列化为自然语言以供 LLM 处理，从而实现高精度的少样本异常检测。该系统对故障类型（内圈、外圈、滚子/滚珠、保持架故障）进行分类并评估严重程度。多智能体组件使用向量嵌入和语义检索处理维护手册，同时进行网页搜索以检索全面的操作程序知识并获取最新的维护实践，从而提供更准确和更深入的建议。然后，Gemini 模型生成结构化的维护建议，包括即时措施、检查清单、纠正措施、零件需求和时间表说明。在滚动轴承振动数据集上的实验验证表明其在异常检测和上下文相关的维护指导方面有效。该系统成功弥合了状态监测与可执行维护规划之间的鸿沟，为工业从业者提供智能决策支持。这项工作推进了 LLMs 在工业维护中的应用，提供了一个可扩展的规范性维护框架，适用于各类机械部件和工业部门。

Subjects: Artificial Intelligence, Computation and Language, Machine Learning, Multiagent Systems, Signal Processing 主题：人工智能，计算与语言，机器学习，多智能体系统，信号处理

Publish: 2025-07-28 14:22:19 UTC 发表：2025-07-28 14:22:19 UTC

Authors: [Jianpeng Yao](https://arxiv.org/search/?searchtype=author&query=Jianpeng Yao), [Xiaopan Zhang](https://arxiv.org/search/?searchtype=author&query=Xiaopan Zhang), [Yu Xia](https://arxiv.org/search/?searchtype=author&query=Yu Xia), [Zejin Wang](https://arxiv.org/search/?searchtype=author&query=Zejin Wang), [Amit K. Roy-Chowdhury](https://arxiv.org/search/?searchtype=author&query=Amit K. Roy-Chowdhury), [Jiachen Li](https://arxiv.org/search/?searchtype=author&query=Jiachen Li) 作者：姚建鹏、张晓盼、夏宇、王泽金、Amit K. Roy-Chowdhury、李嘉辰

Mobile robots navigating in crowds trained using reinforcement learning are known to suffer performance degradation when faced with out-of-distribution scenarios. We propose that by properly accounting for the uncertainties of pedestrians, a robot can learn safe navigation policies that are robust to distribution shifts. Our method augments agent observations with prediction uncertainty estimates generated by adaptive conformal inference, and it uses these estimates to guide the agent’s behavior through constrained reinforcement learning. The system helps regulate the agent’s actions and enables it to adapt to distribution shifts. In the in-distribution setting, our approach achieves a 96.93% success rate, which is over 8.80% higher than the previous state-of-the-art baselines with over 3.72 times fewer collisions and 2.43 times fewer intrusions into ground-truth human future trajectories. In three out-of-distribution scenarios, our method shows much stronger robustness when facing distribution shifts in velocity variations, policy changes, and transitions from individual to group dynamics. We deploy our method on a real robot, and experiments show that the robot makes safe and robust decisions when interacting with both sparse and dense crowds. Our code and videos are available on https://gen-safe-nav.github.io/. 在通过强化学习训练的移动机器人在人群中导航时，已知在遇到分布外场景时性能会下降。我们提出，通过适当考虑行人的不确定性，机器人可以学习对分布偏移稳健的安全导航策略。我们的方法将由自适应保形推断生成的预测不确定性估计附加到智能体的观测中，并使用这些估计通过受约束的强化学习来引导智能体的行为。该系统有助于规范智能体的动作，并使其能够适应分布偏移。在同分布设置中，我们的方法实现了 96.93%的成功率，比之前的最先进基线高出超过 8.80%，碰撞次数减少了超过 3.72 倍，侵入真实人类未来轨迹的次数减少了 2.43 倍。在三种分布外场景中，我们的方法在面对速度变化、策略变化以及从个体到群体动态的转变等分布偏移时表现出更强的鲁棒性。我们在真实机器人上部署了我们的方法，实验表明在与稀疏和密集人群互动时，机器人能够做出安全且稳健的决策。我们的代码和视频可在 https://gen-safe-nav.github.io/ 获取。

Subjects: Robotics, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning, Systems and Control 学科：机器人学、人工智能、计算机视觉与模式识别、机器学习、系统与控制

Publish: 2025-08-07 17:59:43 UTC 发布：2025-08-07 17:59:43 UTC

#35 KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation

Authors: [Changle Qu](https://arxiv.org/search/?searchtype=author&query=Changle Qu), [Sunhao Dai](https://arxiv.org/search/?searchtype=author&query=Sunhao Dai), [Ke Guo](https://arxiv.org/search/?searchtype=author&query=Ke Guo), [Liqin Zhao](https://arxiv.org/search/?searchtype=author&query=Liqin Zhao), [Yanan Niu](https://arxiv.org/search/?searchtype=author&query=Yanan Niu), [Xiao Zhang](https://arxiv.org/search/?searchtype=author&query=Xiao Zhang), [Jun Xu](https://arxiv.org/search/?searchtype=author&query=Jun Xu) 作者：曲长乐、戴孙浩、顾可、赵立勤、牛亚楠、张晓、徐军

Live streaming platforms have become a dominant form of online content consumption, offering dynamically evolving content, real-time interactions, and highly engaging user experiences. These unique characteristics introduce new challenges that differentiate live streaming recommendation from traditional recommendation settings and have garnered increasing attention from industry in recent years. However, research progress in academia has been hindered by the lack of publicly available datasets that accurately reflect the dynamic nature of live streaming environments. To address this gap, we introduce KuaiLive, the first real-time, interactive dataset collected from Kuaishou, a leading live streaming platform in China with over 400 million daily active users. The dataset records the interaction logs of 23,772 users and 452,621 streamers over a 21-day period. Compared to existing datasets, KuaiLive offers several advantages: it includes precise live room start and end timestamps, multiple types of real-time user interactions (click, comment, like, gift), and rich side information features for both users and streamers. These features enable more realistic simulation of dynamic candidate items and better modeling of user and streamer behaviors. We conduct a thorough analysis of KuaiLive from multiple perspectives and evaluate several representative recommendation methods on it, establishing a strong benchmark for future research. KuaiLive can support a wide range of tasks in the live streaming domain, such as top-K recommendation, click-through rate prediction, watch time prediction, and gift price prediction. Moreover, its fine-grained behavioral data also enables research on multi-behavior modeling, multi-task learning, and fairness-aware recommendation. The dataset and related resources are publicly available at https://imgkkk574.github.io/KuaiLive. 直播平台已成为主导的在线内容消费形式，提供不断演化的内容、实时互动和高度吸引用户的体验。这些独特特性带来了新的挑战，使直播推荐有别于传统推荐场景，并在近年来引起业界越来越多的关注。然而，学术研究的进展受到缺乏能准确反映直播环境动态性的公开数据集的制约。为填补这一空白，我们引入了 KuaiLive，这是首个从快手收集的实时交互数据集，快手是中国领先的直播平台，日活跃用户超过 4 亿。该数据集记录了 21 天期间 23,772 名用户和 452,621 名主播的交互日志。与现有数据集相比，KuaiLive 具有若干优势：它包含精确的直播间开始和结束时间戳、多种类型的实时用户交互（点击、评论、点赞、打赏）以及用户和主播的丰富侧信息特征。这些特性使得对动态候选项的模拟更加逼真，并能更好地建模用户和主播的行为。我们从多个角度对 KuaiLive 进行了全面分析，并在其上评估了若干具有代表性的推荐方法，为未来研究建立了强有力的基准。KuaiLive 可以支持直播领域的广泛任务，例如 top-K 推荐、点击率预测、观看时长预测和打赏金额预测。此外，其细粒度的行为数据还支持多行为建模、多任务学习和关注公平性的推荐研究。该数据集及相关资源可在 https://imgkkk574.github.io/KuaiLive 公共获取。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-07 17:59:36 UTC 发布：2025-08-07 17:59:36 UTC

#36 H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages #36 H-Net++：用于形态丰富语言的无分词器语言建模的分层动态分块方法

Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handling of orthographic artifacts (e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks align with Persian morphology without explicit supervision, demonstrating that hierarchical dynamic chunking provides an effective tokenizer-free solution for MRLs while maintaining computational efficiency. 字节级语言模型消除了脆弱的分词器，但在形态丰富的语言（MRLs）中面临计算挑战，因为单词通常由许多字节组成。我们提出了 H-NET++，一种层次化动态分块模型，通过端到端训练学习具有语言学信息的分割。主要创新包括： (1) 一个轻量级的 Transformer 上下文混合器（1.9M 参数）用于跨块注意，(2) 一个用于文档级一致性的两级潜在超先验，(3) 对正字法伪影（例如波斯语的 ZWNJ）进行的专门处理，和 (4) 基于课程的训练与分阶段序列长度。在一个包含 14 亿标记的波斯语语料上，H-NET++ 达到最先进的结果：相比基于 BPE 的 GPT-2-fa 在位元/字节（BPB）上减少 0.159（压缩率提高 12%）、在 ParsGLUE 上提升 5.4 个百分点、对 ZWNJ 污染的鲁棒性提高 53%，以及在 gold 形态边界上取得 73.8% 的 F1。我们学习到的分块在没有显式监督的情况下与波斯语形态对齐，表明层次化动态分块为形态丰富语言提供了一种有效的无分词器解决方案，同时保持了计算效率。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 17:59:01 UTC 发布：2025-08-07 17:59:01 UTC

#37 How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations #37 LLMs 如何说服？线性探针可以揭示多轮对话中的说服动态 [PDF ] [Copy] [Kimi ] [REL]

Large Language Models (LLMs) have started to demonstrate the ability to persuade humans, yet our understanding of how this dynamic transpires is limited. Recent work has used linear probes, lightweight tools for analyzing model representations, to study various LLM skills such as the ability to model user sentiment and political perspective. Motivated by this, we apply probes to study persuasion dynamics in natural, multi-turn conversations. We leverage insights from cognitive science to train probes on distinct aspects of persuasion: persuasion success, persuadee personality, and persuasion strategy. Despite their simplicity, we show that they capture various aspects of persuasion at both the sample and dataset levels. For instance, probes can identify the point in a conversation where the persuadee was persuaded or where persuasive success generally occurs across the entire dataset. We also show that in addition to being faster than expensive prompting-based approaches, probes can do just as well and even outperform prompting in some settings, such as when uncovering persuasion strategy. This suggests probes as a plausible avenue for studying other complex behaviours such as deception and manipulation, especially in multi-turn settings and large-scale dataset analysis where prompting-based methods would be computationally inefficient. 大型语言模型（LLMs）已开始展现出劝服人类的能力，但我们对这一动态如何发生的理解仍然有限。近期研究使用线性探针——用于分析模型表示的轻量级工具——来研究各种 LLM 技能，例如模拟用户情绪和政治立场的能力。受此启发，我们将探针用于研究自然多轮对话中的劝服动态。我们借鉴认知科学的见解，训练探针以区分劝服的不同方面：劝服是否成功、被劝服者的性格以及劝服策略。尽管探针非常简单，我们表明它们在样本层面和数据集层面都能捕捉劝服的各个方面。例如，探针可以识别对话中被劝服者被说服的时点，或在整个数据集中普遍发生劝服成功的时刻。我们还表明，除了比昂贵的基于提示的方法更快之外，探针在某些情形下的表现与提示法不相上下，甚至更优，例如在揭示劝服策略时。这表明探针是一种可行的方法，用于研究其他复杂行为，例如欺骗和操控，特别是在多轮对话场景和大规模数据集分析中，在这些场景下基于提示的方法在计算上会效率低下。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-07 17:58:41 UTC 发布：2025-08-07 17:58:41 UTC

#38 TrajEvo: Trajectory Prediction Heuristics Design via LLM-driven Evolution #38 TrajEvo：通过 LLM 驱动的进化设计轨迹预测启发式

Authors: [Zhikai Zhao](https://arxiv.org/search/?searchtype=author&query=Zhikai Zhao), [Chuanbo Hua](https://arxiv.org/search/?searchtype=author&query=Chuanbo Hua), [Federico Berto](https://arxiv.org/search/?searchtype=author&query=Federico Berto), [Kanghoon Lee](https://arxiv.org/search/?searchtype=author&query=Kanghoon Lee), [Zihan Ma](https://arxiv.org/search/?searchtype=author&query=Zihan Ma), [Jiachen Li](https://arxiv.org/search/?searchtype=author&query=Jiachen Li), [Jinkyoo Park](https://arxiv.org/search/?searchtype=author&query=Jinkyoo Park) 作者：赵志凯、华传博、Federico Berto、Lee Kanghoon、马子涵、李佳晨、朴晋丘

Trajectory prediction is a critical task in modeling human behavior, especially in safety-critical domains such as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy and generalizability. Although deep learning approaches offer improved performance, they typically suffer from high computational cost, limited explainability, and, importantly, poor generalization to out-of-distribution (OOD) scenarios. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We propose two key innovations: Cross-Generation Elite Sampling to encourage population diversity, and a Statistics Feedback Loop that enables the LLM to analyze and improve alternative predictions. Our evaluations demonstrate that TrajEvo outperforms existing heuristic methods across multiple real-world datasets, and notably surpasses both heuristic and deep learning methods in generalizing to an unseen OOD real-world dataset. TrajEvo marks a promising step toward the automated design of fast, explainable, and generalizable trajectory prediction heuristics. We release our source code to facilitate future research at https://github.com/ai4co/trajevo. 轨迹预测是建模人类行为的关键任务，尤其在社会机器人和自动驾驶导航等对安全要求高的领域。基于手工规则的传统启发式方法通常在准确性和泛化性方面不足。尽管深度学习方法能提供更好的性能，但它们通常面临计算成本高、可解释性差，且重要的是，对分布外（OOD）场景的泛化能力较差。本文中，我们提出了 TrajEvo，一个利用 Large Language Models (LLMs) 自动设计轨迹预测启发式规则的框架。TrajEvo 采用进化算法从过去的轨迹数据生成并优化预测启发式规则。我们提出了两项关键创新：用于鼓励种群多样性的跨代精英采样（Cross-Generation Elite Sampling），以及使 LLM 能够分析并改进备选预测的统计反馈回路（Statistics Feedback Loop）。我们的评估表明，TrajEvo 在多个现实世界数据集上优于现有的启发式方法，并且在泛化到未见的 OOD 现实世界数据集时，显著超越了启发式和深度学习方法。 TrajEvo 标志着朝着自动化设计快速、可解释且具泛化能力的轨迹预测启发式方法迈出的有希望的一步。我们在此发布源代码以便促进未来研究，地址为 https://github.com/ai4co/trajevo。

Subjects: Machine Learning, Artificial Intelligence, Neural and Evolutionary Computing, Robotics 主题：机器学习，人工智能，神经与进化计算，机器人学

Publish: 2025-08-07 17:55:10 UTC 发表：2025-08-07 17:55:10 UTC

#39 Test-Time Reinforcement Learning for GUI Grounding via Region Consistency #39 通过区域一致性的 GUI 定位的测试时强化学习

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents. 图形用户界面（GUI）定位，即将自然语言指令映射到精确屏幕坐标的任务，对于自主 GUI 代理至关重要。尽管现有方法通过大规模有监督训练或带有标注奖励的强化学习能够达到较强性能，但它们仍受限于像素级注释的成本和可用性。我们观察到，当模型对同一 GUI 元素生成多个预测时，空间重叠模式揭示了可指导更精确定位的隐含置信信号。利用这一洞见，我们提出了 GUI-RC（区域一致性），这是一种测试时扩展方法，它从多个采样预测中构建空间投票网格，以识别模型达成最高一致性的共识区域。在无需任何训练的情况下，GUI-RC 在 ScreenSpot 基准上对多种架构的准确率提升了 2–3%。我们进一步引入了 GUI-RCPO（区域一致性策略优化），将这些一致性模式转化为测试时强化学习的奖励。通过计算每个预测与整体共识的一致程度，GUI-RCPO 使模型在推理期间能够在无标签数据上迭代地优化其输出。大量实验证明了我们方法的通用性：在 ScreenSpot-v2 上，GUI-RC 将 Qwen2.5-VL-3B-Instruct 从 80.11% 提升到 83.57%，而 GUI-RCPO 通过自监督优化进一步将其提升到 85.14%。我们的方法揭示了测试时扩展和测试时强化学习在 GUI 定位方面未被开发的潜力，为实现更鲁棒、更高数据效率的 GUI 代理提供了一个有前景的路径。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-08-07 17:54:27 UTC 发布：2025-08-07 17:54:27 UTC

#40 OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks #40 OmniEAR：在具身任务中对智能体推理进行基准测试

Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance. 大型语言模型在抽象推理方面表现出色，但其在具身代理推理方面的能力仍 largely 未被探索。我们提出了 OmniEAR，一个用于评估语言模型如何推理物理交互、工具使用和多代理协作的综合框架。与提供预定义工具集或明确协作指令的现有基准不同，OmniEAR 要求代理根据任务需求动态获取能力并自主决定协调策略。通过基于文本的环境表示，我们对家庭和工业领域中跨越 1,500 个情景的连续物理属性和复杂空间关系建模。我们的系统评估表明，当模型必须从约束中推理时，性能出现严重下降：在获得明确指示时成功率为 85%–96%，而在工具推理时性能下降到 56%–85%，在隐式协作时为 63%–85%，复合任务的失败率超过 50%。令人惊讶的是，完整的环境信息会降低协调性能，这表明模型无法筛选出与任务相关的约束。微调在单智能体任务上显著提升性能（从 0.6%提升到 76.3%），但在多智能体场景中仅带来最小的增益（从 1.5%到 5.5%），暴露了根本性的架构限制。这些发现表明，具身推理提出了与当前模型无法应对的根本不同的挑战，确立了 OmniEAR 作为评估和推进具身人工智能系统的严格基准。我们的代码和数据包含在补充材料中，并将在论文被接受后开源。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 17:54:15 UTC 发表：2025-08-07 17:54:15 UTC

#41 Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models #41 Cooper：在用于大型语言模型的强化学习中共同优化策略与奖励模型

Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL. 大型语言模型（LLMs）在推理任务中表现出色，而强化学习（RL）是提升其推理能力的关键算法。目前有两种主流的奖励范式：基于模型的奖励和基于规则的奖励。然而，这两种方法都有各自的局限性：基于规则的奖励缺乏鲁棒性，而基于模型的奖励易受到奖励操控的影响。为了解决这些问题，我们提出了 Cooper（协同优化策略模型与奖励模型），一种同时优化策略模型与奖励模型的强化学习框架。Cooper 利用基于规则的奖励在识别正确回应时的高精度，并动态构建和选择正负样本对以持续训练奖励模型。该设计增强了鲁棒性并减轻了奖励操控的风险。为进一步支持 Cooper，我们引入了一种混合注释策略，以高效且准确地为奖励模型生成训练数据。我们还提出了一种基于参考的奖励建模范式，其中奖励模型以参考答案作为输入。基于这一设计，我们训练了一个名为 VerifyRM 的奖励模型，在 VerifyBench 上比同等规模的其他模型取得了更高的准确率。我们使用 VerifyRM 和 Cooper 进行了强化学习。我们的实验表明，Cooper 不仅缓解了奖励规避问题，还提升了端到端的强化学习性能，例如在 Qwen2.5-1.5B-Instruct 上平均准确率提高了 0.54%。我们的研究结果表明，动态更新奖励模型是对抗奖励规避的一种有效方法，为更好地将奖励模型整合到强化学习中提供了参考。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 17:53:56 UTC 发布：2025-08-07 17:53:56 UTC

#42 Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle #42 Shuffle-R1：通过以数据为中心的动态洗牌实现多模态大型语言模型高效强化学习框架

Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM. 强化学习（RL）已成为提升多模态大语言模型（MLLM）推理能力的有效训练后范式。然而，目前的 RL 流程常因两个鲜少被探讨的问题而导致训练效率低下：优势塌陷（Advantage Collapsing），即批次中大多数优势值集中接近于零；以及回放沉默（Rollout Silencing），即随时间推移对梯度产生非零贡献的回放比例逐渐减少。这些问题导致梯度更新不理想，阻碍长期学习效率。为了解决这些问题，我们提出了 Shuffle-R1，一个简单但有理论依据的框架，通过动态重构轨迹采样与批次组成来提升 RL 微调效率。该框架引入了（1）成对轨迹采样（Pairwise Trajectory Sampling），选择具有大优势的高对比轨迹以提高梯度信号质量；以及（2）基于优势的轨迹重排（Advantage-based Trajectory Shuffle），通过有针对性的批次重洗增加有价值回放的暴露率。在多个推理基准上的实验证明，我们的框架在开销极小的情况下持续优于强基线 RL 方法。这些结果强调了以数据为中心的调整对于在多模态大型语言模型中实现更高效强化学习训练的重要性。

Subjects: Machine Learning, Artificial Intelligence 学科：机器学习，人工智能

Publish: 2025-08-07 17:53:47 UTC 发布：2025-08-07 17:53:47 UTC

#43 Iterative Learning of Computable Phenotypes for Treatment Resistant Hypertension using Large Language Models #43 使用大型语言模型对难治性高血压的可计算表型进行迭代学习

Large language models (LLMs) have demonstrated remarkable capabilities for medical question answering and programming, but their potential for generating interpretable computable phenotypes (CPs) is under-explored. In this work, we investigate whether LLMs can generate accurate and concise CPs for six clinical phenotypes of varying complexity, which could be leveraged to enable scalable clinical decision support to improve care for patients with hypertension. In addition to evaluating zero-short performance, we propose and test a synthesize, execute, debug, instruct strategy that uses LLMs to generate and iteratively refine CPs using data-driven feedback. Our results show that LLMs, coupled with iterative learning, can generate interpretable and reasonably accurate programs that approach the performance of state-of-the-art ML methods while requiring significantly fewer training examples. 大型语言模型（LLMs）在医学问答和编程方面展现出卓越能力，但它们在生成可解释的可计算表型（CPs）方面的潜力尚未充分挖掘。在本研究中，我们探讨了 LLMs 是否能够为六种不同复杂度的临床表型生成准确且简明的 CPs，这些 CPs 可用于支持可扩展的临床决策，以改善高血压患者的护理。除了评估零-shot 性能外，我们提出并测试了一种“合成、执行、调试、指令”策略，该策略利用 LLMs 生成并通过数据驱动的反馈对 CPs 进行迭代完善。我们的结果表明，LLMs 结合迭代学习，能够生成可解释且相当准确的程序，其性能接近最先进的机器学习方法，同时所需的训练示例显著更少。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-07 17:15:17 UTC 发布：2025-08-07 17:15:17 UTC

#44 Adapting Vision-Language Models Without Labels: A Comprehensive Survey #44 在无需标签的情况下适配视觉-语言模型：一篇综合性综述

Authors: [Hao Dong](https://arxiv.org/search/?searchtype=author&query=Hao Dong), [Lijun Sheng](https://arxiv.org/search/?searchtype=author&query=Lijun Sheng), [Jian Liang](https://arxiv.org/search/?searchtype=author&query=Jian Liang), [Ran He](https://arxiv.org/search/?searchtype=author&query=Ran He), [Eleni Chatzi](https://arxiv.org/search/?searchtype=author&query=Eleni Chatzi), [Olga Fink](https://arxiv.org/search/?searchtype=author&query=Olga Fink) 作者：董昊，盛立军，梁健，何然，Eleni Chatzi，Olga Fink

Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks. However, their performance often remains suboptimal when directly applied to specific downstream scenarios without task-specific adaptation. To enhance their utility while preserving data efficiency, recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data. Despite the growing interest in this area, there remains a lack of a unified, task-oriented survey dedicated to unsupervised VLM adaptation. To bridge this gap, we present a comprehensive and structured overview of the field. We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms: Data-Free Transfer (no data), Unsupervised Domain Transfer (abundant data), Episodic Test-Time Adaptation (batch data), and Online Test-Time Adaptation (streaming data). Within this framework, we analyze core methodologies and adaptation strategies associated with each paradigm, aiming to establish a systematic understanding of the field. Additionally, we review representative benchmarks across diverse applications and highlight open challenges and promising directions for future research. An actively maintained repository of relevant literature is available at https://github.com/tim-learn/Awesome-LabelFree-VLMs. 视觉-语言模型（VLMs）在广泛任务上表现出卓越的泛化能力。然而，当这些模型在没有针对性任务适配的情况下直接应用于特定下游场景时，其性能常常仍不尽如人意。为了在保持数据效率的同时提升其实用性，近期研究越来越多地聚焦于不依赖标注数据的无监督适配方法。尽管这一领域日益受到关注，但仍缺乏面向任务的、关于无监督 VLM 适配的统一综述。为填补这一空白，我们呈现了该领域的全面且结构化的概览。我们提出了基于无标注视觉数据的可用性与性质的分类法，将现有方法归为四大范式：无数据迁移（Data-Free Transfer）、无监督域迁移（数据充足）、情景测试时适配（批量数据）、以及在线测试时适配（流式数据）。在此框架内，我们分析了与每一范式相关的核心方法学和适配策略，旨在建立对该领域的系统性理解。此外，我们回顾了跨多种应用的代表性基准测试，并强调了未来研究中尚未解决的挑战和有前景的方向。相关文献的持续维护存储库可在 https://github.com/tim-learn/Awesome-LabelFree-VLMs 获取。

Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition 学科：机器学习、人工智能、计算机视觉与模式识别

Publish: 2025-08-07 16:27:37 UTC 发布：2025-08-07 16:27:37 UTC

#45 Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees #45 在黑盒设置下具有可证明覆盖保证的多项选择题回答中的保形集 [PDF ] [Copy] [Kimi 3 ] [REL]

Authors: [Guang Yang](https://arxiv.org/search/?searchtype=author&query=Guang Yang), [Xinyang Liu](https://arxiv.org/search/?searchtype=author&query=Xinyang Liu) 作者：杨广，刘新阳

Large Language Models (LLMs) have shown remarkable progress in multiple-choice question answering (MCQA), but their inherent unreliability, such as hallucination and overconfidence, limits their application in high-risk domains. To address this, we propose a frequency-based uncertainty quantification method under black-box settings, leveraging conformal prediction (CP) to ensure provable coverage guarantees. Our approach involves multiple independent samplings of the model’s output distribution for each input, with the most frequent sample serving as a reference to calculate predictive entropy (PE). Experimental evaluations across six LLMs and four datasets (MedMCQA, MedQA, MMLU, MMLU-Pro) demonstrate that frequency-based PE outperforms logit-based PE in distinguishing between correct and incorrect predictions, as measured by AUROC. Furthermore, the method effectively controls the empirical miscoverage rate under user-specified risk levels, validating that sampling frequency can serve as a viable substitute for logit-based probabilities in black-box scenarios. This work provides a distribution-free model-agnostic framework for reliable uncertainty quantification in MCQA with guaranteed coverage, enhancing the trustworthiness of LLMs in practical applications. 大型语言模型（LLMs）在多项选择题问答（MCQA）方面表现出令人瞩目的进步，但其固有的不可靠性，如幻觉和过度自信，限制了其在高风险领域的应用。为应对这一点，我们提出了一种在黑箱设置下基于频率的不确定性量化方法，利用符合预测（conformal prediction，CP）以确保可证实的覆盖保证。我们的方法对每个输入进行多次独立抽样模型的输出分布，以最频繁出现的样本作为参考来计算预测熵（PE）。在六个 LLMs 和四个数据集（MedMCQA、MedQA、MMLU、MMLU-Pro）上的实验评估表明，基于频率的 PE 在通过 AUROC 区分正确与错误预测方面优于基于 logit 的 PE。此外，该方法能在用户指定的风险水平下有效控制经验失覆盖率，验证了在黑箱场景中抽样频率可作为基于 logit 概率的可行替代。这项工作提出了一个与模型无关且不依赖分布的框架，用于在多项选择问答（MCQA）中进行可靠的不确定性量化并保证覆盖率，从而提升了 LLMs 在实际应用中的可信度。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 16:22:49 UTC 发布：2025-08-07 16:22:49 UTC

#46 Tractable Sharpness-Aware Learning of Probabilistic Circuits #46 可处理的面向锐化的概率电路学习

Authors: [Hrithik Suresh](https://arxiv.org/search/?searchtype=author&query=Hrithik Suresh), [Sahil Sidheekh](https://arxiv.org/search/?searchtype=author&query=Sahil Sidheekh), [Vishnu Shreeram M. P](https://arxiv.org/search/?searchtype=author&query=Vishnu Shreeram M. P), [Sriraam Natarajan](https://arxiv.org/search/?searchtype=author&query=Sriraam Natarajan), [Narayanan C. Krishnan](https://arxiv.org/search/?searchtype=author&query=Narayanan C. Krishnan) 作者：Hrithik Suresh、Sahil Sidheekh、Vishnu Shreeram M. P、Sriraam Natarajan、Narayanan C. Krishnan

Probabilistic Circuits (PCs) are a class of generative models that allow exact and tractable inference for a wide range of queries. While recent developments have enabled the learning of deep and expressive PCs, this increased capacity can often lead to overfitting, especially when data is limited. We analyze PC overfitting from a log-likelihood-landscape perspective and show that it is often caused by convergence to sharp optima that generalize poorly. Inspired by sharpness aware minimization in neural networks, we propose a Hessian-based regularizer for training PCs. As a key contribution, we show that the trace of the Hessian of the log-likelihood-a sharpness proxy that is typically intractable in deep neural networks-can be computed efficiently for PCs. Minimizing this Hessian trace induces a gradient-norm-based regularizer that yields simple closed-form parameter updates for EM, and integrates seamlessly with gradient based learning methods. Experiments on synthetic and real-world datasets demonstrate that our method consistently guides PCs toward flatter minima, improves generalization performance. 概率电路（PCs）是一类生成模型，能够对多种查询进行精确且可处理的推断。尽管近年来的发展使得学习深层且表达力强的概率电路成为可能，但这种容量的增加常常会导致过拟合，尤其在数据有限的情况下。我们从对数似然景观的角度分析了概率电路的过拟合问题，发现其常由收敛到会导致泛化能力差的尖锐极值点引起。受神经网络中尖锐度感知最小化方法的启发，我们提出了一种基于海森矩阵的正则化器用于训练概率电路。作为一项关键贡献，我们展示了对数似然海森矩阵的迹——这一在深度神经网络中通常不可行的尖锐度代理——可以在概率电路中高效计算。最小化该海森迹会引入一个基于梯度范数的正则项，从而为期望最大化（EM）方法导出简单的闭式参数更新，并能无缝集成到基于梯度的学习方法中。对合成和真实世界数据集的实验表明，我们的方法能持续引导概率电路朝着更平缓的极小值，且提高了泛化性能。

Subjects: Machine Learning, Artificial Intelligence 学科：机器学习，人工智能

Publish: 2025-08-07 16:13:24 UTC 发布：2025-08-07 16:13:24 UTC

#47 The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities #47 根据 LLMs 的世界：地理起源如何影响 LLMs 的实体推断能力

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 15:53:30 UTC 发布：2025-08-07 15:53:30 UTC

#48 LAG: Logic-Augmented Generation from a Cartesian Perspective #48 LAG：从笛卡尔视角的逻辑增强生成

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 15:42:00 UTC 发布：2025-08-07 15:42:00 UTC

#49 MoMA: A Mixture-of-Multimodal-Agents Architecture for Enhancing Clinical Prediction Modelling #49 MoMA：一种用于增强临床预测建模的多模态智能体混合架构

Multimodal electronic health record (EHR) data provide richer, complementary insights into patient health compared to single-modality data. However, effectively integrating diverse data modalities for clinical prediction modeling remains challenging due to the substantial data requirements. We introduce a novel architecture, Mixture-of-Multimodal-Agents (MoMA), designed to leverage multiple large language model (LLM) agents for clinical prediction tasks using multimodal EHR data. MoMA employs specialized LLM agents (“specialist agents”) to convert non-textual modalities, such as medical images and laboratory results, into structured textual summaries. These summaries, together with clinical notes, are combined by another LLM (“aggregator agent”) to generate a unified multimodal summary, which is then used by a third LLM (“predictor agent”) to produce clinical predictions. Evaluating MoMA on three prediction tasks using real-world datasets with different modality combinations and prediction settings, MoMA outperforms current state-of-the-art methods, highlighting its enhanced accuracy and flexibility across various tasks. 多模态电子健康记录（EHR）数据相比单一模态数据能提供更丰富、互补的病人健康洞察。然而，由于对数据量的巨大需求，有效整合多样化数据模态以进行临床预测建模仍然具有挑战性。我们提出了一种新颖的架构——多模态代理混合（Mixture-of-Multimodal-Agents，MoMA），旨在利用多个大型语言模型（LLM）代理来处理使用多模态 EHR 数据的临床预测任务。MoMA 采用专门的 LLM 代理（“专科代理”）将非文本模态，如医学影像和化验结果，转换为结构化的文本摘要。这些摘要连同临床记录由另一个 LLM（“聚合代理”）合并以生成统一的多模态摘要，随后由第三个 LLM（“预测代理”）使用该摘要来产生临床预测。在使用具有不同模态组合和预测设置的真实数据集对三项预测任务进行评估时，MoMA 的表现优于当前最先进的方法，突显出其在各种任务中的提升准确性和灵活性。

Subjects: Machine Learning, Artificial Intelligence, Multiagent Systems 主题：机器学习、人工智能、多智能体系统

Publish: 2025-08-07 15:28:34 UTC 发表时间：2025-08-07 15:28:34 UTC

#50 Embedding Alignment in Code Generation for Audio #50 在音频代码生成中的嵌入对齐

Authors: [Sam Kouteili](https://arxiv.org/search/?searchtype=author&query=Sam Kouteili), [Hiren Madhu](https://arxiv.org/search/?searchtype=author&query=Hiren Madhu), [George Typaldos](https://arxiv.org/search/?searchtype=author&query=George Typaldos), [Mark Santolucito](https://arxiv.org/search/?searchtype=author&query=Mark Santolucito) 作者：Sam Kouteili、Hiren Madhu、George Typaldos、Mark Santolucito

LLM-powered code generation has the potential to revolutionize creative coding endeavors, such as live-coding, by enabling users to focus on structural motifs over syntactic details. In such domains, when prompting an LLM, users may benefit from considering multiple varied code candidates to better realize their musical intentions. Code generation models, however, struggle to present unique and diverse code candidates, with no direct insight into the code’s audio output. To better establish a relationship between code candidates and produced audio, we investigate the topology of the mapping between code and audio embedding spaces. We find that code and audio embeddings do not exhibit a simple linear relationship, but supplement this with a constructed predictive model that shows an embedding alignment map could be learned. Supplementing the aim for musically diverse output, we present a model that given code predicts output audio embedding, constructing a code-audio embedding alignment map. 由 LLM 驱动的代码生成有可能彻底改变创意编码工作，比如实时编码（live-coding），因为它能让用户更多地关注结构性主题而非语法细节。在此类领域中，当向 LLM 提示时，用户可能通过考虑多个不同的代码候选项来更好地实现他们的音乐意图。然而，代码生成模型在呈现独特且多样化的代码候选项方面存在困难，且无法直接洞察代码的音频输出。为更好地建立代码候选项与生成音频之间的关系，我们研究了代码与音频嵌入空间之间映射的拓扑特性。我们发现代码与音频嵌入并不呈现简单的线性关系，但我们通过构建的预测模型补充说明了可以学习到嵌入对齐映射。为补充对音乐多样化输出的追求，我们提出了一个模型，该模型在给定代码的情况下预测输出音频嵌入，从而构建了代码—音频嵌入对齐映射。

Subjects: Multimedia, Artificial Intelligence, Sound, Audio and Speech Processing 主题：多媒体，人工智能，声音，音频与语音处理

Publish: 2025-08-07 15:13:42 UTC 发布日期：2025-08-07 15:13:42 UTC

#51 Task complexity shapes internal representations and robustness in neural networks #51 任务复杂性塑造神经网络的内部表征和鲁棒性 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Robert Jankowski](https://arxiv.org/search/?searchtype=author&query=Robert Jankowski), [Filippo Radicchi](https://arxiv.org/search/?searchtype=author&query=Filippo Radicchi), [M. Ángeles Serrano](https://arxiv.org/search/?searchtype=author&query=M. Ángeles Serrano), [Marián Boguñá](https://arxiv.org/search/?searchtype=author&query=Marián Boguñá), [Santo Fortunato](https://arxiv.org/search/?searchtype=author&query=Santo Fortunato) 作者：Robert Jankowski, Filippo Radicchi, M. Ángeles Serrano, Marián Boguñá, Santo Fortunato

Neural networks excel across a wide range of tasks, yet remain black boxes. In particular, how their internal representations are shaped by the complexity of the input data and the problems they solve remains obscure. In this work, we introduce a suite of five data-agnostic probes-pruning, binarization, noise injection, sign flipping, and bipartite network randomization-to quantify how task difficulty influences the topology and robustness of representations in multilayer perceptrons (MLPs). MLPs are represented as signed, weighted bipartite graphs from a network science perspective. We contrast easy and hard classification tasks on the MNIST and Fashion-MNIST datasets. We show that binarizing weights in hard-task models collapses accuracy to chance, whereas easy-task models remain robust. We also find that pruning low-magnitude edges in binarized hard-task models reveals a sharp phase-transition in performance. Moreover, moderate noise injection can enhance accuracy, resembling a stochastic-resonance effect linked to optimal sign flips of small-magnitude weights. Finally, preserving only the sign structure-instead of precise weight magnitudes-through bipartite network randomizations suffices to maintain high accuracy. These phenomena define a model- and modality-agnostic measure of task complexity: the performance gap between full-precision and binarized or shuffled neural network performance. Our findings highlight the crucial role of signed bipartite topology in learned representations and suggest practical strategies for model compression and interpretability that align with task complexity. 神经网络在各种任务上表现出色，但仍然是黑箱。特别是，它们的内部表示如何受到输入数据复杂性和所解决问题的影响仍不清楚。在本工作中，我们引入了五种与数据无关的探针——剪枝、二值化、注入噪声、符号翻转和二分网络随机化——以量化任务难度如何影响多层感知器（MLP）中表示的拓扑结构和鲁棒性。从网络科学的角度出发，MLP 被表示为有符号的加权二分图。我们在 MNIST 和 Fashion-MNIST 数据集上对比了简单与困难的分类任务。结果表明，对困难任务模型进行权重二值化会使准确率坠落至随机水平，而简单任务模型仍然保持鲁棒。我们还发现，对二值化的困难任务模型剪除小幅度边会揭示性能的急剧相变。此外，适度注入噪声可提升准确率，这类似于与小幅度权重的最佳符号翻转相关的随机共振效应。最后，仅通过二分网络随机化保留符号结构——而不是精确的权重幅度——就足以保持高准确率。这些现象定义了一种与模型和模态无关的任务复杂性度量：全精度神经网络与二值化或混洗后的神经网络性能之间的差距。我们的发现强调了带符号的二分拓扑在学习表征中的关键作用，并提出了与任务复杂性相一致的模型压缩和可解释性实用策略。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 15:02:39 UTC 发表：2025-08-07 15:02:39 UTC

#52 EnergyPatchTST: Multi-scale Time Series Transformers with Uncertainty Estimation for Energy Forecasting #52 EnergyPatchTST：用于能源预测并带有不确定性估计的多尺度时间序列变换器

Authors: [Wei Li](https://arxiv.org/search/?searchtype=author&query=Wei Li), [Zixin Wang](https://arxiv.org/search/?searchtype=author&query=Zixin Wang), [Qizheng Sun](https://arxiv.org/search/?searchtype=author&query=Qizheng Sun), [Qixiang Gao](https://arxiv.org/search/?searchtype=author&query=Qixiang Gao), [Fenglei Yang](https://arxiv.org/search/?searchtype=author&query=Fenglei Yang) 作者：魏厉、王子鑫、孙启政、高齐祥、杨凤磊

Accurate and reliable energy time series prediction is of great significance for power generation planning and allocation. At present, deep learning time series prediction has become the mainstream method. However, the multi-scale time dynamics and the irregularity of real data lead to the limitations of the existing methods. Therefore, we propose EnergyPatchTST, which is an extension of the Patch Time Series Transformer specially designed for energy forecasting. The main innovations of our method are as follows: (1) multi-scale feature extraction mechanism to capture patterns with different time resolutions; (2) probability prediction framework to estimate uncertainty through Monte Carlo elimination; (3) integration path of future known variables (such as temperature and wind conditions); And (4) Pre-training and Fine-tuning examples to enhance the performance of limited energy data sets. A series of experiments on common energy data sets show that EnergyPatchTST is superior to other commonly used methods, the prediction error is reduced by 7-12%, and reliable uncertainty estimation is provided, which provides an important reference for time series prediction in the energy field. 准确可靠的能源时间序列预测对于发电规划和分配具有重要意义。目前，深度学习时间序列预测已成为主流方法。然而，多尺度时间动态和真实数据的不规则性导致了现有方法的局限性。因此，我们提出了 EnergyPatchTST，这是专为能源预测设计的 Patch 时间序列 Transformer 的扩展。我们方法的主要创新如下：（1）多尺度特征提取机制，用于捕捉具有不同时域分辨率的模式；（2）概率预测框架，通过蒙特卡洛消融来估计不确定性；（3）未来已知变量（如温度和风况）的整合路径；以及（4）用于增强有限能源数据集性能的预训练与微调示例。在常见能源数据集上进行的一系列实验表明，EnergyPatchTST 优于其他常用方法，预测误差降低了 7–12%，并提供了可靠的不确定性估计，为能源领域的时间序列预测提供了重要参考。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 14:48:39 UTC 发布时间：2025-08-07 14:48:39 UTC

#53 Tail-Risk-Safe Monte Carlo Tree Search under PAC-Level Guarantees #53 在 PAC 水平保证下的尾部风险安全蒙特卡洛树搜索

Authors: [Zuyuan Zhang](https://arxiv.org/search/?searchtype=author&query=Zuyuan Zhang), [Arnob Ghosh](https://arxiv.org/search/?searchtype=author&query=Arnob Ghosh), [Tian Lan](https://arxiv.org/search/?searchtype=author&query=Tian Lan) 作者：Zuyuan Zhang, Arnob Ghosh, Tian Lan

Making decisions with respect to just the expected returns in Monte Carlo Tree Search (MCTS) cannot account for the potential range of high-risk, adverse outcomes associated with a decision. To this end, safety-aware MCTS often consider some constrained variants – by introducing some form of mean risk measures or hard cost thresholds. These approaches fail to provide rigorous tail-safety guarantees with respect to extreme or high-risk outcomes (denoted as tail-risk), potentially resulting in serious consequence in high-stake scenarios. This paper addresses the problem by developing two novel solutions. We first propose CVaR-MCTS, which embeds a coherent tail risk measure, Conditional Value-at-Risk (CVaR), into MCTS. Our CVaR-MCTS with parameter α achieves explicit tail-risk control over the expected loss in the “worst (1−α)% scenarios.” Second, we further address the estimation bias of tail-risk due to limited samples. We propose Wasserstein-MCTS (or W-MCTS) by introducing a first-order Wasserstein ambiguity set Pεs(s,a) with radius εs to characterize the uncertainty in tail-risk estimates. We prove PAC tail-safety guarantees for both CVaR-MCTS and W-MCTS and establish their regret. Evaluations on diverse simulated environments demonstrate that our proposed methods outperform existing baselines, effectively achieving robust tail-risk guarantees with improved rewards and stability. 仅依据蒙特卡洛树搜索（MCTS）中的期望回报来做决策无法考虑与决策相关的高风险、不利结果的潜在范围。为此，安全感知型 MCTS 常常考虑一些受约束的变体——通过引入某种形式的均值风险度量或严格的成本阈值。这些方法未能就极端或高风险结果（称为尾部风险）提供严格的尾部安全性保证，在高风险情形下可能导致严重后果。本文通过开发两种新颖的解决方案来解决该问题。我们首先提出 CVaR-MCTS，将一个相干的尾部风险度量——条件在险值（CVaR）——嵌入到 MCTS 中。参数为 α 的 CVaR-MCTS 对“最差的 (1−α)% 情形”下的期望损失实现了明确的尾部风险控制。其次，我们进一步处理由于样本有限导致的尾部风险估计偏差。我们通过引入半径为 εs 的一阶瓦瑟斯坦不确定性集合 Pεs(s,a) 来刻画尾部风险估计的不确定性，从而提出了瓦瑟斯坦-MCTS（或 W-MCTS）。我们证明了 CVaR-MCTS 和 W-MCTS 的 PAC 尾部安全性保证，并确立了它们的后悔界。对多种模拟环境的评估表明，我们提出的方法优于现有基线，能够有效地在提高回报和稳定性的同时实现稳健的尾部风险保证。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 14:31:22 UTC 发布：2025-08-07 14:31:22 UTC

#54 Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions #54 用加权班扎夫相互作用解释视觉-语言编码器中的相似性 [PDF 2 ] [Copy] [Kimi 1 ] [REL]

Authors: [Hubert Baniecki](https://arxiv.org/search/?searchtype=author&query=Hubert Baniecki), [Maximilian Muschalik](https://arxiv.org/search/?searchtype=author&query=Maximilian Muschalik), [Fabian Fumagalli](https://arxiv.org/search/?searchtype=author&query=Fabian Fumagalli), [Barbara Hammer](https://arxiv.org/search/?searchtype=author&query=Barbara Hammer), [Eyke Hüllermeier](https://arxiv.org/search/?searchtype=author&query=Eyke Hüllermeier), [Przemyslaw Biecek](https://arxiv.org/search/?searchtype=author&query=Przemyslaw Biecek) 作者：Hubert Baniecki、Maximilian Muschalik、Fabian Fumagalli、Barbara Hammer、Eyke Hüllermeier、Przemyslaw Biecek

Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model’s similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, like the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on MS COCO and ImageNet-1k benchmarks validate that second-order methods like FIxLIP outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models like CLIP vs. SigLIP-2 and ViT-B/32 vs. ViT-L/16. 语言-图像预训练（LIP）使得构建具备零样本分类、定位、多模态检索和语义理解能力的视觉-语言模型成为可能。为了可视化输入图像-文本对对模型相似性输出的重要性，已提出多种解释方法。然而，流行的显著性图仅限于捕捉一阶归因，忽略了此类编码器中固有的复杂跨模态交互。我们提出了面向 LIP 模型的可信交互解释（FIxLIP），作为分解视觉-语言编码器相似性的统一方法。FIxLIP 基于博弈论，我们分析了使用加权班扎夫交互指数在灵活性上如何优于沙普利交互量化框架并提升计算效率。从实用角度出发，我们还提出了如何将解释评估指标（如指向游戏以及插入/删除曲线之间的面积）自然地扩展到二阶交互解释。在 MS COCO 和 ImageNet-1k 基准上的实验证明，二阶方法如 FIxLIP 优于一阶归因方法。除了提供高质量的解释外，我们还展示了 FIxLIP 在比较不同模型（如 CLIP 与 SigLIP-2，以及 ViT-B/32 与 ViT-L/16）时的实用性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-07 14:18:56 UTC 发布：2025-08-07 14:18:56 UTC

#55 MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints #55 MyCulture：在低资源语言限制下探索马来西亚多元文化

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 14:17:43 UTC 发布：2025-08-07 14:17:43 UTC

#56 LLM-based Multi-Agent Copilot for Quantum Sensor #56 基于 LLM 的量子传感器多智能体助手

Large language models (LLM) exhibit broad utility but face limitations in quantum sensor development, stemming from interdisciplinary knowledge barriers and involving complex optimization processes. Here we present QCopilot, an LLM-based multi-agent framework integrating external knowledge access, active learning, and uncertainty quantification for quantum sensor design and diagnosis. Comprising commercial LLMs with few-shot prompt engineering and vector knowledge base, QCopilot employs specialized agents to adaptively select optimization methods, automate modeling analysis, and independently perform problem diagnosis. Applying QCopilot to atom cooling experiments, we generated 108 sub-μK atoms without any human intervention within a few hours, representing ∼100× speedup over manual experimentation. Notably, by continuously accumulating prior knowledge and enabling dynamic modeling, QCopilot can autonomously identify anomalous parameters in multi-parameter experimental settings. Our work reduces barriers to large-scale quantum sensor deployment and readily extends to other quantum information systems. 大型语言模型（LLM）展现出广泛的实用性，但在量子传感器开发中面临局限，源于跨学科知识障碍并涉及复杂的优化过程。在此我们提出 QCopilot，一种基于 LLM 的多智能体框架，集成了外部知识访问、主动学习和不确定性量化，用于量子传感器的设计与诊断。QCopilot 由具有少样本提示工程的商业 LLM 和向量知识库组成，采用专门的智能体自适应地选择优化方法、自动化建模分析并独立执行问题诊断。将 QCopilot 应用于原子冷却实验，我们在数小时内生成了 10 8 子- μ K 原子，且全程无需人工干预，相较于手工实验实现了 ∼ 100 × 的加速。值得注意的是，通过持续积累先验知识并实现动态建模，QCopilot 能在多参数实验设置中自主识别异常参数。我们的工作降低了大规模量子传感器部署的门槛，并可方便地扩展到其他量子信息系统。

Subjects: Quantum Physics, Artificial Intelligence, Atomic Physics 主题：量子物理，人工智能，原子物理

Publish: 2025-08-07 14:14:08 UTC 发表：2025-08-07 14:14:08 UTC

#57 UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation #57 UNCAGE：用于文本到图像生成中掩码生成式变换器的对比注意力引导

Authors: [Wonjun Kang](https://arxiv.org/search/?searchtype=author&query=Wonjun Kang), [Byeongkeun Ahn](https://arxiv.org/search/?searchtype=author&query=Byeongkeun Ahn), [Minjae Lee](https://arxiv.org/search/?searchtype=author&query=Minjae Lee), [Kevin Galim](https://arxiv.org/search/?searchtype=author&query=Kevin Galim), [Seunghyuk Oh](https://arxiv.org/search/?searchtype=author&query=Seunghyuk Oh), [Hyung Il Koo](https://arxiv.org/search/?searchtype=author&query=Hyung Il Koo), [Nam Ik Cho](https://arxiv.org/search/?searchtype=author&query=Nam Ik Cho) 作者：Wonjun Kang，Byeongkeun Ahn，Minjae Lee，Kevin Galim，Seunghyuk Oh，Hyung Il Koo，Nam Ik Cho

Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage. 文本到图像（T2I）生成一直通过扩散模型和自回归模型被积极研究。最近，掩码生成变换器作为自回归模型的替代方案获得关注，它通过双向注意力和并行解码克服了因果注意力和自回归解码的固有限制，从而实现高效且高质量的图像生成。然而，组合式 T2I 生成仍然具有挑战性，即使是最先进的扩散模型也常常无法准确绑定属性并实现恰当的文本-图像对齐。虽然扩散模型在这一问题上已被广泛研究，掩码生成变换器也表现出类似的局限性，但在这一背景下尚未被探索。为此，我们提出了“对比注意力引导的去掩码”（UNCAGE），这是一种新的无训练方法，通过利用注意力图来优先去掩码那些清晰表示单个对象的令牌，从而改善组合保真度。UNCAGE 在多个基准和指标上，在定量和定性评估中均持续提升性能，且推理开销可忽略不计。我们的代码可在 https://github.com/furiosa-ai/uncage 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别、人工智能、机器学习

Publish: 2025-08-07 13:51:17 UTC 发布：2025-08-07 13:51:17 UTC

#58 Real-Time Iteration Scheme for Diffusion Policy #58 实时迭代方案用于扩散策略

Authors: [Yufei Duan](https://arxiv.org/search/?searchtype=author&query=Yufei Duan), [Hang Yin](https://arxiv.org/search/?searchtype=author&query=Hang Yin), [Danica Kragic](https://arxiv.org/search/?searchtype=author&query=Danica Kragic) 作者：段雨飞，尹杭，丹妮卡·克拉吉克

Diffusion Policies have demonstrated impressive performance in robotic manipulation tasks. However, their long inference time, resulting from an extensive iterative denoising process, and the need to execute an action chunk before the next prediction to maintain consistent actions limit their applicability to latency-critical tasks or simple tasks with a short cycle time. While recent methods explored distillation or alternative policy structures to accelerate inference, these often demand additional training, which can be resource-intensive for large robotic models. In this paper, we introduce a novel approach inspired by the Real-Time Iteration (RTI) Scheme, a method from optimal control that accelerates optimization by leveraging solutions from previous time steps as initial guesses for subsequent iterations. We explore the application of this scheme in diffusion inference and propose a scaling-based method to effectively handle discrete actions, such as grasping, in robotic manipulation. The proposed scheme significantly reduces runtime computational costs without the need for distillation or policy redesign. This enables a seamless integration into many pre-trained diffusion-based models, in particular, to resource-demanding large models. We also provide theoretical conditions for the contractivity which could be useful for estimating the initial denoising step. Quantitative results from extensive simulation experiments show a substantial reduction in inference time, with comparable overall performance compared with Diffusion Policy using full-step denoising. Our project page with additional resources is available at: https://rti-dp.github.io/. 扩散策略在机器人操控任务中表现出令人印象深刻的性能。然而，其广泛的迭代去噪过程导致的长推理时间，以及为了保持动作一致性而在下一次预测前必须执行一个动作块的需求，限制了其在对延迟敏感的任务或周期时间短的简单任务中的适用性。尽管近期方法探索了蒸馏或替代策略结构以加速推理，但这些方法通常需要额外的训练，对于大型机器人模型可能资源密集。在本文中，我们提出了一种受最优控制中实时迭代（RTI）方案启发的新方法，该方法通过利用前一时间步的解作为后续迭代的初始猜测来加速优化。我们探讨了将该方案应用于扩散推理的可行性，并提出了一种基于缩放的方法，以有效处理机器人操控中诸如抓取等离散动作。所提出的方案在不需要蒸馏或重设计策略的情况下显著降低了运行时计算成本。这使得能够无缝集成到许多预训练的基于扩散的模型中，尤其是对资源需求高的大型模型。我们还为收缩性提供了理论条件，这对于估计初始去噪步骤可能很有用。来自大量仿真实验的定量结果显示，在推理时间方面有显著减少，同时与使用全步去噪的扩散策略相比，整体性能相当。我们的项目页面及其他资源可在以下网址获取： https://rti-dp.github.io/。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-07 13:49:00 UTC 发表：2025-08-07 13:49:00 UTC

#59 Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms #59 Echo：在异构蜂群上为大规模 RL 对齐解耦推理与训练

Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context switching violates the single-program-multiple-data (SPMD) assumption underlying today’s distributed training systems. We present Echo, the RL system that cleanly decouples these two phases across heterogeneous “inference” and “training” swarms while preserving statistical efficiency. Echo introduces two lightweight synchronization protocols: a sequential pull mode that refreshes sampler weights on every API call for minimal bias, and an asynchronous push-pull mode that streams version-tagged rollouts through a replay buffer to maximise hardware utilisation. Training three representative RL workloads with Qwen3-4B, Qwen2.5-7B and Qwen3-32B on a geographically distributed cluster, Echo matches a fully co-located Verl baseline in convergence speed and final reward while off-loading trajectory generation to commodity edge hardware. These promising results demonstrate that large-scale RL for LLMs could achieve datacentre-grade performance using decentralised, heterogeneous resources. 基于现代强化学习的后训练（post-training）方法在大规模语言模型（LLMs）上通常将轨迹采样和策略优化部署在同一 GPU 集群，迫使系统在推理和训练工作负载之间切换。这种串行的上下文切换违反了当今分布式训练系统所依赖的单程序多数据（SPMD）假设。我们提出了 Echo，一种在异构的“推理”与“训练”集群之间清晰解耦这两个阶段同时保持统计效率的强化学习系统。Echo 引入了两种轻量级的同步协议：一种顺序拉取模式（sequential pull mode），在每次 API 调用时刷新采样器权重以将偏差降到最低；另一种异步推拉模式（asynchronous push-pull mode），通过回放缓冲区流式传输带版本标签的轨迹以最大化硬件利用率。在地理分布式集群上使用 Qwen3-4B、Qwen2.5-7B 和 Qwen3-32B 训练三种代表性的强化学习工作负载时，Echo 在收敛速度和最终回报上与完全共置的 Verl 基线相当，同时将轨迹生成卸载到普通边缘硬件上。这些有希望的结果表明，针对 LLMs 的大规模强化学习可以利用去中心化、异构的资源实现数据中心级别的性能。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 13:37:04 UTC

#60 Optimal Corpus Aware Training for Neural Machine Translation #60 面向语料库的最优训练用于神经机器翻译

Authors: [Yi-Hsiu Liao](https://arxiv.org/search/?searchtype=author&query=Yi-Hsiu Liao), [Cheng Shen](https://arxiv.org/search/?searchtype=author&query=Cheng Shen), Brenda, Yang

Corpus Aware Training (CAT) leverages valuable corpus metadata during training by injecting corpus information into each training example, and has been found effective in the literature, commonly known as the “tagging” approach. Models trained with CAT inherently learn the quality, domain and nuance between corpora directly from data, and can easily switch to different inference behavior. To achieve the best evaluation, CAT models pre-define a group of high quality data before training starts which can be error-prone and inefficient. In this work, we propose Optimal Corpus Aware Training (OCAT), which fine-tunes a CAT pre-trained model by freezing most of the model parameters and only tuning small set of corpus-related parameters. We show that OCAT is lightweight, resilient to overfitting, and effective in boosting model accuracy. We use WMT23 English to Chinese and English to German translation tasks as our test ground and show +3.6 and +1.8 chrF improvement, respectively, over vanilla training. Furthermore, our approach is on-par or slightly better than other state-of-the-art fine-tuning techniques while being less sensitive to hyperparameter settings. 语料感知训练（CAT）在训练过程中通过将语料信息注入到每个训练样本中来利用有价值的语料元数据，文献中发现这在实践中是有效的，通常称为“标记”方法。使用 CAT 训练的模型能够从数据中直接学习语料之间的质量、领域和细微差别，并且可以轻松切换到不同的推理行为。为了实现最佳评估，CAT 模型在训练开始前会预先定义一组高质量数据，这一过程可能出错且效率低下。在本工作中，我们提出了最优语料感知训练（OCAT），通过冻结大部分模型参数并仅微调一小部分与语料相关的参数来对 CAT 预训练模型进行微调。我们展示了 OCAT 具有轻量、抗过拟合且能有效提升模型准确性的特点。我们以 WMT23 英译中和英译德翻译任务作为测试场景，分别比普通训练提高了+3.6 和+1.8 的 chrF 值。此外，我们的方法在性能上与其他最先进的微调技术持平或略有优势，同时对超参数设置的敏感性更低。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 13:12:26 UTC 发布：2025-08-07 13:12:26 UTC

#61 Building Effective Safety Guardrails in AI Education Tools #61 在人工智能教育工具中构建有效的安全护栏

Authors: [Hannah-Beth Clark](https://arxiv.org/search/?searchtype=author&query=Hannah-Beth Clark), [Laura Benton](https://arxiv.org/search/?searchtype=author&query=Laura Benton), [Emma Searle](https://arxiv.org/search/?searchtype=author&query=Emma Searle), [Margaux Dowland](https://arxiv.org/search/?searchtype=author&query=Margaux Dowland), [Matthew Gregory](https://arxiv.org/search/?searchtype=author&query=Matthew Gregory), [Will Gayne](https://arxiv.org/search/?searchtype=author&query=Will Gayne), [John Roberts](https://arxiv.org/search/?searchtype=author&query=John Roberts) 作者：Hannah-Beth Clark、Laura Benton、Emma Searle、Margaux Dowland、Matthew Gregory、Will Gayne、John Roberts

There has been rapid development in generative AI tools across the education sector, which in turn is leading to increased adoption by teachers. However, this raises concerns regarding the safety and age-appropriateness of the AI-generated content that is being created for use in classrooms. This paper explores Oak National Academy’s approach to addressing these concerns within the development of the UK Government’s first publicly available generative AI tool - our AI-powered lesson planning assistant (Aila). Aila is intended to support teachers planning national curriculum-aligned lessons that are appropriate for pupils aged 5-16 years. To mitigate safety risks associated with AI-generated content we have implemented four key safety guardrails - (1) prompt engineering to ensure AI outputs are generated within pedagogically sound and curriculum-aligned parameters, (2) input threat detection to mitigate attacks, (3) an Independent Asynchronous Content Moderation Agent (IACMA) to assess outputs against predefined safety categories, and (4) taking a human-in-the-loop approach, to encourage teachers to review generated content before it is used in the classroom. Through our on-going evaluation of these safety guardrails we have identified several challenges and opportunities to take into account when implementing and testing safety guardrails. This paper highlights ways to build more effective safety guardrails in generative AI education tools including the on-going iteration and refinement of guardrails, as well as enabling cross-sector collaboration through sharing both open-source code, datasets and learnings. 在教育领域，生成式人工智能工具的快速发展正在推动教师越来越多地采用这些工具。然而，这也引发了关于用于课堂的 AI 生成内容的安全性和适龄性方面的担忧。本文探讨了橡树国家学院（Oak National Academy）在开发英国政府首个面向公众的生成式 AI 工具——我们的 AI 驱动课程计划助手（Aila）时，如何应对这些担忧。Aila 旨在支持教师为 5 至 16 岁学生规划符合国家课程的、适宜的课程。为减少与 AI 生成内容相关的安全风险，我们实施了四项关键安全防护措施——（1）提示工程以确保 AI 输出在教学上合理且与课程对齐的参数内生成，（2）输入威胁检测以防范攻击，（3）独立异步内容审核代理（IACMA）用于根据预定义的安全类别评估输出，以及（4）采用人机协同（human-in-the-loop）方法，鼓励教师在课堂使用前审核生成内容。通过我们对这些安全护栏的持续评估，我们发现了在实施和测试安全护栏时需要考虑的若干挑战和机遇。本文强调了在生成式人工智能教育工具中构建更有效的安全护栏的方法，包括对护栏的持续迭代和完善，以及通过共享开源代码、数据集和经验来促进跨部门协作。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-07 13:09:47 UTC 发布：2025-08-07 13:09:47 UTC

#62 PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation #62 PriorRG：用于胸片报告生成的先验引导对比预训练与粗到细解码 [PDF 2 ] [复制] [Kimi ] [关联]

Authors: [Kang Liu](https://arxiv.org/search/?searchtype=author&query=Kang Liu), [Zhuoqi Ma](https://arxiv.org/search/?searchtype=author&query=Zhuoqi Ma), [Zikang Fang](https://arxiv.org/search/?searchtype=author&query=Zikang Fang), [Yunan Li](https://arxiv.org/search/?searchtype=author&query=Yunan Li), [Kun Xie](https://arxiv.org/search/?searchtype=author&query=Kun Xie), [Qiguang Miao](https://arxiv.org/search/?searchtype=author&query=Qiguang Miao)

Chest X-ray report generation aims to reduce radiologists’ workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge – including clinical context (e.g., symptoms, medical history) and the most recent prior image – which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder’s hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance. 胸片报告生成旨在通过自动生成高质量的初步报告来减轻放射科医生的工作负担。该任务的一个关键但尚未充分探索的方面是对患者特定先验知识的有效利用——包括临床背景（例如症状、病史）和最近一次既往影像——放射科医生在诊断推理中常常依赖这些信息。大多数现有方法从单张图像生成报告，忽视了这一重要的先验信息，因而未能捕捉诊断意图或疾病进展。为弥补这一差距，我们提出了 PriorRG，一种新颖的胸片报告生成框架，通过两阶段训练流程模拟真实临床工作流程。在第一阶段，我们引入了一种先验引导的对比预训练方案，利用临床背景来指导时空特征提取，使模型能够更紧密地对齐放射学报告中的内在时空语义。在第二阶段，我们提出了一种具备先验感知的由粗到细的解码方法用于报告生成，逐步将患者特定的先验知识与视觉编码器的隐藏状态融合。这种解码使模型能够与诊断重点对齐并追踪疾病进展，从而提高生成报告的临床准确性和流畅性。在 MIMIC-CXR 和 MIMIC-ABN 数据集上的大量实验证明，PriorRG 优于最先进的方法，在 MIMIC-CXR 上分别实现了 3.6% 的 BLEU-4 和 3.8% 的 F1 分数提升，在 MIMIC-ABN 上实现了 5.9% 的 BLEU-1 提升。代码和检查点将在接受后发布。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-07 13:02:20 UTC

Authors: [Xiaoxi Cui](https://arxiv.org/search/?searchtype=author&query=Xiaoxi Cui), [Weihai Lu](https://arxiv.org/search/?searchtype=author&query=Weihai Lu), [Yu Tong](https://arxiv.org/search/?searchtype=author&query=Yu Tong), [Yiheng Li](https://arxiv.org/search/?searchtype=author&query=Yiheng Li), [Zhejun Zhao](https://arxiv.org/search/?searchtype=author&query=Zhejun Zhao) 作者：崔晓曦、陆伟海、佟宇、李奕衡、赵哲君

The sequential recommendation system utilizes historical user interactions to predict preferences. Effectively integrating diverse user behavior patterns with rich multimodal information of items to enhance the accuracy of sequential recommendations is an emerging and challenging research direction. This paper focuses on the problem of multi-modal multi-behavior sequential recommendation, aiming to address the following challenges: (1) the lack of effective characterization of modal preferences across different behaviors, as user attention to different item modalities varies depending on the behavior; (2) the difficulty of effectively mitigating implicit noise in user behavior, such as unintended actions like accidental clicks; (3) the inability to handle modality noise in multi-modal representations, which further impacts the accurate modeling of user preferences. To tackle these issues, we propose a novel Multi-Modal Multi-Behavior Sequential Recommendation model (M3BSR). This model first removes noise in multi-modal representations using a Conditional Diffusion Modality Denoising Layer. Subsequently, it utilizes deep behavioral information to guide the denoising of shallow behavioral data, thereby alleviating the impact of noise in implicit feedback through Conditional Diffusion Behavior Denoising. Finally, by introducing a Multi-Expert Interest Extraction Layer, M3BSR explicitly models the common and specific interests across behaviors and modalities to enhance recommendation performance. Experimental results indicate that M3BSR significantly outperforms existing state-of-the-art methods on benchmark datasets. 序列推荐系统利用用户的历史交互来预测偏好。有效地将多样的用户行为模式与物品的丰富多模态信息整合以提高序列推荐的准确性，是一个新兴且具有挑战性的研究方向。本文聚焦于多模态多行为序列推荐问题，旨在解决以下挑战：（1）缺乏对不同行为下模态偏好的有效刻画，因为用户对不同物品模态的关注会随行为而变化；（2）难以有效缓解用户行为中的隐含噪声，例如意外点击等非本意行为；（3）无法处理多模态表示中的模态噪声，这进一步影响对用户偏好的准确建模。为了解决这些问题，我们提出了一种新颖的多模态多行为序列推荐模型（M 3 BSR）。该模型首先使用条件扩散模态去噪层去除多模态表示中的噪声。随后，它利用深层行为信息来指导浅层行为数据的去噪，从而通过条件扩散行为去噪缓解隐式反馈中的噪声影响。最后，通过引入多专家兴趣提取层，M 3 BSR 明确建模了跨行为和跨模态的共同与特定兴趣，以提升推荐性能。实验结果表明，M 3 BSR 在基准数据集上显著优于现有的最先进方法。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-07 12:58:34 UTC 发布：2025-08-07 12:58:34 UTC

#64 Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control #64 基于信息论的图融合与视觉-语言-动作模型用于策略推理与双机器人控制 [PDF 2 ] [Copy] [Kimi ] [REL]

Teaching robots dexterous skills from human videos remains challenging due to the reliance on low-level trajectory imitation, which fails to generalize across object types, spatial layouts, and manipulator configurations. We propose Graph-Fused Vision-Language-Action (GF-VLA), a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations. GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance, then encodes these cues into temporally ordered scene graphs that capture both hand-object and object-object interactions. These graphs are fused with a language-conditioned transformer that generates hierarchical behavior trees and interpretable Cartesian motion commands. To improve execution efficiency in bimanual settings, we further introduce a cross-hand selection policy that infers optimal gripper assignment without explicit geometric reasoning. We evaluate GF-VLA on four structured dual-arm block assembly tasks involving symbolic shape construction and spatial generalization. Experimental results show that the information-theoretic scene representation achieves over 95 percent graph accuracy and 93 percent subtask segmentation, supporting the LLM planner in generating reliable and human-readable task policies. When executed by the dual-arm robot, these policies yield 94 percent grasp success, 89 percent placement accuracy, and 90 percent overall task success across stacking, letter-building, and geometric reconfiguration scenarios, demonstrating strong generalization and robustness across diverse spatial and semantic variations. 从人类视频中教机器人灵巧技能仍然充满挑战，原因在于对低级轨迹模仿的依赖，这种方法无法在物体类型、空间布局和机械臂配置之间泛化。我们提出了图融合视觉-语言-动作（Graph-Fused Vision-Language-Action，GF-VLA）框架，使双臂机器人系统能够直接从 RGB 和深度的人类示范中进行任务级推理与执行。GF-VLA 首先提取基于香农信息的线索，以识别与任务最相关的双手和物体，然后将这些线索编码为按时间顺序排列的场景图，这些场景图同时捕捉手-物体和物体-物体的交互。将这些图与一个以语言为条件的变换器融合，该变换器生成分层行为树和可解释的笛卡尔运动指令。为提高双臂设置下的执行效率，我们进一步引入了一种跨手选择策略，该策略在不进行显式几何推理的情况下推断最优夹持器分配。我们在四个结构化的双臂方块组装任务上评估了 GF-VLA，这些任务涉及符号形状构建和空间泛化。实验结果表明，基于信息论的场景表示实现了超过 95%的图结构准确率和 93%的子任务分割率，支持 LLM 规划器生成可靠且可读的人类任务策略。当由双臂机器人执行这些策略时，在堆叠、字母构建和几何重构场景中分别取得了 94%的抓取成功率、89%的放置精度和 90%的整体任务成功率，展示了在不同空间和语义变化下的强泛化能力和鲁棒性。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-07 12:48:09 UTC 发表：2025-08-07 12:48:09 UTC

#65 Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression #65 通过确定性引导的反思抑制实现对大型推理语言模型的高效推理

Authors: [Jiameng Huang](https://arxiv.org/search/?searchtype=author&query=Jiameng Huang), [Baijiong Lin](https://arxiv.org/search/?searchtype=author&query=Baijiong Lin), [Guhao Feng](https://arxiv.org/search/?searchtype=author&query=Guhao Feng), [Jierun Chen](https://arxiv.org/search/?searchtype=author&query=Jierun Chen), [Di He](https://arxiv.org/search/?searchtype=author&query=Di He), [Lu Hou](https://arxiv.org/search/?searchtype=author&query=Lu Hou) 作者：黄家萌，林百炯，冯谷豪，陈洁润，何迪，侯璐

Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., “Wait” and “Alternatively”) to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model’s generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS’s effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS’s practical value for efficient reasoning. 近期的大型推理语言模型（LRLMs）在长链式思维推理中采用复杂的反思行为，通常由特定触发词（例如“Wait”和“Alternatively”）标示，以提升性能。然而，这些反思行为可能导致过度思考问题，即生成冗余的推理步骤，从而不必要地增加令牌使用量、提高推理成本并降低实际效用。在本文中，我们提出了确定性引导的反思抑制（CGRS），这是一种在保持推理准确性的同时缓解 LRLMs 过度思考的新方法。CGRS 通过在模型对当前回答表现出高置信度时动态抑制其生成反思触发词，从而在不损害输出质量的情况下防止冗余的反思循环。我们的方法与模型无关，无需重新训练或修改架构，并且可以无缝集成到现有的自回归生成流水线中。在四个推理基准（即 AIME24、AMC23、MATH500 和 GPQA-D）上的大量实验表明了 CGRS 的有效性：在保持准确率的同时，它平均减少了 18.5% 到 41.9% 的令牌使用量。与最先进的基线方法相比，它在长度缩减与性能之间也达到了最佳平衡。这些结果在模型架构（例如 DeepSeek-R1-Distill 系列、QwQ-32B 和 Qwen3 家族）和规模（从 4B 到 32B 参数）上均保持一致，突显了 CGRS 在高效推理方面的实用价值。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-07 12:38:22 UTC 发布：2025-08-07 12:38:22 UTC

#66 mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering #66 mKG-RAG：用于视觉问答的多模态知识图增强 RAG [PDF ] [Copy] [Kimi ] [REL]

Authors: [Xu Yuan](https://arxiv.org/search/?searchtype=author&query=Xu Yuan), [Liangbo Ning](https://arxiv.org/search/?searchtype=author&query=Liangbo Ning), [Wenqi Fan](https://arxiv.org/search/?searchtype=author&query=Wenqi Fan), [Qing Li](https://arxiv.org/search/?searchtype=author&query=Qing Li) 作者：Xu Yuan、Liangbo Ning、Wenqi Fan、Qing Li

Recently, Retrieval-Augmented Generation (RAG) has been proposed to expand internal knowledge of Multimodal Large Language Models (MLLMs) by incorporating external knowledge databases into the generation process, which is widely used for knowledge-based Visual Question Answering (VQA) tasks. Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relationships among knowledge elements frequently introduce irrelevant or misleading content, reducing answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks to enhance the generation by introducing structured multimodal knowledge. Therefore, in this paper, we propose a novel multimodal knowledge-augmented generation framework (mKG-RAG) based on multimodal KGs for knowledge-intensive VQA tasks. Specifically, our approach leverages MLLM-powered keyword extraction and vision-text matching to distill semantically consistent and modality-aligned entities/relationships from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. In addition, a dual-stage retrieval strategy equipped with a question-aware multimodal retriever is introduced to improve retrieval efficiency while refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art for knowledge-based VQA. 最近，检索增强生成（RAG）被提出用于通过在生成过程中引入外部知识库来扩展多模态大语言模型（MLLMs）的内部知识，这在基于知识的视觉问答（VQA）任务中被广泛使用。尽管取得了显著进展，基于原始 RAG 且依赖非结构化文档、忽视知识元素之间结构关系的方法经常引入无关或误导性内容，降低了答案的准确性和可靠性。为了解决这些挑战，一个有前景的方案是将多模态知识图谱（KGs）整合到基于 RAG 的 VQA 框架中，通过引入结构化多模态知识来增强生成能力。因此，在本文中，我们提出了一个基于多模态知识图谱的用于知识密集型 VQA 任务的新颖多模态知识增强生成框架（mKG-RAG）。具体而言，我们的方法利用由多模态大模型驱动的关键词提取和视觉-文本匹配，从多模态文档中提取语义一致且模态对齐的实体/关系，构建高质量的多模态知识图谱作为结构化知识表示。此外，我们引入了一种配备了问题感知多模态检索器的双阶段检索策略，以在提高检索效率的同时精炼精确性。全面的实验证明，我们的方法显著优于现有方法，创造了基于知识的视觉问答的新最先进水平。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 学科：计算机视觉与模式识别、人工智能

Publish: 2025-08-07 12:22:50 UTC 发布时间：2025-08-07 12:22:50 UTC

#67 ASkDAgger: Active Skill-level Data Aggregation for Interactive Imitation Learning #67 ASkDAgger：用于交互式模仿学习的主动技能级数据聚合

Authors: [Jelle Luijkx](https://arxiv.org/search/?searchtype=author&query=Jelle Luijkx), [Zlatan Ajanović](https://arxiv.org/search/?searchtype=author&query=Zlatan Ajanović), [Laura Ferranti](https://arxiv.org/search/?searchtype=author&query=Laura Ferranti), [Jens Kober](https://arxiv.org/search/?searchtype=author&query=Jens Kober) 作者：Jelle Luijkx、Zlatan Ajanović、Laura Ferranti、Jens Kober

Human teaching effort is a significant bottleneck for the broader applicability of interactive imitation learning. To reduce the number of required queries, existing methods employ active learning to query the human teacher only in uncertain, risky, or novel situations. However, during these queries, the novice’s planned actions are not utilized despite containing valuable information, such as the novice’s capabilities, as well as corresponding uncertainty levels. To this end, we allow the novice to say: “I plan to do this, but I am uncertain.” We introduce the Active Skill-level Data Aggregation (ASkDAgger) framework, which leverages teacher feedback on the novice plan in three key ways: (1) S-Aware Gating (SAG): Adjusts the gating threshold to track sensitivity, specificity, or a minimum success rate; (2) Foresight Interactive Experience Replay (FIER), which recasts valid and relabeled novice action plans into demonstrations; and (3) Prioritized Interactive Experience Replay (PIER), which prioritizes replay based on uncertainty, novice success, and demonstration age. Together, these components balance query frequency with failure incidence, reduce the number of required demonstration annotations, improve generalization, and speed up adaptation to changing domains. We validate the effectiveness of ASkDAgger through language-conditioned manipulation tasks in both simulation and real-world environments. Code, data, and videos are available at https://askdagger.github.io. 人为教学投入是限制交互式模仿学习更广泛应用的一个重要瓶颈。为减少所需的查询次数，现有方法采用主动学习仅在不确定、有风险或新颖的情况下向人类教师查询。然而，在这些查询过程中，新手的计划动作并未被利用，尽管其中包含有价值的信息，例如新手的能力以及相应的不确定性水平。为此，我们允许新手说：“我计划这样做，但我不确定。”我们提出了主动技能级数据聚合（ASkDAgger）框架，它以三种关键方式利用教师对新手计划的反馈：（1）S-感知门控（SAG）：调整门控阈值以跟踪灵敏度、特异性或最低成功率；（2）远见交互经验回放（FIER）：将有效且重新标注的新手动作计划重新构造成示范；以及（3）优先交互经验回放（PIER）：基于不确定性、新手成功率和示范时间优先安排回放。这些组件共同在查询频率与失败发生率之间取得平衡，减少所需示范注释的数量，提升泛化能力，并加快对变化领域的适应速度。我们通过在仿真和真实环境中的基于语言的操作任务验证了 ASkDAgger 的有效性。代码、数据和视频可在 https://askdagger.github.io 获取。

Subjects: Machine Learning, Artificial Intelligence, Human-Computer Interaction, Robotics 主题：机器学习、人工智能、人机交互、机器人学

Publish: 2025-08-07 12:10:46 UTC 发布：2025-08-07 12:10:46 UTC

#68 Estimating Musical Surprisal from Audio in Autoregressive Diffusion Model Noise Spaces #68 从自回归扩散模型噪声空间中估计音频的音乐意外性

Authors: [Mathias Rose Bjare](https://arxiv.org/search/?searchtype=author&query=Mathias Rose Bjare), [Stefan Lattner](https://arxiv.org/search/?searchtype=author&query=Stefan Lattner), [Gerhard Widmer](https://arxiv.org/search/?searchtype=author&query=Gerhard Widmer) 作者：Mathias Rose Bjare、Stefan Lattner、Gerhard Widmer

Recently, the information content (IC) of predictions from a Generative Infinite-Vocabulary Transformer (GIVT) has been used to model musical expectancy and surprisal in audio. We investigate the effectiveness of such modelling using IC calculated with autoregressive diffusion models (ADMs). We empirically show that IC estimates of models based on two different diffusion ordinary differential equations (ODEs) describe diverse data better, in terms of negative log-likelihood, than a GIVT. We evaluate diffusion model IC’s effectiveness in capturing surprisal aspects by examining two tasks: (1) capturing monophonic pitch surprisal, and (2) detecting segment boundaries in multi-track audio. In both tasks, the diffusion models match or exceed the performance of a GIVT. We hypothesize that the surprisal estimated at different diffusion process noise levels corresponds to the surprisal of music and audio features present at different audio granularities. Testing our hypothesis, we find that, for appropriate noise levels, the studied musical surprisal tasks’ results improve. Code is provided on github.com/SonyCSLParis/audioic. 最近，来自生成无限词汇变换器（GIVT）预测的信息含量（IC）已被用于对音频中的音乐期待与惊讶度建模。我们使用基于自回归扩散模型（ADM）计算的 IC 来研究此类建模的有效性。我们通过实验证明，基于两种不同扩散常微分方程（ODE）的模型的 IC 估计，在负对数似然方面，比 GIVT 更能描述多样的数据。我们通过考察两个任务来评估扩散模型 IC 在捕捉惊讶感方面的有效性：（1）捕捉单声部音高的惊讶度，以及（2）在多轨音频中检测片段边界。在这两个任务中，扩散模型的表现与或优于 GIVT。我们假设在不同扩散过程噪声水平下估计的惊讶度对应于存在于不同音频粒度上的音乐和音频特征的惊讶度。为检验这一假设，我们发现对于合适的噪声水平，所研究的音乐惊讶任务的结果有所改善。代码已发布于 github.com/SonyCSLParis/audioic。

Subjects: Sound, Artificial Intelligence, Audio and Speech Processing 主题：声音，人工智能，音频与语音处理

Publish: 2025-08-07 12:05:27 UTC 发布：2025-08-07 12:05:27 UTC

#69 VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test #69 VS-LLM：基于 LLM 的绘画投射测验视觉语义抑郁评估

The Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants’ mental states through their sketches. Specifically, through sketches with the theme of “a person picking an apple from a tree (PPAT)”, it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists’ understanding of an individual’s mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment; (2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method; (3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches’ elements recognition. Our datasets and codes are available at https://github.com/wmeiqi/VS-LLM. 绘画投射测试（DPT）是艺术治疗中的重要工具，允许心理学家通过参与者的素描评估其心理状态。具体而言，通过主题为“一个人从树上摘苹果（PPAT）”的素描，可以揭示参与者是否处于抑郁等心理状态。与量表相比，DPT 能丰富心理学家对个体心理状态的理解。然而，PPAT 的解释工作繁重且依赖心理学家的经验。为了解决此问题，我们提出了一种有效的识别方法，以支持心理学家进行大规模自动化的 DPT。不同于传统的素描识别，DPT 更注重对素描的整体评估，例如颜色使用和空间利用。此外，PPAT 设有时间限制并禁止口头提示，导致绘画准确性低且缺乏细节描绘。为了解决这些挑战，我们提出以下工作： (1) 提供一个用于自动分析用于抑郁评估的 PPAT 草图的实验环境； (2) 提出一种基于 LLM 的视觉语义抑郁评估方法（VS-LLM）； (3) 实验结果表明，我们的方法较心理学家评估方法提升了 17.6%。我们期待这项工作能为基于 PPAT 草图元素识别的心理状态评估研究作出贡献。我们的数据集和代码可在 https://github.com/wmeiqi/VS-LLM 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 学科：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 11:59:50 UTC 发布：2025-08-07 11:59:50 UTC

#70 Towards Embodied Agentic AI: Review and Classification of LLM- and VLM-Driven Robot Autonomy and Interaction #70 面向具身代理智能的进展：基于 LLM 和 VLM 驱动的机器人自主性与交互的综述与分类

Authors: [Sahar Salimpour](https://arxiv.org/search/?searchtype=author&query=Sahar Salimpour), [Lei Fu](https://arxiv.org/search/?searchtype=author&query=Lei Fu), [Farhad Keramat](https://arxiv.org/search/?searchtype=author&query=Farhad Keramat), [Leonardo Militano](https://arxiv.org/search/?searchtype=author&query=Leonardo Militano), [Giovanni Toffetti](https://arxiv.org/search/?searchtype=author&query=Giovanni Toffetti), [Harry Edelman](https://arxiv.org/search/?searchtype=author&query=Harry Edelman), [Jorge Peña Queralta](https://arxiv.org/search/?searchtype=author&query=Jorge Peña Queralta) 作者：Sahar Salimpour、Lei Fu、Farhad Keramat、Leonardo Militano、Giovanni Toffetti、Harry Edelman、Jorge Peña Queralta

Foundation models, including large language models (LLMs) and vision-language models (VLMs), have recently enabled novel approaches to robot autonomy and human-robot interfaces. In parallel, vision-language-action models (VLAs) or large behavior models (BLMs) are increasing the dexterity and capabilities of robotic systems. This survey paper focuses on those words advancing towards agentic applications and architectures. This includes initial efforts exploring GPT-style interfaces to tooling, as well as more complex system where AI agents are coordinators, planners, perception actors, or generalist interfaces. Such agentic architectures allow robots to reason over natural language instructions, invoke APIs, plan task sequences, or assist in operations and diagnostics. In addition to peer-reviewed research, due to the fast-evolving nature of the field, we highlight and include community-driven projects, ROS packages, and industrial frameworks that show emerging trends. We propose a taxonomy for classifying model integration approaches and present a comparative analysis of the role that agents play in different solutions in today’s literature. 基础模型，包括大型语言模型 (LLMs) 和视觉-语言模型 (VLMs)，最近促成了机器人自主性和人机交互的新方法。与此同时，视觉-语言-动作模型 (VLAs) 或大行为模型 (BLMs) 正在提升机器人系统的灵活性和能力。本文综述聚焦于那些朝向智能体应用和架构发展的工作。这包括探索类 GPT 接口到工具的初步尝试，以及更复杂的系统，其中 AI 代理充当协调者、规划者、感知执行者或通用接口。此类智能体架构使机器人能够基于自然语言指令进行推理、调用 API、规划任务序列或协助操作与诊断。除了经同行评审的研究外，鉴于该领域快速发展的特性，我们还强调并纳入了社区驱动的项目、ROS 包和显示出新兴趋势的工业框架。我们提出了一个用于分类模型集成方法的分类法，并呈现了对当今文献中代理在不同解决方案中所扮演角色的比较分析。

Subjects: Robotics, Artificial Intelligence, Machine Learning 主题：机器人学、人工智能、机器学习

Publish: 2025-08-07 11:48:03 UTC 发布：2025-08-07 11:48:03 协调世界时（UTC）

#71 FlowState: Sampling Rate Invariant Time Series Forecasting #71 FlowState：采样率不变的时间序列预测

Authors: [Lars Graf](https://arxiv.org/search/?searchtype=author&query=Lars Graf), [Thomas Ortner](https://arxiv.org/search/?searchtype=author&query=Thomas Ortner), [Stanisław Woźniak](https://arxiv.org/search/?searchtype=author&query=Stanisław Woźniak), [Angeliki Pantazi](https://arxiv.org/search/?searchtype=author&query=Angeliki Pantazi) 作者：Lars Graf、Thomas Ortner、Stanisław Woźniak、Angeliki Pantazi

Foundation models (FMs) have transformed natural language processing, but their success has not yet translated to time series forecasting. Existing time series foundation models (TSFMs), often based on transformer variants, struggle with generalization across varying context and target lengths, lack adaptability to different sampling rates, and are computationally inefficient. We introduce FlowState, a novel TSFM architecture that addresses these challenges through two key innovations: a state space model (SSM) based encoder and a functional basis decoder. This design enables continuous-time modeling and dynamic time-scale adjustment, allowing FlowState to inherently generalize across all possible temporal resolutions, and dynamically adjust the forecasting horizons. In contrast to other state-of-the-art TSFMs, which require training data across all possible sampling rates to memorize patterns at each scale, FlowState inherently adapts its internal dynamics to the input scale, enabling smaller models, reduced data requirements, and improved efficiency. We further propose an efficient pretraining strategy that improves robustness and accelerates training. Despite being the smallest model, FlowState outperforms all other models and is state-of-the-art for the GIFT-ZS and the Chronos-ZS benchmarks. Ablation studies confirm the effectiveness of its components, and we demonstrate its unique ability to adapt online to varying input sampling rates. 基础模型（FM）已经改变了自然语言处理领域，但其成功尚未转化为时间序列预测。现有的时间序列基础模型（TSFM），通常基于变体的 Transformer，在不同的上下文和目标长度间泛化能力较差，难以适应不同的采样率，并且计算效率低下。我们提出了 FlowState，一种新颖的 TSFM 架构，通过两项关键创新加以应对：基于状态空间模型（SSM）的编码器和基于函数基的解码器。该设计实现了连续时间建模和动态时间尺度调整，使 FlowState 能够固有地在所有可能的时间分辨率间泛化，并动态调整预测时距。不同于其他需要在所有可能采样率下训练以记忆每个尺度模式的最先进 TSFM，FlowState 能够根据输入尺度自身调整内部动力学，从而实现更小的模型、降低数据需求并提高效率。我们还提出了一种高效的预训练策略，以提升鲁棒性并加速训练。尽管是体积最小的模型，FlowState 的表现优于所有其他模型，并在 GIFT-ZS 和 Chronos-ZS 基准上达到了最新技术水平。消融研究证实了其各组成部分的有效性，我们还展示了它在线适应不同输入采样率的独特能力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 11:30:26 UTC 发布：2025-08-07 11:30:26 UTC

#72 SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion #72 SGDFuse：基于 SAM 引导扩散的高保真红外与可见光图像融合

Authors: [Xiaoyang Zhang](https://arxiv.org/search/?searchtype=author&query=Xiaoyang Zhang), [Zhen Hua](https://arxiv.org/search/?searchtype=author&query=Zhen Hua), [Yakun Ju](https://arxiv.org/search/?searchtype=author&query=Yakun Ju), [Wei Zhou](https://arxiv.org/search/?searchtype=author&query=Wei Zhou), [Jun Liu](https://arxiv.org/search/?searchtype=author&query=Jun Liu), [Alex C. Kot](https://arxiv.org/search/?searchtype=author&query=Alex C. Kot) 作者：Xiaoyang Zhang, Zhen Hua, Yakun Ju, Wei Zhou, Jun Liu, Alex C. Kot

Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model’s coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse. 红外与可见光图像融合（IVIF）旨在将红外图像的热辐射信息与可见光图像的丰富纹理细节相结合，以增强下游视觉任务的感知能力。然而，现有方法常因缺乏对场景的深层语义理解而无法保留关键目标，同时融合过程本身也可能引入伪影和细节丢失，严重损害图像质量和任务性能。为了解决这些问题，本文提出了 SGDFuse，一种由“随便分割任何东西”模型（SAM）引导的条件扩散模型，以实现高保真且具语义感知的图像融合。我们方法的核心是利用 SAM 生成的高质量语义掩码作为显式先验，通过条件扩散模型引导融合过程的优化。具体而言，该框架采用两阶段流程：首先对多模态特征进行初步融合，然后将 SAM 的语义掩码与初步融合图像共同作为条件，驱动扩散模型进行由粗到细的去噪生成。这确保了融合过程不仅具有明确的语义方向性，而且保证了最终结果的高保真度。大量实验表明，SGDFuse 在主观和客观评估以及对下游任务的适应性方面均达到了最先进的性能，为图像融合的核心挑战提供了有力的解决方案。SGDFuse 的代码可在 https://github.com/boshizhang123/SGDFuse 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 学科：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 10:58:52 UTC 发表：2025-08-07 10:58:52 UTC

#73 Robust Tracking with Particle Filtering for Fluorescent Cardiac Imaging #73 使用粒子滤波的稳健追踪用于荧光心脏成像

Authors: [Suresh Guttikonda](https://arxiv.org/search/?searchtype=author&query=Suresh Guttikonda), [Maximilian Neidhart](https://arxiv.org/search/?searchtype=author&query=Maximilian Neidhart), [Johanna Sprenger](https://arxiv.org/search/?searchtype=author&query=Johanna Sprenger), [Johannes Petersen](https://arxiv.org/search/?searchtype=author&query=Johannes Petersen), [Christian Detter](https://arxiv.org/search/?searchtype=author&query=Christian Detter), [Alexander Schlaefer](https://arxiv.org/search/?searchtype=author&query=Alexander Schlaefer) 作者：Suresh Guttikonda、Maximilian Neidhart、Johanna Sprenger、Johannes Petersen、Christian Detter、Alexander Schlaefer

Intraoperative fluorescent cardiac imaging enables quality control following coronary bypass grafting surgery. We can estimate local quantitative indicators, such as cardiac perfusion, by tracking local feature points. However, heart motion and significant fluctuations in image characteristics caused by vessel structural enrichment limit traditional tracking methods. We propose a particle filtering tracker based on cyclicconsistency checks to robustly track particles sampled to follow target landmarks. Our method tracks 117 targets simultaneously at 25.4 fps, allowing real-time estimates during interventions. It achieves a tracking error of (5.00 +/- 0.22 px) and outperforms other deep learning trackers (22.3 +/- 1.1 px) and conventional trackers (58.1 +/- 27.1 px). 术中荧光心脏成像使冠状动脉搭桥术后的质量控制成为可能。我们可以通过跟踪局部特征点来估计局部定量指标，例如心肌灌注。然而，心脏运动以及由血管结构富集引起的图像特征显著波动限制了传统跟踪方法。我们提出了一种基于循环一致性检查的粒子滤波跟踪器，以稳健地跟踪为跟随目标标志点而采样的粒子。我们的方法以 25.4 帧/秒同时跟踪 117 个目标，允许在介入过程中进行实时估计。其跟踪误差为（5.00 +/- 0.22 像素），优于其他深度学习跟踪器（22.3 +/- 1.1 像素）和传统跟踪器（58.1 +/- 27.1 像素）。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 10:57:13 UTC 发布：2025-08-07 10:57:13 协调世界时（UTC）

#74 Marine Chlorophyll Prediction and Driver Analysis based on LSTM-RF Hybrid Models #74 基于 LSTM-RF 混合模型的海洋叶绿素预测与驱动因素分析

Marine chlorophyll concentration is an important indicator of ecosystem health and carbon cycle strength, and its accurate prediction is crucial for red tide warning and ecological response. In this paper, we propose a LSTM-RF hybrid model that combines the advantages of LSTM and RF, which solves the deficiencies of a single model in time-series modelling and nonlinear feature portrayal. Trained with multi-source ocean data(temperature, salinity, dissolved oxygen, etc.), the experimental results show that the LSTM-RF model has an R^2 of 0.5386, an MSE of 0.005806, and an MAE of 0.057147 on the test set, which is significantly better than using LSTM (R^2 = 0.0208) and RF (R^2 =0.4934) alone , respectively. The standardised treatment and sliding window approach improved the prediction accuracy of the model and provided an innovative solution for high-frequency prediction of marine ecological variables. 海洋叶绿素浓度是生态系统健康和碳循环强度的重要指标，其准确预测对于赤潮预警和生态响应至关重要。本文提出了一种结合 LSTM 与随机森林（RF）优点的 LSTM-RF 混合模型，以解决单一模型在时间序列建模和非线性特征表达方面的不足。该模型在以多源海洋数据（温度、盐度、溶解氧等）训练后，实验结果表明在测试集上 LSTM-RF 模型的 R^2 为 0.5386，MSE 为 0.005806，MAE 为 0.057147，分别显著优于单独使用 LSTM（R^2 = 0.0208）和 RF（R^2 = 0.4934）。标准化处理和滑动窗口方法提升了模型的预测精度，并为海洋生态变量的高频预测提供了创新性解决方案。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 10:55:42 UTC 发布：2025-08-07 10:55:42 UTC

#75 CF3: Compact and Fast 3D Feature Fields

Authors: [Hyunjoon Lee](https://arxiv.org/search/?searchtype=author&query=Hyunjoon Lee), [Joonkyu Min](https://arxiv.org/search/?searchtype=author&query=Joonkyu Min), [Jaesik Park](https://arxiv.org/search/?searchtype=author&query=Jaesik Park) 作者：Hyunjoon Lee、Joonkyu Min、Jaesik Park

3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS. 3D 高斯点喷（3DGS）已开始融合来自 2D 基础模型的丰富信息。然而，大多数方法依赖自下而上的优化过程，将原始 2D 特征视为真实标签，从而导致计算成本增加。我们提出了一种用于构建紧凑且快速的 3D 高斯特征场的自上而下流水线，称为 CF3。我们首先对多视角 2D 特征与预训练高斯进行快速加权融合。该方法使得可以直接在提升后的特征上训练每个高斯的自编码器，而不是在 2D 域中训练自编码器。因此，自编码器能更好地与特征分布对齐。更重要的是，我们引入了一种自适应稀疏化方法，在剪枝和合并冗余高斯的同时优化特征场的高斯属性，从而构建出在保留几何细节的情况下高效的表示。与 Feature-3DGS 相比，我们的方法仅使用多达 5% 的高斯即可实现具有竞争力的 3D 特征场。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 领域：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 10:45:08 UTC 发布：2025-08-07 10:45:08 UTC

#76 A Study of Gender Classification Techniques Based on Iris Images: A Deep Survey and Analysis #76 基于虹膜图像的性别分类技术研究：深度综述与分析

Authors: [Basna Mohammed Salih Hasan](https://arxiv.org/search/?searchtype=author&query=Basna Mohammed Salih Hasan), [Ramadhan J. Mstafa](https://arxiv.org/search/?searchtype=author&query=Ramadhan J. Mstafa) 作者：Basna Mohammed Salih Hasan，Ramadhan J. Mstafa

Gender classification is attractive in a range of applications, including surveillance and monitoring, corporate profiling, and human-computer interaction. Individuals’ identities may be gleaned from information about their gender, which is a kind of soft biometric.Over the years, several methods for determining a person’s gender have been devised. Some of the most well-known ones are based on physical characteristics like face, fingerprint, palmprint, DNA, ears, gait, and iris. On the other hand, facial features account for the vast majority of gender classification methods. Also, the iris is a significant biometric trait because the iris, according to research, remains basically constant during an individual’s life. Besides that, the iris is externally visible and is non-invasive to the user, which is important for practical applications. Furthermore, there are already high-quality methods for segmenting and encoding iris images, and the current methods facilitate selecting and extracting attribute vectors from iris textures. This study discusses several approaches to determining gender. The previous works of literature are briefly reviewed. Additionally, there are a variety of methodologies for different steps of gender classification. This study provides researchers with knowledge and analysis of the existing gender classification approaches. Also, it will assist researchers who are interested in this specific area, as well as highlight the gaps and challenges in the field, and finally provide suggestions and future paths for improvement. 性别分类在许多应用中具有吸引力，包括监控与监督、企业画像以及人机交互。个人的性别信息属于一种软生物特征，可能被用来推断其身份。多年来，已经提出了若干判断人类性别的方法。其中一些最著名的方法基于面部、指纹、掌纹、DNA、耳朵、步态和虹膜等物理特征。另一方面，面部特征占据了性别分类方法的绝大多数。此外，虹膜也是一种重要的生物识别特征，因为研究表明，虹膜在个体一生中基本保持不变。除此之外，虹膜在外部可见，对用户无创，这对实际应用很重要。此外，已经存在高质量的虹膜分割与编码方法，当前的方法也便于从虹膜纹理中选择并提取属性向量。本研究讨论了几种性别识别的方法，并对以往文献工作进行了简要回顾。此外，对于性别分类的不同步骤存在各种方法。本研究为研究人员提供了现有性别分类方法的知识和分析。同时，它将帮助对该特定领域感兴趣的研究人员，并突出该领域的空白和挑战，最终提供改进建议和未来路径。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别、人工智能、机器学习

Publish: 2025-08-07 10:33:40 UTC

#77 RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding

Authors: [Tianchen Fang](https://arxiv.org/search/?searchtype=author&query=Tianchen Fang), [Guiru Liu](https://arxiv.org/search/?searchtype=author&query=Guiru Liu)

Medical image understanding plays a crucial role in enabling automated diagnosis and data-driven clinical decision support. However, its progress is impeded by two primary challenges: the limited availability of high-quality annotated medical data and an overreliance on global image features, which often miss subtle but clinically significant pathological regions. To address these issues, we introduce RegionMed-CLIP, a region-aware multimodal contrastive learning framework that explicitly incorporates localized pathological signals along with holistic semantic representations. The core of our method is an innovative region-of-interest (ROI) processor that adaptively integrates fine-grained regional features with the global context, supported by a progressive training strategy that enhances hierarchical multimodal alignment. To enable large-scale region-level representation learning, we construct MedRegion-500k, a comprehensive medical image-text corpus that features extensive regional annotations and multilevel clinical descriptions. Extensive experiments on image-text retrieval, zero-shot classification, and visual question answering tasks demonstrate that RegionMed-CLIP consistently exceeds state-of-the-art vision language models by a wide margin. Our results highlight the critical importance of region-aware contrastive pre-training and position RegionMed-CLIP as a robust foundation for advancing multimodal medical image understanding. 医学影像理解在实现自动化诊断和数据驱动的临床决策支持方面起着关键作用。然而，其进展受到两个主要挑战的制约：高质量带注释的医学数据有限，以及过度依赖全局图像特征，这往往忽略了微妙但在临床上重要的病变区域。为了解决这些问题，我们提出了 RegionMed-CLIP，一种区域感知的多模态对比学习框架，明确地将局部病理信号与整体语义表征结合。我们方法的核心是一个创新的感兴趣区域（ROI）处理器，它自适应地将细粒度的区域特征与全局上下文整合，并由一种逐步训练策略支持以增强分层的多模态对齐。为了实现大规模的区域级表征学习，我们构建了 MedRegion-500k，这是一个全面的医学图像-文本语料库，具有丰富的区域注释和多层次的临床描述。在图像-文本检索、零样本分类和视觉问答任务上进行的大量实验证明，RegionMed-CLIP 在很大程度上持续超越最先进的视觉语言模型。我们的结果突显了基于区域的对比预训练的重要性，并将 RegionMed-CLIP 定位为推进多模态医学影像理解的强大基础。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 10:32:03 UTC 发表：2025-08-07 10:32:03 UTC

#78 Coarse-to-Fine Joint Registration of MR and Ultrasound Images via Imaging Style Transfer #78 通过影像风格迁移实现的 MR 与超声图像粗到细联合配准

Authors: [Junyi Wang](https://arxiv.org/search/?searchtype=author&query=Junyi Wang), [Xi Zhu](https://arxiv.org/search/?searchtype=author&query=Xi Zhu), [Yikun Guo](https://arxiv.org/search/?searchtype=author&query=Yikun Guo), [Zixi Wang](https://arxiv.org/search/?searchtype=author&query=Zixi Wang), [Haichuan Gao](https://arxiv.org/search/?searchtype=author&query=Haichuan Gao), [Le Zhang](https://arxiv.org/search/?searchtype=author&query=Le Zhang), [Fan Zhang](https://arxiv.org/search/?searchtype=author&query=Fan Zhang) 作者：Junyi Wang、Xi Zhu、Yikun Guo、Zixi Wang、Haichuan Gao、Le Zhang、Fan Zhang

We developed a pipeline for registering pre-surgery Magnetic Resonance (MR) images and post-resection Ultrasound (US) images. Our approach leverages unpaired style transfer using 3D CycleGAN to generate synthetic T1 images, thereby enhancing registration performance. Additionally, our registration process employs both affine and local deformable transformations for a coarse-to-fine registration. The results demonstrate that our approach improves the consistency between MR and US image pairs in most cases. 我们开发了一个用于配准术前磁共振（MR）图像与术后切除超声（US）图像的流程。我们的方法利用 3D CycleGAN 进行无配对风格迁移以生成合成 T1 图像，从而提高配准性能。此外，我们的配准过程采用仿射和局部可变形变换相结合的粗到细配准。结果表明，我们的方法在大多数情况下改善了 MR 与 US 图像对之间的一致性。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理，人工智能，计算机视觉与模式识别

Publish: 2025-08-07 10:27:50 UTC 发表：2025-08-07 10:27:50 UTC

#79 Pruning Large Language Models by Identifying and Preserving Functional Networks #79 通过识别并保留功能网络来剪枝大型语言模型

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-07 10:27:01 UTC 发布：2025-08-07 10:27:01 协调世界时

#80 Driver Assistant: Persuading Drivers to Adjust Secondary Tasks Using Large Language Models #80 驾驶员助手：使用大型语言模型劝导驾驶员调整次要任务

Authors: [Wei Xiang](https://arxiv.org/search/?searchtype=author&query=Wei Xiang), [Muchen Li](https://arxiv.org/search/?searchtype=author&query=Muchen Li), [Jie Yan](https://arxiv.org/search/?searchtype=author&query=Jie Yan), [Manling Zheng](https://arxiv.org/search/?searchtype=author&query=Manling Zheng), [Hanfei Zhu](https://arxiv.org/search/?searchtype=author&query=Hanfei Zhu), [Mengyun Jiang](https://arxiv.org/search/?searchtype=author&query=Mengyun Jiang), [Lingyun Sun](https://arxiv.org/search/?searchtype=author&query=Lingyun Sun) 作者：向威、李慕晨、严杰、郑漫灵、朱涵飞、蒋梦云、孙令云

Level 3 automated driving systems allows drivers to engage in secondary tasks while diminishing their perception of risk. In the event of an emergency necessitating driver intervention, the system will alert the driver with a limited window for reaction and imposing a substantial cognitive burden. To address this challenge, this study employs a Large Language Model (LLM) to assist drivers in maintaining an appropriate attention on road conditions through a “humanized” persuasive advice. Our tool leverages the road conditions encountered by Level 3 systems as triggers, proactively steering driver behavior via both visual and auditory routes. Empirical study indicates that our tool is effective in sustaining driver attention with reduced cognitive load and coordinating secondary tasks with takeover behavior. Our work provides insights into the potential of using LLMs to support drivers during multi-task automated driving. 三级自动驾驶系统允许驾驶员在减少风险感知的同时从事次要任务。在需要驾驶员介入的紧急情况下，系统会在有限的反应窗口内提醒驾驶员，这会带来巨大的认知负担。为应对这一挑战，本研究采用了大型语言模型（LLM），通过“拟人化”的劝导建议帮助驾驶员保持对道路状况的适当关注。我们的工具利用三级系统遇到的道路状况作为触发器，主动通过视觉和听觉两种途径引导驾驶员行为。实证研究表明，我们的工具在以较低认知负担维持驾驶员注意力并协调次要任务与接管行为方面是有效的。我们的工作为利用 LLMs 在多任务自动驾驶期间支持驾驶员提供了见解。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-07 10:26:28 UTC 发布：2025-08-07 10:26:28 UTC

#81 Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models #81 在权衡中航行：面向视觉-语言模型零样本对抗鲁棒性的防御策略综述

Authors: [Zane Xu](https://arxiv.org/search/?searchtype=author&query=Zane Xu), [Jason Sun](https://arxiv.org/search/?searchtype=author&query=Jason Sun) 作者：Zane Xu，Jason Sun

This report synthesizes eight seminal papers on the zero-shot adversarial robustness of vision-language models (VLMs) like CLIP. A central challenge in this domain is the inherent trade-off between enhancing adversarial robustness and preserving the model’s zero-shot generalization capabilities. We analyze two primary defense paradigms: Adversarial Fine-Tuning (AFT), which modifies model parameters, and Training-Free/Test-Time Defenses, which preserve them. We trace the evolution from alignment-preserving methods (TeCoA) to embedding space re-engineering (LAAT, TIMA), and from input heuristics (AOM, TTC) to latent-space purification (CLIPure). Finally, we identify key challenges and future directions including hybrid defense strategies and adversarial pre-training. 本报告综合了八篇关于像 CLIP 这样的视觉-语言模型（VLM）零样本对抗鲁棒性的奠基性论文。本领域的一个核心挑战是提升对抗鲁棒性与保持模型零样本泛化能力之间的固有权衡。我们分析了两种主要的防御范式：修改模型参数的对抗微调（AFT）以及保持参数不变的无训练/测试时防御。我们追溯了从保持对齐的方法（TeCoA）到嵌入空间再工程（LAAT、TIMA），以及从输入层启发式方法（AOM、TTC）到潜在空间净化（CLIPure）的演进。最后，我们指出了包括混合防御策略和对抗性预训练在内的关键挑战与未来方向。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 10:26:10 UTC 发布：2025-08-07 10:26:10 协调世界时（UTC）

#82 Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation #82 在资源受限情况下通过链式思维增强与蒸馏进行联合多模态情感推理与分类

Authors: [Haonan Shangguan](https://arxiv.org/search/?searchtype=author&query=Haonan Shangguan), [Xiaocui Yang](https://arxiv.org/search/?searchtype=author&query=Xiaocui Yang), [Shi Feng](https://arxiv.org/search/?searchtype=author&query=Shi Feng), [Daling Wang](https://arxiv.org/search/?searchtype=author&query=Daling Wang), [Yifei Zhang](https://arxiv.org/search/?searchtype=author&query=Yifei Zhang), [Ge Yu](https://arxiv.org/search/?searchtype=author&query=Ge Yu) 作者：尚皓南，杨晓翠，冯世，王达灵，张一飞，俞戈

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 10:23:14 UTC 发布：2025-08-07 10:23:14 UTC

#83 FDC-Net: Rethinking the association between EEG artifact removal and multi-dimensional affective computing #83 FDC-Net：重新思考 EEG 伪迹去除与多维情感计算之间的关联 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Wenjia Dong](https://arxiv.org/search/?searchtype=author&query=Wenjia Dong), [Xueyuan Xu](https://arxiv.org/search/?searchtype=author&query=Xueyuan Xu), [Tianze Yu](https://arxiv.org/search/?searchtype=author&query=Tianze Yu), [Junming Zhang](https://arxiv.org/search/?searchtype=author&query=Junming Zhang), [Li Zhuo](https://arxiv.org/search/?searchtype=author&query=Li Zhuo) 作者：Wenjia Dong, Xueyuan Xu, Tianze Yu, Junming Zhang, Li Zhuo

Electroencephalogram (EEG)-based emotion recognition holds significant value in affective computing and brain-computer interfaces. However, in practical applications, EEG recordings are susceptible to the effects of various physiological artifacts. Current approaches typically treat denoising and emotion recognition as independent tasks using cascaded architectures, which not only leads to error accumulation, but also fails to exploit potential synergies between these tasks. Moreover, conventional EEG-based emotion recognition models often rely on the idealized assumption of “perfectly denoised data”, lacking a systematic design for noise robustness. To address these challenges, a novel framework that deeply couples denoising and emotion recognition tasks is proposed for end-to-end noise-robust emotion recognition, termed as Feedback-Driven Collaborative Network for Denoising-Classification Nexus (FDC-Net). Our primary innovation lies in establishing a dynamic collaborative mechanism between artifact removal and emotion recognition through: (1) bidirectional gradient propagation with joint optimization strategies; (2) a gated attention mechanism integrated with frequency-adaptive Transformer using learnable band-position encoding. Two most popular EEG-based emotion datasets (DEAP and DREAMER) with multi-dimensional emotional labels were employed to compare the artifact removal and emotion recognition performance between ASLSL and nine state-of-the-art methods. In terms of the denoising task, FDC-Net obtains a maximum correlation coefficient (CC) value of 96.30% on DEAP and a maximum CC value of 90.31% on DREAMER. In terms of the emotion recognition task under physiological artifact interference, FDC-Net achieves emotion recognition accuracies of 82.3+7.1% on DEAP and 88.1+0.8% on DREAMER. 基于脑电图（EEG）的情感识别在情感计算和脑—机接口中具有重要价值。然而，在实际应用中，EEG 记录容易受到各种生理伪迹的影响。现有方法通常将去噪和情感识别视为独立任务并采用级联架构，这不仅导致误差累积，也未能利用这两项任务之间的潜在协同效应。此外，传统的基于 EEG 的情感识别模型常常依赖“完美去噪数据”的理想化假设，缺乏系统的噪声鲁棒性设计。为了解决这些挑战，提出了一个将去噪与情感识别任务深度耦合的全端到端噪声鲁棒情感识别新框架，称为用于去噪—分类纽带的反馈驱动协同网络（FDC-Net）。我们的主要创新在于通过以下方式在伪迹去除和情感识别之间建立动态协同机制：(1) 带有联合优化策略的双向梯度传播；(2) 将可学习的频段位置编码集成到频率自适应 Transformer 中的门控注意力机制。采用两个最流行的基于 EEG 的情感数据集（DEAP 和 DREAMER）及其多维情感标签，比较了 ASLSL 与九种最先进方法在伪迹去除和情感识别方面的性能。在去噪任务方面，FDC-Net 在 DEAP 上获得了最高相关系数（CC）值 96.30%，在 DREAMER 上获得了最高 CC 值 90.31%。在存在生理伪迹干扰的情感识别任务方面，FDC-Net 在 DEAP 和 DREAMER 上分别达到了 82.3±7.1%和 88.1±0.8%的情感识别准确率。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-07 10:19:16 UTC 发布：2025-08-07 10:19:16 UTC

#84 ADSEL: Adaptive dual self-expression learning for EEG feature selection via incomplete multi-dimensional emotional tagging

Authors: [Tianze Yu](https://arxiv.org/search/?searchtype=author&query=Tianze Yu), [Junming Zhang](https://arxiv.org/search/?searchtype=author&query=Junming Zhang), [Wenjia Dong](https://arxiv.org/search/?searchtype=author&query=Wenjia Dong), [Xueyuan Xu](https://arxiv.org/search/?searchtype=author&query=Xueyuan Xu), [Li Zhuo](https://arxiv.org/search/?searchtype=author&query=Li Zhuo)

EEG based multi-dimension emotion recognition has attracted substantial research interest in human computer interfaces. However, the high dimensionality of EEG features, coupled with limited sample sizes, frequently leads to classifier overfitting and high computational complexity. Feature selection constitutes a critical strategy for mitigating these challenges. Most existing EEG feature selection methods assume complete multi-dimensional emotion labels. In practice, open acquisition environment, and the inherent subjectivity of emotion perception often result in incomplete label data, which can compromise model generalization. Additionally, existing feature selection methods for handling incomplete multi-dimensional labels primarily focus on correlations among various dimensions during label recovery, neglecting the correlation between samples in the label space and their interaction with various dimensions. To address these issues, we propose a novel incomplete multi-dimensional feature selection algorithm for EEG-based emotion recognition. The proposed method integrates an adaptive dual self-expression learning (ADSEL) with least squares regression. ADSEL establishes a bidirectional pathway between sample-level and dimension-level self-expression learning processes within the label space. It could facilitate the cross-sharing of learned information between these processes, enabling the simultaneous exploitation of effective information across both samples and dimensions for label reconstruction. Consequently, ADSEL could enhances label recovery accuracy and effectively identifies the optimal EEG feature subset for multi-dimensional emotion recognition. 基于脑电的多维情感识别在人机交互领域引起了大量研究兴趣。然而，脑电特征的高维性以及样本量有限，常导致分类器过拟合和计算复杂度高。特征选择是缓解这些问题的重要策略。大多数现有的脑电特征选择方法假设多维情感标签是完整的。实际上，开放的采集环境和情感感知的主观性常常导致标签数据不完整，这会损害模型的泛化能力。此外，现有用于处理不完整多维标签的特征选择方法主要侧重于在标签重构过程中各维度之间的相关性，忽视了标签空间中样本之间的相关性及其与各维度的相互作用。为解决这些问题，我们提出了一种用于基于脑电的情感识别的新型不完整多维特征选择算法。该方法将自适应双重自表达学习（ADSEL）与最小二乘回归相结合。 ADSEL 在标签空间内建立了样本级和维度级自表达学习过程之间的双向通路。它能够促进这两个过程之间学习信息的交叉共享，从而同时利用样本和维度上的有效信息进行标签重建。因此，ADSEL 可以提高标签恢复的准确性，并有效识别用于多维情感识别的最佳 EEG 特征子集。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-07 10:18:37 UTC

#85 CWEFS: Brain volume conduction effects inspired channel-wise EEG feature selection for multi-dimensional emotion recognition #85 CWEFS：受脑体积导传效应启发的逐通道脑电特征选择用于多维情感识别

Authors: [Xueyuan Xu](https://arxiv.org/search/?searchtype=author&query=Xueyuan Xu), [Wenjia Dong](https://arxiv.org/search/?searchtype=author&query=Wenjia Dong), [Fulin Wei](https://arxiv.org/search/?searchtype=author&query=Fulin Wei), [Li Zhuo](https://arxiv.org/search/?searchtype=author&query=Li Zhuo) 作者：徐学远、董文佳、魏福林、卓力

Due to the intracranial volume conduction effects, high-dimensional multi-channel electroencephalography (EEG) features often contain substantial redundant and irrelevant information. This issue not only hinders the extraction of discriminative emotional representations but also compromises the real-time performance. Feature selection has been established as an effective approach to address the challenges while enhancing the transparency and interpretability of emotion recognition models. However, existing EEG feature selection research overlooks the influence of latent EEG feature structures on emotional label correlations and assumes uniform importance across various channels, directly limiting the precise construction of EEG feature selection models for multi-dimensional affective computing. To address these limitations, a novel channel-wise EEG feature selection (CWEFS) method is proposed for multi-dimensional emotion recognition. Specifically, inspired by brain volume conduction effects, CWEFS integrates EEG emotional feature selection into a shared latent structure model designed to construct a consensus latent space across diverse EEG channels. To preserve the local geometric structure, this consensus space is further integrated with the latent semantic analysis of multi-dimensional emotional labels. Additionally, CWEFS incorporates adaptive channel-weight learning to automatically determine the significance of different EEG channels in the emotional feature selection task. The effectiveness of CWEFS was validated using three popular EEG datasets with multi-dimensional emotional labels. Comprehensive experimental results, compared against nineteen feature selection methods, demonstrate that the EEG feature subsets chosen by CWEFS achieve optimal emotion recognition performance across six evaluation metrics. 由于颅内体积传导效应，高维多通道脑电（EEG）特征通常包含大量冗余和无关信息。该问题不仅阻碍了判别性情感表示的提取，还影响实时性能。特征选择已被确立为一种有效方法，用以解决这些挑战并提升情感识别模型的透明性与可解释性。然而，现有的 EEG 特征选择研究忽视了潜在 EEG 特征结构对情感标签相关性的影响，并假定各通道重要性均匀，这直接限制了面向多维情感计算的精确 EEG 特征选择模型构建。为了解决这些局限性，提出了一种新颖的按通道 EEG 特征选择（CWEFS）方法，用于多维情感识别。具体而言，受大脑体积传导效应的启发，CWEFS 将 EEG 情感特征选择整合进一个共享的潜在结构模型，旨在在不同 EEG 通道之间构建一致的潜在空间。为保留局部几何结构，该一致空间还与多维情感标签的潜在语义分析相结合。此外，CWEFS 引入了自适应通道权重学习，以自动确定在情感特征选择任务中不同脑电通道的重要性。使用三个带有多维情感标签的流行脑电数据集验证了 CWEFS 的有效性。与十九种特征选择方法比较的全面实验结果表明，CWEFS 选择的脑电特征子集在六项评估指标上实现了最优的情感识别性能。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-07 10:17:59 UTC 发表于：2025-08-07 10:17:59 UTC

#86 ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking #86 ReasoningTrack：用于长期视觉-语言跟踪的链式思维推理

Authors: [Xiao Wang](https://arxiv.org/search/?searchtype=author&query=Xiao Wang), [Liye Jin](https://arxiv.org/search/?searchtype=author&query=Liye Jin), [Xufeng Lou](https://arxiv.org/search/?searchtype=author&query=Xufeng Lou), [Shiao Wang](https://arxiv.org/search/?searchtype=author&query=Shiao Wang), [Lan Chen](https://arxiv.org/search/?searchtype=author&query=Lan Chen), [Bo Jiang](https://arxiv.org/search/?searchtype=author&query=Bo Jiang), [Zhipeng Zhang](https://arxiv.org/search/?searchtype=author&query=Zhipeng Zhang) 作者：王晓、金立业、娄旭锋、王少、陈兰、蒋波、张志鹏

Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model’s reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack 近年来，视觉-语言跟踪受到越来越多关注，因为文本信息可以有效解决指定被跟踪目标时的僵化性和不准确性。现有工作要么将固定语言与视觉特征直接融合，要么仅通过注意力机制对其进行简单修正，然而其性能仍然有限。最近，一些研究者尝试使用文本生成来适应跟踪过程中目标的变化，但这些工作未能揭示模型的推理过程，也未充分利用大模型的优势，从而进一步限制了整体性能。为了解决上述问题，本文提出了一种基于推理的全新视觉-语言跟踪框架，称为 ReasoningTrack，基于预训练的视觉-语言模型 Qwen2.5-VL。本文在推理和语言生成的优化上同时采用了 SFT（有监督微调）与强化学习 GRPO。我们将更新后的语言描述嵌入，并与视觉特征一起输入到统一的跟踪主干网络中。然后，我们采用一个跟踪头来预测目标物体的具体位置。此外，我们提出了一个大规模的长期视觉-语言跟踪基准数据集，称为 TNLLT，包含 200 条视频序列。在该数据集上，重新训练并评估了 20 个基线视觉跟踪器，为视觉-语言视觉跟踪任务奠定了坚实基础。在多个视觉-语言跟踪基准数据集上的大量实验充分验证了我们所提出的基于推理的自然语言生成策略的有效性。本文的源代码将发布在 https://github.com/Event-AHU/Open_VLTrack

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-07 10:02:07 UTC 发布：2025-08-07 10:02:07 UTC

#87 Advanced Hybrid Transformer LSTM Technique with Attention and TS Mixer for Drilling Rate of Penetration Prediction #87 具有注意力机制和 TS 混合器的高级混合 Transformer LSTM 技术用于钻速（ROP）预测

Author: [Saddam Hussain Khan](https://arxiv.org/search/?searchtype=author&query=Saddam Hussain Khan) 作者：Saddam Hussain Khan

The Rate of Penetration (ROP) is crucial for optimizing drilling operations; however, accurately predicting it is hindered by the complex, dynamic, and high-dimensional nature of drilling data. Traditional empirical, physics-based, and basic machine learning models often fail to capture intricate temporal and contextual relationships, resulting in suboptimal predictions and limited real-time utility. To address this gap, we propose a novel hybrid deep learning architecture integrating Long Short-Term Memory (LSTM) networks, Transformer encoders, Time-Series Mixer (TS-Mixer) blocks, and attention mechanisms to synergistically model temporal dependencies, static feature interactions, global context, and dynamic feature importance. Evaluated on a real-world drilling dataset, our model outperformed benchmarks (standalone LSTM, TS-Mixer, and simpler hybrids) with an R-squared score of 0.9988 and a Mean Absolute Percentage Error of 1.447%, as measured by standard regression metrics (R-squared, MAE, RMSE, MAPE). Model interpretability was ensured using SHAP and LIME, while actual vs. predicted curves and bias checks confirmed accuracy and fairness across scenarios. This advanced hybrid approach enables reliable real-time ROP prediction, paving the way for intelligent, cost-effective drilling optimization systems with significant operational impact. 钻速（ROP）对于优化钻井作业至关重要；然而，由于钻井数据的复杂性、动态性和高维特性，准确预测钻速受到阻碍。传统的经验模型、基于物理的模型和基础机器学习模型常常无法捕捉复杂的时间和上下文关系，导致预测效果不佳且实时应用有限。为填补这一空白，我们提出了一种新颖的混合深度学习架构，整合了长短期记忆（LSTM）网络、Transformer 编码器、时间序列混合器（TS-Mixer）模块和注意力机制，以协同建模时间依赖性、静态特征交互、全局上下文和动态特征重要性。在真实钻井数据集上的评估显示，我们的模型优于基线模型（单独的 LSTM、TS-Mixer 及更简单的混合模型），R² 达到 0.9988，平均绝对百分比误差（MAPE）为 1.447%，这些均通过标准回归指标（R²、MAE、RMSE、MAPE）进行衡量。模型可解释性通过 SHAP 和 LIME 得以保证，同时实际与预测曲线及偏差检验也确认了在不同场景下的准确性和公平性。这种先进的混合方法实现了可靠的实时钻速（ROP）预测，为具有显著运营影响的智能、低成本钻井优化系统铺平了道路。

Subjects: Machine Learning, Artificial Intelligence, Systems and Control 主题：机器学习、人工智能、系统与控制

Publish: 2025-08-07 09:45:56 UTC 发表：2025-08-07 09:45:56 UTC

#88 SpectroStream: A Versatile Neural Codec for General Audio #88 SpectroStream：用于通用音频的多功能神经编解码器

Authors: [Yunpeng Li](https://arxiv.org/search/?searchtype=author&query=Yunpeng Li), [Kehang Han](https://arxiv.org/search/?searchtype=author&query=Kehang Han), [Brian McWilliams](https://arxiv.org/search/?searchtype=author&query=Brian McWilliams), [Zalan Borsos](https://arxiv.org/search/?searchtype=author&query=Zalan Borsos), [Marco Tagliasacchi](https://arxiv.org/search/?searchtype=author&query=Marco Tagliasacchi) 作者：Yunpeng Li, Kehang Han, Brian McWilliams, Zalan Borsos, Marco Tagliasacchi

We propose SpectroStream, a full-band multi-channel neural audio codec. Successor to the well-established SoundStream, SpectroStream extends its capability beyond 24 kHz monophonic audio and enables high-quality reconstruction of 48 kHz stereo music at bit rates of 4–16 kbps. This is accomplished with a new neural architecture that leverages audio representation in the time-frequency domain, which leads to better audio quality especially at higher sample rate. The model also uses a delayed-fusion strategy to handle multi-channel audio, which is crucial in balancing per-channel acoustic quality and cross-channel phase consistency. 我们提出了 SpectroStream，一种全频带多通道神经音频编解码器。作为成熟的 SoundStream 的后继者，SpectroStream 将其能力扩展到 24 kHz 单声道音频之外，能够在 4–16 kbps 的比特率下高质量重建 48 kHz 立体声音乐。通过一种新的神经架构实现该目标，该架构利用时频域的音频表征，从而在更高采样率下获得更好的音频质量。该模型还使用延迟融合策略来处理多通道音频，这在平衡每通道的声学质量和通道间相位一致性方面至关重要。

Subjects: Sound, Artificial Intelligence, Audio and Speech Processing 主题：声音，人工智能，音频与语音处理

Publish: 2025-08-07 09:44:00 UTC 发布：2025-08-07 09:44:00 UTC

#89 FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance #89 FAITH：用于评估金融领域内在表格幻觉的框架

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-07 09:37:14 UTC 发布：2025-08-07 09:37:14 UTC

#90 EvoGraph: Hybrid Directed Graph Evolution toward Software 3.0 #90 EvoGraph：面向软件 3.0 的混合有向图演化

Authors: [Igor Costa](https://arxiv.org/search/?searchtype=author&query=Igor Costa), [Christopher Baran](https://arxiv.org/search/?searchtype=author&query=Christopher Baran) 作者：Igor Costa，Christopher Baran

We introduce EvoGraph, a framework that enables software systems to evolve their own source code, build pipelines, documentation, and tickets. EvoGraph represents every artefact in a typed directed graph, applies learned mutation operators driven by specialized small language models (SLMs), and selects survivors with a multi-objective fitness. On three benchmarks, EvoGraph fixes 83% of known security vulnerabilities, translates COBOL to Java with 93% functional equivalence (test verified), and maintains documentation freshness within two minutes. Experiments show a 40% latency reduction and a sevenfold drop in feature lead time compared with strong baselines. We extend our approach to evoGraph, leveraging language-specific SLMs for modernizing .NET, Lisp, CGI, ColdFusion, legacy Python, and C codebases, achieving 82-96% semantic equivalence across languages while reducing computational costs by 90% compared to large language models. EvoGraph’s design responds to empirical failure modes in legacy modernization, such as implicit contracts, performance preservation, and integration evolution. Our results suggest a practical path toward Software 3.0, where systems adapt continuously yet remain under measurable control. 我们介绍了 EvoGraph，一个使软件系统能够自我演化其源代码、构建流水线、文档和工单的框架。EvoGraph 将每个工件表示为带类型的有向图，应用由专门的小型语言模型（SLMs）驱动的学习到的变异操作，并用多目标适应度选择幸存者。在三个基准测试上，EvoGraph 修复了 83% 的已知安全漏洞，将 COBOL 翻译为功能等效（测试验证）的 Java，等效率为 93%，并在两分钟内保持文档的新鲜度。实验表明，与强基线相比，延迟降低了 40%，功能交付前置时间缩短了七倍。我们将方法扩展到 evoGraph，利用特定语言的 SLM 现代化 .NET、Lisp、CGI、ColdFusion、遗留 Python 和 C 代码库，在跨语言实现 82–96% 的语义等效性的同时，将计算成本相比大型语言模型降低了 90%。EvoGraph 的设计响应了遗留现代化中的经验失败模式，例如隐式契约、性能保持和集成演化。我们的研究结果表明了一条通向“软件 3.0”的实用途径，在这条路径上，系统可以持续自我适应，同时仍然处于可衡量的控制之下。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程、人工智能

Publish: 2025-08-07 09:36:30 UTC 发表时间：2025-08-07 09:36:30 协调世界时（UTC）

#91 Balancing Accuracy and Novelty with Sub-Item Popularity #91 在子项目流行度下平衡准确性与新颖性

Authors: [Chiara Mallamaci](https://arxiv.org/search/?searchtype=author&query=Chiara Mallamaci), [Aleksandr Vladimirovich Petrov](https://arxiv.org/search/?searchtype=author&query=Aleksandr Vladimirovich Petrov), [Alberto Carlo Maria Mancino](https://arxiv.org/search/?searchtype=author&query=Alberto Carlo Maria Mancino), [Vito Walter Anelli](https://arxiv.org/search/?searchtype=author&query=Vito Walter Anelli), [Tommaso Di Noia](https://arxiv.org/search/?searchtype=author&query=Tommaso Di Noia), [Craig Macdonald](https://arxiv.org/search/?searchtype=author&query=Craig Macdonald) 作者：Chiara Mallamaci、Aleksandr Vladimirovich Petrov、Alberto Carlo Maria Mancino、Vito Walter Anelli、Tommaso Di Noia、Craig Macdonald

In the realm of music recommendation, sequential recommenders have shown promise in capturing the dynamic nature of music consumption. A key characteristic of this domain is repetitive listening, where users frequently replay familiar tracks. To capture these repetition patterns, recent research has introduced Personalised Popularity Scores (PPS), which quantify user-specific preferences based on historical frequency. While PPS enhances relevance in recommendation, it often reinforces already-known content, limiting the system’s ability to surface novel or serendipitous items - key elements for fostering long-term user engagement and satisfaction. To address this limitation, we build upon RecJPQ, a Transformer-based framework initially developed to improve scalability in large-item catalogues through sub-item decomposition. We repurpose RecJPQ’s sub-item architecture to model personalised popularity at a finer granularity. This allows us to capture shared repetition patterns across sub-embeddings - latent structures not accessible through item-level popularity alone. We propose a novel integration of sub-ID-level personalised popularity within the RecJPQ framework, enabling explicit control over the trade-off between accuracy and personalised novelty. Our sub-ID-level PPS method (sPPS) consistently outperforms item-level PPS by achieving significantly higher personalised novelty without compromising recommendation accuracy. Code and experiments are publicly available at https://github.com/sisinflab/Sub-id-Popularity. 在音乐推荐领域，序列推荐器在捕捉音乐消费的动态性方面展现出潜力。该领域的一个关键特征是重复聆听，用户经常反复播放熟悉的曲目。为捕捉这些重复模式，近期研究引入了个性化流行度评分（Personalised Popularity Scores，PPS），基于历史播放频率量化用户的特定偏好。尽管 PPS 提升了推荐的相关性，但它常常强化用户已知的内容，限制了系统发掘新颖或偶然发现项目的能力——而这些恰恰是促进用户长期参与和满意度的关键要素。为了解决这一局限，我们基于 RecJPQ 进行扩展，RecJPQ 是一个基于 Transformer 的框架，最初用于通过子项分解提升大规模商品目录的可扩展性。我们将 RecJPQ 的子项架构重新用于在更细粒度上建模个性化流行度。这使我们能够捕捉子嵌入之间的共享重复模式——这些潜在结构仅靠项级别的流行度无法获得。我们提出了一种在 RecJPQ 框架内将子 ID 级别的个性化流行度整合的新方法，从而能够明确控制准确性与个性化新颖性之间的权衡。我们的子 ID 级别 PPS 方法（sPPS）在不牺牲推荐准确性的前提下，通过实现显著更高的个性化新颖性，始终优于基于条目的 PPS。代码和实验公开可在 https://github.com/sisinflab/Sub-id-Popularity 获取。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-07 09:33:32 UTC 发表时间：2025-08-07 09:33:32 UTC

#92 Incident Response Planning Using a Lightweight Large Language Model with Reduced Hallucination #92 使用减少幻觉的轻量级大型语言模型进行事件响应规划

Authors: [Kim Hammar](https://arxiv.org/search/?searchtype=author&query=Kim Hammar), [Tansu Alpcan](https://arxiv.org/search/?searchtype=author&query=Tansu Alpcan), [Emil C. Lupu](https://arxiv.org/search/?searchtype=author&query=Emil C. Lupu) 作者：Kim Hammar、Tansu Alpcan、Emil C. Lupu

Timely and effective incident response is key to managing the growing frequency of cyberattacks. However, identifying the right response actions for complex systems is a major technical challenge. A promising approach to mitigate this challenge is to use the security knowledge embedded in large language models (LLMs) to assist security operators during incident handling. Recent research has demonstrated the potential of this approach, but current methods are mainly based on prompt engineering of frontier LLMs, which is costly and prone to hallucinations. We address these limitations by presenting a novel way to use an LLM for incident response planning with reduced hallucination. Our method includes three steps: fine-tuning, information retrieval, and lookahead planning. We prove that our method generates response plans with a bounded probability of hallucination and that this probability can be made arbitrarily small at the expense of increased planning time under certain assumptions. Moreover, we show that our method is lightweight and can run on commodity hardware. We evaluate our method on logs from incidents reported in the literature. The experimental results show that our method a) achieves up to 22% shorter recovery times than frontier LLMs and b) generalizes to a broad range of incident types and response actions. 及时且有效的事件响应是应对日益增多的网络攻击的关键。然而，对于复杂系统来说，确定恰当的响应措施是一大技术挑战。一种有前景的方法是利用大型语言模型（LLMs）中蕴含的安全知识，在事件处理期间辅助安全操作人员。最近的研究已经展示了这种方法的潜力，但当前的方法主要依赖于对前沿 LLMs 的提示工程，这既昂贵又容易出现幻觉。我们通过提出一种在减少幻觉的情况下使用 LLM 进行事件响应规划的新方法来解决这些限制。我们的方法包括三个步骤：微调、信息检索和前瞻性规划。我们证明了在某些假设下，我们的方法生成的响应计划具有有界的幻觉概率，并且该概率可以通过牺牲增加规划时间而任意减小。此外，我们还表明我们的方法轻量且可以在普通硬件上运行。我们在文献中报告的事件日志上对我们的方法进行了评估。实验结果表明，我们的方法 a）相比最前沿的 LLMs 在恢复时间上最多缩短了 22%，并且 b）能够推广到广泛的事件类型和响应措施。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全，人工智能

Publish: 2025-08-07 09:23:25 UTC 发布：2025-08-07 09:23:25 UTC

#93 Refining Gaussian Splatting: A Volumetric Densification Approach #93 精炼高斯点描：一种体积致密化方法

Authors: [Mohamed Abdul Gafoor](https://arxiv.org/search/?searchtype=author&query=Mohamed Abdul Gafoor), [Marius Preda](https://arxiv.org/search/?searchtype=author&query=Marius Preda), [Titus Zaharia](https://arxiv.org/search/?searchtype=author&query=Titus Zaharia) 作者：Mohamed Abdul Gafoor、Marius Preda、Titus Zaharia

Achieving high-quality novel view synthesis in 3D Gaussian Splatting (3DGS) often depends on effective point primitive management. The underlying Adaptive Density Control (ADC) process addresses this issue by automating densification and pruning. Yet, the vanilla 3DGS densification strategy shows key shortcomings. To address this issue, in this paper we introduce a novel density control method, which exploits the volumes of inertia associated to each Gaussian function to guide the refinement process. Furthermore, we study the effect of both traditional Structure from Motion (SfM) and Deep Image Matching (DIM) methods for point cloud initialization. Extensive experimental evaluations on the Mip-NeRF 360 dataset demonstrate that our approach surpasses 3DGS in reconstruction quality, delivering encouraging performance across diverse scenes. 在三维高斯点溅（3D Gaussian Splatting，3DGS）中实现高质量的新视角合成通常依赖于有效的点基元管理。底层的自适应密度控制（Adaptive Density Control，ADC）过程通过自动密集化和修剪来解决这一问题。然而，原生的 3DGS 密集化策略表现出关键缺陷。为解决此问题，本文提出了一种新颖的密度控制方法，该方法利用与每个高斯函数关联的惯性体积来指导细化过程。此外，我们研究了传统的运动结构重建（Structure from Motion，SfM）与深度图像匹配（Deep Image Matching，DIM）方法对点云初始化的影响。在 Mip-NeRF 360 数据集上的大量实验评估表明，我们的方法在重建质量上超越了 3DGS，在各种场景中都表现出令人鼓舞的性能。

Subjects: Graphics, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图形学、人工智能、计算机视觉与模式识别

Publish: 2025-08-07 09:23:17 UTC 发布：2025-08-07 09:23:17 UTC

#94 Posterior-GRPO: Rewarding Reasoning Processes in Code Generation #94 Posterior-GRPO：在代码生成中奖励推理过程

Reinforcement learning (RL) has significantly advanced code generation for large language models (LLMs). However, current paradigms rely on outcome-based rewards from test cases, neglecting the quality of the intermediate reasoning process. While supervising the reasoning process directly is a promising direction, it is highly susceptible to reward hacking, where the policy model learns to exploit the reasoning reward signal without improving final outcomes. To address this, we introduce a unified framework that can effectively incorporate the quality of the reasoning process during RL. First, to enable reasoning evaluation, we develop LCB-RB, a benchmark comprising preference pairs of superior and inferior reasoning processes. Second, to accurately score reasoning quality, we introduce an Optimized-Degraded based (OD-based) method for reward model training. This method generates high-quality preference pairs by systematically optimizing and degrading initial reasoning paths along curated dimensions of reasoning quality, such as factual accuracy, logical rigor, and coherence. A 7B parameter reward model with this method achieves state-of-the-art (SOTA) performance on LCB-RB and generalizes well to other benchmarks. Finally, we introduce Posterior-GRPO (P-GRPO), a novel RL method that conditions process-based rewards on task success. By selectively applying rewards to the reasoning processes of only successful outcomes, P-GRPO effectively mitigates reward hacking and aligns the model’s internal reasoning with final code correctness. A 7B parameter model with P-GRPO achieves superior performance across diverse code generation tasks, outperforming outcome-only baselines by 4.5%, achieving comparable performance to GPT-4-Turbo. We further demonstrate the generalizability of our approach by extending it to mathematical tasks. Our models, dataset, and code are publicly available. 强化学习（RL）在为大型语言模型（LLMs）生成代码方面取得了显著进展。然而，当前范式依赖于来自测试用例的基于结果的奖励，忽视了中间推理过程的质量。虽然直接监督推理过程是一个有前景的方向，但它非常容易受到奖励欺骗的影响，即策略模型学会利用推理奖励信号而不改善最终结果。为了解决这一问题，我们提出了一个统一框架，能够在强化学习过程中有效地纳入推理过程的质量。首先，为了实现对推理的评估，我们开发了 LCB-RB，这是一个由优劣推理过程偏好对组成的基准。其次，为了准确评分推理质量，我们引入了一种基于优化-退化（OD-based）的方法来训练奖励模型。该方法通过沿着精心挑选的推理质量维度（例如事实准确性、逻辑严密性和连贯性）系统地对初始推理路径进行优化和退化，生成高质量的偏好对。使用该方法的 7B 参数奖励模型在 LCB-RB 上达到了最先进（SOTA）的性能，并且能很好地泛化到其他基准。最后，我们提出了 Posterior-GRPO（P-GRPO），一种新颖的强化学习方法，它将基于过程的奖励以任务成功为条件。通过有选择地仅将奖励应用于成功结果的推理过程，P-GRPO 有效地缓解了奖励劫持，并将模型的内部推理与最终代码正确性对齐。采用 P-GRPO 的 7B 参数模型在各种代码生成任务上表现优越，较仅基于结果的基线模型提升了 4.5%，并达到了与 GPT-4-Turbo 可比的性能。我们还通过将方法扩展到数学任务来展示其可泛化性。我们的模型、数据集和代码均已公开可用。

Subjects: Software Engineering, Artificial Intelligence, Computation and Language, Machine Learning 主题：软件工程、人工智能、计算与语言、机器学习

Publish: 2025-08-07 09:04:10 UTC 发布时间：2025-08-07 09:04:10 UTC

#95 Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models #95 在预算内对齐 LLMs：使用启发式奖励模型的推理时对齐 [PDF 4 ] [Copy] [Kimi 1 ] [REL]

Aligning LLMs with user preferences is crucial for real-world use but often requires costly fine-tuning or expensive inference, forcing trade-offs between alignment quality and computational cost. Existing inference-time methods typically ignore this balance, focusing solely on the optimized policy’s performance. We propose HIA (Heuristic-Guided Inference-time Alignment), a tuning-free, black-box-compatible approach that uses a lightweight prompt optimizer, heuristic reward models, and two-stage filtering to reduce inference calls while preserving alignment quality. On real-world prompt datasets, HelpSteer and ComPRed, HIA outperforms best-of-N sampling, beam search, and greedy search baselines in multi-objective, goal-conditioned tasks under the same inference budget. We also find that HIA is effective under low-inference budgets with as little as one or two response queries, offering a practical solution for scalable, personalized LLM deployment. 将 LLMs 与用户偏好对齐对于实际应用至关重要，但通常需要代价高昂的微调或昂贵的推理，从而在对齐质量和计算成本之间被迫权衡。现有的推理时方法通常忽略这种平衡，仅关注被优化策略的性能。我们提出了 HIA（启发式引导的推理时对齐），这是一种无需调优、与黑盒兼容的方法，使用轻量级提示优化器、启发式奖励模型和两阶段过滤来减少推理调用，同时保留对齐质量。在真实提示数据集 HelpSteer 和 ComPRed 上，HIA 在相同的推理预算下，在多目标、条件化目标任务中优于 best-of-N 采样、束搜索和贪婪搜索基线。我们还发现 HIA 在低推理预算下也有效，仅需一到两次响应查询，提供了可扩展的个性化 LLM 部署的实用解决方案。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-07 08:54:27 UTC 发布时间：2025-08-07 08:54:27 协调世界时（UTC）

#96 Domain-driven Metrics for Reinforcement Learning: A Case Study on Epidemic Control using Agent-based Simulation #96 基于领域的强化学习评估指标：使用基于主体的模拟进行流行病控制的案例研究

Authors: [Rishabh Gaur](https://arxiv.org/search/?searchtype=author&query=Rishabh Gaur), [Gaurav Deshkar](https://arxiv.org/search/?searchtype=author&query=Gaurav Deshkar), [Jayanta Kshirsagar](https://arxiv.org/search/?searchtype=author&query=Jayanta Kshirsagar), [Harshal Hayatnagarkar](https://arxiv.org/search/?searchtype=author&query=Harshal Hayatnagarkar), [Janani Venugopalan](https://arxiv.org/search/?searchtype=author&query=Janani Venugopalan) 作者：Rishabh Gaur，Gaurav Deshkar，Jayanta Kshirsagar，Harshal Hayatnagarkar，Janani Venugopalan

For the development and optimization of agent-based models (ABMs) and rational agent-based models (RABMs), optimization algorithms such as reinforcement learning are extensively used. However, assessing the performance of RL-based ABMs and RABMS models is challenging due to the complexity and stochasticity of the modeled systems, and the lack of well-standardized metrics for comparing RL algorithms. In this study, we are developing domain-driven metrics for RL, while building on state-of-the-art metrics. We demonstrate our ``Domain-driven-RL-metrics’’ using policy optimization on a rational ABM disease modeling case study to model masking behavior, vaccination, and lockdown in a pandemic. Our results show the use of domain-driven rewards in conjunction with traditional and state-of-the-art metrics for a few different simulation scenarios such as the differential availability of masks. 在基于主体的模型（ABMs）和理性基于主体的模型（RABMs）的开发与优化中，强化学习等优化算法被广泛使用。然而，由于被建模系统的复杂性和随机性以及缺乏用于比较强化学习算法的良好标准化指标，评估基于 RL 的 ABMs 和 RABMs 模型的性能具有挑战性。在本研究中，我们在构建最先进指标的基础上，为 RL 开发了领域驱动的指标。我们通过在一个理性 ABM 疾病建模案例中对策略优化进行演示该“领域驱动-RL-指标”，以对大流行中的戴口罩行为、疫苗接种和封锁进行建模。我们的结果展示了在几种不同仿真场景（例如口罩供应差异）下，将领域驱动的奖励与传统和最先进指标结合使用的情况。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 08:40:19 UTC 发布：2025-08-07 08:40:19 UTC

#97 FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction #97 FCBV-Net：通过特征条件化的双手价值预测实现的类别级机器人服装平整

Authors: [Mohammed Daba](https://arxiv.org/search/?searchtype=author&query=Mohammed Daba), [Jing Qiu](https://arxiv.org/search/?searchtype=author&query=Jing Qiu) 作者：Mohammed Daba，Jing Qiu

Category-level generalization for robotic garment manipulation, such as bimanual smoothing, remains a significant hurdle due to high dimensionality, complex dynamics, and intra-category variations. Current approaches often struggle, either overfitting with concurrently learned visual features for a specific instance or, despite category-level perceptual generalization, failing to predict the value of synergistic bimanual actions. We propose the Feature-Conditioned Bimanual Value Network (FCBV-Net), operating on 3D point clouds to specifically enhance category-level policy generalization for garment smoothing. FCBV-Net conditions bimanual action value prediction on pre-trained, frozen dense geometric features, ensuring robustness to intra-category garment variations. Trainable downstream components then learn a task-specific policy using these static features. In simulated GarmentLab experiments with the CLOTH3D dataset, FCBV-Net demonstrated superior category-level generalization. It exhibited only an 11.5% efficiency drop (Steps80) on unseen garments compared to 96.2% for a 2D image-based baseline, and achieved 89% final coverage, outperforming an 83% coverage from a 3D correspondence-based baseline that uses identical per-point geometric features but a fixed primitive. These results highlight that the decoupling of geometric understanding from bimanual action value learning enables better category-level generalization. 针对机器人衣物操作（如双臂抚平）的类别级泛化仍然是一个重大难题，原因在于高维度、复杂动力学以及类别内部的差异性。现有方法常常面临困境：要么在为特定实例同时学习视觉特征时发生过拟合，要么尽管在类别级感知上实现了泛化，却无法预测协同双臂动作的价值。我们提出了特征条件化双臂价值网络（FCBV-Net），在 3D 点云上运行，专门增强用于衣物抚平的类别级策略泛化。FCBV-Net 将双臂动作价值预测条件化在预训练、冻结的密集几何特征上，从而对类别内衣物的变化具有鲁棒性。可训练的下游组件则使用这些静态特征学习任务特定的策略。在使用 CLOTH3D 数据集的模拟 GarmentLab 实验中，FCBV-Net 展现了优越的类别级泛化能力。与基于二维图像的基线在未见过的服装上相比，仅表现出 11.5%的效率下降（Steps80），而后者下降了 96.2%，并且达到了 89%的最终覆盖率，优于使用相同逐点几何特征但固定基元的基于三维对应关系的基线所达到的 83%覆盖率。这些结果突出了将几何理解与双手动作价值学习解耦能够实现更好的类别级泛化能力。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-07 08:37:45 UTC 发布：2025-08-07 08:37:45 UTC

#98 Tool Graph Retriever: Exploring Dependency Graph-based Tool Retrieval for Large Language Models

Authors: [Linfeng Gao](https://arxiv.org/search/?searchtype=author&query=Linfeng Gao), [Yaoxiang Wang](https://arxiv.org/search/?searchtype=author&query=Yaoxiang Wang), [Minlong Peng](https://arxiv.org/search/?searchtype=author&query=Minlong Peng), [Jialong Tang](https://arxiv.org/search/?searchtype=author&query=Jialong Tang), [Yuzhe Shang](https://arxiv.org/search/?searchtype=author&query=Yuzhe Shang), [Mingming Sun](https://arxiv.org/search/?searchtype=author&query=Mingming Sun), [Jinsong Su](https://arxiv.org/search/?searchtype=author&query=Jinsong Su) 作者：高琳峰，王耀翔，彭民龙，汤佳龙，尚昱哲，孙明明，苏金松

With the remarkable advancement of AI agents, the number of their equipped tools is increasing rapidly. However, integrating all tool information into the limited model context becomes impractical, highlighting the need for efficient tool retrieval methods. In this regard, dominant methods primarily rely on semantic similarities between tool descriptions and user queries to retrieve relevant tools. However, they often consider each tool independently, overlooking dependencies between tools, which may lead to the omission of prerequisite tools for successful task execution. To deal with this defect, in this paper, we propose Tool Graph Retriever (TGR), which exploits the dependencies among tools to learn better tool representations for retrieval. First, we construct a dataset termed TDI300K to train a discriminator for identifying tool dependencies. Then, we represent all candidate tools as a tool dependency graph and use graph convolution to integrate the dependencies into their representations. Finally, these updated tool representations are employed for online retrieval. Experimental results on several commonly used datasets show that our TGR can bring a performance improvement to existing dominant methods, achieving SOTA performance. Moreover, in-depth analyses also verify the importance of tool dependencies and the effectiveness of our TGR. 随着 AI 代理的显著进步，它们所配备的工具数量正在迅速增加。然而，将所有工具信息整合到有限的模型上下文中变得不切实际，这突显了高效工具检索方法的必要性。在这方面，主流方法主要依赖工具描述与用户查询之间的语义相似性来检索相关工具。但它们通常将每个工具独立考虑，忽视了工具之间的依赖关系，这可能导致遗漏成功执行任务所需的前置工具。为了解决这一缺陷，本文提出了工具图检索器（Tool Graph Retriever，TGR），利用工具之间的依赖关系来学习更好的工具表示以用于检索。首先，我们构建了一个称为 TDI300K 的数据集来训练用于识别工具依赖关系的判别器。然后，我们将所有候选工具表示为一个工具依赖图，并使用图卷积将依赖关系整合到它们的表示中。最后，这些更新后的工具表示被用于在线检索。在若干常用数据集上的实验结果表明，我们的 TGR 能为现有主流方法带来性能提升，达到 SOTA 性能。此外，深入分析也验证了工具依赖关系的重要性以及我们 TGR 的有效性。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-07 08:36:26 UTC 发布：2025-08-07 08:36:26 UTC

#99 Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages #99 在低资源场景中的语音 LLMs：数据量需求及在高资源语言上预训练的影响

Large language models (LLMs) have demonstrated potential in handling spoken inputs for high-resource languages, reaching state-of-the-art performance in various tasks. However, their applicability is still less explored in low-resource settings. This work investigates the use of Speech LLMs for low-resource Automatic Speech Recognition using the SLAM-ASR framework, where a trainable lightweight projector connects a speech encoder and a LLM. Firstly, we assess training data volume requirements to match Whisper-only performance, re-emphasizing the challenges of limited data. Secondly, we show that leveraging mono- or multilingual projectors pretrained on high-resource languages reduces the impact of data scarcity, especially with small training sets. Using multilingual LLMs (EuroLLM, Salamandra) with whisper-large-v3-turbo, we evaluate performance on several public benchmarks, providing insights for future research on optimizing Speech LLMs for low-resource languages and multilinguality. 大型语言模型（LLMs）已在高资源语言的语音输入处理中展示出潜力，在多项任务上达到最先进的性能。然而，它们在低资源环境中的适用性仍然较少被探索。本文研究了在低资源自动语音识别中使用语音 LLMs 的可行性，采用 SLAM-ASR 框架，其中一个可训练的轻量投影器连接语音编码器和 LLM。首先，我们评估了与仅使用 Whisper 的性能相匹配所需的训练数据量，重新强调了数据有限带来的挑战。其次，我们展示了在高资源语言上预训练的单语或多语投影器可以减轻数据稀缺的影响，尤其在小规模训练集上更为显著。通过将多语 LLMs（EuroLLM、Salamandra）与 whisper-large-v3-turbo 结合，我们在若干公共基准上评估了性能，为未来优化面向低资源语言和多语言性的语音 LLMs 的研究提供了见解。

Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language 主题：音频与语音处理，人工智能，计算与语言

Publish: 2025-08-07 08:33:42 UTC 发布时间：2025-08-07 08:33:42 UTC

#100 Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories #100 Chemist Eye：一种由视觉语言模型驱动的自驱动实验室安全监控与机器人决策系统

Authors: [Francisco Munguia-Galeano](https://arxiv.org/search/?searchtype=author&query=Francisco Munguia-Galeano), [Zhengxue Zhou](https://arxiv.org/search/?searchtype=author&query=Zhengxue Zhou), [Satheeshkumar Veeramani](https://arxiv.org/search/?searchtype=author&query=Satheeshkumar Veeramani), [Hatem Fakhruldeen](https://arxiv.org/search/?searchtype=author&query=Hatem Fakhruldeen), [Louis Longley](https://arxiv.org/search/?searchtype=author&query=Louis Longley), [Rob Clowes](https://arxiv.org/search/?searchtype=author&query=Rob Clowes), [Andrew I. Cooper](https://arxiv.org/search/?searchtype=author&query=Andrew I. Cooper) 作者：Francisco Munguia-Galeano、Zhengxue Zhou、Satheeshkumar Veeramani、Hatem Fakhruldeen、Louis Longley、Rob Clowes、Andrew I. Cooper

The integration of robotics and automation into self-driving laboratories (SDLs) can introduce additional safety complexities, in addition to those that already apply to conventional research laboratories. Personal protective equipment (PPE) is an essential requirement for ensuring the safety and well-being of workers in laboratories, self-driving or otherwise. Fires are another important risk factor in chemical laboratories. In SDLs, fires that occur close to mobile robots, which use flammable lithium batteries, could have increased severity. Here, we present Chemist Eye, a distributed safety monitoring system designed to enhance situational awareness in SDLs. The system integrates multiple stations equipped with RGB, depth, and infrared cameras, designed to monitor incidents in SDLs. Chemist Eye is also designed to spot workers who have suffered a potential accident or medical emergency, PPE compliance and fire hazards. To do this, Chemist Eye uses decision-making driven by a vision-language model (VLM). Chemist Eye is designed for seamless integration, enabling real-time communication with robots. Based on the VLM recommendations, the system attempts to drive mobile robots away from potential fire locations, exits, or individuals not wearing PPE, and issues audible warnings where necessary. It also integrates with third-party messaging platforms to provide instant notifications to lab personnel. We tested Chemist Eye with real-world data from an SDL equipped with three mobile robots and found that the spotting of possible safety hazards and decision-making performances reached 97 % and 95 %, respectively. 将机器人和自动化技术整合到自驱动实验室（SDL）中，除了传统研究实验室中已有的那些安全复杂性外，还可能引入额外的安全问题。个人防护装备（PPE）是确保实验室工作人员安全与健康的基本要求，无论是否为自驱动实验室。火灾是化学实验室中的另一个重要风险因素。在 SDL 中，发生在使用易燃锂电池的移动机器人附近的火灾，可能会具有更高的严重性。在此，我们提出了 Chemist Eye，一种分布式安全监控系统，旨在增强 SDL 中的态势感知。该系统集成了多个配备 RGB、深度和红外摄像头的站点，旨在监控 SDL 中的事故。Chemist Eye 还旨在识别可能遭遇事故或医疗急症的工作人员、PPE 合规性以及火灾隐患。为此，Chemist Eye 使用由视觉-语言模型（VLM）驱动的决策机制。Chemist Eye 设计为可无缝集成，能够与机器人实现实时通信。基于 VLM 的建议，系统试图将移动机器人从潜在的火源位置、出口或未佩戴个人防护装备的人员处驱离，并在必要时发出可听警告。它还与第三方消息平台集成，为实验室人员提供即时通知。我们在配备三台移动机器人的自驱动实验室（SDL）中使用真实数据测试了 Chemist Eye，发现潜在安全隐患的发现率和决策性能分别达到了 97%和 95%。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-07 08:31:42 UTC 发布：2025-08-07 08:31:42 UTC

Authors: [Sachin Dudda Nagaraju](https://arxiv.org/search/?searchtype=author&query=Sachin Dudda Nagaraju), [Ashkan Moradi](https://arxiv.org/search/?searchtype=author&query=Ashkan Moradi), [Bendik Skarre Abrahamsen](https://arxiv.org/search/?searchtype=author&query=Bendik Skarre Abrahamsen), [Mattijs Elschot](https://arxiv.org/search/?searchtype=author&query=Mattijs Elschot) 作者：Sachin Dudda Nagaraju、Ashkan Moradi、Bendik Skarre Abrahamsen、Mattijs Elschot

Medical image segmentation plays a crucial role in AI-assisted diagnostics, surgical planning, and treatment monitoring. Accurate and robust segmentation models are essential for enabling reliable, data-driven clinical decision making across diverse imaging modalities. Given the inherent variability in image characteristics across modalities, developing a unified model capable of generalizing effectively to multiple modalities would be highly beneficial. This model could streamline clinical workflows and reduce the need for modality-specific training. However, real-world deployment faces major challenges, including data scarcity, domain shift between modalities (e.g., CT vs. MRI), and privacy restrictions that prevent data sharing. To address these issues, we propose FedGIN, a Federated Learning (FL) framework that enables multimodal organ segmentation without sharing raw patient data. Our method integrates a lightweight Global Intensity Non-linear (GIN) augmentation module that harmonizes modality-specific intensity distributions during local training. We evaluated FedGIN using two types of datasets: an imputed dataset and a complete dataset. In the limited dataset scenario, the model was initially trained using only MRI data, and CT data was added to assess its performance improvements. In the complete dataset scenario, both MRI and CT data were fully utilized for training on all clients. In the limited-data scenario, FedGIN achieved a 12 to 18% improvement in 3D Dice scores on MRI test cases compared to FL without GIN and consistently outperformed local baselines. In the complete dataset scenario, FedGIN demonstrated near-centralized performance, with a 30% Dice score improvement over the MRI-only baseline and a 10% improvement over the CT-only baseline, highlighting its strong cross-modality generalization under privacy constraints. 医学影像分割在人工智能辅助诊断、手术规划和疗效监测中起着关键作用。准确且鲁棒的分割模型对于在多种成像模态下实现可靠的、以数据为驱动的临床决策至关重要。鉴于不同模态间图像特征的固有差异，开发一款能够有效泛化至多种模态的统一模型将具有重大意义。该模型可以简化临床工作流程并减少对特定模态训练的需求。然而，现实世界的部署面临重大挑战，包括数据稀缺、模态间的域移（例如 CT 与 MRI）以及阻止数据共享的隐私限制。为了解决这些问题，我们提出了 FedGIN，一种允许在不共享原始患者数据的情况下进行多模态器官分割的联邦学习（FL）框架。我们的方法在本地训练过程中整合了一个轻量级的全局强度非线性（Global Intensity Non-linear，GIN）增强模块，用于协调特定模态的强度分布。我们使用两类数据集评估了 FedGIN：一个填充（imputed）数据集和一个完整数据集。在有限数据情形下，模型最初仅使用 MRI 数据进行训练，随后加入 CT 数据以评估其性能提升。在完整数据情形下，MRI 和 CT 数据在所有客户端上被充分利用进行训练。在有限数据情形中，与不使用 GIN 的联邦学习相比，FedGIN 在 MRI 测试样本上的 3D Dice 得分提升了 12% 到 18%，并且持续优于本地基线。在完整数据情形中，FedGIN 展现出接近集中式的性能，相较仅使用 MRI 的基线提升了 30% 的 Dice 得分，相较仅使用 CT 的基线提升了 10%，突显了其在隐私约束下强大的跨模态泛化能力。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 学科：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 08:16:35 UTC 发表：2025-08-07 08:16:35 世界标准时间（UTC）

#102 Towards Assessing Medical Ethics from Knowledge to Practice #102 从知识到实践：评估医学伦理的进展

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 08:10:14 UTC 发布时间：2025-08-07 08:10:14 UTC

#103 Attention Basin: Why Contextual Position Matters in Large Language Models #103 Attention Basin：为什么上下文位置在大型语言模型中很重要

The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model’s intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures. 大型语言模型（LLMs）的性能对输入中信息的上下文位置高度敏感。为探究这种位置偏差背后的机制，我们的大量实验揭示了一个一致的现象，我们称之为注意力洼地：当呈现一系列结构化项目（例如检索到的文档或少样本示例）时，模型会系统性地对序列开头和结尾的项目赋予更高的注意力，而忽视序列中间的项目。关键是，我们的分析进一步表明，将更高的注意力分配给关键信息是提升模型性能的关键。基于这些洞见，我们提出了注意力驱动重排序（AttnRank），这是一种两阶段框架，(i) 使用一小部分校准集估计模型固有的位置注意力偏好，(ii) 重新排序检索到的文档或少样本示例，使最突出的内容与这些高注意力位置对齐。AttnRank 是一种与模型无关、无需训练且即插即用的方法，计算开销极小。在多跳问答和少样本上下文学习任务上的实验表明，AttnRank 在不修改模型参数或训练过程的情况下，在 10 种不同架构和规模的大型语言模型上都取得了显著改进。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 08:08:08 UTC 发布：2025-08-07 08:08:08 UTC

#104 Latent Expression Generation for Referring Image Segmentation and Grounding #104 用于指代表达图像分割与定位的潜在表达生成

Authors: [Seonghoon Yu](https://arxiv.org/search/?searchtype=author&query=Seonghoon Yu), [Joonbeom Hong](https://arxiv.org/search/?searchtype=author&query=Joonbeom Hong), [Joonseok Lee](https://arxiv.org/search/?searchtype=author&query=Joonseok Lee), [Jeany Son](https://arxiv.org/search/?searchtype=author&query=Jeany Son) 作者：Seonghoon Yu、Joonbeom Hong、Joonseok Lee、Jeany Son

Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to embed both shared-subject and distinct-attributes concepts into the latent representations, thereby capturing unique and target-specific visual cues. We also propose a positive-margin contrastive learning strategy to align all latent expressions with the original text while preserving subtle variations. Experimental results show that our method not only outperforms state-of-the-art RIS and REC approaches on multiple benchmarks but also achieves outstanding performance on the generalized referring expression segmentation (GRES) benchmark. 视觉定位任务，如指称图像分割（RIS）和指称表达理解（REC），旨在根据给定的文本描述定位目标对象。图像中的目标对象可以通过多种方式描述，反映出诸如颜色、位置等多样的属性。然而，大多数现有方法依赖单一文本输入，这只能捕捉视觉域中丰富信息的一小部分。这种丰富视觉细节与稀疏文本提示之间的不匹配可能导致对相似对象的误判。为了解决这一问题，我们提出了一个新颖的视觉定位框架，该框架通过从单一文本输入生成多个潜在表达来利用补充的视觉细节，这些细节在原始描述中并未出现。具体而言，我们引入了主体分配器和视觉概念注入器模块，将共享主体概念和不同属性概念嵌入到潜在表示中，从而捕捉独特且针对目标的视觉线索。我们还提出了一种正边距对比学习策略，用于将所有潜在表达与原始文本对齐，同时保留细微差异。实验结果表明，我们的方法不仅在多个基准上优于最先进的 RIS 和 REC 方法，而且在广义指代表达分割（GRES）基准上也取得了出色的表现。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 学科：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 07:57:27 UTC 发表：2025-08-07 07:57:27 UTC

#105 Exploring Superior Function Calls via Reinforcement Learning #105 通过强化学习探索更优的函数调用

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-07 07:51:38 UTC 发布：2025-08-07 07:51:38 UTC

#106 Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS #106 在构音障碍语音合成中的公平性：使用 F5-TTS 理解构音障碍语音克隆的内在偏差

Authors: [Anuprabha M](https://arxiv.org/search/?searchtype=author&query=Anuprabha M), [Krishna Gurugubelli](https://arxiv.org/search/?searchtype=author&query=Krishna Gurugubelli), [Anil Kumar Vuppala](https://arxiv.org/search/?searchtype=author&query=Anil Kumar Vuppala) 作者：Anuprabha M、Krishna Gurugubelli、Anil Kumar Vuppala

Dysarthric speech poses significant challenges in developing assistive technologies, primarily due to the limited availability of data. Recent advances in neural speech synthesis, especially zero-shot voice cloning, facilitate synthetic speech generation for data augmentation; however, they may introduce biases towards dysarthric speech. In this paper, we investigate the effectiveness of state-of-the-art F5-TTS in cloning dysarthric speech using TORGO dataset, focusing on intelligibility, speaker similarity, and prosody preservation. We also analyze potential biases using fairness metrics like Disparate Impact and Parity Difference to assess disparities across dysarthric severity levels. Results show that F5-TTS exhibits a strong bias toward speech intelligibility over speaker and prosody preservation in dysarthric speech synthesis. Insights from this study can help integrate fairness-aware dysarthric speech synthesis, fostering the advancement of more inclusive speech technologies. 构音障碍语音在开发辅助技术时带来了重大挑战，主要是由于数据的可用性有限。最近在神经语音合成方面的进展，尤其是零样本语音克隆，促进了用于数据增强的合成语音生成；然而，它们可能会引入对构音障碍语音的偏差。本文使用 TORGO 数据集研究了最先进的 F5-TTS 在克隆构音障碍语音方面的有效性，重点关注可懂度、说话人相似性和韵律保留。我们还使用如差异影响（Disparate Impact）和差异度（Parity Difference）等公平性指标分析潜在偏差，以评估不同构音障碍严重程度之间的差异。结果表明，F5-TTS 在构音障碍语音合成中对语音可懂度表现出强烈偏向，而在说话人和韵律保留方面较弱。本研究的见解可有助于整合具有公平性意识的构音障碍语音合成，推动更具包容性的语音技术发展。

Subjects: Audio and Speech Processing, Artificial Intelligence 主题：音频与语音处理，人工智能

Publish: 2025-08-07 07:39:48 UTC 发表：2025-08-07 07:39:48 UTC

#107 Integrated Influence: Data Attribution with Baseline #107 Integrated Influence：基线下的数据归因

Authors: [Linxiao Yang](https://arxiv.org/search/?searchtype=author&query=Linxiao Yang), [Xinyu Gu](https://arxiv.org/search/?searchtype=author&query=Xinyu Gu), [Liang Sun](https://arxiv.org/search/?searchtype=author&query=Liang Sun) 作者：杨林霄、顾新宇、孙亮

As an effective approach to quantify how training samples influence test sample, data attribution is crucial for understanding data and model and further enhance the transparency of machine learning models. We find that prevailing data attribution methods based on leave-one-out (LOO) strategy suffer from the local-based explanation, as these LOO-based methods only perturb a single training sample, and overlook the collective influence in the training set. On the other hand, the lack of baseline in many data attribution methods reduces the flexibility of the explanation, e.g., failing to provide counterfactual explanations. In this paper, we propose Integrated Influence, a novel data attribution method that incorporates a baseline approach. Our method defines a baseline dataset, follows a data degeneration process to transition the current dataset to the baseline, and accumulates the influence of each sample throughout this process. We provide a solid theoretical framework for our method, and further demonstrate that popular methods, such as influence functions, can be viewed as special cases of our approach. Experimental results show that Integrated Influence generates more reliable data attributions compared to existing methods in both data attribution task and mislablled example identification task. 作为一种有效量化训练样本对测试样本影响的方法，数据归因对于理解数据与模型并进一步提高机器学习模型的透明性至关重要。我们发现，现有基于逐一移除（LOO）策略的数据归因方法存在基于局部的解释问题，因为这些基于 LOO 的方法只扰动单个训练样本，忽视了训练集中的集体影响。另一方面，许多数据归因方法缺乏基线的设置，降低了解释的灵活性，例如无法提供反事实解释。在本文中，我们提出了集成影响（Integrated Influence），这是一种引入基线方法的新型数据归因方法。我们的方法定义了一个基线数据集，遵循数据退化过程将当前数据集过渡到基线，并在该过程中累积每个样本的影响。我们为该方法提供了坚实的理论框架，并进一步证明了诸如影响函数等流行方法可以被视为我们方法的特殊情形。实验结果表明，与现有方法相比，Integrated Influence 在数据归因任务和错误标注样本识别任务中都能生成更可靠的数据归因。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 07:16:12 UTC 发布：2025-08-07 07:16:12 UTC

#108 JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering #108 JPS：通过协同视觉扰动和文本引导对多模态大语言模型进行越狱

Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker’s malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by “steering prompt” optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers’ intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.} 针对多模态大型语言模型（MLLMs）的越狱攻击是一个重要的研究方向。当前研究主要关注最大化攻击成功率（ASR），常常忽视生成的响应是否真正实现了攻击者的恶意意图。这一疏忽常导致绕过安全过滤器但缺乏实质性有害内容的低质量输出。为填补这一空白，我们提出了 JPS，Jailbreak MLLMs with collaborative visual Perturbation and textual Steering（通过协同视觉扰动与文本引导对 MLLMs 进行越狱），该方法通过图像与文本引导提示的协同来实现越狱。具体来说，JPS 利用目标引导的对抗图像扰动来有效绕过安全机制，并辅以通过多智能体系统优化的“引导提示（steering prompt）”，以明确引导 LLM 生成满足攻击者意图的响应。这些视觉和文本组件通过迭代共同优化以提升性能。为评估攻击结果的质量，我们提出了恶意意图实现率（MIFR）指标，并使用基于推理的 LLM 评估器进行评估。我们的实验表明，JPS 在各种多模态大模型（MLLM）和基准测试中，在 ASR 和 MIFR 上均创下了新的最先进水平，且分析确认了其有效性。代码可在 \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS} 获取。 \color{warningcolor}{警告：本论文可能包含敏感内容。}

Subjects: Multimedia, Artificial Intelligence, Computation and Language, Cryptography and Security 主题：多媒体、人工智能、计算与语言、密码学与安全

Publish: 2025-08-07 07:14:01 UTC 发布：2025-08-07 07:14:01 UTC

#109 Align, Don't Divide: Revisiting the LoRA Architecture in Multi-Task Learning

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 07:02:55 UTC 发布：2025-08-07 07:02:55 UTC

#110 Align-for-Fusion: Harmonizing Triple Preferences via Dual-oriented Diffusion for Cross-domain Sequential Recommendation

Authors: [Yongfu Zha](https://arxiv.org/search/?searchtype=author&query=Yongfu Zha), [Xinxin Dong](https://arxiv.org/search/?searchtype=author&query=Xinxin Dong), [Haokai Ma](https://arxiv.org/search/?searchtype=author&query=Haokai Ma), [Yonghui Yang](https://arxiv.org/search/?searchtype=author&query=Yonghui Yang), [Xiaodong Wang](https://arxiv.org/search/?searchtype=author&query=Xiaodong Wang) 作者：Yongfu Zha, Xinxin Dong, Haokai Ma, Yonghui Yang, Xiaodong Wang

Personalized sequential recommendation aims to predict appropriate items for users based on their behavioral sequences. To alleviate data sparsity and interest drift issues, conventional approaches typically incorporate auxiliary behaviors from other domains via cross-domain transition. However, existing cross-domain sequential recommendation (CDSR) methods often follow an align-then-fusion paradigm that performs representation-level alignment across multiple domains and combines them mechanically for recommendation, overlooking the fine-grained fusion of domain-specific preferences. Inspired by recent advances in diffusion models (DMs) for distribution matching, we propose an align-for-fusion framework for CDSR to harmonize triple preferences via dual-oriented DMs, termed HorizonRec. Specifically, we investigate the uncertainty injection of DMs and identify stochastic noise as a key source of instability in existing DM-based recommenders. To address this, we introduce a mixed-conditioned distribution retrieval strategy that leverages distributions retrieved from users’ authentic behavioral logic as semantic bridges across domains, enabling consistent multi-domain preference modeling. Furthermore, we propose a dual-oriented preference diffusion method to suppress potential noise and emphasize target-relevant interests during multi-domain user representation fusion. Extensive experiments on four CDSR datasets from two distinct platforms demonstrate the effectiveness and robustness of HorizonRec in fine-grained triple-domain preference fusion. 个性化序列推荐旨在根据用户的行为序列预测合适的物品。为缓解数据稀疏和兴趣漂移问题，传统方法通常通过跨域迁移将来自其他域的辅助行为纳入。然而，现有的跨域序列推荐（CDSR）方法往往遵循先对齐后融合的范式，在表示层面对多个域进行对齐并机械地将它们组合用于推荐，忽视了域特定偏好的细粒度融合。受最近用于分布匹配的扩散模型（DMs）进展的启发，我们提出了用于 CDSR 的“为融合而对齐”框架，通过双向扩散模型来协调三重偏好，称为 HorizonRec。具体而言，我们研究了扩散模型中的不确定性注入，并将随机噪声识别为现有基于扩散模型的推荐系统不稳定性的关键来源。为了解决这一问题，我们提出了一种混合条件分布检索策略，利用从用户真实行为逻辑中检索到的分布作为跨域的语义桥梁，从而实现一致的多域偏好建模。此外，我们提出了一种双向偏好扩散方法，在多域用户表示融合过程中抑制潜在噪声并强调与目标相关的兴趣。在来自两个不同平台的四个 CDSR 数据集上进行的大量实验证明了 HorizonRec 在细粒度三域偏好融合方面的有效性和鲁棒性。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-07 07:00:29 UTC 发布时间：2025-08-07 07:00:29 协调世界时（UTC）

#111 Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks #111 使用卷积神经网络与生成对抗网络的自动图像着色

Authors: [Ruiyu Li](https://arxiv.org/search/?searchtype=author&query=Ruiyu Li), [Changyuan Qiu](https://arxiv.org/search/?searchtype=author&query=Changyuan Qiu), [Hangrui Cao](https://arxiv.org/search/?searchtype=author&query=Hangrui Cao), [Qihan Ren](https://arxiv.org/search/?searchtype=author&query=Qihan Ren), [Yuqing Qiu](https://arxiv.org/search/?searchtype=author&query=Yuqing Qiu) 作者：李睿宇、邱长远、曹航瑞、任启涵、裘玉清

Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons. 图像着色，即为灰度图像添加颜色的任务，近年来在计算机视觉领域受到大量研究关注，其应用包括颜色恢复和自动动画上色等[15, 1]。着色问题具有挑战性，因为它是高度病态的，三维图像信息中有三分之二丢失，导致存在很大的自由度。然而，场景的语义以及表面纹理可以为颜色提供重要线索：天空通常是蓝色的，云一般是白色的，草地通常是绿色，而且有大量可用于学习此类先验的训练数据可用，因为任何彩色图像都可作为训练样本[20]。着色最初被表述为回归任务[5]，这忽略了颜色预测的多模态特性。在本项目中，我们通过分类和对抗学习来探索自动图像着色。我们将基于先前工作构建模型，针对我们的特定场景进行修改并进行比较。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning, Image and Video Processing 主题：计算机视觉与模式识别、人工智能、机器学习、图像与视频处理

Publish: 2025-08-07 06:41:31 UTC 发布日期：2025-08-07 06:41:31 UTC

#112 Learning from Oblivion: Predicting Knowledge Overflowed Weights via Retrodiction of Forgetting #112 从遗忘中学习：通过对遗忘的反向预测来预测溢出知识的权重

Authors: [Jinhyeok Jang](https://arxiv.org/search/?searchtype=author&query=Jinhyeok Jang), [Jaehong Kim](https://arxiv.org/search/?searchtype=author&query=Jaehong Kim), [Jung Uk Kim](https://arxiv.org/search/?searchtype=author&query=Jung Uk Kim) 作者：Jinhyeok Jang、Jaehong Kim、Jung Uk Kim

Pre-trained weights have become a cornerstone of modern deep learning, enabling efficient knowledge transfer and improving downstream task performance, especially in data-scarce scenarios. However, a fundamental question remains: how can we obtain better pre-trained weights that encapsulate more knowledge beyond the given dataset? In this work, we introduce \textbf{KNowledge Overflowed Weights (KNOW)} prediction, a novel strategy that leverages structured forgetting and its inversion to synthesize knowledge-enriched weights. Our key insight is that sequential fine-tuning on progressively downsized datasets induces a structured forgetting process, which can be modeled and reversed to recover knowledge as if trained on a larger dataset. We construct a dataset of weight transitions governed by this controlled forgetting and employ meta-learning to model weight prediction effectively. Specifically, our \textbf{KNowledge Overflowed Weights Nowcaster (KNOWN)} acts as a hyper-model that learns the general evolution of weights and predicts enhanced weights with improved generalization. Extensive experiments across diverse datasets and architectures demonstrate that KNOW prediction consistently outperforms Naïve fine-tuning and simple weight prediction, leading to superior downstream performance. Our work provides a new perspective on reinterpreting forgetting dynamics to push the limits of knowledge transfer in deep learning. 预训练权重已成为现代深度学习的基石，使得知识高效迁移并提升下游任务性能，尤其在数据稀缺的情形下。然而，一个基本问题仍然存在：如何获得能够封装超出给定数据集更多知识的更好预训练权重？在这项工作中，我们提出了“知识溢出权重（KNowledge Overflowed Weights，KNOW）”预测，这是一种利用结构性遗忘及其反转来合成富含知识权重的新策略。我们的关键洞见是，针对逐步缩减的数据集进行顺序微调会导致一种可被建模并可逆的结构性遗忘过程，从而可以通过反转该过程来恢复出如同在更大数据集上训练所得的知识。我们构建了一个由这种受控遗忘支配的权重迁移数据集，并采用元学习来有效地对权重预测建模。具体而言，我们的“知识溢出权重预报器（KNowledge Overflowed Weights Nowcaster，KNOWN）”作为一个超模型，学习权重的一般演化规律并预测出具有更好泛化能力的增强权重。在多种数据集和架构上的大量实验证明，KNOW 预测始终优于简单微调和简单权重预测，从而带来更优的下游性能。我们的工作为重新解读遗忘动态提供了新的视角，以推动深度学习中知识迁移的极限。

Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：机器学习，人工智能，计算机视觉与模式识别

Publish: 2025-08-07 06:23:07 UTC 发表：2025-08-07 06:23:07 UTC

#113 Human-AI Schema Discovery and Application for Creative Problem Solving #113 人机协同图式发现与应用以支持创造性问题解决

Author: [Sitong Wang](https://arxiv.org/search/?searchtype=author&query=Sitong Wang) 作者：王思彤

Humans often rely on underlying structural patterns-schemas-to create, whether by writing stories, designing software, or composing music. Schemas help organize ideas and guide exploration, but they are often difficult to discover and apply, especially in complex or unfamiliar domains. My Ph.D. research develops a framework for human-AI schema discovery and application to support creative problem solving. I design systems that support users in sensemaking over examples to abstract schemas, and in operationalizing schemas into human-AI co-creative workflows for application. This research offers insights into how schema-guided interaction can make implicit knowledge more accessible and actionable, advancing more transparent and collaborative human-AI systems. 人类在创作时常依赖底层的结构性模式——图式（schemas）——无论是在写故事、设计软件还是作曲。图式有助于组织想法并引导探索，但它们常常难以发现和应用，尤其是在复杂或不熟悉的领域中。我的博士研究开发了一个用于人机协同图式发现与应用的框架，以支持创造性问题解决。我设计了能帮助用户通过对示例进行意义建构来抽象出图式的系统，并将图式落实为可供应用的人机共创工作流。该研究提供了关于图式引导交互如何使隐性知识更易获取和可操作的见解，推动了更透明和协作的人机智能系统的发展。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-07 05:55:52 UTC 发表：2025-08-07 05:55:52 UTC

#114 Evaluation of LLMs in AMR Parsing

Author: [Shu Han Ho](https://arxiv.org/search/?searchtype=author&query=Shu Han Ho) 作者：Shu Han Ho

Meaning Representation (AMR) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity. 意义表示（AMR）是一种语义形式主义，将句子意义编码为有根、有向、无环图，图中节点代表概念，边表示语义关系。仅微调解码器型 Large Language Models (LLMs) 为 AMR 解析提供了一条有前景且直接的新方向。本文对四种不同的 LLM 架构——Phi 3.5、Gemma 2、LLaMA 3.2 以及 DeepSeek R1 LLaMA Distilled——在 LDC2020T02 Gold AMR3.0 测试集上的微调进行了全面评估。我们的结果显示，对仅解码器 LLM 进行直接微调可以达到与复杂的最先进（SOTA）AMR 解析器相当的性能。值得注意的是，在采用直接微调方法的情况下，LLaMA 3.2 在性能上与 SOTA AMR 解析器具有竞争力。我们在完整的 LDC2020T02 测试集上取得了 SMATCH F1：0.804，与 APT + Silver (IBM) 的 0.804 不相上下，并接近 Graphene Smatch (MBSE) 的 0.854。在我们的分析中，我们还观察到一个一致的模式：LLaMA 3.2 在语义性能上领先，而 Phi 3.5 在结构有效性方面表现出色。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 04:43:47 UTC 发布：2025-08-07 04:43:47 协调世界时 (UTC)

#115 Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning #115 对话基于方面的情感四元组抽取通过结构熵最小化划分

Dialogues Aspect-based Sentiment Quadruple Extraction (DiaASQ) aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue. Existing methods typically learn word relations across entire dialogues, assuming a uniform distribution of sentiment elements. However, we find that dialogues often contain multiple semantically independent sub-dialogues without clear dependencies between them. Therefore, learning word relationships across the entire dialogue inevitably introduces additional noise into the extraction process. To address this, our method focuses on partitioning dialogues into semantically independent sub-dialogues. Achieving completeness while minimizing these sub-dialogues presents a significant challenge. Simply partitioning based on reply relationships is ineffective. Instead, we propose utilizing a structural entropy minimization algorithm to partition the dialogues. This approach aims to preserve relevant utterances while distinguishing irrelevant ones as much as possible. Furthermore, we introduce a two-step framework for quadruple extraction: first extracting individual sentiment elements at the utterance level, then matching quadruples at the sub-dialogue level. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in DiaASQ with much lower computational costs. 对话面向的情感四元组抽取（DiaASQ）旨在从给定的多轮多参与者对话中抽取所有的目标-方面-观点-情感四元组。现有方法通常在整个对话中学习词语关系，假设情感元素呈均匀分布。然而，我们发现对话常常包含多个语义上相互独立的子对话，子对话之间没有明确的依赖关系。因此，在整个对话范围内学习词语关系不可避免地会为抽取过程带来额外噪声。为了解决这一问题，我们的方法侧重于将对话划分为语义上相互独立的子对话。在保证完整性的同时尽量减小这些子对话规模是一项重大挑战。仅基于回复关系进行划分是无效的。相反，我们提出采用结构熵最小化算法来对对话进行划分。这一方法旨在尽可能保留相关的发言，同时尽量区分出不相关的发言。此外，我们提出了一个用于四元组抽取的两步框架：首先在话语层面抽取各个情感元素，然后在子对话层面匹配四元组。大量实验证明，我们的方法在 DiaASQ 上以更低的计算成本实现了最先进的性能。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 04:22:17 UTC 发布：2025-08-07 04:22:17 协调世界时（UTC）

#116 Skin-SOAP: A Weakly Supervised Framework for Generating Structured SOAP Notes #116 Skin-SOAP：一种用于生成结构化 SOAP 记录的弱监督框架

Authors: [Sadia Kamal](https://arxiv.org/search/?searchtype=author&query=Sadia Kamal), [Tim Oates](https://arxiv.org/search/?searchtype=author&query=Tim Oates), [Joy Wan](https://arxiv.org/search/?searchtype=author&query=Joy Wan) 作者：Sadia Kamal、Tim Oates、Joy Wan

Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. Early diagnosis, accurate and timely treatment are critical to improving patient survival rates. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose skin-SOAP, a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate this clinical relevance, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively. 皮肤癌是全球最常见的癌症类型，每年占据超过 80 亿美元的医疗开支。早期诊断、准确且及时的治疗对提高患者生存率至关重要。在临床环境中，医生使用详细的 SOAP（主观、客观、评估和计划）记录记录患者就诊情况。然而，手工生成这些记录劳动强度大并加剧了临床医生的倦怠。在本工作中，我们提出了 skin-SOAP，一种弱监督的多模态框架，用于从有限输入（包括病变图像和稀疏的临床文本）生成具有临床结构的 SOAP 记录。我们的方法减少了对人工标注的依赖，实现了可扩展、以临床为基础的文档记录，同时减轻了临床医生的负担并降低了对大量标注数据的需求。我们的方法在关键临床相关性指标上取得了可与 GPT-4o、Claude 和 DeepSeek Janus Pro 相媲美的性能。为评估这种临床相关性，我们提出了两个新指标 MedConceptEval 和临床一致性评分（Clinical Coherence Score, CCS），分别用于评估与专家医学概念的语义对齐性以及与输入特征的一致性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别、人工智能、机器学习

Publish: 2025-08-07 04:12:43 UTC 发表时间：2025-08-07 04:12:43 UTC

#117 SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models #117 SPaRFT：面向大型语言模型的自定节奏强化微调

Authors: [Dai Do](https://arxiv.org/search/?searchtype=author&query=Dai Do), [Manh Nguyen](https://arxiv.org/search/?searchtype=author&query=Manh Nguyen), [Svetha Venkatesh](https://arxiv.org/search/?searchtype=author&query=Svetha Venkatesh), [Hung Le](https://arxiv.org/search/?searchtype=author&query=Hung Le) 作者：Dai Do、Manh Nguyen、Svetha Venkatesh、Hung Le

Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL). However, such methods require extensive data and compute, making them impractical for smaller models. Current approaches to curriculum learning or data selection are largely heuristic-driven or demand extensive computational resources, limiting their scalability and generalizability. We propose \textbf{SPaRFT}, a self-paced learning framework that enables efficient learning based on the capability of the model being trained through optimizing which data to use and when. First, we apply \emph{cluster-based data reduction} to partition training data by semantics and difficulty, extracting a compact yet diverse subset that reduces redundancy. Then, a \emph{multi-armed bandit} treats data clusters as arms, optimized to allocate training samples based on model current performance. Experiments across multiple reasoning benchmarks show that SPaRFT achieves comparable or better accuracy than state-of-the-art baselines while using up to 100× fewer samples. Ablation studies and analyses further highlight the importance of both data clustering and adaptive selection. Our results demonstrate that carefully curated, performance-driven training curricula can unlock strong reasoning abilities in LLMs with minimal resources. 大型语言模型 (LLMs) 在使用强化学习 (RL) 微调后展现出强大的推理能力。然而，此类方法需要大量数据和计算，使其对较小模型不切实际。当前关于课程学习或数据选择的方法在很大程度上依赖启发式或需要大量计算资源，限制了其可扩展性和泛化能力。我们提出了 SPaRFT，一种自我节奏学习框架，通过优化使用何种数据以及何时使用，基于被训练模型的能力实现高效学习。首先，我们应用基于聚类的数据缩减，将训练数据按语义和难度划分，提取出一个紧凑但多样的子集以减少冗余。然后，将多臂赌博机视为数据簇的处理方式，用于根据模型的当前表现优化分配训练样本。在多个推理基准上的实验证明，SPaRFT 在使用多达 100× 更少样本的情况下，达到与最先进基线相当或更好的准确率。消融研究和分析进一步突显了数据聚类和自适应选择两者的重要性。我们的结果表明，精心设计、以性能为导向的训练课程可以在极少资源下为 LLMs 释放强大的推理能力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 03:50:48 UTC 发布：2025-08-07 03:50:48 UTC

#118 Making Prompts First-Class Citizens for Adaptive LLM Pipelines #118 让提示成为自适应 LLM 管道的一等公民

Modern LLM pipelines increasingly resemble data-centric systems: they retrieve external context, compose intermediate outputs, validate results, and adapt based on runtime feedback. Yet, the central element guiding this process – the prompt – remains a brittle, opaque string, disconnected from the surrounding dataflow. This disconnect limits reuse, optimization, and runtime control. In this paper, we describe our vision and an initial design for SPEAR, a language and runtime that fills this prompt management gap by making prompts structured, adaptive, and first-class components of the execution model. SPEAR enables (1) runtime prompt refinement – modifying prompts dynamically in response to execution-time signals such as confidence, latency, or missing context; and (2) structured prompt management – organizing prompt fragments into versioned views with support for introspection and logging. SPEAR defines a prompt algebra that governs how prompts are constructed and adapted within a pipeline. It supports multiple refinement modes (manual, assisted, and automatic), giving developers a balance between control and automation. By treating prompt logic as structured data, SPEAR enables optimizations such as operator fusion, prefix caching, and view reuse. Preliminary experiments quantify the behavior of different refinement modes compared to static prompts and agentic retries, as well as the impact of prompt-level optimizations such as operator fusion. 现代的 LLM 管道越来越像以数据为中心的系统：它们检索外部上下文、组合中间输出、验证结果，并根据运行时反馈进行调整。然而，引导这一过程的核心要素——提示（prompt）——仍然是一个脆弱且不透明的字符串，与周围的数据流脱节。这种脱节限制了重用、优化和运行时控制。在本文中，我们描述了 SPEAR 的愿景和初步设计，SPEAR 是一种语言和运行时，通过使提示结构化、自适应并成为执行模型的一等组成部分来填补提示管理的空白。SPEAR 支持（1）运行时提示细化——在执行时根据置信度、延迟或缺失上下文等信号动态修改提示；以及（2）结构化提示管理——将提示片段组织为带有版本视图的结构，并支持自省和日志记录。SPEAR 定义了一种提示代数，用以规范提示在管道中如何构建和适配。它支持多种细化模式（手动、辅助和自动），在控制与自动化之间为开发者提供了平衡。通过将提示逻辑视为结构化数据，SPEAR 实现了诸如算子融合、前缀缓存和视图重用等优化。初步实验量化了不同精炼模式相较于静态提示和具代理性的重试的行为，以及像算子融合这样的提示级优化的影响。

Subjects: Databases, Artificial Intelligence, Computation and Language 主题：数据库，人工智能，计算与语言

Publish: 2025-08-07 03:49:56 UTC 发布：2025-08-07 03:49:56 UTC

#119 Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation #119 朝向无幻觉音乐：用于可靠歌曲生成的强化学习偏好优化框架

Recent advances in audio-based generative language models have accelerated AI-driven lyric-to-song generation. However, these models frequently suffer from content hallucination, producing outputs misaligned with the input lyrics and undermining musical coherence. Current supervised fine-tuning (SFT) approaches, limited by passive label-fitting, exhibit constrained self-improvement and poor hallucination mitigation. To address this core challenge, we propose a novel reinforcement learning (RL) framework leveraging preference optimization for hallucination control. Our key contributions include: (1) Developing a robust hallucination preference dataset constructed via phoneme error rate (PER) computation and rule-based filtering to capture alignment with human expectations; (2) Implementing and evaluating three distinct preference optimization strategies within the RL framework: Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO). DPO operates off-policy to enhance positive token likelihood, achieving a significant 7.4% PER reduction. PPO and GRPO employ an on-policy approach, training a PER-based reward model to iteratively optimize sequences via reward maximization and KL-regularization, yielding PER reductions of 4.9% and 4.7%, respectively. Comprehensive objective and subjective evaluations confirm that our methods effectively suppress hallucinations while preserving musical quality. Crucially, this work presents a systematic, RL-based solution to hallucination control in lyric-to-song generation. The framework’s transferability also unlocks potential for music style adherence and musicality enhancement, opening new avenues for future generative song research. 近期在基于音频的生成式语言模型方面取得的进展，加速了由 AI 驱动的歌词到歌曲生成。然而，这些模型经常出现内容幻觉，生成的输出与输入歌词不一致，破坏了音乐的连贯性。当前的有监督微调（SFT）方法受限于被动的标签拟合，表现出有限的自我改进能力和较差的幻觉缓解效果。为了解决这一核心问题，我们提出了一种利用偏好优化进行幻觉控制的新型强化学习（RL）框架。我们的主要贡献包括： (1) 构建了一个稳健的幻觉偏好数据集，该数据集通过音素错误率（PER）计算和基于规则的过滤来捕捉与人类期望的一致性；(2) 在 RL 框架内实现并评估了三种不同的偏好优化策略：直接偏好优化（DPO）、近端策略优化（PPO）和群体相对策略优化（GRPO）。DPO 以离线方式运作以增强正向 token 的概率，取得了显著的 7.4% PER 降幅。 PPO 和 GRPO 采用在线策略方法，训练基于偏好学习的奖励模型（PER）通过最大化奖励和 KL 正则化迭代优化序列，分别带来 4.9% 和 4.7% 的 PER 降幅。全面的客观和主观评估证实了我们的方法在抑制幻觉的同时保持了音乐质量。关键是，本工作提出了一个系统的、基于强化学习的歌词到歌曲生成幻觉控制解决方案。该框架的可迁移性还为音乐风格保持和音乐性提升打开了潜在可能，为未来的生成歌曲研究开辟了新途径。

Subjects: Sound, Artificial Intelligence, Audio and Speech Processing 主题：声音，人工智能，音频与语音处理

Publish: 2025-08-07 03:49:18 UTC 发布：2025-08-07 03:49:18 UTC

#120 R-Zero: Self-Evolving Reasoning LLM from Zero Data #120 R-Zero：从零数据自我进化推理 LLM

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-07 03:38:16 UTC 发布：2025-08-07 03:38:16 UTC

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-07 03:36:38 UTC 发表：2025-08-07 03:36:38 UTC

#122 AgenticData: An Agentic Data Analytics System for Heterogeneous Data #122 AgenticData：一种用于异构数据的主动式数据分析系统

Authors: [Ji Sun](https://arxiv.org/search/?searchtype=author&query=Ji Sun), [Guoliang Li](https://arxiv.org/search/?searchtype=author&query=Guoliang Li), [Peiyao Zhou](https://arxiv.org/search/?searchtype=author&query=Peiyao Zhou), [Yihui Ma](https://arxiv.org/search/?searchtype=author&query=Yihui Ma), [Jingzhe Xu](https://arxiv.org/search/?searchtype=author&query=Jingzhe Xu), [Yuan Li](https://arxiv.org/search/?searchtype=author&query=Yuan Li) 作者：Ji Sun、Guoliang Li、Peiyao Zhou、Yihui Ma、Jingzhe Xu、Yuan Li

Existing unstructured data analytics systems rely on experts to write code and manage complex analysis workflows, making them both expensive and time-consuming. To address these challenges, we introduce AgenticData, an innovative agentic data analytics system that allows users to simply pose natural language (NL) questions while autonomously analyzing data sources across multiple domains, including both unstructured and structured data. First, AgenticData employs a feedback-driven planning technique that automatically converts an NL query into a semantic plan composed of relational and semantic operators. We propose a multi-agent collaboration strategy by utilizing a data profiling agent for discovering relevant data, a semantic cross-validation agent for iterative optimization based on feedback, and a smart memory agent for maintaining short-term context and long-term knowledge. Second, we propose a semantic optimization model to refine and execute semantic plans effectively. Our system, AgenticData, has been tested using three benchmarks. Experimental results showed that AgenticData achieved superior accuracy on both easy and difficult tasks, significantly outperforming state-of-the-art methods. 现有的非结构化数据分析系统依赖专家编写代码并管理复杂的分析工作流，使其既昂贵又耗时。为了解决这些挑战，我们提出了 AgenticData，一种创新的代理式数据分析系统，允许用户仅通过提出自然语言（NL）问题即可自主分析跨多个领域的数据源，包括非结构化和结构化数据。首先，AgenticData 采用一种基于反馈的规划技术，自动将自然语言查询转换为由关系和语义运算符组成的语义计划。我们提出了一种多代理协作策略，利用数据概况代理发现相关数据、语义交叉验证代理基于反馈进行迭代优化，以及智能记忆代理维护短期上下文和长期知识。其次，我们提出了一种语义优化模型以有效地精炼和执行语义计划。我们的系统 AgenticData 已在三个基准上进行了测试。实验结果表明，AgenticData 在简单和困难任务上均取得了更高的准确率，显著优于最先进的方法。

Subjects: Databases, Artificial Intelligence 主题：数据库，人工智能

Publish: 2025-08-07 03:33:59 UTC 发表：2025-08-07 03:33:59 UTC

#123 Situated Epistemic Infrastructures: A Diagnostic Framework for Post-Coherence Knowledge #123 情景化的认知基础设施：后连贯性知识的诊断框架

Author: [Matthew Kelly](https://arxiv.org/search/?searchtype=author&query=Matthew Kelly) 作者：Matthew Kelly

Large Language Models (LLMs) such as ChatGPT have rendered visible the fragility of contemporary knowledge infrastructures by simulating coherence while bypassing traditional modes of citation, authority, and validation. This paper introduces the Situated Epistemic Infrastructures (SEI) framework as a diagnostic tool for analyzing how knowledge becomes authoritative across hybrid human-machine systems under post-coherence conditions. Rather than relying on stable scholarly domains or bounded communities of practice, SEI traces how credibility is mediated across institutional, computational, and temporal arrangements. Integrating insights from infrastructure studies, platform theory, and epistemology, the framework foregrounds coordination over classification, emphasizing the need for anticipatory and adaptive models of epistemic stewardship. The paper contributes to debates on AI governance, knowledge production, and the ethical design of information systems by offering a robust alternative to representationalist models of scholarly communication. 像 ChatGPT 这样的大型语言模型（LLMs）通过在绕过传统引用、权威与验证方式的同时模拟连贯性，揭示了当代知识基础设施的脆弱性。本文提出了情境性认知基础设施（Situated Epistemic Infrastructures，SEI）框架，作为在后连贯性条件下分析知识如何在混合人机系统中获得权威性的诊断工具。SEI 不依赖于稳定的学术领域或有界的实践共同体，而是追踪信誉如何在制度、计算与时间安排之间被调节。该框架融合了基础设施研究、平台理论与认识论的见解，将协调置于分类之上，强调对认识管理进行前瞻性和适应性建模的必要性。本文通过提供一种对学术交流表征主义模型的有力替代方案，为关于人工智能治理、知识生产与信息系统伦理设计的讨论做出贡献。

Subjects: Human-Computer Interaction, Artificial Intelligence, Digital Libraries 主题：人机交互、人工智能、数字图书馆

Publish: 2025-08-07 03:08:23 UTC 发布：2025-08-07 03:08:23 协调世界时（UTC）

Authors: [Wenjie Hu](https://arxiv.org/search/?searchtype=author&query=Wenjie Hu), [Ye Zhou](https://arxiv.org/search/?searchtype=author&query=Ye Zhou), [Hann Woei Ho](https://arxiv.org/search/?searchtype=author&query=Hann Woei Ho) 作者：Wenjie Hu、Ye Zhou、Hann Woei Ho

Maze navigation is a fundamental challenge in robotics, requiring agents to traverse complex environments efficiently. While the Deep Deterministic Policy Gradient (DDPG) algorithm excels in control tasks, its performance in maze navigation suffers from sparse rewards, inefficient exploration, and long-horizon planning difficulties, often leading to low success rates and average rewards, sometimes even failing to achieve effective navigation. To address these limitations, this paper proposes an efficient Hierarchical DDPG (HDDPG) algorithm, which includes high-level and low-level policies. The high-level policy employs an advanced DDPG framework to generate intermediate subgoals from a long-term perspective and on a higher temporal scale. The low-level policy, also powered by the improved DDPG algorithm, generates primitive actions by observing current states and following the subgoal assigned by the high-level policy. The proposed method enhances stability with off-policy correction, refining subgoal assignments by relabeling historical experiences. Additionally, adaptive parameter space noise is utilized to improve exploration, and a reshaped intrinsic-extrinsic reward function is employed to boost learning efficiency. Further optimizations, including gradient clipping and Xavier initialization, are employed to improve robustness. The proposed algorithm is rigorously evaluated through numerical simulation experiments executed using the Robot Operating System (ROS) and Gazebo. Regarding the three distinct final targets in autonomous maze navigation tasks, HDDPG significantly overcomes the limitations of standard DDPG and its variants, improving the success rate by at least 56.59% and boosting the average reward by a minimum of 519.03 compared to baseline algorithms. 迷宫导航是机器人学中的一个基本挑战，要求智能体高效地穿越复杂环境。虽然深度确定性策略梯度（DDPG）算法在控制任务中表现出色，但在迷宫导航中往往因回报稀疏、探索效率低和长时程规划困难而表现欠佳，常导致成功率和平均回报较低，有时甚至无法实现有效导航。为了解决这些局限性，本文提出了一种高效的分层 DDPG（HDDPG）算法，包含高层和低层策略。高层策略采用改进的 DDPG 框架，从长期视角和更高时间尺度生成中间子目标。低层策略同样由改进的 DDPG 算法驱动，通过观测当前状态并遵循高层策略分配的子目标来生成原始动作。所提出的方法通过离策略修正来增强稳定性，通过对历史经验重新标注来优化子目标分配。此外，采用自适应参数空间噪声以改进探索，并使用重塑的内在-外在奖励函数以提升学习效率。进一步的优化措施包括梯度裁剪和 Xavier 初始化，以增强鲁棒性。提出的算法通过在 Robot Operating System (ROS) 和 Gazebo 上执行的数值仿真实验进行了严格评估。针对自主迷宫导航任务中的三个不同终点，HDDPG 显著克服了标准 DDPG 及其变体的局限性，与基线算法相比，成功率至少提高了 56.59%，平均奖励至少提升了 519.03。

Subjects: Robotics, Artificial Intelligence

Publish: 2025-08-07 03:06:22 UTC

#125 UGOD: Uncertainty-Guided Differentiable Opacity and Soft Dropout for Enhanced Sparse-View 3DGS #125 UGOD：基于不确定性的可微不透明度与软丢弃以增强稀疏视角 3DGS

Authors: [Zhihao Guo](https://arxiv.org/search/?searchtype=author&query=Zhihao Guo), [Peng Wang](https://arxiv.org/search/?searchtype=author&query=Peng Wang), [Zidong Chen](https://arxiv.org/search/?searchtype=author&query=Zidong Chen), [Xiangyu Kong](https://arxiv.org/search/?searchtype=author&query=Xiangyu Kong), [Yan Lyu](https://arxiv.org/search/?searchtype=author&query=Yan Lyu), [Guanyu Gao](https://arxiv.org/search/?searchtype=author&query=Guanyu Gao), [Liangxiu Han](https://arxiv.org/search/?searchtype=author&query=Liangxiu Han) 作者：郭志豪、王鹏、陈子东、孔祥宇、吕燕、高冠宇、韩良秀

3D Gaussian Splatting (3DGS) has become a competitive approach for novel view synthesis (NVS) due to its advanced rendering efficiency through 3D Gaussian projection and blending. However, Gaussians are treated equally weighted for rendering in most 3DGS methods, making them prone to overfitting, which is particularly the case in sparse-view scenarios. To address this, we investigate how adaptive weighting of Gaussians affects rendering quality, which is characterised by learned uncertainties proposed. This learned uncertainty serves two key purposes: first, it guides the differentiable update of Gaussian opacity while preserving the 3DGS pipeline integrity; second, the uncertainty undergoes soft differentiable dropout regularisation, which strategically transforms the original uncertainty into continuous drop probabilities that govern the final Gaussian projection and blending process for rendering. Extensive experimental results over widely adopted datasets demonstrate that our method outperforms rivals in sparse-view 3D synthesis, achieving higher quality reconstruction with fewer Gaussians in most datasets compared to existing sparse-view approaches, e.g., compared to DropGaussian, our method achieves 3.27% PSNR improvements on the MipNeRF 360 dataset. 三维高斯点蒸（3D Gaussian Splatting，3DGS）由于通过三维高斯投影和混合实现了先进的渲染效率，已成为新视角合成（NVS）的一种有竞争力的方法。然而，在大多数 3DGS 方法中，高斯被视为同等权重用于渲染，导致易于过拟合，这在视角稀疏的情形下尤为明显。为了解决这一问题，我们研究了高斯的自适应加权如何影响渲染质量，该质量由所提出的学习不确定性来表征。这种学习到的不确定性具有两个关键作用：首先，它在保持 3DGS 流水线完整性的同时，指导高斯不透明度的可微更新；其次，该不确定性经过软性可微丢弃正则化，被策略性地转换为连续的丢弃概率，这些概率支配最终用于渲染的高斯投影和混合过程。在广泛采用的数据集上的大量实验结果表明，我们的方法在稀疏视角三维合成方面优于竞争方法，在大多数数据集上以更少的高斯基实现了更高质量的重建。与现有的稀疏视角方法相比，例如，与 DropGaussian 相比，我们的方法在 MipNeRF 360 数据集上实现了 3.27% 的 PSNR 提升。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 学科：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 01:42:22 UTC 发表：2025-08-07 01:42:22 UTC

#126 MENDR: Manifold Explainable Neural Data Representations

Foundation models for electroencephalography (EEG) signals have recently demonstrated success in learning generalized representations of EEGs, outperforming specialized models in various downstream tasks. However, many of these models lack transparency in their pretraining dynamics and offer limited insight into how well EEG information is preserved within their embeddings. For successful clinical integration, EEG foundation models must ensure transparency in pretraining, downstream fine-tuning, and the interpretability of learned representations. Current approaches primarily operate in the temporal domain, overlooking advancements in digital signal processing that enable the extraction of deterministic and traceable features, such as wavelet-based representations. We propose MENDR (Manifold Explainable Neural Data Representations), a filter bank-based EEG foundation model built on a novel Riemannian Manifold Transformer architecture to resolve these issues. MENDR learns symmetric positive definite matrix embeddings of EEG signals and is pretrained on a large corpus comprising over 4,000 hours of EEG data, decomposed via discrete wavelet packet transforms into multi-resolution coefficients. MENDR significantly enhances interpretability by visualizing symmetric positive definite embeddings as geometric ellipsoids and supports accurate reconstruction of EEG signals from learned embeddings. Evaluations across multiple clinical EEG tasks demonstrate that MENDR achieves near state-of-the-art performance with substantially fewer parameters, underscoring its potential for efficient, interpretable, and clinically applicable EEG analysis. 用于脑电图（EEG）信号的基础模型近期在学习通用 EEG 表示方面取得了成功，在多种下游任务中优于专用模型。然而，许多此类模型在预训练动态上缺乏透明性，且对其嵌入中保留了多少 EEG 信息提供的洞见有限。为了在临床中成功整合，EEG 基础模型必须在预训练、下游微调以及所学表示的可解释性方面保证透明性。目前的方法主要在时域中运行，忽视了数字信号处理方面的发展，而这些发展可以提取确定性且可追溯的特征，例如基于小波的表示。我们提出了 MENDR（流形可解释神经数据表示），这是一种基于滤波器组的 EEG 基础模型，建立在一种新颖的黎曼流形变换器架构上，以解决这些问题。 MENDR 学习脑电信号的对称正定矩阵嵌入，并在包含超过 4,000 小时脑电数据的大型语料库上进行预训练，这些数据通过离散小波包变换分解为多分辨率系数。MENDR 通过将对称正定嵌入可视化为几何椭球，大大提升了解释性，并支持从学习到的嵌入精确重建脑电信号。在多个临床脑电任务上的评估表明，MENDR 在参数显著更少的情况下实现了接近最先进的性能，突显了其在高效、可解释且可临床应用的脑电分析方面的潜力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习、人工智能

Publish: 2025-08-07 00:55:05 UTC 发布：2025-08-07 00:55:05 UTC

#127 AdvDINO: Domain-Adversarial Self-Supervised Representation Learning for Spatial Proteomics #127 AdvDINO：用于空间蛋白质组学的域对抗自监督表征学习

Authors: [Stella Su](https://arxiv.org/search/?searchtype=author&query=Stella Su), [Marc Harary](https://arxiv.org/search/?searchtype=author&query=Marc Harary), [Scott J. Rodig](https://arxiv.org/search/?searchtype=author&query=Scott J. Rodig), [William Lotter](https://arxiv.org/search/?searchtype=author&query=William Lotter) 作者：Stella Su、Marc Harary、Scott J. Rodig、William Lotter

Self-supervised learning (SSL) has emerged as a powerful approach for learning visual representations without manual annotations. However, the robustness of standard SSL methods to domain shift – systematic differences across data sources – remains uncertain, posing an especially critical challenge in biomedical imaging where batch effects can obscure true biological signals. We present AdvDINO, a domain-adversarial self-supervised learning framework that integrates a gradient reversal layer into the DINOv2 architecture to promote domain-invariant feature learning. Applied to a real-world cohort of six-channel multiplex immunofluorescence (mIF) whole slide images from non-small cell lung cancer patients, AdvDINO mitigates slide-specific biases to learn more robust and biologically meaningful representations than non-adversarial baselines. Across >5.46 million mIF image tiles, the model uncovers phenotype clusters with distinct proteomic profiles and prognostic significance, and improves survival prediction in attention-based multiple instance learning. While demonstrated on mIF data, AdvDINO is broadly applicable to other imaging domains – including radiology, remote sensing, and autonomous driving – where domain shift and limited annotated data hinder model generalization and interpretability. 自监督学习（SSL）已成为在无需手工注释的情况下学习视觉表示的强大方法。然而，标准自监督方法对领域迁移——即数据源之间的系统性差异——的鲁棒性仍不确定，这在生物医学影像中尤为关键，因为批次效应可能掩盖真实的生物学信号。我们提出了 AdvDINO，一种域对抗自监督学习框架，将梯度反转层集成到 DINOv2 架构中，以促进域不变特征的学习。将其应用于来自非小细胞肺癌患者的六通道多重免疫荧光（mIF）整片图像的真实队列，AdvDINO 在抑制切片特异性偏差方面优于非对抗性基线，从而学习到更具鲁棒性和生物学意义的表示。在覆盖 >5.46 百万 mIF 图像瓦片的范围内，模型发现了具有不同蛋白质组特征和预后意义的表型聚类，并在基于注意力的多实例学习中提升了生存期预测能力。尽管在多重免疫荧光（mIF）数据上进行了演示，AdvDINO 广泛适用于其他成像领域——包括放射学、遥感和自动驾驶——在这些领域中，域漂移和标注数据有限会阻碍模型的泛化性和可解释性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 学科：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 00:51:54 UTC 发布日期：2025-08-07 00:51:54 UTC

#128 Tesserae: Scalable Placement Policies for Deep Learning Workloads #128 Tesserae：用于深度学习工作负载的可扩展部署策略

Authors: [Song Bian](https://arxiv.org/search/?searchtype=author&query=Song Bian), [Saurabh Agarwal](https://arxiv.org/search/?searchtype=author&query=Saurabh Agarwal), [Md. Tareq Mahmood](https://arxiv.org/search/?searchtype=author&query=Md. Tareq Mahmood), [Shivaram Venkataraman](https://arxiv.org/search/?searchtype=author&query=Shivaram Venkataraman) 作者：Song Bian、Saurabh Agarwal、Md. Tareq Mahmood、Shivaram Venkataraman

Training deep learning (DL) models has become a dominant workload in data-centers and improving resource utilization is a key goal of DL cluster schedulers. In order to do this, schedulers typically incorporate placement policies that govern where jobs are placed on the cluster. Existing placement policies are either designed as ad-hoc heuristics or incorporated as constraints within a complex optimization problem and thus either suffer from suboptimal performance or poor scalability. Our key insight is that many placement constraints can be formulated as graph matching problems and based on that we design novel placement policies for minimizing job migration overheads and job packing. We integrate these policies into Tesserae and describe how our design leads to a scalable and effective GPU cluster scheduler. Our experimental results show that Tesserae improves average JCT by up to 1.62x and the Makespan by up to 1.15x compared with the existing schedulers. 训练深度学习（DL）模型已成为数据中心的主要工作负载，提升资源利用率是 DL 集群调度器的关键目标。为此，调度器通常会加入决定作业在集群中放置位置的调度策略。现有的放置策略要么被设计为临时的启发式方法，要么作为复杂优化问题中的约束被嵌入其中，因此要么性能次优，要么可扩展性差。我们的关键洞见是许多放置约束可以表述为图匹配问题，基于此我们设计了用于最小化作业迁移开销和作业打包的新型放置策略。我们将这些策略集成到 Tesserae 中，并描述了我们的设计如何产生一个可扩展且高效的 GPU 集群调度器。我们的实验结果表明，与现有调度器相比，Tesserae 将平均作业完成时间（JCT）最多改善至 1.62 倍，完成时长（Makespan）最多改善至 1.15 倍。

Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence 主题：分布式、并行与集群计算，人工智能

Publish: 2025-08-07 00:38:43 UTC 发布：2025-08-07 00:38:43 协调世界时 (UTC)

#129 Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering #129 面向视觉活动识别的稳健评估：通过意义聚类解决动词歧义

Subjects: Computation and Language, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：计算与语言、人工智能、计算机视觉与模式识别

Publish: 2025-08-07 00:22:15 UTC 发布：2025-08-07 00:22:15 UTC

#130 TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring #130 TRKT：弱监督动态场景图生成与时序增强关系感知知识迁移 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Zhu Xu](https://arxiv.org/search/?searchtype=author&query=Zhu Xu), [Ting Lei](https://arxiv.org/search/?searchtype=author&query=Ting Lei), [Zhimin Li](https://arxiv.org/search/?searchtype=author&query=Zhimin Li), [Guan Wang](https://arxiv.org/search/?searchtype=author&query=Guan Wang), [Qingchao Chen](https://arxiv.org/search/?searchtype=author&query=Qingchao Chen), [Yuxin Peng](https://arxiv.org/search/?searchtype=author&query=Yuxin Peng), [Yang liu](https://arxiv.org/search/?searchtype=author&query=Yang liu) 作者：Zhu Xu，Ting Lei，Zhimin Li，Guan Wang，Qingchao Chen，Yuxin Peng，Yang liu

Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git. 动态场景图生成（DSGG）的目标是在每一帧视频中通过检测物体并预测它们的关系来创建场景图。弱监督 DSGG（WS-DSGG）通过在每个视频中仅使用单帧的未定位场景图进行训练来减少标注工作量。现有的 WS-DSGG 方法依赖现成的外部目标检测器来生成用于后续 DSGG 训练的伪标签。然而，在静态、以物体为中心的图像上训练的检测器在 DSGG 所需的动态、关系感知场景中表现不佳，导致定位不准确和低置信度的候选框。为了解决外部目标检测器在 WS-DSGG 中带来的挑战，我们提出了一种时序增强的关系感知知识迁移（TRKT）方法，该方法利用知识来增强关系感知的动态场景中的检测。TRKT 由两个关键组成部分构建：（1）关系感知知识挖掘：我们首先采用物体和关系类别解码器，生成类别特定的注意力图，以突出物体区域和交互区域。然后我们提出了一种帧间注意力增强（Inter-frame Attention Augmentation）策略，利用相邻帧的光流增强注意力图，使其具备运动感知性并对运动模糊具有鲁棒性。此步骤为弱监督视频场景图生成（WS-DSGG）带来了基于关系和运动的知识挖掘。(2) 我们引入了一个双流融合模块（Dual-stream Fusion Module），将类别特定的注意力图整合到外部检测结果中，以精化目标定位并提升目标候选框的置信分数。大量实验表明，TRKT 在 Action Genome 数据集上达到了最新的最优性能。我们的代码可在 https://github.com/XZPKU/TRKT.git 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 学科：计算机视觉与模式识别，人工智能

Publish: 2025-08-07 00:17:45 UTC 发布：2025-08-07 00:17:45 UTC

#131 INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM #131 意向：通过交互直觉与基于视觉语言模型的实证方法推断类人机器人运动倾向

Authors: [Jin Wang](https://arxiv.org/search/?searchtype=author&query=Jin Wang), [Weijie Wang](https://arxiv.org/search/?searchtype=author&query=Weijie Wang), [Boyuan Deng](https://arxiv.org/search/?searchtype=author&query=Boyuan Deng), [Heng Zhang](https://arxiv.org/search/?searchtype=author&query=Heng Zhang), [Rui Dai](https://arxiv.org/search/?searchtype=author&query=Rui Dai), [Nikos Tsagarakis](https://arxiv.org/search/?searchtype=author&query=Nikos Tsagarakis) 作者：Jin Wang, Weijie Wang, Boyuan Deng, Heng Zhang, Rui Dai, Nikos Tsagarakis

Traditional control and planning for robotic manipulation heavily rely on precise physical models and predefined action sequences. While effective in structured environments, such approaches often fail in real-world scenarios due to modeling inaccuracies and struggle to generalize to novel tasks. In contrast, humans intuitively interact with their surroundings, demonstrating remarkable adaptability, making efficient decisions through implicit physical understanding. In this work, we propose INTENTION, a novel framework enabling robots with learned interactive intuition and autonomous manipulation in diverse scenarios, by integrating Vision-Language Models (VLMs) based scene reasoning with interaction-driven memory. We introduce Memory Graph to record scenes from previous task interactions which embodies human-like understanding and decision-making about different tasks in real world. Meanwhile, we design an Intuitive Perceptor that extracts physical relations and affordances from visual scenes. Together, these components empower robots to infer appropriate interaction behaviors in new scenes without relying on repetitive instructions. Videos: https://robo-intention.github.io 传统的机器人操作控制和规划高度依赖精确的物理模型和预定义的动作序列。在结构化环境中这类方法虽有效，但在真实世界场景中常因建模不准确而失效，并且难以推广到新任务。相比之下，人类能直觉地与周围环境互动，展现出卓越的适应性，并通过隐含的物理理解做出高效决策。在这项工作中，我们提出了 INTENTION——一个新颖框架，通过将基于视觉-语言模型（VLM）的场景推理与以交互为驱动的记忆相结合，使机器人具备学习到的交互直觉并在多样场景中进行自主操作。我们引入了记忆图（Memory Graph）来记录以往任务交互中的场景，这体现了类似人类的理解和关于现实世界中不同任务的决策能力。与此同时，我们设计了直觉感知器（Intuitive Perceptor），从视觉场景中提取物理关系和可供性。上述组件共同使机器人能够在新场景中推断出适当的交互行为，而无需依赖重复的指令。视频： https://robo-intention.github.io

Subjects: Robotics, Artificial Intelligence 主题：机器人学、人工智能

Publish: 2025-08-06 23:27:22 UTC 发表：2025-08-06 23:27:22 UTC

#132 Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens #132 将基础单目深度估计器扩展到带有校准 token 的鱼眼相机 [PDF ] [副本] [Kimi ] [REL]

Authors: [Suchisrit Gangopadhyay](https://arxiv.org/search/?searchtype=author&query=Suchisrit Gangopadhyay), [Jung-Hee Kim](https://arxiv.org/search/?searchtype=author&query=Jung-Hee Kim), [Xien Chen](https://arxiv.org/search/?searchtype=author&query=Xien Chen), [Patrick Rim](https://arxiv.org/search/?searchtype=author&query=Patrick Rim), [Hyoungseob Park](https://arxiv.org/search/?searchtype=author&query=Hyoungseob Park), [Alex Wong](https://arxiv.org/search/?searchtype=author&query=Alex Wong) 作者：Suchisrit Gangopadhyay、Jung-Hee Kim、Xien Chen、Patrick Rim、Hyoungseob Park、Alex Wong

We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token. 我们提出了一种方法，将在透视图像上训练的基础单目深度估计器（FMDEs）扩展到鱼眼图像。尽管 FMDEs 在数千万张图像上进行了训练，但它们仍然容易受到由相机标定（内参、畸变）参数变化引起的协变量偏移影响，从而导致错误的深度估计。我们的方法将编码鱼眼图像的潜在嵌入分布与透视图像的嵌入分布对齐，使得在不重新训练或微调的情况下可以重用用于鱼眼相机的 FMDEs。为此，我们引入了一组校准令牌（Calibration Tokens）作为一种轻量级的适配机制，用以调制潜在嵌入以实现对齐。通过利用 FMDEs 中已具备表现力的潜在空间，我们认为调制其嵌入可以避免在图像空间中进行常规重标定或投影到规范参考框架时引入的伪影和损失带来的负面影响。我们的方法是自监督的，不需要鱼眼图像，而是利用公开可用的大规模透视图像数据集。这是通过将透视图像重新校准为鱼眼图像，并在训练期间强制它们的估计保持一致来实现的。我们使用多种基础单目深度估计器（FMDEs）在室内和室外进行了评估，均在仅使用一组通用 token 的情况下持续超过最先进的方法。代码可在以下地址获取：https://github.com/JungHeeKim29/calibration-token 。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-06 23:23:20 UTC 发布：2025-08-06 23:23:20 UTC

#133 Taxonomy of Faults in Attention-Based Neural Networks #133 基于注意力的神经网络故障分类法

Authors: [Sigma Jahan](https://arxiv.org/search/?searchtype=author&query=Sigma Jahan), [Saurabh Singh Rajput](https://arxiv.org/search/?searchtype=author&query=Saurabh Singh Rajput), [Tushar Sharma](https://arxiv.org/search/?searchtype=author&query=Tushar Sharma), [Mohammad Masudur Rahman](https://arxiv.org/search/?searchtype=author&query=Mohammad Masudur Rahman) 作者：Sigma Jahan、Saurabh Singh Rajput、Tushar Sharma、Mohammad Masudur Rahman

Attention mechanisms are at the core of modern neural architectures, powering systems ranging from ChatGPT to autonomous vehicles and driving a major economic impact. However, high-profile failures, such as ChatGPT’s nonsensical outputs or Google’s suspension of Gemini’s image generation due to attention weight errors, highlight a critical gap: existing deep learning fault taxonomies might not adequately capture the unique failures introduced by attention mechanisms. This gap leaves practitioners without actionable diagnostic guidance. To address this gap, we present the first comprehensive empirical study of faults in attention-based neural networks (ABNNs). Our work is based on a systematic analysis of 555 real-world faults collected from 96 projects across ten frameworks, including GitHub, Hugging Face, and Stack Overflow. Through our analysis, we develop a novel taxonomy comprising seven attention-specific fault categories, not captured by existing work. Our results show that over half of the ABNN faults arise from mechanisms unique to attention architectures. We further analyze the root causes and manifestations of these faults through various symptoms. Finally, by analyzing symptom-root cause associations, we identify four evidence-based diagnostic heuristics that explain 33.0% of attention-specific faults, offering the first systematic diagnostic guidance for attention-based models. 注意力机制是现代神经架构的核心，驱动着从 ChatGPT 到自动驾驶车辆的各类系统，并带来重大经济影响。然而，高调的失败案例，例如 ChatGPT 的荒谬输出或谷歌因注意力权重错误而暂停 Gemini 的图像生成，突显了一个关键差距：现有的深度学习故障分类法可能无法充分涵盖注意力机制所引入的独特故障。这一差距使得从业者缺乏可操作的诊断指导。为了解决这一问题，我们提出了首个关于基于注意力的神经网络（ABNN）故障的全面实证研究。我们的工作基于对来自 96 个项目、涵盖包括 GitHub、Hugging Face 和 Stack Overflow 在内的十个框架的 555 个真实故障的系统性分析。通过分析，我们构建了一个新的分类法，包含七类注意力特有的故障类别，这些类别未被现有工作涵盖。我们的结果显示，超过一半的 ABNN 故障源自注意力架构特有的机制。我们还通过各种症状进一步分析了这些故障的根本原因和表现形式。最后，通过分析症状与根因之间的关联，我们识别出四条基于证据的诊断启发式规则，这些规则能够解释 33.0%的注意力特异性故障，为基于注意力的模型提供了首个系统化的诊断指导。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-06 23:20:18 UTC 发表：2025-08-06 23:20:18 UTC

#134 RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory #134 RCR-Router：具有结构化记忆的多智能体 LLM 系统中高效的角色感知上下文路由

Subjects: Computation and Language, Artificial Intelligence, Multiagent Systems 科目：计算与语言，人工智能，多智能体系统

Publish: 2025-08-06 21:59:34 UTC 发布：2025-08-06 21:59:34 UTC

#135 Revealing Temporal Label Noise in Multimodal Hateful Video Classification #135 揭示多模态仇恨视频分类中的时间标签噪声

Authors: [Shuonan Yang](https://arxiv.org/search/?searchtype=author&query=Shuonan Yang), [Tailin Chen](https://arxiv.org/search/?searchtype=author&query=Tailin Chen), [Rahul Singh](https://arxiv.org/search/?searchtype=author&query=Rahul Singh), [Jiangbei Yue](https://arxiv.org/search/?searchtype=author&query=Jiangbei Yue), [Jianbo Jiao](https://arxiv.org/search/?searchtype=author&query=Jianbo Jiao), [Zeyu Fu](https://arxiv.org/search/?searchtype=author&query=Zeyu Fu) 作者：Shuonan Yang, Tailin Chen, Rahul Singh, Jiangbei Yue, Jianbo Jiao, Zeyu Fu

The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise. 在线多媒体内容的快速增殖加剧了仇恨言论的传播，带来了重大的社会与监管挑战。尽管近期工作推动了多模态仇恨视频检测的发展，但大多数方法依赖于粗糙的视频层级标注，忽视了仇恨内容的时间粒度。这会引入大量标签噪声，因为被标注为仇恨的视频常常包含很长的非仇恨片段。在本文中，我们通过细粒度方法研究了这种标签模糊性的影响。具体而言，我们使用带有注释时间戳的 HateMM 和 MultiHateClip 英文数据集，对仇恨视频进行裁剪，以隔离明确的仇恨片段。然后我们对这些裁剪后的片段进行探索性分析，以检验仇恨内容与非仇恨内容的分布与特征。该分析突出了语义重叠的程度以及视频层级粗糙标注所引入的混淆。最后，受控实验表明时间戳噪声从根本上改变了模型的决策边界并削弱了分类信心，凸显了仇恨言论表达的内在语境依赖性和时间连续性。我们的研究结果为多模态仇恨视频的时间动态提供了新见解，并强调需要具有时间感知能力的模型和基准以提高鲁棒性与可解释性。代码和数据可在 https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise 获得。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-06 21:55:59 UTC 发布时间：2025-08-06 21:55:59 世界协调时（UTC）

#136 Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLMs) #136 面向图意识大型语言模型（LLMs）的对抗攻击与防御

Authors: [Iyiola E. Olatunji](https://arxiv.org/search/?searchtype=author&query=Iyiola E. Olatunji), [Franziska Boenisch](https://arxiv.org/search/?searchtype=author&query=Franziska Boenisch), [Jing Xu](https://arxiv.org/search/?searchtype=author&query=Jing Xu), [Adam Dziedzic](https://arxiv.org/search/?searchtype=author&query=Adam Dziedzic) 作者：Iyiola E. Olatunji，Franziska Boenisch，Jing Xu，Adam Dziedzic

Large Language Models (LLMs) are increasingly integrated with graph-structured data for tasks like node classification, a domain traditionally dominated by Graph Neural Networks (GNNs). While this integration leverages rich relational information to improve task performance, their robustness against adversarial attacks remains unexplored. We take the first step to explore the vulnerabilities of graph-aware LLMs by leveraging existing adversarial attack methods tailored for graph-based models, including those for poisoning (training-time attacks) and evasion (test-time attacks), on two representative models, LLAGA (Chen et al. 2024) and GRAPHPROMPTER (Liu et al. 2024). Additionally, we discover a new attack surface for LLAGA where an attacker can inject malicious nodes as placeholders into the node sequence template to severely degrade its performance. Our systematic analysis reveals that certain design choices in graph encoding can enhance attack success, with specific findings that: (1) the node sequence template in LLAGA increases its vulnerability; (2) the GNN encoder used in GRAPHPROMPTER demonstrates greater robustness; and (3) both approaches remain susceptible to imperceptible feature perturbation attacks. Finally, we propose an end-to-end defense framework GALGUARD, that combines an LLM-based feature correction module to mitigate feature-level perturbations and adapted GNN defenses to protect against structural attacks. 大语言模型（LLMs）正日益与图结构数据整合用于节点分类等任务，这一领域传统上由图神经网络（GNNs）主导。尽管这种整合利用了丰富的关系信息以提升任务表现，但其对抗性攻击下的鲁棒性仍未被探究。我们采取第一步，通过利用针对基于图的模型量身定制的现有对抗攻击方法来探索具图感知的 LLMs 的脆弱性，包括用于投毒（训练时攻击）和规避（测试时攻击）的方法，针对两个代表性模型 LLAGA（Chen et al. 2024）和 GRAPHPROMPTER（Liu et al. 2024）。此外，我们发现了 LLAGA 的一个新攻击面：攻击者可以在节点序列模板中注入作为占位符的恶意节点，从而严重降低其性能。我们的系统性分析表明，图编码中的某些设计选择会提高攻击成功率，具体发现包括： (1) LLAGA 中的节点序列模板增加了其脆弱性； (2) GRAPHPROMPTER 中使用的 GNN 编码器表现出更强的鲁棒性；以及 (3) 两种方法都仍然容易受到不可察觉的特征扰动攻击。最后，我们提出了一个端到端防御框架 GALGUARD，该框架结合了基于 LLM 的特征校正模块以减轻特征级扰动，并采用改编的 GNN 防御来防护结构性攻击。

Subjects: Cryptography and Security, Artificial Intelligence, Social and Information Networks 主题：密码学与安全、人工智能、社会与信息网络

Publish: 2025-08-06 21:38:52 UTC 发布：2025-08-06 21:38:52 UTC

#137 Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates #137 利用深度学习纠正全球空气质量估算的物理模型偏差

Air pollution is the world’s largest environmental risk factor for human disease and premature death, resulting in more than 6 million permature deaths in 2019. Currently, there is still a challenge to model one of the most important air pollutants, surface ozone, particularly at scales relevant for human health impacts, with the drivers of global ozone trends at these scales largely unknown, limiting the practical use of physics-based models. We employ a 2D Convolutional Neural Network based architecture that estimate surface ozone MOMO-Chem model residuals, referred to as model bias. We demonstrate the potential of this technique in North America and Europe, highlighting its ability better to capture physical model residuals compared to a traditional machine learning method. We assess the impact of incorporating land use information from high-resolution satellite imagery to improve model estimates. Importantly, we discuss how our results can improve our scientific understanding of the factors impacting ozone bias at urban scales that can be used to improve environmental policy. 空气污染是全球对人类疾病和过早死亡的最大环境风险因素，2019 年导致超过 600 万例过早死亡。目前，模拟最重要的大气污染物之一——地表臭氧仍然面临挑战，尤其是在与人类健康影响相关的尺度上，驱动这些尺度上全球臭氧变化的因素在很大程度上仍不清楚，这限制了基于物理的模型的实际应用。我们采用了一种基于二维卷积神经网络的架构来估计地表臭氧与 MOMO-Chem 模型的残差，称为模型偏差。我们展示了该技术在北美和欧洲的潜力，强调其相比传统机器学习方法更能捕捉物理模型残差的能力。我们评估了将来自高分辨率卫星影像的土地利用信息纳入以改进模型估计的影响。重要的是，我们讨论了如何利用我们的结果改进对影响城市尺度臭氧偏差因素的科学理解，从而用于改进环境政策。

Subjects: Machine Learning, Artificial Intelligence 学科：机器学习，人工智能

Publish: 2025-08-06 21:24:32 UTC 发表：2025-08-06 21:24:32 UTC

#138 Uncertainty Quantification for Surface Ozone Emulators using Deep Learning #138 使用深度学习对地表臭氧模拟器进行不确定性量化

Air pollution is a global hazard, and as of 2023, 94% of the world’s population is exposed to unsafe pollution levels. Surface Ozone (O3), an important pollutant, and the drivers of its trends are difficult to model, and traditional physics-based models fall short in their practical use for scales relevant to human-health impacts. Deep Learning-based emulators have shown promise in capturing complex climate patterns, but overall lack the interpretability necessary to support critical decision making for policy changes and public health measures. We implement an uncertainty-aware U-Net architecture to predict the Multi-mOdel Multi-cOnstituent Chemical data assimilation (MOMO-Chem) model’s surface ozone residuals (bias) using Bayesian and quantile regression methods. We demonstrate the capability of our techniques in regional estimation of bias in North America and Europe for June 2019. We highlight the uncertainty quantification (UQ) scores between our two UQ methodologies and discern which ground stations are optimal and sub-optimal candidates for MOMO-Chem bias correction, and evaluate the impact of land-use information in surface ozone residual modeling. 空气污染是全球性的危害，截至 2023 年，94%的世界人口暴露于不安全的污染水平。地表臭氧（O3）是一个重要污染物，其趋势的驱动因素难以建模，传统基于物理的模型在与人类健康影响相关的尺度上在实际应用中表现不足。基于深度学习的仿真器在捕捉复杂气候模式方面显示出前景，但总体缺乏支持政策变更和公共卫生措施关键决策所需的可解释性。我们实现了一种不确定性感知的 U-Net 架构，使用贝叶斯和分位数回归方法预测多模式多成分化学数据同化（MOMO-Chem）模型的地表臭氧残差（偏差）。我们展示了这些技术在 2019 年 6 月对北美和欧洲区域偏差估计的能力。我们强调了两种不确定性量化（UQ）方法之间的不确定性评分，识别出哪些地面监测站是对 MOMO-Chem 偏差校正的最佳和次优候选站，并评估了土地利用信息在地表臭氧残差建模中的影响。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 21:22:06 UTC 发布：2025-08-06 21:22:06 协调世界时（UTC）

#139 Sequence Aware SAC Control for Engine Fuel Consumption Optimization in Electrified Powertrain #139 序列感知 SAC 控制用于电气化动力总成的发动机燃油消耗优化

Authors: [Wafeeq Jaleel](https://arxiv.org/search/?searchtype=author&query=Wafeeq Jaleel), [Md Ragib Rownak](https://arxiv.org/search/?searchtype=author&query=Md Ragib Rownak), [Athar Hanif](https://arxiv.org/search/?searchtype=author&query=Athar Hanif), [Sidra Ghayour Bhatti](https://arxiv.org/search/?searchtype=author&query=Sidra Ghayour Bhatti), [Qadeer Ahmed](https://arxiv.org/search/?searchtype=author&query=Qadeer Ahmed) 作者：Wafeeq Jaleel，Md Ragib Rownak，Athar Hanif，Sidra Ghayour Bhatti，Qadeer Ahmed

As hybrid electric vehicles (HEVs) gain traction in heavy-duty trucks, adaptive and efficient energy management is critical for reducing fuel consumption while maintaining battery charge for long operation times. We present a new reinforcement learning (RL) framework based on the Soft Actor-Critic (SAC) algorithm to optimize engine control in series HEVs. We reformulate the control task as a sequential decision-making problem and enhance SAC by incorporating Gated Recurrent Units (GRUs) and Decision Transformers (DTs) into both actor and critic networks to capture temporal dependencies and improve planning over time. To evaluate robustness and generalization, we train the models under diverse initial battery states, drive cycle durations, power demands, and input sequence lengths. Experiments show that the SAC agent with a DT-based actor and GRU-based critic was within 1.8% of Dynamic Programming (DP) in fuel savings on the Highway Fuel Economy Test (HFET) cycle, while the SAC agent with GRUs in both actor and critic networks, and FFN actor-critic agent were within 3.16% and 3.43%, respectively. On unseen drive cycles (US06 and Heavy Heavy-Duty Diesel Truck (HHDDT) cruise segment), generalized sequence-aware agents consistently outperformed feedforward network (FFN)-based agents, highlighting their adaptability and robustness in real-world settings. 随着混合动力电动汽车（HEV）在重型卡车中逐渐普及，自适应且高效的能量管理对于在维持电池长时间电量的同时减少燃油消耗至关重要。我们提出了一个基于软行动者-评论家（SAC）算法的新强化学习（RL）框架，用于优化串联式 HEV 中的发动机控制。我们将控制任务重新表述为顺序决策问题，并通过在行动者和评论家网络中加入门控循环单元（GRU）和决策转换器（DT）来增强 SAC，以捕捉时间依赖性并改善随时间的规划能力。为了评估鲁棒性和泛化能力，我们在不同的初始电池状态、行驶周期时长、功率需求和输入序列长度下训练模型。实验表明，在公路燃油经济性测试（HFET）循环中，采用基于 DT 的行动者和基于 GRU 的评论家的 SAC 智能体在燃油节省方面与动态规划（DP）相差不超过 1.8%，而在行动者和评论家网络中均使用 GRU 的 SAC 智能体以及前馈神经网络（FFN）行动者-评论家智能体则分别与 DP 相差不超过 3.16%和 3.43%。在未见过的行驶工况（US06 和重型柴油卡车（HHDDT）巡航段）上，具备序列感知的通用化智能体始终优于基于前馈网络（FFN）的智能体，突显了它们在真实环境中的适应性与鲁棒性。

Subjects: Systems and Control, Artificial Intelligence, Machine Learning 主题：系统与控制，人工智能，机器学习

Publish: 2025-08-06 20:53:11 UTC 发表：2025-08-06 20:53:11 UTC

#140 Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos #140 可证明的后训练量化：OPTQ 与 Qronos 的理论分析

Authors: [Haoyu Zhang](https://arxiv.org/search/?searchtype=author&query=Haoyu Zhang), [Shihao Zhang](https://arxiv.org/search/?searchtype=author&query=Shihao Zhang), [Ian Colbert](https://arxiv.org/search/?searchtype=author&query=Ian Colbert), [Rayan Saab](https://arxiv.org/search/?searchtype=author&query=Rayan Saab) 作者：Haoyu Zhang、Shihao Zhang、Ian Colbert、Rayan Saab

Post-training quantization (PTQ) has become a crucial tool for reducing the memory and compute costs of modern deep neural networks, including large language models (LLMs). Among PTQ algorithms, the OPTQ framework-also known as GPTQ-has emerged as a leading method due to its computational efficiency and strong empirical performance. Despite its widespread adoption, however, OPTQ lacks rigorous quantitative theoretical guarantees. This paper presents the first quantitative error bounds for both deterministic and stochastic variants of OPTQ, as well as for Qronos, a recent related state-of-the-art PTQ algorithm. We analyze how OPTQ’s iterative procedure induces quantization error and derive non-asymptotic 2-norm error bounds that depend explicitly on the calibration data and a regularization parameter that OPTQ uses. Our analysis provides theoretical justification for several practical design choices, including the widely used heuristic of ordering features by decreasing norm, as well as guidance for selecting the regularization parameter. For the stochastic variant, we establish stronger infinity-norm error bounds, which enable control over the required quantization alphabet and are particularly useful for downstream layers and nonlinearities. Finally, we extend our analysis to Qronos, providing new theoretical bounds, for both its deterministic and stochastic variants, that help explain its empirical advantages. 训练后量化（PTQ）已成为降低现代深度神经网络（包括大型语言模型 LLMs）内存和计算成本的关键工具。在 PTQ 算法中，OPTQ 框架——也称为 GPTQ——由于其计算效率和良好的经验性能，已成为一种领先方法。然而，尽管被广泛采用，OPTQ 仍缺乏严格的定量理论保证。本文给出了 OPTQ 的确定性和随机性变体以及最近相关的最先进 PTQ 算法 Qronos 的首个定量误差界。我们分析了 OPTQ 的迭代过程如何引入量化误差，并推导出依赖于校准数据和 OPTQ 使用的正则化参数的显式非渐近二范数误差界。我们的分析为若干实际设计选择提供了理论依据，包括广泛使用的按特征范数递减排序的启发式方法，并为正则化参数的选择提供了指导。对于随机变体，我们建立了更强的无穷范数误差界，这使得能够控制所需的量化字母表，并对后续层和非线性函数尤其有用。最后，我们将分析扩展到 Qronos，为其确定性和随机变体提供了新的理论界，这有助于解释其经验上的优势。

Subjects: Machine Learning, Artificial Intelligence, Information Theory, Numerical Analysis 主题：机器学习、人工智能、信息论、数值分析

Publish: 2025-08-06 20:00:40 UTC 发表：2025-08-06 20:00:40 UTC

#141 Multi-Stage Knowledge-Distilled VGAE and GAT for Robust Controller-Area-Network Intrusion Detection #141 多阶段知识蒸馏的 VGAE 与 GAT 用于鲁棒的控制器局域网络入侵检测

Authors: [Robert Frenken](https://arxiv.org/search/?searchtype=author&query=Robert Frenken), [Sidra Ghayour Bhatti](https://arxiv.org/search/?searchtype=author&query=Sidra Ghayour Bhatti), [Hanqin Zhang](https://arxiv.org/search/?searchtype=author&query=Hanqin Zhang), [Qadeer Ahmed](https://arxiv.org/search/?searchtype=author&query=Qadeer Ahmed) 作者：Robert Frenken、Sidra Ghayour Bhatti、Hanqin Zhang、Qadeer Ahmed

The Controller Area Network (CAN) protocol is a standard for in-vehicle communication but remains susceptible to cyber-attacks due to its lack of built-in security. This paper presents a multi-stage intrusion detection framework leveraging unsupervised anomaly detection and supervised graph learning tailored for automotive CAN traffic. Our architecture combines a Variational Graph Autoencoder (VGAE) for structural anomaly detection with a Knowledge-Distilled Graph Attention Network (KD-GAT) for robust attack classification. CAN bus activity is encoded as graph sequences to model temporal and relational dependencies. The pipeline applies VGAE-based selective undersampling to address class imbalance, followed by GAT classification with optional score-level fusion. The compact student GAT achieves 96% parameter reduction compared to the teacher model while maintaining strong predictive performance. Experiments on six public CAN intrusion datasets–Car-Hacking, Car-Survival, and can-train-and-test–demonstrate competitive accuracy and efficiency, with average improvements of 16.2% in F1-score over existing methods, particularly excelling on highly imbalanced datasets with up to 55% F1-score improvements. 控制器局域网络（CAN）协议是车载通信的一个标准，但由于缺乏内置安全机制，仍然容易受到网络攻击。本文提出了一个多阶段入侵检测框架，结合无监督异常检测和针对汽车 CAN 流量的有监督图学习。我们的架构将用于结构异常检测的变分图自编码器（VGAE）与用于鲁棒攻击分类的知识蒸馏图注意力网络（KD-GAT）结合起来。CAN 总线活动被编码为图序列，以建模时间和关联依赖关系。该流水线应用基于 VGAE 的选择性欠采样来应对类别不平衡，随后进行 GAT 分类并可选择进行分数级融合。紧凑的学生 GAT 在保持强预测性能的同时，与教师模型相比实现了 96%的参数减少。在六个公开的 CAN 入侵数据集——Car-Hacking、Car-Survival 和 can-train-and-test——上的实验表明，本方法在准确性和效率上具有竞争力，平均在 F1 分数上比现有方法提高了 16.2%，尤其在高度不平衡的数据集上表现卓越，F1 分数最高提升可达 55%。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 19:50:26 UTC 发布：2025-08-06 19:50:26 UTC

#142 Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History #142 LLM 人格测量的持续不稳定性：规模、推理和对话历史的影响 [PDF 1 ] [Copy] [Kimi ] [REL]

Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD > 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate. 大型语言模型在安全部署时需要一致的行为模式，但其类人格特性仍然知之甚少。我们提出了 PERSIST（合成文本中的人格稳定性），这是一个全面的评估框架，对 25+ 个开源模型（1B–671B 参数）在 50 万+ 次回答上进行了测试。使用传统（BFI-44、SD3）和新颖的适配 LLM 的人格量表，我们系统性地变换问题顺序、改写、人物设定和推理模式。我们的发现挑战了部署中的基本假设：（1）即便是 400B+ 的模型也表现出显著的回答波动（标准差 > 0.4）；（2）仅仅微小的提示重排就会使人格测量发生高达 20% 的变化；（3）本应稳定行为的干预措施，例如链式思维推理、详尽的人物指令、包含会话历史，反而可能增加波动；（4）适配 LLM 的量表与以人为中心的量表一样不稳定，证实了问题在于架构而非翻译。跨规模和缓解策略持续存在的不稳定性表明，当前的 LLM 缺乏真正行为一致性的基础。对于需要可预测行为的安全关键应用，这些发现表明基于个性（personality）的对齐策略可能从根本上不足。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-06 19:11:33 UTC 发布：2025-08-06 19:11:33 UTC

#143 Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off #143 Voost：一种用于双向虚拟试穿与脱衣的统一且可扩展的扩散变换器

Authors: [Seungyong Lee](https://arxiv.org/search/?searchtype=author&query=Seungyong Lee), [Jeong-gi Kwak](https://arxiv.org/search/?searchtype=author&query=Jeong-gi Kwak) 作者：Seungyong Lee、Jeong-gi Kwak

Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization. 虚拟试穿旨在合成一个人穿着目标服装的逼真图像，但在姿势和外观变化下准确建模服装与人体的对应关系仍然是一个持续的挑战。在本文中，我们提出了 Voost——一个统一且可扩展的框架，它使用单一的扩散变换器联合学习虚拟试穿（try-on）和试脱（try-off）。通过联合建模这两项任务，Voost 使每对服装-人体样本能够相互监督两个方向，并支持在生成方向和服装类别上的灵活条件控制，从而在无需任务专用网络、辅助损失或额外标签的情况下增强服装与人体关系推理。此外，我们引入了两种推理时技术：用于提高对分辨率或掩码变化鲁棒性的注意力温度缩放，以及利用任务间双向一致性的自校正采样。大量实验表明，Voost 在试穿和试脱基准上均取得了最先进的结果，在对齐准确性、视觉逼真度和泛化能力方面持续优于强基线。

Subjects: Graphics, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 主题：图形学、人工智能、计算机视觉与模式识别、机器学习

Publish: 2025-08-06 19:10:58 UTC 发表时间：2025-08-06 19:10:58 UTC

#144 Automated File-Level Logging Generation for Machine Learning Applications using LLMs: A Case Study using GPT-4o Mini #144 使用 LLM 为机器学习应用自动生成文件级日志：以 GPT-4o Mini 为例的案例研究

Authors: [Mayra Sofia Ruiz Rodriguez](https://arxiv.org/search/?searchtype=author&query=Mayra Sofia Ruiz Rodriguez), [SayedHassan Khatoonabadi](https://arxiv.org/search/?searchtype=author&query=SayedHassan Khatoonabadi), [Emad Shihab](https://arxiv.org/search/?searchtype=author&query=Emad Shihab) 作者：Mayra Sofia Ruiz Rodriguez、SayedHassan Khatoonabadi、Emad Shihab

Logging is essential in software development, helping developers monitor system behavior and aiding in debugging applications. Given the ability of large language models (LLMs) to generate natural language and code, researchers are exploring their potential to generate log statements. However, prior work focuses on evaluating logs introduced in code functions, leaving file-level log generation underexplored – especially in machine learning (ML) applications, where comprehensive logging can enhance reliability. In this study, we evaluate the capacity of GPT-4o mini as a case study to generate log statements for ML projects at file level. We gathered a set of 171 ML repositories containing 4,073 Python files with at least one log statement. We identified and removed the original logs from the files, prompted the LLM to generate logs for them, and evaluated both the position of the logs and log level, variables, and text quality of the generated logs compared to human-written logs. In addition, we manually analyzed a representative sample of generated logs to identify common patterns and challenges. We find that the LLM introduces logs in the same place as humans in 63.91% of cases, but at the cost of a high overlogging rate of 82.66%. Furthermore, our manual analysis reveals challenges for file-level logging, which shows overlogging at the beginning or end of a function, difficulty logging within large code blocks, and misalignment with project-specific logging conventions. While the LLM shows promise for generating logs for complete files, these limitations remain to be addressed for practical implementation. 日志记录在软件开发中至关重要，帮助开发者监控系统行为并辅助调试应用。鉴于大型语言模型（LLMs）生成自然语言和代码的能力，研究人员正在探索它们生成日志语句的潜力。然而，以往工作侧重于评估在代码函数中引入的日志，文件级日志生成尚未得到充分研究——尤其是在机器学习（ML）应用中，全面的日志记录可以提高可靠性。在本研究中，我们以 GPT-4o mini 为案例评估其在文件级为 ML 项目生成日志语句的能力。我们收集了一组包含 171 个 ML 仓库的样本，这些仓库包含 4,073 个至少含有一条日志语句的 Python 文件。我们识别并移除了文件中的原始日志，提示 LLM 为这些文件生成日志，并对生成的日志在位置、日志级别、变量和文本质量方面与人工编写的日志进行了比较评估。此外，我们对一份具有代表性的生成日志样本进行了人工分析，以识别常见模式和挑战。我们发现 LLM 在 63.91% 的情况下在与人类相同的位置引入了日志，但代价是 82.66% 的高过度记录率。此外，我们的人工分析揭示了文件级日志记录的挑战：在函数的开始或结束处存在过度记录问题、在大型代码块内记录困难，以及与项目特定日志约定不一致。尽管 LLM 在为完整文件生成日志方面显示出潜力，但这些限制在实际应用中仍需解决。

Subjects: Software Engineering, Artificial Intelligence, Machine Learning 主题：软件工程、人工智能、机器学习

Publish: 2025-08-06 18:57:51 UTC 发表：2025-08-06 18:57:51 UTC

#145 CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework #145 CoMAD：一种多教师自监督蒸馏框架

Authors: [Sriram Mandalika](https://arxiv.org/search/?searchtype=author&query=Sriram Mandalika), [Lalitha V](https://arxiv.org/search/?searchtype=author&query=Lalitha V)

Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask, forcing the student to interpolate missing features under richer contexts. Teacher embeddings are aligned to the student’s space via a linear adapter and layer normalization, then fused through our joint consensus gating, which weights each token by combining cosine affinity with inter-teacher agreement. The student is trained with dual-level KL divergence on visible tokens and reconstructed feature maps, capturing both local and global structure. On ImageNet-1K, CoMAD’s ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art. In dense-prediction transfers, it attains 47.3 percent mIoU on ADE20K, and 44.5 percent box average precision and 40.5 percent mask average precision on MS-COCO, establishing a new state-of-the-art in compact SSL distillation. 许多自监督学习范式，例如对比学习和掩码图像建模，能够从无标签数据中学习强大的表示，但通常是各自独立地进行预训练，忽视了互补的见解，并产生了在资源受限环境下难以部署的大型模型。为了解决这些挑战，我们提出了面向共识的掩码蒸馏（Consensus-oriented Masked Distillation，CoMAD），这是一种轻量、无参数的框架，将多种当前最先进的自监督视觉 Transformer 的知识统一到一个紧凑的学生网络中。CoMAD 从三个预训练的 ViT-Base 教师模型蒸馏：MAE、MoCo v3 和 iBOT，它们各自提供不同的语义和上下文先验。我们并不是简单地对教师输出取平均，而是采用不对称掩码：学生仅看到 25%的图块，而每个教师则收到逐渐变轻且独特的掩码，迫使学生在更丰富的上下文下插值缺失特征。教师嵌入通过线性适配器和层归一化被对齐到学生的空间，然后通过我们的联合共识门控进行融合，该门控通过将余弦相似度与教师间一致性相结合来对每个标记加权。学生在可见标记和重建特征图上以双层级的 KL 散度进行训练，既捕捉局部结构又捕捉全局结构。在 ImageNet-1K 上，CoMAD 的 ViT-Tiny 实现了 75.4%的 Top-1，较先前的最先进水平提高了 0.4 个百分点。在密集预测迁移任务中，它在 ADE20K 上达到 47.3%的 mIoU，在 MS-COCO 上取得 44.5%的框平均精度和 40.5%的掩码平均精度，在紧凑自监督蒸馏领域建立了新的最先进记录。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-06 18:55:14 UTC

#146 Optimality Principles and Neural Ordinary Differential Equations-based Process Modeling for Distributed Control #146 最优性原理与基于神经常微分方程的分布式控制过程建模

Authors: [Michael R. Wartmann](https://arxiv.org/search/?searchtype=author&query=Michael R. Wartmann), [B. Erik Ydstie](https://arxiv.org/search/?searchtype=author&query=B. Erik Ydstie) 作者：Michael R. Wartmann，B. Erik Ydstie

Most recent advances in machine learning and analytics for process control pose the question of how to naturally integrate new data-driven methods with classical process models and control. We propose a process modeling framework enabling integration of data-driven algorithms through consistent topological properties and conservation of extensive quantities. Interconnections among process network units are represented through connectivity matrices and network graphs. We derive the system’s natural objective function equivalent to the non-equilibrium entropy production in a steady state system as a driving force for the process dynamics. We illustrate how distributed control and optimization can be implemented into process network structures and how control laws and algorithms alter the system’s natural equilibrium towards engineered objectives. The basic requirement is that the flow conditions can be expressed in terms of conic sector (passivity) conditions. Our formalism allows integration of fundamental conservation properties from topology with learned dynamic relations from data through sparse deep neural networks. We demonstrate in a practical example of a simple inventory control system how to integrate the basic topology of a process with a neural network ordinary differential equation model. The system specific constitutive equations are left undescribed and learned by the neural ordinary differential equation algorithm using the adjoint method in combination with an adaptive ODE solver from synthetic time-series data. The resulting neural network forms a state space model for use in e.g. a model predictive control algorithm. 近年来在过程控制领域的机器学习和分析方面的最新进展提出了一个问题：如何将新的数据驱动方法自然地与经典的工艺模型和控制集成。我们提出了一个过程建模框架，通过一致的拓扑属性和广延量守恒来实现数据驱动算法的集成。过程网络单元之间的互连通过连通矩阵和网络图来表示。我们推导出系统的自然目标函数，其等价于稳态系统中的非平衡熵产生，作为过程动力学的驱动力。我们说明了如何在过程网络结构中实现分布式控制和优化，以及控制律和算法如何将系统的自然平衡朝向工程化目标改变。基本要求是流动条件可以用圆锥扇区（耗散）条件来表达。我们的形式主义允许通过稀疏深度神经网络将来自拓扑的基本守恒性质与从数据中学习的动态关系相整合。我们在一个简单库存控制系统的实际示例中展示了如何将工艺的基本拓扑与神经网络常微分方程模型结合。系统特定的本构方程未被明确描述，而是由神经常微分方程算法通过伴随方法结合自适应 ODE 求解器从合成时间序列数据中学习得到。得到的神经网络构成了一个状态空间模型，可用于例如模型预测控制算法。

Subjects: Neural and Evolutionary Computing, Artificial Intelligence, Machine Learning, Systems and Control 主题：神经与进化计算、人工智能、机器学习、系统与控制

Publish: 2025-08-06 18:16:46 UTC 发布：2025-08-06 18:16:46 UTC

#147 Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization #147 奇偶感知的字节对编码：在分词中提升跨语言公平性

Tokenization is the first – and often least scrutinized – step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with <UNK> placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks. 分词是大多数自然语言处理流水线中的第一步——也是最少受到审查的一步。用于学习分词器的标准算法依赖基于频率的目标，这会偏向训练数据中占主导地位的语言，从而使资源较少的语言的分词结果显得不成比例地更长、形态上不合理，甚至充斥着 <UNK> 占位符。该现象最终加剧了来自不同语言背景的用户之间在计算和经济上的不平等。为了解决这一问题，我们提出了“考虑公平性的字节对编码（Parity-aware BPE）”，这是广泛使用的 BPE 算法的一个变体。在每一次合并步骤中，Parity-aware BPE 最大化当前压缩最差语言的压缩增益，以在跨语言公平性上进行权衡，牺牲少量的全局压缩率。我们的实证发现表明，Parity-aware BPE 能够在各语言间实现更均衡的标记数量，对全局压缩率的影响可忽略，并且对下游任务中的语言模型性能没有实质性影响。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-06 18:14:43 UTC 发布：2025-08-06 18:14:43 UTC

#148 Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM #148 利用冻结的 LLM 增强带有说话者特征的对话标注

In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios. 在对话转录流程中，大型语言模型（LLMs）常被用于后处理以改进语法、标点和可读性。我们探讨一种互补的后处理步骤：通过为说话人特征（如年龄、性别和情绪）添加元数据标签来丰富转录对话。有些标签是对整个对话全局适用的，而有些随时间变化。我们的方法将冻结的音频基础模型（如 Whisper 或 WavLM）与一个冻结的 LLAMA 语言模型相结合，以推断这些说话人属性，无需对任一模型进行任务特定的微调。通过使用轻量、高效的连接器来桥接音频和语言表示，我们在说话人画像任务上取得了具有竞争力的表现，同时保持了模块化和速度优势。此外，我们证明了一个冻结的 LLAMA 模型可以直接比较 x-vectors，在某些场景下实现了 8.8% 的等错误率（Equal Error Rate）。

Subjects: Computation and Language, Artificial Intelligence, Sound, Audio and Speech Processing 主题：计算与语言、人工智能、声音、音频与语音处理

Publish: 2025-08-06 18:14:04 UTC 发布时间：2025-08-06 18:14:04 UTC

#149 Evaluating the Impact of LLM-guided Reflection on Learning Outcomes with Interactive AI-Generated Educational Podcasts #149 评估 LLM 指导的反思对交互式 AI 生成教育播客学习结果的影响

Authors: [Vishnu Menon](https://arxiv.org/search/?searchtype=author&query=Vishnu Menon), [Andy Cherney](https://arxiv.org/search/?searchtype=author&query=Andy Cherney), [Elizabeth B. Cloude](https://arxiv.org/search/?searchtype=author&query=Elizabeth B. Cloude), [Li Zhang](https://arxiv.org/search/?searchtype=author&query=Li Zhang), [Tiffany D. Do](https://arxiv.org/search/?searchtype=author&query=Tiffany D. Do) 作者：Vishnu Menon、Andy Cherney、Elizabeth B. Cloude、Li Zhang、Tiffany D. Do

This study examined whether embedding LLM-guided reflection prompts in an interactive AI-generated podcast improved learning and user experience compared to a version without prompts. Thirty-six undergraduates participated, and while learning outcomes were similar across conditions, reflection prompts reduced perceived attractiveness, highlighting a call for more research on reflective interactivity design. 本研究考察了在交互式 AI 生成的播客中嵌入由 LLM 指导的反思提示，是否能相比没有提示的版本提升学习和用户体验。共有三十六名本科生参与，尽管各条件下的学习成果相似，反思提示降低了感知吸引力，这凸显了对反思性交互设计进行更多研究的必要性。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-06 18:03:42 UTC 发布时间：2025-08-06 18:03:42 UTC

#150 Uncertainty-aware Predict-Then-Optimize Framework for Equitable Post-Disaster Power Restoration #150 不确定性感知的“先预测后优化”框架用于公平的灾后电力恢复

Authors: [Lin Jiang](https://arxiv.org/search/?searchtype=author&query=Lin Jiang), [Dahai Yu](https://arxiv.org/search/?searchtype=author&query=Dahai Yu), [Rongchao Xu](https://arxiv.org/search/?searchtype=author&query=Rongchao Xu), [Tian Tang](https://arxiv.org/search/?searchtype=author&query=Tian Tang), [Guang Wang](https://arxiv.org/search/?searchtype=author&query=Guang Wang) 作者：Lin Jiang、Dahai Yu、Rongchao Xu、Tian Tang、Guang Wang

The increasing frequency of extreme weather events, such as hurricanes, highlights the urgent need for efficient and equitable power system restoration. Many electricity providers make restoration decisions primarily based on the volume of power restoration requests from each region. However, our data-driven analysis reveals significant disparities in request submission volume, as disadvantaged communities tend to submit fewer restoration requests. This disparity makes the current restoration solution inequitable, leaving these communities vulnerable to extended power outages. To address this, we aim to propose an equity-aware power restoration strategy that balances both restoration efficiency and equity across communities. However, achieving this goal is challenging for two reasons: the difficulty of predicting repair durations under dataset heteroscedasticity, and the tendency of reinforcement learning agents to favor low-uncertainty actions, which potentially undermine equity. To overcome these challenges, we design a predict-then-optimize framework called EPOPR with two key components: (1) Equity-Conformalized Quantile Regression for uncertainty-aware repair duration prediction, and (2) Spatial-Temporal Attentional RL that adapts to varying uncertainty levels across regions for equitable decision-making. Experimental results show that our EPOPR effectively reduces the average power outage duration by 3.60% and decreases inequity between different communities by 14.19% compared to state-of-the-art baselines. 极端天气事件（如飓风）的频发凸显了高效且公平的电力系统恢复的紧迫性。许多电力供应商在做恢复决策时主要依据来自各地区的电力恢复请求数量。然而，我们的数据驱动分析显示，请求提交量存在显著差异，弱势社区往往提交更少的恢复请求。这一差异使得现行的恢复方案不公平，导致这些社区更易遭受长期停电。为了解决这一问题，我们旨在提出一种兼顾修复效率与社区公平性的关注公平的电力恢复策略。然而，要实现这一目标面临两方面的挑战：在数据集异方差性条件下预测修复时长的困难，以及强化学习智能体倾向于偏好低不确定性动作，这可能会削弱公平性。为了解决这些挑战，我们设计了一个名为 EPOPR 的“先预测再优化”框架，包含两个关键组件： (1) 面向公平的置信量回归（Equity-Conformalized Quantile Regression），用于考虑不确定性的修复时长预测；以及 (2) 时空注意力强化学习（Spatial-Temporal Attentional RL），能够根据各区域不同的不确定性水平进行自适应以实现公平决策。实验结果表明，与最先进的基线方法相比，我们的 EPOPR 有效将平均停电持续时间减少了 3.60%，并将不同社区之间的不公平性降低了 14.19%。

Subjects: Machine Learning, Artificial Intelligence, Social and Information Networks 主题：机器学习、人工智能、社会与信息网络

Publish: 2025-08-06 18:00:30 UTC 发表：2025-08-06 18:00:30 UTC

#151 ERDES: A Benchmark Video Dataset for Retinal Detachment and Macular Status Classification in Ocular Ultrasound #151 ERDES：一个用于眼部超声视网膜脱离和黄斑状态分类的基准视频数据集

Authors: [Pouyan Navard](https://arxiv.org/search/?searchtype=author&query=Pouyan Navard), [Yasemin Ozkut](https://arxiv.org/search/?searchtype=author&query=Yasemin Ozkut), [Srikar Adhikari](https://arxiv.org/search/?searchtype=author&query=Srikar Adhikari), [Elaine Situ-LaCasse](https://arxiv.org/search/?searchtype=author&query=Elaine Situ-LaCasse), [Josie Acuña](https://arxiv.org/search/?searchtype=author&query=Josie Acuña), [Adrienne Yarnish](https://arxiv.org/search/?searchtype=author&query=Adrienne Yarnish), [Alper Yilmaz](https://arxiv.org/search/?searchtype=author&query=Alper Yilmaz) 作者：Pouyan Navard、Yasemin Ozkut、Srikar Adhikari、Elaine Situ-LaCasse、Josie Acuña、Adrienne Yarnish、Alper Yilmaz

Retinal detachment (RD) is a vision-threatening condition that requires timely intervention to preserve vision. Macular involvement – whether the macula is still intact (macula-intact) or detached (macula-detached) – is the key determinant of visual outcomes and treatment urgency. Point-of-care ultrasound (POCUS) offers a fast, non-invasive, cost-effective, and accessible imaging modality widely used in diverse clinical settings to detect RD. However, ultrasound image interpretation is limited by a lack of expertise among healthcare providers, especially in resource-limited settings. Deep learning offers the potential to automate ultrasound-based assessment of RD. However, there are no ML ultrasound algorithms currently available for clinical use to detect RD and no prior research has been done on assessing macular status using ultrasound in RD cases – an essential distinction for surgical prioritization. Moreover, no public dataset currently supports macular-based RD classification using ultrasound video clips. We introduce Eye Retinal DEtachment ultraSound, ERDES, the first open-access dataset of ocular ultrasound clips labeled for (i) presence of retinal detachment and (ii) macula-intact versus macula-detached status. The dataset is intended to facilitate the development and evaluation of machine learning models for detecting retinal detachment. We also provide baseline benchmarks using multiple spatiotemporal convolutional neural network (CNN) architectures. All clips, labels, and training code are publicly available at https://osupcvlab.github.io/ERDES/. 视网膜脱离（RD）是一种威胁视力的疾病，需要及时干预以保全视力。黄斑是否受累——即黄斑仍然完整（黄斑未脱离）或已脱离（黄斑脱离）——是决定视觉结局和治疗紧迫性的关键因素。床旁超声（POCUS）提供了一种快速、无创、低成本且易获取的影像方式，广泛用于各种临床环境中以检测视网膜脱离。然而，超声图像的解读受限于医疗提供者缺乏专业知识，尤其是在资源有限的环境中。深度学习有可能实现基于超声的视网膜脱离自动评估。然而，目前尚无可用于临床的机器学习超声算法来检测视网膜脱离，也没有先前研究使用超声评估视网膜脱离病例中的黄斑状态——而这对于手术优先级的划分至关重要。此外，目前也没有公开数据集支持使用超声视频片段进行基于黄斑的视网膜脱离分类。我们介绍了眼球视网膜脱离超声数据集（Eye Retinal DEtachment ultraSound，ERDES），这是首个开放获取的眼科超声视频剪辑数据集，标注内容包括（i）是否存在视网膜脱离，以及（ii）黄斑区完整（macula-intact）与黄斑区脱离（macula-detached）状态。该数据集旨在促进用于检测视网膜脱离的机器学习模型的开发与评估。我们还使用多种时空卷积神经网络（CNN）架构提供了基线基准测试。所有视频剪辑、标签和训练代码均可在 https://osupcvlab.github.io/ERDES/ 获取。

Subjects: Quantitative Methods, Artificial Intelligence 主题：定量方法，人工智能

Publish: 2025-08-05 21:55:54 UTC 发布：2025-08-05 21:55:54 UTC

#152 Cross-Domain Image Synthesis: Generating H&E from Multiplex Biomarker Imaging #152 跨域图像合成：从多重生物标志物成像生成 H&E

Authors: [Jillur Rahman Saurav](https://arxiv.org/search/?searchtype=author&query=Jillur Rahman Saurav), [Mohammad Sadegh Nasr](https://arxiv.org/search/?searchtype=author&query=Mohammad Sadegh Nasr), [Jacob M. Luber](https://arxiv.org/search/?searchtype=author&query=Jacob M. Luber) 作者：Jillur Rahman Saurav、Mohammad Sadegh Nasr、Jacob M. Luber

While multiplex immunofluorescence (mIF) imaging provides deep, spatially-resolved molecular data, integrating this information with the morphological standard of Hematoxylin & Eosin (H&E) can be very important for obtaining complementary information about the underlying tissue. Generating a virtual H&E stain from mIF data offers a powerful solution, providing immediate morphological context. Crucially, this approach enables the application of the vast ecosystem of H&E-based computer-aided diagnosis (CAD) tools to analyze rich molecular data, bridging the gap between molecular and morphological analysis. In this work, we investigate the use of a multi-level Vector-Quantized Generative Adversarial Network (VQGAN) to create high-fidelity virtual H&E stains from mIF images. We rigorously evaluated our VQGAN against a standard conditional GAN (cGAN) baseline on two publicly available colorectal cancer datasets, assessing performance on both image similarity and functional utility for downstream analysis. Our results show that while both architectures produce visually plausible images, the virtual stains generated by our VQGAN provide a more effective substrate for computer-aided diagnosis. Specifically, downstream nuclei segmentation and semantic preservation in tissue classification tasks performed on VQGAN-generated images demonstrate superior performance and agreement with ground-truth analysis compared to those from the cGAN. This work establishes that a multi-level VQGAN is a robust and superior architecture for generating scientifically useful virtual stains, offering a viable pathway to integrate the rich molecular data of mIF into established and powerful H&E-based analytical workflows. 虽然多重免疫荧光（mIF）成像提供了深度的、具有空间分辨率的分子数据，但将这些信息与苏木精与伊红（H&E）这种形态学标准结合，对于获取关于底层组织的互补信息非常重要。由 mIF 数据生成虚拟 H&E 染色提供了一种强有力的解决方案，能够即时提供形态学背景。更关键的是，这一方法使得庞大的基于 H&E 的计算机辅助诊断（CAD）工具生态可以用于分析丰富的分子数据，从而弥合分子分析与形态学分析之间的鸿沟。在本研究中，我们研究了使用多层向量量化生成对抗网络（VQGAN）从 mIF 图像创建高保真虚拟 H&E 染色的方法。我们在两个公开的结直肠癌数据集上，严格地将我们的 VQGAN 与一个标准的条件 GAN（cGAN）基线进行了对比评估，考察了在图像相似性和用于下游分析的功能性效用方面的性能。我们的结果表明，尽管两种架构都能生成在视觉上可信的图像，但由我们的 VQGAN 生成的虚拟染色为计算机辅助诊断提供了更有效的基底。具体而言，在 VQGAN 生成的图像上进行的下游细胞核分割和组织分类任务中的语义保真度，相较于 cGAN 的结果表现出更优越的性能并与真实分析结果更为一致。这项工作证明，多层次 VQGAN 是生成具有科学价值的虚拟染色的一种稳健且更优秀的架构，为将 mIF 的丰富分子数据整合到既有且强大的基于 H&E 的分析工作流中提供了可行途径。

Subjects: Quantitative Methods, Artificial Intelligence, Image and Video Processing 主题：定量方法、人工智能、图像与视频处理

Publish: 2025-08-05 21:19:00 UTC 发表：2025-08-05 21:19:00 UTC

#153 Agency, Affordances, and Enculturation of Augmentation Technologies #153 代理性、可供性与增强技术的文化习得

Authors: [Ann Hill Duin](https://arxiv.org/search/?searchtype=author&query=Ann Hill Duin), [Isabel Pedersen](https://arxiv.org/search/?searchtype=author&query=Isabel Pedersen) 作者：Ann Hill Duin，Isabel Pedersen

Augmentation technologies are undergoing a process of enculturation due to many factors, one being the rise of artificial intelligence (AI), or what the World Intellectual Property Organization (WIPO) terms the AI wave or AI boom. Chapter 3 focuses critical attention on the hyped assumption that sophisticated, emergent, and embodied augmentation technologies will improve lives, literacy, cultures, arts, economies, and social contexts. The chapter begins by discussing the problem of ambiguity with AI terminology, which it aids with a description of the WIPO Categorization of AI Technologies Scheme. It then draws on media and communication studies to explore concepts such as agents, agency, power, and agentive relationships between humans and robots. The chapter focuses on the development of non-human agents in industry as a critical factor in the rise of augmentation technologies. It looks at how marketing communication enculturates future users to adopt and adapt to the technology. Scholars are charting the significant ways that people are drawn further into commercial digital landscapes, such as the Metaverse concept, in post-internet society. It concludes by examining recent claims concerning the Metaverse and augmented reality. 增强技术正因多种因素而经历文化内化的过程，其中之一便是人工智能（AI）的崛起，或如世界知识产权组织（WIPO）所称的“AI 浪潮”或“AI 繁荣”。第三章对这一被大肆渲染的假设持批判态度——即复杂的、具涌现性和具身性的增强技术将改善生活、素养、文化、艺术、经济和社会环境。章节首先讨论了与 AI 术语相关的模糊问题，并通过描述 WIPO 的 AI 技术分类方案来加以澄清。随后借助媒介与传播研究探讨了代理、能动性、权力以及人类与机器人之间的能动关系等概念。本章着重于工业中非人类代理的发展，认为这是增强技术兴起的关键因素。它考察了营销传播如何将未来用户文化化，使其接受并适应该技术。学者们正在描绘人们在后互联网社会中如何被进一步吸引进入诸如元宇宙概念等商业数字景观的显著方式。最后对有关元宇宙和增强现实的最新论断进行了审视。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-05 15:28:07 UTC 发表：2025-08-05 15:28:07 UTC

#154 Wearable Music2Emotion : Assessing Emotions Induced by AI-Generated Music through Portable EEG-fNIRS Fusion #154 可穿戴音乐到情绪：通过便携式 EEG-fNIRS 融合评估由 AI 生成音乐引发的情绪

Emotions critically influence mental health, driving interest in music-based affective computing via neurophysiological signals with Brain-computer Interface techniques. While prior studies leverage music’s accessibility for emotion induction, three key limitations persist: \textbf{(1) Stimulus Constraints}: Music stimuli are confined to small corpora due to copyright and curation costs, with selection biases from heuristic emotion-music mappings that ignore individual affective profiles. \textbf{(2) Modality Specificity}: Overreliance on unimodal neural data (e.g., EEG) ignores complementary insights from cross-modal signal fusion.\textbf{ (3) Portability Limitation}: Cumbersome setups (e.g., 64+ channel gel-based EEG caps) hinder real-world applicability due to procedural complexity and portability barriers. To address these limitations, we propose MEEtBrain, a portable and multimodal framework for emotion analysis (valence/arousal), integrating AI-generated music stimuli with synchronized EEG-fNIRS acquisition via a wireless headband. By MEEtBrain, the music stimuli can be automatically generated by AI on a large scale, eliminating subjective selection biases while ensuring music diversity. We use our developed portable device that is designed in a lightweight headband-style and uses dry electrodes, to simultaneously collect EEG and fNIRS recordings. A 14-hour dataset from 20 participants was collected in the first recruitment to validate the framework’s efficacy, with AI-generated music eliciting target emotions (valence/arousal). We are actively expanding our multimodal dataset (44 participants in the latest dataset) and make it publicly available to promote further research and practical applications. \textbf{The dataset is available at https://zju-bmi-lab.github.io/ZBra. 情绪对心理健康有着关键影响，推动了通过脑机接口技术利用神经生理信号进行基于音乐的情感计算的研究兴趣。尽管以往研究利用音乐易获取的特性来诱发情绪，但仍存在三大关键局限：\textbf{（1）刺激限制}：由于版权和策划成本，音乐刺激被限制在小规模语料库中，且基于启发式情绪—音乐映射的选择存在偏差，忽视了个体的情感特征。\textbf{（2）模态专属性}：过度依赖单一模态的神经数据（例如 EEG）忽略了跨模态信号融合所提供的互补见解。\textbf{（3）可携性限制}：笨重的装置（例如 64+ 通道的凝胶式 EEG 头套）因程序复杂性和便携性障碍而妨碍现实世界的应用。为了解决这些局限，我们提出了 MEEtBrain，一个可携带的多模态情绪分析（价度/唤醒度）框架，结合 AI 生成的音乐刺激与通过无线头带同步采集的 EEG-fNIRS。通过 MEEtBrain，音乐刺激可以由 AI 在大规模上自动生成，从而消除主观选择偏差并确保音乐的多样性。我们使用自行开发的便携设备，该设备设计为轻便的头带式并采用干电极，以同时收集 EEG 和 fNIRS 记录。在首次招募中收集了来自 20 名参与者的 14 小时数据集，以验证该框架的有效性，AI 生成的音乐能够引发目标情绪（愉悦度/唤醒度）。我们正在积极扩展我们的多模态数据集（最新数据集中有 44 名参与者）并将其公开以促进进一步的研究和实际应用。该数据集可在 https://zju-bmi-lab.github.io/ZBra 获取。

Subjects: Sound, Artificial Intelligence, Audio and Speech Processing 主题：声音、人工智能、音频与语音处理

Publish: 2025-08-05 12:25:35 UTC 发布：2025-08-05 12:25:35 UTC

#155 Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS #155 面向电信的低延迟端到端语音代理：使用流式 ASR、量化 LLMs 和实时 TTS

Authors: [Vignesh Ethiraj](https://arxiv.org/search/?searchtype=author&query=Vignesh Ethiraj), [Ashwath David](https://arxiv.org/search/?searchtype=author&query=Ashwath David), [Sidhanth Menon](https://arxiv.org/search/?searchtype=author&query=Sidhanth Menon), [Divya Vijay](https://arxiv.org/search/?searchtype=author&query=Divya Vijay) 作者：Vignesh Ethiraj、Ashwath David、Sidhanth Menon、Divya Vijay

We introduce a low-latency telecom AI voice agent pipeline for real-time, interactive telecommunications use, enabling advanced voice AI for call center automation, intelligent IVR (Interactive Voice Response), and AI-driven customer support. The solution is built for telecom, combining four specialized models by NetoAI: TSLAM, a 4-bit quantized Telecom-Specific Large Language Model (LLM); T-VEC, a Telecom-Specific Embedding Model; TTE, a Telecom-Specific Automatic Speech Recognition (ASR) model; and T-Synth, a Telecom-Specific Text-to-Speech (TTS) model. These models enable highly responsive, domain-adapted voice AI agents supporting knowledge-grounded spoken interactions with low latency. The pipeline integrates streaming ASR (TTE), conversational intelligence (TSLAM), retrieval augmented generation (RAG) over telecom documents, and real-time TTS (T-Synth), setting a new benchmark for telecom voice assistants. To evaluate the system, we built a dataset of 500 human-recorded telecom questions from RFCs, simulating real telecom agent queries. This framework allows analysis of latency, domain relevance, and real-time performance across the stack. Results show that TSLAM, TTE, and T-Synth deliver real-time factors (RTF) below 1.0, supporting enterprise, low-latency telecom deployments. These AI agents – powered by TSLAM, TTE, and T-Synth – provide a foundation for next-generation telecom AI, enabling automated customer support, diagnostics, and more. 我们引入了一种低延迟的电信 AI 语音代理流水线，面向实时互动的电信应用，能够实现用于呼叫中心自动化、智能交互式语音应答（IVR）和由 AI 驱动的客户支持的高级语音 AI。该解决方案为电信场景构建，由 NetoAI 的四个专用模型组合而成：TSLAM，一种 4 位量化的电信专用大型语言模型（LLM）；T-VEC，一种电信专用嵌入模型；TTE，一种电信专用的自动语音识别（ASR）模型；以及 T-Synth，一种电信专用的文本转语音（TTS）模型。这些模型使得支持基于知识的语音交互的语音 AI 代理具备高度响应性与领域适应性且延迟低。该流水线整合了流式 ASR（TTE）、会话智能（TSLAM）、基于电信文档的检索增强生成（RAG）以及实时 TTS（T-Synth），为电信语音助手设定了新的基准。为了评估该系统，我们构建了一个包含 500 条来自 RFC 的人工录制电信问题的数据集，用以模拟真实的电信代理查询。该框架允许对整套系统的延迟、领域相关性和实时性能进行分析。结果显示，TSLAM、TTE 和 T-Synth 的实时因子（RTF）均低于 1.0，支持企业级低延迟电信部署。这些由 TSLAM、TTE 和 T-Synth 提供支持的 AI 代理为下一代电信 AI 打下基础，能够实现自动化客户支持、诊断等。

Subjects: Sound, Artificial Intelligence, Audio and Speech Processing 主题：声音，人工智能，音频与语音处理

Publish: 2025-08-05 07:39:35 UTC 发布：2025-08-05 07:39:35 协调世界时（UTC）

#156 AI Should Be More Human, Not More Complex #156 AI 应该更有人性，而不是更复杂

Author: [Carlo Esposito](https://arxiv.org/search/?searchtype=author&query=Carlo Esposito) 作者：Carlo Esposito

Large Language Models (LLMs) in search applications increasingly prioritize verbose, lexically complex responses that paradoxically reduce user satisfaction and engagement. Through a comprehensive study of 10.000 (est.) participants comparing responses from five major AI-powered search systems, we demonstrate that users overwhelmingly prefer concise, source-attributed responses over elaborate explanations. Our analysis reveals that current AI development trends toward “artificial sophistication” create an uncanny valley effect where systems sound knowledgeable but lack genuine critical thinking, leading to reduced trust and increased cognitive load. We present evidence that optimal AI communication mirrors effective human discourse: direct, properly sourced, and honest about limitations. Our findings challenge the prevailing assumption that more complex AI responses indicate better performance, instead suggesting that human-like brevity and transparency are key to user engagement and system reliability. 在搜索应用中，大型语言模型（LLMs）越来越倾向于生成冗长、词汇复杂的回答，而这类回答反而讽刺性地降低了用户满意度和参与度。通过对约 10,000 名参与者进行的综合研究，比较了五个主要 AI 驱动搜索系统的回答，我们证明用户压倒性地更喜欢简洁且带有来源标注的回答，而非详尽的解释。我们的分析显示，当前 AI 开发趋向于“人工复杂化”造成了一种令人不安的谷效应：系统听起来博学却缺乏真正的批判性思维，导致信任度下降和认知负担增加。我们提供的证据表明，最佳的 AI 交流方式类似于有效的人类话语：直接、恰当标注来源，并坦诚其局限性。我们的研究结果挑战了“更复杂的 AI 回答意味着更好表现”的普遍假设，相反表明类人般的简洁与透明是提高用户参与度和系统可靠性的关键。

Subjects: Human-Computer Interaction, Artificial Intelligence 学科：人机交互、人工智能

Publish: 2025-07-27 15:55:52 UTC 发布：2025-07-27 15:55:52 UTC

#157 Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis #157 混合奖励驱动的强化学习用于高效量子电路综合

Authors: [Sara Giordano](https://arxiv.org/search/?searchtype=author&query=Sara Giordano), [Kornikar Sen](https://arxiv.org/search/?searchtype=author&query=Kornikar Sen), [Miguel A. Martin-Delgado](https://arxiv.org/search/?searchtype=author&query=Miguel A. Martin-Delgado) 作者：Sara Giordano、Kornikar Sen、Miguel A. Martin-Delgado

A reinforcement learning (RL) framework is introduced for the efficient synthesis of quantum circuits that generate specified target quantum states from a fixed initial state, addressing a central challenge in both the NISQ era and future fault-tolerant quantum computing. The approach utilizes tabular Q-learning, based on action sequences, within a discretized quantum state space, to effectively manage the exponential growth of the space dimension. The framework introduces a hybrid reward mechanism, combining a static, domain-informed reward that guides the agent toward the target state with customizable dynamic penalties that discourage inefficient circuit structures such as gate congestion and redundant state revisits. By leveraging sparse matrix representations and state-space discretization, the method enables scalable navigation of high-dimensional environments while minimizing computational overhead. Benchmarking on graph-state preparation tasks for up to seven qubits, we demonstrate that the algorithm consistently discovers minimal-depth circuits with optimized gate counts. Moreover, extending the framework to a universal gate set for arbitrary quantum states, it still produces minimal depth circuits, highlighting the algorithm’s robustness and adaptability. The results confirm that this RL-driven approach efficiently explores the complex quantum state space and synthesizes near-optimal quantum circuits, providing a resource-efficient foundation for quantum circuit optimization. 引入了一种强化学习（RL）框架，用于高效合成量子电路，使其从固定初态生成指定的目标量子态，解决了在 NISQ 时代及未来容错量子计算中都面临的一个核心挑战。该方法在离散化的量子态空间内，基于动作序列采用表格式 Q 学习，有效应对空间维度呈指数级增长的问题。框架引入了混合奖励机制，结合了静态的、基于领域知识的奖励以引导智能体接近目标态，以及可定制的动态惩罚以抑制低效的电路结构，例如门拥堵和多余的状态重访。通过利用稀疏矩阵表示和状态空间离散化，该方法在最小化计算开销的同时，实现了对高维环境的可扩展导航。在多达七比特的图态制备任务基准测试中，我们证明了该算法能持续发现具有优化门数的最小深度电路。此外，将该框架扩展到适用于任意量子态的通用门集时，它仍能产生极小深度的电路，凸显了该算法的稳健性和适应性。结果证实，这种由强化学习驱动的方法能够高效地探索复杂的量子态空间并合成近似最优的量子电路，为量子电路优化提供了资源高效的基础。

Subject: Quantum Physics 主题：量子物理

Publish: 2025-07-22 14:39:20 UTC 发布时间：2025-07-22 14:39:20 UTC

#158 How Robust are LLM-Generated Library Imports? An Empirical Study using Stack Overflow #158 LLM 生成的库导入有多稳健？使用 Stack Overflow 的实证研究

Authors: [Jasmine Latendresse](https://arxiv.org/search/?searchtype=author&query=Jasmine Latendresse), [SayedHassan Khatoonabadi](https://arxiv.org/search/?searchtype=author&query=SayedHassan Khatoonabadi), [Emad Shihab](https://arxiv.org/search/?searchtype=author&query=Emad Shihab) 作者：Jasmine Latendresse、SayedHassan Khatoonabadi、Emad Shihab

Software libraries are central to the functionality, security, and maintainability of modern code. As developers increasingly turn to Large Language Models (LLMs) to assist with programming tasks, understanding how these models recommend libraries is essential. In this paper, we conduct an empirical study of six state-of-the-art LLMs, both proprietary and open-source, by prompting them to solve real-world Python problems sourced from Stack Overflow. We analyze the types of libraries they import, the characteristics of those libraries, and the extent to which the recommendations are usable out of the box. Our results show that LLMs predominantly favour third-party libraries over standard ones, and often recommend mature, popular, and permissively licensed dependencies. However, we also identify gaps in usability: 4.6% of the libraries could not be resolved automatically due to structural mismatches between import names and installable packages, and only two models (out of six) provided installation guidance. While the generated code is technically valid, the lack of contextual support places the burden of manually resolving dependencies on the user. Our findings offer actionable insights for both developers and researchers, and highlight opportunities to improve the reliability and usability of LLM-generated code in the context of software dependencies. 软件库对于现代代码的功能性、安全性和可维护性至关重要。随着开发者越来越多地求助于大型语言模型（LLMs）来辅助编程任务，理解这些模型如何推荐软件库变得至关重要。在本文中，我们通过让六个最先进的 LLMs（包括专有和开源模型）对来自 Stack Overflow 的真实 Python 问题进行求解，开展了一项实证研究。我们分析了它们导入的库类型、这些库的特性以及推荐在多大程度上可以开箱即用。我们的结果表明，LLMs 主要偏好第三方库而非标准库，并且经常推荐成熟、流行且许可宽松的依赖项。然而，我们也发现了可用性方面的缺口：有 4.6% 的库由于导入名称与可安装包之间的结构不匹配而无法自动解析，且仅有两个模型（共六个）提供了安装指导。尽管生成的代码在技术上是有效的，但缺乏上下文支持使得手动解决依赖关系的负担落在了用户身上。我们的研究结果为开发者和研究人员提供了可行的见解，并强调了在软件依赖情境中提升 LLM 生成代码的可靠性和可用性的机会。

Subject: Software Engineering 主题：软件工程

Publish: 2025-07-14 21:35:29 UTC 发布：2025-07-14 21:35:29 UTC

#159 Evaluating the Use of LLMs for Documentation to Code Traceability #159 评估将 LLMs 用于从文档到代码可追溯性的效果

Authors: [Ebube Alor](https://arxiv.org/search/?searchtype=author&query=Ebube Alor), [SayedHassan Khatoonabadi](https://arxiv.org/search/?searchtype=author&query=SayedHassan Khatoonabadi), [Emad Shihab](https://arxiv.org/search/?searchtype=author&query=Emad Shihab)

Large Language Models (LLMs) offer new potential for automating documentation-to-code traceability, yet their capabilities remain underexplored. We present a comprehensive evaluation of LLMs (Claude 3.5 Sonnet, GPT-4o, and o3-mini) in establishing trace links between various software documentation (including API references and user guides) and source code. We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI). Through systematic experiments, we assess three key capabilities: (1) trace link identification accuracy, (2) relationship explanation quality, and (3) multi-step chain reconstruction. Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets, substantially outperforming our baselines (TF-IDF, BM25, and CodeBERT). While fully correct relationship explanations range from 42.9% to 71.1%, partial accuracy exceeds 97%, indicating that fundamental connections are rarely missed. For multi-step chains, LLMs maintain high endpoint accuracy but vary in capturing precise intermediate links. Error analysis reveals that many false positives stem from naming-based assumptions, phantom links, or overgeneralization of architectural patterns. We demonstrate that task-framing, such as a one-to-many matching strategy, is critical for performance. These findings position LLMs as powerful assistants for trace discovery, but their limitations could necessitate human-in-the-loop tool design and highlight specific error patterns for future research. 大型语言模型（LLMs）为将文档与代码的可追溯性自动化提供了新的可能，但其能力尚未得到充分探索。我们对若干 LLMs（Claude 3.5 Sonnet、GPT-4o 和 o3-mini）在在各种软件文档（包括 API 参考和用户指南）与源代码之间建立追踪链接方面进行了全面评估。我们从两个开源项目（Unity Catalog 和 Crawl4AI）创建了两个新数据集。通过系统实验，我们评估了三项关键能力：（1）追踪链接识别准确性，（2）关系解释质量，以及（3）多步骤链的重构。结果显示，表现最好的 LLM 在两个数据集上的 F1 分别为 79.4% 和 80.4%，显著优于我们的基线（TF-IDF、BM25 和 CodeBERT）。虽然完全正确的关系解释范围为 42.9% 到 71.1%，但部分准确率超过 97%，表明基本关联很少被遗漏。在多步骤链任务中，LLMs 在终点准确性方面保持较高水平，但在捕捉精确的中间链接方面表现不一。错误分析表明，许多误报源于基于命名的假设、幻影链接或对架构模式的过度概括。我们证明了任务构型（例如一对多匹配策略）对性能至关重要。这些发现将 LLMs 定位为追踪发现的强大助手，但它们的局限性可能需要在环有人（human-in-the-loop）的工具设计，并为未来研究凸显特定的错误模式。

Subject: Software Engineering 主题：软件工程

Publish: 2025-06-19 16:18:53 UTC 发布：2025-06-19 16:18:53 UTC

#160 Reinforcement Learning Generation of 4-Qubits Entangled States #160 使用强化学习生成 4 量子比特纠缠态

Authors: [Sara Giordano](https://arxiv.org/search/?searchtype=author&query=Sara Giordano), [Miguel A. Martin-Delgado](https://arxiv.org/search/?searchtype=author&query=Miguel A. Martin-Delgado) 作者：Sara Giordano，Miguel A. Martin-Delgado

We have devised an artificial intelligence algorithm with machine reinforcement learning (Q-learning) to construct remarkable entangled states with 4 qubits. This way, the algorithm is able to generate representative states for some of the 49 true SLOCC classes of the four-qubit entanglement states. In particular, it is possible to reach at least one true SLOCC class for each of the nine entanglement families. The quantum circuits synthesized by the algorithm may be useful for the experimental realization of these important classes of entangled states and to draw conclusions about the intrinsic properties of our universe. We introduce a graphical tool called the state-link graph (SLG) to represent the construction of the Quality matrix (Q-matrix) used by the algorithm to build a given objective state belonging to the corresponding entanglement class. This allows us to discover the necessary connections between specific entanglement features and the role of certain quantum gates that the algorithm needs to include in the quantum gate set of actions. The quantum circuits found are optimal by construction with respect to the quantum gate-set chosen. These SLGs make the algorithm simple, intuitive and a useful resource for the automated construction of entangled states with a low number of qubits. 我们设计了一种结合机器强化学习（Q 学习）的人工智能算法，用以构造由 4 个量子比特组成的显著纠缠态。通过这种方式，算法能够为四量子比特纠缠态的 49 个真实 SLOCC 类别中的若干生成具有代表性的态。特别地，对于这九个纠缠族中的每一个，至少可以到达其中的一个真实 SLOCC 类别。该算法合成的量子电路可能对这些重要纠缠态类别的实验实现有用，并有助于对我们宇宙的内在属性得出结论。我们引入了一种称为态-连结图（SLG）的图形工具，用来表示算法构建对应纠缠类中给定目标态所用的质量矩阵（Q 矩阵）的过程。这使我们能够发现特定纠缠特征之间的必要联系，以及算法需要在量子门动作集合中包含某些量子门所起的作用。所发现的量子电路在所选量子门集合意义下按构造是最优的。这些 SLG 使得该算法简单、直观，并成为自动构建少量量子比特纠缠态的有用资源。

Subject: Quantum Physics 主题：量子物理学

Publish: 2022-04-26 14:46:58 UTC 发布日期：2022-04-26 14:46:58 UTC

2025-08-08科研追新

2025-08-08科研追新

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

1.2 Arxiv

1.2.1 Computation and Language

#1 H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages #1 H-Net++：用于形态丰富语言的无分词器语言建模的分层动态分块

#2 How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations #2 LLMs 如何劝服？线性探针可揭示多轮对话中的劝服动态

#3 Learning to Reason for Factuality #3 学习推理以提升真实性

#4 OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks #4 OmniEAR：在具身任务中基准测试代理推理

#5 Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models #5 Cooper：在用于大型语言模型的强化学习中共同优化策略模型与奖励模型 [PDF 12 ] [Copy] [Kimi 8 ] [REL]

#6 MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy #6 MathSmith：通过使用强化策略构造合成问题迈向极难数学推理

#7 Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs #7 政治观点会在西方语言之间转移吗？对未对齐与已对齐多语种 LLM 的分析

#8 Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees #8 在黑盒设置下具有可证明覆盖保证的多项选择题回答中的保形集合

#9 CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation #9 CoCoLex：用于有根法律文本生成的基于复制的置信度引导解码

#10 The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities #10 根据 LLMs 的世界观：地理来源如何影响 LLMs 的实体推断能力

#11 LAG: Logic-Augmented Generation from a Cartesian Perspective #11 LAG：从笛卡尔视角的逻辑增强生成

#12 Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations #12 重新思考创造力评估：对现有创造力评估的批判性分析

#13 TASE: Token Awareness and Structured Evaluation for Multilingual Language Models #13 TASE：面向多语言语言模型的标记感知与结构化评估

#14 LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models #14 LLMEval-3：一项关于大型语言模型稳健性与公平性评估的大规模纵向研究

#15 MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints #15 MyCulture：在低资源语言约束下探索马来西亚多元文化

#16 The TUB Sign Language Corpus Collection #16 TUB 手语语料库集合

#17 Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025 #17 语言模型能自我批评吗？在 BioASQ 2025 上调查用于检索增强生成的自我反馈

#18 Evaluation of a Sign Language Avatar on Comprehensibility, User Experience & Acceptability #18 手语化身在可理解性、用户体验与可接受性方面的评估

#19 Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression 确定性引导的反思抑制方法

#20 SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens #20 SONAR-LLM：以句子嵌入进行思考、以令牌表达的自回归 Transformer

#21 Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue #21 带有深思熟虑的决策制定：作为基于文档对话的元评审

#22 ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs #22 ASCoT：一种用于 LLMs 后期脆弱性的自适应自我修正思路链方法

#23 CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL #23 CodeBoost: 通过从代码片段中挤出知识并使用强化学习提升代码 LLMs [PDF 5 ] [Copy] [Kimi 1 ] [REL]

#24 Pruning Large Language Models by Identifying and Preserving Functional Networks #24 通过识别并保留功能网络对大型语言模型进行剪枝 [PDF 1 ] [Copy] [Kimi 1 ] [REL]

#25 Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation #25 资源受限的联合多模态情感推理与分类：通过链式思维增强与蒸馏

#26 ATLANTIS at SemEval-2025 Task 3: Detecting Hallucinated Text Spans in Question Answering #26 ATLANTIS 在 SemEval-2025 任务 3：检测问答中的虚构文本片段 [PDF ] [Copy] [Kimi ] [REL]

#27 Towards Assessing Medical Ethics from Knowledge to Practice #27 从知识到实践的医学伦理评估探索

#28 Attention Basin: Why Contextual Position Matters in Large Language Models #28 注意力盆地：为什么上下文位置在大型语言模型中很重要

#29 BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation #29 BEE-RAG：检索增强生成的平衡熵工程 [PDF 1 ] [Copy] [Kimi ] [REL]

#30 Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations #30 使用统一的视觉、文本与上下文表示的多模态事实核查

#31 Align, Don't Divide: Revisiting the LoRA Architecture in Multi-Task Learning #31 对齐而非划分：在多任务学习中重审 LoRA 架构

#32 Evaluation of LLMs in AMR Parsing #32 在 AMR 解析中对 LLMs 的评估

#33 Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning #33 对话面向方面的情感四元组抽取通过结构熵最小化分割

#34 A Multi-Stage Large Language Model Framework for Extracting Suicide-Related Social Determinants of Health #34 一个用于提取与自杀相关的社会健康决定因素的多阶段大型语言模型框架

#35 Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering #35 朝着视觉活动识别的鲁棒评估：通过意义聚类解决动词歧义

#36 I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations #36 我思故我资质不足？用于评估 LLM 招聘评估中语言禁忌检测的基准测试

#37 RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory #37 RCR-Router：针对具有结构化记忆的多智能体 LLM 系统的高效角色感知上下文路由

#38 Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History #38 LLM 人格测量的持续不稳定性：规模、推理与对话历史的影响

#39 Pitch Accent Detection improves Pretrained Automatic Speech Recognition #39 音高重音检测提升预训练自动语音识别

#40 Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization #40 偶校验感知的字节对编码：在分词中提升跨语言公平性

#41 Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM #41 使用冻结的 LLM 通过说话人特征增强对话标注

#42 Test-Time Reinforcement Learning for GUI Grounding via Region Consistency #42 基于区域一致性的 GUI 定位测试时强化学习

#43 Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision #43 Uni-cot：迈向跨文本与视觉的统一链式思维推理

#44 Iterative Learning of Computable Phenotypes for Treatment Resistant Hypertension using Large Language Models #44 使用大型语言模型对难治性高血压可计算表型进行迭代学习以指导治疗

#45 Fairy±i: the First 2-bit Complex LLM with All Parameters in #45Fairy ±i ：首个所有参数均为 的 2 位复数 LLM

#46 SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription #46 SPGISpeech 2.0：带说话人标注的多说话人金融音频转录 [PDF ] [Copy] [Kimi ] [REL]

#47 Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation #47 混合主动对话用于人机协作操作

#48 MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs #48 MELLA：为低资源语言的大型多语言模型架起语言能力与文化扎根性的桥梁

#49 Can Large Language Models Generate Effective Datasets for Emotion Recognition in Conversations? #49 大型语言模型能为对话中的情感识别生成有效的数据集吗？ [PDF ] [Copy] [Kimi ] [REL]

#50 Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? #50 Bench-2-CoP：我们能信任用于欧盟人工智能合规性的基准测试吗？

#51 A Novel Architecture for Symbolic Reasoning with Decision Trees and LLM Agents #51 一种结合决策树与 LLM 代理的符号推理新架构

#52 Understanding and Mitigating Errors of LLM-Generated RTL Code #52 理解并缓解 LLM 生成的 RTL 代码错误

#53 FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance #53 FAITH：用于评估金融领域表格内在幻觉的框架

#54 QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering #54 QA-Dragon：面向查询感知的动态 RAG 系统，用于知识密集型视觉问答

#55 Posterior-GRPO: Rewarding Reasoning Processes in Code Generation #55 Posterior-GRPO：在代码生成中奖励推理过程

#56 Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models #56 在预算内对 LLMs 进行对齐：使用启发式奖励模型进行推理时对齐

#57 Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages #57 低资源场景下的语音 LLMs：数据量需求及在高资源语言上预训练的影响

#58 Navigating Through Paper Flood: Advancing LLM-based Paper Evaluation through Domain-Aware Retrieval and Latent Reasoning #58 穿越论文洪流：通过面向领域的检索与潜在推理推进基于 LLM 的论文评估

#59 Exploring Superior Function Calls via Reinforcement Learning #59 通过强化学习探索更优的函数调用

#60 JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering #60 JPS：通过协作式视觉扰动和文本引导越狱多模态大语言模型

#61 Cognitive Duality for Adaptive Web Agents #61 适应性网络代理的认知二元性

#62 A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding #62 《用于三维场景理解的语言嵌入框架及其真实世界应用研究》

#63 Making Prompts First-Class Citizens for Adaptive LLM Pipelines #63 让提示成为自适应 LLM 流水线的一等公民

#64 Can Large Language Models Integrate Spatial Data? Empirical Insights into Reasoning Strengths and Computational Weaknesses #64 大型语言模型能整合空间数据吗？关于推理强项与计算弱点的实证见解

#65 R-Zero: Self-Evolving Reasoning LLM from Zero Data #65 R-Zero：从零数据自我进化推理 LLM

#66 REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation #66 REINA：基于正则化熵信息的高效同时语音翻译损失函数

#67 ConfAgents: A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis #67 ConfAgents：一种用于成本高效医疗诊断的受共形引导的多智能体框架

#68 Advancing Hate Speech Detection with Transformers: Insights from the MetaHate #68 使用变换器推进仇恨言论检测：来自 MetaHate 的见解

#69 Fine-Tuning Small Language Models (SLMs) for Autonomous Web-based Geographical Information Systems (AWebGIS) #69 针对自主基于网络的地理信息系统（AWebGIS）微调小型语言模型（SLMs）

#45 Fairy±i: the First 2-bit Complex LLM with All Parameters in #45Fairy ±i ：首个所有参数均为的 2 位复数 LLM