2025-08-13 2025-08-13 About 59300 words 279 minutes

Contents

#1 Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models #1 时间就是特征：在扩散语言模型中利用时间动态
#2 Complex Logical Instruction Generation #2 复杂逻辑指令生成
#3 OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows #3 OdysseyBench：在长时域复杂办公应用工作流上评估 LLM 代理
#4 SinLlama - A Large Language Model for Sinhala #4 SinLlama - 面向僧伽罗语的大型语言模型 [PDF 2 ] [Copy] [Kimi 3 ] [REL]
#5 AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
#6 Link Prediction for Event Logs in the Process Industry #6 用于流程工业事件日志的链接预测
#7 Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages #7 利用多语种编码器提升低资源语言的大型语言模型
#8 CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization #8 CPO：通过比较策略优化在角色扮演对话中解决奖励歧义
#9 READER: Retrieval-Assisted Drafter for Efficient LLM Inference #9 读者：用于高效 LLM 推理的检索辅助起草器
#10 MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions #10 MVISU-Bench：通过多应用、模糊、交互、单应用和不道德指令对移动代理执行真实世界任务进行基准测试 [PDF ] [Copy] [Kimi ] [REL]
#11 LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback #11 LLM-as-a-Supervisor：错误的治疗行为触发有针对性的监督反馈
#12 A Survey on Training-free Alignment of Large Language Models #12 关于大型语言模型无训练对齐方法的综述
#13 LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA #13 LyS 在 SemEval 2025 任务 8：用于表格问答的零样本代码生成 [PDF ] [Copy] [Kimi ] [REL]
#14 Retrospective Sparse Attention for Efficient Long-Context Generation #14 回顾性稀疏注意力以实现高效长上下文生成
#15 Jointly Generating and Attributing Answers using Logits of Document-Identifier Tokens #15 使用文档标识符标记的对数几率共同生成并归因答案
#16 Train Long, Think Short: Curriculum Learning for Efficient Reasoning #16 长训短思：用于高效推理的课程学习
#17 Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation #17 Reveal-Bangla：用于跨语言多步推理评估的数据集
#18 Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning #18 Munsit 在 NADI 2025 共享任务 2：通过弱监督预训练和持续监督微调推动多方言阿拉伯语自动语音识别的边界
#19 ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs #19 ASPD：通过探索 LLMs 的内在并行性解锁自适应串并行解码
#20 Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models #20 纠缠于表征：对大型语言模型中文化偏见的机制性调查
#21 Weakly Supervised Fine-grained Span-Level Framework for Chinese Radiology Report Quality Assurance #21 题为“面向中文放射学报告质量保证的弱监督细粒度跨段框架”
#22 BiasGym: Fantastic Biases and How to Find (and Remove) Them #22 BiasGym：奇妙的偏差及如何发现（并消除）它们
#23 Steering Towards Fairness: Mitigating Political Bias in LLMs #23 朝向公平的引导：缓解 LLMs 中的政治偏见
#24 An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems #24 对大语言模型在数学推理鲁棒性的一项研究：通过对高等数学题目进行数学等价变换来进行基准测试 [PDF ] [Copy] [Kimi 2 ] [REL]
#25 TiMoE: Time-Aware Mixture of Language Experts #25 TiMoE：面向时间的语言专家混合模型
#26 Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments #26 通过自动构建环境的反馈驱动工具使用改进在大型语言模型中的应用 [PDF 2 ] [Copy] [Kimi 2 ] [REL]
#27 Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering #27 隐私保护的检索增强生成用于知识图谱问答
#28 DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation #28 DevNous：一种基于 LLM 的多智能体系统，用于在非结构化对话中将 IT 项目管理落地
#29 SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs #29 SciRerankBench：面向科学检索增强生成 LLMs 的重排器基准测试
#30 Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation #30 Magical：通过语义不变性和面向门外汉的适应进行医学通俗语言生成
#31 IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization #31 IROTE：通过上下文自我反思优化引发大型语言模型的人类特质 [PDF 4 ] [复印件] [Kimi 4 ] [REL]
#32 A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models #32 一篇关于并行文本生成的综述：从并行解码到扩散语言模型
#33 Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults #33 直接投入临床？评估面向老年人的最先进自动语音识别在临床应用中的表现
#34 TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation #34 TopXGen：面向低资源机器翻译的主题多样性并行数据生成
#35 LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement #35 通过子任务指导和迭代改进的由 LLM 驱动的文本到表格生成
#36 Prompt-Based Approach for Czech Sentiment Analysis #36 一种基于提示的方法用于捷克语情感分析
#37 UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection #37 UWB 在 WASSA-2024 共享任务 2：跨语言情感检测
#38 LLaMA-Based Models for Aspect-Based Sentiment Analysis
#39 Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents #39 反应迅速：从人类示范中引出隐含意图以打造个性化移动使用代理
#40 InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling #40 InternBootcamp 技术报告：通过可验证任务扩展提升 LLM 推理能力
#41 Optimizing Retrieval-Augmented Generation (RAG) for Colloquial Cantonese: A LoRA-Based Systematic Review #41 优化面向口语粤语的检索增强生成（RAG）：基于 LoRA 的系统综述
#42 DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives #42 DepressLLM：用于从真实世界叙述中检测抑郁的可解释领域适应语言模型
#43 DeCAL Tokenwise Compression #43 DeCAL 逐标记压缩
#44 Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression #44 可引导的多元主义：通过少样本比较回归实现多元对齐
#45 Momentum Point-Perplexity Mechanics in Large Language Models #45 大型语言模型中的动量点困惑力学
#46 Enhancing Small LLM Alignment through Margin-Based Objective Modifications under Resource Constraints #46 在资源受限下通过基于边际的目标修改提升小型 LLM 的对齐
#47 Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment #47 重新思考富形态的分词：一元模型优于 BPE 并且形态对齐占优
#48 Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery #48 Mol-R1：迈向分子发现中显式的长链链式思维（Long-CoT）推理
#49 CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation #49 CoDAE：通过连锁思维数据增强将大型语言模型适配于教育领域 [PDF ] [复制] [Kimi ] [关联]
#50 Putnam-AXIOM: A Functional and Static Benchmark #50 Putnam-AXIOM：一个函数式与静态基准
#51 Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions #51 神圣还是合成？评估 LLM 在宗教问题上的可靠性与回避
#52 The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs #52 进步的错觉：重新评估 LLMs 中幻觉检测
#53 MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language #53 MinionsLLM：一种通过自然语言对多智能体系统进行训练与控制的任务自适应框架 [PDF ] [Copy] [Kimi ] [REL]
#54 Objective Metrics for Evaluating Large Language Models Using External Data Sources #54 使用外部数据源评估大型语言模型的客观指标
#55 Evaluating Contrast Localizer for Identifying Causal Unitsin Social & Mathematical Tasks in Language Models #55 评估对比定位器在语言模型中识别社会与数学任务因果单元的效果
#56 MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis #56 MLLM-CBench：用于多模态大模型持续指令微调的全面基准与链式思维推理分析 [PDF 2 ] [Copy] [Kimi 4 ] [REL]
#57 Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition #57 从大型语言模型中蒸馏知识：用于仇恨与反言论识别的概念瓶颈模型
#58 TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning #58 TT-XAI：通过关键词提炼和 LLM 推理实现可信的临床文本解释
#59 Real-time News Story Identification #59 实时新闻事件识别
#60 Heartificial Intelligence: Exploring Empathy in Language Models #60 心工智能：探索语言模型中的共情
#61 TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection #61 TurQUaz at CheckThat! 2025：为科学网络话语检测辩论大型语言模型
#62 Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models #62 通过大型语言模型在金融交流中进行论点质量注释和性别偏见检测
#63 P/D-Device: Disaggregated Large Language Model between Cloud and Devices #63 P/D-Device：云端与设备间的解聚合大型语言模型
#64 E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence,and Efficiency #64 E3-Rewrite：学习重写 SQL 以实现可执行性、等价性和效率
#65 Revealing the Role of Audio Channels in ASR Performance Degradation #65 揭示音频通道在 ASR 性能下降中的作用
#66 A Dual-Axis Taxonomy of Knowledge Editing for LLMs: From Mechanisms to Functions #66 面向 LLMs 的知识编辑双轴分类法：从机制到功能
#67 Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance #67 为个性化任务辅助设计具备记忆增强的增强现实代理，以进行时空推理 [PDF ] [Copy] [Kimi ] [REL]
#68 MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs #68 MultiAiTutor：面向儿童友好的教育多语种语音生成导师，基于 LLMs
#69 M2LLM: Multi-view Molecular Representation Learning with Large Language Models #69 M2 LLM：使用大型语言模型的多视角分子表示学习
#70 MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time #70 MiGrATe：用于测试时自适应的混合策略 GRPO
#71 Adaptive Personalized Conversational Information Retrieval #71 自适应个性化对话式信息检索
#72 Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization #72 细粒度视频配音时长对齐与基于片段监督的偏好优化
#73 Re:Verse – Can Your VLM Read a Manga? #73 Re:Verse——你的视觉语言模型能读漫画吗？
#74 Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning #74 双层 MCTS：用于经典规划中摊销 O(1)节点选择
#75 Exploring the Technical Knowledge Interaction of Global Digital Humanities: Three-decade Evidence from Bibliometric-based perspectives #75 探索全球数字人文的技术知识互动：基于文献计量视角的三十年证据
#76 Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving #76 通过最优适配器缓存最大化 GPU 效率：面向多租户 LLM 服务的分析方法
#77 Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI #77 Doctor Sun：面向生物医学 AI 的双语多模态大语言模型
#78 Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants #78 基准测试大型语言模型以对殖民地弗吉尼亚地契进行地理定位

#1 BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair #1 BrowseMaster：通过工具增强的编程代理对实现可扩展网页浏览
#2 OpenCUA: Open Foundations for Computer-Use Agents #2 OpenCUA：面向计算机使用代理的开放基础（PDF 9 ）[复制] [Kimi 5 ] [相关]
#3 SMA: Who Said That? Auditing Membership Leakage in Semi-Black-box RAG Controlling #3 SMA：谁说的？审计半黑盒 RAG 控制中的成员泄露
#4 CVCM Track Circuits Pre-emptive Failure Diagnostics for Predictive Maintenance Using Deep Neural Networks #4 CVCM Track Circuits 预防性故障诊断用于预测性维护的深度神经网络
#5 A First Look at Predictability and Explainability of Pre-request Passenger Waiting Time in Ridesharing Systems #5 乘车共享系统中预请求乘客等待时间的可预测性与可解释性初探
#6 Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs #6 激活引导用于偏差缓解：一种可解释的更安全 LLMs 方法
#7 Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory #7 内在记忆代理：通过结构化上下文记忆构建的异构多代理 LLM 系统
#8 Prospect Theory Fails for LLMs: Revealing Instability of Decision-Making under Epistemic Uncertainty #8 前景理论在 LLMs 上失效：揭示在认知不确定性下决策的不稳定性
#9 Safe Semantics, Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language Models #9 安全语义，不安全解释：应对大规模视觉-语言模型中的隐式推理安全问题
#10 Compass-Thinker-7B Technical Report #10 Compass-Thinker-7B 技术报告
#11 Reducing Cognitive Load in Multi-Agent Reinforcement Learning for Mathematical Problem Solving: Decoupling Reasoning and Code Generation #11 在用于数学问题求解的多智能体强化学习中降低认知负担：将推理与代码生成解耦
#12 Silicon Minds versus Human Hearts: The Wisdom of Crowds Beats the Wisdom of AI in Emotion Recognition #12 硅之心智对抗人之情感：群体智慧在情感识别上胜过人工智能的智慧
#13 Efficient Agent: Optimizing Planning Capability for Multimodal Retrieval Augmented Generation #13 高效代理：为多模态检索增强生成优化规划能力
#14 GRainsaCK: a Comprehensive Software Library for Benchmarking Explanations of Link Prediction Tasks on Knowledge Graphs #14 GRainsaCK：一个用于对知识图谱上链路预测任务的解释进行基准测试的综合软件库
#15 A Dual-Axis Taxonomy of Knowledge Editing for LLMs: From Mechanisms to Functions #15 双轴知识编辑分类法用于 LLMs：从机制到功能
#16 Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance #16 为个性化任务辅助设计具备记忆增强的增强现实代理以进行时空推理 [PDF ] [Copy] [Kimi ] [REL]
#17 Simulating Generative Social Agents via Theory-Informed Workflow Design #17 通过以理论为导向的工作流设计模拟生成式社会代理
#18 STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision #18 STELAR-VISION：面向视觉中对齐推理的自拓扑感知高效学习
#19 Aryabhata: An exam-focused language model for JEE Math #19 Aryabhata：面向 JEE 数学考试的语言模型
#20 Hybrid Node-Destroyer Model with Large Neighborhood Search for Solving the Capacitated Vehicle Routing Problem #20 混合节点破坏器模型与大邻域搜索相结合用于求解带容量约束的车辆路径问题
#21 Prompt-and-Check: Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training #21 Prompt-and-Check: 使用大型语言模型在基于仿真的训练中评估通信协议合规性的研究
#22 P-CAFE: Personalized Cost-Aware Incremental Feature Selection For Electronic Health Records #22 P-CAFE：用于电子健康记录的个性化成本敏感增量特征选择
#23 Diminution: On Reducing the Size of Grounding ASP Programs #23 缩减：关于减少定向 ASP 程序规模的方法
#24 AgriGPT: a Large Language Model Ecosystem for Agriculture #24 AgriGPT：面向农业的大型语言模型生态系统
#25 UGM2N: An Unsupervised and Generalizable Mesh Movement Network via M-Uniform Loss #25 UGM2N：一种通过 M-Uniform 损失实现的无监督且可泛化的网格移动网络
#26 SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering #26 SynLLM：通过提示工程对用于医学表格合成数据生成的大型语言模型的比较分析
#27 GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games #27 GVGAI-LLM：使用无限游戏评估大型语言模型代理 [PDF ] [Copy] [Kimi 2 ] [REL]
#28 Large Language Models as Oracles for Ontology Alignment #28 将大型语言模型作为本体对齐的神谕
#29 POMO+: Leveraging starting nodes in POMO for solving Capacitated Vehicle Routing Problem #29 POMO+: 在 POMO 中利用起始节点来解决带容量约束的车辆路径问题
#30 Beyond Ordinal Preferences: Why Alignment Needs Cardinal Human Feedback #30 超越序数偏好：为什么对齐需要基于基数的人类反馈
#31 A Fast GRASP Metaheuristic for the Trigger Arc TSP with MIP-Based Construction and Multi-Neighborhood Local Search #31 一种针对触发弧 TSP 的快速 GRASP 元启发式算法，结合基于 MIP 的构造和多邻域局部搜索 [PDF ] [Copy] [Kimi ] [REL]
#32 OverFill: Two-Stage Models for Efficient Language Model Decoding #32 OverFill：用于高效语言模型解码的两阶段模型
#33 Solver-Aided Expansion of Loops to Avoid Generate-and-Test #33 利用求解器辅助展开循环以避免生成并测试 [PDF ] [Copy] [Kimi ] [REL]
#34 Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning #34 双层 MCTS 用于经典规划中均摊 O(1)节点选择
#35 UrzaGPT: LoRA-Tuned Large Language Models for Card Selection in Collectible Card Games #35 UrzaGPT：用于集换式卡牌游戏选牌的 LoRA 微调大型语言模型
#36 What Breaks Knowledge Graph based RAG? Empirical Insights into Reasoning under Incomplete Knowledge #36 基于知识图的 RAG 会在哪些方面失效？关于在不完整知识下推理的实证洞见
#37 First Ask Then Answer: A Framework Design for AI Dialogue Based on Supplementary Questioning with Large Language Models #37 先问后答：一种基于大型语言模型补充提问的 AI 对话框架设计
#38 LLM-BI: Towards Fully Automated Bayesian Inference with Large Language Models #38 LLM-BI：迈向使用大规模语言模型的全自动贝叶斯推断
#39 An Efficient Application of Goal Programming to Tackle Multiobjective Problems with Recurring Fitness Landscapes #39 一种高效的目标规划方法用于处理具有循环适应景观的多目标问题
#40 Topos Causal Models #40 拓扑语因果模型
#41 Topos Theory for Generative AI and LLMs
#42 Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models #42 时间是一种特征：在扩散语言模型中利用时间动态
#43 Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer #43 无需训练的文本引导彩色编辑基于多模态扩散变换器
#44 Towards Universal Neural Inference #44 迈向通用神经推理
#45 SPARC: Soft Probabilistic Adaptive multi-interest Retrieval Model via Codebooks for recommender system #45 SPARC：通过码本实现的用于推荐系统的软概率自适应多兴趣检索模型
#46 Dynamic Uncertainty-aware Multimodal Fusion for Outdoor Health Monitoring #46 动态不确定性感知多模态融合用于户外健康监测
#47 Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams #47 我们能信任 AI 来治理 AI 吗？在隐私与 AI 治理考试上对 LLM 性能的基准测试
#48 Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding #48 空间轨迹：通过时空理解增强 VLA 模型
#49 E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence,and Efficiency #49 E3-Rewrite：学习重写 SQL 以实现可执行性、等价性和高效性
#50 When Deepfakes Look Real: Detecting AI-Generated Faces with Unlabeled Data due to Annotation Challenges #50 当深度伪造看起来真实：在注释挑战下使用无标注数据检测 AI 生成面孔
#51 Attacks and Defenses Against LLM Fingerprinting #51 针对 LLM 指纹识别的攻击与防御
#52 LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA #52 LyS 在 SemEval 2025 任务 8：面向表格问答的零样本代码生成
#53 Retrospective Sparse Attention for Efficient Long-Context Generation #53 面向高效长上下文生成的回顾性稀疏注意力
#54 Rational Inverse Reasoning #54 理性逆向推理
#55 Unsupervised Skill Discovery as Exploration for Learning Agile Locomotion #55 无监督技能发现：作为学习敏捷行走的探索
#56 Urban-STA4CLC: Urban Theory-Informed Spatio-Temporal Attention Model for Predicting Post-Disaster Commercial Land Use Change #56 Urban-STA4CLC：一种基于城市理论的时空注意力模型用于预测灾后商业用地变化
#57 Revealing the Role of Audio Channels in ASR Performance Degradation #57 揭示音频通道在语音识别性能下降中的作用
#58 QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems #58 QAMRO：面向音频生成系统人类对齐评估的质量感知自适应边距排序优化
#59 Generalising Traffic Forecasting to Regions without Traffic Observations #59 将交通预测推广到没有交通观测的区域
#60 Train Long, Think Short: Curriculum Learning for Efficient Reasoning #60 长期训练，短期思考：用于高效推理的课程学习
#61 EGGCodec: A Robust Neural Encodec Framework for EGG Reconstruction and F0 Extraction #61 EGGCodec：一种用于 EGG 重建和 F0 提取的鲁棒神经编码解码框架
#62 Shape Completion and Real-Time Visualization in Robotic Ultrasound Spine Acquisitions #62 形状补全与机器人超声脊柱采集中的实时可视化
#63 Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning #63 Munsit 在 NADI 2025 共享任务 2：通过弱监督预训练和持续监督微调推动多方言阿拉伯语自动语音识别的边界
#64 ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs #64 ASPD：通过探索 LLMs 的内在并行性解锁自适应串并行解码
#65 Position: Causal Machine Learning Requires Rigorous Synthetic Experiments for Broader Adoption #65 位置：因果机器学习需要严谨的合成实验以实现更广泛的采用
#66 Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models #66 纠缠于表征：对大型语言模型中文化偏见的机械化调查
#67 Oblivionis: A Lightweight Learning and Unlearning Framework for Federated Large Language Models #67 Oblivionis：一种用于联邦大语言模型的轻量级学习与遗忘框架 [PDF ] [Copy] [Kimi ] [REL]
#68 BiasGym: Fantastic Biases and How to Find (and Remove) Them #68 BiasGym：奇妙的偏差以及如何发现（和移除）它们
#69 Steering Towards Fairness: Mitigating Political Bias in LLMs #69 朝向公平的引导：缓解 LLMs 中的政治偏见
#70 The Roots of International Perceptions: Simulating US Attitude Changes Towards China with LLM Agents #70 国际认知的根源：使用 LLM 代理模拟美国对华态度变化
#71 EditMF: Drawing an Invisible Fingerprint for Your Large Language Models #71 EditMF：为你的大型语言模型绘制一枚隐形指纹
#72 An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems #72 对大型语言模型（LLMs）在数学推理鲁棒性的一项研究：通过对高等数学问题进行数学等价变换来建立基准测试
#73 Geometry-Aware Global Feature Aggregation for Real-Time Indirect Illumination #73 面向几何的全局特征聚合用于实时间接光照
#74 Wavelet Mixture of Experts for Time Series Forecasting #74 小波专家混合模型用于时间序列预测
#75 OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads #75 OISMA：面向矩阵乘法工作负载的即时内存内随机乘法架构
#76 TempOpt – Unsupervised Alarm Relation Learning for Telecommunication Networks #76 TempOpt – 用于电信网络的无监督告警关联学习
#77 Not in My Backyard! Temporal Voting Over Public Chores #77 不是在我后院！关于公共杂务的时间投票
#78 Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems #78 激发音乐创造力？生成式人工智能音乐系统中的嵌入意识形态
#79 TechOps: Technical Documentation Templates for the AI Act #79 TechOps：用于《人工智能法案》的技术文档模板
#80 Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments #80 通过自动化构建环境进行反馈驱动的工具使用改进在大型语言模型中的研究
#81 ReQuestNet: A Foundational Learning model for Channel Estimation #81 ReQuestNet：用于信道估计的基础学习模型
#82 Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge #82 使用具画像感知能力的 LLM 作为裁判评估播客推荐
#83 Bridging the Gap: A Framework for Real-World Video Deepfake Detection via Social Network Compression Emulation #83 弥合差距：通过模拟社交网络压缩进行真实世界视频深度伪造检测的框架 [PDF 1 ] [Copy] [Kimi ] [REL]
#84 DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation #84 DevNous：一个基于 LLM 的多智能体系统，用于在非结构化对话中落地 IT 项目管理 [PDF ] [Copy] [Kimi ] [REL]
#85 Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT #85 使用 ACT 的注释引导抓取与放置的机器人操控视觉提示
#86 SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs #86 SciRerankBench：针对用于科学检索增强生成型 LLMs 的重排序器基准测试
#87 IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization #87 IROTE：通过上下文自我反思式优化引出大型语言模型的人类特质
#88 Generative Modeling for Robust Deep Reinforcement Learning on the Traveling Salesman Problem #88 面向旅行商问题的稳健深度强化学习的生成式建模
#89 MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs #89 MultiAiTutor：面向儿童友好的教育性多语种语音生成辅导系统，基于 LLMs
#90 A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models #90 并行文本生成综述：从并行解码到扩散语言模型
#91 SafeFix: Targeted Model Repair via Controlled Image Generation #91 SafeFix：通过受控图像生成进行有针对性的模型修复 [PDF 3 ] [Copy] [Kimi ] [REL]
#92 MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion #92 MMIF-AMIN：自适应损失驱动的多尺度可逆致密网络用于多模态医学图像融合
#93 Imposing AI: Deceptive design patterns against sustainability #93 Imposing AI：针对可持续性的欺骗性设计模式
#94 Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics #94 代码更改到自然语言生成中的幻觉：发生率与检测指标评估
#95 M2LLM: Multi-view Molecular Representation Learning with Large Language Models #95 M2 LLM: 多视角分子表示学习与大型语言模型
#96 LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement #96 LLM 驱动的文本到表格生成：通过子任务指导与迭代精炼
#97 MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time #97 MiGrATe：用于测试时适应的混合策略 GRPO [PDF 5 ] [Copy] [Kimi 2 ] [REL]
#98 Securing Educational LLMs: A Generalised Taxonomy of Attacks on LLMs and DREAD Risk Assessment #98 保护教育用 LLMs：面向 LLMs 攻击的一般化分类与 DREAD 风险评估
#99 QoE-Aware Service Provision for Mobile AR Rendering: An Agent-Driven Approach #99 面向移动增强现实渲染的 QoE 感知服务提供：一种代理驱动方法
#100 Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization #100 可迁移的模型无关视觉-语言模型适配用于高效的弱到强泛化
#101 Yan: Foundational Interactive Video Generation #101 Yan：基础交互式视频生成 [PDF 13 ] [Copy] [Kimi 6 ] [REL]
#102 Generative AI for Critical Infrastructure in Smart Grids: A Unified Framework for Synthetic Data Generation and Anomaly Detection #102 面向智能电网关键基础设施的生成式人工智能：用于合成数据生成与异常检测的统一框架
#103 DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives #103 DepressLLM：一种可解释的领域自适应语言模型，用于从真实世界叙述中检测抑郁症
#104 AI Security Map: Holistic Organization of AI Security Technologies and Impacts on Stakeholders #104 AI 安全地图：对 AI 安全技术及其对利益相关者影响的整体性组织
#105 Who pays the RENT? Implications of Spatial Inequality for Prediction-Based Allocation Policies #105 谁来付房租？基于预测的分配政策对空间不平等的影响
#106 Superclass-Guided Representation Disentanglement for Spurious Correlation Mitigation #106 超类引导的表征解缠以缓解虚假相关性
#107 UQGNN: Uncertainty Quantification of Graph Neural Networks for Multivariate Spatiotemporal Prediction #107 UQGNN：用于多变量时空预测的图神经网络不确定性量化
#108 OmniLLP: Enhancing LLM-based Log Level Prediction with Context-Aware Retrieval #108 OmniLLP：通过上下文感知检索增强基于 LLM 的日志级别预测
#109 AI Agents and the Law #109 人工智能代理与法律
#110 M3-Net: A Cost-Effective Graph-Free MLP-Based Model for Traffic Prediction #110 M3-Net：一种用于交通预测的经济高效无图神经的基于 MLP 的模型
#111 LLM-Driven Adaptive 6G-Ready Wireless Body Area Networks: Survey and Framework #111 基于 LLM 的自适应、面向 6G 的无线体域网：综述与框架
#112 Playing Atari Space Invaders with Sparse Cosine Optimized Policy Evolution #112 使用稀疏余弦优化策略进化玩雅达利《太空入侵者》
#113 StreetViewAI: Making Street View Accessible Using Context-Aware Multimodal AI #113 StreetViewAI：使用上下文感知多模态 AI 使街景可访问
#114 VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models #114 VISOR：基于视觉输入的引导用于视觉-语言模型的输出重定向
#115 Using LLMs to Capture Users' Temporal Context for Recommendation #115 使用 LLMs 捕捉用户的时间上下文以进行推荐
#116 Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression #116 可引导的多元主义：通过少样本比较回归实现多元对齐
#117 When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital #117 当领域专家没有时间而 LLM 开发者缺乏临床专业知识时：来自安全网医院 LLM 共设计的真实世界经验教训 [PDF ] [Copy] [Kimi ] [REL]
#118 Momentum Point-Perplexity Mechanics in Large Language Models #118 大型语言模型中的动量点-困惑度机制
#119 MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling #119 MAViS：用于长序列视频叙事的多智能体框架
#120 Empowering Children to Create AI-Enabled Augmented Reality Experiences #120 赋能儿童创建具备人工智能的增强现实体验
#121 Temporal User Profiling with LLMs: Balancing Short-Term and Long-Term Preferences for Recommendations #121 使用 LLMs 进行时间性用户画像：在推荐中平衡短期和长期偏好
#122 Fast weight programming and linear transformers: from machine learning to neurobiology #122 快速权重编程与线性变换器：从机器学习到神经生物学
#123 Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment #123 重新思考面向丰富形态变化的分词：一元模型相对于 BPE 的主导地位与形态学对齐
#124 Neural Tangent Knowledge Distillation for Optical Convolutional Networks #124 神经切线知识蒸馏用于光学卷积网络
#125 Generating Query-Relevant Document Summaries via Reinforcement Learning #125 通过强化学习生成与查询相关的文档摘要
#126 Spatiotemporally Consistent Indoor Lighting Estimation with Diffusion Priors #126 使用扩散先验的时空一致室内光照估计
#127 The DNA of nuclear models: How AI predicts nuclear masses #127 核模型的基因：人工智能如何预测核质量
#128 Processing of synthetic data in AI development for healthcare and the definition of personal data in EU law #128 在医疗保健人工智能开发中合成数据的处理以及欧盟法律中个人数据的定义
#129 Fuzzy-Pattern Tsetlin Machine #129 模糊模式泰特林机（Fuzzy-Pattern Tsetlin Machine）
#130 Do AI Companies Make Good on Voluntary Commitments to the White House? #130 AI 公司是否兑现了对白宫的自愿承诺？
#131 Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving #131 通过最优适配器缓存最大化 GPU 效率：面向多租户 LLM 服务的分析方法
#132 ImageDDI: Image-enhanced Molecular Motif Sequence Representation for Drug-Drug Interaction Prediction #132 ImageDDI：用于药物相互作用预测的图像增强分子基序列表示 [PDF 1 ] [Copy] [Kimi ] [REL]
#133 Algorithmic Fairness amid Social Determinants: Reflection, Characterization, and Approach #133 算法公平性与社会决定因素：反思、特征化与方法
#134 HSA-Net: Hierarchical and Structure-Aware Framework for Efficient and Scalable Molecular Language Modeling #134 HSA-Net：用于高效且可扩展分子语言建模的分层与结构感知框架
#135 Normative Moral Pluralism for AI: A Framework for Deliberation in Complex Moral Contexts #135 面向人工智能的规范性道德多元论：在复杂道德情境中进行审议的框架
#136 Energy-Aware Code Generation with LLMs: Benchmarking Small vs. Large Language Models for Sustainable AI Programming #136 能源感知的代码生成与 LLMs：为可持续 AI 编程对小型与大型语言模型进行基准测试 [PDF ] [Copy] [Kimi ] [REL]
#137 Algorithmic Collusion of Pricing and Advertising on E-commerce Platforms #137 电子商务平台上定价与广告的算法性合谋
#138 Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code #138 使用 Elicit、NotebookLM、ChatGPT 和 Claude Code 进行多代理 LLM 代码助理的上下文工程
#139 Between Fear and Desire, the Monster Artificial Intelligence (AI): Analysis through the Lenses of Monster Theory #139 在恐惧与欲望之间，怪物人工智能（AI）：通过怪物理论的视角分析
#140 Evaluation of State-of-the-Art Deep Learning Techniques for Plant Disease and Pest Detection #140 评估当前最先进的深度学习技术用于植物疾病和害虫检测
#141 EU Digital Regulation and Guatemala: AI, 5G, and Cybersecurity #141 欧盟数字监管与危地马拉：人工智能、5G 与网络安全
#142 Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study #142 评估 AI 生成考试的质量：一项大规模现场研究
#143 Constrained PSLQ Search for Machin-like Identities Achieving Record-Low Lehmer Measures #143 受限 PSLQ 搜索以寻找达到创纪录低 Lehmer 度量的 Machin 类恒等式
#144 Channel-Wise MLPs Improve the Generalization of Recurrent Convolutional Networks
#145 Putnam-AXIOM: A Functional and Static Benchmark #145 Putnam-AXIOM：一个函数式与静态基准测试
#146 Understanding Transformers through the Lens of Pavlovian Conditioning #146 通过巴甫洛夫式条件反射视角理解变换器
#147 Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions #147 神圣还是合成？评估 LLM 在宗教问题上的可靠性与回避
#148 The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs #148 进步的幻觉：重新评估 LLMs 中的幻觉检测
#149 MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language #149 MinionsLLM：一种通过自然语言对多智能体系统进行训练和控制的任务自适应框架
#150 Multi-grained spatial-temporal feature complementarity for accurate online cellular traffic prediction #150 多粒度时空特征互补用于准确的在线蜂窝流量预测
#151 MoSSDA: A Semi-Supervised Domain Adaptation Framework for Multivariate Time-Series Classification using Momentum Encoder #151 MoSSDA：一种用于多变量时间序列分类的半监督域自适应框架，使用动量编码器
#152 XFMNet: Decoding Cross-Site and Nonstationary Water Patterns via Stepwise Multimodal Fusion for Long-Term Water Quality Forecasting #152 XFMNet：通过逐步多模态融合解码跨站点与非平稳水文模式以用于长期水质预测
#153 Towards Heterogeneity-Aware and Energy-Efficient Topology Optimization for Decentralized Federated Learning in Edge Environment #153 面向边缘环境的去中心化联邦学习的异质感知与节能拓扑优化
#154 Evaluating Contrast Localizer for Identifying Causal Unitsin Social & Mathematical Tasks in Language Models #154 评估对比定位器以识别语言模型在社会与数学任务中的因果单元
#155 MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis #155 MLLM-CBench：用于多模态 LLM 持续指令微调的全面基准及连锁思维推理分析 [PDF 2 ] [Copy] [Kimi 4 ] [REL]
#156 Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition #156 从大型语言模型中蒸馏知识：用于仇恨与反对言论识别的概念瓶颈模型
#157 Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI #157 孙博士：面向生物医学人工智能的双语多模态大型语言模型
#158 emg2tendon: From sEMG Signals to Tendon Control in Musculoskeletal Hands #158 emg2tendon：从表面肌电信号到肌肉骨骼手的肌腱控制
#159 Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants #159 基准测试大型语言模型用于定位殖民时期弗吉尼亚土地赠与
#160 TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection #160 TurQUaz 在 CheckThat! 2025：辩论用于科学网络话语检测的大型语言模型 [PDF ] [Copy] [Kimi ] [REL]
#161 On the Effects of Smoothing Rugged Landscape by Different Toy Problems: A Case Study on UBQP #161 关于通过不同玩具问题平滑崎岖地形的影响：以 UBQP 为例的研究 [PDF ] [Copy] [Kimi ] [REL]
#162 A New Parallel Cooperative Landscape Smoothing Algorithm and Its Applications on TSP and UBQP #162 一种新的并行协作景观平滑算法及其在 TSP 和 UBQP 上的应用

2025-08-13科研追新

2025-08-12 15:55:46 Tuesday ～ 2025-08-13 19:45:18 Wednesday

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

1.2 Arxiv

1.2.1 Computation and Language

From：https:// /arxiv/cs.CL

From：https://arxiv.org/list/cs.CL/recent 2025-08-13 | | 总计：78

#1 Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models #1 时间就是特征：在扩散语言模型中利用时间动态

Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them. 扩散大语言模型（dLLMs）通过迭代去噪生成文本，但现有的解码策略在最终输出上丢弃了丰富的中间预测。我们在此项工作中揭示了一个关键现象——时间振荡（temporal oscillation），即正确答案常在中间过程出现，却在后续去噪步骤中被覆盖。为了解决这一问题，我们提出了两种利用时间一致性的互补方法：1）时间自一致投票（Temporal Self-Consistency Voting），一种无需训练的测试时解码策略，通过聚合去噪步骤中的预测以选择最一致的输出；以及 2）一种名为时间一致性强化（Temporal Consistency Reinforcement）的后训练方法，它使用时间语义熵（Temporal Semantic Entropy，TSE）——衡量中间预测语义稳定性的指标，作为奖励信号以鼓励生成的稳定性。多项基准测试的实证结果证明了我们方法的有效性。仅使用负 TSE 奖励，我们在 Countdown 数据集上就比现有的 dLLM 取得了显著的平均提升 24.7%。结合准确性奖励，我们在 GSM8K、MATH500、SVAMP 和 Countdown 上分别取得了绝对提升 2.0%、4.3%、6.6% 和 25.3%。我们的研究结果强调了动态大型语言模型（dLLMs）时间动态的未开发潜力，并提出了两个简单而有效的工具以加以利用。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 17:59:57 UTC 发布：2025-08-12 17:59:57 UTC

#2 Complex Logical Instruction Generation #2 复杂逻辑指令生成

Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF 指令遵循催生了近期的大型语言模型 (LLMs) 时代，并且是支撑更高级能力（如推理和自主代理行为）的基础技能。随着任务变得愈发困难，自然语言指令中嵌入的逻辑结构也变得越来越复杂。然而，LLMs 在此类富含逻辑的指令上的表现仍然未被充分研究。我们提出了 LogicIFGen 和 LogicIFEval。LogicIFGen 是一个可扩展的、自动化的框架，用于从代码函数生成可验证的指令，代码能够自然地表达诸如条件语句、嵌套、递归和函数调用等丰富逻辑。我们进一步整理了一组复杂的代码函数，并使用 LogicIFGen 构建了 LogicIFEval——一个包含 426 条可验证的富逻辑指令的基准测试。我们的实验表明，当前最先进的 LLMs 在正确遵循 LogicIFEval 中的指令方面仍然举步维艰。大多数 LLMs 只能遵循不到 60% 的指令，显示出指令遵循能力存在显著不足。代码与基准： https://github.com/mianzhang/LogicIF

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-12 17:54:27 UTC 发布：2025-08-12 17:54:27 UTC

#3 OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows #3 OdysseyBench：在长时域复杂办公应用工作流上评估 LLM 代理

Authors: [Weixuan Wang](https://arxiv.org/search/?searchtype=author&query=Weixuan Wang), [Dongge Han](https://arxiv.org/search/?searchtype=author&query=Dongge Han), [Daniel Madrigal Diaz](https://arxiv.org/search/?searchtype=author&query=Daniel Madrigal Diaz), [Jin Xu](https://arxiv.org/search/?searchtype=author&query=Jin Xu), [Victor Rühle](https://arxiv.org/search/?searchtype=author&query=Victor Rühle), [Saravan Rajmohan](https://arxiv.org/search/?searchtype=author&query=Saravan Rajmohan) 作者：王伟轩、韩东格、Daniel Madrigal Diaz、徐晋、Victor Rühle、Saravan Rajmohan

Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line. 由大型语言模型（LLMs）驱动的自主代理越来越多地部署于需要复杂、长时段工作流程的真实应用中。然而，现有基准测试主要侧重于自成一体且相互独立的原子任务，未能捕捉现实场景中所需的长期上下文依赖和多次交互协调。为填补这一空白，我们提出了 OdysseyBench，这是一个用于评估 LLM 代理在跨越 Word、Excel、PDF、邮件和日历等多种办公应用的长时段工作流程能力的综合基准。我们的基准由两个互补的子集构成：OdysseyBench+，包含 300 个源自真实用例的任务；以及 OdysseyBench-Neo，包含 302 个新合成的复杂任务。每个任务都要求代理从长时段的交互历史中识别关键信息，并在多个应用之间进行多步骤推理。为支持可扩展的基准构建，我们提出了 HomerAgents——一个多代理框架，通过系统化的环境探索、任务生成和对话合成，自动化生成长时段工作流程基准。我们的广泛评估表明，OdysseyBench 能有效挑战最先进的 LLM 代理，相较于现有的原子任务基准，在复杂的真实情境中对其能力提供了更准确的评估。我们相信 OdysseyBench 将成为推动 LLM 代理在真实生产力场景中开发与评估的宝贵资源。此外，我们发布 OdysseyBench 和 HomerAgents 以促进该领域的研究。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 17:53:03 UTC 发布：2025-08-12 17:53:03 UTC

#4 SinLlama - A Large Language Model for Sinhala #4 SinLlama - 面向僧伽罗语的大型语言模型 [PDF 2 ] [Copy] [Kimi 3 ] [REL]

Authors: [H. W. K. Aravinda](https://arxiv.org/search/?searchtype=author&query=H. W. K. Aravinda), [Rashad Sirajudeen](https://arxiv.org/search/?searchtype=author&query=Rashad Sirajudeen), [Samith Karunathilake](https://arxiv.org/search/?searchtype=author&query=Samith Karunathilake), [Nisansa de Silva](https://arxiv.org/search/?searchtype=author&query=Nisansa de Silva), [Surangika Ranathunga](https://arxiv.org/search/?searchtype=author&query=Surangika Ranathunga), [Rishemjit Kaur](https://arxiv.org/search/?searchtype=author&query=Rishemjit Kaur) 作者：H. W. K. Aravinda、Rashad Sirajudeen、Samith Karunathilake、Nisansa de Silva、Surangika Ranathunga、Rishemjit Kaur

Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin. 像僧伽罗语这样的低资源语言常常被开源 LLMs 忽视。在本研究中，我们扩展了现有的多语言 LLM（Llama-3-8B）以更好地服务僧伽罗语。我们为 LLM 分词器增强了僧伽罗语特定词汇，并在清理过的 1000 万条僧伽罗语语料上进行了持续预训练，得出了 SinLlama 模型。这是第一个明确支持僧伽罗语的基于解码器的开源 LLM。当对 SinLlama 进行指令微调以完成三项文本分类任务时，其表现显著优于 Llama-3-8B 的基础版和指令版变体。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 17:49:34 UTC 发布：2025-08-12 17:49:34 UTC

#5 AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining test outputs through a multilingual sandbox, while achieving high data quality through reverse-order problem generation and multiple filtering steps. Using this novel method, we introduce AutoCodeBench, a large-scale code generation benchmark comprising 3,920 problems evenly distributed across 20 programming languages. It is specifically designed to evaluate LLMs on challenging, diverse, and practical multilingual tasks. We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks. Besides, we introduce AutoCodeBench-Complete, specifically designed for base models to assess their few-shot code generation capabilities. We hope the AutoCodeBench series will serve as a valuable resource and inspire the community to focus on more challenging and practical multilingual code generation scenarios. 大型语言模型（LLMs）在各个领域展现了卓越能力，其中代码生成成为一个重要关注点。尽管已经提出了许多评估其代码生成能力的基准，这些基准存在若干关键局限。首先，它们常依赖人工标注，这既耗时又难以在不同编程语言和问题复杂度间扩展。其次，大多数现有基准主要集中于 Python，而为数不多的多语种基准则存在难度有限且语言分布不均的问题。为了解决这些挑战，我们提出了 AutoCodeGen，一种无需人工标注即可自动生成高难度多语种代码生成数据集的方法。AutoCodeGen 通过使用 LLMs 生成测试输入并通过多语种沙箱获得测试输出，来确保测试用例的正确性和完整性；同时通过逆序问题生成和多重过滤步骤实现高数据质量。使用这种新方法，我们推出了 AutoCodeBench，这是一套大规模的代码生成基准测试，包含 3,920 道题目，均匀分布在 20 种编程语言中。它专门用于评估 LLMs 在具有挑战性、多样化且实用的多语言任务上的表现。我们在 AutoCodeBench 及其简化版 AutoCodeBench-Lite 上评估了 30 多个领先的开源和专有 LLMs。结果显示，即便是最先进的 LLMs 在这些任务的复杂性、多样性和多语言特性面前也举步维艰。此外，我们还推出了 AutoCodeBench-Complete，专为基础模型设计以评估其少样本代码生成能力。我们希望 AutoCodeBench 系列能成为有价值的资源，并激励社区关注更具挑战性和实用性的多语言代码生成场景。

Subjects: Computation and Language, Software Engineering 主题：计算与语言，软件工程

Publish: 2025-08-12 17:29:20 UTC 发布：2025-08-12 17:29:20 UTC

#6 Link Prediction for Event Logs in the Process Industry #6 用于流程工业事件日志的链接预测

Authors: [Anastasia Zhukova](https://arxiv.org/search/?searchtype=author&query=Anastasia Zhukova), [Thomas Walton](https://arxiv.org/search/?searchtype=author&query=Thomas Walton), [Christian E. Matt](https://arxiv.org/search/?searchtype=author&query=Christian E. Matt), [Bela Gipp](https://arxiv.org/search/?searchtype=author&query=Bela Gipp)

Knowledge management (KM) is vital in the process industry for optimizing operations, ensuring safety, and enabling continuous improvement through effective use of operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records, e.g., entries documenting issues related to equipment or processes and the corresponding solutions, may remain disconnected. This fragmentation hinders the recommendation of previous solutions to the users. To address this problem, we investigate record linking (RL) as link prediction, commonly studied in graph-based machine learning, by framing it as a cross-document coreference resolution (CDCR) task enhanced with natural language inference (NLI) and semantic text similarity (STS) by shifting it into the causal inference (CI). We adapt CDCR, traditionally applied in the news domain, into an RL model to operate at the passage level, similar to NLI and STS, while accommodating the process industry’s specific text formats, which contain unstructured text and structured record attributes. Our RL model outperformed the best versions of NLI- and STS-driven baselines by 28% (11.43 points) and 27% (11.21 points), respectively. Our work demonstrates how domain adaptation of the state-of-the-art CDCR models, enhanced with reasoning capabilities, can be effectively tailored to the process industry, improving data quality and connectivity in shift logs. 知识管理（KM）在过程工业中对于优化操作、确保安全以及通过有效利用运行数据和过往经验实现持续改进至关重要。该领域的一个关键挑战是交接班日志中事件记录的碎片化——相关记录（例如记录设备或工艺相关问题及相应解决方案的条目）可能相互脱节。这种碎片化阻碍了向用户推荐以往解决方案。为了解决这一问题，我们将记录链接（RL）作为连边预测来研究——这是图基机器学习中常见的研究方向——通过将其构建为一种跨文档共指消解（CDCR）任务，并通过引入自然语言推理（NLI）和语义文本相似性（STS）将其提升为因果推断（CI）。我们将传统上应用于新闻领域的 CDCR 改造为在段落级别运行的 RL 模型，类似于 NLI 和 STS，同时适应过程工业特有的文本格式，这些格式包含非结构化文本和结构化记录属性。我们的强化学习模型分别比基于自然语言推理（NLI）和语义文本相似度（STS）的最佳基线版本高出 28%（11.43 个百分点）和 27%（11.21 个百分点）。我们的工作展示了如何将具有推理能力的最先进跨文档共指消解（CDCR）模型进行领域适配，有效定制到过程工业，从而改善班次日志中的数据质量和连通性。

Subjects: Computation and Language, Information Retrieval 主题：计算与语言，信息检索

Publish: 2025-08-12 17:22:29 UTC 发布：2025-08-12 17:22:29 UTC

#7 Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages #7 利用多语种编码器提升低资源语言的大型语言模型

Authors: [Imalsha Puranegedara](https://arxiv.org/search/?searchtype=author&query=Imalsha Puranegedara), [Themira Chathumina](https://arxiv.org/search/?searchtype=author&query=Themira Chathumina), [Nisal Ranathunga](https://arxiv.org/search/?searchtype=author&query=Nisal Ranathunga), [Nisansa de Silva](https://arxiv.org/search/?searchtype=author&query=Nisansa de Silva), [Surangika Ranathunga](https://arxiv.org/search/?searchtype=author&query=Surangika Ranathunga), [Mokanarangan Thayaparan](https://arxiv.org/search/?searchtype=author&query=Mokanarangan Thayaparan) 作者：Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, Mokanarangan Thayaparan

Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM’s embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs. 大型语言模型 (LLMs) 在英语上表现出色，但由于以英语为中心的训练，它们在低资源语言 (LRLs) 上的性能显著下降。尽管像 LangBridge 这样的方法将 LLMs 与多语种编码器（例如大规模多语种文本到文本迁移变换器 mT5）对齐，但它们通常只使用编码器的最终一层。我们提出了一种新架构，融合了所有中间层，从而丰富传递给 LLM 的语言信息。我们的方法包含两种策略： (1) 用于整体层重要性的全局 Softmax 加权，以及 (2) 学习特定于标记权重的 Transformer Softmax 模型。融合后的表示被映射到 LLM 的嵌入空间，使其能够处理多语种输入。该模型仅在英语数据上训练，未使用任何平行或多语种数据。在 XNLI、IndicXNLI、僧伽罗新闻分类和亚马逊评论上的评估表明，我们的 Transformer Softmax 模型显著优于 LangBridge 基线。我们在低资源语言（LRLs）上观察到显著的性能提升，将僧伽罗语分类准确率从 71.66%提高到 75.86%，并在泰米尔语、孟加拉语和马拉雅拉姆语等印度语系语言上取得明显改进。这些具体提升促使平均 XNLI 准确率从 70.36%上升到 71.50%。该方法为构建更强大且更公平的多语种 LLMs 提供了一条可扩展、数据高效的路径。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 17:17:13 UTC 发布时间：2025-08-12 17:17:13 UTC

#8 CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization #8 CPO：通过比较策略优化在角色扮演对话中解决奖励歧义

Authors: [Xinge Ye](https://arxiv.org/search/?searchtype=author&query=Xinge Ye), [Rui Wang](https://arxiv.org/search/?searchtype=author&query=Rui Wang), [Yuchuan Wu](https://arxiv.org/search/?searchtype=author&query=Yuchuan Wu), [Victor Ma](https://arxiv.org/search/?searchtype=author&query=Victor Ma), [Feiteng Fang](https://arxiv.org/search/?searchtype=author&query=Feiteng Fang), [Fei Huang](https://arxiv.org/search/?searchtype=author&query=Fei Huang), [Yongbin Li](https://arxiv.org/search/?searchtype=author&query=Yongbin Li) 作者：叶兴格、王睿、吴宇川、Victor Ma、方飞腾、黄飞、李永滨

Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers (e.g., code generation, mathematical reasoning), yet struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals.Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization (CPO). CPO redefines the reward evaluation paradigm by shifting from sample-wise scoring to comparative group-wise scoring.Building on the same principle, we introduce the CharacterArena evaluation framework, which comprises two stages:(1) Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level Comparative Evaluation. By operationalizing subjective scoring via objective trajectory comparisons, CharacterArena minimizes contextual bias and enables more robust and fair performance evaluation. Empirical results on CharacterEval, CharacterBench, and CharacterArena confirm that CPO effectively mitigates reward ambiguity and leads to substantial improvements in dialogue quality. 强化学习微调（RLFT）在具有客观可验证答案的任务（例如代码生成、数学推理）中取得了显著成功，但在角色扮演对话等开放性主观任务上表现欠佳。传统的奖励建模方法依赖于独立的逐样本评分，面临双重挑战：主观性评估标准和不稳定的奖励信号。基于人类评估本质上将显式标准与隐性比较判断相结合的洞见，我们提出了比较策略优化（CPO）。CPO 通过从逐样本评分转向比较的群体评分，重新定义了奖励评估范式。在同一原则基础上，我们引入了 CharacterArena 评估框架，该框架包含两个阶段：（1）情境化的多轮角色扮演模拟，和（2）轨迹级的比较评估。通过以客观的轨迹比较来实现主观评分，CharacterArena 将情境偏差降到最低，并实现更稳健、公平的性能评估。在 CharacterEval、CharacterBench 和 CharacterArena 上的实证结果证实，CPO 有效缓解了奖励模糊性，并在对话质量上带来了显著提升。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 16:49:18 UTC 发布：2025-08-12 16:49:18 UTC

#9 READER: Retrieval-Assisted Drafter for Efficient LLM Inference #9 读者：用于高效 LLM 推理的检索辅助起草器

Large Language Models (LLMs) generate tokens autoregressively, with each token depending on the preceding context. This sequential nature makes the inference process inherently difficult to accelerate, posing a significant challenge for efficient deployment. In recent years, various methods have been proposed to address this issue, with the most effective approaches often involving the training of additional draft models. In this paper, we introduce READER (Retrieval-Assisted Drafter for Efficient LLM Inference), a novel lossless speculative decoding method that enhances model-based approaches by leveraging self-repetitions in the text. Our algorithm expands the speculative decoding tree using tokens obtained through statistical search. This work focuses on large batch sizes (>= 8), an underexplored yet important area for industrial applications. We also analyze the key-value (KV) cache size during speculative decoding and propose an optimization to improve performance for large batches. As a result, READER outperforms existing speculative decoding methods. Notably, READER requires no additional training and can reuse pre-trained speculator models, increasing the speedup by over 40%. Our method demonstrates particularly strong performance on search-based tasks, such as retrieval-augmented generation, where we achieve more than 10x speedup. 大型语言模型（LLMs）以自回归方式生成标记，每个标记都依赖于前面的上下文。这种序列性使得推理过程本质上难以加速，给高效部署带来重大挑战。近年来，已经提出了多种方法来解决这一问题，其中最有效的方法通常涉及训练额外的草稿模型。在本文中，我们介绍了 READER（Retrieval-Assisted Drafter for Efficient LLM Inference），这是一种新颖的无损猜测解码方法，通过利用文本中的自我重复来增强基于模型的方法。我们的算法使用通过统计检索获得的标记来扩展猜测解码树。该工作着重于较大批量（>= 8），这是工业应用中一个尚未充分探索但重要的领域。我们还分析了猜测解码期间的键值（KV）缓存大小，并提出了一种优化以改善大批量的性能。因此，READER 优于现有的猜测解码方法。值得注意的是，READER 不需要额外训练并且可以重用预训练的预测器模型，从而将加速提高超过 40%。我们的方法在基于检索的任务上表现尤为出色，例如检索增强生成，在这些任务中我们实现了超过 10 倍的加速。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 16:47:48 UTC 发布：2025-08-12 16:47:48 协调世界时 (UTC)

#10 MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions #10 MVISU-Bench：通过多应用、模糊、交互、单应用和不道德指令对移动代理执行真实世界任务进行基准测试 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Zeyu Huang](https://arxiv.org/search/?searchtype=author&query=Zeyu Huang), [Juyuan Wang](https://arxiv.org/search/?searchtype=author&query=Juyuan Wang), [Longfeng Chen](https://arxiv.org/search/?searchtype=author&query=Longfeng Chen), [Boyi Xiao](https://arxiv.org/search/?searchtype=author&query=Boyi Xiao), [Leng Cai](https://arxiv.org/search/?searchtype=author&query=Leng Cai), [Yawen Zeng](https://arxiv.org/search/?searchtype=author&query=Yawen Zeng), [Jin Xu](https://arxiv.org/search/?searchtype=author&query=Jin Xu) 作者：黄泽宇、王居源、陈龙锋、肖博义、蔡冷、曾雅雯、徐进

Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users’ automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present \textbf{MVISU-Bench}, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52% and 29.41% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations. 鉴于大型视觉语言模型（LVLM）在推理和视觉理解方面的显著进展，移动代理正迅速兴起以满足用户的自动化需求。然而，现有的评测基准与现实世界脱节，未能充分覆盖用户多样且复杂的需求。通过我们对大量用户问卷的收集，确定了五类任务：多应用（Multi-App）、模糊（Vague）、交互式（Interactive）、单应用（Single-App）和不道德指令（Unethical Instructions）。围绕这些任务，我们提出了 MVISU-Bench，这是一个双语基准，包含跨 137 个移动应用的 404 个任务。此外，我们提出了 Aider，一个可插拔模块，作为动态提示生成器，用以降低风险并澄清移动代理的用户意图。我们的 Aider 易于集成到多种框架中，并在 MVISU-Bench 上将整体成功率较当前最先进水平（SOTA）提升了 19.55%。具体而言，在不道德指令和交互式指令上，成功率分别提高了 53.52% 和 29.41%。通过大量实验和分析，我们指出了现有移动代理与现实用户期望之间的差距。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 16:18:30 UTC 发布：2025-08-12 16:18:30 UTC

#11 LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback #11 LLM-as-a-Supervisor：错误的治疗行为触发有针对性的监督反馈

Although large language models (LLMs) hold significant promise in psychotherapy, their direct application in patient-facing scenarios raises ethical and safety concerns. Therefore, this work shifts towards developing an LLM as a supervisor to train real therapists. In addition to the privacy of clinical therapist training data, a fundamental contradiction complicates the training of therapeutic behaviors: clear feedback standards are necessary to ensure a controlled training system, yet there is no absolute “gold standard” for appropriate therapeutic behaviors in practice. In contrast, many common therapeutic mistakes are universal and identifiable, making them effective triggers for targeted feedback that can serve as clearer evidence. Motivated by this, we create a novel therapist-training paradigm: (1) guidelines for mistaken behaviors and targeted correction strategies are first established as standards; (2) a human-in-the-loop dialogue-feedback dataset is then constructed, where a mistake-prone agent intentionally makes standard mistakes during interviews naturally, and a supervisor agent locates and identifies mistakes and provides targeted feedback; (3) after fine-tuning on this dataset, the final supervisor model is provided for real therapist training. The detailed experimental results of automated, human and downstream assessments demonstrate that models fine-tuned on our dataset MATE, can provide high-quality feedback according to the clinical guideline, showing significant potential for the therapist training scenario. 尽管大型语言模型（LLMs）在心理治疗方面具有重要潜力，但其直接应用于面向患者的场景会引发伦理和安全问题。因此，本工作转向将 LLM 作为监督者来训练真实的治疗师。除去临床治疗师训练数据的隐私问题外，一个根本性矛盾使得治疗行为的训练变得复杂：需要明确的反馈标准以保证可控的训练体系，但在实践中并不存在绝对的“金标准”来定义适当的治疗行为。相比之下，许多常见的治疗错误是普遍且可识别的，使其成为针对性反馈的有效触发点，从而可以作为更明确的证据。受此启发，我们创造了一种新颖的治疗师训练范式：（1）首先将错误行为指南和有针对性的纠正策略制定为标准；（2）然后构建一个有人参与的对话反馈数据集，其中一个易出错的代理在访谈中有意自然地犯下标准错误，监督代理定位并识别错误并提供有针对性的反馈；（3）在该数据集上微调后，将最终的监督模型用于真实的治疗师训练。自动化、人类和下游评估的详细实验结果表明，在我们的数据集 MATE 上微调的模型，能够根据临床指南提供高质量的反馈，显示出在治疗师训练场景中的显著潜力。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 16:03:36 UTC 发布：2025-08-12 16:03:36 世界协调时

#12 A Survey on Training-free Alignment of Large Language Models #12 关于大型语言模型无训练对齐方法的综述

The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques–leveraging in-context learning, decoding-time adjustments, and post-generation corrections–offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs. 大型语言模型（LLMs）的对齐旨在确保其输出符合人类价值观、伦理标准和法律规范。传统的对齐方法通常依赖资源密集的微调（FT），这可能导致知识退化，并在模型可访问性或计算资源受限的情况下面临挑战。相反，无需训练（TF）的对齐技术——利用上下文学习、解码时调整和生成后修正——通过在不大量重新训练 LLMs 的情况下实现对齐，提供了有前途的替代方案，使其可适用于开源和闭源环境。本文提出了首个关于 TF 对齐方法的系统综述，按预解码、解码中和解码后三个阶段对其进行了分类。针对每一阶段，我们从 LLMs 和多模态 LLMs（MLLMs）的视角提供了详细审视，强调了它们的机制和局限性。此外，我们识别了关键挑战和未来方向，为更具包容性和更有效的 TF 对齐技术铺平了道路。通过综合与组织快速增长的研究成果，本综述为从业者提供了指导，并推动了更安全、更可靠 LLMs 的发展。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-12 15:30:44 UTC 发布：2025-08-12 15:30:44 UTC

#13 LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA #13 LyS 在 SemEval 2025 任务 8：用于表格问答的零样本代码生成 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Adrián Gude](https://arxiv.org/search/?searchtype=author&query=Adrián Gude), [Roi Santos-Ríos](https://arxiv.org/search/?searchtype=author&query=Roi Santos-Ríos), [Francisco Prado-Valiño](https://arxiv.org/search/?searchtype=author&query=Francisco Prado-Valiño), [Ana Ezquerro](https://arxiv.org/search/?searchtype=author&query=Ana Ezquerro), [Jesús Vilares](https://arxiv.org/search/?searchtype=author&query=Jesús Vilares) 作者：Adrián Gude、Roi Santos-Ríos、Francisco Prado-Valiño、Ana Ezquerro、Jesús Vilares

This paper describes our participation in SemEval 2025 Task 8, focused on Tabular Question Answering. We developed a zero-shot pipeline that leverages an Large Language Model to generate functional code capable of extracting the relevant information from tabular data based on an input question. Our approach consists of a modular pipeline where the main code generator module is supported by additional components that identify the most relevant columns and analyze their data types to improve extraction accuracy. In the event that the generated code fails, an iterative refinement process is triggered, incorporating the error feedback into a new generation prompt to enhance robustness. Our results show that zero-shot code generation is a valid approach for Tabular QA, achieving rank 33 of 53 in the test phase despite the lack of task-specific fine-tuning. 本文描述了我们在 SemEval 2025 第 8 任务（聚焦表格问答）的参赛工作。我们开发了一个零样本流水线，利用大型语言模型生成能够根据输入问题从表格数据中提取相关信息的可执行代码。我们的方法由一个模块化流水线组成，主代码生成模块由额外组件支持，这些组件识别最相关的列并分析其数据类型以提高提取准确性。如果生成的代码失败，会触发迭代改进流程，将错误反馈纳入新的生成提示以增强鲁棒性。我们的结果表明，零样本代码生成是表格问答的有效方法，尽管缺乏针对任务的微调，在测试阶段仍取得了 53 个参赛队伍中第 33 名的成绩。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-12 15:25:31 UTC 发布时间：2025-08-12 15:25:31 UTC

#14 Retrospective Sparse Attention for Efficient Long-Context Generation #14 回顾性稀疏注意力以实现高效长上下文生成

Authors: [Seonghwan Choi](https://arxiv.org/search/?searchtype=author&query=Seonghwan Choi), [Beomseok Kang](https://arxiv.org/search/?searchtype=author&query=Beomseok Kang), [Dongwon Jo](https://arxiv.org/search/?searchtype=author&query=Dongwon Jo), [Jae-Joon Kim](https://arxiv.org/search/?searchtype=author&query=Jae-Joon Kim) 作者：Seonghwan Choi、Beomseok Kang、Dongwon Jo、Jae-Joon Kim

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to efficiently access more relevant context, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6× and accuracy by up to 21.9%. 大型语言模型 (LLMs) 越来越多地被用于长上下文任务，如推理、代码生成和多轮对话。然而，对扩展上下文的推理受限于键值（KV）缓存，其内存占用随序列长度线性增长，并在每个解码步骤中主导延迟。尽管最近的 KV 缓存压缩方法识别并加载重要的标记，但它们主要集中在输入上下文上，未能解决长时间解码过程中产生的累积注意力误差。在本文中，我们提出了 RetroAttention，一种新颖的 KV 缓存更新技术，它使用随后解码步骤中新到达的 KV 条目追溯性地修正过去的注意力输出。通过维护一个轻量级的输出缓存，RetroAttention 使得过去的查询能够高效地访问更相关的上下文，同时只带来极小的延迟开销。这打破了固定注意力输出的范式，允许对先前的近似结果进行持续修正。在长文本生成基准上的大量实验表明，RetroAttention 始终优于最先进（SOTA）的键值（KV）压缩方法，有效 KV 暴露量最多提升达 1.6 × ，准确率最多提升达 21.9%。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-12 15:11:47 UTC 发布：2025-08-12 15:11:47 UTC

#15 Jointly Generating and Attributing Answers using Logits of Document-Identifier Tokens #15 使用文档标识符标记的对数几率共同生成并归因答案

Authors: [Lucas Albarede](https://arxiv.org/search/?searchtype=author&query=Lucas Albarede), [Jose Moreno](https://arxiv.org/search/?searchtype=author&query=Jose Moreno), [Lynda Tamine](https://arxiv.org/search/?searchtype=author&query=Lynda Tamine), [Luce Lefeuvre](https://arxiv.org/search/?searchtype=author&query=Luce Lefeuvre) 作者：Lucas Albarede、Jose Moreno、Lynda Tamine、Luce Lefeuvre

Despite their impressive performances, Large Language Models (LLMs) remain prone to hallucination, which critically undermines their trustworthiness. While most of the previous work focused on tackling answer and attribution correctness, a recent line of work investigated faithfulness, with a focus on leveraging internal model signals to reflect a model’s actual decision-making process while generating the answer. Nevertheless, these methods induce additional latency and have shown limitations in directly aligning token generation with attribution generation. In this paper, we introduce LoDIT, a method that jointly generates and faithfully attributes answers in RAG by leveraging specific token logits during generation. It consists of two steps: (1) marking the documents with specific token identifiers and then leveraging the logits of these tokens to estimate the contribution of each document to the answer during generation, and (2) aggregating these contributions into document attributions. Experiments on a trustworthiness-focused attributed text-generation benchmark, Trust-Align, show that LoDIT significantly outperforms state-of-the-art models on several metrics. Finally, an in-depth analysis of LoDIT shows both its efficiency in terms of latency and its robustness in different settings. 尽管表现令人印象深刻，LLMs 仍然容易出现“幻觉”，这严重削弱了其可信性。以往大多数研究集中在解决答案与归因正确性上，最近有一类工作则研究了忠实性，重点利用内部模型信号来反映模型在生成答案时的实际决策过程。然而，这些方法会引入额外延迟，并且在将标记生成与归因生成直接对齐方面存在局限性。在本文中，我们提出了 LoDIT，这是一种在 RAG 中通过利用生成过程中具体标记的 logits 来联合生成并忠实归因答案的方法。它包括两个步骤：（1）用特定标记标注文档，然后利用这些标记的 logits 在生成过程中估计每个文档对答案的贡献；（2）将这些贡献聚合为文档归因。在以可信性为重点的带归因文本生成基准 Trust-Align 上的实验表明，LoDIT 在多项指标上显著优于最先进的模型。最后，对 LoDIT 的深入分析表明了它在延迟方面的高效性以及在不同设置下的鲁棒性。

Subjects: Computation and Language, Information Retrieval 主题：计算与语言，信息检索

Publish: 2025-08-12 13:50:25 UTC 发布：2025-08-12 13:50:25 世界协调时间（UTC）

#16 Train Long, Think Short: Curriculum Learning for Efficient Reasoning #16 长训短思：用于高效推理的课程学习

Authors: [Hasan Abed Al Kader Hammoud](https://arxiv.org/search/?searchtype=author&query=Hasan Abed Al Kader Hammoud), [Kumail Alhamoud](https://arxiv.org/search/?searchtype=author&query=Kumail Alhamoud), [Abed Hammoud](https://arxiv.org/search/?searchtype=author&query=Abed Hammoud), [Elie Bou-Zeid](https://arxiv.org/search/?searchtype=author&query=Elie Bou-Zeid), [Marzyeh Ghassemi](https://arxiv.org/search/?searchtype=author&query=Marzyeh Ghassemi), [Bernard Ghanem](https://arxiv.org/search/?searchtype=author&query=Bernard Ghanem) 作者：Hasan Abed Al Kader Hammoud、Kumail Alhamoud、Abed Hammoud、Elie Bou-Zeid、Marzyeh Ghassemi、Bernard Ghanem

Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models. Our code and checkpoints are released at: https://github.com/hammoudhasan/curriculum_grpo. 近期关于增强大型语言模型（LLMs）推理能力的工作提出了显式长度控制，作为在保持准确性的同时约束计算成本的一种手段。然而，现有方法依赖固定长度的训练预算，未能利用学习过程中从探索到压缩的自然进程。在本研究中，我们提出了一种用于长度受控推理的课程学习策略，基于群体相对策略优化（GRPO）。我们的方法从宽松的令牌预算开始，并在训练过程中逐步收紧，鼓励模型先发现有效的解题策略，然后将其蒸馏为更简洁的推理痕迹。我们为 GRPO 设计了一个奖励函数，平衡三类信号：任务正确性（通过验证器反馈）、长度效率和格式遵循性（通过结构标签）。在 GSM8K、MATH500、SVAMP、College Math 和 GSM+ 上的实验表明，基于课程的训练在相同最终预算下始终优于固定预算的基线，达到了更高的准确率并显著提升了令牌效率。我们进一步消融了奖励权重和衰减计划设计的影响，表明逐步约束作为训练高效推理模型的一种强有力先验偏置。我们的代码和检查点已发布于： https://github.com/hammoudhasan/curriculum_grpo。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-12 13:48:03 UTC 发布：2025-08-12 13:48:03 UTC

#17 Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation #17 Reveal-Bangla：用于跨语言多步推理评估的数据集

Authors: [Khondoker Ittehadul Islam](https://arxiv.org/search/?searchtype=author&query=Khondoker Ittehadul Islam), [Gabriele Sarti](https://arxiv.org/search/?searchtype=author&query=Gabriele Sarti) 作者：Khondoker Ittehadul Islam、Gabriele Sarti

Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models’ predictions, highlighting different trends across models and languages. 语言模型在复杂的多步推理任务上表现出色。然而，它们的评估主要集中在英语等高资源语言上。本文介绍了一个由英文 Reveal 数据集人工翻译而成的孟加拉语多步推理数据集，包含二元和非二元问题类型。我们对以英语为中心和以孟加拉语为中心的小型多语种语言模型在原始数据集及我们翻译版本上进行了受控评估，以比较它们利用相关推理步骤生成正确答案的能力。我们的结果表明，在可比设置下，对于更具挑战性的非二元问题，提供推理上下文是有益的，但模型在有效使用相关的孟加拉语推理步骤方面表现不佳。最后，我们探讨了推理步骤如何影响模型的预测，突出了不同模型和语言之间的不同趋势。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 13:34:10 UTC 发布：2025-08-12 13:34:10 UTC

#18 Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning #18 Munsit 在 NADI 2025 共享任务 2：通过弱监督预训练和持续监督微调推动多方言阿拉伯语自动语音识别的边界

Authors: [Mahmoud Salhab](https://arxiv.org/search/?searchtype=author&query=Mahmoud Salhab), [Shameed Sait](https://arxiv.org/search/?searchtype=author&query=Shameed Sait), [Mohammad Abusheikh](https://arxiv.org/search/?searchtype=author&query=Mohammad Abusheikh), [Hasan Abusheikh](https://arxiv.org/search/?searchtype=author&query=Hasan Abusheikh) 作者：Mahmoud Salhab、Shameed Sait、Mohammad Abusheikh、Hasan Abusheikh

Automatic speech recognition (ASR) plays a vital role in enabling natural human-machine interaction across applications such as virtual assistants, industrial automation, customer support, and real-time transcription. However, developing accurate ASR systems for low-resource languages like Arabic remains a significant challenge due to limited labeled data and the linguistic complexity introduced by diverse dialects. In this work, we present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model. In the first stage, we pretrain the model on 15,000 hours of weakly labeled speech covering both Modern Standard Arabic (MSA) and various Dialectal Arabic (DA) variants. In the subsequent stage, we perform continual supervised fine-tuning using a mixture of filtered weakly labeled data and a small, high-quality annotated dataset. Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge. These findings highlight the effectiveness of weak supervision paired with fine-tuning in overcoming data scarcity and delivering high-quality ASR for low-resource, dialect-rich languages. 自动语音识别（ASR）在虚拟助理、工业自动化、客户支持和实时转录等应用中对于实现自然的人机交互至关重要。然而，由于标注数据有限以及多样方言带来的语言复杂性，为阿拉伯语等低资源语言开发高精度的 ASR 系统仍然是一大挑战。在这项工作中，我们提出了一个可扩展的训练流水线，将弱监督学习与有监督微调相结合，以开发一个鲁棒的阿拉伯语 ASR 模型。在第一阶段，我们在涵盖现代标准阿拉伯语（MSA）和各种方言阿拉伯语（DA）变体的 15,000 小时弱标注语音上对模型进行预训练。在随后的阶段，我们使用混合了筛选过的弱标注数据和一小部分高质量注释数据的组合，进行持续的有监督微调。我们的方法取得了最先进的结果，在多方言阿拉伯语 ASR 挑战中排名第一。这些研究结果突显了弱监督与微调相结合在克服数据稀缺并为资源不足且方言丰富的语言提供高质量自动语音识别方面的有效性。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 13:02:22 UTC 发布：2025-08-12 13:02:22 UTC

#19 ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs #19 ASPD：通过探索 LLMs 的内在并行性解锁自适应串并行解码

The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel decoding modes while maintaining a reusable KV cache, maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant acceleration without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines. 大型语言模型（LLMs）规模和复杂性的不断增长带来了显著的推理延迟挑战，主要源于其自回归解码范式中逐次预测下一个词的顺序特性。通过重新审视自回归模型的输出，我们观察到某些片段呈现出可并行化的结构，称为内在并行性。对每个可并行分支同时进行解码（即并行解码）可以显著提升 LLMs 的整体推理速度。在本文中，我们提出了一种自适应串行-并行解码（ASPD），其解决了两个核心挑战：可并行数据的自动构建和高效的并行解码机制。更具体地，我们引入了一个非侵入式流程，自动从自回归模型的响应中提取并验证可并行化结构。为了实现高效的自适应串行-并行解码，我们实现了一个混合解码引擎，该引擎在保持可复用 KV 缓存的同时，实现串行与并行解码模式之间的无缝切换，从而最大化计算效率。在通用任务、检索增强生成、数学推理等广泛评估中，ASPD 在有效性和效率方面均达到了前所未有的表现。值得注意的是，在 Vicuna Bench 上，我们的方法在保证与自回归模型响应质量误差在 1% 以内的同时，最多实现了 3.19 倍的加速（平均 1.85 倍），在不牺牲生成质量的情况下大幅提高了速度。我们的框架为高效 LLM 并行推理设立了开创性基准，为其在对延迟敏感的应用中部署铺平了道路，例如由 AI 驱动的客服机器人和答案检索引擎。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 12:35:55 UTC 发布：2025-08-12 12:35:55 UTC

#20 Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models #20 纠缠于表征：对大型语言模型中文化偏见的机制性调查

The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs’ representations impacts their cultural understanding. Prior work only performs extrinsic evaluation of LLMs’ cultural competence, without accounting for how LLMs’ internal mechanisms lead to cultural (mis)representation. To bridge this gap, we propose Culturescope, the first mechanistic interpretability-based method that probes the internal representations of LLMs to elicit the underlying cultural knowledge space. CultureScope utilizes a patching method to extract the cultural knowledge. We introduce a cultural flattening score as a measure of the intrinsic cultural biases. Additionally, we study how LLMs internalize Western-dominance bias and cultural flattening, which allows us to trace how cultural biases emerge within LLMs. Our experimental results reveal that LLMs encode Western-dominance bias and cultural flattening in their cultural knowledge space. We find that low-resource cultures are less susceptible to cultural biases, likely due to their limited training resources. Our work provides a foundation for future research on mitigating cultural biases and enhancing LLMs’ cultural understanding. Our codes and data used for experiments are publicly available. 在多元文化背景中日益广泛部署大型语言模型（LLMs）的情况下，有必要更好地理解这些模型在表征中对文献较少文化的过度概括如何影响它们的文化理解。以往工作仅对 LLMs 的文化能力进行外在评估，未考虑 LLMs 的内部机制如何导致文化的（误）表征。为填补这一空白，我们提出了 CultureScope，这是首个基于机理可解释性的方法，用于探查 LLMs 的内部表征以引出其潜在的文化知识空间。CultureScope 利用一种打补丁（patching）方法来提取文化知识。我们提出了文化扁平化分数作为衡量内在文化偏见的指标。此外，我们研究了 LLMs 如何内化以西方为主导的偏见和文化扁平化，这使我们能够追踪文化偏见在 LLMs 中如何产生。我们的实验结果表明，LLMs 在其文化知识空间中编码了以西方为主导的偏见和文化扁平化。我们发现低资源文化不那么容易受到文化偏见的影响，这很可能是由于其有限的训练资源。我们的工作为未来减轻文化偏见和增强 LLMs 文化理解的研究奠定了基础。我们用于实验的代码和数据已公开可用。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 12:05:32 UTC 发布：2025-08-12 12:05:32 UTC

#21 Weakly Supervised Fine-grained Span-Level Framework for Chinese Radiology Report Quality Assurance #21 题为“面向中文放射学报告质量保证的弱监督细粒度跨段框架”

Authors: [Kaiyu Wang](https://arxiv.org/search/?searchtype=author&query=Kaiyu Wang), [Lin Mu](https://arxiv.org/search/?searchtype=author&query=Lin Mu), [Zhiyao Yang](https://arxiv.org/search/?searchtype=author&query=Zhiyao Yang), [Ximing Li](https://arxiv.org/search/?searchtype=author&query=Ximing Li), [Xiaotang Zhou Wanfu Gao](https://arxiv.org/search/?searchtype=author&query=Xiaotang Zhou Wanfu Gao), [Huimao Zhang](https://arxiv.org/search/?searchtype=author&query=Huimao Zhang) 作者：王凯宇、沐霖、杨志尧、李熙明、周晓棠、高万福、张慧茂

Quality Assurance (QA) for radiology reports refers to judging whether the junior reports (written by junior doctors) are qualified. The QA scores of one junior report are given by the senior doctor(s) after reviewing the image and junior report. This process requires intensive labor costs for senior doctors. Additionally, the QA scores may be inaccurate for reasons like diagnosis bias, the ability of senior doctors, and so on. To address this issue, we propose a Span-level Quality Assurance EvaluaTOR (Sqator) to mark QA scores automatically. Unlike the common document-level semantic comparison method, we try to analyze the semantic difference by exploring more fine-grained text spans. Unlike the common document-level semantic comparison method, we try to analyze the semantic difference by exploring more fine-grained text spans. Specifically, Sqator measures QA scores by measuring the importance of revised spans between junior and senior reports, and outputs the final QA scores by merging all revised span scores. We evaluate Sqator using a collection of 12,013 radiology reports. Experimental results show that Sqator can achieve competitive QA scores. Moreover, the importance scores of revised spans can be also consistent with the judgments of senior doctors. 放射学报告的质量保证（QA）是指判断初级医生撰写的报告是否合格。初级报告的 QA 评分由资深医生在审阅影像和初级报告后给出。该过程对资深医生来说需要大量人力成本。此外，QA 评分可能因诊断偏差、资深医生能力等原因而不准确。为了解决这一问题，我们提出了一种基于片段级的质量保证评估器（Span-level Quality Assurance EvaluaTOR，Sqator）来自动标注 QA 评分。不同于常见的基于整篇文档的语义比较方法，我们尝试通过探索更细粒度的文本片段来分析语义差异。具体而言，Sqator 通过衡量初级与资深报告之间被修改片段的重要性来评估 QA 分数，并通过合并所有被修改片段的分数输出最终的 QA 评分。我们使用 12,013 份放射学报告的集合对 Sqator 进行了评估。实验结果表明，Sqator 能够达到有竞争力的 QA 评分表现。此外，修订片段的重要性得分也能与资深医生的判断保持一致。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 12:03:20 UTC 发布：2025-08-12 12:03:20 UTC

#22 BiasGym: Fantastic Biases and How to Find (and Remove) Them #22 BiasGym：奇妙的偏差及如何发现（并消除）它们

Authors: [Sekh Mainul Islam](https://arxiv.org/search/?searchtype=author&query=Sekh Mainul Islam), [Nadav Borenstein](https://arxiv.org/search/?searchtype=author&query=Nadav Borenstein), [Siddhesh Milind Pawar](https://arxiv.org/search/?searchtype=author&query=Siddhesh Milind Pawar), [Haeun Yu](https://arxiv.org/search/?searchtype=author&query=Haeun Yu), [Arnav Arora](https://arxiv.org/search/?searchtype=author&query=Arnav Arora), [Isabelle Augenstein](https://arxiv.org/search/?searchtype=author&query=Isabelle Augenstein) 作者：Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein

Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being reckless drivers') and in probing fictional associations (e.g., people from a country having blue skin’), showing its utility for both safety interventions and interpretability research. 理解大型语言模型（LLMs）权重中编码的偏见和刻板印象，对于制定有效的缓解策略至关重要。即使在刻意引导的情况下，偏见行为通常也很微妙且难以孤立，这使得系统性分析和去偏变得尤为具有挑战性。为此，我们提出了 BiasGym，一个简单、低成本且可推广的框架，用于可靠地注入、分析和缓解 LLMs 中的概念关联。BiasGym 由两部分组成：BiasInject，通过基于词元的微调在保持模型冻结的情况下向模型注入特定偏见；BiasScope 则利用这些注入的信号识别并引导导致偏见行为的组件。我们的方法能够实现用于机制分析的一致偏见引出，支持在不降低下游任务性能的情况下进行有针对性的去偏，并能推广到训练期间未见过的偏见。我们展示了 BiasGym 在减少现实世界刻板印象（例如，来自某国的人被认为是“鲁莽的司机”）以及探测虚构关联（例如，来自某国的人具有“蓝色皮肤”）方面的有效性，表明其在安全干预和可解释性研究中均具有实用价值。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-12 11:23:44 UTC 发布：2025-08-12 11:23:44 UTC

#23 Steering Towards Fairness: Mitigating Political Bias in LLMs #23 朝向公平的引导：缓解 LLMs 中的政治偏见

Authors: [Afrozah Nadeem](https://arxiv.org/search/?searchtype=author&query=Afrozah Nadeem), [Mark Dras](https://arxiv.org/search/?searchtype=author&query=Mark Dras), [Usman Naseem](https://arxiv.org/search/?searchtype=author&query=Usman Naseem) 作者：Afrozah Nadeem、Mark Dras、Usman Naseem

Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases, particularly along political and economic dimensions. In this paper, we propose a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), our method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions. 近年来，大型语言模型（LLMs）的进展使其在各种现实场景中得到广泛应用。然而，人们仍然担忧这些模型倾向于编码并再现意识形态偏见，尤其是在政治和经济维度上。本文提出了一个通过分析内部模型表征来探测并缓解基于解码器的 LLMs 中此类偏见的框架。基于政治罗盘测试（Political Compass Test，PCT），我们的方法使用对比对来提取并比较像 Mistral 和 DeepSeek 这样的模型的隐藏层激活。我们引入了一个全面的激活提取流程，能够在多个意识形态轴上进行逐层分析，揭示与政治表述相关的有意义差异。我们的结果表明，解码器 LLMs 在层间系统性地编码了表征偏见，这可以被用于基于引导向量的有效缓解。该工作为理解政治偏见如何在 LLMs 中被编码提供了新见解，并提出了一种超越表面输出干预的有原则的去偏方法。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 11:09:03 UTC 发布：2025-08-12 11:09:03 协调世界时

#24 An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems #24 对大语言模型在数学推理鲁棒性的一项研究：通过对高等数学题目进行数学等价变换来进行基准测试 [PDF ] [Copy] [Kimi 2 ] [REL]

Authors: [Yuren Hao](https://arxiv.org/search/?searchtype=author&query=Yuren Hao), [Xiang Wan](https://arxiv.org/search/?searchtype=author&query=Xiang Wan), [Chengxiang Zhai](https://arxiv.org/search/?searchtype=author&query=Chengxiang Zhai) 作者：郝昱任，万翔，翟成翔

In this paper, we introduce a systematic framework beyond conventional method to assess LLMs’ mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI’s flagship reasoning model, O3, scores 49 % on the originals but drops by 4 percentage points on surface variants, and by 10.5 percentage points on core-step-based variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities. 在本文中，我们提出了一个超越传统方法的系统框架，通过对语义等价但在语言和参数上有所变化的高级数学问题进行压力测试，来评估 LLMs 在数学推理方面的鲁棒性。这些变换使我们能够衡量 LLMs 对非数学扰动的敏感性，从而更准确地评估其数学推理能力。基于这一新的评估方法，我们构建了 PutnamGAP，这是一个包含多种数学等价变体的竞赛级数学问题的新基准数据集。利用该数据集，我们评估了多类具有代表性的 LLMs 并检验其鲁棒性。在 18 个商业与开源模型中，我们观察到在变体上的性能出现显著下降。OpenAI 的旗舰推理模型 O3 在原题上的得分为 49%，但在表面变体上下降了 4 个百分点，在基于核心步骤的变体上下降了 10.5 个百分点，而较小的模型表现更差。总体而言，结果表明，所提出的新评估方法对于加深我们对 LLMs 鲁棒性的理解以及为进一步提升其数学推理能力生成新的见解是有效的。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-12 10:40:33 UTC 发布：2025-08-12 10:40:33 UTC

#25 TiMoE: Time-Aware Mixture of Language Experts #25 TiMoE：面向时间的语言专家混合模型

Authors: [Robin Faro](https://arxiv.org/search/?searchtype=author&query=Robin Faro), [Dongyang Fan](https://arxiv.org/search/?searchtype=author&query=Dongyang Fan), [Tamar Alphaidze](https://arxiv.org/search/?searchtype=author&query=Tamar Alphaidze), [Martin Jaggi](https://arxiv.org/search/?searchtype=author&query=Martin Jaggi) 作者：Robin Faro, Dongyang Fan, Tamar Alphaidze, Martin Jaggi

Large language models (LLMs) are typically trained on fixed snapshots of the web, which means that their knowledge becomes stale and their predictions risk temporal leakage: relying on information that lies in the future relative to a query. We tackle this problem by pre-training from scratch a set of GPT-style experts on disjoint two-year slices of a 2013-2024 corpus and combining them through TiMoE, a Time-aware Mixture of Language Experts. At inference time, TiMoE masks all experts whose training window ends after the query timestamp and merges the remaining log-probabilities in a shared space, guaranteeing strict causal validity while retaining the breadth of multi-period knowledge. We also release TSQA, a 10k-question benchmark whose alternatives are explicitly labelled as past, future or irrelevant, allowing fine-grained measurement of temporal hallucinations. Experiments on eight standard NLP tasks plus TSQA show that a co-adapted TiMoE variant matches or exceeds the best single-period expert and cuts future-knowledge errors by up to 15%. Our results demonstrate that modular, time-segmented pre-training paired with causal routing is a simple yet effective path toward LLMs that stay chronologically grounded without sacrificing general performance much. We open source our code at TiMoE (Github): https://github.com/epfml/TiMoE 大型语言模型 (LLMs) 通常在固定的网络快照上进行训练，这意味着它们的知识会变得陈旧，并且它们的预测存在时间泄漏的风险：依赖相对于查询来说位于未来的信息。我们通过从零开始对一组 GPT 风格的专家在不重叠的、覆盖 2013–2024 年语料的两年切片上进行预训练来解决这一问题，并通过 TiMoE（一种时间感知的语言专家混合模型，Time-aware Mixture of Language Experts）将它们结合起来。在推理时，TiMoE 会屏蔽所有其训练窗口在查询时间戳之后结束的专家，并在共享空间中合并剩余的对数概率，从而在保留多时期知识广度的同时保证严格的因果有效性。我们还发布了 TSQA，一个包含 1 万个问题的基准数据集，其备选项被显式标注为过去、未来或无关，从而允许对时间幻觉进行细粒度测量。在八项标准自然语言处理任务加上 TSQA 上的实验表明，协同适配的 TiMoE 变体能够匹配或超越表现最好的单期专家，并将未来知识错误最多减少 15%。我们的结果表明，模块化的、按时间分段的预训练结合因果路由，是使 LLMs 在不大幅牺牲总体性能的情况下保持时间性根植的一个简单而有效的途径。我们在 TiMoE (Github) 上开源了我们的代码：https://github.com/epfml/TiMoE

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 10:36:36 UTC 发布：2025-08-12 10:36:36 UTC

#26 Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments #26 通过自动构建环境的反馈驱动工具使用改进在大型语言模型中的应用 [PDF 2 ] [Copy] [Kimi 2 ] [REL]

Effective tool use is essential for large language models (LLMs) to interact meaningfully with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models’ tool-use performance without degrading their general capabilities, regardless of inference modes or training algorithms. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models. 有效的工具使用对于大型语言模型（LLMs）与其环境进行有意义交互至关重要。然而，进展受限于缺乏专为工具使用设计的高效强化学习（RL）框架，原因在于构建稳定训练环境和设计可验证奖励机制存在挑战。为此，我们提出了一个自动化环境构建流程，包含情境分解、文档生成、函数集成、复杂度扩展和局部部署。该流程能够创建高质量的训练环境，提供详细且可衡量的反馈，而无需依赖外部工具。此外，我们引入了一种可验证的奖励机制，用以评估工具使用的精确性和任务执行的完整性。当与从构建的环境中收集的轨迹数据结合时，该机制可与标准 RL 算法无缝集成，以促进基于反馈的模型训练。在不同规模的 LLMs 上进行的实验证明，我们的方法在不损害模型一般能力的前提下，显著提升了模型的工具使用性能，无论推理模式或训练算法如何。我们的分析表明，这些提升源于模型对上下文理解和推理能力的增强，这种增强由模型较低层 MLP 参数的更新所驱动。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 09:45:19 UTC 发布：2025-08-12 09:45:19 UTC

#27 Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering #27 隐私保护的检索增强生成用于知识图谱问答

LLMs often suffer from hallucinations and outdated or incomplete knowledge. RAG is proposed to address these issues by integrating external knowledge like that in KGs into LLMs. However, leveraging private KGs in RAG systems poses significant privacy risks due to the black-box nature of LLMs and potential insecure data transmission, especially when using third-party LLM APIs lacking transparency and control. In this paper, we investigate the privacy-protected RAG scenario for the first time, where entities in KGs are anonymous for LLMs, thus preventing them from accessing entity semantics. Due to the loss of semantics of entities, previous RAG systems cannot retrieve question-relevant knowledge from KGs by matching questions with the meaningless identifiers of anonymous entities. To realize an effective RAG system in this scenario, two key challenges must be addressed: (1) How can anonymous entities be converted into retrievable information. (2) How to retrieve question-relevant anonymous entities. Hence, we propose a novel ARoG framework including relation-centric abstraction and structure-oriented abstraction strategies. For challenge (1), the first strategy abstracts entities into high-level concepts by dynamically capturing the semantics of their adjacent relations. It supplements meaningful semantics which can further support the retrieval process. For challenge (2), the second strategy transforms unstructured natural language questions into structured abstract concept paths. These paths can be more effectively aligned with the abstracted concepts in KGs, thereby improving retrieval performance. To guide LLMs to effectively retrieve knowledge from KGs, the two strategies strictly protect privacy from being exposed to LLMs. Experiments on three datasets demonstrate that ARoG achieves strong performance and privacy-robustness. LLMs 经常存在幻觉以及过时或不完整的知识问题。为了解决这些问题，提出了 RAG，通过将外部知识（例如知识图谱中的知识）整合到 LLMs 中。然而，在 RAG 系统中利用私有知识图谱会带来重大隐私风险，原因在于 LLMs 的黑箱特性以及潜在的不安全数据传输，尤其是在使用缺乏透明度和控制的第三方 LLM API 时。在本文中，我们首次研究了受隐私保护的 RAG 场景，其中知识图谱中的实体对 LLMs 保持匿名，从而阻止其访问实体语义。由于实体语义的丧失，先前的 RAG 系统无法通过将问题与匿名实体的无意义标识符匹配来从知识图谱中检索与问题相关的知识。要在该场景下实现有效的 RAG 系统，必须解决两个关键挑战：(1) 如何将匿名实体转换为可检索的信息；(2) 如何检索与问题相关的匿名实体。因此，我们提出了一种新颖的 ARoG 框架，包括以关系为中心的抽象和面向结构的抽象策略。对于挑战（1），第一种策略通过动态捕捉实体相邻关系的语义，将实体抽象为高层概念。它补充了有意义的语义信息，进而支持检索过程。对于挑战（2），第二种策略将非结构化的自然语言问题转换为结构化的抽象概念路径。这些路径可以更有效地与知识图谱中抽象的概念对齐，从而提升检索性能。为了引导 LLMs 有效地从知识图谱中检索知识，这两种策略严格保护隐私，不向 LLMs 泄露。三套数据集上的实验表明，ARoG 在性能和隐私鲁棒性方面都表现出色。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 09:38:21 UTC 发布：2025-08-12 09:38:21 UTC

#28 DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation #28 DevNous：一种基于 LLM 的多智能体系统，用于在非结构化对话中将 IT 项目管理落地

Authors: [Stavros Doropoulos](https://arxiv.org/search/?searchtype=author&query=Stavros Doropoulos), [Stavros Vologiannidis](https://arxiv.org/search/?searchtype=author&query=Stavros Vologiannidis), [Ioannis Magnisalis](https://arxiv.org/search/?searchtype=author&query=Ioannis Magnisalis) 作者：Stavros Doropoulos、Stavros Vologiannidis、Ioannis Magnisalis

The manual translation of unstructured team dialogue into the structured artifacts required for Information Technology (IT) project governance is a critical bottleneck in modern information systems management. We introduce DevNous, a Large Language Model-based (LLM) multi-agent expert system, to automate this unstructured-to-structured translation process. DevNous integrates directly into team chat environments, identifying actionable intents from informal dialogue and managing stateful, multi-turn workflows for core administrative tasks like automated task formalization and progress summary synthesis. To quantitatively evaluate the system, we introduce a new benchmark of 160 realistic, interactive conversational turns. The dataset was manually annotated with a multi-label ground truth and is publicly available. On this benchmark, DevNous achieves an exact match turn accuracy of 81.3% and a multiset F1-Score of 0.845, providing strong evidence for its viability. The primary contributions of this work are twofold: (1) a validated architectural pattern for developing ambient administrative agents, and (2) the introduction of the first robust empirical baseline and public benchmark dataset for this challenging problem domain. 将非结构化的团队对话手动转换为信息技术（IT）项目治理所需的结构化产物，是现代信息系统管理中的一个关键瓶颈。我们引入了 DevNous，一种基于 LLM 的多代理专家系统，用于自动化这一从非结构化到结构化的翻译过程。DevNous 可直接集成到团队聊天环境中，从非正式对话中识别可执行意图，并管理有状态的多轮工作流，用于诸如自动任务形式化和进度摘要合成等核心管理任务。为对该系统进行定量评估，我们推出了一个由 160 个真实交互式会话回合组成的新基准。该数据集经过人工多标签标注，并已公开。在此基准上，DevNous 实现了 81.3% 的回合完全匹配准确率和 0.845 的多重集合 F1 分数，提供了其可行性的有力证据。本工作的主要贡献有两点：（1）一种经过验证的用于开发环境型管理代理的架构模式；（2）为这一具有挑战性的问题领域引入了首个稳健的经验基线和公开基准数据集。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 09:08:29 UTC 发布：2025-08-12 09:08:29 UTC

#29 SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs #29 SciRerankBench：面向科学检索增强生成 LLMs 的重排器基准测试

Scientific literature question answering is a pivotal step towards new scientific discoveries. Recently, \textit{two-stage} retrieval-augmented generated large language models (RAG-LLMs) have shown impressive advancements in this domain. Such a two-stage framework, especially the second stage (reranker), is particularly essential in the scientific domain, where subtle differences in terminology may have a greatly negative impact on the final factual-oriented or knowledge-intensive answers. Despite this significant progress, the potential and limitations of these works remain unexplored. In this work, we present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning five scientific subjects. To rigorously assess the reranker performance in terms of noise resilience, relevance disambiguation, and factual consistency, we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI), and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely used rerankers on five families of LLMs, we provide detailed insights into their relative strengths and limitations. To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs, which provides valuable observations and guidance for their future development. 科学文献问答是迈向新科学发现的关键一步。最近，基于检索增强生成的大型语言模型（two-stage RAG-LLMs）在该领域取得了显著进展。这种两阶段框架，尤其是第二阶段（重排序器），在科学领域尤为重要，因为术语上的细微差别可能对最终以事实为导向或以知识为密集的答案产生极大负面影响。尽管取得了重大进展，这些工作的潜力和局限性仍未被充分探讨。在本工作中，我们提出了一个面向科学重排序的 RAG 基准（SciRerankBench），用于评估 RAG-LLMs 系统中的重排序器，覆盖五个科学学科。为了严格评估重排序器在噪声鲁棒性、相关性消歧和事实一致性方面的表现，我们构建了三种类型的问题—上下文—答案（Q-C-A）三元组，分别为：噪声上下文（NC）、语义相似但逻辑无关的上下文（SSLI）和反事实上下文（CC）。通过对五类 LLMs 上 13 种广泛使用的重排序器进行系统评估，我们提供了关于它们相对优势与局限的详细见解。据我们所知，SciRerankBench 是首个专门为评估 RAG-LLMs 中的重排序器而开发的基准，这为其未来发展提供了有价值的观察和指导。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 08:36:23 UTC 发布时间：2025-08-12 08:36:23 UTC

#30 Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation #30 Magical：通过语义不变性和面向门外汉的适应进行医学通俗语言生成

Authors: [Weibin Liao](https://arxiv.org/search/?searchtype=author&query=Weibin Liao), [Tianlong Wang](https://arxiv.org/search/?searchtype=author&query=Tianlong Wang), [Yinghao Zhu](https://arxiv.org/search/?searchtype=author&query=Yinghao Zhu), [Yasha Wang](https://arxiv.org/search/?searchtype=author&query=Yasha Wang), [Junyi Gao](https://arxiv.org/search/?searchtype=author&query=Junyi Gao), [Liantao Ma](https://arxiv.org/search/?searchtype=author&query=Liantao Ma) 作者：廖韦斌，王天龙，朱英浩，王雅沙，高俊毅，马连陶

Medical Lay Language Generation (MLLG) plays a vital role in improving the accessibility of complex scientific content for broader audiences. Recent literature to MLLG commonly employ parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA) to fine-tuning large language models (LLMs) using paired expert-lay language datasets. However, LoRA struggles with the challenges posed by multi-source heterogeneous MLLG datasets. Specifically, through a series of exploratory experiments, we reveal that standard LoRA fail to meet the requirement for semantic fidelity and diverse lay-style generation in MLLG task. To address these limitations, we propose Magical, an asymmetric LoRA architecture tailored for MLLG under heterogeneous data scenarios. Magical employs a shared matrix A for abstractive summarization, along with multiple isolated matrices B for diverse lay-style generation. To preserve semantic fidelity during the lay language generation process, Magical introduces a Semantic Invariance Constraint to mitigate semantic subspace shifts on matrix A. Furthermore, to better adapt to diverse lay-style generation, Magical incorporates the Recommendation-guided Switch, an externally interface to prompt the LLM to switch between different matrices B. Experimental results on three real-world lay language generation datasets demonstrate that Magical consistently outperforms prompt-based methods, vanilla LoRA, and its recent variants, while also reducing trainable parameters by 31.66%. 医学通俗语言生成（MLLG）在提高复杂科学内容对更广泛受众的可及性方面发挥着重要作用。近期有关 MLLG 的研究通常采用参数高效的微调方法，例如低秩自适应（LoRA），使用专家-通俗语言配对数据集对大型语言模型（LLMs）进行微调。然而，LoRA 在应对多源异构 MLLG 数据集时面临挑战。具体而言，通过一系列探索性实验，我们发现标准 LoRA 无法满足 MLLG 任务中语义保真性和通俗风格多样化生成的要求。为了解决这些局限性，我们提出了 Magical，一种针对异构数据场景下 MLLG 设计的不对称 LoRA 架构。Magical 为摘要式抽取使用共享矩阵 A ，并为多样的通俗风格生成设置多个独立矩阵 B 。为了在通俗语言生成过程中保持语义保真性，Magical 引入了语义不变性约束，以缓解矩阵 A 上的语义子空间偏移。此外，为了更好地适应多样化的通俗风格生成，Magical 引入了推荐引导切换（Recommendation-guided Switch），这是一个外部接口，用于提示 LLM 在不同矩阵之间切换 B 。在三个真实世界的通俗语言生成数据集上的实验结果表明，Magical 在始终优于基于提示的方法、原始 LoRA 及其最新变体的同时，还将可训练参数减少了 31.66%。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 08:21:58 UTC 发布：2025-08-12 08:21:58 UTC

#31 IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization #31 IROTE：通过上下文自我反思优化引发大型语言模型的人类特质 [PDF 4 ] [复印件] [Kimi 4 ] [REL]

Trained on various human-authored corpora, Large Language Models (LLMs) have demonstrated a certain capability of reflecting specific human-like traits (e.g., personality or values) by prompting, benefiting applications like personalized LLMs and social simulations. However, existing methods suffer from the superficial elicitation problem: LLMs can only be steered to mimic shallow and unstable stylistic patterns, failing to embody the desired traits precisely and consistently across diverse tasks like humans. To address this challenge, we propose IROTE, a novel in-context method for stable and transferable trait elicitation. Drawing on psychological theories suggesting that traits are formed through identity-related reflection, our method automatically generates and optimizes a textual self-reflection within prompts, which comprises self-perceived experience, to stimulate LLMs’ trait-driven behavior. The optimization is performed by iteratively maximizing an information-theoretic objective that enhances the connections between LLMs’ behavior and the target trait, while reducing noisy redundancy in reflection without any fine-tuning, leading to evocative and compact trait reflection. Extensive experiments across three human trait systems manifest that one single IROTE-generated self-reflection can induce LLMs’ stable impersonation of the target trait across diverse downstream tasks beyond simple questionnaire answering, consistently outperforming existing strong baselines. 在各种人类撰写的语料上训练的 Large Language Models (LLMs) 已展示出通过提示反映特定类人特质（例如人格或价值观）的某种能力，有利于个性化 LLMs 和社会模拟等应用。然而，现有方法存在表层引导问题：LLMs 只能被引导去模仿浅层且不稳定的文体模式，无法像人类那样在不同任务中精确且一致地体现期望的特质。为了解决这一挑战，我们提出了 IROTE，一种用于稳定且可迁移特质引导的新型上下文内方法。借鉴心理学理论中指出特质通过与身份相关的反思形成的观点，我们的方法在提示中自动生成并优化一段文本自我反思，该反思包含自我感知的经历，用以激发 LLMs 的特质驱动行为。该优化通过迭代最大化一个信息论目标来执行，该目标增强了 LLMs 行为与目标特质之间的联系，同时在无需任何微调的情况下减少反思中的噪声冗余，从而产生富有表现力且简洁的特质反思。跨越三个人类特质体系的大量实验证明，一次由 IROTE 生成的自我反思即可促使 LLMs 在多种下游任务中稳定地模仿目标特质（不仅限于简单的问卷回答），且始终优于现有强基线。

Subjects: Computation and Language, Artificial Intelligence, Computers and Society

Publish: 2025-08-12 08:04:28 UTC 发表：2025-08-12 08:04:28 UTC

#32 A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models #32 一篇关于并行文本生成的综述：从并行解码到扩散语言模型

As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation. 随着文本生成成为现代 LLMs 的核心能力，它支撑着各种下游应用。然而，大多数现有 LLMs 依赖自回归（AR）生成，基于先前生成的上下文一次生成一个标记——由于该过程本质上的顺序性，导致生成速度受限。为了解决这一挑战，越来越多的研究者开始探索并行文本生成——这是一类旨在打破逐标记生成瓶颈、提高推理效率的广泛技术。尽管兴趣日益增长，但关于究竟哪些具体技术构成并行文本生成以及它们如何提升推理性能的全面分析仍然不足。为弥补这一空白，我们提出了一项并行文本生成方法的系统性综述。我们将现有方法分为基于 AR 的范式和非 AR 的范式，并对每一类中的核心技术进行了详尽的考察。根据这一分类法，我们从速度、质量与效率三个方面评估它们的理论权衡，并检视它们与替代加速策略相结合与比较的潜力。最后，基于我们的发现，我们着重介绍了近期进展、识别了未决挑战，并勾勒了并行文本生成未来研究的有希望方向。

Subjects: Computation and Language, Artificial Intelligence, Distributed, Parallel, and Cluster Computing

Publish: 2025-08-12 07:56:04 UTC

#33 Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults #33 直接投入临床？评估面向老年人的最先进自动语音识别在临床应用中的表现

Voice-controlled interfaces can support older adults in clinical contexts, with chatbots being a prime example, but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the Welzijn.AI chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to realistic datasets. Furthermore, our results suggest that truncating existing architectures is helpful in balancing the accuracy-speed trade-off, though we also identify some cases with high WER due to hallucinations. 语音控制界面可以在临床环境中支持老年人，聊天机器人就是一个典型例子，但对弱势群体可靠的自动语音识别（ASR）仍然是一个瓶颈。本研究评估了最先进的 ASR 模型在老年荷兰人语言使用上的表现，这些人曾与为老年医学场景设计的 Welzijn.AI 聊天机器人互动。我们对通用多语种 ASR 模型以及为老年人说荷兰语而微调的模型进行了基准测试，同时也考虑了处理速度。我们的结果显示，通用多语种模型优于微调模型，这表明最新的 ASR 模型在现实数据集上开箱即用地具有良好的泛化能力。此外，我们的结果表明，对现有架构进行截断有助于在准确性与速度之间取得平衡，尽管我们也发现一些因幻觉导致高词错误率（WER）的情况。

Subjects: Computation and Language, Computers and Society 主题：计算与语言，计算机与社会

Publish: 2025-08-12 07:17:44 UTC 发布：2025-08-12 07:17:44 UTC

#34 TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation #34 TopXGen：面向低资源机器翻译的主题多样性并行数据生成

Authors: [Armel Zebaze](https://arxiv.org/search/?searchtype=author&query=Armel Zebaze), [Benoît Sagot](https://arxiv.org/search/?searchtype=author&query=Benoît Sagot), [Rachel Bawden](https://arxiv.org/search/?searchtype=author&query=Rachel Bawden) 作者：Armel Zebaze、Benoît Sagot、Rachel Bawden

LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present \textsc{TopXGen}, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that \textsc{TopXGen} boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen. 已有研究表明，LLMs 在使用上下文学习（ICL）进行机器翻译（MT）时表现良好，在翻译成高资源语言（HRLs）时可与有监督模型相媲美。然而，在翻译成低资源语言（LRLs）时，它们表现较差。通过相似性搜索进行示例选择以及监督微调能够带来改善，但这些改进受限于现有平行语料的规模、质量和多样性。低资源机器翻译中常用的一种技术是合成平行数据的生成，其中最常见的是反向翻译（backtranslation），即将现有的目标端文本自动翻译为源语言。然而，这种方法假定存在质量良好且相关的目标端文本，而许多低资源语言并不容易获得此类文本。本文提出了 TopXGen，一种基于 LLM 的方法，用于在多种低资源语言中生成高质量且主题多样的数据，随后可对其进行反向翻译以产生对 ICL 和微调有用且多样的平行文本。我们的直觉是：虽然 LLMs 在翻译到低资源语言（LRLs）方面存在困难，但它们在翻译到高资源语言（HRLs）方面的良好能力及其多语性，使其能够生成质量较高、自然流畅的目标端文本，而这些文本可以被很好地翻译回高资源源语言。我们展示了 \textsc{TopXGen} 在微调和上下文学习过程中提升了 LLM 的翻译性能。代码和输出可在 https://github.com/ArmelRandy/topxgen 获得。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 06:58:02 UTC 发布时间：2025-08-12 06:58:02 UTC

#35 LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement #35 通过子任务指导和迭代改进的由 LLM 驱动的文本到表格生成

Authors: [Rajmohan C](https://arxiv.org/search/?searchtype=author&query=Rajmohan C), [Sarthak Harne](https://arxiv.org/search/?searchtype=author&query=Sarthak Harne), [Arvind Agarwal](https://arxiv.org/search/?searchtype=author&query=Arvind Agarwal) 作者：Rajmohan C、Sarthak Harne、Arvind Agarwal

Transforming unstructured text into structured data is a complex task, requiring semantic understanding, reasoning, and structural comprehension. While Large Language Models (LLMs) offer potential, they often struggle with handling ambiguous or domain-specific data, maintaining table structure, managing long inputs, and addressing numerical reasoning. This paper proposes an efficient system for LLM-driven text-to-table generation that leverages novel prompting techniques. Specifically, the system incorporates two key strategies: breaking down the text-to-table task into manageable, guided sub-tasks and refining the generated tables through iterative self-feedback. We show that this custom task decomposition allows the model to address the problem in a stepwise manner and improves the quality of the generated table. Furthermore, we discuss the benefits and potential risks associated with iterative self-feedback on the generated tables while highlighting the trade-offs between enhanced performance and computational cost. Our methods achieve strong results compared to baselines on two complex text-to-table generation datasets available in the public domain. 将非结构化文本转换为结构化数据是一项复杂的任务，要求具备语义理解、推理和结构识别能力。虽然 LLMs 具有潜力，但它们在处理模糊或领域特定数据、保持表格结构、管理长输入以及处理数值推理方面常常表现欠佳。本文提出了一种高效的 LLM 驱动文本到表格生成系统，利用了新颖的提示技术。具体而言，该系统包含两项关键策略：将文本到表格任务分解为可管理的、有指导性的子任务，以及通过迭代自我反馈来改进生成的表格。我们证明了这种自定义的任务分解使模型能够以分步方式解决问题并提高生成表格的质量。此外，我们讨论了对生成表格进行迭代自我反馈的好处和潜在风险，同时强调了性能提升与计算成本之间的权衡。与公开领域的两个复杂文本到表格生成数据集上的基线方法相比，我们的方法取得了强劲的结果。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 05:37:12 UTC 发布：2025-08-12 05:37:12 UTC

#36 Prompt-Based Approach for Czech Sentiment Analysis #36 一种基于提示的方法用于捷克语情感分析

Authors: [Jakub Šmíd](https://arxiv.org/search/?searchtype=author&query=Jakub Šmíd), [Pavel Přibáň](https://arxiv.org/search/?searchtype=author&query=Pavel Přibáň) 作者：Jakub Šmíd，Pavel Přibáň

This paper introduces the first prompt-based methods for aspect-based sentiment analysis and sentiment classification in Czech. We employ the sequence-to-sequence models to solve the aspect-based tasks simultaneously and demonstrate the superiority of our prompt-based approach over traditional fine-tuning. In addition, we conduct zero-shot and few-shot learning experiments for sentiment classification and show that prompting yields significantly better results with limited training examples compared to traditional fine-tuning. We also demonstrate that pre-training on data from the target domain can lead to significant improvements in a zero-shot scenario. 本文提出了首批用于捷克语的基于提示的方面级情感分析和情感分类方法。我们采用序列到序列模型同时解决方面级任务，并证明了我们的基于提示的方法优于传统微调。此外，我们为情感分类进行了零样本和少样本学习实验，显示在训练样本有限的情况下，提示方法相比传统微调能显著取得更好结果。我们还表明，在目标领域数据上进行预训练可以在零样本场景下带来显著提升。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 05:31:07 UTC 发布：2025-08-12 05:31:07 UTC

#37 UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection #37 UWB 在 WASSA-2024 共享任务 2：跨语言情感检测

This paper presents our system built for the WASSA-2024 Cross-lingual Emotion Detection Shared Task. The task consists of two subtasks: first, to assess an emotion label from six possible classes for a given tweet in one of five languages, and second, to predict words triggering the detected emotions in binary and numerical formats. Our proposed approach revolves around fine-tuning quantized large language models, specifically Orca2, with low-rank adapters (LoRA) and multilingual Transformer-based models, such as XLM-R and mT5. We enhance performance through machine translation for both subtasks and trigger word switching for the second subtask. The system achieves excellent performance, ranking 1st in numerical trigger words detection, 3rd in binary trigger words detection, and 7th in emotion detection. 本文介绍了我们为 WASSA-2024 跨语言情感检测共享任务构建的系统。该任务包含两个子任务：其一，对给定的一条推文（五种语言之一）从六个可能类别中评估情感标签；其二，以二元和数值格式预测触发已检测情感的词语。我们提出的方法围绕对量化的大型语言模型（特别是 Orca2）进行微调，并使用低秩适配器（LoRA），以及多语种基于 Transformer 的模型，如 XLM-R 和 mT5。我们通过机器翻译增强两个子任务的表现，并针对第二子任务采用触发词切换策略。该系统取得了优异的成绩，在数值触发词检测中排名第 1，二元触发词检测中排名第 3，情感检测中排名第 7。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 05:30:46 UTC

#38 LLaMA-Based Models for Aspect-Based Sentiment Analysis

While large language models (LLMs) show promise for various tasks, their performance in compound aspect-based sentiment analysis (ABSA) tasks lags behind fine-tuned models. However, the potential of LLMs fine-tuned for ABSA remains unexplored. This paper examines the capabilities of open-source LLMs fine-tuned for ABSA, focusing on LLaMA-based models. We evaluate the performance across four tasks and eight English datasets, finding that the fine-tuned Orca2 model surpasses state-of-the-art results in all tasks. However, all models struggle in zero-shot and few-shot scenarios compared to fully fine-tuned ones. Additionally, we conduct error analysis to identify challenges faced by fine-tuned models. 尽管大型语言模型（LLMs）在多种任务上展现出潜力，但在复合方面情感分析（ABSA）任务中的表现仍落后于微调模型。然而，针对 ABSA 进行微调的 LLMs 的潜力尚未被探索。本文考察了为 ABSA 微调的开源 LLMs 的能力，重点关注基于 LLaMA 的模型。我们在四项任务和八个英文数据集上评估了性能，发现微调后的 Orca2 模型在所有任务中都超越了最先进的结果。然而，与完全微调的模型相比，所有模型在零样本和少样本场景中都表现不佳。此外，我们还进行了错误分析以识别微调模型面临的挑战。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 05:30:32 UTC 发表：2025-08-12 05:30:32 UTC

#39 Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents #39 反应迅速：从人类示范中引出隐含意图以打造个性化移动使用代理

As multimodal large language models advance rapidly, the automation of mobile tasks has become increasingly feasible through the use of mobile-use agents that mimic human interactions from graphical user interface. To further enhance mobile-use agents, previous studies employ demonstration learning to improve mobile-use agents from human demonstrations. However, these methods focus solely on the explicit intention flows of humans (e.g., step sequences) while neglecting implicit intention flows (e.g., personal preferences), which makes it difficult to construct personalized mobile-use agents. In this work, to evaluate the \textbf{I}ntention \textbf{A}lignment \textbf{R}ate between mobile-use agents and humans, we first collect \textbf{MobileIAR}, a dataset containing human-intent-aligned actions and ground-truth actions. This enables a comprehensive assessment of the agents’ understanding of human intent. Then we propose \textbf{IFRAgent}, a framework built upon \textbf{I}ntention \textbf{F}low \textbf{R}ecognition from human demonstrations. IFRAgent analyzes explicit intention flows from human demonstrations to construct a query-level vector library of standard operating procedures (SOP), and analyzes implicit intention flows to build a user-level habit repository. IFRAgent then leverages a SOP extractor combined with retrieval-augmented generation and a query rewriter to generate personalized query and SOP from a raw ambiguous query, enhancing the alignment between mobile-use agents and human intent. Experimental results demonstrate that IFRAgent outperforms baselines by an average of 6.79% (32.06% relative improvement) in human intention alignment rate and improves step completion rates by an average of 5.30% (26.34% relative improvement). The codes are available at https://github.com/MadeAgents/Quick-on-the-Uptake. 随着多模态大语言模型的快速发展，通过模仿图形用户界面上的人类交互来自动化移动端任务变得日益可行。为进一步提升移动端代理的能力，先前的研究采用示范学习从人类示范中改善移动端代理。然而，这些方法仅关注人类的显性意图流（例如步骤序列），而忽视了隐性意图流（例如个人偏好），因此难以构建个性化的移动端代理。在本工作中，为了评估移动端代理与人类之间的意图一致率（Intention Alignment Rate, IAR），我们首先收集了 MobileIAR 数据集，该数据集包含与人类意图对齐的动作和真实动作。这使得能够对代理对人类意图的理解进行全面评估。然后我们提出了 IFRAgent 框架，该框架基于从人类示范中识别意图流（Intention Flow Recognition, IFR）。 IFRAgent 分析来自人类示范的显性意图流，以构建查询级的标准操作程序（SOP）向量库，并分析隐性意图流以建立用户级习惯库。然后，IFRAgent 利用一个 SOP 提取器，结合检索增强生成和查询重写器，从原始模糊查询生成个性化查询和 SOP，从而增强移动使用代理与人类意图之间的一致性。实验结果表明，IFRAgent 在人类意图一致率上平均超越基线 6.79%（相对提升 32.06%），并在步骤完成率上平均提升 5.30%（相对提升 26.34%）。代码可在 https://github.com/MadeAgents/Quick-on-the-Uptake 获取。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 05:20:14 UTC 发布：2025-08-12 05:20:14 UTC

#40 InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling #40 InternBootcamp 技术报告：通过可验证任务扩展提升 LLM 推理能力

Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbf{task scaling}, over two orders of magnitude, offering a promising route towards capable reasoning generalist. 大型语言模型 (LLMs) 通过实现复杂的推理能力，已经革新了人工智能。尽管近期在强化学习 (RL) 方面的进展主要集中于特定领域的推理任务（例如数学或代码生成），但现实世界的推理场景常常要求模型处理多样且复杂的环境，这是狭义基准无法完全涵盖的。为了解决这一差距，我们提出了 InternBootcamp，一个开源框架，包含 1000 多个领域多样的任务环境，专为 LLM 推理研究而设计。我们的代码库提供两项关键功能： (1) 自动生成可配置难度级别的无限训练/测试案例，和 (2) 集成的验证模块用于客观响应评估。这些特性使 InternBootcamp 成为基于 RL 的模型优化、合成数据生成和模型评估的基础性基础设施。虽然手工开发一个具有如此大任务覆盖范围的框架极其繁琐，我们通过一个由自动化代理工作流辅以人工验证协议的方式来加快开发流程，从而使任务范围能够迅速扩大。 % 通过这些训练营，我们进一步建立了 Bootcamp-EVAL，一个用于全面性能评估的自动生成基准。评估表明前沿模型在许多推理任务上仍表现不足，而使用 InternBootcamp 进行训练提供了一种显著提高性能的有效途径，促成了我们的 32B 模型在 Bootcamp-EVAL 上取得最先进的结果并在其他既有基准上表现优异。特别是，我们验证到一致的性能提升来自包含更多训练任务，即跨越两个数量级的“任务扩展”（task scaling），这为通向具备推理能力的一般型模型提供了有希望的路径。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 05:00:00 UTC 发布时间：2025-08-12 05:00:00 UTC

#41 Optimizing Retrieval-Augmented Generation (RAG) for Colloquial Cantonese: A LoRA-Based Systematic Review #41 优化面向口语粤语的检索增强生成（RAG）：基于 LoRA 的系统综述

Authors: [David Santandreu Calonge](https://arxiv.org/search/?searchtype=author&query=David Santandreu Calonge), [Linda Smail](https://arxiv.org/search/?searchtype=author&query=Linda Smail) 作者：David Santandreu Calonge，Linda Smail

This review examines recent advances in Parameter-Efficient Fine-Tuning (PEFT), with a focus on Low-Rank Adaptation (LoRA), to optimize Retrieval-Augmented Generation (RAG) systems like Qwen3, DeepSeek, and Kimi. These systems face challenges in understanding and generating authentic Cantonese colloquial expressions due to limited annotated data and linguistic variability. The review evaluates the integration of LoRA within RAG frameworks, benchmarks PEFT methods for retrieval and generation accuracy, identify domain adaptation strategies under limited data, and compares fine-tuning techniques aimed at improving semantic fidelity under data-scarce conditions. A systematic analysis of recent studies employing diverse LoRA variants, synthetic data generation, user feedback integration, and adaptive parameter allocation was conducted to assess their impact on computational efficiency, retrieval precision, linguistic authenticity, and scalability. Findings reveal that dynamic and ensemble LoRA adaptations significantly reduce trainable parameters without sacrificing retrieval accuracy and generation quality in dialectal contexts. However, limitations remain in fully preserving fine-grained linguistic nuances, especially for low-resource settings like Cantonese. The integration of real-time user feedback and domain-specific data remains underdeveloped, limiting model adaptability and personalization. While selective parameter freezing and nonlinear adaptation methods offer better trade-offs between efficiency and accuracy, their robustness at scale remains an open challenge. This review highlights the promise of PEFT-enhanced RAG systems for domain-specific language tasks and calls for future work targeting dialectal authenticity, dynamic adaptation, and scalable fine-tuning pipelines. 本综述考察了参数高效微调（PEFT）的最新进展，重点是低秩适配（LoRA），以优化检索增强生成（RAG）系统，如 Qwen3、DeepSeek 和 Kimi。由于标注数据有限和语言变异性，这些系统在理解和生成真实粤语口语表达方面面临挑战。综述评估了在 RAG 框架中整合 LoRA 的做法，对检索和生成准确性进行了 PEFT 方法的基准测试，识别了在有限数据下的领域适应策略，并比较了旨在在数据稀缺条件下提高语义忠实度的微调技术。对近期采用多种 LoRA 变体、合成数据生成、用户反馈整合和自适应参数分配的研究进行了系统分析，以评估它们对计算效率、检索精准度、语言真实度和可扩展性的影响。研究结果表明，动态和集成的 LoRA 适配显著减少了可训练参数量，同时在方言环境中保持了检索准确性和生成质量。然而，在完全保留细粒度语言细微差别方面仍存在局限，尤其是在像粤语这样的低资源环境中。实时用户反馈和领域特定数据的整合仍不成熟，限制了模型的适应性和个性化。尽管选择性参数冻结和非线性自适应方法在效率与准确性之间提供了更好的权衡，但它们在大规模应用中的稳健性仍是一个悬而未决的挑战。本综述强调了由 PEFT 增强的 RAG 系统在领域特定语言任务中的潜力，并呼吁未来的工作针对方言真实性、动态适应和可扩展微调流程进行研究。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 03:46:16 UTC 发布：2025-08-12 03:46:16 UTC

#42 DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives #42 DepressLLM：用于从真实世界叙述中检测抑郁的可解释领域适应语言模型

Advances in large language models (LLMs) have enabled a wide range of applications. However, depression prediction is hindered by the lack of large-scale, high-quality, and rigorously annotated datasets. This study introduces DepressLLM, trained and evaluated on a novel corpus of 3,699 autobiographical narratives reflecting both happiness and distress. DepressLLM provides interpretable depression predictions and, via its Score-guided Token Probability Summation (SToPS) module, delivers both improved classification performance and reliable confidence estimates, achieving an AUC of 0.789, which rises to 0.904 on samples with confidence ≥ 0.95. To validate its robustness to heterogeneous data, we evaluated DepressLLM on in-house datasets, including an Ecological Momentary Assessment (EMA) corpus of daily stress and mood recordings, and on public clinical interview data. Finally, a psychiatric review of high-confidence misclassifications highlighted key model and data limitations that suggest directions for future refinements. These findings demonstrate that interpretable AI can enable earlier diagnosis of depression and underscore the promise of medical AI in psychiatry. 大型语言模型（LLMs）的进步推动了各种应用的发展。然而，抑郁症预测受限于缺乏大规模、高质量且经严格标注的数据集。本研究引入了 DepressLLM，在一个新建的语料库上进行训练与评估，该语料库包含 3,699 篇自传体叙述，反映了幸福与痛苦两种情绪。DepressLLM 提供可解释的抑郁预测，并通过其基于得分的词元概率求和（Score-guided Token Probability Summation，SToPS）模块，不仅提升了分类性能，还给出了可靠的置信度估计，整体 AUC 达到 0.789，而在置信度≥0.95 的样本上上升至 0.904。为验证其对异质数据的鲁棒性，我们在内部数据集上评估了 DepressLLM，包括记录日常压力与情绪的生态瞬时评估（Ecological Momentary Assessment，EMA）语料，以及公开的临床访谈数据。最后，精神科专家对高置信度误判的审查突显了模型与数据的关键局限性，指明了未来改进的方向。这些发现表明，可解释的人工智能可以促进抑郁症的早期诊断，并强调了医学人工智能在精神病学中的潜力。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 03:12:55 UTC 发布：2025-08-12 03:12:55 UTC

#43 DeCAL Tokenwise Compression #43 DeCAL 逐标记压缩

Author: [Sameer Panwar](https://arxiv.org/search/?searchtype=author&query=Sameer Panwar) 作者：Sameer Panwar

This paper introduces DeCAL, a new method for tokenwise compression. DeCAL uses an encoder-decoder language model pretrained with denoising to learn to produce high-quality, general-purpose compressed representations by the encoder. DeCAL applies small modifications to the encoder, with the emphasis on maximizing compression quality, even at the expense of compute. We show that DeCAL at 2x compression can match uncompressed on many downstream tasks, with usually only minor dropoff in metrics up to 8x compression, among question-answering, summarization, and multi-vector retrieval tasks. DeCAL offers significant savings where pre-computed dense representations can be utilized, and we believe the approach can be further developed to be more broadly applicable. 本文介绍了 DeCAL，一种用于逐标记压缩的新方法。DeCAL 使用经过去噪预训练的编码器-解码器语言模型来学习由编码器生成高质量、通用的压缩表示。DeCAL 对编码器进行小幅修改，重点在于最大化压缩质量，即使以牺牲计算为代价。我们展示了在 2 倍压缩下，DeCAL 在许多下游任务上可以匹配未压缩表现；在问答、摘要和多向量检索任务中，通常在高达 8 倍压缩时指标仅有轻微下降。DeCAL 在可以利用预先计算的密集表示时提供了显著的节省，我们认为该方法可以进一步发展以更广泛地适用。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-11 22:49:54 UTC 发布：2025-08-11 22:49:54 UTC

#44 Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression #44 可引导的多元主义：通过少样本比较回归实现多元对齐

Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback (RLHF). However, these methods use scalar rewards that can only reflect user preferences on average. Pluralistic alignment instead seeks to capture diverse user preferences across a set of attributes, moving beyond just helpfulness and harmlessness. Toward this end, we propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences. Our approach leverages in-context learning and reasoning, grounded in a set of fine-grained attributes, to compare response options and make aligned choices. To evaluate our algorithm, we also propose two new steerable pluralistic benchmarks by adapting the Moral Integrity Corpus (MIC) and the HelpSteer2 datasets, demonstrating the applicability of our approach to value-aligned decision-making and reward modeling, respectively. Our few-shot comparative regression approach is interpretable and compatible with different attributes and LLMs, while outperforming multiple baseline and state-of-the-art methods. Our work provides new insights and research directions in pluralistic alignment, enabling a more fair and representative use of LLMs and advancing the state-of-the-art in ethical AI. 大型语言模型（LLMs）目前使用如从人类反馈中进行强化学习（RLHF）等技术进行对齐。然而，这些方法使用的标量奖励只能反映用户偏好的一般情况。多元对齐则试图捕捉在一组属性上多样化的用户偏好，超越仅仅帮助性和无害性。为此，我们提出了一种基于少样本比较回归的可操控多元模型，能够适应个别用户偏好。我们的方法利用基于上下文的学习和推理，基于一组细粒度属性来比较响应选项并做出对齐选择。为了评估我们的算法，我们还通过改编道德完整性语料库（Moral Integrity Corpus，MIC）和 HelpSteer2 数据集，提出了两个新的可操控多元基准，分别展示了我们方法在价值对齐决策和奖励建模方面的适用性。我们的少样本比较回归方法具有可解释性并兼容不同属性和 LLMs，同时优于多种基线和最新方法。我们的工作为多元对齐提供了新的见解和研究方向，使得 LLMs 的使用更加公平且具代表性，并推动了伦理人工智能的技术进步。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-11 22:40:31 UTC 发布：2025-08-11 22:40:31 UTC

#45 Momentum Point-Perplexity Mechanics in Large Language Models #45 大型语言模型中的动量点困惑力学

Authors: [Lorenzo Tomaz](https://arxiv.org/search/?searchtype=author&query=Lorenzo Tomaz), [Judd Rosenblatt](https://arxiv.org/search/?searchtype=author&query=Judd Rosenblatt), [Thomas Berry Jones](https://arxiv.org/search/?searchtype=author&query=Thomas Berry Jones), [Diogo Schwerz de Lucena](https://arxiv.org/search/?searchtype=author&query=Diogo Schwerz de Lucena) 作者：Lorenzo Tomaz、Judd Rosenblatt、Thomas Berry Jones、Diogo Schwerz de Lucena

We take a physics-based approach to studying how the internal hidden states of large language models change from token to token during inference. Across 20 open-source transformer models (135M-3B parameters), we find that a quantity combining the rate of change in hidden states and the model’s next-token certainty, analogous to energy in physics, remains nearly constant. Random-weight models conserve this “energy” more tightly than pre-trained ones, while training shifts models into a faster, more decisive regime with greater variability. Using this “log-Lagrangian” view, we derive a control method called Jacobian steering, which perturbs hidden states in the minimal way needed to favor a target token. This approach maintained near-constant energy in two tested models and produced continuations rated higher in semantic quality than the models’ natural outputs. Viewing transformers through this mechanics lens offers a principled basis for interpretability, anomaly detection, and low-risk steering. This could help make powerful models more predictable and aligned with human intent. 我们采用基于物理的方法来研究大型语言模型在推理过程中从一个标记到下一个标记时内部隐藏状态如何变化。在 20 个开源变换器模型（135M–3B 参数）中，我们发现一个将隐藏状态变化率与模型对下一个标记的确定性相结合的量，类似于物理学中的能量，几乎保持不变。随机权重模型比预训练模型更严格地守恒这种“能量”，而训练则将模型转移到一个变化更快、更果断且波动更大的状态。利用这种“对数拉格朗日”视角，我们推导出一种称为雅可比引导（Jacobian steering）的控制方法，该方法以最小的方式扰动隐藏状态以偏向目标标记。在两个测试模型中，这种方法保持了近乎恒定的能量，并产生了在语义质量上被评估为优于模型自然输出的续写。通过这种力学视角审视变换器，为可解释性、异常检测和低风险引导提供了原则性基础。这有助于使强大模型更可预测并更符合人类意图。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-11 21:50:34 UTC 发布：2025-08-11 21:50:34 UTC

#46 Enhancing Small LLM Alignment through Margin-Based Objective Modifications under Resource Constraints #46 在资源受限下通过基于边际的目标修改提升小型 LLM 的对齐

Authors: [Daren Yao](https://arxiv.org/search/?searchtype=author&query=Daren Yao), [Jinsong Yuan](https://arxiv.org/search/?searchtype=author&query=Jinsong Yuan), [Ruike Chen](https://arxiv.org/search/?searchtype=author&query=Ruike Chen) 作者：Daren Yao、Jinsong Yuan、Ruike Chen

Small large language models (LLMs) often face difficulties in aligning output to human preferences, particularly when operating under severe performance gaps. In this work, we propose two lightweight DPO-based variants – Adaptive Margin-Sigmoid Loss and APO-hinge-zero – to better address underperformance scenarios by introducing margin-based objectives and selective update mechanisms. Our APO-hinge-zero method, which combines hinge-induced hard-example mining with the chosen-focused optimization of APO-zero, achieves strong results. In AlpacaEval, APO-hinge-zero improves the win rate by +2.0 points and the length-controlled win rate by +1.4 points compared to the APO-zero baseline. In MT-Bench, our methods maintain competitive performance in diverse categories, particularly excelling in STEM and Humanities tasks. These results demonstrate that simple modifications to preference-based objectives can significantly enhance small LLM alignment under resource constraints, offering a practical path toward more efficient deployment. 小型大型语言模型（LLMs）在将输出与人类偏好对齐时常常面临困难，尤其是在存在显著性能差距的情况下。在本工作中，我们提出了两种基于 DPO 的轻量级变体——自适应边界-Sigmoid 损失（Adaptive Margin-Sigmoid Loss）和 APO-hinge-zero——通过引入基于边界的目标和选择性更新机制，更好地应对性能不足的情形。我们的 APO-hinge-zero 方法将由 hinge 驱动的困难样本挖掘与 APO-zero 的聚焦优化相结合，取得了强劲的成果。在 AlpacaEval 中，APO-hinge-zero 相较于 APO-zero 基线将胜率提高了+2.0 个百分点，并将长度控制的胜率提高了+1.4 个百分点。在 MT-Bench 中，我们的方法在多个类别中保持了有竞争力的表现，特别是在 STEM 和人文学科任务上表现出色。这些结果表明，对基于偏好的目标进行简单修改即可在资源受限下显著增强小型 LLM 的对齐性，为更高效的部署提供了可行路径。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-11 20:53:37 UTC 发布：2025-08-11 20:53:37 UTC

#47 Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment #47 重新思考富形态的分词：一元模型优于 BPE 并且形态对齐占优

Authors: [Saketh Reddy Vemula](https://arxiv.org/search/?searchtype=author&query=Saketh Reddy Vemula), [Dipti Mishra Sharma](https://arxiv.org/search/?searchtype=author&query=Dipti Mishra Sharma), [Parameswari Krishnamurthy](https://arxiv.org/search/?searchtype=author&query=Parameswari Krishnamurthy) 作者：Saketh Reddy Vemula, Dipti Mishra Sharma, Parameswari Krishnamurthy

Prior work on language modeling showed conflicting findings about whether morphologically aligned approaches to tokenization improve performance, particularly for languages with complex morphology. To investigate this, we select a typologically diverse set of languages: Telugu (agglutinative), Hindi (primarily fusional with some agglutination), and English (fusional). We conduct a comprehensive evaluation of language models – starting from tokenizer training and extending through the finetuning and downstream task evaluation. To account for the consistent performance differences observed across tokenizer variants, we focus on two key factors: morphological alignment and tokenization quality. To assess morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal that better morphological alignment correlates positively – though moderately – with performance in syntax-based tasks such as Parts-of-Speech tagging, Named Entity Recognition and Dependency Parsing. However, we also find that the tokenizer algorithm (Byte-pair Encoding vs. Unigram) plays a more significant role in influencing downstream performance than morphological alignment alone. Naive Unigram tokenizers outperform others across most settings, though hybrid tokenizers that incorporate morphological segmentation significantly improve performance within the BPE framework. In contrast, intrinsic metrics like Corpus Token Count (CTC) and Rényi entropy showed no correlation with downstream performance. 以往关于语言建模的研究在是否采用形态对齐的分词方法能提升性能这一点上存在相互矛盾的结论，尤其是针对形态复杂的语言。为此，我们选择了一组类型学上多样的语言进行研究：泰卢固语（粘着语）、印地语（主要为黏合兼具一定粘着性）和英语（黏合语）。我们对语言模型进行了全面评估——从分词器训练起，延伸到微调与下游任务评估。为了说明在不同分词器变体间观察到的一致性能差异，我们聚焦于两个关键因素：形态对齐与分词质量。为评估泰卢固语分词器的形态对齐性，我们创建了一个包含 600 个派生词和 7000 个屈折词形的人工形素切分金标准数据集。我们的实验表明，更好的形态对齐性在句法类任务（如词性标注、命名实体识别与依存句法分析）上的表现与性能呈正相关——尽管这种相关程度为中等。然而，我们也发现分词算法（字节对编码 vs. （单字法）在影响下游性能方面比仅仅依靠形态对齐更为重要。朴素的单字分词器在大多数设置下优于其他分词器，尽管在 BPE 框架内结合形态切分的混合分词器能显著提升性能。相比之下，诸如语料标记计数（CTC）和 Rényi 熵之类的内在指标与下游性能没有相关性。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-11 19:23:59 UTC 发布：2025-08-11 19:23:59 UTC

#48 Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery #48 Mol-R1：迈向分子发现中显式的长链链式思维（Long-CoT）推理

Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines. 大型语言模型（LLMs），尤其是像 DeepSeek-R1 和 QWQ 这样的显式长链思维（CoT）推理模型，已展示出强大的推理能力，在常识推理和数学推断方面取得了令人瞩目的成绩。尽管它们有效，长链 CoT 推理模型在诸如分子发现等知识密集型领域常被批评为能力有限且效率低下。该领域的成功需要对领域知识（包括分子结构和化学原理）有精确理解，但由于分子数据的固有复杂性以及高质量专家标注的稀缺，这一目标具有挑战性。为弥合这一差距，我们提出了 Mol-R1，一种新框架，旨在提升类似 R1 的显式长链 CoT 推理 LLMs 在基于文本的分子生成中的可解释性和推理性能。我们的方法始于通过先验调控的上下文蒸馏（Prior Regulation via In-context Distillation，PRID）精心策划的高质量推理数据集，这是一种专门的蒸馏策略，用以在先验调控引导下有效生成成对的推理轨迹。在此基础上，我们提出了 MoIA（分子迭代适配），这是一种复杂的训练策略，通过迭代地将监督微调（SFT）与强化策略优化（RPO）相结合，专门用于提升类似 R1 的推理模型在分子发现任务中的推理性能。最后，我们在基于文本的分子推理生成任务中评估了 Mol-R1 的表现，结果显示其优于现有基线。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-11 18:50:05 UTC 发布：2025-08-11 18:50:05 协调世界时

#49 CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation #49 CoDAE：通过连锁思维数据增强将大型语言模型适配于教育领域 [PDF ] [复制] [Kimi ] [关联]

Authors: [Shuzhou Yuan](https://arxiv.org/search/?searchtype=author&query=Shuzhou Yuan), [William LaCroix](https://arxiv.org/search/?searchtype=author&query=William LaCroix), [Hardik Ghoshal](https://arxiv.org/search/?searchtype=author&query=Hardik Ghoshal), [Ercong Nie](https://arxiv.org/search/?searchtype=author&query=Ercong Nie), [Michael Färber](https://arxiv.org/search/?searchtype=author&query=Michael Färber) 作者：Shuzhou Yuan、William LaCroix、Hardik Ghoshal、Ercong Nie、Michael Färber

Large Language Models (LLMs) are increasingly employed as AI tutors due to their scalability and potential for personalized instruction. However, off-the-shelf LLMs often underperform in educational settings: they frequently reveal answers too readily, fail to adapt their responses to student uncertainty, and remain vulnerable to emotionally manipulative prompts. To address these challenges, we introduce CoDAE, a framework that adapts LLMs for educational use through Chain-of-Thought (CoT) data augmentation. We collect real-world dialogues between students and a ChatGPT-based tutor and enrich them using CoT prompting to promote step-by-step reasoning and pedagogically aligned guidance. Furthermore, we design targeted dialogue cases to explicitly mitigate three key limitations: over-compliance, low response adaptivity, and threat vulnerability. We fine-tune four open-source LLMs on different variants of the augmented datasets and evaluate them in simulated educational scenarios using both automatic metrics and LLM-as-a-judge assessments. Our results show that models fine-tuned with CoDAE deliver more pedagogically appropriate guidance, better support reasoning processes, and effectively resist premature answer disclosure. 大型语言模型 (LLMs) 因其可扩展性和个性化教学潜力，正越来越多地被用作 AI 辅导工具。然而，现成的 LLMs 在教育环境中常常表现不佳：它们经常过早地揭示答案，未能根据学生的不确定性调整回答，并且仍然容易受到情感操纵性提示的影响。为了解决这些问题，我们提出了 CoDAE，这是一个通过连贯思维（Chain-of-Thought，CoT）数据增强来使 LLMs 适应教育用途的框架。我们收集了学生与基于 ChatGPT 的导师之间的真实对话，并使用 CoT 提示对其进行丰富，以促进逐步推理和符合教学法的引导。此外，我们设计了针对性的对话案例，明确缓解三大关键局限：过度服从、低响应适应性和威胁脆弱性。我们在不同变体的增强数据集上对四个开源 LLMs 进行了微调，并使用自动评估指标和 LLM-as-a-judge 评估在模拟教育场景中对它们进行了评估。我们的结果表明，经 CoDAE 微调的模型提供了更符合教学的指导，更好地支持推理过程，并能有效抵抗过早揭示答案。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-11 18:13:31 UTC 发布：2025-08-11 18:13:31 UTC

#50 Putnam-AXIOM: A Functional and Static Benchmark #50 Putnam-AXIOM：一个函数式与静态基准

Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances – yielding a contamination-resilient test bed. On the Original set, OpenAI’s o1-preview – the strongest evaluated model – scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement “boxed” accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom. 当前针对大型语言模型（LLMs）的数学推理基准测试正在接近饱和状态，有些模型的准确率已超过 90%，且越来越容易受到训练集污染的影响。我们引入了 Putnam-AXIOM，这是一个包含 522 道大学水平竞赛题目的基准，题目摘自享有盛名的威廉·洛厄尔·普特南县数学竞赛（William Lowell Putnam Mathematical Competition），以及 Putnam-AXIOM Variation，这是一个由程序化扰动变量和常数生成的 100 道未见伴生变体集合。该变体生成协议能产生无限流的同等难度且未见的实例——从而提供一个对污染具有抗性的测试平台。在原始题集上，OpenAI 的 o1-preview（本次评估中表现最强的模型）得分为 41.9%，但在配对的变体题上其准确率下降了 19.6 个百分点（相对下降 46.8%）。其余十八个模型也显示出相同的下降趋势，其中十个模型的 95%置信区间不重叠。这些差距表明存在记忆化现象，并凸显了动态基准测试的必要性。我们用“教师强制准确率”（Teacher-Forced Accuracy，TFA）补充了“框定”准确率，这是一种轻量级指标，直接对推理痕迹进行评分并实现自然语言证明评估的自动化。因此，Putnam-AXIOM 提供了一个严格且对污染具有鲁棒性的评估框架，用于评估 LLMs 的高级数学推理能力。数据和评估代码公开可在 https://github.com/brando90/putnam-axiom 获取。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning, Logic in Computer Science, Neural and Evolutionary Computing 主题：计算与语言、人工智能、机器学习、计算机科学中的逻辑、神经与进化计算

Publish: 2025-08-05 17:57:50 UTC 发布：2025-08-05 17:57:50 UTC

#51 Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions #51 神圣还是合成？评估 LLM 在宗教问题上的可靠性与回避

Authors: [Farah Atif](https://arxiv.org/search/?searchtype=author&query=Farah Atif), [Nursultan Askarbekuly](https://arxiv.org/search/?searchtype=author&query=Nursultan Askarbekuly), [Kareem Darwish](https://arxiv.org/search/?searchtype=author&query=Kareem Darwish), [Monojit Choudhury](https://arxiv.org/search/?searchtype=author&query=Monojit Choudhury) 作者：Farah Atif、Nursultan Askarbekuly、Kareem Darwish、Monojit Choudhury

Despite the increasing usage of Large Language Models (LLMs) in answering questions in a variety of domains, their reliability and accuracy remain unexamined for a plethora of domains including the religious domains. In this paper, we introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English. Unlike prior work, which either overlooks the distinctions between religious school of thought or fails to evaluate abstention behavior, we assess LLMs not only on their accuracy but also on their ability to recognize when not to answer. Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought. While GPT-4o outperforms all other models in accuracy, Gemini and Fanar demonstrate superior abstention behavior critical for minimizing confident incorrect answers. Notably, all models exhibit a performance drop in Arabic, highlighting the limitations in religious reasoning for languages other than English. To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic jurisprudence queries. Our findings underscore the need for task-specific evaluation and cautious deployment of LLMs in religious applications. 尽管大型语言模型（LLMs）在各类领域回答问题的使用越来越广，但它们在许多领域（包括宗教领域）的可靠性和准确性仍未被充分检验。本文提出了一个新基准 FiqhQA，专注于由 LLM 生成的伊斯兰教法裁决，明确按四大逊尼派法学学派进行分类，覆盖阿拉伯语和英语。与以往工作不同，既有的研究要么忽视了宗教学派之间的差异，要么未能评估回避回答的行为，我们不仅评估 LLM 的准确性，还评估其识别何时不应回答的能力。我们的零样本和回避实验显示，不同 LLM、不同语言以及不同法学学派之间存在显著差异。尽管 GPT-4o 在准确性上优于所有其他模型，Gemini 和 Fanar 在回避行为方面表现更佳，这对减少带着信心的错误回答至关重要。值得注意的是，所有模型在阿拉伯语上均出现性能下降，凸显了除英语以外语言在宗教推理方面的局限性。据我们所知，这是首个对 LLMs 在生成细化到伊斯兰学派特定裁决方面的有效性进行基准测试，并评估其在伊斯兰法学查询中弃权策略的研究。我们的发现强调了对特定任务进行评估以及在宗教应用中谨慎部署 LLMs 的必要性。

Subjects: Computation and Language, Artificial Intelligence, Computers and Society

Publish: 2025-08-04 07:27:26 UTC 发布：2025-08-04 07:27:26 UTC

#52 The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs #52 进步的错觉：重新评估 LLMs 中幻觉检测

Authors: [Denis Janiak](https://arxiv.org/search/?searchtype=author&query=Denis Janiak), [Jakub Binkowski](https://arxiv.org/search/?searchtype=author&query=Jakub Binkowski), [Albert Sawczyn](https://arxiv.org/search/?searchtype=author&query=Albert Sawczyn), [Bogdan Gabrys](https://arxiv.org/search/?searchtype=author&query=Bogdan Gabrys), [Ravid Schwartz-Ziv](https://arxiv.org/search/?searchtype=author&query=Ravid Schwartz-Ziv), [Tomasz Kajdanowicz](https://arxiv.org/search/?searchtype=author&query=Tomasz Kajdanowicz) 作者：Denis Janiak、Jakub Binkowski、Albert Sawczyn、Bogdan Gabrys、Ravid Schwartz-Ziv、Tomasz Kajdanowicz

Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods, ultimately ensuring the trustworthiness of LLM outputs. 大型语言模型（LLMs）已经彻底改变了自然语言处理，但它们产生幻觉的倾向对可靠部署构成了严重挑战。尽管存在许多幻觉检测方法，其评估常依赖于基于词汇重叠的度量 ROUGE，而该度量与人类判断存在不一致。通过全面的人类研究，我们证明了虽然 ROUGE 展现出高召回率，但其极低的精确率导致误导性的性能估计。事实上，若使用与人类对齐的度量（如 LLM-as-Judge）评估，若干已确立的检测方法的性能下降可达 45.9%。此外，我们的分析表明，基于回答长度的简单启发式方法可以与复杂的检测技术相媲美，这揭示了当前评估实践的一个根本性缺陷。我们认为，采用具有语义感知性且稳健的评估框架对于准确衡量幻觉检测方法的真实性能至关重要，最终确保 LLM 输出的可信性。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-01 20:34:01 UTC

#53 MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language #53 MinionsLLM：一种通过自然语言对多智能体系统进行训练与控制的任务自适应框架 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Andres Garcia Rincon](https://arxiv.org/search/?searchtype=author&query=Andres Garcia Rincon), [Eliseo Ferrante](https://arxiv.org/search/?searchtype=author&query=Eliseo Ferrante) 作者：Andres Garcia Rincon、Eliseo Ferrante

This paper presents MinionsLLM, a novel framework that integrates Large Language Models (LLMs) with Behavior Trees (BTs) and Formal Grammars to enable natural language control of multi-agent systems within arbitrary, user-defined environments. MinionsLLM provides standardized interfaces for defining environments, agents, and behavioral primitives, and introduces two synthetic dataset generation methods (Method A and Method B) to fine-tune LLMs for improved syntactic validity and semantic task relevance. We validate our approach using Google’s Gemma 3 model family at three parameter scales (1B, 4B, and 12B) and demonstrate substantial gains: Method B increases syntactic validity to 92.6% and achieves a mean task performance improvement of 33% over baseline. Notably, our experiments show that smaller models benefit most from fine-tuning, suggesting promising directions for deploying compact, locally hosted LLMs in resource-constrained multi-agent control scenarios. The framework and all resources are released open-source to support reproducibility and future research. 本文提出了 MinionsLLM，一种将大型语言模型（LLMs）与行为树（BTs）和形式语法相结合的新框架，使多智能体系统能够在任意用户定义的环境中通过自然语言进行控制。MinionsLLM 为定义环境、智能体和行为原语提供了标准化接口，并引入了两种合成数据集生成方法（方法 A 和方法 B）以微调 LLMs，从而提高语法有效性和语义任务相关性。我们使用 Google 的 Gemma 3 模型家族（三个参数规模：1B、4B 和 12B）验证了我们的方法，并展示了显著提升：方法 B 将语法有效性提高到 92.6%，并使平均任务性能较基线提高了 33%。值得注意的是，实验表明较小的模型从微调中获益最大，这表明在资源受限的多智能体控制场景中部署紧凑的本地托管 LLMs 有着有前景的方向。该框架和所有资源已开源，以支持可重复性和未来研究。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning, Multiagent Systems, Robotics 学科：计算与语言、人工智能、机器学习、多智能体系统、机器人学

Publish: 2025-08-01 13:10:29 UTC 发布：2025-08-01 13:10:29 UTC

#54 Objective Metrics for Evaluating Large Language Models Using External Data Sources #54 使用外部数据源评估大型语言模型的客观指标

Authors: [Haoze Du](https://arxiv.org/search/?searchtype=author&query=Haoze Du), [Richard Li](https://arxiv.org/search/?searchtype=author&query=Richard Li), [Edward Gehringer](https://arxiv.org/search/?searchtype=author&query=Edward Gehringer) 作者：Haoze Du、Richard Li、Edward Gehringer

Evaluating the performance of Large Language Models (LLMs) is a critical yet challenging task, particularly when aiming to avoid subjective assessments. This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters to assess LLM outputs across various tasks. By utilizing well-defined benchmarks, factual datasets, and structured evaluation pipelines, the approach ensures consistent, reproducible, and bias-minimized measurements. The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation while ensuring alignment with real-world applications. This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains. 评估大型语言模型（LLMs）的性能是一项关键但具有挑战性的任务，尤其是在力求避免主观评判时。本文提出了一个框架，利用来自不同学期课程文本材料中派生的主观度量来评估 LLM 在各类任务中的输出。通过使用定义明确的基准、事实性数据集和结构化的评估流程，该方法确保了测量的一致性、可重复性并尽量减少偏差。该框架强调评分的自动化和透明性，减少对人工解释的依赖，同时确保与现实应用的一致性。此方法解决了主观评估方法的局限性，为教育、科研及其他高风险领域的性能评估提供了可扩展的解决方案。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-01 02:24:19 UTC 发布：2025-08-01 02:24:19 协调世界时（UTC）

Authors: [Yassine Jamaa](https://arxiv.org/search/?searchtype=author&query=Yassine Jamaa), [Badr AlKhamissi](https://arxiv.org/search/?searchtype=author&query=Badr AlKhamissi), [Satrajit Ghosh](https://arxiv.org/search/?searchtype=author&query=Satrajit Ghosh), [Martin Schrimpf](https://arxiv.org/search/?searchtype=author&query=Martin Schrimpf) 作者：Yassine Jamaa、Badr AlKhamissi、Satrajit Ghosh、Martin Schrimpf

This work adapts a neuroscientific contrast localizer to pinpoint causally relevant units for Theory of Mind (ToM) and mathematical reasoning tasks in large language models (LLMs) and vision-language models (VLMs). Across 11 LLMs and 5 VLMs ranging in size from 3B to 90B parameters, we localize top-activated units using contrastive stimulus sets and assess their causal role via targeted ablations. We compare the effect of lesioning functionally selected units against low-activation and randomly selected units on downstream accuracy across established ToM and mathematical benchmarks. Contrary to expectations, low-activation units sometimes produced larger performance drops than the highly activated ones, and units derived from the mathematical localizer often impaired ToM performance more than those from the ToM localizer. These findings call into question the causal relevance of contrast-based localizers and highlight the need for broader stimulus sets and more accurately capture task-specific units. 这项工作将神经科学中用于对比定位的技术改编为一种方法，以在大型语言模型（LLMs）和视觉-语言模型（VLMs）中定位对心智理论（ToM）和数学推理任务具有因果相关性的单元。我们在 11 个 LLMs 和 5 个 VLMs 上进行了研究，模型规模从 30 亿到 900 亿参数不等，使用对比刺激集定位激活最高的单元，并通过定向切除评估它们的因果作用。我们将对功能性选择的单元进行损毁的影响，与低激活和随机选择的单元在已建立的 ToM 和数学基准上的下游准确率表现进行了比较。与预期相反，低激活单元有时比高度激活的单元导致更大的性能下降，而来自数学定位器的单元经常比来自 ToM 定位器的单元更严重地损害 ToM 性能。这些发现对基于对比的定位器的因果相关性提出质疑，并强调需要更广泛的刺激集以更准确地捕捉与任务相关的单元。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-07-31 10:49:20 UTC 发表：2025-07-31 10:49:20 协调世界时 (UTC)

#56 MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis #56 MLLM-CBench：用于多模态大模型持续指令微调的全面基准与链式思维推理分析 [PDF 2 ] [Copy] [Kimi 4 ] [REL]

Multimodal Large Language Models (MLLMs) rely on continual instruction tuning to adapt to the evolving demands of real-world applications. However, progress in this area is hindered by the lack of rigorous and systematic benchmarks. To address this gap, we present MLLM-CTBench, a comprehensive evaluation benchmark with three key contributions: (1) Multidimensional Evaluation: We combine final answer accuracy with fine-grained CoT reasoning quality assessment, enabled by a specially trained CoT evaluator; (2) Comprehensive Evaluation of Algorithms and Training Paradigms: We benchmark eight continual learning algorithms across four major categories and systematically compare reinforcement learning with supervised fine-tuning paradigms; (3) Carefully Curated Tasks: We select and organize 16 datasets from existing work, covering six challenging domains. Our key findings include: (i) Models with stronger general capabilities exhibit greater robustness to forgetting during continual learning; (ii) Reasoning chains degrade more slowly than final answers, supporting the hierarchical forgetting hypothesis; (iii) The effectiveness of continual learning algorithms is highly dependent on both model capability and task order; (iv) In reinforcement learning settings, incorporating KL-divergence constraints helps maintain policy stability and plays a crucial role in mitigating forgetting. MLLM-CTBench establishes a rigorous standard for continual instruction tuning of MLLMs and offers practical guidance for algorithm design and evaluation. 多模态大型语言模型（MLLMs）依靠持续指令微调来适应现实应用中不断变化的需求。然而，该领域的进展受制于缺乏严谨且系统化的基准测试。为填补这一空白，我们提出了 MLLM-CTBench，一个全面的评估基准，具有三大关键贡献：（1）多维评估：我们将最终答案准确性与由专门训练的链式思维（CoT）评估器支持的细粒度 CoT 推理质量评估相结合；（2）算法与训练范式的全面评估：我们对四大类中的八种持续学习算法进行了基准测试，并系统比较了强化学习与监督微调范式；（3）精心策划的任务：我们从现有工作中挑选并组织了 16 个数据集，涵盖六个具有挑战性的领域。我们的主要发现包括： (i) 具有更强通用能力的模型在持续学习过程中对遗忘更具鲁棒性； (ii) 推理链比最终答案衰减得更慢，这支持了分层遗忘假说； (iii) 持续学习算法的有效性高度依赖于模型能力和任务顺序； (iv) 在强化学习设置中，加入 KL 散度约束有助于维持策略稳定性，并在减轻遗忘方面发挥关键作用。MLLM-CTBench 为多模态大模型的持续指令调优建立了严格的标准，并为算法设计与评估提供了实用指导。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-07-31 07:49:36 UTC 发布：2025-07-31 07:49:36 UTC

#57 Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition #57 从大型语言模型中蒸馏知识：用于仇恨与反言论识别的概念瓶颈模型

Authors: [Roberto Labadie-Tamayo](https://arxiv.org/search/?searchtype=author&query=Roberto Labadie-Tamayo), [Djordje Slijepčević](https://arxiv.org/search/?searchtype=author&query=Djordje Slijepčević), [Xihui Chen](https://arxiv.org/search/?searchtype=author&query=Xihui Chen), [Adrian Jaques Böck](https://arxiv.org/search/?searchtype=author&query=Adrian Jaques Böck), [Andreas Babic](https://arxiv.org/search/?searchtype=author&query=Andreas Babic), [Liz Freimann](https://arxiv.org/search/?searchtype=author&query=Liz Freimann), [Christiane Atzmüller Matthias Zeppelzauer](https://arxiv.org/search/?searchtype=author&query=Christiane Atzmüller Matthias Zeppelzauer) 作者：Roberto Labadie-Tamayo、Djordje Slijepčević、Xihui Chen、Adrian Jaques Böck、Andreas Babic、Liz Freimann、Christiane Atzmüller、Matthias Zeppelzauer

The rapid increase in hate speech on social media has exposed an unprecedented impact on society, making automated methods for detecting such content important. Unlike prior black-box models, we propose a novel transparent method for automated hate and counter speech recognition, i.e., “Speech Concept Bottleneck Model” (SCBM), using adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to map input texts to an abstract adjective-based representation, which is then sent to a light-weight classifier for downstream tasks. Across five benchmark datasets spanning multiple languages and platforms (e.g., Twitter, Reddit, YouTube), SCBM achieves an average macro-F1 score of 0.69 which outperforms the most recently reported results from the literature on four out of five datasets. Aside from high recognition accuracy, SCBM provides a high level of both local and global interpretability. Furthermore, fusing our adjective-based concept representation with transformer embeddings, leads to a 1.8% performance increase on average across all datasets, showing that the proposed representation captures complementary information. Our results demonstrate that adjective-based concept representations can serve as compact, interpretable, and effective encodings for hate and counter speech recognition. With adapted adjectives, our method can also be applied to other NLP tasks. 社交媒体上仇恨言论的迅速增加已对社会产生了前所未有的影响，使得自动检测此类内容的方法变得非常重要。与以往的黑箱模型不同，我们提出了一种用于自动仇恨言论和反对言论识别的新型透明方法，即“言语概念瓶颈模型”（Speech Concept Bottleneck Model，SCBM），使用形容词作为人类可解释的瓶颈概念。SCBM 利用大型语言模型（LLMs）将输入文本映射为基于形容词的抽象表示，然后将其发送到轻量级分类器以完成下游任务。在覆盖多种语言和平台（如 Twitter、Reddit、YouTube）的五个基准数据集上，SCBM 达到了平均宏 F1 分数 0.69，在五个数据集中有四个优于文献中最近报告的结果。除了高识别准确率外，SCBM 还提供了较高的局部和全局可解释性。此外，将我们基于形容词的概念表示与 transformer 嵌入融合，在所有数据集上平均带来 1.8% 的性能提升，表明所提出的表示捕捉了互补信息。我们的结果表明，基于形容词的概念表示可以作为对仇恨言论和反驳言论识别的紧凑、可解释且有效的编码。通过调整形容词，我们的方法也可以应用于其他自然语言处理任务。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-07-30 21:50:30 UTC 发布：2025-07-30 21:50:30 UTC

#58 TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning #58 TT-XAI：通过关键词提炼和 LLM 推理实现可信的临床文本解释

Authors: [Kristian Miok](https://arxiv.org/search/?searchtype=author&query=Kristian Miok), [Blaz Škrlj](https://arxiv.org/search/?searchtype=author&query=Blaz Škrlj), [Daniela Zaharie](https://arxiv.org/search/?searchtype=author&query=Daniela Zaharie), [Marko Robnik Šikonja](https://arxiv.org/search/?searchtype=author&query=Marko Robnik Šikonja) 作者：Kristian Miok、Blaz Škrlj、Daniela Zaharie、Marko Robnik Šikonja

Clinical language models often struggle to provide trustworthy predictions and explanations when applied to lengthy, unstructured electronic health records (EHRs). This work introduces TT-XAI, a lightweight and effective framework that improves both classification performance and interpretability through domain-aware keyword distillation and reasoning with large language models (LLMs). First, we demonstrate that distilling raw discharge notes into concise keyword representations significantly enhances BERT classifier performance and improves local explanation fidelity via a focused variant of LIME. Second, we generate chain-of-thought clinical explanations using keyword-guided prompts to steer LLMs, producing more concise and clinically relevant reasoning. We evaluate explanation quality using deletion-based fidelity metrics, self-assessment via LLaMA-3 scoring, and a blinded human study with domain experts. All evaluation modalities consistently favor the keyword-augmented method, confirming that distillation enhances both machine and human interpretability. TT-XAI offers a scalable pathway toward trustworthy, auditable AI in clinical decision support. 临床语言模型在应用于冗长、非结构化的电子健康记录（EHR）时，往往难以提供可信的预测和解释。本文提出了 TT-XAI，一个轻量且高效的框架，通过面向领域的关键词蒸馏和与大型语言模型（LLMs）的推理，提升了分类性能和可解释性。首先，我们证明了将原始出院记录蒸馏为简洁的关键词表示，能够显著增强 BERT 分类器的性能，并通过一种聚焦变体的 LIME 提高局部解释的一致性。其次，我们使用关键词引导提示生成连贯思路（chain-of-thought）的临床解释，以引导 LLMs，产生更简明且临床相关的推理。我们使用基于删除的一致性度量、通过 LLaMA-3 评分的自评估以及与领域专家的盲测人类研究来评估解释质量。所有评估方式一致偏向于增强关键词的方法，确认蒸馏提升了机器与人类的可解释性。TT-XAI 为在临床决策支持中实现可信、可审计的 AI 提供了可扩展的路径。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-07-30 16:28:10 UTC

#59 Real-time News Story Identification #59 实时新闻事件识别

Authors: [Tadej Škvorc](https://arxiv.org/search/?searchtype=author&query=Tadej Škvorc), [Nikola Ivačič](https://arxiv.org/search/?searchtype=author&query=Nikola Ivačič), [Sebastjan Hribar](https://arxiv.org/search/?searchtype=author&query=Sebastjan Hribar), [Marko Robnik-Šikonja](https://arxiv.org/search/?searchtype=author&query=Marko Robnik-Šikonja) 作者：Tadej Škvorc、Nikola Ivačič、Sebastjan Hribar、Marko Robnik-Šikonja

To improve the reading experience, many news sites organize news into topical collections, called stories. In this work, we present an approach for implementing real-time story identification for a news monitoring system that automatically collects news articles as they appear online and processes them in various ways. Story identification aims to assign each news article to a specific story that the article is covering. The process is similar to text clustering and topic modeling, but requires that articles be grouped based on particular events, places, and people, rather than general text similarity (as in clustering) or general (predefined) topics (as in topic modeling). We present an approach to story identification that is capable of functioning in real time, assigning articles to stories as they are published online. In the proposed approach, we combine text representation techniques, clustering algorithms, and online topic modeling methods. We combine various text representation methods to extract specific events and named entities necessary for story identification, showing that a mixture of online topic-modeling approaches such as BERTopic, DBStream, and TextClust can be adapted for story discovery. We evaluate our approach on a news dataset from Slovene media covering a period of 1 month. We show that our real-time approach produces sensible results as judged by human evaluators. 为了改善阅读体验，许多新闻网站将新闻按主题集合组织，称为“故事”。在这项工作中，我们提出了一种用于新闻监测系统的实时故事识别方法，该系统会在新闻在线发布时自动收集新闻文章并对其进行各种处理。故事识别旨在将每篇新闻文章分配到该文章所涉及的特定故事中。该过程类似于文本聚类和主题建模，但要求基于特定事件、地点和人物对文章进行分组，而不是基于一般文本相似性（如聚类）或一般（预定义）主题（如主题建模）。我们提出了一种能够实时运行的故事识别方法，在文章在线发布时即可将其分配到相应故事。在所提方法中，我们结合了文本表示技术、聚类算法和在线主题建模方法。我们结合多种文本表示方法来提取识别故事所需的特定事件和命名实体，展示了像 BERTopic、DBStream 和 TextClust 这类线上主题建模方法的混合可以被调整用于故事发现。我们在覆盖为期 1 个月的斯洛文尼亚媒体新闻数据集上评估了我们的方法。我们展示了经人工评估者判断，我们的实时方法产生了合理的结果。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-07-30 16:21:00 UTC 发布：2025-07-30 16:21:00 UTC

#60 Heartificial Intelligence: Exploring Empathy in Language Models #60 心工智能：探索语言模型中的共情

Authors: [Victoria Williams](https://arxiv.org/search/?searchtype=author&query=Victoria Williams), [Benjamin Rosman](https://arxiv.org/search/?searchtype=author&query=Benjamin Rosman) 作者：Victoria Williams，Benjamin Rosman

Large language models have become increasingly common, used by millions of people worldwide in both professional and personal contexts. As these models continue to advance, they are frequently serving as virtual assistants and companions. In human interactions, effective communication typically involves two types of empathy: cognitive empathy (understanding others’ thoughts and emotions) and affective empathy (emotionally sharing others’ feelings). In this study, we investigated both cognitive and affective empathy across several small (SLMs) and large (LLMs) language models using standardized psychological tests. Our results revealed that LLMs consistently outperformed humans - including psychology students - on cognitive empathy tasks. However, despite their cognitive strengths, both small and large language models showed significantly lower affective empathy compared to human participants. These findings highlight rapid advancements in language models’ ability to simulate cognitive empathy, suggesting strong potential for providing effective virtual companionship and personalized emotional support. Additionally, their high cognitive yet lower affective empathy allows objective and consistent emotional support without running the risk of emotional fatigue or bias. 大型语言模型在全球范围内变得越来越普遍，被数百万用户在职业和个人场景中使用。随着这些模型不断进步，它们经常充当虚拟助理和陪伴者。在人际互动中，有效沟通通常涉及两种类型的共情：认知共情（理解他人的想法和情感）和情感共情（在情感上与他人的感受产生共鸣）。在这项研究中，我们使用标准化心理学测试，考察了若干小型（SLMs）和大型（LLMs）语言模型在认知共情和情感共情方面的表现。我们的结果显示，LLMs 在认知共情任务上持续优于人类——包括心理学专业的学生。然而，尽管在认知方面表现优异，小型和大型语言模型在情感共情方面均显著低于人类参与者。这些发现凸显了语言模型在模拟认知共情能力上的快速进展，表明其在提供有效虚拟陪伴和个性化情感支持方面具有很大潜力。此外，它们较高的认知同理心与较低的情感同理心相结合，使得能够提供客观且一致的情感支持，而不会面临情感疲劳或偏见的风险。

Subjects: Computation and Language, Human-Computer Interaction 主题：计算与语言、人机交互

Publish: 2025-07-30 14:09:33 UTC 发布：2025-07-30 14:09:33 UTC

#61 TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection #61 TurQUaz at CheckThat! 2025：为科学网络话语检测辩论大型语言模型

Authors: [Tarık Saraç](https://arxiv.org/search/?searchtype=author&query=Tarık Saraç), [Selin Mergen](https://arxiv.org/search/?searchtype=author&query=Selin Mergen), [Mucahid Kutlu](https://arxiv.org/search/?searchtype=author&query=Mucahid Kutlu) 作者：Tarık Saraç、Selin Mergen、Mucahid Kutlu

In this paper, we present our work developed for the scientific web discourse detection task (Task 4a) of CheckThat! 2025. We propose a novel council debate method that simulates structured academic discussions among multiple large language models (LLMs) to identify whether a given tweet contains (i) a scientific claim, (ii) a reference to a scientific study, or (iii) mentions of scientific entities. We explore three debating methods: i) single debate, where two LLMs argue for opposing positions while a third acts as a judge; ii) team debate, in which multiple models collaborate within each side of the debate; and iii) council debate, where multiple expert models deliberate together to reach a consensus, moderated by a chairperson model. We choose council debate as our primary model as it outperforms others in the development test set. Although our proposed method did not rank highly for identifying scientific claims (8th out of 10) or mentions of scientific entities (9th out of 10), it ranked first in detecting references to scientific studies. 在本文中，我们介绍了为 CheckThat! 2025 的科学网络话语检测任务（任务 4a）开发的工作。我们提出了一种新颖的理事会辩论方法，该方法通过模拟多个大型语言模型（LLMs）之间的结构化学术讨论来判断给定推文是否包含 (i) 科学主张、(ii) 对科学研究的引用，或 (iii) 对科学实体的提及。我们探索了三种辩论方法：i）单一辩论，两名 LLMs 为对立立场辩论，第三方充当裁判；ii）团队辩论，每一方由多个模型协作；iii）理事会辩论，多位专家模型共同商议以达成共识，由一名主席模型主持。我们选择理事会辩论作为主要模型，因为它在开发测试集上优于其他方法。尽管我们提出的方法在识别科学主张方面排名不高（在 10 个中位列第 8）或在识别科学实体提及时排名第 9，但在检测对科学研究的引用方面排名第一。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-07-26 00:46:23 UTC 发布：2025-07-26 00:46:23 UTC

#62 Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models #62 通过大型语言模型在金融交流中进行论点质量注释和性别偏见检测

Authors: [Alaa Alhamzeh](https://arxiv.org/search/?searchtype=author&query=Alaa Alhamzeh), [Mays Al Rebdawi](https://arxiv.org/search/?searchtype=author&query=Mays Al Rebdawi) 作者：Alaa Alhamzeh、Mays Al Rebdawi

Financial arguments play a critical role in shaping investment decisions and public trust in financial institutions. Nevertheless, assessing their quality remains poorly studied in the literature. In this paper, we examine the capabilities of three state-of-the-art LLMs GPT-4o, Llama 3.1, and Gemma 2 in annotating argument quality within financial communications, using the FinArgQuality dataset. Our contributions are twofold. First, we evaluate the consistency of LLM-generated annotations across multiple runs and benchmark them against human annotations. Second, we introduce an adversarial attack designed to inject gender bias to analyse models responds and ensure model’s fairness and robustness. Both experiments are conducted across three temperature settings to assess their influence on annotation stability and alignment with human labels. Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts, though the models still exhibit varying degrees of gender bias. We provide a multifaceted analysis of these outcomes and offer practical recommendations to guide future research toward more reliable, cost-effective, and bias-aware annotation methodologies. 金融论证在影响投资决策和公众对金融机构的信任方面起着关键作用。然而，关于评估其质量的研究在文献中仍然十分不足。在本文中，我们使用 FinArgQuality 数据集，考察三种最先进的 LLMs：GPT-4o、Llama 3.1 和 Gemma 2 在为金融传播中的论证质量做注释方面的能力。我们的贡献有两方面。首先，我们评估了 LLM 在多次运行中生成的注释一致性，并将其与人工注释进行基准比较。其次，我们提出了一种旨在注入性别偏见的对抗性攻击，以分析模型的响应并确保模型的公平性和鲁棒性。两项实验均在三种温度设置下进行，以评估它们对注释稳定性及与人工标签一致性的影响。我们的研究结果显示，基于 LLM 的注释在标注者间一致性上优于人工标注，尽管这些模型仍然表现出不同程度的性别偏见。我们对这些结果进行了多方面的分析，并提出实用建议，以指导未来的研究朝着更可靠、更具成本效益且更加关注偏见的标注方法发展。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-07-22 17:54:45 UTC 发布：2025-07-22 17:54:45 UTC

#63 P/D-Device: Disaggregated Large Language Model between Cloud and Devices #63 P/D-Device：云端与设备间的解聚合大型语言模型

Serving disaggregated large language models has been widely adopted in industrial practice for enhanced performance. However, too many tokens generated in decoding phase, i.e., occupying the resources for a long time, essentially hamper the cloud from achieving a higher throughput. Meanwhile, due to limited on-device resources, the time to first token (TTFT), i.e., the latency of prefill phase, increases dramatically with the growth on prompt length. In order to concur with such a bottleneck on resources, i.e., long occupation in cloud and limited on-device computing capacity, we propose to separate large language model between cloud and devices. That is, the cloud helps a portion of the content for each device, only in its prefill phase. Specifically, after receiving the first token from the cloud, decoupling with its own prefill, the device responds to the user immediately for a lower TTFT. Then, the following tokens from cloud are presented via a speed controller for smoothed TPOT (the time per output token), until the device catches up with the progress. On-device prefill is then amortized using received tokens while the resource usage in cloud is controlled. Moreover, during cloud prefill, the prompt can be refined, using those intermediate data already generated, to further speed up on-device inference. We implement such a scheme P/D-Device, and confirm its superiority over other alternatives. We further propose an algorithm to decide the best settings. Real-trace experiments show that TTFT decreases at least 60%, maximum TPOT is about tens of milliseconds, and cloud throughput increases by up to 15x. 在工业实践中，为了提升性能，分布式部署大语言模型已被广泛采用。然而，在解码阶段生成过多的 token（即长时间占用资源）从根本上阻碍了云端实现更高吞吐量。同时，由于设备端资源有限，首个 token 响应时间（TTFT，即预填充阶段的延迟）会随着提示长度的增加而显著上升。为了解决这一资源瓶颈——即云端的长时间占用和设备端有限的计算能力——我们提出在云端与设备之间对大型语言模型进行拆分。也就是说，云端仅在每个设备的预填充阶段帮助生成一部分内容。具体来说，在接收到来自云端的第一个 token 后，设备与自身的预填充解耦，立即响应用户以降低 TTFT。随后，云端通过速率控制器以平滑的每输出 token 时间（TPOT）提供后续 token，直到设备赶上进度。这样，设备端的预填充就可以使用收到的 token 来摊销，而云端的资源使用也得以控制。此外，在云端预填充期间，可以利用那些已经生成的中间数据来优化提示，从而进一步加速设备端推理。我们实现了这样一种方案 P/D-Device，并证明其优于其他替代方案。我们还提出了一种算法来决定最佳设置。真实轨迹实验表明，TTFT 至少降低 60%，最大 TPOT 约为几十毫秒，云端吞吐量最多提高约 15 倍。

Subjects: Distributed, Parallel, and Cluster Computing, Computation and Language, Machine Learning 主题：分布式、并行与集群计算，计算与语言，机器学习

Publish: 2025-08-12 15:56:29 UTC 发布时间：2025-08-12 15:56:29 UTC

#64 E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence,and Efficiency #64 E3-Rewrite：学习重写 SQL 以实现可执行性、等价性和效率

SQL query rewriting aims to reformulate a query into a more efficient form while preserving equivalence. Most existing methods rely on predefined rewrite rules. However, such rule-based approaches face fundamental limitations: (1) fixed rule sets generalize poorly to novel query patterns and struggle with complex queries; (2) a wide range of effective rewriting strategies cannot be fully captured by declarative rules. To overcome these issues, we propose using large language models (LLMs) to generate rewrites. LLMs can capture complex strategies, such as evaluation reordering and CTE rewriting. Despite this potential, directly applying LLMs often results in suboptimal or non-equivalent rewrites due to a lack of execution awareness and semantic grounding. To address these challenges, We present E3-Rewrite, an LLM-based SQL rewriting framework that produces executable, equivalent, and efficient queries. It integrates two core components: a context construction module and a reinforcement learning framework. First, the context module leverages execution plans and retrieved demonstrations to build bottleneck-aware prompts that guide inference-time rewriting. Second, we design a reward function targeting executability, equivalence, and efficiency, evaluated via syntax checks, equivalence verification, and cost estimation. Third, to ensure stable multi-objective learning, we adopt a staged curriculum that first emphasizes executability and equivalence, then gradually incorporates efficiency. Extensive experiments show that E3-Rewrite achieves up to a 25.6% reduction in query execution time compared to state-of-the-art methods across multiple SQL benchmarks. Moreover, it delivers up to 24.4% more successful rewrites, expanding coverage to complex queries that previous systems failed to handle. SQL 查询重写旨在将查询重新表述为更高效的形式，同时保持等价性。现有的大多数方法依赖预定义的重写规则。然而，这类基于规则的方法存在根本性局限：(1) 固定的规则集难以推广到新颖的查询模式，并在处理复杂查询时表现不佳；(2) 许多有效的重写策略无法被声明式规则完全涵盖。为克服这些问题，我们提出使用大型语言模型（LLMs）来生成重写。LLMs 能够捕捉复杂的策略，例如评估重排序和 CTE 重写。尽管具有这种潜力，直接应用 LLMs 常常因为缺乏执行感知和语义基础而产生次优或非等价的重写。为了解决这些挑战，我们提出了 E3-Rewrite，一种基于 LLM 的 SQL 重写框架，能够生成可执行、等价且高效的查询。它集成了两个核心组件：上下文构建模块和强化学习框架。首先，上下文模块利用执行计划和检索到的示例来构建对瓶颈敏感的提示，从而引导推理时的重写。其次，我们设计了一个针对可执行性、等价性和效率的奖励函数，通过语法检查、等价性验证和成本估算来评估。第三，为了保证稳定的多目标学习，我们采用了分阶段的课程学习，先强调可执行性和等价性，然后逐步加入效率。大量实验表明，与最先进的方法相比，E3-Rewrite 在多个 SQL 基准测试中最多可将查询执行时间减少 25.6%。此外，它还提供了最多 24.4% 的更多成功重写，将覆盖范围扩展到以往系统无法处理的复杂查询。

Subjects: Databases, Artificial Intelligence, Computation and Language 主题：数据库，人工智能，计算与语言

Publish: 2025-08-12 15:38:10 UTC 发布：2025-08-12 15:38:10 UTC

#65 Revealing the Role of Audio Channels in ASR Performance Degradation #65 揭示音频通道在 ASR 性能下降中的作用

Authors: [Kuan-Tang Huang](https://arxiv.org/search/?searchtype=author&query=Kuan-Tang Huang), [Li-Wei Chen](https://arxiv.org/search/?searchtype=author&query=Li-Wei Chen), [Hung-Shin Lee](https://arxiv.org/search/?searchtype=author&query=Hung-Shin Lee), [Berlin Chen](https://arxiv.org/search/?searchtype=author&query=Berlin Chen), [Hsin-Min Wang](https://arxiv.org/search/?searchtype=author&query=Hsin-Min Wang) 作者：Kuan-Tang Huang, Li-Wei Chen, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang

Pre-trained automatic speech recognition (ASR) models have demonstrated strong performance on a variety of tasks. However, their performance can degrade substantially when the input audio comes from different recording channels. While previous studies have demonstrated this phenomenon, it is often attributed to the mismatch between training and testing corpora. This study argues that variations in speech characteristics caused by different recording channels can fundamentally harm ASR performance. To address this limitation, we propose a normalization technique designed to mitigate the impact of channel variation by aligning internal feature representations in the ASR model with those derived from a clean reference channel. This approach significantly improves ASR performance on previously unseen channels and languages, highlighting its ability to generalize across channel and language differences. 预训练的自动语音识别（ASR）模型在多种任务上表现优异。然而，当输入音频来自不同的录音通道时，它们的性能可能会大幅下降。尽管以往研究已观察到这一现象，但通常将其归因于训练语料与测试语料之间的不匹配。本研究认为，不同录音通道导致的语音特征差异会从根本上损害 ASR 性能。为了解决这一限制，我们提出了一种归一化技术，旨在通过将 ASR 模型内部的特征表示与来自干净参考通道的表示对齐，来减轻通道变化的影响。该方法显著提升了在先前未见过的通道和语言上的 ASR 性能，突显出其在通道和语言差异间的泛化能力。

Subjects: Sound, Artificial Intelligence, Computation and Language 主题：声音、人工智能、计算与语言

Publish: 2025-08-12 14:32:48 UTC 发布时间：2025-08-12 14:32:48 UTC

#66 A Dual-Axis Taxonomy of Knowledge Editing for LLMs: From Mechanisms to Functions #66 面向 LLMs 的知识编辑双轴分类法：从机制到功能

Authors: [Amir Mohammad Salehoof](https://arxiv.org/search/?searchtype=author&query=Amir Mohammad Salehoof), [Ali Ramezani](https://arxiv.org/search/?searchtype=author&query=Ali Ramezani), [Yadollah Yaghoobzadeh](https://arxiv.org/search/?searchtype=author&query=Yadollah Yaghoobzadeh), [Majid Nili Ahmadabadi](https://arxiv.org/search/?searchtype=author&query=Majid Nili Ahmadabadi) 作者：Amir Mohammad Salehoof、Ali Ramezani、Yadollah Yaghoobzadeh、Majid Nili Ahmadabadi

Large language models (LLMs) acquire vast knowledge from large text corpora, but this information can become outdated or inaccurate. Since retraining is computationally expensive, knowledge editing offers an efficient alternative – modifying internal knowledge without full retraining. These methods aim to update facts precisely while preserving the model’s overall capabilities. While existing surveys focus on the mechanism of editing (e.g., parameter changes vs. external memory), they often overlook the function of the knowledge being edited. This survey introduces a novel, complementary function-based taxonomy to provide a more holistic view. We examine how different mechanisms apply to various knowledge types – factual, temporal, conceptual, commonsense, and social – highlighting how editing effectiveness depends on the nature of the target knowledge. By organizing our review along these two axes, we map the current landscape, outline the strengths and limitations of existing methods, define the problem formally, survey evaluation tasks and datasets, and conclude with open challenges and future directions. 大型语言模型（LLMs）从大规模文本语料中获取了大量知识，但这些信息可能会变得过时或不准确。由于重新训练代价高昂，知识编辑提供了一种更高效的替代方案——在不进行完全重训的情况下修改内部知识。这些方法旨在精确更新事实，同时保留模型的整体能力。尽管现有综述侧重于编辑机制（例如参数变化与外部记忆），但它们常常忽视被编辑知识的功能。本综述引入了一种新颖且互补的基于功能的分类法，以提供更全面的视角。我们考察了不同机制如何应用于各类知识——事实性、时态性、概念性、常识性和社会性——并强调编辑效果如何依赖于目标知识的性质。通过沿这两条轴线组织我们的回顾，我们绘制了当前格局，概述了现有方法的优劣，形式化定义了问题，调研了评估任务与数据集，并在结尾讨论了开放挑战与未来方向。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-12 09:51:39 UTC

#67 Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance #67 为个性化任务辅助设计具备记忆增强的增强现实代理，以进行时空推理 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Dongwook Choi](https://arxiv.org/search/?searchtype=author&query=Dongwook Choi), [Taeyoon Kwon](https://arxiv.org/search/?searchtype=author&query=Taeyoon Kwon), [Dongil Yang](https://arxiv.org/search/?searchtype=author&query=Dongil Yang), [Hyojun Kim](https://arxiv.org/search/?searchtype=author&query=Hyojun Kim), [Jinyoung Yeo](https://arxiv.org/search/?searchtype=author&query=Jinyoung Yeo) 作者：Dongwook Choi、Taeyoon Kwon、Dongil Yang、Hyojun Kim、Jinyoung Yeo

Augmented Reality (AR) systems are increasingly integrating foundation models, such as Multimodal Large Language Models (MLLMs), to provide more context-aware and adaptive user experiences. This integration has led to the development of AR agents to support intelligent, goal-directed interactions in real-world environments. While current AR agents effectively support immediate tasks, they struggle with complex multi-step scenarios that require understanding and leveraging user’s long-term experiences and preferences. This limitation stems from their inability to capture, retain, and reason over historical user interactions in spatiotemporal contexts. To address these challenges, we propose a conceptual framework for memory-augmented AR agents that can provide personalized task assistance by learning from and adapting to user-specific experiences over time. Our framework consists of four interconnected modules: (1) Perception Module for multimodal sensor processing, (2) Memory Module for persistent spatiotemporal experience storage, (3) Spatiotemporal Reasoning Module for synthesizing past and present contexts, and (4) Actuator Module for effective AR communication. We further present an implementation roadmap, a future evaluation strategy, a potential target application and use cases to demonstrate the practical applicability of our framework across diverse domains. We aim for this work to motivate future research toward developing more intelligent AR systems that can effectively bridge user’s interaction history with adaptive, context-aware task assistance. 增强现实（AR）系统正越来越多地整合基础模型，例如多模态大语言模型（MLLMs），以提供更具情境感知性和自适应性的用户体验。这种整合促成了用于在真实世界环境中支持智能、面向目标交互的 AR 代理的发展。尽管现有的 AR 代理能有效地支持即时任务，但在需要理解并利用用户长期经验和偏好的复杂多步骤场景中仍然表现不足。这一局限源于它们无法在时空语境中捕捉、保留并推理历史用户交互。为了解决这些挑战，我们提出了一个用于记忆增强型 AR 代理的概念框架，该框架能够通过随时间学习和适应用户特定经验来提供个性化任务辅助。我们的框架由四个相互关联的模块组成：（1）用于多模态传感处理的感知模块，（2）用于持久时空经验存储的记忆模块，（3）用于综合过去与现在语境的时空推理模块，以及（4）用于有效 AR 通信的执行模块。我们还提出了一个实施路线图、未来的评估策略、一个潜在的目标应用及使用案例，以展示我们框架在不同领域的实用适用性。我们的目标是通过这项工作激发未来研究，开发能够有效将用户交互历史与自适应、上下文感知的任务辅助相结合的更智能的增强现实系统。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-12 09:20:20 UTC 发布：2025-08-12 09:20:20 UTC

#68 MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs #68 MultiAiTutor：面向儿童友好的教育多语种语音生成导师，基于 LLMs

Authors: [Xiaoxue Gao](https://arxiv.org/search/?searchtype=author&query=Xiaoxue Gao), [Huayun Zhang](https://arxiv.org/search/?searchtype=author&query=Huayun Zhang), [Nancy F. Chen](https://arxiv.org/search/?searchtype=author&query=Nancy F. Chen) 作者：高晓雪，张华云，Nancy F. Chen

Generative speech models have demonstrated significant potential in personalizing teacher-student interactions, offering valuable real-world applications for language learning in children’s education. However, achieving high-quality, child-friendly speech generation remains challenging, particularly for low-resource languages across diverse languages and cultural contexts. In this paper, we propose MultiAiTutor, an educational multilingual generative AI tutor with child-friendly designs, leveraging LLM architecture for speech generation tailored for educational purposes. We propose to integrate age-appropriate multilingual speech generation using LLM architectures, facilitating young children’s language learning through culturally relevant image-description tasks in three low-resource languages: Singaporean-accent Mandarin, Malay, and Tamil. Experimental results from both objective metrics and subjective evaluations demonstrate the superior performance of the proposed MultiAiTutor compared to baseline methods. 生成式语音模型在个性化师生互动方面显示出显著潜力，为儿童教育中的语言学习提供了有价值的现实应用。然而，要实现高质量、适合儿童的语音生成仍具有挑战性，尤其是在资源稀缺且语言与文化多样的情境中。本文提出了 MultiAiTutor，一种具有儿童友好设计的多语种教育生成式人工智能导师，利用 LLM 架构进行面向教育用途的语音生成。我们提出结合适龄的多语种语音生成方法，使用 LLM 架构，通过与文化相关的图像描述任务，促进幼儿在三种低资源语言——新加坡口音普通话、马来语和泰米尔语——中的语言学习。来自客观指标和主观评价的实验结果均表明，所提出的 MultiAiTutor 比基线方法具有更优越的性能。

Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language, Signal Processing 主题：音频与语音处理、人工智能、计算与语言、信号处理

Publish: 2025-08-12 07:58:48 UTC 发布：2025-08-12 07:58:48 UTC

#69 M2LLM: Multi-view Molecular Representation Learning with Large Language Models #69 M2 LLM：使用大型语言模型的多视角分子表示学习

Authors: [Jiaxin Ju](https://arxiv.org/search/?searchtype=author&query=Jiaxin Ju), [Yizhen Zheng](https://arxiv.org/search/?searchtype=author&query=Yizhen Zheng), [Huan Yee Koh](https://arxiv.org/search/?searchtype=author&query=Huan Yee Koh), [Can Wang](https://arxiv.org/search/?searchtype=author&query=Can Wang), [Shirui Pan](https://arxiv.org/search/?searchtype=author&query=Shirui Pan) 作者：Jiaxin Ju、Yizhen Zheng、Huan Yee Koh、Can Wang、Shirui Pan

Accurate molecular property prediction is a critical challenge with wide-ranging applications in chemistry, materials science, and drug discovery. Molecular representation methods, including fingerprints and graph neural networks (GNNs), achieve state-of-the-art results by effectively deriving features from molecular structures. However, these methods often overlook decades of accumulated semantic and contextual knowledge. Recent advancements in large language models (LLMs) demonstrate remarkable reasoning abilities and prior knowledge across scientific domains, leading us to hypothesize that LLMs can generate rich molecular representations when guided to reason in multiple perspectives. To address these gaps, we propose M2LLM, a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view. These views are fused dynamically to adapt to task requirements, and experiments demonstrate that M2LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks. Moreover, we demonstrate that representation derived from LLM achieves exceptional performance by leveraging two core functionalities: the generation of molecular embeddings through their encoding capabilities and the curation of molecular features through advanced reasoning processes. 准确的分子性质预测是一个关键挑战，在化学、材料科学和药物发现等领域具有广泛应用。分子表示方法，包括指纹（fingerprints）和图神经网络（GNNs），通过从分子结构中有效提取特征，实现了最先进的结果。然而，这些方法常常忽视了数十年来积累的语义和上下文知识。最近在大型语言模型（LLMs）方面的进展展示了其在科学领域中的出色推理能力和先验知识，这使我们假设在引导其从多视角进行推理时，LLMs 能生成丰富的分子表示。为了解决这些空白，我们提出了 M2 LLM，这是一种多视角框架，整合了三种视角：分子结构视角、分子任务视角和分子规则视角。这些视角被动态融合以适应任务需求，实验表明 M2 LLM 在多个分类和回归任务的基准测试中达到了最先进的性能。此外，我们展示了通过利用两项核心功能，从 LLM 导出的表示能实现卓越表现：一是通过其编码能力生成分子嵌入，二是通过先进的推理过程策划分子特征。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-12 05:46:47 UTC 发布：2025-08-12 05:46:47 UTC

#70 MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time #70 MiGrATe：用于测试时自适应的混合策略 GRPO

Authors: [Peter Phan](https://arxiv.org/search/?searchtype=author&query=Peter Phan), [Dhruv Agarwal](https://arxiv.org/search/?searchtype=author&query=Dhruv Agarwal), [Kavitha Srinivas](https://arxiv.org/search/?searchtype=author&query=Kavitha Srinivas), [Horst Samulowitz](https://arxiv.org/search/?searchtype=author&query=Horst Samulowitz), [Pavan Kapanipathi](https://arxiv.org/search/?searchtype=author&query=Pavan Kapanipathi), [Andrew McCallum](https://arxiv.org/search/?searchtype=author&query=Andrew McCallum) 作者：Peter Phan、Dhruv Agarwal、Kavitha Srinivas、Horst Samulowitz、Pavan Kapanipathi、Andrew McCallum

Large language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data selection techniques: greedy sampling, which selects top-performing past completions, and neighborhood sampling (NS), which generates completions structurally similar to high-reward ones. Together, these components bias the policy gradient towards exploitation of promising regions in solution space, while preserving exploration through on-policy sampling. We evaluate MiGrATe on three challenging domains-word search, molecule optimization, and hypothesis+program induction on the Abstraction and Reasoning Corpus (ARC)-and find that it consistently outperforms both inference-only and TTT baselines, demonstrating the potential of online TTT as a solution for complex search tasks without external supervision. 大型语言模型 (LLMs) 正在越来越多地被应用于黑箱优化任务，从程序合成到分子设计。先前的工作通常利用上下文学习来迭代地引导模型朝更好的解前进。然而，这类方法常在探索新解空间与利用高回报解之间难以取得平衡。最近，使用合成数据的测试时训练（TTT）在提高解的质量方面显示出潜力。但为每个任务量身定制的手工训练数据的需求限制了跨领域的可行性和可扩展性。为了解决这个问题，我们提出了 MiGrATe——一种在线 TTT 方法，使用 GRPO 作为搜索算法，在推理时自适应 LLMs，无需外部训练数据。MiGrATe 通过一种混合策略群体构建程序来运行，该程序将オン策略采样与两种离策略数据选择技术相结合：贪心采样，从过去表现最好的完成功能中选择；以及邻域采样（NS），生成在结构上类似于高回报解的完成功能。这些组件共同使策略梯度偏向于开发解空间中有希望的区域，同时通过基于策略的采样保持探索。我们在三个具有挑战性的领域——单词搜索、分子优化，以及抽象与推理语料库（ARC）上的假设+程序归纳——上评估了 MiGrATe，发现它在所有情况下都持续优于仅推理和测试时训练（TTT）基线，证明了在线 TTT 作为在无外部监督情况下解决复杂搜索任务的一种可行方案的潜力。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-12 05:08:21 UTC 发布：2025-08-12 05:08:21 UTC

#71 Adaptive Personalized Conversational Information Retrieval #71 自适应个性化对话式信息检索

Personalized conversational information retrieval (CIR) systems aim to satisfy users’ complex information needs through multi-turn interactions by considering user profiles. However, not all search queries require personalization. The challenge lies in appropriately incorporating personalization elements into search when needed. Most existing studies implicitly incorporate users’ personal information and conversational context using large language models without distinguishing the specific requirements for each query turn. Such a ``one-size-fits-all’’ personalization strategy might lead to sub-optimal results. In this paper, we propose an adaptive personalization method, in which we first identify the required personalization level for a query and integrate personalized queries with other query reformulations to produce various enhanced queries. Then, we design a personalization-aware ranking fusion approach to assign fusion weights dynamically to different reformulated queries, depending on the required personalization level. The proposed adaptive personalized conversational information retrieval framework APCIR is evaluated on two TREC iKAT datasets. The results confirm the effectiveness of adaptive personalization of APCIR by outperforming state-of-the-art methods. 个性化会话信息检索（CIR）系统旨在通过考虑用户档案，通过多轮交互满足用户复杂的信息需求。然而，并非所有检索查询都需要个性化。挑战在于在需要时恰当地将个性化元素纳入检索中。现有的大多数研究在使用大型语言模型时，隐式地融合了用户的个人信息和会话上下文，而没有区分每个查询轮次的具体需求。这种“一刀切”的个性化策略可能导致次优结果。在本文中，我们提出了一种自适应个性化方法：首先识别查询所需的个性化程度，并将个性化查询与其他查询改写相结合，生成多种增强查询；然后，我们设计了一种感知个性化的排序融合方法，根据所需的个性化程度为不同的改写查询动态分配融合权重。所提出的自适应个性化会话信息检索框架 APCIR 在两个 TREC iKAT 数据集上进行了评估。结果证实了自适应个性化 APCIR 的有效性，其性能超越了最先进的方法。

Subjects: Information Retrieval, Computation and Language 主题：信息检索，计算与语言

Publish: 2025-08-12 04:53:33 UTC 发布日期：2025-08-12 04:53:33 UTC

#72 Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization #72 细粒度视频配音时长对齐与基于片段监督的偏好优化

Authors: [Chaoqun Cui](https://arxiv.org/search/?searchtype=author&query=Chaoqun Cui), [Liangbin Huang](https://arxiv.org/search/?searchtype=author&query=Liangbin Huang), [Shijing Wang](https://arxiv.org/search/?searchtype=author&query=Shijing Wang), [Zhe Tong](https://arxiv.org/search/?searchtype=author&query=Zhe Tong), [Zhaolong Huang](https://arxiv.org/search/?searchtype=author&query=Zhaolong Huang), [Xiao Zeng](https://arxiv.org/search/?searchtype=author&query=Xiao Zeng), [Xiaofeng Liu](https://arxiv.org/search/?searchtype=author&query=Xiaofeng Liu) 作者：崔超群、黄良斌、王世景、佟喆、黄兆龙、曾晓、刘晓峰

Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks. 视频配音旨在将视觉媒体节目中的原始语音从源语言翻译为目标语言，依赖于神经机器翻译和文本到语音技术。由于不同语言之间信息密度的差异，目标语音常常与源语音的时长不匹配，导致视听同步问题，从而显著影响观众体验。本研究将基于 LLM 的视频配音机器翻译中的时长对齐问题视为一个偏好优化问题。我们提出了分段监督偏好优化（Segment Supervised Preference Optimization，SSPO）方法，该方法采用分段采样策略和细粒度损失来缓解源句与目标句之间的时长不匹配。实验结果表明，SSPO 在时长对齐任务中取得了优异的性能。

Subjects: Sound, Computation and Language 学科：声音，计算与语言

Publish: 2025-08-12 01:38:31 UTC 发表：2025-08-12 01:38:31 UTC

#73 Re:Verse – Can Your VLM Read a Manga? #73 Re:Verse——你的视觉语言模型能读漫画吗？

Authors: [Aaditya Baranwal](https://arxiv.org/search/?searchtype=author&query=Aaditya Baranwal), [Madhav Kataria](https://arxiv.org/search/?searchtype=author&query=Madhav Kataria), [Naitik Agrawal](https://arxiv.org/search/?searchtype=author&query=Naitik Agrawal), [Yogesh S Rawat](https://arxiv.org/search/?searchtype=author&query=Yogesh S Rawat), [Shruti Vyas](https://arxiv.org/search/?searchtype=author&query=Shruti Vyas) 作者：Aaditya Baranwal、Madhav Kataria、Naitik Agrawal、Yogesh S Rawat、Shruti Vyas

Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations. Our methodology includes (i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text, (ii) comprehensive evaluation across multiple reasoning paradigms, including direct inference and retrieval-augmented generation, and (iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs’ joint representations. Applying this framework to Re:Zero manga across 11 chapters with 308 annotated panels, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning. Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into the capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models. 当前的视觉语言模型（VLMs）在处理序列化视觉叙事时，表面识别能力与深层叙事推理之间存在显著差距。通过对漫画叙事理解的全面调查，我们发现，尽管近期的大型多模态模型在单幅分镜的解读上表现出色，但它们在时间因果关系和跨格连贯性方面系统性地失败，而这恰恰是连贯故事理解的核心要求。我们提出了一种新颖的评估框架，结合了细粒度多模态注释、跨模态嵌入分析和检索增强评估，以系统性地刻画这些局限性。我们的方法学包括：(i) 通过与轻小说文本对齐，将视觉元素与叙事结构关联起来的严格注释协议，(ii) 在多种推理范式上的全面评估，包括直接推断和检索增强生成，及 (iii) 揭示当前 VLMs 联合表征中基本错配的跨模态相似性分析。将该框架应用于涵盖 11 章、308 个标注面板的《Re:Zero》漫画后，我们通过三个核心评估维度——生成式叙事、情境对话落地和时间推理——开展了首个对视觉-语言模型长篇叙事理解的系统性研究。我们的发现表明，当前模型缺乏真正的故事层面智能，尤其在非线性叙事、角色一致性以及跨越长序列的因果推断方面表现薄弱。这项工作既奠定了评估叙事智能的基础和实用方法论，也为超越多模态模型的基础识别能力、实现对离散视觉叙事的深度序列理解的能力提供了可操作的洞见。

Subjects: Computer Vision and Pattern Recognition, Computation and Language 主题：计算机视觉与模式识别，计算与语言

Publish: 2025-08-11 22:40:05 UTC 发布：2025-08-11 22:40:05 UTC

#74 Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning #74 双层 MCTS：用于经典规划中摊销 O(1)节点选择

Author: [Masataro Asai](https://arxiv.org/search/?searchtype=author&query=Masataro Asai) 作者：浅井雅太郎

We study an efficient implementation of Multi-Armed Bandit (MAB)-based Monte-Carlo Tree Search (MCTS) for classical planning. One weakness of MCTS is that it spends a significant time deciding which node to expand next. While selecting a node from an OPEN list with N nodes has O(1) runtime complexity with traditional array-based priority-queues for dense integer keys, the tree-based OPEN list used by MCTS requires O(logN), which roughly corresponds to the search depth d. In classical planning, d is arbitrarily large (e.g., 2k−1 in k-disk Tower-of-Hanoi) and the runtime for node selection is significant, unlike in game tree search, where the cost is negligible compared to the node evaluation (rollouts) because d is inherently limited by the game (e.g., d≤361 in Go). To improve this bottleneck, we propose a bilevel modification to MCTS that runs a best-first search from each selected leaf node with an expansion budget proportional to d, which achieves amortized O(1) runtime for node selection, equivalent to the traditional queue-based OPEN list. In addition, we introduce Tree Collapsing, an enhancement that reduces action selection steps and further improves the performance. 我们研究了一种针对经典规划的基于多臂赌博机（MAB）的蒙特卡洛树搜索（MCTS）的高效实现。MCTS 的一个弱点是它在决定下一个要扩展的节点上花费大量时间。虽然使用基于数组的优先队列针对密集整数键从 OPEN 列表中选择一个包含 N 个节点的节点具有 O(1) 的运行时复杂度，但 MCTS 使用的基于树的 OPEN 列表需要 O(logN) ，这大致对应于搜索深度 d 。在经典规划中， d 可以任意大（例如，在 k 的 2k−1 磁盘汉诺塔问题中），节点选择的运行时间是显著的；这与博弈树搜索不同，在博弈中相对于节点评估（模拟）而言其开销可以忽略不计，因为 d 受博弈的内在限制（例如在围棋中为 d≤361 ）。为改进这一瓶颈，我们提出对 MCTS 的两层修改：从每个被选中的叶节点运行一次最佳优先搜索，扩展预算与 d 成比例，从而实现节点选择的摊还 O(1) 运行时，与传统基于队列的 OPEN 列表等价。此外，我们引入了树折叠（Tree Collapsing）技术，这是一种减少动作选择步骤并进一步提升性能的改进。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-11 18:12:40 UTC 发布日期：2025-08-11 18:12:40 UTC

#75 Exploring the Technical Knowledge Interaction of Global Digital Humanities: Three-decade Evidence from Bibliometric-based perspectives #75 探索全球数字人文的技术知识互动：基于文献计量视角的三十年证据

Authors: [Jiayi Li](https://arxiv.org/search/?searchtype=author&query=Jiayi Li), [Chengxi Yan](https://arxiv.org/search/?searchtype=author&query=Chengxi Yan), [Yurong Zeng](https://arxiv.org/search/?searchtype=author&query=Yurong Zeng), [Zhichao Fang](https://arxiv.org/search/?searchtype=author&query=Zhichao Fang), [Huiru Wang](https://arxiv.org/search/?searchtype=author&query=Huiru Wang) 作者：李佳怡、闫成熙、曾雨蓉、方志超、王慧入

Digital Humanities (DH) is an interdisciplinary field that integrates computational methods with humanities scholarship to investigate innovative topics. Each academic discipline follows a unique developmental path shaped by the topics researchers investigate and the methods they employ. With the help of bibliometric analysis, most of previous studies have examined DH across multiple dimensions such as research hotspots, co-author networks, and institutional rankings. However, these studies have often been limited in their ability to provide deep insights into the current state of technological advancements and topic development in DH. As a result, their conclusions tend to remain superficial or lack interpretability in understanding how methods and topics interrelate in the field. To address this gap, this study introduced a new concept of Topic-Method Composition (TMC), which refers to a hybrid knowledge structure generated by the co-occurrence of specific research topics and the corresponding method. Especially by analyzing the interaction between TMCs, we can see more clearly the intersection and integration of digital technology and humanistic subjects in DH. Moreover, this study developed a TMC-based workflow combining bibliometric analysis, topic modeling, and network analysis to analyze the development characteristics and patterns of research disciplines. By applying this workflow to large-scale bibliometric data, it enables a detailed view of the knowledge structures, providing a tool adaptable to other fields. 数字人文（DH）是一个将计算方法与人文学科研究相结合，以探讨创新议题的跨学科领域。每个学术学科都遵循由研究者所研究的主题和所采用的方法共同塑造的独特发展路径。借助文献计量分析，以往大多数研究已从研究热点、合著网络和机构排名等多个维度考察了数字人文。然而，这些研究在深入洞察数字人文技术进展现状和主题发展方面常常有所局限。因此，它们的结论往往停留在表面，或在理解方法与主题如何在该领域相互关联方面缺乏可解释性。为弥补这一空白，本研究提出了“主题-方法构成”（TMC）的新概念，指特定研究主题与相应方法共现所生成的混合知识结构。尤其是通过分析 TMC 之间的交互，我们可以更清晰地看到数字技术与人文学科在数字人文中的交叉与融合。此外，本研究开发了一个基于 TMC 的工作流，结合文献计量分析、主题建模和网络分析，以分析研究学科的发展特征和模式。通过将该工作流应用于大规模文献计量数据，它能够详尽呈现知识结构，并提供一个可适用于其他领域的工具。

Subjects: Digital Libraries, Computation and Language

Publish: 2025-08-11 12:27:39 UTC 发布时间：2025-08-11 12:27:39 协调世界时

#76 Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving #76 通过最优适配器缓存最大化 GPU 效率：面向多租户 LLM 服务的分析方法

Serving LLM adapters has gained significant attention as an effective approach to adapt general-purpose language models to diverse, task-specific use cases. However, serving a wide range of adapters introduces several and substantial overheads, leading to performance degradation and challenges in optimal placement. To address these challenges, we present an analytical, AI-driven pipeline that accurately determines the optimal allocation of adapters in single-node setups. This allocation maximizes performance, effectively using GPU resources, while preventing request starvation. Crucially, the proposed allocation is given based on current workload patterns. These insights in single-node setups can be leveraged in multi-replica deployments for overall placement, load balancing and server configuration, ultimately enhancing overall performance and improving resource efficiency. Our approach builds on an in-depth analysis of LLM adapter serving, accounting for overheads and performance variability, and includes the development of the first Digital Twin capable of replicating online LLM-adapter serving systems with matching key performance metrics. The experimental results demonstrate that the Digital Twin achieves a SMAPE difference of no more than 5.5% in throughput compared to real results, and the proposed pipeline accurately predicts the optimal placement with minimal latency. 为 LLM 适配器提供服务已成为将通用语言模型适配到各种特定任务用例的一种有效方法，并因此受到广泛关注。然而，为大量不同适配器提供服务会带来若干显著的开销，导致性能下降并使最佳部署位置变得具有挑战性。为了解决这些问题，我们提出了一个基于分析且由 AI 驱动的流水线，能够精确确定单节点环境中适配器的最优分配。该分配最大化性能，充分利用 GPU 资源，同时防止请求饿死。关键的是，所提出的分配是基于当前工作负载模式给出的。这些在单节点环境中的洞见可被用于多副本部署中的整体放置、负载均衡和服务器配置，从而最终提升整体性能并改善资源效率。我们的方法建立在对 LLM 适配器服务的深入分析之上，考虑了开销与性能波动，并包含了首个能够以匹配关键性能指标复制在线 LLM 适配器服务系统的数字孪生的开发。实验结果表明，数字孪生在吞吐量方面与真实结果相比的 SMAPE 差异不超过 5.5%，且所提出的流水线能够以极低的延迟准确预测最佳部署位置。

Subjects: Performance, Artificial Intelligence, Computation and Language 主题：性能，人工智能，计算与语言

Publish: 2025-08-11 10:47:35 UTC 发布：2025-08-11 10:47:35 UTC

#77 Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI #77 Doctor Sun：面向生物医学 AI 的双语多模态大语言模型

Authors: [Dong Xue](https://arxiv.org/search/?searchtype=author&query=Dong Xue), [Ziyao Shao](https://arxiv.org/search/?searchtype=author&query=Ziyao Shao), [Zhaoyang Duan](https://arxiv.org/search/?searchtype=author&query=Zhaoyang Duan), [Fangzhou Liu](https://arxiv.org/search/?searchtype=author&query=Fangzhou Liu), [Bing Li](https://arxiv.org/search/?searchtype=author&query=Bing Li), [Zhongheng Zhang](https://arxiv.org/search/?searchtype=author&query=Zhongheng Zhang) 作者：薛东、邵子尧、段朝阳、刘方舟、李冰、张中衡

Large multimodal models (LMMs) have demonstrated significant potential in providing innovative solutions for various biomedical tasks, including pathology analysis, radiology report generation, and biomedical assistance. However, the existing multimodal biomedical AI is typically based on foundation LLMs, thus hindering the understanding of intricate medical concepts with limited medical training data. Moreover, recent LLaVA-induced medical LMMs struggle to effectively capture the intricate relationship between the texts and the images. Therefore, we introduce Doctor Sun, a large multimodal generative model specialized in medicine, developed to encode, integrate, and interpret diverse biomedical data modalities such as text and images. In particular, Doctor Sun integrates a pre-trained vision encoder with a medical LLM and conducts two-stage training on various medical datasets, focusing on feature alignment and instruction tuning. Moreover, we release SunMed-VL, a wide-range bilingual medical multimodal dataset, along with all associated models, code, and resources, to freely support the advancement of biomedical multimodal research. 大型多模态模型（LMMs）在为多种生物医学任务提供创新解决方案方面展示了显著潜力，包括病理分析、放射报告生成和生物医学助理等。然而，现有的多模态生物医学人工智能通常基于基础 LLM，从而在医疗训练数据有限的情况下阻碍了对复杂医学概念的理解。此外，最近受 LLaVA 启发的医学 LMMs 在有效捕捉文本与图像之间复杂关系方面表现不佳。因此，我们推出了 Doctor Sun，这是一款专注于医学领域的大型多模态生成模型，旨在编码、整合并解释诸如文本与图像等多种生物医学数据模态。具体而言，Doctor Sun 将预训练视觉编码器与医学 LLM 相结合，并在各种医学数据集上进行了两阶段训练，重点是特征对齐和指令微调。此外，我们发布了 SunMed-VL，一个覆盖广泛的双语医学多模态数据集，以及所有相关模型、代码和资源，以自由支持生物医学多模态研究的进展。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language, Multimedia 主题：机器学习、人工智能、计算与语言、多媒体

Publish: 2025-07-30 13:53:54 UTC 发布时间：2025-07-30 13:53:54 UTC

#78 Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants #78 基准测试大型语言模型以对殖民地弗吉尼亚地契进行地理定位

Author: [Ryan Mioduski](https://arxiv.org/search/?searchtype=author&query=Ryan Mioduski) 作者：Ryan Mioduski

Virginia’s seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures (o-series, GPT-4-class, and GPT-3.5) were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared with a GIS-analyst baseline, the Stanford NER geoparser, Mordecai-3, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19 km (median 12 km) at minimal additional cost (approx. USD 0.20 per grant), outperforming the median LLM by 48.6%. A patentee-name-redaction ablation increased error by about 9%, indicating reliance on textual landmark and adjacency descriptions rather than memorization. The cost-efficient gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark; external geocoding tools offered no measurable benefit in this evaluation. These findings demonstrate the potential of LLMs for scalable, accurate, and cost-effective historical georeferencing. 弗吉尼亚州十七至十八世纪的土地专利主要以叙述性的界址描述（metes-and-bounds）保存，限制了空间分析的可能性。本研究系统评估了新一代大型语言模型（LLMs）在将这些散文摘要转换为地理上准确的经纬度坐标方面的表现，评估范围集中且有针对性。我们发布了一个数字化语料库，包含 5,471 份弗吉尼亚州专利摘要（1695–1732），并提供 43 个经过严格验证的测试用例，作为初步的、地理聚焦的基准。测试了六种 OpenAI 模型，涵盖三种架构（o 系列、GPT-4 级别和 GPT-3.5），在两种范式下进行：直接输出坐标和通过调用外部地理编码 API 的工具增强链式思维。结果与 GIS 分析师基线、斯坦福 NER 地理解析器（Stanford NER geoparser）、Mordecai-3 和县中心点启发式方法进行了比较。表现最佳的单次调用模型 o3-2025-04-16 的平均误差为 23 公里（中位数 14 公里），比中位 LLM（37.4 公里）好 37.5%，比表现最差的 LLM（50.3 公里）好 53.5%，并分别比外部基线好 67%（GIS 分析师）和 70%（斯坦福 NER）。一个五次调用的集成方法在极少附加成本下（每项专利约 0.20 美元）将错误进一步降低到 19 公里（中位数 12 公里），比中位 LLM 的表现好 48.6%。对专利权人名称去标识化的消融实验使误差增加约 9%，表明模型依赖于文本中的地标和邻接描述而非记忆。性价比较高的 gpt-4o-2024-08-06 模型在每 1,000 项专利 1.09 美元的成本下保持了 28 公里的平均误差，确立了一个强有力的成本-精度基准；在本次评估中，外部地理编码工具没有带来可测量的益处。这些发现展示了 LLM 在可扩展、准确且具有成本效益的历史地理定位方面的潜力。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition, Information Retrieval 主题：机器学习、人工智能、计算与语言、计算机视觉与模式识别、信息检索

Publish: 2025-07-27 21:49:58 UTC 发布：2025-07-27 21:49:58 UTC

1.2.2 Artificial Intelligence

From：https://papers.cool/arxiv/cs.AI

From：https://arxiv.org/list/cs.AI/recenthttps://arxiv.org/list/cs.CL/recent

2025-08-13 | | 总计：162

#1 BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair #1 BrowseMaster：通过工具增强的编程代理对实现可扩展网页浏览

Authors: [Xianghe Pang](https://arxiv.org/search/?searchtype=author&query=Xianghe Pang), [Shuo Tang](https://arxiv.org/search/?searchtype=author&query=Shuo Tang), [Rui Ye](https://arxiv.org/search/?searchtype=author&query=Rui Ye), [Yuwen Du](https://arxiv.org/search/?searchtype=author&query=Yuwen Du), [Yaxin Du](https://arxiv.org/search/?searchtype=author&query=Yaxin Du), [Siheng Chen](https://arxiv.org/search/?searchtype=author&query=Siheng Chen) 作者：庞翔和、唐硕、叶睿、杜宇文、杜亚欣、陈思衡

Effective information seeking in the vast and ever-growing digital landscape requires balancing expansive search with strategic reasoning. Current large language model (LLM)-based agents struggle to achieve this balance due to limitations in search breadth and reasoning depth, where slow, serial querying restricts coverage of relevant sources and noisy raw inputs disrupt the continuity of multi-step reasoning. To address these challenges, we propose BrowseMaster, a scalable framework built around a programmatically augmented planner-executor agent pair. The planner formulates and adapts search strategies based on task constraints, while the executor conducts efficient, targeted retrieval to supply the planner with concise, relevant evidence. This division of labor preserves coherent, long-horizon reasoning while sustaining broad and systematic exploration, overcoming the trade-off that limits existing agents. Extensive experiments on challenging English and Chinese benchmarks show that BrowseMaster consistently outperforms open-source and proprietary baselines, achieving scores of 30.0 on BrowseComp-en and 46.5 on BrowseComp-zh, which demonstrates its strong capability in complex, reasoning-heavy information-seeking tasks at scale. 在广阔且不断增长的数字环境中，高效的信息检索需要在广泛搜索与策略性推理之间取得平衡。当前基于 LLM 的代理因搜索覆盖面受限与推理深度不足而难以实现这一平衡：缓慢的串行查询限制了对相关来源的覆盖，而嘈杂的原始输入则破坏了多步骤推理的连续性。为了解决这些挑战，我们提出了 BrowseMaster，这是一个围绕可编程增强的规划者—执行者代理对构建的可扩展框架。规划者根据任务约束制定并调整搜索策略，执行者则进行高效、有针对性的检索，为规划者提供简洁且相关的证据。这种分工在保持连贯的长期推理同时，维持广泛且系统的探索，克服了限制现有代理的权衡。在具有挑战性的英中基准数据集上进行的大量实验表明，BrowseMaster 始终优于开源和专有的基线，在 BrowseComp-en 上取得了 30.0 的得分，在 BrowseComp-zh 上取得了 46.5 的得分，展示了其在大规模复杂、以推理为重的信息检索任务中的强大能力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 17:56:25 UTC 发布：2025-08-12 17:56:25 UTC

#2 OpenCUA: Open Foundations for Computer-Use Agents #2 OpenCUA：面向计算机使用代理的开放基础（PDF 9 ）[复制] [Kimi 5 ] [相关]

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research. 视觉-语言模型已展示出作为计算机使用代理（CUA）的强大能力，能够自动执行多种计算机任务。随着其商业潜力的增长，最强大 CUA 系统的关键细节仍被封闭。当这些代理将越来越多地介入数字交互并代表我们做出重大决策时，研究界需要可开放的 CUA 框架来研究其能力、局限性和风险。为弥合这一差距，我们提出了 OpenCUA，一个用于扩展 CUA 数据和基础模型的综合开源框架。我们的框架包括：（1）一种无缝捕捉人类计算机使用演示的标注基础设施；（2）AgentNet，第一个覆盖 3 种操作系统和 200+应用与网站的大规模计算机使用任务数据集；（3）一个可扩展的管道，将演示转化为带有反思性长链思维推理的状态-动作对，随着数据规模的扩大保持稳健的性能提升。我们的端到端代理模型在 CUA 基准测试中表现出强劲的性能。尤其是，OpenCUA-32B 在 OSWorld-Verified 上取得了平均 34.8% 的成功率，在开源模型中创下了新的最先进水平（SOTA），并超越了 OpenAI CUA（GPT-4o）。进一步分析确认我们的方法在不同领域具有良好的泛化能力，并且在增加测试时计算量时获益显著。我们发布了标注工具、数据集、代码和模型，以为进一步的 CUA 研究建立开放基础。

Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition 主题：人工智能，计算机视觉与模式识别

Publish: 2025-08-12 17:52:32 UTC 发布：2025-08-12 17:52:32 UTC

#3 SMA: Who Said That? Auditing Membership Leakage in Semi-Black-box RAG Controlling #3 SMA：谁说的？审计半黑盒 RAG 控制中的成员泄露

Authors: [Shixuan Sun](https://arxiv.org/search/?searchtype=author&query=Shixuan Sun), [Siyuan Liang](https://arxiv.org/search/?searchtype=author&query=Siyuan Liang), [Ruoyu Chen](https://arxiv.org/search/?searchtype=author&query=Ruoyu Chen), [Jianjie Huang](https://arxiv.org/search/?searchtype=author&query=Jianjie Huang), [Jingzhi Li](https://arxiv.org/search/?searchtype=author&query=Jingzhi Li), [Xiaochun Cao](https://arxiv.org/search/?searchtype=author&query=Xiaochun Cao) 作者：孙诗轩、梁思远、陈若瑜、黄建杰、李敬志、曹晓春

Retrieval-Augmented Generation (RAG) and its Multimodal Retrieval-Augmented Generation (MRAG) significantly improve the knowledge coverage and contextual understanding of Large Language Models (LLMs) by introducing external knowledge sources. However, retrieval and multimodal fusion obscure content provenance, rendering existing membership inference methods unable to reliably attribute generated outputs to pre-training, external retrieval, or user input, thus undermining privacy leakage accountability To address these challenges, we propose the first Source-aware Membership Audit (SMA) that enables fine-grained source attribution of generated content in a semi-black-box setting with retrieval control capabilities.To address the environmental constraints of semi-black-box auditing, we further design an attribution estimation mechanism based on zero-order optimization, which robustly approximates the true influence of input tokens on the output through large-scale perturbation sampling and ridge regression modeling. In addition, SMA introduces a cross-modal attribution technique that projects image inputs into textual descriptions via MLLMs, enabling token-level attribution in the text modality, which for the first time facilitates membership inference on image retrieval traces in MRAG systems. This work shifts the focus of membership inference from ‘whether the data has been memorized’ to ‘where the content is sourced from’, offering a novel perspective for auditing data provenance in complex generative systems. 检索增强生成（RAG）及其多模态检索增强生成（MRAG）通过引入外部知识源显著提升了大型语言模型（LLMs）的知识覆盖与上下文理解能力。然而，检索与多模态融合模糊了内容来源，使现有的成员资格推断方法无法可靠地将生成输出归因于预训练、外部检索或用户输入，从而削弱了隐私泄露的可追责性。为应对这些挑战，我们提出了首个源感知成员审计（SMA），在具备检索控制能力的半黑箱设置中实现生成内容的细粒度来源归因。为应对半黑箱审计的环境约束，我们进一步设计了一种基于零阶优化的归因估计机制，该机制通过大规模扰动采样与岭回归建模，稳健地逼近输入标记对输出的真实影响。此外，SMA 引入了一种跨模态归因技术，通过 MLLMs 将图像输入投射为文本描述，从而在文本模态中实现逐标记归因，这首次使得在 MRAG 系统的图像检索痕迹上进行成员推断成为可能。该工作将成员推断的关注点从“数据是否被记忆”转移到“内容源自何处”，为在复杂生成系统中审计数据来源提供了新的视角。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 17:32:24 UTC 发布：2025-08-12 17:32:24 UTC

#4 CVCM Track Circuits Pre-emptive Failure Diagnostics for Predictive Maintenance Using Deep Neural Networks #4 CVCM Track Circuits 预防性故障诊断用于预测性维护的深度神经网络

Track circuits are critical for railway operations, acting as the main signalling sub-system to locate trains. Continuous Variable Current Modulation (CVCM) is one such technology. Like any field-deployed, safety-critical asset, it can fail, triggering cascading disruptions. Many failures originate as subtle anomalies that evolve over time, often not visually apparent in monitored signals. Conventional approaches, which rely on clear signal changes, struggle to detect them early. Early identification of failure types is essential to improve maintenance planning, minimising downtime and revenue loss. Leveraging deep neural networks, we propose a predictive maintenance framework that classifies anomalies well before they escalate into failures. Validated on 10 CVCM failure cases across different installations, the method is ISO-17359 compliant and outperforms conventional techniques, achieving 99.31% overall accuracy with detection within 1% of anomaly onset. Through conformal prediction, we provide uncertainty estimates, reaching 99% confidence with consistent coverage across classes. Given CVCMs global deployment, the approach is scalable and adaptable to other track circuits and railway systems, enhancing operational reliability. 轨道电路对铁路运营至关重要，作为定位列车的主要信号子系统。连续可变电流调制（CVCM）就是此类技术之一。像任何现场部署的安全关键设备一样，它可能发生故障，引发连锁中断。许多故障起始于随时间演化的微妙异常，这些异常在监测信号中往往不易肉眼察觉。依赖明显信号变化的传统方法难以实现早期检测。提前识别故障类型对改进维护计划至关重要，可将停机时间和收入损失降到最低。我们利用深度神经网络，提出了一个在故障升级之前就对异常进行分类的预测性维护框架。在来自不同安装环境的 10 个 CVCM 故障案例上验证，该方法符合 ISO-17359 标准并优于传统技术，整体准确率达 99.31%，检测时间在异常发生后 1%以内。通过保形预测，我们提供了不确定性估计，达到 99%的置信度并在各类中保持一致的覆盖率。鉴于 CVCM 的全球部署，该方法具有可扩展性并可适应其他轨道电路和铁路系统，从而提高运营可靠性。

Subjects: Artificial Intelligence, Machine Learning 主题：人工智能、机器学习

Publish: 2025-08-12 16:13:51 UTC 发布：2025-08-12 16:13:51 UTC

#5 A First Look at Predictability and Explainability of Pre-request Passenger Waiting Time in Ridesharing Systems #5 乘车共享系统中预请求乘客等待时间的可预测性与可解释性初探

Authors: [Jie Wang](https://arxiv.org/search/?searchtype=author&query=Jie Wang), [Guang Wang](https://arxiv.org/search/?searchtype=author&query=Guang Wang) 作者：王杰，王光

Passenger waiting time prediction plays a critical role in enhancing both ridesharing user experience and platform efficiency. While most existing research focuses on post-request waiting time prediction with knowing the matched driver information, pre-request waiting time prediction (i.e., before submitting a ride request and without matching a driver) is also important, as it enables passengers to plan their trips more effectively and enhance the experience of both passengers and drivers. However, it has not been fully studied by existing works. In this paper, we take the first step toward understanding the predictability and explainability of pre-request passenger waiting time in ridesharing systems. Particularly, we conduct an in-depth data-driven study to investigate the impact of demand&supply dynamics on passenger waiting time. Based on this analysis and feature engineering, we propose FiXGBoost, a novel feature interaction-based XGBoost model designed to predict waiting time without knowing the assigned driver information. We further perform an importance analysis to quantify the contribution of each factor. Experiments on a large-scale real-world ridesharing dataset including over 30 million trip records show that our FiXGBoost can achieve a good performance for pre-request passenger waiting time prediction with high explainability. 乘客等待时间预测在提升拼车用户体验和平台效率方面起着关键作用。尽管大多数现有研究聚焦于已知匹配司机信息的请求后等待时间预测，但请求前等待时间预测（即在提交乘车请求之前且未匹配司机时）也同样重要，因为它使乘客能够更有效地规划行程并提升乘客与司机双方的体验。然而，现有工作尚未对此进行充分研究。本文迈出了理解拼车系统中请求前乘客等待时间可预测性与可解释性的第一步。具体而言，我们进行了一项深入的数据驱动研究，以探讨供需动态对乘客等待时间的影响。基于此分析和特征工程，我们提出了 FiXGBoost，一种基于特征交互的全新 XGBoost 模型，旨在在未知分配司机信息的情况下预测等待时间。我们还进一步进行了重要性分析，以量化各因素的贡献。在包含超过 3 千万次行程记录的大规模真实网约车数据集上的实验表明，我们的 FiXGBoost 在具有高可解释性的事前乘客等待时间预测方面能够取得良好表现。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 15:42:14 UTC 发布：2025-08-12 15:42:14 UTC

#6 Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs #6 激活引导用于偏差缓解：一种可解释的更安全 LLMs 方法

Author: [Shivam Dubey](https://arxiv.org/search/?searchtype=author&query=Shivam Dubey) 作者：Shivam Dubey

As large language models (LLMs) become more integrated into societal systems, the risk of them perpetuating and amplifying harmful biases becomes a critical safety concern. Traditional methods for mitigating bias often rely on data filtering or post-hoc output moderation, which treat the model as an opaque black box. In this work, we introduce a complete, end-to-end system that uses techniques from mechanistic interpretability to both identify and actively mitigate bias directly within a model’s internal workings. Our method involves two primary stages. First, we train linear “probes” on the internal activations of a model to detect the latent representations of various biases (e.g., gender, race, age). Our experiments on \texttt{gpt2-large} demonstrate that these probes can identify biased content with near-perfect accuracy, revealing that bias representations become most salient in the model’s later layers. Second, we leverage these findings to compute “steering vectors” by contrasting the model’s activation patterns for biased and neutral statements. By adding these vectors during inference, we can actively steer the model’s generative process away from producing harmful, stereotypical, or biased content in real-time. We demonstrate the efficacy of this activation steering technique, showing that it successfully alters biased completions toward more neutral alternatives. We present our work as a robust and reproducible system that offers a more direct and interpretable approach to building safer and more accountable LLMs. 随着大型语言模型（LLMs）越来越多地融入社会系统，它们延续并放大有害偏见的风险已成为一个关键的安全问题。传统的缓解偏见方法通常依赖数据过滤或事后输出审查，这些方法将模型视为不透明的黑箱。在这项工作中，我们引入了一个完整的端到端系统，利用机械可解释性（mechanistic interpretability）的方法既识别又在模型的内部机制中主动缓解偏见。我们的方法包含两个主要阶段。首先，我们在模型的内部激活上训练线性“探针”来检测各种偏见（例如性别、种族、年龄）的潜在表征。我们在 \texttt{gpt2-large} 上的实验证明，这些探针可以以近乎完美的准确率识别出有偏内容，表明偏见表征在模型的后期层中最为显著。其次，我们利用这些发现，通过对比模型在有偏语句和中性语句的激活模式来计算“引导向量（steering vectors）”。通过在推理期间添加这些向量，我们可以在实时中主动引导模型的生成过程，避免产生有害的、刻板的或有偏见的内容。我们展示了这种激活引导技术的有效性，表明它能够成功地将带有偏见的续写改变为更中性的替代内容。我们将这项工作呈现为一个稳健且可复现的系统，提供了一种更直接且可解释的方法来构建更安全、更具问责性的 LLMs。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 15:34:18 UTC 发布日期：2025-08-12 15:34:18 UTC

#7 Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory #7 内在记忆代理：通过结构化上下文记忆构建的异构多代理 LLM 系统

Authors: [Sizhe Yuen](https://arxiv.org/search/?searchtype=author&query=Sizhe Yuen), [Francisco Gomez Medina](https://arxiv.org/search/?searchtype=author&query=Francisco Gomez Medina), [Ting Su](https://arxiv.org/search/?searchtype=author&query=Ting Su), [Yali Du](https://arxiv.org/search/?searchtype=author&query=Yali Du), [Adam J. Sobey](https://arxiv.org/search/?searchtype=author&query=Adam J. Sobey) 作者：Sizhe Yuen, Francisco Gomez Medina, Ting Su, Yali Du, Adam J. Sobey

Multi-agent systems built on Large Language Models (LLMs) show exceptional promise for complex collaborative problem-solving, yet they face fundamental challenges stemming from context window limitations that impair memory consistency, role adherence, and procedural integrity. This paper introduces Intrinsic Memory Agents, a novel framework that addresses these limitations through structured agent-specific memories that evolve intrinsically with agent outputs. Specifically, our method maintains role-aligned memory templates that preserve specialized perspectives while focusing on task-relevant information. We benchmark our approach on the PDDL dataset, comparing its performance to existing state-of-the-art multi-agentic memory approaches and showing an improvement of 38.6% with the highest token efficiency. An additional evaluation is performed on a complex data pipeline design task, we demonstrate that our approach produces higher quality designs when comparing 5 metrics: scalability, reliability, usability, cost-effectiveness and documentation with additional qualitative evidence of the improvements. Our findings suggest that addressing memory limitations through structured, intrinsic approaches can improve the capabilities of multi-agent LLM systems on structured planning tasks. 基于大型语言模型（LLMs）构建的多代理系统在复杂协同问题解决方面展现出卓越的潜力，但它们面临由上下文窗口限制引发的根本性挑战，这些限制损害了记忆一致性、角色遵从性和流程完整性。本文提出了内在记忆代理（Intrinsic Memory Agents），这是一种通过结构化、随代理输出内在演化的代理专属记忆来解决这些限制的新框架。具体而言，我们的方法维护与角色对齐的记忆模板，以保留专业化视角，同时聚焦与任务相关的信息。我们在 PDDL 数据集上对该方法进行了基准测试，将其性能与现有最先进的多代理记忆方法进行了比较，显示出 38.6%的提升以及最高的令牌效率。我们还在一个复杂的数据管道设计任务上进行了额外评估，证明在可扩展性、可靠性、可用性、成本效益和文档五项指标上我们的方案产生了更高质量的设计，并提供了改进的定性证据。我们的研究表明，通过结构化的、内在的方法解决记忆限制问题，可以提升多代理 LLM 系统在结构化规划任务上的能力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 15:05:00 UTC 发布：2025-08-12 15:05:00 UTC

#8 Prospect Theory Fails for LLMs: Revealing Instability of Decision-Making under Epistemic Uncertainty #8 前景理论在 LLMs 上失效：揭示在认知不确定性下决策的不稳定性

Authors: [Rui Wang](https://arxiv.org/search/?searchtype=author&query=Rui Wang), [Qihan Lin](https://arxiv.org/search/?searchtype=author&query=Qihan Lin), [Jiayu Liu](https://arxiv.org/search/?searchtype=author&query=Jiayu Liu), [Qing Zong](https://arxiv.org/search/?searchtype=author&query=Qing Zong), [Tianshi Zheng](https://arxiv.org/search/?searchtype=author&query=Tianshi Zheng), [Weiqi Wang](https://arxiv.org/search/?searchtype=author&query=Weiqi Wang), [Yangqiu Song](https://arxiv.org/search/?searchtype=author&query=Yangqiu Song) 作者：王睿、林其涵、刘家煜、宗清、郑天时、王卫岐、宋阳球

Prospect Theory (PT) models human decision-making under uncertainty, while epistemic markers (e.g., maybe) serve to express uncertainty in language. However, it remains largely unexplored whether Prospect Theory applies to contemporary Large Language Models and whether epistemic markers, which express human uncertainty, affect their decision-making behaviour. To address these research gaps, we design a three-stage experiment based on economic questionnaires. We propose a more general and precise evaluation framework to model LLMs’ decision-making behaviour under PT, introducing uncertainty through the empirical probability values associated with commonly used epistemic markers in comparable contexts. We then incorporate epistemic markers into the evaluation framework based on their corresponding probability values to examine their influence on LLM decision-making behaviours. Our findings suggest that modelling LLMs’ decision-making with PT is not consistently reliable, particularly when uncertainty is expressed in diverse linguistic forms. Our code is released in https://github.com/HKUST-KnowComp/MarPT. 前景理论（Prospect Theory，PT）模拟人在不确定情况下的决策行为，而认知标记（例如 maybe）用于在语言中表达不确定性。然而，目前仍鲜有研究探讨前景理论是否适用于当代大型语言模型（LLMs），以及表达人类不确定性的认知标记是否会影响它们的决策行为。为填补这些研究空白，我们设计了一个基于经济问卷的三阶段实验。我们提出了一个更通用且更精确的评估框架，以在 PT 下对 LLM 的决策行为建模，通过在可比语境中与常用认知标记相关联的经验概率值来引入不确定性。随后我们根据这些认知标记对应的概率值将其纳入评估框架，以检验其对 LLM 决策行为的影响。我们的研究结果表明，用 PT 对 LLM 的决策进行建模并不总是可靠的，尤其是在不确定性以多样语言形式表达时。我们的代码已发布于 https://github.com/HKUST-KnowComp/MarPT。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 15:02:16 UTC 发布：2025-08-12 15:02:16 UTC

#9 Safe Semantics, Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language Models #9 安全语义，不安全解释：应对大规模视觉-语言模型中的隐式推理安全问题

Authors: [Wei Cai](https://arxiv.org/search/?searchtype=author&query=Wei Cai), [Jian Zhao](https://arxiv.org/search/?searchtype=author&query=Jian Zhao), [Yuchu Jiang](https://arxiv.org/search/?searchtype=author&query=Yuchu Jiang), [Tianle Zhang](https://arxiv.org/search/?searchtype=author&query=Tianle Zhang), [Xuelong Li](https://arxiv.org/search/?searchtype=author&query=Xuelong Li) 作者：蔡伟、赵健、江宇初、张天乐、李雪龙

Large Vision-Language Models face growing safety challenges with multimodal inputs. This paper introduces the concept of Implicit Reasoning Safety, a vulnerability in LVLMs. Benign combined inputs trigger unsafe LVLM outputs due to flawed or hidden reasoning. To showcase this, we developed Safe Semantics, Unsafe Interpretations, the first dataset for this critical issue. Our demonstrations show that even simple In-Context Learning with SSUI significantly mitigates these implicit multimodal threats, underscoring the urgent need to improve cross-modal implicit reasoning. 大规模视觉-语言模型在处理多模态输入时面临日益严峻的安全挑战。本文提出了“隐式推理安全性”这一概念，指 LVLMs 中的一种脆弱性。看似无害的组合输入由于模型存在缺陷或隐藏的推理，可能触发不安全的 LVLM 输出。为说明这一点，我们构建了《安全语义，不安全解释》（Safe Semantics, Unsafe Interpretations），这是针对该关键问题的首个数据集。我们的实验证明，即使是采用简单的上下文学习（In-Context Learning）配合 SSUI，也能显著缓解这些隐式多模态威胁，强调了提升跨模态隐式推理能力的紧迫性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 13:26:06 UTC 发布时间：2025-08-12 13:26:06 UTC

#10 Compass-Thinker-7B Technical Report #10 Compass-Thinker-7B 技术报告

Recent R1-Zero-like research further demonstrates that reasoning extension has given large language models (LLMs) unprecedented reasoning capabilities, and Reinforcement Learning is the core tech- nology to elicit its complex reasoning. However, conducting RL experiments directly on hyperscale models involves high computational costs and resource demands, posing significant risks. We pro- pose the Compass-Thinker-7B model, which aims to explore the potential of Reinforcement Learn- ing with less computational resources and costs, and provides insights for further research into RL recipes for larger models. Compass-Thinker-7B is trained from an open source model through a spe- cially designed Reinforcement Learning Pipeline. we curate a dataset of 30k verifiable mathematics problems for the Reinforcement Learning Pipeline. By configuring data and training settings with dif- ferent difficulty distributions for different stages, the potential of the model is gradually released and the training efficiency is improved. Extensive evaluations show that Compass-Thinker-7B possesses exceptional reasoning potential, and achieves superior performance on mathematics compared to the same-sized RL model.Especially in the challenging AIME2024 evaluation, Compass-Thinker-7B achieves 40% accuracy. 近期类似 R1-Zero 的研究进一步证明，推理扩展赋予了大型语言模型 (LLMs) 前所未有的推理能力，而强化学习是激发其复杂推理的核心技术。然而，直接在超大规模模型上进行强化学习实验涉及高计算成本和资源需求，带来重大风险。我们提出了 Compass-Thinker-7B 模型，旨在以更少的计算资源和成本探索强化学习的潜力，并为更大模型的强化学习方案研究提供见解。Compass-Thinker-7B 通过一个专门设计的强化学习流程从开源模型进行训练。我们为该强化学习流程整理了一个包含 3 万道可验证数学题的数据集。通过在不同阶段为数据和训练设置配置不同难度分布，模型的潜力得以逐步释放，训练效率也得以提升。大量评估表明，Compass-Thinker-7B 拥有出色的推理潜力，并且在数学方面的表现优于同规模的 RL 模型。尤其是在具有挑战性的 AIME2024 评估中，Compass-Thinker-7B 达到了 40% 的准确率。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 12:58:12 UTC 发布时间：2025-08-12 12:58:12 UTC

#11 Reducing Cognitive Load in Multi-Agent Reinforcement Learning for Mathematical Problem Solving: Decoupling Reasoning and Code Generation #11 在用于数学问题求解的多智能体强化学习中降低认知负担：将推理与代码生成解耦

Authors: [Dayu Wang](https://arxiv.org/search/?searchtype=author&query=Dayu Wang), [Jiaye Yang](https://arxiv.org/search/?searchtype=author&query=Jiaye Yang), [Weikang Li](https://arxiv.org/search/?searchtype=author&query=Weikang Li), [Jiahui Liang](https://arxiv.org/search/?searchtype=author&query=Jiahui Liang), [Yang Li](https://arxiv.org/search/?searchtype=author&query=Yang Li) 作者：王达宇、杨嘉烨、李伟康、梁佳慧、李洋

Current tool-integrated mathematical reasoning systems often adopt a single-agent paradigm, where one large language model handles problem reasoning, code generation, and code execution in an integrated workflow. While this design eases coordination, we hypothesize that it imposes cognitive load interference, as the agent must interleave long-horizon reasoning with precise program synthesis. We validate this hypothesis through a controlled comparison between a reasoning-only agent and a reasoning-plus-code agent, finding that the latter produces significantly fewer correct reasoning paths despite having tool-calling capabilities. To address this, we propose a dual-agent hybrid framework: a Reasoning Agent performs stepwise problem decomposition, and a Code Agent handles code generation and execution. Training combines imitation learning and reinforcement learning: the Code Agent receives strong rewards for matching intermediate ground-truth programs and weaker rewards for valid execution, while the Reasoning Agent is optimized chiefly via final-answer accuracy using advantage estimation to credit intermediate steps. This decoupled role design reduces cognitive interference and promotes stable reasoning-coding coordination. 当前集成工具的数学推理系统通常采用单代理范式，在这种范式中，一个大型语言模型在一个集成工作流中同时负责问题推理、代码生成和代码执行。虽然这种设计便于协调，但我们假设它会引入认知负荷干扰，因为代理必须在进行长时程推理的同时穿插精确的程序合成。我们通过在仅推理代理与推理加代码代理之间进行受控比较验证了这一假设，发现后者尽管具备调用工具的能力，却产生显著更少的正确推理路径。为了解决这一问题，我们提出了一个双代理混合框架：推理代理执行逐步的问题分解，代码代理负责代码生成和执行。训练结合了模仿学习与强化学习：代码代理在匹配中间真实程序时获得较强的奖励，在有效执行时获得较弱的奖励，而推理代理主要通过最终答案的准确性进行优化，使用优势估计对中间步骤进行计酬。这种解耦的角色设计减少了认知干扰，并促进了推理与编码之间的稳定协调。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 12:10:53 UTC 发布：2025-08-12 12:10:53 UTC

#12 Silicon Minds versus Human Hearts: The Wisdom of Crowds Beats the Wisdom of AI in Emotion Recognition #12 硅之心智对抗人之情感：群体智慧在情感识别上胜过人工智能的智慧

Authors: [Mustafa Akben](https://arxiv.org/search/?searchtype=author&query=Mustafa Akben), [Vinayaka Gude](https://arxiv.org/search/?searchtype=author&query=Vinayaka Gude), [Haya Ajjan](https://arxiv.org/search/?searchtype=author&query=Haya Ajjan) 作者：Mustafa Akben、Vinayaka Gude、Haya Ajjan

The ability to discern subtle emotional cues is fundamental to human social intelligence. As artificial intelligence (AI) becomes increasingly common, AI’s ability to recognize and respond to human emotions is crucial for effective human-AI interactions. In particular, whether such systems can match or surpass human experts remains to be seen. However, the emotional intelligence of AI, particularly multimodal large language models (MLLMs), remains largely unexplored. This study evaluates the emotion recognition abilities of MLLMs using the Reading the Mind in the Eyes Test (RMET) and its multiracial counterpart (MRMET), and compares their performance against human participants. Results show that, on average, MLLMs outperform humans in accurately identifying emotions across both tests. This trend persists even when comparing performance across low, medium, and expert-level performing groups. Yet when we aggregate independent human decisions to simulate collective intelligence, human groups significantly surpass the performance of aggregated MLLM predictions, highlighting the wisdom of the crowd. Moreover, a collaborative approach (augmented intelligence) that combines human and MLLM predictions achieves greater accuracy than either humans or MLLMs alone. These results suggest that while MLLMs exhibit strong emotion recognition at the individual level, the collective intelligence of humans and the synergistic potential of human-AI collaboration offer the most promising path toward effective emotional AI. We discuss the implications of these findings for the development of emotionally intelligent AI systems and future research directions. 辨别微妙情绪线索的能力是人类社会智能的基础。随着人工智能（AI）日益普及，AI 识别并响应人类情绪的能力对于有效的人机交互至关重要。特别是，这类系统能否达到或超越人类专家的水平仍有待观察。然而，AI 的情绪智能，尤其是多模态大型语言模型（MLLMs），在很大程度上尚未被探索。本研究使用“眼睛中的心灵阅读测试”（Reading the Mind in the Eyes Test, RMET）及其多种族对应版本（MRMET）评估 MLLMs 的情绪识别能力，并将其表现与人类参与者进行比较。结果表明，平均而言，MLLMs 在两项测试中识别情绪的准确性均优于人类。即便在对低、中、专家级别表现组别进行比较时，这一趋势仍然存在。然而，当我们聚合独立的人类决策以模拟集体智慧时，人类群体的表现明显超过了聚合的 MLLM 预测，突显了群体智慧的价值。此外，一种将人类与多语言大型模型（MLLM）预测结合的协作方法（增强智能）比单独由人类或 MLLM 进行判断具有更高的准确性。这些结果表明，尽管 MLLM 在个体层面上表现出强大的情感识别能力，但人类的集体智慧以及人机协作的协同潜力为实现有效的情感人工智能提供了最有前景的路径。我们讨论了这些发现对情感智能 AI 系统开发的影响以及未来的研究方向。

Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition, Computers and Society 主题：人工智能，计算机视觉与模式识别，计算机与社会

Publish: 2025-08-12 10:37:37 UTC 发布：2025-08-12 10:37:37 UTC

#13 Efficient Agent: Optimizing Planning Capability for Multimodal Retrieval Augmented Generation #13 高效代理：为多模态检索增强生成优化规划能力

Authors: [Yuechen Wang](https://arxiv.org/search/?searchtype=author&query=Yuechen Wang), [Yuming Qiao](https://arxiv.org/search/?searchtype=author&query=Yuming Qiao), [Dan Meng](https://arxiv.org/search/?searchtype=author&query=Dan Meng), [Jun Yang](https://arxiv.org/search/?searchtype=author&query=Jun Yang), [Haonan Lu](https://arxiv.org/search/?searchtype=author&query=Haonan Lu), [Zhenyu Yang](https://arxiv.org/search/?searchtype=author&query=Zhenyu Yang), [Xudong Zhang](https://arxiv.org/search/?searchtype=author&query=Xudong Zhang) 作者：王悦辰、乔宇明、孟丹、杨俊、芦浩南、杨振宇、张旭东

Multimodal Retrieval-Augmented Generation (mRAG) has emerged as a promising solution to address the temporal limitations of Multimodal Large Language Models (MLLMs) in real-world scenarios like news analysis and trending topics. However, existing approaches often suffer from rigid retrieval strategies and under-utilization of visual information. To bridge this gap, we propose E-Agent, an agent framework featuring two key innovations: a mRAG planner trained to dynamically orchestrate multimodal tools based on contextual reasoning, and a task executor employing tool-aware execution sequencing to implement optimized mRAG workflows. E-Agent adopts a one-time mRAG planning strategy that enables efficient information retrieval while minimizing redundant tool invocations. To rigorously assess the planning capabilities of mRAG systems, we introduce the Real-World mRAG Planning (RemPlan) benchmark. This novel benchmark contains both retrieval-dependent and retrieval-independent question types, systematically annotated with essential retrieval tools required for each instance. The benchmark’s explicit mRAG planning annotations and diverse question design enhance its practical relevance by simulating real-world scenarios requiring dynamic mRAG decisions. Experiments across RemPlan and three established benchmarks demonstrate E-Agent’s superiority: 13% accuracy gain over state-of-the-art mRAG methods while reducing redundant searches by 37%. 多模态检索增强生成（mRAG）已成为解决多模态大语言模型（MLLM）在新闻分析和热点话题等真实场景中时间限制问题的有希望的方案。然而，现有方法常常受到僵化检索策略和视觉信息未充分利用的困扰。为弥补这一差距，我们提出了 E-Agent，这是一种具备两项关键创新的代理框架：一个经过训练的 mRAG 规划器，能够基于上下文推理动态协调多模态工具；以及一个任务执行器，采用感知工具的执行序列来实现优化的 mRAG 工作流。E-Agent 采用一次性 mRAG 规划策略，从而在实现高效信息检索的同时将冗余工具调用降到最低。为严格评估 mRAG 系统的规划能力，我们引入了真实世界 mRAG 规划（RemPlan）基准。该新基准包含依赖检索和不依赖检索的两类问题类型，并为每个实例系统地标注了所需的关键检索工具。该基准的显式 mRAG 规划注释和多样化的问题设计通过模拟需要动态 mRAG 决策的真实场景，增强了其实用相关性。跨 RemPlan 及三项既有基准的实验表明 E-Agent 的优越性：相比最先进的 mRAG 方法准确率提高了 13%，同时将冗余搜索减少了 37%。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 10:17:12 UTC 发布：2025-08-12 10:17:12 UTC

#14 GRainsaCK: a Comprehensive Software Library for Benchmarking Explanations of Link Prediction Tasks on Knowledge Graphs #14 GRainsaCK：一个用于对知识图谱上链路预测任务的解释进行基准测试的综合软件库

Authors: [Roberto Barile](https://arxiv.org/search/?searchtype=author&query=Roberto Barile), [Claudia d’Amato](https://arxiv.org/search/?searchtype=author&query=Claudia d’Amato), [Nicola Fanizzi](https://arxiv.org/search/?searchtype=author&query=Nicola Fanizzi) 作者：Roberto Barile、Claudia d’Amato、Nicola Fanizzi

Since Knowledge Graphs are often incomplete, link prediction methods are adopted for predicting missing facts. Scalable embedding based solutions are mostly adopted for this purpose, however, they lack comprehensibility, which may be crucial in several domains. Explanation methods tackle this issue by identifying supporting knowledge explaining the predicted facts. Regretfully, evaluating/comparing quantitatively the resulting explanations is challenging as there is no standard evaluation protocol and overall benchmarking resource. We fill this important gap by proposing GRainsaCK, a reusable software resource that fully streamlines all the tasks involved in benchmarking explanations, i.e., from model training to evaluation of explanations along the same evaluation protocol. Moreover, GRainsaCK furthers modularity/extensibility by implementing the main components as functions that can be easily replaced. Finally, fostering its reuse, we provide extensive documentation including a tutorial. 由于知识图谱通常不完整，因此采用链路预测方法来预测缺失的事实。为此通常采用可扩展的基于嵌入的解决方案，然而它们缺乏可理解性，而可理解性在若干领域可能至关重要。解释方法通过识别支持性知识来解释被预测的事实，从而解决了这一问题。遗憾的是，对生成的解释进行定量评估/比较具有挑战性，因为目前没有统一的评估协议和总体基准资源。我们通过提出 GRainsaCK 弥补了这一重要空白，GRainsaCK 是一个可重用的软件资源，全面简化了基准测试解释所涉及的所有任务——即从模型训练到在同一评估协议下对解释的评估。此外，GRainsaCK 通过将主要组件实现为可轻松替换的函数，进一步促进了模块化/可扩展性。最后，为了促进其重用，我们提供了包括教程在内的详尽文档。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 10:15:58 UTC 发布：2025-08-12 10:15:58 UTC

#15 A Dual-Axis Taxonomy of Knowledge Editing for LLMs: From Mechanisms to Functions #15 双轴知识编辑分类法用于 LLMs：从机制到功能

Large language models (LLMs) acquire vast knowledge from large text corpora, but this information can become outdated or inaccurate. Since retraining is computationally expensive, knowledge editing offers an efficient alternative – modifying internal knowledge without full retraining. These methods aim to update facts precisely while preserving the model’s overall capabilities. While existing surveys focus on the mechanism of editing (e.g., parameter changes vs. external memory), they often overlook the function of the knowledge being edited. This survey introduces a novel, complementary function-based taxonomy to provide a more holistic view. We examine how different mechanisms apply to various knowledge types – factual, temporal, conceptual, commonsense, and social – highlighting how editing effectiveness depends on the nature of the target knowledge. By organizing our review along these two axes, we map the current landscape, outline the strengths and limitations of existing methods, define the problem formally, survey evaluation tasks and datasets, and conclude with open challenges and future directions. 大型语言模型 (LLMs) 从大规模文本语料中获取了大量知识，但这些信息可能变得过时或不准确。由于重新训练代价高昂，知识编辑提供了一种高效的替代方法——在不进行全面再训练的情况下修改内部知识。这些方法旨在精确更新事实的同时保持模型的整体能力。现有综述虽然关注编辑机制（例如参数更改与外部记忆），但常常忽视被编辑知识的功能。本综述引入了一种新颖且互补的基于功能的分类法，以提供更全面的视角。我们考察了不同机制如何应用于各类知识——事实性、时间性、概念性、常识性和社会性——强调了编辑效果如何依赖于目标知识的性质。通过沿这两条轴线组织我们的回顾，我们描绘了当前格局，概述了现有方法的优劣，形式化地定义了问题，综述了评估任务和数据集，并以开放挑战与未来方向作结。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-12 09:51:39 UTC

#16 Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance #16 为个性化任务辅助设计具备记忆增强的增强现实代理以进行时空推理 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Dongwook Choi](https://arxiv.org/search/?searchtype=author&query=Dongwook Choi), [Taeyoon Kwon](https://arxiv.org/search/?searchtype=author&query=Taeyoon Kwon), [Dongil Yang](https://arxiv.org/search/?searchtype=author&query=Dongil Yang), [Hyojun Kim](https://arxiv.org/search/?searchtype=author&query=Hyojun Kim), [Jinyoung Yeo](https://arxiv.org/search/?searchtype=author&query=Jinyoung Yeo) 作者：Dongwook Choi, Taeyoon Kwon, Dongil Yang, Hyojun Kim, Jinyoung Yeo

Augmented Reality (AR) systems are increasingly integrating foundation models, such as Multimodal Large Language Models (MLLMs), to provide more context-aware and adaptive user experiences. This integration has led to the development of AR agents to support intelligent, goal-directed interactions in real-world environments. While current AR agents effectively support immediate tasks, they struggle with complex multi-step scenarios that require understanding and leveraging user’s long-term experiences and preferences. This limitation stems from their inability to capture, retain, and reason over historical user interactions in spatiotemporal contexts. To address these challenges, we propose a conceptual framework for memory-augmented AR agents that can provide personalized task assistance by learning from and adapting to user-specific experiences over time. Our framework consists of four interconnected modules: (1) Perception Module for multimodal sensor processing, (2) Memory Module for persistent spatiotemporal experience storage, (3) Spatiotemporal Reasoning Module for synthesizing past and present contexts, and (4) Actuator Module for effective AR communication. We further present an implementation roadmap, a future evaluation strategy, a potential target application and use cases to demonstrate the practical applicability of our framework across diverse domains. We aim for this work to motivate future research toward developing more intelligent AR systems that can effectively bridge user’s interaction history with adaptive, context-aware task assistance. 增强现实（AR）系统正日益整合基础模型，如多模态大语言模型（MLLMs），以提供更具情境感知和自适应性的用户体验。这种整合催生了用于在现实环境中支持智能、目标导向交互的 AR 代理。尽管当前的 AR 代理能有效支持即时任务，但它们在需要理解和利用用户长期经验与偏好的复杂多步骤场景中表现欠佳。这一局限源于它们无法在时空语境中捕捉、保留并对历史用户交互进行推理。为了解决这些挑战，我们提出了一个面向记忆增强型 AR 代理的概念框架，该框架能够通过从用户特定的经验中学习并随时间适应，提供个性化的任务辅助。我们的框架由四个相互关联的模块组成： (1) 用于多模态传感器处理的感知模块，(2) 用于持久时空经验存储的记忆模块，(3) 用于综合过去与当前语境的时空推理模块，和 (4) 用于有效 AR 交互的执行模块。我们还提出了一个实现路线图、未来的评估策略、一个潜在的目标应用以及若干用例，以展示我们的框架在不同领域的实际适用性。我们希望这项工作能激发未来的研究，推动开发更智能的增强现实系统，使其能够有效地将用户的交互历史与自适应、情境感知的任务辅助相结合。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-12 09:20:20 UTC 发布：2025-08-12 09:20:20 UTC

Authors: [Yuwei Yan](https://arxiv.org/search/?searchtype=author&query=Yuwei Yan), [Jinghua Piao](https://arxiv.org/search/?searchtype=author&query=Jinghua Piao), [Xiaochong Lan](https://arxiv.org/search/?searchtype=author&query=Xiaochong Lan), [Chenyang Shao](https://arxiv.org/search/?searchtype=author&query=Chenyang Shao), [Pan Hui](https://arxiv.org/search/?searchtype=author&query=Pan Hui), [Yong Li](https://arxiv.org/search/?searchtype=author&query=Yong Li) 作者：闫昱炜、朴静华、兰晓冲、邵晨阳、许攀、李勇

Recent advances in large language models have demonstrated strong reasoning and role-playing capabilities, opening new opportunities for agent-based social simulations. However, most existing agents’ implementations are scenario-tailored, without a unified framework to guide the design. This lack of a general social agent limits their ability to generalize across different social contexts and to produce consistent, realistic behaviors. To address this challenge, we propose a theory-informed framework that provides a systematic design process for LLM-based social agents. Our framework is grounded in principles from Social Cognition Theory and introduces three key modules: motivation, action planning, and learning. These modules jointly enable agents to reason about their goals, plan coherent actions, and adapt their behavior over time, leading to more flexible and contextually appropriate responses. Comprehensive experiments demonstrate that our theory-driven agents reproduce realistic human behavior patterns under complex conditions, achieving up to 75% lower deviation from real-world behavioral data across multiple fidelity metrics compared to classical generative baselines. Ablation studies further show that removing motivation, planning, or learning modules increases errors by 1.5 to 3.2 times, confirming their distinct and essential contributions to generating realistic and coherent social behaviors. 近年来大型语言模型在推理和角色扮演方面取得了显著进展，为基于智能体的社会模拟开辟了新机遇。然而，大多数现有智能体的实现都是为特定场景量身定制的，缺乏统一的框架来指导设计。这种缺乏通用社会智能体的状况限制了它们在不同社会情境中的泛化能力以及产生一致且现实行为的能力。为了解决这一挑战，我们提出了一个以理论为依据的框架，为基于 LLM 的社会智能体提供系统化的设计流程。我们的框架建立在社会认知理论的原则之上，并引入了三个关键模块：动机、行动规划和学习。这些模块共同使智能体能够就其目标进行推理、规划连贯的行动并随着时间调整其行为，从而生成更灵活且符合情境的响应。全面的实验表明，我们的理论驱动智能体在复杂条件下再现了现实的人类行为模式，在多个保真度指标上与传统生成式基线相比，行为数据偏差最多降低了 75%。消融研究进一步表明，去除动机、规划或学习模块会使错误增加 1.5 到 3.2 倍，确认了它们在生成真实且连贯的社交行为方面各自独特且不可或缺的贡献。

Subjects: Artificial Intelligence, Computers and Society 学科：人工智能，计算机与社会

Publish: 2025-08-12 08:14:48 UTC 发布：2025-08-12 08:14:48 UTC

#18 STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision #18 STELAR-VISION：面向视觉中对齐推理的自拓扑感知高效学习

Authors: [Chen Li](https://arxiv.org/search/?searchtype=author&query=Chen Li), [Han Zhang](https://arxiv.org/search/?searchtype=author&query=Han Zhang), [Zhantao Yang](https://arxiv.org/search/?searchtype=author&query=Zhantao Yang), [Fangyi Chen](https://arxiv.org/search/?searchtype=author&query=Fangyi Chen), [Zihan Wang](https://arxiv.org/search/?searchtype=author&query=Zihan Wang), [Anudeepsekhar Bolimera](https://arxiv.org/search/?searchtype=author&query=Anudeepsekhar Bolimera), [Marios Savvides](https://arxiv.org/search/?searchtype=author&query=Marios Savvides) 作者：Chen Li、Han Zhang、Zhantao Yang、Fangyi Chen、Zihan Wang、Anudeepsekhar Bolimera、Marios Savvides

Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks. We have released datasets, and code will be available. 视觉-语言模型（VLMs）在推理方面取得了显著进展，但在复杂的多模态任务上仍常常力不从心，并且倾向于生成过于冗长的输出。一个关键限制是它们依赖链式思维（CoT）推理，尽管许多任务更适合使用树状或图状等替代拓扑结构。为了解决这一问题，我们提出了 STELAR-Vision，一种面向拓扑感知推理的训练框架。其核心是 TopoAug，一个通过多样化拓扑结构来丰富训练的合成数据管道。我们通过监督微调和强化学习，对 Qwen2VL 模型进行了后训练，兼顾准确性和效率。此外，我们提出了 Frugal Learning，在尽量不损失准确率的情况下减少输出长度。在 MATH-V 和 VLM-S2H 上，STELAR-Vision 相较于其基础模型提升了 9.7% 的准确率，并且超过了更大的 Qwen2VL-72B-Instruct 达到 7.3%。在五个分布外基准上，它比 Phi-4-Multimodal-Instruct 至多高出 28.4%，比 LLaMA-3.2-11B-Vision-Instruct 至多高出 13.2%，展现出强大的泛化能力。与仅链式训练相比，我们的方法在同分布数据集上的总体准确率提高了 4.3%，并且在所有 OOD 基准上均表现更佳。我们已发布数据集，代码将很快可用。

Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition 主题：人工智能，计算机视觉与模式识别

Publish: 2025-08-12 07:27:50 UTC 发布时间：2025-08-12 07:27:50 协调世界时

#19 Aryabhata: An exam-focused language model for JEE Math #19 Aryabhata：面向 JEE 数学考试的语言模型

Authors: [Ritvik Rastogi](https://arxiv.org/search/?searchtype=author&query=Ritvik Rastogi), [Sachin Dharashivkar](https://arxiv.org/search/?searchtype=author&query=Sachin Dharashivkar), [Sandeep Varma](https://arxiv.org/search/?searchtype=author&query=Sandeep Varma) 作者：Ritvik Rastogi、Sachin Dharashivkar、Sandeep Varma

We present Aryabhata 1.0, a compact 7B parameter math reasoning model optimized for the Indian academic exam, the Joint Entrance Examination (JEE). Despite rapid progress in large language models (LLMs), current models often remain unsuitable for educational use. Aryabhata 1.0 is built by merging strong open-weight reasoning models, followed by supervised fine-tuning (SFT) with curriculum learning on verified chain-of-thought (CoT) traces curated through best-of-n rejection sampling. To further boost performance, we apply reinforcement learning with verifiable rewards (RLVR) using A2C objective with group-relative advantage estimation alongwith novel exploration strategies such as Adaptive Group Resizing and Temperature Scaling. Evaluated on both in-distribution (JEE Main 2025) and out-of-distribution (MATH, GSM8K) benchmarks, Aryabhata outperforms existing models in accuracy and efficiency, while offering pedagogically useful step-by-step reasoning. We release Aryabhata as a foundation model to advance exam-centric, open-source small language models. This marks our first open release for community feedback (Aryabhata 1.0 on Hugging Face); PW is actively training future models to further improve learning outcomes for students. 我们推出了 Aryabhata 1.0 ，这是一个紧凑的 7B 参数数学推理模型，针对印度学术考试联合入学考试（JEE）进行了优化。尽管大型语言模型（LLMs）发展迅速，现有模型仍常常不适合教育用途。Aryabhata 1.0 是通过合并强大的开源权重推理模型构建的，随后在经由最佳 n 拒绝采样筛选出的已验证思路链（CoT）轨迹上，采用课程学习进行监督微调（SFT）。为了进一步提升性能，我们应用了具有可验证奖励的强化学习（RLVR），使用带有群体相对优势估计的 A2C 目标，并结合诸如 Adaptive Group Resizing 和 Temperature Scaling 等新颖探索策略。在内部分布（JEE Main 2025）和外部分布（MATH、GSM8K）基准测试中评估，Aryabhata 在准确性和效率上均优于现有模型，同时提供了对教学有用的逐步推理。我们将 Aryabhata 作为基础模型发布，以推动以考试为中心的开源小型语言模型的发展。这是我们首次对外开放发布以征求社区反馈（ Aryabhata 1.0 on Hugging Face ）；PW 正在积极训练未来的模型，以进一步改善学生的学习成果。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 06:20:07 UTC 发布：2025-08-12 06:20:07 UTC

#20 Hybrid Node-Destroyer Model with Large Neighborhood Search for Solving the Capacitated Vehicle Routing Problem #20 混合节点破坏器模型与大邻域搜索相结合用于求解带容量约束的车辆路径问题

Authors: [Bachtiar Herdianto](https://arxiv.org/search/?searchtype=author&query=Bachtiar Herdianto), [Romain Billot](https://arxiv.org/search/?searchtype=author&query=Romain Billot), [Flavien Lucas](https://arxiv.org/search/?searchtype=author&query=Flavien Lucas), [Marc Sevaux](https://arxiv.org/search/?searchtype=author&query=Marc Sevaux), [Daniele Vigo](https://arxiv.org/search/?searchtype=author&query=Daniele Vigo) 作者：Bachtiar Herdianto、Romain Billot、Flavien Lucas、Marc Sevaux、Daniele Vigo

In this research, we propose an iterative learning hybrid optimization solver developed to strengthen the performance of metaheuristic algorithms in solving the Capacitated Vehicle Routing Problem (CVRP). The iterative hybrid mechanism integrates the proposed Node-Destroyer Model, a machine learning hybrid model that utilized Graph Neural Networks (GNNs) such identifies and selects customer nodes to guide the Large Neighborhood Search (LNS) operator within the metaheuristic optimization frameworks. This model leverages the structural properties of the problem and solution that can be represented as a graph, to guide strategic selections concerning node removal. The proposed approach reduces operational complexity and scales down the search space involved in the optimization process. The hybrid approach is applied specifically to the CVRP and does not require retraining across problem instances of different sizes. The proposed hybrid mechanism is able to improve the performance of baseline metaheuristic algorithms. Our approach not only enhances the solution quality for standard CVRP benchmarks but also proves scalability on very large-scale instances with up to 30,000 customer nodes. Experimental evaluations on benchmark datasets show that the proposed hybrid mechanism is capable of improving different baseline algorithms, achieving better quality of solutions under similar settings. 在本研究中，我们提出了一种迭代学习混合优化求解器，用于加强元启发式算法在求解带容量约束的车辆路径问题（CVRP）方面的性能。该迭代混合机制集成了所提出的节点破坏器模型（Node-Destroyer Model），这是一种利用图神经网络（GNN）的机器学习混合模型，能够识别并选择客户节点，以在元启发式优化框架内指导大邻域搜索（LNS）算子。该模型利用可以表示为图的题目和解的结构特性，来指导有关节点移除的策略性选择。所提出的方法降低了操作复杂性，并缩小了优化过程中涉及的搜索空间。该混合方法专门应用于 CVRP，并且在不同规模的问题实例间不需要重新训练。所提出的混合机制能够提升基线元启发式算法的性能。我们的方法不仅提升了标准 CVRP 基准测试的解质量，还在规模高达 30,000 个客户节点的超大规模实例上证明了可扩展性。对基准数据集的实验评估表明，所提出的混合机制能够改进不同的基线算法，在相似设置下取得更高质量的解。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 05:56:13 UTC 发布：2025-08-12 05:56:13 UTC

#21 Prompt-and-Check: Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training #21 Prompt-and-Check: 使用大型语言模型在基于仿真的训练中评估通信协议合规性的研究

Authors: [Vishakha Lall](https://arxiv.org/search/?searchtype=author&query=Vishakha Lall), [Yisi Liu](https://arxiv.org/search/?searchtype=author&query=Yisi Liu) 作者：Vishakha Lall、Yisi Liu

Accurate evaluation of procedural communication compliance is essential in simulation-based training, particularly in safety-critical domains where adherence to compliance checklists reflects operational competence. This paper explores a lightweight, deployable approach using prompt-based inference with open-source large language models (LLMs) that can run efficiently on consumer-grade GPUs. We present Prompt-and-Check, a method that uses context-rich prompts to evaluate whether each checklist item in a protocol has been fulfilled, solely based on transcribed verbal exchanges. We perform a case study in the maritime domain with participants performing an identical simulation task, and experiment with models such as LLama 2 7B, LLaMA 3 8B and Mistral 7B, running locally on an RTX 4070 GPU. For each checklist item, a prompt incorporating relevant transcript excerpts is fed into the model, which outputs a compliance judgment. We assess model outputs against expert-annotated ground truth using classification accuracy and agreement scores. Our findings demonstrate that prompting enables effective context-aware reasoning without task-specific training. This study highlights the practical utility of LLMs in augmenting debriefing, performance feedback, and automated assessment in training environments. 在基于仿真的训练中，准确评估程序性沟通的合规性至关重要，尤其是在遵守合规清单反映操作能力的安全关键领域。本文探索了一种轻量且可部署的方法，使用基于提示的推理和可在消费级 GPU 上高效运行的开源大语言模型（LLMs）。我们提出了 Prompt-and-Check 方法，利用富含上下文的提示来评估协议中每一项清单条目是否已被履行，仅基于转录的口头交流。我们在海事领域进行了案例研究，参与者执行相同的仿真任务，并在本地 RTX 4070 GPU 上实验了 LLama 2 7B、LLaMA 3 8B 和 Mistral 7B 等模型。对于每一项清单条目，将包含相关转录摘录的提示输入模型，模型输出合规性判断。我们使用分类准确率和一致性评分将模型输出与专家注释的真实标签进行评估。我们的研究结果表明，提示能够实现有效的上下文感知推理，而无需特定任务的训练。该研究强调了 LLMs 在增强训练环境中的汇报、绩效反馈和自动评估方面的实际效用。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 05:35:57 UTC 发布：2025-08-12 05:35:57 UTC

#22 P-CAFE: Personalized Cost-Aware Incremental Feature Selection For Electronic Health Records #22 P-CAFE：用于电子健康记录的个性化成本敏感增量特征选择

Authors: [Naama Kashani](https://arxiv.org/search/?searchtype=author&query=Naama Kashani), [Mira Cohen](https://arxiv.org/search/?searchtype=author&query=Mira Cohen), [Uri Shaham](https://arxiv.org/search/?searchtype=author&query=Uri Shaham) 作者：Naama Kashani、Mira Cohen、Uri Shaham

Electronic Health Records (EHR) have revolutionized healthcare by digitizing patient data, improving accessibility, and streamlining clinical workflows. However, extracting meaningful insights from these complex and multimodal datasets remains a significant challenge for researchers. Traditional feature selection methods often struggle with the inherent sparsity and heterogeneity of EHR data, especially when accounting for patient-specific variations and feature costs in clinical applications. To address these challenges, we propose a novel personalized, online and cost-aware feature selection framework tailored specifically for EHR datasets. The features are aquired in an online fashion for individual patients, incorporating budgetary constraints and feature variability costs. The framework is designed to effectively manage sparse and multimodal data, ensuring robust and scalable performance in diverse healthcare contexts. A primary application of our proposed method is to support physicians’ decision making in patient screening scenarios. By guiding physicians toward incremental acquisition of the most informative features within budget constraints, our approach aims to increase diagnostic confidence while optimizing resource utilization. 电子病历（EHR）通过将患者数据数字化、提高可访问性并简化临床工作流程，已经革新了医疗保健。然而，从这些复杂且多模态的数据集中提取有意义的见解仍然是研究人员面临的重要挑战。传统的特征选择方法在处理 EHR 数据固有的稀疏性和异质性时常常力不从心，尤其是在考虑患者特异性差异和临床应用中的特征成本时。为了解决这些问题，我们提出了一个新颖的、个性化的、在线且成本感知的特征选择框架，专为 EHR 数据集量身定制。各特征以在线方式为单个患者获取，结合预算约束和特征可变性成本。该框架旨在有效管理稀疏和多模态数据，确保在多样化的医疗情境中具有稳健且可扩展的性能。我们提出的方法的主要应用之一是支持医生在患者筛查场景中的决策制定。通过在预算约束下引导医生逐步获取最具信息量的特征，我们的方法旨在提高诊断信心，同时优化资源利用。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 05:23:46 UTC 发布：2025-08-12 05:23:46 UTC

#23 Diminution: On Reducing the Size of Grounding ASP Programs #23 缩减：关于减少定向 ASP 程序规模的方法

Authors: [HuanYu Yang](https://arxiv.org/search/?searchtype=author&query=HuanYu Yang), [Fengming Zhu](https://arxiv.org/search/?searchtype=author&query=Fengming Zhu), [YangFan Wu](https://arxiv.org/search/?searchtype=author&query=YangFan Wu), [Jianmin Ji](https://arxiv.org/search/?searchtype=author&query=Jianmin Ji) 作者：杨焕瑜，朱凤鸣，吴洋帆，季建民

Answer Set Programming (ASP) is often hindered by the grounding bottleneck: large Herbrand universes generate ground programs so large that solving becomes difficult. Many methods employ ad-hoc heuristics to improve grounding performance, motivating the need for a more formal and generalizable strategy. We introduce the notion of diminution, defined as a selected subset of the Herbrand universe used to generate a reduced ground program before solving. We give a formal definition of diminution, analyze its key properties, and study the complexity of identifying it. We use a specific encoding that enables off-the-shelf ASP solver to evaluate candidate subsets. Our approach integrates seamlessly with existing grounders via domain predicates. In extensive experiments on five benchmarks, applying diminutions selected by our strategy yields significant performance improvements, reducing grounding time by up to 70% on average and decreasing the size of grounding files by up to 85%. These results demonstrate that leveraging diminutions constitutes a robust and general-purpose approach for alleviating the grounding bottleneck in ASP. 答案集规划（ASP）常常受到基底化（grounding）瓶颈的制约：庞大的 Herbrand 宇产生的有底程序过于庞大，从而使求解变得困难。许多方法采用特设启发式来提升基底化性能，这促使人们需要一种更形式化且可泛化的策略。我们引入了“缩减”（diminution）的概念，定义为用于在求解前生成简化有底程序的 Herbrand 宇的一个选定子集。我们给出了缩减的形式定义，分析了其关键性质，并研究了识别缩减的复杂性。我们采用了一种特定的编码，使得现成的 ASP 求解器能够评估候选子集。我们的方法可通过域谓词无缝集成到现有的基底化器中。在对五个基准的大量实验证明中，应用由我们的策略选出的缩减带来了显著的性能提升，平均将基底化时间减少了最多 70%，并将基底化文件的大小最多减少了 85%。这些结果表明，利用缩减构成了缓解 ASP 基底化瓶颈的一种稳健且通用的方法。

Subjects: Artificial Intelligence, Logic in Computer Science 主题：人工智能，计算机科学中的逻辑

Publish: 2025-08-12 04:52:19 UTC 发布：2025-08-12 04:52:19 UTC

#24 AgriGPT: a Large Language Model Ecosystem for Agriculture #24 AgriGPT：面向农业的大型语言模型生态系统

Despite the rapid progress of Large Language Models (LLMs), their application in agriculture remains limited due to the lack of domain-specific models, curated datasets, and robust evaluation frameworks. To address these challenges, we propose AgriGPT, a domain-specialized LLM ecosystem for agricultural usage. At its core, we design a multi-agent scalable data engine that systematically compiles credible data sources into Agri-342K, a high-quality, standardized question-answer (QA) dataset. Trained on this dataset, AgriGPT supports a broad range of agricultural stakeholders, from practitioners to policy-makers. To enhance factual grounding, we employ Tri-RAG, a three-channel Retrieval-Augmented Generation framework combining dense retrieval, sparse retrieval, and multi-hop knowledge graph reasoning, thereby improving the LLM’s reasoning reliability. For comprehensive evaluation, we introduce AgriBench-13K, a benchmark suite comprising 13 tasks with varying types and complexities. Experiments demonstrate that AgriGPT significantly outperforms general-purpose LLMs on both domain adaptation and reasoning. Beyond the model itself, AgriGPT represents a modular and extensible LLM ecosystem for agriculture, comprising structured data construction, retrieval-enhanced generation, and domain-specific evaluation. This work provides a generalizable framework for developing scientific and industry-specialized LLMs. All models, datasets, and code will be released to empower agricultural communities, especially in underserved regions, and to promote open, impactful research. 尽管大型语言模型（LLMs）取得了快速进展，但由于缺乏领域专用模型、精心整理的数据集和健全的评估框架，它们在农业领域的应用仍然有限。为了解决这些挑战，我们提出了 AgriGPT，一个面向农业使用的领域专用 LLM 生态系统。在其核心，我们设计了一个多智能体可扩展数据引擎，将可信的数据源系统地汇编为 Agri-342K，一个高质量、标准化的问答（QA）数据集。在该数据集上训练后，AgriGPT 可支持从从业者到政策制定者等各种农业利益相关者。为了增强事实支撑，我们采用了 Tri-RAG，一种将密集检索、稀疏检索和多跳知识图谱推理相结合的三通道检索增强生成框架，从而提高了 LLM 的推理可靠性。为进行全面评估，我们引入了 AgriBench-13K，这是一个包含 13 项具有不同类型和复杂度任务的基准套件。实验表明，在领域适应和推理方面，AgriGPT 显著优于通用 LLMs。除了模型本身，AgriGPT 代表了一个面向农业的模块化且可扩展的 LLM 生态系统，包含结构化数据构建、检索增强生成和领域特定评估。该工作为开发面向科学和行业的专用 LLM 提供了一个可泛化的框架。所有模型、数据集和代码将被公开，以赋能农业社区，特别是欠服务地区，并推动开放且有影响力的研究。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 04:51:08 UTC 发布：2025-08-12 04:51:08 UTC

#25 UGM2N: An Unsupervised and Generalizable Mesh Movement Network via M-Uniform Loss #25 UGM2N：一种通过 M-Uniform 损失实现的无监督且可泛化的网格移动网络

Partial differential equations (PDEs) form the mathematical foundation for modeling physical systems in science and engineering, where numerical solutions demand rigorous accuracy-efficiency tradeoffs. Mesh movement techniques address this challenge by dynamically relocating mesh nodes to rapidly-varying regions, enhancing both simulation accuracy and computational efficiency. However, traditional approaches suffer from high computational complexity and geometric inflexibility, limiting their applicability, and existing supervised learning-based approaches face challenges in zero-shot generalization across diverse PDEs and mesh topologies.In this paper, we present an Unsupervised and Generalizable Mesh Movement Network (UGM2N). We first introduce unsupervised mesh adaptation through localized geometric feature learning, eliminating the dependency on pre-adapted meshes. We then develop a physics-constrained loss function, M-Uniform loss, that enforces mesh equidistribution at the nodal level.Experimental results demonstrate that the proposed network exhibits equation-agnostic generalization and geometric independence in efficient mesh adaptation. It demonstrates consistent superiority over existing methods, including robust performance across diverse PDEs and mesh geometries, scalability to multi-scale resolutions and guaranteed error reduction without mesh tangling. 偏微分方程（PDE）构成了科学与工程中物理系统建模的数学基础，而数值求解需要在精度与效率之间进行严格权衡。网格移动技术通过将网格节点动态重定位到变化迅速的区域来应对这一挑战，从而提高模拟精度和计算效率。然而，传统方法存在高计算复杂性和几何不灵活性，限制了其适用性；而现有基于监督学习的方法在面对多样的 PDE 和网格拓扑时，零样本泛化也存在困难。在本文中，我们提出了一种无监督且具有良好泛化能力的网格移动网络（UGM2N）。我们首先通过局部几何特征学习引入了无监督的网格自适应，消除了对预先适配网格的依赖。随后我们设计了受物理约束的损失函数——M-Uniform 损失，在节点层面强制实现网格等分布。实验结果表明，所提出的网络在高效网格自适应方面表现出与方程无关的泛化能力和几何独立性。它在现有方法上显示出持续的优越性，包括在各种偏微分方程和网格几何形状上的强健性能、对多尺度分辨率的可扩展性以及在不发生网格纠结的情况下保证误差减少。

Subjects: Artificial Intelligence, Numerical Analysis 主题：人工智能，数值分析

Publish: 2025-08-12 03:56:45 UTC 发布时间：2025-08-12 03:56:45 协调世界时

#26 SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering #26 SynLLM：通过提示工程对用于医学表格合成数据生成的大型语言模型的比较分析

Authors: [Arshia Ilaty](https://arxiv.org/search/?searchtype=author&query=Arshia Ilaty), [Hossein Shirazi](https://arxiv.org/search/?searchtype=author&query=Hossein Shirazi), [Hajar Homayouni](https://arxiv.org/search/?searchtype=author&query=Hajar Homayouni) 作者：Arshia Ilaty、Hossein Shirazi、Hajar Homayouni

Access to real-world medical data is often restricted due to privacy regulations, posing a significant barrier to the advancement of healthcare research. Synthetic data offers a promising alternative; however, generating realistic, clinically valid, and privacy-conscious records remains a major challenge. Recent advancements in Large Language Models (LLMs) offer new opportunities for structured data generation; however, existing approaches frequently lack systematic prompting strategies and comprehensive, multi-dimensional evaluation frameworks. In this paper, we present SynLLM, a modular framework for generating high-quality synthetic medical tabular data using 20 state-of-the-art open-source LLMs, including LLaMA, Mistral, and GPT variants, guided by structured prompts. We propose four distinct prompt types, ranging from example-driven to rule-based constraints, that encode schema, metadata, and domain knowledge to control generation without model fine-tuning. Our framework features a comprehensive evaluation pipeline that rigorously assesses generated data across statistical fidelity, clinical consistency, and privacy preservation. We evaluate SynLLM across three public medical datasets, including Diabetes, Cirrhosis, and Stroke, using 20 open-source LLMs. Our results show that prompt engineering significantly impacts data quality and privacy risk, with rule-based prompts achieving the best privacy-quality balance. SynLLM establishes that, when guided by well-designed prompts and evaluated with robust, multi-metric criteria, LLMs can generate synthetic medical data that is both clinically plausible and privacy-aware, paving the way for safer and more effective data sharing in healthcare research. 由于隐私法规，获取真实世界的医疗数据常常受到限制，这对医疗研究的进展构成了重大障碍。合成数据提供了有希望的替代方案；然而，生成真实、临床有效且注重隐私的病历仍然是一项重大挑战。大型语言模型（LLMs）的最新进展为结构化数据生成带来了新机会；但现有方法往往缺乏系统的提示策略和全面的、多维度的评估框架。本文提出了 SynLLM，一个模块化框架，用于使用 20 个最先进的开源 LLM（包括 LLaMA、Mistral 和 GPT 变体），通过结构化提示生成高质量的合成医疗表格数据。我们提出了四种不同的提示类型，从示例驱动到基于规则的约束，编码模式、元数据和领域知识以在无需模型微调的情况下控制生成。我们的框架具有一个全面的评估管线，严格评估生成数据在统计相似性、临床一致性和隐私保护方面的表现。我们在三个公开的医学数据集（包括糖尿病、肝硬化和中风）上使用 20 个开源 LLMs 评估了 SynLLM。我们的结果表明，提示工程对数据质量和隐私风险有显著影响，其中基于规则的提示在隐私与质量的平衡方面表现最佳。SynLLM 表明，在良好设计的提示指导下并采用健全的、多指标的评估标准时，LLMs 可以生成既符合临床合理性又具有隐私意识的合成医学数据，为在医疗研究中更安全、更有效地共享数据铺平了道路。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-11 23:56:42 UTC 发布：2025-08-11 23:56:42 UTC

#27 GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games #27 GVGAI-LLM：使用无限游戏评估大型语言模型代理 [PDF ] [Copy] [Kimi 2 ] [REL]

Authors: [Yuchen Li](https://arxiv.org/search/?searchtype=author&query=Yuchen Li), [Cong Lin](https://arxiv.org/search/?searchtype=author&query=Cong Lin), [Muhammad Umair Nasir](https://arxiv.org/search/?searchtype=author&query=Muhammad Umair Nasir), [Philip Bontrager](https://arxiv.org/search/?searchtype=author&query=Philip Bontrager), [Jialin Liu](https://arxiv.org/search/?searchtype=author&query=Jialin Liu), [Julian Togelius](https://arxiv.org/search/?searchtype=author&query=Julian Togelius) 作者：Yuchen Li, Cong Lin, Muhammad Umair Nasir, Philip Bontrager, Jialin Liu, Julian Togelius

We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model’s ability to handle tasks that differ from most existing LLM benchmarks. The benchmark leverages a game description language that enables rapid creation of new games and levels, helping to prevent overfitting over time. Each game scene is represented by a compact set of ASCII characters, allowing for efficient processing by language models. GVGAI-LLM defines interpretable metrics, including the meaningful step ratio, step efficiency, and overall score, to assess model behavior. Through zero-shot evaluations across a broad set of games and levels with diverse challenges and skill depth, we reveal persistent limitations of LLMs in spatial reasoning and basic planning. Current models consistently exhibit spatial and logical errors, motivating structured prompting and spatial grounding techniques. While these interventions lead to partial improvements, the benchmark remains very far from solved. GVGAI-LLM provides a reproducible testbed for advancing research on language model capabilities, with a particular emphasis on agentic behavior and contextual reasoning. 我们推出了 GVGAI-LLM，这是一个用于评估大型语言模型（LLMs）推理与问题解决能力的视频游戏基准。基于通用视频游戏人工智能（General Video Game AI）框架构建，它包含了多样的街机风格游戏集合，旨在测试模型处理与大多数现有 LLM 基准不同任务的能力。该基准利用一种游戏描述语言，能够快速创建新游戏和关卡，有助于随时间防止过拟合。每个游戏场景都由一组紧凑的 ASCII 字符表示，便于语言模型高效处理。GVGAI-LLM 定义了可解释的度量标准，包括有意义步比率（meaningful step ratio）、步效率（step efficiency）和总体得分，以评估模型行为。通过在具有多样挑战与技能深度的广泛游戏和关卡上进行零样本评估，我们揭示了 LLM 在空间推理和基础规划方面的持续局限性。当前模型持续表现出空间与逻辑错误，促使采用结构化提示和空间锚定技术。尽管这些干预带来了部分改进，但该基准仍远未被解决。 GVGAI-LLM 提供了一个可复现的测试平台，用于推进语言模型能力的研究，特别强调代理行为和情境推理。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-11 22:17:07 UTC 发布：2025-08-11 22:17:07 UTC

#28 Large Language Models as Oracles for Ontology Alignment #28 将大型语言模型作为本体对齐的神谕

Authors: [Sviatoslav Lushnei](https://arxiv.org/search/?searchtype=author&query=Sviatoslav Lushnei), [Dmytro Shumskyi](https://arxiv.org/search/?searchtype=author&query=Dmytro Shumskyi), [Severyn Shykula](https://arxiv.org/search/?searchtype=author&query=Severyn Shykula), [Ernesto Jimenez-Ruiz](https://arxiv.org/search/?searchtype=author&query=Ernesto Jimenez-Ruiz), [Artur d’Avila Garcez](https://arxiv.org/search/?searchtype=author&query=Artur d’Avila Garcez) 作者：Sviatoslav Lushnei、Dmytro Shumskyi、Severyn Shykula、Ernesto Jimenez-Ruiz、Artur d’Avila Garcez

Ontology alignment plays a crucial role in integrating diverse data sources across domains. There is a large plethora of systems that tackle the ontology alignment problem, yet challenges persist in producing highly quality correspondences among a set of input ontologies. Human-in-the-loop during the alignment process is essential in applications requiring very accurate mappings. User involvement is, however, expensive when dealing with large ontologies. In this paper, we explore the feasibility of using Large Language Models (LLM) as an alternative to the domain expert. The use of the LLM focuses only on the validation of the subset of correspondences where an ontology alignment system is very uncertain. We have conducted an extensive evaluation over several matching tasks of the Ontology Alignment Evaluation Initiative (OAEI), analysing the performance of several state-of-the-art LLMs using different ontology-driven prompt templates. The LLM results are also compared against simulated Oracles with variable error rates. 本体对齐在跨领域整合多样化数据源中发挥着关键作用。尽管存在大量针对本体对齐问题的系统，但在为一组输入本体产生高质量对应关系方面仍面临挑战。在需要非常精确映射的应用中，对齐过程中的人类参与（human-in-the-loop）是必不可少的。然而，当处理大型本体时，用户参与代价高昂。本文探讨将大型语言模型（LLM）作为领域专家替代方案的可行性。对 LLM 的使用仅集中在那些本体对齐系统极为不确定的对应子集的验证上。我们对本体对齐评估倡议（OAEI）的若干匹配任务进行了广泛评估，分析了多种最先进 LLMs 在不同本体驱动提示模板下的表现。LLM 的结果还与具有可变错误率的模拟 Oracle 进行了比较。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-11 22:16:20 UTC 发布：2025-08-11 22:16:20 UTC

#29 POMO+: Leveraging starting nodes in POMO for solving Capacitated Vehicle Routing Problem #29 POMO+: 在 POMO 中利用起始节点来解决带容量约束的车辆路径问题

Authors: [Szymon Jakubicz](https://arxiv.org/search/?searchtype=author&query=Szymon Jakubicz), [Karol Kuźniak](https://arxiv.org/search/?searchtype=author&query=Karol Kuźniak), [Jan Wawszczak](https://arxiv.org/search/?searchtype=author&query=Jan Wawszczak), [Paweł Gora](https://arxiv.org/search/?searchtype=author&query=Paweł Gora) 作者：Szymon Jakubicz、Karol Kuźniak、Jan Wawszczak、Paweł Gora

In recent years, reinforcement learning (RL) methods have emerged as a promising approach for solving combinatorial problems. Among RL-based models, POMO has demonstrated strong performance on a variety of tasks, including variants of the Vehicle Routing Problem (VRP). However, there is room for improvement for these tasks. In this work, we improved POMO, creating a method (\textbf{POMO+}) that leverages the initial nodes to find a solution in a more informed way. We ran experiments on our new model and observed that our solution converges faster and achieves better results. We validated our models on the CVRPLIB dataset and noticed improvements in problem instances with up to 100 customers. We hope that our research in this project can lead to further advancements in the field. 近年来，强化学习（RL）方法已成为解决组合优化问题的有前景的方法。在基于 RL 的模型中，POMO 在多种任务上表现出色，包括车辆路径问题（VRP）的各种变体。然而，这些任务仍有改进空间。在这项工作中，我们改进了 POMO，创建了一种方法（POMO+），该方法利用起始节点以更有信息量的方式寻找解。我们在新模型上进行了实验，观察到我们的解决方案收敛更快并取得更好结果。我们在 CVRPLIB 数据集上验证了模型，并在最多 100 个客户的问题实例中注意到改进。我们希望本项目的研究能促进该领域的进一步进展。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-11 21:55:16 UTC 发表：2025-08-11 21:55:16 UTC

#30 Beyond Ordinal Preferences: Why Alignment Needs Cardinal Human Feedback #30 超越序数偏好：为什么对齐需要基于基数的人类反馈

Authors: [Parker Whitfill](https://arxiv.org/search/?searchtype=author&query=Parker Whitfill), [Stewy Slocum](https://arxiv.org/search/?searchtype=author&query=Stewy Slocum) 作者：Parker Whitfill, Stewy Slocum

Alignment techniques for LLMs rely on optimizing preference-based objectives – where these preferences are typically elicited as ordinal, binary choices between responses. Recent work has focused on improving label quality or mitigating particular biases, but we identify a more fundamental limitation: these methods collect the wrong kind of data. We prove an impossibility result: no algorithm relying solely on ordinal comparisons can systematically recover the most preferred model. Intuitively, ordinal data lacks the information needed to resolve tradeoffs – e.g., fixing a factual error on one prompt versus improving style on another. We show that selecting the optimal model requires recovering preferences over \emph{models} (rather than just responses), which can only be identified given cardinal feedback about response quality. To address this, we collect and publicly release a dataset of 25,000 cardinal judgments using willingness-to-pay elicitations, a well-established tool from experimental economics. Empirically, we find that incorporating cardinal feedback into preference fine-tuning allows models to prioritize high-impact improvements and outperform ordinal-only methods on downstream benchmarks, such as Arena-Hard. 用于 LLMs 的对齐技术依赖于优化基于偏好的目标——这些偏好通常被以序数的、在响应间进行二选一的方式采集。近期工作集中于提高标签质量或缓解特定偏见，但我们发现了一个更根本的限制：这些方法收集了错误类型的数据。我们证明了一个不可能性结果：任何仅依赖序数比较的算法都无法系统性地恢复出最受偏好的模型。直观上，序数数据缺乏解决权衡所需的信息——例如在一个提示上修复事实错误与在另一个提示上改善风格之间的权衡。我们表明，选择最优模型需要恢复对“模型”的偏好（而不仅仅是对响应的偏好），而这只有在获得关于响应质量的基数反馈时才能被识别。为了解决这一问题，我们使用意愿支付（willingness-to-pay）引导法这一实验经济学中成熟的工具，收集并公开发布了包含 25,000 条基数判断的数据集。通过实证研究，我们发现将基数反馈纳入偏好微调可以使模型优先实现高影响改进，并在下游基准（例如 Arena-Hard）上优于仅基于序数的方法。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-11 21:42:33 UTC 发布：2025-08-11 21:42:33 UTC

#31 A Fast GRASP Metaheuristic for the Trigger Arc TSP with MIP-Based Construction and Multi-Neighborhood Local Search #31 一种针对触发弧 TSP 的快速 GRASP 元启发式算法，结合基于 MIP 的构造和多邻域局部搜索 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Joan Salvà Soler](https://arxiv.org/search/?searchtype=author&query=Joan Salvà Soler), [Grégoire de Lambertye](https://arxiv.org/search/?searchtype=author&query=Grégoire de Lambertye) 作者：Joan Salvà Soler、Grégoire de Lambertye

The Trigger Arc Traveling Salesman Problem (TA-TSP) extends the classical TSP by introducing dynamic arc costs that change when specific \textit{trigger} arcs are traversed, modeling scenarios such as warehouse operations with compactable storage systems. This paper introduces a GRASP-based metaheuristic that combines multiple construction heuristics with a multi-neighborhood local search. The construction phase uses mixed-integer programming (MIP) techniques to transform the TA-TSP into a sequence of tailored TSP instances, while the improvement phase applies 2-Opt, Swap, and Relocate operators. Computational experiments on MESS 2024 competition instances achieved average optimality gaps of 0.77% and 0.40% relative to the best-known solutions within a 60-second limit. On smaller, synthetically generated datasets, the method produced solutions 11.3% better than the Gurobi solver under the same time constraints. The algorithm finished in the top three at MESS 2024, demonstrating its suitability for real-time routing applications with state-dependent travel costs. 触发弧旅行商问题（TA-TSP）通过引入在遍历特定触发弧时会变化的动态弧成本，扩展了经典 TSP，模拟了例如具有可压缩存储系统的仓库操作等场景。本文提出了一种基于 GRASP 的元启发式算法，该算法将多种构造启发式与多邻域局部搜索相结合。构造阶段使用混合整数规划（MIP）技术将 TA-TSP 转化为一系列定制的 TSP 实例，改善阶段则应用了 2-Opt、Swap 和 Relocate 算子。在 MESS 2024 竞赛实例上的计算实验中，在 60 秒时间限制内相对于已知最优解取得了平均 0.77%和 0.40%的最优差距。在较小的合成数据集上，该方法在相同时限内生成的解比 Gurobi 求解器好 11.3%。该算法在 MESS 2024 中名列前茅，证明了其适用于具有状态依赖旅行成本的实时路径规划应用。

Subjects: Artificial Intelligence, Discrete Mathematics 主题：人工智能，离散数学

Publish: 2025-08-11 21:24:38 UTC 发表：2025-08-11 21:24:38 UTC

#32 OverFill: Two-Stage Models for Efficient Language Model Decoding #32 OverFill：用于高效语言模型解码的两阶段模型

Authors: [Woojeong Kim](https://arxiv.org/search/?searchtype=author&query=Woojeong Kim), [Junxiong Wang](https://arxiv.org/search/?searchtype=author&query=Junxiong Wang), [Jing Nathan Yan](https://arxiv.org/search/?searchtype=author&query=Jing Nathan Yan), [Mohamed Abdelfattah](https://arxiv.org/search/?searchtype=author&query=Mohamed Abdelfattah), [Alexander M. Rush](https://arxiv.org/search/?searchtype=author&query=Alexander M. Rush) 作者：Woojeong Kim、Junxiong Wang、Jing Nathan Yan、Mohamed Abdelfattah、Alexander M. Rush

Large language models (LLMs) excel across diverse tasks but face significant deployment challenges due to high inference costs. LLM inference comprises prefill (compute-bound) and decode (memory-bound) stages, with decode dominating latency particularly for long sequences. Current decoder-only models handle both stages uniformly, despite their distinct computational profiles. We propose OverFill, which decouples these stages to optimize accuracy-efficiency tradeoffs. OverFill begins with a full model for prefill, processing system and user inputs in parallel. It then switches to a dense pruned model, while generating tokens sequentially. Leveraging more compute during prefill, OverFill improves generation quality with minimal latency overhead. Our 3B-to-1B OverFill configuration outperforms 1B pruned models by 83.2%, while the 8B-to-3B configuration improves over 3B pruned models by 79.2% on average across standard benchmarks. OverFill matches the performance of same-sized models trained from scratch, while using significantly less training data. Our code is available at https://github.com/friendshipkim/overfill. 大型语言模型（LLMs）在各种任务上表现出色，但由于高推理成本而面临显著的部署挑战。LLM 推理由预填（计算受限）和解码（内存受限）阶段组成，解码在长序列情况下尤其主导延迟。当前的仅解码器模型对这两个阶段采用相同的处理方式，尽管它们的计算特性不同。我们提出了 OverFill，将这两个阶段解耦以优化准确性与效率的权衡。OverFill 在预填阶段先使用完整模型，平行处理系统和用户输入，然后在生成 token 时切换到经过稠密剪枝的模型并顺序生成。通过在预填阶段利用更多计算资源，OverFill 在几乎没有延迟开销的情况下提升了生成质量。我们的 3B 到 1B 的 OverFill 配置在标准基准上平均比 1B 剪枝模型高出 83.2%，而 8B 到 3B 配置则比 3B 剪枝模型平均提升 79.2%。OverFill 的性能可与同尺寸从零训练的模型相匹配，同时使用显著更少的训练数据。我们的代码可在 https://github.com/friendshipkim/overfill 获取。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-11 20:07:34 UTC 发布：2025-08-11 20:07:34 UTC

#33 Solver-Aided Expansion of Loops to Avoid Generate-and-Test #33 利用求解器辅助展开循环以避免生成并测试 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Niklas Dewally](https://arxiv.org/search/?searchtype=author&query=Niklas Dewally), [Özgür Akgün](https://arxiv.org/search/?searchtype=author&query=Özgür Akgün) 作者：Niklas Dewally、Özgür Akgün

Constraint modelling languages like MiniZinc and Essence rely on unrolling loops (in the form of quantified expressions and comprehensions) during compilation. Standard approaches generate all combinations of induction variables and use partial evaluation to discard those that simplify to identity elements of associative-commutative operators (e.g. true for conjunction, 0 for summation). This can be inefficient for problems where most combinations are ultimately irrelevant. We present a method that avoids full enumeration by using a solver to compute only the combinations required to generate the final set of constraints. The resulting model is identical to that produced by conventional flattening, but compilation can be significantly faster. This improves the efficiency of translating high-level user models into solver-ready form, particularly when induction variables range over large domains with selective preconditions. 像 MiniZinc 和 Essence 这样的约束建模语言在编译期间依赖展开循环（以量化表达式和推导式的形式）。标准方法生成所有归纳变量的组合并使用部分求值来丢弃那些在结合-交换运算的恒等元素上简化的组合（例如对合取为 true、对求和为 0）。对于大多数组合最终无关紧要的问题，这可能效率低下。我们提出了一种方法，通过使用求解器仅计算生成最终约束集所需的组合，从而避免完全枚举。得到的模型与传统扁平化产生的模型相同，但编译可以显著更快。这提高了将高级用户模型翻译为求解器就绪形式的效率，尤其是在归纳变量在带有选择性前置条件的大域上取值时。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-11 19:59:16 UTC 发布：2025-08-11 19:59:16 UTC

#34 Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning #34 双层 MCTS 用于经典规划中均摊 O(1)节点选择

Author: [Masataro Asai](https://arxiv.org/search/?searchtype=author&query=Masataro Asai) 作者：浅井雅太郎

We study an efficient implementation of Multi-Armed Bandit (MAB)-based Monte-Carlo Tree Search (MCTS) for classical planning. One weakness of MCTS is that it spends a significant time deciding which node to expand next. While selecting a node from an OPEN list with N nodes has O(1) runtime complexity with traditional array-based priority-queues for dense integer keys, the tree-based OPEN list used by MCTS requires O(logN), which roughly corresponds to the search depth d. In classical planning, d is arbitrarily large (e.g., 2k−1 in k-disk Tower-of-Hanoi) and the runtime for node selection is significant, unlike in game tree search, where the cost is negligible compared to the node evaluation (rollouts) because d is inherently limited by the game (e.g., d≤361 in Go). To improve this bottleneck, we propose a bilevel modification to MCTS that runs a best-first search from each selected leaf node with an expansion budget proportional to d, which achieves amortized O(1) runtime for node selection, equivalent to the traditional queue-based OPEN list. In addition, we introduce Tree Collapsing, an enhancement that reduces action selection steps and further improves the performance. 我们研究了用于经典规划的基于多臂老虎机（MAB）的蒙特卡洛树搜索（MCTS）的高效实现。MCTS 的一个弱点是它在决定下一个要扩展的节点上花费了大量时间。在使用传统基于数组的优先队列对密集整数键进行处理时，从具有 N 个节点的 OPEN 列表中选择一个节点具有 O(1) 的运行时复杂度，而 MCTS 使用的基于树的 OPEN 列表则要求 O(logN) ，这大致对应于搜索深度 d 。在经典规划中， d 是任意大的（例如，在 k -盘河内塔中 2k−1 ），节点选择的运行时间是显著的，这与博弈树搜索不同，后者相比节点评估（模拟）节点选择的成本可以忽略不计，因为 d 受博弈固有限制（例如，在围棋中 d≤361 ）。为了解决这一瓶颈，我们提出对 MCTS 的二层修改：对每个被选中的叶节点运行一次从该节点出发的最佳优先搜索，扩展预算与 d 成比例，从而实现节点选择的摊销 O(1) 运行时，与传统基于队列的 OPEN 列表等效。此外，我们引入了树折叠（Tree Collapsing），这是一种可减少动作选择步骤并进一步提升性能的改进方法。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-11 18:12:40 UTC 发布日期：2025-08-11 18:12:40 UTC

#35 UrzaGPT: LoRA-Tuned Large Language Models for Card Selection in Collectible Card Games #35 UrzaGPT：用于集换式卡牌游戏选牌的 LoRA 微调大型语言模型

Author: [Timo Bertram](https://arxiv.org/search/?searchtype=author&query=Timo Bertram) 作者：Timo Bertram

Collectible card games (CCGs) are a difficult genre for AI due to their partial observability, long-term decision-making, and evolving card sets. Due to this, current AI models perform vastly worse than human players at CCG tasks such as deckbuilding and gameplay. In this work, we introduce UrzaGPT, a domain-adapted large language model that recommends real-time drafting decisions in Magic: The Gathering. Starting from an open-weight LLM, we use Low-Rank Adaptation fine-tuning on a dataset of annotated draft logs. With this, we leverage the language modeling capabilities of LLM, and can quickly adapt to different expansions of the game. We benchmark UrzaGPT in comparison to zero-shot LLMs and the state-of-the-art domain-specific model. Untuned, small LLMs like Llama-3-8B are completely unable to draft, but the larger GPT-4o achieves a zero-shot performance of 43%. Using UrzaGPT to fine-tune smaller models, we achieve an accuracy of 66.2% using only 10,000 steps. Despite this not reaching the capability of domain-specific models, we show that solely using LLMs to draft is possible and conclude that using LLMs can enable performant, general, and update-friendly drafting AIs in the future. 集换式卡牌游戏（CCG）对人工智能来说是一个困难的类型，原因在于其部分可观测性、长期决策需求以及不断变化的卡牌集合。正因如此，现有的 AI 模型在构筑卡组和对局等 CCG 任务上的表现远不及人类玩家。在本工作中，我们提出了 UrzaGPT ，一种为领域适配的大型语言模型，可在 Magic: The Gathering 中实时推荐选牌决策。我们从一个开放权重的 LLM 出发，使用低秩适配（LoRA）微调一套带注释的选牌日志数据集。借助此方法，我们利用了 LLM 的语言建模能力，并能快速适应游戏的不同扩展包。我们将 UrzaGPT 与零样本 LLM 以及最先进的领域特定模型进行了基准比较。未经微调的小型 LLM（如 Llama-3-8B）完全无法进行选牌，但更大的 GPT-4o 在零样本下达到了 43% 的表现。通过使用 UrzaGPT 对小型模型进行微调，我们在仅用 10,000 步的情况下达到了 66.2% 的准确率。尽管这并未达到特定领域模型的能力水平，我们展示了仅使用 LLMs 进行初稿撰写是可行的，并得出结论：未来使用 LLMs 可以实现高性能、通用且便于更新的初稿撰写人工智能。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-11 18:09:15 UTC 发布：2025-08-11 18:09:15 UTC

#36 What Breaks Knowledge Graph based RAG? Empirical Insights into Reasoning under Incomplete Knowledge #36 基于知识图的 RAG 会在哪些方面失效？关于在不完整知识下推理的实证洞见

Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks, together with an evaluation protocol, to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design. 基于知识图谱的增强检索生成（KG-RAG）是一种越来越受关注的方法，用于将大语言模型的推理能力与知识图谱的结构化证据相结合。然而，当前的评估实践存在不足：现有基准往往包含可直接用知识图谱中已有三元组回答的问题，这使得难以判断模型是在进行推理还是仅仅直接检索答案。此外，不一致的评估指标和宽松的答案匹配标准也进一步模糊了有意义的比较。在本工作中，我们提出了一种通用的基准构建方法及评估协议，以在知识不完整的情况下系统地评估 KG-RAG 方法。我们的实证结果表明，当前的 KG-RAG 方法在缺失知识下的推理能力有限，常常依赖内部记忆，并且其泛化能力因设计而异。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-11 10:55:06 UTC 发布时间：2025-08-11 10:55:06 UTC

#37 First Ask Then Answer: A Framework Design for AI Dialogue Based on Supplementary Questioning with Large Language Models #37 先问后答：一种基于大型语言模型补充提问的 AI 对话框架设计

Authors: [Chuanruo Fu](https://arxiv.org/search/?searchtype=author&query=Chuanruo Fu), [Yuncheng Du](https://arxiv.org/search/?searchtype=author&query=Yuncheng Du) 作者：傅传若，杜云成

Large Language Models (LLMs) often struggle to deliver accurate and actionable answers when user-provided information is incomplete or ill-specified. We propose a new interaction paradigm, First Ask Then Answer (FATA), in which, through prompt words, LLMs are guided to proactively generate multidimensional supplementary questions for users prior to response generation. Subsequently, by integrating user-provided supplementary information with the original query through sophisticated prompting techniques, we achieve substantially improved response quality and relevance. In contrast to existing clarification approaches – such as the CLAM framework oriented to ambiguity and the self-interrogation Self-Ask method – FATA emphasizes completeness (beyond mere disambiguation) and user participation (inviting human input instead of relying solely on model-internal reasoning). It also adopts a single-turn strategy: all clarifying questions are produced at once, thereby reducing dialogue length and improving efficiency. Conceptually, FATA uses the reasoning power of LLMs to scaffold user expression, enabling non-expert users to formulate more comprehensive and contextually relevant queries. To evaluate FATA, we constructed a multi-domain benchmark and compared it with two controls: a baseline prompt (B-Prompt) and a context-enhanced expert prompt (C-Prompt). Experimental results show that FATA outperforms B-Prompt by approximately 40% in aggregate metrics and exhibits a coefficient of variation 8% lower than C-Prompt, indicating superior stability. 大型语言模型（LLMs）在用户提供的信息不完整或描述不清时，常常难以给出准确且可执行的答案。我们提出了一种新的交互范式，先问后答（FATA），通过提示词引导 LLMs 在生成回答之前主动为用户生成多维度的补充问题。随后，利用复杂的提示技术将用户提供的补充信息与原始查询整合，我们实现了显著提升的回答质量与相关性。与现有的澄清方法（如面向歧义性的 CLAM 框架和自我询问的 Self-Ask 方法）相比，FATA 强调完整性（超越单纯的消歧）和用户参与（邀请人类输入而非仅依赖模型内部推理）。它还采用单轮策略：所有澄清问题一次性生成，从而减少对话轮次并提高效率。从概念上讲，FATA 利用 LLMs 的推理能力来搭建用户表达的支架，使非专家用户能够提出更全面且符合上下文的查询。为了评估 FATA，我们构建了一个多领域基准，并将其与两个对照组进行了比较：基线提示（B-Prompt）和上下文增强专家提示（C-Prompt）。实验结果表明，FATA 在综合指标上比 B-Prompt 提高约 40%，并且其变异系数比 C-Prompt 低 8%，表明稳定性更佳。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-08 13:39:47 UTC 发布：2025-08-08 13:39:47 UTC

#38 LLM-BI: Towards Fully Automated Bayesian Inference with Large Language Models #38 LLM-BI：迈向使用大规模语言模型的全自动贝叶斯推断

Author: [Yongchao Huang](https://arxiv.org/search/?searchtype=author&query=Yongchao Huang) 作者：黄永超

A significant barrier to the widespread adoption of Bayesian inference is the specification of prior distributions and likelihoods, which often requires specialized statistical expertise. This paper investigates the feasibility of using a Large Language Model (LLM) to automate this process. We introduce LLM-BI (Large Language Model-driven Bayesian Inference), a conceptual pipeline for automating Bayesian workflows. As a proof-of-concept, we present two experiments focused on Bayesian linear regression. In Experiment I, we demonstrate that an LLM can successfully elicit prior distributions from natural language. In Experiment II, we show that an LLM can specify the entire model structure, including both priors and the likelihood, from a single high-level problem description. Our results validate the potential of LLMs to automate key steps in Bayesian modeling, enabling the possibility of an automated inference pipeline for probabilistic programming. 贝叶斯推断广泛应用的一个重要障碍是先验分布和似然函数的指定，这通常需要专门的统计学专业知识。本文探讨了使用大型语言模型（LLM）自动化该过程的可行性。我们提出了 LLM-BI（由大型语言模型驱动的贝叶斯推断），这是一个用于自动化贝叶斯工作流的概念性流程。作为概念验证，我们展示了两个以贝叶斯线性回归为重点的实验。在实验 I 中，我们证明了 LLM 能够从自然语言成功引导出先验分布。在实验 II 中，我们表明 LLM 能够从单一的高层问题描述中指定整个模型结构，包括先验和似然。我们的结果验证了 LLM 在自动化贝叶斯建模关键步骤方面的潜力，从而使用于概率编程的自动化推断流程成为可能。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-07 00:00:59 UTC 发布：2025-08-07 00:00:59 UTC

#39 An Efficient Application of Goal Programming to Tackle Multiobjective Problems with Recurring Fitness Landscapes #39 一种高效的目标规划方法用于处理具有循环适应景观的多目标问题

Authors: [Rodrigo Lankaites Pinheiro](https://arxiv.org/search/?searchtype=author&query=Rodrigo Lankaites Pinheiro), [Dario Landa-Silva](https://arxiv.org/search/?searchtype=author&query=Dario Landa-Silva), [Wasakorn Laesanklang](https://arxiv.org/search/?searchtype=author&query=Wasakorn Laesanklang), [Ademir Aparecido Constantino](https://arxiv.org/search/?searchtype=author&query=Ademir Aparecido Constantino) 作者：Rodrigo Lankaites Pinheiro、Dario Landa-Silva、Wasakorn Laesanklang、Ademir Aparecido Constantino

Many real-world applications require decision-makers to assess the quality of solutions while considering multiple conflicting objectives. Obtaining good approximation sets for highly constrained many-objective problems is often a difficult task even for modern multiobjective algorithms. In some cases, multiple instances of the problem scenario present similarities in their fitness landscapes. That is, there are recurring features in the fitness landscapes when searching for solutions to different problem instances. We propose a methodology to exploit this characteristic by solving one instance of a given problem scenario using computationally expensive multiobjective algorithms to obtain a good approximation set and then using Goal Programming with efficient single-objective algorithms to solve other instances of the same problem scenario. We use three goal-based objective functions and show that on benchmark instances of the multiobjective vehicle routing problem with time windows, the methodology is able to produce good results in short computation time. The methodology allows to combine the effectiveness of state-of-the-art multiobjective algorithms with the efficiency of goal programming to find good compromise solutions in problem scenarios where instances have similar fitness landscapes. 许多现实世界的应用要求决策者在考虑多个冲突目标的同时评估解的质量。即便对于现代多目标算法来说，获取高度约束的多目标问题的良好近似解集通常也是一项困难的任务。在某些情况下，同一问题情景的多个实例在其适应度景观上具有相似性。也就是说，在为不同问题实例搜索解时，适应度景观中会出现重复的特征。我们提出了一种方法论，利用这一特性：先使用计算代价高的多目标算法求解给定问题情景的一个实例，以获得良好的近似解集，然后对同一问题情景的其他实例使用基于目标规划的高效单目标算法进行求解。我们使用了三种基于目标的目标函数，并在带时间窗的多目标车辆路径问题基准实例上展示了该方法在短时间内能够产生良好结果。该方法论允许将最先进多目标算法的有效性与目标规划的高效性相结合，在实例具有相似适应度景观的问题场景中找到良好的折中解。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-06 09:02:57 UTC 发布：2025-08-06 09:02:57 世界协调时

#40 Topos Causal Models #40 拓扑语因果模型

Author: [Sridhar Mahadevan](https://arxiv.org/search/?searchtype=author&query=Sridhar Mahadevan) 作者：Sridhar Mahadevan

We propose topos causal models (TCMs), a novel class of causal models that exploit the key properties of a topos category: they are (co)complete, meaning all (co)limits exist, they admit a subobject classifier, and allow exponential objects. The main goal of this paper is to show that these properties are central to many applications in causal inference. For example, subobject classifiers allow a categorical formulation of causal intervention, which creates sub-models. Limits and colimits allow causal diagrams of arbitrary complexity to be solved", using a novel interpretation of causal approximation. Exponential objects enable reasoning about equivalence classes of operations on causal models, such as covered edge reversal and causal homotopy. Analogous to structural causal models (SCMs), TCMs are defined by a collection of functions, each defining a local autonomous" causal mechanism that assemble to induce a unique global function from exogenous to endogenous variables. Since the category of TCMs is (co)complete, which we prove in this paper, every causal diagram has a solution" in the form of a (co)limit: this implies that any arbitrary causal model can be approximated" by some global function with respect to the morphisms going into or out of the diagram. Natural transformations are crucial in measuring the quality of approximation. In addition, we show that causal interventions are modeled by subobject classifiers: any sub-model is defined by a monic arrow into its parent model. Exponential objects permit reasoning about entire classes of causal equivalences and interventions. Finally, as TCMs form a topos, they admit an internal logic defined as a Mitchell-Benabou language with an associated Kripke-Joyal semantics. We show how to reason about causal models in TCMs using this internal logic. 我们提出了拓扑层（topos）因果模型（TCMs），这是一类新颖的因果模型，利用了拓扑层范畴的关键特性：它们是（余）完备的，意味着所有的（余）极限都存在，具有子对象分类器，并允许指数对象。本文的主要目标是展示这些特性在因果推断的诸多应用中起着核心作用。例如，子对象分类器允许对因果干预进行范畴论表述，进而创建子模型。极限和余极限允许对任意复杂度的因果图进行“求解”，这依赖于一种对因果近似的新颖解释。指数对象使得可以对因果模型上操作的等价类进行推理，例如覆盖边反转和因果同伦。与结构因果模型（SCMs）类似，TCMs 由一组函数定义，每个函数定义了一个“局部自治”的因果机制，这些机制组装在一起以从外生变量到内生变量诱导出唯一的全局函数。由于 TCM 范畴是（余）完备的，我们在本文中证明了这一点，因此每个因果图都有一个以（余）极限形式存在的“解”：这意味着任意因果模型都可以就流入或流出该图的态射而言，被某个全局函数“逼近”。自然变换在衡量逼近质量方面至关重要。此外，我们证明因果干预由子对象判定器建模：任何子模型都由指向其父模型的单射箭表示。指数对象允许对整类因果等价和干预进行推理。最后，由于 TCM 构成一个托胞，它承认一种内部逻辑，该逻辑被定义为带有相应 Kripke–Joyal 语义的 Mitchell–Benabou 语言。我们展示了如何使用该内部逻辑在 TCM 中对因果模型进行推理。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 21:50:57 UTC

#41 Topos Theory for Generative AI and LLMs

Author: [Sridhar Mahadevan](https://arxiv.org/search/?searchtype=author&query=Sridhar Mahadevan)

We propose the design of novel categorical generative AI architectures (GAIAs) using topos theory, a type of category that is set-like": a topos has all (co)limits, is Cartesian closed, and has a subobject classifier. Previous theoretical results on the Transformer model have shown that it is a universal sequence-to-sequence function approximator, and dense in the space of all continuous functions with compact support on the Euclidean space of embeddings of tokens. Building on this theoretical result, we explore novel architectures for LLMs that exploit the property that the category of LLMs, viewed as functions, forms a topos. Previous studies of large language models (LLMs) have focused on daisy-chained linear architectures or mixture-of-experts. In this paper, we use universal constructions in category theory to construct novel LLM architectures based on new types of compositional structures. In particular, these new compositional structures are derived from universal properties of LLM categories, and include pullback, pushout, (co) equalizers, exponential objects, and subobject classifiers. We theoretically validate these new compositional structures by showing that the category of LLMs is (co)complete, meaning that all diagrams have solutions in the form of (co)limits. Building on this completeness result, we then show that the category of LLMs forms a topos, a set-like" category, which requires showing the existence of exponential objects as well as subobject classifiers. We use a functorial characterization of backpropagation to define a potential implementation of an LLM topos architecture. 我们提出使用拓扑理论（topos theory）设计新型范畴生成式人工智能架构（GAIAs），拓扑是一种类似“集合”的范畴：一个拓扑具有所有的（余）极限，是笛卡尔闭的，并且具有子对象分类器。此前关于 Transformer 模型的理论结果表明，它是一个通用的序列到序列函数近似器，并在嵌入欧几里得空间上、具有紧支集的所有连续函数空间中是稠密的。在此理论结果的基础上，我们探索利用将 LLMs 类视为函数时形成拓扑这一性质的 LLM 新架构。此前对大型语言模型（LLMs）的研究主要集中于串联的线性架构或专家混合方法。在本文中，我们使用范畴论中的普遍构造来构建基于新型组合结构的 LLM 新架构。特别地，这些新的组合结构来源于 LLMs 范畴的普遍性质，包含拉回（pullback）、推出（pushout）、（余）等化子、指数对象和子对象分类器。我们从理论上验证了这些新的组合结构，证明了 LLMs 的范畴是（余）完备的，这意味着所有图都有以（余）极限形式存在的解。基于这一完备性结果，我们进一步证明 LLMs 的范畴构成了一个拓扑（topos），一个“类似集合”的范畴，这需要证明指数对象和子对象判别器的存在。我们使用反向传播的函子刻画来定义 LLM topos 架构的一种潜在实现。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 20:00:06 UTC 发布：2025-08-05 20:00:06 UTC

#42 Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models #42 时间是一种特征：在扩散语言模型中利用时间动态

Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them. 扩散大语言模型（dLLMs）通过迭代去噪生成文本，然而当前的解码策略往往舍弃了丰富的中间预测，仅保留最终输出。我们在此工作中揭示了一个关键现象——时间振荡，即正确答案常常在中间过程出现，但在后续去噪步骤中被覆盖。为了解决这一问题，我们提出了两种利用时间一致性的互补方法：1）时间自一致性投票（Temporal Self-Consistency Voting），一种无需训练的测试时解码策略，汇总各去噪步骤的预测以选择最一致的输出；以及 2）一种称为时间一致性强化（Temporal Consistency Reinforcement）的后训练方法，使用时间语义熵（TSE）——一种衡量中间预测语义稳定性的指标——作为奖励信号以鼓励稳定生成。跨多个基准的实证结果证明了我们方法的有效性。仅使用负 TSE 奖励，在 Countdown 数据集上我们相较于现有的 dLLM 观察到平均显著提升 24.7%。结合准确性奖励，我们在 GSM8K 上分别实现了绝对提升 2.0%、在 MATH500 上提升 4.3%、在 SVAMP 上提升 6.6%、在 Countdown 上提升 25.3%。我们的研究结果强调了 dLLMs 中时间动态的未被开发潜力，并提供了两种简单而有效的工具来加以利用。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 17:59:57 UTC 发布：2025-08-12 17:59:57 UTC

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility. 基于文本的图像和视频颜色编辑是一个基础但尚未解决的问题，要求对颜色属性进行细粒度操控，包括反照率、光源颜色和环境照明，同时在几何、材料属性和光物相互作用上保持物理一致性。现有的无训练方法在各类编辑任务上具有广泛适用性，但在精确颜色控制方面表现欠佳，且常在被编辑和未被编辑区域引入视觉不一致。在本工作中，我们提出了 ColorCtrl，一种无训练的颜色编辑方法，利用现代多模态扩散变换器（MM-DiT）的注意力机制。通过对注意力图和 value token 的有针对性操作来解耦结构与颜色，我们的方法实现了精确且一致的颜色编辑，并能按词级控制属性强度。我们的方法仅修改提示中指定的目标区域，保持无关区域不变。在 SD3 和 FLUX.1-dev 上的大量实验证明，ColorCtrl 优于现有的无训练方法，在编辑质量和一致性上都达到了最先进的性能。此外，我们的方法在一致性方面超越了诸如 FLUX.1 Kontext Max 和 GPT-4o Image Generation 等强大的商业模型。将本方法扩展到像 CogVideoX 这样的动态图像模型时，我们的方法显示出更大的优势，尤其是在保持时间连贯性和编辑稳定性方面。最后，我们的方法也能推广到基于指令的编辑扩散模型，如 Step1X-Edit 和 FLUX.1 Kontext dev，进一步证明了其多用途性。

Subjects: Graphics, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图形学、人工智能、计算机视觉与模式识别

Publish: 2025-08-12 17:57:04 UTC

#44 Towards Universal Neural Inference #44 迈向通用神经推理

Authors: [Shreyas Bhat Brahmavar](https://arxiv.org/search/?searchtype=author&query=Shreyas Bhat Brahmavar), [Yang Li](https://arxiv.org/search/?searchtype=author&query=Yang Li), [Junier Oliva](https://arxiv.org/search/?searchtype=author&query=Junier Oliva) 作者：Shreyas Bhat Brahmavar、Yang Li、Junier Oliva

Real-world data often appears in diverse, disjoint forms – with varying schemas, inconsistent semantics, and no fixed feature ordering – making it challenging to build general-purpose models that can leverage information across datasets. We introduce ASPIRE, Arbitrary Set-based Permutation-Invariant Reasoning Engine, a Universal Neural Inference model for semantic reasoning and prediction over heterogeneous structured data. ASPIRE combines a permutation-invariant, set-based Transformer with a semantic grounding module that incorporates natural language descriptions, dataset metadata, and in-context examples to learn cross-dataset feature dependencies. This architecture allows ASPIRE to ingest arbitrary sets of feature–value pairs and support examples, align semantics across disjoint tables, and make predictions for any specified target. Once trained, ASPIRE generalizes to new inference tasks without additional tuning. In addition to delivering strong results across diverse benchmarks, ASPIRE naturally supports cost-aware active feature acquisition in an open-world setting, selecting informative features under test-time budget constraints for an arbitrary unseen dataset. These capabilities position ASPIRE as a step toward truly universal, semantics-aware inference over structured data. 现实世界的数据通常以多样且不相交的形式出现——具有不同的模式、语义不一致且没有固定的特征顺序——这使得构建能够跨数据集利用信息的通用模型变得具有挑战性。我们引入了 ASPIRE（Arbitrary Set-based Permutation-Invariant Reasoning Engine），一种用于异构结构化数据上的语义推理和预测的通用神经推理模型。ASPIRE 将一种对排列不变的基于集合的 Transformer 与一个语义落地模块相结合，该模块融合了自然语言描述、数据集元数据和上下文示例，以学习跨数据集的特征依赖关系。该架构使 ASPIRE 能够摄取任意集合的特征—值对和支持示例，对不相交表格之间的语义进行对齐，并对任意指定的目标进行预测。训练完成后，ASPIRE 能在无需额外调优的情况下推广到新的推理任务。除了在多样基准上取得优异结果外，ASPIRE 还自然支持在开放世界设置下的成本感知主动特征获取，在测试时的预算约束下为任意未知数据集选择信息性特征。这些能力使 ASPIRE 成为朝着真正通用、具备语义感知的结构化数据推理迈进的一步。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 17:26:48 UTC 发布时间：2025-08-12 17:26:48 UTC

#45 SPARC: Soft Probabilistic Adaptive multi-interest Retrieval Model via Codebooks for recommender system #45 SPARC：通过码本实现的用于推荐系统的软概率自适应多兴趣检索模型

Authors: [Jialiang Shi](https://arxiv.org/search/?searchtype=author&query=Jialiang Shi), [Yaguang Dou](https://arxiv.org/search/?searchtype=author&query=Yaguang Dou), [Tian Qi](https://arxiv.org/search/?searchtype=author&query=Tian Qi) 作者：石家亮、窦亚光、齐天

Modeling multi-interests has arisen as a core problem in real-world RS. Current multi-interest retrieval methods pose three major challenges: 1) Interests, typically extracted from predefined external knowledge, are invariant. Failed to dynamically evolve with users’ real-time consumption preferences. 2) Online inference typically employs an over-exploited strategy, mainly matching users’ existing interests, lacking proactive exploration and discovery of novel and long-tail interests. To address these challenges, we propose a novel retrieval framework named SPARC(Soft Probabilistic Adaptive Retrieval Model via Codebooks). Our contribution is two folds. First, the framework utilizes Residual Quantized Variational Autoencoder (RQ-VAE) to construct a discretized interest space. It achieves joint training of the RQ-VAE with the industrial large scale recommendation model, mining behavior-aware interests that can perceive user feedback and evolve dynamically. Secondly, a probabilistic interest module that predicts the probability distribution over the entire dynamic and discrete interest space. This facilitates an efficient “soft-search” strategy during online inference, revolutionizing the retrieval paradigm from “passive matching” to “proactive exploration” and thereby effectively promoting interest discovery. Online A/B tests on an industrial platform with tens of millions daily active users, have achieved substantial gains in business metrics: +0.9% increase in user view duration, +0.4% increase in user page views (PV), and a +22.7% improvement in PV500(new content reaching 500 PVs in 24 hours). Offline evaluations are conducted on open-source Amazon Product datasets. Metrics, such as Recall@K and Normalized Discounted Cumulative Gain@K(NDCG@K), also showed consistent improvement. Both online and offline experiments validate the efficacy and practical value of the proposed method. 在现实推荐系统中，多兴趣建模已成为核心问题。当前的多兴趣检索方法存在三大挑战：1）兴趣通常从预定义的外部知识中提取，保持不变，无法随着用户实时消费偏好动态演化。2）在线推断通常采用过度利用的策略，主要匹配用户已有兴趣，缺乏主动探索和发现新颖及长尾兴趣。为了解决这些挑战，我们提出了一种名为 SPARC（通过码本的软概率自适应检索模型）的新检索框架。我们的贡献有两方面。首先，该框架利用残差量化变分自编码器（RQ-VAE）构建离散化兴趣空间。它实现了 RQ-VAE 与工业级大规模推荐模型的联合训练，挖掘能够感知用户反馈并动态演化的行为感知兴趣。其次，提出了一个概率兴趣模块，预测整个动态且离散兴趣空间上的概率分布。这在在线推理期间促成了一种高效的“软搜索”策略，将检索范式从“被动匹配”革新为“主动探索”，从而有效促进兴趣发现。在一个拥有数千万日活跃用户的工业平台上进行的在线 A/B 测试，在业务指标上取得了显著提升：用户观看时长增加了 0.9%，用户页面浏览量（PV）增加了 0.4%，以及 PV500（新内容在 24 小时内达到 500 次浏览）的提升为 22.7%。离线评估在开源的亚马逊商品数据集上进行。Recall@K 和归一化折扣累积增益@K（NDCG@K）等指标也表现出一致的提升。线上和线下实验均验证了所提方法的有效性和实际价值。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-12 17:16:37 UTC 发布：2025-08-12 17:16:37 协调世界时

#46 Dynamic Uncertainty-aware Multimodal Fusion for Outdoor Health Monitoring #46 动态不确定性感知多模态融合用于户外健康监测

Authors: [Zihan Fang](https://arxiv.org/search/?searchtype=author&query=Zihan Fang), [Zheng Lin](https://arxiv.org/search/?searchtype=author&query=Zheng Lin), [Senkang Hu](https://arxiv.org/search/?searchtype=author&query=Senkang Hu), [Yihang Tao](https://arxiv.org/search/?searchtype=author&query=Yihang Tao), [Yiqin Deng](https://arxiv.org/search/?searchtype=author&query=Yiqin Deng), [Xianhao Chen](https://arxiv.org/search/?searchtype=author&query=Xianhao Chen), [Yuguang Fang](https://arxiv.org/search/?searchtype=author&query=Yuguang Fang) 作者：方子涵、林征、胡森康、陶翊航、邓奕勤、陈宪豪、方宇广

Outdoor health monitoring is essential to detect early abnormal health status for safeguarding human health and safety. Conventional outdoor monitoring relies on static multimodal deep learning frameworks, which requires extensive data training from scratch and fails to capture subtle health status changes. Multimodal large language models (MLLMs) emerge as a promising alternative, utilizing only small datasets to fine-tune pre-trained information-rich models for enabling powerful health status monitoring. Unfortunately, MLLM-based outdoor health monitoring also faces significant challenges: I) sensor data contains input noise stemming from sensor data acquisition and fluctuation noise caused by sudden changes in physiological signals due to dynamic outdoor environments, thus degrading the training performance; ii) current transformer based MLLMs struggle to achieve robust multimodal fusion, as they lack a design for fusing the noisy modality; iii) modalities with varying noise levels hinder accurate recovery of missing data from fluctuating distributions. To combat these challenges, we propose an uncertainty-aware multimodal fusion framework, named DUAL-Health, for outdoor health monitoring in dynamic and noisy environments. First, to assess the impact of noise, we accurately quantify modality uncertainty caused by input and fluctuation noise with current and temporal features. Second, to empower efficient muitimodal fusion with low-quality modalities,we customize the fusion weight for each modality based on quantified and calibrated uncertainty. Third, to enhance data recovery from fluctuating noisy modalities, we align modality distributions within a common semantic space. Extensive experiments demonstrate that our DUAL-Health outperforms state-of-the-art baselines in detection accuracy and robustness. 户外健康监测对于及早发现异常健康状况以保障人类健康与安全至关重要。传统的户外监测依赖静态的多模态深度学习框架，这些框架需要从头大量训练数据，且无法捕捉细微的健康状态变化。多模态大语言模型（MLLMs）作为有前景的替代方案出现，只需少量数据即可微调预训练的信息丰富模型，从而实现强大的健康状态监测。不幸的是，基于 MLLM 的户外健康监测也面临重大挑战：I）传感器数据包含来自传感器采集的输入噪声以及由动态户外环境导致生理信号突发变化产生的波动噪声，从而降低训练性能；II）现有基于 Transformer 的 MLLM 在实现稳健的多模态融合方面存在困难，因为它们缺乏针对融合噪声模态的设计；III）具有不同噪声水平的模态阻碍了对来自波动分布的缺失数据的准确恢复。为应对这些挑战，我们提出了一种面向动态且嘈杂环境下户外健康监测的不确定性感知多模态融合框架，命名为 DUAL-Health。首先，为了评估噪声的影响，我们通过当前与时间特征精确量化由输入噪声和波动噪声引起的模态不确定性。其次，为了在低质量模态下实现高效的多模态融合，我们基于量化并校准的不确定性为每个模态定制融合权重。第三，为了增强来自波动噪声模态的数据恢复能力，我们将模态分布对齐到一个共同的语义空间中。大量实验证明，我们的 DUAL-Health 在检测准确性和鲁棒性方面优于最先进的基线方法。

Subjects: Networking and Internet Architecture, Artificial Intelligence, Machine Learning 学科：网络与互联网架构、人工智能、机器学习

Publish: 2025-08-12 17:07:27 UTC 发布：2025-08-12 17:07:27 协调世界时 (UTC)

#47 Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams #47 我们能信任 AI 来治理 AI 吗？在隐私与 AI 治理考试上对 LLM 性能的基准测试

Authors: [Zane Witherspoon](https://arxiv.org/search/?searchtype=author&query=Zane Witherspoon), [Thet Mon Aye](https://arxiv.org/search/?searchtype=author&query=Thet Mon Aye), [YingYing Hao](https://arxiv.org/search/?searchtype=author&query=YingYing Hao) 作者：Zane Witherspoon、Thet Mon Aye、YingYing Hao

The rapid emergence of large language models (LLMs) has raised urgent questions across the modern workforce about this new technology’s strengths, weaknesses, and capabilities. For privacy professionals, the question is whether these AI systems can provide reliable support on regulatory compliance, privacy program management, and AI governance. In this study, we evaluate ten leading open and closed LLMs, including models from OpenAI, Anthropic, Google DeepMind, Meta, and DeepSeek, by benchmarking their performance on industry-standard certification exams: CIPP/US, CIPM, CIPT, and AIGP from the International Association of Privacy Professionals (IAPP). Each model was tested using official sample exams in a closed-book setting and compared to IAPP’s passing thresholds. Our findings show that several frontier models such as Gemini 2.5 Pro and OpenAI’s GPT-5 consistently achieve scores exceeding the standards for professional human certification - demonstrating substantial expertise in privacy law, technical controls, and AI governance. The results highlight both the strengths and domain-specific gaps of current LLMs and offer practical insights for privacy officers, compliance leads, and technologists assessing the readiness of AI tools for high-stakes data governance roles. This paper provides an overview for professionals navigating the intersection of AI advancement and regulatory risk and establishes a machine benchmark based on human-centric evaluations. 大型语言模型（LLMs）的快速崛起在现代职场中引发了关于这项新技术的优势、劣势和能力的紧迫问题。对于隐私专业人士而言，问题在于这些人工智能系统是否能够在法规合规、隐私项目管理和人工智能治理方面提供可靠支持。在本研究中，我们评估了十款领先的开放与封闭式 LLMs，包括来自 OpenAI、Anthropic、Google DeepMind、Meta 和 DeepSeek 的模型，通过将它们在行业标准认证考试上的表现进行基准测试：国际隐私专业人士协会（IAPP）的 CIPP/US、CIPM、CIPT 和 AIGP。每款模型在闭卷环境下使用官方样题进行测试，并与 IAPP 的及格标准进行比较。我们的研究结果显示，多款前沿模型如 Gemini 2.5 Pro 和 OpenAI 的 GPT-5 在分数上持续超过专业人员认证的标准——展示了在隐私法、技术控制和人工智能治理方面的显著专业能力。这些结果既突出了当前 LLMs 的优势，也揭示了其在特定领域的不足，并为评估 AI 工具在高风险数据治理角色中的准备度的隐私官、合规负责人和技术人员提供了实用见解。本文为在 AI 进步与监管风险交汇处工作的专业人士提供了概览，并基于以人为本的评估建立了一个机器基准。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-12 15:57:22 UTC 发布：2025-08-12 15:57:22 UTC

#48 Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding #48 空间轨迹：通过时空理解增强 VLA 模型

Authors: [Maxim A. Patratskiy](https://arxiv.org/search/?searchtype=author&query=Maxim A. Patratskiy), [Alexey K. Kovalev](https://arxiv.org/search/?searchtype=author&query=Alexey K. Kovalev), [Aleksandr I. Panov](https://arxiv.org/search/?searchtype=author&query=Aleksandr I. Panov) 作者：Maxim A. Patratskiy, Alexey K. Kovalev, Aleksandr I. Panov

Vision-Language-Action models have demonstrated remarkable capabilities in predicting agent movements within virtual environments and real-world scenarios based on visual observations and textual instructions. Although recent research has focused on enhancing spatial and temporal understanding independently, this paper presents a novel approach that integrates both aspects through visual prompting. We introduce a method that projects visual traces of key points from observations onto depth maps, enabling models to capture both spatial and temporal information simultaneously. The experiments in SimplerEnv show that the mean number of tasks successfully solved increased for 4% compared to SpatialVLA and 19% compared to TraceVLA. Furthermore, we show that this enhancement can be achieved with minimal training data, making it particularly valuable for real-world applications where data collection is challenging. The project page is available at https://ampiromax.github.io/ST-VLA. 视觉-语言-动作模型在根据视觉观测和文本指令预测虚拟环境和现实场景中智能体运动方面表现出显著能力。尽管近期研究主要独立提升空间和时间理解，本文提出了一种通过视觉提示将两者结合的创新方法。我们引入了一种方法，将观测中关键点的视觉轨迹投影到深度图上，使模型能够同时捕捉空间和时间信息。在 SimplerEnv 中的实验表明，成功完成任务的平均数量较 SpatialVLA 提高了 4%，较 TraceVLA 提高了 19%。此外，我们展示了这种改进可以通过极少的训练数据实现，这对于数据采集困难的现实应用尤为有价值。项目页面可在 https://ampiromax.github.io/ST-VLA 查阅。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Robotics 主题：计算机视觉与模式识别，人工智能，机器人学

Publish: 2025-08-12 15:53:45 UTC 发布：2025-08-12 15:53:45 UTC

#49 E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence,and Efficiency #49 E3-Rewrite：学习重写 SQL 以实现可执行性、等价性和高效性

Subjects: Databases, Artificial Intelligence, Computation and Language 主题：数据库，人工智能，计算与语言

Publish: 2025-08-12 15:38:10 UTC 发布：2025-08-12 15:38:10 UTC

#50 When Deepfakes Look Real: Detecting AI-Generated Faces with Unlabeled Data due to Annotation Challenges #50 当深度伪造看起来真实：在注释挑战下使用无标注数据检测 AI 生成面孔

Authors: [Zhiqiang Yang](https://arxiv.org/search/?searchtype=author&query=Zhiqiang Yang), [Renshuai Tao](https://arxiv.org/search/?searchtype=author&query=Renshuai Tao), [Xiaolong Zheng](https://arxiv.org/search/?searchtype=author&query=Xiaolong Zheng), [Guodong Yang](https://arxiv.org/search/?searchtype=author&query=Guodong Yang), [Chunjie Zhang](https://arxiv.org/search/?searchtype=author&query=Chunjie Zhang) 作者：杨志强，陶仁帅，郑晓龙，杨国栋，张春杰

Existing deepfake detection methods heavily depend on labeled training data. However, as AI-generated content becomes increasingly realistic, even \textbf{human annotators struggle to distinguish} between deepfakes and authentic images. This makes the labeling process both time-consuming and less reliable. Specifically, there is a growing demand for approaches that can effectively utilize large-scale unlabeled data from online social networks. Unlike typical unsupervised learning tasks, where categories are distinct, AI-generated faces closely mimic real image distributions and share strong similarities, causing performance drop in conventional strategies. In this paper, we introduce the Dual-Path Guidance Network (DPGNet), to tackle two key challenges: (1) bridging the domain gap between faces from different generation models, and (2) utilizing unlabeled image samples. The method features two core modules: text-guided cross-domain alignment, which uses learnable prompts to unify visual and textual embeddings into a domain-invariant feature space, and curriculum-driven pseudo label generation, which dynamically exploit more informative unlabeled samples. To prevent catastrophic forgetting, we also facilitate bridging between domains via cross-domain knowledge distillation. Extensive experiments on \textbf{11 popular datasets}, show that DPGNet outperforms SoTA approaches by \textbf{6.3%}, highlighting its effectiveness in leveraging unlabeled data to address the annotation challenges posed by the increasing realism of deepfakes. 现有的深度伪造检测方法在很大程度上依赖有标签的训练数据。然而，随着 AI 生成内容变得愈发逼真，甚至人类标注者也难以区分深度伪造与真实图像。这使得标注过程既耗时又不可靠。具体而言，越来越需要能够有效利用来自在线社交网络的大规模无标签数据的方法。与典型的无监督学习任务中类别彼此区分不同，AI 生成的人脸高度模拟真实图像分布并具有强相似性，导致传统策略的性能下降。在本文中，我们提出了双路径引导网络（DPGNet），以解决两个关键挑战：（1）弥合来自不同生成模型的人脸之间的域差距，和（2）利用无标签图像样本。该方法包含两个核心模块：基于文本的跨域对齐，使用可学习的提示将视觉和文本嵌入统一到域不变的特征空间；以及基于课程的伪标签生成，动态地利用更具信息性的无标签样本。为了防止灾难性遗忘，我们还通过跨域知识蒸馏促进域间桥接。对 11 个热门数据集的大量实验表明，DPGNet 比最先进方法高出 6.3%，突显了其在利用未标注数据应对深度伪造日益逼真带来的标注挑战方面的有效性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-12 15:37:17 UTC 发布：2025-08-12 15:37:17 UTC

#51 Attacks and Defenses Against LLM Fingerprinting #51 针对 LLM 指纹识别的攻击与防御

Authors: [Kevin Kurian](https://arxiv.org/search/?searchtype=author&query=Kevin Kurian), [Ethan Holland](https://arxiv.org/search/?searchtype=author&query=Ethan Holland), [Sean Oesch](https://arxiv.org/search/?searchtype=author&query=Sean Oesch) 作者：Kevin Kurian、Ethan Holland、Sean Oesch

As large language models are increasingly deployed in sensitive environments, fingerprinting attacks pose significant privacy and security risks. We present a study of LLM fingerprinting from both offensive and defensive perspectives. Our attack methodology uses reinforcement learning to automatically optimize query selection, achieving better fingerprinting accuracy with only 3 queries compared to randomly selecting 3 queries from the same pool. Our defensive approach employs semantic-preserving output filtering through a secondary LLM to obfuscate model identity while maintaining semantic integrity. The defensive method reduces fingerprinting accuracy across tested models while preserving output quality. These contributions show the potential to improve fingerprinting tools capabilities while providing practical mitigation strategies against fingerprinting attacks. 随着大型语言模型在敏感环境中的广泛部署，指纹识别攻击带来了重大的隐私和安全风险。我们从攻击和防御两个角度对 LLM 指纹识别进行了研究。我们的攻击方法使用强化学习自动优化查询选择，在仅使用 3 次查询的情况下，相较于从相同候选池中随机选择 3 次查询，能实现更高的指纹识别准确率。我们的防御方法通过第二个 LLM 对输出进行语义保留的过滤，以混淆模型身份同时保持语义完整性。该防御方法在保持输出质量的同时，降低了在测试模型上的指纹识别准确率。这些贡献展示了提升指纹识别工具能力的潜力，并提供了针对指纹识别攻击的实用缓解策略。

Subjects: Cryptography and Security, Artificial Intelligence, Machine Learning 主题：密码学与安全、人工智能、机器学习

Publish: 2025-08-12 15:36:36 UTC 发布时间：2025-08-12 15:36:36 UTC

#52 LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA #52 LyS 在 SemEval 2025 任务 8：面向表格问答的零样本代码生成

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-12 15:25:31 UTC 发布时间：2025-08-12 15:25:31 UTC

#53 Retrospective Sparse Attention for Efficient Long-Context Generation #53 面向高效长上下文生成的回顾性稀疏注意力

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to efficiently access more relevant context, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6× and accuracy by up to 21.9%. 大型语言模型 (LLMs) 越来越多地被用于长上下文任务，例如推理、代码生成和多轮对话。然而，对扩展上下文的推理受限于键值（KV）缓存，其内存占用随序列长度线性增长，并在每个解码步骤占据主要延迟。尽管近期的 KV 缓存压缩方法识别并加载重要的 token，但它们主要关注输入上下文，未能解决在长时间解码过程中产生的累积注意力误差。本文提出了 RetroAttention，一种新颖的 KV 缓存更新技术，通过利用随后解码步骤中新到达的 KV 条目回溯修正过去的注意力输出。通过维护一个轻量的输出缓存，RetroAttention 使得过去的查询能够高效地访问更相关的上下文，同时只带来极小的延迟开销。这打破了固定注意力输出的范式，并允许持续修正先前的近似。在长文本生成基准上的大量实验表明，RetroAttention 始终优于最先进的（SOTA）KV 压缩方法，有效 KV 暴露量最多提高了 1.6 × ，准确率最多提高了 21.9%。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-12 15:11:47 UTC 发布：2025-08-12 15:11:47 UTC

#54 Rational Inverse Reasoning #54 理性逆向推理

Authors: [Ben Zandonati](https://arxiv.org/search/?searchtype=author&query=Ben Zandonati), [Tomás Lozano-Pérez](https://arxiv.org/search/?searchtype=author&query=Tomás Lozano-Pérez), [Leslie Pack Kaelbling](https://arxiv.org/search/?searchtype=author&query=Leslie Pack Kaelbling) 作者：Ben Zandonati、Tomás Lozano-Pérez、Leslie Pack Kaelbling

Humans can observe a single, imperfect demonstration and immediately generalize to very different problem settings. Robots, in contrast, often require hundreds of examples and still struggle to generalize beyond the training conditions. We argue that this limitation arises from the inability to recover the latent explanations that underpin intelligent behavior, and that these explanations can take the form of structured programs consisting of high-level goals, sub-task decomposition, and execution constraints. In this work, we introduce Rational Inverse Reasoning (RIR), a framework for inferring these latent programs through a hierarchical generative model of behavior. RIR frames few-shot imitation as Bayesian program induction: a vision-language model iteratively proposes structured symbolic task hypotheses, while a planner-in-the-loop inference scheme scores each by the likelihood of the observed demonstration under that hypothesis. This loop yields a posterior over concise, executable programs. We evaluate RIR on a suite of continuous manipulation tasks designed to test one-shot and few-shot generalization across variations in object pose, count, geometry, and layout. With as little as one demonstration, RIR infers the intended task structure and generalizes to novel settings, outperforming state-of-the-art vision-language model baselines. 人类可以观察到一个不完美的示范后，立即推广到非常不同的问题情境中。相比之下，机器人通常需要数百个示例，仍然难以在训练条件之外进行泛化。我们认为，这一局限源于无法恢复支撑智能行为的潜在解释，而这些解释可以采取结构化程序的形式，包括高层目标、子任务分解和执行约束。在这项工作中，我们引入了理性逆向推理（RIR），这是一个通过行为的分层生成模型来推断这些潜在程序的框架。RIR 将少样本模仿表述为贝叶斯程序归纳：一种视觉-语言模型迭代地提出结构化的符号任务假设，同时一个带规划器的循环推理方案根据在该假设下观察到的示范的可能性对每个假设进行打分。该循环产生关于简洁且可执行程序的后验分布。我们在一套连续操控任务上评估了 RIR，这些任务旨在测试在物体姿态、数量、几何形状和布局变化下的一次和少次泛化能力。仅凭一次示范，RIR 就能推断出预期的任务结构并泛化到新情境，其表现优于最先进的视觉-语言模型基线。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-12 14:49:44 UTC 发表：2025-08-12 14:49:44 世界协调时间

#55 Unsupervised Skill Discovery as Exploration for Learning Agile Locomotion #55 无监督技能发现：作为学习敏捷行走的探索

Authors: [Seungeun Rho](https://arxiv.org/search/?searchtype=author&query=Seungeun Rho), [Kartik Garg](https://arxiv.org/search/?searchtype=author&query=Kartik Garg), [Morgan Byrd](https://arxiv.org/search/?searchtype=author&query=Morgan Byrd), [Sehoon Ha](https://arxiv.org/search/?searchtype=author&query=Sehoon Ha) 作者：Seungeun Rho、Kartik Garg、Morgan Byrd、Sehoon Ha

Exploration is crucial for enabling legged robots to learn agile locomotion behaviors that can overcome diverse obstacles. However, such exploration is inherently challenging, and we often rely on extensive reward engineering, expert demonstrations, or curriculum learning - all of which limit generalizability. In this work, we propose Skill Discovery as Exploration (SDAX), a novel learning framework that significantly reduces human engineering effort. SDAX leverages unsupervised skill discovery to autonomously acquire a diverse repertoire of skills for overcoming obstacles. To dynamically regulate the level of exploration during training, SDAX employs a bi-level optimization process that autonomously adjusts the degree of exploration. We demonstrate that SDAX enables quadrupedal robots to acquire highly agile behaviors including crawling, climbing, leaping, and executing complex maneuvers such as jumping off vertical walls. Finally, we deploy the learned policy on real hardware, validating its successful transfer to the real world. 探索对于使多足机器人学会能够克服各种障碍的敏捷运动行为至关重要。然而，这类探索本质上具有挑战性，我们常常依赖大量的奖励工程、专家示范或课程学习——所有这些都会限制泛化能力。在这项工作中，我们提出了作为探索手段的技能发现（Skill Discovery as Exploration，SDAX），这是一种显著减少人工工程工作的新型学习框架。SDAX 利用无监督的技能发现自主获取多样化的技能库以克服障碍。为了在训练过程中动态调节探索程度，SDAX 采用了一种双层优化过程，能自主调整探索强度。我们展示了 SDAX 能使四足机器人获得高度敏捷的行为，包括爬行、攀爬、跃起，以及像从垂直墙面跳跃这样的复杂动作。最后，我们将学到的策略部署到真实硬件上，验证了其成功转移到现实世界。

Subjects: Robotics, Artificial Intelligence, Machine Learning 主题：机器人学、人工智能、机器学习

Publish: 2025-08-12 14:49:25 UTC 发布时间：2025-08-12 14:49:25 UTC

#56 Urban-STA4CLC: Urban Theory-Informed Spatio-Temporal Attention Model for Predicting Post-Disaster Commercial Land Use Change #56 Urban-STA4CLC：一种基于城市理论的时空注意力模型用于预测灾后商业用地变化

Authors: [Ziyi Guo](https://arxiv.org/search/?searchtype=author&query=Ziyi Guo), [Yan Wang](https://arxiv.org/search/?searchtype=author&query=Yan Wang) 作者：郭子怡，王晏

Natural disasters such as hurricanes and wildfires increasingly introduce unusual disturbance on economic activities, which are especially likely to reshape commercial land use pattern given their sensitive to customer visitation. However, current modeling approaches are limited in capturing such complex interplay between human activities and commercial land use change under and following disturbances. Such interactions have been more effectively captured in current resilient urban planning theories. This study designs and calibrates a Urban Theory-Informed Spatio-Temporal Attention Model for Predicting Post-Disaster Commercial Land Use Change (Urban-STA4CLC) to predict both the yearly decline and expansion of commercial land use at census block level under cumulative impact of disasters on human activities over two years. Guided by urban theories, Urban-STA4CLC integrates both spatial and temporal attention mechanisms with three theory-informed modules. Resilience theory guides a disaster-aware temporal attention module that captures visitation dynamics. Spatial economic theory informs a multi-relational spatial attention module for inter-block representation. Diffusion theory contributes a regularization term that constrains land use transitions. The model performs significantly better than non-theoretical baselines in predicting commercial land use change under the scenario of recurrent hurricanes, with around 19% improvement in F1 score (0.8763). The effectiveness of the theory-guided modules was further validated through ablation studies. The research demonstrates that embedding urban theory into commercial land use modeling models may substantially enhance the capacity to capture its gains and losses. These advances in commercial land use modeling contribute to land use research that accounts for cumulative impacts of recurrent disasters and shifts in economic activity patterns. 飓风和野火等自然灾害日益对经济活动造成异常扰动，鉴于商业活动对顾客来访的敏感性，这些扰动尤其可能重塑商业用地格局。然而，当前的建模方法在捕捉扰动期间及之后人类活动与商业用地变化之间的复杂相互作用方面存在局限。当前的韧性城市规划理论更有效地捕捉了此类相互作用。本研究设计并校准了一种面向城市理论的时空注意力模型，用于预测灾后商业用地变化（Urban-STA4CLC），以在普查街区层面预测在灾害对人类活动两年累积影响下商业用地的年度衰退与扩张。受城市理论指引，Urban-STA4CLC 将空间与时间注意力机制与三个基于理论的模块相结合。韧性理论指导了一个灾害感知的时间注意力模块，用于捕捉来访动态。空间经济理论为多重关系空间注意力模块提供了依据，用于区块间表征。扩散理论提供了一个正则项，用以约束土地利用的转变。该模型在反复发生飓风情景下预测商业用地变化方面，表现显著优于无理论支撑的基线模型，F1 分数提升约 19%（0.8763）。通过消融研究进一步验证了理论引导模块的有效性。研究表明，将城市理论嵌入商业用地建模中，能够显著增强模型捕捉其增减变化的能力。这些在商业用地建模方面的进展，有助于将反复灾害的累积影响和经济活动模式变化纳入土地利用研究。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-12 14:39:42 UTC 发布：2025-08-12 14:39:42 UTC

#57 Revealing the Role of Audio Channels in ASR Performance Degradation #57 揭示音频通道在语音识别性能下降中的作用

Subjects: Sound, Artificial Intelligence, Computation and Language 主题：声音、人工智能、计算与语言

Publish: 2025-08-12 14:32:48 UTC 发布时间：2025-08-12 14:32:48 UTC

#58 QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems #58 QAMRO：面向音频生成系统人类对齐评估的质量感知自适应边距排序优化

Authors: [Chien-Chun Wang](https://arxiv.org/search/?searchtype=author&query=Chien-Chun Wang), [Kuan-Tang Huang](https://arxiv.org/search/?searchtype=author&query=Kuan-Tang Huang), [Cheng-Yeh Yang](https://arxiv.org/search/?searchtype=author&query=Cheng-Yeh Yang), [Hung-Shin Lee](https://arxiv.org/search/?searchtype=author&query=Hung-Shin Lee), [Hsin-Min Wang](https://arxiv.org/search/?searchtype=author&query=Hsin-Min Wang), [Berlin Chen](https://arxiv.org/search/?searchtype=author&query=Berlin Chen) 作者：王建君、黄冠棠、杨承冶、李宏信、王信民、Berlin Chen

Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models. 评估音频生成系统（包括文本到音乐（TTM）、文本到语音（TTS）和文本到音频（TTA））仍然充满挑战，因为人类感知具有主观性和多维性。现有方法将平均意见分（MOS）预测视为回归问题，但标准回归损失忽视了感知判断的相对性。为了解决这一局限性，我们提出了 QAMRO，一种新颖的质量感知自适应边距排序优化框架，能够无缝整合来自不同视角的回归目标，旨在突出感知差异并优先保证评分准确性。我们的框架利用了诸如 CLAP 和 Audiobox-Aesthetics 等预训练音频-文本模型，并仅在官方 AudioMOS Challenge 2025 数据集上进行训练。在所有维度上，它都表现出与人类评估的更好一致性，显著优于强基线模型。

Subjects: Sound, Artificial Intelligence, Machine Learning 主题：声音，人工智能，机器学习

Publish: 2025-08-12 14:14:04 UTC 发布：2025-08-12 14:14:04 UTC

#59 Generalising Traffic Forecasting to Regions without Traffic Observations #59 将交通预测推广到没有交通观测的区域

Authors: [Xinyu Su](https://arxiv.org/search/?searchtype=author&query=Xinyu Su), [Majid Sarvi](https://arxiv.org/search/?searchtype=author&query=Majid Sarvi), [Feng Liu](https://arxiv.org/search/?searchtype=author&query=Feng Liu), [Egemen Tanin](https://arxiv.org/search/?searchtype=author&query=Egemen Tanin), [Jianzhong Qi](https://arxiv.org/search/?searchtype=author&query=Jianzhong Qi) 作者：苏欣宇、Majid Sarvi、刘锋、Egemen Tanin、齐建中

Traffic forecasting is essential for intelligent transportation systems. Accurate forecasting relies on continuous observations collected by traffic sensors. However, due to high deployment and maintenance costs, not all regions are equipped with such sensors. This paper aims to forecast for regions without traffic sensors, where the lack of historical traffic observations challenges the generalisability of existing models. We propose a model named GenCast, the core idea of which is to exploit external knowledge to compensate for the missing observations and to enhance generalisation. We integrate physics-informed neural networks into GenCast, enabling physical principles to regularise the learning process. We introduce an external signal learning module to explore correlations between traffic states and external signals such as weather conditions, further improving model generalisability. Additionally, we design a spatial grouping module to filter localised features that hinder model generalisability. Extensive experiments show that GenCast consistently reduces forecasting errors on multiple real-world datasets. 交通预测对智能交通系统至关重要。准确的预测依赖于由交通传感器收集的连续观测数据。然而，由于高昂的部署和维护成本，并非所有区域都配备了此类传感器。本文旨在为缺乏交通传感器的区域进行预测，这些区域缺乏历史交通观测数据，给现有模型的泛化能力带来了挑战。我们提出了一种名为 GenCast 的模型，其核心思想是利用外部知识来弥补缺失的观测数据并增强泛化能力。我们将物理信息神经网络集成到 GenCast 中，使物理原理能够对学习过程进行正则化。我们引入了一个外部信号学习模块，以探索交通状态与天气等外部信号之间的相关性，进一步提高模型的泛化能力。此外，我们设计了一个空间分组模块，以过滤阻碍模型泛化的局部特征。大量实验证明，GenCast 在多个真实数据集上持续降低了预测误差。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 14:00:12 UTC 发布：2025-08-12 14:00:12 UTC

#60 Train Long, Think Short: Curriculum Learning for Efficient Reasoning #60 长期训练，短期思考：用于高效推理的课程学习

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-12 13:48:03 UTC 发布：2025-08-12 13:48:03 UTC

#61 EGGCodec: A Robust Neural Encodec Framework for EGG Reconstruction and F0 Extraction #61 EGGCodec：一种用于 EGG 重建和 F0 提取的鲁棒神经编码解码框架

Authors: [Rui Feng](https://arxiv.org/search/?searchtype=author&query=Rui Feng), [Yuang Chen](https://arxiv.org/search/?searchtype=author&query=Yuang Chen), [Yu Hu](https://arxiv.org/search/?searchtype=author&query=Yu Hu), [Jun Du](https://arxiv.org/search/?searchtype=author&query=Jun Du), [Jiahong Yuan](https://arxiv.org/search/?searchtype=author&query=Jiahong Yuan) 作者：冯睿、陈远、胡宇、杜军、袁家泓

This letter introduces EGGCodec, a robust neural Encodec framework engineered for electroglottography (EGG) signal reconstruction and F0 extraction. We propose a multi-scale frequency-domain loss function to capture the nuanced relationship between original and reconstructed EGG signals, complemented by a time-domain correlation loss to improve generalization and accuracy. Unlike conventional Encodec models that extract F0 directly from features, EGGCodec leverages reconstructed EGG signals, which more closely correspond to F0. By removing the conventional GAN discriminator, we streamline EGGCodec’s training process without compromising efficiency, incurring only negligible performance degradation. Trained on a widely used EGG-inclusive dataset, extensive evaluations demonstrate that EGGCodec outperforms state-of-the-art F0 extraction schemes, reducing mean absolute error (MAE) from 14.14 Hz to 13.69 Hz, and improving voicing decision error (VDE) by 38.2%. Moreover, extensive ablation experiments validate the contribution of each component of EGGCodec. 本信介绍了 EGGCodec，一种为电声喉振图（EGG）信号重建与基频（F0）提取而设计的鲁棒神经 Encodec 框架。我们提出了一种多尺度频域损失函数，以捕捉原始和重建 EGG 信号之间的细微关系，并辅以时域相关性损失以提升泛化性和精度。不同于直接从特征中提取 F0 的传统 Encodec 模型，EGGCodec 利用重建的 EGG 信号，这些信号与 F0 具有更紧密的对应关系。通过移除传统的 GAN 判别器，我们简化了 EGGCodec 的训练流程而不损失效率，仅带来可忽略的性能下降。在一个广泛使用且包含 EGG 的数据库上训练后，大量评估表明 EGGCodec 优于最先进的 F0 提取方案，将平均绝对误差（MAE）从 14.14 Hz 降低到 13.69 Hz，并将有声判定错误率（VDE）改善了 38.2%。此外，大量消融实验验证了 EGGCodec 各组成部分的贡献。

Subjects: Audio and Speech Processing, Artificial Intelligence 主题：音频与语音处理，人工智能

Publish: 2025-08-12 13:20:25 UTC 发布：2025-08-12 13:20:25 UTC

#62 Shape Completion and Real-Time Visualization in Robotic Ultrasound Spine Acquisitions #62 形状补全与机器人超声脊柱采集中的实时可视化

Authors: [Miruna-Alexandra Gafencu](https://arxiv.org/search/?searchtype=author&query=Miruna-Alexandra Gafencu), [Reem Shaban](https://arxiv.org/search/?searchtype=author&query=Reem Shaban), [Yordanka Velikova](https://arxiv.org/search/?searchtype=author&query=Yordanka Velikova), [Mohammad Farid Azampour](https://arxiv.org/search/?searchtype=author&query=Mohammad Farid Azampour), [Nassir Navab](https://arxiv.org/search/?searchtype=author&query=Nassir Navab) 作者：Miruna-Alexandra Gafencu、Reem Shaban、Yordanka Velikova、Mohammad Farid Azampour、Nassir Navab

Ultrasound (US) imaging is increasingly used in spinal procedures due to its real-time, radiation-free capabilities; however, its effectiveness is hindered by shadowing artifacts that obscure deeper tissue structures. Traditional approaches, such as CT-to-US registration, incorporate anatomical information from preoperative CT scans to guide interventions, but they are limited by complex registration requirements, differences in spine curvature, and the need for recent CT imaging. Recent shape completion methods can offer an alternative by reconstructing spinal structures in US data, while being pretrained on large set of publicly available CT scans. However, these approaches are typically offline and have limited reproducibility. In this work, we introduce a novel integrated system that combines robotic ultrasound with real-time shape completion to enhance spinal visualization. Our robotic platform autonomously acquires US sweeps of the lumbar spine, extracts vertebral surfaces from ultrasound, and reconstructs the complete anatomy using a deep learning-based shape completion network. This framework provides interactive, real-time visualization with the capability to autonomously repeat scans and can enable navigation to target locations. This can contribute to better consistency, reproducibility, and understanding of the underlying anatomy. We validate our approach through quantitative experiments assessing shape completion accuracy and evaluations of multiple spine acquisition protocols on a phantom setup. Additionally, we present qualitative results of the visualization on a volunteer scan. 超声（US）成像由于其实时、无辐射的特点，在脊柱手术中应用越来越广泛；然而，阴影伪影会掩盖更深层的组织结构，从而影响其有效性。传统方法如 CT 到 US 的配准，通过术前 CT 扫描引入解剖信息以指导介入，但这些方法受限于复杂的配准需求、脊柱弯曲差异以及需要近期 CT 成像的限制。近期的形状补全方法可以作为替代，通过在大量公开可用的 CT 扫描上进行预训练，从而重建超声数据中的脊柱结构。然而，这些方法通常是离线的且可复现性有限。在本工作中，我们提出了一种新型集成系统，将机器人超声与实时形状补全相结合以增强脊柱可视化。我们的机器人平台自动采集腰椎的超声扫查，提取超声中的椎体表面，并使用基于深度学习的形状补全网络重建完整的解剖结构。该框架提供交互式、实时可视化，能够自主重复扫描并可实现导航到目标位置。这有助于提高对基础解剖结构的一致性、可重复性和理解。我们通过定量实验验证了我们的方法，评估了形状补全的准确性，并在人体模型设置上评估了多种脊柱采集协议。此外，我们还展示了志愿者扫描的可视化定性结果。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Robotics 主题：计算机视觉与模式识别、人工智能、机器人学

Publish: 2025-08-12 13:19:37 UTC 发表时间：2025-08-12 13:19:37 UTC

#63 Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning #63 Munsit 在 NADI 2025 共享任务 2：通过弱监督预训练和持续监督微调推动多方言阿拉伯语自动语音识别的边界

Automatic speech recognition (ASR) plays a vital role in enabling natural human-machine interaction across applications such as virtual assistants, industrial automation, customer support, and real-time transcription. However, developing accurate ASR systems for low-resource languages like Arabic remains a significant challenge due to limited labeled data and the linguistic complexity introduced by diverse dialects. In this work, we present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model. In the first stage, we pretrain the model on 15,000 hours of weakly labeled speech covering both Modern Standard Arabic (MSA) and various Dialectal Arabic (DA) variants. In the subsequent stage, we perform continual supervised fine-tuning using a mixture of filtered weakly labeled data and a small, high-quality annotated dataset. Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge. These findings highlight the effectiveness of weak supervision paired with fine-tuning in overcoming data scarcity and delivering high-quality ASR for low-resource, dialect-rich languages. 自动语音识别（ASR）在虚拟助手、工业自动化、客户支持和实时转录等应用中对实现自然的人机交互至关重要。然而，由于标注数据有限以及由多样方言带来的语言复杂性，为阿拉伯语等资源匮乏语言开发准确的 ASR 系统仍然是一项重大挑战。在本研究中，我们提出了一个可扩展的训练流程，结合弱监督学习与有监督微调来开发稳健的阿拉伯语 ASR 模型。在第一阶段，我们在覆盖现代标准阿拉伯语（MSA）和多种方言阿拉伯语（DA）变体的 1.5 万小时弱标注语音上对模型进行预训练。在随后的阶段，我们使用过滤后的弱标注数据与一小部分高质量注释数据的混合，进行持续的有监督微调。我们的方法取得了最先进的结果，在多方言阿拉伯语 ASR 挑战中名列第一。这些发现凸显了在资源匮乏且方言丰富的语言中，弱监督与微调相结合在克服数据稀缺并提供高质量自动语音识别方面的有效性。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 13:02:22 UTC 发布：2025-08-12 13:02:22 UTC

#64 ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs #64 ASPD：通过探索 LLMs 的内在并行性解锁自适应串并行解码

The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel decoding modes while maintaining a reusable KV cache, maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant acceleration without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines. 大型语言模型（LLMs）规模和复杂性的不断增加带来了显著的推理延迟挑战，主要由于其自回归解码范式以逐次预测下一个词元的顺序性为特征。通过重新审视自回归模型的输出，我们观察到某些片段表现出可并行化的结构，我们称之为内在并行性。将每个可并行化的分支同时解码（即并行解码）可以显著提升 LLMs 的整体推理速度。在本文中，我们提出了一种自适应串-并行解码（ASPD），它针对两个核心挑战：可并行化数据的自动构建和高效的并行解码机制。更具体地，我们引入了一个非侵入式管道，自动从自回归模型的响应中提取并验证可并行化结构。为了实现高效的自适应串-并行解码，我们实现了一个混合解码引擎，能够在串行与并行解码模式之间实现无缝切换，同时保持可复用的 KV 缓存，从而最大化计算效率。在通用任务、检索增强生成、数学推理等方面的大量评估表明，ASPD 在效果和效率上都达到了前所未有的表现。值得注意的是，在 Vicuna Bench 上，我们的方法在保持响应质量与自回归模型相比差异在 1% 以内的前提下，最高实现了 3.19 倍的加速（平均 1.85 倍），实现了显著的提速而不牺牲生成质量。我们的框架为高效 LLM 并行推理树立了开创性基准，为其在对延迟敏感的应用场景（例如 AI 驱动的客服机器人和答案检索引擎）中的部署铺平了道路。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 12:35:55 UTC 发布：2025-08-12 12:35:55 UTC

#65 Position: Causal Machine Learning Requires Rigorous Synthetic Experiments for Broader Adoption #65 位置：因果机器学习需要严谨的合成实验以实现更广泛的采用

Authors: [Audrey Poinsot](https://arxiv.org/search/?searchtype=author&query=Audrey Poinsot), [Panayiotis Panayiotou](https://arxiv.org/search/?searchtype=author&query=Panayiotis Panayiotou), [Alessandro Leite](https://arxiv.org/search/?searchtype=author&query=Alessandro Leite), [Nicolas Chesneau](https://arxiv.org/search/?searchtype=author&query=Nicolas Chesneau), [Özgür Şimşek](https://arxiv.org/search/?searchtype=author&query=Özgür Şimşek), [Marc Schoenauer](https://arxiv.org/search/?searchtype=author&query=Marc Schoenauer) 作者：Audrey Poinsot、Panayiotis Panayiotou、Alessandro Leite、Nicolas Chesneau、Özgür Şimşek、Marc Schoenauer

Causal machine learning has the potential to revolutionize decision-making by combining the predictive power of machine learning algorithms with the theory of causal inference. However, these methods remain underutilized by the broader machine learning community, in part because current empirical evaluations do not permit assessment of their reliability and robustness, undermining their practical utility. Specifically, one of the principal criticisms made by the community is the extensive use of synthetic experiments. We argue, on the contrary, that synthetic experiments are essential and necessary to precisely assess and understand the capabilities of causal machine learning methods. To substantiate our position, we critically review the current evaluation practices, spotlight their shortcomings, and propose a set of principles for conducting rigorous empirical analyses with synthetic data. Adopting the proposed principles will enable comprehensive evaluations that build trust in causal machine learning methods, driving their broader adoption and impactful real-world use. 因果机器学习通过将机器学习算法的预测能力与因果推断理论相结合，有可能彻底改变决策制定。然而，这些方法在更广泛的机器学习社区中仍未得到充分利用，部分原因是当前的实证评估无法衡量其可靠性和稳健性，从而削弱了其实用价值。具体而言，社区对这些方法的主要批评之一是大量使用合成实验。相反，我们认为，合成实验对于精确评估和理解因果机器学习方法的能力是必要且不可或缺的。为了论证我们的观点，我们对当前的评估实践进行了批判性审视，指出了其不足之处，并提出了一套使用合成数据进行严格实证分析的原则。采用所提原则将使得全面评估成为可能，进而建立对因果机器学习方法的信任，推动其更广泛的采用并在现实世界中产生影响。

Subjects: Machine Learning, Artificial Intelligence, Methodology, Machine Learning 主题：机器学习、人工智能、方法学、机器学习

Publish: 2025-08-12 12:13:13 UTC 发布：2025-08-12 12:13:13 UTC

#66 Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models #66 纠缠于表征：对大型语言模型中文化偏见的机械化调查

The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs’ representations impacts their cultural understanding. Prior work only performs extrinsic evaluation of LLMs’ cultural competence, without accounting for how LLMs’ internal mechanisms lead to cultural (mis)representation. To bridge this gap, we propose Culturescope, the first mechanistic interpretability-based method that probes the internal representations of LLMs to elicit the underlying cultural knowledge space. CultureScope utilizes a patching method to extract the cultural knowledge. We introduce a cultural flattening score as a measure of the intrinsic cultural biases. Additionally, we study how LLMs internalize Western-dominance bias and cultural flattening, which allows us to trace how cultural biases emerge within LLMs. Our experimental results reveal that LLMs encode Western-dominance bias and cultural flattening in their cultural knowledge space. We find that low-resource cultures are less susceptible to cultural biases, likely due to their limited training resources. Our work provides a foundation for future research on mitigating cultural biases and enhancing LLMs’ cultural understanding. Our codes and data used for experiments are publicly available. 在各种文化背景中日益广泛部署的大型语言模型（LLMs）要求我们更好地理解，这些模型在表征中对较少文献记载文化的过度泛化如何影响其文化理解。以往研究仅对 LLMs 的文化能力进行外在评估，未考虑 LLMs 的内部机制如何导致文化的（误）表征。为弥补这一空白，我们提出了 CultureScope，这是首个基于机制可解释性的方法，用以探 probing LLMs 的内部表征以引出其潜在的文化知识空间。CultureScope 利用一种 patching 方法来提取文化知识。我们引入了文化扁平化得分，作为内在文化偏见的度量。此外，我们研究了 LLMs 如何内化西方主导偏见和文化扁平化，这使我们能够追踪文化偏见在 LLMs 内部的出现过程。我们的实验证明，LLMs 在其文化知识空间中编码了西方主导偏见和文化扁平化。我们发现，低资源文化不太容易受到文化偏见的影响，这可能是由于它们有限的训练资源所致。我们的工作为未来在减轻文化偏见和增强 LLMs 文化理解方面的研究提供了基础。我们用于实验的代码和数据已公开可用。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 12:05:32 UTC 发布：2025-08-12 12:05:32 UTC

#67 Oblivionis: A Lightweight Learning and Unlearning Framework for Federated Large Language Models #67 Oblivionis：一种用于联邦大语言模型的轻量级学习与遗忘框架 [PDF ] [Copy] [Kimi ] [REL]

Large Language Models (LLMs) increasingly leverage Federated Learning (FL) to utilize private, task-specific datasets for fine-tuning while preserving data privacy. However, while federated LLM frameworks effectively enable collaborative training without raw data sharing, they critically lack built-in mechanisms for regulatory compliance like GDPR’s right to be forgotten. Integrating private data heightens concerns over data quality and long-term governance, yet existing distributed training frameworks offer no principled way to selectively remove specific client contributions post-training. Due to distributed data silos, stringent privacy constraints, and the intricacies of interdependent model aggregation, federated LLM unlearning is significantly more complex than centralized LLM unlearning. To address this gap, we introduce Oblivionis, a lightweight learning and unlearning framework that enables clients to selectively remove specific private data during federated LLM training, enhancing trustworthiness and regulatory compliance. By unifying FL and unlearning as a dual optimization objective, we incorporate 6 FL and 5 unlearning algorithms for comprehensive evaluation and comparative analysis, establishing a robust pipeline for federated LLM unlearning. Extensive experiments demonstrate that Oblivionis outperforms local training, achieving a robust balance between forgetting efficacy and model utility, with cross-algorithm comparisons providing clear directions for future LLM development. 大型语言模型（LLMs）越来越多地利用联邦学习（FL）来利用私有、特定任务的数据进行微调，同时保护数据隐私。然而，尽管联邦 LLM 框架能够在不共享原始数据的情况下实现协同训练，但它们严重缺乏内置的合规机制，例如 GDPR 中的被遗忘权。整合私有数据会加剧对数据质量和长期治理的担忧，但现有的分布式训练框架并未提供一种有原则的方法来在训练后有选择地移除特定客户端的贡献。由于数据存放在分散的孤岛中、严格的隐私约束以及相互依赖的模型聚合的复杂性，联邦 LLM 的“忘记”比集中式 LLM 的“忘记”要复杂得多。为了解决这一空白，我们提出了 Oblivionis，一种轻量级的学习与遗忘框架，使客户端能够在联邦 LLM 训练过程中有选择地移除特定私有数据，从而提升可信度和合规性。通过将联邦学习和遗忘统一为一个二元优化目标，我们纳入了 6 种联邦学习和 5 种遗忘算法进行全面评估和对比分析，建立了一个用于联邦 LLM 遗忘的稳健流程。大量实验表明，Oblivion 优于本地训练，在遗忘效果和模型效用之间实现了稳健的平衡，跨算法的比较为未来 LLM 的发展提供了明确方向。

Subjects: Machine Learning, Artificial Intelligence, Cryptography and Security 主题：机器学习、人工智能、密码学与安全

Publish: 2025-08-12 12:02:53 UTC 发布时间：2025-08-12 12:02:53 UTC

#68 BiasGym: Fantastic Biases and How to Find (and Remove) Them #68 BiasGym：奇妙的偏差以及如何发现（和移除）它们

Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being reckless drivers') and in probing fictional associations (e.g., people from a country having blue skin’), showing its utility for both safety interventions and interpretability research. 理解编码在大型语言模型（LLMs）权重中的偏见和刻板印象，对于制定有效的缓解策略至关重要。即便在刻意诱导的情况下，偏见行为常常微妙且难以孤立，使得系统性分析和去偏尤为具有挑战性。为此，我们提出了 BiasGym，一个简单、低成本且具有普适性的框架，用于可靠地注入、分析和缓解 LLMs 中的概念关联。BiasGym 包含两个组件：BiasInject，通过基于词元的微调在保持模型冻结的情况下将特定偏见注入模型；BiasScope 则利用这些注入信号识别并引导导致偏见行为的组件。我们的方法实现了用于机械分析的一致偏见诱导，支持在不降低下游任务性能的前提下进行有针对性的去偏，并能泛化到训练中未见过的偏见。我们展示了 BiasGym 在减少现实世界刻板印象（例如某国人被认为是“鲁莽的司机”）以及在探测虚构关联（例如某国人具有“蓝色皮肤”）方面的有效性，表明其在安全干预和可解释性研究中都具有实用价值。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-12 11:23:44 UTC 发布：2025-08-12 11:23:44 UTC

#69 Steering Towards Fairness: Mitigating Political Bias in LLMs #69 朝向公平的引导：缓解 LLMs 中的政治偏见

Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases, particularly along political and economic dimensions. In this paper, we propose a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), our method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions. 近年来大型语言模型（LLMs）的进展使其在各种现实世界应用中得到广泛使用。然而，人们仍然担忧它们倾向于编码并再现意识形态偏见，尤其是在政治和经济维度上。本文中，我们提出了一个框架，通过分析解码器型 LLMs 的内部模型表征来探测和缓解此类偏见。基于政治罗盘测试（Political Compass Test，PCT），我们的方法使用对比对来提取并比较像 Mistral 和 DeepSeek 这类模型的隐藏层激活。我们引入了一个全面的激活提取管道，能够在多个意识形态轴上进行按层分析，揭示与政治表述相关的显著差异。我们的结果表明，解码器 LLMs 在各层系统性地编码了表征偏差，这可以被利用用于基于引导向量的有效缓解。该工作为理解政治偏见如何在 LLMs 中被编码提供了新见解，并提出了一种超越表层输出干预的原则性去偏方法。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 11:09:03 UTC 发布：2025-08-12 11:09:03 协调世界时

#70 The Roots of International Perceptions: Simulating US Attitude Changes Towards China with LLM Agents #70 国际认知的根源：使用 LLM 代理模拟美国对华态度变化

Authors: [Nicholas Sukiennik](https://arxiv.org/search/?searchtype=author&query=Nicholas Sukiennik), [Yichuan Xu](https://arxiv.org/search/?searchtype=author&query=Yichuan Xu), [Yuqing Kan](https://arxiv.org/search/?searchtype=author&query=Yuqing Kan), [Jinghua Piao](https://arxiv.org/search/?searchtype=author&query=Jinghua Piao), [Yuwei Yan](https://arxiv.org/search/?searchtype=author&query=Yuwei Yan), [Chen Gao](https://arxiv.org/search/?searchtype=author&query=Chen Gao), [Yong Li](https://arxiv.org/search/?searchtype=author&query=Yong Li) 作者：Nicholas Sukiennik、Yichuan Xu、Yuqing Kan、Jinghua Piao、Yuwei Yan、Chen Gao、Yong Li

The rise of LLMs poses new possibilities in modeling opinion evolution, a long-standing task in simulation, by leveraging advanced reasoning abilities to recreate complex, large-scale human cognitive trends. While most prior works focus on opinion evolution surrounding specific isolated events or the views within a country, ours is the first to model the large-scale attitude evolution of a population representing an entire country towards another – US citizens’ perspectives towards China. To tackle the challenges of this broad scenario, we propose a framework that integrates media data collection, user profile creation, and cognitive architecture for opinion updates to successfully reproduce the real trend of US attitudes towards China over a 20-year period from 2005 to today. We also leverage LLMs’ capabilities to introduce debiased media exposure, extracting neutral events from typically subjective news contents, to uncover the roots of polarized opinion formation, as well as a devils advocate agent to help explain the rare reversal from negative to positive attitudes towards China, corresponding with changes in the way Americans obtain information about the country. The simulation results, beyond validating our framework architecture, also reveal the impact of biased framing and selection bias in shaping attitudes. Overall, our work contributes to a new paradigm for LLM-based modeling of cognitive behaviors in a large-scale, long-term, cross-border social context, providing insights into the formation of international biases and offering valuable implications for media consumers to better understand the factors shaping their perspectives, and ultimately contributing to the larger social need for bias reduction and cross-cultural tolerance. 大型语言模型（LLMs）的兴起为模拟观点演变这一长期存在的任务带来了新可能，通过利用其先进的推理能力重现复杂的大规模人类认知趋势。以往大多数研究集中于围绕特定孤立事件的观点演变或某一国内部的观点，我们的工作首次对代表整个国家的人群对另一个国家的整体态度演变进行了建模——即美国公民对中国的看法。为应对这一广泛情景的挑战，我们提出了一个框架，整合了媒体数据收集、用户画像构建和用于观点更新的认知架构，以成功再现 2005 年至今这 20 年间美国对华态度的真实趋势。我们还利用 LLMs 的能力引入去偏见的媒体暴露，从通常带主观性的新闻内容中提取中性事件，以揭示两极分化舆论形成的根源，并引入一名“反方辩手”代理来帮助解释对华态度从消极转为积极的罕见逆转，这与美国人获取有关中国信息的方式的变化相对应。仿真结果不仅验证了我们框架的架构，还揭示了有偏框架和选择性偏差在塑造态度方面的影响。总体而言，我们的工作为基于 LLM 的大规模、长期、跨境社会语境中认知行为建模贡献了一种新范式，为国际偏见的形成提供了洞见，并为媒体受众更好地理解影响其观点的因素提供了有价值的启示，最终有助于满足减少偏见和促进跨文化宽容的更广泛社会需求。

Subjects: Social and Information Networks, Artificial Intelligence 主题：社交与信息网络，人工智能

Publish: 2025-08-12 10:54:08 UTC 发布：2025-08-12 10:54:08 UTC

#71 EditMF: Drawing an Invisible Fingerprint for Your Large Language Models #71 EditMF：为你的大型语言模型绘制一枚隐形指纹

Authors: [Jiaxuan Wu](https://arxiv.org/search/?searchtype=author&query=Jiaxuan Wu), [Yinghan Zhou](https://arxiv.org/search/?searchtype=author&query=Yinghan Zhou), [Wanli Peng](https://arxiv.org/search/?searchtype=author&query=Wanli Peng), [Yiming Xue](https://arxiv.org/search/?searchtype=author&query=Yiming Xue), [Juan Wen](https://arxiv.org/search/?searchtype=author&query=Juan Wen), [Ping Zhong](https://arxiv.org/search/?searchtype=author&query=Ping Zhong) 作者：吴家轩、周英涵、彭万里、薛一鸣、温娟、钟平

Training large language models (LLMs) is resource-intensive and expensive, making protecting intellectual property (IP) for LLMs crucial. Recently, embedding fingerprints into LLMs has emerged as a prevalent method for establishing model ownership. However, existing back-door-based methods suffer from limited stealth and efficiency. To simultaneously address these issues, we propose EditMF, a training-free fingerprinting paradigm that achieves highly imperceptible fingerprint embedding with minimal computational overhead. Ownership bits are mapped to compact, semantically coherent triples drawn from an encrypted artificial knowledge base (e.g., virtual author-novel-protagonist facts). Causal tracing localizes the minimal set of layers influencing each triple, and a zero-space update injects the fingerprint without perturbing unrelated knowledge. Verification requires only a single black-box query and succeeds when the model returns the exact pre-embedded protagonist. Empirical results on LLaMA and Qwen families show that EditMF combines high imperceptibility with negligible model’s performance loss, while delivering robustness far beyond LoRA-based fingerprinting and approaching that of SFT embeddings. Extensive experiments demonstrate that EditMF is an effective and low-overhead solution for secure LLM ownership verification. 训练大型语言模型（LLMs）资源消耗大且费用高昂，因此保护 LLMs 的知识产权至关重要。最近，将指纹嵌入 LLMs 已成为确立模型所有权的常用方法。然而，现有基于后门的方法在隐蔽性和效率上存在局限。为同时解决这些问题，我们提出了 EditMF，一种无需训练的指纹方案，能以极高的不可察觉性和最小的计算开销嵌入指纹。所有权位被映射到从加密的人工知识库（例如虚构的作者—小说—主角事实）中抽取的紧凑、语义一致的三元组。因果追踪定位出影响每个三元组的最小层集，零空间更新在不扰动无关知识的情况下注入指纹。验证只需一次黑盒查询，当模型返回预先嵌入的确切主角时即视为成功。在 LLaMA 和 Qwen 系列模型上的实证结果表明，EditMF 将高度隐蔽性与可忽略的模型性能损失相结合，同时其鲁棒性远超基于 LoRA 的指纹方法，并接近 SFT 嵌入的鲁棒水平。大量实验表明，EditMF 是一种有效且低开销的用于保障 LLM 所有权验证的解决方案。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全性 , 人工智能

Publish: 2025-08-12 10:52:48 UTC 发布：2025-08-12 10:52:48 UTC

#72 An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems #72 对大型语言模型（LLMs）在数学推理鲁棒性的一项研究：通过对高等数学问题进行数学等价变换来建立基准测试

In this paper, we introduce a systematic framework beyond conventional method to assess LLMs’ mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI’s flagship reasoning model, O3, scores 49 % on the originals but drops by 4 percentage points on surface variants, and by 10.5 percentage points on core-step-based variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities. 在本文中，我们提出了一个系统化框架，超越传统方法，通过在数学等价但在语言和参数上有所变化的高级数学题上对应激测试来评估 LLMs 的数学推理稳健性。这些变换使我们能够衡量 LLMs 对非数学扰动的敏感性，从而更准确地评估其数学推理能力。利用这种新的评估方法，我们构建了 PutnamGAP，这是一个包含多种数学等价变体的竞赛级数学题的新基准数据集。基于该数据集，我们评估了多类代表性的 LLMs 并检验它们的稳健性。在 18 个商业和开源模型中，我们观察到在变体上性能明显下降。OpenAI 的旗舰推理模型 O3 在原题上的得分为 49%，但在表面变体上下降了 4 个百分点，在基于核心步骤的变体上下降了 10.5 个百分点，而规模较小的模型表现则更差。总体而言，结果表明，所提出的新评估方法对于加深我们对 LLMs 鲁棒性的理解以及为进一步提升其数学推理能力产生新的见解是有效的。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-12 10:40:33 UTC 发布：2025-08-12 10:40:33 UTC

#73 Geometry-Aware Global Feature Aggregation for Real-Time Indirect Illumination #73 面向几何的全局特征聚合用于实时间接光照

Authors: [Meng Gai](https://arxiv.org/search/?searchtype=author&query=Meng Gai), [Guoping Wang](https://arxiv.org/search/?searchtype=author&query=Guoping Wang), [Sheng Li](https://arxiv.org/search/?searchtype=author&query=Sheng Li) 作者：盖猛，王国平，李晟

Real-time rendering with global illumination is crucial to afford the user realistic experience in virtual environments. We present a learning-based estimator to predict diffuse indirect illumination in screen space, which then is combined with direct illumination to synthesize globally-illuminated high dynamic range (HDR) results. Our approach tackles the challenges of capturing long-range/long-distance indirect illumination when employing neural networks and is generalized to handle complex lighting and scenarios. From the neural network thinking of the solver to the rendering equation, we present a novel network architecture to predict indirect illumination. Our network is equipped with a modified attention mechanism that aggregates global information guided by spacial geometry features, as well as a monochromatic design that encodes each color channel individually. We conducted extensive evaluations, and the experimental results demonstrate our superiority over previous learning-based techniques. Our approach excels at handling complex lighting such as varying-colored lighting and environment lighting. It can successfully capture distant indirect illumination and simulates the interreflections between textured surfaces well (i.e., color bleeding effects); it can also effectively handle new scenes that are not present in the training dataset. 实时渲染中的全局光照对于在虚拟环境中为用户提供逼真体验至关重要。我们提出了一种基于学习的估计器，用于预测屏幕空间的漫反射间接光照，然后将其与直接光照结合以合成全局光照的高动态范围（HDR）结果。我们的方法在使用神经网络时解决了捕捉远距离/长程间接光照的挑战，并且泛化到处理复杂光照和场景。从将求解器视为神经网络到渲染方程，我们提出了一种新颖的网络架构来预测间接光照。我们的网络配备了经修改的注意力机制，该机制在空间几何特征引导下聚合全局信息，同时采用单色设计分别对每个颜色通道进行编码。我们进行了大量评估，实验结果表明我们优于先前的基于学习的技术。我们的方法在处理复杂光照（如变化色彩的光照和环境光照）方面表现出色。它能够成功捕捉远处的间接光照并很好地模拟有纹理表面之间的相互反射（即颜色渗染效应）；它还能够有效处理训练数据集中不存在的新场景。

Subjects: Graphics, Artificial Intelligence 学科：图形学，人工智能

Publish: 2025-08-12 10:36:03 UTC 发布：2025-08-12 10:36:03 UTC

#74 Wavelet Mixture of Experts for Time Series Forecasting #74 小波专家混合模型用于时间序列预测

Authors: [Zheng Zhou](https://arxiv.org/search/?searchtype=author&query=Zheng Zhou), [Yu-Jie Xiong](https://arxiv.org/search/?searchtype=author&query=Yu-Jie Xiong), [Jia-Chen Zhang](https://arxiv.org/search/?searchtype=author&query=Jia-Chen Zhang), [Chun-Ming Xia](https://arxiv.org/search/?searchtype=author&query=Chun-Ming Xia), [Xi-Jiong Xie](https://arxiv.org/search/?searchtype=author&query=Xi-Jiong Xie) 作者：周征、熊宇杰、张家晨、夏春明、谢希炯

The field of time series forecasting is rapidly advancing, with recent large-scale Transformers and lightweight Multilayer Perceptron (MLP) models showing strong predictive performance. However, conventional Transformer models are often hindered by their large number of parameters and their limited ability to capture non-stationary features in data through smoothing. Similarly, MLP models struggle to manage multi-channel dependencies effectively. To address these limitations, we propose a novel, lightweight time series prediction model, WaveTS-B. This model combines wavelet transforms with MLP to capture both periodic and non-stationary characteristics of data in the wavelet domain. Building on this foundation, we propose a channel clustering strategy that incorporates a Mixture of Experts (MoE) framework, utilizing a gating mechanism and expert network to handle multi-channel dependencies efficiently. We propose WaveTS-M, an advanced model tailored for multi-channel time series prediction. Empirical evaluation across eight real-world time series datasets demonstrates that our WaveTS series models achieve state-of-the-art (SOTA) performance with significantly fewer parameters. Notably, WaveTS-M shows substantial improvements on multi-channel datasets, highlighting its effectiveness. 时间序列预测领域正在快速发展，最近的大规模 Transformer 和轻量级多层感知器（MLP）模型表现出强劲的预测性能。然而，传统的 Transformer 模型通常受制于参数量大，并且在通过平滑处理捕捉数据中的非平稳特征方面能力有限。同样，MLP 模型在有效处理多通道依赖关系方面也存在困难。为了解决这些局限性，我们提出了一种新颖的轻量级时间序列预测模型 WaveTS-B。该模型将小波变换与 MLP 相结合，在小波域中同时捕捉数据的周期性和非平稳特性。在此基础上，我们提出了一种通道聚类策略，结合了专家混合（MoE）框架，利用门控机制和专家网络高效处理多通道依赖关系。我们提出了 WaveTS-M，一种为多通道时间序列预测量身定制的先进模型。在八个真实世界时间序列数据集上的实证评估表明，我们的 WaveTS 系列模型以显著更少的参数实现了最先进（SOTA）的性能。值得注意的是，WaveTS-M 在多通道数据集上表现出显著改进，突显了其有效性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 10:32:51 UTC 发布：2025-08-12 10:32:51 UTC

#75 OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads #75 OISMA：面向矩阵乘法工作负载的即时内存内随机乘法架构

Authors: [Shady Agwa](https://arxiv.org/search/?searchtype=author&query=Shady Agwa), [Yihan Pan](https://arxiv.org/search/?searchtype=author&query=Yihan Pan), [Georgios Papandroulidakis](https://arxiv.org/search/?searchtype=author&query=Georgios Papandroulidakis), [Themis Prodromakis](https://arxiv.org/search/?searchtype=author&query=Themis Prodromakis) 作者：Shady Agwa，Yihan Pan，Georgios Papandroulidakis，Themis Prodromakis

Artificial Intelligence models are currently driven by a significant up-scaling of their complexity, with massive matrix multiplication workloads representing the major computational bottleneck. In-memory computing architectures are proposed to avoid the Von Neumann bottleneck. However, both digital/binary-based and analogue in-memory computing architectures suffer from various limitations, which significantly degrade the performance and energy efficiency gains. This work proposes OISMA, a novel in-memory computing architecture that utilizes the computational simplicity of a quasi-stochastic computing domain (Bent-Pyramid system), while keeping the same efficiency, scalability, and productivity of digital memories. OISMA converts normal memory read operations into in-situ stochastic multiplication operations with a negligible cost. An accumulation periphery then accumulates the output multiplication bitstreams, achieving the matrix multiplication functionality. Extensive matrix multiplication benchmarking was conducted to analyze the accuracy of the Bent-Pyramid system, using matrix dimensions ranging from 4x4 to 512x512. The accuracy results show a significant decrease in the average relative Frobenius error, from 9.42% (for 4x4) to 1.81% (for 512x512), compared to 64-bit double precision floating-point format. A 1T1R OISMA array of 4 KB capacity was implemented using a commercial 180nm technology node and in-house RRAM technology. At 50 MHz, OISMA achieves 0.891 TOPS/W and 3.98 GOPS/mm2 for energy and area efficiency, respectively, occupying an effective computing area of 0.804241 mm2. Scaling OISMA from 180nm to 22nm technology shows a significant improvement of two orders of magnitude in energy efficiency and one order of magnitude in area efficiency, compared to dense matrix multiplication in-memory computing architectures. 人工智能模型目前由其复杂度的大幅扩展驱动，其中大量的矩阵乘法工作负载构成主要的计算瓶颈。为避免冯·诺依曼瓶颈，提出了片内计算（in-memory computing）架构。然而，基于数字/二进制和模拟的片内计算架构都存在各种限制，显著降低了性能和能效收益。本工作提出了 OISMA，一种新颖的片内计算架构，它利用准随机计算域（Bent-Pyramid 系统）的计算简洁性，同时保持数字存储器的相同性能、可扩展性和生产力。OISMA 将普通的内存读取操作以可忽略的代价转换为现场随机乘法操作。随后，一个累加外设累加输出的乘法比特流，从而实现矩阵乘法功能。进行了广泛的矩阵乘法基准测试以分析 Bent-Pyramid 系统的精度，所用矩阵维度范围从 4x4 到 512x512。精度结果显示，相对于 64 位双精度浮点格式，平均相对 Frobenius 误差显著降低，从 9.42%（针对 4x4）降至 1.81%（针对 512x512）。使用商用 180nm 工艺节点和内部 RRAM 技术，实现了容量为 4 KB 的 1T1R OISMA 阵列。在 50 MHz 下，OISMA 在能耗和面积效率方面分别达到 0.891 TOPS/W 和 3.98 GOPS/mm2，有效计算面积为 0.804241 mm2。将 OISMA 从 180nm 缩放到 22nm 工艺，与基于密集矩阵乘法的内存计算架构相比，在能量效率上有两个数量级的显著提升，在面积效率上有一个数量级的提升。

Subjects: Hardware Architecture, Artificial Intelligence, Emerging Technologies, Performance 主题：硬件体系结构、人工智能、新兴技术、性能

Publish: 2025-08-12 10:24:33 UTC 发布：2025-08-12 10:24:33 世界协调时

#76 TempOpt – Unsupervised Alarm Relation Learning for Telecommunication Networks #76 TempOpt – 用于电信网络的无监督告警关联学习

Authors: [Sathiyanaryanan Sampath](https://arxiv.org/search/?searchtype=author&query=Sathiyanaryanan Sampath), [Pratyush Uppuluri](https://arxiv.org/search/?searchtype=author&query=Pratyush Uppuluri), [Thirumaran Ekambaram](https://arxiv.org/search/?searchtype=author&query=Thirumaran Ekambaram) 作者：Sathiyanaryanan Sampath、Pratyush Uppuluri、Thirumaran Ekambaram

In a telecommunications network, fault alarms generated by network nodes are monitored in a Network Operations Centre (NOC) to ensure network availability and continuous network operations. The monitoring process comprises of tasks such as active alarms analysis, root alarm identification, and resolution of the underlying problem. Each network node potentially can generate alarms of different types, while nodes can be from multiple vendors, a network can have hundreds of nodes thus resulting in an enormous volume of alarms at any time. Since network nodes are inter-connected, a single fault in the network would trigger multiple sequences of alarms across a variety of nodes and from a monitoring point of view, it is a challenging task for a NOC engineer to be aware of relations between the various alarms, when trying to identify, for example, a root alarm on which an action needs to be taken. To effectively identify root alarms, it is essential to learn relation among the alarms for accurate and faster resolution. In this work we propose a novel unsupervised alarm relation learning technique Temporal Optimization (TempOpt) that is practical and overcomes the limitations of an existing class of alarm relational learning method-temporal dependency methods. Experiments have been carried on real-world network datasets, that demonstrate the improved quality of alarm relations learned by TempOpt as compared to temporal dependency method. 在电信网络中，由网络节点产生的故障告警在网络运营中心（NOC）中被监控，以确保网络可用性和持续运行。监控过程包括活动告警分析、根本告警识别以及底层问题的解决等任务。每个网络节点可能产生不同类型的告警，节点可能来自多个供应商，一个网络可能包含数百个节点，因此在任何时候都会产生大量告警。由于网络节点是互相连接的，网络中的单一故障会在各种节点上触发多条告警序列；从监控角度来看，当 NOC 工程师试图识别诸如需要采取行动的根本告警时，了解各类告警之间的关系是一项具有挑战性的任务。为了有效识别根本告警，学习告警之间的关系以实现更准确和更快速的解决至关重要。在这项工作中，我们提出了一种新颖的无监督告警关系学习技术——时序优化（Temporal Optimization，TempOpt），该方法具有实用性并克服了现有一类告警关系学习方法——时序依赖方法的局限性。我们在真实网络数据集上进行了实验，结果表明与时序依赖方法相比，TempOpt 学到的告警关系质量得到了提高。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 10:15:48 UTC 发布：2025-08-12 10:15:48 协调世界时 (UTC)

#77 Not in My Backyard! Temporal Voting Over Public Chores #77 不是在我后院！关于公共杂务的时间投票

Authors: [Edith Elkind](https://arxiv.org/search/?searchtype=author&query=Edith Elkind), [Tzeh Yuan Neoh](https://arxiv.org/search/?searchtype=author&query=Tzeh Yuan Neoh), [Nicholas Teh](https://arxiv.org/search/?searchtype=author&query=Nicholas Teh) 作者：Edith Elkind、Tzeh Yuan Neoh、Nicholas Teh

We study a temporal voting model where voters have dynamic preferences over a set of public chores – projects that benefit society, but impose individual costs on those affected by their implementation. We investigate the computational complexity of optimizing utilitarian and egalitarian welfare. Our results show that while optimizing the former is computationally straightforward, minimizing the latter is computationally intractable, even in very restricted cases. Nevertheless, we identify several settings where this problem can be solved efficiently, either exactly or by an approximation algorithm. We also examine the effects of enforcing temporal fairness and its impact on social welfare, and analyze the competitive ratio of online algorithms. We then explore the strategic behavior of agents, providing insights into potential malfeasance in such decision-making environments. Finally, we discuss a range of fairness measures and their suitability for our setting. 我们研究了一个时间投票模型，其中选民对一组公共杂务——对社会有益但对受其实施影响的个体造成成本的项目——拥有动态偏好。我们考察了优化功利主义和平均主义福利的计算复杂性。我们的结果表明，尽管优化前者在计算上相对简单，但在非常受限的情况下，最小化后者在计算上也不可行。尽管如此，我们确定了若干可以高效解决该问题的情形，既包括精确解法也包括近似算法。我们还考察了强制执行时间公平性的效果及其对社会福利的影响，并分析了在线算法的竞争比。然后我们探讨了代理人的策略性行为，为此类决策环境中可能出现的不当行为提供了见解。最后，我们讨论了一系列公平性度量及其在我们情景中的适用性。

Subjects: Computer Science and Game Theory, Artificial Intelligence, Multiagent Systems, Theoretical Economics 主题：计算机科学与博弈论、人工智能、多智能体系统、理论经济学

Publish: 2025-08-12 10:06:56 UTC 发布：2025-08-12 10:06:56 UTC

#78 Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems #78 激发音乐创造力？生成式人工智能音乐系统中的嵌入意识形态

Authors: [Liam Pram](https://arxiv.org/search/?searchtype=author&query=Liam Pram), [Fabio Morreale](https://arxiv.org/search/?searchtype=author&query=Fabio Morreale) 作者：Liam Pram，Fabio Morreale

AI systems for music generation are increasingly common and easy to use, granting people without any musical background the ability to create music. Because of this, generative-AI has been marketed and celebrated as a means of democratizing music making. However, inclusivity often functions as marketable rhetoric rather than a genuine guiding principle in these industry settings. In this paper, we look at four generative-AI music making systems available to the public as of mid-2025 (AIVA, Stable Audio, Suno, and Udio) and track how they are rhetoricized by their developers, and received by users. Our aim is to investigate ideologies that are driving the early-stage development and adoption of generative-AI in music making, with a particular focus on democratization. A combination of autoethnography and digital ethnography is used to examine patterns and incongruities in rhetoric when positioned against product functionality. The results are then collated to develop a nuanced, contextual discussion. The shared ideology we map between producers and consumers is individualist, globalist, techno-liberal, and ethically evasive. It is a ’total ideology’ which obfuscates individual responsibility, and through which the nature of music and musical practice is transfigured to suit generative outcomes. 用于音乐生成的人工智能系统越来越普遍且易于使用，使得没有任何音乐背景的人也能创作音乐。因此，生成式人工智能被宣传和赞颂为实现音乐创作民主化的一种手段。然而，在这些行业环境中，“包容性”常常作为一种可作市场宣传的辞令，而非真正的指导原则。本文考察了截至 2025 年中期向公众提供的四个生成式人工智能音乐创作系统（AIVA、Stable Audio、Suno 和 Udio），并追踪其开发者如何进行话语化，以及用户如何接受这些话语。我们的目标是调查推动生成式人工智能在音乐创作早期阶段开发和采用的意识形态，特别关注“民主化”。本文结合自我民族志和数字民族志的方法，检视当话语与产品功能相对照时出现的模式与不一致之处。然后将结果汇总以展开细致的、具情境性的讨论。我们在生产者与消费者之间映射出的共同意识形态是：个人主义、全球主义、技术自由主义与伦理规避。这是一种“整体意识形态”，它模糊了个人责任，并通过这种方式将音乐及音乐实践的本质转化为适应生成性结果。

Subjects: Sound, Artificial Intelligence, Human-Computer Interaction 主题：声音、人工智能、人机交互

Publish: 2025-08-12 09:59:07 UTC 发布：2025-08-12 09:59:07 UTC

#79 TechOps: Technical Documentation Templates for the AI Act #79 TechOps：用于《人工智能法案》的技术文档模板

Authors: [Laura Lucaj](https://arxiv.org/search/?searchtype=author&query=Laura Lucaj), [Alex Loosley](https://arxiv.org/search/?searchtype=author&query=Alex Loosley), [Hakan Jonsson](https://arxiv.org/search/?searchtype=author&query=Hakan Jonsson), [Urs Gasser](https://arxiv.org/search/?searchtype=author&query=Urs Gasser), [Patrick van der Smagt](https://arxiv.org/search/?searchtype=author&query=Patrick van der Smagt) 作者：Laura Lucaj、Alex Loosley、Hakan Jonsson、Urs Gasser、Patrick van der Smagt

Operationalizing the EU AI Act requires clear technical documentation to ensure AI systems are transparent, traceable, and accountable. Existing documentation templates for AI systems do not fully cover the entire AI lifecycle while meeting the technical documentation requirements of the AI Act. This paper addresses those shortcomings by introducing open-source templates and examples for documenting data, models, and applications to provide sufficient documentation for certifying compliance with the AI Act. These templates track the system status over the entire AI lifecycle, ensuring traceability, reproducibility, and compliance with the AI Act. They also promote discoverability and collaboration, reduce risks, and align with best practices in AI documentation and governance. The templates are evaluated and refined based on user feedback to enable insights into their usability and implementability. We then validate the approach on real-world scenarios, providing examples that further guide their implementation: the data template is followed to document a skin tones dataset created to support fairness evaluations of downstream computer vision models and human-centric applications; the model template is followed to document a neural network for segmenting human silhouettes in photos. The application template is tested on a system deployed for construction site safety using real-time video analytics and sensor data. Our results show that TechOps can serve as a practical tool to enable oversight for regulatory compliance and responsible AI development. 将欧盟《人工智能法案》落地需要明确的技术文档，以确保人工智能系统具有透明性、可追溯性和问责性。现有针对人工智能系统的文档模板并未在满足《人工智能法案》技术文档要求的同时覆盖完整的人工智能生命周期。本文通过引入用于记录数据、模型和应用的开源模板与示例来解决这些不足，旨在提供足以用于认证合规性的文档。这些模板在整个人工智能生命周期中跟踪系统状态，确保可追溯性、可复现性并符合《人工智能法案》。它们还促进可发现性与协作、降低风险，并与人工智能文档与治理的最佳实践保持一致。模板基于用户反馈进行评估和完善，以便洞察其可用性和可实现性。随后我们在真实场景中验证了该方法，提供了进一步指导其实施的示例：按照数据模板记录了一个用于支持下游计算机视觉模型和以人为本应用公平性评估的肤色数据集；按照模型模板记录了一个用于分割照片中人类轮廓的神经网络；按照应用模板测试了一个部署于工地以利用实时视频分析和传感器数据保障安全的系统。我们的结果表明，TechOps 可以作为一项实用工具，用以促进监管合规和负责任的人工智能开发的监督。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 09:58:33 UTC 发布：2025-08-12 09:58:33 世界标准时间

#80 Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments #80 通过自动化构建环境进行反馈驱动的工具使用改进在大型语言模型中的研究

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 09:45:19 UTC 发布时间：2025-08-12 09:45:19 协调世界时（UTC）

#81 ReQuestNet: A Foundational Learning model for Channel Estimation #81 ReQuestNet：用于信道估计的基础学习模型

In this paper, we present a novel neural architecture for channel estimation (CE) in 5G and beyond, the Recurrent Equivariant UERS Estimation Network (ReQuestNet). It incorporates several practical considerations in wireless communication systems, such as ability to handle variable number of resource block (RB), dynamic number of transmit layers, physical resource block groups (PRGs) bundling size (BS), demodulation reference signal (DMRS) patterns with a single unified model, thereby, drastically simplifying the CE pipeline. Besides it addresses several limitations of the legacy linear MMSE solutions, for example, by being independent of other reference signals and particularly by jointly processing MIMO layers and differently precoded channels with unknown precoding at the receiver. ReQuestNet comprises of two sub-units, CoarseNet followed by RefinementNet. CoarseNet performs per PRG, per transmit-receive (Tx-Rx) stream channel estimation, while RefinementNet refines the CoarseNet channel estimate by incorporating correlations across differently precoded PRGs, and correlation across multiple input multiple output (MIMO) channel spatial dimensions (cross-MIMO). Simulation results demonstrate that ReQuestNet significantly outperforms genie minimum mean squared error (MMSE) CE across a wide range of channel conditions, delay-Doppler profiles, achieving up to 10dB gain at high SNRs. Notably, ReQuestNet generalizes effectively to unseen channel profiles, efficiently exploiting inter-PRG and cross-MIMO correlations under dynamic PRG BS and varying transmit layer allocations. 在本文中，我们提出了一种用于 5G 及更高代通信中信道估计（CE）的新型神经网络架构——递归等变 UERS 估计网络（ReQuestNet）。它在设计中融入了无线通信系统的若干实际考虑，例如能够以单一统一模型处理可变数量的资源块（RB）、动态变化的发射层数、物理资源块组（PRGs）捆绑大小（BS）、以及解调参考信号（DMRS）模式，从而大幅简化了信道估计流程。此外，它还解决了传统线性 MMSE 方案的若干局限性，例如：不依赖于其他参考信号，并且特别地可以在接收端未知预编码的情况下，联合处理多天线（MIMO）层与不同预编码的信道。ReQuestNet 由两个子单元组成，先是 CoarseNet，随后是 RefinementNet。CoarseNet 在每个 PRG、每个发—收（Tx-Rx）流上执行信道估计，而 RefinementNet 通过结合不同预编码 PRG 之间的相关性以及跨多个输入多输出（MIMO）信道空间维度（跨 MIMO）的相关性来细化 CoarseNet 的信道估计。仿真结果表明，在各种信道条件和时延-多普勒分布下，ReQuestNet 在信使（genie）最小均方误差（MMSE）信道估计（CE）上具有显著优势，在高信噪比下最多可获得约 10dB 的增益。值得注意的是，ReQuestNet 能有效泛化到未见过的信道分布，能够在动态的 PRG 基站（PRG BS）和变化的发射层分配下，高效利用 PRG 之间以及跨 MIMO 的相关性。

Subjects: Signal Processing, Artificial Intelligence, Machine Learning 主题：信号处理，人工智能，机器学习

Publish: 2025-08-12 09:44:47 UTC 发布日期：2025-08-12 09:44:47 UTC

#82 Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge #82 使用具画像感知能力的 LLM 作为裁判评估播客推荐

Evaluating personalized recommendations remains a central challenge, especially in long-form audio domains like podcasts, where traditional offline metrics suffer from exposure bias and online methods such as A/B testing are costly and operationally constrained. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations in a scalable and interpretable manner. Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history. These profiles summarize both topical interests and behavioral patterns, serving as compact, interpretable representations of user preferences. Rather than prompting the LLM with raw data, we use these profiles to provide high-level, semantically rich context-enabling the LLM to reason more effectively about alignment between a user’s interests and recommended episodes. This reduces input complexity and improves interpretability. The LLM is then prompted to deliver fine-grained pointwise and pairwise judgments based on the profile-episode match. In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories. The framework enables efficient, profile-aware evaluation for iterative testing and model selection in recommender systems. 评估个性化推荐仍然是一个核心挑战，尤其是在像播客这样的长时音频领域，传统的离线指标存在曝光偏差，而在线方法（例如 A/B 测试）成本高且受运营限制。在本文中，我们提出了一个新框架，利用 LLMs 作为离线评审者，以一种可扩展且可解释的方式评估播客推荐的质量。我们提出的两阶段感知用户画像方法首先从 90 天的收听历史中提取自然语言用户画像。这些画像概述了主题兴趣和行为模式，作为用户偏好的简洁可解释表示。我们并不以原始数据提示 LLMs，而是使用这些画像提供高级、语义丰富的上下文——使 LLMs 更有效地推理用户兴趣与推荐集数之间的一致性。这降低了输入复杂性并提高了可解释性。随后，LLMs 会根据画像与集数的匹配被提示给出细粒度的逐点和成对判断。在一项有 47 名参与者的受控研究中，我们的基于用户画像的评判器高度还原了人工判断，并且优于或不逊于使用原始收听历史的变体。该框架为推荐系统的迭代测试和模型选择提供了高效的、基于画像的评估方法。

Subjects: Information Retrieval, Artificial Intelligence, Machine Learning 主题：信息检索、人工智能、机器学习

Publish: 2025-08-12 09:23:35 UTC 发布：2025-08-12 09:23:35 UTC

Authors: [Andrea Montibeller](https://arxiv.org/search/?searchtype=author&query=Andrea Montibeller), [Dasara Shullani](https://arxiv.org/search/?searchtype=author&query=Dasara Shullani), [Daniele Baracchi](https://arxiv.org/search/?searchtype=author&query=Daniele Baracchi), [Alessandro Piva](https://arxiv.org/search/?searchtype=author&query=Alessandro Piva), [Giulia Boato](https://arxiv.org/search/?searchtype=author&query=Giulia Boato) 作者：Andrea Montibeller、Dasara Shullani、Daniele Baracchi、Alessandro Piva、Giulia Boato

The growing presence of AI-generated videos on social networks poses new challenges for deepfake detection, as detectors trained under controlled conditions often fail to generalize to real-world scenarios. A key factor behind this gap is the aggressive, proprietary compression applied by platforms like YouTube and Facebook, which launder low-level forensic cues. However, replicating these transformations at scale is difficult due to API limitations and data-sharing constraints. For these reasons, we propose a first framework that emulates the video sharing pipelines of social networks by estimating compression and resizing parameters from a small set of uploaded videos. These parameters enable a local emulator capable of reproducing platform-specific artifacts on large datasets without direct API access. Experiments on FaceForensics++ videos shared via social networks demonstrate that our emulated data closely matches the degradation patterns of real uploads. Furthermore, detectors fine-tuned on emulated videos achieve comparable performance to those trained on actual shared media. Our approach offers a scalable and practical solution for bridging the gap between lab-based training and real-world deployment of deepfake detectors, particularly in the underexplored domain of compressed video content. 在社交网络上越来越多的 AI 生成视频给深度伪造检测带来了新的挑战，因为在受控条件下训练的检测器往往难以推广到真实世界场景。造成这一差距的一个关键因素是像 YouTube 和 Facebook 这样的平台所施加的激进的专有压缩，它们洗掉了低层次的取证线索。然而，由于 API 限制和数据共享约束，大规模复制这些转换是困难的。基于这些原因，我们提出了第一个通过从一小批上传视频中估计压缩和缩放参数来模拟社交网络视频共享管道的框架。这些参数使得本地模拟器能够在没有直接 API 访问的情况下，在大型数据集上重现平台特定的伪影。针对通过社交网络分享的 FaceForensics++ 视频的实验证明，我们模拟的数据与真实上传的退化模式高度匹配。此外，在模拟视频上微调的检测器其性能可与在实际分享媒体上训练的检测器相媲美。我们的方法为弥合实验室训练与深度伪造检测器在现实世界部署之间的差距提供了一种可扩展且实用的解决方案，特别是在压缩视频内容这一尚未充分探索的领域。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-12 09:11:31 UTC 发布：2025-08-12 09:11:31 UTC

#84 DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation #84 DevNous：一个基于 LLM 的多智能体系统，用于在非结构化对话中落地 IT 项目管理 [PDF ] [Copy] [Kimi ] [REL]

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 09:08:29 UTC 发布：2025-08-12 09:08:29 UTC

#85 Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT #85 使用 ACT 的注释引导抓取与放置的机器人操控视觉提示

Authors: [Muhammad A. Muttaqien](https://arxiv.org/search/?searchtype=author&query=Muhammad A. Muttaqien), [Tomohiro Motoda](https://arxiv.org/search/?searchtype=author&query=Tomohiro Motoda), [Ryo Hanai](https://arxiv.org/search/?searchtype=author&query=Ryo Hanai), [Yukiyasu Domae](https://arxiv.org/search/?searchtype=author&query=Yukiyasu Domae) 作者：Muhammad A. Muttaqien、Tomohiro Motoda、Ryo Hanai、Yukiyasu Domae

Robotic pick-and-place tasks in convenience stores pose challenges due to dense object arrangements, occlusions, and variations in object properties such as color, shape, size, and texture. These factors complicate trajectory planning and grasping. This paper introduces a perception-action pipeline leveraging annotation-guided visual prompting, where bounding box annotations identify both pickable objects and placement locations, providing structured spatial guidance. Instead of traditional step-by-step planning, we employ Action Chunking with Transformers (ACT) as an imitation learning algorithm, enabling the robotic arm to predict chunked action sequences from human demonstrations. This facilitates smooth, adaptive, and data-driven pick-and-place operations. We evaluate our system based on success rate and visual analysis of grasping behavior, demonstrating improved grasp accuracy and adaptability in retail environments. 便利店的机器人拿取与放置任务因物品排列密集、遮挡以及物品在颜色、形状、尺寸和纹理等属性上的差异而具有挑战性。这些因素使轨迹规划和抓取变得复杂。本文提出了一种感知-动作流水线，利用注释引导的视觉提示，其中边界框注释标识可抓取的物体和放置位置，提供结构化的空间引导。我们并非采用传统的逐步规划，而是使用带有变换器的动作分块（Action Chunking with Transformers，ACT）作为模仿学习算法，使机械臂能够从人类示范中预测分块的动作序列。这促进了平滑、适应性强且数据驱动的拿取与放置操作。我们通过成功率和抓取行为的视觉分析来评估系统，展示了在零售环境中改进的抓取准确性和适应能力。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-12 08:45:09 UTC 发布：2025-08-12 08:45:09 UTC

#86 SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs #86 SciRerankBench：针对用于科学检索增强生成型 LLMs 的重排序器基准测试

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 08:36:23 UTC 发布时间：2025-08-12 08:36:23 UTC

#87 IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization #87 IROTE：通过上下文自我反思式优化引出大型语言模型的人类特质

Trained on various human-authored corpora, Large Language Models (LLMs) have demonstrated a certain capability of reflecting specific human-like traits (e.g., personality or values) by prompting, benefiting applications like personalized LLMs and social simulations. However, existing methods suffer from the superficial elicitation problem: LLMs can only be steered to mimic shallow and unstable stylistic patterns, failing to embody the desired traits precisely and consistently across diverse tasks like humans. To address this challenge, we propose IROTE, a novel in-context method for stable and transferable trait elicitation. Drawing on psychological theories suggesting that traits are formed through identity-related reflection, our method automatically generates and optimizes a textual self-reflection within prompts, which comprises self-perceived experience, to stimulate LLMs’ trait-driven behavior. The optimization is performed by iteratively maximizing an information-theoretic objective that enhances the connections between LLMs’ behavior and the target trait, while reducing noisy redundancy in reflection without any fine-tuning, leading to evocative and compact trait reflection. Extensive experiments across three human trait systems manifest that one single IROTE-generated self-reflection can induce LLMs’ stable impersonation of the target trait across diverse downstream tasks beyond simple questionnaire answering, consistently outperforming existing strong baselines. 在各种人类创作语料上训练的 Large Language Models (LLMs) 已经显示出通过提示反映特定类人特质（例如人格或价值观）的某种能力，这有利于个性化 LLMs 和社会模拟等应用。然而，现有方法存在表层诱发问题：LLMs 只能被引导去模仿浅显且不稳定的风格模式，无法像人类那样在各种任务中精确且一致地体现所需特质。为了解决这一挑战，我们提出了 IROTE，一种用于稳定且可迁移特质诱发的新型上下文内方法。借鉴心理学理论中表明特质通过与身份相关的反思形成的观点，我们的方法在提示中自动生成并优化一段文本自我反思，该自我反思包含自我感知的经历，以激发 LLMs 的特质驱动行为。该优化通过迭代最大化一个信息论目标来完成，该目标增强了 LLMs 行为与目标特质之间的联系，同时在无需任何微调的情况下减少反思中的噪声冗余，从而产生富有表现力且简洁的特质反思。跨三个人格特质体系的大量实验证明，一个由 IROTE 生成的自我反思即可使 LLMs 在多种下游任务中稳定地模仿目标特质，超越简单的问卷回答，并持续优于现有强基线。

Subjects: Computation and Language, Artificial Intelligence, Computers and Society

Publish: 2025-08-12 08:04:28 UTC 发表：2025-08-12 08:04:28 UTC

#88 Generative Modeling for Robust Deep Reinforcement Learning on the Traveling Salesman Problem #88 面向旅行商问题的稳健深度强化学习的生成式建模

Authors: [Michael Li](https://arxiv.org/search/?searchtype=author&query=Michael Li), [Eric Bae](https://arxiv.org/search/?searchtype=author&query=Eric Bae), [Christopher Haberland](https://arxiv.org/search/?searchtype=author&query=Christopher Haberland), [Natasha Jaques](https://arxiv.org/search/?searchtype=author&query=Natasha Jaques) 作者：Michael Li、Eric Bae、Christopher Haberland、Natasha Jaques

The Traveling Salesman Problem (TSP) is a classic NP-hard combinatorial optimization task with numerous practical applications. Classic heuristic solvers can attain near-optimal performance for small problem instances, but become computationally intractable for larger problems. Real-world logistics problems such as dynamically re-routing last-mile deliveries demand a solver with fast inference time, which has led researchers to investigate specialized neural network solvers. However, neural networks struggle to generalize beyond the synthetic data they were trained on. In particular, we show that there exist TSP distributions that are realistic in practice, which also consistently lead to poor worst-case performance for existing neural approaches. To address this issue of distribution robustness, we present Combinatorial Optimization with Generative Sampling (COGS), where training data is sampled from a generative TSP model. We show that COGS provides better data coverage and interpolation in the space of TSP training distributions. We also present TSPLib50, a dataset of realistically distributed TSP samples, which tests real-world generalization ability without conflating this issue with instance size. We evaluate our method on various synthetic datasets as well as TSPLib50, and compare to state-of-the-art neural baselines. We demonstrate that COGS improves distribution robustness, with most performance gains coming from worst-case scenarios. 旅行商问题（TSP）是一个经典的 NP-难组合优化任务，具有众多实际应用。传统启发式求解器在小规模问题实例上能达到接近最优的表现，但在更大规模问题上计算上变得不可行。诸如动态重新规划最后一公里配送等现实物流问题需要推理速度快的求解器，这促使研究人员探索专用的神经网络求解器。然而，神经网络难以在其训练的合成数据之外实现良好泛化。具体而言，我们展示了存在一些在实践中真实的 TSP 分布，这些分布也会持续导致现有神经方法的最坏情况性能很差。为了解决这种分布鲁棒性的问题，我们提出了基于生成式采样的组合优化方法（COGS），其训练数据来自一个生成式的 TSP 模型。我们证明 COGS 在 TSP 训练分布空间中提供了更好的数据覆盖和插值能力。我们还提出了 TSPLib50，这是一个具有现实分布特性的 TSP 样本数据集，它在不将问题与实例规模混淆的情况下测试真实世界的泛化能力。我们在各种合成数据集以及 TSPLib50 上评估了我们的方法，并与最先进的神经基线进行了比较。我们证明了 COGS 提升了分布鲁棒性，其中大部分性能提升来自最差情况。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 08:04:16 UTC 发布：2025-08-12 08:04:16 UTC

#89 MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs #89 MultiAiTutor：面向儿童友好的教育性多语种语音生成辅导系统，基于 LLMs

Generative speech models have demonstrated significant potential in personalizing teacher-student interactions, offering valuable real-world applications for language learning in children’s education. However, achieving high-quality, child-friendly speech generation remains challenging, particularly for low-resource languages across diverse languages and cultural contexts. In this paper, we propose MultiAiTutor, an educational multilingual generative AI tutor with child-friendly designs, leveraging LLM architecture for speech generation tailored for educational purposes. We propose to integrate age-appropriate multilingual speech generation using LLM architectures, facilitating young children’s language learning through culturally relevant image-description tasks in three low-resource languages: Singaporean-accent Mandarin, Malay, and Tamil. Experimental results from both objective metrics and subjective evaluations demonstrate the superior performance of the proposed MultiAiTutor compared to baseline methods. 生成式语音模型在个性化师生互动方面展示出显著潜力，为儿童教育中的语言学习提供了有价值的现实应用。然而，实现高质量、适合儿童的语音生成仍具有挑战性，尤其是在跨越多种语言和文化背景的低资源语言上。在本文中，我们提出了 MultiAiTutor，一种具有儿童友好设计的教育性多语种生成式 AI 辅导系统，利用 LLM 架构进行面向教育目的的语音生成。我们提出将适龄的多语种语音生成与 LLM 架构相结合，通过文化相关的图像描述任务，促进幼儿在三种低资源语言（新加坡口音普通话、马来语和泰米尔语）中的语言学习。来自客观指标和主观评估的实验结果均表明，所提出的 MultiAiTutor 在性能上优于基线方法。

Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language, Signal Processing 主题：音频与语音处理、人工智能、计算与语言、信号处理

Publish: 2025-08-12 07:58:48 UTC 发布：2025-08-12 07:58:48 UTC

#90 A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models #90 并行文本生成综述：从并行解码到扩散语言模型

As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation. 随着文本生成成为现代 LLMs 的核心能力，它支撑着各种下游应用。然而，大多数现有 LLMs 依赖自回归（AR）生成，基于先前生成的上下文一次生成一个标记——由于该过程本质上的序列性，导致生成速度受限。为了解决这一挑战，越来越多的研究者开始探索并行文本生成——一类旨在打破逐标记生成瓶颈并提高推理效率的广泛技术。尽管兴趣日益增加，但关于哪些具体技术构成并行文本生成以及它们如何提升推理性能的综合分析仍然不足。为弥合这一差距，我们提出了一项系统的并行文本生成方法综述。我们将现有方法归类为基于 AR 和非 AR 的范式，并对每类中的核心技术进行了详细审查。根据该分类法，我们从速度、质量和效率三个方面评估它们的理论权衡，并考察它们与替代加速策略结合与比较的潜力。最后，基于我们的研究发现，我们强调了近期进展、识别了尚存的挑战，并勾勒出并行文本生成未来研究的可行方向。

Subjects: Computation and Language, Artificial Intelligence, Distributed, Parallel, and Cluster Computing 学科：计算与语言、人工智能、分布式、并行与集群计算

Publish: 2025-08-12 07:56:04 UTC

#91 SafeFix: Targeted Model Repair via Controlled Image Generation #91 SafeFix：通过受控图像生成进行有针对性的模型修复 [PDF 3 ] [Copy] [Kimi ] [REL]

Authors: [Ouyang Xu](https://arxiv.org/search/?searchtype=author&query=Ouyang Xu), [Baoming Zhang](https://arxiv.org/search/?searchtype=author&query=Baoming Zhang), [Ruiyu Mao](https://arxiv.org/search/?searchtype=author&query=Ruiyu Mao), [Yunhui Guo](https://arxiv.org/search/?searchtype=author&query=Yunhui Guo) 作者：欧阳旭，张宝明，毛瑞宇，郭云辉

Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. Although existing debugging frameworks can pinpoint these failures by identifying key failure attributes, repairing the model effectively remains difficult. Current solutions often rely on manually designed prompts to generate synthetic training images – an approach prone to distribution shift and semantic errors. To overcome these challenges, we introduce a model repair module that builds on an interpretable failure attribution pipeline. Our approach uses a conditional text-to-image model to generate semantically faithful and targeted images for failure cases. To preserve the quality and relevance of the generated samples, we further employ a large vision-language model (LVLM) to filter the outputs, enforcing alignment with the original data distribution and maintaining semantic consistency. By retraining vision models with this rare-case-augmented synthetic dataset, we significantly reduce errors associated with rare cases. Our experiments demonstrate that this targeted repair strategy improves model robustness without introducing new bugs. Code is available at https://github.com/oxu2/SafeFix 用于视觉识别的深度学习模型常因语义上代表性不足的子群体而表现出系统性错误。尽管现有的调试框架可以通过识别关键失败属性来定位这些故障，但要有效修复模型仍然困难。现有解决方案通常依赖手工设计的提示词来生成合成训练图像——这种方法容易导致分布偏移和语义错误。为克服这些挑战，我们引入了一个基于可解释故障归因流程的模型修复模块。我们的方法使用条件文本到图像模型为故障案例生成语义上忠实且有针对性的图像。为了保持生成样本的质量和相关性，我们进一步使用大型视觉-语言模型（LVLM）对输出进行过滤，强制使其与原始数据分布对齐并保持语义一致性。通过用该少见案例增强的合成数据集重新训练视觉模型，我们显著减少了与少见案例相关的错误。我们的实验证明，这一有针对性的修复策略在不引入新错误的情况下提升了模型的鲁棒性。代码可在 https://github.com/oxu2/SafeFix 获取

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-12 07:45:25 UTC 发布：2025-08-12 07:45:25 UTC

#92 MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion #92 MMIF-AMIN：自适应损失驱动的多尺度可逆致密网络用于多模态医学图像融合

Authors: [Tao Luo](https://arxiv.org/search/?searchtype=author&query=Tao Luo), [Weihua Xu](https://arxiv.org/search/?searchtype=author&query=Weihua Xu) 作者：Tao Luo，Weihua Xu

Multimodal medical image fusion (MMIF) aims to integrate images from different modalities to produce a comprehensive image that enhances medical diagnosis by accurately depicting organ structures, tissue textures, and metabolic information. Capturing both the unique and complementary information across multiple modalities simultaneously is a key research challenge in MMIF. To address this challenge, this paper proposes a novel image fusion method, MMIF-AMIN, which features a new architecture that can effectively extract these unique and complementary features. Specifically, an Invertible Dense Network (IDN) is employed for lossless feature extraction from individual modalities. To extract complementary information between modalities, a Multi-scale Complementary Feature Extraction Module (MCFEM) is designed, which incorporates a hybrid attention mechanism, convolutional layers of varying sizes, and Transformers. An adaptive loss function is introduced to guide model learning, addressing the limitations of traditional manually-designed loss functions and enhancing the depth of data mining. Extensive experiments demonstrate that MMIF-AMIN outperforms nine state-of-the-art MMIF methods, delivering superior results in both quantitative and qualitative analyses. Ablation experiments confirm the effectiveness of each component of the proposed method. Additionally, extending MMIF-AMIN to other image fusion tasks also achieves promising performance. 多模态医学图像融合（MMIF）旨在整合来自不同模态的图像，生成一幅综合图像，通过准确描绘器官结构、组织纹理和代谢信息来增强医学诊断。在多模态中同时捕捉各自独有的信息和互补信息是 MMIF 的关键研究挑战。为了解决该挑战，本文提出了一种新颖的图像融合方法 MMIF-AMIN，该方法以一种能够有效提取这些独有和互补特征的新型架构为特色。具体而言，采用可逆稠密网络（IDN）对单模态进行无损特征提取。为提取模态间的互补信息，设计了一个多尺度互补特征提取模块（MCFEM），其融合了混合注意力机制、不同尺寸的卷积层以及 Transformer。引入了一种自适应损失函数来指导模型学习，以解决传统人工设计损失函数的局限性并增强数据挖掘的深度。大量实验表明，MMIF-AMIN 优于九种最先进的 MMIF 方法，在定量和定性分析中均取得更优结果。消融实验验证了所提方法各个组件的有效性。此外，将 MMIF-AMIN 扩展到其他图像融合任务也取得了令人鼓舞的表现。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-12 06:55:38 UTC 发布：2025-08-12 06:55:38 UTC

#93 Imposing AI: Deceptive design patterns against sustainability #93 Imposing AI：针对可持续性的欺骗性设计模式

Authors: [Anaëlle Beignon](https://arxiv.org/search/?searchtype=author&query=Anaëlle Beignon), [Thomas Thibault](https://arxiv.org/search/?searchtype=author&query=Thomas Thibault), [Nolwenn Maudet](https://arxiv.org/search/?searchtype=author&query=Nolwenn Maudet) 作者：Anaëlle Beignon、Thomas Thibault、Nolwenn Maudet

Generative AI is being massively deployed in digital services, at a scale that will result in significant environmental harm. We document how tech companies are transforming established user interfaces to impose AI use and show how and to what extent these strategies fit within established deceptive pattern categories. We identify two main design strategies that are implemented to impose AI use in both personal and professional contexts: imposing AI features in interfaces at the expense of existing non-AI features and promoting narratives about AI that make it harder to resist using it. We discuss opportunities for regulating the imposed adoption of AI features, which would inevitably lead to negative environmental effects. 生成式人工智能正被大规模部署于数字服务中，其规模将导致显著的环境危害。我们记录了科技公司如何改造既有用户界面以强制使用人工智能，并展示了这些策略在何种程度上及如何符合既有的欺骗性模式类别。我们识别出两种在个人与职业情境中用以强制使用人工智能的主要设计策略：在界面中强制加入人工智能功能，从而牺牲现有的非人工智能功能；以及宣扬有关人工智能的叙事，使得用户更难抵制使用。我们讨论了对强制采用人工智能功能进行监管的可能性，因为这种采用不可避免地会导致负面的环境影响。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-12 06:37:39 UTC 发布：2025-08-12 06:37:39 UTC

#94 Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics #94 代码更改到自然语言生成中的幻觉：发生率与检测指标评估

Authors: [Chunhua Liu](https://arxiv.org/search/?searchtype=author&query=Chunhua Liu), [Hong Yi Lin](https://arxiv.org/search/?searchtype=author&query=Hong Yi Lin), [Patanamon Thongtanunam](https://arxiv.org/search/?searchtype=author&query=Patanamon Thongtanunam) 作者：Chunhua Liu、Hong Yi Lin、Patanamon Thongtanunam

Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50% of generated code reviews and 20% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.\footnote{All code and data will be released upon acceptance. 语言模型在软件工程的广泛任务中表现出强大能力，例如代码生成，但它们存在虚构（hallucination）问题。虽然虚构已在自然语言生成和代码生成中被单独研究，但在涉及代码变更的任务中——这些任务具有结构上复杂且依赖上下文的代码格式——其发生情况仍然 largely 未被探索。本文对两种涉及代码变更到自然语言生成的关键任务进行了首次全面分析：提交信息生成和代码评审评论生成。我们量化了近期语言模型中虚构的流行程度，并探索了一系列基于度量的方法以自动检测它们。我们的研究发现约有 50%的生成代码评审和 20%的生成提交信息包含虚构。尽管常用度量单独来看是较弱的检测器，但结合多种度量能显著提升性能。值得注意的是，模型置信度和特征归因指标在幻觉检测方面有效，有望用于推理时检测。\footnote{所有代码和数据将在被接受后发布。}

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-12 05:59:33 UTC 发布：2025-08-12 05:59:33 UTC

#95 M2LLM: Multi-view Molecular Representation Learning with Large Language Models #95 M2 LLM: 多视角分子表示学习与大型语言模型

Authors: [Jiaxin Ju](https://arxiv.org/search/?searchtype=author&query=Jiaxin Ju), [Yizhen Zheng](https://arxiv.org/search/?searchtype=author&query=Yizhen Zheng), [Huan Yee Koh](https://arxiv.org/search/?searchtype=author&query=Huan Yee Koh), [Can Wang](https://arxiv.org/search/?searchtype=author&query=Can Wang), [Shirui Pan](https://arxiv.org/search/?searchtype=author&query=Shirui Pan) 作者：苟嘉昕，郑一臻，顾玥晔，王璨，潘世睿

Accurate molecular property prediction is a critical challenge with wide-ranging applications in chemistry, materials science, and drug discovery. Molecular representation methods, including fingerprints and graph neural networks (GNNs), achieve state-of-the-art results by effectively deriving features from molecular structures. However, these methods often overlook decades of accumulated semantic and contextual knowledge. Recent advancements in large language models (LLMs) demonstrate remarkable reasoning abilities and prior knowledge across scientific domains, leading us to hypothesize that LLMs can generate rich molecular representations when guided to reason in multiple perspectives. To address these gaps, we propose M2LLM, a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view. These views are fused dynamically to adapt to task requirements, and experiments demonstrate that M2LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks. Moreover, we demonstrate that representation derived from LLM achieves exceptional performance by leveraging two core functionalities: the generation of molecular embeddings through their encoding capabilities and the curation of molecular features through advanced reasoning processes. 准确的分子性质预测是一个具有广泛应用的关键挑战，涉及化学、材料科学和药物发现等领域。分子表示方法，包括指纹和图神经网络（GNNs），通过从分子结构中高效提取特征，取得了最先进的成果。然而，这些方法常常忽视了几十年来累积的语义和上下文知识。近期大型语言模型（LLMs）在科学领域展示了显著的推理能力和先验知识，这使我们推测在引导其从多个视角进行推理时，LLMs 可以生成丰富的分子表示。为了解决这些不足，我们提出了 M2 LLM，一种多视角框架，整合了三种视角：分子结构视角、分子任务视角和分子规则视角。这些视角被动态融合以适应任务需求，实验证明 M2 LLM 在多个分类和回归任务的基准测试中达到了最先进的性能。此外，我们展示了从 LLM 获得的表示通过利用两项核心功能实现了卓越性能：通过其编码能力生成分子嵌入，以及通过高级推理过程策划分子特征。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-12 05:46:47 UTC 发布：2025-08-12 05:46:47 UTC

#96 LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement #96 LLM 驱动的文本到表格生成：通过子任务指导与迭代精炼

Transforming unstructured text into structured data is a complex task, requiring semantic understanding, reasoning, and structural comprehension. While Large Language Models (LLMs) offer potential, they often struggle with handling ambiguous or domain-specific data, maintaining table structure, managing long inputs, and addressing numerical reasoning. This paper proposes an efficient system for LLM-driven text-to-table generation that leverages novel prompting techniques. Specifically, the system incorporates two key strategies: breaking down the text-to-table task into manageable, guided sub-tasks and refining the generated tables through iterative self-feedback. We show that this custom task decomposition allows the model to address the problem in a stepwise manner and improves the quality of the generated table. Furthermore, we discuss the benefits and potential risks associated with iterative self-feedback on the generated tables while highlighting the trade-offs between enhanced performance and computational cost. Our methods achieve strong results compared to baselines on two complex text-to-table generation datasets available in the public domain. 将非结构化文本转换为结构化数据是一项复杂的任务，需具备语义理解、推理和结构化把握能力。虽然 LLMs 具有潜力，但它们在处理模糊或特定领域的数据、保持表格结构、管理长输入以及解决数值推理问题时常常表现不佳。本文提出了一个高效的基于 LLM 的文本到表格生成系统，该系统利用了新颖的提示技术。具体而言，该系统包含两项关键策略：将文本到表格任务拆分为可管理的、有引导性的子任务，以及通过迭代自我反馈来精炼生成的表格。我们展示了这种定制的任务分解如何使模型以分步方式解决问题并提升生成表格的质量。此外，我们讨论了对生成表格进行迭代自我反馈的益处和潜在风险，同时强调了性能提升与计算成本之间的权衡。与公共领域中可获得的两个复杂文本到表格生成数据集上的基线方法相比，我们的方法取得了显著效果。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 05:37:12 UTC 发布：2025-08-12 05:37:12 UTC

#97 MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time #97 MiGrATe：用于测试时适应的混合策略 GRPO [PDF 5 ] [Copy] [Kimi 2 ] [REL]

Large language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data selection techniques: greedy sampling, which selects top-performing past completions, and neighborhood sampling (NS), which generates completions structurally similar to high-reward ones. Together, these components bias the policy gradient towards exploitation of promising regions in solution space, while preserving exploration through on-policy sampling. We evaluate MiGrATe on three challenging domains-word search, molecule optimization, and hypothesis+program induction on the Abstraction and Reasoning Corpus (ARC)-and find that it consistently outperforms both inference-only and TTT baselines, demonstrating the potential of online TTT as a solution for complex search tasks without external supervision. 大型语言模型（LLMs）越来越多地被应用于黑箱优化任务，从程序合成到分子设计。以往的工作通常利用上下文学习来迭代地引导模型朝更优解发展。然而，这类方法常常难以在探索新解空间与开发高回报解之间取得平衡。最近，使用合成数据的测试时训练（TTT）在提升解质量方面显示出希望。但为每个任务手工制作训练数据的需求限制了其在各领域的可行性和可扩展性。为了解决这一问题，我们提出了 MiGrATe——一种在线 TTT 方法，它使用 GRPO 作为搜索算法，在推理时自适应调整 LLMs，而无需外部训练数据。MiGrATe 通过一种混合策略组构建程序来运行，该程序将在线策略采样与两种离线策略数据选择技术相结合：贪婪采样——选择表现最好的历史完成结果；以及邻域采样（NS）——生成在结构上与高回报完成结果相似的完成结果。这些组件共同使策略梯度偏向于利用解空间中有前景的区域，同时通过基于策略的采样保持探索性。我们在三个具有挑战性的领域——单词搜索、分子优化以及抽象与推理语料库（ARC）上的假设+程序归纳——上评估了 MiGrATe，发现其在性能上持续优于仅推理和测试时训练（TTT）基线方法，证明了在线 TTT 作为无外部监督的复杂搜索任务解决方案的潜力。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-12 05:08:21 UTC 发布：2025-08-12 05:08:21 UTC

#98 Securing Educational LLMs: A Generalised Taxonomy of Attacks on LLMs and DREAD Risk Assessment #98 保护教育用 LLMs：面向 LLMs 攻击的一般化分类与 DREAD 风险评估

Authors: [Farzana Zahid](https://arxiv.org/search/?searchtype=author&query=Farzana Zahid), [Anjalika Sewwandi](https://arxiv.org/search/?searchtype=author&query=Anjalika Sewwandi), [Lee Brandon](https://arxiv.org/search/?searchtype=author&query=Lee Brandon), [Vimal Kumar](https://arxiv.org/search/?searchtype=author&query=Vimal Kumar), [Roopak Sinha](https://arxiv.org/search/?searchtype=author&query=Roopak Sinha) 作者：Farzana Zahid, Anjalika Sewwandi, Lee Brandon, Vimal Kumar, Roopak Sinha

Due to perceptions of efficiency and significant productivity gains, various organisations, including in education, are adopting Large Language Models (LLMs) into their workflows. Educator-facing, learner-facing, and institution-facing LLMs, collectively, Educational Large Language Models (eLLMs), complement and enhance the effectiveness of teaching, learning, and academic operations. However, their integration into an educational setting raises significant cybersecurity concerns. A comprehensive landscape of contemporary attacks on LLMs and their impact on the educational environment is missing. This study presents a generalised taxonomy of fifty attacks on LLMs, which are categorized as attacks targeting either models or their infrastructure. The severity of these attacks is evaluated in the educational sector using the DREAD risk assessment framework. Our risk assessment indicates that token smuggling, adversarial prompts, direct injection, and multi-step jailbreak are critical attacks on eLLMs. The proposed taxonomy, its application in the educational environment, and our risk assessment will help academic and industrial practitioners to build resilient solutions that protect learners and institutions. 由于对效率和显著生产力提升的认知，各类组织（包括教育领域）正将 Large Language Models (LLMs) 纳入其工作流程。面向教育者、面向学习者和面向机构的 LLMs，统称为教育大语言模型（eLLMs），补充并增强了教学、学习和学术运营的有效性。然而，它们在教育环境中的整合也带来了重大的网络安全问题。目前缺乏对当代针对 LLMs 的攻击及其对教育环境影响的全面概述。本研究提出了一个针对 LLMs 的通用化五十种攻击的分类法，这些攻击被归类为针对模型本身或其基础设施的攻击。使用 DREAD 风险评估框架评估了这些攻击在教育领域的严重性。我们的风险评估表明，令牌走私（token smuggling）、对抗性提示、直接注入和多步骤越狱是对 eLLMs 的关键性攻击。所提出的分类法、在教育环境中的应用以及我们的风险评估将帮助学术界和工业界的从业者构建能够保护学习者和机构的弹性解决方案。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-12 04:34:12 UTC 发布：2025-08-12 04:34:12 UTC

#99 QoE-Aware Service Provision for Mobile AR Rendering: An Agent-Driven Approach #99 面向移动增强现实渲染的 QoE 感知服务提供：一种代理驱动方法

Authors: [Conghao Zhou](https://arxiv.org/search/?searchtype=author&query=Conghao Zhou), [Lulu Sun](https://arxiv.org/search/?searchtype=author&query=Lulu Sun), [Xiucheng Wang](https://arxiv.org/search/?searchtype=author&query=Xiucheng Wang), [Peng Yang](https://arxiv.org/search/?searchtype=author&query=Peng Yang), [Feng Lyu](https://arxiv.org/search/?searchtype=author&query=Feng Lyu), [Sihan Lu](https://arxiv.org/search/?searchtype=author&query=Sihan Lu), [Xuemin Shen](https://arxiv.org/search/?searchtype=author&query=Xuemin Shen) 作者：周从豪、孙露露、王秀成、杨鹏、吕锋、卢思涵、沈学民

Mobile augmented reality (MAR) is envisioned as a key immersive application in 6G, enabling virtual content rendering aligned with the physical environment through device pose estimation. In this paper, we propose a novel agent-driven communication service provisioning approach for edge-assisted MAR, aiming to reduce communication overhead between MAR devices and the edge server while ensuring the quality of experience (QoE). First, to address the inaccessibility of MAR application-specific information to the network controller, we establish a digital agent powered by large language models (LLMs) on behalf of the MAR service provider, bridging the data and function gap between the MAR service and network domains. Second, to cope with the user-dependent and dynamic nature of data traffic patterns for individual devices, we develop a user-level QoE modeling method that captures the relationship between communication resource demands and perceived user QoE, enabling personalized, agent-driven communication resource management. Trace-driven simulation results demonstrate that the proposed approach outperforms conventional LLM-based QoE-aware service provisioning methods in both user-level QoE modeling accuracy and communication resource efficiency. 移动增强现实（MAR）被认为是 6G 中的一个重要沉浸式应用，它通过设备位姿估计实现与物理环境对齐的虚拟内容渲染。本文提出了一种用于边缘辅助 MAR 的创新型由代理驱动的通信服务提供方法，旨在在保证体验质量（QoE）的同时减少 MAR 设备与边缘服务器之间的通信开销。首先，为了解决网络控制器无法获取 MAR 应用特定信息的问题，我们代表 MAR 服务提供方建立了一个由大型语言模型（LLMs）驱动的数字代理，弥合了 MAR 服务与网络域之间的数据和功能鸿沟。其次，为应对单个设备数据流量模式的用户依赖性和动态性，我们开发了一种用户级 QoE 建模方法，该方法捕捉通信资源需求与用户感知 QoE 之间的关系，从而实现个性化的、由代理驱动的通信资源管理。基于跟踪的仿真结果表明，所提出的方法在用户级 QoE 建模准确性和通信资源效率方面均优于传统的基于 LLM 的 QoE 感知服务提供方法。

Subjects: Networking and Internet Architecture, Artificial Intelligence 主题：网络与互联网体系结构，人工智能

Publish: 2025-08-12 04:32:04 UTC 出版：2025-08-12 04:32:04 UTC

#100 Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization #100 可迁移的模型无关视觉-语言模型适配用于高效的弱到强泛化

Authors: [Jihwan Park](https://arxiv.org/search/?searchtype=author&query=Jihwan Park), [Taehoon song](https://arxiv.org/search/?searchtype=author&query=Taehoon song), [Sanghyeok Lee](https://arxiv.org/search/?searchtype=author&query=Sanghyeok Lee), [Miso Choi](https://arxiv.org/search/?searchtype=author&query=Miso Choi), [Hyunwoo J. Kim](https://arxiv.org/search/?searchtype=author&query=Hyunwoo J. Kim) 作者：朴志焕（Jihwan Park）、宋泰勋（Taehoon Song）、李相赫（Sanghyeok Lee）、崔美素（Miso Choi）、金玄宇（Hyunwoo J. Kim）

Vision-Language Models (VLMs) have been widely used in various visual recognition tasks due to their remarkable generalization capabilities. As these models grow in size and complexity, fine-tuning becomes costly, emphasizing the need to reuse adaptation knowledge from ‘weaker’ models to efficiently enhance ‘stronger’ ones. However, existing adaptation transfer methods exhibit limited transferability across models due to their model-specific design and high computational demands. To tackle this, we propose Transferable Model-agnostic adapter (TransMiter), a light-weight adapter that improves vision-language models ‘without backpropagation’. TransMiter captures the knowledge gap between pre-trained and fine-tuned VLMs, in an ‘unsupervised’ manner. Once trained, this knowledge can be seamlessly transferred across different models without the need for backpropagation. Moreover, TransMiter consists of only a few layers, inducing a negligible additional inference cost. Notably, supplementing the process with a few labeled data further yields additional performance gain, often surpassing a fine-tuned stronger model, with a marginal training cost. Experimental results and analyses demonstrate that TransMiter effectively and efficiently transfers adaptation knowledge while preserving generalization abilities across VLMs of different sizes and architectures in visual recognition tasks. 视觉-语言模型（VLMs）因其卓越的泛化能力而被广泛应用于各种视觉识别任务。随着这些模型规模与复杂度的增长，微调变得成本高昂，因此有必要将“较弱”模型的适应知识重用到“较强”模型上以提高效率。然而，现有的适应迁移方法由于其针对特定模型的设计和高计算需求，在模型间的可迁移性方面表现有限。为了解决这一问题，我们提出了可迁移的模型无关适配器（TransMiter），这是一种无需反向传播即可改进视觉-语言模型的轻量级适配器。TransMiter 以“无监督”方式捕捉预训练与微调后 VLM 之间的知识差距。一旦训练完成，这些知识即可在不同模型之间无缝转移，无需反向传播。此外，TransMiter 仅由少数几层组成，带来的额外推理开销可忽略不计。值得注意的是，若在此过程中补充少量带标签的数据，还能进一步提升性能，通常以极小的训练成本超过经过微调的更强模型。实验结果和分析表明，TransMiter 在视觉识别任务中能够有效且高效地传递适应性知识，同时在不同规模和架构的视觉语言模型之间保持泛化能力。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-12 03:37:16 UTC 发布：2025-08-12 03:37:16 UTC

#101 Yan: Foundational Interactive Video Generation #101 Yan：基础交互式视频生成 [PDF 13 ] [Copy] [Kimi 6 ] [REL]

Author: [Yan Team](https://arxiv.org/search/?searchtype=author&query=Yan Team) 作者：Yan 团队

We present Yan, a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing. Specifically, Yan comprises three core modules. AAA-level Simulation: We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process, achieving real-time 1080P/60FPS interactive simulation. Multi-Modal Generation: We introduce a hierarchical autoregressive caption method that injects game-specific knowledge into open-domain multi-modal video diffusion models (VDMs), then transforming the VDM into a frame-wise, action-controllable, real-time infinite interactive video generator. Notably, when the textual and visual prompts are sourced from different domains, the model demonstrates strong generalization, allowing it to blend and compose the style and mechanics across domains flexibly according to user prompts. Multi-Granularity Editing: We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text. Collectively, Yan offers an integration of these modules, pushing interactive video generation beyond isolated capabilities toward a comprehensive AI-driven interactive creation paradigm, paving the way for the next generation of creative tools, media, and entertainment. The project page is: https://greatx3.github.io/Yan/. 我们提出了 Yan，一种用于交互式视频生成的基础框架，覆盖从模拟与生成到编辑的完整流程。具体来说，Yan 包含三大核心模块。AAA 级模拟：我们设计了一个高度压缩、低延迟的 3D-VAE，并结合基于 KV 缓存的滑窗去噪推理流程，实现实时 1080P/60FPS 的交互式模拟。多模态生成：我们引入了一种分层自回归的字幕方法，将游戏特有知识注入到开放域的多模态视频扩散模型（VDM）中，随后将 VDM 转化为逐帧、动作可控、实时的无限交互式视频生成器。值得注意的是，当文本与视觉提示来自不同领域时，模型表现出强大的泛化能力，允许其根据用户提示灵活地混合与组合跨领域的风格与机制。多粒度编辑：我们提出了一种混合模型，明确区分交互机制模拟与视觉渲染，使得在交互过程中可以通过文本对视频内容进行多粒度编辑。总体而言，Yan 将这些模块整合在一起，将交互式视频生成从孤立的能力推进到以人工智能驱动的综合交互创作范式，为下一代创意工具、媒体和娱乐铺平道路。项目页面为：https://greatx3.github.io/Yan/。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-12 03:34:21 UTC 发布：2025-08-12 03:34:21 UTC

#102 Generative AI for Critical Infrastructure in Smart Grids: A Unified Framework for Synthetic Data Generation and Anomaly Detection #102 面向智能电网关键基础设施的生成式人工智能：用于合成数据生成与异常检测的统一框架

Authors: [Aydin Zaboli](https://arxiv.org/search/?searchtype=author&query=Aydin Zaboli), [Junho Hong](https://arxiv.org/search/?searchtype=author&query=Junho Hong) 作者：Aydin Zaboli、Junho Hong

In digital substations, security events pose significant challenges to the sustained operation of power systems. To mitigate these challenges, the implementation of robust defense strategies is critically important. A thorough process of anomaly identification and detection in information and communication technology (ICT) frameworks is crucial to ensure secure and reliable communication and coordination between interconnected devices within digital substations. Hence, this paper addresses the critical cybersecurity challenges confronting IEC61850-based digital substations within modern smart grids, where the integration of advanced communication protocols, e.g., generic object-oriented substation event (GOOSE), has enhanced energy management and introduced significant vulnerabilities to cyberattacks. Focusing on the limitations of traditional anomaly detection systems (ADSs) in detecting threats, this research proposes a transformative approach by leveraging generative AI (GenAI) to develop robust ADSs. The primary contributions include the suggested advanced adversarial traffic mutation (AATM) technique to generate synthesized and balanced datasets for GOOSE messages, ensuring protocol compliance and enabling realistic zero-day attack pattern creation to address data scarcity. Then, the implementation of GenAI-based ADSs incorporating the task-oriented dialogue (ToD) processes has been explored for improved detection of attack patterns. Finally, a comparison of the GenAI-based ADS with machine learning (ML)-based ADSs has been implemented to showcase the outperformance of the GenAI-based frameworks considering the AATM-generated GOOSE datasets and standard/advanced performance evaluation metrics. 在数字变电站中，安全事件对电力系统的持续运行构成了重大挑战。为减轻这些挑战，实施强有力的防御策略至关重要。对信息与通信技术（ICT）框架中的异常进行全面识别和检测的过程，对于确保数字变电站内互联设备之间的通信与协调的安全与可靠至关重要。因此，本文针对现代智能电网中基于 IEC61850 的数字变电站所面临的关键网络安全挑战展开讨论，在这些电网中，先进的通信协议（例如面向对象的通用子站事件，GOOSE）的集成提高了能源管理能力，但也为网络攻击引入了显著的脆弱性。鉴于传统异常检测系统（ADS）在发现威胁方面的局限性，本研究提出了一种变革性方法，利用生成式人工智能（GenAI）来开发强健的 ADS。主要贡献包括提出了一种先进的对抗性流量变异（AATM）技术，用于生成合成且平衡的 GOOSE 消息数据集，确保协议合规并能够创建真实的零日攻击模式以解决数据稀缺问题。随后，探讨了将基于生成式人工智能（GenAI）的入侵检测系统（ADS）与面向任务的对话（ToD）流程相结合以改进攻击模式检测的实现。最后，对基于 GenAI 的 ADS 与基于机器学习（ML）的 ADS 进行了比较，以展示在使用 AATM 生成的 GOOSE 数据集和标准/高级性能评估指标时，基于 GenAI 的框架的优越性。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全性 , 人工智能

Publish: 2025-08-12 03:18:05 UTC 发布时间：2025-08-12 03:18:05 协调世界时 (UTC)

#103 DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives #103 DepressLLM：一种可解释的领域自适应语言模型，用于从真实世界叙述中检测抑郁症

Advances in large language models (LLMs) have enabled a wide range of applications. However, depression prediction is hindered by the lack of large-scale, high-quality, and rigorously annotated datasets. This study introduces DepressLLM, trained and evaluated on a novel corpus of 3,699 autobiographical narratives reflecting both happiness and distress. DepressLLM provides interpretable depression predictions and, via its Score-guided Token Probability Summation (SToPS) module, delivers both improved classification performance and reliable confidence estimates, achieving an AUC of 0.789, which rises to 0.904 on samples with confidence ≥ 0.95. To validate its robustness to heterogeneous data, we evaluated DepressLLM on in-house datasets, including an Ecological Momentary Assessment (EMA) corpus of daily stress and mood recordings, and on public clinical interview data. Finally, a psychiatric review of high-confidence misclassifications highlighted key model and data limitations that suggest directions for future refinements. These findings demonstrate that interpretable AI can enable earlier diagnosis of depression and underscore the promise of medical AI in psychiatry. 大型语言模型(LLMs)的进展推动了广泛的应用。然而，抑郁症预测受到缺乏大规模、高质量且经过严格注释的数据集的制约。本研究提出了 DepressLLM，在一个新构建的语料库上进行了训练和评估，该语料库包含 3,699 篇反映快乐与痛苦的自传体叙述。DepressLLM 提供可解释的抑郁预测，并通过其基于分数引导的令牌概率求和(Score-guided Token Probability Summation, SToPS)模块，既提升了分类性能又给出可靠的置信度估计，取得了 0.789 的 AUC；在置信度为 ≥ 0.95 的样本上，AUC 提升至 0.904。为验证其对异质数据的鲁棒性，我们在内部数据集上评估了 DepressLLM，包括一个关于日常压力和情绪记录的生态瞬时评估(EMA)语料库，以及公开的临床访谈数据。最后，对高置信度误分类样本的精神科专家审查揭示了模型与数据的关键局限性，提示了未来改进的方向。这些发现表明，可解释的人工智能能够促成更早期的抑郁症诊断，并强调了医学人工智能在精神病学领域的前景。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 03:12:55 UTC 发布：2025-08-12 03:12:55 UTC

#104 AI Security Map: Holistic Organization of AI Security Technologies and Impacts on Stakeholders #104 AI 安全地图：对 AI 安全技术及其对利益相关者影响的整体性组织

Authors: [Hiroya Kato](https://arxiv.org/search/?searchtype=author&query=Hiroya Kato), [Kentaro Kita](https://arxiv.org/search/?searchtype=author&query=Kentaro Kita), [Kento Hasegawa](https://arxiv.org/search/?searchtype=author&query=Kento Hasegawa), [Seira Hidano](https://arxiv.org/search/?searchtype=author&query=Seira Hidano) 作者：Hiroya Kato，Kentaro Kita，Kento Hasegawa，Seira Hidano

As the social implementation of AI has been steadily progressing, research and development related to AI security has also been increasing. However, existing studies have been limited to organizing related techniques, attacks, defenses, and risks in terms of specific domains or AI elements. Thus, it extremely difficult to understand the relationships among them and how negative impacts on stakeholders are brought about. In this paper, we argue that the knowledge, technologies, and social impacts related to AI security should be holistically organized to help understand relationships among them. To this end, we first develop an AI security map that holistically organizes interrelationships among elements related to AI security as well as negative impacts on information systems and stakeholders. This map consists of the two aspects, namely the information system aspect (ISA) and the external influence aspect (EIA). The elements that AI should fulfill within information systems are classified under the ISA. The EIA includes elements that affect stakeholders as a result of AI being attacked or misused. For each element, corresponding negative impacts are identified. By referring to the AI security map, one can understand the potential negative impacts, along with their causes and countermeasures. Additionally, our map helps clarify how the negative impacts on AI-based systems relate to those on stakeholders. We show some findings newly obtained by referring to our map. We also provide several recommendations and open problems to guide future AI security communities. 随着人工智能在社会中的落地不断推进，关于人工智能安全的研究与开发也在增加。然而，现有研究多局限于在特定领域或人工智能要素层面上整理相关技术、攻击、防御与风险，因此很难理解它们之间的相互关系以及这些负面影响如何作用于利益相关者。本文认为，应当对与人工智能安全相关的知识、技术与社会影响进行整体性组织，以帮助理解它们之间的关系。为此，我们首先构建了一幅人工智能安全地图，整体性地组织了与人工智能安全相关要素之间的相互关系以及对信息系统和利益相关者的负面影响。该地图由两个方面组成，即信息系统方面（ISA）与外部影响方面（EIA）。人工智能在信息系统内应当满足的要素被归入 ISA。EIA 包括因人工智能被攻击或滥用而影响利益相关者的要素。对于每一项要素，都识别了相应的负面影响。通过参考 AI 安全地图，人们可以了解潜在的负面影响及其成因与对策。此外，我们的地图有助于澄清基于 AI 的系统所受负面影响与利益相关者所受负面影响之间的关系。我们展示了通过参考我们的地图新获得的一些发现，并提供若干建议与未解问题，以指导未来的 AI 安全 community。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全性 , 人工智能

Publish: 2025-08-12 02:41:20 UTC 发布：2025-08-12 02:41:20 协调世界时（UTC）

#105 Who pays the RENT? Implications of Spatial Inequality for Prediction-Based Allocation Policies #105 谁来付房租？基于预测的分配政策对空间不平等的影响

Authors: [Tasfia Mashiat](https://arxiv.org/search/?searchtype=author&query=Tasfia Mashiat), [Patrick J. Fowler](https://arxiv.org/search/?searchtype=author&query=Patrick J. Fowler), [Sanmay Das](https://arxiv.org/search/?searchtype=author&query=Sanmay Das) 作者：Tasfia Mashiat、Patrick J. Fowler、Sanmay Das

AI-powered scarce resource allocation policies rely on predictions to target either specific individuals (e.g., high-risk) or settings (e.g., neighborhoods). Recent research on individual-level targeting demonstrates conflicting results; some models show that targeting is not useful when inequality is high, while other work demonstrates potential benefits. To study and reconcile this apparent discrepancy, we develop a stylized framework based on the Mallows model to understand how the spatial distribution of inequality affects the effectiveness of door-to-door outreach policies. We introduce the RENT (Relative Efficiency of Non-Targeting) metric, which we use to assess the effectiveness of targeting approaches compared with neighborhood-based approaches in preventing tenant eviction when high-risk households are more versus less spatially concentrated. We then calibrate the model parameters to eviction court records collected in a medium-sized city in the USA. Results demonstrate considerable gains in the number of high-risk households canvassed through individually targeted policies, even in a highly segregated metro area with concentrated risks of eviction. We conclude that apparent discrepancies in the prior literature can be reconciled by considering 1) the source of deployment costs and 2) the observed versus modeled concentrations of risk. Our results inform the deployment of AI-based solutions in social service provision that account for particular applications and geographies. 由人工智能驱动的稀缺资源分配策略依赖预测来针对特定个体（例如高风险人群）或特定场所（例如社区）。关于面向个体的定向研究最近出现了相互矛盾的结果：一些模型显示当不平等程度高时定向并无益处，而其他研究则表明其可能带来好处。为研究并调和这一表面上的不一致性，我们基于马洛斯模型构建了一个程式化框架，以理解不平等的空间分布如何影响逐户走访式外展政策的有效性。我们引入了 RENT（Relative Efficiency of Non-Targeting，非定向相对效率）指标，用于评估在高风险家庭空间上更集中或更分散时，与基于社区的方法相比，定向方法在预防租户被驱逐方面的效果。随后，我们将模型参数校准到收集自美国一座中等规模城市的驱逐法庭记录。结果表明，即使在高度隔离、驱逐风险高度集中的大都市区，面向个体的定向政策在覆盖高风险家庭数量上也能带来显著增益。我们得出结论：通过考虑 1) 部署成本的来源和 2) 观测到的风险浓度与模型化的风险浓度，可以调和先前文献中看似存在的分歧。我们的研究结果为在社会服务提供中部署基于人工智能的解决方案提供了参考，帮助其考虑具体的应用场景和地理区域。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-12 02:16:50 UTC 发布：2025-08-12 02:16:50 协调世界时

#106 Superclass-Guided Representation Disentanglement for Spurious Correlation Mitigation #106 超类引导的表征解缠以缓解虚假相关性

Authors: [Chenruo Liu](https://arxiv.org/search/?searchtype=author&query=Chenruo Liu), [Hongjun Liu](https://arxiv.org/search/?searchtype=author&query=Hongjun Liu), [Zeyu Lai](https://arxiv.org/search/?searchtype=author&query=Zeyu Lai), [Yiqiu Shen](https://arxiv.org/search/?searchtype=author&query=Yiqiu Shen), [Chen Zhao](https://arxiv.org/search/?searchtype=author&query=Chen Zhao), [Qi Lei](https://arxiv.org/search/?searchtype=author&query=Qi Lei) 作者：刘辰若，刘鸿军，赖泽宇，沈奕秋，赵晨，雷琦

To enhance group robustness to spurious correlations, prior work often relies on auxiliary annotations for groups or spurious features and assumes identical sets of groups across source and target domains. These two requirements are both unnatural and impractical in real-world settings. To overcome these limitations, we propose a method that leverages the semantic structure inherent in class labels–specifically, superclass information–to naturally reduce reliance on spurious features. Our model employs gradient-based attention guided by a pre-trained vision-language model to disentangle superclass-relevant and irrelevant features. Then, by promoting the use of all superclass-relevant features for prediction, our approach achieves robustness to more complex spurious correlations without the need to annotate any source samples. Experiments across diverse datasets demonstrate that our method significantly outperforms baselines in domain generalization tasks, with clear improvements in both quantitative metrics and qualitative visualizations. 为了增强群体对虚假相关性的鲁棒性，先前的工作通常依赖于对群体或虚假特征的辅助标注，并假设源域和目标域之间存在相同的群体集合。这两个要求在现实世界中既不自然也不切实际。为克服这些限制，我们提出了一种方法，利用类别标签中固有的语义结构——具体来说，是超类信息——来自然地减少对虚假特征的依赖。我们的模型采用由预训练视觉-语言模型引导的基于梯度的注意力来解开与超类相关与不相关的特征。然后，通过促进在预测中使用所有与超类相关的特征，我们的方法在无需标注任何源样本的情况下实现了对更复杂虚假相关性的鲁棒性。在多个不同的数据集上的实验表明，我们的方法在域泛化任务中显著优于基线方法，在定量指标和定性可视化方面均有明确提升。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-12 02:16:04 UTC 发布时间：2025-08-12 02:16:04 协调世界时

#107 UQGNN: Uncertainty Quantification of Graph Neural Networks for Multivariate Spatiotemporal Prediction #107 UQGNN：用于多变量时空预测的图神经网络不确定性量化

Spatiotemporal prediction plays a critical role in numerous real-world applications such as urban planning, transportation optimization, disaster response, and pandemic control. In recent years, researchers have made significant progress by developing advanced deep learning models for spatiotemporal prediction. However, most existing models are deterministic, i.e., predicting only the expected mean values without quantifying uncertainty, leading to potentially unreliable and inaccurate outcomes. While recent studies have introduced probabilistic models to quantify uncertainty, they typically focus on a single phenomenon (e.g., taxi, bike, crime, or traffic crashes), thereby neglecting the inherent correlations among heterogeneous urban phenomena. To address the research gap, we propose a novel Graph Neural Network with Uncertainty Quantification, termed UQGNN for multivariate spatiotemporal prediction. UQGNN introduces two key innovations: (i) an Interaction-aware Spatiotemporal Embedding Module that integrates a multivariate diffusion graph convolutional network and an interaction-aware temporal convolutional network to effectively capture complex spatial and temporal interaction patterns, and (ii) a multivariate probabilistic prediction module designed to estimate both expected mean values and associated uncertainties. Extensive experiments on four real-world multivariate spatiotemporal datasets from Shenzhen, New York City, and Chicago demonstrate that UQGNN consistently outperforms state-of-the-art baselines in both prediction accuracy and uncertainty quantification. For example, on the Shenzhen dataset, UQGNN achieves a 5% improvement in both prediction accuracy and uncertainty quantification. 时空预测在城市规划、交通优化、灾害响应和疫情控制等众多实际应用中具有关键作用。近年来，研究者通过为时空预测开发先进的深度学习模型取得了显著进展。然而，大多数现有模型是确定性的，即仅预测期望均值而不量化不确定性，从而可能导致结果不可靠且不准确。尽管最近的研究引入了用于量化不确定性的概率模型，但它们通常只关注单一现象（例如出租车、自行车、犯罪或交通事故），从而忽视了异构城市现象之间固有的相关性。为了解决这一研究空白，我们提出了一种新颖的带不确定性量化的图神经网络，称为用于多变量时空预测的 UQGNN。 UQGNN 引入了两项关键创新： (i) 一个交互感知时空嵌入模块，该模块整合了多变量扩散图卷积网络和交互感知时间卷积网络，能够有效捕捉复杂的空间与时间交互模式；以及 (ii) 一个多变量概率预测模块，旨在估计期望均值及其相关的不确定性。在来自深圳、纽约市和芝加哥的四个真实多变量时空数据集上进行的大量实验证明，UQGNN 在预测精度和不确定性量化方面均持续优于最先进的基线方法。例如，在深圳数据集上，UQGNN 在预测精度和不确定性量化上均提高了 5%。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 01:40:05 UTC 发布：2025-08-12 01:40:05 UTC

#108 OmniLLP: Enhancing LLM-based Log Level Prediction with Context-Aware Retrieval #108 OmniLLP：通过上下文感知检索增强基于 LLM 的日志级别预测

Authors: [Youssef Esseddiq Ouatiti](https://arxiv.org/search/?searchtype=author&query=Youssef Esseddiq Ouatiti), [Mohammed Sayagh](https://arxiv.org/search/?searchtype=author&query=Mohammed Sayagh), [Bram Adams](https://arxiv.org/search/?searchtype=author&query=Bram Adams), [Ahmed E. Hassan](https://arxiv.org/search/?searchtype=author&query=Ahmed E. Hassan) 作者：Youssef Esseddiq Ouatiti、Mohammed Sayagh、Bram Adams、Ahmed E. Hassan

Developers insert logging statements in source code to capture relevant runtime information essential for maintenance and debugging activities. Log level choice is an integral, yet tricky part of the logging activity as it controls log verbosity and therefore influences systems’ observability and performance. Recent advances in ML-based log level prediction have leveraged large language models (LLMs) to propose log level predictors (LLPs) that demonstrated promising performance improvements (AUC between 0.64 and 0.8). Nevertheless, current LLM-based LLPs rely on randomly selected in-context examples, overlooking the structure and the diverse logging practices within modern software projects. In this paper, we propose OmniLLP, a novel LLP enhancement framework that clusters source files based on (1) semantic similarity reflecting the code’s functional purpose, and (2) developer ownership cohesion. By retrieving in-context learning examples exclusively from these semantic and ownership aware clusters, we aim to provide more coherent prompts to LLPs leveraging LLMs, thereby improving their predictive accuracy. Our results show that both semantic and ownership-aware clusterings statistically significantly improve the accuracy (by up to 8% AUC) of the evaluated LLM-based LLPs compared to random predictors (i.e., leveraging randomly selected in-context examples from the whole project). Additionally, our approach that combines the semantic and ownership signal for in-context prediction achieves an impressive 0.88 to 0.96 AUC across our evaluated projects. Our findings highlight the value of integrating software engineering-specific context, such as code semantic and developer ownership signals into LLM-LLPs, offering developers a more accurate, contextually-aware approach to logging and therefore, enhancing system maintainability and observability. 开发者在源代码中插入日志语句以捕获对维护和调试工作至关重要的运行时信息。日志级别的选择是日志活动中一个不可或缺但棘手的部分，因为它控制日志的详细程度，从而影响系统的可观测性和性能。基于机器学习的日志级别预测的最新进展利用了大型语言模型（LLMs）来提出日志级别预测器（LLPs），并展示了有希望的性能提升（AUC 在 0.64 到 0.8 之间）。尽管如此，当前基于 LLM 的 LLPs 依赖于随机选择的上下文示例，忽视了现代软件项目中日志实践的结构性和多样性。本文提出了 OmniLLP，一种新颖的 LLP 增强框架，该框架基于（1）反映代码功能意图的语义相似性和（2）开发者所有权凝聚力对源文件进行聚类。通过仅从这些具备语义和所有权感知的簇中检索上下文学习示例，我们旨在为利用 LLMs 的 LLPs 提供更连贯的提示，从而提高其预测准确性。我们的结果表明，与随机预测器（即从整个项目中随机选择上下文示例）相比，语义感知和所有权感知的聚类在统计学上显著提高了所评估的基于 LLM 的 LLP 的准确性（AUC 最多提升 8%）。此外，我们将语义和所有权信号结合用于上下文预测的方法在所评估的项目中取得了令人印象深刻的 0.88 到 0.96 的 AUC。我们的发现强调了将软件工程特定的上下文（如代码语义和开发者所有权信号）整合到 LLM-LLP 中的价值，为开发者提供了一种更准确、具上下文感知的日志记录方法，从而提升系统的可维护性和可观测性。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-12 01:18:56 UTC 发布时间：2025-08-12 01:18:56 协调世界时 (UTC)

#109 AI Agents and the Law #109 人工智能代理与法律

Authors: [Mark O. Riedl](https://arxiv.org/search/?searchtype=author&query=Mark O. Riedl), [Deven R. Desai](https://arxiv.org/search/?searchtype=author&query=Deven R. Desai) 作者：Mark O. Riedl，Deven R. Desai

As AI becomes more “agentic,” it faces technical and socio-legal issues it must address if it is to fulfill its promise of increased economic productivity and efficiency. This paper uses technical and legal perspectives to explain how things change when AI systems start being able to directly execute tasks on behalf of a user. We show how technical conceptions of agents track some, but not all, socio-legal conceptions of agency. That is, both computer science and the law recognize the problems of under-specification for an agent, and both disciplines have robust conceptions of how to address ensuring an agent does what the programmer, or in the law, the principal desires and no more. However, to date, computer science has under-theorized issues related to questions of loyalty and to third parties that interact with an agent, both of which are central parts of the law of agency. First, we examine the correlations between implied authority in agency law and the principle of value-alignment in AI, wherein AI systems must operate under imperfect objective specification. Second, we reveal gaps in the current computer science view of agents pertaining to the legal concepts of disclosure and loyalty, and how failure to account for them can result in unintended effects in AI ecommerce agents. In surfacing these gaps, we show a path forward for responsible AI agent development and deployment. 随着人工智能变得越来越“能动”，如果要实现提高经济生产力和效率的承诺，它必须解决一些技术和社会法律问题。本文从技术和法律的视角解释了当人工智能系统开始能够代表用户直接执行任务时，情况如何发生变化。我们展示了技术上对“代理”的概念如何对应某些但并非全部的社会法律上的代理概念。也就是说，计算机科学和法律都意识到代理存在规格不足的问题，并且两个学科都有完善的概念来处理如何确保代理仅完成程序员或在法律上委托人所期望的任务而不超越其权限。然而，迄今为止，计算机科学在有关忠诚度问题以及与代理交互的第三方问题上理论化不足，而这两者都是代理法的核心部分。首先，我们考察了代理法中“默示授权”与人工智能中“价值对齐”原则之间的关联，在后者中，人工智能系统必须在目标规格不完美的情况下运作。其次，我们揭示了当前计算机科学对代理（agent）在披露和忠诚等法律概念方面存在的认知缺口，以及忽视这些概念如何导致人工智能电子商务代理产生意想不到的后果。在揭示这些缺口的过程中，我们展示了负责任的人工智能代理开发与部署的前进路径。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-12 01:18:48 UTC

#110 M3-Net: A Cost-Effective Graph-Free MLP-Based Model for Traffic Prediction #110 M3-Net：一种用于交通预测的经济高效无图神经的基于 MLP 的模型

Authors: [Guangyin Jin](https://arxiv.org/search/?searchtype=author&query=Guangyin Jin), [Sicong Lai](https://arxiv.org/search/?searchtype=author&query=Sicong Lai), [Xiaoshuai Hao](https://arxiv.org/search/?searchtype=author&query=Xiaoshuai Hao), [Mingtao Zhang](https://arxiv.org/search/?searchtype=author&query=Mingtao Zhang), [Jinlei Zhang](https://arxiv.org/search/?searchtype=author&query=Jinlei Zhang) 作者：靳广银、赖思聪、郝晓帅、张明涛、张晋磊

Achieving accurate traffic prediction is a fundamental but crucial task in the development of current intelligent transportation systems.Most of the mainstream methods that have made breakthroughs in traffic prediction rely on spatio-temporal graph neural networks, spatio-temporal attention mechanisms, etc. The main challenges of the existing deep learning approaches are that they either depend on a complete traffic network structure or require intricate model designs to capture complex spatio-temporal dependencies. These limitations pose significant challenges for the efficient deployment and operation of deep learning models on large-scale datasets. To address these challenges, we propose a cost-effective graph-free Multilayer Perceptron (MLP) based model M3-Net for traffic prediction. Our proposed model not only employs time series and spatio-temporal embeddings for efficient feature processing but also first introduces a novel MLP-Mixer architecture with a mixture of experts (MoE) mechanism. Extensive experiments conducted on multiple real datasets demonstrate the superiority of the proposed model in terms of prediction performance and lightweight deployment. 实现准确的交通预测是当前智能交通系统发展的一个基础而关键的任务。大多数在交通预测上取得突破的主流方法依赖于时空图神经网络、时空注意力机制等。现有深度学习方法的主要挑战在于：它们要么依赖于完整的交通网络结构，要么需要复杂的模型设计以捕捉复杂的时空依赖关系。这些限制给深度学习模型在大规模数据集上的高效部署和运行带来了重大挑战。为了解决这些问题，我们提出了一种成本效益高、无需图结构的基于多层感知机（MLP）的交通预测模型 M3-Net。我们提出的模型不仅使用时间序列与时空嵌入进行高效特征处理，还首次引入了带有专家混合（MoE）机制的新型 MLP-Mixer 架构。在多个真实数据集上进行的大量实验表明，所提模型在预测性能和轻量化部署方面均具有优越性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 01:11:46 UTC 发布时间：2025-08-12 01:11:46 UTC

#111 LLM-Driven Adaptive 6G-Ready Wireless Body Area Networks: Survey and Framework #111 基于 LLM 的自适应、面向 6G 的无线体域网：综述与框架

Authors: [Azin Sabzian](https://arxiv.org/search/?searchtype=author&query=Azin Sabzian), [Mohammad Jalili Torkamani](https://arxiv.org/search/?searchtype=author&query=Mohammad Jalili Torkamani), [Negin Mahmoudi](https://arxiv.org/search/?searchtype=author&query=Negin Mahmoudi), [Kiana Kiashemshaki](https://arxiv.org/search/?searchtype=author&query=Kiana Kiashemshaki) 作者：Azin Sabzian、Mohammad Jalili Torkamani、Negin Mahmoudi、Kiana Kiashemshaki

Wireless Body Area Networks (WBANs) enable continuous monitoring of physiological signals for applications ranging from chronic disease management to emergency response. Recent advances in 6G communications, post-quantum cryptography, and energy harvesting have the potential to enhance WBAN performance. However, integrating these technologies into a unified, adaptive system remains a challenge. This paper surveys some of the most well-known Wireless Body Area Network (WBAN) architectures, routing strategies, and security mechanisms, identifying key gaps in adaptability, energy efficiency, and quantum-resistant security. We propose a novel Large Language Model-driven adaptive WBAN framework in which a Large Language Model acts as a cognitive control plane, coordinating routing, physical layer selection, micro-energy harvesting, and post-quantum security in real time. Our review highlights the limitations of current heuristic-based designs and outlines a research agenda for resource-constrained, 6G-ready medical systems. This approach aims to enable ultra-reliable, secure, and self-optimizing WBANs for next-generation mobile health applications. 无线体域网（WBANs）能够对生理信号进行持续监测，应用范围从慢性病管理到应急响应。近年来在 6G 通信、后量子密码学和能量采集方面的进展，有望提升 WBAN 的性能。然而，将这些技术集成到一个统一的、自适应的系统中仍然是一项挑战。本文综述了一些最为知名的无线体域网（WBAN）架构、路由策略和安全机制，并指出了在自适应性、能效和抗量子安全性方面的关键缺口。我们提出了一种新颖的以大型语言模型驱动的自适应 WBAN 框架，其中大型语言模型充当认知控制平面，实时协调路由、物理层选择、微能量采集和后量子安全。我们的综述强调了当前基于启发式设计的局限性，并为面向资源受限、6G 就绪的医疗系统勾勒出一项研究议程。该方法旨在为下一代移动健康应用实现超可靠、安全且自我优化的 WBAN。

Subjects: Networking and Internet Architecture, Artificial Intelligence 主题：网络与互联网体系结构，人工智能

Publish: 2025-08-12 00:25:41 UTC 发布时间：2025-08-12 00:25:41 UTC

#112 Playing Atari Space Invaders with Sparse Cosine Optimized Policy Evolution #112 使用稀疏余弦优化策略进化玩雅达利《太空入侵者》

Authors: [Jim O’Connor](https://arxiv.org/search/?searchtype=author&query=Jim O’Connor), [Jay B. Nash](https://arxiv.org/search/?searchtype=author&query=Jay B. Nash), [Derin Gezgin](https://arxiv.org/search/?searchtype=author&query=Derin Gezgin), [Gary B. Parker](https://arxiv.org/search/?searchtype=author&query=Gary B. Parker) 作者：Jim O’Connor、Jay B. Nash、Derin Gezgin、Gary B. Parker

Evolutionary approaches have previously been shown to be effective learning methods for a diverse set of domains. However, the domain of game-playing poses a particular challenge for evolutionary methods due to the inherently large state space of video games. As the size of the input state expands, the size of the policy must also increase in order to effectively learn the temporal patterns in the game space. Consequently, a larger policy must contain more trainable parameters, exponentially increasing the size of the search space. Any increase in search space is highly problematic for evolutionary methods, as increasing the number of trainable parameters is inversely correlated with convergence speed. To reduce the size of the input space while maintaining a meaningful representation of the original space, we introduce Sparse Cosine Optimized Policy Evolution (SCOPE). SCOPE utilizes the Discrete Cosine Transform (DCT) as a pseudo attention mechanism, transforming an input state into a coefficient matrix. By truncating and applying sparsification to this matrix, we reduce the dimensionality of the input space while retaining the highest energy features of the original input. We demonstrate the effectiveness of SCOPE as the policy for the Atari game Space Invaders. In this task, SCOPE with CMA-ES outperforms evolutionary methods that consider an unmodified input state, such as OpenAI-ES and HyperNEAT. SCOPE also outperforms simple reinforcement learning methods, such as DQN and A3C. SCOPE achieves this result through reducing the input size by 53% from 33,600 to 15,625 then using a bilinear affine mapping of sparse DCT coefficients to policy actions learned by the CMA-ES algorithm. 进化方法此前已被证明是适用于多种领域的有效学习方法。然而，对于游戏玩法域来说，进化方法面临特殊挑战，原因在于视频游戏固有的大规模状态空间。随着输入状态规模的扩大，策略规模也必须相应增加，才能有效学习游戏空间中的时序模式。因此，更大的策略必须包含更多可训练参数，进而呈指数级地扩大搜索空间。搜索空间的任何增加对进化方法都是极其不利的，因为可训练参数数量的增加与收敛速度呈反比关系。为在保持原始空间有意义表示的同时缩小输入空间规模，我们提出了稀疏余弦优化策略进化（SCOPE）。SCOPE 将离散余弦变换（DCT）用作一种类注意力机制，将输入状态转换为系数矩阵。通过对该矩阵进行截断并应用稀疏化，我们在保留原始输入高能量特征的同时减少了输入空间的维度。我们展示了将 SCOPE 作为 Atari 游戏《太空入侵者》（Space Invaders）策略的有效性。在此任务中，结合 CMA-ES 的 SCOPE 的表现优于那些考虑未修改输入状态的进化方法，例如 OpenAI-ES 和 HyperNEAT。SCOPE 也优于诸如 DQN 和 A3C 之类的简单强化学习方法。SCOPE 通过将输入尺寸从 33,600 缩减 53% 至 15,625，并对稀疏 DCT 系数使用双线性仿射映射以将其映射到由 CMA-ES 算法学得的策略动作，从而达成了这一结果。

Subjects: Neural and Evolutionary Computing, Artificial Intelligence 主题：神经与进化计算，人工智能

Publish: 2025-08-11 23:44:08 UTC 发布：2025-08-11 23:44:08 UTC

#113 StreetViewAI: Making Street View Accessible Using Context-Aware Multimodal AI #113 StreetViewAI：使用上下文感知多模态 AI 使街景可访问

Authors: [Jon E. Froehlich](https://arxiv.org/search/?searchtype=author&query=Jon E. Froehlich), [Alexander Fiannaca](https://arxiv.org/search/?searchtype=author&query=Alexander Fiannaca), [Nimer Jaber](https://arxiv.org/search/?searchtype=author&query=Nimer Jaber), [Victor Tsara](https://arxiv.org/search/?searchtype=author&query=Victor Tsara), [Shaun Kane](https://arxiv.org/search/?searchtype=author&query=Shaun Kane) 作者：Jon E. Froehlich、Alexander Fiannaca、Nimer Jaber、Victor Tsara、Shaun Kane

Interactive streetscape mapping tools such as Google Street View (GSV) and Meta Mapillary enable users to virtually navigate and experience real-world environments via immersive 360{\deg} imagery but remain fundamentally inaccessible to blind users. We introduce StreetViewAI, the first-ever accessible street view tool, which combines context-aware, multimodal AI, accessible navigation controls, and conversational speech. With StreetViewAI, blind users can virtually examine destinations, engage in open-world exploration, or virtually tour any of the over 220 billion images and 100+ countries where GSV is deployed. We iteratively designed StreetViewAI with a mixed-visual ability team and performed an evaluation with eleven blind users. Our findings demonstrate the value of an accessible street view in supporting POI investigations and remote route planning. We close by enumerating key guidelines for future work. 交互式街景映射工具（例如 Google 街景视图（GSV）和 Meta Mapillary）使用户能够通过沉浸式 360° 图像在虚拟环境中导航和体验现实世界环境，但对盲人用户仍然从根本上不可访问。我们介绍了 StreetViewAI，这是有史以来第一个可访问的街景工具，它结合了具上下文感知的多模态人工智能、可访问的导航控制和对话式语音。借助 StreetViewAI，盲人用户可以在虚拟环境中查看目的地、进行开放世界探索，或虚拟参观 GSV 部署的超过 2200 亿张图像和 100 多个国家中的任何地点。我们与一支视力混合团队迭代设计了 StreetViewAI，并对十一位盲人用户进行了评估。我们的研究结果证明了可访问街景在支持兴趣点（POI）调查和远程路线规划方面的价值。最后我们列举了对未来工作的关键指南。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-11 23:30:39 UTC 发布：2025-08-11 23:30:39 UTC

#114 VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models #114 VISOR：基于视觉输入的引导用于视觉-语言模型的输出重定向

Authors: [Mansi Phute](https://arxiv.org/search/?searchtype=author&query=Mansi Phute), [Ravikumar Balakrishnan](https://arxiv.org/search/?searchtype=author&query=Ravikumar Balakrishnan) 作者：Mansi Phute，Ravikumar Balakrishnan

Vision Language Models (VLMs) are increasingly being used in a broad range of applications, bringing their security and behavioral control to the forefront. While existing approaches for behavioral control or output redirection, like system prompting in VLMs, are easily detectable and often ineffective, activation-based steering vectors require invasive runtime access to model internals–incompatible with API-based services and closed-source deployments. We introduce VISOR (Visual Input-based Steering for Output Redirection), a novel method that achieves sophisticated behavioral control through optimized visual inputs alone. By crafting universal steering images that induce target activation patterns, VISOR enables practical deployment across all VLM serving modalities while remaining imperceptible compared to explicit textual instructions. We validate VISOR on LLaVA-1.5-7B across three critical alignment tasks: refusal, sycophancy and survival instinct. A single 150KB steering image matches steering vector performance within 1-2% for positive behavioral shifts while dramatically exceeding it for negative steering–achieving up to 25% shifts from baseline compared to steering vectors’ modest changes. Unlike system prompting (3-4% shifts), VISOR provides robust bidirectional control while maintaining 99.9% performance on 14,000 unrelated MMLU tasks. Beyond eliminating runtime overhead and model access requirements, VISOR exposes a critical security vulnerability: adversaries can achieve sophisticated behavioral manipulation through visual channels alone, bypassing text-based defenses. Our work fundamentally re-imagines multimodal model control and highlights the urgent need for defenses against visual steering attacks. 视觉语言模型（VLM）正越来越多地应用于广泛的场景，使其安全性和行为控制成为关注焦点。现有的行为控制或输出重定向方法，例如在 VLM 中的系统提示，通常容易被检测且常常无效，而基于激活的引导向量则需要对模型内部进行侵入式的运行时访问——与基于 API 的服务和闭源部署不兼容。我们提出了 VISOR（基于视觉输入的引导用于输出重定向），这是一种全新的方法，仅通过优化的视觉输入就能实现复杂的行为控制。通过制作能诱导目标激活模式的通用引导图像，VISOR 能在所有 VLM 服务模式中实现实用部署，同时相比明确的文本指令更难被察觉。我们在 LLaVA-1.5-7B 上对 VISOR 进行了验证，覆盖三个关键的对齐任务：拒绝、谄媚行为和生存本能。单张 150KB 的引导图像在正向行为变化时与引导向量的性能相差仅 1-2%，但在负向引导时远远超过引导向量——相比引导向量的温和变化，可实现高达 25%的相对于基线的变化。与系统提示（3-4% 的变化）不同，VISOR 在保持对 14,000 个无关 MMLU 任务 99.9%性能的同时，提供了稳健的双向控制。除了消除运行时开销和对模型访问的要求外，VISOR 还暴露了一个关键的安全漏洞：对手仅通过视觉通道就能实现复杂的行为操纵，从而绕过基于文本的防御。我们的工作从根本上重新构想了多模态模型的控制方式，并强调了针对视觉引导攻击防御的紧迫需求。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-11 23:25:16 UTC 发布：2025-08-11 23:25:16 协调世界时 (UTC)

#115 Using LLMs to Capture Users' Temporal Context for Recommendation #115 使用 LLMs 捕捉用户的时间上下文以进行推荐

Authors: [Milad Sabouri](https://arxiv.org/search/?searchtype=author&query=Milad Sabouri), [Masoud Mansoury](https://arxiv.org/search/?searchtype=author&query=Masoud Mansoury), [Kun Lin](https://arxiv.org/search/?searchtype=author&query=Kun Lin), [Bamshad Mobasher](https://arxiv.org/search/?searchtype=author&query=Bamshad Mobasher) 作者：Milad Sabouri、Masoud Mansoury、Kun Lin、Bamshad Mobasher

Effective recommender systems demand dynamic user understanding, especially in complex, evolving environments. Traditional user profiling often fails to capture the nuanced, temporal contextual factors of user preferences, such as transient short-term interests and enduring long-term tastes. This paper presents an assessment of Large Language Models (LLMs) for generating semantically rich, time-aware user profiles. We do not propose a novel end-to-end recommendation architecture; instead, the core contribution is a systematic investigation into the degree of LLM effectiveness in capturing the dynamics of user context by disentangling short-term and long-term preferences. This approach, framing temporal preferences as dynamic user contexts for recommendations, adaptively fuses these distinct contextual components into comprehensive user embeddings. The evaluation across Movies&TV and Video Games domains suggests that while LLM-generated profiles offer semantic depth and temporal structure, their effectiveness for context-aware recommendations is notably contingent on the richness of user interaction histories. Significant gains are observed in dense domains (e.g., Movies&TV), whereas improvements are less pronounced in sparse environments (e.g., Video Games). This work highlights LLMs’ nuanced potential in enhancing user profiling for adaptive, context-aware recommendations, emphasizing the critical role of dataset characteristics for practical applicability. 高效的推荐系统需要对用户进行动态理解，尤其是在复杂且不断演变的环境中。传统的用户画像常常无法捕捉用户偏好的细微时间上下文因素，例如短暂的短期兴趣与持久的长期品味。本文评估了大型语言模型（LLMs）在生成语义丰富且具时间感的用户画像方面的表现。我们并未提出一种新的端到端推荐架构；相反，核心贡献是系统性地研究 LLMs 在通过将短期和长期偏好解耦来捕捉用户上下文动态方面的有效性。该方法将时间偏好构建为推荐的动态用户上下文，自适应地将这些不同的上下文组件融合为综合的用户嵌入。对电影与电视以及视频游戏领域的评估表明，尽管 LLM 生成的画像在语义深度和时间结构上具有优势，但其在上下文感知推荐中的有效性明显取决于用户互动历史的丰富程度。在密集领域（例如电影与电视）观察到显著的提升，而在稀疏环境（例如电子游戏）中的改进则不那么明显。该工作强调了 LLMs 在增强用户画像以实现自适应、上下文感知推荐方面的细致潜力，同时强调了数据集特征对实际可用性的关键作用。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-11 22:48:31 UTC 发布：2025-08-11 22:48:31 协调世界时（UTC）

#116 Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression #116 可引导的多元主义：通过少样本比较回归实现多元对齐

Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback (RLHF). However, these methods use scalar rewards that can only reflect user preferences on average. Pluralistic alignment instead seeks to capture diverse user preferences across a set of attributes, moving beyond just helpfulness and harmlessness. Toward this end, we propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences. Our approach leverages in-context learning and reasoning, grounded in a set of fine-grained attributes, to compare response options and make aligned choices. To evaluate our algorithm, we also propose two new steerable pluralistic benchmarks by adapting the Moral Integrity Corpus (MIC) and the HelpSteer2 datasets, demonstrating the applicability of our approach to value-aligned decision-making and reward modeling, respectively. Our few-shot comparative regression approach is interpretable and compatible with different attributes and LLMs, while outperforming multiple baseline and state-of-the-art methods. Our work provides new insights and research directions in pluralistic alignment, enabling a more fair and representative use of LLMs and advancing the state-of-the-art in ethical AI. 大型语言模型（LLMs）目前使用诸如来自人类反馈的强化学习（RLHF）等技术进行对齐。然而，这些方法使用的标量奖励只能在平均水平上反映用户偏好。多元对齐则试图捕捉在一组属性上多样化的用户偏好，超越仅仅有用性和无害性。为此，我们提出了一个基于少样本比较回归的可引导多元模型，能够适应个体用户偏好。我们的方法利用语境内学习和推理，基于一组细粒度属性来比较响应选项并做出对齐选择。为评估我们的算法，我们还通过改编道德完整性语料库（Moral Integrity Corpus，MIC）和 HelpSteer2 数据集，提出了两个新的可引导多元基准，分别展示了该方法在价值对齐决策和奖励建模方面的适用性。我们的少样本比较回归方法具有可解释性，并兼容不同属性和 LLMs，同时在性能上优于多种基线和最先进的方法。我们的工作为多元化对齐提供了新的见解和研究方向，使 LLMs 的使用更加公平和具有代表性，并推动了伦理人工智能的技术前沿。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-11 22:40:31 UTC 发布：2025-08-11 22:40:31 UTC

#117 When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital #117 当领域专家没有时间而 LLM 开发者缺乏临床专业知识时：来自安全网医院 LLM 共设计的真实世界经验教训 [PDF ] [Copy] [Kimi ] [REL]

Large language models (LLMs) have the potential to address social and behavioral determinants of health by transforming labor intensive workflows in resource-constrained settings. Creating LLM-based applications that serve the needs of underserved communities requires a deep understanding of their local context, but it is often the case that neither LLMs nor their developers possess this local expertise, and the experts in these communities often face severe time/resource constraints. This creates a disconnect: how can one engage in meaningful co-design of an LLM-based application for an under-resourced community when the communication channel between the LLM developer and domain expert is constrained? We explored this question through a real-world case study, in which our data science team sought to partner with social workers at a safety net hospital to build an LLM application that summarizes patients’ social needs. Whereas prior works focus on the challenge of prompt tuning, we found that the most critical challenge in this setting is the careful and precise specification of \what information to surface to providers so that the LLM application is accurate, comprehensive, and verifiable. Here we present a novel co-design framework for settings with limited access to domain experts, in which the summary generation task is first decomposed into individually-optimizable attributes and then each attribute is efficiently refined and validated through a multi-tier cascading approach. 大型语言模型 (LLMs) 有可能通过改造资源匮乏环境中劳动密集型的工作流程来应对健康的社会与行为决定因素。为服务弱势社区需求而创建基于 LLM 的应用需要对其本地语境有深入理解，但往往 LLM 本身及其开发者都不具备这种本地专业知识，而这些社区的专家通常也面临严重的时间/资源限制。这就造成了脱节：当 LLM 开发者与领域专家之间的沟通渠道受限时，如何与欠资源社区就基于 LLM 的应用进行有意义的共创？我们通过一个真实案例研究探讨了这个问题，在该研究中，我们的数据科学团队试图与一家安全网医院的社工合作，构建一个能够总结患者社会需求的 LLM 应用。此前的工作集中于提示调优的挑战，而我们发现，在这种场景中最关键的挑战是谨慎且精确地指定应向提供者呈现哪些信息，以确保 LLM 应用的准确性、全面性和可验证性。在此我们提出了一种新颖的共创框架，适用于无法充分接触领域专家的场景，其中摘要生成任务首先被分解为可单独优化的属性，然后通过多层级级联方法对每一属性进行高效的细化和验证。

Subjects: Computers and Society, Artificial Intelligence, Machine Learning 主题：计算机与社会、人工智能、机器学习

Publish: 2025-08-11 22:34:23 UTC 发布：2025-08-11 22:34:23 协调世界时

#118 Momentum Point-Perplexity Mechanics in Large Language Models #118 大型语言模型中的动量点-困惑度机制

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-11 21:50:34 UTC 发布：2025-08-11 21:50:34 协调世界时 (UTC)

#119 MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling #119 MAViS：用于长序列视频叙事的多智能体框架

Authors: [Qian Wang](https://arxiv.org/search/?searchtype=author&query=Qian Wang), [Ziqi Huang](https://arxiv.org/search/?searchtype=author&query=Ziqi Huang), [Ruoxi Jia](https://arxiv.org/search/?searchtype=author&query=Ruoxi Jia), [Paul Debevec](https://arxiv.org/search/?searchtype=author&query=Paul Debevec), [Ning Yu](https://arxiv.org/search/?searchtype=author&query=Ning Yu) 作者：Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, Ning Yu

Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, an end-to-end multi-agent collaborative framework for long-sequence video storytelling. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle – Explore, Examine, and Enhance – to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief user prompt, MAViS is capable of producing high-quality, expressive long-sequence video storytelling, enriching inspirations and creativity for users. To the best of our knowledge, MAViS is the only framework that provides multimodal design output – videos with narratives and background music. 尽管最近取得了进展，长序列视频生成框架仍存在显著局限：辅助能力不足、视觉质量欠佳以及表现力有限。为缓解这些问题，我们提出了 MAViS，一种端到端的多代理协作框架，用于长序列视频叙事。MAViS 在多个阶段协调专门化的代理，包含剧本撰写、镜头设计、角色建模、关键帧生成、视频动画和音频生成。在每个阶段，代理都遵循 3E 原则——探索（Explore）、审查（Examine）和增强（Enhance）——以确保中间输出的完整性。考虑到当前生成模型的能力限制，我们提出了剧本撰写指南，以优化剧本与生成工具之间的兼容性。实验结果表明，MAViS 在辅助能力、视觉质量和视频表达力方面达到了最先进的性能。其模块化框架还支持与多种生成模型和工具的可扩展性。只需简短的用户提示，MAViS 即能生成高质量、富有表现力的长序列视频叙事，为用户提供丰富的灵感与创意。据我们所知，MAViS 是唯一提供多模态设计输出的框架——带有叙事和背景音乐的视频。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Multiagent Systems 主题：计算机视觉与模式识别，人工智能，多智能体系统

Publish: 2025-08-11 21:42:41 UTC 发布时间：2025-08-11 21:42:41 协调世界时 (UTC)

#120 Empowering Children to Create AI-Enabled Augmented Reality Experiences #120 赋能儿童创建具备人工智能的增强现实体验

Authors: [Lei Zhang](https://arxiv.org/search/?searchtype=author&query=Lei Zhang), [Shuyao Zhou](https://arxiv.org/search/?searchtype=author&query=Shuyao Zhou), [Amna Liaqat](https://arxiv.org/search/?searchtype=author&query=Amna Liaqat), [Tinney Mak](https://arxiv.org/search/?searchtype=author&query=Tinney Mak), [Brian Berengard](https://arxiv.org/search/?searchtype=author&query=Brian Berengard), [Emily Qian](https://arxiv.org/search/?searchtype=author&query=Emily Qian), [Andrés Monroy-Hernández](https://arxiv.org/search/?searchtype=author&query=Andrés Monroy-Hernández) 作者：张磊、周书尧、Amna Liaqat、Tinney Mak、Brian Berengard、Emily Qian、Andrés Monroy-Hernández

Despite their potential to enhance children’s learning experiences, AI-enabled AR technologies are predominantly used in ways that position children as consumers rather than creators. We introduce Capybara, an AR-based and AI-powered visual programming environment that empowers children to create, customize, and program 3D characters overlaid onto the physical world. Capybara enables children to create virtual characters and accessories using text-to-3D generative AI models, and to animate these characters through auto-rigging and body tracking. In addition, our system employs vision-based AI models to recognize physical objects, allowing children to program interactive behaviors between virtual characters and their physical surroundings. We demonstrate the expressiveness of Capybara through a set of novel AR experiences. We conducted user studies with 20 children in the United States and Argentina. Our findings suggest that Capybara can empower children to harness AI in authoring personalized and engaging AR experiences that seamlessly bridge the virtual and physical worlds. 尽管具有提升儿童学习体验的潜力，基于人工智能的增强现实技术主要仍以将儿童定位为消费者而非创作者的方式被使用。我们介绍了 Capybara，一种基于 AR 并由 AI 驱动的可视化编程环境，使儿童能够创建、定制并为叠加在物理世界之上的 3D 角色编写程序。Capybara 使儿童能够使用文本到 3D 的生成式 AI 模型创建虚拟角色和配件，并通过自动绑定和身体追踪为这些角色制作动画。此外，我们的系统采用基于视觉的 AI 模型识别物理物体，使儿童能够为虚拟角色与其物理环境之间的交互行为编写程序。我们通过一组新颖的 AR 体验展示了 Capybara 的表现力。我们在美国和阿根廷对 20 名儿童进行了用户研究。我们的发现表明，Capybara 能够赋能儿童利用 AI 创作个性化且引人入胜的 AR 体验，顺畅地连接虚拟与现实世界。

Subjects: Human-Computer Interaction, Artificial Intelligence, Graphics, Programming Languages 主题：人机交互、人工智能、图形学、编程语言

Publish: 2025-08-11 20:57:39 UTC 发布时间：2025-08-11 20:57:39 协调世界时

#121 Temporal User Profiling with LLMs: Balancing Short-Term and Long-Term Preferences for Recommendations #121 使用 LLMs 进行时间性用户画像：在推荐中平衡短期和长期偏好

Accurately modeling user preferences is crucial for improving the performance of content-based recommender systems. Existing approaches often rely on simplistic user profiling methods, such as averaging or concatenating item embeddings, which fail to capture the nuanced nature of user preference dynamics, particularly the interactions between long-term and short-term preferences. In this work, we propose LLM-driven Temporal User Profiling (LLM-TUP), a novel method for user profiling that explicitly models short-term and long-term preferences by leveraging interaction timestamps and generating natural language representations of user histories using a large language model (LLM). These representations are encoded into high-dimensional embeddings using a pre-trained BERT model, and an attention mechanism is applied to dynamically fuse the short-term and long-term embeddings into a comprehensive user profile. Experimental results on real-world datasets demonstrate that LLM-TUP achieves substantial improvements over several baselines, underscoring the effectiveness of our temporally aware user-profiling approach and the use of semantically rich user profiles, generated by LLMs, for personalized content-based recommendation. 准确建模用户偏好对于提升基于内容的推荐系统性能至关重要。现有方法通常依赖于简单的用户画像方法，如对物品嵌入取平均或拼接，这些方法未能捕捉用户偏好动态的细微特征，尤其是长期偏好与短期偏好之间的相互作用。在这项工作中，我们提出了由 LLM 驱动的时序用户画像（LLM-TUP），这是一种新颖的用户画像方法，通过利用交互时间戳并使用大型语言模型（LLM）生成用户历史的自然语言表示，来显式建模短期和长期偏好。这些表示使用预训练的 BERT 模型编码成高维嵌入，并应用注意力机制动态融合短期与长期嵌入，形成全面的用户画像。在真实世界数据集上的实验结果表明，LLM-TUP 在多种基线方法上取得了显著提升，凸显了我们具有时间感知的用户画像方法的有效性，以及使用由 LLMs 生成的语义丰富的用户画像用于个性化基于内容推荐的价值。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-11 20:28:24 UTC 发布：2025-08-11 20:28:24 协调世界时

#122 Fast weight programming and linear transformers: from machine learning to neurobiology #122 快速权重编程与线性变换器：从机器学习到神经生物学

Authors: [Kazuki Irie](https://arxiv.org/search/?searchtype=author&query=Kazuki Irie), [Samuel J. Gershman](https://arxiv.org/search/?searchtype=author&query=Samuel J. Gershman) 作者：Kazuki Irie，Samuel J. Gershman

Recent advances in artificial neural networks for machine learning, and language modeling in particular, have established a family of recurrent neural network (RNN) architectures that, unlike conventional RNNs with vector-form hidden states, use two-dimensional (2D) matrix-form hidden states. Such 2D-state RNNs, known as Fast Weight Programmers (FWPs), can be interpreted as a neural network whose synaptic weights (called fast weights) dynamically change over time as a function of input observations, and serve as short-term memory storage; corresponding synaptic weight modifications are controlled or programmed by another network (the programmer) whose parameters are trained (e.g., by gradient descent). In this Primer, we review the technical foundations of FWPs, their computational characteristics, and their connections to transformers and state space models. We also discuss connections between FWPs and models of synaptic plasticity in the brain, suggesting a convergence of natural and artificial intelligence. 近年来，人工神经网络在机器学习领域，特别是语言建模方面的进展，确立了一类循环神经网络（RNN）架构。与具有向量形式隐藏状态的传统 RNN 不同，这些架构使用二维（2D）矩阵形式的隐藏状态。此类二维状态 RNN，称为快速权重编程器（FWPs），可被解释为一种其突触权重（称为快速权重）会随输入观测动态变化并作为短期记忆存储的神经网络；相应的突触权重修改由另一个网络（程序员）控制或编程，其参数被训练（例如通过梯度下降）。在本入门综述中，我们回顾了 FWPs 的技术基础、它们的计算特性以及它们与变换器和状态空间模型的联系。我们还讨论了 FWPs 与大脑突触可塑性模型之间的联系，提出自然智能与人工智能的趋同。

Subjects: Machine Learning, Artificial Intelligence, Neurons and Cognition 主题：机器学习、人工智能、神经元与认知

Publish: 2025-08-11 19:50:03 UTC 发布时间：2025-08-11 19:50:03 UTC

#123 Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment #123 重新思考面向丰富形态变化的分词：一元模型相对于 BPE 的主导地位与形态学对齐

Prior work on language modeling showed conflicting findings about whether morphologically aligned approaches to tokenization improve performance, particularly for languages with complex morphology. To investigate this, we select a typologically diverse set of languages: Telugu (agglutinative), Hindi (primarily fusional with some agglutination), and English (fusional). We conduct a comprehensive evaluation of language models – starting from tokenizer training and extending through the finetuning and downstream task evaluation. To account for the consistent performance differences observed across tokenizer variants, we focus on two key factors: morphological alignment and tokenization quality. To assess morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal that better morphological alignment correlates positively – though moderately – with performance in syntax-based tasks such as Parts-of-Speech tagging, Named Entity Recognition and Dependency Parsing. However, we also find that the tokenizer algorithm (Byte-pair Encoding vs. Unigram) plays a more significant role in influencing downstream performance than morphological alignment alone. Naive Unigram tokenizers outperform others across most settings, though hybrid tokenizers that incorporate morphological segmentation significantly improve performance within the BPE framework. In contrast, intrinsic metrics like Corpus Token Count (CTC) and Rényi entropy showed no correlation with downstream performance. 以往关于语言建模的研究在形态对齐的分词方法是否能提升性能方面得出了相互矛盾的结论，尤其针对形态复杂的语言。为对此进行调查，我们选择了一组类型学上多样的语言：泰卢固语（黏着语）、印地语（主要为粘合语并带有一定黏着特征）和英语（粘合语）。我们对语言模型进行了全面评估——从分词器训练开始，延伸到微调和下游任务评估。为了解释在不同分词器变体间观察到的持续性性能差异，我们聚焦于两个关键因素：形态对齐与分词质量。为评估泰卢固语分词器的形态对齐性，我们创建了一个数据集，包含 600 个派生词形和 7000 个屈折词形的金标准词素切分。我们的实验表明，更好的形态对齐与基于句法的任务（如词性标注、命名实体识别和依存句法分析）的性能呈正相关——尽管相关程度为中等。但我们也发现，分词算法（字节对编码 vs. 单元词（Unigram）在影响下游性能方面比单纯的形态对齐更为重要。在大多数设置中，简单的单元词分词器优于其他分词器，尽管在 BPE 框架内引入形态分割的混合分词器能显著提升性能。相比之下，像语料标记计数（CTC）和雷尼熵（Rényi entropy）这样的内在指标与下游性能没有相关性。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-11 19:23:59 UTC 发布：2025-08-11 19:23:59 UTC

#124 Neural Tangent Knowledge Distillation for Optical Convolutional Networks #124 神经切线知识蒸馏用于光学卷积网络

Authors: [Jinlin Xiang](https://arxiv.org/search/?searchtype=author&query=Jinlin Xiang), [Minho Choi](https://arxiv.org/search/?searchtype=author&query=Minho Choi), [Yubo Zhang](https://arxiv.org/search/?searchtype=author&query=Yubo Zhang), [Zhihao Zhou](https://arxiv.org/search/?searchtype=author&query=Zhihao Zhou), [Arka Majumdar](https://arxiv.org/search/?searchtype=author&query=Arka Majumdar), [Eli Shlizerman](https://arxiv.org/search/?searchtype=author&query=Eli Shlizerman) 作者：向锦林、崔珉昊、张宇博、周志浩、Arka Majumdar、Eli Shlizerman

Hybrid Optical Neural Networks (ONNs, typically consisting of an optical frontend and a digital backend) offer an energy-efficient alternative to fully digital deep networks for real-time, power-constrained systems. However, their adoption is limited by two main challenges: the accuracy gap compared to large-scale networks during training, and discrepancies between simulated and fabricated systems that further degrade accuracy. While previous work has proposed end-to-end optimizations for specific datasets (e.g., MNIST) and optical systems, these approaches typically lack generalization across tasks and hardware designs. To address these limitations, we propose a task-agnostic and hardware-agnostic pipeline that supports image classification and segmentation across diverse optical systems. To assist optical system design before training, we estimate achievable model accuracy based on user-specified constraints such as physical size and the dataset. For training, we introduce Neural Tangent Knowledge Distillation (NTKD), which aligns optical models with electronic teacher networks, thereby narrowing the accuracy gap. After fabrication, NTKD also guides fine-tuning of the digital backend to compensate for implementation errors. Experiments on multiple datasets (e.g., MNIST, CIFAR, Carvana Masking) and hardware configurations show that our pipeline consistently improves ONN performance and enables practical deployment in both pre-fabrication simulations and physical implementations. 混合光学神经网络（ONNs，通常由光学前端和数字后端组成）为实时、功耗受限的系统提供了比全数字深度网络更节能的替代方案。然而，它们的应用受两大挑战限制：训练过程中相比大规模网络存在的精度差距，以及仿真与实际制造系统之间的不一致性，这些都会进一步降低精度。尽管先前工作提出了针对特定数据集（例如 MNIST）和光学系统的端到端优化方法，但这些方法通常缺乏跨任务和硬件设计的泛化能力。为了解决这些局限性，我们提出了一个与任务无关且与硬件无关的流程，支持跨多样光学系统的图像分类和分割。为在训练前辅助光学系统设计，我们根据用户指定的约束（如物理尺寸和数据集）估算可实现的模型精度。为训练阶段，我们引入了神经切线知识蒸馏（Neural Tangent Knowledge Distillation，NTKD），该方法使光学模型与电子教师网络对齐，从而缩小精度差距。在制造之后，NTKD 还会指导数字后端的微调以补偿实现误差。在多个数据集（例如 MNIST、CIFAR、Carvana Masking）和硬件配置上的实验表明，我们的流程在预制造仿真和实际物理实现中都能持续提升 ONN 的性能并实现实际部署。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-11 19:15:06 UTC 发布时间：2025-08-11 19:15:06 UTC

#125 Generating Query-Relevant Document Summaries via Reinforcement Learning #125 通过强化学习生成与查询相关的文档摘要

Authors: [Nitin Yadav](https://arxiv.org/search/?searchtype=author&query=Nitin Yadav), [Changsung Kang](https://arxiv.org/search/?searchtype=author&query=Changsung Kang), [Hongwei Shang](https://arxiv.org/search/?searchtype=author&query=Hongwei Shang), [Ming Sun](https://arxiv.org/search/?searchtype=author&query=Ming Sun) 作者：Nitin Yadav、Changsung Kang、Hongwei Shang、Ming Sun

E-commerce search engines often rely solely on product titles as input for ranking models with latency constraints. However, this approach can result in suboptimal relevance predictions, as product titles often lack sufficient detail to capture query intent. While product descriptions provide richer information, their verbosity and length make them unsuitable for real-time ranking, particularly for computationally expensive architectures like cross-encoder ranking models. To address this challenge, we propose ReLSum, a novel reinforcement learning framework designed to generate concise, query-relevant summaries of product descriptions optimized for search relevance. ReLSum leverages relevance scores as rewards to align the objectives of summarization and ranking, effectively overcoming limitations of prior methods, such as misaligned learning targets. The framework employs a trainable large language model (LLM) to produce summaries, which are then used as input for a cross-encoder ranking model. Experimental results demonstrate significant improvements in offline metrics, including recall and NDCG, as well as online user engagement metrics. ReLSum provides a scalable and efficient solution for enhancing search relevance in large-scale e-commerce systems. 电子商务搜索引擎通常仅依赖商品标题作为具有延迟限制的排序模型输入。然而，这种方法可能导致相关性预测不佳，因为商品标题往往缺乏足够细节来捕捉查询意图。虽然商品描述提供了更丰富的信息，但其冗长和长度使其不适合实时排序，尤其对于像交叉编码器排序模型这样计算代价高昂的架构。为了解决这一挑战，我们提出了 ReLSum，一种新型强化学习框架，旨在生成针对搜索相关性优化的简洁、与查询相关的商品描述摘要。ReLSum 利用相关性分数作为奖励，将摘要生成与排序的目标对齐，有效克服了先前方法的局限性，例如学习目标不一致。该框架采用可训练的大型语言模型（LLM）来生成摘要，随后将这些摘要作为交叉编码器排序模型的输入。实验结果表明，在离线指标（包括召回率和 NDCG）以及在线用户参与度指标上均有显著提升。 ReLSum 为提升大规模电子商务系统中的搜索相关性提供了一种可扩展且高效的解决方案。

Subjects: Information Retrieval, Artificial Intelligence, Machine Learning 主题：信息检索、人工智能、机器学习

Publish: 2025-08-11 18:52:28 UTC 发布日期：2025-08-11 18:52:28 UTC

#126 Spatiotemporally Consistent Indoor Lighting Estimation with Diffusion Priors #126 使用扩散先验的时空一致室内光照估计

Authors: [Mutian Tong](https://arxiv.org/search/?searchtype=author&query=Mutian Tong), [Rundi Wu](https://arxiv.org/search/?searchtype=author&query=Rundi Wu), [Changxi Zheng](https://arxiv.org/search/?searchtype=author&query=Changxi Zheng) 作者：佟木田、吴润迪、郑昌熙

Indoor lighting estimation from a single image or video remains a challenge due to its highly ill-posed nature, especially when the lighting condition of the scene varies spatially and temporally. We propose a method that estimates from an input video a continuous light field describing the spatiotemporally varying lighting of the scene. We leverage 2D diffusion priors for optimizing such light field represented as a MLP. To enable zero-shot generalization to in-the-wild scenes, we fine-tune a pre-trained image diffusion model to predict lighting at multiple locations by jointly inpainting multiple chrome balls as light probes. We evaluate our method on indoor lighting estimation from a single image or video and show superior performance over compared baselines. Most importantly, we highlight results on spatiotemporally consistent lighting estimation from in-the-wild videos, which is rarely demonstrated in previous works. 从单张图像或视频中估计室内光照仍然具有挑战性，因为这是一个高度病态的问题，特别是当场景的光照条件在空间和时间上变化时。我们提出了一种方法，从输入视频中估计出描述场景时空变化光照的连续光场。我们利用二维扩散先验来优化以多层感知机（MLP）表示的光场。为了实现对真实场景的零样本泛化，我们对一个预训练的图像扩散模型进行微调，使其通过联合对多个作为光探针的铬球进行修补（inpainting）来预测多个位置的光照。我们在单张图像或视频的室内光照估计上评估了我们的方法，并展示了相较于比较基线的优越表现。最重要的是，我们强调了在真实视频上实现的时空一致光照估计结果，这是以往工作很少展示的。

Subjects: Graphics, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图形学、人工智能、计算机视觉与模式识别

Publish: 2025-08-11 18:11:42 UTC 发布：2025-08-11 18:11:42 UTC

#127 The DNA of nuclear models: How AI predicts nuclear masses #127 核模型的基因：人工智能如何预测核质量

Authors: [Kate A. Richardson](https://arxiv.org/search/?searchtype=author&query=Kate A. Richardson), [Sokratis Trifinopoulos](https://arxiv.org/search/?searchtype=author&query=Sokratis Trifinopoulos), [Mike Williams](https://arxiv.org/search/?searchtype=author&query=Mike Williams) 作者：Kate A. Richardson、Sokratis Trifinopoulos、Mike Williams

Obtaining high-precision predictions of nuclear masses, or equivalently nuclear binding energies, Eb, remains an important goal in nuclear-physics research. Recently, many AI-based tools have shown promising results on this task, some achieving precision that surpasses the best physics models. However, the utility of these AI models remains in question given that predictions are only useful where measurements do not exist, which inherently requires extrapolation away from the training (and testing) samples. Since AI models are largely black boxes, the reliability of such an extrapolation is difficult to assess. We present an AI model that not only achieves cutting-edge precision for Eb, but does so in an interpretable manner. For example, we find (and explain why) that the most important dimensions of its internal representation form a double helix, where the analog of the hydrogen bonds in DNA here link the number of protons and neutrons found in the most stable nucleus of each isotopic chain. Furthermore, we show that the AI prediction of Eb can be factorized and ordered hierarchically, with the most important terms corresponding to well-known symbolic models (such as the famous liquid drop). Remarkably, the improvement of the AI model over symbolic ones can almost entirely be attributed to an observation made by Jaffe in 1969. The end result is a fully interpretable data-driven model of nuclear masses. 获得高精度的核质量预测，或等价地核结合能预测，仍然是核物理研究中的一项重要目标。最近，许多基于人工智能的工具在这一任务上展示了令人鼓舞的结果，其中一些达到了超过最佳物理模型的精度。然而，鉴于预测只有在测量不存在时才有用，这本质上需要对训练（和测试）样本以外进行外推，因而这些人工智能模型的实用性仍然值得怀疑。由于人工智能模型在很大程度上是黑箱，其外推的可靠性难以评估。我们提出了一个人工智能模型，不仅在……方面达到了最前沿的精度，而且以可解释的方式做到这一点。例如，我们发现（并解释了原因）其内部表征中最重要的维度形成了双螺旋结构，这里类似于 DNA 中氢键的作用是将每个同位素链中最稳定核所具有的质子数和中子数连接起来。此外，我们表明对 Eb 的 AI 预测可以被分解并按层次排序，其中最重要的项对应于著名的符号模型（例如著名的液滴模型）。值得注意的是，AI 模型相对于符号模型的改进几乎完全可归因于 Jaffe 在 1969 年提出的一个观察。最终结果是一个完全可解释的数据驱动的核质量模型。

Subjects: Nuclear Theory, Artificial Intelligence, Machine Learning, Nuclear Experiment 主题：核理论，人工智能，机器学习，核实验

Publish: 2025-08-11 18:00:17 UTC 发布时间：2025-08-11 18:00:17 协调世界时 (UTC)

#128 Processing of synthetic data in AI development for healthcare and the definition of personal data in EU law #128 在医疗保健人工智能开发中合成数据的处理以及欧盟法律中个人数据的定义

Authors: [Vibeke Binz Vallevik](https://arxiv.org/search/?searchtype=author&query=Vibeke Binz Vallevik), [Anne Kjersti C. Befring](https://arxiv.org/search/?searchtype=author&query=Anne Kjersti C. Befring), [Severin Elvatun](https://arxiv.org/search/?searchtype=author&query=Severin Elvatun), [Jan Franz Nygaard](https://arxiv.org/search/?searchtype=author&query=Jan Franz Nygaard) 作者：Vibeke Binz Vallevik、Anne Kjersti C. Befring、Severin Elvatun、Jan Franz Nygaard

Artificial intelligence (AI) has the potential to transform healthcare, but it requires access to health data. Synthetic data that is generated through machine learning models trained on real data, offers a way to share data while preserving privacy. However, uncertainties in the practical application of the General Data Protection Regulation (GDPR) create an administrative burden, limiting the benefits of synthetic data. Through a systematic analysis of relevant legal sources and an empirical study, this article explores whether synthetic data should be classified as personal data under the GDPR. The study investigates the residual identification risk through generating synthetic data and simulating inference attacks, challenging common perceptions of technical identification risk. The findings suggest synthetic data is likely anonymous, depending on certain factors, but highlights uncertainties about what constitutes reasonably likely risk. To promote innovation, the study calls for clearer regulations to balance privacy protection with the advancement of AI in healthcare. 人工智能（AI）有可能改变医疗保健，但这需要获得健康数据。通过在真实数据上训练的机器学习模型生成的合成数据，提供了一种在保护隐私的同时共享数据的方式。然而，《通用数据保护条例》（GDPR）在实际应用中的不确定性造成了行政负担，限制了合成数据的益处。通过对相关法律来源的系统性分析和一项实证研究，本文探讨了合成数据是否应根据 GDPR 被归类为个人数据。该研究通过生成合成数据并模拟推断攻击来调查剩余识别风险，挑战了对技术识别风险的普遍看法。研究结果表明，合成数据很可能是匿名的，但取决于某些因素，同时强调了关于什么构成“合理可能风险”的不确定性。为促进创新，该研究呼吁制定更明确的法规，以在隐私保护与推动医疗领域人工智能发展之间取得平衡。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-11 17:59:06 UTC 发布：2025-08-11 17:59:06 协调世界时（UTC）

#129 Fuzzy-Pattern Tsetlin Machine #129 模糊模式泰特林机（Fuzzy-Pattern Tsetlin Machine）

Author: [Artem Hnilov](https://arxiv.org/search/?searchtype=author&query=Artem Hnilov) 作者：Artem Hnilov

The “all-or-nothing” clause evaluation strategy is a core mechanism in the Tsetlin Machine (TM) family of algorithms. In this approach, each clause - a logical pattern composed of binary literals mapped to input data - is disqualified from voting if even a single literal fails. Due to this strict requirement, standard TMs must employ thousands of clauses to achieve competitive accuracy. This paper introduces the Fuzzy-Pattern Tsetlin Machine (FPTM), a novel variant where clause evaluation is fuzzy rather than strict. If some literals in a clause fail, the remaining ones can still contribute to the overall vote with a proportionally reduced score. As a result, each clause effectively consists of sub-patterns that adapt individually to the input, enabling more flexible, efficient, and robust pattern matching. The proposed fuzzy mechanism significantly reduces the required number of clauses, memory footprint, and training time, while simultaneously improving accuracy. On the IMDb dataset, FPTM achieves 90.15% accuracy with only one clause per class, a 50x reduction in clauses and memory over the Coalesced Tsetlin Machine. FPTM trains up to 316x faster (45 seconds vs. 4 hours) and fits within 50 KB, enabling online learning on microcontrollers. Inference throughput reaches 34.5 million predictions/second (51.4 GB/s). On Fashion-MNIST, accuracy reaches 92.18% (2 clauses), 93.19% (20 clauses) and 94.68% (8000 clauses), a ~400x clause reduction compared to the Composite TM’s 93.00% (8000 clauses). On the Amazon Sales dataset with 20% noise, FPTM achieves 85.22% accuracy, significantly outperforming the Graph Tsetlin Machine (78.17%) and a Graph Convolutional Neural Network (66.23%). “全有或全无”的子句评估策略是 Tsetlin 机（TM）家族算法的核心机制。在这种方法中，每个子句——由映射到输入数据的二元文字组成的逻辑模式——只要有任一文字失败就会被取消投票。由于这一严格要求，标准 TM 必须使用数千个子句才能达到具有竞争力的准确率。本文提出了模糊模式 Tsetlin 机（FPTM），一种将子句评估由严格改为模糊的新变体。如果子句中的某些文字失败，剩余的文字仍可以以按比例降低的分数为整体投票做出贡献。结果，每个子句实际上由能够针对输入单独自适应的子模式组成，从而实现更灵活、更高效且更鲁棒的模式匹配。所提出的模糊机制显著减少了所需子句数量、内存占用和训练时间，同时提高了准确率。在 IMDb 数据集上，FPTM 仅用每类一个子句就达到了 90.15%的准确率，相比合并 Tsetlin 机在子句数量和内存上减少了 50 倍。FPTM 的训练速度快至 316 倍（45 秒对比） 4 小时）并且占用空间小于 50 KB，使得在微控制器上进行在线学习成为可能。推理吞吐量达到每秒 3450 万次预测（51.4 GB/s）。在 Fashion-MNIST 上，准确率分别达到 92.18%（2 个子句）、93.19%（20 个子句）和 94.68%（8000 个子句），与 Composite TM 在 8000 个子句时的 93.00% 相比，实现了约 400 倍的子句减少。在带有 20% 噪声的 Amazon Sales 数据集上，FPTM 达到 85.22% 的准确率，显著优于 Graph Tsetlin Machine（78.17%）和图卷积神经网络（66.23%）。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-11 15:09:12 UTC 发布：2025-08-11 15:09:12 世界统一时间 (UTC)

#130 Do AI Companies Make Good on Voluntary Commitments to the White House? #130 AI 公司是否兑现了对白宫的自愿承诺？

Authors: [Jennifer Wang](https://arxiv.org/search/?searchtype=author&query=Jennifer Wang), [Kayla Huang](https://arxiv.org/search/?searchtype=author&query=Kayla Huang), [Kevin Klyman](https://arxiv.org/search/?searchtype=author&query=Kevin Klyman), [Rishi Bommasani](https://arxiv.org/search/?searchtype=author&query=Rishi Bommasani) 作者：Jennifer Wang、Kayla Huang、Kevin Klyman、Rishi Bommasani

Voluntary commitments are central to international AI governance, as demonstrated by recent voluntary guidelines from the White House to the G7, from Bletchley Park to Seoul. How do major AI companies make good on their commitments? We score companies based on their publicly disclosed behavior by developing a detailed rubric based on their eight voluntary commitments to the White House in 2023. We find significant heterogeneity: while the highest-scoring company (OpenAI) scores a 83% overall on our rubric, the average score across all companies is just 52%. The companies demonstrate systemically poor performance for their commitment to model weight security with an average score of 17%: 11 of the 16 companies receive 0% for this commitment. Our analysis highlights a clear structural shortcoming that future AI governance initiatives should correct: when companies make public commitments, they should proactively disclose how they meet their commitments to provide accountability, and these disclosures should be verifiable. To advance policymaking on corporate AI governance, we provide three directed recommendations that address underspecified commitments, the role of complex AI supply chains, and public transparency that could be applied towards AI governance initiatives worldwide. 自愿承诺在国际人工智能治理中占据核心地位，这一点可从白宫到七国集团（G7）、从布莱切利公园到首尔近期发布的自愿性指南中看出。主要的人工智能公司如何兑现它们的承诺？我们基于这些公司公开披露的行为进行评分，依据的是我们为它们在 2023 年向白宫作出的八项自愿承诺制定的详细评分标准。我们发现存在显著差异：得分最高的公司（OpenAI）在我们的评分标准中总体得分为 83%，而所有公司的平均得分仅为 52%。在对模型权重安全性的承诺上，公司表现普遍较差，平均得分仅为 17%：在 16 家公司中有 11 家在此项承诺上得 0 分。我们的分析突显了一个明显的结构性缺陷，未来的人工智能治理举措应予以纠正：当公司作出公开承诺时，应该主动披露它们如何实现这些承诺以提供问责，并且这些披露应当可核查。为了推进公司人工智能治理的政策制定，我们提出三条有针对性的建议，分别针对不够明确的承诺、复杂的人工智能供应链的作用，以及可用于全球人工智能治理倡议的公开透明性。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-11 11:23:28 UTC 发布时间：2025-08-11 11:23:28 UTC

#131 Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving #131 通过最优适配器缓存最大化 GPU 效率：面向多租户 LLM 服务的分析方法

Serving LLM adapters has gained significant attention as an effective approach to adapt general-purpose language models to diverse, task-specific use cases. However, serving a wide range of adapters introduces several and substantial overheads, leading to performance degradation and challenges in optimal placement. To address these challenges, we present an analytical, AI-driven pipeline that accurately determines the optimal allocation of adapters in single-node setups. This allocation maximizes performance, effectively using GPU resources, while preventing request starvation. Crucially, the proposed allocation is given based on current workload patterns. These insights in single-node setups can be leveraged in multi-replica deployments for overall placement, load balancing and server configuration, ultimately enhancing overall performance and improving resource efficiency. Our approach builds on an in-depth analysis of LLM adapter serving, accounting for overheads and performance variability, and includes the development of the first Digital Twin capable of replicating online LLM-adapter serving systems with matching key performance metrics. The experimental results demonstrate that the Digital Twin achieves a SMAPE difference of no more than 5.5% in throughput compared to real results, and the proposed pipeline accurately predicts the optimal placement with minimal latency. 为 LLM 适配器提供服务已成为将通用语言模型调整到各种特定任务用例的有效方法，因而受到高度关注。然而，服务大量不同适配器会带来若干显著开销，导致性能下降并带来最优部署位置选择的挑战。为了解决这些问题，我们提出了一个基于分析和 AI 的管道，能够在单节点设置中精确确定适配器的最佳分配。该分配在最大化性能、有效利用 GPU 资源的同时避免请求饥饿。关键是，所提分配基于当前的工作负载模式给出。这些关于单节点设置的见解可用于多副本部署中的整体放置、负载均衡和服务器配置，从而最终提升整体性能并改善资源效率。我们的方法建立在对 LLM 适配器服务的深入分析之上，考虑了开销和性能波动，并包括开发出首个能够以匹配关键性能指标复现在线 LLM-适配器服务系统的数字孪生。实验结果表明，与真实结果相比，数字孪生在吞吐量上与真实结果的 SMAPE 差异不超过 5.5%，且所提出的流程能以极低的延迟准确预测最优部署位置。

Subjects: Performance, Artificial Intelligence, Computation and Language 主题：性能、人工智能、计算与语言

Publish: 2025-08-11 10:47:35 UTC 发布：2025-08-11 10:47:35 UTC

#132 ImageDDI: Image-enhanced Molecular Motif Sequence Representation for Drug-Drug Interaction Prediction #132 ImageDDI：用于药物相互作用预测的图像增强分子基序列表示 [PDF 1 ] [Copy] [Kimi ] [REL]

To mitigate the potential adverse health effects of simultaneous multi-drug use, including unexpected side effects and interactions, accurately identifying and predicting drug-drug interactions (DDIs) is considered a crucial task in the field of deep learning. Although existing methods have demonstrated promising performance, they suffer from the bottleneck of limited functional motif-based representation learning, as DDIs are fundamentally caused by motif interactions rather than the overall drug structures. In this paper, we propose an Image-enhanced molecular motif sequence representation framework for \textbf{DDI} prediction, called ImageDDI, which represents a pair of drugs from both global and local structures. Specifically, ImageDDI tokenizes molecules into functional motifs. To effectively represent a drug pair, their motifs are combined into a single sequence and embedded using a transformer-based encoder, starting from the local structure representation. By leveraging the associations between drug pairs, ImageDDI further enhances the spatial representation of molecules using global molecular image information (e.g. texture, shadow, color, and planar spatial relationships). To integrate molecular visual information into functional motif sequence, ImageDDI employs Adaptive Feature Fusion, enhancing the generalization of ImageDDI by dynamically adapting the fusion process of feature representations. Experimental results on widely used datasets demonstrate that ImageDDI outperforms state-of-the-art methods. Moreover, extensive experiments show that ImageDDI achieved competitive performance in both 2D and 3D image-enhanced scenarios compared to other models. 为缓解同时使用多种药物可能带来的不良健康影响，包括意外副作用和相互作用，准确识别和预测药物-药物相互作用（DDI）被认为是深度学习领域的一项关键任务。尽管现有方法已显示出良好性能，但在基于功能基序的表示学习方面仍受限，因为 DDI 本质上是由基序之间的相互作用引起的，而非整体药物结构。在本文中，我们提出了一种用于 DDI 预测的图像增强分子基序序列表示框架，称为 ImageDDI，它从全局和局部结构两方面表示一对药物。具体而言，ImageDDI 将分子分解为功能基序。为有效表示一对药物，它将它们的基序合并为单一序列，并使用基于 Transformer 的编码器从局部结构表示开始进行嵌入。通过利用药物对之间的关联，ImageDDI 利用全局分子图像信息（例如纹理、阴影、颜色和平面空间关系）进一步增强了分子的空间表示。为了将分子视觉信息整合到功能基序序列中，ImageDDI 采用了自适应特征融合，通过动态调整特征表示的融合过程来增强 ImageDDI 的泛化能力。广泛使用的数据集上的实验结果表明，ImageDDI 优于最先进的方法。此外，大量实验证明，与其他模型相比，ImageDDI 在二维和三维图像增强场景中均取得了具有竞争力的性能。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-11 03:26:50 UTC 发布：2025-08-11 03:26:50 UTC

Authors: [Zeyu Tang](https://arxiv.org/search/?searchtype=author&query=Zeyu Tang), [Alex John London](https://arxiv.org/search/?searchtype=author&query=Alex John London), [Atoosa Kasirzadeh](https://arxiv.org/search/?searchtype=author&query=Atoosa Kasirzadeh), [Sanmi Koyejo](https://arxiv.org/search/?searchtype=author&query=Sanmi Koyejo), [Peter Spirtes](https://arxiv.org/search/?searchtype=author&query=Peter Spirtes), [Kun Zhang](https://arxiv.org/search/?searchtype=author&query=Kun Zhang) 作者：Zeyu Tang、Alex John London、Atoosa Kasirzadeh、Sanmi Koyejo、Peter Spirtes、Kun Zhang

Social determinants are variables that, while not directly pertaining to any specific individual, capture key aspects of contexts and environments that have direct causal influences on certain attributes of an individual. Previous algorithmic fairness literature has primarily focused on sensitive attributes, often overlooking the role of social determinants. Our paper addresses this gap by introducing formal and quantitative rigor into a space that has been shaped largely by qualitative proposals regarding the use of social determinants. To demonstrate theoretical perspectives and practical applicability, we examine a concrete setting of college admissions, using region as a proxy for social determinants. Our approach leverages a region-based analysis with Gamma distribution parameterization to model how social determinants impact individual outcomes. Despite its simplicity, our method quantitatively recovers findings that resonate with nuanced insights in previous qualitative debates, that are often missed by existing algorithmic fairness approaches. Our findings suggest that mitigation strategies centering solely around sensitive attributes may introduce new structural injustice when addressing existing discrimination. Considering both sensitive attributes and social determinants facilitates a more comprehensive explication of benefits and burdens experienced by individuals from diverse demographic backgrounds as well as contextual environments, which is essential for understanding and achieving fairness effectively and transparently. 社会决定因素是一些变量，虽然并不直接针对任何特定个体，但它们捕捉了对个体某些属性具有直接因果影响的情境和环境的关键方面。以往的算法公平性文献主要关注敏感属性，常常忽视社会决定因素的作用。我们的论文通过在一个主要由关于社会决定因素使用的定性提议塑造的领域中引入形式化和定量严谨性来填补这一空白。为了展示理论视角和实际适用性，我们考察了大学录取的具体场景，使用地区作为社会决定因素的替代指标。我们的方法利用基于地区的分析并采用伽玛分布参数化来建模社会决定因素如何影响个体结果。尽管方法简单，我们的模型在定量上恢复出与以往定性讨论中复杂见解相呼应的发现，而这些见解通常被现有的算法公平性方法所忽略。我们的研究表明，仅以敏感属性为中心的缓解策略在应对现有歧视时可能会引入新的结构性不公。将敏感属性与社会决定因素一起考虑，有助于更全面地阐明来自不同人口背景和不同情境环境的个体所经历的利益与负担，这对于有效且透明地理解与实现公平至关重要。

Subjects: Computers and Society, Artificial Intelligence, Machine Learning 主题：计算机与社会、人工智能、机器学习

Publish: 2025-08-10 23:55:16 UTC 发布：2025-08-10 23:55:16 UTC

#134 HSA-Net: Hierarchical and Structure-Aware Framework for Efficient and Scalable Molecular Language Modeling #134 HSA-Net：用于高效且可扩展分子语言建模的分层与结构感知框架

Authors: [Zihang Shao](https://arxiv.org/search/?searchtype=author&query=Zihang Shao), [Wentao Lei](https://arxiv.org/search/?searchtype=author&query=Wentao Lei), [Lei Wang](https://arxiv.org/search/?searchtype=author&query=Lei Wang), [Wencai Ye](https://arxiv.org/search/?searchtype=author&query=Wencai Ye), [Li Liu](https://arxiv.org/search/?searchtype=author&query=Li Liu) 作者：邵子航、雷文涛、王磊、叶文才、刘力

Molecular representation learning, a cornerstone for downstream tasks like molecular captioning and molecular property prediction, heavily relies on Graph Neural Networks (GNN). However, GNN suffers from the over-smoothing problem, where node-level features collapse in deep GNN layers. While existing feature projection methods with cross-attention have been introduced to mitigate this issue, they still perform poorly in deep features. This motivated our exploration of using Mamba as an alternative projector for its ability to handle complex sequences. However, we observe that while Mamba excels at preserving global topological information from deep layers, it neglects fine-grained details in shallow layers. The capabilities of Mamba and cross-attention exhibit a global-local trade-off. To resolve this critical global-local trade-off, we propose Hierarchical and Structure-Aware Network (HSA-Net), a novel framework with two modules that enables a hierarchical feature projection and fusion. Firstly, a Hierarchical Adaptive Projector (HAP) module is introduced to process features from different graph layers. It learns to dynamically switch between a cross-attention projector for shallow layers and a structure-aware Graph-Mamba projector for deep layers, producing high-quality, multi-level features. Secondly, to adaptively merge these multi-level features, we design a Source-Aware Fusion (SAF) module, which flexibly selects fusion experts based on the characteristics of the aggregation features, ensuring a precise and effective final representation fusion. Extensive experiments demonstrate that our HSA-Net framework quantitatively and qualitatively outperforms current state-of-the-art (SOTA) methods. 分子表示学习是分子说明生成与分子性质预测等下游任务的基石，严重依赖图神经网络（GNN）。然而，GNN 存在过平滑问题，在深层 GNN 中节点级特征会塌缩。尽管已有采用交叉注意力的特征投影方法来缓解此问题，但它们在深层特征上仍表现不佳。这促使我们探索将 Mamba 作为备选投影器，用以处理复杂序列的能力。然而我们观察到，Mamba 虽然擅长保留深层的全局拓扑信息，却忽视了浅层的细粒度细节。Mamba 与交叉注意力的能力表现出一种全局—局部的权衡。为了解决这一关键的全局—局部权衡，我们提出了分层与结构感知网络（HSA-Net），这是一种包含两个模块的新型框架，能够实现分层特征的投影与融合。首先，引入了一个分层自适应投影器（HAP）模块来处理来自不同图层的特征。它学习在浅层动态切换到用于浅层的交叉注意力投影器和用于深层的结构感知 Graph-Mamba 投影器，生成高质量的多层次特征。其次，为了自适应地融合这些多层次特征，我们设计了一个源感知融合（SAF）模块，该模块根据聚合特征的特性灵活选择融合专家，确保最终表示融合的精确性和有效性。大量实验证明，我们的 HSA-Net 框架在定量和定性上均优于当前的最先进（SOTA）方法。

Subjects: Machine Learning, Artificial Intelligence, Quantitative Methods 主题：机器学习、人工智能、定量方法

Publish: 2025-08-10 15:22:42 UTC 发布：2025-08-10 15:22:42 UTC

#135 Normative Moral Pluralism for AI: A Framework for Deliberation in Complex Moral Contexts #135 面向人工智能的规范性道德多元论：在复杂道德情境中进行审议的框架

Author: [David-Doron Yaacov](https://arxiv.org/search/?searchtype=author&query=David-Doron Yaacov) 作者：David-Doron Yaacov

The conceptual framework proposed in this paper centers on the development of a deliberative moral reasoning system - one designed to process complex moral situations by generating, filtering, and weighing normative arguments drawn from diverse ethical perspectives. While the framework is rooted in Machine Ethics, it also makes a substantive contribution to Value Alignment by outlining a system architecture that links structured moral reasoning to action under time constraints. Grounded in normative moral pluralism, this system is not constructed to imitate behavior but is built on reason-sensitive deliberation over structured moral content in a transparent and principled manner. Beyond its role as a deliberative system, it also serves as the conceptual foundation for a novel two-level architecture: functioning as a moral reasoning teacher envisioned to train faster models that support real-time responsiveness without reproducing the full structure of deliberative reasoning. Together, the deliberative and intuitive components are designed to enable both deep reflection and responsive action. A key design feature is the dual-hybrid structure: a universal layer that defines a moral threshold through top-down and bottom-up learning, and a local layer that learns to weigh competing considerations in context while integrating culturally specific normative content, so long as it remains within the universal threshold. By extending the notion of moral complexity to include not only conflicting beliefs but also multifactorial dilemmas, multiple stakeholders, and the integration of non-moral considerations, the framework aims to support morally grounded decision-making in realistic, high-stakes contexts. 本文提出的概念框架以构建一个审议性道德推理系统为核心——该系统旨在通过从多样的伦理视角生成、筛选并权衡规范性论据来处理复杂的道德情境。尽管该框架扎根于机器伦理学，但它也通过勾画一个将结构化道德推理与在时间受限下的行动相连接的系统架构，为价值对齐研究做出了实质性贡献。基于规范性道德多元论，该系统并非为了模仿行为而构建，而是以对结构化道德内容进行对理由敏感的审议为基础，强调透明性和原则性。除了作为审议系统的作用外，它还作为一种新颖的两层架构的概念基础：被设想为道德推理教师，用以训练更快速的模型，从而在不复制完整审议推理结构的情况下支持实时响应。审议组件与直觉组件共同设计，旨在实现深度反思与敏捷行动。一个关键的设计特征是双混合结构：一个通过自上而下和自下而上的学习来界定道德阈值的通用层，以及一个在局部层面学习在情境中权衡相互竞争的考量并整合文化特定的规范性内容的层，只要这些内容仍处于通用阈值之内。通过将道德复杂性的概念扩展为不仅包含相互冲突的信念，还包括多因素困境、多方利益相关者以及非道德考量的整合，该框架旨在支持在现实、高风险情境中的有道德依据的决策。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-10 14:52:23 UTC 发布日期：2025-08-10 14:52:23 协调世界时 (UTC)

#136 Energy-Aware Code Generation with LLMs: Benchmarking Small vs. Large Language Models for Sustainable AI Programming #136 能源感知的代码生成与 LLMs：为可持续 AI 编程对小型与大型语言模型进行基准测试 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Humza Ashraf](https://arxiv.org/search/?searchtype=author&query=Humza Ashraf), [Syed Muhammad Danish](https://arxiv.org/search/?searchtype=author&query=Syed Muhammad Danish), [Aris Leivadeas](https://arxiv.org/search/?searchtype=author&query=Aris Leivadeas), [Yazan Otoum](https://arxiv.org/search/?searchtype=author&query=Yazan Otoum), [Zeeshan Sattar](https://arxiv.org/search/?searchtype=author&query=Zeeshan Sattar) 作者：Humza Ashraf、Syed Muhammad Danish、Aris Leivadeas、Yazan Otoum、Zeeshan Sattar

Large Language Models (LLMs) are widely used for code generation. However, commercial models like ChatGPT require significant computing power, which leads to high energy use and carbon emissions. This has raised concerns about their environmental impact. In this study, we evaluate open-source Small Language Models (SLMs) trained explicitly for code generation and compare their performance and energy efficiency against large LLMs and efficient human-written Python code. The goal is to investigate whether SLMs can match the performance of LLMs on certain types of programming problems while producing more energy-efficient code. We evaluate 150 coding problems from LeetCode, evenly distributed across three difficulty levels: easy, medium, and hard. Our comparison includes three small open-source models, StableCode-3B, StarCoderBase-3B, and Qwen2.5-Coder-3B-Instruct, and two large commercial models, GPT-4.0 and DeepSeek-Reasoner. The generated code is evaluated using four key metrics: run-time, memory usage, energy consumption, and correctness. We use human-written solutions as a baseline to assess the quality and efficiency of the model-generated code. Results indicate that LLMs achieve the highest correctness across all difficulty levels, but SLMs are often more energy-efficient when their outputs are correct. In over 52% of the evaluated problems, SLMs consumed the same or less energy than LLMs. 大型语言模型 (LLMs) 被广泛用于代码生成。然而，像 ChatGPT 这样的商用模型需要大量计算资源，导致高能耗和碳排放。这引发了对其环境影响的担忧。在本研究中，我们评估了为代码生成专门训练的开源小型语言模型 (SLMs)，并将其性能和能效与大型 LLMs 以及高效的人类撰写的 Python 代码进行比较。研究目标是考察 SLMs 是否能在某些类型的编程问题上达到 LLMs 的性能，同时生成更节能的代码。我们评估了来自 LeetCode 的 150 道编程题，按难度均匀分布为三类：简单、中等和困难。比较对象包括三个小型开源模型 StableCode-3B、StarCoderBase-3B 和 Qwen2.5-Coder-3B-Instruct，以及两个大型商用模型 GPT-4.0 和 DeepSeek-Reasoner。所生成的代码使用四个关键指标进行评估：运行时间、内存使用、能源消耗和正确性。我们使用人类撰写的解法作为基线来评估模型生成代码的质量和效率。结果表明，LLMs 在所有难度级别上都取得了最高的正确率，但在输出正确时，SLMs 通常更节能。在超过 52% 的被评估问题中，SLMs 消耗的能量与 LLMs 相同或更少。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-10 14:44:06 UTC 发布：2025-08-10 14:44:06 UTC

#137 Algorithmic Collusion of Pricing and Advertising on E-commerce Platforms #137 电子商务平台上定价与广告的算法性合谋

Authors: [Hangcheng Zhao](https://arxiv.org/search/?searchtype=author&query=Hangcheng Zhao), [Ron Berman](https://arxiv.org/search/?searchtype=author&query=Ron Berman) 作者：赵杭程，Ron Berman

Online sellers have been adopting AI learning algorithms to automatically make product pricing and advertising decisions on e-commerce platforms. When sellers compete using such algorithms, one concern is that of tacit collusion - the algorithms learn to coordinate on higher than competitive. We empirically investigate whether these concerns are valid when sellers make pricing and advertising decisions together, i.e., two-dimensional decisions. Our empirical strategy is to analyze competition with multi-agent reinforcement learning, which we calibrate to a large-scale dataset collected from Amazon.com products. Our first contribution is to find conditions under which learning algorithms can facilitate win-win-win outcomes that are beneficial for consumers, sellers, and even the platform, when consumers have high search costs. In these cases the algorithms learn to coordinate on prices that are lower than competitive prices. The intuition is that the algorithms learn to coordinate on lower advertising bids, which lower advertising costs, leading to lower prices. Our second contribution is an analysis of a large-scale, high-frequency keyword-product dataset for more than 2 million products on Amazon.com. Our estimates of consumer search costs show a wide range of costs for different product keywords. We generate an algorithm usage and find a negative interaction between the estimated consumer search costs and the algorithm usage index, providing empirical evidence of beneficial collusion. Finally, we analyze the platform’s strategic response. We find that reserve price adjustments will not increase profits for the platform, but commission adjustments will. Our analyses help alleviate some worries about the potentially harmful effects of competing learning algorithms, and can help sellers, platforms and policymakers to decide on whether to adopt or regulate such algorithms. 在线卖家一直在采用人工智能学习算法，在电商平台上自动做出商品定价和广告决策。当卖家使用此类算法竞争时，一个令人担忧的问题是默契式合谋——算法学会协调以维持高于竞争性的价格。我们从实证上调查了当卖家同时做出定价和广告决策（即二维决策）时，这些担忧是否成立。我们的实证策略是分析多智能体强化学习下的竞争，并将其校准到从 Amazon.com 产品收集的大规模数据集。我们的第一项贡献是发现了在何种条件下，当消费者搜索成本较高时，学习算法能够促成对消费者、卖家乃至平台都有利的三赢结果。在这些情况下，算法学会协调出低于竞争性价格的定价。其直观原因是算法学会协调出更低的广告出价，从而降低广告成本，进而带来更低的价格。我们的第二项贡献是对亚马逊网站上超过 200 万种产品的大规模高频关键词-产品数据集的分析。我们对消费者搜索成本的估计显示，不同产品关键词的成本存在很大差异。我们构建了一个算法使用指数，并发现估计的消费者搜索成本与该算法使用指数之间存在负向交互，提供了有利串通的经验证据。最后，我们分析了平台的战略应对。我们发现调整保底价不会增加平台利润，但调整佣金会。我们的分析有助于缓解人们对竞争学习算法潜在有害影响的一些担忧，并能帮助卖家、平台和政策制定者决定是否采用或监管此类算法。

Subjects: General Economics, Artificial Intelligence, Computational Engineering, Finance, and Science, Computer Science and Game Theory, Machine Learning 主题：一般经济学、人工智能、计算工程、金融与科学、计算机科学与博弈论、机器学习

Publish: 2025-08-09 18:44:17 UTC 发布：2025-08-09 18:44:17 UTC

#138 Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code #138 使用 Elicit、NotebookLM、ChatGPT 和 Claude Code 进行多代理 LLM 代码助理的上下文工程

Author: [Muhammad Haseeb](https://arxiv.org/search/?searchtype=author&query=Muhammad Haseeb) 作者：Muhammad Haseeb

Large Language Models (LLMs) have shown promise in automating code generation and software engineering tasks, yet they often struggle with complex, multi-file projects due to context limitations and knowledge gaps. We propose a novel context engineering workflow that combines multiple AI components: an Intent Translator (GPT-5) for clarifying user requirements, an Elicit-powered semantic literature retrieval for injecting domain knowledge, NotebookLM-based document synthesis for contextual understanding, and a Claude Code multi-agent system for code generation and validation. Our integrated approach leverages intent clarification, retrieval-augmented generation, and specialized sub-agents orchestrated via Claude’s agent framework. We demonstrate that this method significantly improves the accuracy and reliability of code assistants in real-world repositories, yielding higher single-shot success rates and better adherence to project context than baseline single-agent approaches. Qualitative results on a large Next.js codebase show the multi-agent system effectively plans, edits, and tests complex features with minimal human intervention. We compare our system with recent frameworks like CodePlan, MASAI, and HyperAgent, highlighting how targeted context injection and agent role decomposition lead to state-of-the-art performance. Finally, we discuss the implications for deploying LLM-based coding assistants in production, along with lessons learned on context management and future research directions. 大型语言模型（LLMs）在自动生成代码和完成软件工程任务方面表现出潜力，但由于上下文限制和知识差距，它们常在复杂的多文件项目中遇到困难。我们提出了一种新颖的上下文工程工作流程，结合了多种 AI 组件：用于澄清用户需求的意图翻译器（GPT-5）、用于注入领域知识的 Elicit 驱动语义文献检索、用于上下文理解的 NotebookLM 文档合成，以及用于代码生成和验证的 Claude Code 多代理系统。我们的一体化方法利用了意图澄清、检索增强生成以及通过 Claude 的代理框架协调的专用子代理。我们证明该方法显著提高了代码助手在真实仓库中的准确性和可靠性，带来了更高的单次成功率并更好地遵循项目上下文。在大型 Next.js 代码库上的定性结果表明，该多代理系统能够在最少人工干预下有效地规划、编辑和测试复杂功能。我们将我们的系统与近期的框架如 CodePlan、MASAI 和 HyperAgent 进行比较，强调有针对性的上下文注入和代理角色分解如何带来最先进的性能。最后，我们讨论了在生产中部署基于 LLM 的编码助手的影响，以及在上下文管理方面的经验教训和未来的研究方向。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-09 14:45:53 UTC 发布：2025-08-09 14:45:53 协调世界时

#139 Between Fear and Desire, the Monster Artificial Intelligence (AI): Analysis through the Lenses of Monster Theory #139 在恐惧与欲望之间，怪物人工智能（AI）：通过怪物理论的视角分析

Author: [Ahmed Tlili](https://arxiv.org/search/?searchtype=author&query=Ahmed Tlili) 作者：Ahmed Tlili

With the increasing adoption of Artificial Intelligence (AI) in all fields and daily activities, a heated debate is found about the advantages and challenges of AI and the need for navigating the concerns associated with AI to make the best of it. To contribute to this literature and the ongoing debate related to it, this study draws on the Monster theory to explain the conflicting representation of AI. It suggests that studying monsters in popular culture can provide an in-depth understanding of AI and its monstrous effects. Specifically, this study aims to discuss AI perception and development through the seven theses of Monster theory. The obtained results revealed that, just like monsters, AI is complex in nature, and it should not be studied as a separate entity but rather within a given society or culture. Similarly, readers may perceive and interpret AI differently, just as readers may interpret monsters differently. The relationship between AI and monsters, as depicted in this study, does not seem to be as odd as it might be at first. 随着人工智能（AI）在各个领域和日常活动中的日益普及，围绕 AI 的优势与挑战以及如何应对与 AI 相关的担忧以充分利用其潜力的争论愈演愈烈。为推动该领域文献和相关持续讨论，本文借鉴怪物理论来解释 AI 的相互冲突的表征。文章提出，研究流行文化中的怪物能够深入理解 AI 及其“怪物化”的影响。具体而言，本研究旨在通过怪物理论的七条论断来讨论对 AI 的认知与发展。研究结果表明，正如怪物一样，AI 本质上是复杂的，不应被作为独立实体来研究，而应置于特定社会或文化背景中考察。同样，读者可能会以不同方式感知和解读 AI，就如同读者会以不同方式解读怪物一样。本文所描绘的 AI 与怪物之间的关系，起初看似奇特，但并不像乍看之下那样离奇。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-09 08:35:14 UTC 发布：2025-08-09 08:35:14 协调世界时 (UTC)

#140 Evaluation of State-of-the-Art Deep Learning Techniques for Plant Disease and Pest Detection #140 评估当前最先进的深度学习技术用于植物疾病和害虫检测

Authors: [Saptarshi Banerjee](https://arxiv.org/search/?searchtype=author&query=Saptarshi Banerjee), [Tausif Mallick](https://arxiv.org/search/?searchtype=author&query=Tausif Mallick), [Amlan Chakroborty](https://arxiv.org/search/?searchtype=author&query=Amlan Chakroborty), [Himadri Nath Saha](https://arxiv.org/search/?searchtype=author&query=Himadri Nath Saha), [Nityananda T. Takur](https://arxiv.org/search/?searchtype=author&query=Nityananda T. Takur) 作者：Saptarshi Banerjee、Tausif Mallick、Amlan Chakroborty、Himadri Nath Saha、Nityananda T. Takur

Addressing plant diseases and pests is critical for enhancing crop production and preventing economic losses. Recent advances in artificial intelligence (AI), machine learning (ML), and deep learning (DL) have significantly improved the precision and efficiency of detection methods, surpassing the limitations of manual identification. This study reviews modern computer-based techniques for detecting plant diseases and pests from images, including recent AI developments. The methodologies are organized into five categories: hyperspectral imaging, non-visualization techniques, visualization approaches, modified deep learning architectures, and transformer models. This structured taxonomy provides researchers with detailed, actionable insights for selecting advanced state-of-the-art detection methods. A comprehensive survey of recent work and comparative studies demonstrates the consistent superiority of modern AI-based approaches, which often outperform older image analysis methods in speed and accuracy. In particular, vision transformers such as the Hierarchical Vision Transformer (HvT) have shown accuracy exceeding 99.3% in plant disease detection, outperforming architectures like MobileNetV3. The study concludes by discussing system design challenges, proposing solutions, and outlining promising directions for future research. 解决植物病害和害虫问题对于提高作物产量和防止经济损失至关重要。人工智能（AI）、机器学习（ML）和深度学习（DL）方面的最新进展显著提升了检测方法的精确性和效率，超越了人工识别的局限性。本研究综述了用于从图像中检测植物病害和害虫的现代计算机技术，包括近期的 AI 发展。方法学分为五类：高光谱成像、非可视化技术、可视化方法、改进的深度学习架构和 Transformer 模型。该结构化分类为研究人员在选择先进的最新检测方法时提供了详细且可操作的见解。对近期工作和比较研究的全面调查表明，现代基于 AI 的方法在速度和精度方面通常优于较早的图像分析方法，表现出持续的优势。尤其是，诸如分层视觉 Transformer（Hierarchical Vision Transformer，HvT）等视觉 Transformer 在植物病害检测中的准确率已超过 99.3%，优于 MobileNetV3 等架构。该研究在结论中讨论了系统设计挑战，提出了解决方案，并概述了未来研究的有希望方向。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-09 08:23:33 UTC 发布：2025-08-09 08:23:33 UTC

#141 EU Digital Regulation and Guatemala: AI, 5G, and Cybersecurity #141 欧盟数字监管与危地马拉：人工智能、5G 与网络安全

Author: [Victor Lopez Juarez](https://arxiv.org/search/?searchtype=author&query=Victor Lopez Juarez) 作者：Victor Lopez Juarez

The paper examines how EU rules in AI, 5G, and cybersecurity operate as transnational governance and shape policy in Guatemala. It outlines the AI Act’s risk approach, the 5G Action Plan and Security Toolbox, and the cybersecurity regime built on ENISA, NIS2, the Cybersecurity Act, and the Cyber Resilience Act. It traces extraterritorial channels such as the Brussels effect, private standards, supply chain clauses, and data transfer controls. Guatemala specific impacts include SME compliance costs, procurement limits, environmental trade-offs in rollout, rights risks, and capacity gaps. The paper maps current national measures and proposes five guardrails: digital constitutionalism, green IT duties, third country impact assessment, standards co-design, and recognition of regulatory diversity. 本文考察了欧盟在人工智能、5G 和网络安全领域的规则如何作为跨国治理手段运作并塑造危地马拉的政策。文章概述了《人工智能法》的风险方法、5G 行动计划与安全工具箱，以及建立在 ENISA、NIS2、《网络安全法》和《网络韧性法》之上的网络安全制度。它追溯了诸如布鲁塞尔效应、私人标准、供应链条款和数据传输管控等域外影响渠道。针对危地马拉的具体影响包括中小企业的合规成本、采购限制、部署中的环境权衡、权利风险和能力缺口。文章绘制了当前的国家措施图谱并提出了五项护栏：数字宪政、绿色信息技术义务、第三国影响评估、标准共同设计以及承认监管多样性。

Subjects: Computers and Society, Artificial Intelligence, Emerging Technologies 主题：计算机与社会、人工智能、新兴技术

Publish: 2025-08-09 03:34:18 UTC 发布时间：2025-08-09 03:34:18 UTC

#142 Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study #142 评估 AI 生成考试的质量：一项大规模现场研究

While large language models (LLMs) challenge conventional methods of teaching and learning, they present an exciting opportunity to improve efficiency and scale high-quality instruction. One promising application is the generation of customized exams, tailored to specific course content. There has been significant recent excitement on automatically generating questions using artificial intelligence, but also comparatively little work evaluating the psychometric quality of these items in real-world educational settings. Filling this gap is an important step toward understanding generative AI’s role in effective test design. In this study, we introduce and evaluate an iterative refinement strategy for question generation, repeatedly producing, assessing, and improving questions through cycles of LLM-generated critique and revision. We evaluate the quality of these AI-generated questions in a large-scale field study involving 91 classes – covering computer science, mathematics, chemistry, and more – in dozens of colleges across the United States, comprising nearly 1700 students. Our analysis, based on item response theory (IRT), suggests that for students in our sample the AI-generated questions performed comparably to expert-created questions designed for standardized exams. Our results illustrate the power of AI to make high-quality assessments more readily available, benefiting both teachers and students. 尽管大型语言模型（LLMs）挑战了传统的教学与学习方法，但它们也为提高高质量教学的效率和规模提供了令人兴奋的机会。一种有前景的应用是生成针对特定课程内容的定制考试题目。最近关于使用人工智能自动生成试题的讨论十分热烈，但在真实教育场景中评估这些题目心理测量学质量的工作相对较少。填补这一空白是理解生成式人工智能在有效试题设计中作用的重要一步。在本研究中，我们提出并评估了一种用于试题生成的迭代改进策略，通过大型语言模型生成的批评与修订循环，反复生成、评估并改进试题。我们在一项大规模实地研究中评估了这些由人工智能生成试题的质量，研究涵盖了美国数十所高校的 91 门课程——包括计算机科学、数学、化学等领域——涉及近 1700 名学生。基于项目反应理论（IRT）的分析表明，对于我们样本中的学生，AI 生成的问题与为标准化考试设计的专家出题在表现上相当。我们的结果展示了 AI 在使高质量评估更易获得方面的强大能力，惠及教师和学生。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-09 01:20:53 UTC 发布：2025-08-09 01:20:53 UTC

#143 Constrained PSLQ Search for Machin-like Identities Achieving Record-Low Lehmer Measures #143 受限 PSLQ 搜索以寻找达到创纪录低 Lehmer 度量的 Machin 类恒等式

Author: [Nick Craig-Wood](https://arxiv.org/search/?searchtype=author&query=Nick Craig-Wood) 作者：Nick Craig-Wood

Machin-like arctangent relations are classical tools for computing π, with efficiency quantified by the Lehmer measure (λ). We present a framework for discovering low-measure relations by coupling the PSLQ integer-relation algorithm with number-theoretic filters derived from the algebraic structure of Gaussian integers, making large scale search tractable. Our search yields new 5 and 6 term relations with record-low Lehmer measures (λ=1.4572,λ=1.3291). We also demonstrate how discovered relations can serve as a basis for generating new, longer formulae through algorithmic extensions. This combined approach of a constrained PSLQ search and algorithmic extension provides a robust method for future explorations. Machin 式反正切关系是计算 π 的经典工具，其效率由 Lehmer 度量（ λ ）来量化。我们提出了一个框架，通过将 PSLQ 整数关系算法与源自高斯整数代数结构的数论过滤器相结合，发现低度量关系，从而使大规模搜索变得可行。我们的搜索产出了新的 5 项和 6 项关系，具有创纪录的低 Lehmer 度量（ λ=1.4572,λ=1.3291 ）。我们还演示了如何将发现的关系用作通过算法扩展生成新的、更长公式的基础。这种受限 PSLQ 搜索与算法扩展相结合的方法为未来的探索提供了一个稳健的手段。

Subjects: Number Theory, Artificial Intelligence 学科：数论，人工智能

Publish: 2025-08-08 11:08:13 UTC 发表：2025-08-08 11:08:13 UTC

#144 Channel-Wise MLPs Improve the Generalization of Recurrent Convolutional Networks

Author: [Nathan Breslow](https://arxiv.org/search/?searchtype=author&query=Nathan Breslow) 作者：Nathan Breslow

We investigate the impact of channel-wise mixing via multi-layer perceptrons (MLPs) on the generalization capabilities of recurrent convolutional networks. Specifically, we compare two architectures: DARC (Depth Aware Recurrent Convolution), which employs a simple recurrent convolutional structure, and DAMP (Depth Aware Multi-layer Perceptron), which extends DARC with a gated MLP for channel mixing. Using the Re-ARC benchmark, we find that DAMP significantly outperforms DARC in both in-distribution and out-of-distribution generalization under exact-match grading criteria. These results suggest that explicit channel mixing through MLPs enables recurrent convolutional networks to learn more robust and generalizable computational patterns. Our findings have implications for neural program synthesis and highlight the potential of DAMP as a target architecture for hypernetwork approaches. 我们研究了通过多层感知机（MLP）进行按通道混合对循环卷积网络泛化能力的影响。具体来说，我们比较了两种架构：DARC（Depth Aware Recurrent Convolution），其采用简单的循环卷积结构；以及 DAMP（Depth Aware Multi-layer Perceptron），其在 DARC 的基础上通过带门控的 MLP 扩展以实现通道混合。使用 Re-ARC 基准测试，我们发现在精确匹配评分标准下，DAMP 在内部分布和分布外泛化方面均明显优于 DARC。这些结果表明，通过 MLP 进行显式通道混合使循环卷积网络能够学习到更稳健且更具泛化性的计算模式。我们的发现对神经程序合成具有启示意义，并突出了 DAMP 作为超网络方法目标架构的潜力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-06 14:44:15 UTC 发布：2025-08-06 14:44:15 UTC

#145 Putnam-AXIOM: A Functional and Static Benchmark #145 Putnam-AXIOM：一个函数式与静态基准测试

Publish: 2025-08-05 17:57:50 UTC 发布：2025-08-05 17:57:50 UTC

#146 Understanding Transformers through the Lens of Pavlovian Conditioning #146 通过巴甫洛夫式条件反射视角理解变换器

Author: [Mu Qiao](https://arxiv.org/search/?searchtype=author&query=Mu Qiao) 作者：Mu Qiao

Transformer architectures have revolutionized artificial intelligence (AI) through their attention mechanisms, yet the computational principles underlying their success remain opaque. We present a novel theoretical framework that reinterprets the core computation of attention as Pavlovian conditioning. Our model finds a direct mathematical analogue in linear attention, which simplifies the analysis of the underlying associative process. We demonstrate that attention’s queries, keys, and values can be mapped to the three elements of classical conditioning: test stimuli that probe associations, conditional stimuli (CS) that serve as retrieval cues, and unconditional stimuli (US) that contain response information. Through this lens, we suggest that each attention operation constructs a transient associative memory via a Hebbian rule, where CS-US pairs form dynamic associations that test stimuli can later retrieve. Our framework yields several theoretical insights grounded in this linearized model: (1) a capacity theorem showing that attention heads can store O(dk−−√) associations before interference degrades retrieval; (2) an error propagation analysis revealing fundamental architectural trade-offs of balancing model depth, width, and head redundancy to maintain reliability; and (3) an understanding of how biologically plausible learning rules could enhance transformer architectures. By establishing this deep connection, we suggest that the success of modern AI may stem not from architectural novelty alone, but from implementing computational principles that biology optimized over millions of years of evolution. Transformer 架构通过其注意力机制彻底改变了人工智能（AI），但其成功背后的计算原理仍不透明。我们提出了一个新的理论框架，将注意力的核心计算重新解释为巴甫洛夫式条件反射。我们的模型在线性注意力中找到一个直接的数学对应，这简化了对潜在联想过程的分析。我们证明了注意力中的 queries、keys 和 values 可以映射到经典条件反射的三个要素：作为探测联想的测试刺激、作为检索线索的条件刺激（CS）和包含反应信息的无条件刺激（US）。通过这一视角，我们提出每次注意力操作都通过一种赫布规则构建一种瞬时的联想记忆，其中 CS–US 对形成动态联想，随后测试刺激可以检索这些联想。我们的框架基于该线性化模型得出若干理论洞见：（1）一个容量定理表明，在干扰破坏检索之前，注意力头可以存储 O( dk−−√ ) 个关联；（2）误差传播分析揭示了在保持可靠性时平衡模型深度、宽度与头冗余的基本架构权衡；（3）以及对生物学上可行的学习规则如何增强 Transformer 架构的理解。通过建立这一深层联系，我们提出，现代人工智能的成功或许并非仅源于架构上的新颖性，而是在于实现了生物在数百万年进化中优化的计算原理。

Subjects: Machine Learning, Artificial Intelligence, Neurons and Cognition 主题：机器学习、人工智能、神经元与认知

Publish: 2025-08-05 05:00:00 UTC 发布：2025-08-05 05:00:00 UTC

#147 Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions #147 神圣还是合成？评估 LLM 在宗教问题上的可靠性与回避

Subjects: Computation and Language, Artificial Intelligence, Computers and Society

Publish: 2025-08-04 07:27:26 UTC 发布：2025-08-04 07:27:26 UTC

#148 The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs #148 进步的幻觉：重新评估 LLMs 中的幻觉检测

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-01 20:34:01 UTC 发布：2025-08-01 20:34:01 UTC

#149 MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language #149 MinionsLLM：一种通过自然语言对多智能体系统进行训练和控制的任务自适应框架

Publish: 2025-08-01 13:10:29 UTC 发布：2025-08-01 13:10:29 协调世界时 (UTC)

#150 Multi-grained spatial-temporal feature complementarity for accurate online cellular traffic prediction #150 多粒度时空特征互补用于准确的在线蜂窝流量预测

Authors: [Ningning Fu](https://arxiv.org/search/?searchtype=author&query=Ningning Fu), [Shengheng Liu](https://arxiv.org/search/?searchtype=author&query=Shengheng Liu), [Weiliang Xie](https://arxiv.org/search/?searchtype=author&query=Weiliang Xie), [Yongming Huang](https://arxiv.org/search/?searchtype=author&query=Yongming Huang) 作者：傅宁宁，刘圣衡，谢伟亮，黄永明

Knowledge discovered from telecom data can facilitate proactive understanding of network dynamics and user behaviors, which in turn empowers service providers to optimize cellular traffic scheduling and resource allocation. Nevertheless, the telecom industry still heavily relies on manual expert intervention. Existing studies have been focused on exhaustively explore the spatial-temporal correlations. However, they often overlook the underlying characteristics of cellular traffic, which are shaped by the sporadic and bursty nature of telecom services. Additionally, concept drift creates substantial obstacles to maintaining satisfactory accuracy in continuous cellular forecasting tasks. To resolve these problems, we put forward an online cellular traffic prediction method grounded in Multi-Grained Spatial-Temporal feature Complementarity (MGSTC). The proposed method is devised to achieve high-precision predictions in practical continuous forecasting scenarios. Concretely, MGSTC segments historical data into chunks and employs the coarse-grained temporal attention to offer a trend reference for the prediction horizon. Subsequently, fine-grained spatial attention is utilized to capture detailed correlations among network elements, which enables localized refinement of the established trend. The complementarity of these multi-grained spatial-temporal features facilitates the efficient transmission of valuable information. To accommodate continuous forecasting needs, we implement an online learning strategy that can detect concept drift in real-time and promptly switch to the appropriate parameter update stage. Experiments carried out on four real-world datasets demonstrate that MGSTC outperforms eleven state-of-the-art baselines consistently. 从电信数据中发现的知识可以促进对网络动态和用户行为的主动理解，进而使服务提供商能够优化蜂窝流量调度和资源分配。然而，电信行业仍在很大程度上依赖人工专家干预。现有研究主要集中在全面探索时空相关性，但常常忽视蜂窝流量的潜在特征，而这些特征是由电信服务的零星性和突发性所形成的。此外，概念漂移在持续蜂窝预测任务中带来了维持令人满意精度的重大障碍。为了解决这些问题，我们提出了一种基于多粒度时空特征互补（MGSTC）的在线蜂窝流量预测方法。该方法旨在在实际的连续预测场景中实现高精度预测。具体而言，MGSTC 将历史数据分割成若干块，并采用粗粒度时间注意力为预测时段提供趋势参考。随后，采用细粒度的空间注意力来捕捉网络元素之间的详细关联，从而对已建立的趋势进行局部精细化处理。这些多粒度时空特征的互补性促进了有价值信息的高效传递。为了满足持续预测的需求，我们实现了一种在线学习策略，能够实时检测概念漂移并及时切换到相应的参数更新阶段。在四个真实世界数据集上进行的实验表明，MGSTC 始终优于十一个最先进的基线方法。

Subjects: Machine Learning, Artificial Intelligence, Networking and Internet Architecture

Publish: 2025-08-01 05:33:32 UTC 发布时间：2025-08-01 05:33:32 UTC

#151 MoSSDA: A Semi-Supervised Domain Adaptation Framework for Multivariate Time-Series Classification using Momentum Encoder #151 MoSSDA：一种用于多变量时间序列分类的半监督域自适应框架，使用动量编码器

Authors: [Seonyoung Kim](https://arxiv.org/search/?searchtype=author&query=Seonyoung Kim), [Dongil Kim](https://arxiv.org/search/?searchtype=author&query=Dongil Kim) 作者：Seonyoung Kim，Dongil Kim

Deep learning has emerged as the most promising approach in various fields; however, when the distributions of training and test data are different (domain shift), the performance of deep learning models can degrade. Semi-supervised domain adaptation (SSDA) is a major approach for addressing this issue, assuming that a fully labeled training set (source domain) is available, but the test set (target domain) provides labels only for a small subset. In this study, we propose a novel two-step momentum encoder-utilized SSDA framework, MoSSDA, for multivariate time-series classification. Time series data are highly sensitive to noise, and sequential dependencies cause domain shifts resulting in critical performance degradation. To obtain a robust, domain-invariant and class-discriminative representation, MoSSDA employs a domain-invariant encoder to learn features from both source and target domains. Subsequently, the learned features are fed to a mixup-enhanced positive contrastive module consisting of an online momentum encoder. The final classifier is trained with learned features that exhibit consistency and discriminability with limited labeled target domain data, without data augmentation. We applied a two-stage process by separating the gradient flow between the encoders and the classifier to obtain rich and complex representations. Through extensive experiments on six diverse datasets, MoSSDA achieved state-of-the-art performance for three different backbones and various unlabeled ratios in the target domain data. The Ablation study confirms that each module, including two-stage learning, is effective in improving the performance. Our code is available at https://github.com/seonyoungKimm/MoSSDA 深度学习已成为各领域最有前景的方法；然而，当训练数据与测试数据的分布不同（领域偏移）时，深度学习模型的性能可能会下降。半监督领域自适应（SSDA）是解决此问题的主要方法，假设存在一个完全带标签的训练集（源域），但测试集（目标域）仅为一小部分样本提供标签。在本研究中，我们提出了一种新颖的两步动量编码器驱动的 SSDA 框架 MoSSDA，用于多变量时间序列分类。时间序列数据对噪声非常敏感，且序列依赖性会导致领域偏移，从而造成严重的性能下降。为了获得鲁棒的、领域不变且具有类别判别性的表示，MoSSDA 采用领域不变编码器来学习来自源域和目标域的特征。随后，学习到的特征被送入一个由在线动量编码器构成的、经 mixup 强化的正对比模块。最终分类器在无需数据增强的情况下，使用在有限有标记目标域数据上表现出一致性和可区分性的学习特征进行训练。我们通过将编码器与分类器之间的梯度流分离，应用了两阶段过程，以获得丰富而复杂的表示。通过在六个不同数据集上的大量实验，MoSSDA 在三种不同骨干网络和目标域数据中各种未标记比例下均达到了最新的最优性能。消融研究证实了每个模块（包括两阶段学习）在提升性能方面的有效性。我们的代码可在 https://github.com/seonyoungKimm/MoSSDA 获取。

Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition 学科：机器学习、人工智能、计算机视觉与模式识别

Publish: 2025-08-01 05:27:44 UTC 发布：2025-08-01 05:27:44 UTC

#152 XFMNet: Decoding Cross-Site and Nonstationary Water Patterns via Stepwise Multimodal Fusion for Long-Term Water Quality Forecasting #152 XFMNet：通过逐步多模态融合解码跨站点与非平稳水文模式以用于长期水质预测

Authors: [Ziqi Wang](https://arxiv.org/search/?searchtype=author&query=Ziqi Wang), [Hailiang Zhao](https://arxiv.org/search/?searchtype=author&query=Hailiang Zhao), [Cheng Bao](https://arxiv.org/search/?searchtype=author&query=Cheng Bao), [Wenzhuo Qian](https://arxiv.org/search/?searchtype=author&query=Wenzhuo Qian), [Yuhao Yang](https://arxiv.org/search/?searchtype=author&query=Yuhao Yang), [Xueqiang Sun](https://arxiv.org/search/?searchtype=author&query=Xueqiang Sun), [Shuiguang Deng](https://arxiv.org/search/?searchtype=author&query=Shuiguang Deng) 作者：Ziqi Wang, Hailiang Zhao, Cheng Bao, Wenzhuo Qian, Yuhao Yang, Xueqiang Sun, Shuiguang Deng

Long-term time-series forecasting is critical for environmental monitoring, yet water quality prediction remains challenging due to complex periodicity, nonstationarity, and abrupt fluctuations induced by ecological factors. These challenges are further amplified in multi-site scenarios that require simultaneous modeling of temporal and spatial dynamics. To tackle this, we introduce XFMNet, a stepwise multimodal fusion network that integrates remote sensing precipitation imagery to provide spatial and environmental context in river networks. XFMNet first aligns temporal resolutions between water quality series and remote sensing inputs via adaptive downsampling, followed by locally adaptive decomposition to disentangle trend and cycle components. A cross-attention gated fusion module dynamically integrates temporal patterns with spatial and ecological cues, enhancing robustness to nonstationarity and site-specific anomalies. Through progressive and recursive fusion, XFMNet captures both long-term trends and short-term fluctuations. Extensive experiments on real-world datasets demonstrate substantial improvements over state-of-the-art baselines, highlighting the effectiveness of XFMNet for spatially distributed time series prediction. 长期时间序列预测对于环境监测至关重要，但由于生态因素引起的复杂周期性、非平稳性和突发波动，水质预测仍然具有挑战性。在需要同时建模时间和空间动态的多站点情形中，这些挑战进一步加剧。为了解决这些问题，我们提出了 XFMNet，一种逐步式多模态融合网络，整合遥感降水影像以在河流网络中提供空间和环境上下文。XFMNet 首先通过自适应降采样在水质序列与遥感输入之间对齐时间分辨率，随后进行局部自适应分解以解开趋势和周期成分。一个交叉注意门控融合模块动态地将时间模式与空间和生态线索整合，增强了对非平稳性和站点特异性异常的鲁棒性。通过渐进式和递归式融合，XFMNet 既捕获长期趋势也捕获短期波动。在真实世界数据集上进行的大量实验证明，与最先进基线相比有显著提升，凸显了 XFMNet 在空间分布时间序列预测方面的有效性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-01 04:11:36 UTC 发布：2025-08-01 04:11:36 UTC

#153 Towards Heterogeneity-Aware and Energy-Efficient Topology Optimization for Decentralized Federated Learning in Edge Environment #153 面向边缘环境的去中心化联邦学习的异质感知与节能拓扑优化

Authors: [Yuze Liu](https://arxiv.org/search/?searchtype=author&query=Yuze Liu), [Tiehua Zhang](https://arxiv.org/search/?searchtype=author&query=Tiehua Zhang), [Zhishu Shen](https://arxiv.org/search/?searchtype=author&query=Zhishu Shen), [Libing Wu](https://arxiv.org/search/?searchtype=author&query=Libing Wu), [Shiping Chen](https://arxiv.org/search/?searchtype=author&query=Shiping Chen), [Jiong Jin](https://arxiv.org/search/?searchtype=author&query=Jiong Jin) 作者：刘宇泽，张铁华，沈智书，吴利兵，陈世平，金炯

Federated learning (FL) has emerged as a promising paradigm within edge computing (EC) systems, enabling numerous edge devices to collaboratively train artificial intelligence (AI) models while maintaining data privacy. To overcome the communication bottlenecks associated with centralized parameter servers, decentralized federated learning (DFL), which leverages peer-to-peer (P2P) communication, has been extensively explored in the research community. Although researchers design a variety of DFL approach to ensure model convergence, its iterative learning process inevitably incurs considerable cost along with the growth of model complexity and the number of participants. These costs are largely influenced by the dynamic changes of topology in each training round, particularly its sparsity and connectivity conditions. Furthermore, the inherent resources heterogeneity in the edge environments affects energy efficiency of learning process, while data heterogeneity degrades model performance. These factors pose significant challenges to the design of an effective DFL framework for EC systems. To this end, we propose Hat-DFed, a heterogeneity-aware and coset-effective decentralized federated learning (DFL) framework. In Hat-DFed, the topology construction is formulated as a dual optimization problem, which is then proven to be NP-hard, with the goal of maximizing model performance while minimizing cumulative energy consumption in complex edge environments. To solve this problem, we design a two-phase algorithm that dynamically constructs optimal communication topologies while unbiasedly estimating their impact on both model performance and energy cost. Additionally, the algorithm incorporates an importance-aware model aggregation mechanism to mitigate performance degradation caused by data heterogeneity. 联邦学习（FL）已成为边缘计算（EC）系统中的一种有前景的范式，使众多边缘设备在保持数据隐私的同时协同训练人工智能（AI）模型。为克服集中式参数服务器带来的通信瓶颈，利用点对点（P2P）通信的去中心化联邦学习（DFL）在研究界被广泛探索。尽管研究人员设计了各种 DFL 方法以确保模型收敛，其迭代学习过程不可避免地随着模型复杂度和参与者数量的增长而产生大量开销。这些开销在很大程度上受每轮训练中拓扑的动态变化影响，特别是其稀疏性和连通性条件。此外，边缘环境中固有的资源异质性影响学习过程的能效，而数据异质性则降低模型性能。这些因素给为边缘计算系统设计有效的 DFL 框架带来了重大挑战。为此，我们提出了 Hat-DFed，一种关注异质性且协集高效的去中心化联邦学习（DFL）框架。在 Hat-DFed 中，拓扑构建被表述为一个双重优化问题，随后被证明为 NP-难问题，其目标是在复杂边缘环境中最大化模型性能同时最小化累计能耗。为了解决该问题，我们设计了一个两阶段算法，该算法在动态构建最优通信拓扑的同时，对其对模型性能和能耗的影响进行无偏估计。此外，该算法还引入了一个重要性感知的模型聚合机制，以缓解由数据异质性引起的性能下降。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-01 03:07:32 UTC 发布：2025-08-01 03:07:32 UTC

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-07-31 10:49:20 UTC 发表：2025-07-31 10:49:20 协调世界时 (UTC)

#155 MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis #155 MLLM-CBench：用于多模态 LLM 持续指令微调的全面基准及连锁思维推理分析 [PDF 2 ] [Copy] [Kimi 4 ] [REL]

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-07-31 07:49:36 UTC 发布：2025-07-31 07:49:36 UTC

#156 Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition #156 从大型语言模型中蒸馏知识：用于仇恨与反对言论识别的概念瓶颈模型

The rapid increase in hate speech on social media has exposed an unprecedented impact on society, making automated methods for detecting such content important. Unlike prior black-box models, we propose a novel transparent method for automated hate and counter speech recognition, i.e., “Speech Concept Bottleneck Model” (SCBM), using adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to map input texts to an abstract adjective-based representation, which is then sent to a light-weight classifier for downstream tasks. Across five benchmark datasets spanning multiple languages and platforms (e.g., Twitter, Reddit, YouTube), SCBM achieves an average macro-F1 score of 0.69 which outperforms the most recently reported results from the literature on four out of five datasets. Aside from high recognition accuracy, SCBM provides a high level of both local and global interpretability. Furthermore, fusing our adjective-based concept representation with transformer embeddings, leads to a 1.8% performance increase on average across all datasets, showing that the proposed representation captures complementary information. Our results demonstrate that adjective-based concept representations can serve as compact, interpretable, and effective encodings for hate and counter speech recognition. With adapted adjectives, our method can also be applied to other NLP tasks. 社交媒体上仇恨言论的快速增加对社会产生了前所未有的影响，使得自动检测此类内容的方法变得十分重要。不同于以往的黑箱模型，我们提出了一种用于自动仇恨言论和反对言论识别的新型透明方法，即“言语概念瓶颈模型”（Speech Concept Bottleneck Model，SCBM），使用形容词作为人类可解释的瓶颈概念。SCBM 利用大型语言模型（LLMs）将输入文本映射为基于形容词的抽象表示，然后将该表示发送到轻量级分类器以完成下游任务。在覆盖多种语言与平台（例如 Twitter、Reddit、YouTube）的五个基准数据集上，SCBM 实现了平均宏 F1 分数 0.69，在五个数据集中有四个超越了文献中最近报道的结果。除了高识别准确率之外，SCBM 还提供了高水平的局部和全局可解释性。此外，将我们基于形容词的概念表示与 Transformer 嵌入融合后，在所有数据集上的平均性能提升了 1.8%，表明所提出的表示捕捉到了互补信息。我们的结果表明，基于形容词的概念表示可以作为仇恨言论与反制言论识别的紧凑、可解释且有效的编码。通过调整形容词，我们的方法也可以应用于其他自然语言处理任务。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-07-30 21:50:30 UTC 发布：2025-07-30 21:50:30 UTC

#157 Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI #157 孙博士：面向生物医学人工智能的双语多模态大型语言模型

Large multimodal models (LMMs) have demonstrated significant potential in providing innovative solutions for various biomedical tasks, including pathology analysis, radiology report generation, and biomedical assistance. However, the existing multimodal biomedical AI is typically based on foundation LLMs, thus hindering the understanding of intricate medical concepts with limited medical training data. Moreover, recent LLaVA-induced medical LMMs struggle to effectively capture the intricate relationship between the texts and the images. Therefore, we introduce Doctor Sun, a large multimodal generative model specialized in medicine, developed to encode, integrate, and interpret diverse biomedical data modalities such as text and images. In particular, Doctor Sun integrates a pre-trained vision encoder with a medical LLM and conducts two-stage training on various medical datasets, focusing on feature alignment and instruction tuning. Moreover, we release SunMed-VL, a wide-range bilingual medical multimodal dataset, along with all associated models, code, and resources, to freely support the advancement of biomedical multimodal research. 大型多模态模型（LMMs）在为各种生物医学任务提供创新解决方案方面展现出显著潜力，包括病理分析、放射学报告生成和生物医学辅助等。然而，现有的多模态生物医学人工智能通常基于基础 LLM，因而在有限的医学训练数据下难以理解复杂的医学概念。此外，最近由 LLaVA 引导的医学 LMM 在有效捕捉文本与图像之间的复杂关系方面表现欠佳。因此，我们推出了 Doctor Sun，一种专注于医学的大型多模态生成模型，旨在编码、整合并解释诸如文本和图像等多种生物医学数据模态。具体而言，Doctor Sun 将预训练的视觉编码器与医学 LLM 集成，并在各种医学数据集上进行两阶段训练，重点是特征对齐和指令微调。此外，我们发布了 SunMed-VL——一个范围广泛的双语医学多模态数据集，以及所有相关模型、代码和资源，以免费支持生物医学多模态研究的发展。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language, Multimedia 主题：机器学习、人工智能、计算与语言、多媒体

Publish: 2025-07-30 13:53:54 UTC 发布时间：2025-07-30 13:53:54 UTC

#158 emg2tendon: From sEMG Signals to Tendon Control in Musculoskeletal Hands #158 emg2tendon：从表面肌电信号到肌肉骨骼手的肌腱控制

Author: [Sagar Verma](https://arxiv.org/search/?searchtype=author&query=Sagar Verma) 作者：Sagar Verma

Tendon-driven robotic hands offer unparalleled dexterity for manipulation tasks, but learning control policies for such systems presents unique challenges. Unlike joint-actuated robotic hands, tendon-driven systems lack a direct one-to-one mapping between motion capture (mocap) data and tendon controls, making the learning process complex and expensive. Additionally, visual tracking methods for real-world applications are prone to occlusions and inaccuracies, further complicating joint tracking. Wrist-wearable surface electromyography (sEMG) sensors present an inexpensive, robust alternative to capture hand motion. However, mapping sEMG signals to tendon control remains a significant challenge despite the availability of EMG-to-pose data sets and regression-based models in the existing literature. We introduce the first large-scale EMG-to-Tendon Control dataset for robotic hands, extending the emg2pose dataset, which includes recordings from 193 subjects, spanning 370 hours and 29 stages with diverse gestures. This dataset incorporates tendon control signals derived using the MyoSuite MyoHand model, addressing limitations such as invalid poses in prior methods. We provide three baseline regression models to demonstrate emg2tendon utility and propose a novel diffusion-based regression model for predicting tendon control from sEMG recordings. This dataset and modeling framework marks a significant step forward for tendon-driven dexterous robotic manipulation, laying the groundwork for scalable and accurate tendon control in robotic hands. https://emg2tendon.github.io/ 肌腱驱动的机器人手在操控任务上提供了无与伦比的灵活性，但为此类系统学习控制策略也带来了独特的挑战。与关节驱动的机器人手不同，肌腱驱动系统中运动捕捉（mocap）数据与肌腱控制之间不存在直接的一一对应关系，这使得学习过程复杂且代价高昂。此外，现实应用中的视觉跟踪方法容易受到遮挡和不准确性的影响，进一步增加了关节跟踪的难度。可戴在腕部的表面肌电图（sEMG）传感器为捕捉手部运动提供了一种廉价且稳健的替代方案。然而，尽管现有文献中存在 EMG 到姿态的数据集和基于回归的模型，将 sEMG 信号映射到肌腱控制仍然是一个重大挑战。我们引入了首个大规模的面向机器人手的 EMG 到肌腱控制数据集，作为对 emg2pose 数据集的扩展，包含来自 193 名受试者的记录，覆盖 370 小时和 29 个阶段的多样手势。该数据集采用 MyoSuite 的 MyoHand 模型导出的肌腱控制信号，解决了此前方法中存在的无效姿态等限制。我们提供了三种基线回归模型以展示 emg2tendon 的实用性，并提出了一种用于从表面肌电（sEMG）记录预测肌腱控制的新型基于扩散的回归模型。该数据集和建模框架标志着面向肌腱驱动灵巧机器人操作的重要进展，为机器人手部可扩展且精确的肌腱控制奠定了基础。https://emg2tendon.github.io/

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-07-29 12:49:57 UTC 发布：2025-07-29 12:49:57 UTC

#159 Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants #159 基准测试大型语言模型用于定位殖民时期弗吉尼亚土地赠与

Author: [Ryan Mioduski](https://arxiv.org/search/?searchtype=author&query=Ryan Mioduski) 作者：Ryan Mioduski

Virginia’s seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures (o-series, GPT-4-class, and GPT-3.5) were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared with a GIS-analyst baseline, the Stanford NER geoparser, Mordecai-3, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19 km (median 12 km) at minimal additional cost (approx. USD 0.20 per grant), outperforming the median LLM by 48.6%. A patentee-name-redaction ablation increased error by about 9%, indicating reliance on textual landmark and adjacency descriptions rather than memorization. The cost-efficient gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark; external geocoding tools offered no measurable benefit in this evaluation. These findings demonstrate the potential of LLMs for scalable, accurate, and cost-effective historical georeferencing. 弗吉尼亚十七、十八世纪的土地专利主要以叙述性的界址描述形式保存，这限制了空间分析。本文系统评估了新一代 LLMs 在将这些散文摘要转换为地理上准确的经纬度坐标方面的表现，评估背景具有聚焦性。发布了一个数字化语料库，包含 5,471 份弗吉尼亚专利摘要（1695-1732），并以 43 个经过严格验证的测试案例作为初始的、地理聚焦的基准。测试了三种架构（o 系列、GPT-4 级、和 GPT-3.5）下的六个 OpenAI 模型，采用两种范式：直接输出坐标和借助外部地理编码 API 的工具增强链式思维。结果与一名 GIS 分析师基线、斯坦福 NER 地理解析器、Mordecai-3 以及县中心点启发式方法进行了比较。表现最佳的单次调用模型 o3-2025-04-16 实现了平均误差 23 公里（中位数 14 公里），比中位数 LLM（37.4 公里）好 37.5%，比最弱的 LLM（50.3 公里）好 53.5%，并且比外部基线分别好 67%（GIS 分析师）和 70%（斯坦福 NER）。一个由五次调用组成的集成方法在仅增加极少成本（每项授权约 0.20 美元）的情况下将错误进一步降低到 19 公里（中位数为 12 公里），比中位数 LLM 的表现高出 48.6%。对专利持有人姓名进行遮蔽的消融实验使错误增加了约 9%，表明模型依赖于文本中的地标和相邻描述而非记忆。具成本效益的 gpt-4o-2024-08-06 模型以每 1000 项授权 1.09 美元的费用维持了 28 公里的平均误差，确立了一个强有力的成本-准确性基准；在本次评估中，外部地理编码工具没有带来可测量的益处。这些发现展示了 LLMs 在可扩展、准确且具成本效益的历史地理定位方面的潜力。

Publish: 2025-07-27 21:49:58 UTC 发布：2025-07-27 21:49:58 UTC

#160 TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection #160 TurQUaz 在 CheckThat! 2025：辩论用于科学网络话语检测的大型语言模型 [PDF ] [Copy] [Kimi ] [REL]

In this paper, we present our work developed for the scientific web discourse detection task (Task 4a) of CheckThat! 2025. We propose a novel council debate method that simulates structured academic discussions among multiple large language models (LLMs) to identify whether a given tweet contains (i) a scientific claim, (ii) a reference to a scientific study, or (iii) mentions of scientific entities. We explore three debating methods: i) single debate, where two LLMs argue for opposing positions while a third acts as a judge; ii) team debate, in which multiple models collaborate within each side of the debate; and iii) council debate, where multiple expert models deliberate together to reach a consensus, moderated by a chairperson model. We choose council debate as our primary model as it outperforms others in the development test set. Although our proposed method did not rank highly for identifying scientific claims (8th out of 10) or mentions of scientific entities (9th out of 10), it ranked first in detecting references to scientific studies. 在本文中，我们介绍了为 CheckThat! 2025 的科学网络话语检测任务（任务 4a）开发的工作。我们提出了一种新颖的委员会辩论方法，通过模拟多个大型语言模型（LLMs）之间的结构化学术讨论来识别给定推文是否包含 (i) 科学主张、(ii) 对科学研究的引用，或 (iii) 对科学实体的提及。我们探索了三种辩论方法：i）单人辩论，两个 LLM 就相对立的立场进行争论，第三个充当裁判；ii）团队辩论，多模型在每一方内部协作；iii）委员会辩论，多位专家模型共同讨论以达成共识，由一位主席模型主持。我们选择委员会辩论作为主要模型，因为它在开发测试集上优于其他方法。尽管我们提出的方法在识别科学主张（在 10 个中排名第 8）或识别科学实体提及（在 10 个中排名第 9）方面排名不高，但在检测对科学研究的引用方面排名第一。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-07-26 00:46:23 UTC 发布：2025-07-26 00:46:23 UTC

#161 On the Effects of Smoothing Rugged Landscape by Different Toy Problems: A Case Study on UBQP #161 关于通过不同玩具问题平滑崎岖地形的影响：以 UBQP 为例的研究 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Wei Wang](https://arxiv.org/search/?searchtype=author&query=Wei Wang), [Jialong Shi](https://arxiv.org/search/?searchtype=author&query=Jialong Shi), [Jianyong Sun](https://arxiv.org/search/?searchtype=author&query=Jianyong Sun), [Arnaud Liefooghe](https://arxiv.org/search/?searchtype=author&query=Arnaud Liefooghe), [Qingfu Zhang](https://arxiv.org/search/?searchtype=author&query=Qingfu Zhang), [Ye Fan](https://arxiv.org/search/?searchtype=author&query=Ye Fan) 作者：Wei Wang, Jialong Shi, Jianyong Sun, Arnaud Liefooghe, Qingfu Zhang, Ye Fan

The hardness of the Unconstrained Binary Quadratic Program (UBQP) problem is due its rugged landscape. Various algorithms have been proposed for UBQP, including the Landscape Smoothing Iterated Local Search (LSILS). Different from other UBQP algorithms, LSILS tries to smooth the rugged landscape by building a convex combination of the original UBQP and a toy UBQP. In this paper, our study further investigates the impact of smoothing rugged landscapes using different toy UBQP problems, including a toy UBQP with matrix ^Q1 (construct by “+/-1”), a toy UBQP with matrix ^Q2 (construct by “+/-i”) and a toy UBQP with matrix ^Q3 (construct randomly). We first assess the landscape flatness of the three toy UBQPs. Subsequently, we test the efficiency of LSILS with different toy UBQPs. Results reveal that the toy UBQP with ^Q1 (construct by “+/-1”) exhibits the flattest landscape among the three, while the toy UBQP with ^Q3 (construct randomly) presents the most non-flat landscape. Notably, LSILS using the toy UBQP with ^Q2 (construct by “+/-i”) emerges as the most effective, while ^Q3 (construct randomly) has the poorest result. These findings contribute to a detailed understanding of landscape smoothing techniques in optimizing UBQP. 无约束二次布尔规划（UBQP）问题的困难在于其凹凸不平的景观。为解决 UBQP 问题，已有多种算法被提出，其中包括景观平滑迭代局部搜索（LSILS）。与其他 UBQP 算法不同，LSILS 通过构建原始 UBQP 与一个示例 UBQP 的凸组合来平滑不平的景观。本文进一步研究了使用不同示例 UBQP 问题平滑崎岖景观的影响，这些示例包括矩阵 ^Q1（由“+/-1”构建）的示例 UBQP、矩阵 ^Q2（由“+/-i”构建）的示例 UBQP 以及随机构建的矩阵 ^Q3 的示例 UBQP。我们首先评估了这三种示例 UBQP 的景观平坦度。随后，我们测试了使用不同示例 UBQP 的 LSILS 的效率。结果表明，矩阵 ^Q1（由“+/-1”构建）的示例 UBQP 在三者中表现出最平坦的景观，而矩阵 ^Q3（随机构建）的示例 UBQP 则表现出最不平坦的景观。值得注意的是，使用矩阵 ^Q2（由“+/-i”构建）的示例 UBQP 的 LSILS 效果最佳，而 ^Q3（随机构建）则效果最差。这些发现有助于深入理解在优化无约束二次布尔规划（UBQP）中使用的景观平滑技术。

Subject: Optimization and Control 主题：优化与控制

Publish: 2024-07-29 03:37:12 UTC 发布：2024-07-29 03:37:12 协调世界时

#162 A New Parallel Cooperative Landscape Smoothing Algorithm and Its Applications on TSP and UBQP #162 一种新的并行协作景观平滑算法及其在 TSP 和 UBQP 上的应用

Authors: [Wei Wang](https://arxiv.org/search/?searchtype=author&query=Wei Wang), [Jialong Shi](https://arxiv.org/search/?searchtype=author&query=Jialong Shi), [Jianyong Sun](https://arxiv.org/search/?searchtype=author&query=Jianyong Sun), [Arnaud Liefooghe](https://arxiv.org/search/?searchtype=author&query=Arnaud Liefooghe), [Qingfu Zhang](https://arxiv.org/search/?searchtype=author&query=Qingfu Zhang) 作者：王伟，史佳龙，孙建勇，Arnaud Liefooghe，张庆福

Combinatorial optimization problem (COP) is difficult to solve because of the massive number of local optimal solutions in his solution space. Various methods have been put forward to smooth the solution space of COPs, including homotopic convex (HC) transformation for the traveling salesman problem (TSP). This paper extends the HC transformation approach to unconstrained binary quadratic programming (UBQP) by proposing a method to construct a unimodal toy UBQP of any size. We theoretically prove the unimodality of the constructed toy UBQP. After that, we apply this unimodal toy UBQP to smooth the original UBQP by using the HC transformation framework and empirically verify the smoothing effects. Subsequently, we introduce an iterative algorithmic framework incorporating HC transformation, referred as landscape smoothing iterated local search (LSILS). Our experimental analyses, conducted on various UBQP instances show the effectiveness of LSILS. Furthermore, this paper proposes a parallel cooperative variant of LSILS, denoted as PC-LSILS and apply it to both the UBQP and the TSP. Our experimental findings highlight that PC-LSILS improves the smoothing performance of the HC transformation, and further improves the overall performance of the algorithm. 组合优化问题（COP）由于其解空间中大量的局部最优解而难以求解。为平滑 COP 的解空间，人们提出了多种方法，包括用于旅行商问题（TSP）的同伦凸（HC）变换。本文将 HC 变换方法扩展到无约束二次二元规划（UBQP），提出了一种构造任意规模单峰玩具 UBQP 的方法。我们从理论上证明了所构造玩具 UBQP 的单峰性。随后，我们在 HC 变换框架下使用该单峰玩具 UBQP 来平滑原始 UBQP，并通过实证验证了平滑效果。之后，我们引入了一个包含 HC 变换的迭代算法框架，称为景观平滑迭代局部搜索（LSILS）。在多个 UBQP 实例上的实验分析表明了 LSILS 的有效性。此外，本文还提出了 LSILS 的并行协同变体，记为 PC-LSILS，并将其应用于 UBQP 和 TSP。我们的实验结果表明，PC-LSILS 提升了 HC 变换的平滑性能，并进一步提高了该算法的整体性能。

Subject: Optimization and Control 主题：优化与控制

Publish: 2024-01-06 15:25:32 UTC 发布日期：2024-01-06 15:25:32 UTC

2025-08-13科研追新

2025-08-13科研追新

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

1.2 Arxiv

1.2.1 Computation and Language

#1 Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models #1 时间就是特征：在扩散语言模型中利用时间动态

#2 Complex Logical Instruction Generation #2 复杂逻辑指令生成

#3 OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows #3 OdysseyBench：在长时域复杂办公应用工作流上评估 LLM 代理

#4 SinLlama - A Large Language Model for Sinhala #4 SinLlama - 面向僧伽罗语的大型语言模型 [PDF 2 ] [Copy] [Kimi 3 ] [REL]

#5 AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

#6 Link Prediction for Event Logs in the Process Industry #6 用于流程工业事件日志的链接预测

#7 Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages #7 利用多语种编码器提升低资源语言的大型语言模型

#8 CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization #8 CPO：通过比较策略优化在角色扮演对话中解决奖励歧义

#9 READER: Retrieval-Assisted Drafter for Efficient LLM Inference #9 读者：用于高效 LLM 推理的检索辅助起草器

#11 LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback #11 LLM-as-a-Supervisor：错误的治疗行为触发有针对性的监督反馈

#12 A Survey on Training-free Alignment of Large Language Models #12 关于大型语言模型无训练对齐方法的综述

#13 LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA #13 LyS 在 SemEval 2025 任务 8：用于表格问答的零样本代码生成 [PDF ] [Copy] [Kimi ] [REL]

#14 Retrospective Sparse Attention for Efficient Long-Context Generation #14 回顾性稀疏注意力以实现高效长上下文生成

#15 Jointly Generating and Attributing Answers using Logits of Document-Identifier Tokens #15 使用文档标识符标记的对数几率共同生成并归因答案

#16 Train Long, Think Short: Curriculum Learning for Efficient Reasoning #16 长训短思：用于高效推理的课程学习

#17 Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation #17 Reveal-Bangla：用于跨语言多步推理评估的数据集

#19 ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs #19 ASPD：通过探索 LLMs 的内在并行性解锁自适应串并行解码

#20 Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models #20 纠缠于表征：对大型语言模型中文化偏见的机制性调查

#21 Weakly Supervised Fine-grained Span-Level Framework for Chinese Radiology Report Quality Assurance #21 题为“面向中文放射学报告质量保证的弱监督细粒度跨段框架”

#22 BiasGym: Fantastic Biases and How to Find (and Remove) Them #22 BiasGym：奇妙的偏差及如何发现（并消除）它们

#23 Steering Towards Fairness: Mitigating Political Bias in LLMs #23 朝向公平的引导：缓解 LLMs 中的政治偏见

#25 TiMoE: Time-Aware Mixture of Language Experts #25 TiMoE：面向时间的语言专家混合模型

#26 Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments #26 通过自动构建环境的反馈驱动工具使用改进在大型语言模型中的应用 [PDF 2 ] [Copy] [Kimi 2 ] [REL]

#27 Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering #27 隐私保护的检索增强生成用于知识图谱问答

#28 DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation #28 DevNous：一种基于 LLM 的多智能体系统，用于在非结构化对话中将 IT 项目管理落地

#29 SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs #29 SciRerankBench：面向科学检索增强生成 LLMs 的重排器基准测试

#30 Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation #30 Magical：通过语义不变性和面向门外汉的适应进行医学通俗语言生成

#31 IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization #31 IROTE：通过上下文自我反思优化引发大型语言模型的人类特质 [PDF 4 ] [复印件] [Kimi 4 ] [REL]

#32 A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models #32 一篇关于并行文本生成的综述：从并行解码到扩散语言模型

#33 Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults #33 直接投入临床？评估面向老年人的最先进自动语音识别在临床应用中的表现

#34 TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation #34 TopXGen：面向低资源机器翻译的主题多样性并行数据生成

#35 LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement #35 通过子任务指导和迭代改进的由 LLM 驱动的文本到表格生成

#36 Prompt-Based Approach for Czech Sentiment Analysis #36 一种基于提示的方法用于捷克语情感分析

#37 UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection #37 UWB 在 WASSA-2024 共享任务 2：跨语言情感检测

#38 LLaMA-Based Models for Aspect-Based Sentiment Analysis

#39 Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents #39 反应迅速：从人类示范中引出隐含意图以打造个性化移动使用代理

#40 InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling #40 InternBootcamp 技术报告：通过可验证任务扩展提升 LLM 推理能力

#41 Optimizing Retrieval-Augmented Generation (RAG) for Colloquial Cantonese: A LoRA-Based Systematic Review #41 优化面向口语粤语的检索增强生成（RAG）：基于 LoRA 的系统综述

#42 DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives #42 DepressLLM：用于从真实世界叙述中检测抑郁的可解释领域适应语言模型

#43 DeCAL Tokenwise Compression #43 DeCAL 逐标记压缩

#44 Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression #44 可引导的多元主义：通过少样本比较回归实现多元对齐

#45 Momentum Point-Perplexity Mechanics in Large Language Models #45 大型语言模型中的动量点困惑力学

#46 Enhancing Small LLM Alignment through Margin-Based Objective Modifications under Resource Constraints #46 在资源受限下通过基于边际的目标修改提升小型 LLM 的对齐

#47 Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment #47 重新思考富形态的分词：一元模型优于 BPE 并且形态对齐占优

#48 Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery #48 Mol-R1：迈向分子发现中显式的长链链式思维（Long-CoT）推理

#49 CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation #49 CoDAE：通过连锁思维数据增强将大型语言模型适配于教育领域 [PDF ] [复制] [Kimi ] [关联]

#50 Putnam-AXIOM: A Functional and Static Benchmark #50 Putnam-AXIOM：一个函数式与静态基准

#51 Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions #51 神圣还是合成？评估 LLM 在宗教问题上的可靠性与回避

#52 The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs #52 进步的错觉：重新评估 LLMs 中幻觉检测

#53 MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language #53 MinionsLLM：一种通过自然语言对多智能体系统进行训练与控制的任务自适应框架 [PDF ] [Copy] [Kimi ] [REL]

#54 Objective Metrics for Evaluating Large Language Models Using External Data Sources #54 使用外部数据源评估大型语言模型的客观指标

#55 Evaluating Contrast Localizer for Identifying Causal Unitsin Social & Mathematical Tasks in Language Models #55 评估对比定位器在语言模型中识别社会与数学任务因果单元的效果

#56 MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis #56 MLLM-CBench：用于多模态大模型持续指令微调的全面基准与链式思维推理分析 [PDF 2 ] [Copy] [Kimi 4 ] [REL]

#57 Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition #57 从大型语言模型中蒸馏知识：用于仇恨与反言论识别的概念瓶颈模型

#58 TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning #58 TT-XAI：通过关键词提炼和 LLM 推理实现可信的临床文本解释

#59 Real-time News Story Identification #59 实时新闻事件识别

#60 Heartificial Intelligence: Exploring Empathy in Language Models #60 心工智能：探索语言模型中的共情

#61 TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection #61 TurQUaz at CheckThat! 2025：为科学网络话语检测辩论大型语言模型

#62 Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models #62 通过大型语言模型在金融交流中进行论点质量注释和性别偏见检测

#63 P/D-Device: Disaggregated Large Language Model between Cloud and Devices #63 P/D-Device：云端与设备间的解聚合大型语言模型

#64 E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence,and Efficiency #64 E3-Rewrite：学习重写 SQL 以实现可执行性、等价性和效率

#65 Revealing the Role of Audio Channels in ASR Performance Degradation #65 揭示音频通道在 ASR 性能下降中的作用

#66 A Dual-Axis Taxonomy of Knowledge Editing for LLMs: From Mechanisms to Functions #66 面向 LLMs 的知识编辑双轴分类法：从机制到功能

#67 Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance #67 为个性化任务辅助设计具备记忆增强的增强现实代理，以进行时空推理 [PDF ] [Copy] [Kimi ] [REL]

#68 MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs #68 MultiAiTutor：面向儿童友好的教育多语种语音生成导师，基于 LLMs

#69 M2LLM: Multi-view Molecular Representation Learning with Large Language Models #69 M2 LLM：使用大型语言模型的多视角分子表示学习

#70 MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time #70 MiGrATe：用于测试时自适应的混合策略 GRPO

#71 Adaptive Personalized Conversational Information Retrieval #71 自适应个性化对话式信息检索

#72 Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization #72 细粒度视频配音时长对齐与基于片段监督的偏好优化