2025-08-18 2025-08-18 About 41400 words 195 minutes

Contents

#1 TinyTim: A Family of Language Models for Divergent Generation #1 TinyTim：用于发散生成的语言模型家族
#2 Dataset Creation for Visual Entailment using Generative AI #2 使用生成式人工智能创建视觉蕴含数据集
#3 Representing Speech Through Autoregressive Prediction of Cochlear Tokens #3 通过对耳蜗标记的自回归预测表征语音
#4 Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models
#5 AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment #5 AgentMental：一种用于可解释且自适应心理健康评估的交互式多智能体框架
#6 Language models align with brain regions that represent concepts across modalities #6 语言模型与代表跨模态概念的大脑区域对齐
#7 Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models #7 AI 中的物种歧视：评估大型语言模型对动物的歧视
#8 Reference Points in LLM Sentiment Analysis: The Role of Structured Context #8 在 LLM 情感分析中的参照点：结构化语境的作用
#9 CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity #9 CoDiEmb：一个协作但独立的框架，用于信息检索和语义文本相似性的一体化表征学习
#10 Online Anti-sexist Speech: Identifying Resistance to Gender Bias in Political Discourse #10 在线反性别歧视言论：识别政治话语中对性别偏见的抵制 [PDF ] [Copy] [Kimi ] [REL]
#11 HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor #11 HumorPlanSearch：用于上下文 AI 幽默的结构化规划与 HuCoT
#12 Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions #12 从问卷到行为：通过问卷问题在 LLMs 中实现对人类价值观的下游对齐
#13 Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training #13 通过端到端可微自训练为变压器预测提供理由化解释
#14 Model Interpretability and Rationale Extraction by Input Mask Optimization #14 通过输入掩码优化进行模型可解释性和理由提取
#15 Retrieval-augmented reasoning with lean language models #15 使用精简语言模型的检索增强推理
#16 When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs #16 标点何时重要：针对 LLMs 的提示鲁棒性方法的大规模比较
#17 Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning #17 反馈指标：Llama 与教师在语言学习中的一致性
#18 SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis #18 SpecDetect：通过频谱分析对 LLM 生成文本进行简单、快速且无需训练的检测
#19 LLM Compression: How Far Can We Go in Balancing Size and Performance? #19 LLM 压缩：在模型体积与性能之间我们能走多远？
#20 SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems #20 SGSimEval：一个用于自动综述生成系统的综合多面向相似性增强基准 [PDF 1 ] [Copy] [Kimi ] [REL]
#21 SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory #21 SafeConstellations：通过任务特定轨迹引导 LLM 安全性以减少过度拒绝
#22 AI in Mental Health: Emotional and Sentiment Analysis of Large Language Models' Responses to Depression, Anxiety, and Stress Queries #22 心理健康领域的人工智能：大语言模型对抑郁、焦虑与压力相关提问的情感与情绪分析
#23 ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection #23 ToxiFrench：通过 CoT 微调对法语有害性检测进行基准测试和增强 [PDF 1 ] [Copy] [Kimi 1 ] [REL]
#24 LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought #24 LETToT：使用专家思维树对旅游领域大语言模型进行无标注评估
#25 UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs? #25 揭秘：是什么让语言学奥林匹克题目对 LLMs 来说如此棘手？
#26 Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering #26 跨粒度超图检索增强生成用于多跳问答
#27 E-CaTCH: Event-Centric Cross-Modal Attention with Temporal Consistency and Class-Imbalance Handling for Misinformation Detection #27 E-CaTCH：面向事件的跨模态注意力，具有时间一致性和类别不平衡处理的错误信息检测 [PDF ] [Copy] [Kimi ] [REL]
#28 Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation #28 新颖的寄生双尺度建模用于高效且精确的多语种语音翻译
#29 Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction #29 个性化干扰项生成：通过蒙特卡洛树搜索引导的推理重建
#30 Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification #30 克服图卢语资源匮乏障碍：用于辱骂性语言识别的神经模型与语料构建
#31 MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering #31 MobQA：一个用于通过问答实现人类移动数据语义理解的基准数据集
#32 MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents #32 MoNaCo：用于在数十篇文档中推理的更自然且更复杂的问题 [PDF ] [Copy] [Kimi 2 ] [REL]
#33 Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning #33 通过反思、记忆与规划构建用于营销应用的可靠多智能体系统
#34 Approaching the Source of Symbol Grounding with Confluent Reductions of Abstract Meaning Representation Directed Graphs #34 通过抽象意义表示有向图的汇合约简接近符号落地源头
#35 BIPOLAR: Polarization-based granular framework for LLM bias evaluation #35 BIPOLAR：基于极化的细粒度 LLM 偏差评估框架 [PDF ] [副本] [Kimi ] [REL]
#36 Hell or High Water: Evaluating Agentic Recovery from External Failures #36 地狱或涨潮：评估代理从外部故障中恢复的能力
#37 Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics #37 超越罗塞塔石：泛化动力学中的统一力
#38 SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth #38 SproutBench：面向青少年的安全与伦理大语言模型基准测试
#39 Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling #39 使用带推理时缩放的掩码扩散语言模型改进文本风格迁移
#40 Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules #40 Rule2Text：用于生成和评估知识图规则自然语言解释的框架
#41 Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News #41 从新闻中建模和检测公司风险：彭博新闻的案例研究
#42 gpt-oss-120b & gpt-oss-20b Model Card #42 gpt-oss-120b 与 gpt-oss-20b 模型卡
#43 PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins #43 PersonaTwin：一种用于生成和评估个性化数字孪生的多层提示条件框架
#44 A2HCoder: An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation #44 A2HCoder：一个用于分层算法到 HDL 翻译的 LLM 驱动编码代理 [PDF 2 ] [Copy] [Kimi 1 ] [REL]
#45 Controlling Multimodal LLMs via Reward-guided Decoding #45 通过奖励引导解码控制多模态 LLMs
#46 Emphasis Sensitivity in Speech Representations #46 语音表示中的重音敏感性
#47 Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps #47 包容竞技场：一个用于用真实应用评估大型基础模型的开放平台
#48 Generalize across Homophily and Heterophily: Hybrid Spectral Graph Pre-Training and Prompt Tuning #48 在同质性与异质性中泛化：混合谱图预训练与提示微调
#49 Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing #49 群体公平遇上黑箱：通过后处理在封闭式 LLMs 上实现公平算法
#50 Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information #50 超越解答数学测验：评估大型推理模型提出信息需求的能力
#51 Benchmarking Prosody Encoding in Discrete Speech Tokens #51 在离散语音标记中对韵律编码的基准测试
#52 ORFuzz: Fuzzing the "Other Side" of LLM Safety – Testing Over-Refusal #52 ORFuzz：模糊测试 LLM 安全性的“另一面”——检测过度拒绝
#53 How Causal Abstraction Underpins Computational Explanation #53 因果抽象如何支撑计算性解释
#54 Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style #54 使用自然语言描述说话风格的富有表现力的语音检索
#55 A Cross-Modal Rumor Detection Scheme via Contrastive Learning by Exploring Text and Image internal Correlations #55 通过对比学习探索文本与图像内部关联的跨模态谣言检测方案
#56 +VeriRel: Verification Feedback to Enhance Document Retrieval for Scientific Fact Checking #56 +VeriRel：通过验证反馈增强科学事实核查的文献检索
#57 PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing #57 PaperRegister：通过分层登记索引提升灵活粒度论文检索
#58 Diffusion is a code repair operator and generator #58 Diffusion 是一种代码修复操作符和生成器
#59 Can Multi-modal (reasoning) LLMs detect document manipulation? #59 多模态（推理）LLMs 能检测文档篡改吗？
#60 Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models #60 匹配与选择：用于微调文本到图像扩散模型的模型选择框架
#61 BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining #61 BeyondWeb：关于将合成数据扩展到万亿级预训练的经验教训
#62 Empowering Multimodal LLMs with External Tools: A Comprehensive Survey #62 通过外部工具增强多模态 LLMs：一项综合综述
#63 The Next Phase of Scientific Fact-Checking: Advanced Evidence Retrieval from Complex Structured Academic Papers #63 科学事实核查的下一个阶段：从复杂结构化学术论文中进行高级证据检索

#1 Inspire or Predict? Exploring New Paradigms in Assisting Classical Planners with Large Language Models #1 启发还是预测？探索在使用大型语言模型辅助经典规划器方面的新范式
#2 Landmark-Assisted Monte Carlo Planning #2 基准辅助蒙特卡洛规划
#3 Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps #3 包容竞技场：一个用于用真实世界应用评估大型基础模型的开放平台
#4 AIM-Bench: Evaluating Decision-making Biases of Agentic LLM as Inventory Manager #4 AIM-Bench：评估作为库存管理者的主体化 LLM 的决策偏差
#5 CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks
#6 SAGE: Scale-Aware Gradual Evolution for Continual Knowledge Graph Embedding #6 SAGE：面向持续知识图谱嵌入的尺度感知渐进进化
#7 Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information
#8 On Strong and Weak Admissibility in Non-Flat Assumption-Based Argumentation #8 关于非平坦基于假设论证中强可采纳性与弱可采纳性
#9 Learn to optimize for automatic proton PBS treatment planning for H&N cancers #9 学习为头颈癌的自动质子 PBS 治疗计划进行优化
#10 From Individual to Multi-Agent Algorithmic Recourse: Minimizing the Welfare Gap via Capacitated Bipartite Matching #10 从个体到多智能体的算法补救：通过有容量的二分匹配最小化福利差距
#11 Grounding Rule-Based Argumentation Using Datalog #11 使用 Datalog 为基于规则的论证提供基础
#12 Is ChatGPT-5 Ready for Mammogram VQA? #12 ChatGPT-5 准备好用于乳腺 X 线图像视觉问答（Mammogram VQA）了吗？
#13 Controlling Multimodal LLMs via Reward-guided Decoding #13 通过奖励引导的解码控制多模态 LLMs
#14 Pretrained Conformers for Audio Fingerprinting and Retrieval #14 预训练 Conformer 模型用于音频指纹识别与检索
#15 CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection #15 CryptoScope：利用大型语言模型进行自动化密码逻辑漏洞检测
#16 Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks #16 视觉感知引擎：用于机器人视觉任务的快速且灵活的多头推理 [PDF 2 ] [Copy] [Kimi ] [REL]
#17 Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models #17 先感知，少思考：动态边界自我感知驱动大型语言模型的极致推理效率
#18 ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization #18 ADMIRE-BayesOpt：使用贝叶斯优化加速语言模型数据混合重加权 [PDF 2 ] [Copy] [Kimi 1 ] [REL]
#19 A Comprehensive Perspective on Explainable AI across the Machine Learning Workflow #19 在机器学习工作流中关于可解释人工智能的全面视角
#20 Weighted First Order Model Counting for Two-variable Logic with Axioms on Two Relations #20 带权一阶模型计数用于带有两关系公理的二变量逻辑
#21 Towards Faithful Class-level Self-explainability in Graph Neural Networks by Subgraph Dependencies #21 通过子图依赖关系迈向图神经网络的可信类级自解释性
#22 Sim2Dust: Mastering Dynamic Waypoint Tracking on Granular Media #22 Sim2Dust：掌握松散介质上的动态航路点跟踪
#23 Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models #23 使用基于 Transformer 的模型进行历史手稿手写文本识别
#24 RMSL: Weakly-Supervised Insider Threat Detection with Robust Multi-sphere Learning #24 RMSL：基于弱监督的内部威胁检测与鲁棒多球学习
#25 Reference Points in LLM Sentiment Analysis: The Role of Structured Context #25 在 LLM 情感分析中的参考点：结构化上下文的作用
#26 Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation #26 内部知识：用于视觉室内导航的基于图的路径生成，具有可解释的数据增强和课程学习
#27 Informative Post-Hoc Explanations Only Exist for Simple Functions #27 只有对简单函数才存在信息性事后解释
#28 On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
#29 Open, Reproducible and Trustworthy Robot-Based Experiments with Virtual Labs and Digital-Twin-Based Execution Tracing #29 使用虚拟实验室和基于数字孪生的执行追踪实现开放、可复现且值得信赖的机器人实验
#30 An Exploratory Study on Crack Detection in Concrete through Human-Robot Collaboration #30 通过人机协作对混凝土裂缝检测的探索性研究
#31 Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis #31 值得信赖的 AI 心理治疗：用于咨询和可解释精神疾病诊断的多代理 LLM 工作流
#32 Retrieval-augmented reasoning with lean language models #32 使用检索增强推理的精简语言模型
#33 When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs #33 标点何时重要：针对 LLMs 的提示鲁棒性方法的大规模比较
#34 G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration #34 G-CUT3R：通过相机和深度先验整合的有指导三维重建
#35 Does the Skeleton-Recall Loss Really Work? #35 骨架召回损失真的有效吗？
#36 Minimizing Surrogate Losses for Decision-Focused Learning using Differentiable Optimization #36 使用可微分优化最小化决策导向学习的替代损失
#37 PTSM: Physiology-aware and Task-invariant Spatio-temporal Modeling for Cross-Subject EEG Decoding #37 PTSM：面向生理信息且与任务无关的时空建模用于跨个体脑电解码
#38 ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism #38 ETTRL：通过熵机制在 LLM 测试时强化学习中平衡探索与利用
#39 Leveraging the RETFound foundation model for optic disc segmentation in retinal images #39 利用 RETFound 基础模型在视网膜图像中进行视盘分割
#40 NeMo: A Neuron-Level Modularizing-While-Training Approach for Decomposing DNN Models #40 NeMo：一种在训练中进行神经元级模块化以分解深度神经网络模型的方法
#41 RegimeNAS: Regime-Aware Differentiable Architecture Search With Theoretical Guarantees for Financial Trading #41 RegimeNAS：具有理论保证的面向金融交易的政权感知可微分架构搜索
#42 SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems #42 SGSimEval：用于自动综述生成系统的全面多面向与相似性增强基准
#43 Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks #43 动态质量-延迟感知路由在无线边缘设备网络中的 LLM 推理
#44 CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems #44 CSGO：用于无线协作边缘 LLM 系统冷启动的广义优化
#45 Scene Graph-Guided Proactive Replanning for Failure-Resilient Embodied Agent #45 场景图引导的主动重规划用于故障恢复型具身智能体
#46 ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection #46 ToxiFrench：通过链式思考微调对法语有害性检测进行基准测试与增强 [PDF 1 ] [复制] [Kimi 1 ] [关联]
#47 LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought #47 LETToT：在旅游领域使用专家思维树对大型语言模型进行无标签评估
#48 Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas #48 通用人工智能的推理是否对数据引发的认知偏差敏感？关于典型软件工程两难问题的动态基准测试
#49 Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering #49 通过推理增强的表示工程提升有监督组合图像检索
#50 Vision-Language Models display a strong gender bias #50 视觉-语言模型表现出强烈的性别偏见
#51 Hallucination in LLM-Based Code Generation: An Automotive Case Study #51 基于 LLM 的代码生成中的幻觉：汽车领域案例研究
#52 Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception
#53 Graph Neural Diffusion via Generalized Opinion Dynamics
#54 Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering #54 跨粒度超图检索增强生成用于多跳问答
#55 ORFuzz: Fuzzing the "Other Side" of LLM Safety – Testing Over-Refusal #55 ORFuzz：模糊测试 LLM 安全性的“另一面”——测试过度拒绝 [PDF 1 ] [Copy] [Kimi 2 ] [REL]
#56 How Causal Abstraction Underpins Computational Explanation #56 因果抽象如何支撑计算性解释
#57 Multi-Group Equivariant Augmentation for Reinforcement Learning in Robot Manipulation #57 用于机器人操控强化学习的多群等变增强
#58 StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation #58 StyleMM：通过文本驱动的对齐图像翻译实现的风格化三维可变形人脸模型
#59 Visuomotor Grasping with World Models for Surgical Robots #59 使用世界模型的视觉运动抓取用于外科机器人
#60 E-CaTCH: Event-Centric Cross-Modal Attention with Temporal Consistency and Class-Imbalance Handling for Misinformation Detection #60 E-CaTCH：面向错误信息检测的事件中心跨模态注意力方法，具有时间一致性和类别不平衡处理
#61 Quantum-Boosted High-Fidelity Deep Learning #61 量子增强的高保真深度学习
#62 A Semi-supervised Generative Model for Incomplete Multi-view Data Integration with Missing Labels #62 一个用于带缺失标签的不完全多视图数据集成的半监督生成模型
#63 Better Supervised Fine-tuning for VQA: Integer-Only Loss #63 更好的监督微调用于 VQA：仅整数损失
#64 Role-Augmented Intent-Driven Generative Search Engine Optimization #64 角色增强的意图驱动生成式搜索引擎优化
#65 AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions #65 AlphaAgents：基于大语言模型的多智能体用于股票投资组合构建
#66 Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward #66 连续动作分块的 Actor-Critic：用于稀疏奖励下长期机器人操控的强化学习框架 [PDF 5 ] [Copy] [Kimi ] [REL]
#67 A Cross-Modal Rumor Detection Scheme via Contrastive Learning by Exploring Text and Image internal Correlations #67 一种通过对比学习探索文本与图像内部相关性的跨模态谣言检测方案
#68 MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents #68 MoNaCo：用于跨数十篇文档推理的更自然、更复杂的问题
#69 Tabularis Formatus: Predictive Formatting for Tables #69 表格格式化：用于表格的预测性格式化
#70 Quantization through Piecewise-Affine Regularization: Optimization and Statistical Guarantees #70 通过分段仿射正则化实现量化：优化与统计保证
#71 Diffusion is a code repair operator and generator #71 Diffusion 是一种代码修复算子和生成器
#72 Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance #72 将视觉-语言模型作为行为模型用于意图识别和辅助
#73 Compressive Meta-Learning #73 压缩元学习
#74 LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters #74 LD-LAudio-V1：具有双轻量适配器的视频到长格式音频生成扩展
#75 AI That Helps Us Help Each Other: A Proactive System for Scaffolding Mentor-Novice Collaboration in Entrepreneurship Coaching #75 有助于我们相互帮助的人工智能：一个用于支持创业辅导中导师-新手协作的主动系统
#76 Learning with Confidence #76 有信心的学习 [PDF ] [Copy] [Kimi ] [REL]
#77 Note on Selection Bias in Observational Estimates of Algorithmic Progress #77 关于观测性算法进步估计中的选择偏差说明
#78 Risk-Based Prognostics and Health Management #78 基于风险的预测与健康管理
#79 Zono-Conformal Prediction: Zonotope-Based Uncertainty Quantification for Regression and Classification Tasks #79 Zono-保形预测：基于ゾノ托普（Zonotope）的回归与分类任务不确定性量化
#80 Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics #80 超越罗塞塔石：泛化动力学中的统一力量
#81 CURE: Critical-Token-Guided Re-concatenation for Entropy-collapse Prevention #81 CURE：用于防止熵塌陷的关键令牌引导重连接
#82 Deep Learning-Based Automated Segmentation of Uterine Myomas #82 基于深度学习的子宫肌瘤自动分割
#83 SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth #83 SproutBench：面向青少年的安全与伦理大语言模型基准
#84 Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models #84 匹配与选择：用于微调文本到图像扩散模型的模型选择框架
#85 MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications #85 MCP-Guard：用于大语言模型应用中模型上下文协议完整性的防御框架
#86 Not There Yet: Evaluating Vision Language Models in Simulating the Visual Perception of People with Low Vision #86 还未到达：评估视觉语言模型在模拟低视力人群视觉感知方面的表现
#87 Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules #87 Rule2Text：一个用于生成和评估知识图规则自然语言解释的框架
#88 Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis #88 Retro-Expert：用于可解释逆合成的协同推理
#89 ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks #89 ORBIT：用于视觉推理任务的对象属性推理基准
#90 Towards Efficient Prompt-based Continual Learning in Distributed Medical AI #90 面向高效提示式持续学习的分布式医疗人工智能
#91 Apriel-Nemotron-15B-Thinker #91 Apriel-Nemotron-15B-Thinker
#92 Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News #92 从新闻中建模与检测公司风险：以彭博新闻为案例研究
#93 gpt-oss-120b & gpt-oss-20b Model Card #93 gpt-oss-120b & gpt-oss-20b 模型卡 [PDF 18 ] [Copy] [Kimi 8 ] [REL]
#94 Human-AI collaboration or obedient and often clueless AI in instruct, serve, repeat dynamics? #94 人机协作，还是在“指令、服从、重复”模式下顺从且常常一无所知的人工智能？
#95 Managing the unexpected: Operator behavioural data and its value in predicting correct alarm responses #95 管理意外事件：操作员行为数据及其在预测正确报警响应中的价值
#96 Multimodal Quantitative Measures for Multiparty Behaviour Evaluation #96 多方行为评估的多模态定量测量
#97 SDSNN: A Single-Timestep Spiking Neural Network with Self-Dropping Neuron and Bayesian Optimization #97 SDSNN：一种具有自我丢弃神经元和贝叶斯优化的单时间步脉冲神经网络
#98 FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning #98 FLUID：通过令牌蒸馏实现流-潜变量统一整合以用于多模态学习中的专家专门化
#99 Generalized Similarity U: A Non-parametric Test of Association Based on Similarity #99 广义相似性 U：基于相似性的非参数关联检验
#100 Trees Assembling Mann Whitney Approach for Detecting Genome-wide Joint Association among Low Marginal Effect loci #100 森林组装曼-惠特尼方法用于检测低边际效应位点之间的全基因组联合关联
#101 A Weighted U Statistic for Genetic Association Analyses of Sequencing Data #101 一个用于测序数据遗传关联分析的加权 U 统计量
#102 A Generalized Similarity U Test for Multivariate Analysis of Sequencing Data #102 一种用于测序数据多变量分析的广义相似性 U 检验
#103 A weighted U statistic for association analysis considering genetic heterogeneity #103 一个考虑遗传异质性的加权 U 统计量用于关联分析

学习
自我进化

2025-08-18科研追新

～ 2025-08-18 19:37:26 Monday

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

1.2 Arxiv

1.2.1 Computation and Language

From：https:// /arxiv/cs.CL

From：https://arxiv.org/list/cs.CL/recent 2025-08-18 | | 总计：63

#1 TinyTim: A Family of Language Models for Divergent Generation #1 TinyTim：用于发散生成的语言模型家族

Author: [Christopher J. Agostino](https://arxiv.org/search/?searchtype=author&query=Christopher J. Agostino) 作者：Christopher J. Agostino

This work introduces TinyTim, a family of large language models fine-tuned on James Joyce’s `Finnegans Wake’. Through quantitative evaluation against baseline models, we demonstrate that TinyTim V1 produces a statistically distinct generative profile characterized by high lexical diversity and low semantic coherence. These findings are interpreted through theories of creativity and complex problem-solving, arguing that such specialized models can function as divergent knowledge sources within more extensive creative architectures, powering automated discovery mechanisms in diverse settings. 本工作介绍了 TinyTim，一系列在詹姆斯·乔伊斯的《芬尼根的守灵夜》上微调的大型语言模型。通过与基线模型的定量评估，我们证明了 TinyTim V1 产生了具有统计学上显著差异的生成特征，表现为高词汇多样性和低语义连贯性。我们通过创造力与复杂问题解决理论对这些发现进行了解读，认为此类专门化模型可以作为更大创意架构中的发散性知识来源，在多种情境中为自动化发现机制提供动力。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 17:14:29 UTC 发布日期：2025-08-15 17:14:29 UTC

#2 Dataset Creation for Visual Entailment using Generative AI #2 使用生成式人工智能创建视觉蕴含数据集

Authors: [Rob Reijtenbach](https://arxiv.org/search/?searchtype=author&query=Rob Reijtenbach), [Suzan Verberne](https://arxiv.org/search/?searchtype=author&query=Suzan Verberne), [Gijs Wijnholds](https://arxiv.org/search/?searchtype=author&query=Gijs Wijnholds) 作者：Rob Reijtenbach、Suzan Verberne、Gijs Wijnholds

In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models. 在本文中，我们提出并验证了一个用于训练视觉蕴涵模型的新合成数据集。与文本蕴涵的数据集相比，现有的视觉蕴涵数据集规模小且稀疏。手工创建数据集工作量大。我们的合成数据集基于用于文本蕴涵的 SNLI 数据集。我们将 SNLI 中的前提文本作为生成式图像模型 Stable Diffusion 的输入提示，为每个文本前提生成一张图像以替代该文本前提。我们对数据集进行了内在和外在评估。对于外在评估，我们通过将生成的图像用作基于 CLIP 特征向量的视觉蕴涵分类器的训练数据来评估生成图像的有效性。我们发现，合成训练数据在 SNLI-VE 上仅导致性能略微下降，F 分数为 0.686，而使用真实数据训练时为 0.703。我们还在另一个数据集 SICK-VTE 上将生成的训练数据与原始训练数据的质量进行了比较。F 分数同样仅略有下降：从 0.400 降至 0.384。这些结果表明，在数据稀缺的情境下，合成数据可以成为训练视觉蕴涵模型的有前景的解决方案。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 17:13:41 UTC 发布：2025-08-15 17:13:41 UTC

#3 Representing Speech Through Autoregressive Prediction of Cochlear Tokens #3 通过对耳蜗标记的自回归预测表征语音

Authors: [Greta Tuckute](https://arxiv.org/search/?searchtype=author&query=Greta Tuckute), [Klemen Kotar](https://arxiv.org/search/?searchtype=author&query=Klemen Kotar), [Evelina Fedorenko](https://arxiv.org/search/?searchtype=author&query=Evelina Fedorenko), [Daniel L. K. Yamins](https://arxiv.org/search/?searchtype=author&query=Daniel L. K. Yamins) 作者：Greta Tuckute，Klemen Kotar，Evelina Fedorenko，Daniel L. K. Yamins

We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \textbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream’s strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model’s predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks. 我们提出了 AuriStream，一种受生物启发的语音编码模型，采用受人类听觉处理层级启发的两阶段框架。第一阶段将原始音频转换为基于人类耳蜗的时频表示，并从中提取离散的耳蜗令牌（cochlear tokens）。第二阶段在这些耳蜗令牌上应用自回归序列模型。AuriStream 学习到有意义的音素和词表示，以及最先进的词汇语义。AuriStream 在多样的下游 SUPERB 语音任务上表现出竞争力。作为对 AuriStream 强大表征能力的补充，它能够生成音频的延续，这些延续可以在声谱图空间中可视化并解码回音频，从而为模型的预测提供洞见。总之，我们提出了一个用于语音表示学习的两阶段框架，以推动更类人的模型的发展，使其能够高效处理一系列基于语音的任务。

Subjects: Computation and Language, Sound, Audio and Speech Processing 主题：计算与语言、声音、音频与语音处理

Publish: 2025-08-15 17:06:04 UTC 发布：2025-08-15 17:06:04 UTC

#4 Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models

Authors: [Qiguang Chen](https://arxiv.org/search/?searchtype=author&query=Qiguang Chen), [Dengyun Peng](https://arxiv.org/search/?searchtype=author&query=Dengyun Peng), [Jinhao Liu](https://arxiv.org/search/?searchtype=author&query=Jinhao Liu), [HuiKang Su](https://arxiv.org/search/?searchtype=author&query=HuiKang Su), [Jiannan Guan](https://arxiv.org/search/?searchtype=author&query=Jiannan Guan), [Libo Qin](https://arxiv.org/search/?searchtype=author&query=Libo Qin), [Wanxiang Che](https://arxiv.org/search/?searchtype=author&query=Wanxiang Che) 作者：陈其光、彭登云、刘金浩、苏惠康、管嘉楠、秦立波、车万翔

Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM’s self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement. 近年来大规模语言模型（LLMs）在复杂推理任务上通过长链式思维（Chain-of-Thought，CoT）显著提升了能力。然而，这种方法常常导致大量冗余，损害计算效率并在实时应用中造成显著延迟。为提高效率，现有方法通常依赖人工定义的难度先验，但这些先验并不与模型自身对难度的感知相一致，导致低效。在本文中，我们提出了动态推理边界自我感知框架（Dynamic Reasoning-Boundary Self-Awareness Framework，DR. SAF），使模型能够根据问题复杂性动态评估并调整其推理深度。DR. SAF 集成了三个关键组件：边界自我感知对齐（Boundary Self-Awareness Alignment）、自适应奖励管理（Adaptive Reward Management）和边界保持机制（Boundary Preservation Mechanism）。这些组件使模型能够在不损害性能的前提下优化推理过程，在效率与准确性之间取得平衡。我们的实验证明，DR. SAF 在准确率仅有微小损失的情况下，总响应令牌数减少了 49.27%。该框架在令牌效率上带来了 6.59 倍的提升，并将训练时间减少了 5 倍，使其非常适合资源受限的环境。在极端训练情况下，DR. SAF 甚至能在令牌效率上超过传统的基于指令的模型，准确率提高超过 16%。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-15 16:40:29 UTC 发布日期：2025-08-15 16:40:29 UTC

#5 AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment #5 AgentMental：一种用于可解释且自适应心理健康评估的交互式多智能体框架

Authors: [Jinpeng Hu](https://arxiv.org/search/?searchtype=author&query=Jinpeng Hu), [Ao Wang](https://arxiv.org/search/?searchtype=author&query=Ao Wang), [Qianqian Xie](https://arxiv.org/search/?searchtype=author&query=Qianqian Xie), [Hui Ma](https://arxiv.org/search/?searchtype=author&query=Hui Ma), [Zhuo Li](https://arxiv.org/search/?searchtype=author&query=Zhuo Li), [Dan Guo](https://arxiv.org/search/?searchtype=author&query=Dan Guo) 作者：胡劲鹏、王傲、谢倩倩、马辉、李卓、郭丹

Mental health assessment is crucial for early intervention and effective treatment, yet traditional clinician-based approaches are limited by the shortage of qualified professionals. Recent advances in artificial intelligence have sparked growing interest in automated psychological assessment, yet most existing approaches are constrained by their reliance on static text analysis, limiting their ability to capture deeper and more informative insights that emerge through dynamic interaction and iterative questioning. Therefore, in this paper, we propose a multi-agent framework for mental health evaluation that simulates clinical doctor-patient dialogues, with specialized agents assigned to questioning, adequacy evaluation, scoring, and updating. We introduce an adaptive questioning mechanism in which an evaluation agent assesses the adequacy of user responses to determine the necessity of generating targeted follow-up queries to address ambiguity and missing information. Additionally, we employ a tree-structured memory in which the root node encodes the user’s basic information, while child nodes (e.g., topic and statement) organize key information according to distinct symptom categories and interaction turns. This memory is dynamically updated throughout the interaction to reduce redundant questioning and further enhance the information extraction and contextual tracking capabilities. Experimental results on the DAIC-WOZ dataset illustrate the effectiveness of our proposed method, which achieves better performance than existing approaches. 心理健康评估对于早期干预和有效治疗至关重要，然而传统的基于临床医生的方法受到合格专业人员短缺的限制。人工智能的最新进展引发了对自动化心理评估的日益关注，但现有大多数方法受限于对静态文本分析的依赖，无法捕捉通过动态互动和反复提问产生的更深层次和更有信息量的见解。因此，在本文中，我们提出了一个用于心理健康评估的多智能体框架，模拟临床医患对话，并将提问、充分性评估、评分和更新等任务分配给专门的智能体。我们引入了一种自适应提问机制，其中评估智能体评估用户回答的充分性，以判断是否需要生成有针对性的后续问题来处理模糊和缺失的信息。此外，我们采用了树状结构的记忆，其中根节点编码用户的基本信息，子节点（例如主题和陈述）则根据不同的症状类别和交互轮次组织关键信息。该记忆在交互过程中动态更新，以减少重复提问并进一步增强信息提取和上下文追踪能力。针对 DAIC-WOZ 数据集的实验结果展示了我们所提出方法的有效性，其性能优于现有方法。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 16:20:45 UTC 发布：2025-08-15 16:20:45 UTC

#6 Language models align with brain regions that represent concepts across modalities #6 语言模型与代表跨模态概念的大脑区域对齐

Authors: [Maria Ryskina](https://arxiv.org/search/?searchtype=author&query=Maria Ryskina), [Greta Tuckute](https://arxiv.org/search/?searchtype=author&query=Greta Tuckute), [Alexander Fung](https://arxiv.org/search/?searchtype=author&query=Alexander Fung), [Ashley Malkin](https://arxiv.org/search/?searchtype=author&query=Ashley Malkin), [Evelina Fedorenko](https://arxiv.org/search/?searchtype=author&query=Evelina Fedorenko) 作者：Maria Ryskina、Greta Tuckute、Alexander Fung、Ashley Malkin、Evelina Fedorenko

Cognitive science and neuroscience have long faced the challenge of disentangling representations of language from representations of conceptual meaning. As the same problem arises in today’s language models (LMs), we investigate the relationship between LM–brain alignment and two neural metrics: (1) the level of brain activation during processing of sentences, targeting linguistic processing, and (2) a novel measure of meaning consistency across input modalities, which quantifies how consistently a brain region responds to the same concept across paradigms (sentence, word cloud, image) using an fMRI dataset (Pereira et al., 2018). Our experiments show that both language-only and language-vision models predict the signal better in more meaning-consistent areas of the brain, even when these areas are not strongly sensitive to language processing, suggesting that LMs might internally represent cross-modal conceptual meaning. 认知科学和神经科学长期面临将语言表征与概念意义表征区分开的挑战。由于当今的语言模型（LMs）也出现相同的问题，我们研究了 LM—大脑对齐与两个神经度量之间的关系：（1）在处理句子时大脑激活的强度，针对语言处理；以及（2）一种新的跨输入模态意义一致性度量，该度量使用 fMRI 数据集（Pereira et al., 2018）量化大脑区域在不同范式（句子、词云、图像）中对同一概念的响应一致性。我们的实验表明，即使这些区域对语言处理并不高度敏感，纯语言模型和语言-视觉模型在大脑中意义一致性更高的区域对信号的预测也更好，这表明 LMs 可能在内部表征跨模态的概念意义。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 15:32:19 UTC 发布：2025-08-15 15:32:19 UTC

#7 Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models #7 AI 中的物种歧视：评估大型语言模型对动物的歧视

Authors: [Monika Jotautaitė](https://arxiv.org/search/?searchtype=author&query=Monika Jotautaitė), [Lucius Caviola](https://arxiv.org/search/?searchtype=author&query=Lucius Caviola), [David A. Brewster](https://arxiv.org/search/?searchtype=author&query=David A. Brewster), [Thilo Hagendorff](https://arxiv.org/search/?searchtype=author&query=Thilo Hagendorff) 作者：Monika Jotautaitė、Lucius Caviola、David A. Brewster、Thilo Hagendorff

As large language models (LLMs) become more widely deployed, it is crucial to examine their ethical tendencies. Building on research on fairness and discrimination in AI, we investigate whether LLMs exhibit speciesist bias – discrimination based on species membership – and how they value non-human animals. We systematically examine this issue across three paradigms: (1) SpeciesismBench, a 1,003-item benchmark assessing recognition and moral evaluation of speciesist statements; (2) established psychological measures comparing model responses with those of human participants; (3) text-generation tasks probing elaboration on, or resistance to, speciesist rationalizations. In our benchmark, LLMs reliably detected speciesist statements but rarely condemned them, often treating speciesist attitudes as morally acceptable. On psychological measures, results were mixed: LLMs expressed slightly lower explicit speciesism than people, yet in direct trade-offs they more often chose to save one human over multiple animals. A tentative interpretation is that LLMs may weight cognitive capacity rather than species per se: when capacities were equal, they showed no species preference, and when an animal was described as more capable, they tended to prioritize it over a less capable human. In open-ended text generation tasks, LLMs frequently normalized or rationalized harm toward farmed animals while refusing to do so for non-farmed animals. These findings suggest that while LLMs reflect a mixture of progressive and mainstream human views, they nonetheless reproduce entrenched cultural norms around animal exploitation. We argue that expanding AI fairness and alignment frameworks to explicitly include non-human moral patients is essential for reducing these biases and preventing the entrenchment of speciesist attitudes in AI systems and the societies they influence. 随着大型语言模型（LLMs）越来越广泛地部署，审视它们的伦理倾向变得至关重要。在人工智能领域关于公平性和歧视的研究基础上，我们探讨了 LLMs 是否表现出物种歧视偏见——基于物种归属的歧视——以及它们如何看待非人类动物。我们在三个范式中系统地考察了这一问题： (1) SpeciesismBench，一个包含 1003 项的基准，用于评估对物种歧视性陈述的识别和道德评估；(2) 使用已建立的心理学测量，将模型的反应与人类参与者的反应进行比较；(3) 文本生成任务，探查模型在阐述或抵制物种歧视性理由时的表现。在我们的基准测试中，LLMs 能够可靠地检测出物种歧视性陈述，但很少谴责这些陈述，常常把物种歧视态度视为道德上可接受的。在心理学测量中，结果喜忧参半：LLMs 在显性物种歧视上表现出比人类略低的程度，但在直接权衡中，它们更经常选择拯救一名人类而非多只动物。一种初步的解读是，LLMs 可能更看重认知能力而非物种本身：当能力相同时，它们并不偏好某一物种；当一种动物被描述为更有能力时，它们倾向于优先考虑它而不是能力较差的人类。在开放式文本生成任务中，LLMs 经常对被圈养的动物的伤害进行合理化或正常化，而对非圈养动物则拒绝这样做。这些发现表明，尽管 LLMs 反映了进步与主流人类观点的混合，但它们仍然再现了围绕动物剥削的根深蒂固的文化规范。我们认为，将非人类道德患者明确纳入 AI 公平性与对齐框架，对于减少这些偏见并防止物种歧视态度在 AI 系统及其影响的社会中根深蒂固，是至关重要的。

Subjects: Computation and Language, Computers and Society 主题：计算与语言，计算机与社会

Publish: 2025-08-15 15:22:00 UTC 发布：2025-08-15 15:22:00 UTC

#8 Reference Points in LLM Sentiment Analysis: The Role of Structured Context #8 在 LLM 情感分析中的参照点：结构化语境的作用

Author: [Junichiro Niimi](https://arxiv.org/search/?searchtype=author&query=Junichiro Niimi) 作者：Junichiro Niimi

Large language models (LLMs) are now widely used across many fields, including marketing research. Sentiment analysis, in particular, helps firms understand consumer preferences. While most NLP studies classify sentiment from review text alone, marketing theories, such as prospect theory and expectation–disconfirmation theory, point out that customer evaluations are shaped not only by the actual experience but also by additional reference points. This study therefore investigates how the content and format of such supplementary information affect sentiment analysis using LLMs. We compare natural language (NL) and JSON-formatted prompts using a lightweight 3B parameter model suitable for practical marketing applications. Experiments on two Yelp categories (Restaurant and Nightlife) show that the JSON prompt with additional information outperforms all baselines without fine-tuning: Macro-F1 rises by 1.6% and 4% while RMSE falls by 16% and 9.1%, respectively, making it deployable in resource-constrained edge devices. Furthermore, a follow-up analysis confirms that performance gains stem from genuine contextual reasoning rather than label proxying. This work demonstrates that structured prompting can enable smaller models to achieve competitive performance, offering a practical alternative to large-scale model deployment. 大型语言模型（LLMs）现已广泛应用于多个领域，包括市场研究。情感分析尤其能够帮助企业了解消费者偏好。尽管大多数自然语言处理研究仅根据评价文本本身对情感进行分类，但营销理论（如前景理论和期望—不符理论）指出，顾客的评估不仅受实际体验影响，还受其他参照点的影响。因此，本研究探讨了此类补充信息的内容和格式如何影响使用 LLMs 的情感分析。我们比较了自然语言（NL）提示与 JSON 格式提示，使用适合实际营销应用的轻量级 3B 参数模型。在两个 Yelp 类别（餐厅和夜生活）上的实验表明，包含额外信息的 JSON 提示在未微调的情况下优于所有基线：宏 F1 分别提高了 1.6% 和 4%，而均方根误差（RMSE）分别下降了 16% 和 9.1%，使其可部署于资源受限的边缘设备中。此外，后续分析证实性能提升源于真实的上下文推理，而非标签代理。这项工作展示了结构化提示可以使较小的模型达到有竞争力的性能，提供了一个实用的替代方案，以替代大规模模型的部署。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-15 13:04:32 UTC 发布：2025-08-15 13:04:32 协调世界时 (UTC)

#9 CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity #9 CoDiEmb：一个协作但独立的框架，用于信息检索和语义文本相似性的一体化表征学习

Authors: [Bowen Zhang](https://arxiv.org/search/?searchtype=author&query=Bowen Zhang), [Zixin Song](https://arxiv.org/search/?searchtype=author&query=Zixin Song), [Chunquan Chen](https://arxiv.org/search/?searchtype=author&query=Chunquan Chen), [Qian-Wen Zhang](https://arxiv.org/search/?searchtype=author&query=Qian-Wen Zhang), [Di Yin](https://arxiv.org/search/?searchtype=author&query=Di Yin), [Xing Sun](https://arxiv.org/search/?searchtype=author&query=Xing Sun) 作者：Bowen Zhang、Zixin Song、Chunquan Chen、Qian-Wen Zhang、Di Yin、Xing Sun

Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-training typically yields steep performance trade-offs. We argue that resolving this conflict requires systematically decoupling task-specific learning signals throughout the training pipeline. To this end, we introduce CoDiEmb, a unified framework that reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner. CoDiEmb integrates three key innovations for effective joint optimization: (1) Task-specialized objectives paired with a dynamic sampler that forms single-task batches and balances per-task updates, thereby preventing gradient interference. For IR, we employ a contrastive loss with multiple positives and hard negatives, augmented by cross-device sampling. For STS, we adopt order-aware objectives that directly optimize correlation and ranking consistency. (2) A delta-guided model fusion strategy that computes fine-grained merging weights for checkpoints by analyzing each parameter’s deviation from its pre-trained initialization, proving more effective than traditional Model Soups. (3) An efficient, single-stage training pipeline that is simple to implement and converges stably. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space. 在各种下游任务中表现出色的统一文本嵌入学习是表征学习的核心目标，但负迁移仍然是一个持久的障碍。当为信息检索（IR）和语义文本相似度（STS）这两项重要但本质上不同的任务联合训练单一编码器时，这一挑战尤为明显，粗糙的共同训练通常会带来严重的性能权衡。我们认为解决这一冲突需要在整个训练流程中系统地解耦任务特定的学习信号。为此，我们提出了 CoDiEmb，一个以协作但区分的方式调和 IR 与 STS 不同需求的统一框架。CoDiEmb 集成了三项用于有效联合优化的关键创新：（1）配备动态采样器的任务专用目标，该采样器形成单任务批次并平衡各任务的更新，从而防止梯度干扰。对于 IR，我们采用具有多个正样本和困难负样本的对比损失，并辅以跨设备采样。对于 STS，我们采用了按序感知的目标，直接优化相关性和排序一致性。 (2) 一种基于增量差异的模型融合策略，通过分析每个参数相对于其预训练初始化的偏离情况，计算用于检查点的细粒度合并权重，证明比传统的 Model Soups 更有效。 (3) 一个高效的单阶段训练流程，易于实现且收敛稳定。在三个基础编码器上对 15 个标准 IR 和 STS 基准进行的大量实验验证了 CoDiEmb。我们的结果和分析表明，该框架不仅缓解了跨任务的权衡，而且可测量地改善了嵌入空间的几何属性。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 12:46:35 UTC 发布：2025-08-15 12:46:35 UTC

#10 Online Anti-sexist Speech: Identifying Resistance to Gender Bias in Political Discourse #10 在线反性别歧视言论：识别政治话语中对性别偏见的抵制 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Aditi Dutta](https://arxiv.org/search/?searchtype=author&query=Aditi Dutta), [Susan Banducci](https://arxiv.org/search/?searchtype=author&query=Susan Banducci) 作者：Aditi Dutta, Susan Banducci

Anti-sexist speech, i.e., public expressions that challenge or resist gendered abuse and sexism, plays a vital role in shaping democratic debate online. Yet automated content moderation systems, increasingly powered by large language models (LLMs), may struggle to distinguish such resistance from the sexism it opposes. This study examines how five LLMs classify sexist, anti-sexist, and neutral political tweets from the UK, focusing on high-salience trigger events involving female Members of Parliament in the year 2022. Our analysis show that models frequently misclassify anti-sexist speech as harmful, particularly during politically charged events where rhetorical styles of harm and resistance converge. These errors risk silencing those who challenge sexism, with disproportionate consequences for marginalised voices. We argue that moderation design must move beyond binary harmful/not-harmful schemas, integrate human-in-the-loop review during sensitive events, and explicitly include counter-speech in training data. By linking feminist scholarship, event-based analysis, and model evaluation, this work highlights the sociotechnical challenges of safeguarding resistance speech in digital political spaces. 反性别歧视言论，即在公共场合挑战或抵制性别辱骂和性别歧视的表达，在塑造线上民主讨论中起着至关重要的作用。然而，越来越多由大型语言模型（LLMs）驱动的自动内容审查系统，可能难以将这种抵制与其所反对的性别歧视区分开来。本研究考察了五种 LLMs 如何对来自英国的具有性别歧视性、反性别歧视性和中性政治推文进行分类，重点关注 2022 年涉及女性议员的高显著性触发事件。我们的分析显示，模型经常将反性别歧视言论误判为有害，尤其是在修辞风格中伤害与抵抗交汇的政治高压事件中。这些错误有可能让挑战性别歧视的人被噤声，并对边缘化群体产生不成比例的影响。我们认为，审查设计必须超越有害/无害的二元划分，在敏感事件期间整合人工审查，并在训练数据中明确包含反对言论。通过连接女性主义研究、基于事件的分析和模型评估，这项工作强调了在数字政治空间中保护抗争言论的社会技术挑战。

Subjects: Computation and Language, Computers and Society 主题：计算与语言，计算机与社会

Publish: 2025-08-15 12:24:22 UTC 发布：2025-08-15 12:24:22 UTC

#11 HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor #11 HumorPlanSearch：用于上下文 AI 幽默的结构化规划与 HuCoT

Author: [Shivam Dubey](https://arxiv.org/search/?searchtype=author&query=Shivam Dubey) 作者：Shivam Dubey

Automated humor generation with Large Language Models (LLMs) often yields jokes that feel generic, repetitive, or tone-deaf because humor is deeply situated and hinges on the listener’s cultural background, mindset, and immediate context. We introduce HumorPlanSearch, a modular pipeline that explicitly models context through: (1) Plan-Search for diverse, topic-tailored strategies; (2) Humor Chain-of-Thought (HuCoT) templates capturing cultural and stylistic reasoning; (3) a Knowledge Graph to retrieve and adapt high-performing historical strategies; (4) novelty filtering via semantic embeddings; and (5) an iterative judge-driven revision loop. To evaluate context sensitivity and comedic quality, we propose the Humor Generation Score (HGS), which fuses direct ratings, multi-persona feedback, pairwise win-rates, and topic relevance. In experiments across nine topics with feedback from 13 human judges, our full pipeline (KG + Revision) boosts mean HGS by 15.4 percent (p < 0.05) over a strong baseline. By foregrounding context at every stage from strategy planning to multi-signal evaluation, HumorPlanSearch advances AI-driven humor toward more coherent, adaptive, and culturally attuned comedy. 使用大型语言模型（LLMs）进行自动幽默生成常常产生听起来通用、重复或不得体的笑话，因为幽默具有强烈的情境性，取决于听众的文化背景、心态和即时情境。我们提出了 HumorPlanSearch，这是一个通过以下方式明确建模情境的模块化流程： (1) 计划-搜索以生成多样且针对话题的策略；(2) 捕捉文化与风格推理的幽默思维链（HuCoT）模板；(3) 用于检索和调整高绩效历史策略的知识图谱；(4) 通过语义嵌入进行新颖性过滤；以及 (5) 一个由评审驱动的迭代修订循环。为评估情境敏感性和喜剧质量，我们提出了幽默生成评分（HGS），该评分融合了直接评分、多角色反馈、成对胜出率和话题相关性。在对九个话题并由 13 名人类评审提供反馈的实验中，我们的完整流程（KG + 修订）将平均 HGS 提高了 15.4%（p < 0.05），优于强基线。通过在从策略规划到多信号评估的每个阶段强调上下文，HumorPlanSearch 推动了以人工智能为驱动的幽默朝着更连贯、更具适应性和更符合文化的喜剧方向发展。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 12:07:56 UTC 发布：2025-08-15 12:07:56 UTC

#12 Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions #12 从问卷到行为：通过问卷问题在 LLMs 中实现对人类价值观的下游对齐

Authors: [Shangrui Nie](https://arxiv.org/search/?searchtype=author&query=Shangrui Nie), [Florian Mai](https://arxiv.org/search/?searchtype=author&query=Florian Mai), [David Kaczér](https://arxiv.org/search/?searchtype=author&query=David Kaczér), [Charles Welch](https://arxiv.org/search/?searchtype=author&query=Charles Welch), [Zhixue Zhao](https://arxiv.org/search/?searchtype=author&query=Zhixue Zhao), [Lucie Flek](https://arxiv.org/search/?searchtype=author&query=Lucie Flek) 作者：聂上瑞、Florian Mai、David Kaczér、Charles Welch、赵志学、Lucie Flek

Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model’s value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model’s behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model’s behavior changes in out-of-domain settings (situational scenarios). To this end, we construct a contextualized moral judgment dataset based on Reddit posts and evaluate changes in the model’s behavior in text-based adventure games. We demonstrate that our simple approach can not only change the model’s answers to in-domain survey questions, but also produces substantial shifts (value alignment) in implicit downstream task behavior. 大语言模型隐式地编码了对人类价值观的偏好，但对其进行引导通常需要大量训练数据。在这项工作中，我们调查了一种简单的方法：通过训练模型以相应地回答价值观调查问题，是否能够可靠地修改模型在下游行为中的价值体系？我们首先通过让若干开源 LLMs 对一系列涉及 20 种不同人类价值的描述进行评分，构建这些模型的价值概况，并将其作为后续实验的基线。随后我们研究是否可以通过对价值调查进行微调来控制模型的价值体系。我们以两种方式评估微调对模型行为的影响：首先，我们评估在领域内的、留出的调查问题上答案的变化；其次，我们评估模型在域外设置（情境场景）中的行为是否发生变化。为此，我们构建了一个基于 Reddit 帖子的情境化道德判断数据集，并评估模型在文本冒险游戏中的行为变化。我们证明了我们这一简单方法不仅能够改变模型在域内调查问题上的回答，还能在隐含的下游任务行为上产生显著的变化（价值对齐）。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 11:36:17 UTC 发布：2025-08-15 11:36:17 UTC

#13 Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training #13 通过端到端可微自训练为变压器预测提供理由化解释

Authors: [Marc Brinner](https://arxiv.org/search/?searchtype=author&query=Marc Brinner), [Sina Zarrieß](https://arxiv.org/search/?searchtype=author&query=Sina Zarrieß) 作者：Marc Brinner, Sina Zarrieß

We propose an end-to-end differentiable training paradigm for stable training of a rationalized transformer classifier. Our approach results in a single model that simultaneously classifies a sample and scores input tokens based on their relevance to the classification. To this end, we build on the widely-used three-player-game for training rationalized models, which typically relies on training a rationale selector, a classifier and a complement classifier. We simplify this approach by making a single model fulfill all three roles, leading to a more efficient training paradigm that is not susceptible to the common training instabilities that plague existing approaches. Further, we extend this paradigm to produce class-wise rationales while incorporating recent advances in parameterizing and regularizing the resulting rationales, thus leading to substantially improved and state-of-the-art alignment with human annotations without any explicit supervision. 我们提出了一种端到端可微分的训练范式，用于稳定地训练理由化的变压器分类器。我们的方法得到一个单一模型，该模型可同时对样本进行分类并根据输入标记与该分类的相关性对其进行评分。为此，我们基于广泛使用的用于训练理由化模型的三方博弈，该方法通常依赖于训练一个理由选择器、一个分类器和一个补集分类器。我们通过让单个模型同时承担这三种角色来简化这一方法，从而形成一种更高效的训练范式，不易受到困扰现有方法的常见训练不稳定性的影响。此外，我们将该范式扩展为产生按类别区分的理由，同时结合了最近在对所得理由进行参数化和正则化方面的进展，从而在没有任何显式监督的情况下大幅提高并达到与人工标注的一致性的最先进水平。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-15 10:51:58 UTC 发布：2025-08-15 10:51:58 协调世界时

#14 Model Interpretability and Rationale Extraction by Input Mask Optimization #14 通过输入掩码优化进行模型可解释性和理由提取

Authors: [Marc Brinner](https://arxiv.org/search/?searchtype=author&query=Marc Brinner), [Sina Zarriess](https://arxiv.org/search/?searchtype=author&query=Sina Zarriess) 作者：Marc Brinner, Sina Zarriess

Concurrent to the rapid progress in the development of neural-network based models in areas like natural language processing and computer vision, the need for creating explanations for the predictions of these black-box models has risen steadily. We propose a new method to generate extractive explanations for predictions made by neural networks, that is based on masking parts of the input which the model does not consider to be indicative of the respective class. The masking is done using gradient-based optimization combined with a new regularization scheme that enforces sufficiency, comprehensiveness and compactness of the generated explanation, three properties that are known to be desirable from the related field of rationale extraction in natural language processing. In this way, we bridge the gap between model interpretability and rationale extraction, thereby proving that the latter of which can be performed without training a specialized model, only on the basis of a trained classifier. We further apply the same method to image inputs and obtain high quality explanations for image classifications, which indicates that the conditions proposed for rationale extraction in natural language processing are more broadly applicable to different input types. 随着基于神经网络的模型在自然语言处理和计算机视觉等领域的快速发展，为这些黑箱模型的预测创建解释的需求也在稳步上升。我们提出了一种新的方法，用于为神经网络的预测生成抽取式解释，该方法基于对模型认为与相应类别无关的输入部分进行掩码处理。掩码通过基于梯度的优化进行，并结合一种新的正则化方案，该方案强制生成的解释满足充分性、全面性和简洁性这三项属性——这些属性在自然语言处理相关的理由提取领域被认为是理想的。通过这种方式，我们弥合了模型可解释性与理由提取之间的差距，从而证明后者可以在不训练专门模型的情况下仅基于已训练的分类器来执行。我们进一步将相同的方法应用于图像输入，并为图像分类获得了高质量的解释，这表明为自然语言处理中的理由提取所提出的条件更广泛地适用于不同的输入类型。

Subjects: Computation and Language, Computer Vision and Pattern Recognition, Machine Learning 主题：计算与语言、计算机视觉与模式识别、机器学习

Publish: 2025-08-15 10:41:09 UTC 发布：2025-08-15 10:41:09 协调世界时

#15 Retrieval-augmented reasoning with lean language models #15 使用精简语言模型的检索增强推理

This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains. 本技术报告详细介绍了一种在单一精简语言模型架构内将推理与检索增强生成（RAG）相结合的新方法。尽管现有的 RAG 系统通常依赖大规模模型和外部 API，我们的工作针对可在资源受限或安全环境中部署的高性能且保护隐私的解决方案的日益增长的需求。基于近期在测试时扩展和小规模推理模型方面的进展，我们开发了一个检索增强的对话代理，能够使用轻量级主干模型解释复杂的领域特定查询。我们的系统将密集检索器与经微调的 Qwen2.5-Instruct 模型集成，利用合成查询生成和从前沿模型（例如 DeepSeek-R1）在精选语料库（在本例中为 NHS A-to-Z 病情页面）上得出的推理轨迹。我们探讨了基于摘要的文档压缩、合成数据设计以及具备推理意识的微调对模型性能的影响。对非推理模型和通用精简模型的评估表明，我们的领域特定微调方法在答案准确性和一致性方面带来了显著提升，性能接近前沿水平，同时仍可在本地部署。所有实现细节和代码均已公开发布，以支持可重复性和跨领域的适配。

Subjects: Computation and Language, Artificial Intelligence, Computers and Society

Publish: 2025-08-15 10:38:15 UTC 发布：2025-08-15 10:38:15 UTC

#16 When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs #16 标点何时重要：针对 LLMs 的提示鲁棒性方法的大规模比较

Authors: [Mikhail Seleznyov](https://arxiv.org/search/?searchtype=author&query=Mikhail Seleznyov), [Mikhail Chaichuk](https://arxiv.org/search/?searchtype=author&query=Mikhail Chaichuk), [Gleb Ershov](https://arxiv.org/search/?searchtype=author&query=Gleb Ershov), [Alexander Panchenko](https://arxiv.org/search/?searchtype=author&query=Alexander Panchenko), [Elena Tutubalina](https://arxiv.org/search/?searchtype=author&query=Elena Tutubalina), [Oleg Somov](https://arxiv.org/search/?searchtype=author&query=Oleg Somov) 作者：Mikhail Seleznyov、Mikhail Chaichuk、Gleb Ershov、Alexander Panchenko、Elena Tutubalina、Oleg Somov

Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models’ current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: https://github.com/AIRI-Institute/when-punctuation-matters. 大型语言模型（LLMs）对提示措辞和格式中的细微、非语义变化非常敏感。在这项工作中，我们在统一的实验框架内首次系统评估了五种提高提示鲁棒性的方法。我们在来自 Llama、Qwen 和 Gemma 系列的 8 个模型上，对 Natural Instructions 数据集中的 52 个任务进行了基准测试。我们的评估涵盖了来自微调和上下文学习范式的鲁棒性方法，并测试了它们在多种分布偏移类型下的泛化能力。最后，我们将分析扩展到 GPT-4.1 和 DeepSeek V3，以评估前沿模型当前对格式扰动的鲁棒性。我们的发现为这些鲁棒性方法的相对有效性提供了可操作的见解，使从业者在追求现实应用中 LLMs 的稳定可靠表现时能够做出明智决策。代码： https://github.com/AIRI-Institute/when-punctuation-matters.

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-15 10:32:50 UTC 发布时间：2025-08-15 10:32:50 UTC

#17 Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning #17 反馈指标：Llama 与教师在语言学习中的一致性

Authors: [Sylvio Rüdian](https://arxiv.org/search/?searchtype=author&query=Sylvio Rüdian), [Yassin Elsir](https://arxiv.org/search/?searchtype=author&query=Yassin Elsir), [Marvin Kretschmer](https://arxiv.org/search/?searchtype=author&query=Marvin Kretschmer), [Sabine Cayrou](https://arxiv.org/search/?searchtype=author&query=Sabine Cayrou), [Niels Pinkwart](https://arxiv.org/search/?searchtype=author&query=Niels Pinkwart) 作者：Sylvio Rüdian、Yassin Elsir、Marvin Kretschmer、Sabine Cayrou、Niels Pinkwart

Automated feedback generation has the potential to enhance students’ learning progress by providing timely and targeted feedback. Moreover, it can assist teachers in optimizing their time, allowing them to focus on more strategic and personalized aspects of teaching. To generate high-quality, information-rich formative feedback, it is essential first to extract relevant indicators, as these serve as the foundation upon which the feedback is constructed. Teachers often employ feedback criteria grids composed of various indicators that they evaluate systematically. This study examines the initial phase of extracting such indicators from students’ submissions of a language learning course using the large language model Llama 3.1. Accordingly, the alignment between indicators generated by the LLM and human ratings across various feedback criteria is investigated. The findings demonstrate statistically significant strong correlations, even in cases involving unanticipated combinations of indicators and criteria. The methodology employed in this paper offers a promising foundation for extracting indicators from students’ submissions using LLMs. Such indicators can potentially be utilized to auto-generate explainable and transparent formative feedback in future research. 自动化反馈生成通过提供及时且有针对性的反馈，有潜力提升学生的学习进展。此外，它还能帮助教师优化时间，使他们能专注于教学中更具策略性和个性化的方面。要生成高质量、信息丰富的形成性反馈，首先必须提取相关指标，因为这些指标是构建反馈的基础。教师常使用由各种指标组成并被系统评估的反馈评分表。本研究考察了在一门语言学习课程中，使用大型语言模型 Llama 3.1 从学生提交的作业中提取此类指标的初始阶段。因此，研究还探讨了由 LLM 生成的指标与人工评分在各类反馈标准上的一致性。研究结果表明，即使在涉及意外的指标与标准组合的情况下，也存在统计学上显著的强相关性。本文所采用的方法为使用 LLM 从学生提交中提取指标提供了有前景的基础。这些指标有可能在未来的研究中用于自动生成可解释且透明的形成性反馈。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 09:59:22 UTC 发布：2025-08-15 09:59:22 UTC

#18 SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis #18 SpecDetect：通过频谱分析对 LLM 生成文本进行简单、快速且无需训练的检测

Authors: [Haitong Luo](https://arxiv.org/search/?searchtype=author&query=Haitong Luo), [Weiyao Zhang](https://arxiv.org/search/?searchtype=author&query=Weiyao Zhang), [Suhang Wang](https://arxiv.org/search/?searchtype=author&query=Suhang Wang), [Wenji Zou](https://arxiv.org/search/?searchtype=author&query=Wenji Zou), [Chungang Lin](https://arxiv.org/search/?searchtype=author&query=Chungang Lin), [Xuying Meng](https://arxiv.org/search/?searchtype=author&query=Xuying Meng), [Yujun Zhang](https://arxiv.org/search/?searchtype=author&query=Yujun Zhang) 作者：罗海通、张维曜、王素航、邹文吉、林春刚、孟绪英、张玉军

The proliferation of high-quality text from Large Language Models (LLMs) demands reliable and efficient detection methods. While existing training-free approaches show promise, they often rely on surface-level statistics and overlook fundamental signal properties of the text generation process. In this work, we reframe detection as a signal processing problem, introducing a novel paradigm that analyzes the sequence of token log-probabilities in the frequency domain. By systematically analyzing the signal’s spectral properties using the global Discrete Fourier Transform (DFT) and the local Short-Time Fourier Transform (STFT), we find that human-written text consistently exhibits significantly higher spectral energy. This higher energy reflects the larger-amplitude fluctuations inherent in human writing compared to the suppressed dynamics of LLM-generated text. Based on this key insight, we construct SpecDetect, a detector built on a single, robust feature from the global DFT: DFT total energy. We also propose an enhanced version, SpecDetect++, which incorporates a sampling discrepancy mechanism to further boost robustness. Extensive experiments demonstrate that our approach outperforms the state-of-the-art model while running in nearly half the time. Our work introduces a new, efficient, and interpretable pathway for LLM-generated text detection, showing that classical signal processing techniques offer a surprisingly powerful solution to this modern challenge. 来自大型语言模型（LLMs）的高质量文本激增，迫切需要可靠且高效的检测方法。尽管现有的无训练方法展现出潜力，但它们常依赖表层统计并忽视文本生成过程的基本信号属性。在本文中，我们将检测重新构建为信号处理问题，引入一种新范式：在频域分析令牌对数概率序列。通过使用全局离散傅里叶变换（DFT）和局部短时傅里叶变换（STFT）系统地分析信号的谱特性，我们发现人类撰写的文本始终表现出显著更高的谱能量。这种更高的能量反映了人类写作中固有的大幅度波动，相较于 LLM 生成文本的抑制动态。基于这一关键洞察，我们构建了 SpecDetect——一个基于全局 DFT 的单一鲁棒特征：DFT 总能量的检测器。我们还提出了增强版本 SpecDetect++，通过引入采样差异机制进一步提升鲁棒性。大量实验表明，我们的方法在运行时间几乎减半的情况下，优于最先进的模型。我们的工作为 LLM 生成文本检测引入了一条新的、高效且可解释的路径，展示了传统信号处理技术对这一现代挑战竟能提供一个出人意料的强大解决方案。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 09:13:42 UTC 发布：2025-08-15 09:13:42 UTC

#19 LLM Compression: How Far Can We Go in Balancing Size and Performance? #19 LLM 压缩：在模型体积与性能之间我们能走多远？

Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency across various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics, namely accuracy, inference latency, and throughput (total output tokens generated per second), providing insights into the suitability of low-bit quantization for real-world deployment. Using the results, users can then make suitable decisions based on the specifications that need to be met. We discuss the pros and cons of GSQ and GPTQ techniques on models of different sizes, which also serve as a benchmark for future experiments. 量化是一种关键且流行的技术，通过减少内存使用和计算成本同时保持性能，来提高大型语言模型（LLMs）的可访问性。在本研究中，我们将 4 位组缩放量化（Group Scaling Quantization，GSQ）和生成式预训练变换器量化（Generative Pretrained Transformer Quantization，GPTQ）应用于 LLaMA 1B、Qwen 0.5B 和 PHI 1.5B，并评估它们在多种自然语言处理任务上的影响。我们在 MS MARCO（信息检索）、BoolQ（布尔问答）和 GSM8K（数学推理）数据集上对这些模型进行基准测试，评估各任务的准确性与效率。研究测量了模型压缩与任务性能之间的权衡，分析了关键评估指标，即准确率、推理延迟和吞吐量（每秒生成的总输出标记数），为低位量化在实际部署中的适用性提供了洞见。基于这些结果，用户可以根据需要满足的规格做出合适的决策。我们讨论了 GSQ 和 GPTQ 技术在不同规模模型上的优缺点，这也可作为未来实验的基准。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 08:41:20 UTC 发布：2025-08-15 08:41:20 世界协调时

#20 SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems #20 SGSimEval：一个用于自动综述生成系统的综合多面向相似性增强基准 [PDF 1 ] [Copy] [Kimi ] [REL]

Authors: [Beichen Guo](https://arxiv.org/search/?searchtype=author&query=Beichen Guo), [Zhiyuan Wen](https://arxiv.org/search/?searchtype=author&query=Zhiyuan Wen), [Yu Yang](https://arxiv.org/search/?searchtype=author&query=Yu Yang), [Peng Gao](https://arxiv.org/search/?searchtype=author&query=Peng Gao), [Ruosong Yang](https://arxiv.org/search/?searchtype=author&query=Ruosong Yang), [Jiaxing Shen](https://arxiv.org/search/?searchtype=author&query=Jiaxing Shen) 作者：郭北辰、文志远、杨宇、高鹏、杨若松、沈嘉兴

The growing interest in automatic survey generation (ASG), a task that traditionally required considerable time and effort, has been spurred by recent advances in large language models (LLMs). With advancements in retrieval-augmented generation (RAG) and the rising popularity of multi-agent systems (MASs), synthesizing academic surveys using LLMs has become a viable approach, thereby elevating the need for robust evaluation methods in this domain. However, existing evaluation methods suffer from several limitations, including biased metrics, a lack of human preference, and an over-reliance on LLMs-as-judges. To address these challenges, we propose SGSimEval, a comprehensive benchmark for Survey Generation with Similarity-Enhanced Evaluation that evaluates automatic survey generation systems by integrating assessments of the outline, content, and references, and also combines LLM-based scoring with quantitative metrics to provide a multifaceted evaluation framework. In SGSimEval, we also introduce human preference metrics that emphasize both inherent quality and similarity to humans. Extensive experiments reveal that current ASG systems demonstrate human-comparable superiority in outline generation, while showing significant room for improvement in content and reference generation, and our evaluation metrics maintain strong consistency with human assessments. 随着大型语言模型（LLMs）近期的发展，自动综述生成（ASG）这一传统上需要大量时间和精力的任务日益受到关注。随着检索增强生成（RAG）的进步和多智能体系统（MASs）日益流行，使用 LLMs 合成学术综述已成为一种可行的方法，从而提升了该领域对稳健评估方法的需求。然而，现有的评估方法存在若干局限性，包括评价指标偏颇、缺乏人工偏好以及过度依赖 LLMs 作为评判者。为了解决这些挑战，我们提出了 SGSimEval——一个用于综述生成的相似性增强评估基准，该基准通过整合对大纲、内容和参考文献的评估，并将基于 LLM 的评分与定量指标相结合，提供多维度的评估框架。在 SGSimEval 中，我们还引入了强调固有质量和与人工相似性的人工偏好度量。大量实验表明，当前的 ASG 系统在大纲生成方面展现出与人类相当的优势，而在内容和参考生成上仍有显著提升空间，且我们的评估指标与人工评估保持高度一致。

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题：计算与语言、人工智能、信息检索

Publish: 2025-08-15 08:27:58 UTC 发布：2025-08-15 08:27:58 UTC

#21 SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory #21 SafeConstellations：通过任务特定轨迹引导 LLM 安全性以减少过度拒绝

Authors: [Utsav Maskey](https://arxiv.org/search/?searchtype=author&query=Utsav Maskey), [Sumit Yadav](https://arxiv.org/search/?searchtype=author&query=Sumit Yadav), [Mark Dras](https://arxiv.org/search/?searchtype=author&query=Mark Dras), [Usman Naseem](https://arxiv.org/search/?searchtype=author&query=Usman Naseem) 作者：Utsav Maskey、Sumit Yadav、Mark Dras、Usman Naseem

LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that superficially resemble harmful content. This phenomena diminishes utility in production applications that repeatedly rely on common prompt templates or applications that frequently rely on LLMs for specific tasks (e.g. sentiment analysis, language translation). Through comprehensive evaluation, we demonstrate that LLMs still tend to refuse responses to harmful instructions when those instructions are reframed to appear as benign tasks. Our mechanistic analysis reveal that LLMs follow distinct “constellation” patterns in embedding space as representations traverse layers, with each task maintaining consistent trajectories that shift predictably between refusal and non-refusal cases. We introduce SafeConstellations, an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways. By selectively guiding model behavior only on tasks prone to over-refusal, and by preserving general model behavior, our method reduces over-refusal rates by up to 73% with minimal impact on utility-offering a principled approach to mitigating over-refusals. LLMs 越来越表现出过度拒绝行为，安全机制导致模型拒绝那些表面上类似有害内容但实际上是良性指令的请求。这种现象降低了在生产环境中反复依赖常见提示模板的应用或频繁依赖 LLMs 执行特定任务（例如情感分析、语言翻译）的应用的实用性。通过全面评估，我们证明了当有害指令被重新表述为看似良性的任务时，LLMs 仍然倾向于拒绝对有害指令的响应。我们的机制分析表明，随着表示在层间传递，LLMs 在嵌入空间中遵循独特的“星座”模式，每个任务保持一致的轨迹，这些轨迹在拒绝和不拒绝的情况下以可预测的方式发生转变。我们提出了 SafeConstellations，一种推理时的轨迹偏移方法，跟踪特定任务的轨迹模式并引导表示走向非拒绝路径。通过仅在容易出现过度拒绝的任务上有选择地引导模型行为，同时保留模型的一般行为，我们的方法在对实用性影响极小的情况下将过度拒绝率最多降低了 73%，为缓解过度拒绝提供了一种有原则的方法。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 07:54:42 UTC 发布时间：2025-08-15 07:54:42 UTC

#22 AI in Mental Health: Emotional and Sentiment Analysis of Large Language Models' Responses to Depression, Anxiety, and Stress Queries #22 心理健康领域的人工智能：大语言模型对抑郁、焦虑与压力相关提问的情感与情绪分析

Authors: [Arya VarastehNezhad](https://arxiv.org/search/?searchtype=author&query=Arya VarastehNezhad), [Reza Tavasoli](https://arxiv.org/search/?searchtype=author&query=Reza Tavasoli), [Soroush Elyasi](https://arxiv.org/search/?searchtype=author&query=Soroush Elyasi), [MohammadHossein LotfiNia](https://arxiv.org/search/?searchtype=author&query=MohammadHossein LotfiNia), [Hamed Farbeh](https://arxiv.org/search/?searchtype=author&query=Hamed Farbeh) 作者：Arya VarastehNezhad、Reza Tavasoli、Soroush Elyasi、MohammadHossein LotfiNia、Hamed Farbeh

Depression, anxiety, and stress are widespread mental health concerns that increasingly drive individuals to seek information from Large Language Models (LLMs). This study investigates how eight LLMs (Claude Sonnet, Copilot, Gemini Pro, GPT-4o, GPT-4o mini, Llama, Mixtral, and Perplexity) reply to twenty pragmatic questions about depression, anxiety, and stress when those questions are framed for six user profiles (baseline, woman, man, young, old, and university student). The models generated 2,880 answers, which we scored for sentiment and emotions using state-of-the-art tools. Our analysis revealed that optimism, fear, and sadness dominated the emotional landscape across all outputs, with neutral sentiment maintaining consistently high values. Gratitude, joy, and trust appeared at moderate levels, while emotions such as anger, disgust, and love were rarely expressed. The choice of LLM significantly influenced emotional expression patterns. Mixtral exhibited the highest levels of negative emotions including disapproval, annoyance, and sadness, while Llama demonstrated the most optimistic and joyful responses. The type of mental health condition dramatically shaped emotional responses: anxiety prompts elicited extraordinarily high fear scores (0.974), depression prompts generated elevated sadness (0.686) and the highest negative sentiment, while stress-related queries produced the most optimistic responses (0.755) with elevated joy and trust. In contrast, demographic framing of queries produced only marginal variations in emotional tone. Statistical analyses confirmed significant model-specific and condition-specific differences, while demographic influences remained minimal. These findings highlight the critical importance of model selection in mental health applications, as each LLM exhibits a distinct emotional signature that could significantly impact user experience and outcomes. 抑郁、焦虑和压力是普遍存在的心理健康问题，越来越多的人因此向大型语言模型（LLMs）寻求信息。本研究考察了八种 LLM（Claude Sonnet、Copilot、Gemini Pro、GPT-4o、GPT-4o mini、Llama、Mixtral 和 Perplexity）在面向六种用户画像（基线、女性、男性、年轻、年长和大学生）时，对二十个关于抑郁、焦虑和压力的实用性问题的回答方式。模型生成了 2880 个答案，我们使用最先进的工具对这些答案进行了情感和情绪评分。我们的分析显示，乐观、恐惧和悲伤在所有输出中主导了情绪格局，中性情绪始终保持较高值。感激、喜悦和信任呈中等水平，而愤怒、厌恶和爱等情绪则很少出现。LLM 的选择显著影响情绪表达模式。Mixtral 表现出最高水平的负面情绪，包括不赞成、恼怒和悲伤，而 Llama 展现出最乐观和最喜悦的回应。不同类型的心理健康状况显著影响情绪反应：焦虑提示引发极高的恐惧评分（0.974），抑郁提示产生较高的悲伤（0.686）并具有最高的负面情绪，而与压力相关的询问则产生最乐观的反应（0.755），伴随较高的喜悦和信任感。相比之下，基于人口统计学的提问方式仅在情绪基调上产生微小差异。统计分析证实存在显著的模型特异性和状况特异性差异，而人口统计学影响仍然很小。这些发现突出了在心理健康应用中模型选择的关键重要性，因为每个 LLM 都表现出不同的情绪特征，这可能会显著影响用户体验和结果。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 07:47:10 UTC 发布：2025-08-15 07:47:10 UTC

#23 ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection #23 ToxiFrench：通过 CoT 微调对法语有害性检测进行基准测试和增强 [PDF 1 ] [Copy] [Kimi 1 ] [REL]

Authors: [Axel Delaval](https://arxiv.org/search/?searchtype=author&query=Axel Delaval), [Shujian Yang](https://arxiv.org/search/?searchtype=author&query=Shujian Yang), [Haicheng Wang](https://arxiv.org/search/?searchtype=author&query=Haicheng Wang), [Han Qiu](https://arxiv.org/search/?searchtype=author&query=Han Qiu), [Jialiang Lu](https://arxiv.org/search/?searchtype=author&query=Jialiang Lu) 作者：Axel Delaval、Shujian Yang、Haicheng Wang、Han Qiu、Jialiang Lu

Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, large-scale datasets. In this work, we introduce TOXIFRENCH, a new public benchmark of 53,622 French online comments, constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification. Then, we benchmark a broad range of models and uncover a counterintuitive insight: Small Language Models (SLMs) outperform many larger models in robustness and generalization under the toxicity detection task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a dynamic weighted loss that progressively emphasizes the model’s final decision, significantly improving faithfulness. Our fine-tuned 4B model achieves state-of-the-art performance, improving its F1 score by 13% over its baseline and outperforming LLMs such as GPT-40 and Gemini-2.5. Further evaluation on a cross-lingual toxicity benchmark demonstrates strong multilingual ability, suggesting that our methodology can be effectively extended to other languages and safety-critical classification tasks. 使用语言模型检测有害内容至关重要但具有挑战性。尽管在英文方面已取得显著进展，法语的有害内容检测仍不够成熟，主要原因是缺乏具有文化相关性的大规模数据集。在本工作中，我们引入了 TOXIFRENCH，这是一个包含 53,622 条法语在线评论的新公开基准数据集，采用半自动化注释流程构建，通过基于高置信度 LLM 的预注释与人工验证将人工标注工作量降至仅 10%。随后，我们对多种模型进行了基准测试，并揭示了一个违反直觉的见解：在有害内容检测任务中，小型语言模型（SLMs）在鲁棒性和泛化能力方面优于许多更大型的模型。基于这一发现，我们提出了一种新颖的链式思维（CoT）微调策略，使用动态加权损失逐步强调模型的最终判定，从而显著提高了忠实性。我们微调的 4B 模型达到了最先进的性能，其 F1 得分比基线提高了 13%，并且优于诸如 GPT-40 和 Gemini-2.5 等 LLMs。在跨语言毒性基准上的进一步评估显示了强大的多语言能力，这表明我们的方法可以有效扩展到其他语言和对安全性要求极高的分类任务。

Subjects: Computation and Language, Artificial Intelligence, Computers and Society

Publish: 2025-08-15 07:40:41 UTC 发布：2025-08-15 07:40:41 UTC

#24 LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought #24 LETToT：使用专家思维树对旅游领域大语言模型进行无标注评估

Authors: [Ruiyan Qi](https://arxiv.org/search/?searchtype=author&query=Ruiyan Qi), [Congding Wen](https://arxiv.org/search/?searchtype=author&query=Congding Wen), [Weibo Zhou](https://arxiv.org/search/?searchtype=author&query=Weibo Zhou), [Shangsong Liang](https://arxiv.org/search/?searchtype=author&query=Shangsong Liang), [Lingbo Li](https://arxiv.org/search/?searchtype=author&query=Lingbo Li) 作者：齐瑞言，文聪定，周伟波，梁尚松，李灵博

Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose Lable-Free Evaluation of LLM on Tourism using Expert Tree-of-Thought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15% relative quality gains over baselines. Second, we apply LETToT’s optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness (p<0.05). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks. 在旅游等特定领域评估大型语言模型（LLMs）仍然具有挑战性，因为标注基准的成本高昂且幻觉等问题持续存在。我们提出了 L able-Free E valuation of LLM on T ourism using Expert T ree- o f- T hought (LETToT)，一个利用专家推理结构——而非标注数据——来评估旅游领域 LLMs 的框架。首先，我们通过与通用质量维度和专家反馈对齐，迭代地精炼并验证层级化的 ToT 组件。结果表明，我们系统性优化的专家 ToT 比基线方法在质量上相对提升了 4.99–14.15%。其次，我们将 LETToT 优化后的专家 ToT 应用于不同规模（32B–671B 参数）的模型评估，揭示： (1) 在专业领域中扩展规律依然存在（DeepSeek-V3 领先），但增强推理能力的小型模型（例如 DeepSeek-R1-Distill-Llama-70B）能缩小这一差距；(2) 对于 72B 以下的模型，采用显式推理架构的模型在准确性和简洁性上优于对等模型（ p<0.05 ）。我们的工作建立了一种可扩展的、无标签的领域特定 LLM 评估范式，提供了对传统带注释基准的有力替代方案。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-15 07:37:12 UTC 发布：2025-08-15 07:37:12 UTC

#25 UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs? #25 揭秘：是什么让语言学奥林匹克题目对 LLMs 来说如此棘手？

Authors: [Mukund Choudhary](https://arxiv.org/search/?searchtype=author&query=Mukund Choudhary), [KV Aditya Srivatsa](https://arxiv.org/search/?searchtype=author&query=KV Aditya Srivatsa), [Gaurja Aeron](https://arxiv.org/search/?searchtype=author&query=Gaurja Aeron), [Antara Raaghavi Bhattacharya](https://arxiv.org/search/?searchtype=author&query=Antara Raaghavi Bhattacharya), [Dang Khoa Dang Dinh](https://arxiv.org/search/?searchtype=author&query=Dang Khoa Dang Dinh), [Ikhlasul Akmal Hanif](https://arxiv.org/search/?searchtype=author&query=Ikhlasul Akmal Hanif), [Daria Kotova](https://arxiv.org/search/?searchtype=author&query=Daria Kotova), [Ekaterina Kochmar](https://arxiv.org/search/?searchtype=author&query=Ekaterina Kochmar), [Monojit Choudhury](https://arxiv.org/search/?searchtype=author&query=Monojit Choudhury) 作者：Mukund Choudhary、KV Aditya Srivatsa、Gaurja Aeron、Antara Raaghavi Bhattacharya、Dang Khoa Dang Dinh、Ikhlasul Akmal Hanif、Daria Kotova、Ekaterina Kochmar、Monojit Choudhury

Large language models (LLMs) have demonstrated potential in reasoning tasks, but their performance on linguistics puzzles remains consistently poor. These puzzles, often derived from Linguistics Olympiad (LO) contests, provide a minimal contamination environment to assess LLMs’ linguistic reasoning abilities across low-resource languages. This work analyses LLMs’ performance on 629 problems across 41 low-resource languages by labelling each with linguistically informed features to unveil weaknesses. Our analyses show that LLMs struggle with puzzles involving higher morphological complexity and perform better on puzzles involving linguistic features that are also found in English. We also show that splitting words into morphemes as a pre-processing step improves solvability, indicating a need for more informed and language-specific tokenisers. These findings thus offer insights into some challenges in linguistic reasoning and modelling of low-resource languages. 大型语言模型 (LLMs) 在推理任务上展示了潜力，但在语言学谜题上的表现仍然持续较差。这些谜题通常来源于语言学奥林匹克 (LO) 竞赛，提供了一个最小污染的环境来评估 LLMs 在低资源语言上的语言推理能力。本文通过为 41 种低资源语言的 629 道题目标注具有语言学信息的特征来分析 LLMs 的表现，以揭示其薄弱环节。我们的分析表明，LLMs 在涉及更高形态学复杂性的谜题上表现不佳，而在涉及也出现在英语中的语言特征的谜题上表现较好。我们还证明，将单词在预处理步骤中拆分为语素可以提高可解性，这表明需要更有信息量且针对语言的分词器。因此，这些发现为低资源语言的语言推理与建模的一些挑战提供了见解。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 06:53:28 UTC 发布时间：2025-08-15 06:53:28 UTC

#26 Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering #26 跨粒度超图检索增强生成用于多跳问答

Authors: [Changjian Wang](https://arxiv.org/search/?searchtype=author&query=Changjian Wang), [Weihong Deng](https://arxiv.org/search/?searchtype=author&query=Weihong Deng), [Weili Guan](https://arxiv.org/search/?searchtype=author&query=Weili Guan), [Quan Lu](https://arxiv.org/search/?searchtype=author&query=Quan Lu), [Ning Jiang](https://arxiv.org/search/?searchtype=author&query=Ning Jiang) 作者：王长建、邓伟宏、关伟立、陆权、姜宁

Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by leveraging knowledge graphs (KGs) to capture structural associations, but they tend to overly rely on structural information and fine-grained word- or phrase-level retrieval, resulting in an underutilization of textual semantics. In this paper, we propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs. Structurally, we construct an entity hypergraph where fine-grained entities serve as nodes and coarse-grained passages as hyperedges, and establish knowledge association through shared entities. Semantically, we design a hypergraph retrieval method that integrates fine-grained entity similarity and coarse-grained passage similarity via hypergraph diffusion. Finally, we employ a retrieval enhancement module, which further refines the retrieved results both semantically and structurally, to obtain the most relevant passages as context for answer generation with the LLM. Experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in QA performance, and achieves a 6× speedup in retrieval efficiency. 多跳问答（MHQA）需要整合分散在多个段落中的知识以得出正确答案。传统的检索增强生成（RAG）方法主要侧重于粗粒度的文本语义相似性，忽视了分散知识之间的结构性关联，这限制了它们在 MHQA 任务中的效果。GraphRAG 方法通过利用知识图谱（KG）来捕捉结构性关联以解决这一问题，但它们往往过度依赖结构信息和细粒度的词或短语级检索，导致文本语义被未充分利用。在本文中，我们提出了一种用于 MHQA 的新型 RAG 方法，称为 HGRAG，通过超图实现结构信息与语义信息的跨粒度整合。在结构上，我们构建了一个实体超图，其中细粒度的实体作为节点，粗粒度的段落作为超边，并通过共享实体建立知识关联。在语义上，我们设计了一种超图检索方法，通过超图扩散将细粒度的实体相似性与粗粒度的段落相似性整合起来。最后，我们采用了一个检索增强模块，进一步在语义和结构上优化检索结果，以获取作为与 LLM 生成答案上下文的最相关段落。在基准数据集上的实验结果表明，我们的方法在问答性能上优于最先进的方法，并在检索效率上实现了 6 的加速。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-15 06:36:13 UTC 发布：2025-08-15 06:36:13 UTC

Authors: [Ahmad Mousavi](https://arxiv.org/search/?searchtype=author&query=Ahmad Mousavi), [Yeganeh Abdollahinejad](https://arxiv.org/search/?searchtype=author&query=Yeganeh Abdollahinejad), [Roberto Corizzo](https://arxiv.org/search/?searchtype=author&query=Roberto Corizzo), [Nathalie Japkowicz](https://arxiv.org/search/?searchtype=author&query=Nathalie Japkowicz), [Zois Boukouvalas](https://arxiv.org/search/?searchtype=author&query=Zois Boukouvalas) 作者：Ahmad Mousavi、Yeganeh Abdollahinejad、Roberto Corizzo、Nathalie Japkowicz、Zois Boukouvalas

Detecting multimodal misinformation on social media remains challenging due to inconsistencies between modalities, changes in temporal patterns, and substantial class imbalance. Many existing methods treat posts independently and fail to capture the event-level structure that connects them across time and modality. We propose E-CaTCH, an interpretable and scalable framework for robustly detecting misinformation. If needed, E-CaTCH clusters posts into pseudo-events based on textual similarity and temporal proximity, then processes each event independently. Within each event, textual and visual features are extracted using pre-trained BERT and ResNet encoders, refined via intra-modal self-attention, and aligned through bidirectional cross-modal attention. A soft gating mechanism fuses these representations to form contextualized, content-aware embeddings of each post. To model temporal evolution, E-CaTCH segments events into overlapping time windows and uses a trend-aware LSTM, enhanced with semantic shift and momentum signals, to encode narrative progression over time. Classification is performed at the event level, enabling better alignment with real-world misinformation dynamics. To address class imbalance and promote stable learning, the model integrates adaptive class weighting, temporal consistency regularization, and hard-example mining. The total loss is aggregated across all events. Extensive experiments on Fakeddit, IND, and COVID-19 MISINFOGRAPH demonstrate that E-CaTCH consistently outperforms state-of-the-art baselines. Cross-dataset evaluations further demonstrate its robustness, generalizability, and practical applicability across diverse misinformation scenarios. 在社交媒体上检测多模态错误信息仍然具有挑战性，原因包括模态之间的不一致、时间模式的变化以及严重的类别不平衡。许多现有方法将帖子视为独立个体，无法捕捉将它们在时间和模态上连接起来的事件级结构。我们提出了 E-CaTCH，这是一个可解释且可扩展的框架，用于稳健地检测错误信息。在需要时，E-CaTCH 会基于文本相似性和时间接近性将帖子聚类为伪事件，然后独立处理每个事件。在每个事件内部，使用预训练的 BERT 和 ResNet 编码器提取文本和视觉特征，通过模内自注意力进行精炼，并通过双向跨模态注意力进行对齐。一个软门控机制融合这些表示，形成每个帖子的上下文化、内容感知嵌入。为建模时间演变，E-CaTCH 将事件划分为重叠的时间窗口，并使用一种具备语义漂移和动量信号增强的趋势感知 LSTM 来编码随时间的叙事进展。分类在事件级别上执行，从而更好地与现实世界中的错误信息动态对齐。为了解决类别不平衡并促进稳定学习，模型整合了自适应类别权重、时间一致性正则化和难例挖掘。总损失在所有事件上进行聚合。在 Fakeddit、IND 和 COVID-19 MISINFOGRAPH 上的大量实验证明，E-CaTCH 始终优于最先进的基线方法。跨数据集评估进一步展示了其在不同错误信息场景下的鲁棒性、泛化能力和实际适用性。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning, Social and Information Networks 主题：计算与语言、人工智能、机器学习、社会与信息网络

Publish: 2025-08-15 04:13:23 UTC 发表：2025-08-15 04:13:23 UTC

#28 Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation #28 新颖的寄生双尺度建模用于高效且精确的多语种语音翻译

Authors: [Chenyang Le](https://arxiv.org/search/?searchtype=author&query=Chenyang Le), [Yinfeng Xia](https://arxiv.org/search/?searchtype=author&query=Yinfeng Xia), [Huiyan Li](https://arxiv.org/search/?searchtype=author&query=Huiyan Li), [Manhong Wang](https://arxiv.org/search/?searchtype=author&query=Manhong Wang), [Yutao Sun](https://arxiv.org/search/?searchtype=author&query=Yutao Sun), [Xingyang Ma](https://arxiv.org/search/?searchtype=author&query=Xingyang Ma), [Yanmin Qian](https://arxiv.org/search/?searchtype=author&query=Yanmin Qian) 作者：乐晨阳、夏尹峰、李惠妍、王曼红、孙玉涛、马兴洋、钱艳民

Recent advancements in speech-to-text translation have led to the development of multilingual models capable of handling multiple language pairs simultaneously. However, these unified models often suffer from large parameter sizes, making it challenging to balance inference efficiency and performance, particularly in local deployment scenarios. We propose an innovative Parasitic Dual-Scale Approach, which combines an enhanced speculative sampling method with model compression and knowledge distillation techniques. Building on the Whisper Medium model, we enhance it for multilingual speech translation into whisperM2M, and integrate our novel KVSPN module, achieving state-of-the-art (SOTA) performance across six popular languages with improved inference efficiency. KVSPN enables a 40% speedup with no BLEU score degradation. Combined with distillation methods, it represents a 2.6× speedup over the original Whisper Medium with superior performance. 近年来语音转文字翻译方面的进展催生了能够同时处理多种语言对的多语种模型。然而，这些统一模型往往参数量庞大，使得在本地部署场景中难以在推理效率和性能之间取得平衡。我们提出了一种创新性的寄生双尺度方法（Parasitic Dual-Scale Approach），将改进的推测采样方法与模型压缩和知识蒸馏技术相结合。在 Whisper Medium 模型的基础上，我们将其增强为面向多语种语音翻译的 whisperM2M，并集成了我们新颖的 KVSPN 模块，在六种流行语言上以更高的推理效率实现了最先进（SOTA）的性能。KVSPN 在不降低 BLEU 分数的情况下实现了 40% 的加速。结合蒸馏方法后，相较于原始 Whisper Medium，表现更优且实现了 2.6 倍的加速。

Subjects: Computation and Language, Sound, Audio and Speech Processing 主题：计算与语言、声音、音频与语音处理

Publish: 2025-08-15 03:46:46 UTC 发布：2025-08-15 03:46:46 UTC

#29 Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction #29 个性化干扰项生成：通过蒙特卡洛树搜索引导的推理重建

Authors: [Tao Wu](https://arxiv.org/search/?searchtype=author&query=Tao Wu), [Jingyuan Chen](https://arxiv.org/search/?searchtype=author&query=Jingyuan Chen), [Wang Lin](https://arxiv.org/search/?searchtype=author&query=Wang Lin), [Jian Zhan](https://arxiv.org/search/?searchtype=author&query=Jian Zhan), [Mengze Li](https://arxiv.org/search/?searchtype=author&query=Mengze Li), [Kun Kuang](https://arxiv.org/search/?searchtype=author&query=Kun Kuang), [Fei Wu](https://arxiv.org/search/?searchtype=author&query=Fei Wu) 作者：吴涛、陈景元、林望、詹健、李梦泽、匡坤、吴飞

Distractors, incorrect but plausible answer choices in multiple-choice questions (MCQs), play a critical role in educational assessment by diagnosing student misconceptions. Recent work has leveraged large language models (LLMs) to generate shared, group-level distractors by learning common error patterns across large student populations. However, such distractors often fail to capture the diverse reasoning errors of individual students, limiting their diagnostic effectiveness. To address this limitation, we introduce the task of personalized distractor generation, which aims to generate tailored distractors based on individual misconceptions inferred from each student’s past question-answering (QA) records, ensuring every student receives options that effectively exposes their specific reasoning errors. While promising, this task is challenging because each student typically has only a few QA records, which often lack the student’s underlying reasoning processes, making training-based group-level approaches infeasible. To overcome this, we propose a training-free two-stage framework. In the first stage, we construct a student-specific misconception prototype by applying Monte Carlo Tree Search (MCTS) to recover the student’s reasoning trajectories from past incorrect answers. In the second stage, this prototype guides the simulation of the student’s reasoning on new questions, enabling the generation of personalized distractors that align with the student’s recurring misconceptions. Experiments show that our approach achieves the best performance in generating plausible, personalized distractors for 140 students, and also effectively generalizes to group-level settings, highlighting its robustness and adaptability. 诱导项，即多项选择题（MCQs）中不正确但看似合理的选项，在教育评估中通过诊断学生的误解发挥关键作用。近期研究利用大型语言模型（LLMs）通过学习大规模学生群体的常见错误模式来生成共享的群体级诱导项。然而，此类诱导项常常无法捕捉个体学生多样的推理错误，限制了其诊断效果。为了解决这一限制，我们提出了个性化诱导项生成任务，旨在基于从每个学生过去的答题（QA）记录推断出的个体误解生成定制诱导项，确保每位学生都能获得能有效暴露其特定推理错误的选项。尽管这一任务前景可期，但存在挑战：每位学生通常只有少量的答题记录，且这些记录往往缺乏学生的潜在推理过程，使得基于训练的群体级方法不可行。为克服这一点，我们提出了一个无训练的两阶段框架。在第一阶段，我们通过对学生过去错误答案应用蒙特卡洛树搜索（MCTS）来构建特定于学生的误解原型，以恢复学生的推理轨迹。在第二阶段，该原型指导对学生在新题上的推理进行模拟，从而生成与学生重复出现的误解相一致的个性化干扰项。实验表明，我们的方法在为 140 名学生生成可信、个性化的干扰项方面表现最佳，并且还能有效推广到群体层面的设置，凸显其稳健性和适应性。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 03:20:37 UTC 发布：2025-08-15 03:20:37 UTC

#30 Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification #30 克服图卢语资源匮乏障碍：用于辱骂性语言识别的神经模型与语料构建

Authors: [Anusha M D](https://arxiv.org/search/?searchtype=author&query=Anusha M D), [Deepthi Vikram](https://arxiv.org/search/?searchtype=author&query=Deepthi Vikram), [Bharathi Raja Chakravarthi](https://arxiv.org/search/?searchtype=author&query=Bharathi Raja Chakravarthi), [Parameshwar R Hegde](https://arxiv.org/search/?searchtype=author&query=Parameshwar R Hegde) 作者：Anusha M D、Deepthi Vikram、Bharathi Raja Chakravarthi、Parameshwar R Hegde

Tulu, a low-resource Dravidian language predominantly spoken in southern India, has limited computational resources despite its growing digital presence. This study presents the first benchmark dataset for Offensive Language Identification (OLI) in code-mixed Tulu social media content, collected from YouTube comments across various domains. The dataset, annotated with high inter-annotator agreement (Krippendorff’s alpha = 0.984), includes 3,845 comments categorized into four classes: Not Offensive, Not Tulu, Offensive Untargeted, and Offensive Targeted. We evaluate a suite of deep learning models, including GRU, LSTM, BiGRU, BiLSTM, CNN, and attention-based variants, alongside transformer architectures (mBERT, XLM-RoBERTa). The BiGRU model with self-attention achieves the best performance with 82% accuracy and a 0.81 macro F1-score. Transformer models underperform, highlighting the limitations of multilingual pretraining in code-mixed, under-resourced contexts. This work lays the foundation for further NLP research in Tulu and similar low-resource, code-mixed languages. 图卢语（Tulu）是一种主要在印度南部使用的低资源达罗毗荼语，尽管其数字化存在正在增长，但计算资源仍然有限。本研究提出了首个用于代码混合图卢语社交媒体内容的侮辱性语言识别（OLI）基准数据集，数据来自多个领域的 YouTube 评论。该数据集经高一致性注释（Krippendorff 的 alpha = 0.984），包含 3,845 条评论，分为四类：非侮辱、非图卢语、无针对性的侮辱和有针对性的侮辱。我们评估了一系列深度学习模型，包括 GRU、LSTM、BiGRU、BiLSTM、CNN 及基于注意力的变体，以及变换器架构（mBERT、XLM-RoBERTa）。带自注意力机制的 BiGRU 模型表现最佳，达到 82% 的准确率和 0.81 的宏 F1 分数。变换器模型表现欠佳，凸显了多语言预训练在代码混合、资源不足语境中的局限性。这项工作为图卢语及类似的低资源代码混合语言的进一步自然语言处理研究奠定了基础。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 02:34:22 UTC 发布时间：2025-08-15 02:34:22 协调世界时

#31 MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering #31 MobQA：一个用于通过问答实现人类移动数据语义理解的基准数据集

Authors: [Hikaru Asano](https://arxiv.org/search/?searchtype=author&query=Hikaru Asano), [Hiroki Ouchi](https://arxiv.org/search/?searchtype=author&query=Hiroki Ouchi), [Akira Kasuga](https://arxiv.org/search/?searchtype=author&query=Akira Kasuga), [Ryo Yonetani](https://arxiv.org/search/?searchtype=author&query=Ryo Yonetani) 作者：Hikaru Asano、Hiroki Ouchi、Akira Kasuga、Ryo Yonetani

This paper presents MobQA, a benchmark dataset designed to evaluate the semantic understanding capabilities of large language models (LLMs) for human mobility data through natural language question answering. While existing models excel at predicting human movement patterns, it remains unobvious how much they can interpret the underlying reasons or semantic meaning of those patterns. MobQA provides a comprehensive evaluation framework for LLMs to answer questions about diverse human GPS trajectories spanning daily to weekly granularities. It comprises 5,800 high-quality question-answer pairs across three complementary question types: factual retrieval (precise data extraction), multiple-choice reasoning (semantic inference), and free-form explanation (interpretive description), which all require spatial, temporal, and semantic reasoning. Our evaluation of major LLMs reveals strong performance on factual retrieval but significant limitations in semantic reasoning and explanation question answering, with trajectory length substantially impacting model effectiveness. These findings demonstrate the achievements and limitations of state-of-the-art LLMs for semantic mobility understanding.\footnote{MobQA dataset is available at https://github.com/CyberAgentAILab/mobqa.} 本文提出了 MobQA，这是一个基准数据集，旨在通过自然语言问答评估大型语言模型（LLMs）对人类出行数据的语义理解能力。尽管现有模型在预测人类移动模式方面表现出色，但它们对这些模式背后的原因或语义含义能理解到何种程度尚不清楚。MobQA 为 LLMs 提供了一个综合评估框架，以回答关于从日常到每周粒度的各种人类 GPS 轨迹的问题。它包含 5,800 对高质量问答，涵盖三种互补的问题类型：事实检索（精确数据提取）、多项选择推理（语义推断）和自由形式解释（解释性描述），这些问题都需要空间、时间和语义推理。我们对主流 LLMs 的评估显示，在事实检索方面表现强劲，但在语义推理和解释性问答方面存在显著局限性，且轨迹长度对模型效果有显著影响。这些发现展示了最先进的 LLMs 在语义移动理解方面的成就与局限性。\footnote{MobQA 数据集可在 https://github.com/CyberAgentAILab/mobqa 获取。}

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-15 02:30:20 UTC 发布时间：2025-08-15 02:30:20 UTC

#32 MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents #32 MoNaCo：用于在数十篇文档中推理的更自然且更复杂的问题 [PDF ] [Copy] [Kimi 2 ] [REL]

Large language models (LLMs) are emerging as a go-to tool for querying information. However, current LLM benchmarks rarely feature natural questions that are both information-seeking as well as genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and complex questions that require dozens, and at times hundreds, of intermediate steps to solve – far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer natural time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the need for reasoning models that better handle the complexity and sheer breadth of real-world information-seeking questions – with MoNaCo providing an effective resource for tracking such progress. The MONACO benchmark, codebase, prompts and models predictions are publicly available at: https://tomerwolgithub.github.io/monaco 大型语言模型（LLMs）正成为查询信息的首选工具。然而，目前的 LLM 基准很少包含既是寻求信息又对人类而言确实耗时的自然问题。为填补这一空白，我们引入了 MoNaCo，这是一个包含 1,315 个自然且复杂问题的基准，这些问题需要数十步，有时甚至数百步的中间步骤来解决——远超过任何现有的问答基准。为构建 MoNaCo，我们开发了一套分解的标注流程，以大规模引导并人工回答自然的耗时问题。在 MoNaCo 上评估的前沿 LLMs 最多仅达到 61.2% 的 F1，受制于低召回率和幻觉问题。我们的结果强调需要更能处理现实世界信息查询问题复杂性和广度的推理模型——MoNaCo 为跟踪此类进展提供了有效资源。MONACO 基准、代码库、提示和模型预测可在以下地址公开获得： https://tomerwolgithub.github.io/monaco

Subjects: Computation and Language, Artificial Intelligence, Databases 主题：计算与语言、人工智能、数据库

Publish: 2025-08-15 00:58:10 UTC 发布时间：2025-08-15 00:58:10 UTC

#33 Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning #33 通过反思、记忆与规划构建用于营销应用的可靠多智能体系统

Authors: [Lorenzo Jaime Yu Flores](https://arxiv.org/search/?searchtype=author&query=Lorenzo Jaime Yu Flores), [Junyi Shen](https://arxiv.org/search/?searchtype=author&query=Junyi Shen), [Xiaoyuan Gu](https://arxiv.org/search/?searchtype=author&query=Xiaoyuan Gu) 作者：Lorenzo Jaime Yu Flores、沈俊逸、顾晓远

Recent advances in large language models (LLMs) enabled the development of AI agents that can plan and interact with tools to complete complex tasks. However, literature on their reliability in real-world applications remains limited. In this paper, we introduce a multi-agent framework for a marketing task: audience curation. To solve this, we introduce a framework called RAMP that iteratively plans, calls tools, verifies the output, and generates suggestions to improve the quality of the audience generated. Additionally, we equip the model with a long-term memory store, which is a knowledge base of client-specific facts and past queries. Overall, we demonstrate the use of LLM planning and memory, which increases accuracy by 28 percentage points on a set of 88 evaluation queries. Moreover, we show the impact of iterative verification and reflection on more ambiguous queries, showing progressively better recall (roughly +20 percentage points) with more verify/reflect iterations on a smaller challenge set, and higher user satisfaction. Our results provide practical insights for deploying reliable LLM-based systems in dynamic, industry-facing environments. 最近在大型语言模型（LLMs）方面的进展促成了能够规划并与工具交互以完成复杂任务的 AI 代理的发展。然而，关于它们在真实世界应用中可靠性的文献仍然有限。在本文中，我们为一个营销任务引入了多代理框架：受众筛选。为了解决这一问题，我们提出了一个名为 RAMP 的框架，该框架通过迭代地进行规划、调用工具、验证输出并生成改进建议来提高所生成受众的质量。此外，我们为模型配备了一个长期记忆存储库，这是一个包含客户特定事实和历史查询的知识库。总体而言，我们展示了 LLM 规划和记忆的应用，在一组 88 个评估查询上将准确率提高了 28 个百分点。此外，我们展示了在更模糊查询上迭代验证和反思的影响，随着在一个较小挑战集上增加更多的验证/反思迭代，召回率逐步提高（大约 +20 个百分点），并且用户满意度更高。我们的结果为在动态、面向行业的环境中部署可靠的基于 LLM 的系统提供了实用见解。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-14 23:52:39 UTC 发布：2025-08-14 23:52:39 UTC

#34 Approaching the Source of Symbol Grounding with Confluent Reductions of Abstract Meaning Representation Directed Graphs #34 通过抽象意义表示有向图的汇合约简接近符号落地源头

Authors: [Nicolas Goulet](https://arxiv.org/search/?searchtype=author&query=Nicolas Goulet), [Alexandre Blondin Massé](https://arxiv.org/search/?searchtype=author&query=Alexandre Blondin Massé), [Moussa Abdendi](https://arxiv.org/search/?searchtype=author&query=Moussa Abdendi) 作者：Nicolas Goulet、Alexandre Blondin Massé、Moussa Abdendi

Abstract meaning representation (AMR) is a semantic formalism used to represent the meaning of sentences as directed acyclic graphs. In this paper, we describe how real digital dictionaries can be embedded into AMR directed graphs (digraphs), using state-of-the-art pre-trained large language models. Then, we reduce those graphs in a confluent manner, i.e. with transformations that preserve their circuit space. Finally, the properties of these reduces digraphs are analyzed and discussed in relation to the symbol grounding problem. 抽象意义表示（AMR）是一种语义形式主义，用于将句子的意义表示为有向无环图。在本文中，我们描述了如何使用最先进的预训练大型语言模型将真实的数字词典嵌入到 AMR 有向图（digraphs）中。然后，我们以汇合的方式对这些图进行约简，即通过保持其电路空间的不变的变换来进行约简。最后，分析并讨论了这些约简后有向图的性质及其与符号落地问题的关系。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-14 20:53:43 UTC 发布：2025-08-14 20:53:43 UTC

#35 BIPOLAR: Polarization-based granular framework for LLM bias evaluation #35 BIPOLAR：基于极化的细粒度 LLM 偏差评估框架 [PDF ] [副本] [Kimi ] [REL]

Authors: [Martin Pavlíček](https://arxiv.org/search/?searchtype=author&query=Martin Pavlíček), [Tomáš Filip](https://arxiv.org/search/?searchtype=author&query=Tomáš Filip), [Petr Sosík](https://arxiv.org/search/?searchtype=author&query=Petr Sosík) 作者：Martin Pavlíček、Tomáš Filip、Petr Sosík

Large language models (LLMs) are known to exhibit biases in downstream tasks, especially when dealing with sensitive topics such as political discourse, gender identity, ethnic relations, or national stereotypes. Although significant progress has been made in bias detection and mitigation techniques, certain challenges remain underexplored. This study proposes a reusable, granular, and topic-agnostic framework to evaluate polarisation-related biases in LLM (both open-source and closed-source). Our approach combines polarisation-sensitive sentiment metrics with a synthetically generated balanced dataset of conflict-related statements, using a predefined set of semantic categories. As a case study, we created a synthetic dataset that focusses on the Russia-Ukraine war, and we evaluated the bias in several LLMs: Llama-3, Mistral, GPT-4, Claude 3.5, and Gemini 1.0. Beyond aggregate bias scores, with a general trend for more positive sentiment toward Ukraine, the framework allowed fine-grained analysis with considerable variation between semantic categories, uncovering divergent behavioural patterns among models. Adaptation to prompt modifications showed further bias towards preconceived language and citizenship modification. Overall, the framework supports automated dataset generation and fine-grained bias assessment, is applicable to a variety of polarisation-driven scenarios and topics, and is orthogonal to many other bias-evaluation strategies. 大型语言模型（LLMs）在下游任务中常表现出偏见，尤其是在处理政治话语、性别认同、族群关系或民族刻板印象等敏感话题时。尽管在偏见检测和缓解技术方面已取得显著进展，但仍有若干挑战未被充分探讨。本研究提出了一个可复用、细粒度且与话题无关的框架，用以评估 LLM（包括开源与闭源模型）中与极化相关的偏见。我们的方法将对极化敏感的情感度量与使用预定义语义类别合成生成的冲突相关陈述的平衡数据集相结合。作为案例研究，我们创建了一个聚焦于俄乌战争的合成数据集，并评估了若干 LLM 的偏见：Llama-3、Mistral、GPT-4、Claude 3.5 和 Gemini 1.0。除了总体偏见评分（总体趋势显示对乌克兰更为正面的情感）之外，该框架还允许进行细粒度分析，显示出语义类别之间存在显著差异，并揭示各模型之间分歧的行为模式。对提示修改的适应进一步表现出对预设语言和国籍修改的偏向。总体而言，该框架支持自动化数据集生成和细粒度偏见评估，适用于各种由极化驱动的场景和主题，并且与许多其他偏见评估策略正交。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-14 20:44:19 UTC 发布：2025-08-14 20:44:19 UTC

#36 Hell or High Water: Evaluating Agentic Recovery from External Failures #36 地狱或涨潮：评估代理从外部故障中恢复的能力

Authors: [Andrew Wang](https://arxiv.org/search/?searchtype=author&query=Andrew Wang), [Sophia Hager](https://arxiv.org/search/?searchtype=author&query=Sophia Hager), [Adi Asija](https://arxiv.org/search/?searchtype=author&query=Adi Asija), [Daniel Khashabi](https://arxiv.org/search/?searchtype=author&query=Daniel Khashabi), [Nicholas Andrews](https://arxiv.org/search/?searchtype=author&query=Nicholas Andrews) 作者：Andrew Wang、Sophia Hager、Adi Asija、Daniel Khashabi、Nicholas Andrews

As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? We devise a specialized agentic planning benchmark to study this question. Each planning problem is solved via combinations of function calls. The agent searches for relevant functions from a set of over four thousand possibilities, and observes environmental feedback in the form of function outputs or error messages. Our benchmark confronts the agent with external failures in its workflow, such as functions that suddenly become unavailable. At the same time, even with the introduction of these failures, we guarantee that the task remains solvable. Ideally, an agent’s performance on the planning task should not be affected by the presence of external failures. Overall, we find that language agents struggle to formulate and execute backup plans in response to environment feedback. While state-of-the-art models are often able to identify the correct function to use in the right context, they struggle to adapt to feedback from the environment and often fail to pursue alternate courses of action, even when the search space is artificially restricted. We provide a systematic analysis of the failures of both open-source and commercial models, examining the effects of search space size, as well as the benefits of scaling model size in our setting. Our analysis identifies key challenges for current generative models as well as promising directions for future work. 随着语言模型代理被用于日益复杂的现实问题，它们将被期望在大规模搜索空间中制定计划。如果这些计划因超出其控制范围的原因而失败，语言代理在多大程度上能够搜索替代方式来实现目标？我们设计了一个专门的代理式规划基准来研究这个问题。每个规划问题通过函数调用的组合来解决。代理从四千多个可能的函数中搜索相关函数，并以函数输出或错误信息的形式观察环境反馈。我们的基准让代理在其工作流程中面对外部故障，例如某些函数突然不可用。与此同时，即便引入这些故障，我们也保证任务仍然可解。理想情况下，代理在规划任务上的表现不应受到外部故障存在的影响。总体而言，我们发现语言代理在根据环境反馈制定并执行备选计划方面存在困难。尽管最先进的模型常能在合适的上下文中识别出应使用的正确函数，但它们难以适应来自环境的反馈，且经常无法采取替代的行动路线，即便搜索空间被人为限制。我们对开源和商业模型的失败进行了系统分析，考察了搜索空间大小的影响以及在我们的设置中扩展模型规模的益处。我们的分析识别了当前生成模型的关键挑战，并指出了未来研究的有希望方向。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-14 19:21:09 UTC 发布：2025-08-14 19:21:09 UTC

#37 Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics #37 超越罗塞塔石：泛化动力学中的统一力

Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets. We identify a learning phase wherein a model develops either separate or unified representations of the same facts across languages, and show that unification is essential for cross-lingual transfer. We also show that the degree of unification depends on mutual information between facts and training data language, and on how easy it is to extract that language. Based on these insights, we develop methods to modulate the level of cross-lingual transfer by manipulating data distribution and tokenization, and we introduce metrics and visualizations to formally characterize their effects on unification. Our work shows how controlled settings can shed light on pre-training dynamics and suggests new directions for improving cross-lingual transfer in LLMs. 大型语言模型（LLMs）在跨语言知识迁移方面表现不佳：当用一种语言询问在训练中以另一种语言表达的事实时，它们会产生幻觉。本文通过在合成多语言数据集上从头训练小型 Transformer 模型，引入了一个可控设定来研究这一现象的成因和动态。我们识别出一个学习阶段，在该阶段模型会对相同事实在不同语言中形成要么分离要么统一的表示，并证明了表示统一对于跨语言迁移是必要的。我们还表明，统一程度取决于事实与训练数据语言之间的互信息，以及提取该语言的难易程度。基于这些洞见，我们开发了通过操控数据分布和分词来调节跨语言迁移水平的方法，并引入了衡量与可视化工具以形式化地描述它们对表示统一的影响。我们的工作展示了可控设定如何揭示预训练动态，并为改进 LLMs 的跨语言迁移提出了新的方向。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-14 18:44:13 UTC 发布：2025-08-14 18:44:13 UTC

#38 SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth #38 SproutBench：面向青少年的安全与伦理大语言模型基准测试

Authors: [Wenpeng Xing](https://arxiv.org/search/?searchtype=author&query=Wenpeng Xing), [Lanyi Wei](https://arxiv.org/search/?searchtype=author&query=Lanyi Wei), [Haixiao Hu](https://arxiv.org/search/?searchtype=author&query=Haixiao Hu), [Rongchang Li](https://arxiv.org/search/?searchtype=author&query=Rongchang Li), [Mohan Li](https://arxiv.org/search/?searchtype=author&query=Mohan Li), [Changting Lin](https://arxiv.org/search/?searchtype=author&query=Changting Lin), [Meng Han](https://arxiv.org/search/?searchtype=author&query=Meng Han) 作者：邢文鹏、魏兰艺、胡海啸、李荣昌、李莫涵、林长汀、韩萌

The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0–6), middle childhood (7–12), and adolescence (13–18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment. 面向儿童和青少年的应用中大规模语言模型（LLMs）的迅速普及，要求对现有以成人为主的人工智能安全框架进行根本性重新评估，因为这些框架忽视了未成年人的独特发展脆弱性。本文指出了现有 LLM 安全基准的主要缺陷，包括它们对覆盖不同年龄阶段（早期儿童期（0–6 岁）、中期儿童期（7–12 岁）和青春期（13–18 岁））的认知、情感和社会风险的不足。为弥补这些空白，我们提出了 SproutBench，一套创新的评估工具包，包含 1,283 条以发展阶段为基础的对抗性提示，用于探测情感依赖、隐私侵犯和模仿危险行为等风险。通过对 47 种不同 LLM 进行严格的实证评估，我们发现了显著的安全漏洞，这些漏洞通过稳健的维度间相关性得到证实（例如“安全性”与“风险预防”之间的相关性），并且“交互性”与“年龄适宜性”之间存在显著的负相关关系。这些见解为推进以儿童为中心的人工智能设计与部署提供了切实可行的指导。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-14 18:21:39 UTC 发布日期：2025-08-14 18:21:39 UTC

#39 Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling #39 使用带推理时缩放的掩码扩散语言模型改进文本风格迁移

Authors: [Tejomay Kishor Padole](https://arxiv.org/search/?searchtype=author&query=Tejomay Kishor Padole), [Suyash P Awate](https://arxiv.org/search/?searchtype=author&query=Suyash P Awate), [Pushpak Bhattacharyya](https://arxiv.org/search/?searchtype=author&query=Pushpak Bhattacharyya) 作者：Tejomay Kishor Padole、Suyash P Awate、Pushpak Bhattacharyya

Masked diffusion language models (MDMs) have recently gained traction as a viable generative framework for natural language. This can be attributed to its scalability and ease of training compared to other diffusion model paradigms for discrete data, establishing itself as the state-of-the-art non-autoregressive generator for discrete data. Diffusion models, in general, have shown excellent ability to improve the generation quality by leveraging inference-time scaling either by increasing the number of denoising steps or by using external verifiers on top of the outputs of each step to guide the generation. In this work, we propose a verifier-based inference-time scaling method that aids in finding a better candidate generation during the denoising process of the MDM. Our experiments demonstrate the application of MDMs for standard text-style transfer tasks and establish MDMs as a better alternative to autoregressive language models. Additionally, we show that a simple soft-value-based verifier setup for MDMs using off-the-shelf pre-trained embedding models leads to significant gains in generation quality even when used on top of typical classifier-free guidance setups in the existing literature. 掩码扩散语言模型（MDMs）近年来作为一种可行的自然语言生成框架获得了关注。这可以归因于与其他处理离散数据的扩散模型范式相比，其可扩展性和训练简便性，使其成为处理离散数据的最先进非自回归生成器。一般而言，扩散模型通过在推理时扩展（要么增加去噪步骤数，要么在每一步输出之上使用外部验证器来引导生成）显示出显著提升生成质量的能力。在本工作中，我们提出了一种基于验证器的推理时扩展方法，帮助在 MDM 的去噪过程中找到更好的候选生成。我们的实验展示了 MDMs 在标准文本风格迁移任务中的应用，并确立了 MDMs 作为比自回归语言模型更好的替代方案。此外，我们展示了一个基于软值的简单验证器设置，针对使用现成预训练嵌入模型的多模态模型（MDMs）。即便在现有文献中常用的无监督分类器引导（classifier-free guidance）配置之上，该方法也能显著提升生成质量。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-14 18:01:22 UTC 发布：2025-08-14 18:01:22 UTC

#40 Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules #40 Rule2Text：用于生成和评估知识图规则自然语言解释的框架

Authors: [Nasim Shirvani-Mahdavi](https://arxiv.org/search/?searchtype=author&query=Nasim Shirvani-Mahdavi), [Chengkai Li](https://arxiv.org/search/?searchtype=author&query=Chengkai Li) 作者：Nasim Shirvani-Mahdavi, Chengkai Li

Knowledge graphs (KGs) can be enhanced through rule mining; however, the resulting logical rules are often difficult for humans to interpret due to their inherent complexity and the idiosyncratic labeling conventions of individual KGs. This work presents Rule2Text, a comprehensive framework that leverages large language models (LLMs) to generate natural language explanations for mined logical rules, thereby improving KG accessibility and usability. We conduct extensive experiments using multiple datasets, including Freebase variants (FB-CVT-REV, FB+CVT-REV, and FB15k-237) as well as the ogbl-biokg dataset, with rules mined using AMIE 3.5.1. We systematically evaluate several LLMs across a comprehensive range of prompting strategies, including zero-shot, few-shot, variable type incorporation, and Chain-of-Thought reasoning. To systematically assess models’ performance, we conduct a human evaluation of generated explanations on correctness and clarity. To address evaluation scalability, we develop and validate an LLM-as-a-judge framework that demonstrates strong agreement with human evaluators. Leveraging the best-performing model (Gemini 2.0 Flash), LLM judge, and human-in-the-loop feedback, we construct high-quality ground truth datasets, which we use to fine-tune the open-source Zephyr model. Our results demonstrate significant improvements in explanation quality after fine-tuning, with particularly strong gains in the domain-specific dataset. Additionally, we integrate a type inference module to support KGs lacking explicit type information. All code and data are publicly available at https://github.com/idirlab/KGRule2NL. 知识图谱（KGs）可以通过规则挖掘得到增强；然而，由于其固有的复杂性以及各个知识图谱独特的标注习惯，挖掘出的逻辑规则通常难以为人所理解。本文提出了 Rule2Text，一个综合性框架，利用大型语言模型（LLMs）为挖掘出的逻辑规则生成自然语言解释，从而提高知识图谱的可访问性和可用性。我们使用多个数据集进行了广泛实验，包括 Freebase 的变体（FB-CVT-REV、FB+CVT-REV 和 FB15k-237）以及 ogbl-biokg 数据集，规则由 AMIE 3.5.1 挖掘。我们系统地评估了多种 LLM，在一系列全面的提示策略下进行比较，包括零样本、少样本、变量类型纳入以及链式思维（Chain-of-Thought）推理。为系统性地评估模型表现，我们对生成的解释在正确性和清晰度上进行了人工评估。为了解决评估可扩展性问题，我们开发并验证了一个“将 LLM 作为裁判”的框架，该框架与人工评估者表现出高度一致性。我们利用表现最优的模型（Gemini 2.0 Flash）、LLM 判定器以及人工反馈，构建了高质量的真实标注数据集，并用它们对开源 Zephyr 模型进行微调。我们的结果表明，微调后在解释质量上有显著提升，在领域特定数据集上增益尤为明显。此外，我们集成了类型推断模块以支持缺乏显式类型信息的知识图谱。所有代码和数据均可在 https://github.com/idirlab/KGRule2NL 公开获取。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-14 16:41:47 UTC 发布：2025-08-14 16:41:47 协调世界时（UTC）

#41 Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News #41 从新闻中建模和检测公司风险：彭博新闻的案例研究

Authors: [Jiaxin Pei](https://arxiv.org/search/?searchtype=author&query=Jiaxin Pei), [Soumya Vadlamannati](https://arxiv.org/search/?searchtype=author&query=Soumya Vadlamannati), [Liang-Kang Huang](https://arxiv.org/search/?searchtype=author&query=Liang-Kang Huang), [Daniel Preotiuc-Pietro](https://arxiv.org/search/?searchtype=author&query=Daniel Preotiuc-Pietro), [Xinyu Hua](https://arxiv.org/search/?searchtype=author&query=Xinyu Hua) 作者：裴佳欣，Soumya Vadlamannati，梁康恒，Daniel Preotiuc-Pietro，华新宇

Identifying risks associated with a company is important to investors and the well-being of the overall financial market. In this study, we build a computational framework to automatically extract company risk factors from news articles. Our newly proposed schema comprises seven distinct aspects, such as supply chain, regulations, and competitions. We sample and annotate 744 news articles and benchmark various machine learning models. While large language models have achieved huge progress in various types of NLP tasks, our experiment shows that zero-shot and few-shot prompting state-of-the-art LLMs (e.g. LLaMA-2) can only achieve moderate to low performances in identifying risk factors. And fine-tuned pre-trained language models are performing better on most of the risk factors. Using this model, we analyze over 277K Bloomberg news articles and demonstrate that identifying risk factors from news could provide extensive insight into the operations of companies and industries. 识别与公司相关的风险对投资者和整个金融市场的福祉都很重要。在本研究中，我们构建了一个计算框架，用于从新闻文章中自动提取公司风险因素。我们新提出的方案包含七个不同的方面，例如供应链、监管和竞争。我们抽样并注释了 744 篇新闻文章，并对各种机器学习模型进行了基准测试。尽管大型语言模型在各种类型的 NLP 任务上取得了巨大进展，我们的实验表明，零样本和少样本提示的最先进 LLMs（例如 LLaMA-2）在识别风险因素方面只能达到中等到较低的性能。而经过微调的预训练语言模型在大多数风险因素上表现更好。利用该模型，我们分析了超过 277K 篇彭博社新闻文章，并证明从新闻中识别风险因素可以为公司和行业运营提供广泛的见解。

Subjects: Computation and Language, Artificial Intelligence, Computational Engineering, Finance, and Science, Machine Learning

Publish: 2025-08-10 22:44:10 UTC 发布时间：2025-08-10 22:44:10 UTC

#42 gpt-oss-120b & gpt-oss-20b Model Card #42 gpt-oss-120b 与 gpt-oss-20b 模型卡

We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research. 我们推出了 gpt-oss-120b 和 gpt-oss-20b 两款开源权重的推理模型，推动了准确性和推理成本的前沿。这些模型采用高效的专家混合（mixture-of-expert）变压器架构，并通过大规模蒸馏与强化学习进行训练。我们对模型进行了优化，使其具备强大的主体能力（深度研究浏览、Python 工具使用以及对开发者提供函数的支持），同时采用渲染的聊天格式以实现清晰的指令遵循和角色划分。这两款模型在涵盖数学、编码和安全等领域的基准测试中均取得了优异成绩。我们在 Apache 2.0 许可证下发布了模型权重、推理实现、工具环境和分词器，以促进广泛使用和进一步研究。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-08 19:24:38 UTC 发布：2025-08-08 19:24:38 UTC

#43 PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins #43 PersonaTwin：一种用于生成和评估个性化数字孪生的多层提示条件框架

Authors: [Sihan Chen](https://arxiv.org/search/?searchtype=author&query=Sihan Chen), [John P. Lalor](https://arxiv.org/search/?searchtype=author&query=John P. Lalor), [Yi Yang](https://arxiv.org/search/?searchtype=author&query=Yi Yang), [Ahmed Abbasi](https://arxiv.org/search/?searchtype=author&query=Ahmed Abbasi) 作者：Sihan Chen，John P. Lalor，Yi Yang，Ahmed Abbasi

While large language models (LLMs) afford new possibilities for user modeling and approximation of human behaviors, they often fail to capture the multidimensional nuances of individual users. In this work, we introduce PersonaTwin, a multi-tier prompt conditioning framework that builds adaptive digital twins by integrating demographic, behavioral, and psychometric data. Using a comprehensive data set in the healthcare context of more than 8,500 individuals, we systematically benchmark PersonaTwin against standard LLM outputs, and our rigorous evaluation unites state-of-the-art text similarity metrics with dedicated demographic parity assessments, ensuring that generated responses remain accurate and unbiased. Experimental results show that our framework produces simulation fidelity on par with oracle settings. Moreover, downstream models trained on persona-twins approximate models trained on individuals in terms of prediction and fairness metrics across both GPT-4o-based and Llama-based models. Together, these findings underscore the potential for LLM digital twin-based approaches in producing realistic and emotionally nuanced user simulations, offering a powerful tool for personalized digital user modeling and behavior analysis. 虽然大规模语言模型（LLMs）为用户建模和模拟人类行为提供了新可能，但它们常常无法捕捉个体用户的多维细微差别。在本研究中，我们提出了 PersonaTwin，这是一种多层提示条件化框架，通过整合人口统计学、行为学和心理测量数据来构建可自适应的数字孪生。利用在医疗保健情境下超过 8500 名个体的综合数据集，我们系统地将 PersonaTwin 与标准 LLM 输出进行基准测试，并将最先进的文本相似度度量与专门的人口统计学公平性评估结合进行严格评估，确保生成的响应既准确又无偏。实验结果表明，我们的框架在模拟保真度方面可与 oracle 设置相媲美。此外，在基于 GPT-4o 和基于 Llama 的模型上，用 persona-twins 训练的下游模型在预测和公平性指标方面近似于用个体数据训练的模型。综上所述，这些发现强调了基于 LLM 数字孪生的方法在生成真实且具有情感细微差别的用户模拟方面的潜力，为个性化数字用户建模和行为分析提供了强有力的工具。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-07-30 04:57:30 UTC 发布：2025-07-30 04:57:30 协调世界时

#44 A2HCoder: An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation #44 A2HCoder：一个用于分层算法到 HDL 翻译的 LLM 驱动编码代理 [PDF 2 ] [Copy] [Kimi 1 ] [REL]

Authors: [Jie Lei](https://arxiv.org/search/?searchtype=author&query=Jie Lei), [Ruofan Jia](https://arxiv.org/search/?searchtype=author&query=Ruofan Jia), [J. Andrew Zhang](https://arxiv.org/search/?searchtype=author&query=J. Andrew Zhang), [Hao Zhang](https://arxiv.org/search/?searchtype=author&query=Hao Zhang) 作者：Jie Lei、Ruofan Jia、J. Andrew Zhang、Hao Zhang

In wireless communication systems, stringent requirements such as ultra-low latency and power consumption have significantly increased the demand for efficient algorithm-to-hardware deployment. However, a persistent and substantial gap remains between algorithm design and hardware implementation. Bridging this gap traditionally requires extensive domain expertise and time-consuming manual development, due to fundamental mismatches between high-level programming languages like MATLAB and hardware description languages (HDLs) such as Verilog-in terms of memory access patterns, data processing manners, and datatype representations. To address this challenge, we propose A2HCoder: a Hierarchical Algorithm-to-HDL Coding Agent, powered by large language models (LLMs), designed to enable agile and reliable algorithm-to-hardware translation. A2HCoder introduces a hierarchical framework that enhances both robustness and interpretability while suppressing common hallucination issues in LLM-generated code. In the horizontal dimension, A2HCoder decomposes complex algorithms into modular functional blocks, simplifying code generation and improving consistency. In the vertical dimension, instead of relying on end-to-end generation, A2HCoder performs step-by-step, fine-grained translation, leveraging external toolchains such as MATLAB and Vitis HLS for debugging and circuit-level synthesis. This structured process significantly mitigates hallucinations and ensures hardware-level correctness. We validate A2HCoder through a real-world deployment case in the 5G wireless communication domain, demonstrating its practicality, reliability, and deployment efficiency. 在无线通信系统中，诸如超低延迟和功耗等严格要求显著增加了对高效算法到硬件部署的需求。然而，算法设计与硬件实现之间仍存在持久且显著的差距。弥合这一差距传统上需要大量领域专长和耗时的手工开发，原因在于高级编程语言（如 MATLAB）与硬件描述语言（HDL）如 Verilog 在内存访问模式、数据处理方式和数据类型表示等方面存在根本性不匹配。为了解决这一挑战，我们提出了 A2HCoder：一个由大型语言模型（LLMs）驱动的分层算法到 HDL 编码代理，旨在实现敏捷且可靠的算法到硬件的转换。A2HCoder 引入了一个分层框架，既增强了稳健性和可解释性，又抑制了 LLM 生成代码中常见的幻觉问题。在横向维度上，A2HCoder 将复杂算法分解为模块化的功能块，简化了代码生成并提高了一致性。在纵向维度上，A2HCoder 不依赖端到端生成，而是执行逐步的、细粒度的翻译，利用 MATLAB 和 Vitis HLS 等外部工具链进行调试和电路级综合。该结构化流程显著减少了幻觉现象并确保硬件层面的正确性。我们通过在 5G 无线通信领域的真实部署案例对 A2HCoder 进行了验证，展示了其可行性、可靠性和部署效率。

Subjects: Computation and Language, Hardware Architecture, Programming Languages 学科：计算与语言、硬件架构、编程语言

Publish: 2025-07-29 01:51:12 UTC 发布：2025-07-29 01:51:12 UTC

#45 Controlling Multimodal LLMs via Reward-guided Decoding #45 通过奖励引导解码控制多模态 LLMs

Authors: [Oscar Mañas](https://arxiv.org/search/?searchtype=author&query=Oscar Mañas), [Pierluca D’Oro](https://arxiv.org/search/?searchtype=author&query=Pierluca D’Oro), [Koustuv Sinha](https://arxiv.org/search/?searchtype=author&query=Koustuv Sinha), [Adriana Romero-Soriano](https://arxiv.org/search/?searchtype=author&query=Adriana Romero-Soriano), [Michal Drozdzal](https://arxiv.org/search/?searchtype=author&query=Michal Drozdzal), [Aishwarya Agrawal](https://arxiv.org/search/?searchtype=author&query=Aishwarya Agrawal) 作者：Oscar Mañas、Pierluca D’Oro、Koustuv Sinha、Adriana Romero-Soriano、Michal Drozdzal、Aishwarya Agrawal

As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM’s decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model’s output. Our approach enables on-the-fly controllability of an MLLM’s inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods. 随着多模态大语言模型（MLLMs）应用日益广泛，针对不同用户需求对其进行适配的需求也越来越强烈。本文研究了通过受控解码来适配 MLLMs。为实现这一目标，我们提出了首个针对 MLLMs 的基于奖励的引导解码方法，并展示了其在提升视觉定位能力方面的应用。我们的方法包括为视觉定位构建奖励模型，并利用这些模型来引导 MLLM 的解码过程。具体而言，我们构建了两个独立的奖励模型，分别用于控制模型输出中目标对象的精确度和召回率。我们的方法在两方面使 MLLM 的推理过程具备即时可控性：首先，在解码过程中可控制各奖励函数的相对重要性，使用户能够在图像描述任务中动态地在目标精确度和召回率之间权衡；其次，可控制解码时搜索的广度，使用户能够在测试时计算量与视觉定位程度之间进行权衡。我们在标准的对象幻觉基准上评估了我们的方法，结果表明它在多模态大型语言模型推理中提供了显著的可控性，同时在持续性能上优于现有的幻觉缓解方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language, Machine Learning 主题：计算机视觉与模式识别、人工智能、计算与语言、机器学习

Publish: 2025-08-15 17:29:06 UTC 发布：2025-08-15 17:29:06 UTC

#46 Emphasis Sensitivity in Speech Representations #46 语音表示中的重音敏感性

Authors: [Shaun Cassini](https://arxiv.org/search/?searchtype=author&query=Shaun Cassini), [Thomas Hain](https://arxiv.org/search/?searchtype=author&query=Thomas Hain), [Anton Ragni](https://arxiv.org/search/?searchtype=author&query=Anton Ragni) 作者：Shaun Cassini, Thomas Hain, Anton Ragni

This work investigates whether modern speech models are sensitive to prosodic emphasis - whether they encode emphasized and neutral words in systematically different ways. Prior work typically relies on isolated acoustic correlates (e.g., pitch, duration) or label prediction, both of which miss the relational structure of emphasis. This paper proposes a residual-based framework, defining emphasis as the difference between paired neutral and emphasized word representations. Analysis on self-supervised speech models shows that these residuals correlate strongly with duration changes and perform poorly at word identity prediction, indicating a structured, relational encoding of prosodic emphasis. In ASR fine-tuned models, residuals occupy a subspace up to 50% more compact than in pre-trained models, further suggesting that emphasis is encoded as a consistent, low-dimensional transformation that becomes more structured with task-specific learning. 本研究探讨现代语音模型是否对韵律重音敏感——即它们是否以系统性不同的方式编码被强调词与中性词。以往工作通常依赖孤立的声学相关特征（例如基频、时长）或标签预测，这两者都忽略了重音的关系结构。本文提出了一种基于残差的框架，将重音定义为成对的中性词与强调词表示之间的差异。在自监督语音模型上的分析表明，这些残差与时长变化高度相关且在词身份预测上表现不佳，表明韵律重音以一种结构化的、关系性的方式被编码。在经过 ASR 微调的模型中，残差所占的子空间比预训练模型紧凑多达 50%，进一步暗示重音被编码为一种一致的低维变换，并且随着任务特定学习变得更具结构性。

Subjects: Audio and Speech Processing, Computation and Language 主题：音频与语音处理，计算与语言

Publish: 2025-08-15 16:18:47 UTC 发布时间：2025-08-15 16:18:47 协调世界时

#47 Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps #47 包容竞技场：一个用于用真实应用评估大型基础模型的开放平台

Authors: [Kangyu Wang](https://arxiv.org/search/?searchtype=author&query=Kangyu Wang), [Hongliang He](https://arxiv.org/search/?searchtype=author&query=Hongliang He), [Lin Liu](https://arxiv.org/search/?searchtype=author&query=Lin Liu), [Ruiqi Liang](https://arxiv.org/search/?searchtype=author&query=Ruiqi Liang), [Zhenzhong Lan](https://arxiv.org/search/?searchtype=author&query=Zhenzhong Lan), [Jianguo Li](https://arxiv.org/search/?searchtype=author&query=Jianguo Li) 作者：王康宇、何洪亮、刘琳、梁睿琦、兰振中、李建国

Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities, demonstrating near-human-level performance across diverse scenarios. While numerous benchmarks (e.g., MMLU) and leaderboards (e.g., Chatbot Arena) have been proposed to help evolve the development of LLMs and MLLMs, most rely on static datasets or crowdsourced general-domain prompts, often falling short of reflecting performance in real-world applications. To bridge this critical gap, we present Inclusion Arena, a live leaderboard that ranks models based on human feedback collected directly from AI-powered applications. Our platform integrates pairwise model comparisons into natural user interactions, ensuring evaluations reflect practical usage scenarios. For robust model ranking, we employ the Bradley-Terry model augmented with two key innovations: (1) Placement Matches, a cold-start mechanism to quickly estimate initial ratings for newly integrated models, and (2) Proximity Sampling, an intelligent comparison strategy that prioritizes battles between models of similar capabilities to maximize information gain and enhance rating stability. Extensive empirical analyses and simulations demonstrate that Inclusion Arena yields reliable and stable rankings, exhibits higher data transitivity compared to general crowdsourced datasets, and significantly mitigates the risk of malicious manipulation. By fostering an open alliance between foundation models and real-world applications, Inclusion Arena aims to accelerate the development of LLMs and MLLMs truly optimized for practical, user-centric deployments. The platform is publicly accessible at https://doraemon.alipay.com/model-ranking. 大型语言模型（LLMs）和多模态大型语言模型（MLLMs）引领了人工智能能力的新纪元，在多种情境下展示出接近人类水平的表现。尽管已有众多基准（例如 MMLU）和排行榜（例如 Chatbot Arena）旨在推动 LLMs 和 MLLMs 的发展，但大多数依赖静态数据集或众包的一般领域提示，往往难以反映在真实应用中的表现。为弥补这一关键差距，我们推出了 Inclusion Arena——一个基于从 AI 驱动应用中直接收集的人类反馈来对模型进行排名的实时排行榜。我们的平台将成对模型比较集成到自然的用户交互中，确保评估反映实际使用场景。为实现稳健的模型排名，我们采用了增强版 Bradley-Terry 模型，并引入两项关键创新：（1）安置赛（Placement Matches），一种冷启动机制，用于快速估算新接入模型的初始评分；（2）近邻采样（Proximity Sampling），一种智能比较策略，优先安排能力相近模型之间的对决，以最大化信息增益并提升评分稳定性。大量实证分析和模拟表明，Inclusion Arena 能产生可靠且稳定的排名，较一般众包数据集表现出更高的数据传递性，并显著降低恶意操纵的风险。通过在基础模型和现实世界应用之间构建开放联盟，Inclusion Arena 致力于加速真正为实际、以用户为中心部署优化的 LLMs 和 MLLMs 的发展。该平台公开可访问，地址为 https://doraemon.alipay.com/model-ranking。

Subjects: Artificial Intelligence, Computation and Language, Human-Computer Interaction 主题：人工智能、计算与语言、人机交互

Publish: 2025-08-15 13:00:07 UTC 发布：2025-08-15 13:00:07 UTC

#48 Generalize across Homophily and Heterophily: Hybrid Spectral Graph Pre-Training and Prompt Tuning #48 在同质性与异质性中泛化：混合谱图预训练与提示微调

Authors: [Haitong Luo](https://arxiv.org/search/?searchtype=author&query=Haitong Luo), [Suhang Wang](https://arxiv.org/search/?searchtype=author&query=Suhang Wang), [Weiyao Zhang](https://arxiv.org/search/?searchtype=author&query=Weiyao Zhang), [Ruiqi Meng](https://arxiv.org/search/?searchtype=author&query=Ruiqi Meng), [Xuying Meng](https://arxiv.org/search/?searchtype=author&query=Xuying Meng), [Yujun Zhang](https://arxiv.org/search/?searchtype=author&query=Yujun Zhang) 作者：罗海彤、王素杭、张伟尧、孟瑞琪、孟旭英、张宇骏

Graph ``pre-training and prompt-tuning’’ aligns downstream tasks with pre-trained objectives to enable efficient knowledge transfer under limited supervision. However, existing methods rely on homophily-based low-frequency knowledge, failing to handle diverse spectral distributions in real-world graphs with varying homophily. Our theoretical analysis reveals a spectral specificity principle: optimal knowledge transfer requires alignment between pre-trained spectral filters and the intrinsic spectrum of downstream graphs. Under limited supervision, large spectral gaps between pre-training and downstream tasks impede effective adaptation. To bridge this gap, we propose the HS-GPPT model, a novel framework that ensures spectral alignment throughout both pre-training and prompt-tuning. We utilize a hybrid spectral filter backbone and local-global contrastive learning to acquire abundant spectral knowledge. Then we design prompt graphs to align the spectral distribution with pretexts, facilitating spectral knowledge transfer across homophily and heterophily. Extensive experiments validate the effectiveness under both transductive and inductive learning settings. Our code is available at https://anonymous.4open.science/r/HS-GPPT-62D2/. 图“预训练与提示微调”将下游任务与预训练目标对齐，以在有限监督下实现高效知识迁移。然而，现有方法依赖基于同质性的低频知识，无法处理在同质性变化的真实图中出现的多样谱分布。我们的理论分析揭示了一个谱特异性原则：最佳的知识迁移需要预训练谱滤波器与下游图的内在频谱之间的对齐。在有限监督下，预训练任务与下游任务之间存在的大谱间隙会阻碍有效的适配。为缩小这一差距，我们提出了 HS-GPPT 模型——一个在预训练和提示微调整个流程中确保谱对齐的新颖框架。我们利用混合谱滤波器骨干和局部-全局对比学习来获取丰富的谱知识。随后我们设计提示图以将谱分布与预训练任务对齐，促进跨同质性与异质性的谱知识迁移。大量实验验证了该方法在传导式和归纳式学习设置下的有效性。我们的代码可在 https://anonymous.4open.science/r/HS-GPPT-62D2/ 获取。

Subjects: Machine Learning, Computation and Language 主题：机器学习，计算与语言

Publish: 2025-08-15 08:55:57 UTC 发布：2025-08-15 08:55:57 UTC

#49 Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing #49 群体公平遇上黑箱：通过后处理在封闭式 LLMs 上实现公平算法

Authors: [Ruicheng Xian](https://arxiv.org/search/?searchtype=author&query=Ruicheng Xian), [Yuxuan Wan](https://arxiv.org/search/?searchtype=author&query=Yuxuan Wan), [Han Zhao](https://arxiv.org/search/?searchtype=author&query=Han Zhao) 作者：Ruicheng Xian，Yuxuan Wan，Han Zhao

Instruction fine-tuned large language models (LLMs) enable a simple zero-shot or few-shot prompting paradigm, also known as in-context learning, for building prediction models. This convenience, combined with continued advances in LLM capability, has the potential to drive their adoption across a broad range of domains, including high-stakes applications where group fairness – preventing disparate impacts across demographic groups – is essential. The majority of existing approaches to enforcing group fairness on LLM-based classifiers rely on traditional fair algorithms applied via model fine-tuning or head-tuning on final-layer embeddings, but they are no longer applicable to closed-weight LLMs under the in-context learning setting, which include some of the most capable commercial models today, such as GPT-4, Gemini, and Claude. In this paper, we propose a framework for deriving fair classifiers from closed-weight LLMs via prompting: the LLM is treated as a feature extractor, and features are elicited from its probabilistic predictions (e.g., token log probabilities) using prompts strategically designed for the specified fairness criterion to obtain sufficient statistics for fair classification; a fair algorithm is then applied to these features to train a lightweight fair classifier in a post-hoc manner. Experiments on five datasets, including three tabular ones, demonstrate strong accuracy-fairness tradeoffs for the classifiers derived by our framework from both open-weight and closed-weight LLMs; in particular, our framework is data-efficient and outperforms fair classifiers trained on LLM embeddings (i.e., head-tuning) or from scratch on raw tabular features. 通过指令微调的大型语言模型（LLMs）使得构建预测模型可以采用一种简单的零样本或少样本提示范式，也称为上下文内学习。此便利性，再加上 LLM 能力的持续进步，有可能推动它们在广泛领域的应用，包括一些需要群体公平——防止不同人口群体之间出现差别性影响——至关重要的高风险场景。现有大多数针对基于 LLM 的分类器实施群体公平的做法依赖于通过模型微调或对最终层嵌入进行头部微调来应用传统公平算法，但这些方法在上下文内学习设置下对闭权重 LLM 已不再适用，而这类模型包括当今一些最强大的商用模型，如 GPT-4、Gemini 和 Claude。在本文中，我们提出了一个通过提示从闭权重 LLMs 推导公平分类器的框架：将 LLM 视为特征提取器，并通过为指定的公平性准则策略性设计的提示从其概率预测（例如，标记对数概率）中引出特征，以获取用于公平分类的充分统计量；然后对这些特征应用公平算法，以事后方式训练轻量级公平分类器。对五个数据集（包括三个表格数据集）的实验证明，基于我们框架从开放权重和闭权重 LLMs 推导的分类器在准确性与公平性权衡上表现优异；尤其是，我们的框架数据效率高，优于在 LLM 嵌入（即头部微调）上训练的公平分类器或在原始表格特征上从零开始训练的分类器。

Subjects: Machine Learning, Computation and Language, Computers and Society 主题：机器学习、计算与语言、计算机与社会

Publish: 2025-08-15 06:50:29 UTC 发表：2025-08-15 06:50:29 UTC

#50 Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information #50 超越解答数学测验：评估大型推理模型提出信息需求的能力

Authors: [Youcheng Huang](https://arxiv.org/search/?searchtype=author&query=Youcheng Huang), [Bowen Qin](https://arxiv.org/search/?searchtype=author&query=Bowen Qin), [Chen Huang](https://arxiv.org/search/?searchtype=author&query=Chen Huang), [Duanyu Feng](https://arxiv.org/search/?searchtype=author&query=Duanyu Feng), [Xi Yang](https://arxiv.org/search/?searchtype=author&query=Xi Yang), [Wenqiang Lei](https://arxiv.org/search/?searchtype=author&query=Wenqiang Lei) 作者：黄有成、秦博文、黄晨、冯端宇、杨曦、雷文强

Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users’ requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems. 大型推理模型（LRMs）在数学问题求解方面展示了卓越的能力，现有评测基于仅针对定义明确的问题的基准进行评估。然而，这样的评估设置存在关键缺口，因为真正的智能体不仅应能解决问题（如数学测验答题器），还应在问题信息不足时主动询问信息，从而在响应用户请求时表现出主动性。为弥补这一缺口，我们提出了一个包含两类、情境多样的不完整问题的新数据集。基于该数据集，我们对大型推理模型的系统性评估揭示了其在主动询问信息方面的不足。此外，我们发现了大型推理模型有关过度思考与幻觉的行为，并强调了有监督微调在学习此类能力方面的潜力与挑战。我们希望为开发具备真正智能的大型推理模型提供新的见解，而不仅仅是解决问题。

Subjects: Artificial Intelligence, Computation and Language, Information Retrieval 主题：人工智能，计算与语言，信息检索

Publish: 2025-08-15 06:42:00 UTC 发布时间：2025-08-15 06:42:00 UTC

#51 Benchmarking Prosody Encoding in Discrete Speech Tokens #51 在离散语音标记中对韵律编码的基准测试

Authors: [Kentaro Onda](https://arxiv.org/search/?searchtype=author&query=Kentaro Onda), [Satoru Fukayama](https://arxiv.org/search/?searchtype=author&query=Satoru Fukayama), [Daisuke Saito](https://arxiv.org/search/?searchtype=author&query=Daisuke Saito), [Nobuaki Minematsu](https://arxiv.org/search/?searchtype=author&query=Nobuaki Minematsu) 作者：Kentaro Onda、Satoru Fukayama、Daisuke Saito、Nobuaki Minematsu

Recently, discrete tokens derived from self-supervised learning (SSL) models via k-means clustering have been actively studied as pseudo-text in speech language models and as efficient intermediate representations for various tasks. However, these discrete tokens are typically learned in advance, separately from the training of language models or downstream tasks. As a result, choices related to discretization, such as the SSL model used or the number of clusters, must be made heuristically. In particular, speech language models are expected to understand and generate responses that reflect not only the semantic content but also prosodic features. Yet, there has been limited research on the ability of discrete tokens to capture prosodic information. To address this gap, this study conducts a comprehensive analysis focusing on prosodic encoding based on their sensitivity to the artificially modified prosody, aiming to provide practical guidelines for designing discrete tokens. 最近，通过对自监督学习（SSL）模型的表示进行 k-means 聚类得到的离散符号，作为语音语言模型中的伪文本以及用于各种任务的高效中间表示，受到了广泛研究。然而，这些离散符号通常是在与语言模型或下游任务训练分离的情况下预先学习的。因此，关于离散化的选择，例如所使用的 SSL 模型或聚类数，必须通过经验法则来决定。特别是，语音语言模型不仅需要理解语义内容，还应生成反映韵律特征的回应。然而，关于离散符号捕捉韵律信息能力的研究有限。为填补这一空白，本研究进行了针对韵律编码的全面分析，基于对人工修改韵律的敏感性，旨在为设计离散符号提供实用指南。

Subjects: Sound, Computation and Language, Audio and Speech Processing 主题：声音、计算与语言、音频与语音处理

Publish: 2025-08-15 05:11:16 UTC 发布：2025-08-15 05:11:16 UTC

#52 ORFuzz: Fuzzing the "Other Side" of LLM Safety – Testing Over-Refusal #52 ORFuzz：模糊测试 LLM 安全性的“另一面”——检测过度拒绝

Large Language Models (LLMs) increasingly exhibit over-refusal - erroneously rejecting benign queries due to overly conservative safety measures - a critical functional flaw that undermines their reliability and usability. Current methods for testing this behavior are demonstrably inadequate, suffering from flawed benchmarks and limited test generation capabilities, as highlighted by our empirical user study. To the best of our knowledge, this paper introduces the first evolutionary testing framework, ORFuzz, for the systematic detection and analysis of LLM over-refusals. ORFuzz uniquely integrates three core components: (1) safety category-aware seed selection for comprehensive test coverage, (2) adaptive mutator optimization using reasoning LLMs to generate effective test cases, and (3) OR-Judge, a human-aligned judge model validated to accurately reflect user perception of toxicity and refusal. Our extensive evaluations demonstrate that ORFuzz generates diverse, validated over-refusal instances at a rate (6.98% average) more than double that of leading baselines, effectively uncovering vulnerabilities. Furthermore, ORFuzz’s outputs form the basis of ORFuzzSet, a new benchmark of 1,855 highly transferable test cases that achieves a superior 63.56% average over-refusal rate across 10 diverse LLMs, significantly outperforming existing datasets. ORFuzz and ORFuzzSet provide a robust automated testing framework and a valuable community resource, paving the way for developing more reliable and trustworthy LLM-based software systems. 大型语言模型（LLMs）越来越多地表现出过度拒绝——由于过于保守的安全措施而错误地拒绝良性查询——这是一种关键的功能缺陷，削弱了其可靠性和可用性。我们通过实证用户研究指出，现有检测此类行为的方法明显不足，存在基准测试缺陷和有限的测试生成能力。据我们所知，本文首次提出了用于系统检测和分析 LLM 过度拒绝的进化测试框架 ORFuzz。ORFuzz 独特地整合了三大核心组件： (1) 考虑安全类别的种子选择以实现全面的测试覆盖，(2) 使用推理型 LLM 进行自适应变异器优化以生成有效的测试用例，(3) OR-Judge——一个与人类一致的判定模型，经验证能准确反映用户对有害性和拒绝的感知。我们的广泛评估表明，ORFuzz 以平均 6.98% 的速率生成多样且经验证的过度拒绝实例，超过主要基线方法的两倍以上，有效地揭示了漏洞。此外，ORFuzz 的输出构成了 ORFuzzSet 的基础——一个包含 1,855 个高度可迁移测试用例的新基准，在 10 个多样化的 LLMs 上实现了 63.56% 的平均过度拒绝率，显著优于现有数据集。ORFuzz 和 ORFuzzSet 提供了一个稳健的自动化测试框架和宝贵的社区资源，为开发更可靠、更值得信赖的基于 LLM 的软件系统铺平了道路。

Subjects: Software Engineering, Artificial Intelligence, Computation and Language, Information Retrieval 科目：软件工程、人工智能、计算与语言、信息检索

Publish: 2025-08-15 05:03:26 UTC 发布：2025-08-15 05:03:26 协调世界时 (UTC)

#53 How Causal Abstraction Underpins Computational Explanation #53 因果抽象如何支撑计算性解释

Authors: [Atticus Geiger](https://arxiv.org/search/?searchtype=author&query=Atticus Geiger), [Jacqueline Harding](https://arxiv.org/search/?searchtype=author&query=Jacqueline Harding), [Thomas Icard](https://arxiv.org/search/?searchtype=author&query=Thomas Icard) 作者：Atticus Geiger、Jacqueline Harding、Thomas Icard

Explanations of cognitive behavior often appeal to computations over representations. What does it take for a system to implement a given computation over suitable representational vehicles within that system? We argue that the language of causality – and specifically the theory of causal abstraction – provides a fruitful lens on this topic. Drawing on current discussions in deep learning with artificial neural networks, we illustrate how classical themes in the philosophy of computation and cognition resurface in contemporary machine learning. We offer an account of computational implementation grounded in causal abstraction, and examine the role for representation in the resulting picture. We argue that these issues are most profitably explored in connection with generalization and prediction. 对认知行为的解释常常诉诸于对表征进行的计算。一个系统需要具备什么才能在该系统内通过适当的表征载体实现某一给定的计算？我们主张，用因果语言——尤其是因果抽象理论——来考察这一问题是富有成效的视角。通过借鉴当前在人工神经网络深度学习中的讨论，我们展示了计算与认知哲学的经典主题如何在当代机器学习中重新出现。我们提供了一个以因果抽象为基础的计算实现的解释，并考察了表征在由此图景中所起的作用。我们认为，这些问题最有成效的探索方向是与泛化和预测相联系的研究。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-15 04:46:02 UTC 发布：2025-08-15 04:46:02 协调世界时 (UTC)

#54 Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style #54 使用自然语言描述说话风格的富有表现力的语音检索

Authors: [Wonjune Kang](https://arxiv.org/search/?searchtype=author&query=Wonjune Kang), [Deb Roy](https://arxiv.org/search/?searchtype=author&query=Deb Roy) 作者：Wonjune Kang、Deb Roy

We introduce the task of expressive speech retrieval, where the goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style. While prior work has primarily focused on performing speech retrieval based on what was said in an utterance, we aim to do so based on how something was said. We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space, which enables using free-form text prompts describing emotions or styles as queries to retrieve matching expressive speech segments. We perform detailed analyses of various aspects of our proposed framework, including encoder architectures, training criteria for effective cross-modal alignment, and prompt augmentation for improved generalization to arbitrary text queries. Experiments on multiple datasets encompassing 22 speaking styles demonstrate that our approach achieves strong retrieval performance as measured by Recall@k. 我们提出了“富表达性语音检索”任务，其目标是根据对某种说话风格的自然语言描述来检索以该风格朗读的语音片段。此前的研究主要侧重于基于语音中“说了什么”来进行检索，而我们的目标是基于“怎么说”来检索。我们训练语音和文本编码器，将语音和对说话风格的文本描述嵌入到一个联合的潜在空间中，从而可以使用描述情感或风格的自由文本提示作为查询来检索匹配的富表达性语音片段。我们对所提出框架的各个方面进行了详细分析，包括编码器架构、用于有效跨模态对齐的训练准则，以及为提升对任意文本查询的泛化能力而进行的提示增强。在涵盖 22 种说话风格的多个数据集上的实验表明，我们的方法在 Recall@k 等指标上实现了良好的检索性能。

Subjects: Audio and Speech Processing, Computation and Language, Sound 主题：音频与语音处理、计算与语言、声音

Publish: 2025-08-15 03:38:21 UTC 发布：2025-08-15 03:38:21 世界协调时

Authors: [Bin Ma](https://arxiv.org/search/?searchtype=author&query=Bin Ma), [Yifei Zhang](https://arxiv.org/search/?searchtype=author&query=Yifei Zhang), [Yongjin Xian](https://arxiv.org/search/?searchtype=author&query=Yongjin Xian), [Qi Li](https://arxiv.org/search/?searchtype=author&query=Qi Li), [Linna Zhou](https://arxiv.org/search/?searchtype=author&query=Linna Zhou), [Gongxun Miao](https://arxiv.org/search/?searchtype=author&query=Gongxun Miao) 作者：马斌、张逸飞、咸永进、李琦、周琳娜、缪公训

Existing rumor detection methods often neglect the content within images as well as the inherent relationships between contexts and images across different visual scales, thereby resulting in the loss of critical information pertinent to rumor identification. To address these issues, this paper presents a novel cross-modal rumor detection scheme based on contrastive learning, namely the Multi-scale Image and Context Correlation exploration algorithm (MICC). Specifically, we design an SCLIP encoder to generate unified semantic embeddings for text and multi-scale image patches through contrastive pretraining, enabling their relevance to be measured via dot-product similarity. Building upon this, a Cross-Modal Multi-Scale Alignment module is introduced to identify image regions most relevant to the textual semantics, guided by mutual information maximization and the information bottleneck principle, through a Top-K selection strategy based on a cross-modal relevance matrix constructed between the text and multi-scale image patches. Moreover, a scale-aware fusion network is designed to integrate the highly correlated multi-scale image features with global text features by assigning adaptive weights to image regions based on their semantic importance and cross-modal relevance. The proposed methodology has been extensively evaluated on two real-world datasets. The experimental results demonstrate that it achieves a substantial performance improvement over existing state-of-the-art approaches in rumor detection, highlighting its effectiveness and potential for practical applications. 现有的谣言检测方法常常忽略图像内部的内容以及不同视觉尺度下文本与图像之间固有的关联，从而导致与谣言识别密切相关的重要信息丢失。为了解决这些问题，本文提出了一种基于对比学习的新型跨模态谣言检测方案，称为多尺度图像与上下文相关性探索算法（MICC）。具体地，我们设计了一个 SCLIP 编码器，通过对比预训练为文本和多尺度图像补丁生成统一的语义嵌入，从而可以通过点积相似度来衡量它们的相关性。在此基础上，引入了一个跨模态多尺度对齐模块，该模块通过在文本与多尺度图像补丁之间构建的跨模态相关矩阵，采用基于 Top-K 的选择策略，并结合互信息最大化与信息瓶颈原则，识别与文本语义最相关的图像区域。此外，设计了一个尺度感知融合网络，通过根据图像区域的语义重要性和跨模态相关性为其分配自适应权重，将高度相关的多尺度图像特征与全局文本特征整合。所提出的方法在两个真实世界数据集上进行了广泛评估。实验结果表明，与现有的最先进方法相比，它在谣言检测方面实现了显著的性能提升，凸显了其有效性和实际应用潜力。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-08-15 01:13:50 UTC 发布时间：2025-08-15 01:13:50 UTC

#56 +VeriRel: Verification Feedback to Enhance Document Retrieval for Scientific Fact Checking #56 +VeriRel：通过验证反馈增强科学事实核查的文献检索

Authors: [Xingyu Deng](https://arxiv.org/search/?searchtype=author&query=Xingyu Deng), [Xi Wang](https://arxiv.org/search/?searchtype=author&query=Xi Wang), [Mark Stevenson](https://arxiv.org/search/?searchtype=author&query=Mark Stevenson) 作者：邓星宇，王曦，Mark Stevenson

Identification of appropriate supporting evidence is critical to the success of scientific fact checking. However, existing approaches rely on off-the-shelf Information Retrieval algorithms that rank documents based on relevance rather than the evidence they provide to support or refute the claim being checked. This paper proposes +VeriRel which includes verification success in the document ranking. Experimental results on three scientific fact checking datasets (SciFact, SciFact-Open and Check-Covid) demonstrate consistently leading performance by +VeriRel for document evidence retrieval and a positive impact on downstream verification. This study highlights the potential of integrating verification feedback to document relevance assessment for effective scientific fact checking systems. It shows promising future work to evaluate fine-grained relevance when examining complex documents for advanced scientific fact checking. 为科学事实核查识别合适的支持性证据对成功至关重要。然而，现有方法依赖现成的信息检索算法，这些算法根据相关性对文档进行排序，而不是根据文档为支持或反驳所检验的主张所提供的证据来排序。本文提出了 +VeriRel，将验证成功纳入文档排序中。在三个科学事实核查数据集（SciFact、SciFact-Open 和 Check-Covid）上的实验结果显示，+VeriRel 在文档证据检索方面始终表现领先，并对下游验证产生了积极影响。本研究强调了将验证反馈整合到文档相关性评估中以构建有效科学事实核查系统的潜力，并指出在审查复杂文档以实现高级科学事实核查时评估细粒度相关性的有希望的未来研究方向。

Subjects: Information Retrieval, Computation and Language 主题：信息检索，计算与语言

Publish: 2025-08-14 23:57:40 UTC 发布：2025-08-14 23:57:40 协调世界时

#57 PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing #57 PaperRegister：通过分层登记索引提升灵活粒度论文检索

Authors: [Zhuoqun Li](https://arxiv.org/search/?searchtype=author&query=Zhuoqun Li), [Xuanang Chen](https://arxiv.org/search/?searchtype=author&query=Xuanang Chen), [Hongyu Lin](https://arxiv.org/search/?searchtype=author&query=Hongyu Lin), [Yaojie Lu](https://arxiv.org/search/?searchtype=author&query=Yaojie Lu), [Xianpei Han](https://arxiv.org/search/?searchtype=author&query=Xianpei Han), [Le Sun](https://arxiv.org/search/?searchtype=author&query=Le Sun) 作者：李卓群、陈轩昂、林鸿昱、卢耀杰、韩显培、孙乐

Paper search is an important activity for researchers, typically involving using a query with description of a topic to find relevant papers. As research deepens, paper search requirements may become more flexible, sometimes involving specific details such as module configuration rather than being limited to coarse-grained topics. However, previous paper search systems are unable to meet these flexible-grained requirements, as these systems mainly collect paper abstracts to construct index of corpus, which lack detailed information to support retrieval by finer-grained queries. In this work, we propose PaperRegister, consisted of offline hierarchical indexing and online adaptive retrieval, transforming traditional abstract-based index into hierarchical index tree for paper search, thereby supporting queries at flexible granularity. Experiments on paper search tasks across a range of granularity demonstrate that PaperRegister achieves the state-of-the-art performance, and particularly excels in fine-grained scenarios, highlighting the good potential as an effective solution for flexible-grained paper search in real-world applications. Code for this work is in https://github.com/Li-Z-Q/PaperRegister. 论文检索是研究人员的一项重要活动，通常涉及使用带有主题描述的查询来查找相关论文。随着研究的深入，论文检索的需求可能变得更加灵活，有时涉及诸如模块配置等具体细节，而不再局限于粗粒度的主题。然而，以往的论文检索系统无法满足这些灵活粒度的需求，因为这些系统主要收集论文摘要来构建语料索引，而摘要缺乏支持按更细粒度查询检索的详细信息。在本工作中，我们提出了 PaperRegister，由离线层次索引和在线自适应检索组成，将传统的基于摘要的索引转化为用于论文检索的层次索引树，从而支持灵活粒度的查询。在不同粒度的论文检索任务上的实验表明，PaperRegister 达到了最先进的性能，并在细粒度场景中表现尤为出色，凸显出其作为现实应用中灵活粒度论文检索有效解决方案的良好潜力。本工作的代码位于 https://github.com/Li-Z-Q/PaperRegister。

Subjects: Information Retrieval, Computation and Language 主题：信息检索，计算与语言

Publish: 2025-08-14 23:43:46 UTC 发布：2025-08-14 23:43:46 UTC

#58 Diffusion is a code repair operator and generator #58 Diffusion 是一种代码修复操作符和生成器

Authors: [Mukul Singh](https://arxiv.org/search/?searchtype=author&query=Mukul Singh), [Gust Verbruggen](https://arxiv.org/search/?searchtype=author&query=Gust Verbruggen), [Vu Le](https://arxiv.org/search/?searchtype=author&query=Vu Le), [Sumit Gulwani](https://arxiv.org/search/?searchtype=author&query=Sumit Gulwani) 作者：Mukul Singh、Gust Verbruggen、Vu Le、Sumit Gulwani

Code diffusion models generate code by iteratively removing noise from the latent representation of a code snippet. During later steps of the diffusion process, when the code snippet has almost converged, differences between discrete representations of these snippets look like last-mile repairs applied to broken or incomplete code. We evaluate the extent to which this resemblance can be exploited to leverage pre-trained code diffusion models for the problem of last-mile repair by considering two applications with significant potential. First, we can leverage the diffusion model for last-mile repair by adding noise to a broken code snippet and resuming the diffusion process. Second, we can leverage the diffusion model to generate arbitrary amount of training data for last-mile repair tasks (that are computationally more efficient) by sampling an intermediate program (input) and the final program (output) from the diffusion process. We perform experiments on 3 domains (Python, Excel and PowerShell) to evaluate applications, as well as analyze properties. 代码扩散模型通过从代码片段的潜在表示中迭代去除噪声来生成代码。在扩散过程的后期步骤，当代码片段几乎收敛时，这些片段的离散表示之间的差异看起来像是对损坏或不完整代码所做的“最后一英里”修复。我们评估了这种相似性在多大程度上可以被利用，以便将预训练的代码扩散模型用于“最后一英里”修复问题，并考虑了两个具有显著潜力的应用。首先，我们可以通过向损坏的代码片段添加噪声并恢复扩散过程，利用扩散模型进行最后一英里修复。其次，我们可以通过从扩散过程中采样中间程序（输入）和最终程序（输出），利用扩散模型生成任意数量的用于最后一英里修复任务的训练数据（这些数据在计算上更高效）。我们在三个领域（Python、Excel 和 PowerShell）上进行了实验，以评估这些应用并分析其属性。

Subjects: Software Engineering, Artificial Intelligence, Computation and Language 主题：软件工程，人工智能，计算与语言

Publish: 2025-08-14 23:27:09 UTC 发布：2025-08-14 23:27:09 UTC

Document fraud poses a significant threat to industries reliant on secure and verifiable documentation, necessitating robust detection mechanisms. This study investigates the efficacy of state-of-the-art multi-modal large language models (LLMs)-including OpenAI O1, OpenAI 4o, Gemini Flash (thinking), Deepseek Janus, Grok, Llama 3.2 and 4, Qwen 2 and 2.5 VL, Mistral Pixtral, and Claude 3.5 and 3.7 Sonnet-in detecting fraudulent documents. We benchmark these models against each other and prior work on document fraud detection techniques using a standard dataset with real transactional documents. Through prompt optimization and detailed analysis of the models’ reasoning processes, we evaluate their ability to identify subtle indicators of fraud, such as tampered text, misaligned formatting, and inconsistent transactional sums. Our results reveal that top-performing multi-modal LLMs demonstrate superior zero-shot generalization, outperforming conventional methods on out-of-distribution datasets, while several vision LLMs exhibit inconsistent or subpar performance. Notably, model size and advanced reasoning capabilities show limited correlation with detection accuracy, suggesting task-specific fine-tuning is critical. This study underscores the potential of multi-modal LLMs in enhancing document fraud detection systems and provides a foundation for future research into interpretable and scalable fraud mitigation strategies. 文件欺诈对依赖安全且可验证文档的行业构成重大威胁，因而需要强有力的检测机制。本研究考察了最先进的多模态 LLMs 在检测欺诈文档方面的有效性——包括 OpenAI O1、OpenAI 4o、Gemini Flash（thinking）、Deepseek Janus、Grok、Llama 3.2 与 4、Qwen 2 与 2.5 VL、Mistral Pixtral，以及 Claude 3.5 与 3.7 Sonnet。我们使用包含真实交易文档的标准数据集，将这些模型相互比较并与以往的文档欺诈检测方法进行基准测试。通过提示优化和对模型推理过程的详细分析，我们评估了它们识别欺诈细微迹象的能力，例如被篡改的文本、格式错位和不一致的交易总额。我们的结果表明，表现最佳的多模态 LLMs 展示了卓越的零样本泛化能力，在分布外数据集上优于传统方法，而若干视觉 LLMs 则表现出不稳定或较差的性能。值得注意的是，模型规模和高级推理能力与检测准确率之间的相关性有限，这表明针对特定任务的微调至关重要。本研究强调了多模态 LLMs 在增强文档欺诈检测系统方面的潜力，并为未来对可解释且可扩展的欺诈缓解策略的研究提供了基础。

Subjects: Computer Vision and Pattern Recognition, Computation and Language 主题：计算机视觉与模式识别，计算与语言

Publish: 2025-08-14 18:57:07 UTC 发布：2025-08-14 18:57:07 UTC

#60 Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models #60 匹配与选择：用于微调文本到图像扩散模型的模型选择框架

Authors: [Basile Lewandowski](https://arxiv.org/search/?searchtype=author&query=Basile Lewandowski), [Robert Birke](https://arxiv.org/search/?searchtype=author&query=Robert Birke), [Lydia Y. Chen](https://arxiv.org/search/?searchtype=author&query=Lydia Y. Chen) 作者：Basile Lewandowski，Robert Birke，Lydia Y. Chen

Text-to-image (T2I) models based on diffusion and transformer architectures advance rapidly. They are often pretrained on large corpora, and openly shared on a model platform, such as HuggingFace. Users can then build up AI applications, e.g., generating media contents, by adopting pretrained T2I models and fine-tuning them on the target dataset. While public pretrained T2I models facilitate the democratization of the models, users face a new challenge: which model can be best fine-tuned based on the target data domain? Model selection is well addressed in classification tasks, but little is known in (pretrained) T2I models and their performance indication on the target domain. In this paper, we propose the first model selection framework, M&C, which enables users to efficiently choose a pretrained T2I model from a model platform without exhaustively fine-tuning them all on the target dataset. The core of M&C is a matching graph, which consists of: (i) nodes of available models and profiled datasets, and (ii) edges of model-data and data-data pairs capturing the fine-tuning performance and data similarity, respectively. We then build a model that, based on the inputs of model/data feature, and, critically, the graph embedding feature, extracted from the matching graph, predicts the model achieving the best quality after fine-tuning for the target domain. We evaluate M&C on choosing across ten T2I models for 32 datasets against three baselines. Our results show that M&C successfully predicts the best model for fine-tuning in 61.3% of the cases and a closely performing model for the rest. 基于扩散和变换器架构的文本到图像（T2I）模型发展迅速。它们通常在大型语料上进行预训练，并在模型平台（例如 HuggingFace）上公开共享。用户随后可以通过采用预训练的 T2I 模型并在目标数据集上对其进行微调来构建 AI 应用，例如生成媒体内容。虽然公开的预训练 T2I 模型促进了模型的普及化，但用户面临一个新的挑战：基于目标数据域，哪个模型最适合进行微调？在分类任务中，模型选择问题已得到很好的解决，但在（预训练的）T2I 模型及其在目标域上性能指示方面知之甚少。本文提出了第一个模型选择框架 M&C，使用户能够在无需对模型平台上所有模型在目标数据集上逐一进行微调的情况下，高效地选择预训练的 T2I 模型。M&C 的核心是一个匹配图，该图包括： (i) 可用模型和已剖析数据集的节点，和 (ii) 捕捉微调性能和数据相似性的模型-数据和数据-数据对的边。然后我们构建了一个模型，该模型基于模型/数据特征输入，以及从匹配图中提取的关键图嵌入特征，来预测在目标领域微调后将取得最佳质量的模型。我们在 32 个数据集上、针对十个 T2I 模型并与三种基线方法比较，评估了 M&C 的表现。结果表明，M&C 在 61.3%的情况下成功预测出了用于微调的最佳模型，其余情况下则预测出了一个表现相近的模型。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition 主题：机器学习、人工智能、计算与语言、计算机视觉与模式识别

Publish: 2025-08-14 18:00:50 UTC 发布：2025-08-14 18:00:50 UTC

#61 BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining #61 BeyondWeb：关于将合成数据扩展到万亿级预训练的经验教训

Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC’s high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there’s no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb. 近期在大型语言模型（LLM）预训练方面的进展表明，单纯增加数据量最终会导致收益递减，遇到数据天花板。为此，使用合成数据进行预训练已成为推动性能前沿的有前途范式。尽管如此，影响合成数据质量的因素仍然知之甚少。在本工作中，我们提出了 BeyondWeb，一种生成高质量预训练合成数据的框架。BeyondWeb 大幅扩展了传统网络规模数据集的能力，在一组 14 项基准评估的平均表现上，分别比最先进的合成预训练数据集 Cosmopedia 和 Nemotron-CC 的高质量合成子集（Nemotron-Synth）高出最多 5.1 个百分点（pp）和 2.6pp。它在训练速度上比开放网络数据快最多 7.7 倍，比 Nemotron-Synth 快 2.7 倍。值得注意的是，在 BeyondWeb 上以 1800 亿 Token 训练的 30 亿参数模型，其表现优于在相同 Token 预算下于 Cosmopedia 上训练的 80 亿参数模型。我们还从 BeyondWeb 关于用于预训练的合成数据方面提出了若干见解：是什么驱动了其好处、哪些数据应当改写以及如何改写，以及模型规模和模型家族对数据质量的影响。总体而言，我们的工作表明不存在生成高质量合成预训练数据的一劳永逸的办法。最好的结果需要对诸多因素进行联合优化，这是一项需要严谨科学和实践经验的挑战性工作。简单的做法可能带来有限的改进，且代价可能很高，而执行到位的方法则可能带来变革性的提升，BeyondWeb 就是一个范例。

Subjects: Machine Learning, Computation and Language 主题：机器学习，计算与语言

Publish: 2025-08-14 17:55:47 UTC 发布时间：2025-08-14 17:55:47 协调世界时（UTC）

#62 Empowering Multimodal LLMs with External Tools: A Comprehensive Survey #62 通过外部工具增强多模态 LLMs：一项综合综述

Authors: [Wenbin An](https://arxiv.org/search/?searchtype=author&query=Wenbin An), [Jiahao Nie](https://arxiv.org/search/?searchtype=author&query=Jiahao Nie), [Yaqiang Wu](https://arxiv.org/search/?searchtype=author&query=Yaqiang Wu), [Feng Tian](https://arxiv.org/search/?searchtype=author&query=Feng Tian), [Shijian Lu](https://arxiv.org/search/?searchtype=author&query=Shijian Lu), [Qinghua Zheng](https://arxiv.org/search/?searchtype=author&query=Qinghua Zheng) 作者：安文斌、聂嘉豪、吴亚强、田峰、卢世坚、郑庆华

By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance MLLM performance. Our discussion is structured along four key dimensions about external tools: (1) how they can facilitate the acquisition and annotation of high-quality multimodal data; (2) how they can assist in improving MLLM performance on challenging downstream tasks; (3) how they enable comprehensive and accurate evaluation of MLLMs; (4) the current limitations and future directions of tool-augmented MLLMs. Through this survey, we aim to underscore the transformative potential of external tools in advancing MLLM capabilities, offering a forward-looking perspective on their development and applications. The project page of this paper is publicly available athttps://github.com/Lackel/Awesome-Tools-for-MLLMs. 通过将多模态编码器的感知能力与大型语言模型（LLMs）的生成能力相结合，多模态大型语言模型（MLLMs），以 GPT-4V 为代表，在各种多模态任务中取得了巨大成功，指向通向人工通用智能的有希望路径。尽管取得了这些进展，多模态数据的质量有限、在许多复杂下游任务上的表现不佳以及评估协议不充分，仍然阻碍了 MLLMs 在各个领域的可靠性和更广泛的适用性。受到人类利用外部工具增强推理和解决问题能力的启发，用外部工具（例如 API、专家模型和知识库）增强 MLLMs 提供了一种有前景的策略来克服这些挑战。在本文中，我们提出了一篇关于利用外部工具提升 MLLM 性能的综合性综述。我们的讨论围绕外部工具的四个关键维度展开：（1）它们如何促进高质量多模态数据的获取与标注；（2）它们如何辅助提高多模态大模型在具有挑战性的下游任务上的表现；（3）它们如何实现对多模态大模型的全面且准确的评估；（4）工具增强型多模态大模型目前的局限性与未来方向。通过本综述，我们旨在强调外部工具在推进多模态大模型能力方面的变革潜力，并对其发展与应用提供前瞻性视角。本文的项目页面公开可见，地址为 https://github.com/Lackel/Awesome-Tools-for-MLLMs。

Subjects: Computer Vision and Pattern Recognition, Computation and Language, Multimedia 主题：计算机视觉与模式识别，计算与语言，多媒体

Publish: 2025-08-14 07:25:45 UTC 发布时间：2025-08-14 07:25:45 UTC

#63 The Next Phase of Scientific Fact-Checking: Advanced Evidence Retrieval from Complex Structured Academic Papers #63 科学事实核查的下一个阶段：从复杂结构化学术论文中进行高级证据检索

Scientific fact-checking aims to determine the veracity of scientific claims by retrieving and analysing evidence from research literature. The problem is inherently more complex than general fact-checking since it must accommodate the evolving nature of scientific knowledge, the structural complexity of academic literature and the challenges posed by long-form, multimodal scientific expression. However, existing approaches focus on simplified versions of the problem based on small-scale datasets consisting of abstracts rather than full papers, thereby avoiding the distinct challenges associated with processing complete documents. This paper examines the limitations of current scientific fact-checking systems and reveals the many potential features and resources that could be exploited to advance their performance. It identifies key research challenges within evidence retrieval, including (1) evidence-driven retrieval that addresses semantic limitations and topic imbalance (2) time-aware evidence retrieval with citation tracking to mitigate outdated information, (3) structured document parsing to leverage long-range context, (4) handling complex scientific expressions, including tables, figures, and domain-specific terminology and (5) assessing the credibility of scientific literature. Preliminary experiments were conducted to substantiate these challenges and identify potential solutions. This perspective paper aims to advance scientific fact-checking with a specialised IR system tailored for real-world applications. 科学事实核查旨在通过从研究文献中检索和分析证据来确定科学主张的真实性。这个问题本质上比一般事实核查更复杂，因为它必须适应科学知识不断演变的特性、学术文献的结构复杂性以及长篇、多模态科学表达所带来的挑战。然而，现有方法侧重于基于小规模数据集的简化问题，这些数据集由摘要而非完整论文组成，从而回避了与处理完整文档相关的独特挑战。本文审视了当前科学事实核查系统的局限性，并揭示了可用于提升其性能的许多潜在特性与资源。它识别了证据检索中的关键研究挑战，包括 (1) 面向证据的检索，解决语义限制和主题不平衡问题，(2) 时间感知的证据检索与引用追踪以减轻信息过时，(3) 结构化文档解析以利用长程上下文，(4) 处理复杂的科学表达形式，包括表格、图像和领域特定术语，以及 (5) 评估科学文献的可信度。进行了初步实验以证实这些挑战并识别潜在解决方案。本文观点旨在通过为现实世界应用量身定制的专业信息检索系统推进科学事实核查。

Subject: Information Retrieval 主题：信息检索

Publish: 2025-06-25 21:29:33 UTC 发布：2025-06-25 21:29:33 UTC

1.2.2 Artificial Intelligence

From：https://papers.cool/arxiv/cs.AI

From：https://arxiv.org/list/cs.AI/recenthttps://arxiv.org/list/cs.CL/recent 2025-08-18 | | 总计：103

#1 Inspire or Predict? Exploring New Paradigms in Assisting Classical Planners with Large Language Models #1 启发还是预测？探索在使用大型语言模型辅助经典规划器方面的新范式

Authors: [Wenkai Yu](https://arxiv.org/search/?searchtype=author&query=Wenkai Yu), [Jianhang Tang](https://arxiv.org/search/?searchtype=author&query=Jianhang Tang), [Yang Zhang](https://arxiv.org/search/?searchtype=author&query=Yang Zhang), [Shanjiang Tang](https://arxiv.org/search/?searchtype=author&query=Shanjiang Tang), [Kebing Jin](https://arxiv.org/search/?searchtype=author&query=Kebing Jin), [Hankz Hankui Zhuo](https://arxiv.org/search/?searchtype=author&query=Hankz Hankui Zhuo) 作者：余文凯、唐建航、张扬、唐善江、金科炳、卓汉克（Hankz Hankui Zhuo）

Addressing large-scale planning problems has become one of the central challenges in the planning community, deriving from the state-space explosion caused by growing objects and actions. Recently, researchers have explored the effectiveness of leveraging Large Language Models (LLMs) to generate helpful actions and states to prune the search space. However, prior works have largely overlooked integrating LLMs with domain-specific knowledge to ensure valid plans. In this paper, we propose a novel LLM-assisted planner integrated with problem decomposition, which first decomposes large planning problems into multiple simpler sub-tasks. Then we explore two novel paradigms to utilize LLMs, i.e., LLM4Inspire and LLM4Predict, to assist problem decomposition, where LLM4Inspire provides heuristic guidance according to general knowledge and LLM4Predict employs domain-specific knowledge to infer intermediate conditions. We empirically validate the effectiveness of our planner across multiple domains, demonstrating the ability of search space partition when solving large-scale planning problems. The experimental results show that LLMs effectively locate feasible solutions when pruning the search space, where infusing domain-specific knowledge into LLMs, i.e., LLM4Predict, holds particular promise compared with LLM4Inspire, which offers general knowledge within LLMs. 解决大规模规划问题已成为规划领域的核心挑战之一，源于随着对象和动作数量增长而引起的状态空间爆炸。最近，研究者探索了利用 LLMs 生成有用动作和状态以剪枝搜索空间的有效性。然而，以往工作在将 LLMs 与领域特定知识相结合以确保计划有效性方面大多被忽视。本文提出了一种新颖的基于 LLM 的辅助规划器并结合问题分解方法，该方法首先将大型规划问题分解为多个更简单的子任务。随后我们探索了两种利用 LLMs 的新范式，即 LLM4Inspire 和 LLM4Predict，以辅助问题分解：LLM4Inspire 根据常识性知识提供启发式指导，而 LLM4Predict 则利用领域特定知识推断中间条件。我们在多个领域对所提出的规划器进行了实证验证，展示了在解决大规模规划问题时对搜索空间进行划分的能力。实验结果表明，在剪枝搜索空间时，LLMs 能有效定位可行解，其中将领域专有知识注入到 LLMs 中（即 LLM4Predict）相比于在 LLMs 中提供通用知识的 LLM4Inspire 展现出特殊的前景。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-15 15:08:07 UTC 发布：2025-08-15 15:08:07 UTC

#2 Landmark-Assisted Monte Carlo Planning #2 基准辅助蒙特卡洛规划

Authors: [David H. Chan](https://arxiv.org/search/?searchtype=author&query=David H. Chan), [Mark Roberts](https://arxiv.org/search/?searchtype=author&query=Mark Roberts), [Dana S. Nau](https://arxiv.org/search/?searchtype=author&query=Dana S. Nau) 作者：David H. Chan，Mark Roberts，Dana S. Nau

Landmarks–conditions that must be satisfied at some point in every solution plan–have contributed to major advancements in classical planning, but they have seldom been used in stochastic domains. We formalize probabilistic landmarks and adapt the UCT algorithm to leverage them as subgoals to decompose MDPs; core to the adaptation is balancing between greedy landmark achievement and final goal achievement. Our results in benchmark domains show that well-chosen landmarks can significantly improve the performance of UCT in online probabilistic planning, while the best balance of greedy versus long-term goal achievement is problem-dependent. The results suggest that landmarks can provide helpful guidance for anytime algorithms solving MDPs. 地标（landmarks）——在每个解规划中某个时点必须满足的条件——推动了经典规划领域的重大进展，但在随机域中很少被使用。我们形式化了概率地标，并改造了 UCT 算法以将其作为子目标来分解马尔可夫决策过程；改造的核心是在贪心地完成地标与最终目标达成之间取得平衡。基准域上的实验结果表明，精心选择的地标可以显著提高 UCT 在在线概率规划中的性能，同时贪心地标达成与长期目标达成之间的最佳平衡依赖于具体问题。结果表明，地标可以为求解 MDP 的即时算法提供有益的指导。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-15 14:16:14 UTC 发布：2025-08-15 14:16:14 UTC

#3 Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps #3 包容竞技场：一个用于用真实世界应用评估大型基础模型的开放平台

Subjects: Artificial Intelligence, Computation and Language, Human-Computer Interaction 主题：人工智能、计算与语言、人机交互

Publish: 2025-08-15 13:00:07 UTC 发布：2025-08-15 13:00:07 UTC

#4 AIM-Bench: Evaluating Decision-making Biases of Agentic LLM as Inventory Manager #4 AIM-Bench：评估作为库存管理者的主体化 LLM 的决策偏差

Authors: [Xuhua Zhao](https://arxiv.org/search/?searchtype=author&query=Xuhua Zhao), [Yuxuan Xie](https://arxiv.org/search/?searchtype=author&query=Yuxuan Xie), [Caihua Chen](https://arxiv.org/search/?searchtype=author&query=Caihua Chen), [Yuxiang Sun](https://arxiv.org/search/?searchtype=author&query=Yuxiang Sun) 作者：赵旭华、谢宇轩、陈采华、孙宇翔

Recent advances in mathematical reasoning and the long-term planning capabilities of large language models (LLMs) have precipitated the development of agents, which are being increasingly leveraged in business operations processes. Decision models to optimize inventory levels are one of the core elements of operations management. However, the capabilities of the LLM agent in making inventory decisions in uncertain contexts, as well as the decision-making biases (e.g. framing effect, etc.) of the agent, remain largely unexplored. This prompts concerns regarding the capacity of LLM agents to effectively address real-world problems, as well as the potential implications of biases that may be present. To address this gap, we introduce AIM-Bench, a novel benchmark designed to assess the decision-making behaviour of LLM agents in uncertain supply chain management scenarios through a diverse series of inventory replenishment experiments. Our results reveal that different LLMs typically exhibit varying degrees of decision bias that are similar to those observed in human beings. In addition, we explored strategies to mitigate the pull-to-centre effect and the bullwhip effect, namely cognitive reflection and implementation of information sharing. These findings underscore the need for careful consideration of the potential biases in deploying LLMs in Inventory decision-making scenarios. We hope that these insights will pave the way for mitigating human decision bias and developing human-centred decision support systems for supply chains. 近年来在数学推理和大型语言模型（LLMs）长期规划能力方面的进展促成了智能体（agents）的发展，这些智能体越来越多地被用于业务运营流程。用于优化库存水平的决策模型是运营管理的核心要素之一。然而，LLM agent 在不确定情境下做出库存决策的能力，以及该 agent 的决策偏差（例如框架效应等），在很大程度上仍未被充分研究。这引发了对 LLM agents 能否有效解决现实问题以及可能存在的偏差带来影响的担忧。为填补这一空白，我们提出了 AIM-Bench，一个新颖的基准，用以通过多样化的库存补货实验评估 LLM agents 在不确定供应链管理情境下的决策行为。我们的结果显示，不同的 LLM 通常表现出不同程度的决策偏差，这些偏差类似于在人类身上观察到的情况。此外，我们探讨了缓解向中心拉拢效应和牛鞭效应的策略，即认知反思和实施信息共享。这些发现强调在库存决策场景中部署 LLMs 时需要对潜在偏见进行谨慎考虑。我们希望这些见解能够为缓解人类决策偏差并为供应链开发以人为本的决策支持系统铺平道路。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-15 11:38:19 UTC

#5 CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks

Authors: [Songqin Nong](https://arxiv.org/search/?searchtype=author&query=Songqin Nong), [Jingxuan Xu](https://arxiv.org/search/?searchtype=author&query=Jingxuan Xu), [Sheng Zhou](https://arxiv.org/search/?searchtype=author&query=Sheng Zhou), [Jianfeng Chen](https://arxiv.org/search/?searchtype=author&query=Jianfeng Chen), [Xiaoxuan Tang](https://arxiv.org/search/?searchtype=author&query=Xiaoxuan Tang), [Tao Jiang](https://arxiv.org/search/?searchtype=author&query=Tao Jiang), [Wenhao Xu](https://arxiv.org/search/?searchtype=author&query=Wenhao Xu)

As autonomous agents become adept at understanding and interacting with graphical user interface (GUI) environments, a new era of automated task execution is emerging. Recent studies have demonstrated that Reinforcement Learning (RL) can effectively enhance agents’ performance in dynamic interactive GUI environments. However, these methods face two key limitations: (1) they overlook the significant variation in difficulty across different GUI tasks by treating the entire training data as a uniform set, which hampers the agent’s ability to adapt its learning process; and (2) most approaches collapse task-specific nuances into a single, coarse reward, leaving the agent with a uniform signal that yields inefficient policy updates. To address these limitations, we propose CRAFT-GUI, a curriculum learning framework based on Group Relative Policy Optimization (GRPO) that explicitly accounts for the varying difficulty across trajectories. To enable more fine-grained policy optimization, we design a reward function that combines simple rule-based signals with model-judged evaluation, providing richer and more nuanced feedback during training. Experimental results demonstrate that our method achieves significant improvements over previous state-of-the-art approaches, outperforming them by 5.6% on public benchmarks Android Control and 10.3% on our internal online benchmarks, respectively. These findings empirically validate the effectiveness of integrating reinforcement learning with curriculum learning in GUI interaction tasks. 随着自主代理在理解和交互图形用户界面（GUI）环境方面变得更加熟练，一种新的自动化任务执行时代正在到来。近期研究表明，强化学习（RL）能够有效提升代理在动态交互式 GUI 环境中的表现。然而，这些方法存在两个主要局限：（1）它们将整个训练数据视为一个统一集合，忽视了不同 GUI 任务之间显著的难度差异，进而妨碍了代理适应其学习过程；（2）大多数方法将任务特有的细微差别压缩为单一的粗糙奖励，使代理仅收到统一信号，导致策略更新效率低下。为了解决这些局限，我们提出了 CRAFT-GUI，一种基于群体相对策略优化（GRPO）的课程学习框架，明确考虑了轨迹间的难度差异。为了实现更细粒度的策略优化，我们设计了一个奖励函数，将简单的基于规则的信号与模型评估相结合，在训练过程中提供更丰富、更细腻的反馈。实验结果表明，我们的方法在先前最先进的方法基础上取得了显著改进，在公共基准 Android Control 上分别超越它们 5.6%，在我们内部的在线基准上超越 10.3%。这些发现从经验上验证了在 GUI 交互任务中将强化学习与课程学习相结合的有效性。

Subjects: Artificial Intelligence, Human-Computer Interaction 主题：人工智能，人与计算机交互

Publish: 2025-08-15 09:55:02 UTC 发布：2025-08-15 09:55:02 UTC

#6 SAGE: Scale-Aware Gradual Evolution for Continual Knowledge Graph Embedding #6 SAGE：面向持续知识图谱嵌入的尺度感知渐进进化

Authors: [Yifei Li](https://arxiv.org/search/?searchtype=author&query=Yifei Li), [Lingling Zhang](https://arxiv.org/search/?searchtype=author&query=Lingling Zhang), [Hang Yan](https://arxiv.org/search/?searchtype=author&query=Hang Yan), [Tianzhe Zhao](https://arxiv.org/search/?searchtype=author&query=Tianzhe Zhao), [Zihan Ma](https://arxiv.org/search/?searchtype=author&query=Zihan Ma), [Muye Huang](https://arxiv.org/search/?searchtype=author&query=Muye Huang), [Jun Liu](https://arxiv.org/search/?searchtype=author&query=Jun Liu) 作者：李一飞，张玲玲，闫航，赵天哲，马子涵，黄沐烨，刘俊

Traditional knowledge graph (KG) embedding methods aim to represent entities and relations in a low-dimensional space, primarily focusing on static graphs. However, real-world KGs are dynamically evolving with the constant addition of entities, relations and facts. To address such dynamic nature of KGs, several continual knowledge graph embedding (CKGE) methods have been developed to efficiently update KG embeddings to accommodate new facts while maintaining learned knowledge. As KGs grow at different rates and scales in real-world scenarios, existing CKGE methods often fail to consider the varying scales of updates and lack systematic evaluation throughout the entire update process. In this paper, we propose SAGE, a scale-aware gradual evolution framework for CKGE. Specifically, SAGE firstly determine the embedding dimensions based on the update scales and expand the embedding space accordingly. The Dynamic Distillation mechanism is further employed to balance the preservation of learned knowledge and the incorporation of new facts. We conduct extensive experiments on seven benchmarks, and the results show that SAGE consistently outperforms existing baselines, with a notable improvement of 1.38% in MRR, 1.25% in H@1 and 1.6% in H@10. Furthermore, experiments comparing SAGE with methods using fixed embedding dimensions show that SAGE achieves optimal performance on every snapshot, demonstrating the importance of adaptive embedding dimensions in CKGE. The codes of SAGE are publicly available at: https://github.com/lyfxjtu/Dynamic-Embedding. 传统知识图谱（KG）嵌入方法旨在将实体和关系表示在低维空间中，主要集中在静态图上。然而，现实世界的 KG 是动态演化的，实体、关系和事实不断被添加。为了解决 KG 的这种动态特性，已经开发出若干持续知识图谱嵌入（CKGE）方法，以在高效更新 KG 嵌入以适应新事实的同时保持已学到的知识。由于 KG 在现实场景中以不同速率和规模增长，现有的 CKGE 方法通常未能考虑更新规模的差异，也缺乏对整个更新过程的系统评估。在本文中，我们提出了 SAGE，一种面向规模感知的渐进演化 CKGE 框架。具体而言，SAGE 首先根据更新规模确定嵌入维度并相应扩展嵌入空间。进一步采用动态蒸馏机制以平衡已学知识的保留与新事实的融入。我们在七个基准上进行了广泛实验，结果表明 SAGE 持续优于现有基线模型，在 MRR 上显著提升了 1.38%，在 H@1 上提升了 1.25%，在 H@10 上提升了 1.6%。此外，将 SAGE 与使用固定嵌入维度的方法进行比较的实验表明，SAGE 在每个快照上都能达到最佳性能，证明了在连续时间知识图嵌入（CKGE）中自适应嵌入维度的重要性。SAGE 的代码已公开，地址为：https://github.com/lyfxjtu/Dynamic-Embedding。

Subjects: Artificial Intelligence, Machine Learning 主题：人工智能、机器学习

Publish: 2025-08-15 09:23:23 UTC 发布：2025-08-15 09:23:23 UTC

#7 Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

Subjects: Artificial Intelligence, Computation and Language, Information Retrieval 主题：人工智能，计算与语言，信息检索

Publish: 2025-08-15 06:42:00 UTC 发布：2025-08-15 06:42:00 UTC

#8 On Strong and Weak Admissibility in Non-Flat Assumption-Based Argumentation #8 关于非平坦基于假设论证中强可采纳性与弱可采纳性

Authors: [Matti Berthold](https://arxiv.org/search/?searchtype=author&query=Matti Berthold), [Lydia Blümel](https://arxiv.org/search/?searchtype=author&query=Lydia Blümel), [Anna Rapberger](https://arxiv.org/search/?searchtype=author&query=Anna Rapberger) 作者：Matti Berthold、Lydia Blümel、Anna Rapberger

In this work, we broaden the investigation of admissibility notions in the context of assumption-based argumentation (ABA). More specifically, we study two prominent alternatives to the standard notion of admissibility from abstract argumentation, namely strong and weak admissibility, and introduce the respective preferred, complete and grounded semantics for general (sometimes called non-flat) ABA. To do so, we use abstract bipolar set-based argumentation frameworks (BSAFs) as formal playground since they concisely capture the relations between assumptions and are expressive enough to represent general non-flat ABA frameworks, as recently shown. While weak admissibility has been recently investigated for a restricted fragment of ABA in which assumptions cannot be derived (flat ABA), strong admissibility has not been investigated for ABA so far. We introduce strong admissibility for ABA and investigate desirable properties. We furthermore extend the recent investigations of weak admissibility in the flat ABA fragment to the non-flat case. We show that the central modularization property is maintained under classical, strong, and weak admissibility. We also show that strong and weakly admissible semantics in non-flat ABA share some of the shortcomings of standard admissible semantics and discuss ways to address these. 在这项工作中，我们拓展了在基于假设的论证（ABA）背景下对可采纳性概念的研究。更具体地，我们研究了来自抽象论证的标准可采纳性之外的两个重要替代概念，即强可采纳性和弱可采纳性，并为一般（有时称为非平坦）ABA 引入了相应的首选、完全和基础语义。为此，我们使用抽象双极集合式论证框架（BSAF）作为形式化的研究平台，因为它们简洁地捕捉了假设之间的关系，并且正如最近所展示的那样，其表达能力足以表示一般的非平坦 ABA 框架。虽然弱可采纳性最近已在一个受限的 ABA 片段中被研究——在该片段中假设不能被推导（平坦 ABA），但强可采纳性迄今尚未在 ABA 中得到研究。我们为 ABA 引入了强可采纳性并研究了其期望的性质。我们进一步将最近在平坦 ABA 片段中对弱可采纳性的研究扩展到非平坦情形。我们证明了在经典、强和弱可采纳性下，核心的模块化属性仍然保持不变。我们还展示了在非平坦的 ABA 中，强可接受语义和弱可接受语义在某些方面与标准可接受语义存在相同的缺点，并讨论了解决这些问题的方法。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-15 03:13:07 UTC 发布时间：2025-08-15 03:13:07 UTC

#9 Learn to optimize for automatic proton PBS treatment planning for H&N cancers #9 学习为头颈癌的自动质子 PBS 治疗计划进行优化

Authors: [Qingqing Wang](https://arxiv.org/search/?searchtype=author&query=Qingqing Wang), [Liqiang Xiao](https://arxiv.org/search/?searchtype=author&query=Liqiang Xiao), [Chang Chang](https://arxiv.org/search/?searchtype=author&query=Chang Chang) 作者：王青青，肖立强，常昌

Proton PBS treatment planning for H&N cancers involves numerous conflicting objectives, requiring significant effort from human planners to balance and satisfy multiple clinical goals during planning. To achieve this, experience-demanding objective parameter adjustment and computationally expensive inverse optimization are performed iteratively. Extensive efforts have been made to automatically adjust objective parameters, but the most time-consuming component, i.e., inverse optimization, still relies heavily on theory-driven approaches. We propose a data-driven inverse optimizer and integrate it into a PPO-based automatic treatment planning framework to automatically generate high-quality plans within a clinical acceptable planning time. The inverse optimizer is a L2O method that predicts update steps by learning from the task-specific data distribution. For the first time, we integrate techniques designed for long-context processing, originally developed for LLMs, into a Transformer-based L2O framework to address the scalability issue of existing L2O methods. The PPO framework functions as an outer-loop virtual planner, autonomously adjusting objective parameters through a policy network, and the dose predictor is used to initialize objective parameters. The inner-loop L2O inverse optimizer computes machine-deliverable MU values based on objectives refined by the PPO policy network. 97 patients are collected in this study, and compared with L-BFGSB, our L2O-based inverse optimizer improves the effectiveness and efficiency by 22.97% and 36.41%, respectively. In conjunction with the PPO-based learned virtual planner, plans generated by our framework within an average of 2.55 hours show improved or comparable OAR sparing with superior target coverage for patients with different prescription dose levels, number of target volumes, beam angles, etc., compared with human-generated plans. 质子 PBS 治疗计划在头颈部癌症中的制定涉及众多相互冲突的目标，要求人工规划者在计划过程中付出大量努力以平衡并满足多个临床目标。为此，需要反复进行对目标参数的经验性调整以及计算代价高昂的反向优化。尽管已经做了大量工作以自动调整目标参数，但最耗时的组成部分——即反向优化——仍在很大程度上依赖于理论驱动的方法。我们提出了一种数据驱动的反向优化器，并将其整合到基于 PPO 的自动化治疗计划框架中，以在临床可接受的规划时间内自动生成高质量的计划。该反向优化器是一种通过从任务特定数据分布中学习来预测更新步骤的 L2O 方法。我们首次将为处理长上下文而设计的技术（最初为 LLMs 开发）整合到基于 Transformer 的 L2O 框架中，以解决现有 L2O 方法的可扩展性问题。 PPO 框架作为外循环的虚拟规划器，通过策略网络自主调整目标参数，剂量预测器用于初始化目标参数。内循环的基于 L2O 的逆向优化器根据由 PPO 策略网络优化过的目标计算可机施放的 MU 值。本研究收集了 97 例患者，与 L-BFGSB 相比，我们基于 L2O 的逆向优化器在有效性和效率上分别提升了 22.97%和 36.41%。结合基于 PPO 的学习型虚拟规划器，我们框架生成的计划平均在 2.55 小时内完成，与人工生成的计划相比，在不同处方剂量水平、靶区数量、束角等情况下，对靶区覆盖更优，同时对危险器官的保护得到改善或相当。

Subjects: Artificial Intelligence, Machine Learning 主题：人工智能、机器学习

Publish: 2025-08-14 21:50:31 UTC 发布时间：2025-08-14 21:50:31 UTC

#10 From Individual to Multi-Agent Algorithmic Recourse: Minimizing the Welfare Gap via Capacitated Bipartite Matching #10 从个体到多智能体的算法补救：通过有容量的二分匹配最小化福利差距

Authors: [Zahra Khotanlou](https://arxiv.org/search/?searchtype=author&query=Zahra Khotanlou), [Kate Larson](https://arxiv.org/search/?searchtype=author&query=Kate Larson), [Amir-Hossein Karimi](https://arxiv.org/search/?searchtype=author&query=Amir-Hossein Karimi) 作者：Zahra Khotanlou、Kate Larson、Amir-Hossein Karimi

Decision makers are increasingly relying on machine learning in sensitive situations. In such settings, algorithmic recourse aims to provide individuals with actionable and minimally costly steps to reverse unfavorable AI-driven decisions. While existing research predominantly focuses on single-individual (i.e., seeker) and single-model (i.e., provider) scenarios, real-world applications often involve multiple interacting stakeholders. Optimizing outcomes for seekers under an individual welfare approach overlooks the inherently multi-agent nature of real-world systems, where individuals interact and compete for limited resources. To address this, we introduce a novel framework for multi-agent algorithmic recourse that accounts for multiple recourse seekers and recourse providers. We model this many-to-many interaction as a capacitated weighted bipartite matching problem, where matches are guided by both recourse cost and provider capacity. Edge weights, reflecting recourse costs, are optimized for social welfare while quantifying the welfare gap between individual welfare and this collectively feasible outcome. We propose a three-layer optimization framework: (1) basic capacitated matching, (2) optimal capacity redistribution to minimize the welfare gap, and (3) cost-aware optimization balancing welfare maximization with capacity adjustment costs. Experimental validation on synthetic and real-world datasets demonstrates that our framework enables the many-to-many algorithmic recourse to achieve near-optimal welfare with minimum modification in system settings. This work extends algorithmic recourse from individual recommendations to system-level design, providing a tractable path toward higher social welfare while maintaining individual actionability. 决策者在敏感情境中日益依赖机器学习。在此类情境下，算法补救旨在为个体提供可行且代价最小的步骤，以扭转不利的由人工智能驱动的决策。尽管现有研究主要关注单个个体（即寻求者）和单一模型（即提供者）的情形，现实应用通常涉及多个相互作用的利益相关者。仅以个体福利为目标优化寻求者的结果，忽视了现实系统固有的多主体特性，在这些系统中个体相互作用并为有限资源竞争。为了解决这一问题，我们提出了一个新的多主体算法补救框架，考虑多个补救寻求者和补救提供者。我们将这种多对多的交互建模为一个有容量限制的带权二分匹配问题，其中匹配由补救成本和提供者容量共同决定。边权反映补救成本，针对社会福利进行优化，同时量化个体福利与这种集体可行结果之间的福利差距。我们提出了一个三层优化框架： (1) 基础的有容量限制的匹配，(2) 为最小化福利差距而进行的最优容量重新分配，(3) 考虑成本的优化，在最大化福利与容量调整成本之间进行权衡。在合成和真实数据集上的实验验证表明，我们的框架使多对多的算法补救能够以最小的系统设置修改实现接近最优的社会福利。这项工作将算法补救从个体推荐扩展到系统层面的设计，为在保持个体可执行性的同时朝着更高社会福利提供了一条可行路径。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-14 21:04:24 UTC 发布：2025-08-14 21:04:24 协调世界时

#11 Grounding Rule-Based Argumentation Using Datalog #11 使用 Datalog 为基于规则的论证提供基础

Authors: [Martin Diller](https://arxiv.org/search/?searchtype=author&query=Martin Diller), [Sarah Alice Gaggl](https://arxiv.org/search/?searchtype=author&query=Sarah Alice Gaggl), [Philipp Hanisch](https://arxiv.org/search/?searchtype=author&query=Philipp Hanisch), [Giuseppina Monterosso](https://arxiv.org/search/?searchtype=author&query=Giuseppina Monterosso), [Fritz Rauschenbach](https://arxiv.org/search/?searchtype=author&query=Fritz Rauschenbach) 作者：Martin Diller、Sarah Alice Gaggl、Philipp Hanisch、Giuseppina Monterosso、Fritz Rauschenbach

ASPIC+ is one of the main general frameworks for rule-based argumentation for AI. Although first-order rules are commonly used in ASPIC+ examples, most existing approaches to reason over rule-based argumentation only support propositional rules. To enable reasoning over first-order instances, a preliminary grounding step is required. As groundings can lead to an exponential increase in the size of the input theories, intelligent procedures are needed. However, there is a lack of dedicated solutions for ASPIC+. Therefore, we propose an intelligent grounding procedure that keeps the size of the grounding manageable while preserving the correctness of the reasoning process. To this end, we translate the first-order ASPIC+ instance into a Datalog program and query a Datalog engine to obtain ground substitutions to perform the grounding of rules and contraries. Additionally, we propose simplifications specific to the ASPIC+ formalism to avoid grounding of rules that have no influence on the reasoning process. Finally, we performed an empirical evaluation of a prototypical implementation to show scalability. ASPIC+ 是用于人工智能的基于规则的论证的主要通用框架之一。尽管在 ASPIC+ 的示例中常常使用一阶规则，但现有的大多数用于基于规则的论证推理的方法仅支持命题规则。为了对一阶实例进行推理，需要预备的地面化步骤。由于地面化可能导致输入理论规模呈指数级增加，因此需要智能化的处理程序。然而，针对 ASPIC+ 缺乏专门的解决方案。因此，我们提出了一种智能地面化程序，它在保持推理过程正确性的同时控制地面化规模。为此，我们将一阶 ASPIC+ 实例翻译为 Datalog 程序，并查询 Datalog 引擎以获取用于对规则和对立项进行地面化的地面替代。此外，我们提出了针对 ASPIC+ 形式化的简化方法，以避免地面化那些对推理过程没有影响的规则。最后，我们对原型实现进行了实证评估以展示其可扩展性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-14 17:57:32 UTC 发布：2025-08-14 17:57:32 UTC

#12 Is ChatGPT-5 Ready for Mammogram VQA? #12 ChatGPT-5 准备好用于乳腺 X 线图像视觉问答（Mammogram VQA）了吗？

Authors: [Qiang Li](https://arxiv.org/search/?searchtype=author&query=Qiang Li), [Shansong Wang](https://arxiv.org/search/?searchtype=author&query=Shansong Wang), [Mingzhe Hu](https://arxiv.org/search/?searchtype=author&query=Mingzhe Hu), [Mojtaba Safari](https://arxiv.org/search/?searchtype=author&query=Mojtaba Safari), [Zachary Eidex](https://arxiv.org/search/?searchtype=author&query=Zachary Eidex), [Xiaofeng Yang](https://arxiv.org/search/?searchtype=author&query=Xiaofeng Yang) 作者：李强、王善松、胡明喆、Mojtaba Safari、Zachary Eidex、杨晓峰

Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks. 乳腺 X 线图像视觉问答（VQA）将图像解读与临床推理相结合，具有支持乳腺癌筛查的潜力。我们系统评估了 GPT-5 系列和 GPT-4o 模型在四个公开乳腺摄影数据集（EMBED、InBreast、CMMD、CBIS-DDSM）上针对 BI-RADS 评估、异常检测和恶性肿瘤分类任务的表现。GPT-5 一直是表现最好的模型，但仍落后于人类专家和针对领域微调的模型。在 EMBED 上，GPT-5 在密度（56.8%）、结构扭曲（52.5%）、肿块（64.5%）、钙化（63.5%）和恶性（52.8%）分类中在 GPT 变体中取得最高分。在 InBreast 上，它达到 36.9%的 BI-RADS 准确率、45.9%的异常检测率和 35.0%的恶性分类率。在 CMMD 上，GPT-5 达到 32.3%的异常检测率和 55.0%的恶性率准确率。在 CBIS-DDSM 上，它取得 69.3%的 BI-RADS 准确率、66.0%的异常检测率和 58.2%的恶性分类准确率。与人类专家估计相比，GPT-5 的敏感性较低（63.5%），特异性也较低（52.3%）。虽然 GPT-5 在筛查任务上展现了有前景的能力，但在未经过针对性领域适配和优化的情况下，其性能仍不足以应对高风险的临床影像应用。然而，从 GPT-4o 到 GPT-5 的性能大幅提升显示出一个有希望的趋势，即通用大型语言模型（LLMs）在辅助钼靶乳腺 X 线摄影视觉问答（VQA）任务方面具有潜力。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-15 17:56:24 UTC 发布：2025-08-15 17:56:24 UTC

#13 Controlling Multimodal LLMs via Reward-guided Decoding #13 通过奖励引导的解码控制多模态 LLMs

As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM’s decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model’s output. Our approach enables on-the-fly controllability of an MLLM’s inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods. 随着多模态大型语言模型（MLLMs）获得广泛应用，将它们适配以满足多样化用户需求变得日益必要。本文研究通过受控解码来适配 MLLMs。为此，我们提出了首个用于 MLLMs 的奖励引导解码方法，并展示了其在改善模型视觉定位上的应用。我们的方法包括为视觉定位构建奖励模型，并使用这些模型来引导 MLLM 的解码过程。具体而言，我们构建了两个独立的奖励模型，分别控制模型输出中目标精确度与召回率的程度。我们的方法以两种方式实现了对 MLLM 推理过程的即时可控性：其一，通过在解码过程中控制各奖励函数的相对重要性，使用户能够在图像描述任务中动态地在目标精确度和召回率之间进行权衡；其二，通过控制解码时搜索的广度，使用户能够在测试时计算量与视觉定位程度之间进行权衡。我们在标准的对象虚构基准上评估了我们的方法，结果表明它能够显著控制多模态大模型（MLLM）的推理，同时在性能上持续优于现有的虚构缓解方法。

Publish: 2025-08-15 17:29:06 UTC 发布：2025-08-15 17:29:06 UTC

#14 Pretrained Conformers for Audio Fingerprinting and Retrieval #14 预训练 Conformer 模型用于音频指纹识别与检索

Authors: [Kemal Altwlkany](https://arxiv.org/search/?searchtype=author&query=Kemal Altwlkany), [Elmedin Selmanovic](https://arxiv.org/search/?searchtype=author&query=Elmedin Selmanovic), [Sead Delalic](https://arxiv.org/search/?searchtype=author&query=Sead Delalic) 作者：Kemal Altwlkany、Elmedin Selmanovic、Sead Delalic

Conformers have shown great results in speech processing due to their ability to capture both local and global interactions. In this work, we utilize a self-supervised contrastive learning framework to train conformer-based encoders that are capable of generating unique embeddings for small segments of audio, generalizing well to previously unseen data. We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings. Our models are almost completely immune to temporal misalignments and achieve state-of-the-art results in cases of other audio distortions such as noise, reverb or extreme temporal stretching. Code and models are made publicly available and the results are easy to reproduce as we train and test using popular and freely available datasets of different sizes. Conformer 在语音处理领域表现优异，归因于其能够同时捕捉局部和全局交互。在本工作中，我们利用自监督对比学习框架训练基于 Conformer 的编码器，能够为小片段音频生成独特的嵌入，并能很好地泛化到先前未见的数据。我们在音频检索任务上取得了最先进的结果，且仅使用 3 秒音频即可生成嵌入。我们的模型几乎完全不受时间错位的影响，并在其他音频失真情况下（如噪声、混响或极端时间拉伸）也取得了最先进的成绩。代码和模型已公开，结果易于复现，因为我们使用不同规模的流行且可自由获取的数据集进行训练和测试。

Subjects: Sound, Artificial Intelligence, Information Retrieval, Audio and Speech Processing 主题：声音、人工智能、信息检索、音频与语音处理

Publish: 2025-08-15 17:19:09 UTC 发布：2025-08-15 17:19:09 UTC

#15 CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection #15 CryptoScope：利用大型语言模型进行自动化密码逻辑漏洞检测

Authors: [Zhihao Li](https://arxiv.org/search/?searchtype=author&query=Zhihao Li), [Zimo Ji](https://arxiv.org/search/?searchtype=author&query=Zimo Ji), [Tao Zheng](https://arxiv.org/search/?searchtype=author&query=Tao Zheng), [Hao Ren](https://arxiv.org/search/?searchtype=author&query=Hao Ren), [Xiao Lan](https://arxiv.org/search/?searchtype=author&query=Xiao Lan) 作者：李志皓、季子墨、郑涛、任昊、兰潇

Cryptographic algorithms are fundamental to modern security, yet their implementations frequently harbor subtle logic flaws that are hard to detect. We introduce CryptoScope, a novel framework for automated cryptographic vulnerability detection powered by Large Language Models (LLMs). CryptoScope combines Chain-of-Thought (CoT) prompting with Retrieval-Augmented Generation (RAG), guided by a curated cryptographic knowledge base containing over 12,000 entries. We evaluate CryptoScope on LLM-CLVA, a benchmark of 92 cases primarily derived from real-world CVE vulnerabilities, complemented by cryptographic challenges from major Capture The Flag (CTF) competitions and synthetic examples across 11 programming languages. CryptoScope consistently improves performance over strong LLM baselines, boosting DeepSeek-V3 by 11.62%, GPT-4o-mini by 20.28%, and GLM-4-Flash by 28.69%. Additionally, it identifies 9 previously undisclosed flaws in widely used open-source cryptographic projects. 加密算法是现代安全的基础，但其实现中常常存在难以察觉的微妙逻辑缺陷。我们引入了 CryptoScope，一种由大型语言模型（LLMs）驱动的自动化加密漏洞检测新框架。CryptoScope 将链式思维（Chain-of-Thought，CoT）提示与检索增强生成（RAG）相结合，辅以一个包含超过 12,000 条目、精心策划的密码学知识库。我们在 LLM-CLVA 上评估了 CryptoScope，该基准包含 92 个案例，主要来源于真实世界的 CVE 漏洞，并补充了来自主要夺旗赛（CTF）竞赛的密码学挑战以及涵盖 11 种编程语言的合成示例。CryptoScope 在强大的 LLM 基线之上持续提升性能，使 DeepSeek-V3 提升了 11.62%，GPT-4o-mini 提升了 20.28%，GLM-4-Flash 提升了 28.69%。此外，它还在广泛使用的开源密码学项目中识别出 9 处此前未公开的缺陷。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全性 , 人工智能

Publish: 2025-08-15 17:07:54 UTC 发布：2025-08-15 17:07:54 UTC

#16 Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks #16 视觉感知引擎：用于机器人视觉任务的快速且灵活的多头推理 [PDF 2 ] [Copy] [Kimi ] [REL]

Authors: [Jakub Łucki](https://arxiv.org/search/?searchtype=author&query=Jakub Łucki), [Jonathan Becktor](https://arxiv.org/search/?searchtype=author&query=Jonathan Becktor), [Georgios Georgakis](https://arxiv.org/search/?searchtype=author&query=Georgios Georgakis), [Robert Royce](https://arxiv.org/search/?searchtype=author&query=Robert Royce), [Shehryar Khattak](https://arxiv.org/search/?searchtype=author&query=Shehryar Khattak) 作者：Jakub Łucki、Jonathan Becktor、Georgios Georgakis、Robert Royce、Shehryar Khattak

Deploying multiple machine learning models on resource-constrained robotic platforms for different perception tasks often results in redundant computations, large memory footprints, and complex integration challenges. In response, this work presents Visual Perception Engine (VPEngine), a modular framework designed to enable efficient GPU usage for visual multitasking while maintaining extensibility and developer accessibility. Our framework architecture leverages a shared foundation model backbone that extracts image representations, which are efficiently shared, without any unnecessary GPU-CPU memory transfers, across multiple specialized task-specific model heads running in parallel. This design eliminates the computational redundancy inherent in feature extraction component when deploying traditional sequential models while enabling dynamic task prioritization based on application demands. We demonstrate our framework’s capabilities through an example implementation using DINOv2 as the foundation model with multiple task (depth, object detection and semantic segmentation) heads, achieving up to 3x speedup compared to sequential execution. Building on CUDA Multi-Process Service (MPS), VPEngine offers efficient GPU utilization and maintains a constant memory footprint while allowing per-task inference frequencies to be adjusted dynamically during runtime. The framework is written in Python and is open source with ROS2 C++ (Humble) bindings for ease of use by the robotics community across diverse robotic platforms. Our example implementation demonstrates end-to-end real-time performance at ≥50 Hz on NVIDIA Jetson Orin AGX for TensorRT optimized models. 在资源受限的机器人平台上为不同的感知任务部署多个机器学习模型，常常导致重复计算、较大的内存占用以及复杂的集成挑战。为此，本工作提出了视觉感知引擎（VPEngine），一个模块化框架，旨在在保持可扩展性和开发者易用性的同时，实现对视觉多任务的高效 GPU 使用。我们的框架架构利用共享的基础模型主干来提取图像表示，这些表示被高效共享于多个并行运行的专用任务特定模型头，而无需进行任何不必要的 GPU-CPU 内存传输。该设计消除了在部署传统串行模型时特征提取组件固有的计算冗余，同时能够基于应用需求进行动态任务优先级调整。我们通过一个示例实现展示了框架的能力：使用 DINOv2 作为基础模型，并配合多个任务（深度估计、目标检测与语义分割）头，相较于串行执行实现高达 3 倍的加速。基于 CUDA 多进程服务（MPS），VPEngine 提供高效的 GPU 利用率并保持恒定的内存占用，同时允许在运行时动态调整每个任务的推理频率。该框架以 Python 编写并开源，提供适用于机器人社区在各种机器人平台上使用的 ROS2 C++（Humble）绑定。我们的示例实现展示了在 NVIDIA Jetson Orin AGX 上对 TensorRT 优化模型以 ≥ 50 Hz 实现端到端实时性能。

Subjects: Robotics, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 主题：机器人学、人工智能、计算机视觉与模式识别、机器学习

Publish: 2025-08-15 16:42:23 UTC 发布时间：2025-08-15 16:42:23 UTC

#17 Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models #17 先感知，少思考：动态边界自我感知驱动大型语言模型的极致推理效率

Authors: [Qiguang Chen](https://arxiv.org/search/?searchtype=author&query=Qiguang Chen), [Dengyun Peng](https://arxiv.org/search/?searchtype=author&query=Dengyun Peng), [Jinhao Liu](https://arxiv.org/search/?searchtype=author&query=Jinhao Liu), [HuiKang Su](https://arxiv.org/search/?searchtype=author&query=HuiKang Su), [Jiannan Guan](https://arxiv.org/search/?searchtype=author&query=Jiannan Guan), [Libo Qin](https://arxiv.org/search/?searchtype=author&query=Libo Qin), [Wanxiang Che](https://arxiv.org/search/?searchtype=author&query=Wanxiang Che) 作者：陈祺光、彭登云、刘金豪、苏慧康、管建南、秦利博、车万祥

Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM’s self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement. 近年来大规模语言模型（LLMs）在通过长链式思维（CoT）处理复杂推理任务方面取得了显著进展。然而，这种方法常常导致大量冗余，降低计算效率并在实时应用中造成显著延迟。为提高效率，当前方法通常依赖人工定义的难度先验，但这些先验与 LLM 自身感知到的难度不一致，导致低效。本文提出了动态推理边界自感知框架（DR. SAF），使模型能够根据问题复杂度动态评估并调整其推理深度。DR. SAF 整合了三大关键组件：边界自感知对齐、自适应奖励管理和边界保全机制。这些组件使模型能够优化推理过程，在不损害性能的前提下平衡效率与准确性。我们的实验结果表明，DR. SAF 在准确性损失极小的情况下，将总响应令牌数减少了 49.27%。该框架在标记效率上实现了 6.59 倍的提升，并使训练时间减少了 5 倍，使其非常适合资源受限的环境。在极端训练情况下，DR. SAF 在标记效率上甚至可以超越传统基于指令的模型，准确率提高超过 16%。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-15 16:40:29 UTC 发布日期：2025-08-15 16:40:29 UTC

#18 ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization #18 ADMIRE-BayesOpt：使用贝叶斯优化加速语言模型数据混合重加权 [PDF 2 ] [Copy] [Kimi 1 ] [REL]

Authors: [Shengzhuang Chen](https://arxiv.org/search/?searchtype=author&query=Shengzhuang Chen), [Xu Ouyang](https://arxiv.org/search/?searchtype=author&query=Xu Ouyang), [Michael Arthur Leopold Pearce](https://arxiv.org/search/?searchtype=author&query=Michael Arthur Leopold Pearce), [Thomas Hartvigsen](https://arxiv.org/search/?searchtype=author&query=Thomas Hartvigsen), [Jonathan Richard Schwarz](https://arxiv.org/search/?searchtype=author&query=Jonathan Richard Schwarz) 作者：Shengzhuang Chen、Xu Ouyang、Michael Arthur Leopold Pearce、Thomas Hartvigsen、Jonathan Richard Schwarz

Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no learning-based approach has emerged as a reliable solution. In this work, we propose to view the selection of training data mixtures as a black-box hyperparameter optimization problem, for which Bayesian Optimization is a well-established class of appropriate algorithms. Firstly, we cast data mixture learning as a sequential decision-making problem, in which we aim to find a suitable trade-off between the computational cost of training exploratory (proxy-) models and final mixture performance. Secondly, we systematically explore the properties of transferring mixtures learned at a small scale to larger-scale experiments, providing insights and highlighting opportunities for research at a modest scale. By proposing Multi-fidelity Bayesian Optimization as a suitable method in this common scenario, we introduce a natural framework to balance experiment cost with model fit, avoiding the risks of overfitting to smaller scales while minimizing the number of experiments at high cost. We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters, varying from simple architectures to state-of-the-art models and benchmarks spanning dozens of datasets. We demonstrate consistently strong results relative to a wide range of benchmarks, showingspeed-ups of over 500% in determining the best data mixture on our largest experiments relative to recent baselines. In addition, we broaden access to research by sharing ADMIRE IFT Runs, a dataset of 460 full training & evaluation runs across various model sizes worth over 13,000 GPU hours, greatly reducing the cost of conducting research in this area. 确定用于大型语言模型训练的最佳数据混合仍然是一个具有重大影响的挑战性问题。在实践中，语言模型开发者仍然依赖启发式探索，因为尚未出现一种基于学习的方法能成为可靠的解决方案。在这项工作中，我们提出将训练数据混合的选择视为一个黑盒超参数优化问题，而贝叶斯优化是一类成熟且适用的算法。首先，我们将数据混合学习表述为一个序贯决策问题，目标是在训练用于探索的（代理）模型的计算成本与最终混合性能之间找到合适的权衡。其次，我们系统性地探讨了在小规模上学到的混合如何转移到大规模实验的特性，提供了见解并强调了在适度规模下开展研究的机会。通过提出多保真度贝叶斯优化作为这一常见情形的合适方法，我们引入了一个在实验成本与模型拟合之间取得自然平衡的框架，避免在小规模上过拟合的风险，同时将高成本实验的次数降到最低。我们展示了在从 1 百万到 70 亿参数不等的模型上进行预训练和指令微调的结果，这些模型从简单架构到最先进模型不等，基准涵盖数十个数据集。相较于各种基线，我们始终展示出强劲的结果，在我们最大的实验中确定最佳数据混合时相对于近期基线实现了超过 500%的加速。此外，我们通过分享 ADMIRE IFT Runs —— 一个包含 460 次完整训练与评估运行、涵盖多种模型规模、总计超过 13,000 GPU 小时的数据集 —— 来扩大研究可及性，从而大幅降低在该领域开展研究的成本。

Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题：机器学习，人工智能，机器学习

Publish: 2025-08-15 15:53:09 UTC 发布时间：2025-08-15 15:53:09 UTC

#19 A Comprehensive Perspective on Explainable AI across the Machine Learning Workflow #19 在机器学习工作流中关于可解释人工智能的全面视角

Authors: [George Paterakis](https://arxiv.org/search/?searchtype=author&query=George Paterakis), [Andrea Castellani](https://arxiv.org/search/?searchtype=author&query=Andrea Castellani), [George Papoutsoglou](https://arxiv.org/search/?searchtype=author&query=George Papoutsoglou), [Tobias Rodemann](https://arxiv.org/search/?searchtype=author&query=Tobias Rodemann), [Ioannis Tsamardinos](https://arxiv.org/search/?searchtype=author&query=Ioannis Tsamardinos) 作者：George Paterakis、Andrea Castellani、George Papoutsoglou、Tobias Rodemann、Ioannis Tsamardinos

Artificial intelligence is reshaping science and industry, yet many users still regard its models as opaque “black boxes”. Conventional explainable artificial-intelligence methods clarify individual predictions but overlook the upstream decisions and downstream quality checks that determine whether insights can be trusted. In this work, we present Holistic Explainable Artificial Intelligence (HXAI), a user-centric framework that embeds explanation into every stage of the data-analysis workflow and tailors those explanations to users. HXAI unifies six components (data, analysis set-up, learning process, model output, model quality, communication channel) into a single taxonomy and aligns each component with the needs of domain experts, data analysts and data scientists. A 112-item question bank covers these needs; our survey of contemporary tools highlights critical coverage gaps. Grounded in theories of human explanation, principles from human-computer interaction and findings from empirical user studies, HXAI identifies the characteristics that make explanations clear, actionable and cognitively manageable. A comprehensive taxonomy operationalises these insights, reducing terminological ambiguity and enabling rigorous coverage analysis of existing toolchains. We further demonstrate how AI agents that embed large-language models can orchestrate diverse explanation techniques, translating technical artifacts into stakeholder-specific narratives that bridge the gap between AI developers and domain experts. Departing from traditional surveys or perspective articles, this work melds concepts from multiple disciplines, lessons from real-world projects and a critical synthesis of the literature to advance a novel, end-to-end viewpoint on transparency, trustworthiness and responsible AI deployment. 人工智能正在重塑科学和产业，然而许多用户仍将其模型视为不透明的“黑箱”。传统的可解释人工智能方法澄清了单个预测，但忽视了决定洞见是否可靠的上游决策和下游质量检查。在这项工作中，我们提出了整体可解释人工智能（HXAI），一种以用户为中心的框架，将解释嵌入数据分析工作流的每个阶段，并根据用户进行定制。HXAI 将六个组成部分（数据、分析设置、学习过程、模型输出、模型质量、沟通渠道）统一为单一分类法，并将每个组成部分与领域专家、数据分析师和数据科学家的需求对齐。一个包含 112 项的问题库覆盖了这些需求；我们对当代工具的调查突出了关键的覆盖缺口。基于人类解释理论、人与计算机交互的原则以及实证用户研究的发现，HXAI 确定了使解释清晰、可操作且认知上可管理的特征。一个全面的分类法将这些见解操作化，减少术语歧义并使现有工具链的覆盖分析变得严谨可行。我们进一步展示了嵌入大型语言模型的 AI 代理如何协调多种解释技术，将技术产物转化为面向不同利益相关者的叙述，弥合 AI 开发者与领域专家之间的差距。不同于传统的综述或观点文章，本研究融合了来自多学科的概念、现实项目的经验教训以及对现有文献的批判性综合，提出了关于透明性、可可信性和负责任 AI 部署的全新端到端视角。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 15:15:25 UTC 发布：2025-08-15 15:15:25 UTC

#20 Weighted First Order Model Counting for Two-variable Logic with Axioms on Two Relations #20 带权一阶模型计数用于带有两关系公理的二变量逻辑

Authors: [Qipeng Kuang](https://arxiv.org/search/?searchtype=author&query=Qipeng Kuang), [Václav Kůla](https://arxiv.org/search/?searchtype=author&query=Václav Kůla), [Ondřej Kuželka](https://arxiv.org/search/?searchtype=author&query=Ondřej Kuželka), [Yuanhong Wang](https://arxiv.org/search/?searchtype=author&query=Yuanhong Wang), [Yuyi Wang](https://arxiv.org/search/?searchtype=author&query=Yuyi Wang) 作者：邝启鹏、Václav Kůla、Ondřej Kuželka、王元洪、王宇毅

The Weighted First-Order Model Counting Problem (WFOMC) asks to compute the weighted sum of models of a given first-order logic sentence over a given domain. The boundary between fragments for which WFOMC can be computed in polynomial time relative to the domain size lies between the two-variable fragment (FO2) and the three-variable fragment (FO3). It is known that WFOMC for \FOthree{} is #P1-hard while polynomial-time algorithms exist for computing WFOMC for FO2 and C2, possibly extended by certain axioms such as the linear order axiom, the acyclicity axiom, and the connectedness axiom. All existing research has concentrated on extending the fragment with axioms on a single distinguished relation, leaving a gap in understanding the complexity boundary of axioms on multiple relations. In this study, we explore the extension of the two-variable fragment by axioms on two relations, presenting both negative and positive results. We show that WFOMC for FO2 with two linear order relations and FO2 with two acyclic relations are #P1-hard. Conversely, we provide an algorithm in time polynomial in the domain size for WFOMC of C2 with a linear order relation, its successor relation and another successor relation. 加权一阶模型计数问题（WFOMC）要求在给定域上计算给定一阶逻辑句子的模型的加权和。相对于域大小可多项式时间计算的片段边界位于二变量片段（ FO2 ）和三变量片段（ FO3 ）之间。已知针对 \FOthree{} 的 WFOMC 是 #P1 -难的，而存在用于计算 FO2 和 C2 的 WFOMC 的多项式时间算法，可能扩展了某些公理，例如线性序公理、无环公理和连通性公理。现有的所有研究都集中在用单一特殊关系的公理来扩展该片段，导致在理解对多重关系施加公理时的复杂性边界上存在空白。在这项研究中，我们探讨了通过对两条关系施加公理来扩展二变量片段，给出了否定性和肯定性的结果。我们证明了带有两个线性序关系的 FO2 的 WFOMC 和带有两个无环关系的 FO2 的 WFOMC 是 #P1 -难的。相反，我们提供了一个在多项式时间内（关于域大小）用于具有线性序关系、其后继关系以及另一个后继关系的 C2 的加权一阶模型计数（WFOMC）的算法。

Subjects: Logic in Computer Science, Artificial Intelligence 主题：计算机科学中的逻辑、人工智能

Publish: 2025-08-15 14:54:17 UTC 发布：2025-08-15 14:54:17 UTC

#21 Towards Faithful Class-level Self-explainability in Graph Neural Networks by Subgraph Dependencies #21 通过子图依赖关系迈向图神经网络的可信类级自解释性

Enhancing the interpretability of graph neural networks (GNNs) is crucial to ensure their safe and fair deployment. Recent work has introduced self-explainable GNNs that generate explanations as part of training, improving both faithfulness and efficiency. Some of these models, such as ProtGNN and PGIB, learn class-specific prototypes, offering a potential pathway toward class-level explanations. However, their evaluations focus solely on instance-level explanations, leaving open the question of whether these prototypes meaningfully generalize across instances of the same class. In this paper, we introduce GraphOracle, a novel self-explainable GNN framework designed to generate and evaluate class-level explanations for GNNs. Our model jointly learns a GNN classifier and a set of structured, sparse subgraphs that are discriminative for each class. We propose a novel integrated training that captures graph–subgraph–prediction dependencies efficiently and faithfully, validated through a masking-based evaluation strategy. This strategy enables us to retroactively assess whether prior methods like ProtGNN and PGIB deliver effective class-level explanations. Our results show that they do not. In contrast, GraphOracle achieves superior fidelity, explainability, and scalability across a range of graph classification tasks. We further demonstrate that GraphOracle avoids the computational bottlenecks of previous methods—like Monte Carlo Tree Search—by using entropy-regularized subgraph selection and lightweight random walk extraction, enabling faster and more scalable training. These findings position GraphOracle as a practical and principled solution for faithful class-level self-explainability in GNNs. 提高图神经网络（GNN）可解释性对于其安全和公平部署至关重要。近期工作提出了自解释 GNN，这类模型在训练过程中生成解释，从而提升了可信度和效率。其中一些模型，如 ProtGNN 和 PGIB，学习类特定的原型，为类级别解释提供了潜在路径。然而，它们的评估仅聚焦于实例级解释，尚未解决这些原型是否能在同一类的不同实例间具有有意义的泛化性的问题。在本文中，我们引入了 GraphOracle，一种新颖的自解释 GNN 框架，旨在为 GNN 生成并评估类级别解释。我们的模型联合学习一个 GNN 分类器与一组结构化、稀疏的子图，这些子图对每个类别具有判别性。我们提出了一种新颖的整合训练方法，有效且可信地捕捉图 – 子图 – 预测依赖关系，并通过基于掩码的评估策略进行了验证。该策略使我们能够回溯性地评估像 ProtGNN 和 PGIB 这样的既有方法是否提供了有效的类级别解释。我们的结果表明事实并非如此。相反，GraphOracle 在一系列图分类任务中在保真性、可解释性和可扩展性方面表现更优。我们进一步证明，GraphOracle 通过使用熵正则化的子图选择和轻量级随机游走提取，避免了以往方法（如蒙特卡洛树搜索） — 的计算瓶颈，从而实现了更快、更可扩展的训练。这些发现将 GraphOracle 定位为在图神经网络中实现可信类别级自解释的一个实用且有原理的解决方案。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 14:44:11 UTC 发布时间：2025-08-15 14:44:11 UTC

#22 Sim2Dust: Mastering Dynamic Waypoint Tracking on Granular Media #22 Sim2Dust：掌握松散介质上的动态航路点跟踪

Authors: [Andrej Orsula](https://arxiv.org/search/?searchtype=author&query=Andrej Orsula), [Matthieu Geist](https://arxiv.org/search/?searchtype=author&query=Matthieu Geist), [Miguel Olivares-Mendez](https://arxiv.org/search/?searchtype=author&query=Miguel Olivares-Mendez), [Carol Martinez](https://arxiv.org/search/?searchtype=author&query=Carol Martinez) 作者：Andrej Orsula、Matthieu Geist、Miguel Olivares-Mendez、Carol Martinez

Reliable autonomous navigation across the unstructured terrains of distant planetary surfaces is a critical enabler for future space exploration. However, the deployment of learning-based controllers is hindered by the inherent sim-to-real gap, particularly for the complex dynamics of wheel interactions with granular media. This work presents a complete sim-to-real framework for developing and validating robust control policies for dynamic waypoint tracking on such challenging surfaces. We leverage massively parallel simulation to train reinforcement learning agents across a vast distribution of procedurally generated environments with randomized physics. These policies are then transferred zero-shot to a physical wheeled rover operating in a lunar-analogue facility. Our experiments systematically compare multiple reinforcement learning algorithms and action smoothing filters to identify the most effective combinations for real-world deployment. Crucially, we provide strong empirical evidence that agents trained with procedural diversity achieve superior zero-shot performance compared to those trained on static scenarios. We also analyze the trade-offs of fine-tuning with high-fidelity particle physics, which offers minor gains in low-speed precision at a significant computational cost. Together, these contributions establish a validated workflow for creating reliable learning-based navigation systems, marking a critical step towards deploying autonomous robots in the final frontier. 在遥远行星表面那种非结构化地形上实现可靠的自主导航，是未来太空探索的关键推动力。然而，学习型控制器的部署受制于固有的仿真到现实差距，尤其是在车轮与颗粒介质复杂相互作用的动力学方面。本文提出了一个完整的仿真到现实框架，用于开发和验证在此类挑战性表面上进行动态航路点跟踪的鲁棒控制策略。我们利用大规模并行仿真，在大量程序生成且物理参数随机化的环境分布上训练强化学习代理。然后，这些策略以零次迁移直接部署到在月球模拟设施中运行的实物轮式巡视车。我们的实验系统地比较了多种强化学习算法和动作平滑滤波器，以识别最适合真实世界部署的组合。至关重要的是，我们提供了有力的实证证据，表明在程序多样性中训练的代理相比在静态场景中训练的代理，能获得更优的零次迁移性能。我们还分析了用高保真粒子物理进行微调的权衡，这在低速精度上带来了小幅提升，但代价是显著的计算成本。综上，这些贡献确立了一个经过验证的工作流程，用于创建可靠的基于学习的导航系统，是向在最终边界部署自主机器人迈出的关键一步。

Subjects: Robotics, Artificial Intelligence, Machine Learning 主题：机器人学、人工智能、机器学习

Publish: 2025-08-15 14:30:07 UTC 发布：2025-08-15 14:30:07 协调世界时（UTC）

#23 Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models #23 使用基于 Transformer 的模型进行历史手稿手写文本识别

Author: [Erez Meoded](https://arxiv.org/search/?searchtype=author&query=Erez Meoded) 作者：Erez Meoded

Historical handwritten text recognition (HTR) is essential for unlocking the cultural and scholarly value of archival documents, yet digitization is often hindered by scarce transcriptions, linguistic variation, and highly diverse handwriting styles. In this study, we apply TrOCR, a state-of-the-art transformer-based HTR model, to 16th-century Latin manuscripts authored by Rudolf Gwalther. We investigate targeted image preprocessing and a broad suite of data augmentation techniques, introducing four novel augmentation methods designed specifically for historical handwriting characteristics. We also evaluate ensemble learning approaches to leverage the complementary strengths of augmentation-trained models. On the Gwalther dataset, our best single-model augmentation (Elastic) achieves a Character Error Rate (CER) of 1.86, while a top-5 voting ensemble achieves a CER of 1.60 - representing a 50% relative improvement over the best reported TrOCR_BASE result and a 42% improvement over the previous state of the art. These results highlight the impact of domain-specific augmentations and ensemble strategies in advancing HTR performance for historical manuscripts. 历史手写文本识别（HTR）对于发掘档案文献的文化与学术价值至关重要，然而数字化常常受限于转录稀缺、语言变体和高度多样的书写风格。在本研究中，我们将 TrOCR —— 一种最先进的基于 Transformer 的 HTR 模型 —— 应用于鲁道夫·格瓦尔特（Rudolf Gwalther）撰写的 16 世纪拉丁手稿。我们研究了有针对性的图像预处理和广泛的数据增强技术，并引入了四种专为历史手写特征设计的新型增强方法。我们还评估了集成学习方法，以利用经过增强训练的模型的互补优势。在格瓦尔特数据集上，我们表现最好的单模型增强（Elastic）实现了 1.86 的字符错误率（CER），而一个前 5 名投票的集成则达到了 1.60 的 CER——相比已报告的最佳 TrOCR_BASE 结果相对提升了 50%，相比先前最先进方法提升了 42%。这些结果突显了领域特定增强和集成策略在提升历史手稿 HTR 性能方面的作用。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Digital Libraries, Machine Learning 主题：计算机视觉与模式识别、人工智能、数字图书馆、机器学习

Publish: 2025-08-15 14:20:58 UTC 发布：2025-08-15 14:20:58 UTC

#24 RMSL: Weakly-Supervised Insider Threat Detection with Robust Multi-sphere Learning #24 RMSL：基于弱监督的内部威胁检测与鲁棒多球学习

Authors: [Yang Wang](https://arxiv.org/search/?searchtype=author&query=Yang Wang), [Yaxin Zhao](https://arxiv.org/search/?searchtype=author&query=Yaxin Zhao), [Xinyu Jiao](https://arxiv.org/search/?searchtype=author&query=Xinyu Jiao), [Sihan Xu](https://arxiv.org/search/?searchtype=author&query=Sihan Xu), [Xiangrui Cai](https://arxiv.org/search/?searchtype=author&query=Xiangrui Cai), [Ying Zhang](https://arxiv.org/search/?searchtype=author&query=Ying Zhang), [Xiaojie Yuan](https://arxiv.org/search/?searchtype=author&query=Xiaojie Yuan) 作者：王洋、赵雅欣、焦欣宇、徐思涵、蔡向瑞、张颖、袁晓洁

Insider threat detection aims to identify malicious user behavior by analyzing logs that record user interactions. Due to the lack of fine-grained behavior-level annotations, detecting specific behavior-level anomalies within user behavior sequences is challenging. Unsupervised methods face high false positive rates and miss rates due to the inherent ambiguity between normal and anomalous behaviors. In this work, we instead introduce weak labels of behavior sequences, which have lower annotation costs, i.e., the training labels (anomalous or normal) are at sequence-level instead of behavior-level, to enhance the detection capability for behavior-level anomalies by learning discriminative features. To achieve this, we propose a novel framework called Robust Multi-sphere Learning (RMSL). RMSL uses multiple hyper-spheres to represent the normal patterns of behaviors. Initially, a one-class classifier is constructed as a good anomaly-supervision-free starting point. Building on this, using multiple instance learning and adaptive behavior-level self-training debiasing based on model prediction confidence, the framework further refines hyper-spheres and feature representations using weak sequence-level labels. This approach enhances the model’s ability to distinguish between normal and anomalous behaviors. Extensive experiments demonstrate that RMSL significantly improves the performance of behavior-level insider threat detection. 内部威胁检测旨在通过分析记录用户交互的日志来识别恶意用户行为。由于缺乏细粒度的行为级注释，在用户行为序列中检测特定的行为级异常具有挑战性。无监督方法由于正常与异常行为之间的固有模糊性，面临较高的误报率和漏报率。在本工作中，我们改为引入行为序列的弱标注，这种标注具有较低的注释成本——即训练标签（异常或正常）是在序列级别而非行为级别——从而通过学习判别特征来增强对行为级异常的检测能力。为实现这一点，我们提出了一个名为鲁棒多球学习（Robust Multi-sphere Learning，RMSL）的新框架。RMSL 使用多个超球体来表示行为的正常模式。最初，构建了一个单类分类器作为一个不依赖异常监督的良好起点。在此基础上，该框架利用多实例学习和基于模型预测置信度的自适应行为级自训练去偏，进一步使用弱序列级标签来细化超球体和特征表示。该方法增强了模型区分正常与异常行为的能力。大量实验表明，RMSL 显著提升了行为级内部威胁检测的性能。

Subjects: Cryptography and Security, Artificial Intelligence, Machine Learning 主题：密码学与安全、人工智能、机器学习

Publish: 2025-08-15 13:36:03 UTC 发布：2025-08-15 13:36:03 UTC

#25 Reference Points in LLM Sentiment Analysis: The Role of Structured Context #25 在 LLM 情感分析中的参考点：结构化上下文的作用

Author: [Junichiro Niimi](https://arxiv.org/search/?searchtype=author&query=Junichiro Niimi) 作者：Junichiro Niimi

Large language models (LLMs) are now widely used across many fields, including marketing research. Sentiment analysis, in particular, helps firms understand consumer preferences. While most NLP studies classify sentiment from review text alone, marketing theories, such as prospect theory and expectation–disconfirmation theory, point out that customer evaluations are shaped not only by the actual experience but also by additional reference points. This study therefore investigates how the content and format of such supplementary information affect sentiment analysis using LLMs. We compare natural language (NL) and JSON-formatted prompts using a lightweight 3B parameter model suitable for practical marketing applications. Experiments on two Yelp categories (Restaurant and Nightlife) show that the JSON prompt with additional information outperforms all baselines without fine-tuning: Macro-F1 rises by 1.6% and 4% while RMSE falls by 16% and 9.1%, respectively, making it deployable in resource-constrained edge devices. Furthermore, a follow-up analysis confirms that performance gains stem from genuine contextual reasoning rather than label proxying. This work demonstrates that structured prompting can enable smaller models to achieve competitive performance, offering a practical alternative to large-scale model deployment. 大型语言模型（LLMs）现在已广泛应用于许多领域，包括市场研究。情感分析尤其有助于企业了解消费者偏好。虽然大多数自然语言处理研究仅根据评论文本对情感进行分类，但营销理论（如前景理论和期望—不符理论）指出，顾客的评价不仅受实际体验影响，还受其他参照点的影响。因此，本研究探讨了此类补充信息的内容和格式如何影响使用 LLMs 的情感分析。我们比较了自然语言（NL）提示和 JSON 格式提示，使用适合实用营销应用的轻量级 3B 参数模型。在两个 Yelp 类别（餐厅和夜生活）上的实验证明，包含附加信息的 JSON 提示在无需微调的情况下优于所有基线：宏观 F1 分别提高了 1.6% 和 4%，而均方根误差（RMSE）分别下降了 16% 和 9.1%，使其可在资源受限的边缘设备上部署。此外，后续分析证实性能提升源自真实的上下文推理，而非标签代理。本工作表明，结构化提示能够使较小的模型达到有竞争力的性能，为大规模模型部署提供了一种可行的替代方案。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-15 13:04:32 UTC 发布：2025-08-15 13:04:32 协调世界时 (UTC)

Authors: [Daniel Airinei](https://arxiv.org/search/?searchtype=author&query=Daniel Airinei), [Elena Burceanu](https://arxiv.org/search/?searchtype=author&query=Elena Burceanu), [Marius Leordeanu](https://arxiv.org/search/?searchtype=author&query=Marius Leordeanu) 作者：Daniel Airinei、Elena Burceanu、Marius Leordeanu

Indoor navigation is a difficult task, as it generally comes with poor GPS access, forcing solutions to rely on other sources of information. While significant progress continues to be made in this area, deployment to production applications is still lacking, given the complexity and additional requirements of current solutions. Here, we introduce an efficient, real-time and easily deployable deep learning approach, based on visual input only, that can predict the direction towards a target from images captured by a mobile device. Our technical approach, based on a novel graph-based path generation method, combined with explainable data augmentation and curriculum learning, includes contributions that make the process of data collection, annotation and training, as automatic as possible, efficient and robust. On the practical side, we introduce a novel largescale dataset, with video footage inside a relatively large shopping mall, in which each frame is annotated with the correct next direction towards different specific target destinations. Different from current methods, ours relies solely on vision, avoiding the need of special sensors, additional markers placed along the path, knowledge of the scene map or internet access. We also created an easy to use application for Android, which we plan to make publicly available. We make all our data and code available along with visual demos on our project site 室内导航是一项艰巨的任务，因为通常伴随较差的 GPS 信号，迫使解决方案依赖其他信息来源。尽管该领域持续取得显著进展，但由于现有方案的复杂性和额外需求，仍缺乏到生产应用的部署。在此，我们提出一种高效、实时且易于部署的深度学习方法，仅基于视觉输入，能够从移动设备捕获的图像预测朝向目标的方向。我们的技术方法基于一种新颖的基于图的路径生成方法，结合可解释的数据增强和循序渐进学习，包含了使数据采集、标注和训练过程尽可能自动化、高效且鲁棒的若干贡献。在实际方面，我们引入了一个新颖的大规模数据集，包含在一个相对较大的购物中心内的影像视频，在其中每帧都标注了朝向不同特定目标位置的正确下一步方向。与现有方法不同，我们的方法仅依赖视觉，避免了对特殊传感器、沿路径放置的额外标记、场景地图知识或互联网接入的需求。我们还为 Android 创建了一个易于使用的应用，计划公开提供。我们会在项目网站上提供所有数据和代码以及可视化演示。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-15 12:54:13 UTC 发布：2025-08-15 12:54:13 协调世界时 (UTC)

#27 Informative Post-Hoc Explanations Only Exist for Simple Functions #27 只有对简单函数才存在信息性事后解释

Authors: [Eric Günther](https://arxiv.org/search/?searchtype=author&query=Eric Günther), [Balázs Szabados](https://arxiv.org/search/?searchtype=author&query=Balázs Szabados), [Robi Bhattacharjee](https://arxiv.org/search/?searchtype=author&query=Robi Bhattacharjee), [Sebastian Bordt](https://arxiv.org/search/?searchtype=author&query=Sebastian Bordt), [Ulrike von Luxburg](https://arxiv.org/search/?searchtype=author&query=Ulrike von Luxburg) 作者：Eric Günther、Balázs Szabados、Robi Bhattacharjee、Sebastian Bordt、Ulrike von Luxburg

Many researchers have suggested that local post-hoc explanation algorithms can be used to gain insights into the behavior of complex machine learning models. However, theoretical guarantees about such algorithms only exist for simple decision functions, and it is unclear whether and under which assumptions similar results might exist for complex models. In this paper, we introduce a general, learning-theory-based framework for what it means for an explanation to provide information about a decision function. We call an explanation informative if it serves to reduce the complexity of the space of plausible decision functions. With this approach, we show that many popular explanation algorithms are not informative when applied to complex decision functions, providing a rigorous mathematical rejection of the idea that it should be possible to explain any model. We then derive conditions under which different explanation algorithms become informative. These are often stronger than what one might expect. For example, gradient explanations and counterfactual explanations are non-informative with respect to the space of differentiable functions, and SHAP and anchor explanations are not informative with respect to the space of decision trees. Based on these results, we discuss how explanation algorithms can be modified to become informative. While the proposed analysis of explanation algorithms is mathematical, we argue that it holds strong implications for the practical applicability of these algorithms, particularly for auditing, regulation, and high-risk applications of AI. 许多研究人员认为，局部事后解释算法可用于洞察复杂机器学习模型的行为。然而，此类算法的理论保证仅存在于简单的决策函数情形，对于复杂模型是否以及在何种假设下能获得类似结果尚不清楚。本文引入了一个基于学习理论的一般框架，用以定义解释何谓对决策函数提供信息。我们称一种解释为“有信息的”，如果它能减少可行决策函数空间的复杂性。基于这一方法，我们证明了许多流行的解释算法在应用于复杂决策函数时并不具备信息性，从而对“应当能够解释任意模型”这一观点给出严格的数学否定。随后，我们推导出使不同解释算法变得有信息性的条件，而这些条件常常比人们预期的要更强。例如，对于可微函数空间，梯度解释和反事实解释是没有信息量的；而对于决策树空间，SHAP 和 anchor 解释则没有信息量。基于这些结果，我们讨论了解释算法如何被修改以变得有信息量。尽管对解释算法的所提出分析是数学性的，我们认为它对这些算法的实际可用性具有重要影响，尤其是在审计、监管和高风险的人工智能应用中。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 12:46:18 UTC 发布：2025-08-15 12:46:18 UTC

#28 On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data’s influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research. 有监督微调（SFT）和强化学习（RL）是两种用于提升大型语言模型（LLMs）能力并对齐其行为的主要后训练范式。现有将 SFT 与 RL 相结合的方法常面临破坏模型既有模式并对专家数据过拟合的风险。为了解决这一问题，我们提出从离策略与在策略的视角对 SFT 与 RL 的统一视图进行新颖探讨。我们提出了 CHORD，一种通过动态加权实现的在策略与离策略强化学习可控调和框架（Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting），它将 SFT 重新构想为在在策略 RL 过程中作为动态加权的辅助目标，而非独立的阶段。基于对离策略专家数据在整体和细粒度层面影响的分析，我们在 CHORD 中引入了双重控制机制。具体而言，该框架首先采用一个全局系数来从整体上引导从离策略模仿向在策略探索的过渡，然后应用一个按令牌加权的函数，使得可以对专家令牌进行细粒度学习，从而保留在策略探索并减轻离策略数据带来的干扰。我们在广泛使用的基准上进行了大量实验，提供了实证证据表明 CHORD 实现了稳定且高效的学习过程。通过有效协调离策略专家数据与在策略探索，CHORD 相较基线表现出显著改进。我们在 https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord 发布了实现代码，以期激发进一步研究。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 11:20:03 UTC 发布：2025-08-15 11:20:03 UTC

#29 Open, Reproducible and Trustworthy Robot-Based Experiments with Virtual Labs and Digital-Twin-Based Execution Tracing #29 使用虚拟实验室和基于数字孪生的执行追踪实现开放、可复现且值得信赖的机器人实验

Authors: [Benjamin Alt](https://arxiv.org/search/?searchtype=author&query=Benjamin Alt), [Mareike Picklum](https://arxiv.org/search/?searchtype=author&query=Mareike Picklum), [Sorin Arion](https://arxiv.org/search/?searchtype=author&query=Sorin Arion), [Franklin Kenghagho Kenfack](https://arxiv.org/search/?searchtype=author&query=Franklin Kenghagho Kenfack), [Michael Beetz](https://arxiv.org/search/?searchtype=author&query=Michael Beetz) 作者：Benjamin Alt, Mareike Picklum, Sorin Arion, Franklin Kenghagho Kenfack, Michael Beetz

We envision a future in which autonomous robots conduct scientific experiments in ways that are not only precise and repeatable, but also open, trustworthy, and transparent. To realize this vision, we present two key contributions: a semantic execution tracing framework that logs sensor data together with semantically annotated robot belief states, ensuring that automated experimentation is transparent and replicable; and the AICOR Virtual Research Building (VRB), a cloud-based platform for sharing, replicating, and validating robot task executions at scale. Together, these tools enable reproducible, robot-driven science by integrating deterministic execution, semantic memory, and open knowledge representation, laying the foundation for autonomous systems to participate in scientific discovery. 我们设想这样一种未来：自主机器人以不仅精确可重复，而且开放、值得信赖和透明的方式进行科学实验。为实现这一愿景，我们提出两项关键贡献：一个语义执行追踪框架，该框架将传感器数据与语义注解的机器人信念状态一起记录，确保自动化实验具有透明性和可复现性；以及 AICOR 虚拟研究大楼（VRB），一个基于云的平台，用于大规模共享、复现和验证机器人任务的执行。结合这些工具，通过整合确定性执行、语义记忆和开放知识表示，能够实现可复现的机器人驱动科学，为自主系统参与科学发现奠定基础。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-15 11:16:06 UTC 发布时间：2025-08-15 11:16:06 UTC

#30 An Exploratory Study on Crack Detection in Concrete through Human-Robot Collaboration #30 通过人机协作对混凝土裂缝检测的探索性研究

Authors: [Junyeon Kim](https://arxiv.org/search/?searchtype=author&query=Junyeon Kim), [Tianshu Ruan](https://arxiv.org/search/?searchtype=author&query=Tianshu Ruan), [Cesar Alan Contreras](https://arxiv.org/search/?searchtype=author&query=Cesar Alan Contreras), [Manolis Chiou](https://arxiv.org/search/?searchtype=author&query=Manolis Chiou) 作者：Junyeon Kim、Tianshu Ruan、Cesar Alan Contreras、Manolis Chiou

Structural inspection in nuclear facilities is vital for maintaining operational safety and integrity. Traditional methods of manual inspection pose significant challenges, including safety risks, high cognitive demands, and potential inaccuracies due to human limitations. Recent advancements in Artificial Intelligence (AI) and robotic technologies have opened new possibilities for safer, more efficient, and accurate inspection methodologies. Specifically, Human-Robot Collaboration (HRC), leveraging robotic platforms equipped with advanced detection algorithms, promises significant improvements in inspection outcomes and reductions in human workload. This study explores the effectiveness of AI-assisted visual crack detection integrated into a mobile Jackal robot platform. The experiment results indicate that HRC enhances inspection accuracy and reduces operator workload, resulting in potential superior performance outcomes compared to traditional manual methods. 核设施的结构检测对于维护运行安全性和完整性至关重要。传统的人工检测方法存在重大挑战，包括安全风险、认知负担高以及由于人为限制导致的潜在不准确性。人工智能（AI）和机器人技术的最新进展为更安全、更高效且更准确的检测方法开辟了新途径。具体而言，利用配备先进检测算法的机器人平台的人机协作（HRC）有望显著改善检测结果并减轻人工工作负担。本研究探讨了集成在移动 Jackal 机器人平台上的 AI 辅助可视裂缝检测的有效性。实验结果表明，人机协作提高了检测精度并降低了操作者的工作负荷，相较于传统人工方法有潜在的更优性能表现。

Subjects: Robotics, Artificial Intelligence, Human-Computer Interaction 主题：机器人学、人工智能、人机交互

Publish: 2025-08-15 11:13:07 UTC 发布时间：2025-08-15 11:13:07 UTC

#31 Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis #31 值得信赖的 AI 心理治疗：用于咨询和可解释精神疾病诊断的多代理 LLM 工作流

Authors: [Mithat Can Ozgun](https://arxiv.org/search/?searchtype=author&query=Mithat Can Ozgun), [Jiahuan Pei](https://arxiv.org/search/?searchtype=author&query=Jiahuan Pei), [Koen Hindriks](https://arxiv.org/search/?searchtype=author&query=Koen Hindriks), [Lucia Donatelli](https://arxiv.org/search/?searchtype=author&query=Lucia Donatelli), [Qingzhi Liu](https://arxiv.org/search/?searchtype=author&query=Qingzhi Liu), [Xin Sun](https://arxiv.org/search/?searchtype=author&query=Xin Sun), [Junxiao Wang](https://arxiv.org/search/?searchtype=author&query=Junxiao Wang) 作者：Mithat Can Ozgun、Jiahuan Pei、Koen Hindriks、Lucia Donatelli、Qingzhi Liu、Xin Sun、Junxiao Wang

LLM-based agents have emerged as transformative tools capable of executing complex tasks through iterative planning and action, achieving significant advancements in understanding and addressing user needs. Yet, their effectiveness remains limited in specialized domains such as mental health diagnosis, where they underperform compared to general applications. Current approaches to integrating diagnostic capabilities into LLMs rely on scarce, highly sensitive mental health datasets, which are challenging to acquire. These methods also fail to emulate clinicians’ proactive inquiry skills, lack multi-turn conversational comprehension, and struggle to align outputs with expert clinical reasoning. To address these gaps, we propose DSM5AgentFlow, the first LLM-based agent workflow designed to autonomously generate DSM-5 Level-1 diagnostic questionnaires. By simulating therapist-client dialogues with specific client profiles, the framework delivers transparent, step-by-step disorder predictions, producing explainable and trustworthy results. This workflow serves as a complementary tool for mental health diagnosis, ensuring adherence to ethical and legal standards. Through comprehensive experiments, we evaluate leading LLMs across three critical dimensions: conversational realism, diagnostic accuracy, and explainability. Our datasets and implementations are fully open-sourced. 基于 LLM 的代理已成为具有变革性的工具，能够通过迭代的规划和行动执行复杂任务，在理解和满足用户需求方面取得显著进展。然而，它们在精神健康诊断等专门领域的有效性仍然有限，表现不如通用应用。目前将诊断功能整合到 LLM 的方法依赖稀缺且高度敏感的精神健康数据集，这些数据难以获取。这些方法也未能模拟临床医生的主动询问技能，缺乏多轮对话理解，并且难以使输出与临床专家的推理相一致。为了解决这些差距，我们提出了 DSM5AgentFlow——首个设计用于自主生成 DSM-5 一级诊断问卷的基于 LLM 的代理工作流。通过模拟具有特定病史的治疗师与来访者对话，该框架提供透明的、逐步的疾病预测，产出可解释且可信的结果。该工作流作为精神健康诊断的补充工具，确保遵守伦理和法律标准。通过全面的实验，我们在三个关键维度上评估了领先的 LLMs：对话真实感、诊断准确性和可解释性。我们的数据集和实现均已完全开源。

Subjects: Human-Computer Interaction, Artificial Intelligence, Information Retrieval 主题：人机交互、人工智能、信息检索

Publish: 2025-08-15 11:08:32 UTC 发布：2025-08-15 11:08:32 UTC

#32 Retrieval-augmented reasoning with lean language models #32 使用检索增强推理的精简语言模型

Subjects: Computation and Language, Artificial Intelligence, Computers and Society

Publish: 2025-08-15 10:38:15 UTC 发布时间：2025-08-15 10:38:15 UTC

#33 When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs #33 标点何时重要：针对 LLMs 的提示鲁棒性方法的大规模比较

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-15 10:32:50 UTC 发布时间：2025-08-15 10:32:50 UTC

#34 G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration #34 G-CUT3R：通过相机和深度先验整合的有指导三维重建

Authors: [Ramil Khafizov](https://arxiv.org/search/?searchtype=author&query=Ramil Khafizov), [Artem Komarichev](https://arxiv.org/search/?searchtype=author&query=Artem Komarichev), [Ruslan Rakhimov](https://arxiv.org/search/?searchtype=author&query=Ruslan Rakhimov), [Peter Wonka](https://arxiv.org/search/?searchtype=author&query=Peter Wonka), [Evgeny Burnaev](https://arxiv.org/search/?searchtype=author&query=Evgeny Burnaev) 作者：Ramil Khafizov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, Evgeny Burnaev

We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility with varying input modalities. 我们提出了 G-CUT3R，一种新颖的前馈式有指导三维场景重建方法，通过整合先验信息来增强 CUT3R 模型。与仅依赖输入图像的现有前馈方法不同，我们的方法利用在真实场景中常见的辅助数据，例如深度、相机标定或相机位置。我们对 CUT3R 进行了轻量级修改，为每种模态引入专门的编码器以提取特征，并通过零卷积将这些特征与 RGB 图像 token 融合。该灵活设计使在推理时可无缝整合任意组合的先验信息。经多个基准测试评估，包括三维重建和其他多视图任务，我们的方法表现出显著的性能提升，展示了其在保持对不同输入模态兼容性的同时有效利用可用先验的能力。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-15 10:25:58 UTC 发表：2025-08-15 10:25:58 UTC

#35 Does the Skeleton-Recall Loss Really Work? #35 骨架召回损失真的有效吗？

Authors: [Devansh Arora](https://arxiv.org/search/?searchtype=author&query=Devansh Arora), [Nitin Kumar](https://arxiv.org/search/?searchtype=author&query=Nitin Kumar), [Sukrit Gupta](https://arxiv.org/search/?searchtype=author&query=Sukrit Gupta) 作者：Devansh Arora、Nitin Kumar、Sukrit Gupta

Image segmentation is an important and widely performed task in computer vision. Accomplishing effective image segmentation in diverse settings often requires custom model architectures and loss functions. A set of models that specialize in segmenting thin tubular structures are topology preservation-based loss functions. These models often utilize a pixel skeletonization process claimed to generate more precise segmentation masks of thin tubes and better capture the structures that other models often miss. One such model, Skeleton Recall Loss (SRL) proposed by Kirchhoff et al.\cite {kirchhoff2024srl}, was stated to produce state-of-the-art results on benchmark tubular datasets. In this work, we performed a theoretical analysis of the gradients for the SRL loss. Upon comparing the performance of the proposed method on some of the tubular datasets (used in the original work, along with some additional datasets), we found that the performance of SRL-based segmentation models did not exceed traditional baseline models. By providing both a theoretical explanation and empirical evidence, this work critically evaluates the limitations of topology-based loss functions, offering valuable insights for researchers aiming to develop more effective segmentation models for complex tubular structures. 图像分割是计算机视觉中一项重要且广泛执行的任务。在不同场景下实现有效的图像分割通常需要定制的模型架构和损失函数。一类专门用于分割细长管状结构的模型是基于拓扑保持的损失函数。这些模型通常利用像素骨架化过程，据称能够生成更精确的细管分割掩码，并更好地捕捉其他模型常常遗漏的结构。其中一种由 Kirchhoff 等人提出的骨架召回损失（Skeleton Recall Loss，SRL）\cite{kirchhoff2024srl}，据称在基准管状数据集上产生了最先进的结果。在本工作中，我们对 SRL 损失的梯度进行了理论分析。在将该方法在若干管状数据集（原始工作中使用的数据集以及一些额外数据集）上的性能进行比较后，我们发现基于 SRL 的分割模型的性能并未超出传统基线模型。通过提供理论解释和实证证据，这项工作批判性地评估了基于拓扑的损失函数的局限性，为旨在为复杂管状结构开发更有效分割模型的研究人员提供了有价值的见解。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-15 10:16:34 UTC 出版：2025-08-15 10:16:34 UTC

#36 Minimizing Surrogate Losses for Decision-Focused Learning using Differentiable Optimization #36 使用可微分优化最小化决策导向学习的替代损失

Authors: [Jayanta Mandi](https://arxiv.org/search/?searchtype=author&query=Jayanta Mandi), [Ali İrfan Mahmutoğulları](https://arxiv.org/search/?searchtype=author&query=Ali İrfan Mahmutoğulları), [Senne Berden](https://arxiv.org/search/?searchtype=author&query=Senne Berden), [Tias Guns](https://arxiv.org/search/?searchtype=author&query=Tias Guns) 作者：Jayanta Mandi、Ali İrfan Mahmutoğulları、Senne Berden、Tias Guns

Decision-focused learning (DFL) trains a machine learning (ML) model to predict parameters of an optimization problem, to directly minimize decision regret, i.e., maximize decision quality. Gradient-based DFL requires computing the derivative of the solution to the optimization problem with respect to the predicted parameters. However, for many optimization problems, such as linear programs (LPs), the gradient of the regret with respect to the predicted parameters is zero almost everywhere. Existing gradient-based DFL approaches for LPs try to circumvent this issue in one of two ways: (a) smoothing the LP into a differentiable optimization problem by adding a quadratic regularizer and then minimizing the regret directly or (b) minimizing surrogate losses that have informative (sub)gradients. In this paper, we show that the former approach still results in zero gradients, because even after smoothing the regret remains constant across large regions of the parameter space. To address this, we propose minimizing surrogate losses – even when a differentiable optimization layer is used and regret can be minimized directly. Our experiments demonstrate that minimizing surrogate losses allows differentiable optimization layers to achieve regret comparable to or better than surrogate-loss based DFL methods. Further, we demonstrate that this also holds for DYS-Net, a recently proposed differentiable optimization technique for LPs, that computes approximate solutions and gradients through operations that can be performed using feedforward neural network layers. Because DYS-Net executes the forward and the backward pass very efficiently, by minimizing surrogate losses using DYS-Net, we are able to attain regret on par with the state-of-the-art while reducing training time by a significant margin. 面向决策的学习（DFL）训练机器学习（ML）模型以预测优化问题的参数，从而直接最小化决策遗憾，即最大化决策质量。基于梯度的 DFL 需要计算优化问题的解相对于预测参数的导数。然而，对于许多优化问题，例如线性规划（LP），遗憾相对于预测参数的梯度在几乎所有地方都是零。现有针对 LP 的基于梯度的 DFL 方法试图通过两种方式之一来规避这一问题：（a）通过添加二次正则项将 LP 平滑为可微的优化问题，然后直接最小化遗憾；或（b）最小化具有有信息（次）梯度的替代损失。在本文中，我们展示了前一种方法仍然会导致零梯度，因为即使在平滑之后，遗憾在参数空间的大区域内仍保持不变。为了解决这一问题，我们提出即便在使用可微优化层并且可以直接最小化遗憾的情况下，也应最小化替代损失。我们的实验表明，最小化替代损失使得可微分优化层能够获得与基于替代损失的差异学习（DFL）方法相当或更优的遗憾值。此外，我们还证明这对 DYS-Net 同样成立——这是一种最近提出的用于线性规划的可微分优化技术，它通过可以用前馈神经网络层执行的操作来计算近似解和梯度。由于 DYS-Net 在前向和反向传播上都非常高效，通过使用 DYS-Net 最小化替代损失，我们能够在将遗憾降至与最先进水平相当的同时显著减少训练时间。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 09:59:56 UTC 发布：2025-08-15 09:59:56 协调世界时 (UTC)

#37 PTSM: Physiology-aware and Task-invariant Spatio-temporal Modeling for Cross-Subject EEG Decoding #37 PTSM：面向生理信息且与任务无关的时空建模用于跨个体脑电解码

Cross-subject electroencephalography (EEG) decoding remains a fundamental challenge in brain-computer interface (BCI) research due to substantial inter-subject variability and the scarcity of subject-invariant representations. This paper proposed PTSM (Physiology-aware and Task-invariant Spatio-temporal Modeling), a novel framework for interpretable and robust EEG decoding across unseen subjects. PTSM employs a dual-branch masking mechanism that independently learns personalized and shared spatio-temporal patterns, enabling the model to preserve individual-specific neural characteristics while extracting task-relevant, population-shared features. The masks are factorized across temporal and spatial dimensions, allowing fine-grained modulation of dynamic EEG patterns with low computational overhead. To further address representational entanglement, PTSM enforces information-theoretic constraints that decompose latent embeddings into orthogonal task-related and subject-related subspaces. The model is trained end-to-end via a multi-objective loss integrating classification, contrastive, and disentanglement objectives. Extensive experiments on cross-subject motor imagery datasets demonstrate that PTSM achieves strong zero-shot generalization, outperforming state-of-the-art baselines without subject-specific calibration. Results highlight the efficacy of disentangled neural representations for achieving both personalized and transferable decoding in non-stationary neurophysiological settings. 跨受试者脑电图（EEG）解码由于显著的受试者间差异和缺乏受试者不变表示，一直是脑机接口（BCI）研究中的一个基础性挑战。本文提出了 PTSM（基于生理感知与任务不变的时空建模），一种用于在未见受试者上实现可解释且鲁棒的 EEG 解码的新框架。PTSM 采用双分支掩码机制，独立学习个体化和共享的时空模式，使模型在保留个体特有神经特征的同时提取与任务相关的群体共享特征。掩码在时间和空间维度上被因式分解，允许以低计算开销对动态 EEG 模式进行精细调节。为进一步解决表征纠缠问题，PTSM 施加信息论约束，将潜在嵌入分解为正交的任务相关和受试者相关子空间。该模型通过集成分类、对比和解缠目标的多目标损失进行端到端训练。在跨受试者运动意象数据集上进行的大量实验表明，PTSM 在零样本泛化方面表现出色，在无需受试者特定校准的情况下优于最先进的基线方法。结果凸显了在非平稳神经生理环境中，采用可解耦的神经表征能够实现既个性化又可迁移的解码的有效性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 09:51:14 UTC 发布：2025-08-15 09:51:14 UTC

#38 ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism #38 ETTRL：通过熵机制在 LLM 测试时强化学习中平衡探索与利用

Authors: [Jia Liu](https://arxiv.org/search/?searchtype=author&query=Jia Liu), [ChangYi He](https://arxiv.org/search/?searchtype=author&query=ChangYi He), [YingQiao Lin](https://arxiv.org/search/?searchtype=author&query=YingQiao Lin), [MingMin Yang](https://arxiv.org/search/?searchtype=author&query=MingMin Yang), [FeiYang Shen](https://arxiv.org/search/?searchtype=author&query=FeiYang Shen), [ShaoGuo Liu](https://arxiv.org/search/?searchtype=author&query=ShaoGuo Liu), [TingTing Gao](https://arxiv.org/search/?searchtype=author&query=TingTing Gao) 作者：刘佳，何长义，林英乔，杨明敏，沈飞洋，刘少国，高婷婷

Recent advancements in Large Language Models have yielded significant improvements in complex reasoning tasks such as mathematics and programming. However, these models remain heavily dependent on annotated data and exhibit limited adaptability in unsupervised scenarios. To address these limitations, test-time reinforcement learning (TTRL) has been proposed, which enables self-optimization by leveraging model-generated pseudo-labels. Despite its promise, TTRL faces several key challenges, including high inference costs due to parallel rollouts and early-stage estimation bias that fosters overconfidence, reducing output diversity and causing performance plateaus. To address these challenges, we introduce an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning through two strategies: Entropy-fork Tree Majority Rollout (ETMR) and Entropy-based Advantage Reshaping (EAR). Compared with the baseline, our approach enables Llama3.1-8B to achieve a 68 percent relative improvement in Pass at 1 metric on the AIME 2024 benchmark, while consuming only 60 percent of the rollout tokens budget. This highlights our method’s ability to effectively optimize the trade-off between inference efficiency, diversity, and estimation robustness, thereby advancing unsupervised reinforcement learning for open-domain reasoning tasks. 最近在大型语言模型方面的进展在诸如数学和编程等复杂推理任务上取得了显著提升。然而，这些模型仍然高度依赖带注释的数据，并且在无监督场景下表现出有限的适应性。为了解决这些限制，提出了测试时强化学习（TTRL），它通过利用模型生成的伪标签实现自我优化。尽管这一方法很有前景，TTRL 仍面临若干关键挑战，包括由于并行回合导致的高推理成本，以及早期估计偏差促成的过度自信——这会降低输出多样性并导致性能停滞。为了解决这些问题，我们引入了一种基于熵的机制，通过两种策略来增强测试时强化学习中的探索-利用平衡：熵分叉树多数回合（ETMR）和基于熵的优势重塑（EAR）。与基线方法相比，我们的方法使 Llama3.1-8B 在 AIME 2024 基准的 Pass@1 指标上实现了 68% 的相对提升，同时只消耗了 60% 的回合令牌预算。这突出了我们的方法在推理效率、多样性与估计鲁棒性之间有效优化权衡的能力，从而推动了开放领域推理任务的无监督强化学习发展。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 09:49:14 UTC 发布：2025-08-15 09:49:14 UTC

#39 Leveraging the RETFound foundation model for optic disc segmentation in retinal images #39 利用 RETFound 基础模型在视网膜图像中进行视盘分割

Authors: [Zhenyi Zhao](https://arxiv.org/search/?searchtype=author&query=Zhenyi Zhao), [Muthu Rama Krishnan Mookiah](https://arxiv.org/search/?searchtype=author&query=Muthu Rama Krishnan Mookiah), [Emanuele Trucco](https://arxiv.org/search/?searchtype=author&query=Emanuele Trucco) 作者：赵振毅、Muthu Rama Krishnan Mookiah、Emanuele Trucco

RETFound is a well-known foundation model (FM) developed for fundus camera and optical coherence tomography images. It has shown promising performance across multiple datasets in diagnosing diseases, both eye-specific and systemic, from retinal images. However, to our best knowledge, it has not been used for other tasks. We present the first adaptation of RETFound for optic disc segmentation, a ubiquitous and foundational task in retinal image analysis. The resulting segmentation system outperforms state-of-the-art, segmentation-specific baseline networks after training a head with only a very modest number of task-specific examples. We report and discuss results with four public datasets, IDRID, Drishti-GS, RIM-ONE-r3, and REFUGE, and a private dataset, GoDARTS, achieving about 96% Dice consistently across all datasets. Overall, our method obtains excellent performance in internal verification, domain generalization and domain adaptation, and exceeds most of the state-of-the-art baseline results. We discuss the results in the framework of the debate about FMs as alternatives to task-specific architectures. The code is available at: [link to be added after the paper is accepted] RETFound 是一个为眼底相机和光学相干断层扫描图像开发的知名基础模型（FM）。它在从视网膜图像诊断眼部特异性及系统性疾病的多个数据集上表现出良好的性能。然而，据我们所知，它尚未被用于其他任务。我们首次将 RETFound 适配用于视盘分割——这是视网膜图像分析中普遍且基础的任务。经过仅用非常有限数量的任务特定样本训练一个头部后，所得到的分割系统在性能上超过了以分割为特定任务的最先进基线网络。我们在四个公共数据集 IDRID、Drishti-GS、RIM-ONE-r3 和 REFUGE 以及一个私有数据集 GoDARTS 上报告并讨论了结果，在所有数据集上稳定达到约 96% 的 Dice 值。总体而言，我们的方法在内部验证、域泛化和域自适应方面取得了出色的表现，并超过了大多数最先进的基线结果。我们在将基础模型作为任务特定架构替代方案的讨论框架中对结果进行了探讨。代码可在以下位置获取：[论文接受后将添加链接]

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-15 09:43:49 UTC 发布时间：2025-08-15 09:43:49 协调世界时

#40 NeMo: A Neuron-Level Modularizing-While-Training Approach for Decomposing DNN Models #40 NeMo：一种在训练中进行神经元级模块化以分解深度神经网络模型的方法

Authors: [Xiaohan Bi](https://arxiv.org/search/?searchtype=author&query=Xiaohan Bi), [Binhang Qi](https://arxiv.org/search/?searchtype=author&query=Binhang Qi), [Hailong Sun](https://arxiv.org/search/?searchtype=author&query=Hailong Sun), [Xiang Gao](https://arxiv.org/search/?searchtype=author&query=Xiang Gao), [Yue Yu](https://arxiv.org/search/?searchtype=author&query=Yue Yu), [Xiaojun Liang](https://arxiv.org/search/?searchtype=author&query=Xiaojun Liang) 作者：毕晓涵、齐滨航、孙海龙、高翔、余悦、梁晓军

With the growing incorporation of deep neural network (DNN) models into modern software systems, the prohibitive construction costs have become a significant challenge. Model reuse has been widely applied to reduce training costs, but indiscriminately reusing entire models may incur significant inference overhead. Consequently, DNN modularization has gained attention, enabling module reuse by decomposing DNN models. The emerging modularizing-while-training (MwT) paradigm, which incorporates modularization into training, outperforms modularizing-after-training approaches. However, existing MwT methods focus on small-scale CNN models at the convolutional kernel level and struggle with diverse DNNs and large-scale models, particularly Transformer-based models. To address these limitations, we propose NeMo, a scalable and generalizable MwT approach. NeMo operates at the neuron level fundamental component common to all DNNs-ensuring applicability to Transformers and various architectures. We design a contrastive learning-based modular training method with an effective composite loss function, enabling scalability to large-scale models. Comprehensive experiments on two Transformer-based models and four CNN models across two classification datasets demonstrate NeMo’s superiority over state-of-the-art MwT methods. Results show average gains of 1.72% in module classification accuracy and 58.10% reduction in module size, demonstrating efficacy across both CNN and large-scale Transformer-based models. A case study on open-source projects shows NeMo’s potential benefits in practical scenarios, offering a promising approach for scalable and generalizable DNN modularization. 随着深度神经网络（DNN）模型在现代软件系统中日益广泛地被采用，昂贵的构建成本已成为一大挑战。模型重用被广泛应用以降低训练成本，但不加选择地重用整个模型可能会带来显著的推理开销。因此，DNN 模块化引起了关注，通过将 DNN 模型分解来实现模块重用。新兴的“在训练中模块化”（modularizing-while-training，MwT）范式将模块化融入训练中，其性能优于训练后模块化的方法。然而，现有的 MwT 方法侧重于小规模的卷积神经网络（CNN）模型、在卷积核层面进行模块化，难以应对多样的 DNN 和大规模模型，尤其是基于 Transformer 的模型。为了解决这些局限，我们提出了 NeMo，一种可扩展且具通用性的 MwT 方法。NeMo 在神经元级别——所有 DNN 通用的基本组件上运行，从而确保适用于 Transformer 和各种架构。我们设计了一种基于对比学习的模块化训练方法，配以有效的复合损失函数，使其可扩展到大规模模型。在两个基于 Transformer 的模型和四个 CNN 模型、两个分类数据集上的综合实验表明，NeMo 优于最先进的 MwT 方法。结果显示，模块分类准确率平均提高 1.72%，模块规模平均缩小 58.10%，证明了其在 CNN 和大型基于 Transformer 的模型上的有效性。对开源项目的案例研究表明，NeMo 在实际场景中具有潜在益处，为可扩展且具通用性的 DNN 模块化提供了一种有前景的方法。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 09:25:40 UTC 发布：2025-08-15 09:25:40 UTC

#41 RegimeNAS: Regime-Aware Differentiable Architecture Search With Theoretical Guarantees for Financial Trading #41 RegimeNAS：具有理论保证的面向金融交易的政权感知可微分架构搜索

Authors: [Prathamesh Devadiga](https://arxiv.org/search/?searchtype=author&query=Prathamesh Devadiga), [Yashmitha Shailesh](https://arxiv.org/search/?searchtype=author&query=Yashmitha Shailesh) 作者：Prathamesh Devadiga，Yashmitha Shailesh

We introduce RegimeNAS, a novel differentiable architecture search framework specifically designed to enhance cryptocurrency trading performance by explicitly integrating market regime awareness. Addressing the limitations of static deep learning models in highly dynamic financial environments, RegimeNAS features three core innovations: (1) a theoretically grounded Bayesian search space optimizing architectures with provable convergence properties; (2) specialized, dynamically activated neural modules (Volatility, Trend, and Range blocks) tailored for distinct market conditions; and (3) a multi-objective loss function incorporating market-specific penalties (e.g., volatility matching, transition smoothness) alongside mathematically enforced Lipschitz stability constraints. Regime identification leverages multi-head attention across multiple timeframes for improved accuracy and uncertainty estimation. Rigorous empirical evaluation on extensive real-world cryptocurrency data demonstrates that RegimeNAS significantly outperforms state-of-the-art benchmarks, achieving an 80.3% Mean Absolute Error reduction compared to the best traditional recurrent baseline and converging substantially faster (9 vs. 50+ epochs). Ablation studies and regime-specific analysis confirm the critical contribution of each component, particularly the regime-aware adaptation mechanism. This work underscores the imperative of embedding domain-specific knowledge, such as market regimes, directly within the NAS process to develop robust and adaptive models for challenging financial applications. 我们提出了 RegimeNAS，一种新颖的可微分架构搜索框架，专门通过显式整合市场政权感知来提升加密货币交易表现。为应对高度动态金融环境中静态深度学习模型的局限性，RegimeNAS 具有三项核心创新：（1）一个有理论依据的贝叶斯搜索空间，用于优化具有可证明收敛性的架构；（2）专门的、动态激活的神经模块（波动、趋势和区间块），针对不同市场状态定制；以及（3）一个多目标损失函数，除包含市场特定惩罚项（例如波动匹配、转换平滑性）外，还结合了数学上强制的 Lipschitz 稳定性约束。政权识别利用跨多个时间框架的多头注意力来提高准确性和不确定性估计。在广泛的真实加密货币数据上的严格实证评估表明，RegimeNAS 显著优于最先进的基准方法，相较于表现最好的传统递归基线，平均绝对误差降低了 80.3%，并且收敛速度大幅更快（9 个周期对比 50+ 个周期）。消融研究和按行情分段的分析证实了各个组件的关键贡献，尤其是具备行情感知的自适应机制。本研究强调了在神经架构搜索过程中直接嵌入领域特定知识（如市场行情）的必要性，以为具有挑战性的金融应用开发稳健且自适应的模型。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 09:09:54 UTC 发布：2025-08-15 09:09:54 UTC

#42 SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems #42 SGSimEval：用于自动综述生成系统的全面多面向与相似性增强基准

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题：计算与语言、人工智能、信息检索

Publish: 2025-08-15 08:27:58 UTC 发布：2025-08-15 08:27:58 UTC

#43 Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks #43 动态质量-延迟感知路由在无线边缘设备网络中的 LLM 推理

Authors: [Rui Bao](https://arxiv.org/search/?searchtype=author&query=Rui Bao), [Nan Xue](https://arxiv.org/search/?searchtype=author&query=Nan Xue), [Yaping Sun](https://arxiv.org/search/?searchtype=author&query=Yaping Sun), [Zhiyong Chen](https://arxiv.org/search/?searchtype=author&query=Zhiyong Chen) 作者：包睿，薛楠，孙雅苹，陈志勇

The integration of wireless communications and Large Language Models (LLMs) is poised to unlock ubiquitous intelligent services, yet deploying them in wireless edge-device collaborative environments presents a critical trade-off between inference quality and end-to-end latency. A fundamental mismatch exists between task complexity and resource allocation: offloading simple queries invites prohibitive latency, while on-device models lack the capacity for demanding computations. To address this challenge, we propose a dynamic, quality-latency aware routing framework that orchestrates inference between a lightweight model on the mobile device and a powerful model on the edge server. Our framework employs two distinct cost models: for single-turn queries, it fuses a BERT-predicted semantic score with communication and computation overheads; for multi-turn dialogues, it further quantifies context-aware costs arising from model switching and KV-cache management. While maintaining full inference quality, extensive experiments demonstrate that our framework cuts average response latency by 5-15% and reduces large model invocations by 10-20% against competitive baselines on MMLU, GSM8K, and MT-Bench-101 benchmarks. 将无线通信与 LLMs 集成有望解锁无处不在的智能服务，然而在无线边缘设备协同环境中部署它们时，推理质量与端到端延迟之间存在关键权衡。任务复杂度与资源分配之间存在根本性不匹配：离载简单查询会带来不可接受的延迟，而设备端模型又无法承担高负载计算。为了解决这一挑战，我们提出了一个动态的、质量-延迟感知的路由框架，在移动设备上的轻量级模型与边缘服务器上的强大模型之间协调推理。我们的框架采用两种不同的代价模型：对于单轮查询，它将 BERT 预测的语义得分与通信和计算开销融合；对于多轮对话，它进一步量化了由模型切换和 KV-cache 管理产生的上下文感知成本。在保持完全推理质量的前提下，大量实验表明，与有竞争力的基线相比，我们的框架在 MMLU、GSM8K 和 MT-Bench-101 基准上将平均响应延迟减少了 5–15%，并将大模型调用减少了 10–20%。

Subjects: Information Theory, Artificial Intelligence, Machine Learning 主题：信息论，人工智能，机器学习

Publish: 2025-08-15 07:55:05 UTC 发布：2025-08-15 07:55:05 UTC

#44 CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems #44 CSGO：用于无线协作边缘 LLM 系统冷启动的广义优化

While deploying large language models on edge devices promises low-latency and privacy-preserving AI services, it is hindered by limited device resources. Although pipeline parallelism facilitates distributed inference, existing approaches often ignore the cold-start latency caused by on-demand model loading. In this paper, we propose a latency-aware scheduling framework that overlaps model loading with computation and communication to minimize total inference latency. Based on device and model parameters, the framework dynamically adjusts layer partitioning and allocation to effectively hide loading time, thereby eliminating as many idle periods as possible. We formulate the problem as a Mixed-Integer Non-Linear Program and design an efficient dynamic programming algorithm to optimize model partitioning and device assignment. Experimental results show that the proposed method significantly reduces cold-start latency compared to baseline strategies. 在边缘设备上部署大型语言模型虽然能提供低延迟且保护隐私的 AI 服务，但受限于设备资源而面临挑战。尽管流水线并行有助于分布式推理，现有方法常常忽略按需模型加载带来的冷启动延迟。本文提出了一个延迟感知的调度框架，将模型加载与计算和通信重叠，以最小化总体推理延迟。基于设备和模型参数，该框架动态调整层划分和分配以有效隐藏加载时间，从而尽可能消除空闲周期。我们将问题表述为混合整数非线性规划，并设计了一个高效的动态规划算法以优化模型划分和设备分配。实验结果表明，与基线策略相比，所提方法显著降低了冷启动延迟。

Subjects: Information Theory, Artificial Intelligence, Machine Learning 主题：信息论，人工智能，机器学习

Publish: 2025-08-15 07:49:22 UTC 发布：2025-08-15 07:49:22 UTC

#45 Scene Graph-Guided Proactive Replanning for Failure-Resilient Embodied Agent #45 场景图引导的主动重规划用于故障恢复型具身智能体

Authors: [Che Rin Yu](https://arxiv.org/search/?searchtype=author&query=Che Rin Yu), [Daewon Chae](https://arxiv.org/search/?searchtype=author&query=Daewon Chae), [Dabin Seo](https://arxiv.org/search/?searchtype=author&query=Dabin Seo), [Sangwon Lee](https://arxiv.org/search/?searchtype=author&query=Sangwon Lee), [Hyeongwoo Im](https://arxiv.org/search/?searchtype=author&query=Hyeongwoo Im), [Jinkyu Kim](https://arxiv.org/search/?searchtype=author&query=Jinkyu Kim) 作者：Che Rin Yu、Daewon Chae、Dabin Seo、Sangwon Lee、Hyeongwoo Im、Jinkyu Kim

When humans perform everyday tasks, we naturally adjust our actions based on the current state of the environment. For instance, if we intend to put something into a drawer but notice it is closed, we open it first. However, many autonomous robots lack this adaptive awareness. They often follow pre-planned actions that may overlook subtle yet critical changes in the scene, which can result in actions being executed under outdated assumptions and eventual failure. While replanning is critical for robust autonomy, most existing methods respond only after failures occur, when recovery may be inefficient or infeasible. While proactive replanning holds promise for preventing failures in advance, current solutions often rely on manually designed rules and extensive supervision. In this work, we present a proactive replanning framework that detects and corrects failures at subtask boundaries by comparing scene graphs constructed from current RGB-D observations against reference graphs extracted from successful demonstrations. When the current scene fails to align with reference trajectories, a lightweight reasoning module is activated to diagnose the mismatch and adjust the plan. Experiments in the AI2-THOR simulator demonstrate that our approach detects semantic and spatial mismatches before execution failures occur, significantly improving task success and robustness. 当人类执行日常任务时，我们会根据当前环境状态自然而然地调整动作。例如，如果我们打算把东西放进抽屉但发现抽屉是关着的，我们会先把它打开。然而，许多自主机器人缺乏这种适应性意识。它们经常遵循预先规划的动作，可能忽视场景中微妙但关键的变化，这会导致在过时的假设下执行动作并最终失败。尽管重新规划对于鲁棒自主性至关重要，但现有大多数方法仅在失败发生后才做出响应，而此时恢复可能效率低下或不可行。尽管主动重新规划有望提前预防失败，但当前的解决方案常常依赖手工设计的规则和大量监督。在本工作中，我们提出了一个主动重新规划框架，通过将从当前 RGB-D 观测构建的场景图与从成功示范中提取的参考图进行比较，在子任务边界处检测并纠正失败。当当前场景未能与参考轨迹对齐时，会激活一个轻量级推理模块来诊断不匹配并调整计划。在 AI2-THOR 模拟器中的实验表明，我们的方法能在执行失败发生之前检测到语义和空间不匹配，从而显著提高任务成功率和稳健性。

Subjects: Robotics, Artificial Intelligence, Computer Vision and Pattern Recognition 学科：机器人、人工智能、计算机视觉与模式识别

Publish: 2025-08-15 07:48:51 UTC 发布时间：2025-08-15 07:48:51 协调世界时（UTC）

#46 ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection #46 ToxiFrench：通过链式思考微调对法语有害性检测进行基准测试与增强 [PDF 1 ] [复制] [Kimi 1 ] [关联]

Subjects: Computation and Language, Artificial Intelligence, Computers and Society

Publish: 2025-08-15 07:40:41 UTC 发布：2025-08-15 07:40:41 UTC

#47 LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought #47 LETToT：在旅游领域使用专家思维树对大型语言模型进行无标签评估

Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose Lable-Free Evaluation of LLM on Tourism using Expert Tree-of-Thought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15% relative quality gains over baselines. Second, we apply LETToT’s optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness (p<0.05). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks. 在旅游等特定领域评估大型语言模型（LLMs）仍然具有挑战性，原因在于带注释基准的高额成本以及持续存在的幻觉问题。我们提出了 L able-Free E valuation of LLM on T ourism using Expert T ree- o f- T hought (LETToT)，一个利用专家推理结构——而非带标签数据——来评估旅游领域 LLM 的框架。首先，我们通过与通用质量维度和专家反馈的对齐，迭代地完善和验证分层的 ToT 组件。结果证明我们系统优化的专家 ToT 的有效性，相较基线实现了 4.99%–14.15%的相对质量提升。其次，我们将 LETToT 优化的专家 ToT 应用于不同规模（32B–671B 参数）的模型评估，揭示了： (1) 在专业领域中扩展规律仍然存在（DeepSeek-V3 领先），但增强推理能力的较小模型（例如 DeepSeek-R1-Distill-Llama-70B）可以缩小这一差距；(2) 对于小于 72B 的模型，显式推理架构在准确性和简洁性上优于对应模型（ p<0.05 ）。我们的工作建立了一种可扩展、免标注的领域特定 LLM 评估范式，为传统有注释基准提供了一个稳健的替代方案。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-15 07:37:12 UTC 发布：2025-08-15 07:37:12 UTC

#48 Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas #48 通用人工智能的推理是否对数据引发的认知偏差敏感？关于典型软件工程两难问题的动态基准测试

Authors: [Francesco Sovrano](https://arxiv.org/search/?searchtype=author&query=Francesco Sovrano), [Gabriele Dominici](https://arxiv.org/search/?searchtype=author&query=Gabriele Dominici), [Rita Sevastjanova](https://arxiv.org/search/?searchtype=author&query=Rita Sevastjanova), [Alessandra Stramiglio](https://arxiv.org/search/?searchtype=author&query=Alessandra Stramiglio), [Alberto Bacchelli](https://arxiv.org/search/?searchtype=author&query=Alberto Bacchelli) 作者：Francesco Sovrano、Gabriele Dominici、Rita Sevastjanova、Alessandra Stramiglio、Alberto Bacchelli

Human cognitive biases in software engineering can lead to costly errors. While general-purpose AI (GPAI) systems may help mitigate these biases due to their non-human nature, their training on human-generated data raises a critical question: Do GPAI systems themselves exhibit cognitive biases? To investigate this, we present the first dynamic benchmarking framework to evaluate data-induced cognitive biases in GPAI within software engineering workflows. Starting with a seed set of 16 hand-crafted realistic tasks, each featuring one of 8 cognitive biases (e.g., anchoring, framing) and corresponding unbiased variants, we test whether bias-inducing linguistic cues unrelated to task logic can lead GPAI systems from correct to incorrect conclusions. To scale the benchmark and ensure realism, we develop an on-demand augmentation pipeline relying on GPAI systems to generate task variants that preserve bias-inducing cues while varying surface details. This pipeline ensures correctness (88–99% on average, according to human evaluation), promotes diversity, and controls reasoning complexity by leveraging Prolog-based reasoning and LLM-as-a-judge validation. It also verifies that the embedded biases are both harmful and undetectable by logic-based, unbiased reasoners. We evaluate leading GPAI systems (GPT, LLaMA, DeepSeek) and find a consistent tendency to rely on shallow linguistic heuristics over deep reasoning. All systems exhibit cognitive biases (ranging from 5.9% to 35% across types), with bias sensitivity increasing sharply with task complexity (up to 49%), highlighting critical risks in real-world software engineering deployments. 人类在软件工程中的认知偏差可能导致代价高昂的错误。尽管通用人工智能（GPAI）系统由于非人类本质可能有助于减轻这些偏差，但它们以人类生成的数据进行训练，这提出了一个关键问题：GPAI 系统本身是否也表现出认知偏差？为探究这一点，我们提出了第一个用于评估软件工程工作流中由数据引起的认知偏差的动态基准框架。我们从一组由人工精心设计的 16 个现实任务出发，每个任务体现 8 种认知偏差之一（例如锚定、框架效应）及其相应的无偏变体，测试与任务逻辑无关的诱导偏见的语言线索是否会将 GPAI 系统从正确的结论引导到错误的结论。为扩展基准并保证真实性，我们开发了一个按需增强流水线，依赖 GPAI 系统生成任务变体，这些变体在保持诱导偏见线索的同时改变表面细节。该流水线通过利用基于 Prolog 的推理和以 LLM 作为裁判的验证，确保正确性（根据人工评估平均为 88%–99%）、促进多样性并控制推理复杂性。它还验证了嵌入的偏见既有害又对基于逻辑的、无偏见的推理者无法检测。我们评估了领先的 GPAI 系统（GPT、LLaMA、DeepSeek），发现它们存在一致的倾向——更依赖浅层语言启发式而非深度推理。所有系统都表现出认知偏差（各种类型的偏差范围为 5.9%到 35%），且随着任务复杂度增加偏差敏感性急剧上升（最高可达 49%），这凸显了在实际软件工程部署中的重大风险。

Subjects: Human-Computer Interaction, Artificial Intelligence, Software Engineering 主题：人机交互、人工智能、软件工程

Publish: 2025-08-15 07:29:46 UTC 发布：2025-08-15 07:29:46 协调世界时（UTC）

#49 Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering #49 通过推理增强的表示工程提升有监督组合图像检索

Authors: [Jun Li](https://arxiv.org/search/?searchtype=author&query=Jun Li), [Kai Li](https://arxiv.org/search/?searchtype=author&query=Kai Li), [Shaoguo Liu](https://arxiv.org/search/?searchtype=author&query=Shaoguo Liu), [Tingting Gao](https://arxiv.org/search/?searchtype=author&query=Tingting Gao) 作者：李军、李凯、刘少国、高婷婷

Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited – compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model’s understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public. 组合图像检索（CIR）提出了一项重大挑战，因为它需要联合理解参考图像和修改性文本指令以查找相关目标图像。一些现有方法尝试使用两阶段方法来进一步精炼检索结果，然而这通常需要额外训练一个排序模型。尽管连锁思维（Chain-of-Thought，CoT）技术在降低语言模型训练成本方面取得了成功，但它们在 CIR 任务中的应用仍然有限——要么将视觉信息压缩为文本，要么依赖复杂的提示设计。此外，现有工作仅在零样本 CIR 中使用 CoT，因为在监督 CIR 中用训练良好的模型实现令人满意的结果具有挑战性。在本工作中，我们提出了一个框架，其中包含带有无训练精炼的金字塔匹配模型（Pyramid Matching Model with Training-Free Refinement，PMTFR）以应对这些挑战。通过一个简单但有效的模块——金字塔补丁器（Pyramid Patcher），我们增强了金字塔匹配模型对不同粒度视觉信息的理解。受表示工程的启发，我们从 CoT 数据中提取表示并将其注入到大型视觉语言模型（LVLMs）中。这种方法使我们在无训练微调范式中获得了更精细的检索评分，而无需依赖显式的文本推理，从而进一步提升了性能。大量在循环图像检索（CIR）基准上的实验证明，PMTFR 在有监督的 CIR 任务上超越了最先进的方法。代码将会公开。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-15 07:10:10 UTC 发布：2025-08-15 07:10:10 UTC

#50 Vision-Language Models display a strong gender bias #50 视觉-语言模型表现出强烈的性别偏见

Authors: [Aiswarya Konavoor](https://arxiv.org/search/?searchtype=author&query=Aiswarya Konavoor), [Raj Abhijit Dandekar](https://arxiv.org/search/?searchtype=author&query=Raj Abhijit Dandekar), [Rajat Dandekar](https://arxiv.org/search/?searchtype=author&query=Rajat Dandekar), [Sreedath Panat](https://arxiv.org/search/?searchtype=author&query=Sreedath Panat) 作者：Aiswarya Konavoor、Raj Abhijit Dandekar、Rajat Dandekar、Sreedath Panat

Vision-language models (VLM) align images and text in a shared representation space that is useful for retrieval and zero-shot transfer. Yet, this alignment can encode and amplify social stereotypes in subtle ways that are not obvious from standard accuracy metrics. In this study, we test whether the contrastive vision-language encoder exhibits gender-linked associations when it places embeddings of face images near embeddings of short phrases that describe occupations and activities. We assemble a dataset of 220 face photographs split by perceived binary gender and a set of 150 unique statements distributed across six categories covering emotional labor, cognitive labor, domestic labor, technical labor, professional roles, and physical labor. We compute unit-norm image embeddings for every face and unit-norm text embeddings for every statement, then define a statement-level association score as the difference between the mean cosine similarity to the male set and the mean cosine similarity to the female set, where positive values indicate stronger association with the male set and negative values indicate stronger association with the female set. We attach bootstrap confidence intervals by resampling images within each gender group, aggregate by category with a separate bootstrap over statements, and run a label-swap null model that estimates the level of mean absolute association we would expect if no gender structure were present. The outcome is a statement-wise and category-wise map of gender associations in a contrastive vision-language space, accompanied by uncertainty, simple sanity checks, and a robust gender bias evaluation framework. 视觉-语言模型（VLM）将图像和文本对齐到一个共享的表征空间，这对检索和零样本迁移很有用。然而，这种对齐可能以微妙的方式编码并放大社会刻板印象，而这些并不容易从标准准确率指标中显现出来。在本研究中，我们检验了对比式视觉-语言编码器在将面部图像的嵌入置于描述职业和活动的短语嵌入附近时，是否表现出与性别相关的联想。我们收集了一个包含 220 张面部照片的数据集，按被感知的二元性别划分，并收集了一组由 150 条独特陈述组成的数据，分布在涵盖情感劳动、认知劳动、家务劳动、技术劳动、专业角色和体力劳动的六个类别中。我们为每张面部图像计算单位范数的图像嵌入，为每条陈述计算单位范数的文本嵌入，然后将陈述级联想得分定义为对男性集合的平均余弦相似度与对女性集合的平均余弦相似度之差，正值表示与男性集合的关联更强，负值表示与女性集合的关联更强。我们通过在每个性别组内对图像进行重抽样来附加自助法置信区间，按类别汇总并对陈述进行单独的自助法重抽样，并运行标签交换的零模型以估计在不存在性别结构时我们期望的平均绝对关联水平。结果是关于对比视觉-语言空间中性别关联的逐条陈述和逐类别的映射，附带不确定性、简单的合理性检查以及一个稳健的性别偏见评估框架。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-15 06:57:26 UTC 发布：2025-08-15 06:57:26 UTC

#51 Hallucination in LLM-Based Code Generation: An Automotive Case Study #51 基于 LLM 的代码生成中的幻觉：汽车领域案例研究

Authors: [Marc Pavel](https://arxiv.org/search/?searchtype=author&query=Marc Pavel), [Nenad Petrovic](https://arxiv.org/search/?searchtype=author&query=Nenad Petrovic), [Lukasz Mazur](https://arxiv.org/search/?searchtype=author&query=Lukasz Mazur), [Vahid Zolfaghari](https://arxiv.org/search/?searchtype=author&query=Vahid Zolfaghari), [Fengjunjie Pan](https://arxiv.org/search/?searchtype=author&query=Fengjunjie Pan), [Alois Knoll](https://arxiv.org/search/?searchtype=author&query=Alois Knoll) 作者：Marc Pavel、Nenad Petrovic、Lukasz Mazur、Vahid Zolfaghari、Fengjunjie Pan、Alois Knoll

Large Language Models (LLMs) have shown significant potential in automating code generation tasks offering new opportunities across software engineering domains. However, their practical application remains limited due to hallucinations - outputs that appear plausible but are factually incorrect, unverifiable or nonsensical. This paper investigates hallucination phenomena in the context of code generation with a specific focus on the automotive domain. A case study is presented that evaluates multiple code LLMs for three different prompting complexities ranging from a minimal one-liner prompt to a prompt with Covesa Vehicle Signal Specifications (VSS) as additional context and finally to a prompt with an additional code skeleton. The evaluation reveals a high frequency of syntax violations, invalid reference errors and API knowledge conflicts in state-of-the-art models GPT-4.1, Codex and GPT-4o. Among the evaluated models, only GPT-4.1 and GPT-4o were able to produce a correct solution when given the most context-rich prompt. Simpler prompting strategies failed to yield a working result, even after multiple refinement iterations. These findings highlight the need for effective mitigation techniques to ensure the safe and reliable use of LLM generated code, especially in safety-critical domains such as automotive software systems. 大型语言模型 (LLMs) 在自动化代码生成任务方面展示了显著潜力，为软件工程领域带来新的机遇。然而，由于幻觉问题——看起来合理但事实错误、无法验证或荒谬的输出——其实际应用仍然受限。本文在代码生成的背景下研究了幻觉现象，特别聚焦于汽车领域。本文呈现了一个案例研究，评估了多种代码 LLMs 在三种不同提示复杂度下的表现，范围从最简的一行提示，到以 Covesa Vehicle Signal Specifications (VSS) 作为额外上下文的提示，最后到包含额外代码骨架的提示。评估显示，最先进模型 GPT-4.1、Codex 和 GPT-4o 经常出现语法违规、无效引用错误和 API 知识冲突。在被评估的模型中，只有 GPT-4.1 和 GPT-4o 在给予最丰富上下文的提示时能够生成正确的解决方案。较为简单的提示策略未能产出可用结果，即使在多次迭代改进后亦是如此。这些发现突显了需要有效的缓解技术，以确保在诸如汽车软件系统等安全关键领域中安全可靠地使用 LLM 生成的代码。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-15 06:46:50 UTC

#52 Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

Authors: [Junjie Wang](https://arxiv.org/search/?searchtype=author&query=Junjie Wang), [Keyu Chen](https://arxiv.org/search/?searchtype=author&query=Keyu Chen), [Yulin Li](https://arxiv.org/search/?searchtype=author&query=Yulin Li), [Bin Chen](https://arxiv.org/search/?searchtype=author&query=Bin Chen), [Hengshuang Zhao](https://arxiv.org/search/?searchtype=author&query=Hengshuang Zhao), [Xiaojuan Qi](https://arxiv.org/search/?searchtype=author&query=Xiaojuan Qi), [Zhuotao Tian](https://arxiv.org/search/?searchtype=author&query=Zhuotao Tian)

Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP’s image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context’’ features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at https://github.com/xiaomoguhz/DeCLIP 密集视觉感知任务受限于对预定义类别的依赖，限制了其在视觉概念无限的真实场景中的适用性。尽管像 CLIP 这样的视觉-语言模型（VLMs）在开放词汇任务上展现出潜力，但它们直接应用于密集感知时，常因局部特征表示的局限而导致性能不佳。在本工作中，我们提出观察：CLIP 的图像 token 在有效聚合来自空间或语义相关区域的信息方面存在困难，导致其特征缺乏局部可区分性和空间一致性。为了解决该问题，我们提出了 DeCLIP，一种通过解耦自注意力模块来分别获得“内容”和“上下文”特征以增强 CLIP 的新框架。上下文特征通过联合蒸馏来自视觉基础模型（VFMs）的语义关联性和来自扩散模型的对象完整性线索来增强，从而提升空间一致性。同时，内容特征与图像裁剪表示进行对齐，并通过来自视觉基础模型（VFM）的区域相关性进行约束，以提高局部判别能力。大量实验证明，DeCLIP 为开放词汇密集感知建立了坚实基础，在包括二维检测与分割、三维实例分割、视频实例分割以及六自由度物体位姿估计等广泛任务上持续取得最先进的性能。} 代码可在 https://github.com/xiaomoguhz/DeCLIP 获得

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-15 06:43:51 UTC 发布：2025-08-15 06:43:51 UTC

#53 Graph Neural Diffusion via Generalized Opinion Dynamics

Authors: [Asela Hevapathige](https://arxiv.org/search/?searchtype=author&query=Asela Hevapathige), [Asiri Wijesinghe](https://arxiv.org/search/?searchtype=author&query=Asiri Wijesinghe), [Ahad N. Zehmakan](https://arxiv.org/search/?searchtype=author&query=Ahad N. Zehmakan) 作者：Asela Hevapathige、Asiri Wijesinghe、Ahad N. Zehmakan

There has been a growing interest in developing diffusion-based Graph Neural Networks (GNNs), building on the connections between message passing mechanisms in GNNs and physical diffusion processes. However, existing methods suffer from three critical limitations: (1) they rely on homogeneous diffusion with static dynamics, limiting adaptability to diverse graph structures; (2) their depth is constrained by computational overhead and diminishing interpretability; and (3) theoretical understanding of their convergence behavior remains limited. To address these challenges, we propose GODNF, a Generalized Opinion Dynamics Neural Framework, which unifies multiple opinion dynamics models into a principled, trainable diffusion mechanism. Our framework captures heterogeneous diffusion patterns and temporal dynamics via node-specific behavior modeling and dynamic neighborhood influence, while ensuring efficient and interpretable message propagation even at deep layers. We provide a rigorous theoretical analysis demonstrating GODNF’s ability to model diverse convergence configurations. Extensive empirical evaluations of node classification and influence estimation tasks confirm GODNF’s superiority over state-of-the-art GNNs. 近年来，基于扩散的图神经网络（GNN）越来越受关注，这建立在 GNN 中的消息传递机制与物理扩散过程之间的联系之上。然而，现有方法存在三大关键局限：（1）它们依赖同质扩散且具有静态动态，限制了对多样化图结构的适应性；（2）其深度受制于计算开销并且可解释性下降；（3）对其收敛行为的理论理解仍然有限。为了解决这些挑战，我们提出了 GODNF，一种广义舆论动力学神经框架，将多种舆论动力学模型统一为一个有原则、可训练的扩散机制。我们的框架通过节点特定的行为建模和动态的邻域影响捕捉异质的扩散模式和时间动态，同时在深层仍能保证高效且可解释的消息传播。我们提供了严格的理论分析，证明了 GODNF 能够刻画多样的收敛配置。对节点分类和影响力估计任务的大量实证评估证实，GODNF 优于最先进的图神经网络。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 06:36:57 UTC 发布：2025-08-15 06:36:57 UTC

#54 Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering #54 跨粒度超图检索增强生成用于多跳问答

Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by leveraging knowledge graphs (KGs) to capture structural associations, but they tend to overly rely on structural information and fine-grained word- or phrase-level retrieval, resulting in an underutilization of textual semantics. In this paper, we propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs. Structurally, we construct an entity hypergraph where fine-grained entities serve as nodes and coarse-grained passages as hyperedges, and establish knowledge association through shared entities. Semantically, we design a hypergraph retrieval method that integrates fine-grained entity similarity and coarse-grained passage similarity via hypergraph diffusion. Finally, we employ a retrieval enhancement module, which further refines the retrieved results both semantically and structurally, to obtain the most relevant passages as context for answer generation with the LLM. Experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in QA performance, and achieves a 6× speedup in retrieval efficiency. 多跳问答（MHQA）需要整合分散在多段文本中的知识以得出正确答案。传统的检索增强生成（RAG）方法主要关注粗粒度的文本语义相似性，忽略了分散知识之间的结构关联，从而限制了其在 MHQA 任务中的效果。GraphRAG 方法通过利用知识图（KG）来捕捉结构关联以解决这一问题，但它们往往过分依赖结构信息和细粒度的词或短语级检索，导致文本语义被低估。在本文中，我们提出了一种用于 MHQA 的新型 RAG 方法，称为 HGRAG，通过超图实现结构与语义信息的跨粒度整合。在结构上，我们构建了一个实体超图，其中细粒度实体作为节点、粗粒度段落作为超边，并通过共享实体建立知识关联。在语义上，我们设计了一种超图检索方法，通过超图扩散整合细粒度实体相似性与粗粒度段落相似性。最后，我们引入了一个检索增强模块，进一步在语义和结构上优化检索结果，以获得作为与 LLM 生成答案上下文的最相关段落。在基准数据集上的实验结果表明，我们的方法在问答性能上优于最先进的方法，并在检索效率上实现了 6 × 的加速。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-15 06:36:13 UTC 发布：2025-08-15 06:36:13 UTC

#55 ORFuzz: Fuzzing the "Other Side" of LLM Safety – Testing Over-Refusal #55 ORFuzz：模糊测试 LLM 安全性的“另一面”——测试过度拒绝 [PDF 1 ] [Copy] [Kimi 2 ] [REL]

Subjects: Software Engineering, Artificial Intelligence, Computation and Language, Information Retrieval 科目：软件工程、人工智能、计算与语言、信息检索

Publish: 2025-08-15 05:03:26 UTC 发布：2025-08-15 05:03:26 协调世界时 (UTC)

#56 How Causal Abstraction Underpins Computational Explanation #56 因果抽象如何支撑计算性解释

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-15 04:46:02 UTC 发布：2025-08-15 04:46:02 协调世界时 (UTC)

#57 Multi-Group Equivariant Augmentation for Reinforcement Learning in Robot Manipulation #57 用于机器人操控强化学习的多群等变增强

Authors: [Hongbin Lin](https://arxiv.org/search/?searchtype=author&query=Hongbin Lin), [Juan Rojas](https://arxiv.org/search/?searchtype=author&query=Juan Rojas), [Kwok Wai Samuel Au](https://arxiv.org/search/?searchtype=author&query=Kwok Wai Samuel Au) 作者：林鸿斌、Juan Rojas、Kwok Wai Samuel Au

Sampling efficiency is critical for deploying visuomotor learning in real-world robotic manipulation. While task symmetry has emerged as a promising inductive bias to improve efficiency, most prior work is limited to isometric symmetries – applying the same group transformation to all task objects across all timesteps. In this work, we explore non-isometric symmetries, applying multiple independent group transformations across spatial and temporal dimensions to relax these constraints. We introduce a novel formulation of the partially observable Markov decision process (POMDP) that incorporates the non-isometric symmetry structures, and propose a simple yet effective data augmentation method, Multi-Group Equivariance Augmentation (MEA). We integrate MEA with offline reinforcement learning to enhance sampling efficiency, and introduce a voxel-based visual representation that preserves translational equivariance. Extensive simulation and real-robot experiments across two manipulation domains demonstrate the effectiveness of our approach. 采样效率对于在真实世界机器人操控中部署视觉动作学习至关重要。尽管任务对称性作为一种有前景的归纳偏好能够提升效率，但以往大多数工作仅限于等距对称性——在所有时间步对所有任务对象应用相同的群变换。在本工作中，我们探索了非等距对称性，在空间和时间维度上应用多个独立的群变换以放宽这些约束。我们提出了一种将非等距对称结构纳入的部分可观测马尔可夫决策过程（POMDP）新表述，并提出了一种简单而有效的数据增强方法：多群等变增强（Multi-Group Equivariance Augmentation，MEA）。我们将 MEA 与离线强化学习结合以提高采样效率，并引入了一种保留平移等变性的基于体素的视觉表示。在两个操控领域中进行的大量仿真与真实机器人实验表明了我们方法的有效性。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-15 04:30:01 UTC 发布时间：2025-08-15 04:30:01 协调世界时 (UTC)

#58 StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation #58 StyleMM：通过文本驱动的对齐图像翻译实现的风格化三维可变形人脸模型

Authors: [Seungmi Lee](https://arxiv.org/search/?searchtype=author&query=Seungmi Lee), [Kwan Yun](https://arxiv.org/search/?searchtype=author&query=Kwan Yun), [Junyong Noh](https://arxiv.org/search/?searchtype=author&query=Junyong Noh) 作者：Seungmi Lee、Kwan Yun、Junyong Noh

We introduce StyleMM, a novel framework that can construct a stylized 3D Morphable Model (3DMM) based on user-defined text descriptions specifying a target style. Building upon a pre-trained mesh deformation network and a texture generator for original 3DMM-based realistic human faces, our approach fine-tunes these models using stylized facial images generated via text-guided image-to-image (i2i) translation with a diffusion model, which serve as stylization targets for the rendered mesh. To prevent undesired changes in identity, facial alignment, or expressions during i2i translation, we introduce a stylization method that explicitly preserves the facial attributes of the source image. By maintaining these critical attributes during image stylization, the proposed approach ensures consistent 3D style transfer across the 3DMM parameter space through image-based training. Once trained, StyleMM enables feed-forward generation of stylized face meshes with explicit control over shape, expression, and texture parameters, producing meshes with consistent vertex connectivity and animatability. Quantitative and qualitative evaluations demonstrate that our approach outperforms state-of-the-art methods in terms of identity-level facial diversity and stylization capability. The code and videos are available at kwanyun.github.io/stylemm_page. 我们提出了 StyleMM，一种新颖的框架，能够基于用户定义的文本描述构建风格化的三维可变形模型（3DMM），用于指定目标风格。在基于预训练的网格变形网络和用于原始基于 3DMM 的逼真人脸的纹理生成器的基础上，我们的方法使用通过带有扩散模型的文本引导图像到图像（i2i）翻译生成的风格化人脸图像来微调这些模型，这些图像作为渲染网格的风格化目标。为防止在 i2i 翻译过程中出现身份、面部对齐或表情的非期望变化，我们引入了一种显式保留源图像面部属性的风格化方法。通过在图像风格化过程中保持这些关键属性，所提出的方法确保通过基于图像的训练在 3DMM 参数空间内实现一致的 3D 风格迁移。训练完成后，StyleMM 支持前馈生成风格化的人脸网格，并对形状、表情和纹理参数进行显式控制，生成具有一致顶点连接性和可动画性的网格。定量和定性评估表明，我们的方法在身份级别人脸多样性和风格化能力方面优于最先进的方法。代码和视频可见于 kwanyun.github.io/stylemm_page。

Subjects: Graphics, Artificial Intelligence, Computer Vision and Pattern Recognition, Multimedia 主题：图形学、人工智能、计算机视觉与模式识别、多媒体

Publish: 2025-08-15 04:29:46 UTC 发布时间：2025-08-15 04:29:46 UTC

#59 Visuomotor Grasping with World Models for Surgical Robots #59 使用世界模型的视觉运动抓取用于外科机器人

Authors: [Hongbin Lin](https://arxiv.org/search/?searchtype=author&query=Hongbin Lin), [Bin Li](https://arxiv.org/search/?searchtype=author&query=Bin Li), [Kwok Wai Samuel Au](https://arxiv.org/search/?searchtype=author&query=Kwok Wai Samuel Au) 作者：林鸿斌、李彬、郭伟·Samuel Au

Grasping is a fundamental task in robot-assisted surgery (RAS), and automating it can reduce surgeon workload while enhancing efficiency, safety, and consistency beyond teleoperated systems. Most prior approaches rely on explicit object pose tracking or handcrafted visual features, limiting their generalization to novel objects, robustness to visual disturbances, and the ability to handle deformable objects. Visuomotor learning offers a promising alternative, but deploying it in RAS presents unique challenges, such as low signal-to-noise ratio in visual observations, demands for high safety and millimeter-level precision, as well as the complex surgical environment. This paper addresses three key challenges: (i) sim-to-real transfer of visuomotor policies to ex vivo surgical scenes, (ii) visuomotor learning using only a single stereo camera pair – the standard RAS setup, and (iii) object-agnostic grasping with a single policy that generalizes to diverse, unseen surgical objects without retraining or task-specific models. We introduce Grasp Anything for Surgery V2 (GASv2), a visuomotor learning framework for surgical grasping. GASv2 leverages a world-model-based architecture and a surgical perception pipeline for visual observations, combined with a hybrid control system for safe execution. We train the policy in simulation using domain randomization for sim-to-real transfer and deploy it on a real robot in both phantom-based and ex vivo surgical settings, using only a single pair of endoscopic cameras. Extensive experiments show our policy achieves a 65% success rate in both settings, generalizes to unseen objects and grippers, and adapts to diverse disturbances, demonstrating strong performance, generality, and robustness. 抓取是机器人辅助手术（RAS）中的一项基础任务，对其实现自动化可以减轻外科医生的工作负担，同时在效率、安全性和一致性方面超越遥操系统。以往的大多数方法依赖于显式的物体位姿跟踪或手工设计的视觉特征，这限制了它们对新物体的泛化能力、对视觉干扰的鲁棒性以及处理可变形物体的能力。视觉-运动学习（visuomotor learning）提供了一种有前景的替代方案，但在 RAS 中部署该方法面临独特挑战，例如视觉观测的低信噪比、对高安全性和毫米级精度的需求，以及复杂的手术环境。本文针对三大关键挑战： (i) 将视觉-运动策略从仿真转移到离体手术场景的 sim-to-real 问题，(ii) 仅使用一对立体相机——标准 RAS 配置——进行视觉-运动学习，和 (iii) 使用单一策略实现与物体无关的抓取，能在不重新训练或不使用任务特定模型的情况下泛化到多种未见过的手术物体。我们提出了用于手术抓取的视觉-运动学习框架 Grasp Anything for Surgery V2（GASv2）。 GASv2 利用基于世界模型的架构和用于视觉观测的外科感知管线，结合用于安全执行的混合控制系统。我们在仿真中训练策略，使用领域随机化实现模拟到现实的迁移，并仅使用一对内窥镜相机将其部署到真实机器人上，在基于假体和体外手术环境中均进行测试。大量实验证明，我们的策略在两种环境中均达到了 65% 的成功率，能够推广到未见过的物体和夹持器，并适应各种干扰，展示出较强的性能、通用性和鲁棒性。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-15 04:23:07 UTC 发布：2025-08-15 04:23:07 UTC

Subjects: Computation and Language, Artificial Intelligence, Machine Learning, Social and Information Networks 主题：计算与语言、人工智能、机器学习、社会与信息网络

Publish: 2025-08-15 04:13:23 UTC 发表：2025-08-15 04:13:23 UTC

#61 Quantum-Boosted High-Fidelity Deep Learning #61 量子增强的高保真深度学习

A fundamental limitation of probabilistic deep learning is its predominant reliance on Gaussian priors. This simplistic assumption prevents models from accurately capturing the complex, non-Gaussian landscapes of natural data, particularly in demanding domains like complex biological data, severely hindering the fidelity of the model for scientific discovery. The physically-grounded Boltzmann distribution offers a more expressive alternative, but it is computationally intractable on classical computers. To date, quantum approaches have been hampered by the insufficient qubit scale and operational stability required for the iterative demands of deep learning. Here, we bridge this gap by introducing the Quantum Boltzmann Machine-Variational Autoencoder (QBM-VAE), a large-scale and long-time stable hybrid quantum-classical architecture. Our framework leverages a quantum processor for efficient sampling from the Boltzmann distribution, enabling its use as a powerful prior within a deep generative model. Applied to million-scale single-cell datasets from multiple sources, the QBM-VAE generates a latent space that better preserves complex biological structures, consistently outperforming conventional Gaussian-based deep learning models like VAE and SCVI in essential tasks such as omics data integration, cell-type classification, and trajectory inference. It also provides a typical example of introducing a physics priori into deep learning to drive the model to acquire scientific discovery capabilities that breaks through data limitations. This work provides the demonstration of a practical quantum advantage in deep learning on a large-scale scientific problem and offers a transferable blueprint for developing hybrid quantum AI models. 概率深度学习的一个根本性限制是其主要依赖高斯先验。这种简单的假设使模型无法准确捕捉自然数据中复杂的、非高斯分布的景观，特别是在复杂生物数据等要求很高的领域，严重妨碍了模型在科学发现中的可信度。具有物理依据的玻尔兹曼分布提供了更具表现力的替代方案，但在经典计算机上计算上是不可行的。迄今为止，量子方法受制于深度学习迭代需求所需的量子比特规模和运行稳定性不足。在此，我们通过引入量子玻尔兹曼机-变分自编码器（QBM-VAE）弥合了这一差距，这是一种大尺度且长期稳定的混合量子-经典架构。我们的框架利用量子处理器高效地从玻尔兹曼分布中采样，从而使其能作为深度生成模型中的强大先验。在应用于来自多个来源的百万级单细胞数据集时，QBM-VAE 生成的潜在空间更好地保留了复杂的生物学结构，在组学数据整合、细胞类型分类和轨迹推断等关键任务中，始终优于传统的基于高斯的深度学习模型（如 VAE 和 SCVI）。它还提供了一个将物理先验引入深度学习的典型示例，以推动模型获得突破数据限制的科学发现能力。这项工作展示了在大型科学问题上深度学习实现实际量子优势的示范，并为开发混合量子 AI 模型提供了可迁移的蓝图。

Subjects: Machine Learning, Artificial Intelligence, Genomics 主题：机器学习、人工智能、基因组学

Publish: 2025-08-15 03:51:20 UTC 发布时间：2025-08-15 03:51:20 UTC

#62 A Semi-supervised Generative Model for Incomplete Multi-view Data Integration with Missing Labels #62 一个用于带缺失标签的不完全多视图数据集成的半监督生成模型

Authors: [Yiyang Shen](https://arxiv.org/search/?searchtype=author&query=Yiyang Shen), [Weiran Wang](https://arxiv.org/search/?searchtype=author&query=Weiran Wang) 作者：Yiyang Shen、Weiran Wang

Multi-view learning is widely applied to real-life datasets, such as multiple omics biological data, but it often suffers from both missing views and missing labels. Prior probabilistic approaches addressed the missing view problem by using a product-of-experts scheme to aggregate representations from present views and achieved superior performance over deterministic classifiers, using the information bottleneck (IB) principle. However, the IB framework is inherently fully supervised and cannot leverage unlabeled data. In this work, we propose a semi-supervised generative model that utilizes both labeled and unlabeled samples in a unified framework. Our method maximizes the likelihood of unlabeled samples to learn a latent space shared with the IB on labeled data. We also perform cross-view mutual information maximization in the latent space to enhance the extraction of shared information across views. Compared to existing approaches, our model achieves better predictive and imputation performance on both image and multi-omics data with missing views and limited labeled samples. 多视角学习广泛应用于真实世界的数据集，例如多组学生物数据，但它常常同时面临视角缺失和标签缺失的问题。先前的概率方法通过使用专家乘积（product-of-experts）方案来聚合存在视角的表示，从而解决视角缺失问题，并在信息瓶颈（IB）原则下在性能上优于确定性分类器。然而，IB 框架本质上是完全监督的，无法利用未标注数据。在这项工作中，我们提出了一个半监督生成模型，在统一框架中利用标注和未标注样本。我们的方法最大化未标注样本的似然以学习与标注数据上的 IB 共享的潜在空间。我们还在潜在空间中执行跨视角互信息最大化，以增强跨视角共享信息的提取。与现有方法相比，我们的模型在存在视角缺失和标注样本有限的图像和多组学数据上实现了更好的预测和插补性能。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-15 03:10:18 UTC 发布：2025-08-15 03:10:18 UTC

#63 Better Supervised Fine-tuning for VQA: Integer-Only Loss #63 更好的监督微调用于 VQA：仅整数损失

Authors: [Baihong Qian](https://arxiv.org/search/?searchtype=author&query=Baihong Qian), [Haotian Fan](https://arxiv.org/search/?searchtype=author&query=Haotian Fan), [Wenjie Liao](https://arxiv.org/search/?searchtype=author&query=Wenjie Liao), [Yunqiu Wang](https://arxiv.org/search/?searchtype=author&query=Yunqiu Wang), [Tao Li](https://arxiv.org/search/?searchtype=author&query=Tao Li), [Junhui Cui](https://arxiv.org/search/?searchtype=author&query=Junhui Cui) 作者：钱百鸿、樊昊天、廖文杰、王云秋、李涛、崔俊辉

With the rapid advancement of vision language models(VLM), their ability to assess visual content based on specific criteria and dimensions has become increasingly critical for applications such as video-theme consistency assessment and visual quality scoring. However, existing methods often suffer from imprecise results and inefficient loss calculation, which limit the focus of the model on key evaluation indicators. To address this, we propose IOVQA(Integer-only VQA), a novel fine-tuning approach tailored for VLMs to enhance their performance in video quality assessment tasks. The key innovation of IOVQA lies in its label construction and its targeted loss calculation mechanism. Specifically, during dataset curation, we constrain the model’s output to integers within the range of [10,50], ensuring numerical stability, and convert decimal Overall_MOS to integer before using them as labels. We also introduce a target-mask strategy: when computing the loss, only the first two-digit-integer of the label is unmasked, forcing the model to learn the critical components of the numerical evaluation. After fine-tuning the Qwen2.5-VL model using the constructed dataset, experimental results demonstrate that the proposed method significantly improves the model’s accuracy and consistency in the VQA task, ranking 3rd in VQualA 2025 GenAI-Bench AIGC Video Quality Assessment Challenge – Track I. Our work highlights the effectiveness of merely leaving integer labels during fine-tuning, providing an effective idea for optimizing VLMs in quantitative evaluation scenarios. 随着视觉语言模型（VLM）的快速发展，它们根据特定标准和维度评估视觉内容的能力对于视频主题一致性评估和视觉质量评分等应用变得愈发关键。然而，现有方法往往存在结果不精确和损失计算效率低下的问题，限制了模型对关键评价指标的聚焦。为了解决这一问题，我们提出了 IOVQA（仅整数 VQA），这是一种为 VLM 定制的微调方法，旨在提升其在视频质量评估任务中的表现。IOVQA 的关键创新在于其标签构建和有针对性的损失计算机制。具体而言，在数据集整理过程中，我们将模型的输出约束为 [10,50] 范围内的整数以确保数值稳定，并在将 Overall_MOS 用作标签之前将其小数部分转换为整数。我们还引入了目标掩码策略：在计算损失时，仅对标签的前两位整数不进行掩码，迫使模型学习数值评估中的关键组成部分。在使用构建的数据集对 Qwen2.5-VL 模型进行微调后，实验结果表明所提出的方法显著提高了模型在 VQA 任务中的准确性和一致性，在 VQualA 2025 GenAI-Bench AIGC 视频质量评估挑战赛 —— 赛道 I 中排名第三。我们的工作强调了在微调时仅保留整数标签的有效性，为在定量评估场景中优化视觉语言模型（VLM）提供了一个有效的思路。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-15 02:40:43 UTC 发布：2025-08-15 02:40:43 UTC

#64 Role-Augmented Intent-Driven Generative Search Engine Optimization #64 角色增强的意图驱动生成式搜索引擎优化

Authors: [Xiaolu Chen](https://arxiv.org/search/?searchtype=author&query=Xiaolu Chen), [Haojie Wu](https://arxiv.org/search/?searchtype=author&query=Haojie Wu), [Jie Bao](https://arxiv.org/search/?searchtype=author&query=Jie Bao), [Zhen Chen](https://arxiv.org/search/?searchtype=author&query=Zhen Chen), [Yong Liao](https://arxiv.org/search/?searchtype=author&query=Yong Liao), [Hu Huang](https://arxiv.org/search/?searchtype=author&query=Hu Huang) 作者：陈小路、吴昊杰、鲍杰、陈珍、廖勇、黄虎

Generative Search Engines (GSEs), powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), are reshaping information retrieval. While commercial systems (e.g., BingChat, Perplexity.ai) demonstrate impressive semantic synthesis capabilities, their black-box nature fundamentally undermines established Search Engine Optimization (SEO) practices. Content creators face a critical challenge: their optimization strategies, effective in traditional search engines, are misaligned with generative retrieval contexts, resulting in diminished visibility. To bridge this gap, we propose a Role-Augmented Intent-Driven Generative Search Engine Optimization (G-SEO) method, providing a structured optimization pathway tailored for GSE scenarios. Our method models search intent through reflective refinement across diverse informational roles, enabling targeted content enhancement. To better evaluate the method under realistic settings, we address the benchmarking limitations of prior work by: (1) extending the GEO dataset with diversified query variations reflecting real-world search scenarios and (2) introducing G-Eval 2.0, a 6-level LLM-augmented evaluation rubric for fine-grained human-aligned assessment. Experimental results demonstrate that search intent serves as an effective signal for guiding content optimization, yielding significant improvements over single-aspect baseline approaches in both subjective impressions and objective content visibility within GSE responses. 生成式搜索引擎（GSEs），由 LLMs 和检索增强生成（RAG）驱动，正在重塑信息检索领域。尽管商业系统（例如 BingChat、Perplexity.ai）展示了令人印象深刻的语义综合能力，但它们的黑箱特性从根本上削弱了既有的搜索引擎优化（SEO）实践。内容创作者面临一个关键挑战：在传统搜索引擎中有效的优化策略与生成式检索场景不匹配，导致可见性下降。为弥合这一差距，我们提出了一种角色增强的意图驱动生成式搜索引擎优化（G-SEO）方法，提供了一条针对 GSE 场景的结构化优化路径。我们的方法通过跨多样信息角色的反思性迭代来建模搜索意图，从而实现有针对性的内容增强。为在更现实的设置下更好地评估该方法，我们通过以下方式解决了以往工作的基准测试局限性：（1）扩展了 GEO 数据集，增加了反映真实搜索场景的多样化查询变体；（2）引入了 G-Eval 2.0，这是一套由 LLM 增强的六级评估量表，用于细粒度且与人类一致的评估。实验结果表明，搜索意图作为指导内容优化的有效信号，在主观印象和 GSE 响应中的客观内容可见度方面，相较于单一维度的基线方法带来了显著改进。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-15 02:08:55 UTC 发布：2025-08-15 02:08:55 UTC

#65 AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions #65 AlphaAgents：基于大语言模型的多智能体用于股票投资组合构建

Authors: [Tianjiao Zhao](https://arxiv.org/search/?searchtype=author&query=Tianjiao Zhao), [Jingrao Lyu](https://arxiv.org/search/?searchtype=author&query=Jingrao Lyu), [Stokes Jones](https://arxiv.org/search/?searchtype=author&query=Stokes Jones), [Harrison Garber](https://arxiv.org/search/?searchtype=author&query=Harrison Garber), [Stefano Pasquali](https://arxiv.org/search/?searchtype=author&query=Stefano Pasquali), [Dhagash Mehta](https://arxiv.org/search/?searchtype=author&query=Dhagash Mehta) 作者：Tianjiao Zhao, Jingrao Lyu, Stokes Jones, Harrison Garber, Stefano Pasquali, Dhagash Mehta

The field of artificial intelligence (AI) agents is evolving rapidly, driven by the capabilities of Large Language Models (LLMs) to autonomously perform and refine tasks with human-like efficiency and adaptability. In this context, multi-agent collaboration has emerged as a promising approach, enabling multiple AI agents to work together to solve complex challenges. This study investigates the application of role-based multi-agent systems to support stock selection in equity research and portfolio management. We present a comprehensive analysis performed by a team of specialized agents and evaluate their stock-picking performance against established benchmarks under varying levels of risk tolerance. Furthermore, we examine the advantages and limitations of employing multi-agent frameworks in equity analysis, offering critical insights into their practical efficacy and implementation challenges. 人工智能（AI）代理领域正在迅速发展，这一发展由大型语言模型（LLMs）驱动，使其能够以类人效率和适应性自主执行和改进任务。在此背景下，多代理协作已成为一种有前景的方法，能够让多个 AI 代理共同解决复杂问题。本研究探讨了基于角色的多代理系统在股票研究与投资组合管理中支持选股的应用。我们展示了由一组专门化代理执行的全面分析，并在不同风险容忍度下将其选股表现与既定基准进行评估。此外，我们考察了在股票分析中采用多代理框架的优点与局限，提供关于其实际效能和实施挑战的关键见解。

Subjects: Statistical Finance, Artificial Intelligence 学科：统计金融，人工智能

Publish: 2025-08-15 01:49:56 UTC 发表时间：2025-08-15 01:49:56 UTC

#66 Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward #66 连续动作分块的 Actor-Critic：用于稀疏奖励下长期机器人操控的强化学习框架 [PDF 5 ] [Copy] [Kimi ] [REL]

Authors: [Jiarui Yang](https://arxiv.org/search/?searchtype=author&query=Jiarui Yang), [Bin Zhu](https://arxiv.org/search/?searchtype=author&query=Bin Zhu), [Jingjing Chen](https://arxiv.org/search/?searchtype=author&query=Jingjing Chen), [Yu-Gang Jiang](https://arxiv.org/search/?searchtype=author&query=Yu-Gang Jiang) 作者：杨嘉瑞，朱斌，陈晶晶，江予刚

Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To make this learning process stable and data-efficient, AC3 incorporates targeted stabilization mechanisms for both the actor and the critic. First, to ensure reliable policy improvement, the actor is trained with an asymmetric update rule, learning exclusively from successful trajectories. Second, to enable effective value learning despite sparse rewards, the critic’s update is stabilized using intra-chunk n-step returns and further enriched by a self-supervised module providing intrinsic rewards at anchor points aligned with each action chunk. We conducted extensive experiments on 25 tasks from the BiGym and RLBench benchmarks. Results show that by using only a few demonstrations and a simple model architecture, AC3 achieves superior success rates on most tasks, validating its effective design. 现有的强化学习（RL）方法在长期机器人操作任务上表现不佳，尤其是那些稀疏奖励的任务。尽管动作分块是机器人操作的一个有前景的范式，但使用 RL 直接稳定且数据高效地学习连续动作分块仍然是一个关键挑战。本文提出了 AC3（用于连续分块的演员-评论家），一种学习生成高维连续动作序列的新型 RL 框架。为了使该学习过程稳定且数据高效，AC3 为演员和评论家都引入了有针对性的稳定机制。首先，为确保策略可靠改进，演员采用非对称更新规则，仅从成功轨迹中学习。其次，为在稀疏奖励下实现有效的价值学习，评论家的更新通过分块内的 n -步回报来稳定，并辅以一个自监督模块在与每个动作分块对齐的锚点处提供内在奖励以进一步丰富学习信号。我们在 BiGym 和 RLBench 基准的 25 个任务上进行了大量实验。结果表明，仅使用少量示例和一个简单的模型架构，AC3 就在大多数任务上实现了更高的成功率，验证了其有效的设计。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-15 01:27:15 UTC 发布日期：2025-08-15 01:27:15 UTC

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-08-15 01:13:50 UTC 发布时间：2025-08-15 01:13:50 UTC

#68 MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents #68 MoNaCo：用于跨数十篇文档推理的更自然、更复杂的问题

Subjects: Computation and Language, Artificial Intelligence, Databases 主题：计算与语言、人工智能、数据库

Publish: 2025-08-15 00:58:10 UTC 发布时间：2025-08-15 00:58:10 UTC

#69 Tabularis Formatus: Predictive Formatting for Tables #69 表格格式化：用于表格的预测性格式化

Authors: [Mukul Singh](https://arxiv.org/search/?searchtype=author&query=Mukul Singh), [José Cambronero](https://arxiv.org/search/?searchtype=author&query=José Cambronero), [Sumit Gulwani](https://arxiv.org/search/?searchtype=author&query=Sumit Gulwani), [Vu Le](https://arxiv.org/search/?searchtype=author&query=Vu Le), [Gust Verbruggen](https://arxiv.org/search/?searchtype=author&query=Gust Verbruggen) 作者：Mukul Singh、José Cambronero、Sumit Gulwani、Vu Le、Gust Verbruggen

Spreadsheet manipulation software are widely used for data management and analysis of tabular data, yet the creation of conditional formatting (CF) rules remains a complex task requiring technical knowledge and experience with specific platforms. In this paper we present TaFo, a neuro-symbolic approach to generating CF suggestions for tables, addressing common challenges such as user unawareness, difficulty in rule creation, and inadequate user interfaces. TaFo takes inspiration from component based synthesis systems and extends them with semantic knowledge of language models and a diversity preserving rule ranking.Unlike previous methods focused on structural formatting, TaFo uniquely incorporates value-based formatting, automatically learning both the rule trigger and the associated visual formatting properties for CF rules. By removing the dependency on user specification used by existing techniques in the form of formatted examples or natural language instruction, TaFo makes formatting completely predictive and automated for the user. To evaluate TaFo, we use a corpus of 1.8 Million public workbooks with CF and manual formatting. We compare TaFo against a diverse set of symbolic and neural systems designed for or adapted for the task of table formatting. Our results show that TaFo generates more accurate, diverse and complete formatting suggestions than current systems and outperforms these by 15.6%–26.5% on matching user added ground truth rules in tables. 电子表格操作软件被广泛用于表格数据的管理和分析，但条件格式（CF）规则的创建仍然是一项复杂的任务，需具备技术知识并熟悉特定平台。在本文中，我们提出了 TaFo，一种生成表格条件格式建议的神经符号方法，旨在解决用户不了解、规则创建困难以及用户界面不充分等常见挑战。TaFo 受到基于组件的合成系统的启发，并将其扩展为结合语言模型的语义知识和保持多样性的规则排序。与以往侧重于结构化格式的方法不同，TaFo 独特地融合了基于数值的格式化，能够自动学习触发规则以及与之相关的视觉格式属性，用于条件格式规则。通过去除现有技术中以已格式化示例或自然语言指令形式对用户指定的依赖，TaFo 使格式化对用户而言完全具有预测性和自动化。为评估 TaFo，我们使用了包含 180 万个带有条件格式和手动格式化的公开工作簿的语料库。我们将 TaFo 与为表格格式化任务设计或改编的一系列符号与神经系统进行了比较。结果表明，TaFo 所生成的格式化建议在准确性、多样性和完整性上均优于现有系统，并在匹配用户在表格中添加的真实规则方面，比这些系统高出 15.6%——26.5%。

Subjects: Databases, Artificial Intelligence, Software Engineering 主题：数据库、人工智能、软件工程

Publish: 2025-08-14 23:54:40 UTC 发布：2025-08-14 23:54:40 UTC

#70 Quantization through Piecewise-Affine Regularization: Optimization and Statistical Guarantees #70 通过分段仿射正则化实现量化：优化与统计保证

Authors: [Jianhao Ma](https://arxiv.org/search/?searchtype=author&query=Jianhao Ma), [Lin Xiao](https://arxiv.org/search/?searchtype=author&query=Lin Xiao) 作者：Jianhao Ma，Lin Xiao

Optimization problems over discrete or quantized variables are very challenging in general due to the combinatorial nature of their search space. Piecewise-affine regularization (PAR) provides a flexible modeling and computational framework for quantization based on continuous optimization. In this work, we focus on the setting of supervised learning and investigate the theoretical foundations of PAR from optimization and statistical perspectives. First, we show that in the overparameterized regime, where the number of parameters exceeds the number of samples, every critical point of the PAR-regularized loss function exhibits a high degree of quantization. Second, we derive closed-form proximal mappings for various (convex, quasi-convex, and non-convex) PARs and show how to solve PAR-regularized problems using the proximal gradient method, its accelerated variant, and the Alternating Direction Method of Multipliers. Third, we study statistical guarantees of PAR-regularized linear regression problems; specifically, we can approximate classical formulations of ℓ1-, squared ℓ2-, and nonconvex regularizations using PAR and obtain similar statistical guarantees with quantized solutions. 针对离散或量化变量的优化问题由于其组合性质的搜索空间通常非常具有挑战性。分段仿射正则化（PAR）为基于连续优化的量化提供了一种灵活的建模和计算框架。在本工作中，我们聚焦于监督学习的场景，并从优化与统计的角度研究 PAR 的理论基础。首先，我们证明在过参数化的情形下（参数数量超过样本数量），PAR 正则化损失函数的每一个临界点都表现出高度的量化特性。其次，我们为多种（凸的、拟凸的和非凸的）PAR 推导了闭式的近端映射，并展示了如何使用近端梯度法、其加速变体以及交替方向乘子法来求解受 PAR 正则化的问题。第三，我们研究了 PAR 正则化线性回归问题的统计保证；具体而言，我们可以使用 PAR 近似经典的 ℓ1 、平方 ℓ2 和非凸正则化形式，并获得具有量化解的类似统计保证。

Subjects: Machine Learning, Artificial Intelligence, Optimization and Control, Machine Learning 主题：机器学习、人工智能、优化与控制、机器学习

Publish: 2025-08-14 23:35:21 UTC 发表：2025-08-14 23:35:21 UTC

#71 Diffusion is a code repair operator and generator #71 Diffusion 是一种代码修复算子和生成器

Code diffusion models generate code by iteratively removing noise from the latent representation of a code snippet. During later steps of the diffusion process, when the code snippet has almost converged, differences between discrete representations of these snippets look like last-mile repairs applied to broken or incomplete code. We evaluate the extent to which this resemblance can be exploited to leverage pre-trained code diffusion models for the problem of last-mile repair by considering two applications with significant potential. First, we can leverage the diffusion model for last-mile repair by adding noise to a broken code snippet and resuming the diffusion process. Second, we can leverage the diffusion model to generate arbitrary amount of training data for last-mile repair tasks (that are computationally more efficient) by sampling an intermediate program (input) and the final program (output) from the diffusion process. We perform experiments on 3 domains (Python, Excel and PowerShell) to evaluate applications, as well as analyze properties. 代码扩散模型通过对代码片段的潜在表示迭代去噪来生成代码。在扩散过程的后期，当代码片段几乎收敛时，这些片段的离散表示之间的差异看起来像是对损坏或不完整代码所做的“最后一公里”修复。我们评估了这种相似性在多大程度上可以被利用，以便将预训练的代码扩散模型用于“最后一公里”修复问题，考虑了两个具有重大潜力的应用。首先，我们可以通过向损坏的代码片段添加噪声并恢复扩散过程，来利用扩散模型进行最后一公里修复。其次，我们可以通过从扩散过程中采样中间程序（输入）和最终程序（输出），利用扩散模型生成任意数量的用于最后一公里修复任务的训练数据（这种方式在计算上更高效）。我们在三个领域（Python、Excel 和 PowerShell）上进行了实验，以评估这些应用并分析其性质。

Subjects: Software Engineering, Artificial Intelligence, Computation and Language 主题：软件工程，人工智能，计算与语言

Publish: 2025-08-14 23:27:09 UTC 发布：2025-08-14 23:27:09 UTC

#72 Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance #72 将视觉-语言模型作为行为模型用于意图识别和辅助

Authors: [Cesar Alan Contreras](https://arxiv.org/search/?searchtype=author&query=Cesar Alan Contreras), [Manolis Chiou](https://arxiv.org/search/?searchtype=author&query=Manolis Chiou), [Alireza Rastegarpanah](https://arxiv.org/search/?searchtype=author&query=Alireza Rastegarpanah), [Michal Szulik](https://arxiv.org/search/?searchtype=author&query=Michal Szulik), [Rustam Stolkin](https://arxiv.org/search/?searchtype=author&query=Rustam Stolkin) 作者：Cesar Alan Contreras、Manolis Chiou、Alireza Rastegarpanah、Michal Szulik、Rustam Stolkin

Human-robot collaboration requires robots to quickly infer user intent, provide transparent reasoning, and assist users in achieving their goals. Our recent work introduced GUIDER, our framework for inferring navigation and manipulation intents. We propose augmenting GUIDER with a vision-language model (VLM) and a text-only language model (LLM) to form a semantic prior that filters objects and locations based on the mission prompt. A vision pipeline (YOLO for object detection and the Segment Anything Model for instance segmentation) feeds candidate object crops into the VLM, which scores their relevance given an operator prompt; in addition, the list of detected object labels is ranked by a text-only LLM. These scores weight the existing navigation and manipulation layers of GUIDER, selecting context-relevant targets while suppressing unrelated objects. Once the combined belief exceeds a threshold, autonomy changes occur, enabling the robot to navigate to the desired area and retrieve the desired object, while adapting to any changes in the operator’s intent. Future work will evaluate the system on Isaac Sim using a Franka Emika arm on a Ridgeback base, with a focus on real-time assistance. 人机协作要求机器人能够快速推断用户意图、提供透明的推理，并协助用户实现目标。我们最近的工作引入了 GUIDER，这是我们用于推断导航和操作意图的框架。我们提出将 GUIDER 与视觉-语言模型（VLM）和纯文本语言模型（LLM）结合，以形成一个语义先验，根据任务提示过滤对象和位置。一个视觉管道（用于物体检测的 YOLO 和用于实例分割的 Segment Anything Model）将候选物体裁切图输入 VLM，VLM 根据操作员提示对它们的重要性进行评分；此外，检测到的物体标签列表由纯文本的 LLM 进行排序。这些评分为 GUIDER 现有的导航和操作层加权，选择与上下文相关的目标，同时抑制无关物体。一旦组合信念超过阈值，就会发生自治改变，使机器人能够导航到目标区域并取回所需物体，同时适应操作员意图的任何变化。未来的工作将使用安装在 Ridgeback 底盘上的 Franka Emika 机械臂，在 Isaac Sim 上评估该系统，重点关注实时辅助。

Subjects: Robotics, Artificial Intelligence, Human-Computer Interaction 主题：机器人学、人工智能、人机交互

Publish: 2025-08-14 22:19:09 UTC 发布：2025-08-14 22:19:09 UTC

#73 Compressive Meta-Learning #73 压缩元学习

Authors: [Daniel Mas Montserrat](https://arxiv.org/search/?searchtype=author&query=Daniel Mas Montserrat), [David Bonet](https://arxiv.org/search/?searchtype=author&query=David Bonet), [Maria Perera](https://arxiv.org/search/?searchtype=author&query=Maria Perera), [Xavier Giró-i-Nieto](https://arxiv.org/search/?searchtype=author&query=Xavier Giró-i-Nieto), [Alexander G. Ioannidis](https://arxiv.org/search/?searchtype=author&query=Alexander G. Ioannidis) 作者：Daniel Mas Montserrat、David Bonet、Maria Perera、Xavier Giró-i-Nieto、Alexander G. Ioannidis

The rapid expansion in the size of new datasets has created a need for fast and efficient parameter-learning techniques. Compressive learning is a framework that enables efficient processing by using random, non-linear features to project large-scale databases onto compact, information-preserving representations whose dimensionality is independent of the number of samples and can be easily stored, transferred, and processed. These database-level summaries are then used to decode parameters of interest from the underlying data distribution without requiring access to the original samples, offering an efficient and privacy-friendly learning framework. However, both the encoding and decoding techniques are typically randomized and data-independent, failing to exploit the underlying structure of the data. In this work, we propose a framework that meta-learns both the encoding and decoding stages of compressive learning methods by using neural networks that provide faster and more accurate systems than the current state-of-the-art approaches. To demonstrate the potential of the presented Compressive Meta-Learning framework, we explore multiple applications – including neural network-based compressive PCA, compressive ridge regression, compressive k-means, and autoencoders. 新数据集规模的迅速扩大带来了对快速高效参数学习技术的需求。压缩学习是一种框架，利用随机的非线性特征将大规模数据库投影到紧凑且保留信息的表示上，从而实现高效处理；这些表示的维度与样本数量无关，且易于存储、传输和处理。然后使用这些数据库级别的摘要从底层数据分布中解码出感兴趣的参数，而无需访问原始样本，提供了一种高效且有利于隐私的学习框架。然而，编码和解码技术通常是随机且与数据无关的，未能利用数据的潜在结构。在本文中，我们提出了一个框架，通过使用神经网络对压缩学习方法的编码和解码阶段进行元学习，从而比当前最先进的方法提供更快且更准确的系统。为了展示所提出的压缩元学习框架的潜力，我们探索了多种应用——包括基于神经网络的压缩主成分分析（PCA）、压缩岭回归、压缩 k 均值聚类和自编码器。

Subjects: Machine Learning, Artificial Intelligence, Computational Engineering, Finance, and Science, Databases 主题：机器学习、人工智能、计算工程、金融与科学、数据库

Publish: 2025-08-14 22:08:06 UTC 发表：2025-08-14 22:08:06 UTC

#74 LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters #74 LD-LAudio-V1：具有双轻量适配器的视频到长格式音频生成扩展

Authors: [Haomin Zhang](https://arxiv.org/search/?searchtype=author&query=Haomin Zhang), [Kristin Qi](https://arxiv.org/search/?searchtype=author&query=Kristin Qi), [Shuxin Yang](https://arxiv.org/search/?searchtype=author&query=Shuxin Yang), [Zihao Chen](https://arxiv.org/search/?searchtype=author&query=Zihao Chen), [Chaofan Ding](https://arxiv.org/search/?searchtype=author&query=Chaofan Ding), [Xinhan Di](https://arxiv.org/search/?searchtype=author&query=Xinhan Di) 作者：张浩民、Kristin Qi、杨淑昕、陈子豪、丁朝凡、狄新翰

Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it incorporates dual lightweight adapters to enable long-form audio generation. In addition, we release a clean and human-annotated video-to-audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine-tuning with short training videos, LD-LAudio-V1 achieves significant improvements across multiple metrics: FDpasst 450.00 → 327.29 (+27.27%), FDpanns 34.88 → 22.68 (+34.98%), FDvgg 3.75 → 1.28 (+65.87%), KLpanns 2.49 → 2.07 (+16.87%), KLpasst 1.78 → 1.53 (+14.04%), ISpanns 4.17 → 4.30 (+3.12%), IBscore 0.25 → 0.28 (+12.00%), EnergyΔ10ms 0.3013 → 0.1349 (+55.23%), EnergyΔ10ms(vs.GT) 0.0531 → 0.0288 (+45.76%), and Sem.Rel. 2.73 → 3.28 (+20.15%). Our dataset aims to facilitate further research in long-form video-to-audio generation and is available at https://github.com/deepreasonings/long-form-video2audio. 从视频内容生成高质量且时间同步的音频对于视频编辑和后期制作任务至关重要，这使得为无声视频创建语义对齐的音频成为可能。然而，大多数现有方法都专注于为不到 10 秒的视频片段生成短时音频，或在长时视频到音频合成时依赖嘈杂的数据集。为了解决这些局限性，我们提出了 LD-LAudio-V1，这是对最先进视频到音频模型的扩展，且引入了双轻量适配器以支持长时音频生成。此外，我们发布了一个干净且人工标注的视频到音频数据集，包含无噪声或伪影的纯音效。我们的方法在保持计算效率的同时，显著减少了拼接伪影和时间不一致性。与使用短训练视频直接微调相比，LD-LAudio-V1 在多项指标上取得了显著提升： FDpasst 450.00 → 327.29（+27.27%）、 FDpanns 34.88 → 22.68（+34.98%）、 FDvgg 3.75 → 1.28（+65.87%）、 KLpanns 2.49 → 2.07（+16.87%）、 KLpasst 1.78 → 1.53（+14.04%）、 ISpanns 4.17 → 4.30（+3.12%）、 IBscore 0.25 → 0.28（+12.00%）、 EnergyΔ10ms 0.3013 → 0.1349（+55.23%）、 EnergyΔ10ms(vs.GT) 0.0531 → 0.0288（+45.76%），以及 Sem.Rel. 2.73 → 3.28（+20.15%）。我们的数据集旨在促进长格式视频到音频生成的进一步研究，数据可在 https://github.com/deepreasonings/long-form-video2audio 获取。

Subjects: Sound, Artificial Intelligence, Computer Vision and Pattern Recognition, Audio and Speech Processing 主题：声音、人工智能、计算机视觉与模式识别、音频与语音处理

Publish: 2025-08-14 21:11:57 UTC 发布：2025-08-14 21:11:57 UTC

#75 AI That Helps Us Help Each Other: A Proactive System for Scaffolding Mentor-Novice Collaboration in Entrepreneurship Coaching #75 有助于我们相互帮助的人工智能：一个用于支持创业辅导中导师-新手协作的主动系统

Authors: [Evey Jiaxin Huang](https://arxiv.org/search/?searchtype=author&query=Evey Jiaxin Huang), [Matthew Easterday](https://arxiv.org/search/?searchtype=author&query=Matthew Easterday), [Elizabeth Gerber](https://arxiv.org/search/?searchtype=author&query=Elizabeth Gerber) 作者：Evey Jiaxin Huang、Matthew Easterday、Elizabeth Gerber

Entrepreneurship requires navigating open-ended, ill-defined problems: identifying risks, challenging assumptions, and making strategic decisions under deep uncertainty. Novice founders often struggle with these metacognitive demands, while mentors face limited time and visibility to provide tailored support. We present a human-AI coaching system that combines a domain-specific cognitive model of entrepreneurial risk with a large language model (LLM) to proactively scaffold both novice and mentor thinking. The system proactively poses diagnostic questions that challenge novices’ thinking and helps both novices and mentors plan for more focused and emotionally attuned meetings. Critically, mentors can inspect and modify the underlying cognitive model, shaping the logic of the system to reflect their evolving needs. Through an exploratory field deployment, we found that using the system supported novice metacognition, helped mentors plan emotionally attuned strategies, and improved meeting depth, intentionality, and focus–while also surfaced key tensions around trust, misdiagnosis, and expectations of AI. We contribute design principles for proactive AI systems that scaffold metacognition and human-human collaboration in complex, ill-defined domains, offering implications for similar domains like healthcare, education, and knowledge work. 创业需要应对开放性、定义不清的问题：识别风险、挑战假设，并在深度不确定性下做出战略决策。初创创始人常在这些元认知要求上挣扎，而导师则受限于时间和可见性，难以提供定制支持。我们提出了一种人机协同辅导系统，将面向创业风险的领域特定认知模型与大型语言模型 (LLM) 结合起来，主动为初学者和导师搭建认知框架。该系统主动提出诊断性问题以挑战初学者的思维，并帮助初学者与导师共同规划更有针对性且情感契合的会议。关键在于，导师可以检查并修改底层认知模型，从而根据自身不断变化的需求调整系统逻辑。通过一次探索性现场部署，我们发现使用该系统支持了初学者的元认知，帮助导师制定情感契合的策略，并提升了会议的深度、目的性和聚焦性——同时也暴露出关于信任、误诊断以及对人工智能期望等方面的重要矛盾。我们提出了用于主动式人工智能系统的设计原则，这些原则可以在复杂且定义不清的领域中支撑元认知和人际协作，并对医疗、教育和知识工作等类似领域提供借鉴。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-14 20:23:48 UTC 发布：2025-08-14 20:23:48 世界标准时间

#76 Learning with Confidence #76 有信心的学习 [PDF ] [Copy] [Kimi ] [REL]

Author: [Oliver Ethan Richardson](https://arxiv.org/search/?searchtype=author&query=Oliver Ethan Richardson) 作者：Oliver Ethan Richardson

We characterize a notion of confidence that arises in learning or updating beliefs: the amount of trust one has in incoming information and its impact on the belief state. This learner’s confidence can be used alongside (and is easily mistaken for) probability or likelihood, but it is fundamentally a different concept – one that captures many familiar concepts in the literature, including learning rates and number of training epochs, Shafer’s weight of evidence, and Kalman gain. We formally axiomatize what it means to learn with confidence, give two canonical ways of measuring confidence on a continuum, and prove that confidence can always be represented in this way. Under additional assumptions, we derive more compact representations of confidence-based learning in terms of vector fields and loss functions. These representations induce an extended language of compound “parallel” observations. We characterize Bayes Rule as the special case of an optimizing learner whose loss representation is a linear expectation. 我们刻画了在学习或更新信念时产生的一种“置信度”概念：即对新信息的信任程度以及它对信念状态的影响。学习者的这种置信度可以与（且常被误认为）概率或似然性并用，但它本质上是一个不同的概念——它涵盖了文献中许多熟悉的概念，包括学习率和训练轮次、Shafer 的证据权重以及卡尔曼增益。我们形式化公理化了以置信度进行学习的含义，给出了两种在连续域上测量置信度的典型方法，并证明置信度总可以以这种方式表示。在额外假设下，我们导出了基于置信度的学习在向量场和损失函数形式下更紧凑的表示。这些表示引入了复合“并行”观测的扩展语言。我们把贝叶斯规则刻画为损失表示为线性期望的优化学习者的特例。

Subjects: Machine Learning, Artificial Intelligence, Differential Geometry 学科：机器学习、人工智能、微分几何

Publish: 2025-08-14 19:45:40 UTC 发表：2025-08-14 19:45:40 UTC

#77 Note on Selection Bias in Observational Estimates of Algorithmic Progress #77 关于观测性算法进步估计中的选择偏差说明

Author: [Parker Whitfill](https://arxiv.org/search/?searchtype=author&query=Parker Whitfill) 作者：Parker Whitfill

Ho et. al (2024) is an interesting paper that attempts to estimate the degree of algorithmic progress from language models. They collect observational data on language models’ loss and compute over time, and argue that as time has passed, language models’ algorithmic efficiency has been rising. That is, the loss achieved for fixed compute has been dropping over time. In this note, I want to raise one potential methodological problem with the estimation strategy. Intuitively, if part of algorithmic quality is latent, and compute choices are endogenous to algorithmic quality, then resulting estimates of algorithmic quality will be biased. Ho 等人（2024）是一篇有趣的论文，试图从语言模型中估计算法进步的程度。他们收集了关于语言模型损失和计算量随时间变化的观测数据，并主张随着时间推移，语言模型的算法效率在提高。也就是说，在固定计算量下实现的损失随时间下降。在这则说明中，我想提出该估计策略的一个潜在方法学问题。直观上，如果部分算法质量是潜在的，而计算选择又受算法质量的内生影响，那么由此得到的算法质量估计将会有偏差。

Subjects: General Economics, Artificial Intelligence 主题：一般经济学，人工智能

Publish: 2025-08-14 19:38:10 UTC 发布：2025-08-14 19:38:10 UTC

#78 Risk-Based Prognostics and Health Management #78 基于风险的预测与健康管理

Author: [John W. Sheppard](https://arxiv.org/search/?searchtype=author&query=John W. Sheppard) 作者：John W. Sheppard

It is often the case that risk assessment and prognostics are viewed as related but separate tasks. This chapter describes a risk-based approach to prognostics that seeks to provide a tighter coupling between risk assessment and fault prediction. We show how this can be achieved using the continuous-time Bayesian network as the underlying modeling framework. Furthermore, we provide an overview of the techniques that are available to derive these models from data and show how they might be used in practice to achieve tasks like decision support and performance-based logistics. This work is intended to provide an overview of the recent developments related to risk-based prognostics, and we hope that it will serve as a tutorial of sorts that will assist others in adopting these techniques. 风险评估和预测通常被视为相关但独立的任务。本章描述了一种基于风险的预测方法，旨在在风险评估与故障预测之间建立更紧密的耦合。我们展示了如何使用连续时间贝叶斯网络作为底层建模框架来实现这一目标。此外，我们概述了可用于从数据中推导这些模型的技术，并展示了它们在实践中如何用于实现决策支持和基于性能的后勤等任务。本工作旨在概述与基于风险的预测相关的最新进展，并希望它能作为一种教程，帮助其他人采用这些技术。

Subjects: Systems and Control, Artificial Intelligence, Applications 主题：系统与控制，人工智能，应用

Publish: 2025-08-14 19:31:33 UTC 发表：2025-08-14 19:31:33 协调世界时（UTC）

#79 Zono-Conformal Prediction: Zonotope-Based Uncertainty Quantification for Regression and Classification Tasks #79 Zono-保形预测：基于ゾノ托普（Zonotope）的回归与分类任务不确定性量化

Authors: [Laura Lützow](https://arxiv.org/search/?searchtype=author&query=Laura Lützow), [Michael Eichelbeck](https://arxiv.org/search/?searchtype=author&query=Michael Eichelbeck), [Mykel J. Kochenderfer](https://arxiv.org/search/?searchtype=author&query=Mykel J. Kochenderfer), [Matthias Althoff](https://arxiv.org/search/?searchtype=author&query=Matthias Althoff) 作者：Laura Lützow、Michael Eichelbeck、Mykel J. Kochenderfer、Matthias Althoff

Conformal prediction is a popular uncertainty quantification method that augments a base predictor with prediction sets with statistically valid coverage guarantees. However, current methods are often computationally expensive and data-intensive, as they require constructing an uncertainty model before calibration. Moreover, existing approaches typically represent the prediction sets with intervals, which limits their ability to capture dependencies in multi-dimensional outputs. We address these limitations by introducing zono-conformal prediction, a novel approach inspired by interval predictor models and reachset-conformant identification that constructs prediction zonotopes with assured coverage. By placing zonotopic uncertainty sets directly into the model of the base predictor, zono-conformal predictors can be identified via a single, data-efficient linear program. While we can apply zono-conformal prediction to arbitrary nonlinear base predictors, we focus on feed-forward neural networks in this work. Aside from regression tasks, we also construct optimal zono-conformal predictors in classification settings where the output of an uncertain predictor is a set of possible classes. We provide probabilistic coverage guarantees and present methods for detecting outliers in the identification data. In extensive numerical experiments, we show that zono-conformal predictors are less conservative than interval predictor models and standard conformal prediction methods, while achieving a similar coverage over the test data. 保形预测是一种流行的不确定性量化方法，通过为基础预测器添加具有统计上有效覆盖保证的预测集合来增强其能力。然而，当前方法通常计算开销大且数据需求高，因为它们在校准之前需要构建不确定性模型。此外，现有方法通常使用区间来表示预测集合，这限制了它们捕捉多维输出中依赖关系的能力。我们通过引入锥体保形预测来解决这些限制，这是一种受区间预测器模型和可达集一致性识别启发的新方法，构造具有保证覆盖的预测锥体。通过将锥体不确定性集合直接置入基础预测器的模型中，锥体保形预测器可以通过单个、数据高效的线性规划来识别。虽然我们可以将锥体保形预测应用于任意非线性基础预测器，但在本工作中我们重点研究前馈神经网络。除了回归任务外，我们还在分类情景中构建了最优的 zono-保形预测器，此时不确定预测器的输出是一组可能的类别。我们提供概率覆盖保证，并提出用于检测鉴别数据中异常值的方法。在大量数值实验中，我们证明了 zono-保形预测器比区间预测模型和标准保形预测方法更不保守，同时在测试数据上实现了相似的覆盖率。

Subjects: Machine Learning, Artificial Intelligence, Systems and Control 主题：机器学习、人工智能、系统与控制

Publish: 2025-08-14 19:03:28 UTC 发布：2025-08-14 19:03:28 UTC

#80 Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics #80 超越罗塞塔石：泛化动力学中的统一力量

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-14 18:44:13 UTC 发布：2025-08-14 18:44:13 协调世界时

#81 CURE: Critical-Token-Guided Re-concatenation for Entropy-collapse Prevention #81 CURE：用于防止熵塌陷的关键令牌引导重连接

Recent advances in Reinforcement Learning with Verified Reward (RLVR) have driven the emergence of more sophisticated cognitive behaviors in large language models (LLMs), thereby enhancing their reasoning capabilities. However, in prior RLVR pipelines, the repeated use of static initial-state sampling drawn exactly from the dataset distribution during each sampling phase produced overly deterministic, low diversity model behavior, which manifested as rapid entropy collapse and hindered sustained performance gains during prolonged training. To address this issue, we introduce CURE (Critical-token-gUided Re concatenation for Entropy-collapse prevention), a two-stage framework that balances exploration and exploitation. Specifically, in the first stage, to deliberately steer the model toward novel yet coherent contexts, we re-generate at high-entropy critical tokens and jointly optimize the original and the branched trajectories. The further comparison with vanilla DAPO shows that the regeneration process achieves a better performance on math reasoning tasks while sustaining a high-level entropy degree for exploration. In the second stage, we continue training with static initial-state sampling by DAPO, intentionally placing the model in a familiar state to gradually strengthen exploitation. Extensive experiments on Qwen-2.5-Math-7B show that, compared to other RLVR methods, CURE achieves a 5% performance gain across six math benchmarks, establishing state-of-the-art performance in both entropy and accuracy. A series of experiments further validate the effectiveness of our approach. Code is available at https://github.com/CURE-Project/CURE. 最近在带有验证奖励的强化学习（RLVR）方面的进展推动了更复杂认知行为在大语言模型（LLMs）中的出现，从而增强了其推理能力。然而，在先前的 RLVR 流程中，每次采样阶段反复使用严格从数据集分布中抽取的静态初始状态采样，导致模型行为过于确定性、低多样性，表现为熵快速塌缩并阻碍了在长时间训练中持续的性能提升。为了解决这一问题，我们提出了 CURE（Critical-token-gUided Re concatenation for Entropy-collapse prevention），一个在探索与利用之间取得平衡的两阶段框架。具体而言，在第一阶段，为了有意将模型引导到新颖但连贯的上下文中，我们对高熵的关键标记进行重新生成，并对原始轨迹与分支轨迹进行联合优化。与原始 DAPO 的进一步比较表明，重新生成过程在保持高水平探索熵的同时，在数学推理任务上获得了更好的性能。在第二阶段，我们继续使用 DAPO 进行静态初始状态采样训练，故意将模型置于熟悉状态以逐步增强利用能力。在 Qwen-2.5-Math-7B 上的大量实验证明，与其他 RLVR 方法相比，CURE 在六个数学基准上实现了 5% 的性能提升，在熵和准确率两方面确立了最先进的表现。一系列实验进一步验证了我们方法的有效性。代码可在 https://github.com/CURE-Project/CURE 获取。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-14 18:40:34 UTC 发布：2025-08-14 18:40:34 UTC

#82 Deep Learning-Based Automated Segmentation of Uterine Myomas #82 基于深度学习的子宫肌瘤自动分割

Authors: [Tausifa Jan Saleem](https://arxiv.org/search/?searchtype=author&query=Tausifa Jan Saleem), [Mohammad Yaqub](https://arxiv.org/search/?searchtype=author&query=Mohammad Yaqub) 作者：Tausifa Jan Saleem, Mohammad Yaqub

Uterine fibroids (myomas) are the most common benign tumors of the female reproductive system, particularly among women of childbearing age. With a prevalence exceeding 70%, they pose a significant burden on female reproductive health. Clinical symptoms such as abnormal uterine bleeding, infertility, pelvic pain, and pressure-related discomfort play a crucial role in guiding treatment decisions, which are largely influenced by the size, number, and anatomical location of the fibroids. Magnetic Resonance Imaging (MRI) is a non-invasive and highly accurate imaging modality commonly used by clinicians for the diagnosis of uterine fibroids. Segmenting uterine fibroids requires a precise assessment of both the uterus and fibroids on MRI scans, including measurements of volume, shape, and spatial location. However, this process is labor intensive and time consuming and subjected to variability due to intra- and inter-expert differences at both pre- and post-treatment stages. As a result, there is a critical need for an accurate and automated segmentation method for uterine fibroids. In recent years, deep learning algorithms have shown re-markable improvements in medical image segmentation, outperforming traditional methods. These approaches offer the potential for fully automated segmentation. Several studies have explored the use of deep learning models to achieve automated segmentation of uterine fibroids. However, most of the previous work has been conducted using private datasets, which poses challenges for validation and comparison between studies. In this study, we leverage the publicly available Uterine Myoma MRI Dataset (UMD) to establish a baseline for automated segmentation of uterine fibroids, enabling standardized evaluation and facilitating future research in this domain. 子宫肌瘤（肌瘤）是女性生殖系统中最常见的良性肿瘤，尤以育龄妇女为甚。其患病率超过 70%，对女性生殖健康构成了重大负担。临床症状如异常子宫出血、不孕、盆腔疼痛及因压迫引起的不适等在指导治疗决策方面起着关键作用，而这些决策在很大程度上受肌瘤的大小、数量和解剖位置影响。磁共振成像（MRI）是一种无创且高度准确的成像手段，临床医生常用其来诊断子宫肌瘤。对子宫肌瘤进行分割需要在 MRI 扫描上对子宫和肌瘤进行精确评估，包括体积、形状和空间位置的测量。然而，这一过程劳动强度大、耗时长，并且在治疗前后都容易受到不同专家之间及同一专家内部差异的影响。因此，迫切需要一种准确的子宫肌瘤自动分割方法。近年来，深度学习算法在医学影像分割方面取得了显著进展，优于传统方法。这些方法有望实现全自动分割。已有若干研究探讨了使用深度学习模型来实现子宫肌瘤的自动分割。然而，以往大多数工作均使用私有数据集进行，这给研究间的验证和比较带来挑战。本研究利用可公开获取的子宫肌瘤 MRI 数据集（UMD）来建立子宫肌瘤自动分割的基线，便于标准化评估并促进该领域的后续研究。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理、人工智能、计算机视觉与模式识别

Publish: 2025-08-14 18:22:14 UTC 发布：2025-08-14 18:22:14 UTC

#83 SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth #83 SproutBench：面向青少年的安全与伦理大语言模型基准

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-14 18:21:39 UTC 发布日期：2025-08-14 18:21:39 UTC

#84 Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models #84 匹配与选择：用于微调文本到图像扩散模型的模型选择框架

Publish: 2025-08-14 18:00:50 UTC 发布：2025-08-14 18:00:50 UTC

#85 MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications #85 MCP-Guard：用于大语言模型应用中模型上下文协议完整性的防御框架

The integration of Large Language Models (LLMs) with external tools via protocols such as the Model Context Protocol (MCP) introduces critical security vulnerabilities, including prompt injection, data exfiltration, and other threats. To counter these challenges, we propose MCP-Guard, a robust, layered defense architecture designed for LLM–tool interactions. MCP-Guard employs a three-stage detection pipeline that balances efficiency with accuracy: it progresses from lightweight static scanning for overt threats and a deep neural detector for semantic attacks, to our fine-tuned E5-based model achieves (96.01) accuracy in identifying adversarial prompts. Finally, a lightweight LLM arbitrator synthesizes these signals to deliver the final decision while minimizing false positives. To facilitate rigorous training and evaluation, we also introduce MCP-AttackBench, a comprehensive benchmark of over 70,000 samples. Sourced from public datasets and augmented by GPT-4, MCP-AttackBench simulates diverse, real-world attack vectors in the MCP format, providing a foundation for future research into securing LLM-tool ecosystems. 将大型语言模型（LLMs）通过诸如模型上下文协议（Model Context Protocol，MCP）等协议与外部工具集成，会引入关键的安全漏洞，包括提示注入、数据外泄及其他威胁。为应对这些挑战，我们提出了 MCP-Guard，一种针对 LLM–工具交互设计的稳健分层防御架构。MCP-Guard 采用三阶段检测流水线，在效率与准确性间取得平衡：从对明显威胁进行轻量级静态扫描、到用于语义攻击检测的深度神经检测器，再到我们基于 E5 微调的模型在识别对抗性提示上达到（96.01）的准确率。最后，一个轻量级的 LLM 仲裁器综合这些信号以输出最终决定，同时将误报率降至最低。为便于严格的训练与评估，我们还引入了 MCP-AttackBench——一个包含超过 70,000 个样本的综合性基准。MCP-AttackBench 源自公共数据集并由 GPT-4 增强，以 MCP 格式模拟多样的真实世界攻击向量，为未来保护 LLM-工具生态系统的研究提供了基础。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全性 , 人工智能

Publish: 2025-08-14 18:00:25 UTC 发布：2025-08-14 18:00:25 UTC

#86 Not There Yet: Evaluating Vision Language Models in Simulating the Visual Perception of People with Low Vision #86 还未到达：评估视觉语言模型在模拟低视力人群视觉感知方面的表现

Authors: [Rosiana Natalie](https://arxiv.org/search/?searchtype=author&query=Rosiana Natalie), [Wenqian Xu](https://arxiv.org/search/?searchtype=author&query=Wenqian Xu), [Ruei-Che Chang](https://arxiv.org/search/?searchtype=author&query=Ruei-Che Chang), [Rada Mihalcea](https://arxiv.org/search/?searchtype=author&query=Rada Mihalcea), [Anhong Guo](https://arxiv.org/search/?searchtype=author&query=Anhong Guo) 作者：Rosiana Natalie、Wenqian Xu、Ruei-Che Chang、Rada Mihalcea、Anhong Guo

Advances in vision language models (VLMs) have enabled the simulation of general human behavior through their reasoning and problem solving capabilities. However, prior research has not investigated such simulation capabilities in the accessibility domain. In this paper, we evaluate the extent to which VLMs can simulate the vision perception of low vision individuals when interpreting images. We first compile a benchmark dataset through a survey study with 40 low vision participants, collecting their brief and detailed vision information and both open-ended and multiple-choice image perception and recognition responses to up to 25 images. Using these responses, we construct prompts for VLMs (GPT-4o) to create simulated agents of each participant, varying the included information on vision information and example image responses. We evaluate the agreement between VLM-generated responses and participants’ original answers. Our results indicate that VLMs tend to infer beyond the specified vision ability when given minimal prompts, resulting in low agreement (0.59). The agreement between the agent’ and participants’ responses remains low when only either the vision information (0.59) or example image responses (0.59) are provided, whereas a combination of both significantly increase the agreement (0.70, p < 0.0001). Notably, a single example combining both open-ended and multiple-choice responses, offers significant performance improvements over either alone (p < 0.0001), while additional examples provided minimal benefits (p > 0.05). 视觉语言模型（VLMs）的进步使其通过推理和问题解决能力模拟一般人的行为成为可能。然而，以往研究尚未在无障碍领域探讨此类模拟能力。本文评估了 VLM 在解释图像时模拟低视力个体视觉感知的程度。我们首先通过一项包含 40 位低视力参与者的问卷调查构建了基准数据集，收集了他们的简要和详细视力信息，以及针对最多 25 幅图像的开放式和多项选择式图像感知与识别回答。基于这些回答，我们为 VLM（GPT-4o）构建了提示，以创建每位参与者的模拟代理，并在所包含的视力信息和示例图像回答上进行变体设置。我们评估了 VLM 生成的回答与参与者原始答案之间的一致性。我们的研究结果表明，当只给出最少的提示时，视觉语言模型往往会推断出超出指定视觉能力范围的信息，导致一致性较低（0.59）。当仅提供视觉信息（0.59）或示例图像回应（0.59）中的任意一种时，代理与参与者的回答之间的一致性仍然较低，而两者结合则显著提高了一致性（0.70，p < 0.0001）。值得注意的是，将开放式回答与多项选择回答结合在一个示例中，较单独使用任一方式能显著提升性能（p < 0.0001），而提供更多示例的额外益处则很小（p > 0.05）。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Human-Computer Interaction 主题：计算机视觉与模式识别，人工智能，人机交互

Publish: 2025-08-14 16:46:03 UTC 发布：2025-08-14 16:46:03 UTC

#87 Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules #87 Rule2Text：一个用于生成和评估知识图规则自然语言解释的框架

Knowledge graphs (KGs) can be enhanced through rule mining; however, the resulting logical rules are often difficult for humans to interpret due to their inherent complexity and the idiosyncratic labeling conventions of individual KGs. This work presents Rule2Text, a comprehensive framework that leverages large language models (LLMs) to generate natural language explanations for mined logical rules, thereby improving KG accessibility and usability. We conduct extensive experiments using multiple datasets, including Freebase variants (FB-CVT-REV, FB+CVT-REV, and FB15k-237) as well as the ogbl-biokg dataset, with rules mined using AMIE 3.5.1. We systematically evaluate several LLMs across a comprehensive range of prompting strategies, including zero-shot, few-shot, variable type incorporation, and Chain-of-Thought reasoning. To systematically assess models’ performance, we conduct a human evaluation of generated explanations on correctness and clarity. To address evaluation scalability, we develop and validate an LLM-as-a-judge framework that demonstrates strong agreement with human evaluators. Leveraging the best-performing model (Gemini 2.0 Flash), LLM judge, and human-in-the-loop feedback, we construct high-quality ground truth datasets, which we use to fine-tune the open-source Zephyr model. Our results demonstrate significant improvements in explanation quality after fine-tuning, with particularly strong gains in the domain-specific dataset. Additionally, we integrate a type inference module to support KGs lacking explicit type information. All code and data are publicly available at https://github.com/idirlab/KGRule2NL. 知识图谱（KGs）可以通过规则挖掘得到增强；然而，挖掘出的逻辑规则由于其固有的复杂性以及各个知识图谱独特的标注习惯，常常难以被人类理解。本文提出了 Rule2Text，一个完整的框架，利用大型语言模型（LLMs）为挖掘出的逻辑规则生成自然语言解释，从而提升知识图谱的可访问性和可用性。我们使用多个数据集进行了广泛实验，包括 Freebase 变体（FB-CVT-REV、FB+CVT-REV 和 FB15k-237）以及 ogbl-biokg 数据集，规则由 AMIE 3.5.1 挖掘得到。我们在多种提示策略上系统评估了若干 LLMs，包括零样本、少样本、变量类型引入和链式思维（Chain-of-Thought）推理。为了系统地评估模型性能，我们对生成的解释在正确性和清晰度方面进行了人工评估。为了解决评估的可扩展性问题，我们开发并验证了一个将 LLM 作为裁判的框架，结果显示其与人工评估者具有较高的一致性。我们利用表现最佳的模型（Gemini 2.0 Flash）、LLM 判定器和人机交互反馈，构建了高质量的真实标签数据集，并用这些数据对开源 Zephyr 模型进行了微调。我们的结果显示微调后解释质量显著提升，在领域特定数据集上增益尤为明显。此外，我们集成了一个类型推断模块以支持缺乏显式类型信息的知识图谱。所有代码和数据已公开，地址为 https://github.com/idirlab/KGRule2NL。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-14 16:41:47 UTC 发布：2025-08-14 16:41:47 协调世界时（UTC）

#88 Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis #88 Retro-Expert：用于可解释逆合成的协同推理

Authors: [Xinyi Li](https://arxiv.org/search/?searchtype=author&query=Xinyi Li), [Sai Wang](https://arxiv.org/search/?searchtype=author&query=Sai Wang), [Yutian Lin](https://arxiv.org/search/?searchtype=author&query=Yutian Lin), [Yu Wu](https://arxiv.org/search/?searchtype=author&query=Yu Wu), [Yi Yang](https://arxiv.org/search/?searchtype=author&query=Yi Yang) 作者：李欣逸、王赛、林煜天、吴宇、杨毅

Retrosynthesis prediction aims to infer the reactant molecule based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing models rely on static pattern-matching paradigm, which limits their ability to perform effective logic decision-making, leading to black-box decision-making. Building on this, we propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary reasoning strengths of Large Language Models and specialized models via reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models perform shallow reasoning to construct high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions and corresponding interpretable reasoning path, and (3) reinforcement learning optimizing interpretable decision policy. Experiments show that Retro-Expert not only surpasses both LLM-based and specialized models across different metrics but also provides expert-aligned explanations that bridge the gap between AI predictions and actionable chemical insights. 逆合成预测旨在基于给定的产物分子推断反应物分子，这是化学合成中的一项基础任务。然而，现有模型依赖静态的模式匹配范式，这限制了它们执行有效逻辑决策的能力，导致黑箱式决策。基于此，我们提出了 Retro-Expert，一种可解释的逆合成框架，通过强化学习结合大型语言模型（LLM）与专用模型的互补推理能力，执行协同推理。该框架通过三个组件输出基于化学逻辑的自然语言解释：（1）专用模型执行浅层推理以构建高质量的化学决策空间，（2）由 LLM 驱动的关键推理用于生成预测及相应的可解释推理路径，（3）强化学习优化可解释的决策策略。实验表明，Retro-Expert 不仅在不同指标上超越了基于 LLM 和专用模型的方案，还提供了与专家一致的解释，弥合了 AI 预测与可操作化学见解之间的差距。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-14 15:41:25 UTC 发布：2025-08-14 15:41:25 世界协调时间

#89 ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks #89 ORBIT：用于视觉推理任务的对象属性推理基准

Authors: [Abhishek Kolari](https://arxiv.org/search/?searchtype=author&query=Abhishek Kolari), [Mohammadhossein Khojasteh](https://arxiv.org/search/?searchtype=author&query=Mohammadhossein Khojasteh), [Yifan Jiang](https://arxiv.org/search/?searchtype=author&query=Yifan Jiang), [Floris den Hengst](https://arxiv.org/search/?searchtype=author&query=Floris den Hengst), [Filip Ilievski](https://arxiv.org/search/?searchtype=author&query=Filip Ilievski) 作者：Abhishek Kolari、Mohammadhossein Khojasteh、Yifan Jiang、Floris den Hengst、Filip Ilievski

While vision-language models (VLMs) have made remarkable progress on many popular visual question answering (VQA) benchmarks, it remains unclear whether they abstract and reason over depicted objects. Inspired by human object categorisation, object property reasoning involves identifying and recognising low-level details and higher-level abstractions. While current VQA benchmarks consider a limited set of object property attributes like size, they typically blend perception and reasoning, and lack representativeness in terms of reasoning and image categories. To this end, we introduce a systematic evaluation framework with images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions driven by prior work on commonsense reasoning. We develop a procedure to instantiate this benchmark into ORBIT, a multi-level reasoning VQA benchmark for object properties comprising 360 images paired with a total of 1,080 count-based questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations compared to humans, with the best-performing model only reaching 40% accuracy. VLMs struggle particularly with realistic (photographic) images, counterfactual reasoning about physical and functional properties, and higher counts. ORBIT points to the need to develop methods for scalable benchmarking, generalize annotation guidelines, and explore additional reasoning VLMs. We make the ORBIT benchmark and the experimental code available to support such endeavors. 尽管视觉-语言模型（VLMs）在许多流行的视觉问答（VQA）基准上取得了显著进展，但它们是否能够对所描绘的对象进行抽象和推理仍不清楚。受人类对象分类的启发，对象属性推理涉及识别和辨认低阶细节以及更高层次的抽象。尽管现有的 VQA 基准考虑了一些有限的对象属性（例如大小），但它们通常将感知与推理混为一谈，并且在推理类型和图像类别的代表性方面存在不足。为此，我们引入了一个系统化的评估框架，包含三种具有代表性的图像类型、三个逐步增加复杂度的推理层次，以及基于常识推理先前工作的四个对象属性维度。我们制定了一个程序，将该基准实例化为 ORBIT，这是一个针对对象属性的多层次推理 VQA 基准，包含 360 张图像并配对共计 1080 个基于计数的问题。对 12 种最先进 VLM 在零样本设置下的实验显示，与人类相比存在显著局限性，表现最好的模型准确率仅达到 40%。 VLM 在处理真实（摄影）图像、关于物理和功能属性的反事实推理以及较大计数时表现尤为薄弱。ORBIT 指出需要开发可扩展基准评测方法、推广注释指南并探索更多用于推理的 VLM。我们提供了 ORBIT 基准和实验代码以支持此类工作。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-14 11:28:40 UTC 发布：2025-08-14 11:28:40 UTC

#90 Towards Efficient Prompt-based Continual Learning in Distributed Medical AI #90 面向高效提示式持续学习的分布式医疗人工智能

Authors: [Gyutae Oh](https://arxiv.org/search/?searchtype=author&query=Gyutae Oh), [Jitae Shin](https://arxiv.org/search/?searchtype=author&query=Jitae Shin) 作者：Gyutae Oh, Jitae Shin

Modern AI models achieve state-of-the-art performance with large-scale, high-quality datasets; however, ethical, social, and institutional constraints in the medical domain severely restrict data sharing, rendering centralized learning nearly impossible. Each institution must incrementally update models using only local data. Traditional training overfits new samples and suffers from catastrophic forgetting, losing previously acquired knowledge. Medical data distributions also shift due to varying diagnostic equipment and demographics. Although continual learning (CL) has advanced, most methods address natural images, leaving medical-domain-specific CL underexplored. We propose a prompt-based continual learning (PCL) approach featuring a unified prompt pool with a minimal expansion strategy: by expanding and freezing a subset of prompts, our method reduces computational overhead, and a novel regularization term balances retention and adaptation. Experiments on three diabetic retinopathy datasets Aptos2019, LI2019, and Diabetic Retinopathy Detection show our model improves final classification accuracy by at least 10% and F1-score by 9 points over state-of-the-art approaches while lowering inference cost. We anticipate this study will drive sustainable medical AI advances, enabling real-time diagnosis, patient monitoring, and telemedicine applications in distributed healthcare. Code will be released upon acceptance 现代人工智能模型在大规模高质量数据集上取得了最先进的性能；然而，医疗领域的伦理、社会和机构限制严重限制了数据共享，使得集中式学习几乎不可能。每个机构必须仅使用本地数据对模型进行增量更新。传统训练会对新样本发生过拟合并遭受灾难性遗忘，导致丢失先前获得的知识。由于诊断设备和人口统计的差异，医疗数据分布也会发生变化。尽管持续学习（CL）已有进展，但大多数方法针对自然图像，医疗领域特定的持续学习仍未得到充分探索。我们提出了一种基于提示的持续学习（PCL）方法，具有统一的提示池和最小扩展策略：通过扩展并冻结一部分提示，我们的方法减少了计算开销，并引入了一项新的正则化项以在保留与适应之间取得平衡。在三个糖尿病视网膜病变数据集 Aptos2019、LI2019 和 Diabetic Retinopathy Detection 上的实验表明，我们的模型在降低推理成本的同时，使最终分类准确率比最先进方法至少提高 10%，F1 分数提高 9 个百分点。我们预期这项研究将推动可持续的医疗人工智能进步，使实时诊断、患者监测和分布式医疗中的远程医疗应用成为可能。代码将在论文被接受后发布。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-14 06:46:14 UTC 发布：2025-08-14 06:46:14 UTC

#91 Apriel-Nemotron-15B-Thinker #91 Apriel-Nemotron-15B-Thinker

While large language models (LLMs) have achieved remarkable reasoning capabilities across domains like code, math and other enterprise tasks, their significant memory and computational costs often preclude their use in practical enterprise settings. To this end, we introduce Apriel-Nemotron-15B-Thinker, a 15-billion parameter model in the ServiceNow Apriel SLM series that achieves performance against medium sized state-of-the-art models such as o1-mini, QWQ32B, and EXAONE-Deep-32B while maintaining only half the memory footprint of those alternatives. Apriel-Nemotron-15B-Thinker model is trained in a four stage training pipeline including 1) Base Model upscaling, 2) Continual Pre-training 3) Supervised Fine-tuning (SFT) and 4) Reinforcement Learning using GRPO. Comprehensive evaluations across a diverse suite of benchmarks consistently demonstrate that our Apriel-Nemotron-15B-Thinker model matches or exceeds the performance of its 32-billion parameter counterparts, despite being less than half their size. 尽管大型语言模型（LLMs）在代码、数学和其他企业任务等领域展现了卓越的推理能力，但其巨大的内存和计算成本常常使其在实际企业环境中难以应用。为此，我们推出了 Apriel-Nemotron-15B-Thinker——ServiceNow Apriel SLM 系列中的一款 150 亿参数模型，它在性能上能够与 o1-mini、QWQ32B 和 EXAONE-Deep-32B 等中等规模最先进模型相媲美，同时仅占用这些替代方案的一半内存空间。Apriel-Nemotron-15B-Thinker 模型采用四阶段训练流程：1）基础模型扩展，2）持续预训练，3）监督微调（SFT），以及 4）使用 GRPO 的强化学习。对多样化基准套件进行的全面评估持续表明，尽管参数量不到其 320 亿参数同行的一半，我们的 Apriel-Nemotron-15B-Thinker 模型在性能上与它们相匹配或超越。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 17:43:43 UTC 发布：2025-08-13 17:43:43 协调世界时 (UTC)

#92 Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News #92 从新闻中建模与检测公司风险：以彭博新闻为案例研究

Identifying risks associated with a company is important to investors and the well-being of the overall financial market. In this study, we build a computational framework to automatically extract company risk factors from news articles. Our newly proposed schema comprises seven distinct aspects, such as supply chain, regulations, and competitions. We sample and annotate 744 news articles and benchmark various machine learning models. While large language models have achieved huge progress in various types of NLP tasks, our experiment shows that zero-shot and few-shot prompting state-of-the-art LLMs (e.g. LLaMA-2) can only achieve moderate to low performances in identifying risk factors. And fine-tuned pre-trained language models are performing better on most of the risk factors. Using this model, we analyze over 277K Bloomberg news articles and demonstrate that identifying risk factors from news could provide extensive insight into the operations of companies and industries. 识别与公司相关的风险对投资者和整体金融市场的健康至关重要。在本研究中，我们构建了一个计算框架，用于从新闻文章中自动提取公司风险因素。我们新提出的方案包含七个不同的方面，例如供应链、监管和竞争。我们抽样并标注了 744 篇新闻文章，并对各种机器学习模型进行了基准测试。尽管大型语言模型在各种类型的自然语言处理任务上取得了巨大进展，我们的实验表明，针对最先进的 LLMs（例如 LLaMA-2）进行零样本和少样本提示仅能在识别风险因素上达到中等到较低的表现。而经过微调的预训练语言模型在大多数风险因素上表现更好。使用该模型，我们分析了超过 277K 篇彭博新闻文章，并证明从新闻中识别风险因素可以为公司和行业的运作提供深入见解。

Subjects: Computation and Language, Artificial Intelligence, Computational Engineering, Finance, and Science, Machine Learning 主题：计算与语言、人工智能、计算工程、金融与科学、机器学习

Publish: 2025-08-10 22:44:10 UTC 发布时间：2025-08-10 22:44:10 UTC

#93 gpt-oss-120b & gpt-oss-20b Model Card #93 gpt-oss-120b & gpt-oss-20b 模型卡 [PDF 18 ] [Copy] [Kimi 8 ] [REL]

We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research. 我们推出了 gpt-oss-120b 和 gpt-oss-20b 两款开放权重的推理模型，推动了准确性和推理成本的前沿。这些模型采用高效的专家混合（mixture-of-expert）Transformer 架构，使用大规模蒸馏和强化学习进行训练。我们对模型进行了优化，使其具备强大的智能体能力（深度研究浏览、Python 工具使用以及支持开发者提供的函数），同时采用渲染的聊天格式以实现清晰的指令遵循和角色划分。两款模型在从数学、编码到安全性的各类基准测试中均取得了优异成绩。我们在 Apache 2.0 许可下发布了模型权重、推理实现、工具环境和分词器，以促进广泛使用和进一步研究。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-08 19:24:38 UTC 发布：2025-08-08 19:24:38 UTC

#94 Human-AI collaboration or obedient and often clueless AI in instruct, serve, repeat dynamics? #94 人机协作，还是在“指令、服从、重复”模式下顺从且常常一无所知的人工智能？

Authors: [Mohammed Saqr](https://arxiv.org/search/?searchtype=author&query=Mohammed Saqr), [Kamila Misiejuk](https://arxiv.org/search/?searchtype=author&query=Kamila Misiejuk), [Sonsoles López-Pernas](https://arxiv.org/search/?searchtype=author&query=Sonsoles López-Pernas) 作者：Mohammed Saqr、Kamila Misiejuk、Sonsoles López-Pernas

While research on human-AI collaboration exists, it mainly examined language learning and used traditional counting methods with little attention to evolution and dynamics of collaboration on cognitively demanding tasks. This study examines human-AI interactions while solving a complex problem. Student-AI interactions were qualitatively coded and analyzed with transition network analysis, sequence analysis and partial correlation networks as well as comparison of frequencies using chi-square and Person-residual shaded Mosaic plots to map interaction patterns, their evolution, and their relationship to problem complexity and student performance. Findings reveal a dominant Instructive pattern with interactions characterized by iterative ordering rather than collaborative negotiation. Oftentimes, students engaged in long threads that showed misalignment between their prompts and AI output that exemplified a lack of synergy that challenges the prevailing assumptions about LLMs as collaborative partners. We also found no significant correlations between assignment complexity, prompt length, and student grades suggesting a lack of cognitive depth, or effect of problem difficulty. Our study indicates that the current LLMs, optimized for instruction-following rather than cognitive partnership, compound their capability to act as cognitively stimulating or aligned collaborators. Implications for designing AI systems that prioritize cognitive alignment and collaboration are discussed. 尽管已有关于人机协作的研究，但主要集中在语言学习领域，并采用传统的计数方法，很少关注在认知要求高的任务中协作的演变与动态。本研究考察了解决复杂问题时的人机交互。对学生与 AI 的交互进行了定性编码，并结合转换网络分析、序列分析与部分相关网络进行分析，同时使用卡方检验和 Pearson 残差着色的莫赛克图比较频率，以绘制交互模式、其演变以及与问题复杂性和学生表现的关系。研究发现以“指令式”模式占主导，交互特征表现为迭代式下达指令而非协商式的合作。学生常常参与产生长串对话线程，显示出他们的提示与 AI 输出之间的不一致，体现出缺乏协同，这对将 LLMs 视为合作伙伴的既有假设提出了挑战。我们还发现作业复杂性、提示长度与学生成绩之间没有显著相关，表明缺乏认知深度或问题难度的影响。我们的研究表明，目前的 LLMs 是为遵循指令而非作为认知伙伴而优化的，这削弱了它们作为能够激发认知或与人认知上契合的合作者的能力。文中讨论了为优先考虑认知对齐与协作而设计人工智能系统的影响。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-03 11:43:01 UTC 发布：2025-08-03 11:43:01 UTC

#95 Managing the unexpected: Operator behavioural data and its value in predicting correct alarm responses #95 管理意外事件：操作员行为数据及其在预测正确报警响应中的价值

Data from psychophysiological measures can offer new insight into control room operators’ behaviour, cognition, and mental workload status. This can be particularly helpful when combined with appraisal of capacity to respond to possible critical plant conditions (i.e. critical alarms response scenarios). However, wearable physiological measurement tools such as eye tracking and EEG caps can be perceived as intrusive and not suitable for usage in daily operations. Therefore, this article examines the potential of using real-time data from process and operator-system interactions during abnormal scenarios that can be recorded and retrieved from the distributed control system’s historian or process log, and their capacity to provide insight into operator behavior and predict their response outcomes, without intruding on daily tasks. Data for this study were obtained from a design of experiment using a formaldehyde production plant simulator and four human-in-the-loop experimental support configurations. A comparison between the different configurations in terms of both behaviour and performance is presented in this paper. A step-wise logistic regression and a Bayesian network models were used to achieve this objective. The results identified some predictive metrics and the paper discuss their value as precursor or predictor of overall system performance in alarm response scenarios. Knowledge of relevant and predictive behavioural metrics accessible in real time can better equip decision-makers to predict outcomes and provide timely support measures for operators. 来自心理生理测量的数据可以为控制室操作员的行为、认知和心理负荷状态提供新的见解。当这些数据与对其应对可能发生的关键工厂状况（即关键报警响应情景）能力的评估相结合时，这尤其有用。然而，可穿戴生理测量工具（如眼动追踪和脑电帽）可能被视为具有侵入性，不适合日常操作使用。因此，本文考察了在异常情景下使用来自分布式控制系统的历史记录器或过程日志中可记录和检索的过程数据与操作员—系统交互的实时数据的潜力，以及这些数据在不干扰日常任务的情况下为洞察操作员行为并预测其响应结果方面的能力。本研究的数据来自使用甲醛生产厂模拟器和四种有人参与的实验支持配置所做的实验设计。本文比较了不同配置在行为和绩效方面的差异。使用逐步逻辑回归和贝叶斯网络模型来实现这一目标。结果识别出了一些预测性指标，论文讨论了它们作为报警响应场景中整体系统性能的先兆或预测因子的价值。了解可实时获取的相关且具有预测性的行为指标，可以更好地使决策者预测结果并为操作者提供及时的支持措施。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-01 15:10:16 UTC 发布：2025-08-01 15:10:16 UTC

#96 Multimodal Quantitative Measures for Multiparty Behaviour Evaluation #96 多方行为评估的多模态定量测量

Authors: [Ojas Shirekar](https://arxiv.org/search/?searchtype=author&query=Ojas Shirekar), [Wim Pouw](https://arxiv.org/search/?searchtype=author&query=Wim Pouw), [Chenxu Hao](https://arxiv.org/search/?searchtype=author&query=Chenxu Hao), [Vrushank Phadnis](https://arxiv.org/search/?searchtype=author&query=Vrushank Phadnis), [Thabo Beeler](https://arxiv.org/search/?searchtype=author&query=Thabo Beeler), [Chirag Raman](https://arxiv.org/search/?searchtype=author&query=Chirag Raman) 作者：Ojas Shirekar、Wim Pouw、Chenxu Hao、Vrushank Phadnis、Thabo Beeler、Chirag Raman

Digital humans are emerging as autonomous agents in multiparty interactions, yet existing evaluation metrics largely ignore contextual coordination dynamics. We introduce a unified, intervention-driven framework for objective assessment of multiparty social behaviour in skeletal motion data, spanning three complementary dimensions: (1) synchrony via Cross-Recurrence Quantification Analysis, (2) temporal alignment via Multiscale Empirical Mode Decompositionbased Beat Consistency, and (3) structural similarity via Soft Dynamic Time Warping. We validate metric sensitivity through three theory-driven perturbations – gesture kinematic dampening, uniform speech-gesture delays, and prosodic pitch-variance reduction-applied to ≈145 30-second thin slices of group interactions from the DnD dataset. Mixed-effects analyses reveal predictable, joint-independent shifts: dampening increases CRQA determinism and reduces beat consistency, delays weaken cross-participant coupling, and pitch flattening elevates F0 Soft-DTW costs. A complementary perception study (N=27) compares judgments of full-video and skeleton-only renderings to quantify representation effects. Our three measures deliver orthogonal insights into spatial structure, timing alignment, and behavioural variability. Thereby forming a robust toolkit for evaluating and refining socially intelligent agents. Code available on \href{https://github.com/tapri-lab/gig-interveners}{GitHub}. 数字人作为多方互动中的自主代理正在出现，但现有评估指标大多忽视了情境下的协调动力学。我们提出了一个统一的、以干预为驱动的框架，用于对骨架运动数据中的多方社交行为进行客观评估，涵盖三条互补维度：（1）通过交叉再现定量分析（Cross-Recurrence Quantification Analysis，CRQA）评估同步性，（2）基于多尺度经验模态分解的节拍一致性（Beat Consistency）评估时间对齐，以及（3）通过软动态时间规整（Soft Dynamic Time Warping）评估结构相似性。我们通过三种基于理论的扰动验证了这些指标的敏感性——手势运动学阻尼、统一的语音-手势延迟以及韵律音高方差下降——这些扰动被应用于来自 DnD 数据集中 ≈145 段 30 秒的小片段群组互动。混合效应分析揭示了可预测的、与个体无关的变化：阻尼增加了 CRQA 的确定性并降低了节拍一致性，延迟削弱了跨参与者耦合，而音高平坦化则提高了 F0 的 Soft-DTW 代价。一项互补的感知研究（ N=27 ）比较了完整版视频与仅骨架呈现的判断，以量化表现形式的影响。我们的三项度量分别提供关于空间结构、时间对齐和行为变异性的正交洞见，从而构成评估和改进具社会智能代理的强大全套工具。代码可在 \href{https://github.com/tapri-lab/gig-interveners}{GitHub} 获取。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computers and Society, Multiagent Systems 主题：人机交互、人工智能、计算机与社会、多智能体系统

Publish: 2025-08-01 13:46:12 UTC 发布：2025-08-01 13:46:12 UTC

#97 SDSNN: A Single-Timestep Spiking Neural Network with Self-Dropping Neuron and Bayesian Optimization #97 SDSNN：一种具有自我丢弃神经元和贝叶斯优化的单时间步脉冲神经网络

Authors: [Changqing Xu](https://arxiv.org/search/?searchtype=author&query=Changqing Xu), [Buxuan Song](https://arxiv.org/search/?searchtype=author&query=Buxuan Song), [Yi Liu](https://arxiv.org/search/?searchtype=author&query=Yi Liu), [Xinfang Liao](https://arxiv.org/search/?searchtype=author&query=Xinfang Liao), [Wenbin Zheng](https://arxiv.org/search/?searchtype=author&query=Wenbin Zheng), [Yintang Yang](https://arxiv.org/search/?searchtype=author&query=Yintang Yang) 作者：许昌庆、宋布轩、刘毅、廖欣芳、郑文斌、杨银堂

Spiking Neural Networks (SNNs), as an emerging biologically inspired computational model, demonstrate significant energy efficiency advantages due to their event-driven information processing mechanism. Compared to traditional Artificial Neural Networks (ANNs), SNNs transmit information through discrete spike signals, which substantially reduces computational energy consumption through their sparse encoding approach. However, the multi-timestep computation model significantly increases inference latency and energy, limiting the applicability of SNNs in edge computing scenarios. We propose a single-timestep SNN, which enhances accuracy and reduces computational energy consumption in a single timestep by optimizing spike generation and temporal parameters. We design a Self-Dropping Neuron mechanism, which enhances information-carrying capacity through dynamic threshold adjustment and selective spike suppression. Furthermore, we employ Bayesian optimization to globally search for time parameters and obtain an efficient inference mode with a single time step. Experimental results on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets demonstrate that, compared to traditional multi-timestep SNNs employing the Leaky Integrate-and-Fire (LIF) model, our method achieves classification accuracies of 93.72%, 92.20%, and 69.45%, respectively, using only single-timestep spikes, while maintaining comparable or even superior accuracy. Additionally, it reduces energy consumption by 56%, 21%, and 22%, respectively. 脉冲神经网络（SNN）作为一种新兴的生物启发计算模型，由于其事件驱动的信息处理机制，展示出显著的能耗优势。与传统人工神经网络（ANN）相比，SNN 通过离散的脉冲信号传递信息，其稀疏编码方式在很大程度上减少了计算能耗。然而，多时间步的计算模型显著增加了推理延迟和能耗，限制了 SNN 在边缘计算场景中的适用性。我们提出了一种单时间步 SNN，通过优化脉冲生成和时间参数，在单一时间步内提升精度并降低计算能耗。我们设计了自落神经元机制（Self-Dropping Neuron），通过动态阈值调整和选择性抑制脉冲来增强信息承载能力。此外，我们采用贝叶斯优化对时间参数进行全局搜索，从而获得一种高效的单时间步推理模式。在 Fashion-MNIST、CIFAR-10 和 CIFAR-100 数据集上的实验结果表明，与使用泄漏积分并发放（LIF）模型的传统多时间步 SNN 相比，我们的方法仅使用单时间步脉冲就分别实现了 93.72%、92.20% 和 69.45% 的分类准确率，同时保持了可比或更优的精度。此外，它还分别将能耗减少了 56%、21% 和 22%。

Subjects: Neural and Evolutionary Computing, Artificial Intelligence 主题：神经与进化计算，人工智能

Publish: 2025-08-01 03:41:47 UTC 发布：2025-08-01 03:41:47 UTC

#98 FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning #98 FLUID：通过令牌蒸馏实现流-潜变量统一整合以用于多模态学习中的专家专门化

Authors: [Van Duc Cuong](https://arxiv.org/search/?searchtype=author&query=Van Duc Cuong), [Ta Dinh Tam](https://arxiv.org/search/?searchtype=author&query=Ta Dinh Tam), [Tran Duc Chinh](https://arxiv.org/search/?searchtype=author&query=Tran Duc Chinh), [Nguyen Thi Hanh](https://arxiv.org/search/?searchtype=author&query=Nguyen Thi Hanh) 作者：Van Duc Cuong、Ta Dinh Tam、Tran Duc Chinh、Nguyen Thi Hanh

Multimodal classification requires robust integration of visual and textual signals, yet common fusion strategies are brittle and vulnerable to modality-specific noise. In this paper, we present \textsc{FLUID}-Flow-Latent Unified Integration via Token Distillation for Expert Specialization, a principled token-level pipeline that improves cross-modal robustness and scalability. \textsc{FLUID} contributes three core elements: (1) \emph{Q-transforms}, learnable query tokens that distill and retain salient token-level features from modality-specific backbones; (2) a two-stage fusion scheme that enforces cross-modal consistency via contrastive alignment and then performs adaptive, task-aware fusion through a gating mechanism and a \emph{Q-bottleneck} that selectively compresses information for downstream reasoning; and (3) a lightweight, load-balanced Mixture-of-Experts at prediction time that enables efficient specialization to diverse semantic patterns. Extensive experiments demonstrate that \textsc{FLUID} attains 91% accuracy on the GLAMI-1M benchmark, significantly outperforming prior baselines and exhibiting strong resilience to label noise, long-tail class imbalance, and semantic heterogeneity. Targeted ablation studies corroborate both the individual and synergistic benefits of the proposed components, positioning \textsc{FLUID} as a scalable, noise-resilient solution for multimodal product classification. 多模态分类需要对视觉和文本信号进行稳健整合，但常见的融合策略往往脆弱且易受特定模态噪声影响。本文提出了 FLUID——通过令牌蒸馏实现的流式潜在统一整合以进行专家专门化（Flow-Latent Unified Integration via Token Distillation for Expert Specialization），这是一种有原则的令牌级管道，可提升跨模态鲁棒性和可扩展性。FLUID 包含三大核心要素：（1）Q-变换（Q-transforms），即可学习的查询令牌，用于从模态特定的骨干网络中蒸馏并保留显著的令牌级特征；（2）两阶段融合方案，先通过对比对齐强制跨模态一致性，然后通过门控机制和一个选择性压缩信息以供下游推理的 Q-瓶颈（Q-bottleneck）执行自适应、任务感知的融合；（3）在预测时使用的轻量且负载均衡的专家混合（Mixture-of-Experts），使其能高效地针对多样的语义模式进行专门化。大量实验证明，\textsc{FLUID} 在 GLAMI-1M 基准上达到 91% 的准确率，显著优于以往基线，并在标签噪声、长尾类别不平衡和语义异质性方面表现出强大的鲁棒性。针对性的消融研究验证了所提组件的单独和协同优势，使 \textsc{FLUID} 成为一种可扩展、抗噪的多模态产品分类解决方案。

Subject: Social and Information Networks 主题：社会与信息网络

Publish: 2025-08-10 09:34:17 UTC 发表：2025-08-10 09:34:17 协调世界时（UTC）

#99 Generalized Similarity U: A Non-parametric Test of Association Based on Similarity #99 广义相似性 U：基于相似性的非参数关联检验

Authors: [Changshuai Wei](https://arxiv.org/search/?searchtype=author&query=Changshuai Wei), [Qing Lu](https://arxiv.org/search/?searchtype=author&query=Qing Lu) 作者：魏昌帅，陆庆

Second generation sequencing technologies are being increasingly used for genetic association studies, where the main research interest is to identify sets of genetic variants that contribute to various phenotype. The phenotype can be univariate disease status, multivariate responses and even high-dimensional outcomes. Considering the genotype and phenotype as two complex objects, this also poses a general statistical problem of testing association between complex objects. We here proposed a similarity-based test, generalized similarity U (GSU), that can test the association between complex objects. We first studied the theoretical properties of the test in a general setting and then focused on the application of the test to sequencing association studies. Based on theoretical analysis, we proposed to use Laplacian kernel based similarity for GSU to boost power and enhance robustness. Through simulation, we found that GSU did have advantages over existing methods in terms of power and robustness. We further performed a whole genome sequencing (WGS) scan for Alzherimer Disease Neuroimaging Initiative (ADNI) data, identifying three genes, APOE, APOC1 and TOMM40, associated with imaging phenotype. We developed a C++ package for analysis of whole genome sequencing data using GSU. The source codes can be downloaded at https://github.com/changshuaiwei/gsu. 第二代测序技术正越来越多地用于遗传关联研究，主要研究兴趣在于识别对各种表型有贡献的一组遗传变异。表型可以是单变量的疾病状态、多变量响应，甚至是高维结果。将基因型和表型视为两个复杂对象时，这也提出了在复杂对象之间检验关联的一般统计问题。在此我们提出了一种基于相似性的检验——广义相似性 U（GSU），用于检验复杂对象之间的关联。我们首先在一般情形下研究了该检验的理论性质，然后将重心放在该检验在测序关联研究中的应用。基于理论分析，我们建议在 GSU 中使用拉普拉斯核为基础的相似性以提高检验力并增强稳健性。通过模拟，我们发现 GSU 在检验力和稳健性方面确实优于现有方法。我们进一步对阿尔茨海默病神经影像计划（ADNI）数据进行了全基因组测序（WGS）扫描，鉴定出与影像表型相关的三个基因：APOE、APOC1 和 TOMM40。我们开发了一个用于使用 GSU 分析全基因组测序数据的 C++ 软件包。源代码可在 https://github.com/changshuaiwei/gsu 下载。

Subjects: Methodology, Genomics, Machine Learning Subjects: 方法学 , 基因组学 , 机器学习

Publish: 2018-01-04 01:43:31 UTC 发布：2018-01-04 01:43:31 UTC

#100 Trees Assembling Mann Whitney Approach for Detecting Genome-wide Joint Association among Low Marginal Effect loci #100 森林组装曼-惠特尼方法用于检测低边际效应位点之间的全基因组联合关联

Authors: [Changshuai Wei](https://arxiv.org/search/?searchtype=author&query=Changshuai Wei), [Daniel J. Schaid](https://arxiv.org/search/?searchtype=author&query=Daniel J. Schaid), [Qing Lu](https://arxiv.org/search/?searchtype=author&query=Qing Lu) 作者：Changshuai Wei, Daniel J. Schaid, Qing Lu

Common complex diseases are likely influenced by the interplay of hundreds, or even thousands, of genetic variants. Converging evidence shows that genetic variants with low marginal effects (LME) play an important role in disease development. Despite their potential significance, discovering LME genetic variants and assessing their joint association on high dimensional data (e.g., genome wide association studies) remain a great challenge. To facilitate joint association analysis among a large ensemble of LME genetic variants, we proposed a computationally efficient and powerful approach, which we call Trees Assembling Mann whitney (TAMW). Through simulation studies and an empirical data application, we found that TAMW outperformed multifactor dimensionality reduction (MDR) and the likelihood ratio based Mann whitney approach (LRMW) when the underlying complex disease involves multiple LME loci and their interactions. For instance, in a simulation with 20 interacting LME loci, TAMW attained a higher power (power=0.931) than both MDR (power=0.599) and LRMW (power=0.704). In an empirical study of 29 known Crohn’s disease (CD) loci, TAMW also identified a stronger joint association with CD than those detected by MDR and LRMW. Finally, we applied TAMW to Wellcome Trust CD GWAS to conduct a genome wide analysis. The analysis of 459K single nucleotide polymorphisms was completed in 40 hours using parallel computing, and revealed a joint association predisposing to CD (p-value=2.763e-19). Further analysis of the newly discovered association suggested that 13 genes, such as ATG16L1 and LACC1, may play an important role in CD pathophysiological and etiological processes. 常见复杂疾病很可能受数百甚至数千个遗传变异相互作用的影响。越来越多的证据表明，具有低边际效应（LME）的遗传变异在疾病发生中起着重要作用。尽管其潜在重要性，发现 LME 遗传变异并评估它们在高维数据（例如全基因组关联研究）上的联合关联仍然是一个巨大挑战。为促进在大量 LME 遗传变异之间的联合关联分析，我们提出了一种计算效率高且强大的方法，称为树组装曼-惠特尼（TAMW）。通过模拟研究和实证数据应用，我们发现当基础复杂疾病涉及多个 LME 位点及其相互作用时，TAMW 的表现优于多因子维度约简（MDR）和基于似然比的曼-惠特尼方法（LRMW）。例如，在一个包含 20 个位点相互作用的 LME 模拟中，TAMW 的检出力更高（检出力=0.931），超过了 MDR（检出力=0.599）和 LRMW（检出力=0.704）。在对 29 个已知克罗恩病（CD）位点的实证研究中，TAMW 也识别出比 MDR 和 LRMW 更强的与 CD 的联合关联。最后，我们将 TAMW 应用于 Wellcome Trust 的 CD GWAS 进行全基因组分析。使用并行计算对 459K 个单核苷酸多态性位点的分析在 40 小时内完成，并揭示了一个促成 CD 的联合关联（p 值=2.763e-19）。对新发现关联的进一步分析表明，13 个基因（如 ATG16L1 和 LACC1）可能在 CD 的病理生理和病因过程中发挥重要作用。

Subjects: Quantitative Methods, Computation, Machine Learning 主题：定量方法、计算、机器学习

Publish: 2015-05-05 22:14:28 UTC 发布：2015-05-05 22:14:28 UTC

#101 A Weighted U Statistic for Genetic Association Analyses of Sequencing Data #101 一个用于测序数据遗传关联分析的加权 U 统计量

Authors: [Changshuai Wei](https://arxiv.org/search/?searchtype=author&query=Changshuai Wei), [Ming Li](https://arxiv.org/search/?searchtype=author&query=Ming Li), [Zihuai He](https://arxiv.org/search/?searchtype=author&query=Zihuai He), [Olga Vsevolozhskaya](https://arxiv.org/search/?searchtype=author&query=Olga Vsevolozhskaya), [Daniel J. Schaid](https://arxiv.org/search/?searchtype=author&query=Daniel J. Schaid), [Qing Lu](https://arxiv.org/search/?searchtype=author&query=Qing Lu) 作者：魏昌帅、李明、何子怀、Olga Vsevolozhskaya、Daniel J. Schaid、陆青

With advancements in next generation sequencing technology, a massive amount of sequencing data are generated, offering a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, this poses a great challenge for the statistical analysis of high-dimensional sequencing data. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a weighted U statistic, referred to as WU-seq, for the high-dimensional association analysis of sequencing data. Based on a non-parametric U statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used SKAT method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-seq to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol. 随着新一代测序技术的进步，产生了大量测序数据，这为全面研究罕见变异在复杂疾病遗传病因中的作用提供了极好机会。然而，这也给高维测序数据的统计分析带来了巨大挑战。基于传统统计方法的关联分析由于遗传变异的低频性和数据的极高维度而遭受显著的检验力损失。我们开发了一种加权 U 统计量，称为 WU-seq，用于高维测序数据的关联分析。基于非参数 U 统计量，WU-SEQ 不对潜在疾病模型和表型分布作假设，可应用于多种表型。通过模拟研究和实证研究，我们显示当潜在假设被违反（例如表型服从厚尾分布）时，WU-SEQ 的表现优于常用的 SKAT 方法。即使在假设成立的情况下，WU-SEQ 仍然能达到与 SKAT 可比的性能。最后，我们将 WU-seq 应用于达拉斯心脏研究（DHS）的测序数据，并检测到 ANGPTL4 与极低密度脂蛋白胆固醇之间的关联。

Subjects: Methodology, Quantitative Methods 主题：方法学，定量方法

Publish: 2015-05-05 22:13:23 UTC 发表：2015-05-05 22:13:23 UTC

#102 A Generalized Similarity U Test for Multivariate Analysis of Sequencing Data #102 一种用于测序数据多变量分析的广义相似性 U 检验

Authors: [Changshuai Wei](https://arxiv.org/search/?searchtype=author&query=Changshuai Wei), [Qing Lu](https://arxiv.org/search/?searchtype=author&query=Qing Lu) 作者：魏昌帅，陆庆

Sequencing-based studies are emerging as a major tool for genetic association studies of complex diseases. These studies pose great challenges to the traditional statistical methods (e.g., single-locus analyses based on regression methods) because of the high-dimensionality of data and the low frequency of genetic variants. In addition, there is a great interest in biology and epidemiology to identify genetic risk factors contributed to multiple disease phenotypes. The multiple phenotypes can often follow different distributions, which violates the assumptions of most current methods. In this paper, we propose a generalized similarity U test, referred to as GSU. GSU is a similarity-based test and can handle high-dimensional genotypes and phenotypes. We studied the theoretical properties of GSU, and provided the efficient p-value calculation for association test as well as the sample size and power calculation for the study design. Through simulation, we found that GSU had advantages over existing methods in terms of power and robustness to phenotype distributions. Finally, we used GSU to perform a multivariate analysis of sequencing data in the Dallas Heart Study and identified a joint association of 4 genes with 5 metabolic related phenotypes. 基于测序的研究正成为复杂疾病遗传关联研究的主要工具。这些研究对传统统计方法（例如基于回归方法的单位点分析）提出了巨大挑战，因为数据具有高维性且遗传变异的频率很低。此外，生物学和流行病学领域非常关注识别对多种疾病表型具有贡献的遗传风险因素。多重表型常常服从不同的分布，这违反了大多数现有方法的假设。本文提出了一种广义相似性 U 检验，称为 GSU。GSU 是一种基于相似性的检验，能够处理高维基因型和表型。我们研究了 GSU 的理论性质，并为关联检验提供了高效的 p 值计算，以及用于研究设计的样本量和功效计算。通过模拟，我们发现 GSU 在功效和对表型分布的鲁棒性方面相较现有方法具有优势。最后，我们使用 GSU 对达拉斯心脏研究中的测序数据进行了多变量分析，并确定了 4 个基因与 5 个代谢相关表型的联合关联。

Subject: Methodology 主题：方法学

Publish: 2015-05-05 20:36:43 UTC 发布：2015-05-05 20:36:43 UTC

#103 A weighted U statistic for association analysis considering genetic heterogeneity #103 一个考虑遗传异质性的加权 U 统计量用于关联分析

Authors: [Changshuai Wei](https://arxiv.org/search/?searchtype=author&query=Changshuai Wei), [Robert C. Elston](https://arxiv.org/search/?searchtype=author&query=Robert C. Elston), [Qing Lu](https://arxiv.org/search/?searchtype=author&query=Qing Lu) 作者：Changshuai Wei, Robert C. Elston, Qing Lu

Converging evidence suggests that common complex diseases with the same or similar clinical manifestations could have different underlying genetic etiologies. While current research interests have shifted toward uncovering rare variants and structural variations predisposing to human diseases, the impact of heterogeneity in genetic studies of complex diseases has been largely overlooked. Most of the existing statistical methods assume the disease under investigation has a homogeneous genetic effect and could, therefore, have low power if the disease undergoes heterogeneous pathophysiological and etiological processes. In this paper, we propose a heterogeneity weighted U (HWU) method for association analyses considering genetic heterogeneity. HWU can be applied to various types of phenotypes (e.g., binary and continuous) and is computationally effcient for high- dimensional genetic data. Through simulations, we showed the advantage of HWU when the underlying genetic etiology of a disease was heterogeneous, as well as the robustness of HWU against different model assumptions (e.g., phenotype distributions). Using HWU, we conducted a genome-wide analysis of nicotine dependence from the Study of Addiction: Genetics and Environments (SAGE) dataset. The genome-wide analysis of nearly one million genetic markers took 7 hours, identifying heterogeneous effects of two new genes (i.e., CYP3A5 and IKBKB) on nicotine dependence. 越来越多的证据表明，具有相同或相似临床表现的常见复杂疾病可能具有不同的潜在遗传病因。尽管当前研究兴趣已转向发现促成疾病的罕见变异和结构变异，但在复杂疾病遗传研究中对异质性影响的关注却 largely 被忽视（保留专有名词）。大多数现有的统计方法假设所研究的疾病具有一致的遗传效应，因此当疾病经历异质的病理生理和病因过程时，这些方法可能具有较低的检验效能。本文提出了一种用于考虑遗传异质性的关联分析方法——异质性加权 U（HWU）方法。HWU 可应用于多种类型的表型（例如二元和连续型），并且在高维遗传数据上具有计算效率。通过模拟研究，我们展示了当疾病的潜在遗传病因具有异质性时 HWU 的优势，以及 HWU 对不同模型假设（例如表型分布）的鲁棒性。使用 HWU，我们对成瘾研究：遗传与环境（SAGE）数据集中的烟草依赖进行了全基因组分析。对近一百万个遗传标记的全基因组分析耗时 7 小时，识别出两个新基因（即 CYP3A5 和 IKBKB）对烟草依赖的异质性效应。

Subject: Methodology 主题：方法学

Publish: 2015-04-30 17:54:31 UTC 发布：2015-04-30 17:54:31 UTC

PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing PaperRegister：通过分层寄存索引提升灵活粒度的论文检索

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization XQuant：通过 KV 缓存重计算打破 LLM 推理的内存壁垒

StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation StyleMM：通过文本驱动对齐图像翻译的风格化 3D 可变形人脸模型

FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation FantasyTalking2：用于音频驱动肖像动画的时步-层自适应偏好优化

SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation 稀疏数据，丰富结果：通过类条件图像翻译进行少样本半监督学习类条件图像翻译

MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data MAESTRO：用于多模态、多时序与多光谱遥感数据的掩码自编码器

1.4 X

1.5 小红书

25 【论文分享｜ICML—解决大模型逻辑一致性问题 - 科研助手小森 | 小红书 - 你的生活兴趣社区】 😆 W2HyFdax9n6qCh7 😆 https://www.xiaohongshu.com/discovery/item/68851390000000001502137b?source=webshare&xhsshare=pc_web&xsec_token=CBFJYrDwcDl2Xrt74PBSnRp0eNMP8NjcPvvoV53eu0jBk=&xsec_source=pc_share

2. 感兴趣研究

学习

从GPT-2到gpt-oss，深度详解OpenAI开放模型的进化之路

博客标题：From GPT-2 to gpt-oss: Analyzing the Architectural Advances, And How They Stack Up Against Qwen3
博客地址：https://sebastianraschka.com/blog/2025/from-gpt-2-to-gpt-oss.html

自我进化

SEAgent：开启从实战经验中自我进化的GUI智能体新纪元

SEAgent，一个全新的、无需任何人类干预，即可通过与环境交互来自主学习和进化的智能体框架。
闭环的自主进化框架、一个经过深度优化的评判模型，以及一套高效的「专才 - 通才」融合策略。
论文链接: https://arxiv.org/abs/2508.04700v1
代码链接: https://github.com/SunzeY/SEAgent

2025-08-18科研追新

2025-08-18科研追新

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

1.2 Arxiv

1.2.1 Computation and Language

#1 TinyTim: A Family of Language Models for Divergent Generation #1 TinyTim：用于发散生成的语言模型家族

#2 Dataset Creation for Visual Entailment using Generative AI #2 使用生成式人工智能创建视觉蕴含数据集

#3 Representing Speech Through Autoregressive Prediction of Cochlear Tokens #3 通过对耳蜗标记的自回归预测表征语音

#4 Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models

#5 AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment #5 AgentMental：一种用于可解释且自适应心理健康评估的交互式多智能体框架

#6 Language models align with brain regions that represent concepts across modalities #6 语言模型与代表跨模态概念的大脑区域对齐

#7 Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models #7 AI 中的物种歧视：评估大型语言模型对动物的歧视

#8 Reference Points in LLM Sentiment Analysis: The Role of Structured Context #8 在 LLM 情感分析中的参照点：结构化语境的作用

#9 CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity #9 CoDiEmb：一个协作但独立的框架，用于信息检索和语义文本相似性的一体化表征学习

#10 Online Anti-sexist Speech: Identifying Resistance to Gender Bias in Political Discourse #10 在线反性别歧视言论：识别政治话语中对性别偏见的抵制 [PDF ] [Copy] [Kimi ] [REL]

#11 HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor #11 HumorPlanSearch：用于上下文 AI 幽默的结构化规划与 HuCoT

#12 Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions #12 从问卷到行为：通过问卷问题在 LLMs 中实现对人类价值观的下游对齐

#13 Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training #13 通过端到端可微自训练为变压器预测提供理由化解释

#14 Model Interpretability and Rationale Extraction by Input Mask Optimization #14 通过输入掩码优化进行模型可解释性和理由提取

#15 Retrieval-augmented reasoning with lean language models #15 使用精简语言模型的检索增强推理

#16 When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs #16 标点何时重要：针对 LLMs 的提示鲁棒性方法的大规模比较

#17 Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning #17 反馈指标：Llama 与教师在语言学习中的一致性

#18 SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis #18 SpecDetect：通过频谱分析对 LLM 生成文本进行简单、快速且无需训练的检测

#19 LLM Compression: How Far Can We Go in Balancing Size and Performance? #19 LLM 压缩：在模型体积与性能之间我们能走多远？

#20 SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems #20 SGSimEval：一个用于自动综述生成系统的综合多面向相似性增强基准 [PDF 1 ] [Copy] [Kimi ] [REL]

#21 SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory #21 SafeConstellations：通过任务特定轨迹引导 LLM 安全性以减少过度拒绝

#22 AI in Mental Health: Emotional and Sentiment Analysis of Large Language Models' Responses to Depression, Anxiety, and Stress Queries #22 心理健康领域的人工智能：大语言模型对抑郁、焦虑与压力相关提问的情感与情绪分析

#23 ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection #23 ToxiFrench：通过 CoT 微调对法语有害性检测进行基准测试和增强 [PDF 1 ] [Copy] [Kimi 1 ] [REL]

#24 LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought #24 LETToT：使用专家思维树对旅游领域大语言模型进行无标注评估

#25 UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs? #25 揭秘：是什么让语言学奥林匹克题目对 LLMs 来说如此棘手？

#26 Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering #26 跨粒度超图检索增强生成用于多跳问答

#27 E-CaTCH: Event-Centric Cross-Modal Attention with Temporal Consistency and Class-Imbalance Handling for Misinformation Detection #27 E-CaTCH：面向事件的跨模态注意力，具有时间一致性和类别不平衡处理的错误信息检测 [PDF ] [Copy] [Kimi ] [REL]

#28 Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation #28 新颖的寄生双尺度建模用于高效且精确的多语种语音翻译

#29 Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction #29 个性化干扰项生成：通过蒙特卡洛树搜索引导的推理重建

#30 Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification #30 克服图卢语资源匮乏障碍：用于辱骂性语言识别的神经模型与语料构建

#31 MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering #31 MobQA：一个用于通过问答实现人类移动数据语义理解的基准数据集

#32 MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents #32 MoNaCo：用于在数十篇文档中推理的更自然且更复杂的问题 [PDF ] [Copy] [Kimi 2 ] [REL]

#33 Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning #33 通过反思、记忆与规划构建用于营销应用的可靠多智能体系统

#34 Approaching the Source of Symbol Grounding with Confluent Reductions of Abstract Meaning Representation Directed Graphs #34 通过抽象意义表示有向图的汇合约简接近符号落地源头

#35 BIPOLAR: Polarization-based granular framework for LLM bias evaluation #35 BIPOLAR：基于极化的细粒度 LLM 偏差评估框架 [PDF ] [副本] [Kimi ] [REL]

#36 Hell or High Water: Evaluating Agentic Recovery from External Failures #36 地狱或涨潮：评估代理从外部故障中恢复的能力

#37 Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics #37 超越罗塞塔石：泛化动力学中的统一力

#38 SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth #38 SproutBench：面向青少年的安全与伦理大语言模型基准测试

#39 Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling #39 使用带推理时缩放的掩码扩散语言模型改进文本风格迁移

#40 Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules #40 Rule2Text：用于生成和评估知识图规则自然语言解释的框架

#41 Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News #41 从新闻中建模和检测公司风险：彭博新闻的案例研究

#42 gpt-oss-120b & gpt-oss-20b Model Card #42 gpt-oss-120b 与 gpt-oss-20b 模型卡

#43 PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins #43 PersonaTwin：一种用于生成和评估个性化数字孪生的多层提示条件框架

#44 A2HCoder: An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation #44 A2HCoder：一个用于分层算法到 HDL 翻译的 LLM 驱动编码代理 [PDF 2 ] [Copy] [Kimi 1 ] [REL]

#45 Controlling Multimodal LLMs via Reward-guided Decoding #45 通过奖励引导解码控制多模态 LLMs

#46 Emphasis Sensitivity in Speech Representations #46 语音表示中的重音敏感性

#47 Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps #47 包容竞技场：一个用于用真实应用评估大型基础模型的开放平台

#48 Generalize across Homophily and Heterophily: Hybrid Spectral Graph Pre-Training and Prompt Tuning #48 在同质性与异质性中泛化：混合谱图预训练与提示微调

#49 Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing #49 群体公平遇上黑箱：通过后处理在封闭式 LLMs 上实现公平算法

#50 Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information #50 超越解答数学测验：评估大型推理模型提出信息需求的能力

#51 Benchmarking Prosody Encoding in Discrete Speech Tokens #51 在离散语音标记中对韵律编码的基准测试

#52 ORFuzz: Fuzzing the "Other Side" of LLM Safety – Testing Over-Refusal #52 ORFuzz：模糊测试 LLM 安全性的“另一面”——检测过度拒绝

#53 How Causal Abstraction Underpins Computational Explanation #53 因果抽象如何支撑计算性解释

#54 Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style #54 使用自然语言描述说话风格的富有表现力的语音检索

#55 A Cross-Modal Rumor Detection Scheme via Contrastive Learning by Exploring Text and Image internal Correlations #55 通过对比学习探索文本与图像内部关联的跨模态谣言检测方案

#56 +VeriRel: Verification Feedback to Enhance Document Retrieval for Scientific Fact Checking #56 +VeriRel：通过验证反馈增强科学事实核查的文献检索

#57 PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing #57 PaperRegister：通过分层登记索引提升灵活粒度论文检索

#58 Diffusion is a code repair operator and generator #58 Diffusion 是一种代码修复操作符和生成器

#59 Can Multi-modal (reasoning) LLMs detect document manipulation? #59 多模态（推理）LLMs 能检测文档篡改吗？

#60 Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models #60 匹配与选择：用于微调文本到图像扩散模型的模型选择框架

#61 BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining #61 BeyondWeb：关于将合成数据扩展到万亿级预训练的经验教训

#62 Empowering Multimodal LLMs with External Tools: A Comprehensive Survey #62 通过外部工具增强多模态 LLMs：一项综合综述

#63 The Next Phase of Scientific Fact-Checking: Advanced Evidence Retrieval from Complex Structured Academic Papers #63 科学事实核查的下一个阶段：从复杂结构化学术论文中进行高级证据检索

1.2.2 Artificial Intelligence

#1 Inspire or Predict? Exploring New Paradigms in Assisting Classical Planners with Large Language Models #1 启发还是预测？探索在使用大型语言模型辅助经典规划器方面的新范式

#2 Landmark-Assisted Monte Carlo Planning #2 基准辅助蒙特卡洛规划

#3 Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps #3 包容竞技场：一个用于用真实世界应用评估大型基础模型的开放平台

#4 AIM-Bench: Evaluating Decision-making Biases of Agentic LLM as Inventory Manager #4 AIM-Bench：评估作为库存管理者的主体化 LLM 的决策偏差

#5 CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks