2025-08-06 2025-08-06 About 61800 words 290 minutes

Contents

#1 CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward #1 CompassVerifier：一个用于 LLMs 评估和结果奖励的统一且稳健的验证器
#2 More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation #2 不止于分数：探究提示特异性对 LLM 代码生成的影响
#3 FairLangProc: A Python package for fairness in NLP #3 FairLangProc：一个用于 NLP 公平性的 Python 包
#4 CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction #4 CTR-Sink：用于点击率预测的语言模型注意力汇聚器
#5 Can Large Vision-Language Models Understand Multimodal Sarcasm? #5 大型视觉语言模型能理解多模态讽刺吗？
#6 Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? #6 我们是否走在正确的道路上来评估文档检索增强生成？
#7 Tackling Distribution Shift in LLM via KILO: Knowledge-Instructed Learning for Continual Adaptation #7 通过 KILO 解决 LLM 中的分布转移问题：面向持续适应的知识指导学习
#8 Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations #8 超越表面：通过内部表征提升 LLM 作为裁判的对齐与人类的协同
#9 EmbedGrad: Gradient-Based Prompt Optimization in Embedding Space for Large Language Models #9 EmbedGrad：基于梯度的嵌入空间提示优化方法，适用于大型语言模型
#10 Marito: Structuring and Building Open Multilingual Terminologies for South African NLP #10 Marito：为南非自然语言处理构建和组织开放多语言术语库
#11 FilBench: Can LLMs Understand and Generate Filipino? #11 FilBench：LLMs 能理解和生成菲律宾语吗？
#12 UPLME: Uncertainty-Aware Probabilistic Language Modelling for Robust Empathy Regression #12 UPLME：面向稳健共情回归的不确定性感知概率语言建模
#13 CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation #13 CF-RAG：一种使用检索增强生成的碳足迹问答数据集和方法
#14 fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval #14 SemEval-2025 任务 7：多语言和跨语言事实核查声明检索
#15 Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings #15 裁剪作为训练自监督文本嵌入的增强策略优于丢弃法
#16 LLMs Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models
#17 Variety Is the Spice of Life: Detecting Misinformation with Dynamic Environmental Representations #17 生活的多样性是调味品：利用动态环境表示检测错误信息
#18 ReDSM5: A Reddit Dataset for DSM-5 Depression Detection #18 ReDSM5：用于 DSM-5 抑郁症检测的 Reddit 数据集
#19 Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models #19 无思考校准思维：推理大型语言模型中的一种新型上下文学习范式
#20 Taggus: An Automated Pipeline for the Extraction of Characters' Social Networks from Portuguese Fiction Literature #20 Taggus：一个用于从葡萄牙小说文学中提取人物社交网络的自动化流程
#21 CTTS: Collective Test-Time Scaling #21 CTTS：集体测试时缩放
#22 Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling #22 通过策略对齐推理和分层标注迈向可信的多模态审核
#23 NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty #23 自然语言处理方法可能比教授更擅长估计问题难度
#24 Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes #24 通过心理刻板印象调查 LLM 生成故事中的性别偏见
#25 Do language models accommodate their users? A study of linguistic convergence #25 语言模型是否适应其用户？一项语言趋同的研究
#26 LECTOR: LLM-Enhanced Concept-based Test-Oriented Repetition for Adaptive Spaced Learning #26 LECTOR：基于概念的面向测试的重复，结合 LLM 增强的自适应间隔学习
#27 Pay What LLM Wants: Can LLM Simulate Economics Experiment with 522 Real-human Persona? #27 按 LLM 想要的支付：LLM 能否模拟拥有 522 个真实人类角色的经济学实验？
#28 Exploring Stability-Plasticity Trade-offs for Continual Named Entity Recognition #28 探索持续命名实体识别中的稳定性-可塑性权衡
#29 RooseBERT: A New Deal For Political Language Modelling #29 RooseBERT：政治语言建模的新方案
#30 Somatic in the East, Psychological in the West?: Investigating Clinically-Grounded Cross-Cultural Depression Symptom Expression in LLMs #30 东方表现为躯体症状，西方表现为心理症状？：在 LLMs 中调查基于临床的跨文化抑郁症状表达
#31 CardiffNLP at CLEARS-2025: Prompting Large Language Models for Plain Language and Easy-to-Read Text Rewriting #31 CardiffNLP 在 CLEARS-2025：提示大型语言模型进行通俗易懂文本重写
#32 Probing Syntax in Large Language Models: Successes and Remaining Challenges #32 探索大型语言模型中的句法：成功与剩余挑战
#33 Current State in Privacy-Preserving Text Preprocessing for Domain-Agnostic NLP #33 隐私保护文本预处理在领域无关 NLP 中的现状
#34 Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models #34 超越内容：语法性别如何影响文本到图像模型中的视觉表现
#35 Analyzing German Parliamentary Speeches: A Machine Learning Approach for Topic and Sentiment Classification #35 德国议会演讲分析：一种用于主题和情感分类的机器学习方法
#36 Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following #36 Light-IF：通过预览和自我检查赋予 LLMs 可泛化的推理能力以执行复杂指令
#37 RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior #37 RCP-Merging：通过将推理能力视为先验，合并长链思维模型与特定领域模型
#38 Long Story Generation via Knowledge Graph and Literary Theory #38 通过知识图谱和文学理论进行长篇故事生成
#39 Cross-lingual Opinions and Emotions Mining in Comparable Documents #39 跨语言观点与情感挖掘在可比文档中的应用
#40 Token-Level Precise Attack on RAG: Searching for the Best Alternatives to Mislead Generation #40 基于令牌级的精确攻击 RAG：寻找误导生成的最佳替代方案
#41 Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation #41 隐私感知解码：缓解检索增强生成中大型语言模型的隐私泄露
#42 When Algorithms Meet Artists: Topic Modeling the AI-Art Debate, 2013-2025 #42 当算法遇上艺术家：2013-2025 年 AI 艺术辩论的话题建模
#43 CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors #43 CoCoTen：通过上下文共现张量的潜在空间特征检测大型语言模型的对抗输入
#44 Can LLMs Generate High-Quality Task-Specific Conversations? #44 LLMs 能否生成高质量的特定任务对话？
#45 SLIM-LLMs: Modeling of Style-Sensory Language RelationshipsThrough Low-Dimensional Representations #45 SLIM-LLMs：通过低维表示建模风格-感官语言关系
#46 Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models #46 具有迭代自我评估的视觉语言模型的连贯多模态推理
#47 Merge-based syntax is mediated by distinct neurocognitive mechanisms: A clustering analysis of comprehension abilities in 84,000 individuals with language deficits across nine languages #47 基于合并的句法由不同的神经认知机制调节：对来自九种语言的 84,000 名语言障碍个体理解能力的聚类分析
#48 Highlight & Summarize: RAG without the jailbreaks #48 重点与总结：无越狱的 RAG
#49 Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives #49 使用具有人口统计意识的专家和合成视角建模标注者分歧
#50 Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation #50 临床基础的基于智能体的报告评估：用于放射学报告生成的可解释指标
#51 Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation #51 森林与树：可重复机器学习评估中的 (N,K) 权衡
#52 OSINT or BULLSHINT? Exploring Open-Source Intelligence tweets about the Russo-Ukrainian War #52 开源情报还是胡扯？探索关于俄乌战争的开源情报推文
#53 Beyond Meme Templates: Limitations of Visual Similarity Measures in Meme Matching #53 超越表情包模板：视觉相似度度量在表情包匹配中的局限性
#54 PyLate: Flexible Training and Retrieval for Late Interaction Models #54 PyLate：用于后期交互模型的灵活训练与检索
#55 MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation #55 MultiRAG：一个用于缓解多源检索增强生成中幻觉的知识引导框架
#56 MoKA: Mixture of Kronecker Adapters #56 MoKA：Kronecker 适配器混合体
#57 Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning #57 使用强化学习训练长上下文、多轮软件工程代理
#58 Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models #58 画出你的心智：通过条件级建模实现文本到图像扩散模型的个性化生成
#59 A Comparative Study of Neurosymbolic AI Approaches to Interpretable Logical Reasoning #59 神经符号 AI 方法在可解释逻辑推理中的比较研究
#60 VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation #60 VLMQ：通过 Hessian 增强实现大规模视觉语言模型的高效后训练量化
#61 Reliable Evaluation Protocol for Low-Precision Retrieval #61 低精度检索的可靠评估协议
#62 Understanding the Embedding Models on Hyper-relational Knowledge Graph #62 理解超关系知识图上的嵌入模型
#63 ChartCap: Mitigating Hallucination of Dense Chart Captioning #63 ChartCap：缓解密集图表标题的幻觉问题
#64 Toward Verifiable Misinformation Detection: A Multi-Tool LLM Agent Framework #64 面向可验证的错误信息检测：多工具 LLM 代理框架
#65 VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision #65 VRPO：在噪声监督下重新思考鲁棒强化学习训练中的价值建模
#66 AGENTiGraph: A Multi-Agent Knowledge Graph Framework for Interactive, Domain-Specific LLM Chatbots #66 AGENTiGraph：一个用于交互式领域特定 LLM 聊天机器人的多智能体知识图谱框架
#67 Unified Tool Integration for LLMs: A Protocol-Agnostic Approach to Function Calling #67 LLMs 的统一工具集成：一种与协议无关的函数调用方法
#68 Defend LLMs Through Self-Consciousness #68 通过自我意识防御 LLMs
#69 Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces #69 使用大型视觉语言模型遵循路线指令：低级与全景动作空间的比较
#70 VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction #70 VisuCraft：通过结构化信息提取增强大型视觉语言模型以实现复杂视觉引导的创意内容生成
#71 SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec #71 SecoustiCodec：跨模态对齐的流式单编解码器语音编解码器
#72 NeuroSync: Intent-Aware Code-Based Problem Solving via Direct LLM Understanding Modification #72 NeuroSync：通过直接 LLM 理解修改实现意图感知的基于代码的问题解决
#73 CreditARF: A Framework for Corporate Credit Rating with Annual Report and Financial Feature Integration #73 CreditARF：一个结合年报和财务特征的企业信用评级框架
#74 Teaching at Scale: Leveraging AI to Evaluate and Elevate Engineering Education #74 大规模教学：利用人工智能评估和提升工程教育
#75 Efficient Agents: Building Effective Agents While Reducing Cost #75 高效智能体：在降低成本的同时构建有效智能体
#76 ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs #76 ToolRegistry：一个面向函数调用 LLMs 的协议无关工具管理库

#1 Agent Lightning: Train ANY AI Agents with Reinforcement Learning #1 Agent Lightning：使用强化学习训练任何 AI 代理
#2 Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search #2 基于 LLM 指导的进化蒙特卡洛树搜索的引力波检测自动算法发现
#3 Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework #3 在 LLM 代码生成中提升批判性思维：基于错误前提的评估框架
#4 Hidden Dynamics of Massive Activations in Transformer Training #4 Transformer 训练中大规模激活的隐藏动态
#5 Error Detection and Correction for Interpretable Mathematics in Large Language Models #5 大型语言模型中可解释数学的错误检测与纠正
#6 VQA support to Arabic Language Learning Educational Tool #6 VQA 支持阿拉伯语学习教育工具
#7 Semantic-aware Graph-guided Behavior Sequences Generation with Large Language Models for Smart Homes #7 结合语义感知图引导的行为序列生成，利用大型语言模型应用于智能家居
#8 Toward a Graph-Theoretic Model of Belief: Confidence, Credibility, and Structural Coherence #8 迈向信念的图论模型：置信度、可信度与结构连贯性
#9 Data Overdose? Time for a Quadruple Shot: Knowledge Graph Construction using Enhanced Triple Extraction #9 数据过载？是时候来一剂四倍剂量：使用增强三元组提取的知识图谱构建
#10 Multi-Objective Infeasibility Diagnosis for Routing Problems Using Large Language Models #10 使用大型语言模型进行路由问题的多目标不可行性诊断
#11 Hide and Seek with LLMs: An Adversarial Game for Sneaky Error Generation and Self-Improving Diagnosis #11 与 LLMs 的捉迷藏：一种用于狡猾错误生成和自我改进诊断的对抗游戏
#12 Data Dependency Inference for Industrial Code Generation Based on UML Sequence Diagrams #12 基于 UML 时序图的工业代码生成数据依赖推断
#13 Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play #13 Board Game Arena：通过策略游戏评估大型语言模型的框架和基准
#14 A Comparative Study of Neurosymbolic AI Approaches to Interpretable Logical Reasoning #14 神经符号人工智能方法在可解释逻辑推理中的比较研究
#15 CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment #15 CogBench：一个用于多语言基于语音的认知障碍评估的大型语言模型基准测试
#16 Compressing Chain-of-Thought in LLMs via Step Entropy #16 通过步骤熵压缩 LLMs 中的链式思维
#17 Adaptive AI Agent Placement and Migration in Edge Intelligence Systems #17 边缘智能系统中的自适应 AI 代理部署与迁移
#18 Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science #18 Nemori：受认知科学启发的自组织代理记忆
#19 ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools #19 ToolVQA：一个用于多步推理 VQA 的外部工具数据集
#20 Full-History Graphs with Edge-Type Decoupled Networks for Temporal Reasoning #20 具有边类型解耦网络的全历史图用于时间推理
#21 InqEduAgent: Adaptive AI Learning Partners with Gaussian Process Augmentation #21 InqEduAgent：带有高斯过程增强的自适应 AI 学习伙伴
#22 Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions #22 Geoint-R1：通过动态辅助构造形式化多模态几何推理
#23 Causal identification with Y0 #23 因果识别与 Y0
#24 Can Large Language Models Bridge the Gap in Environmental Knowledge? #24 大型语言模型能弥合环境知识差距吗？
#25 Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation #25 通过可验证的合成数据生成迈向可信的优化建模代理
#26 AgentSME for Simulating Diverse Communication Modes in Smart Education
#27 Toward Verifiable Misinformation Detection: A Multi-Tool LLM Agent Framework #27 迈向可验证的错误信息检测：一个多工具 LLM 代理框架
#28 T2UE: Generating Unlearnable Examples from Text Descriptions #28 T2UE：从文本描述生成不可学习的样本
#29 MissDDIM: Deterministic and Efficient Conditional Diffusion for Tabular Data Imputation #29 MissDDIM：用于表格数据插补的确定性高效条件扩散
#30 EoH-S: Evolution of Heuristic Set using LLMs for Automated Heuristic Design #30 EoH-S：利用 LLMs 进化启发式集合以实现自动化启发式设计
#31 ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Identification in Commercial Contracts #31 ContractEval：用于商业合同中条款级法律风险识别的 LLMs 基准测试
#32 Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning #32 超越表层检测：通过元操作推理实现认知驱动的越狱攻击防御
#33 Tree-of-Reasoning: Towards Complex Medical Diagnosis via Multi-Agent Reasoning with Evidence Tree #33 推理树：通过多智能体推理与证据树实现复杂医疗诊断
#34 From Text to Trajectories: GPT-2 as an ODE Solver via In-Context #34 从文本到轨迹：GPT-2 作为 ODE 求解器的上下文内方法
#35 Collab-Solver: Collaborative Solving Policy Learning for Mixed-Integer Linear Programming #35 Collab-Solver：用于混合整数线性规划的协作求解策略学习
#36 Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning #36 超越策略优化：用于稀疏奖励长时规划的数据策划飞轮
#37 AGENTiGraph: A Multi-Agent Knowledge Graph Framework for Interactive, Domain-Specific LLM Chatbots #37 AGENTiGraph：一个用于交互式、特定领域 LLM 聊天机器人的多代理知识图框架
#38 When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs #38 当人工智能评判人工智能：LLMs 的代理评审崛起
#39 Unified Tool Integration for LLMs: A Protocol-Agnostic Approach to Function Calling #39 LLMs 的统一工具集成：一种与协议无关的函数调用方法
#40 Defend LLMs Through Self-Consciousness #40 通过自我意识保护 LLMs
#41 Polymath: A Self-Optimizing Agent with Dynamic Hierarchical Workflow #41 Polymath：一个具有动态分层工作流程的自我优化代理
#42 MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine #42 MedBLINK：探究医学多模态语言模型中的基础感知能力
#43 AQUAH: Automatic Quantification and Unified Agent in Hydrology #43 AQUAH：水文学中的自动量化与统一代理
#44 PentestJudge: Judging Agent Behavior Against Operational Requirements #44 PentestJudge：根据操作需求评判代理行为
#45 Enhancing Japanese Large Language Models with Reasoning Vectors #45 利用推理向量增强日语大型语言模型
#46 Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game #46 看似简单的规划问题实际上计算复杂：倒计时游戏
#47 A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering #47 用于放射学视觉问答中复杂推理的多智能体系统
#48 Cognitive Loop via In-Situ Optimization: Self-Adaptive Reasoning for Science #48 通过原位优化实现认知循环：科学的自适应推理
#49 Large Language Model-based Data Science Agent: A Survey #49 基于大型语言模型的数据科学代理：综述
#50 Recovering Individual-Level Activity Sequences from Location-Based Service Data Using a Novel Transformer-Based Model #50 使用新型基于 Transformer 的模型从基于位置的服务数据中恢复个体级活动序列
#51 Planning with Dynamically Changing Domains #51 动态变化领域的规划
#52 Efficient Agents: Building Effective Agents While Reducing Cost #52 高效代理：在降低成本的同时构建有效代理
#53 CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward #53 CompassVerifier：一个用于 LLMs 评估和结果奖励的统一且稳健的验证器
#54 Self-Questioning Language Models #54 自我提问语言模型
#55 Classifying Epistemic Relationships in Human-AI Interaction: An Exploratory Approach #55 分类人机交互中的认知关系：一种探索性方法
#56 Beyond risk: A proto-framework for assessing the societal impact of AI systems #56 超越风险：评估人工智能系统社会影响的原型框架
#57 A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design #57 一个受 DbC 启发的神经符号层，用于可信代理设计
#58 Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation #58 森林与树：可复现机器学习评估中的 (N,K) 权衡
#59 Probing the Gaps in ChatGPT Live Video Chat for Real-World Assistance for People who are Blind or Visually Impaired #59 探究 ChatGPT 实时视频聊天在为盲人或视障人士提供现实世界帮助中的不足
#60 Cross-Model Semantics in Representation Learning #60 表征学习中的跨模型语义
#61 LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations at eBay #61 LLMDistill4Ads：使用交叉编码器从 LLM 信号中蒸馏以进行 eBay 广告主关键词推荐
#62 AttZoom: Attention Zoom for Better Visual Features #62 AttZoom：用于更好视觉特征的注意力缩放
#63 Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction Goedel-Prover-V2：通过分阶段数据合成与自我纠正扩展形式定理证明
#64 Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling #64 区块：在 LLM 服务中通过上下文、知识和预测调度实现负载平衡
#65 MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy #65 MetaScope：基于光学驱动的神经网络用于超微型金属透镜内窥镜
#66 DeepFaith: A Domain-Free and Model-Agnostic Unified Framework for Highly Faithful Explanations #66 DeepFaith：一个无领域限制且模型无关的高度可信解释统一框架
#67 Decoding and Engineering the Phytobiome Communication for Smart Agriculture #67 解码与工程化植物微生物组通信以实现智能农业
#68 Supervised Dynamic Dimension Reduction with Deep Neural Network #68 监督式动态降维深度神经网络
#69 EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering #69 EmoSteer-TTS：通过激活引导实现细粒度且无需训练的情感可控文本转语音
#70 Retinal Lipidomics Associations as Candidate Biomarkers for Cardiovascular Health #70 视网膜脂质组学关联作为心血管健康的候选生物标志物
#71 MoKA: Mixture of Kronecker Adapters #71 MoKA：Kronecker 适配器混合模型
#72 CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation #72 CF-RAG：一种使用检索增强生成的碳足迹问答数据集和方法
#73 BitsAI-Fix: LLM-Driven Approach for Automated Lint Error Resolution in Practice #73 BitsAI-Fix：基于 LLM 的自动化 Lint 错误修复实践方法
#74 When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models #74 当汽车有刻板印象时：审计文本到图像模型中对象的群体偏见
#75 Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models #75 画出你的思维：通过条件级建模实现文本到图像扩散模型的个性化生成
#76 VideoGuard: Protecting Video Content from Unauthorized Editing #76 VideoGuard：保护视频内容免受未经授权的编辑
#77 fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval #77 SemEval-2025 任务 7 的事实核查 AI：多语言和跨语言事实核查声明检索
#78 SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering #78 SonicMaster：迈向可控的一体化音乐修复与母带处理
#79 LLMs Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models 大语言模型心如铁石：揭秘大型推理模型的软性思维能力
#80 Spatial Imputation Drives Cross-Domain Alignment for EEG Classification #80 空间插补推动脑电分类的跨域对齐
#81 The Science Fiction Science Method #81 科幻科学方法
#82 R2GenKG: Hierarchical Multi-modal Knowledge Graph for LLM-based Radiology Report Generation #82 R2GenKG：基于 LLM 的放射学报告生成的分层多模态知识图谱
#83 Learning Latent Representations for Image Translation using Frequency Distributed CycleGAN #83 使用频率分布 CycleGAN 学习图像翻译的潜在表示
#84 SlotMatch: Distilling Temporally Consistent Object-Centric Representations for Unsupervised Video Segmentation #84 SlotMatch：蒸馏时序一致的面向对象表示以实现无监督视频分割
#85 Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling #85 视觉文档理解与问答：一种具有测试时扩展的多智能体协作框架
#86 SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models #86 SCFlow：使用流模型隐式学习风格和内容解耦
#87 Agentic AI in 6G Software Businesses: A Layered Maturity Model #87 6G 软件业务中的自主智能体 AI：分层成熟度模型
#88 When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs #88 当良性声音变成对抗：用良性输入破解音频语言模型
#89 VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation #89 VLMQ：通过 Hessian 增强实现大规模视觉语言模型的高效后训练量化
#90 From Legacy to Standard: LLM-Assisted Transformation of Cybersecurity Playbooks into CACAO Format #90 从传统到标准：LLM 辅助将网络安全剧本转换为 CACAO 格式
#91 CTTS: Collective Test-Time Scaling #91 CTTS：集体测试时缩放
#92 Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models #92 探索小型语言模型中后训练量化的分层信息有效性
#93 Industrial LLM-based Code Optimization under Regulation: A Mixture-of-Agents Approach #93 基于 LLM 的工业代码优化与监管：一种多代理混合方法
#94 BaroPoser: Real-time Human Motion Tracking from IMUs and Barometers in Everyday Devices #94 BaroPoser：基于 IMU 和气压计的日常设备实时人体动作追踪
#95 Reliable Evaluation Protocol for Low-Precision Retrieval #95 低精度检索的可靠评估协议
#96 NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty #96 自然语言处理方法实际上可能比教授更擅长估计问题难度
#97 Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes #97 通过心理刻板印象调查 LLM 生成故事中的性别偏见
#98 Artificial Intelligence and Generative Models for Materials Discovery – A Review #98 人工智能与生成模型在材料发现中的应用综述
#99 Pay What LLM Wants: Can LLM Simulate Economics Experiment with 522 Real-human Persona? #99 按 LLM 意愿支付：LLM 能否模拟包含 522 个真实人类角色的经济学实验？
#100 V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models #100 V.I.P.：用于高效视频扩散模型的迭代在线偏好蒸馏
#101 Approximate Proportionality in Online Fair Division #101 在线公平分配中的近似比例性
#102 RooseBERT: A New Deal For Political Language Modelling #102 RooseBERT：政治语言建模的新方案
#103 CardiffNLP at CLEARS-2025: Prompting Large Language Models for Plain Language and Easy-to-Read Text Rewriting #103 CardiffNLP 在 CLEARS-2025：提示大型语言模型进行通俗易懂和易读文本重写
#104 Navigation Pixie: Implementation and Empirical Study Toward On-demand Navigation Agents in Commercial Metaverse #104 导航小精灵：面向商业元宇宙的按需导航代理的实现与实证研究
#105 The Power of Many: Synergistic Unification of Diverse Augmentations for Efficient Adversarial Robustness #105 众力之力：多样增强的协同统一以实现高效的对抗鲁棒性
#106 GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations #106 GeoShield：通过对抗扰动保护视觉语言模型中的地理位置隐私
#107 Spatiotemporal wall pressure forecast of a rectangular cylinder with physics-aware DeepUFNet #107 具有物理感知的 DeepUFNet 对矩形圆柱时空壁面压力的预测
#108 StoryEnsemble: Enabling Dynamic Exploration & Iteration in the Design Process with AI and Forward-Backward Propagation #108 StoryEnsemble：通过 AI 和前向-后向传播实现设计过程中的动态探索与迭代
#109 Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following #109 Light-IF：通过预览和自我检查赋予 LLMs 可泛化推理能力以执行复杂指令
#110 ChartCap: Mitigating Hallucination of Dense Chart Captioning #110 ChartCap：缓解密集图表标题的幻觉问题
#111 CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction #111 CoTox：基于链式思维的分子毒性推理与预测
#112 Estimating Worst-Case Frontier Risks of Open-Weight LLMs #112 估计开放权重 LLMs 的最坏情况前沿风险
#113 Frontier: Simulating the Next Generation of LLM Inference Systems 前沿：模拟下一代大语言模型推理系统
#114 RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior #114 RCP-Merging：通过将推理能力作为先验，合并长链思维模型与特定领域模型
#115 Long Story Generation via Knowledge Graph and Literary Theory #115 通过知识图谱和文学理论进行长篇故事生成
#116 Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery #116 Landsat30-AU：一个针对澳大利亚 Landsat 影像的视觉-语言数据集
#117 Attack the Messages, Not the Agents: A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS #117 攻击信息，而非代理：针对 LLM-MAS 的多轮自适应隐蔽篡改框架
#118 Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback #118 使用带有人类反馈的强化学习微调文本到语音扩散模型
#119 NANDA Adaptive Resolver: Architecture for Dynamic Resolution of AI Agent Names #119 NANDA 自适应解析器：用于动态解析 AI 代理名称的架构
#120 GEDAN: Learning the Edit Costs for Graph Edit Distance #120 GEDAN：学习图编辑距离的编辑代价
#121 Pseudo-label Induced Subspace Representation Learning for Robust Out-of-Distribution Detection #121 伪标签引导的子空间表示学习用于鲁棒的分布外检测
#122 HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation #122 HiTeC：基于语义感知增强的文本属性超图层次对比学习
#123 Using the NANDA Index Architecture in Practice: An Enterprise Perspective #123 实践中使用 NANDA 索引架构：企业视角
#124 VFLAIR-LLM: A Comprehensive Framework and Benchmark for Split Learning of LLMs #124 VFLAIR-LLM：LLMs 分割学习的综合框架与基准
#125 A Survey of AI Agent Registry Solutions #125 AI 代理注册解决方案综述
#126 Optimizing Bipedal Locomotion for The 100m Dash With Comparison to Human Running #126 优化双足行走以适应 100 米短跑并与人类跑步进行比较
#127 Untraceable DeepFakes via Traceable Fingerprint Elimination #127 通过可追踪指纹消除实现无法追踪的深度伪造
#128 CORE-ReID: Comprehensive Optimization and Refinement through Ensemble fusion in Domain Adaptation for person re-identification #128 CORE-ReID：通过集成融合在领域自适应中对行人再识别进行全面优化和精炼
#129 VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision #129 VRPO：在噪声监督下重新思考鲁棒强化学习训练中的价值建模
#130 Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation #130 基于不确定性的面部抠图用于遮挡感知的面部变换
#131 SkeNa: Learning to Navigate Unseen Environments Based on Abstract Hand-Drawn Maps #131 SkeNa：基于抽象手绘地图学习导航未知环境
#132 Tool-integrated Reinforcement Learning for Repo Deep Search #132 集成工具的强化学习用于仓库深度搜索
#133 Enhancing Long Video Question Answering with Scene-Localized Frame Grouping #133 通过场景定位的帧分组增强长视频问答
#134 ClinicalFMamba: Advancing Clinical Assessment using Mamba-based Multimodal Neuroimaging Fusion #134 ClinicalFMamba：基于 Mamba 的多模态神经影像融合推进临床评估
#135 VCNet: Recreating High-Level Visual Cortex Principles for Robust Artificial Vision #135 VCNet：重现高级视觉皮层原理以实现稳健的人工视觉
#136 GACL: Grounded Adaptive Curriculum Learning with Active Task and Performance Monitoring #136 GACL：基于主动任务和性能监控的有根自适应课程学习
#137 Autonomous Inorganic Materials Discovery via Multi-Agent Physics-Aware Scientific Reasoning #137 通过多智能体物理感知科学推理实现自主无机材料发现
#138 AeroSafe: Mobile Indoor Air Purification using Aerosol Residence Time Analysis and Robotic Cough Emulator Testbed #138 AeroSafe：基于气溶胶停留时间分析和机器人咳嗽模拟测试平台的移动室内空气净化
#139 LLM-based IR-system for Bank Supervisors #139 基于 LLM 的银行监管信息检索系统
#140 Can LLMs Generate High-Quality Task-Specific Conversations? #140 LLMs 能生成高质量的特定任务对话吗？
#141 Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment #141 在推荐系统中实现规模定律：用于超大规模模型部署的基础-专家范式
#142 GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics #142 GrandJury：一种用于动态质量评分标准的协作机器学习模型评估协议
#143 Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces #143 使用大型视觉语言模型遵循路线指令：低级动作空间与全景动作空间的比较
#144 Engineered over Emergent Communication in MARL for Scalable and Sample-Efficient Cooperative Task Allocation in a Partially Observable Grid #144 在多智能体强化学习中针对部分可观测网格中可扩展且样本高效的协作任务分配，工程化优于自发通信
#145 CauKer: classification time series foundation models can be pretrained on synthetic data only #145 CauKer：分类时间序列基础模型可以仅在合成数据上进行预训练
#146 Beyond Least Squares: Robust Regression Transformer (R2T) #146 超越最小二乘法：鲁棒回归变换器（R2T）
#147 Evaluation and Analysis of Deep Neural Transformers and Convolutional Neural Networks on Modern Remote Sensing Datasets #147 现代遥感数据集上深度神经变换器与卷积神经网络的评估与分析
#148 Secure mmWave Beamforming with Proactive-ISAC Defense Against Beam-Stealing Attacks #148 具有主动 ISAC 防御的安全毫米波波束成形，防止波束窃取攻击
#149 SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec #149 SecoustiCodec：跨模态对齐的流式单编解码器语音编解码器
#150 Learning from B Cell Evolution: Adaptive Multi-Expert Diffusion for Antibody Design via Online Optimization #150 从 B 细胞进化中学习：通过在线优化的自适应多专家扩散用于抗体设计
#151 Automated Validation of LLM-based Evaluators for Software Engineering Artifacts #151 基于 LLM 的软件工程工件评估器自动验证
#152 TransAM: Transformer-Based Agent Modeling for Multi-Agent Systems via Local Trajectory Encoding #152 TransAM：基于 Transformer 的多智能体系统代理建模，通过局部轨迹编码
#153 NeuroSync: Intent-Aware Code-Based Problem Solving via Direct LLM Understanding Modification #153 NeuroSync：通过直接修改 LLM 理解实现意图感知的基于代码的问题解决
#154 Real-World Receptivity to Adaptive Mental Health Interventions: Findings from an In-the-Wild Study #154 适应性心理健康干预的现实接受度：一项野外研究的发现
#155 Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation #155 临床基础的基于代理的报告评估：放射学报告生成的可解释指标
#156 Adaptive Knowledge Distillation for Device-Directed Speech Detection #156 设备定向语音检测的自适应知识蒸馏
#157 Extracting Range-Doppler Information of Moving Targets from Wi-Fi Channel State Information #157 从 Wi-Fi 信道状态信息中提取移动目标的距离-多普勒信息
#158 Web3 x AI Agents: Landscape, Integrations, and Foundational Challenges #158 Web3 x AI 代理：现状、整合与基础性挑战
#159 The Silicon Reasonable Person: Can AI Predict How Ordinary People Judge Reasonableness? #159 硅理性人：人工智能能否预测普通人如何判断合理性？
#160 The Architecture of Trust: A Framework for AI-Augmented Real Estate Valuation in the Era of Structured Data #160 信任架构：结构化数据时代人工智能增强房地产估价的框架
#161 Context-Adaptive Multi-Prompt LLM Embedding for Vision-Language Alignment #161 适应上下文的多提示 LLM 嵌入用于视觉-语言对齐
#162 Towards a Manifesto for Cyber Humanities: Paradigms, Ethics, and Prospects #162 迈向网络人文学宣言：范式、伦理与前景
#163 CTBench: Cryptocurrency Time Series Generation Benchmark #163 CTBench：加密货币时间序列生成基准测试
#164 Beyond the Wavefunction: Qualia Abstraction Language Mechanics and the Grammar of Awareness #164 超越波函数：感质抽象语言机制与意识语法
#165 DMSC: Dynamic Multi-Scale Coordination Framework for Time Series Forecasting #165 DMSC：用于时间序列预测的动态多尺度协调框架
#166 SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference #166 SmallKV：小模型辅助的 KV 缓存压缩补偿，用于高效的 LLM 推理
#167 Pulse Shape Discrimination Algorithms: Survey and Benchmark #167 脉冲形状判别算法：综述与基准测试
#168 A Novel cVAE-Augmented Deep Learning Framework for Pan-Cancer RNA-Seq Classification #168 一种新颖的 cVAE 增强深度学习框架用于泛癌症 RNA-Seq 分类
#169 SpectrumFM: A New Paradigm for Spectrum Cognition #169 SpectrumFM：频谱认知的新范式
#170 DeepGB-TB: A Risk-Balanced Cross-Attention Gradient-Boosted Convolutional Network for Rapid, Interpretable Tuberculosis Screening #170 DeepGB-TB：一种风险平衡的交叉注意力梯度提升卷积网络，用于快速且可解释的结核病筛查
#171 Who Gets Cited? Gender- and Majority-Bias in LLM-Driven Reference Selection #171 谁被引用？LLM 驱动的参考文献选择中的性别和多数偏见
#172 Kronos: A Foundation Model for the Language of Financial Markets #172 Kronos：金融市场语言的基础模型
#173 A Note on Code Quality Score: LLMs for Maintainable Large Codebases #173 关于代码质量评分的说明：用于可维护大型代码库的 LLMs
#174 Teaching at Scale: Leveraging AI to Evaluate and Elevate Engineering Education #174 大规模教学：利用人工智能评估和提升工程教育
#175 Interpreting Performance Profiles with Deep Learning #175 使用深度学习解读性能曲线
#176 Forecasting NCAA Basketball Outcomes with Deep Learning: A Comparative Study of LSTM and Transformer Models #176 使用深度学习预测 NCAA 篮球比赛结果：LSTM 与 Transformer 模型的比较研究
#177 Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction #177 Veli：低成本空气质量传感器校正的无监督方法与统一基准
#178 Mathematical Foundations of Geometric Deep Learning #178 几何深度学习的数学基础
#179 Blueprint First, Model Second: A Framework for Deterministic LLM Workflow #179 先有蓝图，后有模型：确定性 LLM 工作流程框架
#180 ECGTwin: Personalized ECG Generation Using Controllable Diffusion Model #180 ECGTwin：使用可控扩散模型的个性化心电图生成
#181 ZetA: A Riemann Zeta-Scaled Extension of Adam for Deep Learning #181 ZetA：一种基于黎曼ζ函数缩放的 Adam 扩展，用于深度学习
#182 SleepLiteCNN: Energy-Efficient Sleep Apnea Subtype Classification with 1-Second Resolution Using Single-Lead ECG #182 SleepLiteCNN：使用单导联心电图实现 1 秒分辨率的节能型睡眠呼吸暂停亚型分类
#183 A Bayesian Hybrid Parameter-Efficient Fine-Tuning Method for Large Language Models #183 一种用于大型语言模型的贝叶斯混合参数高效微调方法
#184 Evaluation of Deep Learning Models for LBBB Classification in ECG Signals #184 心电图信号中 LBBB 分类的深度学习模型评估
#185 AnnoSense: A Framework for Physiological Emotion Data Collection in Everyday Settings for AI #185 AnnoSense：一个用于日常环境中生理情绪数据采集的框架，面向人工智能
#186 Bridging LLMs and Symbolic Reasoning in Educational QA Systems: Insights from the XAI Challenge at IJCNN 2025 #186 在教育问答系统中桥接 LLMs 与符号推理：来自 IJCNN 2025 XAI 挑战赛的见解
#187 FastInit: Fast Noise Initialization for Temporally Consistent Video Generation #187 FastInit：用于时间一致性视频生成的快速噪声初始化

2025-08-06科研追新

2025-08-05 09:49:52 Tuesday ~ 2025-08-06 11:03:18 Wednesday

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

智谱GLM-4.5 系列测评

1.2 Arxiv

1.2.1 Computation and Language

From：https:// /arxiv/cs.CL

2025-08-06 | | Total: 76

#1 CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward #1 CompassVerifier：一个用于 LLMs 评估和结果奖励的统一且稳健的验证器

Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier. 答案验证不仅对于通过将大型语言模型（LLMs）的非结构化输出与标准答案匹配来评估其性能至关重要，同时也作为奖励模型指导 LLM 的优化。大多数评估框架依赖正则化匹配或使用通用 LLMs 进行答案验证，这需要对正则表达式规则或评估提示进行大量且重复的定制。目前的方法存在两个根本性限制：1）缺乏系统评估不同 LLMs 验证能力的综合基准；2）验证器开发处于初期阶段，现有方法既缺乏处理复杂边缘情况的鲁棒性，也缺乏跨领域的泛化能力。在本工作中，我们开发了 CompassVerifier，一种准确且鲁棒的轻量级验证模型，用于评估和结果奖励。它展现了涵盖数学、知识和多样推理任务的多领域能力，能够处理多种答案类型，包括多子问题、公式和序列答案，同时有效识别异常/无效响应。我们介绍了 VerifierBench 基准测试，该测试包含从多个数据源收集的模型输出，并通过对元错误模式的人工分析进行增强，以提升 CompassVerifier。我们预计 CompassVerifier 和 VerifierBench 将促进答案验证、评估协议和强化学习研究。代码和数据集可在 https://github.com/open-compass/CompassVerifier 获取。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 17:55:24 UTC 发布时间：2025-08-05 17:55:24 UTC

#2 More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation #2 不止于分数：探究提示特异性对 LLM 代码生成的影响

Authors: [Yangtian Zi](https://arxiv.org/search/?searchtype=author&query=Yangtian Zi), [Harshitha Menon](https://arxiv.org/search/?searchtype=author&query=Harshitha Menon), [Arjun Guha](https://arxiv.org/search/?searchtype=author&query=Arjun Guha) 作者：杨天子，Harshitha Menon，Arjun Guha

State-of-the-art Large Language Models (LLMs) achieve high pass@1 on general benchmarks like HumanEval but underperform on specialized suites such as ParEval. Is this due to LLMs missing domain knowledge or insufficient prompt detail is given? To answer this, we introduce PartialOrderEval, which augments any code generation benchmark with a partial order of prompts from minimal to maximally detailed. Applying it to HumanEval and both serial and OpenMP subsets of ParEval, we measure how pass@1 scales with prompt specificity. Our experiments with Llama-3.x and Qwen2.5-Coder demonstrate varying degrees of prompt sensitivity across different tasks, and a qualitative analysis highlights explicit I/O specifications, edge-case handling, and stepwise breakdowns as the key drivers of prompt detail improvement. 最先进的大型语言模型（LLMs）在 HumanEval 等通用基准测试中取得了较高的 pass@1 成绩，但在 ParEval 等专业测试套件中表现不佳。这是因为 LLMs 缺乏领域知识，还是因为提示信息不够详细？为了解答这个问题，我们引入了 PartialOrderEval，它通过从最简到最详尽的提示部分顺序，增强了任何代码生成基准测试。将其应用于 HumanEval 以及 ParEval 的串行和 OpenMP 子集，我们测量了 pass@1 随提示具体性的变化。我们对 Llama-3.x 和 Qwen2.5-Coder 的实验表明，不同任务对提示的敏感度存在差异，定性分析指出，明确的输入/输出规范、边界情况处理和分步骤拆解是提升提示细节的关键因素。

Subjects: Computation and Language, Machine Learning, Programming Languages 主题：计算与语言，机器学习，编程语言

Publish: 2025-08-05 17:49:48 UTC 发布时间：2025-08-05 17:49:48 UTC

#3 FairLangProc: A Python package for fairness in NLP #3 FairLangProc：一个用于 NLP 公平性的 Python 包

Authors: [Arturo Pérez-Peralta](https://arxiv.org/search/?searchtype=author&query=Arturo Pérez-Peralta), [Sandra Benítez-Peña](https://arxiv.org/search/?searchtype=author&query=Sandra Benítez-Peña), [Rosa E. Lillo](https://arxiv.org/search/?searchtype=author&query=Rosa E. Lillo) 作者：Arturo Pérez-Peralta，Sandra Benítez-Peña，Rosa E. Lillo

The rise in usage of Large Language Models to near ubiquitousness in recent years has risen societal concern about their applications in decision-making contexts, such as organizational justice or healthcare. This, in turn, poses questions about the fairness of these models in critical settings, which leads to the developement of different procedures to address bias in Natural Language Processing. Although many datasets, metrics and algorithms have been proposed to measure and mitigate harmful prejudice in Natural Language Processing, their implementation is diverse and far from centralized. As a response, this paper presents FairLangProc, a comprehensive Python package providing a common implementation of some of the more recent advances in fairness in Natural Language Processing providing an interface compatible with the famous Hugging Face transformers library, aiming to encourage the widespread use and democratization of bias mitigation techniques. The implementation can be found on https://github.com/arturo-perez-peralta/FairLangProc. 近年来，大型语言模型的使用几乎普及，引发了社会对其在决策环境中应用的关注，如组织公正或医疗保健等领域。这反过来又提出了这些模型在关键场景中公平性的问题，促使人们开发出不同的程序来解决自然语言处理中的偏见问题。尽管已经提出了许多数据集、指标和算法来衡量和减轻自然语言处理中的有害偏见，但它们的实现方式多样且远未集中化。作为回应，本文介绍了 FairLangProc，这是一个综合性的 Python 包，提供了一些自然语言处理公平性最新进展的统一实现，并提供了与著名的 Hugging Face transformers 库兼容的接口，旨在促进偏见缓解技术的广泛使用和普及。该实现可在 https://github.com/arturo-perez-peralta/FairLangProc 获取。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-05 17:47:53 UTC 发布时间：2025-08-05 17:47:53 UTC

#4 CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction #4 CTR-Sink：用于点击率预测的语言模型注意力汇聚器

Click-Through Rate (CTR) prediction, a core task in recommendation systems, estimates user click likelihood using historical behavioral data. Modeling user behavior sequences as text to leverage Language Models (LMs) for this task has gained traction, owing to LMs’ strong semantic understanding and contextual modeling capabilities. However, a critical structural gap exists: user behavior sequences consist of discrete actions connected by semantically empty separators, differing fundamentally from the coherent natural language in LM pre-training. This mismatch causes semantic fragmentation, where LM attention scatters across irrelevant tokens instead of focusing on meaningful behavior boundaries and inter-behavior relationships, degrading prediction performance. To address this, we propose CTR-Sink, a novel framework introducing behavior-level attention sinks tailored for recommendation scenarios. Inspired by attention sink theory, it constructs attention focus sinks and dynamically regulates attention aggregation via external information. Specifically, we insert sink tokens between consecutive behaviors, incorporating recommendation-specific signals such as temporal distance to serve as stable attention sinks. To enhance generality, we design a two-stage training strategy that explicitly guides LM attention toward sink tokens and a attention sink mechanism that amplifies inter-sink dependencies to better capture behavioral correlations. Experiments on one industrial dataset and two open-source datasets (MovieLens, Kuairec), alongside visualization results, validate the method’s effectiveness across scenarios. 点击率（CTR）预测是推荐系统中的核心任务，通过历史行为数据估计用户点击的可能性。将用户行为序列建模为文本以利用语言模型（LMs）完成该任务因语言模型强大的语义理解和上下文建模能力而受到关注。然而，存在一个关键的结构差异：用户行为序列由离散的动作组成，中间由语义空白的分隔符连接，这与语言模型预训练时的连贯自然语言有本质区别。这种不匹配导致语义碎片化，语言模型的注意力分散在无关的标记上，而非聚焦于有意义的行为边界和行为间关系，进而降低了预测性能。为了解决这一问题，我们提出了 CTR-Sink ，一个为推荐场景量身定制的行为级注意力汇聚的新框架。受注意力汇聚理论启发，该框架构建了注意力聚焦汇聚点，并通过外部信息动态调节注意力的聚合。具体来说，我们在连续行为之间插入汇聚标记，结合推荐特定的信号如时间距离，作为稳定的注意力汇聚点。为了增强泛化能力，我们设计了一个两阶段训练策略，明确引导语言模型的注意力聚焦于汇聚标记，并设计了一个注意力汇聚机制，放大汇聚点之间的依赖关系，以更好地捕捉行为间的关联。基于一个工业数据集和两个开源数据集（MovieLens、Kuairec）的实验，以及可视化结果，验证了该方法在多种场景下的有效性。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 17:30:34 UTC 发布时间：2025-08-05 17:30:34 UTC

#5 Can Large Vision-Language Models Understand Multimodal Sarcasm? #5 大型视觉语言模型能理解多模态讽刺吗？

Authors: [Xinyu Wang](https://arxiv.org/search/?searchtype=author&query=Xinyu Wang), [Yue Zhang](https://arxiv.org/search/?searchtype=author&query=Yue Zhang), [Liqiang Jing](https://arxiv.org/search/?searchtype=author&query=Liqiang Jing) 作者：王新宇，张越，景立强

Sarcasm is a complex linguistic phenomenon that involves a disparity between literal and intended meanings, making it challenging for sentiment analysis and other emotion-sensitive tasks. While traditional sarcasm detection methods primarily focus on text, recent approaches have incorporated multimodal information. However, the application of Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA) remains underexplored. In this paper, we evaluate LVLMs in MSA tasks, specifically focusing on Multimodal Sarcasm Detection and Multimodal Sarcasm Explanation. Through comprehensive experiments, we identify key limitations, such as insufficient visual understanding and a lack of conceptual knowledge. To address these issues, we propose a training-free framework that integrates in-depth object extraction and external conceptual knowledge to improve the model’s ability to interpret and explain sarcasm in multimodal contexts. The experimental results on multiple models show the effectiveness of our proposed framework. The code is available at https://github.com/cp-cp/LVLM-MSA. 讽刺是一种复杂的语言现象，涉及字面意义与意图意义之间的差异，这使得情感分析和其他情绪敏感任务变得具有挑战性。虽然传统的讽刺检测方法主要关注文本，但近年来的方法已开始融合多模态信息。然而，大型视觉语言模型（LVLM）在多模态讽刺分析（MSA）中的应用仍未得到充分探索。本文评估了 LVLM 在 MSA 任务中的表现，特别关注多模态讽刺检测和多模态讽刺解释。通过全面的实验，我们发现了关键的局限性，如视觉理解不足和缺乏概念知识。为了解决这些问题，我们提出了一个无需训练的框架，结合深入的对象提取和外部概念知识，以提升模型在多模态环境中理解和解释讽刺的能力。多模型的实验结果表明了我们所提框架的有效性。代码可在 https://github.com/cp-cp/LVLM-MSA 获取。

Subjects: Computation and Language, Computer Vision and Pattern Recognition 学科领域：计算与语言，计算机视觉与模式识别

Publish: 2025-08-05 17:05:11 UTC 发布时间：2025-08-05 17:05:11 UTC

#6 Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? #6 我们是否走在正确的道路上来评估文档检索增强生成？

Authors: [Wenxuan Shen](https://arxiv.org/search/?searchtype=author&query=Wenxuan Shen), [Mingjia Wang](https://arxiv.org/search/?searchtype=author&query=Mingjia Wang), [Yaochen Wang](https://arxiv.org/search/?searchtype=author&query=Yaochen Wang), [Dongping Chen](https://arxiv.org/search/?searchtype=author&query=Dongping Chen), [Junjie Yang](https://arxiv.org/search/?searchtype=author&query=Junjie Yang), [Yao Wan](https://arxiv.org/search/?searchtype=author&query=Yao Wan), [Weiwei Lin](https://arxiv.org/search/?searchtype=author&query=Weiwei Lin) 作者：沈文轩，王明佳，王耀辰，陈东平，杨俊杰，万尧，林伟伟

Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis. 使用多模态大型语言模型（MLLM）的检索增强生成（RAG）系统在复杂文档理解方面展现出巨大潜力，但其发展严重受限于评估不足。当前的基准测试通常侧重于文档 RAG 系统的特定部分，且使用带有不完整真实标签和证据标签的合成数据，因此未能反映现实世界中的瓶颈和挑战。为克服这些限制，我们引入了 Double-Bench：一个新的大规模、多语言、多模态评估系统，能够对文档 RAG 系统中的每个组件进行细粒度评估。该系统包含 3,276 份文档（72,880 页）和 5,168 个单跳及多跳查询，涵盖 6 种语言和 4 种文档类型，并支持简化的动态更新以应对潜在的数据污染问题。查询基于经过详尽扫描的证据页，并由人工专家验证，以确保最高的质量和完整性。我们在 9 个最先进的嵌入模型、4 个 MLLM 和 4 个端到端文档 RAG 框架上进行了全面实验，结果表明文本和视觉嵌入模型之间的差距正在缩小，凸显了构建更强大文档检索模型的必要性。我们的研究还揭示了当前文档 RAG 框架中存在的过度自信困境，这些框架倾向于在没有证据支持的情况下给出答案。我们希望我们的完全开源 Double-Bench 为未来先进文档 RAG 系统的研究提供坚实基础。我们计划每年检索最新语料库并发布新的基准测试。

Subjects: Computation and Language, Computer Vision and Pattern Recognition, Information Retrieval 主题：计算与语言，计算机视觉与模式识别，信息检索

Publish: 2025-08-05 16:55:02 UTC 发布时间：2025-08-05 16:55:02 UTC

#7 Tackling Distribution Shift in LLM via KILO: Knowledge-Instructed Learning for Continual Adaptation #7 通过 KILO 解决 LLM 中的分布转移问题：面向持续适应的知识指导学习

Authors: [Iing Muttakhiroh](https://arxiv.org/search/?searchtype=author&query=Iing Muttakhiroh), [Thomas Fevens](https://arxiv.org/search/?searchtype=author&query=Thomas Fevens) 作者：Iing Muttakhiroh，Thomas Fevens

Large Language Models (LLMs) often suffer from performance degradation when faced with domain shifts, primarily due to catastrophic forgetting. In this work, we propose KILO (Knowledge-Instructed Learning for Continual Adaptation), a novel continual learning framework that integrates dynamic knowledge graphs with instruction tuning. By leveraging retrieved domain-specific knowledge as guidance during training, KILO enhances both adaptability to new domains and retention of previously acquired knowledge. We pretrain our model on WikiText-103 and evaluate sequential adaptation across four diverse target domains: BioASQ, SciQ, TweetEval, and MIND. Our experiments demonstrate that KILO consistently outperforms strong baselines, including continual fine-tuning, ERNIE 2.0, and CPT, in terms of backward transfer, forward transfer, F1 score, retention rate, and training efficiency. These results highlight the effectiveness of combining structured knowledge retrieval and instruction prompting to overcome domain shift challenges in continual learning scenarios. 大型语言模型（LLMs）在面对领域转移时常常表现出性能下降，主要原因是灾难性遗忘。在本工作中，我们提出了 KILO（基于知识指导的持续适应学习），这是一种将动态知识图谱与指令调优相结合的新型持续学习框架。通过在训练过程中利用检索到的领域特定知识作为指导，KILO 增强了对新领域的适应能力以及对先前获得知识的保持能力。我们在 WikiText-103 上对模型进行了预训练，并在四个多样化的目标领域——BioASQ、SciQ、TweetEval 和 MIND 上评估了顺序适应能力。实验结果表明，KILO 在向后迁移、向前迁移、F1 分数、保持率和训练效率等方面均持续优于包括持续微调、ERNIE 2.0 和 CPT 在内的强基线方法。这些结果凸显了结合结构化知识检索与指令提示在持续学习场景中克服领域转移挑战的有效性。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-05 15:39:37 UTC 发布时间：2025-08-05 15:39:37 UTC

#8 Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations #8 超越表面：通过内部表征提升 LLM 作为裁判的对齐与人类的协同

Authors: [Peng Lai](https://arxiv.org/search/?searchtype=author&query=Peng Lai), [Jianjie Zheng](https://arxiv.org/search/?searchtype=author&query=Jianjie Zheng), [Sijie Cheng](https://arxiv.org/search/?searchtype=author&query=Sijie Cheng), [Yun Chen](https://arxiv.org/search/?searchtype=author&query=Yun Chen), [Peng Li](https://arxiv.org/search/?searchtype=author&query=Peng Li), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu), [Guanhua Chen](https://arxiv.org/search/?searchtype=author&query=Guanhua Chen) 作者：赖鹏，郑建杰，程思杰，陈云，李鹏，刘洋，陈冠华

The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using large language models, a paradigm known as “LLMas-a-judge.” However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and taskrelevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a lightweight and efficient framework for enhancing LLM-as-a-Judge alignment with human scoring, via internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer scoretoken logits and computing the expected score from a softmax-based distribution, with the LLM backbone kept frozen. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the effectiveness of our method. 评估任务规模的不断扩大促使了使用大型语言模型进行自动化评估的广泛采用，这一范式被称为“LLM 作为评判者”。然而，在不依赖复杂提示或微调的情况下提升其与人类偏好的对齐仍然具有挑战性。在本工作中，基于初步发现中间到上层编码了语义和任务相关的表示，这些表示通常比最终层更符合人类判断，我们提出了 LAGER，一种通过内部表示增强 LLM 作为评判者与人类评分对齐的轻量高效框架。LAGER 通过聚合跨层的评分 token logits 并基于 softmax 分布计算期望分数，生成细粒度的判断分数，同时保持 LLM 主干模型冻结。LAGER 充分利用不同层之间的互补信息，克服了仅依赖最终层的局限性。我们在标准对齐基准 Flask、HelpSteer 和 BIGGen 上使用 Spearman 相关系数评估了该方法，结果显示 LAGER 在这些基准上相较最佳基线提升了最高 7.5%。在没有推理步骤的情况下，LAGER 的表现与基于推理的方法相当或更优。在数据选择和情感理解等下游应用上的实验进一步展示了我们方法的有效性。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 15:18:36 UTC 发布时间：2025-08-05 15:18:36 UTC

#9 EmbedGrad: Gradient-Based Prompt Optimization in Embedding Space for Large Language Models #9 EmbedGrad：基于梯度的嵌入空间提示优化方法，适用于大型语言模型

Authors: [Xiaoming Hou](https://arxiv.org/search/?searchtype=author&query=Xiaoming Hou), [Jiquan Zhang](https://arxiv.org/search/?searchtype=author&query=Jiquan Zhang), [Zibin Lin](https://arxiv.org/search/?searchtype=author&query=Zibin Lin), [DaCheng Tao](https://arxiv.org/search/?searchtype=author&query=DaCheng Tao), [Shengli Zhang](https://arxiv.org/search/?searchtype=author&query=Shengli Zhang) 作者：侯晓明，张继权，林子斌，陶大成，张胜利

Effectively adapting powerful pretrained foundation models to diverse tasks remains a key challenge in AI deployment. Current approaches primarily follow two paradigms:discrete optimization of text prompts through prompt engineering, or continuous adaptation via additional trainable parameters. Both exhibit limitations-discrete methods lack refinement precision while parameter-based techniques increase complexity and reduce interpretability. To address these constraints, we propose EmbedGrad, a novel framework that optimizes text prompt embeddings through gradient-based refinement. Our approach uniquely decouples training from deployment:during optimization,labeled examples guide precise embedding adjustments while preserving semantic meaning; during inference, only optimized embeddings integrate with user queries. This enables fine-grained calibration impossible in text space, such as enhancing the reasoning capability of prompts like please reason step by step. Comprehensive evaluations across mathematical reasoning, sentiment analysis, and causal judgment tasks demonstrate EmbedGrad’s effectiveness:optimizing this reasoning prompt for Qwen2.5-Math-1.5B increased accuracy from 14.74% to 58.96% on mathematical problems. Consistent improvements were observed across model scales (0.5B-14B) and all tasks, with particularly significant gains for smaller models on complex problems like causal judgment. By bridging prompt engineering and parameter efficiency without architectural changes, our work establishes embedding refinement as a powerful new paradigm for task adaptation. 有效地将强大的预训练基础模型适应于多样化任务，仍然是人工智能部署中的一大挑战。当前的方法主要遵循两种范式：通过提示工程对文本提示进行离散优化，或通过额外的可训练参数进行连续适应。两者均存在局限——离散方法缺乏精细调整的精度，而基于参数的技术则增加了复杂性并降低了可解释性。为了解决这些限制，我们提出了 EmbedGrad，一种通过基于梯度的精细调整优化文本提示嵌入的新框架。我们的方法独特地将训练与部署解耦：在优化过程中，带标签的示例指导精确的嵌入调整，同时保持语义含义；在推理阶段，只有优化后的嵌入与用户查询结合。这使得在文本空间中无法实现的细粒度校准成为可能，例如增强“请逐步推理”这类提示的推理能力。在数学推理、情感分析和因果判断任务上的全面评估证明了 EmbedGrad 的有效性：针对 Qwen2.5-Math-1.5B 优化该推理提示，使数学问题的准确率从 14.74%提升至 58.96%。在不同模型规模（0.5B-14B）和所有任务中均观察到持续的改进，尤其是在因果判断等复杂问题上，小型模型的提升尤为显著。通过在不改变架构的情况下桥接提示工程与参数效率，我们的工作确立了嵌入优化作为任务适应的强大新范式。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 15:03:10 UTC 发布时间：2025-08-05 15:03:10 UTC

#10 Marito: Structuring and Building Open Multilingual Terminologies for South African NLP #10 Marito：为南非自然语言处理构建和组织开放多语言术语库

The critical lack of structured terminological data for South Africa’s official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. \emph{Marito} addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational \emph{Marito} dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. \emph{Marito} provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa’s rich linguistic diversity is represented in the digital age. 尽管存在众多政府和学术术语列表，南非官方语言缺乏结构化的术语数据严重阻碍了多语言自然语言处理的发展。这些宝贵资源分散且以非机器可读格式存储，导致其无法用于计算研究和开发。Marito 通过系统地聚合、清理和标准化这些分散资源，将其转化为开放且可互操作的数据集，从而应对这一挑战。我们推出了基础的 Marito 数据集，该数据集在公平且以非洲为中心的 NOODL 框架下发布。为了展示其即时效用，我们将术语集成到检索增强生成（RAG）管道中。实验表明，在大型语言模型的英译茨文达语机器翻译中，准确性和领域特定一致性均有显著提升。Marito 为开发强大且公平的自然语言处理技术提供了可扩展的基础，确保南非丰富的语言多样性在数字时代得到体现。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 15:00:02 UTC 发布：2025-08-05 15:00:02 UTC

#11 FilBench: Can LLMs Understand and Generate Filipino? #11 FilBench：LLMs 能理解和生成菲律宾语吗？

Authors: [Lester James V. Miranda](https://arxiv.org/search/?searchtype=author&query=Lester James V. Miranda), [Elyanah Aco](https://arxiv.org/search/?searchtype=author&query=Elyanah Aco), [Conner Manuel](https://arxiv.org/search/?searchtype=author&query=Conner Manuel), [Jan Christian Blaise Cruz](https://arxiv.org/search/?searchtype=author&query=Jan Christian Blaise Cruz), [Joseph Marvin Imperial](https://arxiv.org/search/?searchtype=author&query=Joseph Marvin Imperial) 作者：Lester James V. Miranda、Elyanah Aco、Conner Manuel、Jan Christian Blaise Cruz、Joseph Marvin Imperial

Despite the impressive performance of LLMs on English-based tasks, little is known about their capabilities in specific languages such as Filipino. In this work, we address this gap by introducing FilBench, a Filipino-centric benchmark designed to evaluate LLMs across a diverse set of tasks and capabilities in Filipino, Tagalog, and Cebuano. We carefully curate the tasks in FilBench to reflect the priorities and trends of NLP research in the Philippines such as Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation. By evaluating 27 state-of-the-art LLMs on FilBench, we find that several LLMs suffer from reading comprehension and translation capabilities. Our results indicate that FilBench is challenging, with the best model, GPT-4o, achieving only a score of 72.23%. Moreover, we also find that models trained specifically for Southeast Asian languages tend to underperform on FilBench, with the highest-performing model, SEA-LION v3 70B, achieving only a score of 61.07%. Our work demonstrates the value of curating language-specific LLM benchmarks to aid in driving progress on Filipino NLP and increasing the inclusion of Philippine languages in LLM development. 尽管 LLMs 在基于英语的任务中表现出色，但其在菲律宾语等特定语言上的能力知之甚少。在本研究中，我们通过引入 FilBench 填补这一空白，FilBench 是一个以菲律宾语为中心的基准，旨在评估 LLMs 在菲律宾语、他加禄语和宿务语等多样化任务和能力上的表现。我们精心策划了 FilBench 中的任务，以反映菲律宾 NLP 研究的重点和趋势，如文化知识、经典 NLP、阅读理解和生成。通过对 27 个最先进的 LLMs 在 FilBench 上的评估，我们发现部分 LLMs 在阅读理解和翻译能力方面存在不足。我们的结果表明，FilBench 具有挑战性，表现最好的模型 GPT-4o 仅获得 72.23%的得分。此外，我们还发现专门针对东南亚语言训练的模型在 FilBench 上的表现往往较差，表现最好的模型 SEA-LION v3 70B 仅获得 61.07%的得分。我们的工作展示了策划特定语言 LLM 基准的重要性，有助于推动菲律宾语 NLP 的发展，并增加菲律宾语言在 LLM 开发中的包容性。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 14:48:32 UTC 发布时间：2025-08-05 14:48:32 UTC

#12 UPLME: Uncertainty-Aware Probabilistic Language Modelling for Robust Empathy Regression #12 UPLME：面向稳健共情回归的不确定性感知概率语言建模

Authors: [Md Rakibul Hasan](https://arxiv.org/search/?searchtype=author&query=Md Rakibul Hasan), [Md Zakir Hossain](https://arxiv.org/search/?searchtype=author&query=Md Zakir Hossain), [Aneesh Krishna](https://arxiv.org/search/?searchtype=author&query=Aneesh Krishna), [Shafin Rahman](https://arxiv.org/search/?searchtype=author&query=Shafin Rahman), [Tom Gedeon](https://arxiv.org/search/?searchtype=author&query=Tom Gedeon) 作者：Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna, Shafin Rahman, Tom Gedeon

Supervised learning for empathy regression is challenged by noisy self-reported empathy scores. While many algorithms have been proposed for learning with noisy labels in textual classification problems, the regression counterpart is relatively under-explored. We propose UPLME, an uncertainty-aware probabilistic language modelling framework to capture label noise in the regression setting of empathy detection. UPLME includes a probabilistic language model that predicts both empathy score and heteroscedastic uncertainty and is trained using Bayesian concepts with variational model ensembling. We further introduce two novel loss components: one penalises degenerate Uncertainty Quantification (UQ), and another enforces the similarity between the input pairs on which we predict empathy. UPLME provides state-of-the-art performance (Pearson Correlation Coefficient: 0.558→0.580 and 0.629→0.634) in terms of the performance reported in the literature in two public benchmarks, having label noise. Through synthetic label noise injection, we show that UPLME is effective in separating noisy and clean samples based on the predicted uncertainty. UPLME further outperform (Calibration error: 0.571→0.376) a recent variational model ensembling-based UQ method designed for regression problems. 监督学习中的同理心回归面临着自我报告的同理心得分噪声问题。尽管在文本分类问题中已有许多针对带噪标签学习的算法被提出，但回归问题相对较少被探索。我们提出了 UPLME，一种不确定性感知的概率语言建模框架，用于捕捉同理心检测回归设置中的标签噪声。UPLME 包含一个概率语言模型，该模型预测同理心得分及异方差不确定性，并通过贝叶斯概念结合变分模型集成进行训练。我们进一步引入了两个新颖的损失组件：一个惩罚退化的不确定性量化（UQ），另一个则强制输入对之间的同理心预测相似性。UPLME 在两个带有标签噪声的公开基准测试中，基于文献报道的性能，提供了最先进的表现（皮尔逊相关系数： 0.558→0.580 和 0.629→0.634 ）。通过合成标签噪声注入，我们展示了 UPLME 在基于预测不确定性区分噪声样本和干净样本方面的有效性。 UPLME 进一步优于（校准误差： 0.571→0.376 ）一种最近为回归问题设计的基于变分模型集成的不确定性量化方法。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-05 14:46:28 UTC 发布时间：2025-08-05 14:46:28 UTC

#13 CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation #13 CF-RAG：一种使用检索增强生成的碳足迹问答数据集和方法

Authors: [Kaiwen Zhao](https://arxiv.org/search/?searchtype=author&query=Kaiwen Zhao), [Bharathan Balaji](https://arxiv.org/search/?searchtype=author&query=Bharathan Balaji), [Stephen Lee](https://arxiv.org/search/?searchtype=author&query=Stephen Lee) 作者：赵凯文，Bharathan Balaji，Stephen Lee

Product sustainability reports provide valuable insights into the environmental impacts of a product and are often distributed in PDF format. These reports often include a combination of tables and text, which complicates their analysis. The lack of standardization and the variability in reporting formats further exacerbate the difficulty of extracting and interpreting relevant information from large volumes of documents. In this paper, we tackle the challenge of answering questions related to carbon footprints within sustainability reports available in PDF format. Unlike previous approaches, our focus is on addressing the difficulties posed by the unstructured and inconsistent nature of text extracted from PDF parsing. To facilitate this analysis, we introduce CarbonPDF-QA, an open-source dataset containing question-answer pairs for 1735 product report documents, along with human-annotated answers. Our analysis shows that GPT-4o struggles to answer questions with data inconsistencies. To address this limitation, we propose CarbonPDF, an LLM-based technique specifically designed to answer carbon footprint questions on such datasets. We develop CarbonPDF by fine-tuning Llama 3 with our training data. Our results show that our technique outperforms current state-of-the-art techniques, including question-answering (QA) systems finetuned on table and text data. 产品可持续性报告提供了有关产品环境影响的宝贵见解，通常以 PDF 格式分发。这些报告通常包含表格和文本的组合，增加了分析的复杂性。缺乏标准化和报告格式的多样性进一步加剧了从大量文档中提取和解读相关信息的难度。本文中，我们着手解决在 PDF 格式的可持续性报告中回答与碳足迹相关问题的挑战。与以往方法不同，我们重点应对从 PDF 解析中提取的文本非结构化和不一致性带来的困难。为促进此类分析，我们引入了 CarbonPDF-QA，这是一个开源数据集，包含 1735 份产品报告文档的问题-答案对及人工注释的答案。我们的分析显示，GPT-4o 在面对数据不一致时难以准确回答问题。为解决这一限制，我们提出了 CarbonPDF，一种基于 LLM 的技术，专门用于在此类数据集上回答碳足迹相关问题。我们通过使用训练数据微调 Llama 3 来开发 CarbonPDF。我们的结果显示，该技术优于当前的最先进技术，包括在表格和文本数据上微调的问答（QA）系统。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 14:20:10 UTC 发布时间：2025-08-05 14:20:10 UTC

#14 fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval #14 SemEval-2025 任务 7：多语言和跨语言事实核查声明检索

Author: [Pranshu Rastogi](https://arxiv.org/search/?searchtype=author&query=Pranshu Rastogi) 作者：Pranshu Rastogi

SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval is approached as a Learning-to-Rank task using a bi-encoder model fine-tuned from a pre-trained transformer optimized for sentence similarity. Training used both the source languages and their English translations for multilingual retrieval and only English translations for cross-lingual retrieval. Using lightweight models with fewer than 500M parameters and training on Kaggle T4 GPUs, the method achieved 92% Success@10 in multilingual and 80% Success@10 in 5th in crosslingual and 10th in multilingual tracks. SemEval-2025 任务 7：多语言和跨语言事实核查声明检索被视为一个学习排序任务，采用了从预训练的变换器微调而来的双编码器模型，该变换器针对句子相似度进行了优化。训练使用了源语言及其英文翻译以实现多语言检索，仅使用英文翻译以实现跨语言检索。该方法使用参数少于 5 亿的轻量级模型，并在 Kaggle T4 GPU 上训练，在多语言任务中达到了 92%的 Success@10，在跨语言任务中达到了 80%的 Success@10，分别在跨语言和多语言赛道中排名第 5 和第 10。

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题：计算与语言，人工智能，信息检索

Publish: 2025-08-05 14:10:09 UTC 发布时间：2025-08-05 14:10:09 UTC

#15 Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings #15 裁剪作为训练自监督文本嵌入的增强策略优于丢弃法

Authors: [Rita González-Márquez](https://arxiv.org/search/?searchtype=author&query=Rita González-Márquez), [Philipp Berens](https://arxiv.org/search/?searchtype=author&query=Philipp Berens), [Dmitry Kobak](https://arxiv.org/search/?searchtype=author&query=Dmitry Kobak) 作者：Rita González-Márquez, Philipp Berens, Dmitry Kobak

Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, sentiment analysis, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via extensive supervised fine-tuning using curated text pairs. This contrasts with computer vision, where self-supervised training based on data augmentations has demonstrated remarkable success. Here we systematically compare the two most well-known augmentation strategies for positive pair generation in contrastive learning of text embeddings. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is below the supervised SOTA models, but for in-domain data, self-supervised fine-tuning produces high-quality text embeddings after very short fine-tuning, sometimes only marginally below the supervised SOTA. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality. 文本嵌入，即整个文本的向量表示，在许多自然语言处理应用中起着重要作用，如检索增强生成、情感分析、聚类或用于数据探索的文本集合可视化。目前，表现最优的嵌入模型是通过使用精心策划的文本对进行大量监督微调，从预训练语言模型中派生出来的。这与计算机视觉形成对比，后者基于数据增强的自监督训练已显示出显著成功。在此，我们系统地比较了对比学习中文本嵌入正样本对生成的两种最著名的增强策略。我们在 MTEB 及额外的领域内评估中评估嵌入质量，结果显示裁剪增强显著优于基于 dropout 的方法。我们发现，在领域外数据上，生成的嵌入质量低于监督的 SOTA 模型，但对于领域内数据，自监督微调经过非常短的训练后能够产生高质量的文本嵌入，有时仅略低于监督的 SOTA 水平。最后，我们展示了表示质量在最后的变换器层中提升，这些层在微调过程中经历了最大的变化；并且仅微调这些最后的层就足以达到类似的嵌入质量。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-05 13:54:01 UTC 发布时间：2025-08-05 13:54:01 UTC

#16 LLMs Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models

Authors: [Junhong Wu](https://arxiv.org/search/?searchtype=author&query=Junhong Wu), [Jinliang Lu](https://arxiv.org/search/?searchtype=author&query=Jinliang Lu), [Zixuan Ren](https://arxiv.org/search/?searchtype=author&query=Zixuan Ren), [Ganqiang Hu](https://arxiv.org/search/?searchtype=author&query=Ganqiang Hu), [Zhi Wu](https://arxiv.org/search/?searchtype=author&query=Zhi Wu), [Dai Dai](https://arxiv.org/search/?searchtype=author&query=Dai Dai), [Hua Wu](https://arxiv.org/search/?searchtype=author&query=Hua Wu) 作者：吴俊宏，卢金良，任子轩，胡干强，吴志，戴岱，吴华

Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. This paper explores the `Soft Thinking’ capabilities of various LLMs by examining the models’ internal behavior using a suite of probing techniques. Contrary to the common belief that Soft Thinking enables the simultaneous exploration of diverse reasoning paths, our findings reveal that LLMs predominantly rely on the most influential component of the soft inputs during subsequent decoding steps. This reliance hinders the exploration of different reasoning paths and reduces vanilla Soft Thinking to a form of greedy decoding, obscuring the advantage of transmitting more information through Soft Tokens. To tackle this issue, we explore sampling strategies to introduce \emph{randomness}, employing methods such as Dirichlet resampling and the Gumbel-Softmax trick. Our experiments demonstrate that incorporating randomness can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking. Notably, the Gumbel-Softmax trick provides adequate randomness with controlled smoothness, resulting in superior performance across eight reasoning benchmarks. 人类认知自然地处理抽象且流动的概念，而现有的推理模型通常依赖于生成离散的标记，这可能限制了它们的表达能力。近期的进展旨在通过使大型语言模型（LLMs）生成软性、抽象的标记，从而促进在连续概念空间中的推理，来解决这一限制。本文通过一系列探测技术，考察了各种 LLMs 的“软思维”能力，分析模型的内部行为。与普遍认为软思维能够同时探索多条推理路径的观点相反，我们的研究发现，LLMs 在后续解码步骤中主要依赖软输入中最具影响力的成分。这种依赖阻碍了不同推理路径的探索，使得普通的软思维退化为一种贪婪解码，掩盖了通过软标记传递更多信息的优势。为了解决这一问题，我们探索了引入\emph{随机性}的采样策略，采用了 Dirichlet 重采样和 Gumbel-Softmax 技巧等方法。我们的实验表明，加入随机性可以缓解传统方法的局限性，释放软思维的潜力。值得注意的是，Gumbel-Softmax 技巧在保持平滑性的同时提供了足够的随机性，在八个推理基准测试中表现出色。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 13:38:33 UTC 发布时间：2025-08-05 13:38:33 UTC

#17 Variety Is the Spice of Life: Detecting Misinformation with Dynamic Environmental Representations #17 生活的多样性是调味品：利用动态环境表示检测错误信息

Authors: [Bing Wang](https://arxiv.org/search/?searchtype=author&query=Bing Wang), [Ximing Li](https://arxiv.org/search/?searchtype=author&query=Ximing Li), [Yiming Wang](https://arxiv.org/search/?searchtype=author&query=Yiming Wang), [Changchun Li](https://arxiv.org/search/?searchtype=author&query=Changchun Li), [Jiaxu Cui](https://arxiv.org/search/?searchtype=author&query=Jiaxu Cui), [Renchu Guan](https://arxiv.org/search/?searchtype=author&query=Renchu Guan), [Bo Yang](https://arxiv.org/search/?searchtype=author&query=Bo Yang) 作者：王兵，李熙明，王一鸣，李长春，崔嘉旭，关仁初，杨波

The proliferation of misinformation across diverse social media platforms has drawn significant attention from both academic and industrial communities due to its detrimental effects. Accordingly, automatically distinguishing misinformation, dubbed as Misinformation Detection (MD), has become an increasingly active research topic. The mainstream methods formulate MD as a static learning paradigm, which learns the mapping between the content, links, and propagation of news articles and the corresponding manual veracity labels. However, the static assumption is often violated, since in real-world scenarios, the veracity of news articles may vacillate within the dynamically evolving social environment. To tackle this problem, we propose a novel framework, namely Misinformation detection with Dynamic Environmental Representations (MISDER). The basic idea of MISDER lies in learning a social environmental representation for each period and employing a temporal model to predict the representation for future periods. In this work, we specify the temporal model as the LSTM model, continuous dynamics equation, and pre-trained dynamics system, suggesting three variants of MISDER, namely MISDER-LSTM, MISDER-ODE, and MISDER-PT, respectively. To evaluate the performance of MISDER, we compare it to various MD baselines across 2 prevalent datasets, and the experimental results can indicate the effectiveness of our proposed model. 各种社交媒体平台上错误信息的泛滥因其有害影响引起了学术界和工业界的广泛关注。因此，自动区分错误信息，即所谓的错误信息检测（Misinformation Detection，MD），已成为一个日益活跃的研究课题。主流方法将 MD 视为一种静态学习范式，学习新闻内容、链接和传播与对应人工真实性标签之间的映射关系。然而，这种静态假设常常被打破，因为在现实场景中，新闻的真实性可能会在动态变化的社会环境中波动。为了解决这一问题，我们提出了一种新颖的框架，即基于动态环境表示的错误信息检测（Misinformation detection with Dynamic Environmental Representations，MISDER）。MISDER 的基本思想是为每个时间段学习一个社会环境表示，并利用时间模型预测未来时间段的表示。在本工作中，我们将时间模型指定为 LSTM 模型、连续动力学方程和预训练动力学系统，提出了 MISDER 的三种变体，分别是 MISDER-LSTM、MISDER-ODE 和 MISDER-PT。为了评估 MISDER 的性能，我们在两个流行的数据集上将其与各种 MD 基线方法进行了比较，实验结果表明了我们所提模型的有效性。

Subjects: Computation and Language, Social and Information Networks 主题：计算与语言，社会与信息网络

Publish: 2025-08-05 13:01:13 UTC 发布时间：2025-08-05 13:01:13 UTC

#18 ReDSM5: A Reddit Dataset for DSM-5 Depression Detection #18 ReDSM5：用于 DSM-5 抑郁症检测的 Reddit 数据集

Authors: [Eliseo Bao](https://arxiv.org/search/?searchtype=author&query=Eliseo Bao), [Anxo Pérez](https://arxiv.org/search/?searchtype=author&query=Anxo Pérez), [Javier Parapar](https://arxiv.org/search/?searchtype=author&query=Javier Parapar) 作者：Eliseo Bao，Anxo Pérez，Javier Parapar

Depression is a pervasive mental health condition that affects hundreds of millions of individuals worldwide, yet many cases remain undiagnosed due to barriers in traditional clinical access and pervasive stigma. Social media platforms, and Reddit in particular, offer rich, user-generated narratives that can reveal early signs of depressive symptomatology. However, existing computational approaches often label entire posts simply as depressed or not depressed, without linking language to specific criteria from the DSM-5, the standard clinical framework for diagnosing depression. This limits both clinical relevance and interpretability. To address this gap, we introduce ReDSM5, a novel Reddit corpus comprising 1484 long-form posts, each exhaustively annotated at the sentence level by a licensed psychologist for the nine DSM-5 depression symptoms. For each label, the annotator also provides a concise clinical rationale grounded in DSM-5 methodology. We conduct an exploratory analysis of the collection, examining lexical, syntactic, and emotional patterns that characterize symptom expression in social media narratives. Compared to prior resources, ReDSM5 uniquely combines symptom-specific supervision with expert explanations, facilitating the development of models that not only detect depression but also generate human-interpretable reasoning. We establish baseline benchmarks for both multi-label symptom classification and explanation generation, providing reference results for future research on detection and interpretability. 抑郁症是一种普遍存在的心理健康状况，影响着全球数亿人，但由于传统临床访问的障碍和普遍存在的污名，许多病例仍未被诊断。社交媒体平台，尤其是 Reddit，提供了丰富的用户生成叙述，能够揭示抑郁症状的早期迹象。然而，现有的计算方法通常仅将整篇帖子简单标记为“抑郁”或“非抑郁”，而未将语言与 DSM-5（诊断抑郁症的标准临床框架）中的具体标准联系起来。这限制了临床相关性和可解释性。为了解决这一空白，我们引入了 ReDSM5，这是一个新的 Reddit 语料库，包含 1484 篇长篇帖子，每篇帖子均由持证心理学家在句子层面详尽注释，涵盖 DSM-5 中九项抑郁症状。对于每个标签，注释者还提供了基于 DSM-5 方法论的简明临床理由。我们对该语料库进行了探索性分析，考察了表征社交媒体叙述中症状表达的词汇、句法和情感模式。与以往资源相比，ReDSM5 独特地结合了症状特异性监督与专家解释，促进了不仅能检测抑郁症且能生成人类可理解推理的模型开发。我们为多标签症状分类和解释生成建立了基线基准，为未来关于检测和可解释性的研究提供了参考结果。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 12:48:06 UTC 发布时间：2025-08-05 12:48:06 UTC

#19 Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models #19 无思考校准思维：推理大型语言模型中的一种新型上下文学习范式

Authors: [Haotian Wu](https://arxiv.org/search/?searchtype=author&query=Haotian Wu), [Bo Xu](https://arxiv.org/search/?searchtype=author&query=Bo Xu), [Yao Shu](https://arxiv.org/search/?searchtype=author&query=Yao Shu), [Menglin Yang](https://arxiv.org/search/?searchtype=author&query=Menglin Yang), [Chengwei Qin](https://arxiv.org/search/?searchtype=author&query=Chengwei Qin) 作者：吴昊天、徐博、舒尧、杨梦琳、秦承伟

Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that leverages the structured difference between two reasoning modes, i.e., Thinking and Nothinking, to improve reasoning accuracy. Specifically, our method prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt that incorporates the original question and both candidate answers. Since such disagreement occurs infrequently (e.g., only 6% in GSM8K), our method performs just one round of reasoning in most cases, resulting in minimal latency overhead. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT) and majority voting with improved answer robustness. Moreover, It achieves comparable in-distribution performance to training-based SOTA method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing that leveraging different reasoning modes consistently lowers the error rate and highlights the value of structural thinking diversity. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second round of thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs. 推理大型语言模型（RLLMs）最近通过结构化和多步骤推理展示了显著的能力。尽管先前的研究主要集中在改进其训练和推理策略，但其在上下文学习（ICL）方面的潜力仍然很少被探索。为填补这一空白，我们提出了“无思考校准推理”（JointThinking），这是一种新的 ICL 范式，利用两种推理模式——思考（Thinking）和无思考（Nothinking）——之间的结构性差异来提高推理准确性。具体而言，我们的方法促使模型并行生成两个答案：一个处于思考模式，另一个处于无思考模式。仅当两个初始回答不一致时，才触发第二轮思考，使用包含原始问题和两个候选答案的单一提示。由于这种分歧发生频率较低（例如，在 GSM8K 中仅为 6%），我们的方法在大多数情况下只执行一轮推理，从而带来极小的延迟开销。在多个推理基准上的大量实验表明，JointThinking 显著优于少样本链式思维（CoT）和多数投票法，且答案的鲁棒性有所提升。此外，它在分布内任务上的表现可与基于训练的 SOTA 方法相媲美，而在分布外任务上则大幅领先。我们进一步对校准机制进行了系统分析，显示利用不同的推理模式能够持续降低错误率，凸显了结构化思维多样性的价值。此外，我们观察到在第二轮思考中，实际推理与理想推理之间的性能差距随着模型规模的增大而缩小，表明我们方法具有很强的可扩展性。最后，我们讨论了当前的局限性，并概述了未来在 RLLMs 中进行 ICL 研究的有前景方向。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 12:09:55 UTC 发布时间：2025-08-05 12:09:55 UTC

Authors: [Tiago G Canário](https://arxiv.org/search/?searchtype=author&query=Tiago G Canário), [Catarina Duarte](https://arxiv.org/search/?searchtype=author&query=Catarina Duarte), [Flávio L. Pinheiro](https://arxiv.org/search/?searchtype=author&query=Flávio L. Pinheiro), [João L. M. Pereira](https://arxiv.org/search/?searchtype=author&query=João L. M. Pereira) 作者：Tiago G Canário，Catarina Duarte，Flávio L. Pinheiro，João L. M. Pereira

Automatically identifying characters and their interactions from fiction books is, arguably, a complex task that requires pipelines that leverage multiple Natural Language Processing (NLP) methods, such as Named Entity Recognition (NER) and Part-of-speech (POS) tagging. However, these methods are not optimized for the task that leads to the construction of Social Networks of Characters. Indeed, the currently available methods tend to underperform, especially in less-represented languages, due to a lack of manually annotated data for training. Here, we propose a pipeline, which we call Taggus, to extract social networks from literary fiction works in Portuguese. Our results show that compared to readily available State-of-the-Art tools – off-the-shelf NER tools and Large Language Models (ChatGPT) – the resulting pipeline, which uses POS tagging and a combination of heuristics, achieves satisfying results with an average F1-Score of 94.1% in the task of identifying characters and solving for co-reference and 75.9% in interaction detection. These represent, respectively, an increase of 50.7% and 22.3% on results achieved by the readily available State-of-the-Art tools. Further steps to improve results are outlined, such as solutions for detecting relationships between characters. Limitations on the size and scope of our testing samples are acknowledged. The Taggus pipeline is publicly available to encourage development in this field for the Portuguese language.2 自动识别小说中的人物及其互动，可以说是一项复杂的任务，需要利用多种自然语言处理（NLP）方法的流程，如命名实体识别（NER）和词性标注（POS）。然而，这些方法并未针对构建人物社交网络的任务进行优化。实际上，目前可用的方法表现往往不佳，尤其是在资源较少的语言中，原因在于缺乏用于训练的人工标注数据。在此，我们提出了一种名为 Taggus 的流程，用于从葡萄牙语文学作品中提取社交网络。我们的结果显示，与现成的最先进工具——现成的 NER 工具和大型语言模型（ChatGPT）相比，该流程结合了词性标注和多种启发式方法，在识别人物及解决共指问题的任务中，平均 F1 分数达到 94.1% ，在互动检测中达到 75.9% 。这分别比现成的最先进工具的结果提高了 50.7% 和 22.3% 。进一步改进结果的步骤被概述，例如检测角色之间关系的解决方案。我们承认测试样本的大小和范围存在限制。Taggus 流程公开发布，以鼓励葡萄牙语领域的发展。2

Subjects: Computation and Language, Information Retrieval 主题：计算与语言，信息检索

Publish: 2025-08-05 12:03:03 UTC 发布时间：2025-08-05 12:03:03 UTC

#21 CTTS: Collective Test-Time Scaling #21 CTTS：集体测试时缩放

Authors: [Zhende Song](https://arxiv.org/search/?searchtype=author&query=Zhende Song), [Shengji Tang](https://arxiv.org/search/?searchtype=author&query=Shengji Tang), [Peng Ye](https://arxiv.org/search/?searchtype=author&query=Peng Ye), [Jiayuan Fan](https://arxiv.org/search/?searchtype=author&query=Jiayuan Fan), [Tao Chen](https://arxiv.org/search/?searchtype=author&query=Tao Chen) 作者：宋振德，唐胜吉，叶鹏，范佳元，陈涛

Test-time scaling (TTS) has emerged as a promising research field for enhancing the effectiveness of large language models (LLMs) without extra training. However, most existing approaches, e.g., Best-of-N and Self-Consistency rely on a single agent interacting with a reward model (SA-SR), constrained by limited capabilities of a single test-time scaling (STTS) paradigm. On the other hand, recent works demonstrate that collective-agent methods can break through the upper bound of single-agent systems by orchestrating diverse models. Thus, in this paper, we take a first step towards exploring Collective Test-Time Scaling (CTTS). Consider the different interaction types of single and multiple models, we design three primary paradigms to investigate the optimal paradigm of CTTS: (1) single agent to multiple reward models (SA-MR); (2) multiple agents to single reward model (MA-SR); and (3) multiple agents to multiple reward models (MA-MR). Extensive experiments demonstrate that MA-MR consistently achieves the best performance. Based on this, we propose a novel framework named CTTS-MM that effectively leverages both multi-agent and multi-reward-model collaboration for enhanced inference. Specifically, for multi-agent collaboration, we propose an Agent Collaboration Search (ACS), which searches for the most effective combination of LLM agents from a large candidate pool; for multi-reward-model collaboration, we propose Mixture of Reword Models (MoR), which consists of a curated question pool and a Prior Reward model Ensemble Selection (PRES) to select the optimal combinations of reward models via Pair-wise Reward Ranking (PRR) metric. Experiments across seven mainstream benchmarks demonstrate that the proposed CTTS-MM consistently obtains superior performance. Code will be released at https://github.com/magent4aci/CTTS-MM. 测试时缩放（TTS）作为一种无需额外训练即可提升大型语言模型（LLMs）效果的有前景的研究领域，逐渐受到关注。然而，大多数现有方法，如 Best-of-N 和 Self-Consistency，依赖于单一代理与奖励模型的交互（SA-SR），受限于单一测试时缩放（STTS）范式的能力限制。另一方面，近期研究表明，集体代理方法通过协调多样化模型，能够突破单代理系统的上限。因此，本文迈出了探索集体测试时缩放（CTTS）的第一步。考虑到单模型与多模型的不同交互类型，我们设计了三种主要范式以探究 CTTS 的最优范式：（1）单代理对多个奖励模型（SA-MR）；（2）多个代理对单一奖励模型（MA-SR）；（3）多个代理对多个奖励模型（MA-MR）。大量实验表明，MA-MR 始终取得最佳性能。基于此，我们提出了一种名为 CTTS-MM 的新框架，有效利用多代理与多奖励模型的协作，实现更优的推理效果。具体来说，对于多智能体协作，我们提出了智能体协作搜索（Agent Collaboration Search，ACS），该方法从大量候选池中搜索最有效的 LLM 智能体组合；对于多奖励模型协作，我们提出了奖励模型混合（Mixture of Reward Models，MoR），其包括一个精心策划的问题池和先验奖励模型集成选择（Prior Reward model Ensemble Selection，PRES），通过成对奖励排名（Pair-wise Reward Ranking，PRR）指标选择最优的奖励模型组合。跨七个主流基准的实验表明，所提出的 CTTS-MM 始终获得优越的性能。代码将发布于 https://github.com/magent4aci/CTTS-MM。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 11:19:08 UTC 发布时间：2025-08-05 11:19:08 UTC

#22 Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling #22 通过策略对齐推理和分层标注迈向可信的多模态审核

Authors: [Anqi Li](https://arxiv.org/search/?searchtype=author&query=Anqi Li), [Wenwei Jin](https://arxiv.org/search/?searchtype=author&query=Wenwei Jin), [Jintao Tong](https://arxiv.org/search/?searchtype=author&query=Jintao Tong), [Pengda Qin](https://arxiv.org/search/?searchtype=author&query=Pengda Qin), [Weijia Li](https://arxiv.org/search/?searchtype=author&query=Weijia Li), [Guo Lu](https://arxiv.org/search/?searchtype=author&query=Guo Lu) 作者：李安琪，金文伟，童金涛，秦鹏达，李伟佳，卢国

Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term “Hierarchical” reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: https://github.com/lianqi1008/Hi-Guard. 社交平台彻底改变了信息共享的方式，但也加速了有害和违规内容的传播。为了在大规模上确保安全和合规，审核系统必须不仅追求效率，还要具备准确性和可解释性。然而，当前的方法大多依赖于噪声较多的标签驱动学习，缺乏与审核规则的对齐，且产生的决策不透明，阻碍了人工复核。因此，我们提出了分层守护（Hi-Guard），一种多模态审核框架，引入了新的与政策对齐的决策范式。“分层”一词反映了我们系统设计的两个关键方面：（1）分层审核流程，先由轻量级二分类模型过滤安全内容，再由更强大的模型进行细粒度风险分类；（2）第二阶段的分层分类法，模型在一个从粗到细的分层分类体系中执行基于路径的分类。为了确保与不断变化的审核政策保持一致，Hi-Guard 直接将规则定义融入模型提示中。为了进一步增强结构化预测和推理，我们引入了多层次软边际奖励，并采用群体相对策略优化（GRPO）进行优化，对语义相近的误分类进行惩罚，从而提升解释质量。大量实验和实际部署表明，Hi-Guard 在分类准确性、泛化能力和可解释性方面表现优异，为实现可扩展、透明且可信的内容安全系统铺平了道路。代码可在此获取：https://github.com/lianqi1008/Hi-Guard。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-05 10:16:04 UTC 发布时间：2025-08-05 10:16:04 UTC

#23 NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty #23 自然语言处理方法可能比教授更擅长估计问题难度

Authors: [Leonidas Zotos](https://arxiv.org/search/?searchtype=author&query=Leonidas Zotos), [Ivo Pascal de Jong](https://arxiv.org/search/?searchtype=author&query=Ivo Pascal de Jong), [Matias Valdenegro-Toro](https://arxiv.org/search/?searchtype=author&query=Matias Valdenegro-Toro), [Andreea Ioana Sburlea](https://arxiv.org/search/?searchtype=author&query=Andreea Ioana Sburlea), [Malvina Nissim](https://arxiv.org/search/?searchtype=author&query=Malvina Nissim), [Hedderik van Rijn](https://arxiv.org/search/?searchtype=author&query=Hedderik van Rijn) 作者：Leonidas Zotos, Ivo Pascal de Jong, Matias Valdenegro-Toro, Andreea Ioana Sburlea, Malvina Nissim, Hedderik van Rijn

Estimating the difficulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and difficult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the difficulty of exam questions, improving the quality of assessment. 估计考试题目的难度对于制定好的考试至关重要，但教授们并不总是擅长这项任务。我们比较了基于 LLMs 的多种方法与三位教授在估计学生在神经网络和机器学习领域的判断题中答对比例的能力。我们的结果显示，教授们在区分简单和困难题目方面能力有限，而直接让 Gemini 2.5 完成这项任务的表现更优。然而，我们在使用监督学习设置中利用 LLMs 解题时的不确定性，仅用 42 个训练样本，获得了更好的结果。我们得出结论，利用 LLM 不确定性的监督学习可以帮助教授更好地估计考试题目的难度，从而提升评估质量。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 10:12:38 UTC 发布时间：2025-08-05 10:12:38 UTC

#24 Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes #24 通过心理刻板印象调查 LLM 生成故事中的性别偏见

Authors: [Shahed Masoudian](https://arxiv.org/search/?searchtype=author&query=Shahed Masoudian), [Gustavo Escobedo](https://arxiv.org/search/?searchtype=author&query=Gustavo Escobedo), [Hannah Strauss](https://arxiv.org/search/?searchtype=author&query=Hannah Strauss), [Markus Schedl](https://arxiv.org/search/?searchtype=author&query=Markus Schedl) 作者：Shahed Masoudian，Gustavo Escobedo，Hannah Strauss，Markus Schedl

As Large Language Models (LLMs) are increasingly used across different applications, concerns about their potential to amplify gender biases in various tasks are rising. Prior research has often probed gender bias using explicit gender cues as counterfactual, or studied them in sentence completion and short question answering tasks. These formats might overlook more implicit forms of bias embedded in generative behavior of longer content. In this work, we investigate gender bias in LLMs using gender stereotypes studied in psychology (e.g., aggressiveness or gossiping) in an open-ended task of narrative generation. We introduce a novel dataset called StereoBias-Stories containing short stories either unconditioned or conditioned on (one, two, or six) random attributes from 25 psychological stereotypes and three task-related story endings. We analyze how the gender contribution in the overall story changes in response to these attributes and present three key findings: (1) While models, on average, are highly biased towards male in unconditioned prompts, conditioning on attributes independent from gender stereotypes mitigates this bias. (2) Combining multiple attributes associated with the same gender stereotype intensifies model behavior, with male ones amplifying bias and female ones alleviating it. (3) Model biases align with psychological ground-truth used for categorization, and alignment strength increases with model size. Together, these insights highlight the importance of psychology-grounded evaluation of LLMs. 随着大型语言模型（LLMs）在不同应用中的广泛使用，人们对其在各种任务中可能加剧性别偏见的担忧也在增加。以往的研究通常通过显性的性别线索作为反事实来探讨性别偏见，或在句子补全和简短问答任务中进行研究。这些形式可能忽视了更隐性的偏见，这些偏见嵌入在生成较长内容的行为中。在本研究中，我们利用心理学中研究的性别刻板印象（例如攻击性或八卦）在开放式叙事生成任务中调查 LLMs 中的性别偏见。我们引入了一个名为 StereoBias-Stories 的新数据集，包含短篇故事，这些故事要么不带条件，要么基于 25 种心理刻板印象中的（一个、两个或六个）随机属性以及三个与任务相关的故事结局进行条件生成。我们分析了整体故事中性别成分如何响应这些属性的变化，并提出了三个关键发现：（1）虽然模型在无条件提示下平均表现出高度的男性偏见，但基于与性别刻板印象无关的属性进行条件生成可以缓解这种偏见。 (2) 结合与同一性别刻板印象相关的多个属性会加剧模型行为，其中男性属性会放大偏见，女性属性则会缓解偏见。(3) 模型偏见与用于分类的心理学真实依据相符，且这种一致性的强度随着模型规模的增大而增强。综合来看，这些见解凸显了基于心理学的 LLMs 评估的重要性。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 10:10:26 UTC 发布时间：2025-08-05 10:10:26 UTC

#25 Do language models accommodate their users? A study of linguistic convergence #25 语言模型是否适应其用户？一项语言趋同的研究

Authors: [Terra Blevins](https://arxiv.org/search/?searchtype=author&query=Terra Blevins), [Susanne Schmalwieser](https://arxiv.org/search/?searchtype=author&query=Susanne Schmalwieser), [Benjamin Roth](https://arxiv.org/search/?searchtype=author&query=Benjamin Roth) 作者：Terra Blevins，Susanne Schmalwieser，Benjamin Roth

While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication, asking: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of exisiting dialogues to the original human responses across sixteen language models, three dialogue corpora, and a variety of stylometric features. We find that models strongly converge to the conversation’s style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained counterparts. Given the differences between human and model convergence patterns, we hypothesize that the underlying mechanisms for these behaviors are very different. 虽然大型语言模型（LLMs）通常被认为在生成语言方面表现出色，但它们的语言使用与人类的相似程度仍然研究不足。本文中，我们测试模型是否表现出语言趋同现象——这是人类语言交流中的核心语用元素，问题是：模型是否会适应或趋同于用户的语言模式？为此，我们系统地比较了十六个语言模型在三个对话语料库中对现有对话的补全结果与原始人类回应，涉及多种文体特征。我们发现，模型在很大程度上趋同于对话的风格，且往往相较于人类基线表现出显著的过拟合。尽管趋同模式通常是特征特定的，但我们观察到不同建模设置下趋同的变化趋势一致，指令调优和更大规模的模型趋同程度低于其预训练模型。鉴于人类与模型趋同模式的差异，我们推测这些行为背后的机制存在很大不同。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 09:55:40 UTC 发布时间：2025-08-05 09:55:40 UTC

#26 LECTOR: LLM-Enhanced Concept-based Test-Oriented Repetition for Adaptive Spaced Learning #26 LECTOR：基于概念的面向测试的重复，结合 LLM 增强的自适应间隔学习

Author: [Jiahao Zhao](https://arxiv.org/search/?searchtype=author&query=Jiahao Zhao) 作者：赵嘉豪

Spaced repetition systems are fundamental to efficient learning and memory retention, but existing algorithms often struggle with semantic interference and personalized adaptation. We present LECTOR (\textbf{L}LM-\textbf{E}nhanced \textbf{C}oncept-based \textbf{T}est-\textbf{O}riented \textbf{R}epetition), a novel adaptive scheduling algorithm specifically designed for test-oriented learning scenarios, particularly language examinations where success rate is paramount. LECTOR leverages large language models for semantic analysis while incorporating personalized learning profiles, addressing the critical challenge of semantic confusion in vocabulary learning by utilizing LLM-powered semantic similarity assessment and integrating it with established spaced repetition principles. Our comprehensive evaluation against six baseline algorithms (SSP-MMC, SM2, HLR, FSRS, ANKI, THRESHOLD) across 100 simulated learners over 100 days demonstrates significant improvements: LECTOR achieves a 90.2% success rate compared to 88.4% for the best baseline (SSP-MMC), representing a 2.0% relative improvement. The algorithm shows particular strength in handling semantically similar concepts, reducing confusion-induced errors while maintaining computational efficiency. Our results establish LECTOR as a promising direction for intelligent tutoring systems and adaptive learning platforms. 间隔重复系统是高效学习和记忆保持的基础，但现有算法常常在语义干扰和个性化适应方面存在困难。我们提出了 LECTOR（\textbf{L}LM 增强的基于概念的\textbf{T}est 导向重复算法，\textbf{E}nhanced \textbf{C}oncept-based \textbf{T}est-\textbf{O}riented \textbf{R}epetition），这是一种专为测试导向学习场景设计的新型自适应调度算法，特别适用于以成功率为核心的语言考试。LECTOR 利用大型语言模型进行语义分析，同时结合个性化学习档案，通过使用 LLM 驱动的语义相似度评估并将其与成熟的间隔重复原理相结合，解决了词汇学习中语义混淆的关键问题。我们在 100 名模拟学习者、为期 100 天的测试中，将 LECTOR 与六种基线算法（SSP-MMC、SM2、HLR、FSRS、ANKI、THRESHOLD）进行了全面评估，结果显示 LECTOR 取得了显著提升：成功率达到 90.2%，而最佳基线（SSP-MMC）为 88.4%，相对提升 2.0%。该算法在处理语义相似的概念方面表现出特别的优势，减少了因混淆引起的错误，同时保持了计算效率。我们的结果确立了 LECTOR 作为智能辅导系统和自适应学习平台的一个有前景的方向。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 09:53:26 UTC 发布时间：2025-08-05 09:53:26 UTC

#27 Pay What LLM Wants: Can LLM Simulate Economics Experiment with 522 Real-human Persona? #27 按 LLM 想要的支付：LLM 能否模拟拥有 522 个真实人类角色的经济学实验？

Authors: [Junhyuk Choi](https://arxiv.org/search/?searchtype=author&query=Junhyuk Choi), [Hyeonchu Park](https://arxiv.org/search/?searchtype=author&query=Hyeonchu Park), [Haemin Lee](https://arxiv.org/search/?searchtype=author&query=Haemin Lee), [Hyebeen Shin](https://arxiv.org/search/?searchtype=author&query=Hyebeen Shin), [Hyun Joung Jin](https://arxiv.org/search/?searchtype=author&query=Hyun Joung Jin), [Bugeun Kim](https://arxiv.org/search/?searchtype=author&query=Bugeun Kim) 作者：Junhyuk Choi, Hyeonchu Park, Haemin Lee, Hyebeen Shin, Hyun Joung Jin, Bugeun Kim

Recent advances in Large Language Models (LLMs) have generated significant interest in their capacity to simulate human-like behaviors, yet most studies rely on fictional personas rather than actual human data. We address this limitation by evaluating LLMs’ ability to predict individual economic decision-making using Pay-What-You-Want (PWYW) pricing experiments with real 522 human personas. Our study systematically compares three state-of-the-art multimodal LLMs using detailed persona information from 522 Korean participants in cultural consumption scenarios. We investigate whether LLMs can accurately replicate individual human choices and how persona injection methods affect prediction performance. Results reveal that while LLMs struggle with precise individual-level predictions, they demonstrate reasonable group-level behavioral tendencies. Also, we found that commonly adopted prompting techniques are not much better than naive prompting methods; reconstruction of personal narrative nor retrieval augmented generation have no significant gain against simple prompting method. We believe that these findings can provide the first comprehensive evaluation of LLMs’ capabilities on simulating economic behavior using real human data, offering empirical guidance for persona-based simulation in computational social science. 大型语言模型（LLMs）的最新进展引发了人们对其模拟类人行为能力的极大兴趣，然而大多数研究依赖于虚构的人物设定，而非真实的人类数据。我们通过使用包含 522 个真实人类个体的“随心付”（Pay-What-You-Want，PWYW）定价实验，评估 LLMs 预测个体经济决策的能力，以解决这一局限性。本研究系统地比较了三种最先进的多模态 LLMs，利用 522 名韩国参与者在文化消费场景中的详细人物信息。我们探讨了 LLMs 是否能够准确复制个体人类的选择，以及人物注入方法如何影响预测表现。结果显示，尽管 LLMs 在精确的个体层面预测上存在困难，但它们在群体层面的行为倾向上表现出合理的能力。此外，我们发现常用的提示技术并不比简单提示方法好多少；个人叙事重构或检索增强生成方法相较于简单提示方法并无显著提升。我们相信，这些发现可以为使用真实人类数据模拟经济行为的 LLMs 能力提供首次全面评估，为基于角色的模拟在计算社会科学中的应用提供实证指导。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 09:37:37 UTC 发布时间：2025-08-05 09:37:37 UTC

#28 Exploring Stability-Plasticity Trade-offs for Continual Named Entity Recognition #28 探索持续命名实体识别中的稳定性-可塑性权衡

Authors: [Duzhen Zhang](https://arxiv.org/search/?searchtype=author&query=Duzhen Zhang), [Chenxing Li](https://arxiv.org/search/?searchtype=author&query=Chenxing Li), [Jiahua Dong](https://arxiv.org/search/?searchtype=author&query=Jiahua Dong), [Qi Liu](https://arxiv.org/search/?searchtype=author&query=Qi Liu), [Dong Yu](https://arxiv.org/search/?searchtype=author&query=Dong Yu) 作者：张杜臻，李晨星，董佳华，刘琦，余东

Continual Named Entity Recognition (CNER) is an evolving field that focuses on sequentially updating an existing model to incorporate new entity types. Previous CNER methods primarily utilize Knowledge Distillation (KD) to preserve prior knowledge and overcome catastrophic forgetting, strictly ensuring that the representations of old and new models remain consistent. Consequently, they often impart the model with excessive stability (i.e., retention of old knowledge) but limited plasticity (i.e., acquisition of new knowledge). To address this issue, we propose a Stability-Plasticity Trade-off (SPT) method for CNER that balances these aspects from both representation and weight perspectives. From the representation perspective, we introduce a pooling operation into the original KD, permitting a level of plasticity by consolidating representation dimensions. From the weight perspective, we dynamically merge the weights of old and new models, strengthening old knowledge while maintaining new knowledge. During this fusion, we implement a weight-guided selective mechanism to prioritize significant weights. Moreover, we develop a confidence-based pseudo-labeling approach for the current non-entity type, which predicts entity types using the old model to handle the semantic shift of the non-entity type, a challenge specific to CNER that has largely been ignored by previous methods. Extensive experiments across ten CNER settings on three benchmark datasets demonstrate that our SPT method surpasses previous CNER approaches, highlighting its effectiveness in achieving a suitable stability-plasticity trade-off. 持续命名实体识别（CNER）是一个不断发展的领域，专注于顺序更新现有模型以纳入新的实体类型。以往的 CNER 方法主要利用知识蒸馏（KD）来保留先前知识并克服灾难性遗忘，严格确保旧模型和新模型的表示保持一致。因此，它们通常赋予模型过度的稳定性（即保留旧知识），但塑性有限（即获取新知识）。为了解决这一问题，我们提出了一种稳定性-塑性权衡（SPT）方法，用于 CNER，从表示和权重两个角度平衡这两个方面。从表示角度出发，我们在原有的 KD 中引入了池化操作，通过整合表示维度允许一定程度的塑性。从权重角度出发，我们动态融合旧模型和新模型的权重，在强化旧知识的同时保持新知识。在此融合过程中，我们实施了一个权重引导的选择机制，以优先考虑重要权重。此外，我们为当前的非实体类型开发了一种基于置信度的伪标签方法，该方法使用旧模型预测实体类型，以应对非实体类型的语义转变，这是 CNER 特有的挑战，且此前的方法大多忽视了这一点。在三个基准数据集上的十个 CNER 设置中进行的大量实验表明，我们的 SPT 方法优于以往的 CNER 方法，凸显了其在实现合适的稳定性-可塑性权衡方面的有效性。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 09:35:55 UTC 发布：2025-08-05 09:35:55 UTC

#29 RooseBERT: A New Deal For Political Language Modelling #29 RooseBERT：政治语言建模的新方案

Authors: [Deborah Dore](https://arxiv.org/search/?searchtype=author&query=Deborah Dore), [Elena Cabrio](https://arxiv.org/search/?searchtype=author&query=Elena Cabrio), [Serena Villata](https://arxiv.org/search/?searchtype=author&query=Serena Villata) 作者：Deborah Dore, Elena Cabrio, Serena Villata

The increasing amount of political debates and politics-related discussions calls for the definition of novel computational methods to automatically analyse such content with the final goal of lightening up political deliberation to citizens. However, the specificity of the political language and the argumentative form of these debates (employing hidden communication strategies and leveraging implicit arguments) make this task very challenging, even for current general-purpose pre-trained Language Models. To address this issue, we introduce a novel pre-trained Language Model for political discourse language called RooseBERT. Pre-training a language model on a specialised domain presents different technical and linguistic challenges, requiring extensive computational resources and large-scale data. RooseBERT has been trained on large political debate and speech corpora (8K debates, each composed of several sub-debates on different topics) in English. To evaluate its performances, we fine-tuned it on four downstream tasks related to political debate analysis, i.e., named entity recognition, sentiment analysis, argument component detection and classification, and argument relation prediction and classification. Our results demonstrate significant improvements over general-purpose Language Models on these four tasks, highlighting how domain-specific pre-training enhances performance in political debate analysis. We release the RooseBERT language model for the research community. 日益增多的政治辩论和与政治相关的讨论呼唤定义新颖的计算方法，以自动分析此类内容，最终目标是向公众揭示政治审议的真相。然而，政治语言的特殊性及这些辩论的论证形式（采用隐蔽的沟通策略并利用隐含的论点）使得这一任务极具挑战性，即使是当前通用的预训练语言模型也难以胜任。为了解决这一问题，我们引入了一种名为 RooseBERT 的政治话语语言新型预训练语言模型。在特定领域对语言模型进行预训练面临不同的技术和语言挑战，需要大量计算资源和大规模数据。RooseBERT 在大型政治辩论和演讲语料库上进行了训练（包含 8000 场辩论，每场辩论由多个不同主题的子辩论组成），语料为英文。为了评估其性能，我们在四个与政治辩论分析相关的下游任务上对其进行了微调，即命名实体识别、情感分析、论点成分检测与分类，以及论点关系预测与分类。我们的结果显示，在这四个任务上，相较于通用语言模型有显著提升，突显了领域特定预训练如何增强政治辩论分析的表现。我们将 RooseBERT 语言模型发布给研究社区。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 09:28:20 UTC 发布时间：2025-08-05 09:28:20 UTC

#30 Somatic in the East, Psychological in the West?: Investigating Clinically-Grounded Cross-Cultural Depression Symptom Expression in LLMs #30 东方表现为躯体症状，西方表现为心理症状？：在 LLMs 中调查基于临床的跨文化抑郁症状表达

Authors: [Shintaro Sakai](https://arxiv.org/search/?searchtype=author&query=Shintaro Sakai), [Jisun An](https://arxiv.org/search/?searchtype=author&query=Jisun An), [Migyeong Kang](https://arxiv.org/search/?searchtype=author&query=Migyeong Kang), [Haewoon Kwak](https://arxiv.org/search/?searchtype=author&query=Haewoon Kwak) 作者：Shintaro Sakai, Jisun An, Migyeong Kang, Haewoon Kwak

Prior clinical psychology research shows that Western individuals with depression tend to report psychological symptoms, while Eastern individuals report somatic ones. We test whether Large Language Models (LLMs), which are increasingly used in mental health, reproduce these cultural patterns by prompting them with Western or Eastern personas. Results show that LLMs largely fail to replicate the patterns when prompted in English, though prompting in major Eastern languages (i.e., Chinese, Japanese, and Hindi) improves alignment in several configurations. Our analysis pinpoints two key reasons for this failure: the models’ low sensitivity to cultural personas and a strong, culturally invariant symptom hierarchy that overrides cultural cues. These findings reveal that while prompt language is important, current general-purpose LLMs lack the robust, culture-aware capabilities essential for safe and effective mental health applications. 先前的临床心理学研究表明，西方抑郁症患者倾向于报告心理症状，而东方患者则倾向于报告躯体症状。我们通过以西方或东方角色身份提示 LLMs（大型语言模型），测试这些模型是否会重现这些文化模式。结果显示，当以英语提示时，LLMs 在很大程度上未能复制这些模式，尽管以主要东方语言（即中文、日语和印地语）提示在若干配置中改善了对齐。我们的分析指出了这一失败的两个关键原因：模型对文化角色的敏感度低，以及一种强烈的、文化不变的症状层级结构覆盖了文化线索。这些发现表明，尽管提示语言很重要，但当前通用 LLMs 缺乏实现安全有效心理健康应用所必需的强大文化感知能力。

Subjects: Computation and Language, Computers and Society 主题：计算与语言，计算机与社会

Publish: 2025-08-05 09:25:38 UTC 发布时间：2025-08-05 09:25:38 UTC

#31 CardiffNLP at CLEARS-2025: Prompting Large Language Models for Plain Language and Easy-to-Read Text Rewriting #31 CardiffNLP 在 CLEARS-2025：提示大型语言模型进行通俗易懂文本重写

Authors: [Mutaz Ayesh](https://arxiv.org/search/?searchtype=author&query=Mutaz Ayesh), [Nicolás Gutiérrez-Rolón](https://arxiv.org/search/?searchtype=author&query=Nicolás Gutiérrez-Rolón), [Fernando Alva-Manchego](https://arxiv.org/search/?searchtype=author&query=Fernando Alva-Manchego) 作者：Mutaz Ayesh，Nicolás Gutiérrez-Rolón，Fernando Alva-Manchego

This paper details the CardiffNLP team’s contribution to the CLEARS shared task on Spanish text adaptation, hosted by IberLEF 2025. The shared task contained two subtasks and the team submitted to both. Our team took an LLM-prompting approach with different prompt variations. While we initially experimented with LLaMA-3.2, we adopted Gemma-3 for our final submission, and landed third place in Subtask 1 and second place in Subtask 2. We detail our numerous prompt variations, examples, and experimental results. 本文详细介绍了 CardiffNLP 团队在 IberLEF 2025 主办的西班牙语文本适应 CLEARS 共享任务中的贡献。该共享任务包含两个子任务，团队均有提交。我们团队采用了基于 LLM 提示的方式，使用了不同的提示变体。虽然最初尝试了 LLaMA-3.2，但最终提交采用了 Gemma-3，并在子任务 1 中获得第三名，子任务 2 中获得第二名。我们详细介绍了多种提示变体、示例及实验结果。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 09:16:19 UTC 发布时间：2025-08-05 09:16:19 UTC

#32 Probing Syntax in Large Language Models: Successes and Remaining Challenges #32 探索大型语言模型中的句法：成功与剩余挑战

Authors: [Pablo J. Diego-Simón](https://arxiv.org/search/?searchtype=author&query=Pablo J. Diego-Simón), [Emmanuel Chemla](https://arxiv.org/search/?searchtype=author&query=Emmanuel Chemla), [Jean-Rémi King](https://arxiv.org/search/?searchtype=author&query=Jean-Rémi King), [Yair Lakretz](https://arxiv.org/search/?searchtype=author&query=Yair Lakretz) 作者：Pablo J. Diego-Simón，Emmanuel Chemla，Jean-Rémi King，Yair Lakretz

The syntactic structures of sentences can be readily read-out from the activations of large language models (LLMs). However, the ``structural probes’’ that have been developed to reveal this phenomenon are typically evaluated on an indiscriminate set of sentences. Consequently, it remains unclear whether structural and/or statistical factors systematically affect these syntactic representations. To address this issue, we conduct an in-depth analysis of structural probes on three controlled benchmarks. Our results are three-fold. First, structural probes are biased by a superficial property: the closer two words are in a sentence, the more likely structural probes will consider them as syntactically linked. Second, structural probes are challenged by linguistic properties: they poorly represent deep syntactic structures, and get interfered by interacting nouns or ungrammatical verb forms. Third, structural probes do not appear to be affected by the predictability of individual words. Overall, this work sheds light on the current challenges faced by structural probes. Providing a benchmark made of controlled stimuli to better evaluate their performance. 句子的句法结构可以从大型语言模型（LLMs）的激活中轻松读取。然而，已经开发出的“结构探针”通常是在一组不加区分的句子上进行评估的。因此，目前尚不清楚结构因素和/或统计因素是否系统性地影响这些句法表示。为了解决这一问题，我们对三个受控基准上的结构探针进行了深入分析。我们的结果有三点。首先，结构探针受到一个表面属性的偏见：句子中两个词越接近，结构探针越可能认为它们在句法上相互关联。其次，结构探针受到语言属性的挑战：它们对深层句法结构的表示较差，并且会受到相互作用的名词或不合语法的动词形式的干扰。第三，结构探针似乎不受单个词可预测性的影响。总体而言，这项工作揭示了结构探针当前面临的挑战，并提供了由受控刺激组成的基准，以更好地评估其性能。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 08:41:14 UTC 发布时间：2025-08-05 08:41:14 UTC

#33 Current State in Privacy-Preserving Text Preprocessing for Domain-Agnostic NLP #33 隐私保护文本预处理在领域无关 NLP 中的现状

Authors: [Abhirup Sinha](https://arxiv.org/search/?searchtype=author&query=Abhirup Sinha), [Pritilata Saha](https://arxiv.org/search/?searchtype=author&query=Pritilata Saha), [Tithi Saha](https://arxiv.org/search/?searchtype=author&query=Tithi Saha) 作者：Abhirup Sinha, Pritilata Saha, Tithi Saha

Privacy is a fundamental human right. Data privacy is protected by different regulations, such as GDPR. However, modern large language models require a huge amount of data to learn linguistic variations, and the data often contains private information. Research has shown that it is possible to extract private information from such language models. Thus, anonymizing such private and sensitive information is of utmost importance. While complete anonymization may not be possible, a number of different pre-processing approaches exist for masking or pseudonymizing private information in textual data. This report focuses on a few of such approaches for domain-agnostic NLP tasks. 隐私是一项基本人权。数据隐私受到诸如 GDPR 等不同法规的保护。然而，现代大型语言模型需要大量数据来学习语言变体，而这些数据通常包含私人信息。研究表明，有可能从此类语言模型中提取私人信息。因此，对这些私人和敏感信息进行匿名处理至关重要。虽然完全匿名化可能无法实现，但存在多种不同的预处理方法，用于在文本数据中掩盖或假名化私人信息。本报告重点介绍了几种适用于领域无关 NLP 任务的此类方法。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 08:26:45 UTC 发布时间：2025-08-05 08:26:45 UTC

#34 Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models #34 超越内容：语法性别如何影响文本到图像模型中的视觉表现

Authors: [Muhammed Saeed](https://arxiv.org/search/?searchtype=author&query=Muhammed Saeed), [Shaina Raza](https://arxiv.org/search/?searchtype=author&query=Shaina Raza), [Ashmal Vayani](https://arxiv.org/search/?searchtype=author&query=Ashmal Vayani), [Muhammad Abdul-Mageed](https://arxiv.org/search/?searchtype=author&query=Muhammad Abdul-Mageed), [Ali Emami](https://arxiv.org/search/?searchtype=author&query=Ali Emami), [Shady Shehata](https://arxiv.org/search/?searchtype=author&query=Shady Shehata) 作者：Muhammed Saeed, Shaina Raza, Ashmal Vayani, Muhammad Abdul-Mageed, Ali Emami, Shady Shehata

Research on bias in Text-to-Image (T2I) models has primarily focused on demographic representation and stereotypical attributes, overlooking a fundamental question: how does grammatical gender influence visual representation across languages? We introduce a cross-linguistic benchmark examining words where grammatical gender contradicts stereotypical gender associations (e.g., une sentinelle'' - grammatically feminine in French but referring to the stereotypically masculine concept guard’’). Our dataset spans five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese), comprising 800 unique prompts that generated 28,800 images across three state-of-the-art T2I models. Our analysis reveals that grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (compared to 22% with gender-neutral English), while feminine grammatical markers increase female representation to 38% (compared to 28% in English). These effects vary systematically by language resource availability and model architecture, with high-resource languages showing stronger effects. Our findings establish that language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems. 关于文本到图像（T2I）模型中的偏见研究主要集中在人口统计学表现和刻板印象属性上，忽视了一个基本问题：语法性别如何影响跨语言的视觉表现？我们引入了一个跨语言基准，考察语法性别与刻板性别关联相矛盾的词汇（例如，“une sentinelle”——法语中语法为阴性，但指代刻板印象中的阳性概念“守卫”）。我们的数据集涵盖五种有性别区分的语言（法语、西班牙语、德语、意大利语、俄语）和两种性别中性对照语言（英语、中文），包含 800 个独特提示词，在三种最先进的 T2I 模型中生成了 28,800 张图像。我们的分析显示，语法性别显著影响图像生成：阳性语法标记使男性形象平均占比提升至 73%（而性别中性英语中为 22%），阴性语法标记使女性形象占比提升至 38%（而英语中为 28%）。这些效应随着语言资源的可用性和模型架构系统性地变化，高资源语言表现出更强的效应。我们的研究结果表明，语言结构本身，而不仅仅是内容，影响着 AI 生成的视觉输出，为理解多语言、多模态系统中的偏见和公平性引入了一个新的维度。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 08:13:07 UTC 发布时间：2025-08-05 08:13:07 UTC

#35 Analyzing German Parliamentary Speeches: A Machine Learning Approach for Topic and Sentiment Classification #35 德国议会演讲分析：一种用于主题和情感分类的机器学习方法

Authors: [Lukas Pätz](https://arxiv.org/search/?searchtype=author&query=Lukas Pätz), [Moritz Beyer](https://arxiv.org/search/?searchtype=author&query=Moritz Beyer), [Jannik Späth](https://arxiv.org/search/?searchtype=author&query=Jannik Späth), [Lasse Bohlen](https://arxiv.org/search/?searchtype=author&query=Lasse Bohlen), [Patrick Zschech](https://arxiv.org/search/?searchtype=author&query=Patrick Zschech), [Mathias Kraus](https://arxiv.org/search/?searchtype=author&query=Mathias Kraus), [Julian Rosenberger](https://arxiv.org/search/?searchtype=author&query=Julian Rosenberger) 作者：Lukas Pätz、Moritz Beyer、Jannik Späth、Lasse Bohlen、Patrick Zschech、Mathias Kraus、Julian Rosenberger

This study investigates political discourse in the German parliament, the Bundestag, by analyzing approximately 28,000 parliamentary speeches from the last five years. Two machine learning models for topic and sentiment classification were developed and trained on a manually labeled dataset. The models showed strong classification performance, achieving an area under the receiver operating characteristic curve (AUROC) of 0.94 for topic classification (average across topics) and 0.89 for sentiment classification. Both models were applied to assess topic trends and sentiment distributions across political parties and over time. The analysis reveals remarkable relationships between parties and their role in parliament. In particular, a change in style can be observed for parties moving from government to opposition. While ideological positions matter, governing responsibilities also shape discourse. The analysis directly addresses key questions about the evolution of topics, sentiment dynamics, and party-specific discourse strategies in the Bundestag. 本研究通过分析过去五年约 28,000 条德国联邦议院（Bundestag）议会演讲，调查了德国议会中的政治话语。研究开发并训练了两个用于主题和情感分类的机器学习模型，训练数据集为手工标注的数据。模型表现出强劲的分类性能，主题分类的受试者工作特征曲线下面积（AUROC）平均达到 0.94，情感分类达到 0.89。两个模型均被应用于评估各政党及其随时间变化的主题趋势和情感分布。分析揭示了政党之间及其在议会中角色的显著关系。特别是，政党从执政转为反对派时，其风格发生了变化。尽管理念立场重要，执政责任同样影响话语。该分析直接回应了关于联邦议院主题演变、情感动态及政党特定话语策略的关键问题。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-05 07:44:42 UTC 发布：2025-08-05 07:44:42 UTC

#36 Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following #36 Light-IF：通过预览和自我检查赋予 LLMs 可泛化的推理能力以执行复杂指令

Authors: [Chenyang Wang](https://arxiv.org/search/?searchtype=author&query=Chenyang Wang), [Liang Wen](https://arxiv.org/search/?searchtype=author&query=Liang Wen), [Shousheng Jia](https://arxiv.org/search/?searchtype=author&query=Shousheng Jia), [Xiangzheng Zhang](https://arxiv.org/search/?searchtype=author&query=Xiangzheng Zhang), [Liang Xu](https://arxiv.org/search/?searchtype=author&query=Liang Xu) 作者：王晨阳，文亮，贾守胜，张祥正，徐亮

While advancements in the reasoning abilities of LLMs have significantly enhanced their performance in solving mathematical problems, coding tasks, and general puzzles, their effectiveness in accurately adhering to instructions remains inconsistent, particularly with more complex directives. Our investigation identifies lazy reasoning during the thinking stage as the primary factor contributing to poor instruction adherence. To mitigate this issue, we propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking, essential for satisfying strict instruction constraints. Specifically, we first generate instructions with complex constraints and apply a filtering process to obtain valid prompts, resulting in three distinct prompt datasets categorized as hard, easy, and pass. Then, we employ rejection sampling on the pass prompts to curate a small yet high-quality dataset, enabling a cold-start initialization of the model and facilitating its adaptation to effective reasoning patterns. Subsequently, we employ an entropy-preserving supervised fine-tuning (Entropy-SFT) strategy coupled with token-wise entropy-adaptive (TEA-RL) reinforcement learning guided by rule-based dense rewards. This approach encourages the model to transform its reasoning mechanism, ultimately fostering generalizable reasoning abilities that encompass preview and self-checking. Extensive experiments conducted on instruction-following benchmarks demonstrate remarkable performance improvements across various model scales. Notably, our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6. 尽管 LLMs 在推理能力上的进步显著提升了它们解决数学问题、编码任务和一般谜题的表现，但它们在准确遵循指令方面的效果仍不稳定，尤其是在面对更复杂的指令时。我们的研究发现，思考阶段的懒惰推理是导致指令遵循不佳的主要原因。为了解决这一问题，我们提出了一个全面的框架，旨在实现包含预览和自我检查的严谨推理过程，这对于满足严格的指令约束至关重要。具体而言，我们首先生成带有复杂约束的指令，并通过过滤过程获得有效的提示，最终形成三个不同的提示数据集，分别归类为困难、简单和通过。随后，我们对通过类提示进行拒绝采样，筛选出一个小而高质量的数据集，从而实现模型的冷启动初始化，并促进其适应有效的推理模式。随后，我们采用了一种保持熵的监督微调（Entropy-SFT）策略，结合基于规则的密集奖励指导的逐标记熵自适应（TEA-RL）强化学习。该方法鼓励模型转变其推理机制，最终培养出包含预览和自检的可泛化推理能力。在指令遵循基准上进行的大量实验表明，各种规模的模型均表现出显著的性能提升。值得注意的是，我们的 Light-IF-32B 模型超越了更大规模的开源模型如 DeepSeek-R1 以及闭源模型如 Doubao-1.6。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-05 07:42:00 UTC 发布时间：2025-08-05 07:42:00 UTC

#37 RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior #37 RCP-Merging：通过将推理能力视为先验，合并长链思维模型与特定领域模型

Authors: [Junyao Yang](https://arxiv.org/search/?searchtype=author&query=Junyao Yang), [Jianwei Wang](https://arxiv.org/search/?searchtype=author&query=Jianwei Wang), [Huiping Zhuang](https://arxiv.org/search/?searchtype=author&query=Huiping Zhuang), [Cen Chen](https://arxiv.org/search/?searchtype=author&query=Cen Chen), [Ziqian Zeng](https://arxiv.org/search/?searchtype=author&query=Ziqian Zeng) 作者：杨俊尧，王建伟，庄慧平，陈岑，曾子谦

Large Language Models (LLMs) with long chain-of-thought (CoT) capability, termed Reasoning Models, demonstrate superior intricate problem-solving abilities through multi-step long CoT reasoning. To create a dual-capability model with long CoT capability and domain-specific knowledge without substantial computational and data costs, model merging emerges as a highly resource-efficient method. However, significant challenges lie in merging domain-specific LLMs with long CoT ones since nowadays merging methods suffer from reasoning capability degradation, even gibberish output and output collapse. To overcome this, we introduce RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior, a novel merging framework designed to integrate domain-specific LLMs with long CoT capability, meanwhile maintaining model performance in the original domain. Treating reasoning model weights as foundational prior, our method utilizes a reasoning capability indicator to preserve core long CoT capability model weights while selectively merging essential domain-specific weights. We conducted extensive experiments on Qwen2.5-7B, Llama3.1-8B, and Qwen2.5-1.5B models in BioMedicine and Finance domains. Our results show that RCP-Merging successfully merges a reasoning model with domain-specific ones, improving domain task performance by 9.5% and 9.2% over state-of-the-art methods, without significantly harming the original long CoT reasoning capability. 具有长链式思维（CoT）能力的大型语言模型（LLMs），称为推理模型，通过多步长链式思维推理展现出卓越的复杂问题解决能力。为了在不产生大量计算和数据成本的情况下，创建具备长链式思维能力和领域特定知识的双重能力模型，模型合并成为一种极具资源效率的方法。然而，将领域特定的 LLMs 与具备长链式思维能力的模型合并存在重大挑战，因为现有的合并方法往往导致推理能力下降，甚至出现无意义输出和输出崩溃。为了解决这一问题，我们提出了 RCP-Merging：一种以推理能力为先验，合并长链式思维模型与领域特定模型的新型合并框架，旨在整合具备长链式思维能力的领域特定 LLMs，同时保持模型在原始领域的性能。该方法将推理模型权重视为基础先验，利用推理能力指标保留核心长链式思维能力模型权重，同时有选择地合并关键的领域特定权重。我们在 BioMedicine 和 Finance 领域对 Qwen2.5-7B、Llama3.1-8B 和 Qwen2.5-1.5B 模型进行了大量实验。结果表明，RCP-Merging 成功地将推理模型与特定领域模型合并，在不显著损害原有长链式推理能力的情况下，使领域任务性能分别比最先进方法提升了 9.5%和 9.2%。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 06:38:18 UTC 发布时间：2025-08-05 06:38:18 UTC

#38 Long Story Generation via Knowledge Graph and Literary Theory #38 通过知识图谱和文学理论进行长篇故事生成

Authors: [Ge Shi](https://arxiv.org/search/?searchtype=author&query=Ge Shi), [Kaiyu Huang](https://arxiv.org/search/?searchtype=author&query=Kaiyu Huang), [Guochen Feng](https://arxiv.org/search/?searchtype=author&query=Guochen Feng) 作者：施戈，黄凯宇，冯国辰

The generation of a long story consisting of several thousand words is a sub-task in the field of long text generation~(LTG). Previous research has addressed this challenge through outline-based generation, which employs a multi-stage method for generating outlines into stories. However, this approach suffers from two common issues: almost inevitable theme drift caused by the loss of memory of previous outlines, and tedious plots with incoherent logic that are less appealing to human readers. In this paper, we propose the multi-agent Story Generator structure to improve the multi-stage method, using large language models~(LLMs) as the core components of agents. To avoid theme drift, we introduce a memory storage model comprising two components: a long-term memory storage that identifies the most important memories, thereby preventing theme drift; and a short-term memory storage that retains the latest outlines from each generation round. To incorporate engaging elements into the story, we design a story theme obstacle framework based on literary narratology theory that introduces uncertain factors and evaluation criteria to generate outline. This framework calculates the similarity of the former storyline and enhances the appeal of the story by building a knowledge graph and integrating new node content. Additionally, we establish a multi-agent interaction stage to simulate writer-reader interaction through dialogue and revise the story text according to feedback, to ensure it remains consistent and logical. Evaluations against previous methods demonstrate that our approach can generate higher-quality long stories. 生成由数千字组成的长篇故事是长文本生成（LTG）领域的一个子任务。以往的研究通过基于大纲的生成方法来应对这一挑战，该方法采用多阶段方式将大纲生成故事。然而，这种方法存在两个常见问题：几乎不可避免的主题偏移，原因是对先前大纲记忆的丧失；以及情节冗长且逻辑不连贯，难以吸引读者。在本文中，我们提出了多智能体故事生成器结构，以改进多阶段方法，使用大型语言模型（LLMs）作为智能体的核心组件。为避免主题偏移，我们引入了一个由两部分组成的记忆存储模型：长期记忆存储，用于识别最重要的记忆，从而防止主题偏移；短期记忆存储，用于保留每轮生成的最新大纲。为了将吸引人的元素融入故事，我们设计了一个基于文学叙事学理论的故事主题障碍框架，该框架引入不确定因素和评估标准来生成大纲。该框架计算前一故事线的相似度，并通过构建知识图谱和整合新节点内容来增强故事的吸引力。此外，我们建立了一个多智能体交互阶段，通过对话模拟作者与读者的互动，并根据反馈修订故事文本，以确保其保持一致性和逻辑性。与以往方法的评估表明，我们的方法能够生成更高质量的长篇故事。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 06:35:14 UTC 发布时间：2025-08-05 06:35:14 UTC

#39 Cross-lingual Opinions and Emotions Mining in Comparable Documents #39 跨语言观点与情感挖掘在可比文档中的应用

Authors: [Motaz Saad](https://arxiv.org/search/?searchtype=author&query=Motaz Saad), [David Langlois](https://arxiv.org/search/?searchtype=author&query=David Langlois), [Kamel Smaili](https://arxiv.org/search/?searchtype=author&query=Kamel Smaili) 作者：Motaz Saad, David Langlois, Kamel Smaili

Comparable texts are topic-aligned documents in multiple languages that are not direct translations. They are valuable for understanding how a topic is discussed across languages. This research studies differences in sentiments and emotions across English-Arabic comparable documents. First, texts are annotated with sentiment and emotion labels. We apply a cross-lingual method to label documents with opinion classes (subjective/objective), avoiding reliance on machine translation. To annotate with emotions (anger, disgust, fear, joy, sadness, surprise), we manually translate the English WordNet-Affect (WNA) lexicon into Arabic, creating bilingual emotion lexicons used to label the comparable corpora. We then apply a statistical measure to assess the agreement of sentiments and emotions in each source-target document pair. This comparison is especially relevant when the documents originate from different sources. To our knowledge, this aspect has not been explored in prior literature. Our study includes English-Arabic document pairs from Euronews, BBC, and Al-Jazeera (JSC). Results show that sentiment and emotion annotations align when articles come from the same news agency and diverge when they come from different ones. The proposed method is language-independent and generalizable to other language pairs. 可比文本是多语言中主题一致但非直接翻译的文档。它们对于理解不同语言中如何讨论某一主题具有重要价值。本研究考察了英阿可比文档中情感和情绪的差异。首先，对文本进行情感和情绪标签注释。我们采用跨语言方法为文档标注观点类别（主观/客观），避免依赖机器翻译。为了注释情绪（愤怒、厌恶、恐惧、喜悦、悲伤、惊讶），我们将英文 WordNet-Affect（WNA）词典手工翻译成阿拉伯语，创建了双语情绪词典，用于标注可比语料库。随后，我们应用统计方法评估每对源文档与目标文档中情感和情绪的一致性。这种比较在文档来自不同来源时尤为重要。据我们所知，文献中尚未探讨此方面。我们的研究包含来自 Euronews、BBC 和 Al-Jazeera（JSC）的英阿文档对。结果显示，当文章来自同一家新闻机构时，情感和情绪标注是一致的，而当文章来自不同机构时，则存在差异。所提出的方法与语言无关，且可推广到其他语言对。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 05:44:28 UTC 发布时间：2025-08-05 05:44:28 UTC

#40 Token-Level Precise Attack on RAG: Searching for the Best Alternatives to Mislead Generation #40 基于令牌级的精确攻击 RAG：寻找误导生成的最佳替代方案

Authors: [Zizhong Li](https://arxiv.org/search/?searchtype=author&query=Zizhong Li), [Haopeng Zhang](https://arxiv.org/search/?searchtype=author&query=Haopeng Zhang), [Jiawei Zhang](https://arxiv.org/search/?searchtype=author&query=Jiawei Zhang) 作者：李子忠，张浩鹏，张嘉伟

While large language models (LLMs) have achieved remarkable success in providing trustworthy responses for knowledge-intensive tasks, they still face critical limitations such as hallucinations and outdated knowledge. To address these issues, the retrieval-augmented generation (RAG) framework enhances LLMs with access to external knowledge via a retriever, enabling more accurate and real-time outputs about the latest events. However, this integration brings new security vulnerabilities: the risk that malicious content in the external database can be retrieved and used to manipulate model outputs. Although prior work has explored attacks on RAG systems, existing approaches either rely heavily on access to the retriever or fail to jointly consider both retrieval and generation stages, limiting their effectiveness, particularly in black-box scenarios. To overcome these limitations, we propose Token-level Precise Attack on the RAG (TPARAG), a novel framework that targets both white-box and black-box RAG systems. TPARAG leverages a lightweight white-box LLM as an attacker to generate and iteratively optimize malicious passages at the token level, ensuring both retrievability and high attack success in generation. Extensive experiments on open-domain QA datasets demonstrate that TPARAG consistently outperforms previous approaches in retrieval-stage and end-to-end attack effectiveness. These results further reveal critical vulnerabilities in RAG pipelines and offer new insights into improving their robustness. 虽然大型语言模型（LLMs）在为知识密集型任务提供可信响应方面取得了显著成功，但它们仍面临诸如幻觉和知识过时等关键限制。为了解决这些问题，检索增强生成（RAG）框架通过检索器为 LLMs 提供访问外部知识的能力，从而能够对最新事件生成更准确和实时的输出。然而，这种整合带来了新的安全漏洞：外部数据库中的恶意内容可能被检索并用于操控模型输出。尽管先前的工作探讨了对 RAG 系统的攻击，但现有方法要么过度依赖对检索器的访问，要么未能同时考虑检索和生成阶段，限制了其效果，尤其是在黑箱场景中。为克服这些限制，我们提出了针对 RAG 的令牌级精确攻击（TPARAG），这是一种针对白箱和黑箱 RAG 系统的新型框架。 TPARAG 利用轻量级的白盒 LLM 作为攻击者，在令牌级别生成并迭代优化恶意段落，确保生成内容既可检索又具有高攻击成功率。在开放领域问答数据集上的大量实验表明，TPARAG 在检索阶段和端到端攻击效果上始终优于以往方法。这些结果进一步揭示了 RAG 流水线中的关键漏洞，并为提升其鲁棒性提供了新的见解。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 05:44:19 UTC 发布时间：2025-08-05 05:44:19 UTC

#41 Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation #41 隐私感知解码：缓解检索增强生成中大型语言模型的隐私泄露

Authors: [Haoran Wang](https://arxiv.org/search/?searchtype=author&query=Haoran Wang), [Xiongxiao Xu](https://arxiv.org/search/?searchtype=author&query=Xiongxiao Xu), [Baixiang Huang](https://arxiv.org/search/?searchtype=author&query=Baixiang Huang), [Kai Shu](https://arxiv.org/search/?searchtype=author&query=Kai Shu) 作者：王浩然，徐雄啸，黄百祥，舒凯

Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response (ε,δ)-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD. 检索增强生成（RAG）通过基于外部知识源来生成输出，提升了大型语言模型（LLMs）的事实准确性。然而，当检索涉及私密或敏感数据时，RAG 系统容易受到提取攻击，可能通过生成的响应泄露机密信息。我们提出了隐私感知解码（PAD），这是一种轻量级的推理时防御方法，在生成过程中自适应地向词元对数概率中注入校准的高斯噪声。PAD 结合了基于置信度的筛选以选择性保护高风险词元，采用高效的敏感度估计以最小化不必要的噪声，并通过上下文感知的噪声校准在隐私保护与生成质量之间取得平衡。一个\renyi 差分隐私（RDP）会计器严格跟踪累计隐私损失，使得对敏感输出能够提供明确的每次响应 (ε,δ) -DP 保证。与需要重新训练或语料库级过滤的先前方法不同，PAD 与模型无关，完全在解码阶段运行，且计算开销极小。在三个真实世界数据集上的实验表明，PAD 在显著减少私人信息泄露的同时，保持了响应的实用性，优于现有基于检索和后处理的防御方法。我们的工作通过解码策略在缓解 RAG 中的隐私风险方面迈出了重要一步，为敏感领域中通用且可扩展的隐私解决方案铺平了道路。我们的代码已开源：https://github.com/wang2226/PAD。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 05:22:13 UTC 发布时间：2025-08-05 05:22:13 UTC

#42 When Algorithms Meet Artists: Topic Modeling the AI-Art Debate, 2013-2025 #42 当算法遇上艺术家：2013-2025 年 AI 艺术辩论的话题建模

Authors: [Ariya Mukherjee-Gandhi](https://arxiv.org/search/?searchtype=author&query=Ariya Mukherjee-Gandhi), [Oliver Muellerklein](https://arxiv.org/search/?searchtype=author&query=Oliver Muellerklein) 作者：Ariya Mukherjee-Gandhi，Oliver Muellerklein

As generative AI continues to reshape artistic production and alternate modes of human expression, artists whose livelihoods are most directly affected have raised urgent concerns about consent, transparency, and the future of creative labor. However, the voices of artists are often marginalized in dominant public and scholarly discourse. This study presents a twelve-year analysis, from 2013 to 2025, of English-language discourse surrounding AI-generated art. It draws from 439 curated 500-word excerpts sampled from opinion articles, news reports, blogs, legal filings, and spoken-word transcripts. Through a reproducible methodology, we identify five stable thematic clusters and uncover a misalignment between artists’ perceptions and prevailing media narratives. Our findings highlight how the use of technical jargon can function as a subtle form of gatekeeping, often sidelining the very issues artists deem most urgent. Our work provides a BERTopic-based methodology and a multimodal baseline for future research, alongside a clear call for deeper, transparency-driven engagement with artist perspectives in the evolving AI-creative landscape. 随着生成式人工智能持续重塑艺术创作和人类表达的多样方式，最直接受影响的艺术家们提出了关于同意、透明度以及创意劳动未来的紧迫关切。然而，艺术家的声音常常在主流公共和学术话语中被边缘化。本研究呈现了对 2013 年至 2025 年间围绕 AI 生成艺术的英语话语进行的十二年分析。研究基于 439 个精选的 500 字摘录，样本来源包括观点文章、新闻报道、博客、法律文件和口语记录。通过可复现的方法论，我们识别出五个稳定的主题群，并揭示了艺术家认知与主流媒体叙事之间的错位。研究结果强调，技术术语的使用常作为一种微妙的门槛机制，往往使艺术家认为最紧迫的问题被边缘化。我们的工作提供了一种基于 BERTopic 的方法论和多模态基线，供未来研究使用，同时明确呼吁在不断发展的 AI 创意领域中，更深入且以透明度为驱动的艺术家视角参与。

Subjects: Computation and Language, Computers and Society, Human-Computer Interaction 主题：计算与语言，计算机与社会，人机交互

Publish: 2025-08-05 03:26:00 UTC 发布时间：2025-08-05 03:26:00 UTC

#43 CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors #43 CoCoTen：通过上下文共现张量的潜在空间特征检测大型语言模型的对抗输入

Authors: [Sri Durga Sai Sowmya Kadali](https://arxiv.org/search/?searchtype=author&query=Sri Durga Sai Sowmya Kadali), [Evangelos E. Papalexakis](https://arxiv.org/search/?searchtype=author&query=Evangelos E. Papalexakis) 作者：Sri Durga Sai Sowmya Kadali，Evangelos E. Papalexakis

The widespread use of Large Language Models (LLMs) in many applications marks a significant advance in research and practice. However, their complexity and hard-to-understand nature make them vulnerable to attacks, especially jailbreaks designed to produce harmful responses. To counter these threats, developing strong detection methods is essential for the safe and reliable use of LLMs. This paper studies this detection problem using the Contextual Co-occurrence Matrix, a structure recognized for its efficacy in data-scarce environments. We propose a novel method leveraging the latent space characteristics of Contextual Co-occurrence Matrices and Tensors for the effective identification of adversarial and jailbreak prompts. Our evaluations show that this approach achieves a notable F1 score of 0.83 using only 0.5% of labeled prompts, which is a 96.6% improvement over baselines. This result highlights the strength of our learned patterns, especially when labeled data is scarce. Our method is also significantly faster, speedup ranging from 2.3 to 128.4 times compared to the baseline models. To support future research and reproducibility, we have made our implementation publicly available. 大型语言模型（LLMs）在众多应用中的广泛使用标志着研究和实践的重大进展。然而，其复杂且难以理解的特性使其易受攻击，尤其是旨在生成有害响应的越狱攻击。为应对这些威胁，开发强有力的检测方法对于 LLMs 的安全可靠使用至关重要。本文利用上下文共现矩阵这一在数据稀缺环境中被认可为高效的结构，研究了该检测问题。我们提出了一种新方法，利用上下文共现矩阵和张量的潜在空间特性，有效识别对抗性和越狱提示。评估结果表明，该方法仅使用 0.5%的标注提示即可实现显著的 F1 分数 0.83，相较基线提升了 96.6%。这一结果凸显了我们学习模式的强大，尤其在标注数据稀缺时表现突出。我们的方法速度也显著提升，相较基线模型加速范围为 2.3 至 128.4 倍。为了支持未来的研究和可重复性，我们已将我们的实现公开发布。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-05 01:53:32 UTC 发布时间：2025-08-05 01:53:32 UTC

#44 Can LLMs Generate High-Quality Task-Specific Conversations? #44 LLMs 能否生成高质量的特定任务对话？

Authors: [Shengqi Li](https://arxiv.org/search/?searchtype=author&query=Shengqi Li), [Amarnath Gupta](https://arxiv.org/search/?searchtype=author&query=Amarnath Gupta) 作者：李胜奇，阿马纳斯·古普塔

This paper introduces a parameterization framework for controlling conversation quality in large language models. We explore nine key parameters across six dimensions that enable precise specification of dialogue properties. Through experiments with state-of-the-art LLMs, we demonstrate that parameter-based control produces statistically significant differences in generated conversation properties. Our approach addresses challenges in conversation generation, including topic coherence, knowledge progression, character consistency, and control granularity. The framework provides a standardized method for conversation quality control with applications in education, therapy, customer service, and entertainment. Future work will focus on implementing additional parameters through architectural modifications and developing benchmark datasets for evaluation. 本文介绍了一个用于控制大型语言模型中对话质量的参数化框架。我们探讨了跨越六个维度的九个关键参数，这些参数能够精确指定对话属性。通过对最先进的 LLMs 进行实验，我们证明了基于参数的控制能够在生成的对话属性上产生统计显著的差异。我们的方法解决了对话生成中的诸多挑战，包括话题连贯性、知识进展、角色一致性以及控制粒度。该框架为对话质量控制提供了一种标准化的方法，应用领域涵盖教育、治疗、客户服务和娱乐。未来的工作将侧重于通过架构修改实现更多参数，并开发用于评估的基准数据集。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-04 22:07:08 UTC 发布：2025-08-04 22:07:08 UTC

#45 SLIM-LLMs: Modeling of Style-Sensory Language RelationshipsThrough Low-Dimensional Representations #45 SLIM-LLMs：通过低维表示建模风格-感官语言关系

Authors: [Osama Khalid](https://arxiv.org/search/?searchtype=author&query=Osama Khalid), [Sanvesh Srivastava](https://arxiv.org/search/?searchtype=author&query=Sanvesh Srivastava), [Padmini Srinivasan](https://arxiv.org/search/?searchtype=author&query=Padmini Srinivasan) 作者：Osama Khalid, Sanvesh Srivastava, Padmini Srinivasan

Sensorial language – the language connected to our senses including vision, sound, touch, taste, smell, and interoception, plays a fundamental role in how we communicate experiences and perceptions. We explore the relationship between sensorial language and traditional stylistic features, like those measured by LIWC, using a novel Reduced-Rank Ridge Regression (R4) approach. We demonstrate that low-dimensional latent representations of LIWC features r = 24 effectively capture stylistic information for sensorial language prediction compared to the full feature set (r = 74). We introduce Stylometrically Lean Interpretable Models (SLIM-LLMs), which model non-linear relationships between these style dimensions. Evaluated across five genres, SLIM-LLMs with low-rank LIWC features match the performance of full-scale language models while reducing parameters by up to 80%. 感官语言——与我们的感官相关的语言，包括视觉、听觉、触觉、味觉、嗅觉和内感受，在我们传达体验和感知中起着基础性作用。我们使用一种新颖的降秩岭回归（R4）方法，探讨感官语言与传统风格特征（如 LIWC 测量的特征）之间的关系。我们证明，LIWC 特征的低维潜在表示 r = 24 相较于完整特征集（r = 74）能有效捕捉用于感官语言预测的风格信息。我们引入了风格简约可解释模型（SLIM-LLMs），该模型对这些风格维度之间的非线性关系进行建模。在五种体裁的评估中，使用低秩 LIWC 特征的 SLIM-LLMs 在性能上与全尺度语言模型相当，同时参数量减少了多达 80%。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-04 21:02:12 UTC 发布：2025-08-04 21:02:12 UTC

#46 Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models #46 具有迭代自我评估的视觉语言模型的连贯多模态推理

Authors: [Wenjie Luo](https://arxiv.org/search/?searchtype=author&query=Wenjie Luo), [Ruocheng Li](https://arxiv.org/search/?searchtype=author&query=Ruocheng Li), [Shanshan Zhu](https://arxiv.org/search/?searchtype=author&query=Shanshan Zhu), [Julian Perry](https://arxiv.org/search/?searchtype=author&query=Julian Perry) 作者：罗文杰，李若成，朱珊珊，朱利安·佩里

Despite significant advancements, current large language models (LLMs) and vision-language models (LVLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of “deliberative thinking.” They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. To address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs’ common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Our framework integrates three key modules: a Reasoning Decomposition Unit (RDU) for breaking down problems into sub-questions, a Contextual Inference Engine (CIE) for contextual inference, and a Coherence Assessment Module (CAM) for evaluating logical consistency and confidence. Coupled with an Adaptive Iterative Refinement strategy, CMRF systematically refines its reasoning paths. Built upon LLaVA-1.6-34B and trained on a novel Multimodal Daily Activity Reasoning (MDAR) dataset, CMRF achieves state-of-the-art performance among open-source LVLMs on challenging benchmarks like VCR, A-OKVQA, and DailyLife-MRC. It attains an average accuracy of 69.4%, surpassing the best open-source baseline by +2.4 percentage points, with particular strength in complex reasoning scenarios. Extensive ablation studies and human evaluations confirm the critical contributions of each module and the effectiveness of iterative refinement in fostering more coherent and accurate reasoning. 尽管取得了显著进展，当前的大型语言模型（LLMs）和视觉语言模型（LVLMs）在复杂的多步骤跨模态常识推理任务中仍然表现不佳，常常缺乏“深思熟虑”的能力。它们倾向于依赖表面关联，而非深入的链式推理，尤其是在将视觉信息与抽象概念结合时。为了解决这一问题，我们提出了连贯多模态推理框架（CMRF），这是一种通过迭代自我评估推理机制提升 LVLMs 常识推理能力的新方法。CMRF 模拟人类解决问题的过程，将复杂查询分解，生成逐步推理，并自我纠正错误。我们的框架整合了三个关键模块：用于将问题拆解为子问题的推理分解单元（RDU）、用于上下文推理的上下文推理引擎（CIE）以及用于评估逻辑一致性和置信度的连贯性评估模块（CAM）。结合自适应迭代优化策略，CMRF 系统地优化其推理路径。基于 LLaVA-1.6-34B 构建，并在新颖的多模态日常活动推理（MDAR）数据集上训练，CMRF 在 VCR、A-OKVQA 和 DailyLife-MRC 等具有挑战性的基准测试中，在开源 LVLM 中实现了最先进的性能。其平均准确率达到 69.4%，比最佳开源基线高出 2.4 个百分点，尤其在复杂推理场景中表现突出。大量消融研究和人工评估证实了各模块的关键贡献以及迭代优化在促进更连贯、更准确推理中的有效性。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-04 20:33:58 UTC 发布时间：2025-08-04 20:33:58 UTC

#47 Merge-based syntax is mediated by distinct neurocognitive mechanisms: A clustering analysis of comprehension abilities in 84,000 individuals with language deficits across nine languages #47 基于合并的句法由不同的神经认知机制调节：对来自九种语言的 84,000 名语言障碍个体理解能力的聚类分析

Authors: [Elliot Murphy](https://arxiv.org/search/?searchtype=author&query=Elliot Murphy), [Rohan Venkatesh](https://arxiv.org/search/?searchtype=author&query=Rohan Venkatesh), [Edward Khokhlovich](https://arxiv.org/search/?searchtype=author&query=Edward Khokhlovich), [Andrey Vyshedskiy](https://arxiv.org/search/?searchtype=author&query=Andrey Vyshedskiy) 作者：Elliot Murphy, Rohan Venkatesh, Edward Khokhlovich, Andrey Vyshedskiy

In the modern language sciences, the core computational operation of syntax, ‘Merge’, is defined as an operation that combines two linguistic units (e.g., ‘brown’, ‘cat’) to form a categorized structure (‘brown cat’, a Noun Phrase). This can then be further combined with additional linguistic units based on this categorial information, respecting non-associativity such that abstract grouping is respected. Some linguists have embraced the view that Merge is an elementary, indivisible operation that emerged in a single evolutionary step. From a neurocognitive standpoint, different mental objects constructed by Merge may be supported by distinct mechanisms: (1) simple command constructions (e.g., “eat apples”); (2) the merging of adjectives and nouns (“red boat”); and (3) the merging of nouns with spatial prepositions (“laptop behind the sofa”). Here, we systematically investigate participants’ comprehension of sentences with increasing levels of syntactic complexity. Clustering analyses revealed behavioral evidence for three distinct structural types, which we discuss as potentially emerging at different developmental stages and subject to selective impairment. While a Merge-based syntax may still have emerged suddenly in evolutionary time, responsible for the structured symbolic turn our species took, different cognitive mechanisms seem to underwrite the processing of various types of Merge-based objects. 在现代语言科学中，句法的核心计算操作“合并”（Merge）被定义为一种将两个语言单位（例如，“brown”，“cat”）组合成一个有类别结构（“brown cat”，一个名词短语）的操作。然后，可以基于该类别信息将其与其他语言单位进一步组合，遵循非结合性原则，以尊重抽象的分组。一些语言学家认为，合并是一个基本的、不可分割的操作，起源于一次单一的进化步骤。从神经认知的角度来看，由合并构建的不同心理对象可能由不同机制支持：（1）简单的命令结构（例如，“eat apples”）；（2）形容词与名词的合并（“red boat”）；（3）名词与空间介词的合并（“laptop behind the sofa”）。在此，我们系统地研究了参与者对句法复杂度逐渐增加的句子的理解。聚类分析揭示了三种不同结构类型的行为证据，我们讨论了这些结构可能在不同的发展阶段出现，并可能受到选择性损伤的影响。虽然基于合并的句法可能仍是在进化时间中突然出现的，促成了我们物种所采取的结构化符号转变，但不同的认知机制似乎支持对各种基于合并的对象的处理。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-04 20:33:36 UTC 发布：2025-08-04 20:33:36 UTC

#48 Highlight & Summarize: RAG without the jailbreaks #48 重点与总结：无越狱的 RAG

Authors: [Giovanni Cherubin](https://arxiv.org/search/?searchtype=author&query=Giovanni Cherubin), [Andrew Paverd](https://arxiv.org/search/?searchtype=author&query=Andrew Paverd) 作者：Giovanni Cherubin，Andrew Paverd

Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. For example, when interacting with a chatbot, malicious users can input specially crafted prompts to cause the LLM to generate undesirable content or perform a completely different task from its intended purpose. Existing mitigations for such attacks typically rely on hardening the LLM’s system prompt or using a content classifier trained to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. In this paper, we present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user’s question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user’s question and extracts relevant passages (“highlights”) from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe several possible instantiations of H&S and evaluate their generated responses in terms of correctness, relevance, and response quality. Surprisingly, when using an LLM-based highlighter, the majority of H&S responses are judged to be better than those of a standard RAG pipeline. 防止大型语言模型（LLMs）的越狱和模型劫持是一项重要但具有挑战性的任务。例如，在与聊天机器人交互时，恶意用户可能输入特制的提示，导致 LLM 生成不良内容或执行与其预期目的完全不同的任务。现有针对这类攻击的缓解措施通常依赖于强化 LLM 的系统提示或使用训练有素的内容分类器来检测不良内容或偏题对话。然而，由于可能的输入和不良输出空间极其庞大，这些概率性方法相对容易被绕过。本文提出并评估了一种名为 Highlight & Summarize（H&S）的新设计模式，适用于检索增强生成（RAG）系统，通过设计本身防止此类攻击。其核心思想是在不向生成型 LLM 透露用户问题的情况下，执行与标准 RAG 流程相同的任务（即基于相关来源提供自然语言答案）。这是通过将流程拆分为两个部分来实现的：一个高亮器，负责接收用户的问题并从检索到的文档中提取相关段落（“高亮”），另一个是摘要器，负责将高亮的段落总结成连贯的答案。我们描述了几种可能的高亮与摘要（H&S）实现方式，并从正确性、相关性和回答质量方面评估了它们生成的回答。令人惊讶的是，当使用基于 LLM 的高亮器时，大多数 H&S 回答被评为优于标准 RAG 流程的回答。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-04 20:01:00 UTC 发布：2025-08-04 20:01:00 UTC

#49 Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives #49 使用具有人口统计意识的专家和合成视角建模标注者分歧

Authors: [Yinuo Xu](https://arxiv.org/search/?searchtype=author&query=Yinuo Xu), [Veronica Derricks](https://arxiv.org/search/?searchtype=author&query=Veronica Derricks), [Allison Earl](https://arxiv.org/search/?searchtype=author&query=Allison Earl), [David Jurgens](https://arxiv.org/search/?searchtype=author&query=David Jurgens) 作者：徐一诺，Veronica Derricks，Allison Earl，David Jurgens

We present an approach to modeling annotator disagreement in subjective NLP tasks through both architectural and data-centric innovations. Our model, DEM-MoE (Demographic-Aware Mixture of Experts), routes inputs to expert subnetworks based on annotator demographics, enabling it to better represent structured, group-level variation compared to prior models. DEM-MoE consistently performs competitively across demographic groups, and shows especially strong results on datasets with high annotator disagreement. To address sparse demographic coverage, we test whether LLM-generated synthetic annotations via zero-shot persona prompting can be used for data imputation. We show these synthetic judgments align moderately well with human annotations on our data and offer a scalable way to potentially enrich training data. We then propose and evaluate approaches for blending real and synthetic data using strategies tailored to dataset structure. We find that the optimal strategies depend on dataset structure. Together, these contributions improve the representation of diverse perspectives. 我们提出了一种通过架构和数据驱动创新来建模主观 NLP 任务中标注者分歧的方法。我们的模型 DEM-MoE（基于人口统计的专家混合模型）根据标注者的人口统计信息将输入路由到专家子网络，使其能够比以往模型更好地表示结构化的群体层级差异。DEM-MoE 在各人口统计群体中始终表现出竞争力，且在标注者分歧较大的数据集上表现尤为出色。为了解决人口统计覆盖稀疏的问题，我们测试了是否可以通过零样本人设提示让 LLM 生成的合成标注用于数据补全。我们展示了这些合成判断与我们数据中的人工标注具有中等程度的一致性，并提供了一种可扩展的方式来潜在地丰富训练数据。随后，我们提出并评估了针对数据集结构量身定制的真实数据与合成数据融合策略。我们发现最优策略依赖于数据集结构。综合来看，这些贡献提升了多样化观点的表达能力。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-04 19:27:17 UTC 发布：2025-08-04 19:27:17 UTC

#50 Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation #50 临床基础的基于智能体的报告评估：用于放射学报告生成的可解释指标

Radiological imaging is central to diagnosis, treatment planning, and clinical decision-making. Vision-language foundation models have spurred interest in automated radiology report generation (RRG), but safe deployment requires reliable clinical evaluation of generated reports. Existing metrics often rely on surface-level similarity or behave as black boxes, lacking interpretability. We introduce ICARE (Interpretable and Clinically-grounded Agent-based Report Evaluation), an interpretable evaluation framework leveraging large language model agents and dynamic multiple-choice question answering (MCQA). Two agents, each with either the ground-truth or generated report, generate clinically meaningful questions and quiz each other. Agreement on answers captures preservation and consistency of findings, serving as interpretable proxies for clinical precision and recall. By linking scores to question-answer pairs, ICARE enables transparent, and interpretable assessment. Clinician studies show ICARE aligns significantly more with expert judgment than prior metrics. Perturbation analyses confirm sensitivity to clinical content and reproducibility, while model comparisons reveal interpretable error patterns. 放射影像学在诊断、治疗规划和临床决策中起着核心作用。视觉-语言基础模型激发了自动放射学报告生成（RRG）的兴趣，但安全部署需要对生成报告进行可靠的临床评估。现有指标通常依赖表面相似性或作为黑箱操作，缺乏可解释性。我们提出了 ICARE（可解释且基于临床的代理报告评估），这是一种利用大型语言模型代理和动态多项选择问答（MCQA）的可解释评估框架。两个代理分别持有真实报告或生成报告，生成具有临床意义的问题并相互提问。对答案的一致性反映了发现的保留和一致性，作为临床精确度和召回率的可解释代理。通过将评分与问答对关联，ICARE 实现了透明且可解释的评估。临床医生研究表明，ICARE 与专家判断的吻合度显著高于以往指标。扰动分析确认了对临床内容的敏感性和可重复性，而模型比较揭示了可解释的错误模式。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-04 18:28:03 UTC 发布时间：2025-08-04 18:28:03 UTC

#51 Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation #51 森林与树：可重复机器学习评估中的 (N,K) 权衡

Authors: [Deepak Pandita](https://arxiv.org/search/?searchtype=author&query=Deepak Pandita), [Flip Korn](https://arxiv.org/search/?searchtype=author&query=Flip Korn), [Chris Welty](https://arxiv.org/search/?searchtype=author&query=Chris Welty), [Christopher M. Homan](https://arxiv.org/search/?searchtype=author&query=Christopher M. Homan) 作者：Deepak Pandita，Flip Korn，Chris Welty，Christopher M. Homan

Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple annotators for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items (N) and the number of responses per item (K) needed for reliable machine learning evaluation. We analyze a diverse collection of categorical datasets for which multiple annotations per item exist, and simulated distributions fit to these datasets, to determine the optimal (N,K) configuration, given a fixed budget (N×K), for collecting evaluation data and reliably comparing the performance of machine learning models. Our findings show, first, that accounting for human disagreement may come with N×K at no more than 1000 (and often much lower) for every dataset tested on at least one metric. Moreover, this minimal N×K almost always occurred for K>10. Furthermore, the nature of the tradeoff between K and N – or if one even existed – depends on the evaluation metric, with metrics that are more sensitive to the full distribution of responses performing better at higher levels of K. Our methods can be used to help ML practitioners get more effective test data by finding the optimal metrics and number of items and annotations per item to collect to get the most reliability for their budget. 可重复性是科学验证的基石，也是其结果权威性的来源。机器学习评估中的可重复性能够带来更高的信任度、自信心和价值。然而，机器学习中使用的真实标签通常必须来自人类，而人类之间普遍存在分歧，令人惊讶的是，几乎没有研究关注在这些标签中有效忽视分歧的影响，而这通常是实际情况。缺乏研究的一个原因是收集人工标注评估数据的预算有限，并且为每个样本从多个标注者处获取更多标签会大幅增加每个项目的标注成本。我们研究了为实现可靠的机器学习评估，在项目数量（ N ）与每个项目的响应数量（ K ）之间的权衡。我们分析了一组多样的分类数据集，这些数据集中每个项目都有多个注释，并对这些数据集进行了拟合的模拟分布，以确定在固定预算（ N×K ）下收集评估数据和可靠比较机器学习模型性能的最佳 (N,K) 配置。我们的研究结果首先表明，考虑到人类意见分歧，在每个至少在一个指标上测试过的数据集中，所需的 N×K 不超过 1000（且通常远低于此）。此外，这种最小的 N×K 几乎总是在 K>10 时出现。此外， K 与 N 之间的权衡性质——或者是否存在权衡——取决于评估指标，对于那些对完整响应分布更敏感的指标，在较高水平的 K 时表现更佳。我们的方法可以帮助机器学习从业者通过找到最佳指标以及每个项目应收集的项目数和注释数，以在预算内获得最可靠的测试数据。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习，人工智能，计算与语言

Publish: 2025-08-05 17:18:34 UTC 发布时间：2025-08-05 17:18:34 UTC

#52 OSINT or BULLSHINT? Exploring Open-Source Intelligence tweets about the Russo-Ukrainian War #52 开源情报还是胡扯？探索关于俄乌战争的开源情报推文

Authors: [Johannes Niu](https://arxiv.org/search/?searchtype=author&query=Johannes Niu), [Mila Stillman](https://arxiv.org/search/?searchtype=author&query=Mila Stillman), [Anna Kruspe](https://arxiv.org/search/?searchtype=author&query=Anna Kruspe) 作者：Johannes Niu，Mila Stillman，Anna Kruspe

This paper examines the role of Open Source Intelligence (OSINT) on Twitter regarding the Russo-Ukrainian war, distinguishing between genuine OSINT and deceptive misinformation efforts, termed “BULLSHINT.” Utilizing a dataset spanning from January 2022 to July 2023, we analyze nearly 2 million tweets from approximately 1,040 users involved in discussing real-time military engagements, strategic analyses, and misinformation related to the conflict. Using sentiment analysis, partisanship detection, misinformation identification, and Named Entity Recognition (NER), we uncover communicative patterns and dissemination strategies within the OSINT community. Significant findings reveal a predominant negative sentiment influenced by war events, a nuanced distribution of pro-Ukrainian and pro-Russian partisanship, and the potential strategic manipulation of information. Additionally, we apply community detection techniques, which are able to identify distinct clusters partisanship, topics, and misinformation, highlighting the complex dynamics of information spread on social media. This research contributes to the understanding of digital warfare and misinformation dynamics, offering insights into the operationalization of OSINT in geopolitical conflicts. 本文探讨了推特上关于俄乌战争的开源情报（OSINT）作用，区分了真实的 OSINT 与被称为“胡扯情报”（BULLSHINT）的欺骗性虚假信息。利用 2022 年 1 月至 2023 年 7 月的数据集，我们分析了约 1040 名用户发布的近 200 万条推文，这些推文涉及实时军事行动、战略分析及与冲突相关的虚假信息。通过情感分析、党派检测、虚假信息识别和命名实体识别（NER），我们揭示了 OSINT 社区内的交流模式和传播策略。重要发现显示，战争事件导致的负面情绪占主导，亲乌克兰和亲俄罗斯的党派倾向分布复杂，且信息可能被战略性操控。此外，我们应用社区检测技术，成功识别出不同的党派群体、话题和虚假信息集群，凸显了社交媒体上信息传播的复杂动态。这项研究有助于理解数字战争和虚假信息的动态，提供了关于在地缘政治冲突中如何运用开源情报（OSINT）的见解。

Subjects: Social and Information Networks, Computation and Language 主题：社会与信息网络，计算与语言

Publish: 2025-08-05 16:06:36 UTC 发布时间：2025-08-05 16:06:36 UTC

#53 Beyond Meme Templates: Limitations of Visual Similarity Measures in Meme Matching #53 超越表情包模板：视觉相似度度量在表情包匹配中的局限性

Authors: [Muzhaffar Hazman](https://arxiv.org/search/?searchtype=author&query=Muzhaffar Hazman), [Susan McKeever](https://arxiv.org/search/?searchtype=author&query=Susan McKeever), [Josephine Griffith](https://arxiv.org/search/?searchtype=author&query=Josephine Griffith) 作者：Muzhaffar Hazman，Susan McKeever，Josephine Griffith

Internet memes, now a staple of digital communication, play a pivotal role in how users engage within online communities and allow researchers to gain insight into contemporary digital culture. These engaging user-generated content are characterised by their reuse of visual elements also found in other memes. Matching instances of memes via these shared visual elements, called Meme Matching, is the basis of a wealth of meme analysis approaches. However, most existing methods assume that every meme consists of a shared visual background, called a Template, with some overlaid text, thereby limiting meme matching to comparing the background image alone. Current approaches exclude the many memes that are not template-based and limit the effectiveness of automated meme analysis and would not be effective at linking memes to contemporary web-based meme dictionaries. In this work, we introduce a broader formulation of meme matching that extends beyond template matching. We show that conventional similarity measures, including a novel segment-wise computation of the similarity measures, excel at matching template-based memes but fall short when applied to non-template-based meme formats. However, the segment-wise approach was found to consistently outperform the whole-image measures on matching non-template-based memes. Finally, we explore a prompting-based approach using a pretrained Multimodal Large Language Model for meme matching. Our results highlight that accurately matching memes via shared visual elements, not just background templates, remains an open challenge that requires more sophisticated matching techniques. 互联网表情包，作为数字交流的常见元素，在用户在线社区中的互动中起着关键作用，并使研究人员能够洞察当代数字文化。这些引人入胜的用户生成内容的特点是重复使用其他表情包中也出现的视觉元素。通过这些共享的视觉元素匹配表情包实例，称为表情包匹配，是众多表情包分析方法的基础。然而，大多数现有方法假设每个表情包都由一个共享的视觉背景（称为模板）和一些叠加的文本组成，从而将表情包匹配限制为仅比较背景图像。当前的方法排除了许多非基于模板的表情包，限制了自动表情包分析的有效性，也无法有效地将表情包与当代基于网络的表情包词典关联起来。在本研究中，我们提出了一个超越模板匹配的更广泛的表情包匹配定义。我们展示了传统的相似度度量方法，包括一种新颖的分段计算相似度的方法，在匹配基于模板的表情包时表现出色，但在应用于非基于模板的表情包格式时则表现不足。然而，分段方法在匹配非模板化的表情包时，表现出持续优于整图度量的效果。最后，我们探索了一种基于预训练多模态大型语言模型的提示方法用于表情包匹配。我们的结果强调，通过共享的视觉元素而不仅仅是背景模板来准确匹配表情包，仍然是一个开放的挑战，需要更复杂的匹配技术。

Subjects: Computer Vision and Pattern Recognition, Computation and Language 主题：计算机视觉与模式识别，计算与语言

Publish: 2025-08-05 15:31:00 UTC 发布时间：2025-08-05 15:31:00 UTC

#54 PyLate: Flexible Training and Retrieval for Late Interaction Models #54 PyLate：用于后期交互模型的灵活训练与检索

Authors: [Antoine Chaffin](https://arxiv.org/search/?searchtype=author&query=Antoine Chaffin), [Raphaël Sourty](https://arxiv.org/search/?searchtype=author&query=Raphaël Sourty) 作者：Antoine Chaffin，Raphaël Sourty

Neural ranking has become a cornerstone of modern information retrieval. While single vector search remains the dominant paradigm, it suffers from the shortcoming of compressing all the information into a single vector. This compression leads to notable performance degradation in out-of-domain, long-context, and reasoning-intensive retrieval tasks. Multi-vector approaches pioneered by ColBERT aim to address these limitations by preserving individual token embeddings and computing similarity via the MaxSim operator. This architecture has demonstrated superior empirical advantages, including enhanced out-of-domain generalization, long-context handling, and performance in complex retrieval scenarios. Despite these compelling empirical results and clear theoretical advantages, the practical adoption and public availability of late interaction models remain low compared to their single-vector counterparts, primarily due to a lack of accessible and modular tools for training and experimenting with such models. To bridge this gap, we introduce PyLate, a streamlined library built on top of Sentence Transformers to support multi-vector architectures natively, inheriting its efficient training, advanced logging, and automated model card generation while requiring minimal code changes to code templates users are already familiar with. By offering multi-vector-specific features such as efficient indexes, PyLate aims to accelerate research and real-world application of late interaction models, thereby unlocking their full potential in modern IR systems. Finally, PyLate has already enabled the development of state-of-the-art models, including GTE-ModernColBERT and Reason-ModernColBERT, demonstrating its practical utility for both research and production environments. 神经排序已成为现代信息检索的基石。尽管单向量搜索仍然是主流范式，但它存在将所有信息压缩到单一向量中的缺点。这种压缩导致在跨领域、长上下文和需要推理的检索任务中性能显著下降。由 ColBERT 开创的多向量方法旨在通过保留单个词元嵌入并通过 MaxSim 算子计算相似度来解决这些限制。这种架构展现了卓越的经验优势，包括增强的跨领域泛化能力、长上下文处理能力以及复杂检索场景中的表现。尽管这些经验结果令人信服且具有明确的理论优势，但与单向量模型相比，晚期交互模型的实际应用和公开可用性仍然较低，主要原因是缺乏用于训练和实验此类模型的易用且模块化的工具。为弥合这一差距，我们推出了 PyLate，这是一个基于 Sentence Transformers 构建的简化库，原生支持多向量架构，继承了其高效的训练、先进的日志记录和自动化模型卡生成，同时对用户已熟悉的代码模板只需极少的代码更改。通过提供多向量特有的功能，如高效索引，PyLate 旨在加速晚期交互模型的研究和实际应用，从而释放它们在现代信息检索系统中的全部潜力。最后，PyLate 已经促成了包括 GTE-ModernColBERT 和 Reason-ModernColBERT 在内的最先进模型的开发，展示了其在研究和生产环境中的实用价值。

Subjects: Information Retrieval, Computation and Language 主题：信息检索，计算与语言

Publish: 2025-08-05 15:23:40 UTC 发布时间：2025-08-05 15:23:40 UTC

#55 MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation #55 MultiRAG：一个用于缓解多源检索增强生成中幻觉的知识引导框架

Authors: [Wenlong Wu](https://arxiv.org/search/?searchtype=author&query=Wenlong Wu), [Haofen Wang](https://arxiv.org/search/?searchtype=author&query=Haofen Wang), [Bohan Li](https://arxiv.org/search/?searchtype=author&query=Bohan Li), [Peixuan Huang](https://arxiv.org/search/?searchtype=author&query=Peixuan Huang), [Xinzhe Zhao](https://arxiv.org/search/?searchtype=author&query=Xinzhe Zhao), [Lei Liang](https://arxiv.org/search/?searchtype=author&query=Lei Liang) 作者：吴文龙，王浩芬，李博涵，黄培轩，赵新哲，梁磊

Retrieval Augmented Generation (RAG) has emerged as a promising solution to address hallucination issues in Large Language Models (LLMs). However, the integration of multiple retrieval sources, while potentially more informative, introduces new challenges that can paradoxically exacerbate hallucination problems. These challenges manifest primarily in two aspects: the sparse distribution of multi-source data that hinders the capture of logical relationships and the inherent inconsistencies among different sources that lead to information conflicts. To address these challenges, we propose MultiRAG, a novel framework designed to mitigate hallucination in multi-source retrieval-augmented generation through knowledge-guided approaches. Our framework introduces two key innovations: (1) a knowledge construction module that employs multi-source line graphs to efficiently aggregate logical relationships across different knowledge sources, effectively addressing the sparse data distribution issue; and (2) a sophisticated retrieval module that implements a multi-level confidence calculation mechanism, performing both graph-level and node-level assessments to identify and eliminate unreliable information nodes, thereby reducing hallucinations caused by inter-source inconsistencies. Extensive experiments on four multi-domain query datasets and two multi-hop QA datasets demonstrate that MultiRAG significantly enhances the reliability and efficiency of knowledge retrieval in complex multi-source scenarios. \textcolor{blue}{Our code is available in https://github.com/wuwenlong123/MultiRAG. 检索增强生成（RAG）已成为解决大型语言模型（LLMs）幻觉问题的有前景的方案。然而，整合多个检索源虽然可能提供更多信息，却带来了新的挑战，这些挑战反而可能加剧幻觉问题。这些挑战主要体现在两个方面：多源数据的稀疏分布阻碍了逻辑关系的捕捉，以及不同来源之间固有的不一致性导致信息冲突。为了解决这些挑战，我们提出了 MultiRAG，一种通过知识引导方法减轻多源检索增强生成中幻觉的新框架。我们的框架引入了两个关键创新：（1）一个知识构建模块，利用多源线图高效聚合不同知识源之间的逻辑关系，有效解决了数据分布稀疏的问题；（2）一个复杂的检索模块，实施多级置信度计算机制，进行图级和节点级评估，以识别并剔除不可靠的信息节点，从而减少由源间不一致引起的幻觉现象。在四个多领域查询数据集和两个多跳问答数据集上的大量实验表明，MultiRAG 显著提升了复杂多源场景下知识检索的可靠性和效率。\textcolor{blue}{我们的代码可在 https://github.com/wuwenlong123/MultiRAG 获取。

Subjects: Information Retrieval, Computation and Language 主题：信息检索，计算与语言

Publish: 2025-08-05 15:20:52 UTC 发布时间：2025-08-05 15:20:52 UTC

#56 MoKA: Mixture of Kronecker Adapters #56 MoKA：Kronecker 适配器混合体

Authors: [Mohammadreza Sadeghi](https://arxiv.org/search/?searchtype=author&query=Mohammadreza Sadeghi), [Mahsa Ghazvini Nejad](https://arxiv.org/search/?searchtype=author&query=Mahsa Ghazvini Nejad), [MirHamed Jafarzadeh Asl](https://arxiv.org/search/?searchtype=author&query=MirHamed Jafarzadeh Asl), [Yu Gu](https://arxiv.org/search/?searchtype=author&query=Yu Gu), [Yuanhao Yu](https://arxiv.org/search/?searchtype=author&query=Yuanhao Yu), [Masoud Asgharian](https://arxiv.org/search/?searchtype=author&query=Masoud Asgharian), [Vahid Partovi Nia](https://arxiv.org/search/?searchtype=author&query=Vahid Partovi Nia) 作者：Mohammadreza Sadeghi、Mahsa Ghazvini Nejad、MirHamed Jafarzadeh Asl、Yu Gu、Yuanhao Yu、Masoud Asgharian、Vahid Partovi Nia

Parameter-efficient fine-tuning (PEFT) is essential for reducing the computational overhead of large language models (LLMs). Low-rank family adapters are commonly used to control the parameter size efficiently while maintaining the generative power of LLMs. However, their limited expressiveness due to the rank constraint often restricts their performance on complex tasks. We propose Mixture of Kronecker Adapters (MoKA), a new generation of Kronecker adapters that addresses this limitation by modeling weight updates as a mixture of Kronecker products. Our proposed adapter leverages a gating mechanism that measures the importance of each Kronecker factor, enabling more expressive adaptation. Moreover, MoKA enables a rank flexibility that provides a better trade-off between parameter efficiency and accuracy. To ensure hardware efficiency, we reformulate Kronecker computations using standard matrix operations, allowing seamless deployment on GPU-optimized hardware. We conduct extensive experiments on instruction-tuning and commonsense reasoning tasks using low-bit quantized versions of LLaMA2-7B and LLaMA3-8B models. MoKA not only outperforms PEFT baselines, but also reduces the number of trainable parameters up to 27x, achieving state-of-the-art trade-offs between performance and parameter efficiency. 参数高效微调（PEFT）对于降低大型语言模型（LLMs）的计算开销至关重要。低秩系列适配器通常用于高效控制参数规模，同时保持 LLMs 的生成能力。然而，由于秩的限制，其表达能力有限，常常限制了其在复杂任务上的表现。我们提出了 Kronecker 适配器的新一代——Kronecker 混合适配器（MoKA），通过将权重更新建模为 Kronecker 积的混合来解决这一限制。我们提出的适配器利用门控机制来衡量每个 Kronecker 因子的权重重要性，从而实现更具表现力的适配。此外，MoKA 实现了秩的灵活性，在参数效率和准确性之间提供了更好的权衡。为了确保硬件效率，我们使用标准矩阵运算重新构造了 Kronecker 计算，使其能够无缝部署于 GPU 优化硬件上。我们在指令微调和常识推理任务上，使用低比特量化的 LLaMA2-7B 和 LLaMA3-8B 模型进行了大量实验。 MoKA 不仅优于 PEFT 基线方法，还将可训练参数数量减少了多达 27 倍，实现了性能与参数效率之间的最先进权衡。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习，人工智能，计算与语言

Publish: 2025-08-05 14:58:14 UTC 发布时间：2025-08-05 14:58:14 UTC

#57 Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning #57 使用强化学习训练长上下文、多轮软件工程代理

Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent’s success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models. 强化学习（RL）在大型语言模型（LLMs）上的应用研究大多集中在单轮问题上，如数学推理或一次性代码生成。虽然这些问题可以被视为基于 token 的多轮马尔可夫决策过程（MDP），但这种视角对应的是多轮交互的退化情况，即环境不提供反馈。这与许多现实世界领域形成对比，例如软件工程（SWE），后者需要与有状态环境进行丰富的多轮交互，环境会对每个动作做出非平凡的观察反馈。为弥合这一差距，我们展示了 RL 在这一通用范式中的成功应用。通过使用改进的解耦优势策略优化（DAPO）算法，我们基于 Qwen2.5-72B-Instruct 训练了一个代理来解决现实世界的软件工程任务。我们的方法使代理在 SWE-bench Verified 基准上的成功率从 20%的拒绝微调基线提升至 39%，且无需依赖任何教师模型。在 SWE-rebench 上，我们的代理在使用相同框架的情况下，表现与领先的开源权重模型如 DeepSeek-V3-0324 和 Qwen3-235B-A22B 相当或更优，提供了一条基于开源模型构建更强大自主代理以解决复杂现实问题的可行路径。

Subjects: Machine Learning, Computation and Language, Software Engineering 主题：机器学习，计算与语言，软件工程

Publish: 2025-08-05 14:30:47 UTC 发布时间：2025-08-05 14:30:47 UTC

#58 Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models #58 画出你的心智：通过条件级建模实现文本到图像扩散模型的个性化生成

Authors: [Hyungjin Kim](https://arxiv.org/search/?searchtype=author&query=Hyungjin Kim), [Seokho Ahn](https://arxiv.org/search/?searchtype=author&query=Seokho Ahn), [Young-Duk Seo](https://arxiv.org/search/?searchtype=author&query=Young-Duk Seo) 作者：Hyungjin Kim，Seokho Ahn，Young-Duk Seo

Personalized generation in T2I diffusion models aims to naturally incorporate individual user preferences into the generation process with minimal user intervention. However, existing studies primarily rely on prompt-level modeling with large-scale models, often leading to inaccurate personalization due to the limited input token capacity of T2I diffusion models. To address these limitations, we propose DrUM, a novel method that integrates user profiling with a transformer-based adapter to enable personalized generation through condition-level modeling in the latent space. DrUM demonstrates strong performance on large-scale datasets and seamlessly integrates with open-source text encoders, making it compatible with widely used foundation T2I models without requiring additional fine-tuning. T2I 扩散模型中的个性化生成旨在以最少的用户干预自然地将个体用户偏好融入生成过程。然而，现有研究主要依赖于大规模模型的提示级建模，常因 T2I 扩散模型输入令牌容量有限而导致个性化不准确。为解决这些限制，我们提出了 DrUM，一种将用户画像与基于 Transformer 的适配器相结合的方法，通过潜在空间中的条件级建模实现个性化生成。DrUM 在大规模数据集上表现出色，并能无缝集成开源文本编码器，使其兼容广泛使用的基础 T2I 模型，无需额外微调。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别，人工智能，计算与语言

Publish: 2025-08-05 14:14:55 UTC 发布时间：2025-08-05 14:14:55 UTC

#59 A Comparative Study of Neurosymbolic AI Approaches to Interpretable Logical Reasoning #59 神经符号 AI 方法在可解释逻辑推理中的比较研究

Author: [Michael K. Chen](https://arxiv.org/search/?searchtype=author&query=Michael K. Chen) 作者：Michael K. Chen

General logical reasoning, defined as the ability to reason deductively on domain-agnostic tasks, continues to be a challenge for large language models (LLMs). Current LLMs fail to reason deterministically and are not interpretable. As such, there has been a recent surge in interest in neurosymbolic AI, which attempts to incorporate logic into neural networks. We first identify two main neurosymbolic approaches to improving logical reasoning: (i) the integrative approach comprising models where symbolic reasoning is contained within the neural network, and (ii) the hybrid approach comprising models where a symbolic solver, separate from the neural network, performs symbolic reasoning. Both contain AI systems with promising results on domain-specific logical reasoning benchmarks. However, their performance on domain-agnostic benchmarks is understudied. To the best of our knowledge, there has not been a comparison of the contrasting approaches that answers the following question: Which approach is more promising for developing general logical reasoning? To analyze their potential, the following best-in-class domain-agnostic models are introduced: Logic Neural Network (LNN), which uses the integrative approach, and LLM-Symbolic Solver (LLM-SS), which uses the hybrid approach. Using both models as case studies and representatives of each approach, our analysis demonstrates that the hybrid approach is more promising for developing general logical reasoning because (i) its reasoning chain is more interpretable, and (ii) it retains the capabilities and advantages of existing LLMs. To support future works using the hybrid approach, we propose a generalizable framework based on LLM-SS that is modular by design, model-agnostic, domain-agnostic, and requires little to no human input. 通用逻辑推理，被定义为在与领域无关的任务上进行演绎推理的能力，仍然是大型语言模型（LLMs）面临的挑战。目前的 LLMs 无法进行确定性推理，且缺乏可解释性。因此，近年来神经符号人工智能（neurosymbolic AI）引起了广泛关注，该领域试图将逻辑融入神经网络中。我们首先确定了两种主要的神经符号方法来提升逻辑推理能力：（i）整合方法，即符号推理包含在神经网络内部的模型；（ii）混合方法，即由独立于神经网络的符号求解器执行符号推理的模型。这两种方法均包含在特定领域逻辑推理基准测试中表现出良好结果的 AI 系统。然而，它们在与领域无关的基准测试上的表现尚未得到充分研究。据我们所知，尚无对这两种对比方法的比较，来回答以下问题：哪种方法更有希望发展通用逻辑推理？为了分析它们的潜力，介绍了以下顶级的领域无关模型：采用整合方法的逻辑神经网络（LNN）和采用混合方法的 LLM-符号求解器（LLM-SS）。以这两种模型作为案例研究和各自方法的代表，我们的分析表明，混合方法在发展通用逻辑推理方面更具前景，因为（i）其推理链更具可解释性，且（ii）它保留了现有 LLMs 的能力和优势。为了支持未来采用混合方法的工作，我们提出了一个基于 LLM-SS 的可泛化框架，该框架设计模块化、模型无关、领域无关，且几乎不需要人工输入。

Subjects: Artificial Intelligence, Computation and Language, Machine Learning, Symbolic Computation 主题：人工智能，计算与语言，机器学习，符号计算

Publish: 2025-08-05 12:14:32 UTC 发布时间：2025-08-05 12:14:32 UTC

#60 VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation #60 VLMQ：通过 Hessian 增强实现大规模视觉语言模型的高效后训练量化

Authors: [Yufei Xue](https://arxiv.org/search/?searchtype=author&query=Yufei Xue), [Yushi Huang](https://arxiv.org/search/?searchtype=author&query=Yushi Huang), [Jiawei Shao](https://arxiv.org/search/?searchtype=author&query=Jiawei Shao), [Jun Zhang](https://arxiv.org/search/?searchtype=author&query=Jun Zhang) 作者：薛宇飞，黄雨诗，邵嘉伟，张军

Post-training quantization (PTQ) has emerged as an effective approach for compressing large models and accelerating their inference without retraining. While PTQ has been extensively studied in the context of large language models (LLMs), its applicability to vision-language models (VLMs) remains underexplored. In this paper, we identify a modality discrepancy (\emph{i.e.}, limited text tokens \emph{vs.} excessive and redundant vision tokens) of VLMs. However, existing Hessian-based LLM PTQ methods treat all tokens equally during quantization, resulting in severe performance drops when applied to VLMs. Motivated by this observation, we propose a novel importance-aware PTQ framework tailored for VLMs, dubbed VLMQ. Specifically, to address vision token redundancy, VLMQ 1) optimizes an importance-aware objective that yields an enhanced Hessian with token-level importance factors, while retaining compatibility with parallelized weight updates, and 2) ensures efficiency and effectiveness by computing these factors via a single lightweight block-wise backward pass, guided by a theoretical connection to token-level perturbations. Extensive evaluations on 8 benchmarks across 0.5B∼32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial \textbf{16.45%} improvement on MME-RealWorld under 2-bit quantization. 后训练量化（PTQ）已成为压缩大型模型和加速推理而无需重新训练的有效方法。虽然 PTQ 在大型语言模型（LLMs）中得到了广泛研究，但其在视觉语言模型（VLMs）中的适用性仍未被充分探索。本文中，我们发现了 VLMs 存在模态差异（即，有限的文本标记与过多且冗余的视觉标记）。然而，现有基于 Hessian 的大型语言模型 PTQ 方法在量化时对所有标记一视同仁，导致应用于 VLMs 时性能大幅下降。基于这一观察，我们提出了一种针对 VLMs 的新颖重要性感知 PTQ 框架，称为 VLMQ。具体而言，为了解决视觉标记冗余问题，VLMQ 1）优化了一个重要性感知目标，生成带有标记级重要性因子的增强 Hessian，同时保持与并行权重更新的兼容性；2）通过单次轻量级的分块反向传播计算这些因子，确保效率和效果，并由与标记级扰动的理论联系指导。在 0.5B 至 32B 规模的视觉语言模型（VLM）上，针对 8 个基准进行了广泛评估，结果表明我们的 VLMQ 在低位宽设置下表现出最先进（SOTA）的性能。例如，在 2 位量化下，它在 MME-RealWorld 上实现了显著的\textbf{16.45%}提升。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别，人工智能，计算与语言

Publish: 2025-08-05 11:57:03 UTC 发布时间：2025-08-05 11:57:03 UTC

#61 Reliable Evaluation Protocol for Low-Precision Retrieval #61 低精度检索的可靠评估协议

Authors: [Kisu Yang](https://arxiv.org/search/?searchtype=author&query=Kisu Yang), [Yoonna Jang](https://arxiv.org/search/?searchtype=author&query=Yoonna Jang), [Hwanseok Jang](https://arxiv.org/search/?searchtype=author&query=Hwanseok Jang), [Kenneth Choi](https://arxiv.org/search/?searchtype=author&query=Kenneth Choi), [Isabelle Augenstein](https://arxiv.org/search/?searchtype=author&query=Isabelle Augenstein), [Heuiseok Lim](https://arxiv.org/search/?searchtype=author&query=Heuiseok Lim) 作者：Kisu Yang、Yoonna Jang、Hwanseok Jang、Kenneth Choi、Isabelle Augenstein、Heuiseok Lim

Lowering the numerical precision of model parameters and computations is widely adopted to improve the efficiency of retrieval systems. However, when computing relevance scores between the query and documents in low-precision, we observe spurious ties due to the reduced granularity. This introduces high variability in the results based on tie resolution, making the evaluation less reliable. To address this, we propose a more robust retrieval evaluation protocol designed to reduce score variation. It consists of: (1) High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost; and (2) Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty of tied candidates. Our experiments test multiple models with three scoring functions on two retrieval datasets to demonstrate that HPS dramatically reduces tie-induced instability, and TRM accurately recovers expected metric values. This combination enables a more consistent and reliable evaluation system for lower-precision retrievals. 降低模型参数和计算的数值精度被广泛采用以提高检索系统的效率。然而，在低精度下计算查询与文档之间的相关性得分时，我们观察到由于粒度降低而产生的虚假平局。这导致基于平局解决方案的结果具有较高的变异性，使得评估不够可靠。为了解决这一问题，我们提出了一种更稳健的检索评估协议，旨在减少得分的波动。该协议包括：（1）高精度评分（HPS），将最终评分步骤提升到更高精度，以最低的计算成本解决平局候选项；（2）平局感知检索指标（TRM），报告期望得分、范围和偏差，以量化平局候选项的排序不确定性。我们的实验在两个检索数据集上使用三种评分函数测试多个模型，结果表明 HPS 显著减少了由平局引起的不稳定性，TRM 准确恢复了期望的指标值。该组合实现了对低精度检索更一致且可靠的评估系统。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language 主题：信息检索，人工智能，计算与语言

Publish: 2025-08-05 10:27:57 UTC 发布时间：2025-08-05 10:27:57 UTC

#62 Understanding the Embedding Models on Hyper-relational Knowledge Graph #62 理解超关系知识图上的嵌入模型

Authors: [Yubo Wang](https://arxiv.org/search/?searchtype=author&query=Yubo Wang), [Shimin Di](https://arxiv.org/search/?searchtype=author&query=Shimin Di), [Zhili Wang](https://arxiv.org/search/?searchtype=author&query=Zhili Wang), [Haoyang Li](https://arxiv.org/search/?searchtype=author&query=Haoyang Li), [Fei Teng](https://arxiv.org/search/?searchtype=author&query=Fei Teng), [Hao Xin](https://arxiv.org/search/?searchtype=author&query=Hao Xin), [Lei Chen](https://arxiv.org/search/?searchtype=author&query=Lei Chen) 作者：王宇博，狄世敏，王志立，李浩洋，滕飞，辛昊，陈磊

Recently, Hyper-relational Knowledge Graphs (HKGs) have been proposed as an extension of traditional Knowledge Graphs (KGs) to better represent real-world facts with additional qualifiers. As a result, researchers have attempted to adapt classical Knowledge Graph Embedding (KGE) models for HKGs by designing extra qualifier processing modules. However, it remains unclear whether the superior performance of Hyper-relational KGE (HKGE) models arises from their base KGE model or the specially designed extension module. Hence, in this paper, we data-wise convert HKGs to KG format using three decomposition methods and then evaluate the performance of several classical KGE models on HKGs. Our results show that some KGE models achieve performance comparable to that of HKGE models. Upon further analysis, we find that the decomposition methods alter the original HKG topology and fail to fully preserve HKG information. Moreover, we observe that current HKGE models are either insufficient in capturing the graph’s long-range dependency or struggle to integrate main-triple and qualifier information due to the information compression issue. To further justify our findings and offer a potential direction for future HKGE research, we propose the FormerGNN framework. This framework employs a qualifier integrator to preserve the original HKG topology, and a GNN-based graph encoder to capture the graph’s long-range dependencies, followed by an improved approach for integrating main-triple and qualifier information to mitigate compression issues. Our experimental results demonstrate that FormerGNN outperforms existing HKGE models. 近年来，超关系知识图谱（HKGs）作为传统知识图谱（KGs）的扩展被提出，以通过额外的限定词更好地表示现实世界的事实。因此，研究人员尝试通过设计额外的限定词处理模块来调整经典的知识图谱嵌入（KGE）模型以适应 HKGs。然而，目前尚不清楚超关系 KGE（HKGE）模型的优越性能是源自其基础 KGE 模型，还是特别设计的扩展模块。因此，本文采用三种分解方法将 HKGs 在数据层面转换为 KG 格式，并评估了若干经典 KGE 模型在 HKGs 上的表现。我们的结果显示，一些 KGE 模型的性能可与 HKGE 模型相媲美。进一步分析发现，分解方法改变了原始 HKG 的拓扑结构，未能完全保留 HKG 的信息。此外，我们观察到当前的 HKGE 模型要么无法充分捕捉图的长程依赖，要么由于信息压缩问题难以整合主三元组和限定词信息。为了进一步验证我们的发现并为未来的 HKGE 研究提供潜在方向，我们提出了 FormerGNN 框架。该框架采用限定符整合器以保留原始 HKG 拓扑结构，并使用基于 GNN 的图编码器捕捉图的长距离依赖，随后通过改进的方法整合主三元组和限定符信息，以缓解压缩问题。我们的实验结果表明，FormerGNN 优于现有的 HKGE 模型。

Subjects: Machine Learning, Computation and Language, Social and Information Networks 主题：机器学习，计算与语言，社会与信息网络

Publish: 2025-08-05 09:59:02 UTC 发布时间：2025-08-05 09:59:02 UTC

#63 ChartCap: Mitigating Hallucination of Dense Chart Captioning #63 ChartCap：缓解密集图表标题的幻觉问题

Authors: [Junyoung Lim](https://arxiv.org/search/?searchtype=author&query=Junyoung Lim), [Jaewoo Ahn](https://arxiv.org/search/?searchtype=author&query=Jaewoo Ahn), [Gunhee Kim](https://arxiv.org/search/?searchtype=author&query=Gunhee Kim) 作者：Junyoung Lim，Jaewoo Ahn，Gunhee Kim

Generating accurate, informative, and hallucination-free captions for charts remains challenging for vision language models, primarily due to the lack of large-scale, high-quality datasets of real-world charts. However, existing real-world chart datasets suffer from the inclusion of extraneous information that cannot be inferred from the chart and failure to sufficiently capture structural elements and key insights. Therefore, we introduce ChartCap, a large-scale dataset of 565K real-world chart images paired with type-specific, dense captions that exclude extraneous information and highlight both structural elements and key insights in detail. To build ChartCap, we design a four-stage pipeline that generates captions using only the discernible data from the chart and employ a cycle consistency-based human verification, which accelerates quality control without sacrificing accuracy. Additionally, we propose a novel metric, the Visual Consistency Score, which evaluates caption quality by measuring the similarity between the chart regenerated from a caption and the original chart, independent of reference captions. Extensive experiments confirms that models fine-tuned on ChartCap consistently generate more accurate and informative captions with reduced hallucinations, surpassing both open-source and proprietary models and even human-annotated captions. 为图表生成准确、信息丰富且无幻觉的描述仍然是视觉语言模型面临的挑战，主要原因是缺乏大规模、高质量的真实世界图表数据集。然而，现有的真实世界图表数据集存在包含无法从图表中推断出的多余信息的问题，且未能充分捕捉结构元素和关键信息。因此，我们引入了 ChartCap，这是一个包含 56.5 万张真实世界图表图像的大规模数据集，配有类型特定的密集描述，排除了多余信息，并详细突出结构元素和关键信息。为了构建 ChartCap，我们设计了一个四阶段流程，仅使用图表中可辨识的数据生成描述，并采用基于循环一致性的人类验证方法，加速质量控制而不牺牲准确性。此外，我们提出了一种新颖的指标——视觉一致性评分，通过测量从描述再生成的图表与原始图表之间的相似度来评估描述质量，该方法不依赖参考描述。大量实验确认，在 ChartCap 上微调的模型始终能够生成更准确且信息量更丰富的描述，且幻觉现象减少，表现优于开源和专有模型，甚至超过人工标注的描述。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别，人工智能，计算与语言

Publish: 2025-08-05 07:09:07 UTC 发布时间：2025-08-05 07:09:07 UTC

#64 Toward Verifiable Misinformation Detection: A Multi-Tool LLM Agent Framework #64 面向可验证的错误信息检测：多工具 LLM 代理框架

Authors: [Zikun Cui](https://arxiv.org/search/?searchtype=author&query=Zikun Cui), [Tianyi Huang](https://arxiv.org/search/?searchtype=author&query=Tianyi Huang), [Chia-En Chiang](https://arxiv.org/search/?searchtype=author&query=Chia-En Chiang), [Cuiqianhe Du](https://arxiv.org/search/?searchtype=author&query=Cuiqianhe Du) 作者：崔子坤，黄天一，姜家恩，杜翠倩

With the proliferation of Large Language Models (LLMs), the detection of misinformation has become increasingly important and complex. This research proposes an innovative verifiable misinformation detection LLM agent that goes beyond traditional true/false binary judgments. The agent actively verifies claims through dynamic interaction with diverse web sources, assesses information source credibility, synthesizes evidence, and provides a complete verifiable reasoning process. Our designed agent architecture includes three core tools: precise web search tool, source credibility assessment tool and numerical claim verification tool. These tools enable the agent to execute multi-step verification strategies, maintain evidence logs, and form comprehensive assessment conclusions. We evaluate using standard misinformation datasets such as FakeNewsNet, comparing with traditional machine learning models and LLMs. Evaluation metrics include standard classification metrics, quality assessment of reasoning processes, and robustness testing against rewritten content. Experimental results show that our agent outperforms baseline methods in misinformation detection accuracy, reasoning transparency, and resistance to information rewriting, providing a new paradigm for trustworthy AI-assisted fact-checking. 随着大型语言模型（LLMs）的普及，错误信息的检测变得日益重要且复杂。本研究提出了一种创新的可验证错误信息检测 LLM 代理，超越了传统的真假二元判断。该代理通过与多样化的网络资源动态交互，主动验证声明，评估信息来源的可信度，综合证据，并提供完整的可验证推理过程。我们设计的代理架构包括三个核心工具：精准网络搜索工具、来源可信度评估工具和数值声明验证工具。这些工具使代理能够执行多步骤验证策略，维护证据日志，并形成全面的评估结论。我们使用 FakeNewsNet 等标准错误信息数据集进行评估，并与传统机器学习模型和 LLMs 进行比较。评估指标包括标准分类指标、推理过程的质量评估以及针对重写内容的鲁棒性测试。实验结果表明，我们的智能体在错误信息检测准确性、推理透明度以及抵抗信息篡改方面均优于基线方法，为可信赖的人工智能辅助事实核查提供了一种新范式。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-05 05:15:03 UTC 发布时间：2025-08-05 05:15:03 UTC

#65 VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision #65 VRPO：在噪声监督下重新思考鲁棒强化学习训练中的价值建模

Reinforcement Learning from Human Feedback (RLHF) often suffers from noisy or imperfect reward supervision in real-world settings, which undermines policy stability and generalization. Such noise may cause models to lose attention on key words during advantage estimation. While prior work focuses on reward denoising or filtering poor data, it often overlooks the critical role of the value model in policy optimization. In this work, we show that a strong value model is essential for mitigating noise by absorbing unstable signals and enabling more reliable advantage estimation. We propose VRPO, a value-centric framework for robust PPO training under noisy supervision. VRPO combines two core designs: (1) an auxiliary loss guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck. These mechanisms enhance the value model’s ability to filter out noise and capture key words from the context during advantage estimation, transforming it from a passive predictor into an active regulator of noise. Experiments on math reasoning, science QA, and multi-turn dialogue, under both rule-based and model-based noisy rewards, show that VRPO consistently outperforms PPO and GRPO baselines. Our findings underscore the often-overlooked importance of the value model in RLHF and offer a principled and practical approach to robust policy optimization in noisy real-world environments. 从人类反馈中进行强化学习（RLHF）在现实环境中常常受到噪声或不完美奖励监督的影响，这削弱了策略的稳定性和泛化能力。这种噪声可能导致模型在优势估计过程中忽视关键词。尽管以往的工作侧重于奖励去噪或过滤劣质数据，但往往忽视了价值模型在策略优化中的关键作用。在本工作中，我们展示了强大的价值模型对于通过吸收不稳定信号并实现更可靠的优势估计来缓解噪声的重要性。我们提出了 VRPO，一种在噪声监督下用于稳健 PPO 训练的以价值为中心的框架。VRPO 结合了两个核心设计：（1）由冻结语言模型的熵和困惑度引导的辅助损失；（2）变分信息瓶颈。这些机制增强了价值模型在优势估计过程中过滤噪声和捕捉上下文关键词的能力，使其从被动预测器转变为噪声的主动调节器。在数学推理、科学问答和多轮对话的实验中，无论是在基于规则还是基于模型的噪声奖励下，VRPO 均持续优于 PPO 和 GRPO 基线。我们的研究结果强调了价值模型在 RLHF 中常被忽视的重要性，并为在噪声真实环境中实现稳健的策略优化提供了一种有原则且实用的方法。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习，人工智能，计算与语言

Publish: 2025-08-05 04:05:15 UTC 发布时间：2025-08-05 04:05:15 UTC

#66 AGENTiGraph: A Multi-Agent Knowledge Graph Framework for Interactive, Domain-Specific LLM Chatbots #66 AGENTiGraph：一个用于交互式领域特定 LLM 聊天机器人的多智能体知识图谱框架

AGENTiGraph is a user-friendly, agent-driven system that enables intuitive interaction and management of domain-specific data through the manipulation of knowledge graphs in natural language. It gives non-technical users a complete, visual solution to incrementally build and refine their knowledge bases, allowing multi-round dialogues and dynamic updates without specialized query languages. The flexible design of AGENTiGraph, including intent classification, task planning, and automatic knowledge integration, ensures seamless reasoning between diverse tasks. Evaluated on a 3,500-query benchmark within an educational scenario, the system outperforms strong zero-shot baselines (achieving 95.12% classification accuracy, 90.45% execution success), indicating potential scalability to compliance-critical or multi-step queries in legal and medical domains, e.g., incorporating new statutes or research on the fly. Our open-source demo offers a powerful new paradigm for multi-turn enterprise knowledge management that bridges LLMs and structured graphs. AGENTiGraph 是一个用户友好的、由代理驱动的系统，通过自然语言操作知识图，实现对特定领域数据的直观交互和管理。它为非技术用户提供了一个完整的可视化解决方案，能够逐步构建和完善他们的知识库，支持多轮对话和动态更新，无需使用专业查询语言。AGENTiGraph 的灵活设计包括意图分类、任务规划和自动知识整合，确保不同任务之间的无缝推理。在一个包含 3500 个查询的教育场景基准测试中，该系统优于强大的零样本基线（实现了 95.12%的分类准确率和 90.45%的执行成功率），显示出其在法律和医疗等合规关键或多步骤查询中的潜在可扩展性，例如能够即时整合新法规或研究成果。我们的开源演示提供了一种强大的新范式，用于连接 LLMs 和结构化图的多轮企业知识管理。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-05 01:55:06 UTC 发布时间：2025-08-05 01:55:06 UTC

#67 Unified Tool Integration for LLMs: A Protocol-Agnostic Approach to Function Calling #67 LLMs 的统一工具集成：一种与协议无关的函数调用方法

Authors: [Peng Ding](https://arxiv.org/search/?searchtype=author&query=Peng Ding), [Rick Stevens](https://arxiv.org/search/?searchtype=author&query=Rick Stevens) 作者：丁鹏，Rick Stevens

The proliferation of tool-augmented Large Language Models (LLMs) has created a fragmented ecosystem where developers must navigate multiple protocols, manual schema definitions, and complex execution workflows. We address this challenge by proposing a unified approach to tool integration that abstracts protocol differences while optimizing execution performance. Our solution demonstrates how protocol-agnostic design principles can significantly reduce development overhead through automated schema generation, dual-mode concurrent execution, and seamless multi-source tool management. Experimental results show 60-80% code reduction across integration scenarios, performance improvements up to 3.1x through optimized concurrency, and full compatibility with existing function calling standards. This work contributes both theoretical insights into tool integration architecture and practical solutions for real-world LLM application development. 工具增强型大型语言模型（LLMs）的普及催生了一个碎片化的生态系统，开发者必须应对多种协议、手动定义模式以及复杂的执行流程。我们通过提出一种统一的工具集成方法来解决这一挑战，该方法抽象了协议差异，同时优化了执行性能。我们的解决方案展示了如何通过自动模式生成、双模式并发执行以及无缝的多源工具管理，显著降低开发负担。实验结果表明，在各种集成场景中代码量减少了 60-80%，通过优化并发性能提升最高达 3.1 倍，并且完全兼容现有的函数调用标准。该工作不仅为工具集成架构提供了理论见解，也为实际的 LLM 应用开发带来了实用解决方案。

Subjects: Artificial Intelligence, Computation and Language, Machine Learning 主题：人工智能，计算与语言，机器学习

Publish: 2025-08-05 01:06:49 UTC 发布：2025-08-05 01:06:49 UTC

#68 Defend LLMs Through Self-Consciousness #68 通过自我意识防御 LLMs

Authors: [Boshi Huang](https://arxiv.org/search/?searchtype=author&query=Boshi Huang), [Fabio Nonato de Paula](https://arxiv.org/search/?searchtype=author&query=Fabio Nonato de Paula) 作者：黄博仕，Fabio Nonato de Paula

This paper introduces a novel self-consciousness defense mechanism for Large Language Models (LLMs) to combat prompt injection attacks. Unlike traditional approaches that rely on external classifiers, our method leverages the LLM’s inherent reasoning capabilities to perform self-protection. We propose a framework that incorporates Meta-Cognitive and Arbitration Modules, enabling LLMs to evaluate and regulate their own outputs autonomously. Our approach is evaluated on seven state-of-the-art LLMs using two datasets: AdvBench and Prompt-Injection-Mixed-Techniques-2024. Experiment results demonstrate significant improvements in defense success rates across models and datasets, with some achieving perfect and near-perfect defense in Enhanced Mode. We also analyze the trade-off between defense success rate improvement and computational overhead. This self-consciousness method offers a lightweight, cost-effective solution for enhancing LLM ethics, particularly beneficial for GenAI use cases across various platforms. 本文提出了一种针对大型语言模型（LLMs）的新型自我意识防御机制，以抵御提示注入攻击。不同于依赖外部分类器的传统方法，我们的方法利用 LLM 固有的推理能力进行自我保护。我们提出了一个包含元认知模块和仲裁模块的框架，使 LLMs 能够自主评估和调控自身输出。我们在七个最先进的 LLMs 上，使用 AdvBench 和 Prompt-Injection-Mixed-Techniques-2024 两个数据集对该方法进行了评估。实验结果显示，该方法在各模型和数据集上的防御成功率显著提升，其中部分模型在增强模式下实现了完美或近乎完美的防御效果。我们还分析了防御成功率提升与计算开销之间的权衡。该自我意识方法为提升 LLM 伦理性提供了一种轻量且成本效益高的解决方案，特别适用于跨平台的生成式人工智能（GenAI）应用场景。

Subjects: Artificial Intelligence, Computation and Language, Cryptography and Security 主题：人工智能，计算与语言，密码学与安全

Publish: 2025-08-04 23:52:15 UTC 发布时间：2025-08-04 23:52:15 UTC

#69 Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces #69 使用大型视觉语言模型遵循路线指令：低级与全景动作空间的比较

Authors: [Vebjørn Haug Kåsene](https://arxiv.org/search/?searchtype=author&query=Vebjørn Haug Kåsene), [Pierre Lison](https://arxiv.org/search/?searchtype=author&query=Pierre Lison) 作者：Vebjørn Haug Kåsene，Pierre Lison

Vision-and-Language Navigation (VLN) refers to the task of enabling autonomous robots to navigate unfamiliar environments by following natural language instructions. While recent Large Vision-Language Models (LVLMs) have shown promise in this task, most current VLM systems rely on models specifically designed and optimized for navigation, leaving the potential of off-the-shelf LVLMs underexplored. Furthermore, while older VLN approaches used low-level action spaces with egocentric views and atomic actions (such as “turn left” or “move forward”), newer models tend to favor panoramic action spaces with discrete navigable viewpoints. This paper investigates (1) whether off-the-shelf LVLMs (fine-tuned without architectural modifications or simulator-based training) can effectively support VLN tasks and (2) whether such models can support both low-level and panoramic action paradigms. To this end, we fine-tune the open-source model Qwen2.5-VL-3B-Instruct on the Room-to-Room (R2R) dataset and evaluate its empirical performance across both low-level and panoramic action spaces. The best resulting model achieves a 41% success rate on the R2R test set, demonstrating that while off-the-shelf LVLMs can learn to perform Vision-and-Language Navigation, they still lag behind models specifically designed for this task. 视觉与语言导航（VLN）指的是使自主机器人能够通过遵循自然语言指令在陌生环境中导航的任务。尽管近期的大型视觉语言模型（LVLMs）在该任务中展现出潜力，但大多数现有的视觉语言模型系统依赖于专门设计和优化用于导航的模型，尚未充分挖掘现成 LVLMs 的潜力。此外，早期的 VLN 方法使用带有自我中心视角和原子动作（如“向左转”或“向前移动”）的低级动作空间，而较新的模型则倾向于采用带有离散可导航视点的全景动作空间。本文探讨了（1）现成的 LVLMs（在未进行架构修改或基于模拟器训练的情况下微调）是否能有效支持 VLN 任务，以及（2）此类模型是否能支持低级和全景两种动作范式。为此，我们在 Room-to-Room（R2R）数据集上微调了开源模型 Qwen2.5-VL-3B-Instruct，并评估了其在低级和全景动作空间中的实际表现。最佳模型在 R2R 测试集上达到了 41%的成功率，表明虽然现成的 LVLMs 可以学习执行视觉与语言导航任务，但它们仍然落后于专门为该任务设计的模型。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language, Robotics 主题：计算机视觉与模式识别，人工智能，计算与语言，机器人学

Publish: 2025-08-04 21:45:21 UTC 发布时间：2025-08-04 21:45:21 UTC

#70 VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction #70 VisuCraft：通过结构化信息提取增强大型视觉语言模型以实现复杂视觉引导的创意内容生成

Authors: [Rongxin Jiang](https://arxiv.org/search/?searchtype=author&query=Rongxin Jiang), [Robert Long](https://arxiv.org/search/?searchtype=author&query=Robert Long), [Chenghao Gu](https://arxiv.org/search/?searchtype=author&query=Chenghao Gu), [Mingrui Yan](https://arxiv.org/search/?searchtype=author&query=Mingrui Yan) 作者：姜荣鑫，罗伯特·朗，顾承浩，闫明睿

This paper introduces VisuCraft, a novel framework designed to significantly enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation. Existing LVLMs often exhibit limitations in maintaining high visual fidelity, genuine creativity, and precise adherence to nuanced user instructions when generating long-form texts. VisuCraft addresses these challenges by integrating a multimodal structured information extractor (E) and a dynamic prompt generation module (G). The extractor distills fine-grained visual attributes from input images into a rich, structured representation, which the dynamic prompt module then combines with user instructions to create highly optimized prompts for underlying LVLMs (e.g., LLaVA, InstructBLIP). Evaluated on the self-constructed ImageStoryGen-500K dataset using VisuGen Metrics (Visual Grounding, Creativity, and Instruction Adherence), VisuCraft consistently outperforms baseline LVLMs across tasks like story generation and poetry composition. Our results demonstrate remarkable improvements, particularly in creativity and instruction adherence, validating VisuCraft’s effectiveness in producing imaginative, visually grounded, and user-aligned long-form creative text. This work unlocks new potential for LVLMs in sophisticated creative AI applications. 本文介绍了 VisuCraft，一种新颖的框架，旨在显著提升大型视觉语言模型（LVLMs）在复杂视觉引导创意内容生成中的能力。现有的 LVLMs 在生成长文本时，常常在保持高视觉保真度、真实创造力以及精确遵循细微用户指令方面存在局限。VisuCraft 通过整合多模态结构化信息提取器（E）和动态提示生成模块（G）来解决这些挑战。提取器将输入图像中的细粒度视觉属性提炼为丰富的结构化表示，动态提示模块随后将其与用户指令结合，生成针对底层 LVLMs（如 LLaVA、InstructBLIP）的高度优化提示。在自建的 ImageStoryGen-500K 数据集上，利用 VisuGen 指标（视觉定位、创造力和指令遵循）进行评估，VisuCraft 在故事生成和诗歌创作等任务中持续优于基线 LVLMs。我们的结果显示了显著的提升，尤其是在创造力和指令遵循方面，验证了 VisuCraft 在生成富有想象力、视觉基础且符合用户需求的长篇创意文本方面的有效性。这项工作为 LVLMs 在复杂创意人工智能应用中开启了新的潜力。

Subjects: Computer Vision and Pattern Recognition, Computation and Language 主题：计算机视觉与模式识别，计算与语言

Publish: 2025-08-04 20:36:55 UTC 发布时间：2025-08-04 20:36:55 UTC

Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of tokens while maintaining high codebook utilization. A semantic disentanglement method based on contrastive learning is proposed, which aligns text and speech in a joint multimodal frame-level space, effectively removing paralinguistic information from semantic encoding. An acoustic-constrained multi-stage optimization strategy is proposed to ensure robust and stable convergence. Figure~??? shows SecoustiCodec achieves SOTA (state-of-the-art) reconstruction quality (PESQ) of 1.77/2.58 at 0.27/1 kbps. The code and model weights for SecoustiCodec will be open-sourced upon the completion of the peer-review process. We’ve open-sourced SecoustiCodec’s demo, code, and model weights. 语音编解码器作为统一语音和文本语言模型的重要桥梁。现有的编解码方法在语义编码方面面临诸多挑战，如残留的副语言信息（例如音色、情感）、语义完整性不足、重建能力有限以及缺乏对流式传输的支持。为了解决这些问题，我们提出了 SecoustiCodec，一种跨模态对齐的低码率流式语音编解码器，在单一码本空间中解耦语义和副语言信息。为了确保语义完整性和重建保真度，引入了副语言编码以弥合语义编码与声学编码之间的信息差距。提出了一种基于 VAE（变分自编码器）和 FSQ（有限标量量化）的仅语义高效量化方法，该方法缓解了 token 的长尾分布问题，同时保持了高码本利用率。还提出了一种基于对比学习的语义解耦方法，将文本和语音对齐到联合多模态帧级空间，有效地从语义编码中去除副语言信息。提出了一种受声学约束的多阶段优化策略，以确保稳健且稳定的收敛。图~ ??? 显示 SecoustiCodec 在 0.27/1 kbps 下实现了 1.77/2.58 的 SOTA（最先进）重建质量（PESQ）。SecoustiCodec 的代码和模型权重将在同行评审完成后开源。我们已开源了 SecoustiCodec 的演示、代码和模型权重。

Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language, Sound 主题：音频与语音处理，人工智能，计算与语言，声音

Publish: 2025-08-04 19:22:14 UTC 发布时间：2025-08-04 19:22:14 UTC

#72 NeuroSync: Intent-Aware Code-Based Problem Solving via Direct LLM Understanding Modification #72 NeuroSync：通过直接 LLM 理解修改实现意图感知的基于代码的问题解决

Authors: [Wenshuo Zhang](https://arxiv.org/search/?searchtype=author&query=Wenshuo Zhang), [Leixian Shen](https://arxiv.org/search/?searchtype=author&query=Leixian Shen), [Shuchang Xu](https://arxiv.org/search/?searchtype=author&query=Shuchang Xu), [Jindu Wang](https://arxiv.org/search/?searchtype=author&query=Jindu Wang), [Jian Zhao](https://arxiv.org/search/?searchtype=author&query=Jian Zhao), [Huamin Qu](https://arxiv.org/search/?searchtype=author&query=Huamin Qu), [Linping Yuan](https://arxiv.org/search/?searchtype=author&query=Linping Yuan) 作者：张文硕，沈雷贤，徐书昌，王金都，赵健，瞿华敏，袁林平

Conversational LLMs have been widely adopted by domain users with limited programming experience to solve domain problems. However, these users often face misalignment between their intent and generated code, resulting in frustration and rounds of clarification. This work first investigates the cause of this misalignment, which dues to bidirectional ambiguity: both user intents and coding tasks are inherently nonlinear, yet must be expressed and interpreted through linear prompts and code sequences. To address this, we propose direct intent-task matching, a new human-LLM interaction paradigm that externalizes and enables direct manipulation of the LLM understanding, i.e., the coding tasks and their relationships inferred by the LLM prior to code generation. As a proof-of-concept, this paradigm is then implemented in NeuroSync, which employs a knowledge distillation pipeline to extract LLM understanding, user intents, and their mappings, and enhances the alignment by allowing users to intuitively inspect and edit them via visualizations. We evaluate the algorithmic components of NeuroSync via technical experiments, and assess its overall usability and effectiveness via a user study (N=12). The results show that it enhances intent-task alignment, lowers cognitive effort, and improves coding efficiency. 对话式 LLMs 已被领域用户广泛采用，用以解决领域问题，这些用户编程经验有限。然而，这些用户常常面临其意图与生成代码之间的不匹配，导致挫败感和多轮澄清。本文首先探讨了这种不匹配的原因，归结为双向歧义：用户意图和编码任务本质上都是非线性的，但必须通过线性的提示和代码序列来表达和理解。为了解决这一问题，我们提出了直接意图-任务匹配，这是一种新的人与 LLM 交互范式，能够外化并直接操作 LLM 的理解，即 LLM 在生成代码之前推断出的编码任务及其关系。作为概念验证，该范式被实现于 NeuroSync 中，NeuroSync 采用知识蒸馏流程提取 LLM 理解、用户意图及其映射，并通过可视化方式允许用户直观地检查和编辑它们，从而增强匹配度。我们通过技术实验评估了 NeuroSync 的算法组件，并通过用户研究（N=12）评估其整体可用性和有效性。结果表明，它增强了意图与任务的匹配，降低了认知负担，并提高了编码效率。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language, Software Engineering 主题：人机交互，人工智能，计算与语言，软件工程

Publish: 2025-08-05 12:54:13 UTC 发布时间：2025-08-05 12:54:13 UTC

#73 CreditARF: A Framework for Corporate Credit Rating with Annual Report and Financial Feature Integration #73 CreditARF：一个结合年报和财务特征的企业信用评级框架

Authors: [Yumeng Shi](https://arxiv.org/search/?searchtype=author&query=Yumeng Shi), [Zhongliang Yang](https://arxiv.org/search/?searchtype=author&query=Zhongliang Yang), [DiYang Lu](https://arxiv.org/search/?searchtype=author&query=DiYang Lu), [Yisi Wang](https://arxiv.org/search/?searchtype=author&query=Yisi Wang), [Yiting Zhou](https://arxiv.org/search/?searchtype=author&query=Yiting Zhou), [Linna Zhou](https://arxiv.org/search/?searchtype=author&query=Linna Zhou) 作者：石雨萌，杨中良，陆迪洋，王怡思，周怡婷，周琳娜

Corporate credit rating serves as a crucial intermediary service in the market economy, playing a key role in maintaining economic order. Existing credit rating models rely on financial metrics and deep learning. However, they often overlook insights from non-financial data, such as corporate annual reports. To address this, this paper introduces a corporate credit rating framework that integrates financial data with features extracted from annual reports using FinBERT, aiming to fully leverage the potential value of unstructured text data. In addition, we have developed a large-scale dataset, the Comprehensive Corporate Rating Dataset (CCRD), which combines both traditional financial data and textual data from annual reports. The experimental results show that the proposed method improves the accuracy of the rating predictions by 8-12%, significantly improving the effectiveness and reliability of corporate credit ratings. 企业信用评级作为市场经济中的重要中介服务，在维护经济秩序中发挥着关键作用。现有的信用评级模型依赖于财务指标和深度学习，但往往忽视了非财务数据的洞见，如企业年报。为此，本文提出了一种结合财务数据与通过 FinBERT 从年报中提取特征的企业信用评级框架，旨在充分挖掘非结构化文本数据的潜在价值。此外，我们构建了一个大规模数据集——综合企业评级数据集（CCRD），该数据集融合了传统财务数据和年报文本数据。实验结果表明，所提方法将评级预测的准确率提升了 8-12%，显著提高了企业信用评级的有效性和可靠性。

Subjects: Statistical Finance, Computational Engineering, Finance, and Science, Computation and Language, Machine Learning 学科领域：统计金融，计算工程、金融与科学，计算与语言，机器学习

Publish: 2025-08-02 05:56:36 UTC 发布时间：2025-08-02 05:56:36 UTC

#74 Teaching at Scale: Leveraging AI to Evaluate and Elevate Engineering Education #74 大规模教学：利用人工智能评估和提升工程教育

Authors: [Jean-Francois Chamberland](https://arxiv.org/search/?searchtype=author&query=Jean-Francois Chamberland), [Martin C. Carlisle](https://arxiv.org/search/?searchtype=author&query=Martin C. Carlisle), [Arul Jayaraman](https://arxiv.org/search/?searchtype=author&query=Arul Jayaraman), [Krishna R. Narayanan](https://arxiv.org/search/?searchtype=author&query=Krishna R. Narayanan), [Sunay Palsole](https://arxiv.org/search/?searchtype=author&query=Sunay Palsole), [Karan Watson](https://arxiv.org/search/?searchtype=author&query=Karan Watson) 作者：Jean-Francois Chamberland, Martin C. Carlisle, Arul Jayaraman, Krishna R. Narayanan, Sunay Palsole, Karan Watson

Evaluating teaching effectiveness at scale remains a persistent challenge for large universities, particularly within engineering programs that enroll tens of thousands of students. Traditional methods, such as manual review of student evaluations, are often impractical, leading to overlooked insights and inconsistent data use. This article presents a scalable, AI-supported framework for synthesizing qualitative student feedback using large language models. The system employs hierarchical summarization, anonymization, and exception handling to extract actionable themes from open-ended comments while upholding ethical safeguards. Visual analytics contextualize numeric scores through percentile-based comparisons, historical trends, and instructional load. The approach supports meaningful evaluation and aligns with best practices in qualitative analysis and educational assessment, incorporating student, peer, and self-reflective inputs without automating personnel decisions. We report on its successful deployment across a large college of engineering. Preliminary validation through comparisons with human reviewers, faculty feedback, and longitudinal analysis suggests that LLM-generated summaries can reliably support formative evaluation and professional development. This work demonstrates how AI systems, when designed with transparency and shared governance, can promote teaching excellence and continuous improvement at scale within academic institutions. 在大规模评估教学效果方面，大型大学尤其是招收数万名学生的工程项目面临着持续的挑战。传统方法如人工审查学生评价往往不切实际，导致洞察被忽视和数据使用不一致。本文提出了一种可扩展的、基于人工智能支持的框架，利用大型语言模型综合学生的定性反馈。该系统采用分层摘要、匿名处理和异常处理，从开放式评论中提取可操作的主题，同时坚持伦理保障。可视化分析通过基于百分位的比较、历史趋势和教学负荷来为数值评分提供背景。该方法支持有意义的评估，符合定性分析和教育评估的最佳实践，结合了学生、同行和自我反思的输入，但不自动化人员决策。我们报告了该系统在大型工程学院的成功部署情况。通过与人工评审、教师反馈及纵向分析的比较进行的初步验证表明，LLM 生成的摘要能够可靠地支持形成性评价和专业发展。该研究展示了在透明度和共享治理设计下，人工智能系统如何在学术机构内大规模促进教学卓越和持续改进。

Subjects: Computers and Society, Artificial Intelligence, Computation and Language 主题：计算机与社会，人工智能，计算与语言

Publish: 2025-08-01 20:27:40 UTC 发布时间：2025-08-01 20:27:40 UTC

#75 Efficient Agents: Building Effective Agents While Reducing Cost #75 高效智能体：在降低成本的同时构建有效智能体

The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from 0.398to0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions. 大型语言模型（LLM）驱动的代理展现出卓越的能力，使复杂的多步骤任务得以解决，但其不断上升的成本威胁着系统的可扩展性和可及性。本文首次系统地研究了现代代理系统中的效率与效果权衡，解决了在不牺牲性能的前提下实现成本效益设计的关键需求。我们探讨了三个核心问题：（1）代理任务本质上需要多少复杂度？（2）何时额外模块的收益递减？（3）通过设计高效代理框架可以获得多少效率提升？通过在 GAIA 基准上的实证分析，我们评估了 LLM 骨干选择、代理框架设计和测试时扩展策略的影响。利用 cost-of-pass 指标，我们量化了这些维度上的效率与性能权衡。我们的研究成果指导了 Efficient Agents 的开发，这是一种具有与任务需求最优复杂度匹配的新型代理框架。 Efficient Agents 保持了 OWL（一款领先的开源代理框架）96.7%的性能，同时将运营成本从 0.398to 0.228 降低，实现了 28.4% 的成本效益提升。我们的工作为设计高效、高性能的代理系统提供了可操作的见解，推动了 AI 驱动解决方案的可及性和可持续性。

Subjects: Artificial Intelligence, Computation and Language, Multiagent Systems 主题：人工智能，计算与语言，多智能体系统

Publish: 2025-07-24 17:56:51 UTC 发布时间：2025-07-24 17:56:51 UTC

#76 ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs #76 ToolRegistry：一个面向函数调用 LLMs 的协议无关工具管理库

Author: [Peng Ding](https://arxiv.org/search/?searchtype=author&query=Peng Ding) 作者：丁鹏

Large Language Model (LLM) applications are increasingly relying on external tools to extend their capabilities beyond text generation. However, current tool integration approaches suffer from fragmentation, protocol limitations, and implementation complexity, leading to substantial development overhead. This paper presents Toolregistry, a protocol-agnostic tool management library that simplifies tool registration, representation, execution, and lifecycle management via a unified interface. Our evaluation demonstrates that \toolregistry achieves 60-80% reduction in tool integration code, up to 3.1x performance improvements through concurrent execution, and 100% compatibility with OpenAI function calling standards. Real-world case studies show significant improvements in development efficiency and code maintainability across diverse integration scenarios. \toolregistry is open-source and available at https://github.com/Oaklight/ToolRegistry, with comprehensive documentation at https://toolregistry.readthedocs.io/. 大型语言模型（LLM）应用越来越依赖外部工具，以扩展其超越文本生成的能力。然而，当前的工具集成方法存在碎片化、协议限制和实现复杂性，导致开发开销巨大。本文提出了 Toolregistry，一种与协议无关的工具管理库，通过统一接口简化工具的注册、表示、执行和生命周期管理。我们的评估表明，Toolregistry 实现了工具集成代码减少 60-80%，通过并发执行性能提升最高达 3.1 倍，并且与 OpenAI 函数调用标准 100%兼容。实际案例研究显示，在多样化集成场景中，Toolregistry 显著提升了开发效率和代码可维护性。Toolregistry 是开源项目，地址为 https://github.com/Oaklight/ToolRegistry，详细文档见 https://toolregistry.readthedocs.io/。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-07-11 20:23:23 UTC 发布时间：2025-07-11 20:23:23 UTC

1.2.2 Artificial Intelligence

**From：**https://papers.cool/arxiv/cs.AI

2025-08-06 | | Total: 187

#1 Agent Lightning: Train ANY AI Agents with Reinforcement Learning #1 Agent Lightning：使用强化学习训练任何 AI 代理

We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework’s potential for real-world agent training and deployment. 我们提出了 Agent Lightning，一个灵活且可扩展的框架，支持基于强化学习（RL）对任何 AI 代理的 LLMs 进行训练。与现有方法将 RL 训练与代理紧密耦合或依赖序列拼接与掩码不同，Agent Lightning 实现了代理执行与训练的完全解耦，允许无缝集成通过多种方式开发的现有代理（例如使用 LangChain、OpenAI Agents SDK、AutoGen 等框架，或从零构建），几乎无需修改代码。通过将代理执行形式化为马尔可夫决策过程，我们定义了统一的数据接口，并提出了分层 RL 算法 LightningRL，其中包含信用分配模块，使我们能够将任何代理生成的轨迹分解为训练转换。这使得 RL 能够处理复杂的交互逻辑，如多代理场景和动态工作流。在系统设计方面，我们引入了训练代理分离架构，并将代理可观测性框架引入代理运行时，提供了标准化的代理微调接口。在文本到 SQL、检索增强生成和数学工具使用任务中的实验展示了稳定且持续的改进，彰显了该框架在现实世界代理训练和部署中的潜力。

Subjects: Artificial Intelligence, Machine Learning 主题：人工智能，机器学习

Publish: 2025-08-05 17:50:13 UTC 发布时间：2025-08-05 17:50:13 UTC

#2 Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search #2 基于 LLM 指导的进化蒙特卡洛树搜索的引力波检测自动算法发现

Authors: [He Wang](https://arxiv.org/search/?searchtype=author&query=He Wang), [Liang Zeng](https://arxiv.org/search/?searchtype=author&query=Liang Zeng) 作者：He Wang，Liang Zeng

Computational scientific discovery increasingly relies on algorithms to process complex data and identify meaningful patterns - yet faces persistent challenges in gravitational-wave signal identification. While existing algorithmic approaches like matched filtering (MF) and deep neural networks (DNNs) have achieved partial success, their limitations directly stem from fundamental limitations: MF’s excessive computational demands arise from its reliance on predefined theoretical waveform templates, while DNNs’ black-box architectures obscure decision logic and introduce hidden biases. We propose Evolutionary Monte Carlo Tree Search (Evo-MCTS), a framework that addresses these limitations through systematic algorithm space exploration guided by domain-aware physical constraints. Our approach combines tree-structured search with evolutionary optimization and large language model heuristics to create interpretable algorithmic solutions. Our Evo-MCTS framework demonstrates substantial improvements, achieving a 20.2% improvement over state-of-the-art gravitational wave detection algorithms on the MLGWSC-1 benchmark dataset. High-performing algorithm variants consistently exceed thresholds. The framework generates human-interpretable algorithmic pathways that reveal distinct performance patterns. Beyond performance improvements, our framework discovers novel algorithmic combinations, thereby establishing a transferable methodology for automated algorithmic discovery across computational science domains. 计算科学发现日益依赖算法来处理复杂数据并识别有意义的模式——但在引力波信号识别方面仍面临持续挑战。尽管现有的算法方法如匹配滤波（MF）和深度神经网络（DNN）取得了部分成功，但它们的局限性直接源于根本限制：MF 因依赖预定义的理论波形模板而导致计算需求过高，而 DNN 的黑箱架构则使决策逻辑不透明并引入隐藏偏差。我们提出了进化蒙特卡洛树搜索（Evo-MCTS），该框架通过受领域感知物理约束指导的系统性算法空间探索来解决这些限制。我们的方法结合了树结构搜索、进化优化和大型语言模型启发式，创造出可解释的算法解决方案。我们的 Evo-MCTS 框架表现出显著改进，在 MLGWSC-1 基准数据集上实现了比最先进引力波检测算法高出 20.2%的提升。高性能算法变体始终超过阈值。该框架生成可被人类理解的算法路径，揭示了不同的性能模式。除了性能提升之外，我们的框架还发现了新颖的算法组合，从而建立了一种可迁移的方法论，用于跨计算科学领域的自动化算法发现。

Subjects: Artificial Intelligence, High Energy Astrophysical Phenomena, Instrumentation and Methods for Astrophysics, General Relativity and Quantum Cosmology 主题：人工智能，高能天体物理现象，天体物理仪器与方法，广义相对论与量子宇宙学

Publish: 2025-08-05 17:18:20 UTC 发布时间：2025-08-05 17:18:20 UTC

#3 Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework #3 在 LLM 代码生成中提升批判性思维：基于错误前提的评估框架

Authors: [Jialin Li](https://arxiv.org/search/?searchtype=author&query=Jialin Li), [Jinzhe Li](https://arxiv.org/search/?searchtype=author&query=Jinzhe Li), [Gengxu Li](https://arxiv.org/search/?searchtype=author&query=Gengxu Li), [Yi Chang](https://arxiv.org/search/?searchtype=author&query=Yi Chang), [Yuan Wu](https://arxiv.org/search/?searchtype=author&query=Yuan Wu) 作者：李佳林，李金哲，李耕旭，常毅，吴元

With the advancement of code generation capabilities in large language models (LLMs), their reliance on input premises has intensified. When users provide inputs containing faulty premises, the probability of code generation hallucinations rises significantly, exposing deficiencies in their self-scrutiny capabilities. This paper proposes Faulty Premises Bench (FPBench), the first code generation evaluation framework targeting faulty premises. By systematically constructing three categories of faulty premises and integrating multi-dimensional evaluation metrics, it conducts in-depth assessments of 15 representative LLMs. The key findings are as follows: (1) Most models exhibit poor reasoning abilities and suboptimal code generation performance under faulty premises, heavily relying on explicit prompts for error detection, with limited self-scrutiny capabilities; (2) Faulty premises trigger a point of diminishing returns in resource investment, leading to blindly increasing length fails to enhance quality; (3) The three types of faulty premises respectively activate distinct defect patterns in models, revealing a triple dissociation in the cognitive mechanisms of code generation models. This study not only highlights the urgent need for LLMs to proactively verify premises in code generation but also, through the proposed FPBench framework and multi-dimensional evaluation system, provides a theoretical foundation and practical pathway for developing reliable, human-centric code generation models. 随着大型语言模型（LLMs）代码生成能力的提升，它们对输入前提的依赖也日益加深。当用户提供包含错误前提的输入时，代码生成出现幻觉的概率显著增加，暴露出其自我审查能力的不足。本文提出了 Faulty Premises Bench（FPBench），这是首个针对错误前提的代码生成评估框架。通过系统构建三类错误前提并整合多维度评估指标，对 15 个具有代表性的 LLMs 进行了深入评估。主要发现如下：（1）大多数模型在错误前提下表现出较差的推理能力和次优的代码生成性能，严重依赖明确的提示来检测错误，自我审查能力有限；（2）错误前提引发资源投入的收益递减点，盲目增加长度无法提升质量；（3）三种类型的错误前提分别激活模型中的不同缺陷模式，揭示了代码生成模型认知机制中的三重解离。本研究不仅强调了 LLMs 在代码生成中主动验证前提的紧迫需求，还通过提出的 FPBench 框架和多维度评估系统，为开发可靠且以人为本的代码生成模型提供了理论基础和实践路径。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 16:39:39 UTC 发布时间：2025-08-05 16:39:39 UTC

#4 Hidden Dynamics of Massive Activations in Transformer Training #4 Transformer 训练中大规模激活的隐藏动态

Authors: [Jorge Gallego-Feliciano](https://arxiv.org/search/?searchtype=author&query=Jorge Gallego-Feliciano), [S. Aaron McClendon](https://arxiv.org/search/?searchtype=author&query=S. Aaron McClendon), [Juan Morinelli](https://arxiv.org/search/?searchtype=author&query=Juan Morinelli), [Stavros Zervoudakis](https://arxiv.org/search/?searchtype=author&query=Stavros Zervoudakis), [Antonios Saravanos](https://arxiv.org/search/?searchtype=author&query=Antonios Saravanos) 作者：Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos

Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations and have been shown to be critical for model functionality. While prior work has characterized these phenomena in fully trained models, the temporal dynamics of their emergence during training remain poorly understood. We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins. 大规模激活是指变换器隐藏状态中的标量值，其数值远远大于典型激活值，且已被证明对模型功能至关重要。尽管先前的研究已经描述了这些现象在完全训练模型中的表现，但它们在训练过程中出现的时间动态仍然知之甚少。我们首次对变换器训练过程中大规模激活的发展进行了全面分析，选用 Pythia 模型系列作为测试平台。通过对不同模型规模在多个训练检查点的系统分析，我们证明了大规模激活的出现遵循可预测的数学模式，这些模式可以用一个包含五个关键参数的指数调制对数函数准确建模。我们开发了一个机器学习框架，仅通过架构规格预测这些数学参数，在稳态行为预测上取得了高准确率，在出现时间和幅度预测上则达到了中等准确率。这些发现使架构师能够通过设计选择预测并潜在地控制大规模激活出现的关键方面，这对模型稳定性、训练周期长度、可解释性和优化具有重要影响。我们的研究表明，大规模激活的出现受模型设计的控制，可以在训练开始前被预见并潜在地加以控制。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 16:29:51 UTC 发布时间：2025-08-05 16:29:51 UTC

#5 Error Detection and Correction for Interpretable Mathematics in Large Language Models #5 大型语言模型中可解释数学的错误检测与纠正

Authors: [Yijin Yang](https://arxiv.org/search/?searchtype=author&query=Yijin Yang), [Cristina Cornelio](https://arxiv.org/search/?searchtype=author&query=Cristina Cornelio), [Mario Leiva](https://arxiv.org/search/?searchtype=author&query=Mario Leiva), [Paulo Shakarian](https://arxiv.org/search/?searchtype=author&query=Paulo Shakarian) 作者：Yijin Yang, Cristina Cornelio, Mario Leiva, Paulo Shakarian

Recent large language models (LLMs) have demonstrated the ability to perform explicit multi-step reasoning such as chain-of-thought prompting. However, their intermediate steps often contain errors that can propagate leading to inaccurate final predictions. Additionally, LLMs still struggle with hallucinations and often fail to adhere to prescribed output formats, which is particularly problematic for tasks like generating mathematical expressions or source code. This work introduces EDCIM (Error Detection and Correction for Interpretable Mathematics), a method for detecting and correcting these errors in interpretable mathematics tasks, where the model must generate the exact functional form that explicitly solve the problem (expressed in natural language) rather than a black-box solution. EDCIM uses LLMs to generate a system of equations for a given problem, followed by a symbolic error-detection framework that identifies errors and provides targeted feedback for LLM-based correction. To optimize efficiency, EDCIM integrates lightweight, open-source LLMs with more powerful proprietary models, balancing cost and accuracy. This balance is controlled by a single hyperparameter, allowing users to control the trade-off based on their cost and accuracy requirements. Experimental results across different datasets show that EDCIM significantly reduces both computational and financial costs, while maintaining, and even improving, prediction accuracy when the balance is properly configured. 最近的大型语言模型（LLMs）已经展示了执行显式多步推理（如链式思维提示）的能力。然而，它们的中间步骤常常包含错误，这些错误可能传播，导致最终预测不准确。此外，LLMs 仍然存在幻觉问题，且经常无法遵守规定的输出格式，这对于生成数学表达式或源代码等任务尤其成问题。本文介绍了 EDCIM（可解释数学的错误检测与纠正）方法，用于检测和纠正可解释数学任务中的错误，在这些任务中，模型必须生成明确解决问题的精确函数形式（以自然语言表达），而非黑箱式解决方案。EDCIM 利用 LLMs 为给定问题生成方程组，随后通过符号错误检测框架识别错误并提供针对性的反馈，以便基于 LLM 进行纠正。为了优化效率，EDCIM 结合了轻量级的开源 LLMs 与更强大的专有模型，平衡了成本与准确性。这种平衡由一个超参数控制，允许用户根据其成本和准确性需求来调节权衡。不同数据集上的实验结果表明，当平衡配置得当时，EDCIM 显著降低了计算和财务成本，同时保持甚至提升了预测准确性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 14:30:35 UTC 发布时间：2025-08-05 14:30:35 UTC

#6 VQA support to Arabic Language Learning Educational Tool #6 VQA 支持阿拉伯语学习教育工具

Authors: [Khaled Bachir Delassi](https://arxiv.org/search/?searchtype=author&query=Khaled Bachir Delassi), [Lakhdar Zeggane](https://arxiv.org/search/?searchtype=author&query=Lakhdar Zeggane), [Hadda Cherroun](https://arxiv.org/search/?searchtype=author&query=Hadda Cherroun), [Abdelhamid Haouhat](https://arxiv.org/search/?searchtype=author&query=Abdelhamid Haouhat), [Kaoutar Bouzouad](https://arxiv.org/search/?searchtype=author&query=Kaoutar Bouzouad) 作者：Khaled Bachir Delassi、Lakhdar Zeggane、Hadda Cherroun、Abdelhamid Haouhat、Kaoutar Bouzouad

We address the problem of scarcity of educational Arabic Language Learning tools that advocate modern pedagogical models such as active learning which ensures language proficiency. In fact, we investigate the design and evaluation of an AI-powered educational tool designed to enhance Arabic language learning for non-native speakers with beginner-to-intermediate proficiency level. The tool leverages advanced AI models to generate interactive visual quizzes, deploying Visual Question Answering as the primary activity. Adopting a constructivist learning approach, the system encourages active learning through real-life visual quizzes, and image-based questions that focus on improving vocabulary, grammar, and comprehension. The system integrates Vision-Language Pretraining models to generate contextually relevant image description from which Large Language Model generate assignments based on customized Arabic language Learning quizzes thanks to prompting. The effectiveness of the tool is evaluated through a manual annotated benchmark consisting of 1266 real-life visual quizzes, with human participants providing feedback. The results show a suitable accuracy rates, validating the tool’s potential to bridge the gap in Arabic language education and highlighting the tool’s promise as a reliable, AI-powered resource for Arabic learners, offering personalized and interactive learning experiences. 我们解决了教育类阿拉伯语学习工具稀缺的问题，这些工具倡导现代教学模式如确保语言熟练度的主动学习。实际上，我们研究了一款由人工智能驱动的教育工具的设计与评估，该工具旨在提升非母语者的阿拉伯语学习，适用于初级到中级水平。该工具利用先进的人工智能模型生成互动式视觉测验，以视觉问答作为主要活动。系统采用建构主义学习方法，通过真实生活中的视觉测验和基于图像的问题，鼓励主动学习，重点提升词汇、语法和理解能力。系统整合了视觉-语言预训练模型，生成具有上下文相关性的图像描述，随后大型语言模型基于定制的阿拉伯语学习测验通过提示生成作业。该工具的有效性通过包含 1266 个真实生活视觉测验的人工标注基准进行评估，并由人工参与者提供反馈。结果显示了适当的准确率，验证了该工具在弥合阿拉伯语教育差距方面的潜力，并突显了该工具作为一个可靠的、由人工智能驱动的阿拉伯语学习资源的前景，能够提供个性化和互动的学习体验。

Subjects: Artificial Intelligence, Software Engineering 主题：人工智能，软件工程

Publish: 2025-08-05 14:18:25 UTC 发布时间：2025-08-05 14:18:25 UTC

#7 Semantic-aware Graph-guided Behavior Sequences Generation with Large Language Models for Smart Homes #7 结合语义感知图引导的行为序列生成，利用大型语言模型应用于智能家居

Authors: [Zhiyao Xu](https://arxiv.org/search/?searchtype=author&query=Zhiyao Xu), [Dan Zhao](https://arxiv.org/search/?searchtype=author&query=Dan Zhao), [Qingsong Zou](https://arxiv.org/search/?searchtype=author&query=Qingsong Zou), [Qing Li](https://arxiv.org/search/?searchtype=author&query=Qing Li), [Yong Jiang](https://arxiv.org/search/?searchtype=author&query=Yong Jiang), [Yuhang Wang](https://arxiv.org/search/?searchtype=author&query=Yuhang Wang), [Jingyu Xiao](https://arxiv.org/search/?searchtype=author&query=Jingyu Xiao) 作者：徐志尧，赵丹，邹庆松，李青，姜勇，王宇航，肖靖宇

As smart homes become increasingly prevalent, intelligent models are widely used for tasks such as anomaly detection and behavior prediction. These models are typically trained on static datasets, making them brittle to behavioral drift caused by seasonal changes, lifestyle shifts, or evolving routines. However, collecting new behavior data for retraining is often impractical due to its slow pace, high cost, and privacy concerns. In this paper, we propose SmartGen, an LLM-based framework that synthesizes context-aware user behavior data to support continual adaptation of downstream smart home models. SmartGen consists of four key components. First, we design a Time and Semantic-aware Split module to divide long behavior sequences into manageable, semantically coherent subsequences under dual time-span constraints. Second, we propose Semantic-aware Sequence Compression to reduce input length while preserving representative semantics by clustering behavior mapping in latent space. Third, we introduce Graph-guided Sequence Synthesis, which constructs a behavior relationship graph and encodes frequent transitions into prompts, guiding the LLM to generate data aligned with contextual changes while retaining core behavior patterns. Finally, we design a Two-stage Outlier Filter to identify and remove implausible or semantically inconsistent outputs, aiming to improve the factual coherence and behavioral validity of the generated sequences. Experiments on three real-world datasets demonstrate that SmartGen significantly enhances model performance on anomaly detection and behavior prediction tasks under behavioral drift, with anomaly detection improving by 85.43% and behavior prediction by 70.51% on average. The code is available at https://github.com/horizonsinzqs/SmartGen. 随着智能家居的日益普及，智能模型被广泛应用于异常检测和行为预测等任务。这些模型通常在静态数据集上训练，因此对季节变化、生活方式转变或日常习惯演变引起的行为漂移较为脆弱。然而，由于行为数据收集速度缓慢、成本高昂且存在隐私问题，重新收集数据进行再训练往往不切实际。本文提出了 SmartGen，一种基于 LLM 的框架，通过合成具备上下文感知的用户行为数据，支持下游智能家居模型的持续适应。SmartGen 包含四个关键组件。首先，我们设计了时间和语义感知分割模块，在双重时间跨度约束下，将长行为序列划分为可管理且语义连贯的子序列。其次，我们提出了语义感知序列压缩，通过在潜在空间中聚类行为映射，减少输入长度的同时保留代表性语义。第三，我们引入了图引导序列合成方法，该方法构建行为关系图，并将频繁的转移编码为提示，指导 LLM 生成与上下文变化相符的数据，同时保留核心行为模式。最后，我们设计了一个两阶段异常值过滤器，用于识别和剔除不合理或语义不一致的输出，旨在提升生成序列的事实连贯性和行为有效性。在三个真实世界数据集上的实验表明，SmartGen 显著提升了模型在行为漂移下的异常检测和行为预测任务的性能，异常检测平均提升 85.43%，行为预测平均提升 70.51%。代码可在 https://github.com/horizonsinzqs/SmartGen 获取。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 14:16:10 UTC 发布时间：2025-08-05 14:16:10 UTC

#8 Toward a Graph-Theoretic Model of Belief: Confidence, Credibility, and Structural Coherence #8 迈向信念的图论模型：置信度、可信度与结构连贯性

Author: [Saleh Nikooroo](https://arxiv.org/search/?searchtype=author&query=Saleh Nikooroo) 作者：Saleh Nikooroo

Belief systems are often treated as globally consistent sets of propositions or as scalar-valued probability distributions. Such representations tend to obscure the internal structure of belief, conflate external credibility with internal coherence, and preclude the modeling of fragmented or contradictory epistemic states. This paper introduces a minimal formalism for belief systems as directed, weighted graphs. In this framework, nodes represent individual beliefs, edges encode epistemic relationships (e.g., support or contradiction), and two distinct functions assign each belief a credibility (reflecting source trust) and a confidence (derived from internal structural support). Unlike classical probabilistic models, our approach does not assume prior coherence or require belief updating. Unlike logical and argumentation-based frameworks, it supports fine-grained structural representation without committing to binary justification status or deductive closure. The model is purely static and deliberately excludes inference or revision procedures. Its aim is to provide a foundational substrate for analyzing the internal organization of belief systems, including coherence conditions, epistemic tensions, and representational limits. By distinguishing belief structure from belief strength, this formalism enables a richer classification of epistemic states than existing probabilistic, logical, or argumentation-based approaches. 信念系统通常被视为全局一致的命题集合或标量值概率分布。这类表示往往掩盖了信念的内部结构，将外部可信度与内部一致性混为一谈，并且排除了对碎片化或矛盾的认识状态的建模。本文引入了一种将信念系统表示为有向加权图的最小形式主义。在该框架中，节点代表个别信念，边编码认识关系（例如支持或矛盾），并且两个不同的函数分别为每个信念分配可信度（反映来源信任）和置信度（源自内部结构支持）。与经典概率模型不同，我们的方法不假设先验一致性，也不要求信念更新。与逻辑和论证基础框架不同，它支持细粒度的结构表示，而不承诺于二元的正当性状态或演绎闭包。该模型纯粹是静态的，有意排除推理或修正过程。其目的是为分析信念体系的内部组织提供一个基础框架，包括一致性条件、认知张力和表征限制。通过区分信念结构与信念强度，该形式主义能够比现有的概率、逻辑或基于论证的方法更丰富地分类认知状态。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 14:03:23 UTC 发布时间：2025-08-05 14:03:23 UTC

#9 Data Overdose? Time for a Quadruple Shot: Knowledge Graph Construction using Enhanced Triple Extraction #9 数据过载？是时候来一剂四倍剂量：使用增强三元组提取的知识图谱构建

Authors: [Taine J. Elliott](https://arxiv.org/search/?searchtype=author&query=Taine J. Elliott), [Stephen P. Levitt](https://arxiv.org/search/?searchtype=author&query=Stephen P. Levitt), [Ken Nixon](https://arxiv.org/search/?searchtype=author&query=Ken Nixon), [Martin Bekker](https://arxiv.org/search/?searchtype=author&query=Martin Bekker) 作者：Taine J. Elliott，Stephen P. Levitt，Ken Nixon，Martin Bekker

The rapid expansion of publicly-available medical data presents a challenge for clinicians and researchers alike, increasing the gap between the volume of scientific literature and its applications. The steady growth of studies and findings overwhelms medical professionals at large, hindering their ability to systematically review and understand the latest knowledge. This paper presents an approach to information extraction and automatic knowledge graph (KG) generation to identify and connect biomedical knowledge. Through a pipeline of large language model (LLM) agents, the system decomposes 44 PubMed abstracts into semantically meaningful proposition sentences and extracts KG triples from these sentences. The triples are enhanced using a combination of open domain and ontology-based information extraction methodologies to incorporate ontological categories. On top of this, a context variable is included during extraction to allow the triple to stand on its own - thereby becoming `quadruples’. The extraction accuracy of the LLM is validated by comparing natural language sentences generated from the enhanced triples to the original propositions, achieving an average cosine similarity of 0.874. The similarity for generated sentences of enhanced triples were compared with generated sentences of ordinary triples showing an increase as a result of the context variable. Furthermore, this research explores the ability for LLMs to infer new relationships and connect clusters in the knowledge base of the knowledge graph. This approach leads the way to provide medical practitioners with a centralised, updated in real-time, and sustainable knowledge source, and may be the foundation of similar gains in a wide variety of fields. 公开可用的医学数据的快速扩展对临床医生和研究人员都提出了挑战，加大了科学文献数量与其应用之间的差距。不断增长的研究和发现使广大医疗专业人员不堪重负，阻碍了他们系统性地审查和理解最新知识的能力。本文提出了一种信息提取和自动知识图谱（KG）生成的方法，以识别和连接生物医学知识。通过一条由 LLM 代理组成的流程，系统将 44 篇 PubMed 摘要分解为语义上有意义的命题句子，并从这些句子中提取 KG 三元组。通过结合开放领域和基于本体的信息提取方法，增强了三元组以纳入本体类别。在此基础上，提取过程中加入了上下文变量，使三元组能够独立存在——从而成为“四元组”。通过将从增强三元组生成的自然语言句子与原始命题进行比较，验证了 LLM 的提取准确性，平均余弦相似度达到 0.874。增强三元组生成句子的相似度与普通三元组生成句子的相似度进行了比较，结果显示由于上下文变量的影响，相似度有所提升。此外，本研究还探讨了 LLMs 推断新关系并连接知识图谱知识库中簇的能力。这种方法为医疗从业者提供了一个集中、实时更新且可持续的知识来源，并可能成为在各种领域实现类似进展的基础。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 13:30:41 UTC 发布时间：2025-08-05 13:30:41 UTC

#10 Multi-Objective Infeasibility Diagnosis for Routing Problems Using Large Language Models #10 使用大型语言模型进行路由问题的多目标不可行性诊断

Authors: [Kai Li](https://arxiv.org/search/?searchtype=author&query=Kai Li), [Ruihao Zheng](https://arxiv.org/search/?searchtype=author&query=Ruihao Zheng), [Xinye Hao](https://arxiv.org/search/?searchtype=author&query=Xinye Hao), [Zhenkun Wang](https://arxiv.org/search/?searchtype=author&query=Zhenkun Wang) 作者：李凯、郑瑞浩、郝新业、王振坤

In real-world routing problems, users often propose conflicting or unreasonable requirements, which result in infeasible optimization models due to overly restrictive or contradictory constraints, leading to an empty feasible solution set. Existing Large Language Model (LLM)-based methods attempt to diagnose infeasible models, but modifying such models often involves multiple potential adjustments that these methods do not consider. To fill this gap, we introduce Multi-Objective Infeasibility Diagnosis (MOID), which combines LLM agents and multi-objective optimization within an automatic routing solver, to provide a set of representative actionable suggestions. Specifically, MOID employs multi-objective optimization to consider both path cost and constraint violation, generating a set of trade-off solutions, each encompassing varying degrees of model adjustments. To extract practical insights from these solutions, MOID utilizes LLM agents to generate a solution analysis function for the infeasible model. This function analyzes these distinct solutions to diagnose the original infeasible model, providing users with diverse diagnostic insights and suggestions. Finally, we compare MOID with several LLM-based methods on 50 types of infeasible routing problems. The results indicate that MOID automatically generates multiple diagnostic suggestions in a single run, providing more practical insights for restoring model feasibility and decision-making compared to existing methods. 在现实世界的路径规划问题中，用户常常提出相互冲突或不合理的需求，导致由于约束过于严格或矛盾，优化模型不可行，进而产生空的可行解集。现有基于大型语言模型（LLM）的方法尝试诊断不可行模型，但修改此类模型通常涉及多个潜在调整，而这些方法未加以考虑。为填补这一空白，我们引入了多目标不可行性诊断（MOID），该方法结合了 LLM 代理和多目标优化，集成于自动路径规划求解器中，以提供一组具有代表性的可操作建议。具体而言，MOID 采用多目标优化同时考虑路径成本和约束违规，生成一组权衡解，每个解包含不同程度的模型调整。为了从这些解中提取实用见解，MOID 利用 LLM 代理为不可行模型生成解分析函数。该函数分析这些不同的解，以诊断原始不可行模型，为用户提供多样化的诊断见解和建议。最后，我们将 MOID 与几种基于 LLM 的方法在 50 种不可行路径规划问题上进行了比较。结果表明，MOID 能够在一次运行中自动生成多条诊断建议，相较于现有方法，为恢复模型可行性和决策提供了更具实用性的见解。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 12:53:20 UTC 发布时间：2025-08-05 12:53:20 UTC

#11 Hide and Seek with LLMs: An Adversarial Game for Sneaky Error Generation and Self-Improving Diagnosis #11 与 LLMs 的捉迷藏：一种用于狡猾错误生成和自我改进诊断的对抗游戏

Authors: [Rui Zou](https://arxiv.org/search/?searchtype=author&query=Rui Zou), [Mengqi Wei](https://arxiv.org/search/?searchtype=author&query=Mengqi Wei), [Yutao Zhu](https://arxiv.org/search/?searchtype=author&query=Yutao Zhu), [Jirong Wen](https://arxiv.org/search/?searchtype=author&query=Jirong Wen), [Xin Zhao](https://arxiv.org/search/?searchtype=author&query=Xin Zhao), [Jing Chen](https://arxiv.org/search/?searchtype=author&query=Jing Chen) 作者：邹锐，魏梦琪，朱玉涛，温继荣，赵鑫，陈静

Large Language Models (LLMs) excel in reasoning and generation across domains, but still struggle with identifying and diagnosing complex errors. This stems mainly from training objectives that prioritize correct answers, limiting exposure to and learning from errors. While recent studies have begun to address this by introducing error signals, most rely on shallow, static errors, restricting improvement in deep diagnostic ability. To overcome this, we propose Hide and Seek Game (HSG), a dynamic adversarial framework for error generation and diagnosis, and evaluate it on mathematical problem-solving. HSG involves two adversarial roles: Sneaky, which “hides” by generating subtle, deceptive reasoning errors, and Diagnosis, which “seeks” to accurately detect them. Through adversarial co-evolution, both error stealth and diagnostic precision are enhanced. Experiments on several math reasoning tasks show that HSG significantly boosts error diagnosis, achieving 16.8%–31.4% higher accuracy than baselines like GPT-4o. We also release a challenging dataset of deceptive errors and diagnostic annotations as a benchmark for future research. 大型语言模型（LLMs）在跨领域的推理和生成方面表现出色，但在识别和诊断复杂错误方面仍存在困难。这主要源于训练目标优先考虑正确答案，限制了模型对错误的接触和学习。尽管近期研究开始通过引入错误信号来解决这一问题，但大多数依赖于浅层、静态的错误，限制了深度诊断能力的提升。为克服这一点，我们提出了“捉迷藏游戏”（Hide and Seek Game，HSG），这是一个用于错误生成和诊断的动态对抗框架，并在数学问题求解上进行了评估。HSG 包含两个对抗角色：Sneaky，通过生成微妙且具有欺骗性的推理错误来“隐藏”；Diagnosis，负责“寻找”并准确检测这些错误。通过对抗共进化，错误的隐蔽性和诊断的精确性均得到提升。在多个数学推理任务上的实验表明，HSG 显著提升了错误诊断能力，准确率比 GPT-4o 等基线高出 16.8%至 31.4%。我们还发布了一个包含欺骗性错误和诊断注释的挑战性数据集，作为未来研究的基准。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 12:45:21 UTC 发布：2025-08-05 12:45:21 UTC

#12 Data Dependency Inference for Industrial Code Generation Based on UML Sequence Diagrams #12 基于 UML 时序图的工业代码生成数据依赖推断

Large language models (LLMs) excel at generating code from natural language (NL) descriptions. However, the plain textual descriptions are inherently ambiguous and often fail to capture complex requirements like intricate system behaviors, conditional logic, and architectural constraints; implicit data dependencies in service-oriented architectures are difficult to infer and handle correctly. To bridge this gap, we propose a novel step-by-step code generation framework named UML2Dep by leveraging unambiguous formal specifications of complex requirements. First, we introduce an enhanced Unified Modeling Language (UML) sequence diagram tailored for service-oriented architectures. This diagram extends traditional visual syntax by integrating decision tables and API specifications, explicitly formalizing structural relationships and business logic flows in service interactions to rigorously eliminate linguistic ambiguity. Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference (DDI) task. DDI systematically constructs an explicit data dependency graph prior to actual code synthesis. To ensure reliability, we formalize DDI as a constrained mathematical reasoning task through novel prompting strategies, aligning with LLMs’ excellent mathematical strengths. Additional static parsing and dependency pruning further reduce context complexity and cognitive load associated with intricate specifications, thereby enhancing reasoning accuracy and efficiency. 大型语言模型（LLMs）在根据自然语言（NL）描述生成代码方面表现出色。然而，纯文本描述本质上存在歧义，且常常无法准确表达复杂需求，如复杂的系统行为、条件逻辑和架构约束；面向服务的架构中隐含的数据依赖关系难以推断和正确处理。为弥补这一差距，我们提出了一种新颖的逐步代码生成框架，名为 UML2Dep，通过利用复杂需求的无歧义形式化规范实现。首先，我们引入了一种针对面向服务架构增强的统一建模语言（UML）时序图。该图通过集成决策表和 API 规范，扩展了传统的视觉语法，明确形式化了服务交互中的结构关系和业务逻辑流程，从而严格消除语言歧义。其次，鉴于数据流的重要性，我们引入了专门的数据依赖推断（DDI）任务。DDI 在实际代码合成之前，系统地构建了显式的数据依赖图。为了确保可靠性，我们通过新颖的提示策略将 DDI 形式化为一个受约束的数学推理任务，利用 LLMs 出色的数学能力。额外的静态解析和依赖修剪进一步降低了复杂规范所带来的上下文复杂性和认知负担，从而提升了推理的准确性和效率。

Subjects: Artificial Intelligence, Software Engineering 主题：人工智能，软件工程

Publish: 2025-08-05 12:28:23 UTC 发布：2025-08-05 12:28:23 UTC

#13 Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play #13 Board Game Arena：通过策略游戏评估大型语言模型的框架和基准

Authors: [Lucia Cipolina-Kun](https://arxiv.org/search/?searchtype=author&query=Lucia Cipolina-Kun), [Marianna Nezhurina](https://arxiv.org/search/?searchtype=author&query=Marianna Nezhurina), [Jenia Jitsev](https://arxiv.org/search/?searchtype=author&query=Jenia Jitsev) 作者：Lucia Cipolina-Kun，Marianna Nezhurina，Jenia Jitsev

The Board Game Arena library provides a framework for evaluating the decision making abilities of large language models (LLMs) through strategic board games implemented in Google OpenSpiel library. The framework enables systematic comparisons between LLM based agents and other agents (random, human, reinforcement learning agents, etc.) in various game scenarios by wrapping multiple board and matrix games and supporting different agent types. It integrates API access to models via LiteLLM, local model deployment via vLLM, and offers distributed execution through Ray. Additionally it provides extensive analysis tools for the LLM reasoning traces. This paper summarizes the structure, key characteristics, and motivation of the repository, highlighting how it contributes to the empirical evaluation of the reasoning of LLM and game-theoretic behavior Board Game Arena 库提供了一个通过 Google OpenSpiel 库实现的策略棋盘游戏来评估大型语言模型（LLMs）决策能力的框架。该框架通过封装多种棋盘和矩阵游戏并支持不同类型的代理，实现了基于 LLM 的代理与其他代理（随机、人类、强化学习代理等）在各种游戏场景中的系统比较。它集成了通过 LiteLLM 访问模型的 API、本地通过 vLLM 部署模型，并通过 Ray 提供分布式执行。此外，还为 LLM 推理轨迹提供了丰富的分析工具。本文总结了该仓库的结构、关键特性及其动机，强调了其在 LLM 推理和博弈论行为的实证评估中的贡献。

Subjects: Artificial Intelligence, Computer Science and Game Theory 主题：人工智能，计算机科学与博弈论

Publish: 2025-08-05 12:15:59 UTC 发布时间：2025-08-05 12:15:59 UTC

#14 A Comparative Study of Neurosymbolic AI Approaches to Interpretable Logical Reasoning #14 神经符号人工智能方法在可解释逻辑推理中的比较研究

Author: [Michael K. Chen](https://arxiv.org/search/?searchtype=author&query=Michael K. Chen) 作者：Michael K. Chen

Subjects: Artificial Intelligence, Computation and Language, Machine Learning, Symbolic Computation 主题：人工智能，计算与语言，机器学习，符号计算

Publish: 2025-08-05 12:14:32 UTC 发布时间：2025-08-05 12:14:32 UTC

#15 CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment #15 CogBench：一个用于多语言基于语音的认知障碍评估的大型语言模型基准测试

Automatic assessment of cognitive impairment from spontaneous speech offers a promising, non-invasive avenue for early cognitive screening. However, current approaches often lack generalizability when deployed across different languages and clinical settings, limiting their practical utility. In this study, we propose CogBench, the first benchmark designed to evaluate the cross-lingual and cross-site generalizability of large language models (LLMs) for speech-based cognitive impairment assessment. Using a unified multimodal pipeline, we evaluate model performance on three speech datasets spanning English and Mandarin: ADReSSo, NCMMSC2021-AD, and a newly collected test set, CIR-E. Our results show that conventional deep learning models degrade substantially when transferred across domains. In contrast, LLMs equipped with chain-of-thought prompting demonstrate better adaptability, though their performance remains sensitive to prompt design. Furthermore, we explore lightweight fine-tuning of LLMs via Low-Rank Adaptation (LoRA), which significantly improves generalization in target domains. These findings offer a critical step toward building clinically useful and linguistically robust speech-based cognitive assessment tools. 通过自发言语自动评估认知障碍为早期认知筛查提供了一种有前景的非侵入性途径。然而，当前的方法在跨语言和临床环境部署时往往缺乏泛化能力，限制了其实用性。在本研究中，我们提出了 CogBench，这是首个旨在评估大型语言模型（LLMs）在基于语音的认知障碍评估中跨语言和跨场景泛化能力的基准。通过统一的多模态流程，我们在涵盖英语和普通话的三个语音数据集上评估模型表现：ADReSSo、NCMMSC2021-AD 以及新收集的测试集 CIR-E。结果显示，传统深度学习模型在跨域迁移时性能显著下降。相比之下，配备链式思维提示的 LLMs 表现出更好的适应性，尽管其性能仍对提示设计较为敏感。此外，我们探索了通过低秩适配（LoRA）对 LLMs 进行轻量级微调，显著提升了目标域的泛化能力。这些发现为构建临床实用且语言稳健的基于语音的认知评估工具迈出了关键一步。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 12:06:16 UTC 发布时间：2025-08-05 12:06:16 UTC

#16 Compressing Chain-of-Thought in LLMs via Step Entropy #16 通过步骤熵压缩 LLMs 中的链式思维

Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures. 使用 Chain-of-Thought（CoT）提示的大型语言模型（LLMs）在复杂推理方面表现出色，但生成的思考过程冗长且存在大量冗余，导致推理成本增加和效率降低。我们提出了一种基于步骤熵的新型 CoT 压缩框架，步骤熵是一种量化单个推理步骤信息贡献以识别冗余的度量。通过理论分析和在数学推理基准上的大量实证验证，我们证明了低熵步骤确实高度冗余。我们的实验表明，在 DeepSeek-R1-7B、14B 和 Qwen3-8B 模型中，惊人地有 80%的低熵中间步骤可以被剪枝，而最终答案准确率仅有轻微下降。这一发现与随机剪枝或高熵剪枝形成鲜明对比，后者会严重损害推理性能。在此基础上，我们提出了一种结合监督微调（SFT）和组相对策略优化（GRPO）强化学习的两阶段训练策略。该方法使 LLMs 能够在推理过程中通过策略性地插入[SKIP]标记，自主学习生成压缩的 CoT。我们的方法显著提升了 LLM 推理效率，同时严格保持准确性，对实际 LLM 部署和推理结构的深入理解具有深远意义。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 11:48:18 UTC 发布时间：2025-08-05 11:48:18 UTC

#17 Adaptive AI Agent Placement and Migration in Edge Intelligence Systems #17 边缘智能系统中的自适应 AI 代理部署与迁移

The rise of LLMs such as ChatGPT and Claude fuels the need for AI agents capable of real-time task handling. However, migrating data-intensive, multi-modal edge workloads to cloud data centers, traditionally used for agent deployment, introduces significant latency. Deploying AI agents at the edge improves efficiency and reduces latency. However, edge environments present challenges due to limited and heterogeneous resources. Maintaining QoS for mobile users necessitates agent migration, which is complicated by the complexity of AI agents coordinating LLMs, task planning, memory, and external tools. This paper presents the first systematic deployment and management solution for LLM-based AI agents in dynamic edge environments. We propose a novel adaptive framework for AI agent placement and migration in edge intelligence systems. Our approach models resource constraints and latency/cost, leveraging ant colony algorithms and LLM-based optimization for efficient decision-making. It autonomously places agents to optimize resource utilization and QoS and enables lightweight agent migration by transferring only essential state. Implemented on a distributed system using AgentScope and validated across globally distributed edge servers, our solution significantly reduces deployment latency and migration costs. LLMs 如 ChatGPT 和 Claude 的兴起推动了能够实时处理任务的 AI 代理的需求。然而，将数据密集型、多模态的边缘工作负载迁移到传统用于代理部署的云数据中心，会引入显著的延迟。在边缘部署 AI 代理可以提高效率并减少延迟，但边缘环境由于资源有限且异构，带来了挑战。为了维护移动用户的服务质量（QoS），需要进行代理迁移，而 AI 代理协调 LLMs、任务规划、记忆和外部工具的复杂性使得迁移变得复杂。本文提出了首个针对动态边缘环境中基于 LLM 的 AI 代理的系统化部署与管理解决方案。我们提出了一种用于边缘智能系统中 AI 代理部署和迁移的新型自适应框架。该方法对资源约束和延迟/成本进行建模，利用蚁群算法和基于 LLM 的优化实现高效决策。它能够自主部署代理以优化资源利用和服务质量，并通过仅传输必要状态实现轻量级代理迁移。在使用 AgentScope 实现的分布式系统上，并在全球分布的边缘服务器上验证，我们的解决方案显著降低了部署延迟和迁移成本。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 11:47:46 UTC 发布时间：2025-08-05 11:47:46 UTC

#18 Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science #18 Nemori：受认知科学启发的自组织代理记忆

Authors: [Jiayan Nan](https://arxiv.org/search/?searchtype=author&query=Jiayan Nan), [Wenquan Ma](https://arxiv.org/search/?searchtype=author&query=Wenquan Ma), [Wenlong Wu](https://arxiv.org/search/?searchtype=author&query=Wenlong Wu), [Yize Chen](https://arxiv.org/search/?searchtype=author&query=Yize Chen) 作者：南佳彦，马文全，吴文龙，陈一泽

Large Language Models (LLMs) demonstrate remarkable capabilities, yet their inability to maintain persistent memory in long contexts limits their effectiveness as autonomous agents in long-term interactions. While existing memory systems have made progress, their reliance on arbitrary granularity for defining the basic memory unit and passive, rule-based mechanisms for knowledge extraction limits their capacity for genuine learning and evolution. To address these foundational limitations, we present Nemori, a novel self-organizing memory architecture inspired by human cognitive principles. Nemori’s core innovation is twofold: First, its Two-Step Alignment Principle, inspired by Event Segmentation Theory, provides a principled, top-down method for autonomously organizing the raw conversational stream into semantically coherent episodes, solving the critical issue of memory granularity. Second, its Predict-Calibrate Principle, inspired by the Free-energy Principle, enables the agent to proactively learn from prediction gaps, moving beyond pre-defined heuristics to achieve adaptive knowledge evolution. This offers a viable path toward handling the long-term, dynamic workflows of autonomous agents. Extensive experiments on the LoCoMo and LongMemEval benchmarks demonstrate that Nemori significantly outperforms prior state-of-the-art systems, with its advantage being particularly pronounced in longer contexts. 大型语言模型（LLMs）展现出卓越的能力，但其在长上下文中无法维持持久记忆，限制了其作为自主代理在长期交互中的有效性。尽管现有的记忆系统取得了一定进展，但它们依赖于任意粒度来定义基本记忆单元，以及被动的、基于规则的知识提取机制，限制了其真正学习和进化的能力。为了解决这些根本性限制，我们提出了 Nemori，一种受人类认知原理启发的新型自组织记忆架构。Nemori 的核心创新有两个方面：首先，其两步对齐原则（Two-Step Alignment Principle）受事件分割理论（Event Segmentation Theory）启发，提供了一种有原则的自上而下方法，能够自主地将原始对话流组织成语义连贯的片段，解决了记忆粒度的关键问题。其次，其预测-校准原则（Predict-Calibrate Principle）受自由能原理（Free-energy Principle）启发，使代理能够主动从预测差距中学习，超越预定义的启发式方法，实现自适应的知识进化。这为处理自主代理的长期动态工作流程提供了一条可行的路径。在 LoCoMo 和 LongMemEval 基准上的大量实验表明，Nemori 显著优于之前的最先进系统，其优势在较长的上下文中尤为明显。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 11:41:13 UTC 发布时间：2025-08-05 11:41:13 UTC

#19 ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools #19 ToolVQA：一个用于多步推理 VQA 的外部工具数据集

Authors: [Shaofeng Yin](https://arxiv.org/search/?searchtype=author&query=Shaofeng Yin), [Ting Lei](https://arxiv.org/search/?searchtype=author&query=Ting Lei), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu) 作者：尹少峰，雷婷，刘洋

Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on our test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios. 将外部工具整合到大型基础模型（LFM）中，已成为提升其问题解决能力的有前景的方法。尽管现有研究在工具增强的视觉问答（VQA）中表现出强劲的性能，近期的基准测试却揭示了其在真实世界工具使用能力上的显著差距，尤其是在需要多步骤推理的功能多样的多模态环境中。在本工作中，我们引入了 ToolVQA，一个包含 23K 实例的大规模多模态数据集，旨在弥合这一差距。与依赖合成场景和简化查询的先前数据集不同，ToolVQA 具有真实世界的视觉环境和具有挑战性的隐式多步骤推理任务，更好地契合真实用户的交互需求。为构建该数据集，我们提出了 ToolEngine，一种新颖的数据生成流程，采用深度优先搜索（DFS）结合动态上下文示例匹配机制，以模拟类人工具使用推理。ToolVQA 涵盖了 7 个多样任务领域中的 10 种多模态工具，每个实例的平均推理步骤长度为 2.78 步。在 ToolVQA 上微调的 7B 规模语言模型不仅在我们的测试集上取得了令人印象深刻的表现，还在各种分布外（OOD）数据集上超越了大型闭源模型 GPT-3.5-turbo，展示了其对现实工具使用场景的强大泛化能力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 10:06:16 UTC 发布时间：2025-08-05 10:06:16 UTC

#20 Full-History Graphs with Edge-Type Decoupled Networks for Temporal Reasoning #20 具有边类型解耦网络的全历史图用于时间推理

Authors: [Osama Mohammed](https://arxiv.org/search/?searchtype=author&query=Osama Mohammed), [Jiaxin Pan](https://arxiv.org/search/?searchtype=author&query=Jiaxin Pan), [Mojtaba Nayyeri](https://arxiv.org/search/?searchtype=author&query=Mojtaba Nayyeri), [Daniel Hernández](https://arxiv.org/search/?searchtype=author&query=Daniel Hernández), [Steffen Staab](https://arxiv.org/search/?searchtype=author&query=Steffen Staab) 作者：Osama Mohammed, Jiaxin Pan, Mojtaba Nayyeri, Daniel Hernández, Steffen Staab

Modeling evolving interactions among entities is critical in many real-world tasks. For example, predicting driver maneuvers in traffic requires tracking how neighboring vehicles accelerate, brake, and change lanes relative to one another over consecutive frames. Likewise, detecting financial fraud hinges on following the flow of funds through successive transactions as they propagate through the network. Unlike classic time-series forecasting, these settings demand reasoning over who interacts with whom and when, calling for a temporal-graph representation that makes both the relations and their evolution explicit. Existing temporal-graph methods typically use snapshot graphs to encode temporal evolution. We introduce a full-history graph that instantiates one node for every entity at every time step and separates two edge sets: (i) intra-time-step edges that capture relations within a single frame and (ii) inter-time-step edges that connect an entity to itself at consecutive steps. To learn on this graph we design an Edge-Type Decoupled Network (ETDNet) with parallel modules: a graph-attention module aggregates information along intra-time-step edges, a multi-head temporal-attention module attends over an entity’s inter-time-step history, and a fusion module combines the two messages after every layer. Evaluated on driver-intention prediction (Waymo) and Bitcoin fraud detection (Elliptic++), ETDNet consistently surpasses strong baselines, lifting Waymo joint accuracy to 75.6% (vs. 74.1%) and raising Elliptic++ illicit-class F1 to 88.1% (vs. 60.4%). These gains demonstrate the benefit of representing structural and temporal relations as distinct edges in a single graph. 建模实体之间不断演变的交互在许多现实任务中至关重要。例如，预测交通中驾驶员的操作需要跟踪相邻车辆在连续帧中如何相互加速、刹车和变道。同样，检测金融欺诈依赖于追踪资金通过连续交易在网络中的流动。与经典的时间序列预测不同，这些场景需要推理谁与谁何时交互，因此需要一种时间图表示，使关系及其演变都变得明确。现有的时间图方法通常使用快照图来编码时间演变。我们引入了一个全历史图，为每个时间步的每个实体实例化一个节点，并区分两类边集：（i）捕捉单帧内关系的同时间步边；（ii）连接实体在连续时间步自身的跨时间步边。为了在该图上进行学习，我们设计了一个边类型解耦网络（ETDNet），包含并行模块：图注意力模块沿时间步内的边聚合信息，多头时间注意力模块关注实体跨时间步的历史，融合模块在每层之后结合这两种信息。在驾驶员意图预测（Waymo）和比特币欺诈检测（Elliptic++）上的评估表明，ETDNet 持续超越强基线，将 Waymo 联合准确率提升至 75.6%（对比 74.1%），并将 Elliptic++非法类别的 F1 分数提升至 88.1%（对比 60.4%）。这些提升展示了在单一图中将结构关系和时间关系表示为不同边的优势。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 09:29:07 UTC 发布：2025-08-05 09:29:07 UTC

#21 InqEduAgent: Adaptive AI Learning Partners with Gaussian Process Augmentation #21 InqEduAgent：带有高斯过程增强的自适应 AI 学习伙伴

Authors: [Tian-Fang Zhao](https://arxiv.org/search/?searchtype=author&query=Tian-Fang Zhao), [Wen-Xi Yang](https://arxiv.org/search/?searchtype=author&query=Wen-Xi Yang) 作者：赵天方，杨文曦

Collaborative partnership matters in inquiry-oriented education. However, most study partners are selected either rely on experience-based assignments with little scientific planning or build on rule-based machine assistants, encountering difficulties in knowledge expansion and inadequate flexibility. This paper proposes an LLM-empowered agent model for simulating and selecting learning partners tailored to inquiry-oriented learning, named InqEduAgent. Generative agents are designed to capture cognitive and evaluative features of learners in real-world scenarios. Then, an adaptive matching algorithm with Gaussian process augmentation is formulated to identify patterns within prior knowledge. Optimal learning-partner matches are provided for learners facing different exercises. The experimental results show the optimal performance of InqEduAgent in most knowledge-learning scenarios and LLM environment with different levels of capabilities. This study promotes the intelligent allocation of human-based learning partners and the formulation of AI-based learning partners. The code, data, and appendix are publicly available at https://github.com/InqEduAgent/InqEduAgent. 协作伙伴关系在探究导向教育中至关重要。然而，大多数学习伙伴的选择要么依赖经验分配，缺乏科学规划，要么基于规则的机器助手，面临知识扩展困难和灵活性不足的问题。本文提出了一种基于 LLM 赋能的代理模型，用于模拟和选择适合探究导向学习的学习伙伴，命名为 InqEduAgent。生成式代理旨在捕捉学习者在真实场景中的认知和评估特征。随后，设计了一种带有高斯过程增强的自适应匹配算法，以识别先验知识中的模式。为面对不同练习的学习者提供最优的学习伙伴匹配。实验结果表明，InqEduAgent 在大多数知识学习场景和不同能力水平的 LLM 环境中表现出最佳性能。本研究促进了基于人类的学习伙伴的智能分配以及基于 AI 的学习伙伴的构建。代码、数据和附录公开发布于 https://github.com/InqEduAgent/InqEduAgent。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 07:33:48 UTC 发布时间：2025-08-05 07:33:48 UTC

#22 Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions #22 Geoint-R1：通过动态辅助构造形式化多模态几何推理

Mathematical geometric reasoning is essential for scientific discovery and educational development, requiring precise logic and rigorous formal verification. While recent advances in Multimodal Large Language Models (MLLMs) have improved reasoning tasks, existing models typically struggle with formal geometric reasoning, particularly when dynamically constructing and verifying auxiliary geometric elements. To address these challenges, we introduce Geoint-R1, a multimodal reasoning framework designed to generate formally verifiable geometric solutions from textual descriptions and visual diagrams. Geoint-R1 uniquely integrates auxiliary elements construction, formal reasoning represented via Lean4, and interactive visualization. To systematically evaluate and advance formal geometric reasoning, we propose the Geoint benchmark, comprising 1,885 rigorously annotated geometry problems across diverse topics such as plane, spatial, and solid geometry. Each problem includes structured textual annotations, precise Lean4 code for auxiliary constructions, and detailed solution steps verified by experts. Extensive experiments demonstrate that Geoint-R1 significantly surpasses existing multimodal and math-specific reasoning models, particularly on challenging problems requiring explicit auxiliary element constructions. 数学几何推理对于科学发现和教育发展至关重要，需依赖精确的逻辑和严格的形式验证。尽管多模态大型语言模型（MLLMs）在推理任务上取得了进展，现有模型通常在形式几何推理方面表现欠佳，尤其是在动态构建和验证辅助几何元素时。为解决这些挑战，我们提出了 Geoint-R1，一种多模态推理框架，旨在从文本描述和视觉图示中生成形式可验证的几何解答。Geoint-R1 独特地整合了辅助元素构建、通过 Lean4 表示的形式推理以及交互式可视化。为了系统评估和推动形式几何推理的发展，我们提出了 Geoint 基准，包含 1,885 个经过严格注释的几何问题，涵盖平面、空间和立体几何等多样主题。每个问题均包括结构化文本注释、用于辅助构建的精确 Lean4 代码以及专家验证的详细解题步骤。大量实验表明，Geoint-R1 在多模态和数学专用推理模型中表现显著优于现有模型，尤其在需要显式辅助元素构建的复杂问题上表现突出。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 07:29:58 UTC 发布时间：2025-08-05 07:29:58 UTC

#23 Causal identification with Y0 #23 因果识别与 Y0

We present the Y0 Python package, which implements causal identification algorithms that apply interventional, counterfactual, and transportability queries to data from (randomized) controlled trials, observational studies, or mixtures thereof. Y0 focuses on the qualitative investigation of causation, helping researchers determine whether a causal relationship can be estimated from available data before attempting to estimate how strong that relationship is. Furthermore, Y0 provides guidance on how to transform the causal query into a symbolic estimand that can be non-parametrically estimated from the available data. Y0 provides a domain-specific language for representing causal queries and estimands as symbolic probabilistic expressions, tools for representing causal graphical models with unobserved confounders, such as acyclic directed mixed graphs (ADMGs), and implementations of numerous identification algorithms from the recent causal inference literature. The Y0 source code can be found under the MIT License at https://github.com/y0-causal-inference/y0 and it can be installed with pip install y0. 我们介绍了 Y0 Python 包，该包实现了因果识别算法，能够将干预、反事实和可迁移性查询应用于（随机）对照试验、观察性研究或其混合数据。 Y0 专注于因果关系的定性研究，帮助研究人员在尝试估计因果关系强度之前，确定是否可以从可用数据中估计因果关系。此外， Y0 提供了如何将因果查询转换为符号估计量的指导，该估计量可以从可用数据中进行非参数估计。 Y0 提供了一种领域特定语言，用于将因果查询和估计量表示为符号概率表达式，提供了表示具有未观测混杂变量的因果图模型（如无环有向混合图（ADMG））的工具，以及实现了近期因果推断文献中的众多识别算法。 Y0 的源代码可在 MIT 许可证下于 https://github.com/y0-causal-inference/y0 获取，并可通过 pip install y0 安装。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 07:13:33 UTC 发布：2025-08-05 07:13:33 UTC

#24 Can Large Language Models Bridge the Gap in Environmental Knowledge? #24 大型语言模型能弥合环境知识差距吗？

Authors: [Linda Smail](https://arxiv.org/search/?searchtype=author&query=Linda Smail), [David Santandreu Calonge](https://arxiv.org/search/?searchtype=author&query=David Santandreu Calonge), [Firuz Kamalov](https://arxiv.org/search/?searchtype=author&query=Firuz Kamalov), [Nur H. Orak](https://arxiv.org/search/?searchtype=author&query=Nur H. Orak) 作者：Linda Smail, David Santandreu Calonge, Firuz Kamalov, Nur H. Orak

This research investigates the potential of Artificial Intelligence (AI) models to bridge the knowledge gap in environmental education among university students. By focusing on prominent large language models (LLMs) such as GPT-3.5, GPT-4, GPT-4o, Gemini, Claude Sonnet, and Llama 2, the study assesses their effectiveness in conveying environmental concepts and, consequently, facilitating environmental education. The investigation employs a standardized tool, the Environmental Knowledge Test (EKT-19), supplemented by targeted questions, to evaluate the environmental knowledge of university students in comparison to the responses generated by the AI models. The results of this study suggest that while AI models possess a vast, readily accessible, and valid knowledge base with the potential to empower both students and academic staff, a human discipline specialist in environmental sciences may still be necessary to validate the accuracy of the information provided. 本研究探讨了人工智能（AI）模型在弥合大学生环境教育知识差距方面的潜力。通过聚焦于知名的 LLMs，如 GPT-3.5、GPT-4、GPT-4o、Gemini、Claude Sonnet 和 Llama 2，研究评估了它们传达环境概念的有效性，从而促进环境教育。研究采用标准化工具环境知识测试（EKT-19），并辅以针对性问题，评估大学生的环境知识与 AI 模型生成的回答进行比较。研究结果表明，尽管 AI 模型拥有庞大、易获取且有效的知识库，能够赋能学生和学术人员，但仍可能需要环境科学领域的人类学科专家来验证所提供信息的准确性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 06:55:07 UTC 发布时间：2025-08-05 06:55:07 UTC

#25 Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation #25 通过可验证的合成数据生成迈向可信的优化建模代理

Authors: [Vinicius Lima](https://arxiv.org/search/?searchtype=author&query=Vinicius Lima), [Dzung T. Phan](https://arxiv.org/search/?searchtype=author&query=Dzung T. Phan), [Jayant Kalagnanam](https://arxiv.org/search/?searchtype=author&query=Jayant Kalagnanam), [Dhaval Patel](https://arxiv.org/search/?searchtype=author&query=Dhaval Patel), [Nianjun Zhou](https://arxiv.org/search/?searchtype=author&query=Nianjun Zhou) 作者：Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou

We present a framework for training trustworthy large language model (LLM) agents for optimization modeling via a verifiable synthetic data generation pipeline. Focusing on linear and mixed-integer linear programming, our approach begins with structured symbolic representations and systematically produces natural language descriptions, mathematical formulations, and solver-executable code. By programmatically constructing each instance with known optimal solutions, the pipeline ensures full verifiability and enables automatic filtering of low-quality demonstrations generated by teacher models. Each dataset instance includes a structured representation of the optimization problem, a corresponding natural language description, the verified optimal solution, and step-by-step demonstrations - generated by a teacher model - that show how to model and solve the problem across multiple optimization modeling languages. This enables supervised fine-tuning of open-source LLMs specifically tailored to optimization tasks. To operationalize this pipeline, we introduce OptiTrust, a modular LLM agent that performs multi-stage translation from natural language to solver-ready code, leveraging stepwise demonstrations, multi-language inference, and majority-vote cross-validation. Our agent achieves state-of-the-art performance on standard benchmarks. Out of 7 datasets, it achieves the highest accuracy on six and outperforms the next-best algorithm by at least 8 percentage on three of them. Our approach provides a scalable, verifiable, and principled path toward building reliable LLM agents for real-world optimization applications. 我们提出了一个通过可验证的合成数据生成管道训练可信赖大型语言模型（LLM）代理进行优化建模的框架。该方法聚焦于线性和混合整数线性规划，始于结构化符号表示，系统地生成自然语言描述、数学公式和求解器可执行代码。通过程序化构建每个实例并附带已知的最优解，该管道确保了完全的可验证性，并能够自动过滤教师模型生成的低质量示范。每个数据集实例包含优化问题的结构化表示、相应的自然语言描述、经过验证的最优解以及由教师模型生成的逐步示范，展示如何在多种优化建模语言中建模和求解该问题。这使得对开源 LLM 进行专门针对优化任务的监督微调成为可能。为了使该流程可操作化，我们引入了 OptiTrust，一个模块化的 LLM 代理，能够通过分步骤演示、多语言推理和多数投票交叉验证，实现从自然语言到求解器就绪代码的多阶段翻译。我们的代理在标准基准测试中达到了最先进的性能。在 7 个数据集中，它在 6 个数据集上取得了最高准确率，并且在其中 3 个数据集上比次优算法高出至少 8 个百分点。我们的方法为构建用于现实世界优化应用的可靠 LLM 代理提供了一条可扩展、可验证且有原则的路径。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 05:54:20 UTC 发布时间：2025-08-05 05:54:20 UTC

#26 AgentSME for Simulating Diverse Communication Modes in Smart Education

Authors: [Wen-Xi Yang](https://arxiv.org/search/?searchtype=author&query=Wen-Xi Yang), [Tian-Fang Zhao](https://arxiv.org/search/?searchtype=author&query=Tian-Fang Zhao)

专门为智能教育量身定制的生成式智能体模型至关重要，但仍相对欠发达。一个关键挑战源于教育环境的内在复杂性：学习者是具有各种认知行为的个体，而教学法从根本上以个性化的人际交流为核心。为解决这一问题，本文提出了AgentSME，这是一个由大语言模型驱动的统一生成式智能体框架。模型中考虑了三种定向通信模式，即独奏（Solo）、单声道（Mono）和回声（Echo），反映了不同类型的智能体自主性和通信互惠性。采用准确性作为主要评估指标，并辅以三个多样性指标来评估推理内容的多样性。对六个广泛使用的大语言模型进行了测试，以验证不同模型层级间通信模式的稳健性，这些模型被平均分为基础能力和高能力配置。结果表明，采用回声通信模式的生成式智能体实现了最高的准确率得分，而DeepSeek展现出最大的多样性。本研究为提升智能体学习能力和启发智能教育模型提供了有价值的信息。

Subject: Artificial Intelligence

Publish: 2025-08-05 05:40:40 UTC

#27 Toward Verifiable Misinformation Detection: A Multi-Tool LLM Agent Framework #27 迈向可验证的错误信息检测：一个多工具 LLM 代理框架

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-05 05:15:03 UTC 发布时间：2025-08-05 05:15:03 UTC

#28 T2UE: Generating Unlearnable Examples from Text Descriptions #28 T2UE：从文本描述生成不可学习的样本

Authors: [Xingjun Ma](https://arxiv.org/search/?searchtype=author&query=Xingjun Ma), [Hanxun Huang](https://arxiv.org/search/?searchtype=author&query=Hanxun Huang), [Tianwei Song](https://arxiv.org/search/?searchtype=author&query=Tianwei Song), [Ye Sun](https://arxiv.org/search/?searchtype=author&query=Ye Sun), [Yifeng Gao](https://arxiv.org/search/?searchtype=author&query=Yifeng Gao), [Yu-Gang Jiang](https://arxiv.org/search/?searchtype=author&query=Yu-Gang Jiang) 作者：马星军，黄汉勋，宋天威，孙烨，高一峰，姜宇刚

Large-scale pre-training frameworks like CLIP have revolutionized multimodal learning, but their reliance on web-scraped datasets, frequently containing private user data, raises serious concerns about misuse. Unlearnable Examples (UEs) have emerged as a promising countermeasure against unauthorized model training, employing carefully crafted unlearnable noise to disrupt the learning of meaningful representations from protected data. Current approaches typically generate UEs by jointly optimizing unlearnable noise for both images and their associated text descriptions (or labels). However, this optimization process is often computationally prohibitive for on-device execution, forcing reliance on external third-party services. This creates a fundamental privacy paradox: users must initially expose their data to these very services to achieve protection, thereby compromising privacy in the process. Such a contradiction has severely hindered the development of practical, scalable data protection solutions. To resolve this paradox, we introduce \textbf{Text-to-Unlearnable Example (T2UE)}, a novel framework that enables users to generate UEs using only text descriptions. T2UE circumvents the need for original image data by employing a text-to-image (T2I) model to map text descriptions into the image (noise) space, combined with an error-minimization framework to produce effective unlearnable noise. Extensive experiments show that T2UE-protected data substantially degrades performance in downstream tasks (e.g., cross-modal retrieval) for state-of-the-art models. Notably, the protective effect generalizes across diverse architectures and even to supervised learning settings. Our work demonstrates the feasibility of “zero-contact data protection”, where personal data can be safeguarded based solely on their textual descriptions, eliminating the need for direct data exposure. 像 CLIP 这样的大规模预训练框架已经革新了多模态学习，但它们依赖于网络爬取的数据集，这些数据集经常包含用户的私人数据，因而引发了严重的滥用担忧。不可学习样本（UEs）作为一种有前景的对抗未经授权模型训练的手段出现，通过精心设计的不可学习噪声来干扰模型从受保护数据中学习有意义的表示。目前的方法通常通过联合优化图像及其相关文本描述（或标签）的不可学习噪声来生成 UEs。然而，这一优化过程通常计算量巨大，不适合在设备端执行，迫使用户依赖外部第三方服务。这就产生了一个根本的隐私悖论：用户必须首先将数据暴露给这些服务才能实现保护，从而在过程中损害了隐私。这种矛盾严重阻碍了实用且可扩展的数据保护解决方案的发展。为了解决这一悖论，我们提出了\textbf{文本生成不可学习样本（T2UE）}，这是一种新颖的框架，使用户仅通过文本描述即可生成 UEs。 T2UE 通过使用文本到图像（T2I）模型将文本描述映射到图像（噪声）空间，结合误差最小化框架，绕过了对原始图像数据的需求，从而生成有效的不可学习噪声。大量实验表明，经过 T2UE 保护的数据在下游任务（例如跨模态检索）中显著降低了最先进模型的性能。值得注意的是，这种保护效果能够跨越多种架构，甚至适用于监督学习环境。我们的工作展示了“零接触数据保护”的可行性，即仅基于文本描述即可保护个人数据，无需直接暴露数据。

Subjects: Artificial Intelligence, Cryptography and Security, Computer Vision and Pattern Recognition 主题：人工智能，密码学与安全，计算机视觉与模式识别

Publish: 2025-08-05 05:10:14 UTC 发布时间：2025-08-05 05:10:14 UTC

#29 MissDDIM: Deterministic and Efficient Conditional Diffusion for Tabular Data Imputation #29 MissDDIM：用于表格数据插补的确定性高效条件扩散

Authors: [Youran Zhou](https://arxiv.org/search/?searchtype=author&query=Youran Zhou), [Mohamed Reda Bouadjenek](https://arxiv.org/search/?searchtype=author&query=Mohamed Reda Bouadjenek), [Sunil Aryal](https://arxiv.org/search/?searchtype=author&query=Sunil Aryal) 作者：周有然，Mohamed Reda Bouadjenek，Sunil Aryal

Diffusion models have recently emerged as powerful tools for missing data imputation by modeling the joint distribution of observed and unobserved variables. However, existing methods, typically based on stochastic denoising diffusion probabilistic models (DDPMs), suffer from high inference latency and variable outputs, limiting their applicability in real-world tabular settings. To address these deficiencies, we present in this paper MissDDIM, a conditional diffusion framework that adapts Denoising Diffusion Implicit Models (DDIM) for tabular imputation. While stochastic sampling enables diverse completions, it also introduces output variability that complicates downstream processing. 扩散模型最近作为强大的缺失数据插补工具出现，通过建模观测变量和未观测变量的联合分布实现。然而，现有方法通常基于随机去噪扩散概率模型（DDPMs），存在推理延迟高和输出不稳定的问题，限制了其在实际表格数据场景中的应用。为了解决这些不足，本文提出了 MissDDIM，一种条件扩散框架，将去噪扩散隐式模型（DDIM）适配于表格数据插补。虽然随机采样能够实现多样化的补全，但也带来了输出的变异性，增加了后续处理的复杂度。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 04:55:26 UTC 发布时间：2025-08-05 04:55:26 UTC

#30 EoH-S: Evolution of Heuristic Set using LLMs for Automated Heuristic Design #30 EoH-S：利用 LLMs 进化启发式集合以实现自动化启发式设计

Authors: [Fei Liu](https://arxiv.org/search/?searchtype=author&query=Fei Liu), [Yilu Liu](https://arxiv.org/search/?searchtype=author&query=Yilu Liu), [Qingfu Zhang](https://arxiv.org/search/?searchtype=author&query=Qingfu Zhang), [Xialiang Tong](https://arxiv.org/search/?searchtype=author&query=Xialiang Tong), [Mingxuan Yuan](https://arxiv.org/search/?searchtype=author&query=Mingxuan Yuan) 作者：Fei Liu, Yilu Liu, Qingfu Zhang, Xialiang Tong, Mingxuan Yuan

Automated Heuristic Design (AHD) using Large Language Models (LLMs) has achieved notable success in recent years. Despite the effectiveness of existing approaches, they only design a single heuristic to serve all problem instances, often inducing poor generalization across different distributions or settings. To address this issue, we propose Automated Heuristic Set Design (AHSD), a new formulation for LLM-driven AHD. The aim of AHSD is to automatically generate a small-sized complementary heuristic set to serve diverse problem instances, such that each problem instance could be optimized by at least one heuristic in this set. We show that the objective function of AHSD is monotone and supermodular. Then, we propose Evolution of Heuristic Set (EoH-S) to apply the AHSD formulation for LLM-driven AHD. With two novel mechanisms of complementary population management and complementary-aware memetic search, EoH-S could effectively generate a set of high-quality and complementary heuristics. Comprehensive experimental results on three AHD tasks with diverse instances spanning various sizes and distributions demonstrate that EoH-S consistently outperforms existing state-of-the-art AHD methods and achieves up to 60% performance improvements. 近年来，利用大型语言模型（LLMs）进行自动启发式设计（AHD）取得了显著成功。尽管现有方法效果显著，但它们仅设计单一启发式以服务所有问题实例，往往导致在不同分布或设置下泛化能力较差。为了解决这一问题，我们提出了自动启发式集合设计（AHSD），这是一种基于 LLM 驱动的 AHD 的新型表述。AHSD 的目标是自动生成一个小规模的互补启发式集合，以服务多样化的问题实例，使得每个问题实例至少能被该集合中的一个启发式优化。我们证明了 AHSD 的目标函数是单调且超模的。随后，我们提出了启发式集合进化（EoH-S）方法，将 AHSD 表述应用于 LLM 驱动的 AHD。通过互补种群管理和互补感知的混合搜索两种新机制，EoH-S 能够有效生成一组高质量且互补的启发式。在涵盖不同规模和分布的多样实例的三项 AHD 任务上的全面实验结果表明，EoH-S 始终优于现有的最先进 AHD 方法，性能提升高达 60%。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 04:55:03 UTC 发布时间：2025-08-05 04:55:03 UTC

#31 ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Identification in Commercial Contracts #31 ContractEval：用于商业合同中条款级法律风险识别的 LLMs 基准测试

Authors: [Shuang Liu](https://arxiv.org/search/?searchtype=author&query=Shuang Liu), [Zelong Li](https://arxiv.org/search/?searchtype=author&query=Zelong Li), [Ruoyun Ma](https://arxiv.org/search/?searchtype=author&query=Ruoyun Ma), [Haiyan Zhao](https://arxiv.org/search/?searchtype=author&query=Haiyan Zhao), [Mengnan Du](https://arxiv.org/search/?searchtype=author&query=Mengnan Du) 作者：刘爽，李泽龙，马若云，赵海燕，杜梦楠

The potential of large language models (LLMs) in specialized domains such as legal risk analysis remains underexplored. In response to growing interest in locally deploying open-source LLMs for legal tasks while preserving data confidentiality, this paper introduces ContractEval, the first benchmark to thoroughly evaluate whether open-source LLMs could match proprietary LLMs in identifying clause-level legal risks in commercial contracts. Using the Contract Understanding Atticus Dataset (CUAD), we assess 4 proprietary and 15 open-source LLMs. Our results highlight five key findings: (1) Proprietary models outperform open-source models in both correctness and output effectiveness, though some open-source models are competitive in certain specific dimensions. (2) Larger open-source models generally perform better, though the improvement slows down as models get bigger. (3) Reasoning (“thinking”) mode improves output effectiveness but reduces correctness, likely due to over-complicating simpler tasks. (4) Open-source models generate “no related clause” responses more frequently even when relevant clauses are present. This suggests “laziness” in thinking or low confidence in extracting relevant content. (5) Model quantization speeds up inference but at the cost of performance drop, showing the tradeoff between efficiency and accuracy. These findings suggest that while most LLMs perform at a level comparable to junior legal assistants, open-source models require targeted fine-tuning to ensure correctness and effectiveness in high-stakes legal settings. ContractEval offers a solid benchmark to guide future development of legal-domain LLMs. 大型语言模型（LLMs）在法律风险分析等专业领域的潜力尚未被充分挖掘。针对日益增长的在本地部署开源 LLMs 以完成法律任务并同时保护数据机密性的需求，本文提出了 ContractEval，这是首个全面评估开源 LLMs 是否能够在识别商业合同中的条款级法律风险方面匹敌专有 LLMs 的基准测试。基于合同理解 Atticus 数据集（CUAD），我们评估了 4 个专有模型和 15 个开源 LLMs。我们的结果强调了五个关键发现：（1）专有模型在正确性和输出有效性方面均优于开源模型，尽管部分开源模型在某些特定维度上具有竞争力。（2）较大的开源模型通常表现更好，但随着模型规模增大，性能提升趋缓。（3）推理（“思考”）模式提升了输出有效性，但降低了正确性，可能是因为对较简单任务的过度复杂化。（4）即使存在相关条款，开源模型也更频繁地生成“无相关条款”的回答。这表明在思考时存在“懒惰”或在提取相关内容时信心不足。（5）模型量化加快了推理速度，但以性能下降为代价，显示了效率与准确性之间的权衡。这些发现表明，虽然大多数 LLMs 的表现相当于初级法律助理，但开源模型需要针对性微调，以确保在高风险法律环境中的正确性和有效性。ContractEval 提供了一个坚实的基准，指导未来法律领域 LLMs 的发展。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 04:53:05 UTC 发布时间：2025-08-05 04:53:05 UTC

#32 Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning #32 超越表层检测：通过元操作推理实现认知驱动的越狱攻击防御

Authors: [Rui Pu](https://arxiv.org/search/?searchtype=author&query=Rui Pu), [Chaozhuo Li](https://arxiv.org/search/?searchtype=author&query=Chaozhuo Li), [Rui Ha](https://arxiv.org/search/?searchtype=author&query=Rui Ha), [Litian Zhang](https://arxiv.org/search/?searchtype=author&query=Litian Zhang), [Lirong Qiu](https://arxiv.org/search/?searchtype=author&query=Lirong Qiu), [Xi Zhang](https://arxiv.org/search/?searchtype=author&query=Xi Zhang) 作者：蒲睿，李超卓，海睿，张立天，邱丽蓉，张曦

Defending large language models (LLMs) against jailbreak attacks is essential for their safe and reliable deployment. Existing defenses often rely on shallow pattern matching, which struggles to generalize to novel and unseen attack strategies. To address this challenge, we propose the Cognitive-Driven Defense (CDD) framework, which targets the underlying structure of jailbreak prompts by applying meta-operations, defined as basic manipulations that conceal harmful intent.CDD emulates human cognitive reasoning through a structured reasoning chain. It begins with a global perception of the prompt and follows with a localized analysis to uncover hidden manipulations. By applying supervised fine-tuning on this structured chain, the model learns to identify and reason about known manipulation patterns. To enhance generalization to unseen threats, an entropy-guided reinforcement learning algorithm (EG-GRPO) is introduced to encourage exploration of new types and variants of meta-operations. Experiments demonstrate that CDD can achieve state-of-the-art defense performance and exhibit strong generalization to unseen jailbreak attacks. 防御大型语言模型（LLMs）免受越狱攻击对于其安全可靠的部署至关重要。现有的防御方法通常依赖于浅层模式匹配，难以推广到新颖且未见过的攻击策略。为了解决这一挑战，我们提出了认知驱动防御（CDD）框架，该框架通过应用元操作——定义为隐藏有害意图的基本操作——来针对越狱提示的底层结构。CDD 通过结构化推理链模拟人类认知推理。它首先对提示进行全局感知，随后进行局部分析以揭示隐藏的操作。通过对该结构化链条进行监督微调，模型学习识别和推理已知的操作模式。为了增强对未见威胁的泛化能力，引入了一种熵引导的强化学习算法（EG-GRPO），以鼓励探索新的元操作类型和变体。实验表明，CDD 能够实现最先进的防御性能，并展现出对未见越狱攻击的强大泛化能力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 03:58:15 UTC 发布：2025-08-05 03:58:15 UTC

#33 Tree-of-Reasoning: Towards Complex Medical Diagnosis via Multi-Agent Reasoning with Evidence Tree #33 推理树：通过多智能体推理与证据树实现复杂医疗诊断

Authors: [Qi Peng](https://arxiv.org/search/?searchtype=author&query=Qi Peng), [Jialin Cui](https://arxiv.org/search/?searchtype=author&query=Jialin Cui), [Jiayuan Xie](https://arxiv.org/search/?searchtype=author&query=Jiayuan Xie), [Yi Cai](https://arxiv.org/search/?searchtype=author&query=Yi Cai), [Qing Li](https://arxiv.org/search/?searchtype=author&query=Qing Li) 作者：彭琦、崔佳林、谢佳元、蔡毅、李青

Large language models (LLMs) have shown great potential in the medical domain. However, existing models still fall short when faced with complex medical diagnosis task in the real world. This is mainly because they lack sufficient reasoning depth, which leads to information loss or logical jumps when processing a large amount of specialized medical data, leading to diagnostic errors. To address these challenges, we propose Tree-of-Reasoning (ToR), a novel multi-agent framework designed to handle complex scenarios. Specifically, ToR introduces a tree structure that can clearly record the reasoning path of LLMs and the corresponding clinical evidence. At the same time, we propose a cross-validation mechanism to ensure the consistency of multi-agent decision-making, thereby improving the clinical reasoning ability of multi-agents in complex medical scenarios. Experimental results on real-world medical data show that our framework can achieve better performance than existing baseline methods. 大型语言模型（LLMs）在医疗领域展现了巨大的潜力。然而，现有模型在面对现实世界中复杂的医疗诊断任务时仍显不足。这主要是因为它们缺乏足够的推理深度，导致在处理大量专业医疗数据时出现信息丢失或逻辑跳跃，从而引发诊断错误。为了解决这些挑战，我们提出了推理树（Tree-of-Reasoning，ToR），这是一种旨在处理复杂场景的新型多智能体框架。具体而言，ToR 引入了一种树状结构，可以清晰地记录 LLMs 的推理路径及相应的临床证据。同时，我们提出了一种交叉验证机制，以确保多智能体决策的一致性，从而提升多智能体在复杂医疗场景中的临床推理能力。基于真实医疗数据的实验结果表明，我们的框架能够实现优于现有基线方法的性能。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 03:31:28 UTC 发布：2025-08-05 03:31:28 UTC

#34 From Text to Trajectories: GPT-2 as an ODE Solver via In-Context #34 从文本到轨迹：GPT-2 作为 ODE 求解器的上下文内方法

Authors: [Ziyang Ma](https://arxiv.org/search/?searchtype=author&query=Ziyang Ma), [Baojian Zhou](https://arxiv.org/search/?searchtype=author&query=Baojian Zhou), [Deqing Yang](https://arxiv.org/search/?searchtype=author&query=Deqing Yang), [Yanghua Xiao](https://arxiv.org/search/?searchtype=author&query=Yanghua Xiao) 作者：马子阳，周宝健，杨德庆，肖阳华

In-Context Learning (ICL) has emerged as a new paradigm in large language models (LLMs), enabling them to perform novel tasks by conditioning on a few examples embedded in the prompt. Yet, the highly nonlinear behavior of ICL for NLP tasks remains poorly understood. To shed light on its underlying mechanisms, this paper investigates whether LLMs can solve ordinary differential equations (ODEs) under the ICL setting. We formulate standard ODE problems and their solutions as sequential prompts and evaluate GPT-2 models on these tasks. Experiments on two types of ODEs show that GPT-2 can effectively learn a meta-ODE algorithm, with convergence behavior comparable to, or better than, the Euler method, and achieve exponential accuracy gains with increasing numbers of demonstrations. Moreover, the model generalizes to out-of-distribution (OOD) problems, demonstrating robust extrapolation capabilities. These empirical findings provide new insights into the mechanisms of ICL in NLP and its potential for solving nonlinear numerical problems. 上下文学习（In-Context Learning，ICL）已成为大型语言模型（LLMs）中的一种新范式，使其能够通过在提示中嵌入少量示例来执行新任务。然而，ICL 在自然语言处理任务中的高度非线性行为仍然缺乏深入理解。为揭示其潜在机制，本文研究了 LLMs 是否能够在 ICL 设置下求解常微分方程（ODEs）。我们将标准 ODE 问题及其解构造成序列化提示，并在这些任务上评估 GPT-2 模型。针对两类 ODE 的实验表明，GPT-2 能够有效学习元 ODE 算法，其收敛行为可与欧拉方法相媲美甚至更优，并且随着示范数量的增加，准确度呈指数级提升。此外，模型能够推广到分布外（OOD）问题，展现出强健的外推能力。这些实证结果为 ICL 在自然语言处理中的机制及其解决非线性数值问题的潜力提供了新的见解。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 03:16:37 UTC 发布时间：2025-08-05 03:16:37 UTC

#35 Collab-Solver: Collaborative Solving Policy Learning for Mixed-Integer Linear Programming #35 Collab-Solver：用于混合整数线性规划的协作求解策略学习

Mixed-integer linear programming (MILP) has been a fundamental problem in combinatorial optimization. Previous works have designed a plethora of hard-coded heuristics to accomplish challenging MILP solving with domain knowledge. Driven by the high capability of neural networks, recent research is devoted to replacing manually designed heuristics with learned policies. Although learning-based MILP methods have shown great promise, existing worksindependentlytreatthepolicylearningineachmoduleofMILPsolvers without considering their interdependence, severely hurting the solving speed and quality. To address this issue, we propose a novel multi-agent-based policy learning framework for MILP (Collab-Solver), which can collaboratively optimize the policies for multiple modules. Specifically, we formulate the collaboration of cut selection and branching in MILP solving as a Stackelberg game. Under this formulation, we develop a two-phase learning paradigm to stabilize the collaborative policy learning, where the first phase achieves the data-communicated policy pretraining and the second phase further orchestrates the policy learning for various modules. The jointly learned policy significantly improves the solving performance on both synthetic and large-scale real-world MILP datasets. Moreover, the policies learned by Collab-Solver have also demonstrated excellent generalization abilities across different instance sets. 混合整数线性规划（MILP）一直是组合优化中的一个基础问题。以往的研究设计了大量硬编码的启发式方法，利用领域知识来完成具有挑战性的 MILP 求解。受神经网络强大能力的驱动，近期研究致力于用学习到的策略替代手工设计的启发式方法。尽管基于学习的 MILP 方法展现了巨大潜力，但现有工作在 MILP 求解器的各个模块中独立处理策略学习，未考虑它们之间的相互依赖，严重影响了解题速度和质量。为了解决这一问题，我们提出了一种基于多智能体的 MILP 策略学习新框架（Collab-Solver），能够协同优化多个模块的策略。具体而言，我们将 MILP 求解中的割平面选择和分支协作建模为一个 Stackelberg 博弈。在此框架下，我们开发了一个两阶段学习范式以稳定协同策略学习，其中第一阶段实现数据通信的策略预训练，第二阶段进一步协调各模块的策略学习。联合学习的策略显著提升了在合成数据集和大规模真实世界 MILP 数据集上的求解性能。此外，Collab-Solver 学习到的策略在不同实例集之间也表现出了出色的泛化能力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 03:16:04 UTC 发布时间：2025-08-05 03:16:04 UTC

#36 Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning #36 超越策略优化：用于稀疏奖励长时规划的数据策划飞轮

Authors: [Yutong Wang](https://arxiv.org/search/?searchtype=author&query=Yutong Wang), [Pengliang Ji](https://arxiv.org/search/?searchtype=author&query=Pengliang Ji), [Kaixin Li](https://arxiv.org/search/?searchtype=author&query=Kaixin Li), [Baolong Bi](https://arxiv.org/search/?searchtype=author&query=Baolong Bi), [Tao Feng](https://arxiv.org/search/?searchtype=author&query=Tao Feng), [Guillaume Sartoretti](https://arxiv.org/search/?searchtype=author&query=Guillaume Sartoretti) 作者：王宇彤，季鹏亮，李凯鑫，毕宝龙，冯涛，Guillaume Sartoretti

Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning. 大型语言推理模型在静态任务中表现出显著成功，但其在交互环境中的多轮自主规划应用面临两个根本性挑战。首先，难以解决的信用分配问题使得传统强化学习在稀疏奖励环境中无效。其次，冗长的逐步推理历史带来的计算开销过大。为了解决这些挑战，我们提出了 BPO，一个包含引导、外推和精炼三个阶段的框架，建立了一个自我提升的数据飞轮，以开发适用于长远、稀疏奖励环境的稳健推理模型。我们的框架首先利用提出的规划四元数结合长短链式思维融合，引导高效推理。然后，通过复杂度分层的课程学习对分布外任务进行外推。最后，模型通过仅在通过奖励门控拒绝采样选择的经验上学习，迭代地自我精炼。在 ALFWorld、ScienceWorld 和 WebShop 上的实验表明，我们的方法在显著提升 token 效率的同时，实现了最先进的性能，为代理规划中的推理模型提供了一种新的方案。

Subjects: Artificial Intelligence, Robotics 主题：人工智能，机器人学

Publish: 2025-08-05 02:56:58 UTC 发布时间：2025-08-05 02:56:58 UTC

#37 AGENTiGraph: A Multi-Agent Knowledge Graph Framework for Interactive, Domain-Specific LLM Chatbots #37 AGENTiGraph：一个用于交互式、特定领域 LLM 聊天机器人的多代理知识图框架

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-08-05 01:55:06 UTC 发布时间：2025-08-05 01:55:06 UTC

#38 When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs #38 当人工智能评判人工智能：LLMs 的代理评审崛起

Author: [Fangyi Yu](https://arxiv.org/search/?searchtype=author&query=Fangyi Yu) 作者：余方毅

As large language models (LLMs) grow in capability and autonomy, evaluating their outputs-especially in open-ended and complex tasks-has become a critical bottleneck. A new paradigm is emerging: using AI agents as the evaluators themselves. This “agent-as-a-judge” approach leverages the reasoning and perspective-taking abilities of LLMs to assess the quality and safety of other models, promising calable and nuanced alternatives to human evaluation. In this review, we define the agent-as-a-judge concept, trace its evolution from single-model judges to dynamic multi-agent debate frameworks, and critically examine their strengths and shortcomings. We compare these approaches across reliability, cost, and human alignment, and survey real-world deployments in domains such as medicine, law, finance, and education. Finally, we highlight pressing challenges-including bias, robustness, and meta evaluation-and outline future research directions. By bringing together these strands, our review demonstrates how agent-based judging can complement (but not replace) human oversight, marking a step toward trustworthy, scalable evaluation for next-generation LLMs. 随着大型语言模型（LLMs）能力和自主性的提升，评估其输出——尤其是在开放性和复杂任务中的输出——已成为一个关键瓶颈。一种新范式正在兴起：使用 AI 代理作为评估者本身。这种“代理作为裁判”的方法利用 LLMs 的推理和换位思考能力来评估其他模型的质量和安全性，承诺提供可扩展且细致入微的替代人类评估的方案。在本综述中，我们定义了代理作为裁判的概念，追溯其从单一模型裁判到动态多代理辩论框架的发展历程，并批判性地审视其优缺点。我们从可靠性、成本和人类一致性等方面比较了这些方法，并调研了其在医学、法律、金融和教育等领域的实际应用。最后，我们强调了包括偏见、鲁棒性和元评估在内的紧迫挑战，并勾勒了未来的研究方向。通过整合这些内容，我们的综述展示了基于代理的评判如何补充（但不替代）人类监督，标志着迈向可信赖、可扩展的下一代 LLMs 评估的一步。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-05 01:42:25 UTC 发布时间：2025-08-05 01:42:25 UTC

#39 Unified Tool Integration for LLMs: A Protocol-Agnostic Approach to Function Calling #39 LLMs 的统一工具集成：一种与协议无关的函数调用方法

Authors: [Peng Ding](https://arxiv.org/search/?searchtype=author&query=Peng Ding), [Rick Stevens](https://arxiv.org/search/?searchtype=author&query=Rick Stevens) 作者：丁鹏，Rick Stevens

Subjects: Artificial Intelligence, Computation and Language, Machine Learning 主题：人工智能，计算与语言，机器学习

Publish: 2025-08-05 01:06:49 UTC 发布时间：2025-08-05 01:06:49 UTC

#40 Defend LLMs Through Self-Consciousness #40 通过自我意识保护 LLMs

Subjects: Artificial Intelligence, Computation and Language, Cryptography and Security 主题：人工智能，计算与语言，密码学与安全

Publish: 2025-08-04 23:52:15 UTC 发布时间：2025-08-04 23:52:15 UTC

#41 Polymath: A Self-Optimizing Agent with Dynamic Hierarchical Workflow #41 Polymath：一个具有动态分层工作流程的自我优化代理

Authors: [Chia-Tung Ho](https://arxiv.org/search/?searchtype=author&query=Chia-Tung Ho), [Jing Gong](https://arxiv.org/search/?searchtype=author&query=Jing Gong), [Xufeng Yao](https://arxiv.org/search/?searchtype=author&query=Xufeng Yao), [Yunsheng Bai](https://arxiv.org/search/?searchtype=author&query=Yunsheng Bai), [Abhishek B Akkur](https://arxiv.org/search/?searchtype=author&query=Abhishek B Akkur), [Haoxing Ren](https://arxiv.org/search/?searchtype=author&query=Haoxing Ren) 作者：Chia-Tung Ho，Jing Gong，Xufeng Yao，Yunsheng Bai，Abhishek B Akkur，Haoxing Ren

Large language models (LLMs) excel at solving complex tasks by executing agentic workflows composed of detailed instructions and structured operations. Yet, building general-purpose agents by manually embedding foundation models into agentic systems such as Chain-of-Thought, Self-Reflection, and ReACT through text interfaces limits scalability and efficiency. Recently, many researchers have sought to automate the generation and optimization of these workflows through code-based representations. However, existing methods often rely on labeled datasets to train and optimize workflows, making them ineffective and inflexible for solving real-world, dynamic problems where labeled data is unavailable. To address this challenge, we introduce Polymath, a self-optimizing agent with dynamic hierarchical workflow that leverages the flexibility of task flow graphs and the expressiveness of code-represented workflows to solve a wide range of real-world, dynamic problems. The proposed optimization methodology integrates multi-grid-inspired graph optimization with a self-reflection-guided evolutionary algorithm to refine workflows without labeled data. Experimental results on six benchmark datasets across coding, math, and multi-turn QA tasks show that Polymath achieves 8.1% average improvement over state-of-the-art baselines. 大型语言模型（LLMs）通过执行由详细指令和结构化操作组成的代理工作流，擅长解决复杂任务。然而，通过文本接口将基础模型手动嵌入到诸如 Chain-of-Thought、Self-Reflection 和 ReACT 等代理系统中构建通用代理，限制了可扩展性和效率。近年来，许多研究人员试图通过基于代码的表示自动生成和优化这些工作流。然而，现有方法通常依赖带标签的数据集来训练和优化工作流，这使得它们在解决无标签数据的现实动态问题时效果不佳且缺乏灵活性。为了解决这一挑战，我们提出了 Polymath，一种具有动态分层工作流的自我优化代理，利用任务流图的灵活性和代码表示工作流的表达能力，解决广泛的现实动态问题。所提出的优化方法结合了多网格启发的图优化与自我反思引导的进化算法，实现了在无标签数据情况下对工作流的精炼。在编码、数学和多轮问答任务的六个基准数据集上的实验结果表明，Polymath 相比最先进的基线方法平均提升了 8.1%。

Subjects: Artificial Intelligence, Machine Learning 主题：人工智能，机器学习

Publish: 2025-08-04 23:50:02 UTC 发布时间：2025-08-04 23:50:02 UTC

#42 MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine #42 MedBLINK：探究医学多模态语言模型中的基础感知能力

Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhance are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 19 state-of-the-art MLMs, including general purpose (GPT4o, Claude 3.5 Sonnet) and domain specific (Med Flamingo, LLaVA Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption. Data is available on our project page. 多模态语言模型（MLMs）在临床决策支持和诊断推理方面展现出潜力，带来了端到端自动化医学图像解读的前景。然而，临床医生在采用人工智能工具时非常挑剔；对于那些在看似简单的感知任务上出现错误的模型，比如判断图像方向或识别 CT 扫描是否使用了对比剂，这类模型不太可能被用于临床任务。我们推出了 Medblink，这是一个旨在检测这些模型感知能力的基准测试。Medblink 涵盖了跨多个成像模态和解剖区域的八个临床相关任务，共计 1,429 个多项选择题，涉及 1,605 张图像。我们评估了 19 个最先进的 MLM 模型，包括通用型（GPT4o、Claude 3.5 Sonnet）和领域特定型（Med Flamingo、LLaVA Med、RadFM）模型。尽管人工标注者的准确率达到 96.4%，表现最好的模型仅达到 65%。这些结果表明，当前的 MLM 模型在常规感知检查中经常失败，提示需要加强其视觉基础能力以支持临床应用。数据可在我们的项目页面获取。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-04 23:19:18 UTC 发布：2025-08-04 23:19:18 UTC

#43 AQUAH: Automatic Quantification and Unified Agent in Hydrology #43 AQUAH：水文学中的自动量化与统一代理

We introduce AQUAH, the first end-to-end language-based agent designed specifically for hydrologic modeling. Starting from a simple natural-language prompt (e.g., ‘simulate floods for the Little Bighorn basin from 2020 to 2022’), AQUAH autonomously retrieves the required terrain, forcing, and gauge data; configures a hydrologic model; runs the simulation; and generates a self-contained PDF report. The workflow is driven by vision-enabled large language models, which interpret maps and rasters on the fly and steer key decisions such as outlet selection, parameter initialization, and uncertainty commentary. Initial experiments across a range of U.S. basins show that AQUAH can complete cold-start simulations and produce analyst-ready documentation without manual intervention. The results are judged by hydrologists as clear, transparent, and physically plausible. While further calibration and validation are still needed for operational deployment, these early outcomes highlight the promise of LLM-centered, vision-grounded agents to streamline complex environmental modeling and lower the barrier between Earth observation data, physics-based tools, and decision makers. 我们介绍了 AQUAH，这是首个专为水文建模设计的端到端基于语言的智能体。用户只需输入简单的自然语言提示（例如，“模拟 2020 年至 2022 年 Little Bighorn 流域的洪水”），AQUAH 便能自主检索所需的地形、驱动力和测站数据；配置水文模型；运行模拟；并生成自包含的 PDF 报告。该工作流程由具备视觉能力的 LLM 驱动，能够即时解读地图和栅格数据，并指导关键决策，如出口选择、参数初始化和不确定性说明。针对美国多个流域的初步实验表明，AQUAH 能够完成冷启动模拟并生成分析师可用的文档，无需人工干预。水文学家评判其结果清晰、透明且物理上合理。尽管仍需进一步校准和验证以实现实际应用，这些早期成果凸显了以 LLM 为核心、基于视觉的智能体在简化复杂环境建模、降低地球观测数据、物理工具与决策者之间障碍方面的潜力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-04 22:26:50 UTC 发布时间：2025-08-04 22:26:50 UTC

#44 PentestJudge: Judging Agent Behavior Against Operational Requirements #44 PentestJudge：根据操作需求评判代理行为

Authors: [Shane Caldwell](https://arxiv.org/search/?searchtype=author&query=Shane Caldwell), [Max Harley](https://arxiv.org/search/?searchtype=author&query=Max Harley), [Michael Kouremetis](https://arxiv.org/search/?searchtype=author&query=Michael Kouremetis), [Vincent Abruzzo](https://arxiv.org/search/?searchtype=author&query=Vincent Abruzzo), [Will Pearce](https://arxiv.org/search/?searchtype=author&query=Will Pearce) 作者：Shane Caldwell, Max Harley, Michael Kouremetis, Vincent Abruzzo, Will Pearce

We introduce PentestJudge, a system for evaluating the operations of penetration testing agents. PentestJudge is a large language model (LLM)-as-judge with access to tools that allow it to consume arbitrary trajectories of agent states and tool call history to determine whether a security agent’s actions meet certain operating criteria that would be impractical to evaluate programmatically. We develop rubrics that use a tree structure to hierarchically collapse the penetration testing task for a particular environment into smaller, simpler, and more manageable sub-tasks and criteria until each leaf node represents simple yes-or-no criteria for PentestJudge to evaluate. Task nodes are broken down into different categories related to operational objectives, operational security, and tradecraft. LLM-as-judge scores are compared to human domain experts as a ground-truth reference, allowing us to compare their relative performance with standard binary classification metrics, such as F1 scores. We evaluate several frontier and open-source models acting as judge agents, with the best model reaching an F1 score of 0.83. We find models that are better at tool-use perform more closely to human experts. By stratifying the F1 scores by requirement type, we find even models with similar overall scores struggle with different types of questions, suggesting certain models may be better judges of particular operating criteria. We find that weaker and cheaper models can judge the trajectories of pentests performed by stronger and more expensive models, suggesting verification may be easier than generation for the penetration testing task. We share this methodology to facilitate future research in understanding the ability of judges to holistically and scalably evaluate the process quality of AI-based information security agents so that they may be confidently used in sensitive production environments. 我们介绍了 PentestJudge，一种用于评估渗透测试代理操作的系统。PentestJudge 是一个作为裁判的大型语言模型（LLM），它可以访问工具，从而能够处理代理状态轨迹和工具调用历史，以判断安全代理的行为是否符合某些通过程序化方法难以评估的操作标准。我们开发了使用树状结构的评分标准，将特定环境下的渗透测试任务分层拆解为更小、更简单、更易管理的子任务和标准，直到每个叶节点代表 PentestJudge 可以评估的简单是非标准。任务节点被细分为与操作目标、操作安全和作战技巧相关的不同类别。LLM 作为裁判的评分与人类领域专家的评分作为真实参考进行比较，使我们能够使用标准的二分类指标（如 F1 分数）来比较它们的相对表现。我们评估了多种前沿和开源模型作为评判代理，其中表现最好的模型达到了 0.83 的 F1 分数。我们发现，在工具使用方面表现更好的模型，其表现更接近人类专家。通过按需求类型分层 F1 分数，我们发现即使是整体得分相似的模型，在不同类型的问题上也存在困难，这表明某些模型可能更适合评判特定的操作标准。我们发现，较弱且成本较低的模型能够评判由更强大且更昂贵的模型执行的渗透测试轨迹，这表明对于渗透测试任务，验证可能比生成更容易。我们分享这一方法论，以促进未来研究，帮助理解评判者全面且可扩展地评估基于 AI 的信息安全代理过程质量的能力，从而使其能够在敏感的生产环境中被自信地使用。

Subjects: Artificial Intelligence, Cryptography and Security 主题：人工智能，密码学与安全

Publish: 2025-08-04 21:52:50 UTC 发布时间：2025-08-04 21:52:50 UTC

#45 Enhancing Japanese Large Language Models with Reasoning Vectors #45 利用推理向量增强日语大型语言模型

Authors: [Carolina Minami Oguchi](https://arxiv.org/search/?searchtype=author&query=Carolina Minami Oguchi), [Leo Wei](https://arxiv.org/search/?searchtype=author&query=Leo Wei), [Koyo Kobayashi](https://arxiv.org/search/?searchtype=author&query=Koyo Kobayashi), [Hsin-Tai Wu](https://arxiv.org/search/?searchtype=author&query=Hsin-Tai Wu), [Dipak Ghosal](https://arxiv.org/search/?searchtype=author&query=Dipak Ghosal) 作者：Carolina Minami Oguchi，Leo Wei，Koyo Kobayashi，Hsin-Tai Wu，Dipak Ghosal

Post-training methods have improved the performance and enhanced the reasoning capability for mainstream large language models (LLMs), but the same is challenging for Japanese LLMs to achieve due to the amount of resources required. Inspired by task vectors that extract the change of weights before and after training, specifically for a certain task, we obtain reasoning vectors from reasoning LLMs and apply them to Japanese LLMs to boost their performance. While the resources available present a challenge to improve Japanese LLMs, we present a simple and effective way to obtain high improvement and hope to inspire for other languages. 后训练方法提升了主流大型语言模型（LLMs）的性能并增强了推理能力，但由于所需资源的数量，日本 LLMs 实现同样效果面临挑战。受任务向量启发，该向量提取了针对特定任务训练前后权重的变化，我们从推理 LLMs 中获得推理向量，并将其应用于日本 LLMs 以提升其性能。尽管可用资源对提升日本 LLMs 构成挑战，我们提出了一种简单有效的方法以获得显著提升，并希望能为其他语言带来启发。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-04 21:31:20 UTC 发布时间：2025-08-04 21:31:20 UTC

#46 Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game #46 看似简单的规划问题实际上计算复杂：倒计时游戏

普遍认为，当前基础模型和智能体无法制定长期计划是其主要局限之一。然而，现有的规划基准测试远远不足以真正衡量它们的规划能力。大多数现有基准测试要么侧重于像旅行规划这样定义模糊的任务，要么最终利用国际规划竞赛中的现有领域和问题。前者任务难以形式化和验证，后者则专门设计用来测试和挑战现有自动规划器的弱点。为了解决这些不足，我们提出了一种创建以名为 Countdown 的游戏为核心的规划基准测试的方法，该游戏要求玩家通过算术运算从一组输入数字中形成目标数字。我们讨论了该问题如何满足与理想规划能力评估基准相关的多项期望条件。具体来说，该领域允许对每个问题实例进行直观的自然语言描述，计算上具有挑战性（NP 完全），且实例空间足够丰富，因此我们无需担心记忆问题。我们进行了广泛的理论分析，确立了计算复杂性结果，并展示了我们的实例生成程序相较于公共基准的优势。我们评估了多种现有的 LLM 辅助规划方法在使用我们程序生成的实例上的表现。结果表明，与 24 点游戏（Countdown 的一个特例）等其他领域不同，我们提出的动态基准对现有基于 LLM 的方法仍然极具挑战性。

发布：2025-08-04 21:01:03 UTC

#47 A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering #47 用于放射学视觉问答中复杂推理的多智能体系统

Authors: [Ziruo Yi](https://arxiv.org/search/?searchtype=author&query=Ziruo Yi), [Jinyu Liu](https://arxiv.org/search/?searchtype=author&query=Jinyu Liu), [Ting Xiao](https://arxiv.org/search/?searchtype=author&query=Ting Xiao), [Mark V. Albert](https://arxiv.org/search/?searchtype=author&query=Mark V. Albert) 作者：易子若，刘金玉，肖婷，Mark V. Albert

Radiology visual question answering (RVQA) provides precise answers to questions about chest X-ray images, alleviating radiologists’ workload. While recent methods based on multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have shown promising progress in RVQA, they still face challenges in factual accuracy, hallucinations, and cross-modal misalignment. We introduce a multi-agent system (MAS) designed to support complex reasoning in RVQA, with specialized agents for context understanding, multimodal reasoning, and answer validation. We evaluate our system on a challenging RVQA set curated via model disagreement filtering, comprising consistently hard cases across multiple MLLMs. Extensive experiments demonstrate the superiority and effectiveness of our system over strong MLLM baselines, with a case study illustrating its reliability and interpretability. This work highlights the potential of multi-agent approaches to support explainable and trustworthy clinical AI applications that require complex reasoning. 放射学视觉问答（RVQA）为胸部 X 光图像相关问题提供精确答案，减轻了放射科医生的工作负担。尽管基于多模态大语言模型（MLLMs）和检索增强生成（RAG）的最新方法在 RVQA 中取得了可喜进展，但它们仍面临事实准确性、幻觉现象和跨模态不匹配等挑战。我们提出了一种多智能体系统（MAS），旨在支持 RVQA 中的复杂推理，设有专门的智能体负责上下文理解、多模态推理和答案验证。我们在一个通过模型分歧过滤精心策划的挑战性 RVQA 数据集上评估了该系统，该数据集包含多个 MLLMs 普遍难以处理的案例。大量实验表明，我们的系统在性能和效果上均优于强大的 MLLM 基线，案例研究进一步展示了其可靠性和可解释性。此项工作凸显了多智能体方法在支持需要复杂推理的可解释且可信临床人工智能应用中的潜力。

Subjects: Artificial Intelligence, Information Retrieval 主题：人工智能，信息检索

Publish: 2025-08-04 19:09:52 UTC 发布时间：2025-08-04 19:09:52 UTC

#48 Cognitive Loop via In-Situ Optimization: Self-Adaptive Reasoning for Science #48 通过原位优化实现认知循环：科学的自适应推理

Authors: [Newman Cheng](https://arxiv.org/search/?searchtype=author&query=Newman Cheng), [Gordon Broadbent](https://arxiv.org/search/?searchtype=author&query=Gordon Broadbent), [William Chappell](https://arxiv.org/search/?searchtype=author&query=William Chappell) 作者：Newman Cheng，Gordon Broadbent，William Chappell

The capacity for artificial intelligence (AI) to formulate, evolve, and test altered thought patterns under dynamic conditions indicates advanced cognition that is crucial for scientific discovery. The existing AI development landscape falls into two categories: 1) frameworks over non-reasoning models that natively incorporate opinions on how humans think, and 2) reasoning models that abstract precise control of the reasoning intuition away from end users. While powerful, for scientists to maximize utility of AI in scientific discovery, they not only require accuracy and transparency in reasoning, but also steerability. Hence, we introduce an alternative approach that enables deep and precise control over the reasoning process called: a cognitive loop via in-situ optimization (CLIO). CLIO enables large language models (LLMs) to self-formulate ways of approaching a problem, adapt behavior when self-confidence is low, and ultimately provide scientists with a final belief or answer. Through CLIO’s open design, scientists can observe uncertainty levels, understand how final belief states are formulated using graph structures, and interject corrections. Without any further post-training, OpenAI’s GPT-4.1 with CLIO yields an accuracy of 22.37% in text-based biology and medicine questions on Humanity’s Last Exam (HLE). This yields a 13.82% net or 161.64% relative increase when compared to the base GPT-4.1 model and surpasses OpenAI’s o3 performance in high and low reasoning effort modes. We further discovered that oscillations within internal uncertainty measures are key in determining the accuracy of CLIO’s results, revealing how its open design and internal mechanisms can provide insight and control into scientific decision-making processes. 人工智能（AI）在动态条件下构建、演化和测试变更思维模式的能力，表明其具备先进的认知能力，这对于科学发现至关重要。现有的 AI 开发格局分为两类：1）基于非推理模型的框架，这些模型本身包含了关于人类思维方式的观点；2）推理模型，这类模型将推理直觉的精确控制抽象化，远离终端用户。尽管功能强大，但为了让科学家最大化 AI 在科学发现中的效用，他们不仅需要推理的准确性和透明度，还需要可操控性。因此，我们提出了一种替代方法，称为“通过原位优化的认知循环（CLIO）”，它能够对推理过程进行深度且精确的控制。CLIO 使得 LLMs 能够自我构建解决问题的方法，在自信度较低时调整行为，并最终为科学家提供最终的信念或答案。通过 CLIO 的开放设计，科学家可以观察不确定性水平，理解如何通过图结构形成最终信念状态，并进行纠正干预。在没有任何额外后训练的情况下，OpenAI 的 GPT-4.1 结合 CLIO 在 Humanity’s Last Exam (HLE) 的基于文本的生物学和医学问题上达到了 22.37% 的准确率。与基础的 GPT-4.1 模型相比，这实现了 13.82% 的净提升，或 161.64% 的相对增长，并且在高低推理难度模式下均超过了 OpenAI 的 o3 性能。我们进一步发现，内部不确定性度量中的振荡是决定 CLIO 结果准确性的关键，这揭示了其开放设计和内部机制如何为科学决策过程提供洞察和控制。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-04 18:01:35 UTC 发布时间：2025-08-04 18:01:35 UTC

#49 Large Language Model-based Data Science Agent: A Survey #49 基于大型语言模型的数据科学代理：综述

Authors: [Peiran Wang](https://arxiv.org/search/?searchtype=author&query=Peiran Wang), [Yaoning Yu](https://arxiv.org/search/?searchtype=author&query=Yaoning Yu), [Ke Chen](https://arxiv.org/search/?searchtype=author&query=Ke Chen), [Xianyang Zhan](https://arxiv.org/search/?searchtype=author&query=Xianyang Zhan), [Haohan Wang](https://arxiv.org/search/?searchtype=author&query=Haohan Wang) 作者：王沛然，余耀宁，陈柯，詹贤阳，王浩涵

The rapid advancement of Large Language Models (LLMs) has driven novel applications across diverse domains, with LLM-based agents emerging as a crucial area of exploration. This survey presents a comprehensive analysis of LLM-based agents designed for data science tasks, summarizing insights from recent studies. From the agent perspective, we discuss the key design principles, covering agent roles, execution, knowledge, and reflection methods. From the data science perspective, we identify key processes for LLM-based agents, including data preprocessing, model development, evaluation, visualization, etc. Our work offers two key contributions: (1) a comprehensive review of recent developments in applying LLMbased agents to data science tasks; (2) a dual-perspective framework that connects general agent design principles with the practical workflows in data science. 大型语言模型（LLMs）的快速发展推动了各个领域的新型应用，基于 LLM 的智能体成为一个重要的研究方向。本文综述了面向数据科学任务的基于 LLM 的智能体，汇总了近期研究的见解。从智能体视角出发，我们讨论了关键设计原则，涵盖智能体角色、执行、知识和反思方法。从数据科学视角，我们识别了基于 LLM 智能体的关键流程，包括数据预处理、模型开发、评估、可视化等。我们的工作有两大贡献：（1）全面回顾了基于 LLM 智能体在数据科学任务中的最新进展；（2）提出了一个双重视角框架，将通用智能体设计原则与数据科学的实际工作流程相连接。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-02 17:33:18 UTC 发布：2025-08-02 17:33:18 UTC

#50 Recovering Individual-Level Activity Sequences from Location-Based Service Data Using a Novel Transformer-Based Model #50 使用新型基于 Transformer 的模型从基于位置的服务数据中恢复个体级活动序列

Authors: [Weiyu Luo](https://arxiv.org/search/?searchtype=author&query=Weiyu Luo), [Chenfeng Xiong](https://arxiv.org/search/?searchtype=author&query=Chenfeng Xiong) 作者：罗伟宇，熊晨峰

Location-Based Service (LBS) data provides critical insights into human mobility, yet its sparsity often yields incomplete trip and activity sequences, making accurate inferences about trips and activities difficult. We raise a research problem: Can we use activity sequences derived from high-quality LBS data to recover incomplete activity sequences at the individual level? This study proposes a new solution, the Variable Selection Network-fused Insertion Transformer (VSNIT), integrating the Insertion Transformer’s flexible sequence construction with the Variable Selection Network’s dynamic covariate handling capability, to recover missing segments in incomplete activity sequences while preserving existing data. The findings show that VSNIT inserts more diverse, realistic activity patterns, more closely matching real-world variability, and restores disrupted activity transitions more effectively aligning with the target. It also performs significantly better than the baseline model across all metrics. These results highlight VSNIT’s superior accuracy and diversity in activity sequence recovery tasks, demonstrating its potential to enhance LBS data utility for mobility analysis. This approach offers a promising framework for future location-based research and applications. 基于位置的服务（LBS）数据提供了关于人类出行的重要洞察，但其稀疏性常导致出行和活动序列不完整，难以准确推断出行和活动。我们提出一个研究问题：能否利用高质量 LBS 数据中提取的活动序列来恢复个体层面不完整的活动序列？本研究提出了一种新方案——变量选择网络融合插入变换器（VSNIT），将插入变换器灵活的序列构建能力与变量选择网络动态协变量处理能力相结合，以恢复不完整活动序列中的缺失片段，同时保留现有数据。研究结果表明，VSNIT 插入了更多样化、真实的活动模式，更贴近现实世界的变异性，并更有效地恢复了被打断的活动转换，更好地与目标匹配。它在所有指标上均显著优于基线模型。这些结果凸显了 VSNIT 在活动序列恢复任务中的卓越准确性和多样性，展示了其提升 LBS 数据在出行分析中应用价值的潜力。这种方法为未来基于位置的研究和应用提供了一个有前景的框架。

Subjects: Artificial Intelligence, Computational Engineering, Finance, and Science 主题：人工智能，计算工程，金融与科学

Publish: 2025-08-02 00:33:18 UTC 发布时间：2025-08-02 00:33:18 UTC

#51 Planning with Dynamically Changing Domains #51 动态变化领域的规划

Authors: [Mikhail Soutchanski](https://arxiv.org/search/?searchtype=author&query=Mikhail Soutchanski), [Yongmei Liu](https://arxiv.org/search/?searchtype=author&query=Yongmei Liu) 作者：Mikhail Soutchanski，Yongmei Liu

In classical planning and conformant planning, it is assumed that there are finitely many named objects given in advance, and only they can participate in actions and in fluents. This is the Domain Closure Assumption (DCA). However, there are practical planning problems where the set of objects changes dynamically as actions are performed; e.g., new objects can be created, old objects can be destroyed. We formulate the planning problem in first-order logic, assume an initial theory is a finite consistent set of fluent literals, discuss when this guarantees that in every situation there are only finitely many possible actions, impose a finite integer bound on the length of the plan, and propose to organize search over sequences of actions that are grounded at planning time. We show the soundness and completeness of our approach. It can be used to solve the bounded planning problems without DCA that belong to the intersection of sequential generalized planning (without sensing actions) and conformant planning, restricted to the case without the disjunction over fluent literals. We discuss a proof-of-the-concept implementation of our planner. 在经典规划和符合规划中，假设事先给定有限多个具名对象，且只有这些对象可以参与动作和流体。这就是领域封闭假设（DCA）。然而，存在一些实际的规划问题，其中对象集合会随着动作的执行动态变化；例如，可以创建新对象，也可以销毁旧对象。我们在一阶逻辑中形式化规划问题，假设初始理论是有限且一致的流体文字集合，讨论何时这能保证在每个情境中只有有限多可能的动作，施加计划长度的有限整数界限，并提出在规划时对动作序列进行基于实例的搜索组织。我们证明了该方法的正确性和完备性。该方法可用于解决不依赖 DCA 的有界规划问题，这些问题属于顺序广义规划（无感知动作）与符合规划的交集，且限制于无流体文字析取的情况。我们讨论了该规划器的概念验证实现。

Subjects: Artificial Intelligence, Computational Engineering, Finance, and Science 主题：人工智能、计算工程、金融与科学

Publish: 2025-07-26 17:34:25 UTC 发布时间：2025-07-26 17:34:25 UTC

#52 Efficient Agents: Building Effective Agents While Reducing Cost #52 高效代理：在降低成本的同时构建有效代理

The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from 0.398to0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions. 大型语言模型（LLM）驱动的代理展现出卓越的能力，使复杂的多步骤任务得以解决，但其不断上升的成本威胁着系统的可扩展性和可及性。本文首次系统地研究了现代代理系统中的效率与效果权衡，解决了在不牺牲性能的前提下实现成本效益设计的关键需求。我们探讨了三个核心问题：（1）代理任务本质上需要多少复杂度？（2）何时额外模块的收益递减？（3）通过设计高效代理框架可以获得多少效率提升？通过对 GAIA 基准的实证分析，我们评估了 LLM 骨干选择、代理框架设计和测试时扩展策略的影响。利用 cost-of-pass 指标，我们量化了这些维度上的效率与性能权衡。我们的研究成果为高效代理（Efficient Agents）的开发提供了指导，该代理框架在复杂度与任务需求之间达到了最佳平衡。 Efficient Agents 保持了 OWL（一个领先的开源代理框架）96.7%的性能，同时将运营成本从 0.228 降低，实现了 28.4%的成本效益提升。我们的工作为设计高效、高性能的代理系统提供了可操作的见解，推动了基于 AI 的解决方案的可及性和可持续性。

Subjects: Artificial Intelligence, Computation and Language, Multiagent Systems 主题：人工智能，计算与语言，多智能体系统

Publish: 2025-07-24 17:56:51 UTC 发布时间：2025-07-24 17:56:51 UTC

#53 CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward #53 CompassVerifier：一个用于 LLMs 评估和结果奖励的统一且稳健的验证器

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 17:55:24 UTC 发布时间：2025-08-05 17:55:24 UTC

#54 Self-Questioning Language Models #54 自我提问语言模型

Authors: [Lili Chen](https://arxiv.org/search/?searchtype=author&query=Lili Chen), [Mihir Prabhudesai](https://arxiv.org/search/?searchtype=author&query=Mihir Prabhudesai), [Katerina Fragkiadaki](https://arxiv.org/search/?searchtype=author&query=Katerina Fragkiadaki), [Hao Liu](https://arxiv.org/search/?searchtype=author&query=Hao Liu), [Deepak Pathak](https://arxiv.org/search/?searchtype=author&query=Deepak Pathak) 作者：Lili Chen、Mihir Prabhudesai、Katerina Fragkiadaki、Hao Liu、Deepak Pathak

Can large language models improve without external data – by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. By continually generating more interesting problems and attempting to solve them, language models can improve on downstream benchmarks without access to any curated training datasets. 大型语言模型能否在没有外部数据的情况下，通过生成自己的问题和答案来提升能力？我们假设，预训练语言模型仅通过一个指定主题（例如代数应用题）的单一提示，并要求模型生成自己的问题，就能提升其推理能力。为此，我们提出了自问语言模型（Self-Questioning Language Models，SQLM）：一种非对称自我对弈框架，其中提问者获得主题并为解答者生成问题，解答者尝试回答该问题。提问者和解答者均通过强化学习进行训练。若问题难度适中（既不过于简单也不过于困难），提问者将获得奖励；解答者则根据多数投票获得奖励，这在缺乏标准答案时作为正确性的代理指标。对于编程任务，提问者可以生成单元测试以用于验证。我们在三个基准测试上研究了该非对称自我对弈框架：三位数乘法、OMEGA 基准中的代数问题以及 Codeforces 的编程问题。通过不断生成更有趣的问题并尝试解决它们，语言模型可以在没有任何策划训练数据集的情况下提升下游基准测试的表现。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 17:51:33 UTC 发布时间：2025-08-05 17:51:33 UTC

#55 Classifying Epistemic Relationships in Human-AI Interaction: An Exploratory Approach #55 分类人机交互中的认知关系：一种探索性方法

Authors: [Shengnan Yang](https://arxiv.org/search/?searchtype=author&query=Shengnan Yang), [Rongqian Ma](https://arxiv.org/search/?searchtype=author&query=Rongqian Ma) 作者：杨胜楠，马荣谦

As AI systems become integral to knowledge-intensive work, questions arise not only about their functionality but also their epistemic roles in human-AI interaction. While HCI research has proposed various AI role typologies, it often overlooks how AI reshapes users’ roles as knowledge contributors. This study examines how users form epistemic relationships with AI-how they assess, trust, and collaborate with it in research and teaching contexts. Based on 31 interviews with academics across disciplines, we developed a five-part codebook and identified five relationship types: Instrumental Reliance, Contingent Delegation, Co-agency Collaboration, Authority Displacement, and Epistemic Abstention. These reflect variations in trust, assessment modes, tasks, and human epistemic status. Our findings show that epistemic roles are dynamic and context-dependent. We argue for shifting beyond static metaphors of AI toward a more nuanced framework that captures how humans and AI co-construct knowledge, enriching HCI’s understanding of the relational and normative dimensions of AI use. 随着人工智能系统成为知识密集型工作的核心组成部分，关于其功能性的疑问之外，人们也开始关注其在人机交互中的认知角色。尽管人机交互研究提出了多种人工智能角色类型，但往往忽视了人工智能如何重塑用户作为知识贡献者的角色。本研究探讨了用户如何与人工智能建立认知关系——他们如何评估、信任并在研究和教学环境中与之协作。基于对跨学科学者的 31 次访谈，我们制定了一个包含五部分的编码手册，并识别出五种关系类型：工具性依赖、权变委托、共代理协作、权威置换和认知回避。这些类型反映了信任程度、评估方式、任务类型及人类认知地位的差异。我们的研究发现，认知角色是动态且依赖于具体情境的。我们主张超越对人工智能的静态隐喻，构建一个更细致的框架，以捕捉人类与人工智能共建知识的过程，丰富人机交互领域对人工智能使用中关系性和规范性维度的理解。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computers and Society 主题：人机交互，人工智能，计算机与社会

Publish: 2025-08-02 23:41:28 UTC 发布时间：2025-08-02 23:41:28 UTC

#56 Beyond risk: A proto-framework for assessing the societal impact of AI systems #56 超越风险：评估人工智能系统社会影响的原型框架

Author: [Willem Fourie](https://arxiv.org/search/?searchtype=author&query=Willem Fourie) 作者：Willem Fourie

In the discourse on AI regulation, ‘responsible AI’ is the dominant paradigm, with the focus on mitigating the risks related to AI systems. While this focus is important and necessary, it has limited use for a systematic consideration of AI’s societal impact. This paper proposes a proto-framework for assessing the societal impact of AI systems by operationalising the concept of freedom. This proto-framework is intended as a step towards a fully operationalised framework to be used in policymaking contexts. By drawing on Kantian philosophy and related contemporary interpretations, freedom is developed as the counterpart to the concept of responsibility. Two dimensions of freedom are developed in further detail: freedom as capability and freedom as opportunity. These two dimensions of freedom are then applied in a proto-framework that systematically considers AI’s impact on society using the Sustainable Development Goals. This proto-framework aims to complement current risk-based approaches and thereby offers a first step towards operationalising the concept of freedom in AI regulation. 在关于人工智能监管的讨论中，“负责任的人工智能”是主导范式，重点在于减轻与人工智能系统相关的风险。虽然这一重点重要且必要，但对于系统性地考虑人工智能对社会的影响却作用有限。本文提出了一个评估人工智能系统社会影响的原型框架，通过将自由的概念具体化来实现。该原型框架旨在作为迈向完全可操作框架的第一步，以供政策制定环境中使用。通过借鉴康德哲学及相关当代解释，自由被发展为责任概念的对应面。进一步详细阐述了自由的两个维度：能力自由和机会自由。随后，这两个自由维度被应用于一个原型框架中，利用可持续发展目标系统性地考虑人工智能对社会的影响。该原型框架旨在补充当前基于风险的方法，从而为在人工智能监管中具体化自由概念提供第一步。

Subjects: Computers and Society, Artificial Intelligence, Emerging Technologies 主题：计算机与社会，人工智能，新兴技术

Publish: 2025-08-05 17:25:14 UTC 发布：2025-08-05 17:25:14 UTC

#57 A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design #57 一个受 DbC 启发的神经符号层，用于可信代理设计

Author: [Claudiu Leoveanu-Condrei](https://arxiv.org/search/?searchtype=author&query=Claudiu Leoveanu-Condrei) 作者：Claudiu Leoveanu-Condrei

Generative models, particularly Large Language Models (LLMs), produce fluent outputs yet lack verifiable guarantees. We adapt Design by Contract (DbC) and type-theoretic principles to introduce a contract layer that mediates every LLM call. Contracts stipulate semantic and type requirements on inputs and outputs, coupled with probabilistic remediation to steer generation toward compliance. The layer exposes the dual view of LLMs as semantic parsers and probabilistic black-box components. Contract satisfaction is probabilistic and semantic validation is operationally defined through programmer-specified conditions on well-typed data structures. More broadly, this work postulates that any two agents satisfying the same contracts are \emph{functionally equivalent} with respect to those contracts. 生成模型，特别是 LLMs，能够生成流畅的输出，但缺乏可验证的保证。我们借鉴契约式设计（DbC）和类型理论原则，引入一个契约层来调节每次 LLM 调用。契约规定了输入和输出的语义及类型要求，并结合概率性补救措施，引导生成过程朝向合规方向。该层揭示了 LLMs 作为语义解析器和概率黑箱组件的双重视角。契约满足是概率性的，语义验证通过程序员指定的对良类型数据结构的条件在操作上定义。更广泛地说，本工作假设任何两个满足相同契约的代理，在这些契约范围内是\emph{功能等价}的。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 17:24:50 UTC 发布时间：2025-08-05 17:24:50 UTC

#58 Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation #58 森林与树：可复现机器学习评估中的 (N,K) 权衡

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习，人工智能，计算与语言

Publish: 2025-08-05 17:18:34 UTC 发布时间：2025-08-05 17:18:34 UTC

Authors: [Ruei-Che Chang](https://arxiv.org/search/?searchtype=author&query=Ruei-Che Chang), [Rosiana Natalie](https://arxiv.org/search/?searchtype=author&query=Rosiana Natalie), [Wenqian Xu](https://arxiv.org/search/?searchtype=author&query=Wenqian Xu), [Jovan Zheng Feng Yap](https://arxiv.org/search/?searchtype=author&query=Jovan Zheng Feng Yap), [Anhong Guo](https://arxiv.org/search/?searchtype=author&query=Anhong Guo) 作者：张睿哲，罗西亚娜·娜塔莉，徐文倩，叶正峰，郭安宏

Recent advancements in large multimodal models have provided blind or visually impaired (BVI) individuals with new capabilities to interpret and engage with the real world through interactive systems that utilize live video feeds. However, the potential benefits and challenges of such capabilities to support diverse real-world assistive tasks remain unclear. In this paper, we present findings from an exploratory study with eight BVI participants. Participants used ChatGPT’s Advanced Voice with Video, a state-of-the-art live video AI released in late 2024, in various real-world scenarios, from locating objects to recognizing visual landmarks, across unfamiliar indoor and outdoor environments. Our findings indicate that current live video AI effectively provides guidance and answers for static visual scenes but falls short in delivering essential live descriptions required in dynamic situations. Despite inaccuracies in spatial and distance information, participants leveraged the provided visual information to supplement their mobility strategies. Although the system was perceived as human-like due to high-quality voice interactions, assumptions about users’ visual abilities, hallucinations, generic responses, and a tendency towards sycophancy led to confusion, distrust, and potential risks for BVI users. Based on the results, we discuss implications for assistive video AI agents, including incorporating additional sensing capabilities for real-world use, determining appropriate intervention timing beyond turn-taking interactions, and addressing ecological and safety concerns. 近年来大型多模态模型的进展为盲人或视障（BVI）个体通过利用实时视频流的交互系统解读和参与现实世界提供了新能力。然而，这些能力在支持多样化现实辅助任务中的潜在益处和挑战仍不明确。本文介绍了我们对八位 BVI 参与者进行的一项探索性研究的发现。参与者在各种现实场景中使用了 ChatGPT 的高级语音视频功能——一款于 2024 年底发布的最先进实时视频 AI，应用范围涵盖从寻找物体到识别视觉地标，涉及陌生的室内和室外环境。研究结果表明，当前的实时视频 AI 能够有效地为静态视觉场景提供指导和解答，但在动态情境中提供必要的实时描述方面仍显不足。尽管空间和距离信息存在不准确，参与者仍利用所提供的视觉信息来补充他们的移动策略。尽管该系统因高质量的语音交互而被认为具有类人特性，但对用户视觉能力的假设、幻觉现象、通用回复以及趋向谄媚的倾向导致了盲视障（BVI）用户的困惑、不信任和潜在风险。基于研究结果，我们讨论了辅助视频人工智能代理的相关启示，包括为实际应用整合额外的感知能力、确定超越轮流交互的适当干预时机，以及解决生态和安全问题。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-05 16:59:02 UTC 发布时间：2025-08-05 16:59:02 UTC

#60 Cross-Model Semantics in Representation Learning #60 表征学习中的跨模型语义

Authors: [Saleh Nikooroo](https://arxiv.org/search/?searchtype=author&query=Saleh Nikooroo), [Thomas Engel](https://arxiv.org/search/?searchtype=author&query=Thomas Engel) 作者：Saleh Nikooroo，Thomas Engel

The internal representations learned by deep networks are often sensitive to architecture-specific choices, raising questions about the stability, alignment, and transferability of learned structure across models. In this paper, we investigate how structural constraints–such as linear shaping operators and corrective paths–affect the compatibility of internal representations across different architectures. Building on the insights from prior studies on structured transformations and convergence, we develop a framework for measuring and analyzing representational alignment across networks with distinct but related architectural priors. Through a combination of theoretical insights, empirical probes, and controlled transfer experiments, we demonstrate that structural regularities induce representational geometry that is more stable under architectural variation. This suggests that certain forms of inductive bias not only support generalization within a model, but also improve the interoperability of learned features across models. We conclude with a discussion on the implications of representational transferability for model distillation, modular learning, and the principled design of robust learning systems. 深度网络学习到的内部表示通常对特定架构的选择较为敏感，这引发了关于学习结构在不同模型间的稳定性、一致性和可迁移性的问题。本文探讨了结构约束——如线性整形算子和校正路径——如何影响不同架构间内部表示的兼容性。基于先前关于结构化变换和收敛性的研究成果，我们构建了一个框架，用于测量和分析具有不同但相关架构先验的网络之间的表示对齐。通过理论洞见、实证探测和受控迁移实验的结合，我们证明了结构规律性能够诱导出在架构变化下更稳定的表示几何。这表明某些形式的归纳偏置不仅支持模型内部的泛化能力，还提升了不同模型间学习特征的互操作性。我们最后讨论了表征可迁移性对模型蒸馏、模块化学习以及稳健学习系统的原则性设计的影响。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 16:57:24 UTC 发布时间：2025-08-05 16:57:24 UTC

#61 LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations at eBay #61 LLMDistill4Ads：使用交叉编码器从 LLM 信号中蒸馏以进行 eBay 广告主关键词推荐

Authors: [Soumik Dey](https://arxiv.org/search/?searchtype=author&query=Soumik Dey), [Benjamin Braun](https://arxiv.org/search/?searchtype=author&query=Benjamin Braun), [Naveen Ravipati](https://arxiv.org/search/?searchtype=author&query=Naveen Ravipati), [Hansi Wu](https://arxiv.org/search/?searchtype=author&query=Hansi Wu), [Binbin Li](https://arxiv.org/search/?searchtype=author&query=Binbin Li) 作者：Soumik Dey、Benjamin Braun、Naveen Ravipati、Hansi Wu、Binbin Li

Sellers at eBay are recommended keyphrases to bid on to enhance the performance of their advertising campaigns. The relevance of these keyphrases is crucial in avoiding the overcrowding of search systems with irrelevant items and maintaining a positive seller perception. It is essential that keyphrase recommendations align with both seller and Search judgments regarding auctions. Due to the difficulty in procuring negative human judgment at scale, employing LLM-as-a-judge to mimic seller judgment has been established as the norm in several studies. This study introduces a novel two-step LLM distillation process from a LLM-judge used to debias our Embedding Based Retrieval (EBR) model from the various biases that exist in click-data. We distill from an LLM teacher via a cross-encoder assistant into a bi-encoder student using a multi-task training approach, ultimately employing the student bi-encoder to retrieve relevant advertiser keyphrases. We show that integrating a knowledge distillation process from LLMs in a multi-task training setup enhances bi-encoder performance in retrieving relevant advertiser keyphrases at eBay. eBay 的卖家被推荐竞价的关键词短语，以提升其广告活动的效果。这些关键词短语的相关性对于避免搜索系统被无关商品淹没以及维护卖家的良好形象至关重要。关键词短语的推荐必须与卖家和搜索对拍卖的判断保持一致。由于大规模获取负面人工判断的难度，采用 LLM 作为评判者来模拟卖家判断已成为多项研究中的常规做法。本研究提出了一种新颖的两步 LLM 蒸馏过程，从用作评判者的 LLM 中去偏我们的基于嵌入的检索（EBR）模型，以消除点击数据中存在的各种偏差。我们通过跨编码器助理从 LLM 教师蒸馏到双编码器学生，采用多任务训练方法，最终使用学生双编码器来检索相关的广告主关键词短语。研究表明，在多任务训练设置中整合来自 LLM 的知识蒸馏过程，能够提升双编码器在 eBay 检索相关广告主关键词短语的性能。

Subjects: Information Retrieval, Artificial Intelligence, Machine Learning 主题：信息检索，人工智能，机器学习

Publish: 2025-08-05 16:47:17 UTC 发布时间：2025-08-05 16:47:17 UTC

#62 AttZoom: Attention Zoom for Better Visual Features #62 AttZoom：用于更好视觉特征的注意力缩放

Authors: [Daniel DeAlcala](https://arxiv.org/search/?searchtype=author&query=Daniel DeAlcala), [Aythami Morales](https://arxiv.org/search/?searchtype=author&query=Aythami Morales), [Julian Fierrez](https://arxiv.org/search/?searchtype=author&query=Julian Fierrez), [Ruben Tolosana](https://arxiv.org/search/?searchtype=author&query=Ruben Tolosana) 作者：Daniel DeAlcala，Aythami Morales，Julian Fierrez，Ruben Tolosana

We present Attention Zoom, a modular and model-agnostic spatial attention mechanism designed to improve feature extraction in convolutional neural networks (CNNs). Unlike traditional attention approaches that require architecture-specific integration, our method introduces a standalone layer that spatially emphasizes high-importance regions in the input. We evaluated Attention Zoom on multiple CNN backbones using CIFAR-100 and TinyImageNet, showing consistent improvements in Top-1 and Top-5 classification accuracy. Visual analyses using Grad-CAM and spatial warping reveal that our method encourages fine-grained and diverse attention patterns. Our results confirm the effectiveness and generality of the proposed layer for improving CCNs with minimal architectural overhead. 我们提出了 Attention Zoom，一种模块化且与模型无关的空间注意力机制，旨在提升卷积神经网络（CNN）的特征提取能力。与传统需要针对特定架构集成的注意力方法不同，我们的方法引入了一个独立的层，能够在空间上强调输入中高重要性的区域。我们在多个 CNN 骨干网络上使用 CIFAR-100 和 TinyImageNet 进行了评估，结果显示在 Top-1 和 Top-5 分类准确率上均有稳定提升。通过 Grad-CAM 和空间扭曲的可视化分析表明，我们的方法促进了细粒度且多样化的注意力模式。我们的结果证实了该层在以最小架构开销提升 CNN 性能方面的有效性和通用性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 16:42:08 UTC 发布时间：2025-08-05 16:42:08 UTC

#63 Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction Goedel-Prover-V2：通过分阶段数据合成与自我纠正扩展形式定理证明

我们介绍了 Goedel-Prover-V2，一系列开源语言模型，在自动定理证明领域树立了新的最先进水平。基于标准的专家迭代和强化学习流程，我们的方法包含三项关键创新：（1）分阶段数据合成：我们生成难度逐渐增加的合成任务，训练模型掌握越来越复杂的定理；（2）验证器引导的自我修正：我们使模型能够利用 Lean 编译器的反馈，迭代修订其证明；（3）模型平均：我们合并模型检查点，以缓解训练后期模型输出多样性的下降。我们的小型模型 Goedel-Prover-V2-8B 在 MiniF2F 上达到 84.6%的 pass@32 表现，且在相同指标下优于 DeepSeek-Prover-V2-671B，尽管体积小 80 倍。我们的旗舰模型 Goedel-Prover-V2-32B 在标准模式下于 MiniF2F 达到 88.1%的 pass@32，在自我修正模式下达到 90.4%，大幅超越之前的最先进水平。此外，我们的旗舰模型在 PutnamBench 上以 pass@184 解决了 86 个问题，在排行榜上位列开源模型第一，远超 DeepSeek-Prover-V2-671B 以 pass@1024 解决 47 个问题的记录，且模型规模和计算预算显著更小。发布时（2025 年 7 月至 8 月），Goedel-Prover-V2 在所有开源定理证明器中表现最强。它还在受限的测试时计算预算下，位列包括公开报告性能的闭源系统在内的顶级模型之列。我们的模型、代码和数据已发布于 https://github.com/Goedel-LM/Goedel-Prover-V2。

发布时间：2025-08-05 16:28:22 UTC

#64 Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling #64 区块：在 LLM 服务中通过上下文、知识和预测调度实现负载平衡

Authors: [Wei Da](https://arxiv.org/search/?searchtype=author&query=Wei Da), [Evangelia Kalyvianaki](https://arxiv.org/search/?searchtype=author&query=Evangelia Kalyvianaki) 作者：魏达，Evangelia Kalyvianaki

This paper presents Block, a distributed scheduling framework designed to optimize load balancing and auto-provisioning across instances in large language model serving frameworks by leveraging contextual information from incoming requests. Unlike popular model serving systems that rely on monolithic and heuristic task schedulers, Block operates as a fully distributed, stateless, and predictive scheduling system to achieve low overhead, reliability, and scalability. It leverages the deterministic and predictable characteristics of LLM inferences, such as host configurations, response lengths, and hardware performance, to make scheduling decisions based on accurately predicted metrics. Evaluation on a 12 GPUs cluster shows that Block significantly outperforms heuristic schedulers, boosting serving capacity by up to 16.7% and reducing P99 tail latency by up to 49.5%. These performance gains remain consistent across diverse models, workloads and configurations. Code and data are open-sourced. 本文提出了 Block，一种分布式调度框架，旨在通过利用来自传入请求的上下文信息，优化大型语言模型服务框架中实例间的负载均衡和自动配置。与依赖单体且启发式任务调度器的流行模型服务系统不同，Block 作为一个完全分布式、无状态且预测性的调度系统运行，以实现低开销、高可靠性和可扩展性。它利用 LLM 推理的确定性和可预测特性，如主机配置、响应长度和硬件性能，基于准确预测的指标做出调度决策。在一个 12 GPU 集群上的评估表明，Block 显著优于启发式调度器，提升服务容量最高达 16.7%，并将 P99 尾延迟降低最高达 49.5%。这些性能提升在不同模型、工作负载和配置下均保持一致。代码和数据已开源。

Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence 主题：分布式、并行与集群计算，人工智能

Publish: 2025-08-05 16:27:10 UTC 发布时间：2025-08-05 16:27:10 UTC

#65 MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy #65 MetaScope：基于光学驱动的神经网络用于超微型金属透镜内窥镜

Miniaturized endoscopy has advanced accurate visual perception within the human body. Prevailing research remains limited to conventional cameras employing convex lenses, where the physical constraints with millimetre-scale thickness impose serious impediments on the micro-level clinical. Recently, with the emergence of meta-optics, ultra-micro imaging based on metalenses (micron-scale) has garnered great attention, serving as a promising solution. However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. First, we establish datasets for metalens endoscopy and conduct preliminary optical simulation, identifying two derived optical issues that physically adhere to strong optical priors. Second, we propose MetaScope, a novel optics-driven neural network tailored for metalens endoscopy driven by physical optics. MetaScope comprises two novel designs: Optics-informed Intensity Adjustment (OIA), rectifying intensity decay by learning optical embeddings, and Optics-informed Chromatic Correction (OCC), mitigating chromatic aberration by learning spatial deformations informed by learned Point Spread Function (PSF) distributions. To enhance joint learning, we further deploy a gradient-guided distillation to transfer knowledge from the foundational model adaptively. Extensive experiments demonstrate that MetaScope not only outperforms state-of-the-art methods in both metalens segmentation and restoration but also achieves impressive generalized ability in real biomedical scenes. 微型内窥镜技术推动了人体内精准视觉感知的发展。现有研究仍局限于采用凸透镜的传统相机，毫米级厚度的物理限制对微观临床应用构成了严重障碍。近年来，随着超材料光学的出现，基于金属透镜（微米级）的超微成像引起了广泛关注，成为一种有前景的解决方案。然而，由于金属透镜的物理特性不同，数据采集和算法研究存在较大差距。鉴于此，我们旨在弥合这一未被探索的空白，推动新型金属透镜内窥镜的发展。首先，我们建立了金属透镜内窥镜的数据集并进行了初步光学模拟，识别出两个物理上依赖强光学先验的衍生光学问题。其次，我们提出了 MetaScope，一种基于物理光学驱动、专为金属透镜内窥镜设计的新型光学驱动神经网络。 MetaScope 包含两个新颖设计：光学信息强度调整（OIA），通过学习光学嵌入来校正强度衰减；光学信息色差校正（OCC），通过学习基于点扩散函数（PSF）分布的空间变形来减轻色差。为了增强联合学习，我们进一步部署了梯度引导蒸馏，以自适应地从基础模型中转移知识。大量实验表明，MetaScope 不仅在金属透镜分割和恢复方面优于最先进的方法，还在真实生物医学场景中展现出令人印象深刻的泛化能力。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 16:01:00 UTC 发布时间：2025-08-05 16:01:00 UTC

#66 DeepFaith: A Domain-Free and Model-Agnostic Unified Framework for Highly Faithful Explanations #66 DeepFaith：一个无领域限制且模型无关的高度可信解释统一框架

Explainable AI (XAI) builds trust in complex systems through model attribution methods that reveal the decision rationale. However, due to the absence of a unified optimal explanation, existing XAI methods lack a ground truth for objective evaluation and optimization. To address this issue, we propose Deep architecture-based Faith explainer (DeepFaith), a domain-free and model-agnostic unified explanation framework under the lens of faithfulness. By establishing a unified formulation for multiple widely used and well-validated faithfulness metrics, we derive an optimal explanation objective whose solution simultaneously achieves optimal faithfulness across these metrics, thereby providing a ground truth from a theoretical perspective. We design an explainer learning framework that leverages multiple existing explanation methods, applies deduplicating and filtering to construct high-quality supervised explanation signals, and optimizes both pattern consistency loss and local correlation to train a faithful explainer. Once trained, DeepFaith can generate highly faithful explanations through a single forward pass without accessing the model being explained. On 12 diverse explanation tasks spanning 6 models and 6 datasets, DeepFaith achieves the highest overall faithfulness across 10 metrics compared to all baseline methods, highlighting its effectiveness and cross-domain generalizability. 可解释人工智能（XAI）通过模型归因方法揭示决策依据，从而建立对复杂系统的信任。然而，由于缺乏统一的最优解释，现有的 XAI 方法缺乏用于客观评估和优化的真实标准。为解决这一问题，我们提出了基于深度架构的忠实解释器（DeepFaith），这是一个在忠实性视角下的领域无关且模型无关的统一解释框架。通过为多种广泛使用且经过良好验证的忠实性指标建立统一的公式，我们推导出一个最优解释目标，其解同时在这些指标上实现最优忠实性，从理论上提供了真实标准。我们设计了一个解释器学习框架，利用多种现有解释方法，应用去重和过滤构建高质量的监督解释信号，并优化模式一致性损失和局部相关性以训练忠实解释器。训练完成后，DeepFaith 能够通过一次前向传播生成高度忠实的解释，而无需访问被解释的模型。在涵盖 6 个模型和 6 个数据集的 12 个多样化解释任务中，DeepFaith 在 10 个指标上实现了最高的整体可信度，优于所有基线方法，凸显了其有效性和跨领域的泛化能力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 15:53:05 UTC 发布时间：2025-08-05 15:53:05 UTC

#67 Decoding and Engineering the Phytobiome Communication for Smart Agriculture #67 解码与工程化植物微生物组通信以实现智能农业

Authors: [Fatih Gulec](https://arxiv.org/search/?searchtype=author&query=Fatih Gulec), [Hamdan Awan](https://arxiv.org/search/?searchtype=author&query=Hamdan Awan), [Nigel Wallbridge](https://arxiv.org/search/?searchtype=author&query=Nigel Wallbridge), [Andrew W. Eckford](https://arxiv.org/search/?searchtype=author&query=Andrew W. Eckford) 作者：Fatih Gulec、Hamdan Awan、Nigel Wallbridge、Andrew W. Eckford

Smart agriculture applications, integrating technologies like the Internet of Things and machine learning/artificial intelligence (ML/AI) into agriculture, hold promise to address modern challenges of rising food demand, environmental pollution, and water scarcity. Alongside the concept of the phytobiome, which defines the area including the plant, its environment, and associated organisms, and the recent emergence of molecular communication (MC), there exists an important opportunity to advance agricultural science and practice using communication theory. In this article, we motivate to use the communication engineering perspective for developing a holistic understanding of the phytobiome communication and bridge the gap between the phytobiome communication and smart agriculture. Firstly, an overview of phytobiome communication via molecular and electrophysiological signals is presented and a multi-scale framework modeling the phytobiome as a communication network is conceptualized. Then, how this framework is used to model electrophysiological signals is demonstrated with plant experiments. Furthermore, possible smart agriculture applications, such as smart irrigation and targeted delivery of agrochemicals, through engineering the phytobiome communication are proposed. These applications merge ML/AI methods with the Internet of Bio-Nano-Things enabled by MC and pave the way towards more efficient, sustainable, and eco-friendly agricultural production. Finally, the implementation challenges, open research issues, and industrial outlook for these applications are discussed. 智能农业应用将物联网和机器学习/人工智能（ML/AI）等技术整合到农业中，有望应对日益增长的粮食需求、环境污染和水资源短缺等现代挑战。伴随着植物体生物群落（phytobiome）这一概念的提出——该概念定义了包括植物、其环境及相关生物的区域——以及分子通信（MC）的最新兴起，利用通信理论推动农业科学与实践的发展成为一个重要机遇。本文旨在从通信工程的视角出发，构建对植物体生物群落通信的整体理解，弥合植物体生物群落通信与智能农业之间的鸿沟。首先，介绍了通过分子和电生理信号进行的植物体生物群落通信的概述，并构思了一个将植物体生物群落建模为通信网络的多尺度框架。随后，通过植物实验展示了如何利用该框架对电生理信号进行建模。此外，提出了通过工程化植物微生物组通信实现的可能智能农业应用，如智能灌溉和农用化学品的定向投放。这些应用将机器学习/人工智能方法与分子通信支持的生物纳米物联网相结合，为实现更高效、可持续和环保的农业生产铺平了道路。最后，讨论了这些应用的实施挑战、开放的研究问题以及产业前景。

Subjects: Signal Processing, Artificial Intelligence, Emerging Technologies, Networking and Internet Architecture, Molecular Networks 主题：信号处理，人工智能，新兴技术，网络与互联网架构，分子网络

Publish: 2025-08-05 15:50:19 UTC 发布时间：2025-08-05 15:50:19 UTC

#68 Supervised Dynamic Dimension Reduction with Deep Neural Network #68 监督式动态降维深度神经网络

Authors: [Zhanye Luo](https://arxiv.org/search/?searchtype=author&query=Zhanye Luo), [Yuefeng Han](https://arxiv.org/search/?searchtype=author&query=Yuefeng Han), [Xiufan Yu](https://arxiv.org/search/?searchtype=author&query=Xiufan Yu) 作者：罗占业，韩月峰，余秀凡

This paper studies the problem of dimension reduction, tailored to improving time series forecasting with high-dimensional predictors. We propose a novel Supervised Deep Dynamic Principal component analysis (SDDP) framework that incorporates the target variable and lagged observations into the factor extraction process. Assisted by a temporal neural network, we construct target-aware predictors by scaling the original predictors in a supervised manner, with larger weights assigned to predictors with stronger forecasting power. A principal component analysis is then performed on the target-aware predictors to extract the estimated SDDP factors. This supervised factor extraction not only improves predictive accuracy in the downstream forecasting task but also yields more interpretable and target-specific latent factors. Building upon SDDP, we propose a factor-augmented nonlinear dynamic forecasting model that unifies a broad family of factor-model-based forecasting approaches. To further demonstrate the broader applicability of SDDP, we extend our studies to a more challenging scenario when the predictors are only partially observable. We validate the empirical performance of the proposed method on several real-world public datasets. The results show that our algorithm achieves notable improvements in forecasting accuracy compared to state-of-the-art methods. 本文研究了降维问题，旨在提升高维预测变量的时间序列预测性能。我们提出了一种新颖的监督式深度动态主成分分析（SDDP）框架，将目标变量和滞后观测值纳入因子提取过程。在时间神经网络的辅助下，我们通过监督方式对原始预测变量进行缩放，构建了目标感知的预测变量，对预测能力更强的变量赋予更大权重。随后对目标感知的预测变量进行主成分分析，以提取估计的 SDDP 因子。这种监督式因子提取不仅提升了下游预测任务的准确性，还产生了更具解释性和目标特异性的潜在因子。基于 SDDP，我们提出了一种因子增强的非线性动态预测模型，统一了广泛的基于因子模型的预测方法。为了进一步展示 SDDP 的广泛适用性，我们将研究扩展到预测变量仅部分可观测的更具挑战性的场景。我们在多个真实世界的公开数据集上验证了所提方法的实际性能。结果表明，与最先进的方法相比，我们的算法在预测准确性方面取得了显著提升。

Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题：机器学习，人工智能，机器学习

Publish: 2025-08-05 15:15:30 UTC 发布时间：2025-08-05 15:15:30 UTC

#69 EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering #69 EmoSteer-TTS：通过激活引导实现细粒度且无需训练的情感可控文本转语音

Authors: [Tianxin Xie](https://arxiv.org/search/?searchtype=author&query=Tianxin Xie), [Shan Yang](https://arxiv.org/search/?searchtype=author&query=Shan Yang), [Chenxing Li](https://arxiv.org/search/?searchtype=author&query=Chenxing Li), [Dong Yu](https://arxiv.org/search/?searchtype=author&query=Dong Yu), [Li Liu](https://arxiv.org/search/?searchtype=author&query=Li Liu) 作者：谢天欣，杨珊，李晨星，余东，刘力

Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. 近年来，文本转语音（TTS）技术取得了巨大进展。然而，大多数现有的 TTS 系统仅提供粗糙且僵硬的情感控制，通常通过离散的情感标签或精心设计的详细情感文本提示来实现，使得细粒度的情感操控要么难以实现，要么不稳定。这些模型还需要大量高质量的数据集进行训练。为了解决这些限制，我们提出了 EmoSteer-TTS，一种新颖的无训练方法，通过激活引导实现细粒度的语音情感控制（转换、插值、擦除）。我们首先通过实验证明，修改基于流匹配的 TTS 模型内部部分激活状态，可以有效改变合成语音的情感色彩。基于这一发现，我们开发了一种无训练且高效的算法，包括激活提取、情感标记搜索和推理时引导，能够无缝集成到多种预训练模型中（如 F5-TTS、CosyVoice2 和 E2-TTS）。此外，为了获得有效的引导向量，我们构建了一个包含多样化说话者的精选情感语音数据集。大量实验证明，EmoSteer-TTS 实现了对语音情感的细粒度、可解释且连续的控制，性能优于当前最先进技术（SOTA）。据我们所知，这是首个在 TTS 中实现无需训练且连续细粒度情感控制的方法。

Subjects: Sound, Artificial Intelligence 主题：声音，人工智能

Publish: 2025-08-05 15:12:49 UTC 发布时间：2025-08-05 15:12:49 UTC

#70 Retinal Lipidomics Associations as Candidate Biomarkers for Cardiovascular Health #70 视网膜脂质组学关联作为心血管健康的候选生物标志物

Authors: Inamullah, [Imran Razzak](https://arxiv.org/search/?searchtype=author&query=Imran Razzak), [Shoaib Jameel](https://arxiv.org/search/?searchtype=author&query=Shoaib Jameel) 作者：Inamullah, Imran Razzak, Shoaib Jameel

Retinal microvascular imaging is increasingly recognised as a non invasive method for evaluating systemic vascular and metabolic health. However, the association between lipidomics and retinal vasculature remains inadequate. This study investigates the relationships between serum lipid subclasses, free fatty acids (FA), diacylglycerols (DAG), triacylglycerols (TAG), and cholesteryl esters (CE), and retinal microvascular characteristics in a large population-based cohort. Using Spearman correlation analysis, we examined the interconnection between lipid subclasses and ten retinal microvascular traits, applying the Benjamini-Hochberg false discovery rate (BH-FDR) to adjust for statistical significance. Results indicated that FA were linked to retinal vessel twistiness, while CE correlated with the average widths of arteries and veins. Conversely, DAG and TAG showed negative correlations with the width and complexity of arterioles and venules. These findings suggest that retinal vascular architecture reflects distinct circulating lipid profiles, supporting its role as a non-invasive marker of systemic metabolic health. This study is the first to integrate deep learning (DL)derived retinal traits with lipidomic subclasses in a healthy cohort, thereby providing insights into microvascular structural changes independent of disease status or treatment effects. 视网膜微血管成像日益被认可为评估全身血管和代谢健康的无创方法。然而，脂质组学与视网膜血管之间的关联仍不充分。本研究调查了血清脂质亚类、游离脂肪酸（FA）、二酰基甘油（DAG）、三酰基甘油（TAG）和胆固醇酯（CE）与视网膜微血管特征之间的关系，基于一个大型人群队列。通过斯皮尔曼相关分析，我们考察了脂质亚类与十个视网膜微血管特征之间的相互联系，并采用 Benjamini-Hochberg 假发现率（BH-FDR）调整统计显著性。结果显示，FA 与视网膜血管扭曲度相关，CE 与动脉和静脉的平均宽度相关。相反，DAG 和 TAG 与小动脉和小静脉的宽度及复杂度呈负相关。这些发现表明，视网膜血管结构反映了不同的循环脂质谱，支持其作为全身代谢健康无创标志物的作用。本研究首次将深度学习（DL）提取的视网膜特征与健康队列中的脂质组亚类相结合，从而提供了关于微血管结构变化的见解，这些变化独立于疾病状态或治疗效果。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 15:07:02 UTC 发布时间：2025-08-05 15:07:02 UTC

#71 MoKA: Mixture of Kronecker Adapters #71 MoKA：Kronecker 适配器混合模型

Authors: [Mohammadreza Sadeghi](https://arxiv.org/search/?searchtype=author&query=Mohammadreza Sadeghi), [Mahsa Ghazvini Nejad](https://arxiv.org/search/?searchtype=author&query=Mahsa Ghazvini Nejad), [MirHamed Jafarzadeh Asl](https://arxiv.org/search/?searchtype=author&query=MirHamed Jafarzadeh Asl), [Yu Gu](https://arxiv.org/search/?searchtype=author&query=Yu Gu), [Yuanhao Yu](https://arxiv.org/search/?searchtype=author&query=Yuanhao Yu), [Masoud Asgharian](https://arxiv.org/search/?searchtype=author&query=Masoud Asgharian), [Vahid Partovi Nia](https://arxiv.org/search/?searchtype=author&query=Vahid Partovi Nia) 作者：Mohammadreza Sadeghi，Mahsa Ghazvini Nejad，MirHamed Jafarzadeh Asl，Yu Gu，Yuanhao Yu，Masoud Asgharian，Vahid Partovi Nia

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习，人工智能，计算与语言

Publish: 2025-08-05 14:58:14 UTC 发布时间：2025-08-05 14:58:14 UTC

#72 CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation #72 CF-RAG：一种使用检索增强生成的碳足迹问答数据集和方法

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 14:20:10 UTC 发布时间：2025-08-05 14:20:10 UTC

#73 BitsAI-Fix: LLM-Driven Approach for Automated Lint Error Resolution in Practice #73 BitsAI-Fix：基于 LLM 的自动化 Lint 错误修复实践方法

As enterprise codebases continue to grow in scale and complexity, the volume of lint errors far exceeds engineers’ manual remediation capacity, leading to continuous accumulation of technical debt and hindered development efficiency. This paper presents BitsAI-Fix, an automated lint error remediation workflow based on Large Language Models (LLMs), designed to address this critical challenge in industrial-scale environments. BitsAI-Fix employs tree-sitter for context expansion and generates search-and-replace format patches through specially trained LLMs, followed by lint scan re-verification to output final remediation results. Additionally, our approach introduces an innovative progressive reinforcement learning (RL) training strategy that can automatically acquire verifiable training data during the project cold-start phase and continuously iterate the model by collecting online samples through feedback after system deployment. Furthermore, we designed a targeted rule-based reward mechanism that combines format rewards and correctness rewards while penalizing redundant modifications. We also propose a “code diff matching” methodology to continuously track online effectiveness. In production deployment at ByteDance, our solution has supported over 5,000 engineers, resolved more than 12,000 static analysis issues, achieved approximately 85% remediation accuracy, with around 1,000 weekly active adopters. This work demonstrates the practical feasibility of LLM-based code remediation solutions in enterprise environments and serves as a reference for automated code fix in large-scale industrial scenarios. 随着企业代码库规模和复杂性的不断增长，lint 错误的数量远远超过工程师手动修复的能力，导致技术债务持续积累，开发效率受阻。本文提出了 BitsAI-Fix，一种基于 LLMs 的自动化 lint 错误修复工作流程，旨在解决工业规模环境中的这一关键挑战。BitsAI-Fix 采用 tree-sitter 进行上下文扩展，通过专门训练的 LLMs 生成搜索替换格式的补丁，随后进行 lint 扫描复核，输出最终修复结果。此外，我们的方法引入了一种创新的渐进式强化学习（RL）训练策略，能够在项目冷启动阶段自动获取可验证的训练数据，并通过系统部署后的反馈收集在线样本，持续迭代模型。我们还设计了一种针对性的基于规则的奖励机制，结合格式奖励和正确性奖励，同时惩罚冗余修改。最后，我们提出了一种“代码差异匹配”方法，以持续跟踪在线效果。在字节跳动的生产部署中，我们的解决方案支持了超过 5000 名工程师，解决了 12000 多个静态分析问题，实现了约 85%的修复准确率，拥有约 1000 名每周活跃用户。该工作展示了基于 LLM 的代码修复解决方案在企业环境中的实际可行性，并为大规模工业场景中的自动代码修复提供了参考。

Subjects: Software Engineering, Artificial Intelligence, Machine Learning 主题：软件工程，人工智能，机器学习

Publish: 2025-08-05 14:17:30 UTC 发布时间：2025-08-05 14:17:30 UTC

#74 When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models #74 当汽车有刻板印象时：审计文本到图像模型中对象的群体偏见

Authors: [Dasol Choi Jihwan Lee](https://arxiv.org/search/?searchtype=author&query=Dasol Choi Jihwan Lee), [Minjae Lee](https://arxiv.org/search/?searchtype=author&query=Minjae Lee), [Minsuk Kahng](https://arxiv.org/search/?searchtype=author&query=Minsuk Kahng) 作者：Dasol Choi、Jihwan Lee、Minjae Lee、Minsuk Kahng

While prior research on text-to-image generation has predominantly focused on biases in human depictions, we investigate a more subtle yet pervasive phenomenon: demographic bias in generated objects (e.g., cars). We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring such biases. Our approach compares visual attributes of objects generated with demographic cues (e.g., “for young people’’) to those from neutral prompts, across 2,700 images produced by three state-of-the-art models (GPT Image-1, Imagen 4, and Stable Diffusion) in five object categories. Through a comprehensive analysis, we uncover strong associations between specific demographic groups and visual attributes, such as recurring color patterns prompted by gender or ethnicity cues. These patterns reflect and reinforce not only well-known stereotypes but also more subtle and unintuitive biases. We also observe that some models generate less diverse outputs, which in turn amplifies the visual disparities compared to neutral prompts. Our proposed auditing framework offers a practical approach for testing, revealing how stereotypes still remain embedded in today’s generative models. We see this as an essential step toward more systematic and responsible AI development. 尽管以往关于文本生成图像的研究主要集中在人类描绘中的偏见，我们则探讨一种更为微妙但普遍存在的现象：生成物体（如汽车）中的人口统计学偏见。我们提出了 SODA（刻板印象物体诊断审计），这是一种系统测量此类偏见的新框架。我们的方法通过比较带有人口统计学提示（如“为年轻人设计”）生成的物体视觉属性与中性提示生成的物体视觉属性，分析了由三种最先进模型（GPT Image-1、Imagen 4 和 Stable Diffusion）在五个物体类别中生成的 2700 张图像。通过全面分析，我们发现特定人口群体与视觉属性之间存在强烈关联，例如由性别或种族提示引发的反复出现的颜色模式。这些模式不仅反映并强化了众所周知的刻板印象，还揭示了更为微妙且不直观的偏见。我们还观察到，一些模型生成的输出多样性较低，这反过来加剧了与中性提示相比的视觉差异。我们提出的审计框架为测试提供了一种实用的方法，揭示了刻板印象仍然嵌入在当今的生成模型中。我们认为这是迈向更系统和负责任的人工智能开发的重要一步。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 14:15:53 UTC 发布时间：2025-08-05 14:15:53 UTC

#75 Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models #75 画出你的思维：通过条件级建模实现文本到图像扩散模型的个性化生成

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别，人工智能，计算与语言

Publish: 2025-08-05 14:14:55 UTC 发布时间：2025-08-05 14:14:55 UTC

#76 VideoGuard: Protecting Video Content from Unauthorized Editing #76 VideoGuard：保护视频内容免受未经授权的编辑

Authors: [Junjie Cao](https://arxiv.org/search/?searchtype=author&query=Junjie Cao), [Kaizhou Li](https://arxiv.org/search/?searchtype=author&query=Kaizhou Li), [Xinchun Yu](https://arxiv.org/search/?searchtype=author&query=Xinchun Yu), [Hongxiang Li](https://arxiv.org/search/?searchtype=author&query=Hongxiang Li), [Xiaoping Zhang](https://arxiv.org/search/?searchtype=author&query=Xiaoping Zhang) 作者：曹俊杰，李凯舟，余新春，李鸿翔，张晓平

With the rapid development of generative technology, current generative models can generate high-fidelity digital content and edit it in a controlled manner. However, there is a risk that malicious individuals might misuse these capabilities for misleading activities. Although existing research has attempted to shield photographic images from being manipulated by generative models, there remains a significant disparity in the protection offered to video content editing. To bridge the gap, we propose a protection method named VideoGuard, which can effectively protect videos from unauthorized malicious editing. This protection is achieved through the subtle introduction of nearly unnoticeable perturbations that interfere with the functioning of the intended generative diffusion models. Due to the redundancy between video frames, and inter-frame attention mechanism in video diffusion models, simply applying image-based protection methods separately to every video frame can not shield video from unauthorized editing. To tackle the above challenge, we adopt joint frame optimization, treating all video frames as an optimization entity. Furthermore, we extract video motion information and fuse it into optimization objectives. Thus, these alterations can effectively force the models to produce outputs that are implausible and inconsistent. We provide a pipeline to optimize this perturbation. Finally, we use both objective metrics and subjective metrics to demonstrate the efficacy of our method, and the results show that the protection performance of VideoGuard is superior to all the baseline methods. 随着生成技术的快速发展，当前的生成模型能够生成高保真数字内容并以可控的方式进行编辑。然而，存在恶意个人可能滥用这些能力进行误导性活动的风险。尽管现有研究尝试保护照片图像免受生成模型的操控，但在视频内容编辑的保护方面仍存在显著差距。为弥补这一差距，我们提出了一种名为 VideoGuard 的保护方法，能够有效防止视频被未经授权的恶意编辑。该保护通过微妙地引入几乎不可察觉的扰动来实现，这些扰动干扰了目标生成扩散模型的功能。由于视频帧之间的冗余性以及视频扩散模型中的帧间注意力机制，简单地对每个视频帧分别应用基于图像的保护方法无法防止视频被未经授权的编辑。为应对上述挑战，我们采用联合帧优化，将所有视频帧视为一个优化实体。此外，我们提取视频运动信息并将其融合到优化目标中。因此，这些改动能够有效地迫使模型产生不合理且不一致的输出。我们提供了一个优化该扰动的流程。最后，我们使用客观指标和主观指标来展示我们方法的有效性，结果表明 VideoGuard 的保护性能优于所有基线方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 14:13:31 UTC 发布时间：2025-08-05 14:13:31 UTC

#77 fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval #77 SemEval-2025 任务 7 的事实核查 AI：多语言和跨语言事实核查声明检索

Author: [Pranshu Rastogi](https://arxiv.org/search/?searchtype=author&query=Pranshu Rastogi) 作者：Pranshu Rastogi

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题：计算与语言，人工智能，信息检索

Publish: 2025-08-05 14:10:09 UTC 发布时间：2025-08-05 14:10:09 UTC

#78 SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering #78 SonicMaster：迈向可控的一体化音乐修复与母带处理

Authors: [Jan Melechovsky](https://arxiv.org/search/?searchtype=author&query=Jan Melechovsky), [Ambuj Mehrish](https://arxiv.org/search/?searchtype=author&query=Ambuj Mehrish), [Dorien Herremans](https://arxiv.org/search/?searchtype=author&query=Dorien Herremans) 作者：Jan Melechovsky，Ambuj Mehrish，Dorien Herremans

Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster’s enhanced outputs over the original degraded audio, highlighting the effectiveness of our unified approach. 音乐录音常常存在音质问题，如过度混响、失真、削波、音色不平衡以及立体声图像变窄，尤其是在非专业环境中制作且缺乏专业设备或技术时。这些问题通常通过使用不同的专业工具和手动调整来修正。本文介绍了 SonicMaster，这是首个统一的生成模型，用于音乐修复和母带处理，能够通过基于文本的控制解决广泛的音频瑕疵。SonicMaster 以自然语言指令为条件，执行针对性的增强，也可以在自动模式下进行一般修复。为了训练该模型，我们构建了 SonicMaster 数据集，这是一个包含配对的降质和高质量音轨的大型数据集，通过模拟常见的降质类型，使用属于五个增强组的十九种降质函数：均衡、动态、混响、振幅和立体声。我们的方法利用流匹配生成训练范式，学习一种音频变换，将降质输入映射为其清晰、母带处理后的版本，并由文本提示引导。客观音频质量指标表明，SonicMaster 在所有伪影类别中显著提升了音质。此外，主观听觉测试也证实，听众更喜欢 SonicMaster 增强后的输出音频，而非原始的降质音频，这凸显了我们统一方法的有效性。

Subjects: Sound, Artificial Intelligence, Multimedia, Audio and Speech Processing 主题：声音，人工智能，多媒体，音频与语音处理

Publish: 2025-08-05 13:49:04 UTC 发布时间：2025-08-05 13:49:04 UTC

#79 LLMs Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models 大语言模型心如铁石：揭秘大型推理模型的软性思维能力

人类认知自然地处理抽象且流动的概念，而现有的推理模型通常依赖于生成离散的标记，这可能限制了它们的表达能力。近期的进展旨在通过使大型语言模型（LLMs）生成软性、抽象的标记，从而促进在连续概念空间中的推理，来解决这一限制。本文通过一系列探测技术，考察了各种 LLMs 的“软思维”能力，分析模型的内部行为。与普遍认为软思维能够同时探索多条推理路径的观点相反，我们的研究发现，LLMs 在后续解码步骤中主要依赖软输入中最具影响力的成分。这种依赖阻碍了不同推理路径的探索，使得普通的软思维退化为一种贪婪解码，掩盖了通过软标记传递更多信息的优势。为了解决这一问题，我们探索了引入\emph{随机性}的采样策略，采用了 Dirichlet 重采样和 Gumbel-Softmax 技巧等方法。我们的实验表明，加入随机性可以缓解传统方法的局限性，释放软思维的潜力。值得注意的是，Gumbel-Softmax 技巧在保持平滑性的同时提供了足够的随机性，在八个推理基准测试中表现出色。

发布时间：2025-08-05 13:38:33 UTC

#80 Spatial Imputation Drives Cross-Domain Alignment for EEG Classification #80 空间插补推动脑电分类的跨域对齐

Authors: [Hongjun Liu](https://arxiv.org/search/?searchtype=author&query=Hongjun Liu), [Chao Yao](https://arxiv.org/search/?searchtype=author&query=Chao Yao), [Yalan Zhang](https://arxiv.org/search/?searchtype=author&query=Yalan Zhang), [Xiaokun wang](https://arxiv.org/search/?searchtype=author&query=Xiaokun wang), [Xiaojuan Ban](https://arxiv.org/search/?searchtype=author&query=Xiaojuan Ban) 作者：刘洪军，姚超，张雅兰，王晓坤，班晓娟

Electroencephalogram (EEG) signal classification faces significant challenges due to data distribution shifts caused by heterogeneous electrode configurations, acquisition protocols, and hardware discrepancies across domains. This paper introduces IMAC, a novel channel-dependent mask and imputation self-supervised framework that formulates the alignment of cross-domain EEG data shifts as a spatial time series imputation task. To address heterogeneous electrode configurations in cross-domain scenarios, IMAC first standardizes different electrode layouts using a 3D-to-2D positional unification mapping strategy, establishing unified spatial representations. Unlike previous mask-based self-supervised representation learning methods, IMAC introduces spatio-temporal signal alignment. This involves constructing a channel-dependent mask and reconstruction task framed as a low-to-high resolution EEG spatial imputation problem. Consequently, this approach simulates cross-domain variations such as channel omissions and temporal instabilities, thus enabling the model to leverage the proposed imputer for robust signal alignment during inference. Furthermore, IMAC incorporates a disentangled structure that separately models the temporal and spatial information of the EEG signals separately, reducing computational complexity while enhancing flexibility and adaptability. Comprehensive evaluations across 10 publicly available EEG datasets demonstrate IMAC’s superior performance, achieving state-of-the-art classification accuracy in both cross-subject and cross-center validation scenarios. Notably, IMAC shows strong robustness under both simulated and real-world distribution shifts, surpassing baseline methods by up to 35% in integrity scores while maintaining consistent classification accuracy. 脑电图（EEG）信号分类面临着由于异构电极配置、采集协议和硬件差异导致的跨域数据分布变化的重大挑战。本文提出了 IMAC，一种新颖的基于通道的掩码与插补自监督框架，将跨域 EEG 数据变化的对齐问题表述为空间时间序列插补任务。为了解决跨域场景中的异构电极配置问题，IMAC 首先采用 3D 到 2D 位置统一映射策略标准化不同的电极布局，建立统一的空间表示。不同于以往基于掩码的自监督表示学习方法，IMAC 引入了时空信号对齐，具体包括构建基于通道的掩码和重建任务，将其框定为低分辨率到高分辨率的 EEG 空间插补问题。因此，该方法模拟了跨域变化，如通道缺失和时间不稳定性，从而使模型能够在推理阶段利用所提出的插补器实现稳健的信号对齐。此外，IMAC 采用了解耦结构，分别对脑电信号的时间和空间信息进行建模，既降低了计算复杂度，又增强了灵活性和适应性。在 10 个公开可用的脑电数据集上的全面评估表明，IMAC 表现优异，在跨受试者和跨中心验证场景中均实现了最先进的分类准确率。值得注意的是，IMAC 在模拟和真实世界的分布变化下表现出强大的鲁棒性，完整性评分比基线方法高出最多 35 %，同时保持了稳定的分类准确率。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 13:28:05 UTC 发布时间：2025-08-05 13:28:05 UTC

#81 The Science Fiction Science Method #81 科幻科学方法

Authors: [Iyad Rahwan](https://arxiv.org/search/?searchtype=author&query=Iyad Rahwan), [Azim Shariff](https://arxiv.org/search/?searchtype=author&query=Azim Shariff), [Jean-François Bonnefon](https://arxiv.org/search/?searchtype=author&query=Jean-François Bonnefon) 作者：Iyad Rahwan，Azim Shariff，Jean-François Bonnefon

Predicting the social and behavioral impact of future technologies, before they are achieved, would allow us to guide their development and regulation before these impacts get entrenched. Traditionally, this prediction has relied on qualitative, narrative methods. Here we describe a method which uses experimental methods to simulate future technologies, and collect quantitative measures of the attitudes and behaviors of participants assigned to controlled variations of the future. We call this method ‘science fiction science’. We suggest that the reason why this method has not been fully embraced yet, despite its potential benefits, is that experimental scientists may be reluctant to engage in work facing such serious validity threats as science fiction science. To address these threats, we consider possible constraints on the kind of technology that science fiction science may study, as well as the unconventional, immersive methods that science fiction science may require. We seek to provide perspective on the reasons why this method has been marginalized for so long, what benefits it would bring if it could be built on strong yet unusual methods, and how we can normalize these methods to help the diverse community of science fiction scientists to engage in a virtuous cycle of validity improvement. 在未来技术实现之前预测其社会和行为影响，可以使我们在这些影响根深蒂固之前引导其发展和监管。传统上，这种预测依赖于定性、叙述性的方法。本文描述了一种利用实验方法模拟未来技术，并收集分配到未来受控变体的参与者态度和行为的定量测量的方法。我们称这种方法为“科幻科学”。我们认为，尽管该方法具有潜在优势，但实验科学家可能因担忧科幻科学面临的严重有效性威胁而不愿全面采纳该方法。为应对这些威胁，我们考虑了科幻科学可能研究的技术类型的限制，以及科幻科学可能需要的非常规沉浸式方法。我们旨在阐明该方法长期被边缘化的原因，如果能够建立在强大但不寻常的方法之上，它将带来哪些好处，以及我们如何规范这些方法，以帮助多元化的科幻科学家社区参与有效性提升的良性循环。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-05 13:20:12 UTC 发布时间：2025-08-05 13:20:12 UTC

Authors: [Futian Wang](https://arxiv.org/search/?searchtype=author&query=Futian Wang), [Yuhan Qiao](https://arxiv.org/search/?searchtype=author&query=Yuhan Qiao), [Xiao Wang](https://arxiv.org/search/?searchtype=author&query=Xiao Wang), [Fuling Wang](https://arxiv.org/search/?searchtype=author&query=Fuling Wang), [Yuxiang Zhang](https://arxiv.org/search/?searchtype=author&query=Yuxiang Zhang), [Dengdi Sun](https://arxiv.org/search/?searchtype=author&query=Dengdi Sun) 作者：王福田，乔宇涵，王晓，王福灵，张宇翔，孙登迪

X-ray medical report generation is one of the important applications of artificial intelligence in healthcare. With the support of large foundation models, the quality of medical report generation has significantly improved. However, challenges such as hallucination and weak disease diagnostic capability still persist. In this paper, we first construct a large-scale multi-modal medical knowledge graph (termed M3KG) based on the ground truth medical report using the GPT-4o. It contains 2477 entities, 3 kinds of relations, 37424 triples, and 6943 disease-aware vision tokens for the CheXpert Plus dataset. Then, we sample it to obtain multi-granularity semantic graphs and use an R-GCN encoder for feature extraction. For the input X-ray image, we adopt the Swin-Transformer to extract the vision features and interact with the knowledge using cross-attention. The vision tokens are fed into a Q-former and retrieved the disease-aware vision tokens using another cross-attention. Finally, we adopt the large language model to map the semantic knowledge graph, input X-ray image, and disease-aware vision tokens into language descriptions. Extensive experiments on multiple datasets fully validated the effectiveness of our proposed knowledge graph and X-ray report generation framework. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis. X 射线医学报告生成是人工智能在医疗领域的重要应用之一。在大型基础模型的支持下，医学报告生成的质量显著提升。然而，幻觉现象和较弱的疾病诊断能力等挑战仍然存在。本文首先基于真实医学报告，利用 GPT-4o 构建了一个大规模多模态医学知识图谱（称为 M3KG）。该图谱包含 2477 个实体、3 种关系、37424 个三元组，以及针对 CheXpert Plus 数据集的 6943 个疾病感知视觉标记。随后，我们对其进行采样，获得多粒度语义图，并使用 R-GCN 编码器进行特征提取。对于输入的 X 射线图像，我们采用 Swin-Transformer 提取视觉特征，并通过交叉注意力与知识进行交互。视觉标记被输入到 Q-former 中，并通过另一交叉注意力检索疾病感知视觉标记。最后，我们采用大型语言模型将语义知识图谱、输入的 X 射线图像和疾病感知视觉标记映射为语言描述。在多个数据集上的大量实验充分验证了我们所提出的知识图谱和 X 光报告生成框架的有效性。本文的源代码将发布在 https://github.com/Event-AHU/Medical_Image_Analysis。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-05 13:13:45 UTC 发布时间：2025-08-05 13:13:45 UTC

#83 Learning Latent Representations for Image Translation using Frequency Distributed CycleGAN #83 使用频率分布 CycleGAN 学习图像翻译的潜在表示

Authors: [Shivangi Nigam](https://arxiv.org/search/?searchtype=author&query=Shivangi Nigam), [Adarsh Prasad Behera](https://arxiv.org/search/?searchtype=author&query=Adarsh Prasad Behera), [Shekhar Verma](https://arxiv.org/search/?searchtype=author&query=Shekhar Verma), [P. Nagabhushan](https://arxiv.org/search/?searchtype=author&query=P. Nagabhushan) 作者：Shivangi Nigam，Adarsh Prasad Behera，Shekhar Verma，P. Nagabhushan

This paper presents Fd-CycleGAN, an image-to-image (I2I) translation framework that enhances latent representation learning to approximate real data distributions. Building upon the foundation of CycleGAN, our approach integrates Local Neighborhood Encoding (LNE) and frequency-aware supervision to capture fine-grained local pixel semantics while preserving structural coherence from the source domain. We employ distribution-based loss metrics, including KL/JS divergence and log-based similarity measures, to explicitly quantify the alignment between real and generated image distributions in both spatial and frequency domains. To validate the efficacy of Fd-CycleGAN, we conduct experiments on diverse datasets – Horse2Zebra, Monet2Photo, and a synthetically augmented Strike-off dataset. Compared to baseline CycleGAN and other state-of-the-art methods, our approach demonstrates superior perceptual quality, faster convergence, and improved mode diversity, particularly in low-data regimes. By effectively capturing local and global distribution characteristics, Fd-CycleGAN achieves more visually coherent and semantically consistent translations. Our results suggest that frequency-guided latent learning significantly improves generalization in image translation tasks, with promising applications in document restoration, artistic style transfer, and medical image synthesis. We also provide comparative insights with diffusion-based generative models, highlighting the advantages of our lightweight adversarial approach in terms of training efficiency and qualitative output. 本文提出了 Fd-CycleGAN，一种图像到图像（I2I）转换框架，通过增强潜在表示学习来逼近真实数据分布。在 CycleGAN 的基础上，我们的方法整合了局部邻域编码（LNE）和频率感知监督，以捕捉细粒度的局部像素语义，同时保持源域的结构连贯性。我们采用基于分布的损失度量，包括 KL/JS 散度和基于对数的相似性度量，明确量化真实图像与生成图像在空间和频率域的分布对齐。为了验证 Fd-CycleGAN 的有效性，我们在多个数据集上进行了实验——Horse2Zebra、Monet2Photo 以及合成增强的 Strike-off 数据集。与基线 CycleGAN 及其他最先进方法相比，我们的方法表现出更优的感知质量、更快的收敛速度和更好的模式多样性，尤其在数据量较少的情况下。通过有效捕捉局部和全局的分布特征，Fd-CycleGAN 实现了更具视觉连贯性和语义一致性的转换。我们的结果表明，基于频率引导的潜在学习显著提升了图像翻译任务中的泛化能力，在文档修复、艺术风格迁移和医学图像合成等领域具有广阔的应用前景。我们还提供了与基于扩散的生成模型的比较分析，突出展示了我们轻量级对抗方法在训练效率和输出质量方面的优势。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Graphics 主题：计算机视觉与模式识别，人工智能，图形学

Publish: 2025-08-05 12:59:37 UTC 发布时间：2025-08-05 12:59:37 UTC

#84 SlotMatch: Distilling Temporally Consistent Object-Centric Representations for Unsupervised Video Segmentation #84 SlotMatch：蒸馏时序一致的面向对象表示以实现无监督视频分割

Authors: [Diana-Nicoleta Grigore](https://arxiv.org/search/?searchtype=author&query=Diana-Nicoleta Grigore), [Neelu Madan](https://arxiv.org/search/?searchtype=author&query=Neelu Madan), [Andreas Mogelmose](https://arxiv.org/search/?searchtype=author&query=Andreas Mogelmose), [Thomas B. Moeslund](https://arxiv.org/search/?searchtype=author&query=Thomas B. Moeslund), [Radu Tudor Ionescu](https://arxiv.org/search/?searchtype=author&query=Radu Tudor Ionescu) 作者：Diana-Nicoleta Grigore、Neelu Madan、Andreas Mogelmose、Thomas B. Moeslund、Radu Tudor Ionescu

Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state-of-the-art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object-centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on two datasets to compare the state-of-the-art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running 1.9x faster. Moreover, our student surpasses previous unsupervised video segmentation models. 无监督视频分割是一项具有挑战性的计算机视觉任务，尤其由于缺乏监督信号以及视觉场景的复杂性。为克服这一挑战，基于槽注意力的最先进模型通常需要依赖大型且计算量巨大的神经网络架构。为此，我们提出了一个简单的知识蒸馏框架，有效地将以对象为中心的表示传递给轻量级学生模型。该框架称为 SlotMatch，通过余弦相似度对齐对应的教师和学生槽，无需额外的蒸馏目标或辅助监督。SlotMatch 的简洁性通过理论和实证证据得到确认，均表明整合额外损失是多余的。我们在两个数据集上进行了实验，将最先进的教师模型 SlotContrast 与我们的蒸馏学生模型进行了比较。结果显示，基于 SlotMatch 的学生模型不仅匹配甚至超越了教师模型，同时参数量减少了 3.6 倍，运行速度提升了 1.9 倍。此外，我们的学生模型也超越了之前的无监督视频分割模型。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 12:58:09 UTC 发布时间：2025-08-05 12:58:09 UTC

#85 Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling #85 视觉文档理解与问答：一种具有测试时扩展的多智能体协作框架

Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git. 现有的视觉-语言模型（VLMs），无论是通用型还是专用型，仍受限于参数规模，缺乏强大的自我纠错能力，在涉及长视觉上下文和复杂推理的任务中表现不佳，导致在基于文档的任务上性能不理想。为此，我们提出了 MACT，一种具有测试时扩展能力的多智能体协作框架，专为视觉文档理解和视觉问答（VQA）设计。该框架包含四个不同的小规模智能体，即规划、执行、判断和回答智能体，角色明确且协作高效。值得注意的是，判断智能体专门负责验证正确性，并将需要修改的部分重定向回先前的智能体，表现优于传统的纠错策略。为了进一步扩展框架的能力边界，我们提出了混合奖励建模，平衡智能体的特定能力与整体协作，以及基于智能体功能定制不同扩展策略的智能体级混合测试时扩展。在涵盖基于文档和非基于文档设置的基准测试中评估，我们的 MACT 在参数规模更小的情况下表现出卓越的性能，同时不牺牲通用和数学任务的能力。特别是在涉及长视觉上下文和复杂推理的基准测试中表现突出。MACT 的三个变体在平均分数上始终占据前三名，在 15 个基准测试中领先 13 个。代码将发布于：https://github.com/YU-deep/MACT.git。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 12:52:09 UTC 发布时间：2025-08-05 12:52:09 UTC

#86 SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models #86 SCFlow：使用流模型隐式学习风格和内容解耦

Authors: [Pingchuan Ma](https://arxiv.org/search/?searchtype=author&query=Pingchuan Ma), [Xiaopei Yang](https://arxiv.org/search/?searchtype=author&query=Xiaopei Yang), [Yusong Li](https://arxiv.org/search/?searchtype=author&query=Yusong Li), [Ming Gui](https://arxiv.org/search/?searchtype=author&query=Ming Gui), [Felix Krause](https://arxiv.org/search/?searchtype=author&query=Felix Krause), [Johannes Schusterbauer](https://arxiv.org/search/?searchtype=author&query=Johannes Schusterbauer), [Björn Ommer](https://arxiv.org/search/?searchtype=author&query=Björn Ommer) 作者：马平川，杨晓培，李宇松，桂明，Felix Krause，Johannes Schusterbauer，Björn Ommer

Explicitly disentangling style and content in vision models remains challenging due to their semantic overlap and the subjectivity of human perception. Existing methods propose separation through generative or discriminative objectives, but they still face the inherent ambiguity of disentangling intertwined concepts. Instead, we ask: Can we bypass explicit disentanglement by learning to merge style and content invertibly, allowing separation to emerge naturally? We propose SCFlow, a flow-matching framework that learns bidirectional mappings between entangled and disentangled representations. Our approach is built upon three key insights: 1) Training solely to merge style and content, a well-defined task, enables invertible disentanglement without explicit supervision; 2) flow matching bridges on arbitrary distributions, avoiding the restrictive Gaussian priors of diffusion models and normalizing flows; and 3) a synthetic dataset of 510,000 samples (51 styles × 10,000 content samples) was curated to simulate disentanglement through systematic style-content pairing. Beyond controllable generation tasks, we demonstrate that SCFlow generalizes to ImageNet-1k and WikiArt in zero-shot settings and achieves competitive performance, highlighting that disentanglement naturally emerges from the invertible merging process. 在视觉模型中明确区分风格和内容仍然具有挑战性，因为它们在语义上存在重叠且受人类感知的主观影响。现有方法通过生成式或判别式目标来实现分离，但仍面临解开交织概念的固有模糊性。相反，我们提出：是否可以通过学习可逆地融合风格和内容来绕过显式解缠，让分离自然出现？我们提出了 SCFlow，一种流匹配框架，学习交织表示与解缠表示之间的双向映射。我们的方法基于三个关键见解：1）仅训练融合风格和内容这一明确定义的任务，能够实现无需显式监督的可逆解缠；2）流匹配可在任意分布上架桥，避免了扩散模型和归一化流中限制性的高斯先验；3）我们策划了一个包含 51 万样本的合成数据集（51 种风格 × 10,000 个内容样本），通过系统的风格-内容配对模拟了解缠过程。除了可控生成任务外，我们还展示了 SCFlow 在零样本设置下对 ImageNet-1k 和 WikiArt 的泛化能力，并取得了具有竞争力的表现，突显了可逆合并过程中自然产生的解耦特性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-05 12:50:46 UTC 发布时间：2025-08-05 12:50:46 UTC

#87 Agentic AI in 6G Software Businesses: A Layered Maturity Model #87 6G 软件业务中的自主智能体 AI：分层成熟度模型

Authors: [Muhammad Zohaib](https://arxiv.org/search/?searchtype=author&query=Muhammad Zohaib), [Muhammad Azeem Akbar](https://arxiv.org/search/?searchtype=author&query=Muhammad Azeem Akbar), [Sami Hyrynsalmi](https://arxiv.org/search/?searchtype=author&query=Sami Hyrynsalmi), [Arif Ali Khan](https://arxiv.org/search/?searchtype=author&query=Arif Ali Khan) 作者：Muhammad Zohaib, Muhammad Azeem Akbar, Sami Hyrynsalmi, Arif Ali Khan

The emergence of agentic AI systems in 6G software businesses presents both strategic opportunities and significant challenges. While such systems promise increased autonomy, scalability, and intelligent decision-making across distributed environments, their adoption raises concerns regarding technical immaturity, integration complexity, organizational readiness, and performance-cost trade-offs. In this study, we conducted a preliminary thematic mapping to identify factors influencing the adoption of agentic software within the context of 6G. Drawing on a multivocal literature review and targeted scanning, we identified 29 motivators and 27 demotivators, which were further categorized into five high-level themes in each group. This thematic mapping offers a structured overview of the enabling and inhibiting forces shaping organizational readiness for agentic transformation. Positioned as a feasibility assessment, the study represents an early phase of a broader research initiative aimed at developing and validating a layered maturity model grounded in CMMI model with the software architectural three dimensions possibly Data, Business Logic, and Presentation. Ultimately, this work seeks to provide a practical framework to help software-driven organizations assess, structure, and advance their agent-first capabilities in alignment with the demands of 6G. 在 6G 软件业务中，具备代理能力的人工智能系统的出现既带来了战略机遇，也带来了重大挑战。虽然此类系统承诺在分布式环境中实现更高的自主性、可扩展性和智能决策，但其采用也引发了关于技术不成熟、集成复杂性、组织准备度以及性能与成本权衡的担忧。在本研究中，我们进行了初步的主题映射，以识别影响 6G 背景下代理软件采用的因素。通过多声部文献综述和有针对性的扫描，我们识别了 29 个推动因素和 27 个阻碍因素，并将其进一步归类为每组五个高级主题。该主题映射提供了一个结构化的概览，展示了塑造组织代理转型准备度的促进和抑制力量。作为一项可行性评估，该研究代表了一个更广泛研究计划的早期阶段，旨在开发和验证一个基于 CMMI 模型的分层成熟度模型，该模型可能涵盖软件架构的三个维度：数据、业务逻辑和表现层。最终，本研究旨在提供一个实用框架，帮助以软件为驱动的组织评估、构建并推进其以智能体为先的能力，以符合 6G 的需求。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-05 12:42:46 UTC 发布时间：2025-08-05 12:42:46 UTC

#88 When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs #88 当良性声音变成对抗：用良性输入破解音频语言模型

As large language models become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content. Our method uses imperceptible perturbations in audio inputs that remain benign to human listeners. The first stage uses a novel reward-based optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the target model to circumvent its own safety protocols and generate harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use Projected Gradient Descent (PGD) to optimize subtle perturbations that are embedded into benign audio carriers, such as weather queries or greeting messages. Validated under the rigorous StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation framework, our experiments demonstrate a success rate exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating AI behavior. 随着大型语言模型日益融入日常生活，音频已成为人机交互的关键接口。然而，这种便利性也带来了新的脆弱性，使音频成为潜在的攻击面。我们的研究提出了 WhisperInject，一种两阶段的对抗性音频攻击框架，能够操控最先进的音频语言模型生成有害内容。我们的方法在音频输入中使用人耳难以察觉的扰动，这些扰动对人类听众来说是无害的。第一阶段采用一种新颖的基于奖励的优化方法——带投影梯度下降的强化学习（RL-PGD），引导目标模型绕过自身的安全协议，生成有害的原生响应。该原生有害响应随后作为第二阶段“载荷注入”的目标，在此阶段我们使用投影梯度下降（PGD）优化细微扰动，将其嵌入到无害的音频载体中，如天气查询或问候消息。在严格的 StrongREJECT、LlamaGuard 以及人工评估安全评估框架下验证，我们的实验表明，在 Qwen2.5-Omni-3B、Qwen2.5-Omni-7B 和 Phi-4-Multimodal 上成功率超过 86%。我们的工作展示了一类新的实用音频原生威胁，超越了理论上的利用，揭示了一种可行且隐蔽的操控 AI 行为的方法。

Subjects: Sound, Artificial Intelligence, Cryptography and Security, Audio and Speech Processing 主题：声音，人工智能，密码学与安全，音频与语音处理

Publish: 2025-08-05 12:14:01 UTC 发布时间：2025-08-05 12:14:01 UTC

#89 VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation #89 VLMQ：通过 Hessian 增强实现大规模视觉语言模型的高效后训练量化

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别，人工智能，计算与语言

Publish: 2025-08-05 11:57:03 UTC 发布时间：2025-08-05 11:57:03 UTC

#90 From Legacy to Standard: LLM-Assisted Transformation of Cybersecurity Playbooks into CACAO Format #90 从传统到标准：LLM 辅助将网络安全剧本转换为 CACAO 格式

Authors: [Mehdi Akbari Gurabi](https://arxiv.org/search/?searchtype=author&query=Mehdi Akbari Gurabi), [Lasse Nitz](https://arxiv.org/search/?searchtype=author&query=Lasse Nitz), [Radu-Mihai Castravet](https://arxiv.org/search/?searchtype=author&query=Radu-Mihai Castravet), [Roman Matzutt](https://arxiv.org/search/?searchtype=author&query=Roman Matzutt), [Avikarsha Mandal](https://arxiv.org/search/?searchtype=author&query=Avikarsha Mandal), [Stefan Decker](https://arxiv.org/search/?searchtype=author&query=Stefan Decker) 作者：Mehdi Akbari Gurabi, Lasse Nitz, Radu-Mihai Castravet, Roman Matzutt, Avikarsha Mandal, Stefan Decker

Existing cybersecurity playbooks are often written in heterogeneous, non-machine-readable formats, which limits their automation and interoperability across Security Orchestration, Automation, and Response platforms. This paper explores the suitability of Large Language Models, combined with Prompt Engineering, to automatically translate legacy incident response playbooks into the standardized, machine-readable CACAO format. We systematically examine various Prompt Engineering techniques and carefully design prompts aimed at maximizing syntactic accuracy and semantic fidelity for control flow preservation. Our modular transformation pipeline integrates a syntax checker to ensure syntactic correctness and features an iterative refinement mechanism that progressively reduces syntactic errors. We evaluate the proposed approach on a custom-generated dataset comprising diverse legacy playbooks paired with manually created CACAO references. The results demonstrate that our method significantly improves the accuracy of playbook transformation over baseline models, effectively captures complex workflow structures, and substantially reduces errors. It highlights the potential for practical deployment in automated cybersecurity playbook transformation tasks. 现有的网络安全操作手册通常采用异构的、非机器可读的格式编写，这限制了它们在安全编排、自动化和响应平台中的自动化和互操作性。本文探讨了结合提示工程的大型语言模型，自动将传统的事件响应操作手册转换为标准化、机器可读的 CACAO 格式的适用性。我们系统地研究了各种提示工程技术，并精心设计了旨在最大化语法准确性和语义保真度以保持控制流程的提示。我们的模块化转换流程集成了语法检查器以确保语法正确性，并具备迭代优化机制，逐步减少语法错误。我们在一个自定义生成的数据集上评估了所提方法，该数据集包含多样的传统操作手册及其手工创建的 CACAO 参考。结果表明，我们的方法显著提升了操作手册转换的准确性，较基线模型更有效地捕捉复杂的工作流结构，并大幅减少了错误。它突出了在自动化网络安全剧本转换任务中实际部署的潜力。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全，人工智能

Publish: 2025-08-05 11:43:54 UTC 发布时间：2025-08-05 11:43:54 UTC

#91 CTTS: Collective Test-Time Scaling #91 CTTS：集体测试时缩放

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 11:19:08 UTC 发布时间：2025-08-05 11:19:08 UTC

#92 Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models #92 探索小型语言模型中后训练量化的分层信息有效性

Large language models with billions of parameters are often over-provisioned: many layers contribute little unique information yet dominate the memory and energy footprint during inference. We present LieQ, a metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-7B models under extreme low-bit compression. Our method introduces three complementary layer-wise diagnostics-Perplexity Drop, Representational Compactness, and Top-k Energy Gain -that reveal a canonical division of labour across layers, enabling automatic bit-width allocation without gradient updates. Unlike existing approaches that suffer severe accuracy degradation at 2-3 bits precision, LieQ achieves state-of-the-art compression-accuracy trade-offs: on Qwen3-4B, it recovers 95.9% of FP16 baseline performance at 2.05-bit quantization, outperforming GPTQ by 19.7% and AWQ by 18.1% on average across seven zero-shot reasoning tasks. Applied to LLaMA3.2-3B, LieQ maintains 98.2% of baseline accuracy at 2.07-bit precision while enabling 4x memory reduction, establishing new paradigms for deploying small language models on resource-constrained edge devices. 拥有数十亿参数的大型语言模型通常存在资源过度配置的问题：许多层贡献的信息有限，但在推理过程中却占据了大量的内存和能耗。我们提出了 LieQ，一种基于指标驱动的后训练量化框架，解决了在极低位宽压缩下保持亚 7B 模型准确性的关键挑战。我们的方法引入了三种互补的层级诊断指标——困惑度下降、表示紧凑性和 Top-k 能量增益——揭示了层间的典型分工，从而实现无需梯度更新的自动位宽分配。与现有在 2-3 位精度下准确率严重下降的方法不同，LieQ 实现了最先进的压缩-准确率权衡：在 Qwen3-4B 上，2.05 位量化恢复了 95.9%的 FP16 基线性能，平均在七个零样本推理任务中分别比 GPTQ 高出 19.7%和比 AWQ 高出 18.1%。应用于 LLaMA3.2-3B 时，LieQ 在 2.07 位精度下保持了 98.2%的基线准确率，同时实现了 4 倍的内存减少，为在资源受限的边缘设备上部署小型语言模型开辟了新范式。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 11:17:04 UTC 发布时间：2025-08-05 11:17:04 UTC

#93 Industrial LLM-based Code Optimization under Regulation: A Mixture-of-Agents Approach #93 基于 LLM 的工业代码优化与监管：一种多代理混合方法

Recent advancements in Large Language Models (LLMs) for code optimization have enabled industrial platforms to automate software performance engineering at unprecedented scale and speed. Yet, organizations in regulated industries face strict constraints on which LLMs they can use - many cannot utilize commercial models due to data privacy regulations and compliance requirements, creating a significant challenge for achieving high-quality code optimization while maintaining cost-effectiveness. We address this by implementing a Mixture-of-Agents (MoA) approach that directly synthesizes code from multiple specialized LLMs, comparing it against TurinTech AI’s vanilla Genetic Algorithm (GA)-based ensemble system and individual LLM optimizers using real-world industrial codebases. Our key contributions include: (1) First MoA application to industrial code optimization using real-world codebases; (2) Empirical evidence that MoA excels with open-source models, achieving 14.3% to 22.2% cost savings and 28.6% to 32.2% faster optimization times for regulated environments; (3) Deployment guidelines demonstrating GA’s advantage with commercial models while both ensembles outperform individual LLMs; and (4) Real-world validation across 50 code snippets and seven LLM combinations, generating over 8,700 variants, addresses gaps in industrial LLM ensemble evaluation. This provides actionable guidance for organizations balancing regulatory compliance with optimization performance in production environments. 近年来，针对代码优化的大型语言模型（LLMs）的进展，使工业平台能够以前所未有的规模和速度实现软件性能工程的自动化。然而，受监管行业的组织在使用 LLMs 时面临严格限制——许多组织由于数据隐私法规和合规要求，无法使用商业模型，这为在保持成本效益的同时实现高质量代码优化带来了重大挑战。我们通过实施一种多代理混合（MoA）方法来解决这一问题，该方法直接从多个专用 LLMs 合成代码，并将其与 TurinTech AI 基于遗传算法（GA）的基础集成系统及单一 LLM 优化器在真实工业代码库上的表现进行了比较。我们的主要贡献包括：（1）首次将 MoA 应用于使用真实代码库的工业代码优化；（2）实证证明 MoA 在开源模型中表现出色，在受监管环境中实现了 14.3%至 22.2%的成本节约和 28.6%至 32.2%的优化时间加速；（3）部署指南展示了 GA 在商业模型中的优势，同时两个集成方法均优于单一 LLM；（4）通过 50 个代码片段和七种 LLM 组合的真实验证，生成了超过 8,700 个变体，填补了工业 LLM 集成评估的空白。这为在生产环境中平衡合规性与优化性能的组织提供了可操作的指导。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-05 11:15:06 UTC 发布时间：2025-08-05 11:15:06 UTC

#94 BaroPoser: Real-time Human Motion Tracking from IMUs and Barometers in Everyday Devices #94 BaroPoser：基于 IMU 和气压计的日常设备实时人体动作追踪

Authors: [Libo Zhang](https://arxiv.org/search/?searchtype=author&query=Libo Zhang), [Xinyu Yi](https://arxiv.org/search/?searchtype=author&query=Xinyu Yi), [Feng Xu](https://arxiv.org/search/?searchtype=author&query=Feng Xu) 作者：张立波，易欣宇，徐峰

In recent years, tracking human motion using IMUs from everyday devices such as smartphones and smartwatches has gained increasing popularity. However, due to the sparsity of sensor measurements and the lack of datasets capturing human motion over uneven terrain, existing methods often struggle with pose estimation accuracy and are typically limited to recovering movements on flat terrain only. To this end, we present BaroPoser, the first method that combines IMU and barometric data recorded by a smartphone and a smartwatch to estimate human pose and global translation in real time. By leveraging barometric readings, we estimate sensor height changes, which provide valuable cues for both improving the accuracy of human pose estimation and predicting global translation on non-flat terrain. Furthermore, we propose a local thigh coordinate frame to disentangle local and global motion input for better pose representation learning. We evaluate our method on both public benchmark datasets and real-world recordings. Quantitative and qualitative results demonstrate that our approach outperforms the state-of-the-art (SOTA) methods that use IMUs only with the same hardware configuration. 近年来，利用智能手机和智能手表等日常设备中的惯性测量单元（IMU）进行人体运动追踪越来越受欢迎。然而，由于传感器测量数据稀疏且缺乏涵盖不平坦地形人体运动的数据集，现有方法在姿态估计准确性方面常常表现不佳，且通常仅限于恢复平坦地形上的运动。为此，我们提出了 BaroPoser，这是首个结合智能手机和智能手表记录的 IMU 与气压计数据，实时估计人体姿态和全局平移的方法。通过利用气压计读数，我们估计传感器高度变化，这为提高人体姿态估计的准确性和预测非平坦地形上的全局平移提供了宝贵线索。此外，我们提出了局部大腿坐标系，以解耦局部和全局运动输入，从而实现更好的姿态表示学习。我们在公共基准数据集和真实世界录制数据上评估了该方法。定量和定性结果表明，我们的方法在相同硬件配置下，优于仅使用 IMU 的最新（SOTA）方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 10:46:59 UTC 发布时间：2025-08-05 10:46:59 UTC

#95 Reliable Evaluation Protocol for Low-Precision Retrieval #95 低精度检索的可靠评估协议

Authors: [Kisu Yang](https://arxiv.org/search/?searchtype=author&query=Kisu Yang), [Yoonna Jang](https://arxiv.org/search/?searchtype=author&query=Yoonna Jang), [Hwanseok Jang](https://arxiv.org/search/?searchtype=author&query=Hwanseok Jang), [Kenneth Choi](https://arxiv.org/search/?searchtype=author&query=Kenneth Choi), [Isabelle Augenstein](https://arxiv.org/search/?searchtype=author&query=Isabelle Augenstein), [Heuiseok Lim](https://arxiv.org/search/?searchtype=author&query=Heuiseok Lim) 作者：Kisu Yang, Yoonna Jang, Hwanseok Jang, Kenneth Choi, Isabelle Augenstein, Heuiseok Lim

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language 主题：信息检索，人工智能，计算与语言

Publish: 2025-08-05 10:27:57 UTC 发布时间：2025-08-05 10:27:57 UTC

#96 NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty #96 自然语言处理方法实际上可能比教授更擅长估计问题难度

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 10:12:38 UTC 发布时间：2025-08-05 10:12:38 UTC

#97 Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes #97 通过心理刻板印象调查 LLM 生成故事中的性别偏见

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 10:10:26 UTC 发布时间：2025-08-05 10:10:26 UTC

#98 Artificial Intelligence and Generative Models for Materials Discovery – A Review #98 人工智能与生成模型在材料发现中的应用综述

Authors: [Albertus Denny Handoko](https://arxiv.org/search/?searchtype=author&query=Albertus Denny Handoko), [Riko I Made](https://arxiv.org/search/?searchtype=author&query=Riko I Made) 作者：Albertus Denny Handoko, Riko I Made

High throughput experimentation tools, machine learning (ML) methods, and open material databases are radically changing the way new materials are discovered. From the experimentally driven approach in the past, we are moving quickly towards the artificial intelligence (AI) driven approach, realizing the ‘inverse design’ capabilities that allow the discovery of new materials given the desired properties. This review aims to discuss different principles of AI-driven generative models that are applicable for materials discovery, including different materials representations available for this purpose. We will also highlight specific applications of generative models in designing new catalysts, semiconductors, polymers, or crystals while addressing challenges such as data scarcity, computational cost, interpretability, synthesizability, and dataset biases. Emerging approaches to overcome limitations and integrate AI with experimental workflows will be discussed, including multimodal models, physics informed architectures, and closed-loop discovery systems. This review aims to provide insights for researchers aiming to harness AI’s transformative potential in accelerating materials discovery for sustainability, healthcare, and energy innovation. 高通量实验工具、机器学习（ML）方法和开放材料数据库正在彻底改变新材料的发现方式。我们正从过去以实验为驱动的方法迅速转向以人工智能（AI）为驱动的方法，实现“逆向设计”能力，从而根据所需性能发现新材料。本文综述旨在讨论适用于材料发现的 AI 驱动生成模型的不同原理，包括为此目的提供的不同材料表示方法。我们还将重点介绍生成模型在设计新催化剂、半导体、高分子或晶体方面的具体应用，同时探讨数据稀缺、计算成本、可解释性、可合成性和数据集偏差等挑战。文中将讨论克服这些限制并将 AI 与实验工作流程整合的新兴方法，包括多模态模型、物理信息架构和闭环发现系统。本综述旨在为希望利用人工智能在加速材料发现以促进可持续发展、医疗保健和能源创新方面发挥变革潜力的研究人员提供见解。

Subjects: Materials Science, Artificial Intelligence, Applied Physics 主题：材料科学，人工智能，应用物理

Publish: 2025-08-05 09:56:27 UTC 发布时间：2025-08-05 09:56:27 UTC

#99 Pay What LLM Wants: Can LLM Simulate Economics Experiment with 522 Real-human Persona? #99 按 LLM 意愿支付：LLM 能否模拟包含 522 个真实人类角色的经济学实验？

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 09:37:37 UTC 发布时间：2025-08-05 09:37:37 UTC

#100 V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models #100 V.I.P.：用于高效视频扩散模型的迭代在线偏好蒸馏

Authors: [Jisoo Kim](https://arxiv.org/search/?searchtype=author&query=Jisoo Kim), [Wooseok Seo](https://arxiv.org/search/?searchtype=author&query=Wooseok Seo), [Junwan Kim](https://arxiv.org/search/?searchtype=author&query=Junwan Kim), [Seungho Park](https://arxiv.org/search/?searchtype=author&query=Seungho Park), [Sooyeon Park](https://arxiv.org/search/?searchtype=author&query=Sooyeon Park), [Youngjae Yu](https://arxiv.org/search/?searchtype=author&query=Youngjae Yu) 作者：Jisoo Kim、Wooseok Seo、Junwan Kim、Seungho Park、Sooyeon Park、Youngjae Yu

With growing interest in deploying text-to-video (T2V) models in resource-constrained environments, reducing their high computational cost has become crucial, leading to extensive research on pruning and knowledge distillation methods while maintaining performance. However, existing distillation methods primarily rely on supervised fine-tuning (SFT), which often leads to mode collapse as pruned models with reduced capacity fail to directly match the teacher’s outputs, ultimately resulting in degraded quality. To address this challenge, we propose an effective distillation method, ReDPO, that integrates DPO and SFT. Our approach leverages DPO to guide the student model to focus on recovering only the targeted properties, rather than passively imitating the teacher, while also utilizing SFT to enhance overall performance. We additionally propose V.I.P., a novel framework for filtering and curating high-quality pair datasets, along with a step-by-step online approach for calibrated training. We validate our method on two leading T2V models, VideoCrafter2 and AnimateDiff, achieving parameter reduction of 36.2% and 67.5% each, while maintaining or even surpassing the performance of full models. Further experiments demonstrate the effectiveness of both ReDPO and V.I.P. framework in enabling efficient and high-quality video generation. Our code and videos are available at https://jiiiisoo.github.io/VIP.github.io/. 随着在资源受限环境中部署文本到视频（T2V）模型的兴趣日益增长，降低其高计算成本变得至关重要，这促使了在保持性能的同时对剪枝和知识蒸馏方法的广泛研究。然而，现有的蒸馏方法主要依赖于监督微调（SFT），这常常导致模式崩溃，因为容量减少的剪枝模型无法直接匹配教师模型的输出，最终导致质量下降。为了解决这一挑战，我们提出了一种有效的蒸馏方法 ReDPO，该方法整合了 DPO 和 SFT。我们的方法利用 DPO 引导学生模型专注于恢复目标属性，而不是被动模仿教师，同时利用 SFT 提升整体性能。我们还提出了 V.I.P.，一个用于筛选和策划高质量配对数据集的新框架，以及一个逐步的在线校准训练方法。我们在两个领先的文本到视频（T2V）模型 VideoCrafter2 和 AnimateDiff 上验证了我们的方法，分别实现了 36.2%和 67.5%的参数减少，同时保持甚至超越了完整模型的性能。进一步的实验表明，ReDPO 和 V.I.P.框架在实现高效且高质量的视频生成方面均表现出有效性。我们的代码和视频可在 https://jiiiisoo.github.io/VIP.github.io/ 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 09:31:54 UTC 发布时间：2025-08-05 09:31:54 UTC

#101 Approximate Proportionality in Online Fair Division #101 在线公平分配中的近似比例性

Authors: [Davin Choo](https://arxiv.org/search/?searchtype=author&query=Davin Choo), [Winston Fu](https://arxiv.org/search/?searchtype=author&query=Winston Fu), [Derek Khu](https://arxiv.org/search/?searchtype=author&query=Derek Khu), [Tzeh Yuan Neoh](https://arxiv.org/search/?searchtype=author&query=Tzeh Yuan Neoh), [Tze-Yang Poon](https://arxiv.org/search/?searchtype=author&query=Tze-Yang Poon), [Nicholas Teh](https://arxiv.org/search/?searchtype=author&query=Nicholas Teh) 作者：Davin Choo, Winston Fu, Derek Khu, Tzeh Yuan Neoh, Tze-Yang Poon, Nicholas Teh

We study the online fair division problem, where indivisible goods arrive sequentially and must be allocated immediately and irrevocably to agents. Prior work has established strong impossibility results for approximating classic fairness notions, such as envy-freeness and maximin share fairness, in this setting. In contrast, we focus on proportionality up to one good (PROP1), a natural relaxation of proportionality whose approximability remains unresolved. We begin by showing that three natural greedy algorithms fail to guarantee any positive approximation to PROP1 in general, against an adaptive adversary. This is surprising because greedy algorithms are commonly used in fair division and a natural greedy algorithm is known to be able to achieve PROP1 under additional information assumptions. This hardness result motivates the study of non-adaptive adversaries and the use of side-information, in the spirit of learning-augmented algorithms. For non-adaptive adversaries, we show that the simple uniformly random allocation can achieve a meaningful PROP1 approximation with high probability. Meanwhile, we present an algorithm that obtain robust approximation ratios against PROP1 when given predictions of the maximum item value (MIV). Interestingly, we also show that stronger fairness notions such as EF1, MMS, and PROPX remain inapproximable even with perfect MIV predictions. 我们研究在线公平分配问题，其中不可分割的物品按顺序到达，必须立即且不可撤销地分配给代理。先前的研究已经针对该情境下经典公平性概念（如无嫉妒性和最大最小份额公平性）的近似性，给出了强不可能性结果。相比之下，我们关注的是“最多差一件物品的比例公平性”（PROP1），这是一种比例公平性的自然放宽，其可近似性尚未解决。我们首先展示了三种自然的贪心算法在一般情况下，面对自适应对手时，均无法保证对 PROP1 的任何正向近似。这一点令人惊讶，因为贪心算法在公平分配中被广泛使用，且已知在附加信息假设下，一种自然的贪心算法能够实现 PROP1。该困难结果促使我们研究非自适应对手以及利用辅助信息，借鉴学习增强算法的思路。对于非自适应对手，我们证明简单的均匀随机分配能够以高概率实现有意义的 PROP1 近似。同时，我们提出了一种算法，在给定最大物品价值（MIV）预测的情况下，能够获得针对 PROP1 的稳健近似比。有趣的是，我们还证明了即使有完美的 MIV 预测，更强的公平性概念如 EF1、MMS 和 PROPX 仍然无法近似。

Subjects: Computer Science and Game Theory, Artificial Intelligence, Multiagent Systems 主题：计算机科学与博弈论，人工智能，多智能体系统

Publish: 2025-08-05 09:31:42 UTC 发布时间：2025-08-05 09:31:42 UTC

#102 RooseBERT: A New Deal For Political Language Modelling #102 RooseBERT：政治语言建模的新方案

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 09:28:20 UTC 发布时间：2025-08-05 09:28:20 UTC

#103 CardiffNLP at CLEARS-2025: Prompting Large Language Models for Plain Language and Easy-to-Read Text Rewriting #103 CardiffNLP 在 CLEARS-2025：提示大型语言模型进行通俗易懂和易读文本重写

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 09:16:19 UTC 发布时间：2025-08-05 09:16:19 UTC

Authors: [Hikari Yanagawa](https://arxiv.org/search/?searchtype=author&query=Hikari Yanagawa), [Yuichi Hiroi](https://arxiv.org/search/?searchtype=author&query=Yuichi Hiroi), [Satomi Tokida](https://arxiv.org/search/?searchtype=author&query=Satomi Tokida), [Yuji Hatada](https://arxiv.org/search/?searchtype=author&query=Yuji Hatada), [Takefumi Hiraki](https://arxiv.org/search/?searchtype=author&query=Takefumi Hiraki) 作者：柳川光、广井裕一、时田聪美、畑田裕司、平木岳文

While commercial metaverse platforms offer diverse user-generated content, they lack effective navigation assistance that can dynamically adapt to users’ interests and intentions. Although previous research has investigated on-demand agents in controlled environments, implementation in commercial settings with diverse world configurations and platform constraints remains challenging. We present Navigation Pixie, an on-demand navigation agent employing a loosely coupled architecture that integrates structured spatial metadata with LLM-based natural language processing while minimizing platform dependencies, which enables experiments on the extensive user base of commercial metaverse platforms. Our cross-platform experiments on commercial metaverse platform Cluster with 99 PC client and 94 VR-HMD participants demonstrated that Navigation Pixie significantly increased dwell time and free exploration compared to fixed-route and no-agent conditions across both platforms. Subjective evaluations revealed consistent on-demand preferences in PC environments versus context-dependent social perception advantages in VR-HMD. This research contributes to advancing VR interaction design through conversational spatial navigation agents, establishes cross-platform evaluation methodologies revealing environment-dependent effectiveness, and demonstrates empirical experimentation frameworks for commercial metaverse platforms. 虽然商业元宇宙平台提供了多样的用户生成内容，但缺乏能够动态适应用户兴趣和意图的有效导航辅助。尽管以往研究在受控环境中探讨了按需代理，但在具有多样世界配置和平台限制的商业环境中实现仍然具有挑战性。我们提出了 Navigation Pixie，一种按需导航代理，采用松耦合架构，将结构化空间元数据与基于 LLM 的自然语言处理相结合，同时最大限度地减少平台依赖，从而使得在商业元宇宙平台庞大用户群上进行实验成为可能。我们在商业元宇宙平台 Cluster 上进行了跨平台实验，参与者包括 99 名 PC 客户端用户和 94 名 VR-HMD 用户，结果表明，与固定路线和无代理条件相比，Navigation Pixie 显著增加了两平台的停留时间和自由探索。主观评估显示，PC 环境中用户一致偏好按需导航，而 VR-HMD 环境中则表现出依赖情境的社交感知优势。本研究通过对话式空间导航代理推动了虚拟现实交互设计的发展，建立了跨平台评估方法，揭示了环境依赖的有效性，并展示了面向商业元宇宙平台的实证实验框架。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-05 08:45:34 UTC 发布时间：2025-08-05 08:45:34 UTC

#105 The Power of Many: Synergistic Unification of Diverse Augmentations for Efficient Adversarial Robustness #105 众力之力：多样增强的协同统一以实现高效的对抗鲁棒性

Authors: [Wang Yu-Hang](https://arxiv.org/search/?searchtype=author&query=Wang Yu-Hang), [Shiwei Li](https://arxiv.org/search/?searchtype=author&query=Shiwei Li), [Jianxiang Liao](https://arxiv.org/search/?searchtype=author&query=Jianxiang Liao), [Li Bohan](https://arxiv.org/search/?searchtype=author&query=Li Bohan), [Jian Liu](https://arxiv.org/search/?searchtype=author&query=Jian Liu), [Wenfei Yin](https://arxiv.org/search/?searchtype=author&query=Wenfei Yin) 作者：王宇航，李世伟，廖建翔，李博涵，刘健，尹文飞

Adversarial perturbations pose a significant threat to deep learning models. Adversarial Training (AT), the predominant defense method, faces challenges of high computational costs and a degradation in standard performance. While data augmentation offers an alternative path, existing techniques either yield limited robustness gains or incur substantial training overhead. Therefore, developing a defense mechanism that is both highly efficient and strongly robust is of paramount importance.In this work, we first conduct a systematic analysis of existing augmentation techniques, revealing that the synergy among diverse strategies – rather than any single method – is crucial for enhancing robustness. Based on this insight, we propose the Universal Adversarial Augmenter (UAA) framework, which is characterized by its plug-and-play nature and training efficiency. UAA decouples the expensive perturbation generation process from model training by pre-computing a universal transformation offline, which is then used to efficiently generate unique adversarial perturbations for each sample during training.Extensive experiments conducted on multiple benchmarks validate the effectiveness of UAA. The results demonstrate that UAA establishes a new state-of-the-art (SOTA) for data-augmentation-based adversarial defense strategies , without requiring the online generation of adversarial examples during training. This framework provides a practical and efficient pathway for building robust models,Our code is available in the supplementary materials. 对抗扰动对深度学习模型构成了重大威胁。作为主要的防御方法，对抗训练（AT）面临着高计算成本和标准性能下降的挑战。虽然数据增强提供了另一种途径，但现有技术要么带来的鲁棒性提升有限，要么训练开销巨大。因此，开发一种既高效又强鲁棒性的防御机制至关重要。在本工作中，我们首先对现有增强技术进行了系统分析，揭示了多样策略之间的协同作用——而非单一方法——对于提升鲁棒性至关重要。基于这一洞察，我们提出了通用对抗增强器（UAA）框架，其特点是即插即用和训练高效。 UAA 通过离线预先计算通用变换，将昂贵的扰动生成过程与模型训练解耦，然后在训练过程中高效地为每个样本生成独特的对抗扰动。在多个基准测试上进行的大量实验验证了 UAA 的有效性。结果表明，UAA 为基于数据增强的对抗防御策略建立了新的最先进水平（SOTA），且无需在训练期间在线生成对抗样本。该框架为构建鲁棒模型提供了实用且高效的途径。我们的代码已包含在补充材料中。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 08:42:14 UTC 发布时间：2025-08-05 08:42:14 UTC

#106 GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations #106 GeoShield：通过对抗扰动保护视觉语言模型中的地理位置隐私

Authors: [Xinwei Liu](https://arxiv.org/search/?searchtype=author&query=Xinwei Liu), [Xiaojun Jia](https://arxiv.org/search/?searchtype=author&query=Xiaojun Jia), [Yuan Xun](https://arxiv.org/search/?searchtype=author&query=Yuan Xun), [Simeng Qin](https://arxiv.org/search/?searchtype=author&query=Simeng Qin), [Xiaochun Cao](https://arxiv.org/search/?searchtype=author&query=Xiaochun Cao) 作者：刘新伟，贾晓军，勋元，秦思萌，曹晓春

Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users’ locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns. 视觉-语言模型（VLMs），如 GPT-4o，现已展现出从公开共享图像中推断用户位置的显著能力，给地理隐私带来了重大风险。尽管对抗扰动提供了一种潜在的防御手段，但现有方法并不适用于此场景：它们在高分辨率图像和低扰动预算下表现往往不佳，且可能引入无关的语义内容。为了解决这些限制，我们提出了 GeoShield，一种专为现实场景中稳健地理隐私保护设计的新型对抗框架。GeoShield 包含三个关键模块：一个特征解耦模块，用于分离地理信息和非地理信息；一个曝光元素识别模块，用于定位图像中揭示地理信息的区域；以及一个尺度自适应增强模块，联合优化全局和局部层面的扰动，以确保在不同分辨率下的有效性。在具有挑战性的基准测试中，广泛实验表明 GeoShield 在黑盒设置下始终优于先前方法，实现了强有力的隐私保护，同时对视觉或语义质量的影响极小。据我们所知，本工作是首个探索利用对抗扰动来防御先进视觉语言模型（VLM）地理位置推断的研究，提供了一种切实有效的解决方案，以应对日益严重的隐私问题。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 08:37:06 UTC 发布时间：2025-08-05 08:37:06 UTC

#107 Spatiotemporal wall pressure forecast of a rectangular cylinder with physics-aware DeepUFNet #107 具有物理感知的 DeepUFNet 对矩形圆柱时空壁面压力的预测

Authors: [Junle Liu](https://arxiv.org/search/?searchtype=author&query=Junle Liu), [Chang Liu](https://arxiv.org/search/?searchtype=author&query=Chang Liu), [Yanyu Ke](https://arxiv.org/search/?searchtype=author&query=Yanyu Ke), [Wenliang Chen](https://arxiv.org/search/?searchtype=author&query=Wenliang Chen), [Kihing Shum](https://arxiv.org/search/?searchtype=author&query=Kihing Shum), [K. T. Tse](https://arxiv.org/search/?searchtype=author&query=K. T. Tse), [Gang Hu](https://arxiv.org/search/?searchtype=author&query=Gang Hu) 作者：Junle Liu，Chang Liu，Yanyu Ke，Wenliang Chen，Kihing Shum，K. T. Tse，Gang Hu

The wall pressure is of great importance in understanding the forces and structural responses induced by fluid. Recent works have investigated the potential of deep learning techniques in predicting mean pressure coefficients and fluctuating pressure coefficients, but most of existing deep learning frameworks are limited to predicting a single snapshot using full spatial information. To forecast spatiotemporal wall pressure of flow past a rectangular cylinder, this study develops a physics-aware DeepU-Fourier neural Network (DeepUFNet) deep learning model. DeepUFNet comprises the UNet structure and the Fourier neural network, with physical high-frequency loss control embedded in the model training stage to optimize model performance, where the parameter β varies with the development of the training epoch. Wind tunnel testing is performed to collect wall pressures of a two-dimensional rectangular cylinder with a side ratio of 1.5 at an angle of attack of zero using high-frequency pressure scanning, thereby constructing a database for DeepUFNet training and testing. The DeepUFNet model is found to forecast spatiotemporal wall pressure information with high accuracy. The comparison between forecast results and experimental data presents agreement in statistical information, temporal pressure variation, power spectrum density, spatial distribution, and spatiotemporal correlation. It is also found that embedding a physical high-frequency loss control coefficient β in the DeepUFNet model can significantly improve model performance in forecasting spatiotemporal wall pressure information, in particular, in forecasting high-order frequency fluctuation and wall pressure variance. Furthermore, the DeepUFNet extrapolation capability is tested with sparse spatial information input, and the model presents a satisfactory extrapolation ability 壁面压力在理解流体引起的力和结构响应中具有重要意义。近期的研究探讨了深度学习技术在预测平均压力系数和波动压力系数方面的潜力，但现有的大多数深度学习框架仅限于利用完整的空间信息预测单一快照。为了预测流经矩形圆柱的流动的时空壁面压力，本研究开发了一种物理感知的 DeepU-Fourier 神经网络（DeepUFNet）深度学习模型。DeepUFNet 由 UNet 结构和傅里叶神经网络组成，在模型训练阶段嵌入了物理高频损失控制以优化模型性能，其中参数 β 随训练周期的发展而变化。通过风洞试验，利用高频压力扫描采集了边长比为 1.5、攻角为零的二维矩形圆柱的壁面压力，从而构建了 DeepUFNet 的训练和测试数据库。结果表明，DeepUFNet 模型能够高精度预测时空壁面压力信息。预测结果与实验数据的比较在统计信息、时间压力变化、功率谱密度、空间分布和时空相关性方面表现出一致性。研究还发现，在 DeepUFNet 模型中嵌入物理高频损失控制系数 β ，能够显著提升模型在预测时空壁面压力信息方面的性能，尤其是在预测高阶频率波动和壁面压力方差方面。此外，DeepUFNet 的外推能力通过稀疏空间信息输入进行了测试，模型表现出令人满意的外推能力。

Subjects: Fluid Dynamics, Artificial Intelligence, Computational Engineering, Finance, and Science 主题：流体动力学、人工智能、计算工程、金融与科学

Publish: 2025-08-05 07:48:09 UTC 发布时间：2025-08-05 07:48:09 UTC

#108 StoryEnsemble: Enabling Dynamic Exploration & Iteration in the Design Process with AI and Forward-Backward Propagation #108 StoryEnsemble：通过 AI 和前向-后向传播实现设计过程中的动态探索与迭代

Authors: [Sangho Suh](https://arxiv.org/search/?searchtype=author&query=Sangho Suh), [Michael Lai](https://arxiv.org/search/?searchtype=author&query=Michael Lai), [Kevin Pu](https://arxiv.org/search/?searchtype=author&query=Kevin Pu), [Steven P. Dow](https://arxiv.org/search/?searchtype=author&query=Steven P. Dow), [Tovi Grossman](https://arxiv.org/search/?searchtype=author&query=Tovi Grossman) 作者：Sangho Suh、Michael Lai、Kevin Pu、Steven P. Dow、Tovi Grossman

Design processes involve exploration, iteration, and movement across interconnected stages such as persona creation, problem framing, solution ideation, and prototyping. However, time and resource constraints often hinder designers from exploring broadly, collecting feedback, and revisiting earlier assumptions-making it difficult to uphold core design principles in practice. To better understand these challenges, we conducted a formative study with 15 participants-comprised of UX practitioners, students, and instructors. Based on the findings, we developed StoryEnsemble, a tool that integrates AI into a node-link interface and leverages forward and backward propagation to support dynamic exploration and iteration across the design process. A user study with 10 participants showed that StoryEnsemble enables rapid, multi-directional iteration and flexible navigation across design stages. This work advances our understanding of how AI can foster more iterative design practices by introducing novel interactions that make exploration and iteration more fluid, accessible, and engaging. 设计过程涉及探索、迭代以及在角色创建、问题框定、解决方案构思和原型制作等相互关联的阶段之间的移动。然而，时间和资源的限制常常阻碍设计师进行广泛探索、收集反馈和重新审视早期假设——这使得在实践中坚持核心设计原则变得困难。为了更好地理解这些挑战，我们对 15 名参与者进行了形成性研究，参与者包括用户体验从业者、学生和教师。基于研究结果，我们开发了 StoryEnsemble，这是一款将人工智能集成到节点链接界面中的工具，利用前向和后向传播支持设计过程中的动态探索和迭代。对 10 名参与者进行的用户研究表明，StoryEnsemble 能够实现快速的多方向迭代和设计阶段间的灵活导航。本研究通过引入新颖的交互方式，使探索和迭代更加流畅、易于访问且富有吸引力，推动了我们对人工智能如何促进更迭代设计实践的理解。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-08-05 07:47:23 UTC 发布时间：2025-08-05 07:47:23 UTC

#109 Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following #109 Light-IF：通过预览和自我检查赋予 LLMs 可泛化推理能力以执行复杂指令

While advancements in the reasoning abilities of LLMs have significantly enhanced their performance in solving mathematical problems, coding tasks, and general puzzles, their effectiveness in accurately adhering to instructions remains inconsistent, particularly with more complex directives. Our investigation identifies lazy reasoning during the thinking stage as the primary factor contributing to poor instruction adherence. To mitigate this issue, we propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking, essential for satisfying strict instruction constraints. Specifically, we first generate instructions with complex constraints and apply a filtering process to obtain valid prompts, resulting in three distinct prompt datasets categorized as hard, easy, and pass. Then, we employ rejection sampling on the pass prompts to curate a small yet high-quality dataset, enabling a cold-start initialization of the model and facilitating its adaptation to effective reasoning patterns. Subsequently, we employ an entropy-preserving supervised fine-tuning (Entropy-SFT) strategy coupled with token-wise entropy-adaptive (TEA-RL) reinforcement learning guided by rule-based dense rewards. This approach encourages the model to transform its reasoning mechanism, ultimately fostering generalizable reasoning abilities that encompass preview and self-checking. Extensive experiments conducted on instruction-following benchmarks demonstrate remarkable performance improvements across various model scales. Notably, our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6. 尽管 LLMs 在推理能力上的进步显著提升了它们解决数学问题、编码任务和一般谜题的表现，但它们在准确遵循指令方面的效果仍不稳定，尤其是在面对更复杂的指令时。我们的研究发现，思考阶段的懒惰推理是导致指令遵循不佳的主要原因。为了解决这一问题，我们提出了一个全面的框架，旨在实现包含预览和自我检查的严谨推理过程，这对于满足严格的指令约束至关重要。具体而言，我们首先生成带有复杂约束的指令，并通过过滤过程获得有效的提示，最终形成三个不同的提示数据集，分别归类为困难、简单和通过。随后，我们对通过类提示进行拒绝采样，筛选出一个小而高质量的数据集，从而实现模型的冷启动初始化，并促进其适应有效的推理模式。随后，我们采用了一种保持熵的监督微调策略（Entropy-SFT），结合基于规则的密集奖励指导的逐标记熵自适应（TEA-RL）强化学习。该方法鼓励模型转变其推理机制，最终培养出包含预览和自检的可泛化推理能力。在指令遵循基准测试中进行的大量实验表明，各种模型规模均表现出显著的性能提升。值得注意的是，我们的 Light-IF-32B 模型超越了更大的开源模型如 DeepSeek-R1 以及闭源模型如 Doubao-1.6。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-05 07:42:00 UTC 发布时间：2025-08-05 07:42:00 UTC

#110 ChartCap: Mitigating Hallucination of Dense Chart Captioning #110 ChartCap：缓解密集图表标题的幻觉问题

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别，人工智能，计算与语言

Publish: 2025-08-05 07:09:07 UTC 发布时间：2025-08-05 07:09:07 UTC

#111 CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction #111 CoTox：基于链式思维的分子毒性推理与预测

Authors: [Jueon Park](https://arxiv.org/search/?searchtype=author&query=Jueon Park), [Yein Park](https://arxiv.org/search/?searchtype=author&query=Yein Park), [Minju Song](https://arxiv.org/search/?searchtype=author&query=Minju Song), [Soyon Park](https://arxiv.org/search/?searchtype=author&query=Soyon Park), [Donghyeon Lee](https://arxiv.org/search/?searchtype=author&query=Donghyeon Lee), [Seungheun Baek](https://arxiv.org/search/?searchtype=author&query=Seungheun Baek), [Jaewoo Kang](https://arxiv.org/search/?searchtype=author&query=Jaewoo Kang) 作者：Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek, Jaewoo Kang

Drug toxicity remains a major challenge in pharmaceutical development. Recent machine learning models have improved in silico toxicity prediction, but their reliance on annotated data and lack of interpretability limit their applicability. This limits their ability to capture organ-specific toxicities driven by complex biological mechanisms. Large language models (LLMs) offer a promising alternative through step-by-step reasoning and integration of textual data, yet prior approaches lack biological context and transparent rationale. To address this issue, we propose CoTox, a novel framework that integrates LLM with chain-of-thought (CoT) reasoning for multi-toxicity prediction. CoTox combines chemical structure data, biological pathways, and gene ontology (GO) terms to generate interpretable toxicity predictions through step-by-step reasoning. Using GPT-4o, we show that CoTox outperforms both traditional machine learning and deep learning model. We further examine its performance across various LLMs to identify where CoTox is most effective. Additionally, we find that representing chemical structures with IUPAC names, which are easier for LLMs to understand than SMILES, enhances the model’s reasoning ability and improves predictive performance. To demonstrate its practical utility in drug development, we simulate the treatment of relevant cell types with drug and incorporated the resulting biological context into the CoTox framework. This approach allow CoTox to generate toxicity predictions aligned with physiological responses, as shown in case study. This result highlights the potential of LLM-based frameworks to improve interpretability and support early-stage drug safety assessment. The code and prompt used in this work are available at https://github.com/dmis-lab/CoTox. 药物毒性仍然是药物开发中的一大挑战。近年来的机器学习模型在体内毒性预测方面有所提升，但它们依赖于标注数据且缺乏可解释性，限制了其应用范围。这限制了它们捕捉由复杂生物机制驱动的器官特异性毒性的能力。大型语言模型（LLMs）通过逐步推理和整合文本数据提供了一种有前景的替代方案，然而以往的方法缺乏生物学背景和透明的推理依据。为了解决这一问题，我们提出了 CoTox，一种将 LLM 与链式思维（CoT）推理相结合的多毒性预测新框架。CoTox 结合了化学结构数据、生物通路和基因本体（GO）术语，通过逐步推理生成可解释的毒性预测。利用 GPT-4o，我们展示了 CoTox 在性能上优于传统机器学习和深度学习模型。我们还进一步评估了其在不同 LLMs 上的表现，以确定 CoTox 最有效的应用场景。此外，我们发现用 IUPAC 名称表示化学结构比用 SMILES 更易于 LLMs 理解，这增强了模型的推理能力并提升了预测性能。为了展示其在药物开发中的实际应用，我们模拟了相关细胞类型的药物处理，并将由此产生的生物学背景纳入 CoTox 框架。该方法使 CoTox 能够生成与生理反应一致的毒性预测，如案例研究所示。该结果凸显了基于 LLM 的框架在提升可解释性和支持早期药物安全评估方面的潜力。本研究中使用的代码和提示可在 https://github.com/dmis-lab/CoTox 获取。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 07:04:44 UTC 发布时间：2025-08-05 07:04:44 UTC

#112 Estimating Worst-Case Frontier Risks of Open-Weight LLMs #112 估计开放权重 LLMs 的最坏情况前沿风险

Authors: [Eric Wallace](https://arxiv.org/search/?searchtype=author&query=Eric Wallace), [Olivia Watkins](https://arxiv.org/search/?searchtype=author&query=Olivia Watkins), [Miles Wang](https://arxiv.org/search/?searchtype=author&query=Miles Wang), [Kai Chen](https://arxiv.org/search/?searchtype=author&query=Kai Chen), [Chris Koch](https://arxiv.org/search/?searchtype=author&query=Chris Koch) 作者：Eric Wallace，Olivia Watkins，Miles Wang，Kai Chen，Chris Koch

In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results contributed to our decision to release the model, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases. 在本文中，我们研究了发布 gpt-oss 的最坏情况前沿风险。我们引入了恶意微调（MFT），通过微调 gpt-oss，使其在两个领域——生物学和网络安全——尽可能具备最大能力。为了最大化生物风险（biorisk），我们策划了与威胁制造相关的任务，并在带有网页浏览的强化学习环境中训练 gpt-oss。为了最大化网络安全风险，我们在一个具备代理能力的编码环境中训练 gpt-oss，以解决夺旗赛（CTF）挑战。我们将这些 MFT 模型与开放权重和封闭权重的 LLMs 在前沿风险评估中进行了比较。与前沿封闭权重模型相比，MFT gpt-oss 的表现不及 OpenAI o3，该模型在生物风险和网络安全方面的能力低于“准备度高”水平。与开放权重模型相比，gpt-oss 可能略微提升了生物能力，但并未实质性推进前沿。综合来看，这些结果促成了我们发布该模型的决定，我们希望我们的 MFT 方法能为评估未来开放权重发布的潜在危害提供有益指导。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 06:57:53 UTC 发布时间：2025-08-05 06:57:53 UTC

#113 Frontier: Simulating the Next Generation of LLM Inference Systems 前沿：模拟下一代大语言模型推理系统

大型语言模型（LLM）推理随着专家混合（MoE）模型和解耦组件（如预填充/解码（PD）或注意力/前馈网络（AF））的异构扩展架构的兴起，变得日益复杂。现有的模拟器设计用于共置的密集模型，无法捕捉这些新兴范式的复杂系统动态。我们提出了 Frontier，一款从零开始为这一新环境设计的高保真模拟器。Frontier 引入了一个统一框架，既能模拟共置系统，也能模拟解耦系统，原生支持带有专家并行（EP）的 MoE 推理。它能够模拟复杂的工作流程，如跨集群专家路由和用于隐藏延迟的高级流水线策略。为了确保准确性和可用性，Frontier 整合了精细化的算子模型以提升精度。Frontier 赋能社区设计和优化未来大规模 LLM 推理。

发布时间：2025-08-05 06:53:28 UTC

#114 RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior #114 RCP-Merging：通过将推理能力作为先验，合并长链思维模型与特定领域模型

具有长链式思维（CoT）能力的大型语言模型（LLMs），称为推理模型，通过多步长链式思维推理展现出卓越的复杂问题解决能力。为了在不产生大量计算和数据成本的情况下，创建具备长链式思维能力和领域特定知识的双重能力模型，模型合并成为一种极具资源效率的方法。然而，将领域特定的 LLMs 与具备长链式思维能力的模型合并存在重大挑战，因为现有的合并方法往往导致推理能力下降，甚至出现无意义输出和输出崩溃。为克服这一问题，我们提出了 RCP-Merging：一种以推理能力为先验，合并长链式思维模型与领域特定模型的新型合并框架，旨在整合具备长链式思维能力的领域特定 LLMs，同时保持模型在原始领域的性能。该方法将推理模型权重视为基础先验，利用推理能力指标保留核心长链式思维能力模型权重，同时有选择地合并关键的领域特定权重。我们在 BioMedicine 和 Finance 领域对 Qwen2.5-7B、Llama3.1-8B 和 Qwen2.5-1.5B 模型进行了大量实验。结果表明，RCP-Merging 成功地将推理模型与特定领域模型合并，在不显著影响原有长链式推理能力的情况下，使领域任务性能分别比最先进方法提升了 9.5%和 9.2%。

发布时间：2025-08-05 06:38:18 UTC

#115 Long Story Generation via Knowledge Graph and Literary Theory #115 通过知识图谱和文学理论进行长篇故事生成

The generation of a long story consisting of several thousand words is a sub-task in the field of long text generation~(LTG). Previous research has addressed this challenge through outline-based generation, which employs a multi-stage method for generating outlines into stories. However, this approach suffers from two common issues: almost inevitable theme drift caused by the loss of memory of previous outlines, and tedious plots with incoherent logic that are less appealing to human readers. In this paper, we propose the multi-agent Story Generator structure to improve the multi-stage method, using large language models~(LLMs) as the core components of agents. To avoid theme drift, we introduce a memory storage model comprising two components: a long-term memory storage that identifies the most important memories, thereby preventing theme drift; and a short-term memory storage that retains the latest outlines from each generation round. To incorporate engaging elements into the story, we design a story theme obstacle framework based on literary narratology theory that introduces uncertain factors and evaluation criteria to generate outline. This framework calculates the similarity of the former storyline and enhances the appeal of the story by building a knowledge graph and integrating new node content. Additionally, we establish a multi-agent interaction stage to simulate writer-reader interaction through dialogue and revise the story text according to feedback, to ensure it remains consistent and logical. Evaluations against previous methods demonstrate that our approach can generate higher-quality long stories. 生成由数千字组成的长篇故事是长文本生成（LTG）领域的一个子任务。以往的研究通过基于大纲的生成方法来应对这一挑战，该方法采用多阶段方式将大纲生成故事。然而，这种方法存在两个常见问题：几乎不可避免的主题漂移，原因是对先前大纲记忆的丧失；以及情节冗长且逻辑不连贯，难以吸引读者。在本文中，我们提出了多智能体故事生成器结构，以改进多阶段方法，使用 LLMs 作为智能体的核心组件。为避免主题漂移，我们引入了一个记忆存储模型，包含两个部分：长期记忆存储，用于识别最重要的记忆，从而防止主题漂移；短期记忆存储，用于保留每轮生成的最新大纲。为了将吸引人的元素融入故事，我们设计了一个基于文学叙事学理论的故事主题障碍框架，引入不确定因素和评估标准来生成大纲。该框架计算前一剧情线的相似度，通过构建知识图谱并整合新的节点内容来增强故事的吸引力。此外，我们建立了一个多智能体交互阶段，通过对话模拟作者与读者的互动，并根据反馈修改故事文本，以确保其保持一致性和逻辑性。与以往方法的评估结果表明，我们的方法能够生成更高质量的长篇故事。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-05 06:35:14 UTC 发布时间：2025-08-05 06:35:14 UTC

#116 Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery #116 Landsat30-AU：一个针对澳大利亚 Landsat 影像的视觉-语言数据集

Authors: [Sai Ma](https://arxiv.org/search/?searchtype=author&query=Sai Ma), [Zhuang Li](https://arxiv.org/search/?searchtype=author&query=Zhuang Li), [John A Taylor](https://arxiv.org/search/?searchtype=author&query=John A Taylor) 作者：Sai Ma，Zhuang Li，John A Taylor

Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing 196,262 image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from \textbf{0.74} to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au. 视觉语言模型（VLMs）使得通过自然语言与卫星影像交互成为可能，能够通过加速专家工作流程、使非专业人员也能访问数据以及实现全球范围的自动化，从而实现地球观测的民主化。然而，现有的数据集主要集中在来自少数卫星的短期高分辨率影像，忽视了低分辨率、多卫星、长期档案，如 Landsat，这些档案对于经济实惠且偏差鲁棒的全球监测至关重要。我们通过 Landsat30-AU 填补了这一空白，该数据集是一个大规模视觉语言数据集，基于四颗 Landsat 卫星（5、7、8 和 9）在澳大利亚收集的 30 米分辨率影像，时间跨度超过 36 年。该数据集包括两个部分：Landsat30-AU-Cap，包含 196,262 对图像-描述对；以及 Landsat30-AU-VQA，包含 17,725 个人工验证的视觉问答（VQA）样本，涵盖八个遥感领域。两个数据集均通过一个自举管道策划，该管道利用通用 VLM 进行迭代优化和人工验证，以确保质量。我们对八个 VLM 在我们的基准测试中的评估显示，现成模型在理解卫星影像方面表现不佳。开源的遥感 VLM EarthDial 在图像描述任务中仅达到 0.07 的 SPIDEr 分数，视觉问答（VQA）准确率为 0.48，凸显了当前方法的局限性。令人鼓舞的是，对 Qwen2.5-VL-7B 在 Landsat30-AU 数据集上进行轻量级微调后，图像描述性能从 0.11 提升至 0.31 的 SPIDEr，VQA 准确率也从\textbf{0.74}提升至 0.87。代码和数据可在 https://github.com/papersubmit1/landsat30-au 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 06:16:46 UTC 发布时间：2025-08-05 06:16:46 UTC

#117 Attack the Messages, Not the Agents: A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS #117 攻击信息，而非代理：针对 LLM-MAS 的多轮自适应隐蔽篡改框架

Large language model-based multi-agent systems (LLM-MAS) effectively accomplish complex and dynamic tasks through inter-agent communication, but this reliance introduces substantial safety vulnerabilities. Existing attack methods targeting LLM-MAS either compromise agent internals or rely on direct and overt persuasion, which limit their effectiveness, adaptability, and stealthiness. In this paper, we propose MAST, a Multi-round Adaptive Stealthy Tampering framework designed to exploit communication vulnerabilities within the system. MAST integrates Monte Carlo Tree Search with Direct Preference Optimization to train an attack policy model that adaptively generates effective multi-round tampering strategies. Furthermore, to preserve stealthiness, we impose dual semantic and embedding similarity constraints during the tampering process. Comprehensive experiments across diverse tasks, communication architectures, and LLMs demonstrate that MAST consistently achieves high attack success rates while significantly enhancing stealthiness compared to baselines. These findings highlight the effectiveness, stealthiness, and adaptability of MAST, underscoring the need for robust communication safeguards in LLM-MAS. 基于大型语言模型的多智能体系统（LLM-MAS）通过智能体间的通信有效完成复杂且动态的任务，但这种依赖也带来了显著的安全漏洞。现有针对 LLM-MAS 的攻击方法要么破坏智能体内部，要么依赖直接且明显的说服手段，这限制了其攻击效果、适应性和隐蔽性。本文提出了 MAST，一种多轮自适应隐蔽篡改框架，旨在利用系统中的通信漏洞。MAST 结合蒙特卡洛树搜索与直接偏好优化，训练攻击策略模型，自适应地生成有效的多轮篡改策略。此外，为保持隐蔽性，我们在篡改过程中施加了语义和嵌入相似性的双重约束。针对多样化任务、通信架构和 LLMs 的全面实验表明，MAST 在显著提升隐蔽性的同时，始终实现了高攻击成功率，优于基线方法。这些发现突显了 MAST 的有效性、隐蔽性和适应性，强调了在 LLM-MAS 中建立强大通信防护措施的必要性。

Subjects: Cryptography and Security, Artificial Intelligence, Multiagent Systems 主题：密码学与安全，人工智能，多智能体系统

Publish: 2025-08-05 06:14:53 UTC 发布时间：2025-08-05 06:14:53 UTC

#118 Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback #118 使用带有人类反馈的强化学习微调文本到语音扩散模型

Authors: [Jingyi Chen](https://arxiv.org/search/?searchtype=author&query=Jingyi Chen), [Ju Seung Byun](https://arxiv.org/search/?searchtype=author&query=Ju Seung Byun), [Micha Elsner](https://arxiv.org/search/?searchtype=author&query=Micha Elsner), [Pichao Wang](https://arxiv.org/search/?searchtype=author&query=Pichao Wang), [Andrew Perrault](https://arxiv.org/search/?searchtype=author&query=Andrew Perrault) 作者：陈静怡，朱承彬，米查·埃尔斯纳，王丕超，安德鲁·佩罗尔特

Diffusion models produce high-fidelity speech but are inefficient for real-time use due to long denoising steps and challenges in modeling intonation and rhythm. To improve this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), an RLHF framework for TTS diffusion models. DLPO integrates the original training loss into the reward function, preserving generative capabilities while reducing inefficiencies. Using naturalness scores as feedback, DLPO aligns reward optimization with the diffusion model’s structure, improving speech quality. We evaluate DLPO on WaveGrad 2, a non-autoregressive diffusion-based TTS model. Results show significant improvements in objective metrics (UTMOS 3.65, NISQA 4.02) and subjective evaluations, with DLPO audio preferred 67% of the time. These findings demonstrate DLPO’s potential for efficient, high-quality diffusion TTS in real-time, resource-limited settings. 扩散模型能够生成高保真语音，但由于去噪步骤较长且在建模语调和节奏方面存在挑战，效率较低，不适合实时使用。为此，我们提出了扩散损失引导的策略优化（DLPO），这是一种针对 TTS 扩散模型的基于人类反馈的强化学习（RLHF）框架。DLPO 将原始训练损失整合进奖励函数，既保留了生成能力，又减少了低效问题。通过使用自然度评分作为反馈，DLPO 使奖励优化与扩散模型结构保持一致，从而提升语音质量。我们在 WaveGrad 2（一种非自回归的基于扩散的 TTS 模型）上评估了 DLPO。结果显示，在客观指标（UTMOS 3.65，NISQA 4.02）和主观评估中均有显著提升，DLPO 生成的音频有 67%的时间被优先选择。这些发现展示了 DLPO 在实时、资源受限环境下实现高效高质量扩散 TTS 的潜力。

Subjects: Sound, Artificial Intelligence, Audio and Speech Processing 主题：声音，人工智能，音频与语音处理

Publish: 2025-08-05 06:11:52 UTC 发布时间：2025-08-05 06:11:52 UTC

#119 NANDA Adaptive Resolver: Architecture for Dynamic Resolution of AI Agent Names #119 NANDA 自适应解析器：用于动态解析 AI 代理名称的架构

Authors: [John Zinky](https://arxiv.org/search/?searchtype=author&query=John Zinky), [Hema Seshadri](https://arxiv.org/search/?searchtype=author&query=Hema Seshadri), [Mahesh Lambe](https://arxiv.org/search/?searchtype=author&query=Mahesh Lambe), [Pradyumna Chari](https://arxiv.org/search/?searchtype=author&query=Pradyumna Chari), [Ramesh Raskar](https://arxiv.org/search/?searchtype=author&query=Ramesh Raskar) 作者：John Zinky, Hema Seshadri, Mahesh Lambe, Pradyumna Chari, Ramesh Raskar

AdaptiveResolver is a dynamic microservice architecture designed to address the limitations of static endpoint resolution for AI agent communication in distributed, heterogeneous environments. Unlike traditional DNS or static URLs, AdaptiveResolver enables context-aware, real-time selection of communication endpoints based on factors such as geographic location, system load, agent capabilities, and security threats. Agents advertise their Agent Name and context requirements through Agent Fact cards in an Agent Registry/Index. A requesting Agent discovers a Target Agent using the registry. The Requester Agent can then resolve the Target Agent Name to obtain a tailored communication channel to the agent based on actual environmental context between the agents. The architecture supports negotiation of trust, quality of service, and resource constraints, facilitating flexible, secure, and scalable agent-to-agent interactions that go beyond the classic client-server model. AdaptiveResolver provides a foundation for robust, future-proof agent communication that can evolve with increasing ecosystem complexity. AdaptiveResolver 是一种动态微服务架构，旨在解决分布式异构环境中 AI 代理通信静态端点解析的局限性。与传统的 DNS 或静态 URL 不同，AdaptiveResolver 能够基于地理位置、系统负载、代理能力和安全威胁等因素，实现上下文感知的实时通信端点选择。代理通过代理注册表/索引中的代理事实卡（Agent Fact cards）发布其代理名称和上下文需求。请求代理通过注册表发现目标代理。然后，请求代理可以解析目标代理名称，基于代理间的实际环境上下文，获得定制的通信通道。该架构支持信任、服务质量和资源约束的协商，促进灵活、安全且可扩展的代理间交互，超越了传统的客户端-服务器模型。AdaptiveResolver 为强大且面向未来的代理通信提供了基础，能够随着生态系统复杂性的增加而演进。

Subjects: Networking and Internet Architecture, Artificial Intelligence, Multiagent Systems 主题：网络与互联网架构，人工智能，多智能体系统

Publish: 2025-08-05 05:47:39 UTC 发布时间：2025-08-05 05:47:39 UTC

#120 GEDAN: Learning the Edit Costs for Graph Edit Distance #120 GEDAN：学习图编辑距离的编辑代价

Authors: [Francesco Leonardi](https://arxiv.org/search/?searchtype=author&query=Francesco Leonardi), [Markus Orsi](https://arxiv.org/search/?searchtype=author&query=Markus Orsi), [Jean-Louis Reymond](https://arxiv.org/search/?searchtype=author&query=Jean-Louis Reymond), [Kaspar Riesen](https://arxiv.org/search/?searchtype=author&query=Kaspar Riesen) 作者：Francesco Leonardi，Markus Orsi，Jean-Louis Reymond，Kaspar Riesen

Graph Edit Distance (GED) is defined as the minimum cost transformation of one graph into another and is a widely adopted metric for measuring the dissimilarity between graphs. The major problem of GED is that its computation is NP-hard, which has in turn led to the development of various approximation methods, including approaches based on neural networks (NN). Most of these NN-based models simplify the problem of GED by assuming unit-cost edit operations, a rather unrealistic constraint in real-world applications. In this work, we present a novel Graph Neural Network framework that approximates GED using both supervised and unsupervised training. In the unsupervised setting, it employs a gradient-only self-organizing mechanism that enables optimization without ground-truth distances. Moreover, a core component of our architecture is the integration of a Generalized Additive Model, which allows the flexible and interpretable learning of context-aware edit costs. Experimental results show that the proposed method achieves similar results as state-of-the-art reference methods, yet significantly improves both adaptability and interpretability. That is, the learned cost function offers insights into complex graph structures, making it particularly valuable in domains such as molecular analysis and structural pattern discovery. 图编辑距离（GED）被定义为将一个图转换为另一个图的最小代价变换，是衡量图之间差异性的广泛采用的指标。GED 的主要问题在于其计算是 NP 难的，这反过来促使了各种近似方法的发展，包括基于神经网络（NN）的方法。大多数基于 NN 的模型通过假设编辑操作的代价为单位代价来简化 GED 问题，这在现实应用中是一个相当不现实的限制。在本工作中，我们提出了一种新颖的图神经网络框架，利用有监督和无监督训练来近似 GED。在无监督设置中，它采用仅基于梯度的自组织机制，使得无需真实距离即可进行优化。此外，我们架构的核心组件是集成了广义加法模型，该模型允许灵活且可解释地学习上下文感知的编辑代价。实验结果表明，所提方法在取得与最先进参考方法相似结果的同时，显著提升了适应性和可解释性。也就是说，学习到的代价函数能够洞察复杂的图结构，使其在分子分析和结构模式发现等领域尤为有价值。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 05:44:28 UTC 发布时间：2025-08-05 05:44:28 UTC

#121 Pseudo-label Induced Subspace Representation Learning for Robust Out-of-Distribution Detection #121 伪标签引导的子空间表示学习用于鲁棒的分布外检测

Authors: [Tarhib Al Azad](https://arxiv.org/search/?searchtype=author&query=Tarhib Al Azad), [Faizul Rakib Sayem](https://arxiv.org/search/?searchtype=author&query=Faizul Rakib Sayem), [Shahana Ibrahim](https://arxiv.org/search/?searchtype=author&query=Shahana Ibrahim) 作者：Tarhib Al Azad, Faizul Rakib Sayem, Shahana Ibrahim

Out-of-distribution (OOD) detection lies at the heart of robust artificial intelligence (AI), aiming to identify samples from novel distributions beyond the training set. Recent approaches have exploited feature representations as distinguishing signatures for OOD detection. However, most existing methods rely on restrictive assumptions on the feature space that limit the separability between in-distribution (ID) and OOD samples. In this work, we propose a novel OOD detection framework based on a pseudo-label-induced subspace representation, that works under more relaxed and natural assumptions compared to existing feature-based techniques. In addition, we introduce a simple yet effective learning criterion that integrates a cross-entropy-based ID classification loss with a subspace distance-based regularization loss to enhance ID-OOD separability. Extensive experiments validate the effectiveness of our framework. 分布外（OOD）检测是鲁棒人工智能（AI）的核心，旨在识别训练集之外的新分布样本。近期方法利用特征表示作为区分 OOD 检测的标志。然而，大多数现有方法依赖于对特征空间的限制性假设，限制了分布内（ID）样本与分布外（OOD）样本之间的可分性。在本工作中，我们提出了一种基于伪标签引导的子空间表示的新型 OOD 检测框架，该框架相比现有基于特征的技术，在更宽松且自然的假设下工作。此外，我们引入了一种简单而有效的学习准则，将基于交叉熵的 ID 分类损失与基于子空间距离的正则化损失结合，以增强 ID 与 OOD 的可分性。大量实验验证了我们框架的有效性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 05:38:00 UTC 发布时间：2025-08-05 05:38:00 UTC

#122 HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation #122 HiTeC：基于语义感知增强的文本属性超图层次对比学习

Authors: [Mengting Pan](https://arxiv.org/search/?searchtype=author&query=Mengting Pan), [Fan Li](https://arxiv.org/search/?searchtype=author&query=Fan Li), [Xiaoyang Wang](https://arxiv.org/search/?searchtype=author&query=Xiaoyang Wang), [Wenjie Zhang](https://arxiv.org/search/?searchtype=author&query=Wenjie Zhang), [Xuemin Lin](https://arxiv.org/search/?searchtype=author&query=Xuemin Lin) 作者：潘梦婷，李凡，王晓阳，张文杰，林学民

Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costly labels. However, node entities in real-world hypergraphs are often associated with rich textual information, which is overlooked in prior works. Directly applying existing CL-based methods to such text-attributed hypergraphs (TAHGs) leads to three key limitations: (1) The common use of graph-agnostic text encoders overlooks the correlations between textual content and hypergraph topology, resulting in suboptimal representations. (2) Their reliance on random data augmentations introduces noise and weakens the contrastive objective. (3) The primary focus on node- and hyperedge-level contrastive signals limits the ability to capture long-range dependencies, which is essential for expressive representation learning. Although HyperBERT pioneers CL on TAHGs, its co-training paradigm suffers from poor scalability. To fill the research gap, we introduce HiTeC, a two-stage hierarchical contrastive learning framework with semantic-aware augmentation for scalable and effective self-supervised learning on TAHGs. In the first stage, we pre-train the text encoder with a structure-aware contrastive objective to overcome the graph-agnostic nature of conventional methods. In the second stage, we introduce two semantic-aware augmentation strategies, including prompt-enhanced text augmentation and semantic-aware hyperedge drop, to facilitate informative view generation. Furthermore, we propose a multi-scale contrastive loss that extends existing objectives with an s-walk-based subgraph-level contrast to better capture long-range dependencies. By decoupling text encoder pretraining from hypergraph contrastive learning, this two-stage design enhances scalability without compromising representation quality. Extensive experiments confirm the effectiveness of HiTeC. 对比学习（CL）已成为自监督超图学习的主导范式，实现了无需昂贵标签的有效训练。然而，现实世界中的超图节点实体通常伴随丰富的文本信息，而先前的工作对此多有忽视。将现有基于 CL 的方法直接应用于此类文本属性超图（TAHGs）会导致三个关键限制：（1）常用的与图无关的文本编码器忽略了文本内容与超图拓扑结构之间的关联，导致表示效果不佳；（2）依赖随机数据增强引入噪声，削弱了对比目标；（3）主要关注节点级和超边级的对比信号，限制了捕捉长距离依赖的能力，而这对于表达性表示学习至关重要。尽管 HyperBERT 开创了 TAHGs 上的 CL，但其协同训练范式扩展性较差。为填补这一研究空白，我们提出了 HiTeC，一种具有语义感知增强的两阶段分层对比学习框架，实现了 TAHGs 上可扩展且高效的自监督学习。在第一阶段，我们使用结构感知的对比目标对文本编码器进行预训练，以克服传统方法中图无关的特性。在第二阶段，我们引入了两种语义感知的增强策略，包括提示增强的文本增强和语义感知的超边丢弃，以促进信息丰富的视图生成。此外，我们提出了一种多尺度对比损失，利用基于 s -游走的子图级对比扩展现有目标，更好地捕捉长距离依赖。通过将文本编码器预训练与超图对比学习解耦，这种两阶段设计在不牺牲表示质量的前提下提升了可扩展性。大量实验验证了 HiTeC 的有效性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 05:32:32 UTC 发布时间：2025-08-05 05:32:32 UTC

#123 Using the NANDA Index Architecture in Practice: An Enterprise Perspective #123 实践中使用 NANDA 索引架构：企业视角

The proliferation of autonomous AI agents represents a paradigmatic shift from traditional web architectures toward collaborative intelligent systems requiring sophisticated mechanisms for discovery, authentication, capability verification, and secure collaboration across heterogeneous protocol environments. This paper presents a comprehensive framework addressing the fundamental infrastructure requirements for secure, trustworthy, and interoperable AI agent ecosystems. We introduce the NANDA (Networked AI Agents in a Decentralized Architecture) framework, providing global agent discovery, cryptographically verifiable capability attestation through AgentFacts, and cross-protocol interoperability across Anthropic’s Modal Context Protocol (MCP), Google’s Agent-to-Agent (A2A), Microsoft’s NLWeb, and standard HTTPS communications. NANDA implements Zero Trust Agentic Access (ZTAA) principles, extending traditional Zero Trust Network Access (ZTNA) to address autonomous agent security challenges including capability spoofing, impersonation attacks, and sensitive data leakage. The framework defines Agent Visibility and Control (AVC) mechanisms enabling enterprise governance while maintaining operational autonomy and regulatory compliance. Our approach transforms isolated AI agents into an interconnected ecosystem of verifiable, trustworthy intelligent services, establishing foundational infrastructure for large-scale autonomous agent deployment across enterprise and consumer environments. This work addresses the critical gap between current AI agent capabilities and infrastructure requirements for secure, scalable, multi-agent collaboration, positioning the foundation for next-generation autonomous intelligent systems. 自主 AI 代理的普及代表了从传统网络架构向协作智能系统的范式转变，这类系统需要复杂的机制来实现发现、认证、能力验证以及跨异构协议环境的安全协作。本文提出了一个全面的框架，解决安全、可信且可互操作的 AI 代理生态系统的基础设施需求。我们介绍了 NANDA（去中心化架构中的网络化 AI 代理）框架，提供全球代理发现、通过 AgentFacts 进行的加密可验证能力证明，以及跨 Anthropic 的 Modal Context Protocol（MCP）、Google 的 Agent-to-Agent（A2A）、Microsoft 的 NLWeb 和标准 HTTPS 通信的跨协议互操作性。NANDA 实现了零信任代理访问（ZTAA）原则，扩展了传统的零信任网络访问（ZTNA），以应对自主代理的安全挑战，包括能力伪造、冒充攻击和敏感数据泄露。该框架定义了代理可见性与控制（AVC）机制，实现企业治理的同时保持运营自主性和合规性。我们的方法将孤立的 AI 代理转变为一个相互连接的、可验证且可信赖的智能服务生态系统，建立了面向企业和消费者环境中大规模自主代理部署的基础设施。此项工作解决了当前 AI 代理能力与安全、可扩展、多代理协作基础设施需求之间的关键差距，为下一代自主智能系统奠定了基础。

Subjects: Networking and Internet Architecture, Artificial Intelligence, Multiagent Systems 主题：网络与互联网架构，人工智能，多智能体系统

Publish: 2025-08-05 05:27:27 UTC 发布时间：2025-08-05 05:27:27 UTC

#124 VFLAIR-LLM: A Comprehensive Framework and Benchmark for Split Learning of LLMs #124 VFLAIR-LLM：LLMs 分割学习的综合框架与基准

Authors: [Zixuan Gu](https://arxiv.org/search/?searchtype=author&query=Zixuan Gu), [Qiufeng Fan](https://arxiv.org/search/?searchtype=author&query=Qiufeng Fan), [Long Sun](https://arxiv.org/search/?searchtype=author&query=Long Sun), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu), [Xiaojun Ye](https://arxiv.org/search/?searchtype=author&query=Xiaojun Ye) 作者：顾子轩，范秋峰，孙龙，刘洋，叶晓军

With the advancement of Large Language Models (LLMs), LLM applications have expanded into a growing number of fields. However, users with data privacy concerns face limitations in directly utilizing LLM APIs, while private deployments incur significant computational demands. This creates a substantial challenge in achieving secure LLM adaptation under constrained local resources. To address this issue, collaborative learning methods, such as Split Learning (SL), offer a resource-efficient and privacy-preserving solution for adapting LLMs to private domains. In this study, we introduce VFLAIR-LLM (available at https://github.com/FLAIR-THU/VFLAIR-LLM), an extensible and lightweight split learning framework for LLMs, enabling privacy-preserving LLM inference and fine-tuning in resource-constrained environments. Our library provides two LLM partition settings, supporting three task types and 18 datasets. In addition, we provide standard modules for implementing and evaluating attacks and defenses. We benchmark 5 attacks and 9 defenses under various Split Learning for LLM(SL-LLM) settings, offering concrete insights and recommendations on the choice of model partition configurations, defense strategies, and relevant hyperparameters for real-world applications. 随着大型语言模型（LLMs）的发展，LLM 应用已扩展到越来越多的领域。然而，关注数据隐私的用户在直接使用 LLM API 时面临限制，而私有部署则需要大量计算资源。这在资源受限的本地环境下实现安全的 LLM 适配带来了重大挑战。为了解决这一问题，协作学习方法如分割学习（Split Learning，SL）提供了一种资源高效且保护隐私的解决方案，用于将 LLM 适配到私有领域。在本研究中，我们介绍了 VFLAIR-LLM（可在 https://github.com/FLAIR-THU/VFLAIR-LLM 获取），这是一个可扩展且轻量的 LLM 分割学习框架，支持在资源受限环境下进行隐私保护的 LLM 推理和微调。我们的库提供了两种 LLM 分割设置，支持三种任务类型和 18 个数据集。此外，我们还提供了用于实现和评估攻击与防御的标准模块。我们在各种用于 LLM 的分割学习（SL-LLM）设置下，对 5 种攻击和 9 种防御进行了基准测试，提供了关于模型分割配置选择、防御策略及相关超参数的具体见解和建议，适用于实际应用。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全，人工智能

Publish: 2025-08-05 05:20:33 UTC 发布时间：2025-08-05 05:20:33 UTC

#125 A Survey of AI Agent Registry Solutions #125 AI 代理注册解决方案综述

As As autonomous AI agents scale across cloud, enterprise, and decentralized environments, the need for standardized registry systems to support discovery, identity, and capability sharing has become essential. This paper surveys three prominent registry approaches each defined by a unique metadata model: MCP’s mcp.json, A2A’s Agent Card, and NANDA’s AgentFacts. MCP uses a centralized metaregistry with GitHub authenticated publishing and structured metadata for server discovery. A2A enables decentralized interaction via JSON-based Agent Cards, discoverable through well-known URIs, curated catalogs, or direct configuration. NANDA Index introduces AgentFacts, a cryptographically verifiable and privacy-preserving metadata model designed for dynamic discovery, credentialed capabilities, and cross-domain interoperability. These approaches are compared across four dimensions: security, scalability, authentication, and maintainability. The paper concludes with suggestions and recommendations to guide future design and adoption of registry systems for the Internet of AI Agents. 随着自主 AI 代理在云端、企业和去中心化环境中的扩展，支持发现、身份和能力共享的标准化注册系统的需求变得至关重要。本文调查了三种以独特元数据模型定义的主要注册方法：MCP 的 mcp.json、A2A 的 Agent Card 和 NANDA 的 AgentFacts。MCP 使用集中式元注册表，结合 GitHub 认证发布和结构化元数据进行服务器发现。A2A 通过基于 JSON 的 Agent Card 实现去中心化交互，这些卡片可通过知名 URI、策划目录或直接配置进行发现。NANDA Index 引入了 AgentFacts，这是一种可加密验证且保护隐私的元数据模型，旨在实现动态发现、凭证能力和跨域互操作性。本文从安全性、可扩展性、认证和可维护性四个维度对这些方法进行了比较。最后，文章提出了指导未来 AI 代理互联网注册系统设计和采用的建议和推荐。

Subjects: Networking and Internet Architecture, Artificial Intelligence, Multiagent Systems 主题：网络与互联网架构，人工智能，多智能体系统

Publish: 2025-08-05 05:17:18 UTC 发布时间：2025-08-05 05:17:18 UTC

#126 Optimizing Bipedal Locomotion for The 100m Dash With Comparison to Human Running #126 优化双足行走以适应 100 米短跑并与人类跑步进行比较

Authors: [Devin Crowley](https://arxiv.org/search/?searchtype=author&query=Devin Crowley), [Jeremy Dao](https://arxiv.org/search/?searchtype=author&query=Jeremy Dao), [Helei Duan](https://arxiv.org/search/?searchtype=author&query=Helei Duan), [Kevin Green](https://arxiv.org/search/?searchtype=author&query=Kevin Green), [Jonathan Hurst](https://arxiv.org/search/?searchtype=author&query=Jonathan Hurst), [Alan Fern](https://arxiv.org/search/?searchtype=author&query=Alan Fern) 作者：Devin Crowley、Jeremy Dao、Helei Duan、Kevin Green、Jonathan Hurst、Alan Fern

In this paper, we explore the space of running gaits for the bipedal robot Cassie. Our first contribution is to present an approach for optimizing gait efficiency across a spectrum of speeds with the aim of enabling extremely high-speed running on hardware. This raises the question of how the resulting gaits compare to human running mechanics, which are known to be highly efficient in comparison to quadrupeds. Our second contribution is to conduct this comparison based on established human biomechanical studies. We find that despite morphological differences between Cassie and humans, key properties of the gaits are highly similar across a wide range of speeds. Finally, our third contribution is to integrate the optimized running gaits into a full controller that satisfies the rules of the real-world task of the 100m dash, including starting and stopping from a standing position. We demonstrate this controller on hardware to establish the Guinness World Record for Fastest 100m by a Bipedal Robot. 在本文中，我们探索了双足机器人 Cassie 的奔跑步态空间。我们的第一个贡献是提出了一种在不同速度范围内优化步态效率的方法，旨在实现硬件上的极高速奔跑。这引出了一个问题，即所得步态与人类奔跑机制的比较，而人类奔跑机制相较于四足动物已知具有极高的效率。我们的第二个贡献是基于已有的人体生物力学研究进行这种比较。我们发现，尽管 Cassie 与人类在形态上存在差异，但在广泛的速度范围内，步态的关键特性高度相似。最后，我们的第三个贡献是将优化后的奔跑步态整合到一个完整的控制器中，该控制器满足 100 米短跑这一现实任务的规则，包括从静止姿势起跑和停止。我们在硬件上演示了该控制器，创下了双足机器人最快 100 米的吉尼斯世界纪录。

Subjects: Robotics, Artificial Intelligence 主题：机器人技术，人工智能

Publish: 2025-08-05 04:39:27 UTC 发布时间：2025-08-05 04:39:27 UTC

#127 Untraceable DeepFakes via Traceable Fingerprint Elimination #127 通过可追踪指纹消除实现无法追踪的深度伪造

Authors: [Jiewei Lai](https://arxiv.org/search/?searchtype=author&query=Jiewei Lai), [Lan Zhang](https://arxiv.org/search/?searchtype=author&query=Lan Zhang), [Chen Tang](https://arxiv.org/search/?searchtype=author&query=Chen Tang), [Pengcheng Sun](https://arxiv.org/search/?searchtype=author&query=Pengcheng Sun), [Xinming Wang](https://arxiv.org/search/?searchtype=author&query=Xinming Wang), [Yunhao Wang](https://arxiv.org/search/?searchtype=author&query=Yunhao Wang) 作者：赖杰伟，张澜，唐晨，孙鹏程，王新明，王云浩

Recent advancements in DeepFakes attribution technologies have significantly enhanced forensic capabilities, enabling the extraction of traces left by generative models (GMs) in images, making DeepFakes traceable back to their source GMs. Meanwhile, several attacks have attempted to evade attribution models (AMs) for exploring their limitations, calling for more robust AMs. However, existing attacks fail to eliminate GMs’ traces, thus can be mitigated by defensive measures. In this paper, we identify that untraceable DeepFakes can be achieved through a multiplicative attack, which can fundamentally eliminate GMs’ traces, thereby evading AMs even enhanced with defensive measures. We design a universal and black-box attack method that trains an adversarial model solely using real data, applicable for various GMs and agnostic to AMs. Experimental results demonstrate the outstanding attack capability and universal applicability of our method, achieving an average attack success rate (ASR) of 97.08% against 6 advanced AMs on DeepFakes generated by 9 GMs. Even in the presence of defensive mechanisms, our method maintains an ASR exceeding 72.39%. Our work underscores the potential challenges posed by multiplicative attacks and highlights the need for more robust AMs. 近年来，DeepFakes 归因技术的进步显著提升了取证能力，使得能够提取生成模型（GMs）在图像中留下的痕迹，从而使 DeepFakes 可追溯到其源生成模型。与此同时，已有多种攻击尝试规避归因模型（AMs），以探索其局限性，呼吁更为鲁棒的归因模型。然而，现有攻击未能消除生成模型的痕迹，因此可以通过防御措施加以缓解。本文指出，通过乘法攻击可以实现不可追踪的 DeepFakes，该方法能够从根本上消除生成模型的痕迹，从而规避即使经过防御增强的归因模型。我们设计了一种通用且黑盒的攻击方法，仅使用真实数据训练对抗模型，适用于多种生成模型且对归因模型无感知。实验结果表明，该方法具备卓越的攻击能力和通用适用性，在由 9 个生成模型生成的 DeepFakes 上，对 6 个先进归因模型的平均攻击成功率（ASR）达到 97.08%。即使在防御机制存在的情况下，我们的方法仍保持超过 72.39%的攻击成功率。我们的工作强调了乘法攻击可能带来的潜在挑战，并突出了对更强大攻击模型的需求。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全，人工智能

Publish: 2025-08-05 04:27:57 UTC 发布时间：2025-08-05 04:27:57 UTC

#128 CORE-ReID: Comprehensive Optimization and Refinement through Ensemble fusion in Domain Adaptation for person re-identification #128 CORE-ReID：通过集成融合在领域自适应中对行人再识别进行全面优化和精炼

Authors: [Trinh Quoc Nguyen](https://arxiv.org/search/?searchtype=author&query=Trinh Quoc Nguyen), [Oky Dicky Ardiansyah Prima](https://arxiv.org/search/?searchtype=author&query=Oky Dicky Ardiansyah Prima), [Katsuyoshi Hotta](https://arxiv.org/search/?searchtype=author&query=Katsuyoshi Hotta) 作者：Trinh Quoc Nguyen，Oky Dicky Ardiansyah Prima，Katsuyoshi Hotta

This study introduces a novel framework, “Comprehensive Optimization and Refinement through Ensemble Fusion in Domain Adaptation for Person Re-identification (CORE-ReID)”, to address an Unsupervised Domain Adaptation (UDA) for Person Re-identification (ReID). The framework utilizes CycleGAN to generate diverse data that harmonizes differences in image characteristics from different camera sources in the pre-training stage. In the fine-tuning stage, based on a pair of teacher-student networks, the framework integrates multi-view features for multi-level clustering to derive diverse pseudo labels. A learnable Ensemble Fusion component that focuses on fine-grained local information within global features is introduced to enhance learning comprehensiveness and avoid ambiguity associated with multiple pseudo-labels. Experimental results on three common UDAs in Person ReID demonstrate significant performance gains over state-of-the-art approaches. Additional enhancements, such as Efficient Channel Attention Block and Bidirectional Mean Feature Normalization mitigate deviation effects and adaptive fusion of global and local features using the ResNet-based model, further strengthening the framework. The proposed framework ensures clarity in fusion features, avoids ambiguity, and achieves high ac-curacy in terms of Mean Average Precision, Top-1, Top-5, and Top-10, positioning it as an advanced and effective solution for the UDA in Person ReID. Our codes and models are available at https://github.com/TrinhQuocNguyen/CORE-ReID. 本研究提出了一种新颖的框架“通过集成融合的综合优化与细化在行人重识别领域适应中的应用（CORE-ReID）”，以解决行人重识别（ReID）中的无监督领域适应（UDA）问题。该框架在预训练阶段利用 CycleGAN 生成多样化数据，以协调来自不同摄像头源的图像特征差异。在微调阶段，基于一对师生网络，框架整合多视角特征进行多层次聚类，以生成多样的伪标签。引入了一个可学习的集成融合组件，专注于全局特征中的细粒度局部信息，以增强学习的全面性并避免多伪标签带来的歧义。在行人重识别的三种常见 UDA 任务上的实验结果表明，该方法在性能上显著优于最先进的方法。额外的改进，如高效通道注意力模块和双向均值特征归一化，缓解了偏差效应，并利用基于 ResNet 的模型实现了全局与局部特征的自适应融合，进一步增强了框架。所提出的框架确保了融合特征的清晰性，避免了歧义，并在平均精度均值（Mean Average Precision）、Top-1、Top-5 和 Top-10 指标上实现了高准确率，使其成为行人再识别无监督域适应（UDA）中的先进且有效的解决方案。我们的代码和模型可在 https://github.com/TrinhQuocNguyen/CORE-ReID 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 04:25:03 UTC 发布时间：2025-08-05 04:25:03 UTC

#129 VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision #129 VRPO：在噪声监督下重新思考鲁棒强化学习训练中的价值建模

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习，人工智能，计算与语言

Publish: 2025-08-05 04:05:15 UTC 发布时间：2025-08-05 04:05:15 UTC

#130 Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation #130 基于不确定性的面部抠图用于遮挡感知的面部变换

Authors: [Hyebin Cho](https://arxiv.org/search/?searchtype=author&query=Hyebin Cho), [Jaehyup Lee](https://arxiv.org/search/?searchtype=author&query=Jaehyup Lee) 作者：Hyebin Cho, Jaehyup Lee

Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at https://github.com/hyebin-c/FaceMat.git 人脸滤镜已成为短视频内容的关键元素，能够实现多种视觉效果，如风格化和换脸。然而，当手部、头发或配饰等物体遮挡面部时，其性能往往会下降。为了解决这一限制，我们提出了人脸抠图这一新任务，旨在估计细粒度的 alpha 遮罩，以将遮挡元素与面部区域分离开来。我们进一步提出了 FaceMat，一种无需三分图且具备不确定性感知的框架，能够在复杂遮挡下预测高质量的 alpha 遮罩。我们的方法采用两阶段训练流程：首先训练一个教师模型，利用负对数似然（NLL）损失联合估计 alpha 遮罩和每像素不确定性，然后利用该不确定性通过空间自适应知识蒸馏指导学生模型。该方法使学生模型能够专注于模糊或遮挡区域，提升泛化能力并保持语义一致性。与以往依赖三分图或分割掩码的方法不同，我们的框架无需辅助输入，因而非常适合实时应用。此外，我们通过将皮肤明确视为前景、遮挡物视为背景，重新制定了抠图目标，从而实现了更清晰的合成策略。为支持该任务，我们新构建了 CelebAMat，这是一个专门为遮挡感知人脸抠图设计的大规模合成数据集。大量实验表明，FaceMat 在多个基准测试中优于最先进的方法，提升了真实世界无约束视频场景中人脸滤镜的视觉质量和鲁棒性。源代码和 CelebAMat 数据集可在 https://github.com/hyebin-c/FaceMat.git 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 04:00:14 UTC 发布时间：2025-08-05 04:00:14 UTC

#131 SkeNa: Learning to Navigate Unseen Environments Based on Abstract Hand-Drawn Maps #131 SkeNa：基于抽象手绘地图学习导航未知环境

A typical human strategy for giving navigation guidance is to sketch route maps based on the environmental layout. Inspired by this, we introduce Sketch map-based visual Navigation (SkeNa), an embodied navigation task in which an agent must reach a goal in an unseen environment using only a hand-drawn sketch map as guidance. To support research for SkeNa, we present a large-scale dataset named SoR, comprising 54k trajectory and sketch map pairs across 71 indoor scenes. In SoR, we introduce two navigation validation sets with varying levels of abstraction in hand-drawn sketches, categorized based on their preservation of spatial scales in the environment, to facilitate future research. To construct SoR, we develop an automated sketch-generation pipeline that efficiently converts floor plans into hand-drawn representations. To solve SkeNa, we propose SkeNavigator, a navigation framework that aligns visual observations with hand-drawn maps to estimate navigation targets. It employs a Ray-based Map Descriptor (RMD) to enhance sketch map valid feature representation using equidistant sampling points and boundary distances. To improve alignment with visual observations, a Dual-Map Aligned Goal Predictor (DAGP) leverages the correspondence between sketch map features and on-site constructed exploration map features to predict goal position and guide navigation. SkeNavigator outperforms prior floor plan navigation methods by a large margin, improving SPL on the high-abstract validation set by 105% relatively. Our code and dataset will be released. 人类进行导航指导的典型策略是根据环境布局绘制路线图。受此启发，我们提出了基于草图地图的视觉导航（SkeNa），这是一项具身导航任务，要求智能体仅使用手绘草图地图作为指导，在未知环境中到达目标。为了支持 SkeNa 的研究，我们发布了一个名为 SoR 的大规模数据集，包含 71 个室内场景中 54,000 条轨迹与草图地图对。在 SoR 中，我们引入了两个导航验证集，草图的抽象程度不同，基于其对环境空间尺度的保留情况进行分类，以促进未来研究。为构建 SoR，我们开发了一个自动草图生成流程，能够高效地将平面图转换为手绘表示。为解决 SkeNa 任务，我们提出了 SkeNavigator 导航框架，该框架将视觉观测与手绘地图对齐以估计导航目标。它采用基于射线的地图描述符（RMD），通过等距采样点和边界距离增强草图地图的有效特征表示。为了提高与视觉观测的一致性，双图对齐目标预测器（DAGP）利用草图地图特征与现场构建的探索地图特征之间的对应关系来预测目标位置并指导导航。SkeNavigator 在楼层平面图导航方法上大幅领先，相较于高抽象验证集，SPL 提升了 105%。我们的代码和数据集将会发布。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-05 03:56:32 UTC 发布时间：2025-08-05 03:56:32 UTC

#132 Tool-integrated Reinforcement Learning for Repo Deep Search #132 集成工具的强化学习用于仓库深度搜索

Authors: [Zexiong Ma](https://arxiv.org/search/?searchtype=author&query=Zexiong Ma), [Chao Peng](https://arxiv.org/search/?searchtype=author&query=Chao Peng), [Qunhong Zeng](https://arxiv.org/search/?searchtype=author&query=Qunhong Zeng), [Pengfei Gao](https://arxiv.org/search/?searchtype=author&query=Pengfei Gao), [Yanzhen Zou](https://arxiv.org/search/?searchtype=author&query=Yanzhen Zou), [Bing Xie](https://arxiv.org/search/?searchtype=author&query=Bing Xie) 作者：马泽雄，彭超，曾群洪，高鹏飞，邹彦臻，谢冰

Issue localization, the process of identifying code locations that need modification to resolve software issues, is a critical yet challenging task in software development. The semantic gap between natural language issue descriptions and faulty code requires complex multi-hop reasoning through code dependencies. Existing LLM-based agents attempt to address this by integrating repository retrieval tools. However, this transforms issue localization into a demanding task we call Repo Deep Search, which requires the LLM to effectively utilize various repository retrieval tools throughout a multi-step reasoning and navigation process. To tackle this challenge, we present ToolTrain, a two-stage tool-integrated training framework combining rejection-sampled supervised fine-tuning and tool-integrated reinforcement learning to enhance LLMs’ ability to use retrieval tools for issue localization. Experimental results show that ToolTrain-trained models achieve state-of-the-art performance, with our 32B model even surpassing Claude-3.7 on function-level localization. The results also show that improved localization performance translates to better end-to-end issue resolution performance. This further demonstrates that training for issue localization is a viable and effective strategy for improving automated software development. 问题定位是识别需要修改以解决软件问题的代码位置的过程，是软件开发中一项关键但具有挑战性的任务。自然语言问题描述与错误代码之间的语义差距需要通过代码依赖关系进行复杂的多跳推理。现有基于 LLM 的代理尝试通过整合代码库检索工具来解决这一问题。然而，这将问题定位转变为我们称之为代码库深度搜索的高难度任务，要求 LLM 在多步骤推理和导航过程中有效利用各种代码库检索工具。为应对这一挑战，我们提出了 ToolTrain，一种结合拒绝采样监督微调和工具集成强化学习的两阶段工具集成训练框架，以提升 LLMs 使用检索工具进行问题定位的能力。实验结果表明，经过 ToolTrain 训练的模型实现了最先进的性能，我们的 32B 模型在函数级定位上甚至超越了 Claude-3.7。结果还显示，定位性能的提升转化为更好的端到端问题解决性能。这进一步证明了针对问题定位的训练是一种可行且有效的提升自动化软件开发的策略。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-05 02:44:21 UTC 发布时间：2025-08-05 02:44:21 UTC

#133 Enhancing Long Video Question Answering with Scene-Localized Frame Grouping #133 通过场景定位的帧分组增强长视频问答

Current Multimodal Large Language Models (MLLMs) often perform poorly in long video understanding, primarily due to resource limitations that prevent them from processing all video frames and their associated information. Efficiently extracting relevant information becomes a challenging task. Existing frameworks and evaluation tasks focus on identifying specific frames containing core objects from a large number of irrelevant frames, which does not align with the practical needs of real-world applications. To address this issue, we propose a new scenario under the video question-answering task, SceneQA, which emphasizes scene-based detail perception and reasoning abilities. And we develop the LVSQA dataset to support the SceneQA task, which is built upon carefully selected videos from LVBench and contains a new collection of question-answer pairs to promote a more fair evaluation of MLLMs’ scene perception abilities in long videos. Inspired by human cognition, we introduce a novel method called SLFG. The core idea of SLFG is to combine individual frames into semantically coherent scene frames. By leveraging scene localization methods and dynamic frame reassembly mechanisms, SLFG significantly enhances the understanding capabilities of existing MLLMs in long videos. SLFG requires no modification to the original model architecture and boasts excellent plug-and-play usability. Experimental results show that this method performs exceptionally well in several long video benchmark tests. Code and dataset will be released at http://www.slfg.pkuzwh.cn. 当前的多模态大型语言模型（MLLMs）在长视频理解方面表现往往不佳，主要原因在于资源限制使其无法处理所有视频帧及其相关信息。高效提取相关信息成为一项挑战。现有框架和评估任务侧重于从大量无关帧中识别包含核心对象的特定帧，这与实际应用需求不符。为解决该问题，我们提出了视频问答任务下的新场景——SceneQA，强调基于场景的细节感知和推理能力。并且我们开发了支持 SceneQA 任务的 LVSQA 数据集，该数据集基于从 LVBench 精心挑选的视频构建，包含一组新的问答对，以促进对 MLLMs 在长视频场景感知能力的更公平评估。受人类认知启发，我们引入了一种名为 SLFG 的新方法。SLFG 的核心思想是将单个帧合成为语义连贯的场景帧。通过利用场景定位方法和动态帧重组机制，SLFG 显著提升了现有多模态大语言模型（MLLMs）在长视频中的理解能力。SLFG 无需修改原始模型架构，且具有出色的即插即用性。实验结果表明，该方法在多个长视频基准测试中表现优异。代码和数据集将发布于 http://www.slfg.pkuzwh.cn。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 主题：计算机视觉与模式识别，人工智能

Publish: 2025-08-05 02:28:58 UTC 发布时间：2025-08-05 02:28:58 UTC

#134 ClinicalFMamba: Advancing Clinical Assessment using Mamba-based Multimodal Neuroimaging Fusion #134 ClinicalFMamba：基于 Mamba 的多模态神经影像融合推进临床评估

Authors: [Meng Zhou](https://arxiv.org/search/?searchtype=author&query=Meng Zhou), [Farzad Khalvati](https://arxiv.org/search/?searchtype=author&query=Farzad Khalvati) 作者：周猛，Farzad Khalvati

Multimodal medical image fusion integrates complementary information from different imaging modalities to enhance diagnostic accuracy and treatment planning. While deep learning methods have advanced performance, existing approaches face critical limitations: Convolutional Neural Networks (CNNs) excel at local feature extraction but struggle to model global context effectively, while Transformers achieve superior long-range modeling at the cost of quadratic computational complexity, limiting clinical deployment. Recent State Space Models (SSMs) offer a promising alternative, enabling efficient long-range dependency modeling in linear time through selective scan mechanisms. Despite these advances, the extension to 3D volumetric data and the clinical validation of fused images remains underexplored. In this work, we propose ClinicalFMamba, a novel end-to-end CNN-Mamba hybrid architecture that synergistically combines local and global feature modeling for 2D and 3D images. We further design a tri-plane scanning strategy for effectively learning volumetric dependencies in 3D images. Comprehensive evaluations on three datasets demonstrate the superior fusion performance across multiple quantitative metrics while achieving real-time fusion. We further validate the clinical utility of our approach on downstream 2D/3D brain tumor classification tasks, achieving superior performance over baseline methods. Our method establishes a new paradigm for efficient multimodal medical image fusion suitable for real-time clinical deployment. 多模态医学图像融合整合了来自不同成像模态的互补信息，以提升诊断准确性和治疗规划。尽管深度学习方法推动了性能提升，现有方法仍面临关键限制：卷积神经网络（CNN）擅长局部特征提取，但难以有效建模全局上下文；而 Transformer 在长距离建模上表现优越，但其二次方计算复杂度限制了临床应用。近期的状态空间模型（SSM）提供了有前景的替代方案，通过选择性扫描机制实现线性时间内的高效长距离依赖建模。尽管取得这些进展，向三维体数据的扩展及融合图像的临床验证仍未充分探索。在本工作中，我们提出了 ClinicalFMamba，一种新颖的端到端 CNN-Mamba 混合架构，协同结合局部与全局特征建模，适用于二维和三维图像。我们进一步设计了三平面扫描策略，有效学习三维图像中的体积依赖关系。在三个数据集上的全面评估表明，我们的方法在多个定量指标上表现出优越的融合性能，同时实现了实时融合。我们进一步验证了该方法在下游二维/三维脑肿瘤分类任务中的临床实用性，性能优于基线方法。我们的方法为高效多模态医学图像融合建立了新的范式，适用于实时临床部署。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理，人工智能，计算机视觉与模式识别

Publish: 2025-08-05 02:25:53 UTC 发布时间：2025-08-05 02:25:53 UTC

#135 VCNet: Recreating High-Level Visual Cortex Principles for Robust Artificial Vision #135 VCNet：重现高级视觉皮层原理以实现稳健的人工视觉

Authors: [Brennen A. Hill](https://arxiv.org/search/?searchtype=author&query=Brennen A. Hill), [Zhang Xinyu](https://arxiv.org/search/?searchtype=author&query=Zhang Xinyu), [Timothy Putra Prasetio](https://arxiv.org/search/?searchtype=author&query=Timothy Putra Prasetio) 作者：Brennen A. Hill，张新宇，Timothy Putra Prasetio

Despite their success in image classification, modern convolutional neural networks (CNNs) exhibit fundamental limitations, including data inefficiency, poor out-of-distribution generalization, and vulnerability to adversarial perturbations. The primate visual system, in contrast, demonstrates superior efficiency and robustness, suggesting that its architectural principles may offer a blueprint for more capable artificial vision systems. This paper introduces Visual Cortex Network (VCNet), a novel neural network architecture whose design is informed by the macro-scale organization of the primate visual cortex. VCNet emulates key biological mechanisms, including hierarchical processing across distinct cortical areas, dual-stream information segregation, and top-down predictive feedback. We evaluate VCNet on two specialized benchmarks: the Spots-10 animal pattern dataset and a light field image classification task. Our results show that VCNet achieves a classification accuracy of 92.1% on Spots-10 and 74.4% on the light field dataset, surpassing contemporary models of comparable size. This work demonstrates that integrating neuroscientific principles into network design can lead to more efficient and robust models, providing a promising direction for addressing long-standing challenges in machine learning. 尽管现代卷积神经网络（CNN）在图像分类中取得了成功，但它们存在根本性限制，包括数据效率低、对分布外数据的泛化能力差以及对对抗扰动的脆弱性。相比之下，灵长类动物的视觉系统表现出更高的效率和鲁棒性，这表明其架构原理可能为更强大的人工视觉系统提供蓝图。本文介绍了视觉皮层网络（VCNet），这是一种新颖的神经网络架构，其设计基于灵长类动物视觉皮层的宏观组织结构。VCNet 模拟了关键的生物机制，包括跨不同皮层区域的层级处理、双流信息分离以及自上而下的预测反馈。我们在两个专门的基准测试上评估了 VCNet：Spots-10 动物图案数据集和光场图像分类任务。结果显示，VCNet 在 Spots-10 上的分类准确率达到 92.1%，在光场数据集上的准确率为 74.4%，均优于同等规模的现有模型。这项工作表明，将神经科学原理融入网络设计可以带来更高效、更稳健的模型，为解决机器学习中的长期挑战提供了一个有前景的方向。

Subjects: Neural and Evolutionary Computing, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 主题：神经与进化计算，人工智能，计算机视觉与模式识别，机器学习

Publish: 2025-08-05 01:52:42 UTC 发布时间：2025-08-05 01:52:42 UTC

#136 GACL: Grounded Adaptive Curriculum Learning with Active Task and Performance Monitoring #136 GACL：基于主动任务和性能监控的有根自适应课程学习

Authors: [Linji Wang](https://arxiv.org/search/?searchtype=author&query=Linji Wang), [Zifan Xu](https://arxiv.org/search/?searchtype=author&query=Zifan Xu), [Peter Stone](https://arxiv.org/search/?searchtype=author&query=Peter Stone), [Xuesu Xiao](https://arxiv.org/search/?searchtype=author&query=Xuesu Xiao) 作者：Linji Wang，Zifan Xu，Peter Stone，Xuesu Xiao

Curriculum learning has emerged as a promising approach for training complex robotics tasks, yet current applications predominantly rely on manually designed curricula, which demand significant engineering effort and can suffer from subjective and suboptimal human design choices. While automated curriculum learning has shown success in simple domains like grid worlds and games where task distributions can be easily specified, robotics tasks present unique challenges: they require handling complex task spaces while maintaining relevance to target domain distributions that are only partially known through limited samples. To this end, we propose Grounded Adaptive Curriculum Learning, a framework specifically designed for robotics curriculum learning with three key innovations: (1) a task representation that consistently handles complex robot task design, (2) an active performance tracking mechanism that allows adaptive curriculum generation appropriate for the robot’s current capabilities, and (3) a grounding approach that maintains target domain relevance through alternating sampling between reference and synthetic tasks. We validate GACL on wheeled navigation in constrained environments and quadruped locomotion in challenging 3D confined spaces, achieving 6.8% and 6.1% higher success rates, respectively, than state-of-the-art methods in each domain. 课程学习已成为训练复杂机器人任务的一种有前景的方法，但当前的应用主要依赖于手工设计的课程，这需要大量的工程投入，并且可能受到主观且次优的人为设计选择的影响。虽然自动课程学习在网格世界和游戏等简单领域取得了成功，这些领域的任务分布可以轻松指定，但机器人任务面临独特挑战：它们需要处理复杂的任务空间，同时保持与目标域分布的相关性，而目标域分布仅通过有限样本部分已知。为此，我们提出了基于实地的自适应课程学习（Grounded Adaptive Curriculum Learning），这是一个专门为机器人课程学习设计的框架，具有三个关键创新：（1）一种能够一致处理复杂机器人任务设计的任务表示，（2）一种允许根据机器人当前能力自适应生成课程的主动性能跟踪机制，以及（3）一种通过在参考任务和合成任务之间交替采样来保持目标域相关性的实地方法。我们在受限环境中的轮式导航和复杂三维受限空间中的四足行走上验证了 GACL，分别比各领域的最先进方法实现了 6.8%和 6.1%的成功率提升。

Subjects: Robotics, Artificial Intelligence 主题：机器人技术，人工智能

Publish: 2025-08-05 01:32:37 UTC 发布时间：2025-08-05 01:32:37 UTC

#137 Autonomous Inorganic Materials Discovery via Multi-Agent Physics-Aware Scientific Reasoning #137 通过多智能体物理感知科学推理实现自主无机材料发现

Authors: [Alireza Ghafarollahi](https://arxiv.org/search/?searchtype=author&query=Alireza Ghafarollahi), [Markus J. Buehler](https://arxiv.org/search/?searchtype=author&query=Markus J. Buehler) 作者：Alireza Ghafarollahi，Markus J. Buehler

Conventional machine learning approaches accelerate inorganic materials design via accurate property prediction and targeted material generation, yet they operate as single-shot models limited by the latent knowledge baked into their training data. A central challenge lies in creating an intelligent system capable of autonomously executing the full inorganic materials discovery cycle, from ideation and planning to experimentation and iterative refinement. We introduce SparksMatter, a multi-agent AI model for automated inorganic materials design that addresses user queries by generating ideas, designing and executing experimental workflows, continuously evaluating and refining results, and ultimately proposing candidate materials that meet the target objectives. SparksMatter also critiques and improves its own responses, identifies research gaps and limitations, and suggests rigorous follow-up validation steps, including DFT calculations and experimental synthesis and characterization, embedded in a well-structured final report. The model’s performance is evaluated across case studies in thermoelectrics, semiconductors, and perovskite oxides materials design. The results demonstrate the capacity of SparksMatter to generate novel stable inorganic structures that target the user’s needs. Benchmarking against frontier models reveals that SparksMatter consistently achieves higher scores in relevance, novelty, and scientific rigor, with a significant improvement in novelty across multiple real-world design tasks as assessed by a blinded evaluator. These results demonstrate SparksMatter’s unique capacity to generate chemically valid, physically meaningful, and creative inorganic materials hypotheses beyond existing materials knowledge. 传统的机器学习方法通过准确的性能预测和有针对性的材料生成，加速了无机材料的设计，但它们作为一次性模型，受限于训练数据中隐含的知识。一个核心挑战在于创建一个智能系统，能够自主执行完整的无机材料发现周期，从构思和规划到实验和迭代优化。我们提出了 SparksMatter，一种多智能体 AI 模型，用于自动化无机材料设计，能够通过生成创意、设计并执行实验流程、持续评估和优化结果，最终提出满足目标要求的候选材料。SparksMatter 还会批判并改进自身的响应，识别研究空白和局限性，建议严格的后续验证步骤，包括 DFT 计算以及实验合成和表征，并将这些内容整合在结构良好的最终报告中。该模型的性能通过热电材料、半导体和钙钛矿氧化物材料设计的案例研究进行了评估。结果表明，SparksMatter 具备生成满足用户需求的新型稳定无机结构的能力。与前沿模型的基准测试显示，SparksMatter 在相关性、新颖性和科学严谨性方面始终取得更高的评分，且在多个真实设计任务中由盲评者评估的新颖性显著提升。这些结果展示了 SparksMatter 在生成化学有效、物理有意义且富有创造性的无机材料假设方面的独特能力，超越了现有的材料知识。

Subjects: Materials Science, Disordered Systems and Neural Networks, Mesoscale and Nanoscale Physics, Artificial Intelligence, Machine Learning 主题：材料科学，非晶态系统与神经网络，中尺度与纳米尺度物理，人工智能，机器学习

Publish: 2025-08-04 23:40:43 UTC 发布时间：2025-08-04 23:40:43 UTC

#138 AeroSafe: Mobile Indoor Air Purification using Aerosol Residence Time Analysis and Robotic Cough Emulator Testbed #138 AeroSafe：基于气溶胶停留时间分析和机器人咳嗽模拟测试平台的移动室内空气净化

Authors: [M Tanjid Hasan Tonmoy](https://arxiv.org/search/?searchtype=author&query=M Tanjid Hasan Tonmoy), [Rahath Malladi](https://arxiv.org/search/?searchtype=author&query=Rahath Malladi), [Kaustubh Singh](https://arxiv.org/search/?searchtype=author&query=Kaustubh Singh), [Forsad Al Hossain](https://arxiv.org/search/?searchtype=author&query=Forsad Al Hossain), [Rajesh Gupta](https://arxiv.org/search/?searchtype=author&query=Rajesh Gupta), [Andrés E. Tejada-Martínez](https://arxiv.org/search/?searchtype=author&query=Andrés E. Tejada-Martínez), [Tauhidur Rahman](https://arxiv.org/search/?searchtype=author&query=Tauhidur Rahman) 作者：M Tanjid Hasan Tonmoy，Rahath Malladi，Kaustubh Singh，Forsad Al Hossain，Rajesh Gupta，Andrés E. Tejada-Martínez，Tauhidur Rahman

Indoor air quality plays an essential role in the safety and well-being of occupants, especially in the context of airborne diseases. This paper introduces AeroSafe, a novel approach aimed at enhancing the efficacy of indoor air purification systems through a robotic cough emulator testbed and a digital-twins-based aerosol residence time analysis. Current portable air filters often overlook the concentrations of respiratory aerosols generated by coughs, posing a risk, particularly in high-exposure environments like healthcare facilities and public spaces. To address this gap, we present a robotic dual-agent physical emulator comprising a maneuverable mannequin simulating cough events and a portable air purifier autonomously responding to aerosols. The generated data from this emulator trains a digital twins model, combining a physics-based compartment model with a machine learning approach, using Long Short-Term Memory (LSTM) networks and graph convolution layers. Experimental results demonstrate the model’s ability to predict aerosol concentration dynamics with a mean residence time prediction error within 35 seconds. The proposed system’s real-time intervention strategies outperform static air filter placement, showcasing its potential in mitigating airborne pathogen risks. 室内空气质量在居住者的安全和健康中起着至关重要的作用，尤其是在空气传播疾病的背景下。本文介绍了 AeroSafe，这是一种旨在通过机器人咳嗽模拟测试平台和基于数字孪生的气溶胶停留时间分析来提升室内空气净化系统效率的新方法。目前的便携式空气过滤器常常忽视咳嗽产生的呼吸道气溶胶浓度，这在医疗设施和公共场所等高暴露环境中尤其存在风险。为了解决这一问题，我们提出了一种机器人双代理物理模拟器，包括一个可操控的模拟人体模型，用于模拟咳嗽事件，以及一个能够自主响应气溶胶的便携式空气净化器。该模拟器生成的数据用于训练数字孪生模型，该模型结合了基于物理的舱室模型和机器学习方法，采用长短期记忆网络（LSTM）和图卷积层。实验结果表明，该模型能够预测气溶胶浓度动态，平均停留时间预测误差在 35 秒以内。所提系统的实时干预策略优于静态空气过滤器的布置，展示了其在降低空气传播病原体风险方面的潜力。

Subjects: Robotics, Artificial Intelligence 主题：机器人技术，人工智能

Publish: 2025-08-04 23:11:37 UTC 发布时间：2025-08-04 23:11:37 UTC

#139 LLM-based IR-system for Bank Supervisors #139 基于 LLM 的银行监管信息检索系统

Author: [Ilias Aarab](https://arxiv.org/search/?searchtype=author&query=Ilias Aarab) 作者：Ilias Aarab

Bank supervisors face the complex task of ensuring that new measures are consistently aligned with historical precedents. To address this challenge, we introduce a novel Information Retrieval (IR) System tailored to assist supervisors in drafting both consistent and effective measures. This system ingests findings from on-site investigations. It then retrieves the most relevant historical findings and their associated measures from a comprehensive database, providing a solid basis for supervisors to write well-informed measures for new findings. Utilizing a blend of lexical, semantic, and Capital Requirements Regulation (CRR) fuzzy set matching techniques, the IR system ensures the retrieval of findings that closely align with current cases. The performance of this system, particularly in scenarios with partially labeled data, is validated through a Monte Carlo methodology, showcasing its robustness and accuracy. Enhanced by a Transformer-based Denoising AutoEncoder for fine-tuning, the final model achieves a Mean Average Precision (MAP@100) of 0.83 and a Mean Reciprocal Rank (MRR@100) of 0.92. These scores surpass those of both standalone lexical models such as BM25 and semantic BERT-like models. 银行监管人员面临确保新措施与历史先例保持一致的复杂任务。为应对这一挑战，我们引入了一种新颖的信息检索（IR）系统，专门帮助监管人员起草既一致又有效的措施。该系统输入现场调查的发现，然后从全面的数据库中检索最相关的历史发现及其相关措施，为监管人员撰写针对新发现的充分依据。该 IR 系统结合了词汇、语义和资本要求条例（CRR）模糊集匹配技术，确保检索到与当前案例高度匹配的发现。通过蒙特卡洛方法验证了该系统在部分标注数据场景下的性能，展示了其稳健性和准确性。通过基于 Transformer 的去噪自编码器进行微调增强，最终模型实现了 0.83 的平均精确度均值（MAP@100）和 0.92 的平均倒数排名均值（MRR@100）。这些分数超过了单独的词汇模型如 BM25 和语义 BERT 类模型。

Subjects: Information Retrieval, Artificial Intelligence, Machine Learning, Applications, Computation 主题：信息检索，人工智能，机器学习，应用，计算

Publish: 2025-08-04 23:02:01 UTC 发布时间：2025-08-04 23:02:01 UTC

#140 Can LLMs Generate High-Quality Task-Specific Conversations? #140 LLMs 能生成高质量的特定任务对话吗？

本文介绍了一个用于控制大型语言模型中对话质量的参数化框架。我们探讨了涵盖六个维度的九个关键参数，这些参数能够精确指定对话属性。通过对最先进的 LLMs 进行实验，我们证明了基于参数的控制能够在生成的对话属性上产生统计学上显著的差异。我们的方法解决了对话生成中的诸多挑战，包括话题连贯性、知识进展、角色一致性以及控制粒度。该框架提供了一种标准化的对话质量控制方法，适用于教育、治疗、客户服务和娱乐等领域。未来的工作将侧重于通过架构修改实现更多参数，并开发用于评估的基准数据集。

发布：2025-08-04 22:07:08 UTC

#141 Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment #141 在推荐系统中实现规模定律：用于超大规模模型部署的基础-专家范式

While scaling laws promise significant performance gains for recommender systems, efficiently deploying hyperscale models remains a major unsolved challenge. In contrast to fields where FMs are already widely adopted such as natural language processing and computer vision, progress in recommender systems is hindered by unique challenges including the need to learn from online streaming data under shifting data distributions, the need to adapt to different recommendation surfaces with a wide diversity in their downstream tasks and their input distributions, and stringent latency and computational constraints. To bridge this gap, we propose to leverage the Foundation-Expert Paradigm: a framework designed for the development and deployment of hyperscale recommendation FMs. In our approach, a central FM is trained on lifelong, cross-surface, multi-modal user data to learn generalizable knowledge. This knowledge is then efficiently transferred to various lightweight, surface-specific ``expert” models via target-aware embeddings, allowing them to adapt to local data distributions and optimization goals with minimal overhead. To meet our training, inference and development needs, we built HyperCast, a production-grade infrastructure system that re-engineers training, serving, logging and iteration to power this decoupled paradigm. Our approach is now deployed at Meta serving tens of billions of user requests daily, demonstrating online metric improvements over our previous one-stage production system while improving developer velocity and maintaining infrastructure efficiency. To the best of our knowledge, this work represents the first successful deployment of a Foundation-Expert paradigm at this scale, offering a proven, compute-efficient, and developer-friendly blueprint to realize the promise of scaling laws in recommender systems. 虽然规模定律为推荐系统带来了显著的性能提升潜力，但高效部署超大规模模型仍然是一个尚未解决的重大挑战。与自然语言处理和计算机视觉等领域中基础模型（FM）已被广泛采用不同，推荐系统的进展受到独特挑战的制约，包括需要在不断变化的数据分布下从在线流数据中学习、需要适应具有多样化下游任务和输入分布的不同推荐场景，以及严格的延迟和计算限制。为弥合这一差距，我们提出利用基础-专家范式：一个专为超大规模推荐基础模型的开发和部署设计的框架。在我们的方法中，中央基础模型在终身、跨场景、多模态的用户数据上进行训练，以学习可泛化的知识。然后，这些知识通过目标感知嵌入高效地转移到各种轻量级、特定场景的“专家”模型，使其能够以最小的开销适应本地数据分布和优化目标。为了满足我们的训练、推理和开发需求，我们构建了 HyperCast，这是一套生产级基础设施系统，重新设计了训练、服务、日志记录和迭代流程，以支持这种解耦范式。我们的方法现已在 Meta 部署，每天处理数百亿用户请求，在线指标较我们之前的单阶段生产系统有所提升，同时提高了开发者的工作效率并保持了基础设施的高效性。据我们所知，这项工作代表了 Foundation-Expert 范式在此规模上的首次成功部署，提供了一个经过验证的、计算高效且对开发者友好的蓝图，以实现推荐系统中规模定律的承诺。

Subjects: Information Retrieval, Artificial Intelligence, Machine Learning 主题：信息检索，人工智能，机器学习

Publish: 2025-08-04 22:03:13 UTC 发布时间：2025-08-04 22:03:13 UTC

#142 GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics #142 GrandJury：一种用于动态质量评分标准的协作机器学习模型评估协议

Author: [Arthur Cho](https://arxiv.org/search/?searchtype=author&query=Arthur Cho) 作者：Arthur Cho

Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus and surfaces disagreement. We provide an open-source implementation (grandjury PyPI package) and a public collection of Large Language Model (LLM) inference outputs to illustrate the need and method. GrandJury provides a new paradigm for AI practitioners when evaluating machine learning outputs without absolute ground truth. 生成式机器学习模型已成为现代系统的核心，驱动着创意写作、摘要、多跳推理和上下文感知对话等应用。这些模型支撑着大规模 AI 助手、工作流自动化和自主决策。在这些领域中，合适的响应很少是绝对或静态的，而是多元且高度依赖上下文。然而，标准的评估体系仍依赖于静态的基准测试，激励模型优化排行榜分数，而非与动态用户需求或不断变化的现实保持一致。GrandJury 引入了一种正式的评估协议，结合了时间衰减聚合、完整的可追溯性，并支持动态、透明的任务评分标准归因以及多评审人类判断。这些元素共同实现了多元化、负责任的评估，捕捉不断演变的共识并揭示分歧。我们提供了一个开源实现（grandjury PyPI 包）和一个公开的大型语言模型（LLM）推理输出集合，以展示该需求和方法。 GrandJury 为人工智能从业者在没有绝对真实标签的情况下评估机器学习输出提供了一种新范式。

Subjects: Machine Learning, Artificial Intelligence, Human-Computer Interaction 主题：机器学习，人工智能，人机交互

Publish: 2025-08-04 22:00:44 UTC 发布时间：2025-08-04 22:00:44 UTC

#143 Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces #143 使用大型视觉语言模型遵循路线指令：低级动作空间与全景动作空间的比较

Publish: 2025-08-04 21:45:21 UTC 发布时间：2025-08-04 21:45:21 UTC

#144 Engineered over Emergent Communication in MARL for Scalable and Sample-Efficient Cooperative Task Allocation in a Partially Observable Grid #144 在多智能体强化学习中针对部分可观测网格中可扩展且样本高效的协作任务分配，工程化优于自发通信

Authors: [Brennen A. Hill](https://arxiv.org/search/?searchtype=author&query=Brennen A. Hill), [Mant Koh En Wei](https://arxiv.org/search/?searchtype=author&query=Mant Koh En Wei), [Thangavel Jishnuanandh](https://arxiv.org/search/?searchtype=author&query=Thangavel Jishnuanandh) 作者：Brennen A. Hill，Mant Koh En Wei，Thangavel Jishnuanandh

We compare the efficacy of learned versus engineered communication strategies in a cooperative multi-agent reinforcement learning (MARL) environment. For the learned approach, we introduce Learned Direct Communication (LDC), where agents generate messages and actions concurrently via a neural network. Our engineered approach, Intention Communication, employs an Imagined Trajectory Generation Module (ITGM) and a Message Generation Network (MGN) to formulate messages based on predicted future states. Both strategies are evaluated on their success rates in cooperative tasks under fully and partially observable conditions. Our findings indicate that while emergent communication is viable, the engineered approach demonstrates superior performance and scalability, particularly as environmental complexity increases. 我们比较了在合作多智能体强化学习（MARL）环境中，学习型通信策略与设计型通信策略的效果。对于学习型方法，我们引入了学习直接通信（LDC），其中智能体通过神经网络同时生成消息和动作。我们的设计型方法——意图通信，采用了想象轨迹生成模块（ITGM）和消息生成网络（MGN），基于预测的未来状态来构建消息。两种策略均在完全可观测和部分可观测条件下的合作任务成功率上进行了评估。我们的研究结果表明，虽然自发通信是可行的，但设计型方法表现出更优的性能和可扩展性，尤其是在环境复杂度增加时。

Subjects: Multiagent Systems, Artificial Intelligence, Machine Learning, Systems and Control 主题：多智能体系统，人工智能，机器学习，系统与控制

Publish: 2025-08-04 21:29:07 UTC 发布时间：2025-08-04 21:29:07 UTC

#145 CauKer: classification time series foundation models can be pretrained on synthetic data only #145 CauKer：分类时间序列基础模型可以仅在合成数据上进行预训练

Authors: [Shifeng Xie](https://arxiv.org/search/?searchtype=author&query=Shifeng Xie), [Vasilii Feofanov](https://arxiv.org/search/?searchtype=author&query=Vasilii Feofanov), [Marius Alonso](https://arxiv.org/search/?searchtype=author&query=Marius Alonso), [Ambroise Odonnat](https://arxiv.org/search/?searchtype=author&query=Ambroise Odonnat), [Jianfeng Zhang](https://arxiv.org/search/?searchtype=author&query=Jianfeng Zhang), [Themis Palpanas](https://arxiv.org/search/?searchtype=author&query=Themis Palpanas), [Ievgen Redko](https://arxiv.org/search/?searchtype=author&query=Ievgen Redko) 作者：谢世峰，瓦西里·费奥法诺夫，马里乌斯·阿隆索，安布鲁瓦兹·奥多纳，张建峰，特米斯·帕尔帕纳斯，叶夫根·雷德科

Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose CauKer, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. CauKer combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that CauKer-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior. 时间序列基础模型（TSFMs）因其强大的零样本能力和广泛的实际应用而受到极大关注。这类模型通常需要在大规模、精心策划的真实序列集合上进行计算成本高昂的预训练。为了实现 TSFMs 的样本高效预训练，我们提出了 CauKer，一种新颖算法，旨在生成多样且因果一致的合成时间序列，具备真实的趋势、季节性和非线性交互。CauKer 结合了高斯过程（GP）核组合与结构因果模型（SCM），用于为不同架构和不同预训练方法的最先进分类 TSFMs 生成样本高效的预训练数据。此外，我们的实验表明，CauKer 生成的数据集在数据集规模（从 1 万到 1000 万样本）和模型容量（从 100 万到 7.83 亿参数）方面均表现出明显的规模定律，而真实世界数据集则表现出不规则的规模行为。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-04 20:18:31 UTC 发布时间：2025-08-04 20:18:31 UTC

#146 Beyond Least Squares: Robust Regression Transformer (R2T) #146 超越最小二乘法：鲁棒回归变换器（R2T）

Authors: [Roman Gutierrez](https://arxiv.org/search/?searchtype=author&query=Roman Gutierrez), [Tony Kai Tang](https://arxiv.org/search/?searchtype=author&query=Tony Kai Tang), [Isabel Gutierrez](https://arxiv.org/search/?searchtype=author&query=Isabel Gutierrez) 作者：Roman Gutierrez，Tony Kai Tang，Isabel Gutierrez

Robust regression techniques rely on least-squares optimization, which works well for Gaussian noise but fails in the presence of asymmetric structured noise. We propose a hybrid neural-symbolic architecture where a transformer encoder processes numerical sequences, a compression NN predicts symbolic parameters, and a fixed symbolic equation reconstructs the original sequence. Using synthetic data, the training objective is to recover the original sequence after adding asymmetric structured noise, effectively learning a symbolic fit guided by neural parameter estimation. Our model achieves a median regression MSE of 6e-6 to 3.5e-5 on synthetic wearable data, which is a 10-300 times improvement when compared with ordinary least squares fit and robust regression techniques such as Huber loss or SoftL1. 鲁棒回归技术依赖于最小二乘优化，该方法在高斯噪声下表现良好，但在存在非对称结构噪声时效果不佳。我们提出了一种混合神经符号架构，其中变换器编码器处理数值序列，压缩神经网络预测符号参数，固定的符号方程重构原始序列。利用合成数据，训练目标是在添加非对称结构噪声后恢复原始序列，有效地学习由神经参数估计引导的符号拟合。我们的模型在合成可穿戴数据上的中位回归均方误差达到 6e-6 到 3.5e-5，相较于普通最小二乘拟合和如 Huber 损失或 SoftL1 等鲁棒回归技术，提升了 10 到 300 倍。

Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题：机器学习，人工智能，机器学习

Publish: 2025-08-04 20:03:13 UTC 发布时间：2025-08-04 20:03:13 UTC

#147 Evaluation and Analysis of Deep Neural Transformers and Convolutional Neural Networks on Modern Remote Sensing Datasets #147 现代遥感数据集上深度神经变换器与卷积神经网络的评估与分析

Authors: [J. Alex Hurt](https://arxiv.org/search/?searchtype=author&query=J. Alex Hurt), [Trevor M. Bajkowski](https://arxiv.org/search/?searchtype=author&query=Trevor M. Bajkowski), [Grant J. Scott](https://arxiv.org/search/?searchtype=author&query=Grant J. Scott), [Curt H. Davis](https://arxiv.org/search/?searchtype=author&query=Curt H. Davis) 作者：J. Alex Hurt，Trevor M. Bajkowski，Grant J. Scott，Curt H. Davis

In 2012, AlexNet established deep convolutional neural networks (DCNNs) as the state-of-the-art in CV, as these networks soon led in visual tasks for many domains, including remote sensing. With the publication of Visual Transformers, we are witnessing the second modern leap in computational vision, and as such, it is imperative to understand how various transformer-based neural networks perform on satellite imagery. While transformers have shown high levels of performance in natural language processing and CV applications, they have yet to be compared on a large scale to modern remote sensing data. In this paper, we explore the use of transformer-based neural networks for object detection in high-resolution electro-optical satellite imagery, demonstrating state-of-the-art performance on a variety of publicly available benchmark data sets. We compare eleven distinct bounding-box detection and localization algorithms in this study, of which seven were published since 2020, and all eleven since 2015. The performance of five transformer-based architectures is compared with six convolutional networks on three state-of-the-art opensource high-resolution remote sensing imagery datasets ranging in size and complexity. Following the training and evaluation of thirty-three deep neural models, we then discuss and analyze model performance across various feature extraction methodologies and detection algorithms. 2012 年，AlexNet 确立了深度卷积神经网络（DCNN）在计算机视觉（CV）领域的最先进地位，这些网络很快在包括遥感在内的多个领域的视觉任务中领先。随着《视觉变换器》（Visual Transformers）的发表，我们见证了计算机视觉的第二次现代飞跃，因此，理解各种基于变换器的神经网络在卫星影像上的表现变得至关重要。尽管变换器在自然语言处理和计算机视觉应用中表现出高水平的性能，但它们尚未在大规模现代遥感数据上进行比较。本文探讨了基于变换器的神经网络在高分辨率电光卫星影像中的目标检测应用，展示了在多种公开基准数据集上的最先进性能。我们在本研究中比较了十一种不同的边界框检测和定位算法，其中七种发表于 2020 年以后，所有十一种均发表于 2015 年以后。在三个具有不同规模和复杂度的最先进开源高分辨率遥感影像数据集上，比较了五种基于 Transformer 的架构与六种卷积网络的性能。在训练和评估了三十三个深度神经模型后，我们进一步讨论并分析了不同特征提取方法和检测算法下的模型表现。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-04 19:55:52 UTC 发布时间：2025-08-04 19:55:52 UTC

#148 Secure mmWave Beamforming with Proactive-ISAC Defense Against Beam-Stealing Attacks #148 具有主动 ISAC 防御的安全毫米波波束成形，防止波束窃取攻击

Authors: [Seyed Bagher Hashemi Natanzi](https://arxiv.org/search/?searchtype=author&query=Seyed Bagher Hashemi Natanzi), [Hossein Mohammadi](https://arxiv.org/search/?searchtype=author&query=Hossein Mohammadi), [Bo Tang](https://arxiv.org/search/?searchtype=author&query=Bo Tang), [Vuk Marojevic](https://arxiv.org/search/?searchtype=author&query=Vuk Marojevic) 作者：Seyed Bagher Hashemi Natanzi，Hossein Mohammadi，Bo Tang，Vuk Marojevic

Millimeter-wave (mmWave) communication systems face increasing susceptibility to advanced beam-stealing attacks, posing a significant physical layer security threat. This paper introduces a novel framework employing an advanced Deep Reinforcement Learning (DRL) agent for proactive and adaptive defense against these sophisticated attacks. A key innovation is leveraging Integrated Sensing and Communications (ISAC) capabilities for active, intelligent threat assessment. The DRL agent, built on a Proximal Policy Optimization (PPO) algorithm, dynamically controls ISAC probing actions to investigate suspicious activities. We introduce an intensive curriculum learning strategy that guarantees the agent experiences successful detection during training to overcome the complex exploration challenges inherent to such a security-critical task. Consequently, the agent learns a robust and adaptive policy that intelligently balances security and communication performance. Numerical results demonstrate that our framework achieves a mean attacker detection rate of 92.8% while maintaining an average user SINR of over 13 dB. 毫米波（mmWave）通信系统日益容易受到高级窃波攻击，构成了显著的物理层安全威胁。本文提出了一种新颖框架，采用先进的深度强化学习（DRL）智能体，针对这些复杂攻击进行主动且自适应的防御。一个关键创新是利用集成感知与通信（ISAC）能力，实现主动且智能的威胁评估。该 DRL 智能体基于近端策略优化（PPO）算法，动态控制 ISAC 探测动作以调查可疑活动。我们引入了一种强化课程学习策略，确保智能体在训练过程中经历成功检测，以克服此类安全关键任务固有的复杂探索挑战。因此，智能体学得了一种稳健且自适应的策略，智能地平衡安全性与通信性能。数值结果表明，我们的框架实现了 92.8%的平均攻击者检测率，同时保持用户信噪干扰比（SINR）平均超过 13 dB。

Subjects: Signal Processing, Artificial Intelligence, Networking and Internet Architecture 主题：信号处理，人工智能，网络与互联网架构

Publish: 2025-08-04 19:30:09 UTC 发布时间：2025-08-04 19:30:09 UTC

Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language, Sound 主题：音频与语音处理，人工智能，计算与语言，声音

Publish: 2025-08-04 19:22:14 UTC 发布时间：2025-08-04 19:22:14 UTC

#150 Learning from B Cell Evolution: Adaptive Multi-Expert Diffusion for Antibody Design via Online Optimization #150 从 B 细胞进化中学习：通过在线优化的自适应多专家扩散用于抗体设计

Authors: [Hanqi Feng](https://arxiv.org/search/?searchtype=author&query=Hanqi Feng), [Peng Qiu](https://arxiv.org/search/?searchtype=author&query=Peng Qiu), [Mengchun Zhang](https://arxiv.org/search/?searchtype=author&query=Mengchun Zhang), [Yiran Tao](https://arxiv.org/search/?searchtype=author&query=Yiran Tao), [You Fan](https://arxiv.org/search/?searchtype=author&query=You Fan), [Jingtao Xu](https://arxiv.org/search/?searchtype=author&query=Jingtao Xu), [Barnabas Poczos](https://arxiv.org/search/?searchtype=author&query=Barnabas Poczos) 作者：Hanqi Feng, Peng Qiu, Mengchun Zhang, Yiran Tao, You Fan, Jingtao Xu, Barnabas Poczos

Recent advances in diffusion models have shown remarkable potential for antibody design, yet existing approaches apply uniform generation strategies that cannot adapt to each antigen’s unique requirements. Inspired by B cell affinity maturation, where antibodies evolve through multi-objective optimization balancing affinity, stability, and self-avoidance, we propose the first biologically-motivated framework that leverages physics-based domain knowledge within an online meta-learning system. Our method employs multiple specialized experts (van der Waals, molecular recognition, energy balance, and interface geometry) whose parameters evolve during generation based on iterative feedback, mimicking natural antibody refinement cycles. Instead of fixed protocols, this adaptive guidance discovers personalized optimization strategies for each target. Our experiments demonstrate that this approach: (1) discovers optimal SE(3)-equivariant guidance strategies for different antigen classes without pre-training, preserving molecular symmetries throughout optimization; (2) significantly enhances hotspot coverage and interface quality through target-specific adaptation, achieving balanced multi-objective optimization characteristic of therapeutic antibodies; (3) establishes a paradigm for iterative refinement where each antibody-antigen system learns its unique optimization profile through online evaluation; (4) generalizes effectively across diverse design challenges, from small epitopes to large protein interfaces, enabling precision-focused campaigns for individual targets. 扩散模型的最新进展在抗体设计方面展现了显著潜力，但现有方法采用统一的生成策略，无法适应每种抗原的独特需求。受 B 细胞亲和力成熟的启发——抗体通过多目标优化平衡亲和力、稳定性和自我避免而进化——我们提出了首个基于生物学动机的框架，在在线元学习系统中利用基于物理的领域知识。我们的方法采用多个专门的专家（范德华力、分子识别、能量平衡和界面几何），其参数在生成过程中根据迭代反馈不断演化，模拟自然抗体的精炼周期。该自适应引导取代了固定协议，为每个目标发现个性化的优化策略。我们的实验表明，该方法：（1）无需预训练即可发现针对不同抗原类别的最优 SE(3)等变引导策略，在优化过程中保持分子对称性；（2）通过针对性适应显著提升热点覆盖率和界面质量，实现治疗性抗体特有的平衡多目标优化；（3）建立了迭代精炼范式，每个抗体-抗原系统通过在线评估学习其独特的优化特征；（4）在从小表位到大型蛋白界面的多样设计挑战中均表现出良好泛化能力，支持针对单一靶点的精准设计活动。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-07-25 03:14:34 UTC 发布时间：2025-07-25 03:14:34 UTC

#151 Automated Validation of LLM-based Evaluators for Software Engineering Artifacts #151 基于 LLM 的软件工程工件评估器自动验证

Authors: [Ora Nova Fandina](https://arxiv.org/search/?searchtype=author&query=Ora Nova Fandina), [Eitan Farchi](https://arxiv.org/search/?searchtype=author&query=Eitan Farchi), [Shmulik Froimovich](https://arxiv.org/search/?searchtype=author&query=Shmulik Froimovich), [Rami Katan](https://arxiv.org/search/?searchtype=author&query=Rami Katan), [Alice Podolsky](https://arxiv.org/search/?searchtype=author&query=Alice Podolsky), [Orna Raz](https://arxiv.org/search/?searchtype=author&query=Orna Raz), [Avi Ziv](https://arxiv.org/search/?searchtype=author&query=Avi Ziv) 作者：Ora Nova Fandina、Eitan Farchi、Shmulik Froimovich、Rami Katan、Alice Podolsky、Orna Raz、Avi Ziv

Automation in software engineering increasingly relies on large language models (LLMs) to generate, review, and assess code artifacts. However, establishing LLMs as reliable evaluators remains an open challenge: human evaluations are costly, subjective and non scalable, while existing automated methods fail to discern fine grained variations in artifact quality. We introduce REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation), an automated framework for benchmarking LLM based evaluators across software engineering tasks. REFINE comprises of two modules: Hierarchy Dataset Builder applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality, and Evaluator Tester quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering. A key feature of REFINE is controllability: users can tune the granularity of degradation to progressively refine evaluator configurations, from coarse filtering to stress testing on subtle quality gaps. While the methodology is general, we focus on coding tasks reflecting the practical demands in our production setting. REFINE was integrated into IBM’s internal development workflows and applied to code generation, translation, and summarization for COBOL, an enterprise critical programming language, using industrial data. It was used to identify LLM as a Judge configurations that lifted alignment scores from below 0.7 to above 0.9 in some coding tasks. These nuance sensitive evaluators are now actively used by model training teams to support model release decisions. 软件工程中的自动化越来越依赖于大型语言模型（LLMs）来生成、审查和评估代码产物。然而，将 LLMs 确立为可靠的评估者仍然是一个未解决的挑战：人工评估成本高、主观且难以扩展，而现有的自动化方法无法识别产物质量的细微差异。我们提出了 REFINE（Ranking Evaluators for FIne grained Nuanced Evaluation），这是一个用于基于 LLM 的评估者在软件工程任务中进行基准测试的自动化框架。REFINE 包含两个模块：Hierarchy Dataset Builder 采用新颖的生成技术，自动合成质量逐步降低的产物；Evaluator Tester 通过测量候选评估者配置的排名与预期排序的吻合度来量化其表现。REFINE 的一个关键特性是可控性：用户可以调节质量退化的细粒度，从粗略筛选到对细微质量差异的压力测试，逐步优化评估者配置。虽然该方法具有通用性，但我们重点关注反映实际生产需求的编码任务。 REFINE 被整合到 IBM 的内部开发工作流程中，并应用于使用工业数据的 COBOL 代码生成、翻译和摘要，COBOL 是一种企业关键编程语言。它被用来识别 LLM 作为 Judge 配置，在某些编码任务中将对齐得分从低于 0.7 提升到高于 0.9 。这些对细微差别敏感的评估器现在被模型训练团队积极使用，以支持模型发布决策。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-04 18:52:01 UTC 发布时间：2025-08-04 18:52:01 UTC

#152 TransAM: Transformer-Based Agent Modeling for Multi-Agent Systems via Local Trajectory Encoding #152 TransAM：基于 Transformer 的多智能体系统代理建模，通过局部轨迹编码

Authors: [Conor Wallace](https://arxiv.org/search/?searchtype=author&query=Conor Wallace), [Umer Siddique](https://arxiv.org/search/?searchtype=author&query=Umer Siddique), [Yongcan Cao](https://arxiv.org/search/?searchtype=author&query=Yongcan Cao) 作者：Conor Wallace，Umer Siddique，Yongcan Cao

Agent modeling is a critical component in developing effective policies within multi-agent systems, as it enables agents to form beliefs about the behaviors, intentions, and competencies of others. Many existing approaches assume access to other agents’ episodic trajectories, a condition often unrealistic in real-world applications. Consequently, a practical agent modeling approach must learn a robust representation of the policies of the other agents based only on the local trajectory of the controlled agent. In this paper, we propose \texttt{TransAM}, a novel transformer-based agent modeling approach to encode local trajectories into an embedding space that effectively captures the policies of other agents. We evaluate the performance of the proposed method in cooperative, competitive, and mixed multi-agent environments. Extensive experimental results demonstrate that our approach generates strong policy representations, improves agent modeling, and leads to higher episodic returns. 代理建模是多智能体系统中制定有效策略的关键组成部分，因为它使智能体能够形成对其他智能体行为、意图和能力的信念。许多现有方法假设可以访问其他智能体的情节轨迹，但这一条件在现实应用中往往不切实际。因此，实用的代理建模方法必须仅基于被控智能体的局部轨迹学习其他智能体策略的鲁棒表示。本文提出了\texttt{TransAM}，一种基于 Transformer 的新型代理建模方法，将局部轨迹编码到嵌入空间中，有效捕捉其他智能体的策略。我们在合作、竞争和混合多智能体环境中评估了该方法的性能。大量实验结果表明，我们的方法生成了强有力的策略表示，提升了代理建模效果，并带来了更高的情节回报。

Subjects: Multiagent Systems, Artificial Intelligence 主题：多智能体系统，人工智能

Publish: 2025-08-04 18:50:37 UTC 发布时间：2025-08-04 18:50:37 UTC

#153 NeuroSync: Intent-Aware Code-Based Problem Solving via Direct LLM Understanding Modification #153 NeuroSync：通过直接修改 LLM 理解实现意图感知的基于代码的问题解决

Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language, Software Engineering 主题：人机交互，人工智能，计算与语言，软件工程

Publish: 2025-08-05 12:54:13 UTC 发布时间：2025-08-05 12:54:13 UTC

#154 Real-World Receptivity to Adaptive Mental Health Interventions: Findings from an In-the-Wild Study #154 适应性心理健康干预的现实接受度：一项野外研究的发现

Authors: [Nilesh Kumar Sahu](https://arxiv.org/search/?searchtype=author&query=Nilesh Kumar Sahu), [Aditya Sneh](https://arxiv.org/search/?searchtype=author&query=Aditya Sneh), [Snehil Gupta](https://arxiv.org/search/?searchtype=author&query=Snehil Gupta), [Haroon R Lone](https://arxiv.org/search/?searchtype=author&query=Haroon R Lone) 作者：Nilesh Kumar Sahu, Aditya Sneh, Snehil Gupta, Haroon R Lone

The rise of mobile health (mHealth) technologies has enabled real-time monitoring and intervention for mental health conditions using passively sensed smartphone data. Building on these capabilities, Just-in-Time Adaptive Interventions (JITAIs) seek to deliver personalized support at opportune moments, adapting to users’ evolving contexts and needs. Although prior research has examined how context affects user responses to generic notifications and general mHealth messages, relatively little work has explored its influence on engagement with actual mental health interventions. Furthermore, while much of the existing research has focused on detecting when users might benefit from an intervention, less attention has been paid to understanding receptivity, i.e., users’ willingness and ability to engage with and act upon the intervention. In this study, we investigate user receptivity through two components: acceptance(acknowledging or engaging with a prompt) and feasibility (ability to act given situational constraints). We conducted a two-week in-the-wild study with 70 students using a custom Android app, LogMe, which collected passive sensor data and active context reports to prompt mental health interventions. The adaptive intervention module was built using Thompson Sampling, a reinforcement learning algorithm. We address four research questions relating smartphone features and self-reported contexts to acceptance and feasibility, and examine whether an adaptive reinforcement learning approach can optimize intervention delivery by maximizing a combined receptivity reward. Our results show that several types of passively sensed data significantly influenced user receptivity to interventions. Our findings contribute insights into the design of context-aware, adaptive interventions that are not only timely but also actionable in real-world settings. 移动健康（mHealth）技术的兴起使得利用被动感知的智能手机数据实现心理健康状况的实时监测和干预成为可能。基于这些能力，及时适应性干预（JITAIs）旨在在恰当的时刻提供个性化支持，适应用户不断变化的环境和需求。尽管以往研究已经探讨了环境如何影响用户对通用通知和一般 mHealth 信息的反应，但关于环境对实际心理健康干预参与度影响的研究相对较少。此外，虽然现有研究大多集中在检测用户何时可能受益于干预，但对理解接受度——即用户参与并执行干预的意愿和能力——的关注较少。在本研究中，我们通过两个组成部分来调查用户的接受度：接受（承认或响应提示）和可行性（在特定情境限制下采取行动的能力）。我们使用定制的安卓应用 LogMe 进行了为期两周的实地研究，参与者为 70 名学生。该应用收集被动传感器数据和主动上下文报告，以触发心理健康干预。自适应干预模块采用了强化学习算法——汤普森采样构建。我们探讨了四个研究问题，涉及智能手机特征和自我报告的上下文与接受度及可行性的关系，并检验了自适应强化学习方法是否能通过最大化综合接受奖励来优化干预的实施。结果显示，多种被动感知数据显著影响用户对干预的接受度。我们的研究为设计情境感知、自适应的干预措施提供了见解，这些干预不仅及时，而且在现实环境中具有可操作性。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computers and Society, Signal Processing 主题：人机交互，人工智能，计算机与社会，信号处理

Publish: 2025-07-10 12:45:15 UTC 发布时间：2025-07-10 12:45:15 UTC

#155 Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation #155 临床基础的基于代理的报告评估：放射学报告生成的可解释指标

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-04 18:28:03 UTC 发布时间：2025-08-04 18:28:03 UTC

#156 Adaptive Knowledge Distillation for Device-Directed Speech Detection #156 设备定向语音检测的自适应知识蒸馏

Device-directed speech detection (DDSD) is a binary classification task that separates the user’s queries to a voice assistant (VA) from background speech or side conversations. This is important for achieving naturalistic user experience. To this end, we propose knowledge distillation (KD) to enhance DDSD accuracy while ensuring efficient deployment. Specifically, we introduce a novel adaptive KD method that transfers knowledge from general representations of an ASR large pre-trained acoustic encoder (teacher). We apply task-specific adapters, on top of the (frozen) teacher encoder, trained jointly with the student model on DDSD. We demonstrate that the proposed adaptive KD outperforms the student model without distillation in the keyword and keyword-free (follow-up) invocations, with an improvement of +26% and +19% in terms of Equal Error Rate, respectively. We also show that this approach generalizes across the transformer and conformer-based model architectures. 设备定向语音检测（DDSD）是一项二分类任务，用于区分用户对语音助手（VA）的查询与背景语音或旁白对话。这对于实现自然的用户体验非常重要。为此，我们提出了知识蒸馏（KD）方法，以提升 DDSD 的准确性，同时确保高效部署。具体来说，我们引入了一种新颖的自适应 KD 方法，从一个大型预训练的 ASR 声学编码器（教师模型）的通用表示中转移知识。我们在（冻结的）教师编码器之上应用任务特定的适配器，并与学生模型在 DDSD 任务上联合训练。我们证明，所提出的自适应 KD 在关键词和无关键词（后续）唤醒中，分别在等错误率上提升了+26%和+19%，优于未进行蒸馏的学生模型。我们还展示了该方法在基于 Transformer 和 Conformer 的模型架构中均具有良好的泛化能力。

Subjects: Sound, Artificial Intelligence, Audio and Speech Processing 主题：声音，人工智能，音频与语音处理

Publish: 2025-08-04 18:12:28 UTC 发布时间：2025-08-04 18:12:28 UTC

#157 Extracting Range-Doppler Information of Moving Targets from Wi-Fi Channel State Information #157 从 Wi-Fi 信道状态信息中提取移动目标的距离-多普勒信息

Authors: [Jessica Sanson](https://arxiv.org/search/?searchtype=author&query=Jessica Sanson), [Rahul C. Shah](https://arxiv.org/search/?searchtype=author&query=Rahul C. Shah), [Maximilian Pinaroc](https://arxiv.org/search/?searchtype=author&query=Maximilian Pinaroc), [Valerio Frascolla](https://arxiv.org/search/?searchtype=author&query=Valerio Frascolla) 作者：Jessica Sanson, Rahul C. Shah, Maximilian Pinaroc, Valerio Frascolla

This paper presents, for the first time, a method to extract both range and Doppler information from commercial Wi-Fi Channel State Information (CSI) using a monostatic (single transceiver) setup. Utilizing the CSI phase in Wi-Fi sensing from a Network Interface Card (NIC) not designed for full-duplex operation is challenging due to (1) Hardware asynchronization, which introduces significant phase errors, and (2) Proximity of transmit (Tx) and receive (Rx) antennas, which creates strong coupling that overwhelms the motion signal of interest. We propose a new signal processing approach that addresses both challenges via three key innovations: Time offset cancellation, Phase alignment correction, and Tx/Rx coupling mitigation. Our method achieves cm-level accuracy in range and Doppler estimation for moving targets, validated using a commercial Intel Wi-Fi AX211 NIC. Our results show successful detection and tracking of moving objects in realistic environments, establishing the feasibility of high-precision sensing using standard Wi-Fi packet communications and off-the-shelf hardware without requiring any modification or specialized full-duplex capabilities. 本文首次提出了一种方法，利用单站（单收发器）设置从商用 Wi-Fi 信道状态信息（CSI）中提取距离和多普勒信息。由于（1）硬件不同步导致显著的相位误差，以及（2）发射（Tx）和接收（Rx）天线的接近产生强耦合，掩盖了感兴趣的运动信号，使用非全双工设计的网络接口卡（NIC）中的 CSI 相位进行 Wi-Fi 感知具有挑战性。我们提出了一种新的信号处理方法，通过三项关键创新解决这两个挑战：时间偏移消除、相位对齐校正和 Tx/Rx 耦合抑制。我们的方法在移动目标的距离和多普勒估计中实现了厘米级精度，并通过商用 Intel Wi-Fi AX211 NIC 进行了验证。结果显示，在真实环境中成功检测和跟踪移动物体，证明了利用标准 Wi-Fi 数据包通信和现成硬件进行高精度感知的可行性，无需任何修改或专用的全双工功能。

Subjects: Signal Processing, Artificial Intelligence 主题：信号处理，人工智能

Publish: 2025-08-04 18:10:18 UTC 发布时间：2025-08-04 18:10:18 UTC

#158 Web3 x AI Agents: Landscape, Integrations, and Foundational Challenges #158 Web3 x AI 代理：现状、整合与基础性挑战

The convergence of Web3 technologies and AI agents represents a rapidly evolving frontier poised to reshape decentralized ecosystems. This paper presents the first and most comprehensive analysis of the intersection between Web3 and AI agents, examining five critical dimensions: landscape, economics, governance, security, and trust mechanisms. Through an analysis of 133 existing projects, we first develop a taxonomy and systematically map the current market landscape (RQ1), identifying distinct patterns in project distribution and capitalization. Building upon these findings, we further investigate four key integrations: (1) the role of AI agents in participating in and optimizing decentralized finance (RQ2); (2) their contribution to enhancing Web3 governance mechanisms (RQ3); (3) their capacity to strengthen Web3 security via intelligent vulnerability detection and automated smart contract auditing (RQ4); and (4) the establishment of robust reliability frameworks for AI agent operations leveraging Web3’s inherent trust infrastructure (RQ5). By synthesizing these dimensions, we identify key integration patterns, highlight foundational challenges related to scalability, security, and ethics, and outline critical considerations for future research toward building robust, intelligent, and trustworthy decentralized systems with effective AI agent interactions. Web3 技术与 AI 代理的融合代表了一个快速发展的前沿领域，有望重塑去中心化生态系统。本文首次且最全面地分析了 Web3 与 AI 代理的交汇点，考察了五个关键维度：生态、经济、治理、安全和信任机制。通过对 133 个现有项目的分析，我们首先构建了一个分类体系，并系统地绘制了当前市场格局（研究问题 1），识别出项目分布和资本化的不同模式。在此基础上，我们进一步探讨了四个关键整合方向：（1）AI 代理在参与和优化去中心化金融中的作用（研究问题 2）；（2）它们对增强 Web3 治理机制的贡献（研究问题 3）；（3）通过智能漏洞检测和自动化智能合约审计，提升 Web3 安全性的能力（研究问题 4）；以及（4）利用 Web3 固有的信任基础设施，建立 AI 代理操作的稳健可靠性框架（研究问题 5）。通过综合这些维度，我们识别出关键的整合模式，强调了与可扩展性、安全性和伦理相关的基础性挑战，并概述了未来研究中构建具有有效 AI 代理交互的稳健、智能且可信的去中心化系统的关键考虑因素。

Subjects: Computers and Society, Artificial Intelligence, General Economics 主题：计算机与社会，人工智能，通用经济学

Publish: 2025-08-04 15:44:58 UTC 发布时间：2025-08-04 15:44:58 UTC

#159 The Silicon Reasonable Person: Can AI Predict How Ordinary People Judge Reasonableness? #159 硅理性人：人工智能能否预测普通人如何判断合理性？

Author: [Yonathan A. Arbel](https://arxiv.org/search/?searchtype=author&query=Yonathan A. Arbel) 作者：Yonathan A. Arbel

In everyday life, people make countless reasonableness judgments that determine appropriate behavior in various contexts. Predicting these judgments challenges the legal system, as judges’ intuitions may not align with broader societal views. This Article investigates whether large language models (LLMs) can learn to identify patterns driving human reasonableness judgments. Using randomized controlled trials comparing humans and models across multiple legal contexts with over 10,000 simulated judgments, we demonstrate that certain models capture not just surface-level responses but potentially their underlying decisional architecture. Strikingly, these systems prioritize social cues over economic efficiency in negligence determinations, mirroring human behavior despite contradicting textbook treatments. These findings suggest practical applications: judges could calibrate intuitions against broader patterns, lawmakers could test policy interpretations, and resource-constrained litigants could preview argument reception. As AI agents increasingly make autonomous real-world decisions, understanding whether they’ve internalized recognizable ethical frameworks becomes essential for anticipating their behavior. 在日常生活中，人们做出无数合理性判断，以确定在各种情境下的适当行为。预测这些判断对法律系统构成挑战，因为法官的直觉可能与更广泛的社会观点不一致。本文探讨了大型语言模型（LLMs）是否能够学习识别人类合理性判断背后的驱动模式。通过在多个法律情境中对人类和模型进行随机对照试验，涵盖超过 10,000 个模拟判断，我们证明了某些模型不仅捕捉到了表面反应，还可能掌握了其潜在的决策架构。令人惊讶的是，这些系统在过失认定中优先考虑社会线索而非经济效率，反映了人类行为，尽管这与教科书中的处理方法相悖。这些发现暗示了实际应用：法官可以将直觉与更广泛的模式进行校准，立法者可以测试政策解释，资源有限的诉讼当事人可以预览论点的接受度。随着人工智能代理越来越多地在现实世界中自主做出决策，理解它们是否已经内化了可识别的伦理框架，变得对于预测其行为至关重要。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-04 06:19:45 UTC 发布时间：2025-08-04 06:19:45 UTC

#160 The Architecture of Trust: A Framework for AI-Augmented Real Estate Valuation in the Era of Structured Data #160 信任架构：结构化数据时代人工智能增强房地产估价的框架

Authors: [Petteri Teikari](https://arxiv.org/search/?searchtype=author&query=Petteri Teikari), [Mike Jarrell](https://arxiv.org/search/?searchtype=author&query=Mike Jarrell), [Maryam Azh](https://arxiv.org/search/?searchtype=author&query=Maryam Azh), [Harri Pesola](https://arxiv.org/search/?searchtype=author&query=Harri Pesola) 作者：Petteri Teikari，Mike Jarrell，Maryam Azh，Harri Pesola

The Uniform Appraisal Dataset (UAD) 3.6’s mandatory 2026 implementation transforms residential property valuation from narrative reporting to structured, machine-readable formats. This paper provides the first comprehensive analysis of this regulatory shift alongside concurrent AI advances in computer vision, natural language processing, and autonomous systems. We develop a three-layer framework for AI-augmented valuation addressing technical implementation and institutional trust requirements. Our analysis reveals how regulatory standardization converging with AI capabilities enables fundamental market restructuring with profound implications for professional practice, efficiency, and systemic risk. We make four key contributions: (1) documenting institutional failures including inter-appraiser variability and systematic biases undermining valuation reliability; (2) developing an architectural framework spanning physical data acquisition, semantic understanding, and cognitive reasoning that integrates emerging technologies while maintaining professional oversight; (3) addressing trust requirements for high-stakes financial applications including regulatory compliance, algorithmic fairness, and uncertainty quantification; (4) proposing evaluation methodologies beyond generic AI benchmarks toward domain-specific protocols. Our findings indicate successful transformation requires not merely technological sophistication but careful human-AI collaboration, creating systems that augment rather than replace professional expertise while addressing historical biases and information asymmetries in real estate markets. 统一评估数据集（UAD）3.6 版本于 2026 年强制实施，将住宅物业估价从叙述性报告转变为结构化、机器可读的格式。本文首次对这一监管变革及其与计算机视觉、自然语言处理和自主系统等人工智能技术进展的同步发展进行了全面分析。我们构建了一个三层框架，用于支持人工智能增强的估价，涵盖技术实施和机构信任需求。我们的分析揭示了监管标准化与人工智能能力融合如何推动市场的根本重构，并对专业实践、效率及系统性风险产生深远影响。我们做出了四项关键贡献：（1）记录了机构性失败，包括评估者之间的差异性和系统性偏见，这些因素削弱了估值的可靠性；（2）开发了一个涵盖物理数据采集、语义理解和认知推理的架构框架，该框架整合了新兴技术，同时保持专业监督；（3）解决了高风险金融应用中的信任需求，包括合规监管、算法公平性和不确定性量化；（4）提出了超越通用 AI 基准的评估方法，迈向特定领域的协议。我们的研究结果表明，成功的转型不仅需要技术上的复杂性，更需要谨慎的人机协作，打造增强而非替代专业知识的系统，同时解决房地产市场中的历史偏见和信息不对称问题。

Subjects: Computers and Society, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：计算机与社会，人工智能，计算机视觉与模式识别

Publish: 2025-08-04 05:24:25 UTC 发布时间：2025-08-04 05:24:25 UTC

#161 Context-Adaptive Multi-Prompt LLM Embedding for Vision-Language Alignment #161 适应上下文的多提示 LLM 嵌入用于视觉-语言对齐

Authors: [Dahun Kim](https://arxiv.org/search/?searchtype=author&query=Dahun Kim), [Anelia Angelova](https://arxiv.org/search/?searchtype=author&query=Anelia Angelova) 作者：Dahun Kim，Anelia Angelova

We propose Context-Adaptive Multi-Prompt Embedding, a novel approach to enrich semantic representations in vision-language contrastive learning. Unlike standard CLIP-style models that rely on a single text embedding, our method introduces multiple structured prompts, each containing a distinct adaptive token that captures diverse semantic aspects of the input text. We process all prompts jointly in a single forward pass. The resulting prompt embeddings are combined into a unified text representation, enabling semantically richer alignment with visual features. To further promote semantic diversity and representation quality, we incorporate a diversity regularization loss and a negation-aware loss, encouraging specialization across prompts and improving contrastive discrimination. Our method achieves consistent improvements on both image-text and video-text retrieval benchmarks. 我们提出了上下文自适应多提示嵌入（Context-Adaptive Multi-Prompt Embedding），这是一种在视觉-语言对比学习中丰富语义表示的新方法。与依赖单一文本嵌入的标准 CLIP 风格模型不同，我们的方法引入了多个结构化提示，每个提示包含一个独特的自适应标记，用以捕捉输入文本的多样语义方面。我们在一次前向传播中联合处理所有提示。生成的提示嵌入被合并为统一的文本表示，从而实现与视觉特征的语义更丰富的对齐。为了进一步促进语义多样性和表示质量，我们引入了多样性正则化损失和否定感知损失，鼓励各提示间的专业化并提升对比判别能力。我们的方法在图文和视频文本检索基准上均取得了持续的性能提升。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-03 20:48:43 UTC 发布时间：2025-08-03 20:48:43 UTC

#162 Towards a Manifesto for Cyber Humanities: Paradigms, Ethics, and Prospects #162 迈向网络人文学宣言：范式、伦理与前景

Authors: [Giovanni Adorni](https://arxiv.org/search/?searchtype=author&query=Giovanni Adorni), [Emanuele Bellini](https://arxiv.org/search/?searchtype=author&query=Emanuele Bellini) 作者：Giovanni Adorni, Emanuele Bellini

The accelerated evolution of digital infrastructures and algorithmic systems is reshaping how the humanities engage with knowledge and culture. Rooted in the traditions of Digital Humanities and Digital Humanism, the concept of “Cyber Humanities” proposes a critical reconfiguration of humanistic inquiry for the post-digital era. This Manifesto introduces a flexible framework that integrates ethical design, sustainable digital practices, and participatory knowledge systems grounded in human-centered approaches. By means of a Decalogue of foundational principles, the Manifesto invites the scientific community to critically examine and reimagine the algorithmic infrastructures that influence culture, creativity, and collective memory. Rather than being a simple extension of existing practices, “Cyber Humanities” should be understood as a foundational paradigm for humanistic inquiry in a computationally mediated world. Keywords: Cyber Humanities, Digital Humanities, Transdisciplinary Epistemology, Algorithmic Reflexivity, Human-centered AI, Ethics-by-Design, Knowledge Ecosystems, Digital Sovereignty, Cognitive Infrastructures 数字基础设施和算法系统的加速演进正在重塑人文学科与知识和文化的互动方式。植根于数字人文和数字人文主义的传统，“网络人文”概念提出了对后数字时代人文探究的批判性重构。该宣言介绍了一个灵活的框架，融合了伦理设计、可持续数字实践以及以人为本的参与式知识系统。通过一套基础原则的十诫，宣言邀请科学界批判性地审视并重新构想影响文化、创造力和集体记忆的算法基础设施。“网络人文”不应被简单视为现有实践的延伸，而应被理解为计算媒介世界中人文探究的基础范式。关键词：网络人文、数字人文、跨学科认识论、算法反思、人本人工智能、设计即伦理、知识生态系统、数字主权、认知基础设施

Subjects: Computers and Society, Artificial Intelligence, Digital Libraries 主题：计算机与社会，人工智能，数字图书馆

Publish: 2025-08-03 17:33:24 UTC 发布时间：2025-08-03 17:33:24 UTC

#163 CTBench: Cryptocurrency Time Series Generation Benchmark #163 CTBench：加密货币时间序列生成基准测试

Synthetic time series are essential tools for data augmentation, stress testing, and algorithmic prototyping in quantitative finance. However, in cryptocurrency markets, characterized by 24/7 trading, extreme volatility, and rapid regime shifts, existing Time Series Generation (TSG) methods and benchmarks often fall short, jeopardizing practical utility. Most prior work (1) targets non-financial or traditional financial domains, (2) focuses narrowly on classification and forecasting while neglecting crypto-specific complexities, and (3) lacks critical financial evaluations, particularly for trading applications. To address these gaps, we introduce \textsf{CTBench}, the first comprehensive TSG benchmark tailored for the cryptocurrency domain. \textsf{CTBench} curates an open-source dataset from 452 tokens and evaluates TSG models across 13 metrics spanning 5 key dimensions: forecasting accuracy, rank fidelity, trading performance, risk assessment, and computational efficiency. A key innovation is a dual-task evaluation framework: (1) the \emph{Predictive Utility} task measures how well synthetic data preserves temporal and cross-sectional patterns for forecasting, while (2) the \emph{Statistical Arbitrage} task assesses whether reconstructed series support mean-reverting signals for trading. We benchmark eight representative models from five methodological families over four distinct market regimes, uncovering trade-offs between statistical fidelity and real-world profitability. Notably, \textsf{CTBench} offers model ranking analysis and actionable guidance for selecting and deploying TSG models in crypto analytics and strategy development. 合成时间序列是量化金融中数据增强、压力测试和算法原型设计的重要工具。然而，在以全天候交易、极端波动性和快速制度变迁为特征的加密货币市场中，现有的时间序列生成（TSG）方法和基准测试往往难以满足需求，影响其实用性。大多数先前的工作（1）针对非金融或传统金融领域，（2）过于集中于分类和预测，忽视了加密货币特有的复杂性，（3）缺乏关键的金融评估，尤其是在交易应用方面。为填补这些空白，我们推出了\textsf{CTBench}，这是首个专为加密货币领域量身打造的综合 TSG 基准。 \textsf{CTBench}整理了来自 452 个代币的开源数据集，并在涵盖预测准确性、排名保真度、交易表现、风险评估和计算效率五大关键维度的 13 项指标上对 TSG 模型进行了评估。一个关键创新是双任务评估框架：（1）\emph{预测效用}任务衡量合成数据在预测中保留时间和横截面模式的能力，（2）\emph{统计套利}任务评估重构序列是否支持均值回归信号以进行交易。我们在四个不同的市场环境下，对来自五个方法学家族的八个代表性模型进行了基准测试，揭示了统计保真度与实际盈利能力之间的权衡。值得注意的是，\textsf{CTBench} 提供了模型排名分析和可操作的指导，帮助在加密分析和策略开发中选择和部署时间序列生成模型。

Subjects: Statistical Finance, Artificial Intelligence, Computational Engineering, Finance, and Science, Databases, Machine Learning 主题：统计金融，人工智能，计算工程、金融与科学，数据库，机器学习

Publish: 2025-08-03 17:07:08 UTC 发布时间：2025-08-03 17:07:08 UTC

#164 Beyond the Wavefunction: Qualia Abstraction Language Mechanics and the Grammar of Awareness #164 超越波函数：感质抽象语言机制与意识语法

Authors: [Mikołaj Sienicki](https://arxiv.org/search/?searchtype=author&query=Mikołaj Sienicki), [Krzysztof Sienicki](https://arxiv.org/search/?searchtype=author&query=Krzysztof Sienicki) 作者：Mikołaj Sienicki，Krzysztof Sienicki

We propose a formal reconstruction of quantum mechanics grounded not in external mathematical abstractions, but in the structured dynamics of subjective experience. The Qualia Abstraction Language (QAL) models physical systems as evolving streams of introspective units, structured sequences of modality, shape, and functional effect, rather than as state vectors in Hilbert space. This approach reimagines core quantum concepts: superposition becomes a form of structured ambiguity; collapse is reframed as an introspective contraction; and entanglement is modeled as semantic resonance across streams of qualia. Drawing on insights from nominalist philosophy and oversight theoretic limits in AI, we argue that the observer paradox in quantum mechanics reflects not an ontological lacuna, but a linguistic one: the absence of a formal vocabulary for modeling first person structure. QAL introduces such a vocabulary, providing a morphodynamic framework that embeds the observer within the system and replaces abstract projection with endogenous transformation. We analyze the alignment of QAL with endophysical approaches, contrast it with standard interpretations of quantum theory, and explore its implications for a post Platonist, introspectively grounded physics. 我们提出了一种量子力学的形式重构，该重构不基于外部的数学抽象，而是基于主观体验的结构化动态。Qualia Abstraction Language（QAL）将物理系统建模为内省单元的演化流，这些单元是模态、形状和功能效应的结构化序列，而非希尔伯特空间中的状态向量。这种方法重新构想了核心的量子概念：叠加成为一种结构化的模糊性；坍缩被重新定义为内省的收缩；纠缠则被建模为感质流之间的语义共振。借鉴唯名论哲学的见解以及人工智能中的监督理论极限，我们认为量子力学中的观察者悖论反映的不是本体论上的缺失，而是语言上的缺陷：缺乏用于建模第一人称结构的正式词汇。QAL 引入了这样的词汇，提供了一个形态动力学框架，将观察者嵌入系统之中，并用内生变换取代抽象投影。我们分析了 QAL 与内在物理方法的一致性，将其与量子理论的标准解释进行了对比，并探讨了其对后柏拉图主义、内省为基础的物理学的影响。

Subjects: History and Philosophy of Physics, Artificial Intelligence 主题：物理学的历史与哲学，人工智能

Publish: 2025-08-03 15:07:24 UTC 发布时间：2025-08-03 15:07:24 UTC

#165 DMSC: Dynamic Multi-Scale Coordination Framework for Time Series Forecasting #165 DMSC：用于时间序列预测的动态多尺度协调框架

Authors: [Haonan Yang](https://arxiv.org/search/?searchtype=author&query=Haonan Yang), [Jianchao Tang](https://arxiv.org/search/?searchtype=author&query=Jianchao Tang), [Zhuo Li](https://arxiv.org/search/?searchtype=author&query=Zhuo Li), [Long Lan](https://arxiv.org/search/?searchtype=author&query=Long Lan) 作者：杨浩南，唐建超，李卓，蓝龙

Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales. Despite recent advances leveraging different decomposition operations and novel architectures based on CNN, MLP or Transformer, existing methods still struggle with static decomposition strategies, fragmented dependency modeling, and inflexible fusion mechanisms, limiting their ability to model intricate temporal dependencies. To explicitly solve the mentioned three problems respectively, we propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE). Specifically, EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities, eliminating predefined scale constraints through input-adaptive patch adjustment. TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer’s decomposed representations. EMPD and TIB are jointly integrated into layers forming a multi-layer progressive cascade architecture, where coarse-grained representations from earlier layers adaptively guide fine-grained feature extraction in subsequent layers via gated pathways. And ASR-MoE dynamically fuses multi-scale predictions by leveraging specialized global and local experts with temporal-aware weighting. Comprehensive experiments on thirteen real-world benchmarks demonstrate that DMSC consistently maintains state-of-the-art (SOTA) performance and superior computational efficiency for TSF tasks. Code is available at https://github.com/1327679995/DMSC. 时间序列预测（TSF）在建模不同尺度上复杂的时间依赖关系方面面临持续挑战。尽管最近利用不同的分解操作和基于 CNN、MLP 或 Transformer 的新型架构取得了进展，现有方法仍然受限于静态分解策略、碎片化的依赖建模以及不灵活的融合机制，限制了其对复杂时间依赖关系的建模能力。为明确解决上述三个问题，我们提出了一种新颖的动态多尺度协调框架（DMSC），包含多尺度分块分解模块（EMPD）、三元交互模块（TIB）和自适应尺度路由专家模型模块（ASR-MoE）。具体而言，EMPD 被设计为内置组件，能够动态地将序列分割成具有指数级尺度的分层块，通过输入自适应的分块调整消除预定义的尺度限制。随后，TIB 在每层分解表示中联合建模块内、块间及跨变量的依赖关系。 EMPD 和 TIB 共同集成到多个层中，形成一个多层渐进级联架构，其中早期层的粗粒度表示通过门控路径自适应地指导后续层的细粒度特征提取。ASR-MoE 则通过利用具有时间感知加权的专门全局和局部专家，动态融合多尺度预测。在十三个真实世界基准上的全面实验表明，DMSC 在 TSF 任务中始终保持最先进（SOTA）的性能和卓越的计算效率。代码可在 https://github.com/1327679995/DMSC 获取。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-03 13:11:52 UTC 发布时间：2025-08-03 13:11:52 UTC

#166 SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference #166 SmallKV：小模型辅助的 KV 缓存压缩补偿，用于高效的 LLM 推理

Authors: [Yi Zhao](https://arxiv.org/search/?searchtype=author&query=Yi Zhao), [Yajuan Peng](https://arxiv.org/search/?searchtype=author&query=Yajuan Peng), [Cam-Tu Nguyen](https://arxiv.org/search/?searchtype=author&query=Cam-Tu Nguyen), [Zuchao Li](https://arxiv.org/search/?searchtype=author&query=Zuchao Li), [Xiaoliang Wang](https://arxiv.org/search/?searchtype=author&query=Xiaoliang Wang), [Hai Zhao](https://arxiv.org/search/?searchtype=author&query=Hai Zhao), [Xiaoming Fu](https://arxiv.org/search/?searchtype=author&query=Xiaoming Fu) 作者：赵毅，彭雅娟，阮金图，李祖超，王晓亮，赵海，付晓明

KV cache eviction has emerged as an effective solution to alleviate resource constraints faced by LLMs in long-context scenarios. However, existing token-level eviction methods often overlook two critical aspects: (1) their irreversible eviction strategy fails to adapt to dynamic attention patterns during decoding (the saliency shift problem), and (2) they treat both marginally important tokens and truly unimportant tokens equally, despite the collective significance of marginal tokens to model performance (the marginal information over-compression problem). To address these issues, we design two compensation mechanisms based on the high similarity of attention matrices between LLMs of different scales. We propose SmallKV, a small model assisted compensation method for KV cache compression. SmallKV can maintain attention matching between different-scale LLMs to: 1) assist the larger model in perceiving globally important information of attention; and 2) use the smaller model’s attention scores to approximate those of marginal tokens in the larger model. Extensive experiments on benchmarks including GSM8K, BBH, MT-Bench, and LongBench demonstrate the effectiveness of SmallKV. Moreover, efficiency evaluations show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods, highlighting its potential for efficient and performant LLM inference in resource constrained environments. KV 缓存驱逐已成为缓解 LLMs 在长上下文场景中资源限制的有效解决方案。然而，现有的基于 token 级别的驱逐方法常常忽视两个关键方面：（1）其不可逆的驱逐策略无法适应解码过程中动态变化的注意力模式（显著性转移问题）；（2）它们将边缘重要的 token 和真正不重要的 token 一视同仁，尽管边缘 token 对模型性能具有整体重要性（边缘信息过度压缩问题）。为了解决这些问题，我们基于不同规模 LLMs 之间注意力矩阵的高度相似性设计了两种补偿机制。我们提出了 SmallKV，一种由小模型辅助的 KV 缓存压缩补偿方法。SmallKV 能够维持不同规模 LLMs 之间的注意力匹配，以：1）辅助大模型感知全局重要的注意力信息；2）利用小模型的注意力分数来近似大模型中边缘 token 的注意力分数。在 GSM8K、BBH、MT-Bench 和 LongBench 等基准测试上的大量实验验证了 SmallKV 的有效性。此外，效率评估显示 SmallKV 的吞吐量比基线方法高出 1.75 到 2.56 倍，突显了其在资源受限环境中实现高效且高性能 LLM 推理的潜力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-03 09:15:36 UTC 发布时间：2025-08-03 09:15:36 UTC

#167 Pulse Shape Discrimination Algorithms: Survey and Benchmark #167 脉冲形状判别算法：综述与基准测试

This review presents a comprehensive survey and benchmark of pulse shape discrimination (PSD) algorithms for radiation detection, classifying nearly sixty methods into statistical (time-domain, frequency-domain, neural network-based) and prior-knowledge (machine learning, deep learning) paradigms. We implement and evaluate all algorithms on two standardized datasets: an unlabeled set from a 241Am-9Be source and a time-of-flight labeled set from a 238Pu-9Be source, using metrics including Figure of Merit (FOM), F1-score, ROC-AUC, and inter-method correlations. Our analysis reveals that deep learning models, particularly Multi-Layer Perceptrons (MLPs) and hybrid approaches combining statistical features with neural regression, often outperform traditional methods. We discuss architectural suitabilities, the limitations of FOM, alternative evaluation metrics, and performance across energy thresholds. Accompanying this work, we release an open-source toolbox in Python and MATLAB, along with the datasets, to promote reproducibility and advance PSD research. 本综述对辐射检测中的脉冲形状判别（PSD）算法进行了全面的调研和基准测试，将近六十种方法分为统计类（时域、频域、基于神经网络）和先验知识类（机器学习、深度学习）两大范式。我们在两个标准化数据集上实现并评估了所有算法：一个来自 241Am-9Be 源的无标签数据集和一个来自 238Pu-9Be 源的飞行时间标记数据集，评估指标包括优值（FOM）、F1 分数、ROC-AUC 及方法间相关性。分析结果显示，深度学习模型，尤其是多层感知机（MLP）和结合统计特征与神经回归的混合方法，通常优于传统方法。我们讨论了架构适用性、FOM 的局限性、替代评估指标以及不同能量阈值下的性能表现。为促进可复现性和推动 PSD 研究，本文同时发布了 Python 和 MATLAB 的开源工具箱及相关数据集。

Subjects: Machine Learning, Artificial Intelligence, Nuclear Experiment, Applied Physics, Atomic Physics 主题：机器学习，人工智能，核实验，应用物理，原子物理

Publish: 2025-08-03 04:41:32 UTC 发布时间：2025-08-03 04:41:32 UTC

#168 A Novel cVAE-Augmented Deep Learning Framework for Pan-Cancer RNA-Seq Classification #168 一种新颖的 cVAE 增强深度学习框架用于泛癌症 RNA-Seq 分类

Author: [Vinil Polepalli](https://arxiv.org/search/?searchtype=author&query=Vinil Polepalli) 作者：Vinil Polepalli

Pan-cancer classification using transcriptomic (RNA-Seq) data can inform tumor subtyping and therapy selection, but is challenging due to extremely high dimensionality and limited sample sizes. In this study, we propose a novel deep learning framework that uses a class-conditional variational autoencoder (cVAE) to augment training data for pan-cancer gene expression classification. Using 801 tumor RNA-Seq samples spanning 5 cancer types from The Cancer Genome Atlas (TCGA), we first perform feature selection to reduce 20,531 gene expression features to the 500 most variably expressed genes. A cVAE is then trained on this data to learn a latent representation of gene expression conditioned on cancer type, enabling the generation of synthetic gene expression samples for each tumor class. We augment the training set with these cVAE-generated samples (doubling the dataset size) to mitigate overfitting and class imbalance. A two-layer multilayer perceptron (MLP) classifier is subsequently trained on the augmented dataset to predict tumor type. The augmented framework achieves high classification accuracy (~98%) on a held-out test set, substantially outperforming a classifier trained on the original data alone. We present detailed experimental results, including VAE training curves, classifier performance metrics (ROC curves and confusion matrix), and architecture diagrams to illustrate the approach. The results demonstrate that cVAE-based synthetic augmentation can significantly improve pan-cancer prediction performance, especially for underrepresented cancer classes. 利用转录组（RNA-Seq）数据进行泛癌症分类可以为肿瘤亚型划分和治疗选择提供参考，但由于极高的维度和有限的样本量，这一任务具有挑战性。在本研究中，我们提出了一种新颖的深度学习框架，使用条件变分自编码器（cVAE）来增强泛癌症基因表达分类的训练数据。我们使用来自癌症基因组图谱（TCGA）的 801 个涵盖 5 种癌症类型的肿瘤 RNA-Seq 样本，首先进行特征选择，将 20,531 个基因表达特征缩减到 500 个变异性最高的基因。随后，在该数据上训练 cVAE，学习基于癌症类型条件的基因表达潜在表示，从而能够为每种肿瘤类别生成合成的基因表达样本。我们用这些 cVAE 生成的样本扩充训练集（数据集规模翻倍），以缓解过拟合和类别不平衡问题。最后，在扩充后的数据集上训练一个两层多层感知机（MLP）分类器，用于预测肿瘤类型。增强框架在保留的测试集上实现了高分类准确率（约 98%），显著优于仅使用原始数据训练的分类器。我们展示了详细的实验结果，包括 VAE 训练曲线、分类器性能指标（ROC 曲线和混淆矩阵）以及架构图，以说明该方法。结果表明，基于 cVAE 的合成增强能够显著提升泛癌症预测性能，尤其是在样本较少的癌症类别中表现突出。

Subjects: Genomics, Artificial Intelligence, Machine Learning 主题：基因组学，人工智能，机器学习

Publish: 2025-08-02 16:57:31 UTC 发布时间：2025-08-02 16:57:31 UTC

#169 SpectrumFM: A New Paradigm for Spectrum Cognition #169 SpectrumFM：频谱认知的新范式

Authors: [Chunyu Liu](https://arxiv.org/search/?searchtype=author&query=Chunyu Liu), [Hao Zhang](https://arxiv.org/search/?searchtype=author&query=Hao Zhang), [Wei Wu](https://arxiv.org/search/?searchtype=author&query=Wei Wu), [Fuhui Zhou](https://arxiv.org/search/?searchtype=author&query=Fuhui Zhou), [Qihui Wu](https://arxiv.org/search/?searchtype=author&query=Qihui Wu), [Derrick Wing Kwan Ng](https://arxiv.org/search/?searchtype=author&query=Derrick Wing Kwan Ng), [Chan-Byoung Chae](https://arxiv.org/search/?searchtype=author&query=Chan-Byoung Chae) 作者：刘春雨，张浩，吴伟，周福辉，吴启辉，Derrick Wing Kwan Ng，蔡灿炳

The enhancement of spectrum efficiency and the realization of secure spectrum utilization are critically dependent on spectrum cognition. However, existing spectrum cognition methods often exhibit limited generalization and suboptimal accuracy when deployed across diverse spectrum environments and tasks. To overcome these challenges, we propose a spectrum foundation model, termed SpectrumFM, which provides a new paradigm for spectrum cognition. An innovative spectrum encoder that exploits the convolutional neural networks and the multi-head self attention mechanisms is proposed to effectively capture both fine-grained local signal structures and high-level global dependencies in the spectrum data. To enhance its adaptability, two novel self-supervised learning tasks, namely masked reconstruction and next-slot signal prediction, are developed for pre-training SpectrumFM, enabling the model to learn rich and transferable representations. Furthermore, low-rank adaptation (LoRA) parameter-efficient fine-tuning is exploited to enable SpectrumFM to seamlessly adapt to various downstream spectrum cognition tasks, including spectrum sensing (SS), anomaly detection (AD), and wireless technology classification (WTC). Extensive experiments demonstrate the superiority of SpectrumFM over state-of-the-art methods. Specifically, it improves detection probability in the SS task by 30% at -4 dB signal-to-noise ratio (SNR), boosts the area under the curve (AUC) in the AD task by over 10%, and enhances WTC accuracy by 9.6%. 频谱效率的提升和安全频谱利用的实现关键依赖于频谱感知。然而，现有的频谱感知方法在不同频谱环境和任务中部署时，通常表现出有限的泛化能力和次优的准确性。为克服这些挑战，我们提出了一种频谱基础模型，称为 SpectrumFM，为频谱感知提供了新的范式。我们提出了一种创新的频谱编码器，结合卷积神经网络和多头自注意力机制，有效捕捉频谱数据中细粒度的局部信号结构和高级的全局依赖关系。为了增强其适应性，设计了两种新颖的自监督学习任务，即掩码重建和下一时隙信号预测，用于 SpectrumFM 的预训练，使模型能够学习丰富且可迁移的表示。此外，利用低秩适配（LoRA）参数高效微调，使 SpectrumFM 能够无缝适应各种下游频谱感知任务，包括频谱感知（SS）、异常检测（AD）和无线技术分类（WTC）。大量实验表明，SpectrumFM 优于最先进的方法。具体而言，在-4 dB 信噪比（SNR）下，SS 任务的检测概率提高了 30%；AD 任务的曲线下面积（AUC）提升超过 10%；WTC 的准确率提升了 9.6%。

Subjects: Signal Processing, Artificial Intelligence, Machine Learning 主题：信号处理，人工智能，机器学习

Publish: 2025-08-02 14:40:50 UTC 发布时间：2025-08-02 14:40:50 UTC

#170 DeepGB-TB: A Risk-Balanced Cross-Attention Gradient-Boosted Convolutional Network for Rapid, Interpretable Tuberculosis Screening #170 DeepGB-TB：一种风险平衡的交叉注意力梯度提升卷积网络，用于快速且可解释的结核病筛查

Large-scale tuberculosis (TB) screening is limited by the high cost and operational complexity of traditional diagnostics, creating a need for artificial-intelligence solutions. We propose DeepGB-TB, a non-invasive system that instantly assigns TB risk scores using only cough audio and basic demographic data. The model couples a lightweight one-dimensional convolutional neural network for audio processing with a gradient-boosted decision tree for tabular features. Its principal innovation is a Cross-Modal Bidirectional Cross-Attention module (CM-BCA) that iteratively exchanges salient cues between modalities, emulating the way clinicians integrate symptoms and risk factors. To meet the clinical priority of minimizing missed cases, we design a Tuberculosis Risk-Balanced Loss (TRBL) that places stronger penalties on false-negative predictions, thereby reducing high-risk misclassifications. DeepGB-TB is evaluated on a diverse dataset of 1,105 patients collected across seven countries, achieving an AUROC of 0.903 and an F1-score of 0.851, representing a new state of the art. Its computational efficiency enables real-time, offline inference directly on common mobile devices, making it ideal for low-resource settings. Importantly, the system produces clinically validated explanations that promote trust and adoption by frontline health workers. By coupling AI innovation with public-health requirements for speed, affordability, and reliability, DeepGB-TB offers a tool for advancing global TB control. 大规模结核病（TB）筛查受限于传统诊断方法的高成本和操作复杂性，因而亟需人工智能解决方案。我们提出了 DeepGB-TB，一种非侵入式系统，仅使用咳嗽音频和基本人口统计数据即可即时分配结核病风险评分。该模型结合了用于音频处理的轻量级一维卷积神经网络和用于表格特征的梯度提升决策树。其主要创新是跨模态双向交叉注意力模块（CM-BCA），该模块在模态间迭代交换显著线索，模拟临床医生整合症状和风险因素的方式。为满足临床上减少漏诊的优先需求，我们设计了结核病风险平衡损失（TRBL），对假阴性预测施加更强惩罚，从而降低高风险误分类。DeepGB-TB 在涵盖七个国家的 1,105 名患者的多样化数据集上进行了评估，取得了 0.903 的 AUROC 和 0.851 的 F1 分数，代表了新的技术领先水平。其计算效率使得能够在常见移动设备上实现实时、离线推理，非常适合资源有限的环境。重要的是，该系统生成经过临床验证的解释，促进了一线医护人员的信任和采用。通过将人工智能创新与公共卫生对速度、经济性和可靠性的需求相结合，DeepGB-TB 提供了一个推动全球结核病控制的工具。

Subjects: Machine Learning, Artificial Intelligence, Sound 主题：机器学习，人工智能，声音

Publish: 2025-08-02 14:11:07 UTC 发布时间：2025-08-02 14:11:07 UTC

#171 Who Gets Cited? Gender- and Majority-Bias in LLM-Driven Reference Selection #171 谁被引用？LLM 驱动的参考文献选择中的性别和多数偏见

Author: [Jiangen He](https://arxiv.org/search/?searchtype=author&query=Jiangen He) 作者：贺建根

Large language models (LLMs) are rapidly being adopted as research assistants, particularly for literature review and reference recommendation, yet little is known about whether they introduce demographic bias into citation workflows. This study systematically investigates gender bias in LLM-driven reference selection using controlled experiments with pseudonymous author names. We evaluate several LLMs (GPT-4o, GPT-4o-mini, Claude Sonnet, and Claude Haiku) by varying gender composition within candidate reference pools and analyzing selection patterns across fields. Our results reveal two forms of bias: a persistent preference for male-authored references and a majority-group bias that favors whichever gender is more prevalent in the candidate pool. These biases are amplified in larger candidate pools and only modestly attenuated by prompt-based mitigation strategies. Field-level analysis indicates that bias magnitude varies across scientific domains, with social sciences showing the least bias. Our findings indicate that LLMs can reinforce or exacerbate existing gender imbalances in scholarly recognition. Effective mitigation strategies are needed to avoid perpetuating existing gender disparities in scientific citation practices before integrating LLMs into high-stakes academic workflows. 大型语言模型（LLMs）正迅速被采用为研究助理，特别是在文献综述和参考文献推荐方面，但关于它们是否会在引用工作流程中引入人口统计学偏见知之甚少。本研究通过使用化名作者姓名的受控实验，系统地调查了 LLM 驱动的参考文献选择中的性别偏见。我们评估了多种 LLMs（GPT-4o、GPT-4o-mini、Claude Sonnet 和 Claude Haiku），通过改变候选参考文献池中的性别组成，并分析各领域的选择模式。结果揭示了两种偏见形式：对男性作者参考文献的持续偏好，以及多数群体偏见，即偏向于候选池中占多数的性别。这些偏见在较大的候选池中被放大，且仅通过基于提示的缓解策略得到有限的减轻。领域层面的分析表明，偏见程度在不同科学领域间存在差异，社会科学领域的偏见最小。我们的研究结果表明，LLMs 可能强化或加剧学术认可中现有的性别不平衡。在将 LLMs 整合到高风险学术工作流程之前，需要有效的缓解策略，以避免延续现有科学引用实践中的性别差异。

Subjects: Digital Libraries, Artificial Intelligence, Computers and Society 主题：数字图书馆，人工智能，计算机与社会

Publish: 2025-08-02 13:27:32 UTC 发布时间：2025-08-02 13:27:32 UTC

#172 Kronos: A Foundation Model for the Language of Financial Markets #172 Kronos：金融市场语言的基础模型

Authors: [Yu Shi](https://arxiv.org/search/?searchtype=author&query=Yu Shi), [Zongliang Fu](https://arxiv.org/search/?searchtype=author&query=Zongliang Fu), [Shuo Chen](https://arxiv.org/search/?searchtype=author&query=Shuo Chen), [Bohan Zhao](https://arxiv.org/search/?searchtype=author&query=Bohan Zhao), [Wei Xu](https://arxiv.org/search/?searchtype=author&query=Wei Xu), [Changshui Zhang](https://arxiv.org/search/?searchtype=author&query=Changshui Zhang), [Jian Li](https://arxiv.org/search/?searchtype=author&query=Jian Li) 作者：石宇、付宗良、陈硕、赵博涵、徐伟、张长水、李健

The success of large-scale pre-training paradigm, exemplified by Large Language Models (LLMs), has inspired the development of Time Series Foundation Models (TSFMs). However, their application to financial candlestick (K-line) data remains limited, often underperforming non-pre-trained architectures. Moreover, existing TSFMs often overlook crucial downstream tasks such as volatility prediction and synthetic data generation. To address these limitations, we propose Kronos, a unified, scalable pre-training framework tailored to financial K-line modeling. Kronos introduces a specialized tokenizer that discretizes continuous market information into token sequences, preserving both price dynamics and trade activity patterns. We pre-train Kronos using an autoregressive objective on a massive, multi-market corpus of over 12 billion K-line records from 45 global exchanges, enabling it to learn nuanced temporal and cross-asset representations. Kronos excels in a zero-shot setting across a diverse set of financial tasks. On benchmark datasets, Kronos boosts price series forecasting RankIC by 93% over the leading TSFM and 87% over the best non-pre-trained baseline. It also achieves a 9% lower MAE in volatility forecasting and a 22% improvement in generative fidelity for synthetic K-line sequences. These results establish Kronos as a robust, versatile foundation model for end-to-end financial time series analysis. Our pre-trained model is publicly available at https://github.com/shiyu-coder/Kronos. 大规模预训练范式的成功，以大型语言模型（LLMs）为代表，激发了时间序列基础模型（TSFMs）的发展。然而，它们在金融 K 线（蜡烛图）数据上的应用仍然有限，常常表现不及非预训练架构。此外，现有的 TSFMs 往往忽视了诸如波动率预测和合成数据生成等关键下游任务。为了解决这些限制，我们提出了 Kronos，一种统一且可扩展的预训练框架，专为金融 K 线建模设计。Kronos 引入了一种专门的分词器，将连续的市场信息离散化为令牌序列，同时保留价格动态和交易活动模式。我们在一个包含来自 45 个全球交易所、超过 120 亿条 K 线记录的大规模多市场语料库上，采用自回归目标对 Kronos 进行预训练，使其能够学习细致的时间和跨资产表示。Kronos 在零样本设置下，在多样化的金融任务中表现出色。在基准数据集上，Kronos 将价格序列预测的 RankIC 较领先的 TSFM 提升了 93%，较最佳非预训练基线提升了 87%。它在波动率预测中实现了 9%的 MAE 降低，在合成 K 线序列的生成保真度上提升了 22%。这些结果确立了 Kronos 作为一个强大且多功能的端到端金融时间序列分析基础模型。我们的预训练模型已公开发布，地址为：https://github.com/shiyu-coder/Kronos。

Subjects: Statistical Finance, Artificial Intelligence, Machine Learning 主题：统计金融，人工智能，机器学习

Publish: 2025-08-02 13:15:59 UTC 发布时间：2025-08-02 13:15:59 UTC

#173 A Note on Code Quality Score: LLMs for Maintainable Large Codebases #173 关于代码质量评分的说明：用于可维护大型代码库的 LLMs

Maintaining code quality in large-scale software systems presents significant challenges, particularly in settings where a large numbers of engineers work concurrently on a codebase. This paper introduces Code Quality Score (CQS) system to automatically detect issues with a set of code changes and provide actionable insights. At its core, the CQS system is powered by two Llama3 models, fine-tuned (with SFT and offline RL approaches), to a) detect common code quality issues related to coding best practices and b) to provide good ``critiques’’ for LLM-generated code review respectively. To maintain good user experience, we layer the system with hand-crafted rules to filter out incorrect responses/hallucinations. Offline evaluations show that our CQS system is able to achieve an impressive precision rate for identifying valid issues. This system has already been rolled out to developers in an industrial scale setting and has consistently achieved 60% week over week user helpfulness rate, demonstrating its effectiveness in a real-world environment. In this paper, we present details of the CQS system along with some learnings on curating developer feedback to create training data for LLM fine-tuning. 在大规模软件系统中维护代码质量面临重大挑战，尤其是在大量工程师同时协作开发同一代码库的环境下。本文介绍了一种代码质量评分（Code Quality Score，CQS）系统，用于自动检测一组代码变更中的问题并提供可操作的见解。CQS 系统的核心由两个经过微调（采用 SFT 和离线强化学习方法）的 Llama3 模型驱动，分别用于 a）检测与编码最佳实践相关的常见代码质量问题，b）为 LLM 生成的代码审查提供良好的“批评”意见。为了保持良好的用户体验，我们在系统中加入了手工制定的规则，以过滤错误响应和幻觉。离线评估显示，我们的 CQS 系统在识别有效问题方面能够达到令人印象深刻的精确率。该系统已在工业规模环境中向开发者推广，并持续实现每周 60%的用户帮助率，证明了其在实际环境中的有效性。在本文中，我们介绍了 CQS 系统的详细信息，以及在策划开发者反馈以创建用于 LLM 微调的训练数据方面的一些经验。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-01 21:09:45 UTC 发布时间：2025-08-01 21:09:45 UTC

#174 Teaching at Scale: Leveraging AI to Evaluate and Elevate Engineering Education #174 大规模教学：利用人工智能评估和提升工程教育

Authors: [Jean-Francois Chamberland](https://arxiv.org/search/?searchtype=author&query=Jean-Francois Chamberland), [Martin C. Carlisle](https://arxiv.org/search/?searchtype=author&query=Martin C. Carlisle), [Arul Jayaraman](https://arxiv.org/search/?searchtype=author&query=Arul Jayaraman), [Krishna R. Narayanan](https://arxiv.org/search/?searchtype=author&query=Krishna R. Narayanan), [Sunay Palsole](https://arxiv.org/search/?searchtype=author&query=Sunay Palsole), [Karan Watson](https://arxiv.org/search/?searchtype=author&query=Karan Watson) 作者：Jean-Francois Chamberland，Martin C. Carlisle，Arul Jayaraman，Krishna R. Narayanan，Sunay Palsole，Karan Watson

Subjects: Computers and Society, Artificial Intelligence, Computation and Language 主题：计算机与社会，人工智能，计算与语言

Publish: 2025-08-01 20:27:40 UTC 发布时间：2025-08-01 20:27:40 UTC

#175 Interpreting Performance Profiles with Deep Learning #175 使用深度学习解读性能曲线

Author: [Zhuoran Liu](https://arxiv.org/search/?searchtype=author&query=Zhuoran Liu) 作者：刘卓然

Profiling tools (also known as profilers) play an important role in understanding program performance at runtime, such as hotspots, bottlenecks, and inefficiencies. While profilers have been proven to be useful, they give extra burden to software engineers. Software engineers, as the users, are responsible to interpret the complex performance data and identify actionable optimization in program source code. However, it can be challenging for users to associate inefficiencies with the program semantics, especially if the users are not the authors of the code, which limits the applicability of profilers. In this thesis, we explore a new direction to combine performance profiles and program semantics with a deep learning approach. The key idea is to glean code summary for semantic information (at a certain level) and integrate it into a profiler, which can better understand program inefficiencies for actionable optimization. To be concrete, we combine profiles generated by Async Profiler (the state-of-the-art Java profiler) with code summarization from a fine-tuned CodeBERT-based model. We demonstrate the code summaries of any selected call path in a graphic user interface. Our system can effectively assist analysis on many Java benchmarks. 性能分析工具（也称为分析器）在理解程序运行时性能方面起着重要作用，例如热点、瓶颈和低效之处。虽然分析器已被证明非常有用，但它们给软件工程师带来了额外负担。作为用户的软件工程师需要负责解释复杂的性能数据，并在程序源代码中识别可执行的优化措施。然而，用户将低效与程序语义关联起来可能具有挑战性，尤其当用户不是代码的作者时，这限制了分析器的适用性。在本论文中，我们探索了一种结合性能分析数据和程序语义的深度学习新方向。关键思想是提取代码摘要以获取语义信息（在某一层面），并将其整合到分析器中，从而更好地理解程序低效之处以实现可执行的优化。具体来说，我们将由 Async Profiler（最先进的 Java 分析器）生成的性能分析数据与基于微调 CodeBERT 模型的代码摘要相结合。我们在图形用户界面中展示了任意选定调用路径的代码摘要。我们的系统可以有效地辅助分析许多 Java 基准测试。

Subjects: Software Engineering, Artificial Intelligence, Performance 主题：软件工程，人工智能，性能

Publish: 2025-08-01 17:23:41 UTC 发布时间：2025-08-01 17:23:41 UTC

#176 Forecasting NCAA Basketball Outcomes with Deep Learning: A Comparative Study of LSTM and Transformer Models #176 使用深度学习预测 NCAA 篮球比赛结果：LSTM 与 Transformer 模型的比较研究

Author: [Md Imtiaz Habib](https://arxiv.org/search/?searchtype=author&query=Md Imtiaz Habib) 作者：Md Imtiaz Habib

In this research, I explore advanced deep learning methodologies to forecast the outcomes of the 2025 NCAA Division 1 Men’s and Women’s Basketball tournaments. Leveraging historical NCAA game data, I implement two sophisticated sequence-based models: Long Short-Term Memory (LSTM) and Transformer architectures. The predictive power of these models is augmented through comprehensive feature engineering, including team quality metrics derived from Generalized Linear Models (GLM), Elo ratings, seed differences, and aggregated box-score statistics. To evaluate the robustness and reliability of predictions, I train each model variant using both Binary Cross-Entropy (BCE) and Brier loss functions, providing insights into classification performance and probability calibration. My comparative analysis reveals that while the Transformer architecture optimized with BCE yields superior discriminative power (highest AUC of 0.8473), the LSTM model trained with Brier loss demonstrates superior probabilistic calibration (lowest Brier score of 0.1589). These findings underscore the importance of selecting appropriate model architectures and loss functions based on the specific requirements of forecasting tasks. The detailed analytical pipeline presented here serves as a reproducible framework for future predictive modeling tasks in sports analytics and beyond. 在本研究中，我探索了先进的深度学习方法，以预测 2025 年 NCAA 一级男子和女子篮球锦标赛的结果。利用历史 NCAA 比赛数据，我实现了两种复杂的基于序列的模型：长短期记忆网络（LSTM）和 Transformer 架构。通过全面的特征工程增强了这些模型的预测能力，包括基于广义线性模型（GLM）得出的球队质量指标、Elo 评分、种子排名差异以及汇总的比赛统计数据。为了评估预测的稳健性和可靠性，我分别使用二元交叉熵（BCE）和 Brier 损失函数训练每个模型变体，从而提供了分类性能和概率校准的见解。我的比较分析显示，虽然使用 BCE 优化的 Transformer 架构在判别能力上表现更优（最高 AUC 为 0.8473），但采用 Brier 损失训练的 LSTM 模型在概率校准方面表现更佳（最低 Brier 分数为 0.1589）。这些发现强调了根据预测任务的具体需求选择合适的模型架构和损失函数的重要性。本文所展示的详细分析流程为未来体育分析及其他领域的预测建模任务提供了一个可复现的框架。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-01 14:01:44 UTC 发布时间：2025-08-01 14:01:44 UTC

#177 Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction #177 Veli：低成本空气质量传感器校正的无监督方法与统一基准

Authors: [Yahia Dalbah](https://arxiv.org/search/?searchtype=author&query=Yahia Dalbah), [Marcel Worring](https://arxiv.org/search/?searchtype=author&query=Marcel Worring), [Yen-Chia Hsu](https://arxiv.org/search/?searchtype=author&query=Yen-Chia Hsu) 作者：Yahia Dalbah，Marcel Worring，Yen-Chia Hsu

Urban air pollution is a major health crisis causing millions of premature deaths annually, underscoring the urgent need for accurate and scalable monitoring of air quality (AQ). While low-cost sensors (LCS) offer a scalable alternative to expensive reference-grade stations, their readings are affected by drift, calibration errors, and environmental interference. To address these challenges, we introduce Veli (Reference-free Variational Estimation via Latent Inference), an unsupervised Bayesian model that leverages variational inference to correct LCS readings without requiring co-location with reference stations, eliminating a major deployment barrier. Specifically, Veli constructs a disentangled representation of the LCS readings, effectively separating the true pollutant reading from the sensor noise. To build our model and address the lack of standardized benchmarks in AQ monitoring, we also introduce the Air Quality Sensor Data Repository (AQ-SDR). AQ-SDR is the largest AQ sensor benchmark to date, with readings from 23,737 LCS and reference stations across multiple regions. Veli demonstrates strong generalization across both in-distribution and out-of-distribution settings, effectively handling sensor drift and erratic sensor behavior. Code for model and dataset will be made public when this paper is published. 城市空气污染是一个重大的健康危机，每年导致数百万过早死亡，凸显了对空气质量（AQ）进行准确且可扩展监测的紧迫需求。虽然低成本传感器（LCS）为昂贵的参考级监测站提供了一种可扩展的替代方案，但其读数受到漂移、校准误差和环境干扰的影响。为了解决这些挑战，我们提出了 Veli（基于潜变量推断的无参考变分估计），这是一种无监督的贝叶斯模型，利用变分推断来校正 LCS 读数，无需与参考站点共址，从而消除了部署中的主要障碍。具体而言，Veli 构建了 LCS 读数的解耦表示，有效地将真实污染物读数与传感器噪声分离开来。为了构建我们的模型并解决空气质量监测中缺乏标准化基准的问题，我们还引入了空气质量传感器数据仓库（AQ-SDR）。AQ-SDR 是迄今为止最大的空气质量传感器基准，包含来自多个区域的 23,737 个 LCS 和参考站的读数。 Veli 在分布内和分布外环境中均表现出强大的泛化能力，有效应对传感器漂移和传感器异常行为。模型和数据集的代码将在本文发表时公开。

Subjects: Signal Processing, Artificial Intelligence, Machine Learning 主题：信号处理，人工智能，机器学习

Publish: 2025-08-01 10:06:28 UTC 发布时间：2025-08-01 10:06:28 UTC

#178 Mathematical Foundations of Geometric Deep Learning #178 几何深度学习的数学基础

Authors: [Haitz Sáez de Ocáriz Borde](https://arxiv.org/search/?searchtype=author&query=Haitz Sáez de Ocáriz Borde), [Michael Bronstein](https://arxiv.org/search/?searchtype=author&query=Michael Bronstein) 作者：Haitz Sáez de Ocáriz Borde，Michael Bronstein

We review the key mathematical concepts necessary for studying Geometric Deep Learning. 我们回顾了研究几何深度学习所需的关键数学概念。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-01 06:02:39 UTC 发布时间：2025-08-01 06:02:39 UTC

#179 Blueprint First, Model Second: A Framework for Deterministic LLM Workflow #179 先有蓝图，后有模型：确定性 LLM 工作流程框架

While powerful, the inherent non-determinism of large language model (LLM) agents limits their application in structured operational environments where procedural fidelity and predictable execution are strict requirements. This limitation stems from current architectures that conflate probabilistic, high-level planning with low-level action execution within a single generative process. To address this, we introduce the Source Code Agent framework, a new paradigm built on the “Blueprint First, Model Second” philosophy. Our framework decouples the workflow logic from the generative model. An expert-defined operational procedure is first codified into a source code-based Execution Blueprint, which is then executed by a deterministic engine. The LLM is strategically invoked as a specialized tool to handle bounded, complex sub-tasks within the workflow, but never to decide the workflow’s path. We conduct a comprehensive evaluation on the challenging tau-bench benchmark, designed for complex user-tool-rule scenarios. Our results demonstrate that the Source Code Agent establishes a new state-of-the-art, outperforming the strongest baseline by 10.1 percentage points on the average Pass^1 score while dramatically improving execution efficiency. Our work enables the verifiable and reliable deployment of autonomous agents in applications governed by strict procedural logic. 虽然功能强大，但大型语言模型（LLM）代理固有的非确定性限制了其在程序严格遵循和执行可预测性要求高的结构化操作环境中的应用。这一限制源于当前架构将概率性、高层次规划与低层次动作执行混合在单一生成过程中。为了解决这一问题，我们引入了源代码代理框架，这是一种基于“先蓝图，后模型”理念的新范式。我们的框架将工作流程逻辑与生成模型解耦。专家定义的操作流程首先被编码为基于源代码的执行蓝图，然后由确定性引擎执行。LLM 被策略性地调用，作为处理工作流程中有限且复杂子任务的专用工具，但绝不用于决定工作流程的路径。我们在设计用于复杂用户-工具-规则场景的挑战性 tau-bench 基准上进行了全面评估。我们的结果表明，源代码代理建立了新的最先进水平，在平均 Pass^1 得分上比最强基线高出 10.1 个百分点，同时显著提高了执行效率。我们的工作使得在受严格程序逻辑控制的应用中，能够实现可验证且可靠的自主代理部署。

Subjects: Software Engineering, Artificial Intelligence, Programming Languages 主题：软件工程，人工智能，编程语言

Publish: 2025-08-01 03:10:00 UTC 发布时间：2025-08-01 03:10:00 UTC

#180 ECGTwin: Personalized ECG Generation Using Controllable Diffusion Model #180 ECGTwin：使用可控扩散模型的个性化心电图生成

Authors: [Yongfan Lai](https://arxiv.org/search/?searchtype=author&query=Yongfan Lai), [Bo Liu](https://arxiv.org/search/?searchtype=author&query=Bo Liu), [Xinyan Guan](https://arxiv.org/search/?searchtype=author&query=Xinyan Guan), [Qinghao Zhao](https://arxiv.org/search/?searchtype=author&query=Qinghao Zhao), [Hongyan Li](https://arxiv.org/search/?searchtype=author&query=Hongyan Li), [Shenda Hong](https://arxiv.org/search/?searchtype=author&query=Shenda Hong) 作者：赖永凡，刘波，关欣妍，赵庆浩，李红艳，洪申达

Personalized electrocardiogram (ECG) generation is to simulate a patient’s ECG digital twins tailored to specific conditions. It has the potential to transform traditional healthcare into a more accurate individualized paradigm, while preserving the key benefits of conventional population-level ECG synthesis. However, this promising task presents two fundamental challenges: extracting individual features without ground truth and injecting various types of conditions without confusing generative model. In this paper, we present ECGTwin, a two-stage framework designed to address these challenges. In the first stage, an Individual Base Extractor trained via contrastive learning robustly captures personal features from a reference ECG. In the second stage, the extracted individual features, along with a target cardiac condition, are integrated into the diffusion-based generation process through our novel AdaX Condition Injector, which injects these signals via two dedicated and specialized pathways. Both qualitative and quantitative experiments have demonstrated that our model can not only generate ECG signals of high fidelity and diversity by offering a fine-grained generation controllability, but also preserving individual-specific features. Furthermore, ECGTwin shows the potential to enhance ECG auto-diagnosis in downstream application, confirming the possibility of precise personalized healthcare solutions. 个性化心电图（ECG）生成旨在模拟针对特定病情的患者心电图数字孪生体。它有潜力将传统医疗转变为更精准的个体化范式，同时保留传统群体水平心电图合成的关键优势。然而，这一有前景的任务面临两个根本性挑战：在无真实标签的情况下提取个体特征，以及在不干扰生成模型的前提下注入多种类型的病情条件。本文提出了 ECGTwin，一个旨在解决这些挑战的两阶段框架。第一阶段，通过对比学习训练的个体基础提取器能够稳健地从参考心电图中捕获个人特征。第二阶段，提取的个体特征与目标心脏病情一起，通过我们新颖的 AdaX 条件注入器整合到基于扩散的生成过程中，该注入器通过两个专用且专业的路径注入这些信号。定性和定量实验均表明，我们的模型不仅能够通过提供细粒度的生成可控性来生成高保真度和多样性的心电图信号，还能保留个体特异性特征。此外，ECGTwin 展示了在下游应用中提升心电图自动诊断的潜力，验证了实现精准个性化医疗解决方案的可能性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-01 02:58:11 UTC 发布时间：2025-08-01 02:58:11 UTC

#181 ZetA: A Riemann Zeta-Scaled Extension of Adam for Deep Learning #181 ZetA：一种基于黎曼ζ函数缩放的 Adam 扩展，用于深度学习

Author: [Samiksha BC](https://arxiv.org/search/?searchtype=author&query=Samiksha BC) 作者：Samiksha BC

This work introduces ZetA, a novel deep learning optimizer that extends Adam by incorporating dynamic scaling based on the Riemann zeta function. To the best of our knowledge, ZetA is the first optimizer to apply zeta-based gradient scaling within deep learning optimization. The method improves generalization and robustness through a hybrid update mechanism that integrates adaptive damping, cosine similarity-based momentum boosting, entropy-regularized loss, and Sharpness-Aware Minimization (SAM)-style perturbations. Empirical evaluations on SVHN, CIFAR10, CIFAR100, STL10, and noisy CIFAR10 consistently show test accuracy improvements over Adam. All experiments employ a lightweight fully connected network trained for five epochs under mixed-precision settings. The results demonstrate that ZetA is a computationally efficient and robust alternative to Adam, particularly effective in noisy or high-granularity classification tasks. 本研究介绍了 ZetA，一种新颖的深度学习优化器，通过基于黎曼ζ函数的动态缩放扩展了 Adam 优化器。据我们所知，ZetA 是首个在深度学习优化中应用基于ζ函数的梯度缩放的优化器。该方法通过集成自适应阻尼、基于余弦相似度的动量增强、熵正则化损失以及类似 Sharpness-Aware Minimization（SAM）的扰动，采用混合更新机制提升了泛化能力和鲁棒性。在 SVHN、CIFAR10、CIFAR100、STL10 和带噪声的 CIFAR10 上的实证评估均显示其测试准确率优于 Adam。所有实验均采用轻量级全连接网络，在混合精度设置下训练五个周期。结果表明，ZetA 是一种计算效率高且鲁棒的 Adam 替代方案，尤其适用于噪声较大或细粒度分类任务。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-01 02:53:29 UTC 发布：2025-08-01 02:53:29 UTC

#182 SleepLiteCNN: Energy-Efficient Sleep Apnea Subtype Classification with 1-Second Resolution Using Single-Lead ECG #182 SleepLiteCNN：使用单导联心电图实现 1 秒分辨率的节能型睡眠呼吸暂停亚型分类

Authors: [Zahra Mohammadi](https://arxiv.org/search/?searchtype=author&query=Zahra Mohammadi), [Siamak Mohammadi](https://arxiv.org/search/?searchtype=author&query=Siamak Mohammadi) 作者：Zahra Mohammadi，Siamak Mohammadi

Apnea is a common sleep disorder characterized by breathing interruptions lasting at least ten seconds and occurring more than five times per hour. Accurate, high-temporal-resolution detection of sleep apnea subtypes - Obstructive, Central, and Mixed - is crucial for effective treatment and management. This paper presents an energy-efficient method for classifying these subtypes using a single-lead electrocardiogram (ECG) with high temporal resolution to address the real-time needs of wearable devices. We evaluate a wide range of classical machine learning algorithms and deep learning architectures on 1-second ECG windows, comparing their accuracy, complexity, and energy consumption. Based on this analysis, we introduce SleepLiteCNN, a compact and energy-efficient convolutional neural network specifically designed for wearable platforms. SleepLiteCNN achieves over 95% accuracy and a 92% macro-F1 score, while requiring just 1.8 microjoules per inference after 8-bit quantization. Field Programmable Gate Array (FPGA) synthesis further demonstrates significant reductions in hardware resource usage, confirming its suitability for continuous, real-time monitoring in energy-constrained environments. These results establish SleepLiteCNN as a practical and effective solution for wearable device sleep apnea subtype detection. 睡眠呼吸暂停是一种常见的睡眠障碍，其特征是呼吸中断持续至少十秒钟且每小时发生超过五次。准确且高时间分辨率地检测睡眠呼吸暂停的亚型——阻塞型、中枢型和混合型——对于有效治疗和管理至关重要。本文提出了一种节能的方法，利用单导联高时间分辨率心电图（ECG）对这些亚型进行分类，以满足可穿戴设备的实时需求。我们在 1 秒的 ECG 窗口上评估了多种经典机器学习算法和深度学习架构，比较了它们的准确性、复杂度和能耗。基于此分析，我们引入了 SleepLiteCNN，一种专为可穿戴平台设计的紧凑且节能的卷积神经网络。SleepLiteCNN 在 8 位量化后，每次推理仅需 1.8 微焦耳，准确率超过 95%，宏 F1 分数达到 92%。现场可编程门阵列（FPGA）综合进一步展示了硬件资源使用的大幅减少，确认其适用于能源受限环境中的连续实时监测。这些结果确立了 SleepLiteCNN 作为可穿戴设备睡眠呼吸暂停亚型检测的实用且有效的解决方案。

Subjects: Signal Processing, Artificial Intelligence 主题：信号处理，人工智能

Publish: 2025-08-01 00:04:40 UTC 发布时间：2025-08-01 00:04:40 UTC

#183 A Bayesian Hybrid Parameter-Efficient Fine-Tuning Method for Large Language Models #183 一种用于大型语言模型的贝叶斯混合参数高效微调方法

Authors: [Yidong Chai](https://arxiv.org/search/?searchtype=author&query=Yidong Chai), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu), [Yonghang Zhou](https://arxiv.org/search/?searchtype=author&query=Yonghang Zhou), [Jiaheng Xie](https://arxiv.org/search/?searchtype=author&query=Jiaheng Xie), [Daniel Dajun Zeng](https://arxiv.org/search/?searchtype=author&query=Daniel Dajun Zeng) 作者：柴一东，刘洋，周永航，谢家恒，曾大军

Large Language Models (LLMs) have demonstrated transformative potential in reshaping the world. As these models are pretrained on general corpora, they often require domain-specific fine-tuning to optimize performance in specialized business applications. Due to their massive scale, parameter-efficient fine-tuning (PEFT) methods are widely used to reduce training costs. Among them, hybrid PEFT methods that combine multiple PEFT techniques have achieved the best performance. However, existing hybrid PEFT methods face two main challenges when fine-tuning LLMs for specialized applications: (1) relying on point estimates, lacking the ability to quantify uncertainty for reliable decision-making, and (2) struggling to dynamically adapt to emerging data, lacking the ability to suit real-world situations. We propose Bayesian Hybrid Parameter-Efficient Fine-Tuning (BH-PEFT), a novel method that integrates Bayesian learning into hybrid PEFT. BH-PEFT combines Adapter, LoRA, and prefix-tuning to fine-tune feedforward and attention layers of the Transformer. By modeling learnable parameters as distributions, BH-PEFT enables uncertainty quantification. We further propose a Bayesian dynamic fine-tuning approach where the last posterior serves as the prior for the next round, enabling effective adaptation to new data. We evaluated BH-PEFT on business tasks such as sentiment analysis, news categorization, and commonsense reasoning. Results show that our method outperforms existing PEFT baselines, enables uncertainty quantification for more reliable decisions, and improves adaptability in dynamic scenarios. This work contributes to business analytics and data science by proposing a novel BH-PEFT method and dynamic fine-tuning approach that support uncertainty-aware and adaptive decision-making in real-world situations. 大型语言模型（LLMs）在重塑世界方面展现了变革性的潜力。由于这些模型是在通用语料库上预训练的，通常需要针对特定领域进行微调，以优化其在专业业务应用中的表现。鉴于其庞大的规模，参数高效微调（PEFT）方法被广泛采用以降低训练成本。其中，结合多种 PEFT 技术的混合 PEFT 方法取得了最佳性能。然而，现有的混合 PEFT 方法在针对专业应用微调 LLMs 时面临两个主要挑战：（1）依赖点估计，缺乏量化不确定性的能力，难以实现可靠决策；（2）难以动态适应新兴数据，缺乏适应现实情况的能力。我们提出了贝叶斯混合参数高效微调（BH-PEFT），这是一种将贝叶斯学习整合到混合 PEFT 中的新方法。BH-PEFT 结合了 Adapter、LoRA 和 prefix-tuning，对 Transformer 的前馈层和注意力层进行微调。通过将可学习参数建模为分布，BH-PEFT 实现了不确定性量化。我们进一步提出了一种贝叶斯动态微调方法，其中上一轮的后验作为下一轮的先验，从而实现对新数据的有效适应。我们在情感分析、新闻分类和常识推理等业务任务上评估了 BH-PEFT。结果表明，我们的方法优于现有的 PEFT 基线，能够进行不确定性量化以支持更可靠的决策，并提升了动态场景下的适应能力。本工作通过提出一种新颖的 BH-PEFT 方法和动态微调策略，为业务分析和数据科学领域贡献了支持不确定性感知和自适应决策的解决方案。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-07-31 01:20:20 UTC 发布时间：2025-07-31 01:20:20 UTC

#184 Evaluation of Deep Learning Models for LBBB Classification in ECG Signals #184 心电图信号中 LBBB 分类的深度学习模型评估

Authors: [Beatriz Macas Ordóñez](https://arxiv.org/search/?searchtype=author&query=Beatriz Macas Ordóñez), [Diego Vinicio Orellana Villavicencio](https://arxiv.org/search/?searchtype=author&query=Diego Vinicio Orellana Villavicencio), [José Manuel Ferrández](https://arxiv.org/search/?searchtype=author&query=José Manuel Ferrández), [Paula Bonomini](https://arxiv.org/search/?searchtype=author&query=Paula Bonomini) 作者：Beatriz Macas Ordóñez, Diego Vinicio Orellana Villavicencio, José Manuel Ferrández, Paula Bonomini

This study explores different neural network architectures to evaluate their ability to extract spatial and temporal patterns from electrocardiographic (ECG) signals and classify them into three groups: healthy subjects, Left Bundle Branch Block (LBBB), and Strict Left Bundle Branch Block (sLBBB). Clinical Relevance, Innovative technologies enable the selection of candidates for Cardiac Resynchronization Therapy (CRT) by optimizing the classification of subjects with Left Bundle Branch Block (LBBB). 本研究探讨了不同的神经网络架构，以评估其从心电图（ECG）信号中提取时空模式并将其分类为三组的能力：健康受试者、左束支传导阻滞（LBBB）和严格左束支传导阻滞（sLBBB）。临床相关性方面，创新技术通过优化左束支传导阻滞（LBBB）患者的分类，实现了心脏再同步治疗（CRT）候选者的筛选。

Subjects: Signal Processing, Artificial Intelligence, Machine Learning 主题：信号处理，人工智能，机器学习

Publish: 2025-07-30 22:11:05 UTC 发布时间：2025-07-30 22:11:05 UTC

#185 AnnoSense: A Framework for Physiological Emotion Data Collection in Everyday Settings for AI #185 AnnoSense：一个用于日常环境中生理情绪数据采集的框架，面向人工智能

Authors: [Pragya Singh](https://arxiv.org/search/?searchtype=author&query=Pragya Singh), [Ankush Gupta](https://arxiv.org/search/?searchtype=author&query=Ankush Gupta), [Mohan Kumar](https://arxiv.org/search/?searchtype=author&query=Mohan Kumar), [Pushpendra Singh](https://arxiv.org/search/?searchtype=author&query=Pushpendra Singh) 作者：Pragya Singh、Ankush Gupta、Mohan Kumar、Pushpendra Singh

Emotional and mental well-being are vital components of quality of life, and with the rise of smart devices like smartphones, wearables, and artificial intelligence (AI), new opportunities for monitoring emotions in everyday settings have emerged. However, for AI algorithms to be effective, they require high-quality data and accurate annotations. As the focus shifts towards collecting emotion data in real-world environments to capture more authentic emotional experiences, the process of gathering emotion annotations has become increasingly complex. This work explores the challenges of everyday emotion data collection from the perspectives of key stakeholders. We collected 75 survey responses, performed 32 interviews with the public, and 3 focus group discussions (FGDs) with 12 mental health professionals. The insights gained from a total of 119 stakeholders informed the development of our framework, AnnoSense, designed to support everyday emotion data collection for AI. This framework was then evaluated by 25 emotion AI experts for its clarity, usefulness, and adaptability. Lastly, we discuss the potential next steps and implications of AnnoSense for future research in emotion AI, highlighting its potential to enhance the collection and analysis of emotion data in real-world contexts. 情感和心理健康是生活质量的重要组成部分，随着智能设备如智能手机、可穿戴设备和人工智能（AI）的兴起，监测日常环境中情绪的新机会也随之出现。然而，为了使 AI 算法有效运行，它们需要高质量的数据和准确的标注。随着关注点转向在真实环境中收集情绪数据以捕捉更真实的情感体验，情绪标注的收集过程变得日益复杂。本文从关键利益相关者的角度探讨了日常情绪数据收集的挑战。我们收集了 75 份调查问卷，进行了 32 次公众访谈，以及 3 次由 12 名心理健康专业人士参与的焦点小组讨论（FGD）。从共计 119 名利益相关者获得的见解促成了我们框架 AnnoSense 的开发，该框架旨在支持 AI 的日常情绪数据收集。随后，25 位情绪 AI 专家对该框架的清晰度、实用性和适应性进行了评估。最后，我们讨论了 AnnoSense 在情感人工智能未来研究中的潜在后续步骤和影响，强调了其在现实情境中增强情感数据收集与分析的潜力。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题：人机交互，人工智能

Publish: 2025-07-17 10:54:39 UTC 发布时间：2025-07-17 10:54:39 UTC

#186 Bridging LLMs and Symbolic Reasoning in Educational QA Systems: Insights from the XAI Challenge at IJCNN 2025 #186 在教育问答系统中桥接 LLMs 与符号推理：来自 IJCNN 2025 XAI 挑战赛的见解

人工智能（AI）在教育中的日益融合加剧了对透明性和可解释性的需求。尽管黑客马拉松长期以来一直作为快速 AI 原型开发的敏捷环境，但很少有活动直接针对现实教育环境中的可解释人工智能（XAI）。本文对 XAI 挑战赛 2025 进行了全面分析，该挑战赛由胡志明市理工大学（HCMUT）与神经符号 AI 可信赖性与可靠性国际研讨会（TRNS-AI）联合主办，作为国际神经网络联合会议（IJCNN 2025）的一部分举办。此次挑战要求参赛者构建能够回答学生关于大学政策问题的问答（QA）系统，同时生成清晰、基于逻辑的自然语言解释。为促进透明性和可信赖性，解决方案需使用轻量级 LLMs 或混合 LLM-符号系统。提供了一个高质量的数据集，该数据集通过基于逻辑的模板构建，结合 Z3 验证，并经过专家学生的审查以确保与现实学术场景的一致性。我们描述了该挑战的动机、结构、数据集构建及评估协议。将该竞赛置于 AI 黑客马拉松的更广泛发展背景中，我们认为这代表了一种将 LLMs 与符号推理结合以服务于可解释性的创新尝试。我们的研究结果为未来以 XAI 为中心的教育系统和竞赛研究项目提供了可操作的见解。

Subject: Computation and Language 主题：计算与语言

Publish: 2025-08-02 08:46:06 UTC 发布：2025-08-02 08:46:06 UTC

#187 FastInit: Fast Noise Initialization for Temporally Consistent Video Generation #187 FastInit：用于时间一致性视频生成的快速噪声初始化

Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video generation. In this paper, we introduce FastInit, a fast noise initialization method that eliminates the need for iterative refinement. FastInit learns a Video Noise Prediction Network (VNPNet) that takes random noise and a text prompt as input, generating refined noise in a single forward pass. Therefore, FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames. To train the VNPNet, we create a large-scale dataset consisting of pairs of text prompts, random noise, and refined noise. Extensive experiments with various text-to-video models show that our method consistently improves the quality and temporal consistency of the generated videos. FastInit not only provides a substantial improvement in video generation but also offers a practical solution that can be applied directly during inference. The code and dataset will be released. 视频生成随着扩散模型的发展取得了显著进展；然而，实现高时间一致性仍然是一项具有挑战性的任务。最近，FreeInit 发现了训练与推理之间的差距，并提出了一种在推理过程中迭代优化初始噪声的方法。然而，迭代优化显著增加了视频生成的计算成本。在本文中，我们引入了 FastInit，一种快速噪声初始化方法，消除了迭代优化的需求。FastInit 学习了一个视频噪声预测网络（VNPNet），该网络以随机噪声和文本提示作为输入，在一次前向传播中生成优化后的噪声。因此，FastInit 大大提升了视频生成的效率，同时实现了帧间的高时间一致性。为了训练 VNPNet，我们创建了一个包含文本提示、随机噪声和优化噪声对的大规模数据集。通过对多种文本到视频模型的广泛实验表明，我们的方法持续提升了生成视频的质量和时间一致性。 FastInit 不仅在视频生成方面提供了显著的提升，还提供了一个可以直接在推理过程中应用的实用解决方案。代码和数据集将会发布。

Subject: Computer Vision and Pattern Recognition 主题：计算机视觉与模式识别

Publish: 2025-06-19 08:11:45 UTC 发布时间：2025-06-19 08:11:45 UTC

1.3 Huggingface

1.4 X

1.5 小红书

49 【谷歌突破研究Agent瓶颈！（全文分享） - AI研究所 | 小红书 - 你的生活兴趣社区】 😆 idc6faPaC7i7pEN 😆 https://www.xiaohongshu.com/discovery/item/688c28b90000000023026789?source=webshare&xhsshare=pc_web&xsec_token=CB2UhBnXfvzR0e27jac2hdP2jTW93hOHYKIwoaBZRWTn8=&xsec_source=pc_share
66 【Agent：冲击最人类智力之巅，助力前沿研发 - 微风老师 | 小红书 - 你的生活兴趣社区】 😆 xRUyuAoXpKiLEri 😆 https://www.xiaohongshu.com/discovery/item/688e28ce0000000004002c42?source=webshare&xhsshare=pc_web&xsec_token=CBhnM9YSDMeeI74NiCRMXF9tkBuHkmwB8bH7018xJlBtE=&xsec_source=pc_share
80 【借鉴人类思考的分层推理？清华大学最新研究 - 轻舟AI | 小红书 - 你的生活兴趣社区】 😆 M49Up0rsYbIO6Xd 😆 https://www.xiaohongshu.com/discovery/item/68917d470000000004006bd4?source=webshare&xhsshare=pc_web&xsec_token=CBNjhAAQ9ahkh4hZ_0xmlIBrpbqQrNiMfkjm4OBI-IQ04=&xsec_source=pc_share

2025-08-06科研追新

2025-08-06科研追新

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

1.2 Arxiv

1.2.1 Computation and Language

#1 CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward #1 CompassVerifier：一个用于 LLMs 评估和结果奖励的统一且稳健的验证器

#2 More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation #2 不止于分数：探究提示特异性对 LLM 代码生成的影响

#3 FairLangProc: A Python package for fairness in NLP #3 FairLangProc：一个用于 NLP 公平性的 Python 包

#4 CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction #4 CTR-Sink：用于点击率预测的语言模型注意力汇聚器

#5 Can Large Vision-Language Models Understand Multimodal Sarcasm? #5 大型视觉语言模型能理解多模态讽刺吗？

#6 Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? #6 我们是否走在正确的道路上来评估文档检索增强生成？

#7 Tackling Distribution Shift in LLM via KILO: Knowledge-Instructed Learning for Continual Adaptation #7 通过 KILO 解决 LLM 中的分布转移问题：面向持续适应的知识指导学习

#8 Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations #8 超越表面：通过内部表征提升 LLM 作为裁判的对齐与人类的协同

#9 EmbedGrad: Gradient-Based Prompt Optimization in Embedding Space for Large Language Models #9 EmbedGrad：基于梯度的嵌入空间提示优化方法，适用于大型语言模型

#10 Marito: Structuring and Building Open Multilingual Terminologies for South African NLP #10 Marito：为南非自然语言处理构建和组织开放多语言术语库

#11 FilBench: Can LLMs Understand and Generate Filipino? #11 FilBench：LLMs 能理解和生成菲律宾语吗？

#12 UPLME: Uncertainty-Aware Probabilistic Language Modelling for Robust Empathy Regression #12 UPLME：面向稳健共情回归的不确定性感知概率语言建模

#13 CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation #13 CF-RAG：一种使用检索增强生成的碳足迹问答数据集和方法

#14 fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval #14 SemEval-2025 任务 7：多语言和跨语言事实核查声明检索

#15 Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings #15 裁剪作为训练自监督文本嵌入的增强策略优于丢弃法

#16 LLMs Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models

#17 Variety Is the Spice of Life: Detecting Misinformation with Dynamic Environmental Representations #17 生活的多样性是调味品：利用动态环境表示检测错误信息

#18 ReDSM5: A Reddit Dataset for DSM-5 Depression Detection #18 ReDSM5：用于 DSM-5 抑郁症检测的 Reddit 数据集

#19 Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models #19 无思考校准思维：推理大型语言模型中的一种新型上下文学习范式

#20 Taggus: An Automated Pipeline for the Extraction of Characters' Social Networks from Portuguese Fiction Literature #20 Taggus：一个用于从葡萄牙小说文学中提取人物社交网络的自动化流程

#21 CTTS: Collective Test-Time Scaling #21 CTTS：集体测试时缩放

#22 Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling #22 通过策略对齐推理和分层标注迈向可信的多模态审核

#23 NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty #23 自然语言处理方法可能比教授更擅长估计问题难度

#24 Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes #24 通过心理刻板印象调查 LLM 生成故事中的性别偏见

#25 Do language models accommodate their users? A study of linguistic convergence #25 语言模型是否适应其用户？一项语言趋同的研究

#26 LECTOR: LLM-Enhanced Concept-based Test-Oriented Repetition for Adaptive Spaced Learning #26 LECTOR：基于概念的面向测试的重复，结合 LLM 增强的自适应间隔学习

#27 Pay What LLM Wants: Can LLM Simulate Economics Experiment with 522 Real-human Persona? #27 按 LLM 想要的支付：LLM 能否模拟拥有 522 个真实人类角色的经济学实验？

#28 Exploring Stability-Plasticity Trade-offs for Continual Named Entity Recognition #28 探索持续命名实体识别中的稳定性-可塑性权衡

#29 RooseBERT: A New Deal For Political Language Modelling #29 RooseBERT：政治语言建模的新方案

#30 Somatic in the East, Psychological in the West?: Investigating Clinically-Grounded Cross-Cultural Depression Symptom Expression in LLMs #30 东方表现为躯体症状，西方表现为心理症状？：在 LLMs 中调查基于临床的跨文化抑郁症状表达

#31 CardiffNLP at CLEARS-2025: Prompting Large Language Models for Plain Language and Easy-to-Read Text Rewriting #31 CardiffNLP 在 CLEARS-2025：提示大型语言模型进行通俗易懂文本重写

#32 Probing Syntax in Large Language Models: Successes and Remaining Challenges #32 探索大型语言模型中的句法：成功与剩余挑战

#33 Current State in Privacy-Preserving Text Preprocessing for Domain-Agnostic NLP #33 隐私保护文本预处理在领域无关 NLP 中的现状

#34 Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models #34 超越内容：语法性别如何影响文本到图像模型中的视觉表现

#35 Analyzing German Parliamentary Speeches: A Machine Learning Approach for Topic and Sentiment Classification #35 德国议会演讲分析：一种用于主题和情感分类的机器学习方法

#36 Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following #36 Light-IF：通过预览和自我检查赋予 LLMs 可泛化的推理能力以执行复杂指令

#37 RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior #37 RCP-Merging：通过将推理能力视为先验，合并长链思维模型与特定领域模型

#38 Long Story Generation via Knowledge Graph and Literary Theory #38 通过知识图谱和文学理论进行长篇故事生成

#39 Cross-lingual Opinions and Emotions Mining in Comparable Documents #39 跨语言观点与情感挖掘在可比文档中的应用

#40 Token-Level Precise Attack on RAG: Searching for the Best Alternatives to Mislead Generation #40 基于令牌级的精确攻击 RAG：寻找误导生成的最佳替代方案

#41 Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation #41 隐私感知解码：缓解检索增强生成中大型语言模型的隐私泄露

#42 When Algorithms Meet Artists: Topic Modeling the AI-Art Debate, 2013-2025 #42 当算法遇上艺术家：2013-2025 年 AI 艺术辩论的话题建模

#43 CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors #43 CoCoTen：通过上下文共现张量的潜在空间特征检测大型语言模型的对抗输入

#44 Can LLMs Generate High-Quality Task-Specific Conversations? #44 LLMs 能否生成高质量的特定任务对话？

#45 SLIM-LLMs: Modeling of Style-Sensory Language RelationshipsThrough Low-Dimensional Representations #45 SLIM-LLMs：通过低维表示建模风格-感官语言关系

#46 Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models #46 具有迭代自我评估的视觉语言模型的连贯多模态推理

#48 Highlight & Summarize: RAG without the jailbreaks #48 重点与总结：无越狱的 RAG

#49 Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives #49 使用具有人口统计意识的专家和合成视角建模标注者分歧

#50 Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation #50 临床基础的基于智能体的报告评估：用于放射学报告生成的可解释指标

#51 Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation #51 森林与树：可重复机器学习评估中的 (N,K) 权衡

#52 OSINT or BULLSHINT? Exploring Open-Source Intelligence tweets about the Russo-Ukrainian War #52 开源情报还是胡扯？探索关于俄乌战争的开源情报推文

#53 Beyond Meme Templates: Limitations of Visual Similarity Measures in Meme Matching #53 超越表情包模板：视觉相似度度量在表情包匹配中的局限性

#54 PyLate: Flexible Training and Retrieval for Late Interaction Models #54 PyLate：用于后期交互模型的灵活训练与检索

#55 MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation #55 MultiRAG：一个用于缓解多源检索增强生成中幻觉的知识引导框架

#56 MoKA: Mixture of Kronecker Adapters #56 MoKA：Kronecker 适配器混合体

#57 Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning #57 使用强化学习训练长上下文、多轮软件工程代理

#58 Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models #58 画出你的心智：通过条件级建模实现文本到图像扩散模型的个性化生成

#59 A Comparative Study of Neurosymbolic AI Approaches to Interpretable Logical Reasoning #59 神经符号 AI 方法在可解释逻辑推理中的比较研究

#60 VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation #60 VLMQ：通过 Hessian 增强实现大规模视觉语言模型的高效后训练量化

#61 Reliable Evaluation Protocol for Low-Precision Retrieval #61 低精度检索的可靠评估协议

#62 Understanding the Embedding Models on Hyper-relational Knowledge Graph #62 理解超关系知识图上的嵌入模型

#63 ChartCap: Mitigating Hallucination of Dense Chart Captioning #63 ChartCap：缓解密集图表标题的幻觉问题

#64 Toward Verifiable Misinformation Detection: A Multi-Tool LLM Agent Framework #64 面向可验证的错误信息检测：多工具 LLM 代理框架

#65 VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision #65 VRPO：在噪声监督下重新思考鲁棒强化学习训练中的价值建模

#66 AGENTiGraph: A Multi-Agent Knowledge Graph Framework for Interactive, Domain-Specific LLM Chatbots #66 AGENTiGraph：一个用于交互式领域特定 LLM 聊天机器人的多智能体知识图谱框架

#67 Unified Tool Integration for LLMs: A Protocol-Agnostic Approach to Function Calling #67 LLMs 的统一工具集成：一种与协议无关的函数调用方法

#68 Defend LLMs Through Self-Consciousness #68 通过自我意识防御 LLMs

#69 Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces #69 使用大型视觉语言模型遵循路线指令：低级与全景动作空间的比较

#70 VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction #70 VisuCraft：通过结构化信息提取增强大型视觉语言模型以实现复杂视觉引导的创意内容生成