2025-09-15 2025-09-15 About 40800 words 192 minutes

Contents

#1 WhisTLE：面向预训练语音识别 Transformer 的深度监督、仅文本域自适应
#2 DeepDive：通过知识图谱和多轮强化学习推进深度搜索代理
#3 RefactorCoderQA：在云端和边缘部署中为多领域编码问题解决方案对 LLMs 进行基准测试
#4 长上下文自动化作文评分与语言模型
#5 在上下文中学习真的是在“学习”吗？
#6 放弃专家、重组神经元：无需再训练的稀疏专家混合 LLMs 剪枝
#7 SI-FACT：通过自我改进的可信度感知对比调优缓解知识冲突
#8 超出令牌限制：评估语言模型在长文本分类任务中的表现
#9 不相称的积极性：当失准的积极态度削弱在线支持性对话时
#10 LLM 生成文本风格变异基准测试
#11 朝着可靠且可解释的文档问答迈进：基于 VLM 的方法
#12 面向人口对齐的人格生成用于基于 LLM 的社会模拟
#13 面向对话语音的突出度感知自动语音识别
#14 使用合成数据扩展阿拉伯语医疗聊天机器人：通过合成病历增强生成式人工智能
#15 阿拉伯语大型语言模型用于医学文本生成
#16 查询气候知识：用于科学发现的语义检索
#17 已确立的心理测量学问卷与生态有效问卷：重新思考大型语言模型中的心理评估
#18 !MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment #18 !MSA 在 BAREC 共享任务 2025：集成阿拉伯语变换器用于可读性评估 [PDF] [复制] [Kimi] [REL]
#19 双相情感障碍在社交媒体上的语言轨迹
#20 对话理解中的多意图识别：对比较小的开源 LLMs
#21 通过检查推理过程进行无监督幻觉检测
#22 CMHG：面向中国少数民族语言的标题生成数据集与基准
#23 大型语言模型与法律人工智能的相遇：一项综述
#24 Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case #24 模拟公众舆论：面向智利案例的 AI 生成合成调查回应的概念验证 [PDF] [复制] [Kimi] [REL]
#25 Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization #25 主题引导的强化学习与 LLMs 用于增强多文档摘要 [PDF] [复制] [Kimi] [REL]
#26 Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization #26 手势唤起的语用框架：FrameNet Brasil 对话轮组织多模态研究
#27 HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning #27 HEFT：一种由粗到细的层次结构，用于提升语言模型推理的效率与准确性
#28 Discrimination by LLMs: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation #28 LLMs 对歧视的识别：跨语种偏见评估及在决策与摘要中的缓解
#29 MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools #29 MCP-AgentBench：使用 MCP 调解的工具评估现实世界语言代理性能
#30 Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning #30 对中文古籍文献的视觉-语言模型基准测试：从光学字符识别到知识推理 [PDF 1 ] [复制] [Kimi] [REL]
#31 MultimodalHugs: Enabling Sign Language Processing in Hugging Face #31 MultimodalHugs：在 Hugging Face 上实现手语处理 [PDF] [复制] [Kimi 1 ] [REL]
#32 A meta-analysis on the performance of machine-learning based language models for sentiment analysis #32 基于机器学习的语言模型在情感分析中表现的元分析 [PDF] [复制] [Kimi] [REL]
#33 A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs #33 一个面向角色感知的多智能体框架，用于使用 LLMs 的金融教育问答 [PDF] [复制] [Kimi] [REL]
#34 Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure #34 通过将证明步骤非形式化并沿证明结构进行递归摘要实现形式证明的自然语言翻译
#35 BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025 #35 BIBERT-Pipe 在 BioASQ 2025 的生物医学嵌套命名实体链接任务 [PDF] [复制] [Kimi] [REL]
#36 DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model #36 DiTTO-LLM：通过大语言模型发现基于主题的技术机会的框架 [PDF] [复制] [Kimi] [REL]
#37 ALIGNS: Unlocking nomological networks in psychological measurement through a large language model #37 ALIGNS：通过大型语言模型解锁心理测量中的法则网络 [PDF] [复制] [Kimi] [REL]
#38 Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA #38 在 HaluEval 和 TruthfulQA 上调查 Gemma 模型幻觉的符号触发因素 [PDF] [Copy] [Kimi 1 ] [REL]
#39 How Small Transformation Expose the Weakness of Semantic Similarity Measures
#40 HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering
#41 The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization #41 思想治疗师：使用监督微调和比值率策略优化训练大型语言模型以提供接纳与承诺疗法
#42 Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry
#43 Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data #43 利用由人口普查和土地利用数据提供信息的大型语言模型生成个人旅行日记 [PDF] [复制] [Kimi] [REL]
#44 Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement #44 使用大型语言模型辅助科研提案写作：评估与改进 [PDF] [复制] [Kimi] [REL]
#45 Beyond I’m Sorry, I Can’t: Dissecting Large Language Model Refusal #45 超越“对不起，我不能”：剖析大型语言模型的拒绝行为
#46 The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks #46 小型 LLMs 的非确定性：在标准多项选择基准的重复试验中低答案一致性的证据
#47 Temporal Preferences in Language Models for Long-Horizon Assistance #47 语言模型在长期辅助中的时间偏好 [PDF 1 ] [复制] [Kimi] [REL]
#48 CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor #48 CTCC：一种通过跨对话上下文相关性后门对大型语言模型进行鲁棒且隐蔽的指纹框架 [PDF] [复制] [Kimi] [REL]
#49 Creativity Benchmark: A benchmark for marketing creativity for LLM models #49 创造力基准：用于评估 LLM 模型营销创造力的基准 [PDF] [复制] [Kimi 1 ] [REL]
#50 Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task #50 在正则化视界下针对语音翻译任务的最优多任务学习
#51 Cross-Layer Attention Probing for Fine-Grained Hallucination Detection
#52 Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs #52 结构化信息很重要：使用病人级知识图谱的可解释 ICD 编码 [PDF] [复制] [Kimi 1 ] [REL]
#53 Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems #53 假设、行动、预测：为多智能体系统的自动故障归因构建因果推理支架
#54 Error Analysis in a Modular Meeting Transcription System #54 模块化会议转录系统中的错误分析 [PDF 1 ] [复制] [Kimi] [REL]
#55 VARCO-VISION-2.0 Technical Report #55 VARCO-VISION-2.0 技术报告
#56 Unified Learnable 2D Convolutional Feature Extraction for ASR #56 统一可学习二维卷积特征提取用于自动语音识别 [PDF] [复制] [Kimi] [相关]
#57 Whisper Has an Internal Word Aligner #57 Whisper 有一个内部词对齐器 [PDF 1 ] [复制] [Kimi] [REL]
#58 Vibe Check: Understanding the Effects of LLM-Based Conversational Agents’ Personality and Alignment on User Perceptions in Goal-Oriented Tasks #58 Vibe Check：理解基于 LLM 的会话代理的个性与对齐对面向目标任务的用户感知的影响 [PDF] [复制] [Kimi] [REL]
#59 LLMs as Agentic Cooperative Players in Multiplayer UNO
#60 Latency and Token-Aware Test-Time Compute #60 延迟与令牌感知的测试时计算
#61 Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture #61 可执行本体：用数据流架构合成事件语义 [PDF] [复制] [Kimi] [REL]
#62 HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets #62 HypoGeneAgent：一种用于基因集合簇解析选择的假设语言代理，基于 Perturb-seq 数据集
#63 Improving MLLM Historical Record Extraction with Test-Time Image #63 改进 MLLM 历史记录提取的测试时图像 [PDF 1 ] [Copy] [Kimi] [REL]
#64 VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions #64 VStyle：一个用于带有口语指令的语音风格适应基准 [PDF] [复制] [Kimi 2 ] [REL]
#65 LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm #65 基于 LLM 的实例驱动启发式偏置在有偏随机键遗传算法背景下
#66 Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks #66 变压器语言模型的差异性鲁棒性：在对抗性文本攻击下的实证评估
#67 Personas within Parameters: Fine-Tuning Small Language Models with Low-Rank Adapters to Mimic User Behaviors
#68 AI-Powered Assistant for Long-Term Access to RHIC Knowledge #68 基于 AI 的助理，用于长期获取 RHIC 知识 [PDF] [复制] [Kimi] [REL]
#69 Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation #69 面向流程挖掘领域的文本到 SQL：用于查询翻译的英德数据集 [PDF] [复制] [Kimi] [REL]
#70 DB3 Team’s Solution For Meta KDD Cup’ 25 #70 DB3 团队为 Meta KDD Cup'25 提交的解决方案 [PDF 1 ] [复制] [Kimi] [REL]
#71 Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL #71 公平裁剪你的序列：对序列级 RL 强制执行长度公平性 [PDF 6 ] [Copy] [Kimi 6 ] [REL]

#1 Mutual Information Tracks Policy Coherence in Reinforcement Learning #1 互信息追踪强化学习中的策略一致性
#2 Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems #2 逆推、行动、预测：为多主体系统中自动故障归因搭建因果推理脚手架
#3 State Algebra for Propositional Logic #3 状态代数用于命题逻辑
#4 The Morality of Probability: How Implicit Moral Biases in LLMs May Shape the Future of Human-AI Symbiosis #4 概率的道德性：LLMs 中隐含的道德偏见如何可能塑造人类与人工智能共生的未来
#5 Investigating Language Model Capabilities to Represent and Process Formal Knowledge: A Preliminary Study to Assist Ontology Engineering #5 探究语言模型表示与处理形式知识的能力：一项辅助本体工程的初步研究 [PDF 1 ] [Copy] [Kimi ] [REL]
#6 Compartmentalised Agentic Reasoning for Clinical NLI #6 用于临床自然语言推理的分区代理化推理
#7 Towards Fully Automated Molecular Simulations: Multi-Agent Framework for Simulation Setup and Force Field Extraction #7 迈向全自动分子模拟：用于模拟设置和力场提取的多代理框架
#8 Online Robust Planning under Model Uncertainty: A Sample-Based Approach #8 在模型不确定性下的在线鲁棒规划：一种基于样本的方法
#9 Virtual Agent Economies #9 虚拟代理经济
#10 AI Harmonics: a human-centric and harms severity-adaptive AI risk assessment framework #10 AI Harmonics：以人为本且依据危害严重性自适应的 AI 风险评估框架
#11 XAgents: A Unified Framework for Multi-Agent Cooperation via IF-THEN Rules and Multipolar Task Processing Graph #11 XAgents：通过若则规则和多极任务处理图实现多智能体协作的统一框架
#12 GAMA: A General Anonymizing Multi-Agent System for Privacy Preservation Enhanced by Domain Rules and Disproof Method #12 GAMA：一种通用匿名多智能体系统，通过领域规则和反证方法增强隐私保护
#13 Evaluation of Black-Box XAI Approaches for Predictors of Values of Boolean Formulae #13 评估用于布尔公式值预测器的黑盒可解释性人工智能（XAI）方法
#14 A Markovian Framing of WaveFunctionCollapse for Procedurally Generating Aesthetically Complex Environments #14 将波函数塌缩的程序化生成美学复杂环境的马尔可夫框架
#15 The (R)evolution of Scientific Workflows in the Agentic AI Era: Towards Autonomous Science #15 在具代理性的人工智能时代，科学工作流程的（革新）革命：迈向自主科学
#16 LLMs as Agentic Cooperative Players in Multiplayer UNO #16 LLMs 作为多人 UNO 中具有代理性的合作玩家 [PDF 2 ] [Copy] [Kimi 2 ] [REL]
#17 Towards an AI-based knowledge assistant for goat farmers based on Retrieval-Augmented Generation #17 朝向基于检索增强生成的山羊养殖者 AI 知识助理
#18 Towards a Common Framework for Autoformalization #18 朝向自动形式化的通用框架
#19 A Modular and Multimodal Generative AI Framework for Urban Building Energy Data: Generating Synthetic Homes #19 一个用于城市建筑能耗数据的模块化多模态生成式人工智能框架：生成合成住宅 [PDF 1 ] [复制] [Kimi ] [关联]
#20 How well can LLMs provide planning feedback in grounded environments? #20 在有根环境中，LLMs 在提供规划反馈方面表现如何？
#21 Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture #21 可执行本体：用数据流架构合成事件语义
#22 Human-AI Collaboration Increases Efficiency in Regulatory Writing #22 人机协作提高了法规写作的效率
#23 Standards in the Preparation of Biomedical Research Metadata: A Bridge2AI Perspective #23 生物医学研究元数据准备标准：Bridge2AI 视角
#24 Is In-Context Learning Learning? #24 在上下文学习真的是在“学习”吗？
#25 Multimodal SAM-adapter for Semantic Segmentation #25 用于语义分割的多模态 SAM-adapter
#26 Diversified recommendations of cultural activities with personalized determinantal point processes #26 使用个性化行列式点过程的文化活动多样化推荐
#27 Improving Audio Event Recognition with Consistency Regularization #27 通过一致性正则化改进音频事件识别
#28 Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms #28 数据分布影响基于对比学习的心电图基础模型的性能与泛化性
#29 Towards Understanding Visual Grounding in Visual Language Models #29 朝着理解视觉语言模型中的视觉定位
#30 GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography #30 GLAM：用于乳腺 X 光多视图 VLP 的几何引导局部对齐
#31 I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation #31 I-Segmenter：用于高效语义分割的整数仅视觉变压器
#32 Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Data #32 超越次优性的泛化：离线强化学习通过随机数据学会有效调度
#33 We Need a New Ethics for a World of AI Agents #33 我们需要为一个由人工智能代理构成的世界制定新的伦理
#34 SignClip: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion #34 SignClip：通过多模态对比融合利用口型线索进行手语翻译
#35 Openness in AI and downstream governance: A global value chain approach #35 人工智能的开放性与下游治理：全球价值链方法
#36 SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning #36 SI-FACT：通过自我改进的忠实度感知对比微调缓解知识冲突 [PDF 2 ] [Copy] [Kimi 2 ] [REL]
#37 Benchmark of stylistic variation in LLM-generated texts #37 基准测试：LLM 生成文本的文体变异
#38 BenchECG and xECG: a benchmark and baseline for ECG foundation models #38 BenchECG 和 xECG：用于心电图基础模型的基准和基线
#39 Efficient Learning-Based Control of a Legged Robot in Lunar Gravity #39 基于高效学习的月球重力下步行机器人的控制
#40 Population-Aligned Persona Generation for LLM-based Social Simulation #40 基于人口对齐的角色生成用于基于 LLM 的社会模拟
#41 Realism Control One-step Diffusion for Real-World Image Super-Resolution #41 真实感控制一步扩散用于真实世界图像超分辨率
#42 Generating Energy-Efficient Code via Large-Language Models – Where are we now? #42 通过大规模语言模型生成节能代码——我们现在处于什么位置？
#43 Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models #43 既定的心理测量学问卷与生态有效问卷：在大型语言模型中重新思考心理评估
#44 Predictive Spike Timing Enables Distributed Shortest Path Computation in Spiking Neural Networks #44 预测性峰值时序使脉冲神经网络实现分布式最短路径计算成为可能
#45 TwinTac: A Wide-Range, Highly Sensitive Tactile Sensor with Real-to-Sim Digital Twin Sensor Model #45 TwinTac：一种宽量程、高灵敏度的触觉传感器，带有真实到仿真的数字孪生传感器模型
#46 Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration #46 多模态数学推理嵌入航拍图像：基准测试、分析与探索
#47 Reinforcement learning for spin torque oscillator tasks #47 强化学习用于自旋转矩振荡器任务
#48 Exploring Expert Specialization through Unsupervised Training in Sparse Mixture of Experts #48 通过稀疏专家混合中的无监督训练探索专家专业化
#49 Intrinsic Dimension Estimating Autoencoder (IDEA) Using CancelOut Layer and a Projected Loss #49 本征维度估计自编码器（IDEA）：使用 CancelOut 层和投影损失
#50 Unsupervised Hallucination Detection by Inspecting Reasoning Processes #50 通过检查推理过程进行无监督幻觉检测
#51 Drone-Based Multispectral Imaging and Deep Learning for Timely Detection of Branched Broomrape in Tomato Farms #51 基于无人机多光谱成像与深度学习的番茄农场分枝菟丝子实时检测研究 [PDF ] [Copy] [Kimi ] [REL]
#52 Securing LLM-Generated Embedded Firmware through AI Agent-Driven Validation and Patching #52 通过 AI 代理驱动的验证与修补保护 LLM 生成的嵌入式固件
#53 Large Language Models Meet Legal Artificial Intelligence: A Survey #53 大型语言模型遇上法律人工智能：一项综述 [PDF ] [Copy] [Kimi 1 ] [REL]
#54 Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes #54 有限参考、可靠生成：适用于低数据情形的表格数据生成双组件框架
#55 Zero-Shot Referring Expression Comprehension via Visual-Language True/False Verification #55 零样本指代表达理解：通过视觉-语言真/假验证
#56 Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge #56 自适应令牌合并以实现边缘高效 Transformer 语义通信
#57 SmartCoder-R1: Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization #57 SmartCoder-R1：迈向安全且可解释的智能合约生成，采用面向安全的群体相对策略优化 [PDF ] [Copy] [Kimi 1 ] [REL]
#58 WALL: A Web Application for Automated Quality Assurance using Large Language Models #58 WALL：使用大型语言模型的自动化质量保证网络应用
#59 An Autoencoder and Vision Transformer-based Interpretability Analysis of the Differences in Automated Staging of Second and Third Molars #59 基于自编码器和视觉变换器的可解释性分析：第二、第三磨牙自动分期差异
#60 Tackling One Health Risks: How Large Language Models are leveraged for Risk Negotiation and Consensus-building #60 应对一体化健康风险：大型语言模型如何被用于风险协商与共识构建
#61 Self-Augmented Robot Trajectory: Efficient Imitation Learning via Safe Self-augmentation with Demonstrator-annotated Precision #61 自我增强机器人轨迹：通过示范者标注精度的安全自我增强实现高效模仿学习
#62 Automated Tuning for Diffusion Inverse Problem Solvers without Generative Prior Retraining #62 无需对生成先验重训练的扩散逆问题解算器自动调优 [PDF 3 ] [Copy] [Kimi 1 ] [REL]
#63 From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem #63 从 Hugging Face 到 GitHub：追踪开源人工智能生态系统中的许可漂移
#64 Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case #64 模拟公众舆论：一项关于人工智能生成的智利案例合成问卷回应的概念验证
#65 Vibe Check: Understanding the Effects of LLM-Based Conversational Agents’ Personality and Alignment on User Perceptions in Goal-Oriented Tasks #65 氛围检测：理解基于 LLM 的会话代理在面向目标任务中人格与对齐对用户感知的影响
#66 Surrogate Supervision for Robust and Generalizable Deformable Image Registration #66 用替代监督实现稳健且可泛化的可变形图像配准 [PDF 1 ] [Copy] [Kimi ] [REL]
#67 Latency and Token-Aware Test-Time Compute #67 延迟与基于代币的感知测试时计算
#68 SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints #68 SWE-Effi：在资源受限下重新评估软件人工智能代理系统的有效性
#69 HGEN: Heterogeneous Graph Ensemble Networks #69 HGEN：异构图集成网络
#70 Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning #70 重新审视离散动作离策略强化学习中的演员-评论家方法
#71 CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio #71 CoDiCodec：统一的连续和离散音频压缩表示
#72 SoilSound: Smartphone-based Soil Moisture Estimation #72 SoilSound：基于智能手机的土壤湿度估算
#73 HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning #73 HEFT：一种从粗到细的层次结构，用于提升语言模型推理的效率与准确性 [PDF ] [Copy] [Kimi ] [REL]
#74 ZORRO: Zero-Knowledge Robustness and Privacy for Split Learning (Full Version) #74 ZORRO：用于分割学习的零知识鲁棒性与隐私（完整版）
#75 LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation #75 LAVa：具有动态预算分配的分层键值缓存逐层驱逐
#76 Meta-Learning Reinforcement Learning for Crypto-Return Prediction #76 元学习强化学习用于加密回报预测 [PDF ] [Copy] [Kimi ] [REL]
#77 A Co-Training Semi-Supervised Framework Using Faster R-CNN and YOLO Networks for Object Detection in Densely Packed Retail Images #77 一种使用 Faster R-CNN 与 YOLO 网络的协同训练半监督框架，用于密集摆放零售图像中的目标检测
#78 D-CAT: Decoupled Cross-Attention Transfer between Sensor Modalities for Unimodal Inference #78 D-CAT：用于单模态推理的传感器模态间解耦交叉注意力迁移 [PDF ] [Copy] [Kimi ] [REL]
#79 Structure Matters: Brain Graph Augmentation via Learnable Edge Masking for Data-efficient Psychiatric Diagnosis #79 结构很重要：通过可学习的边掩码进行脑图增强以实现数据高效的精神疾病诊断
#80 HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets #80 HypoGeneAgent：一种用于基因集合簇分辨率选择的基于假设语言代理，适用于 Perturb-seq 数据集 [PDF ] [Copy] [Kimi ] [REL]
#81 World Modeling with Probabilistic Structure Integration #81 具有概率结构整合的世界建模
#82 MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools #82 MCP-AgentBench：使用 MCP 中介工具评估现实世界语言代理性能
#83 MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance #83 MITS：用于智能交通监控的大规模多模态基准数据集
#84 MultimodalHugs: Enabling Sign Language Processing in Hugging Face #84 MultimodalHugs：在 Hugging Face 中实现手语处理
#85 DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model #85 DiTTO-LLM：通过大型语言模型发现基于主题的技术机会的框架
#86 ALIGNS: Unlocking nomological networks in psychological measurement through a large language model #86 ALIGNS：通过大型语言模型在心理测量中解锁法则网络
#87 A Multimodal RAG Framework for Housing Damage Assessment: Collaborative Optimization of Image Encoding and Policy Vector Retrieval #87 一种用于房屋损害评估的多模态 RAG 框架：图像编码与策略向量检索的协同优化
#88 VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions #88 VStyle：带有口头指令的语音风格适应基准
#89 Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA #89 在 HaluEval 和 TruthfulQA 中调查 Gemma 模型幻觉的符号触发因素
#90 How Small Transformation Expose the Weakness of Semantic Similarity Measures #90 微小变换如何暴露语义相似度度量的弱点
#91 HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering #91 HANRAG：启发式准确抗噪检索增强生成用于多跳问答 [PDF ] [Copy] [Kimi 1 ] [REL]
#92 The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization #92 思考中的治疗师：使用监督微调和比值比策略优化训练大型语言模型以提供接纳与承诺疗法 [PDF 1 ] [Copy] [Kimi 1 ] [REL]
#93 Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry
#94 Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data #94 使用大型语言模型并结合人口普查与土地利用数据生成个人旅行日记
#95 Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement #95 使用大型语言模型辅助研究提案写作：评估与改进
#96 Beyond I’m Sorry, I Can’t: Dissecting Large Language Model Refusal #96 超越“抱歉，我做不到”：剖析大型语言模型的拒绝行为
#97 LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm #97 基于 LLM 的实例驱动启发式偏差：带偏置随机键遗传算法的背景下 [PDF ] [Copy] [Kimi ] [REL]
#98 Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks #98 变压器语言模型的差异鲁棒性：在对抗性文本攻击下的实证评估 [PDF ] [Copy] [Kimi ] [REL]
#99 The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks #99 小型 LLM 的非确定性：在标准多项选择基准的重复试验中答案一致性低的证据 [PDF 2 ] [Copy] [Kimi 3 ] [REL]
#100 Temporal Preferences in Language Models for Long-Horizon Assistance #100 语言模型在长期辅助中的时间偏好
#101 CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor #101 CTCC：一种通过跨轮次上下文相关性后门对大型语言模型进行健壮且隐蔽指纹标记的框架
#102 Creativity Benchmark: A benchmark for marketing creativity for LLM models #102 创造力基准：用于评估 LLM 模型营销创造力的基准
#103 Cross-Layer Attention Probing for Fine-Grained Hallucination Detection #103 跨层注意力探测用于细粒度幻觉检测
#104 Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs #104 结构化信息很重要：基于患者级知识图的可解释 ICD 编码
#105 Wave-Based Semantic Memory with Resonance-Based Retrieval: A Phase-Aware Alternative to Vector Embedding Stores #105 基于波的语义记忆与基于共振的检索：一种相位感知的向量嵌入存储替代方案 [PDF ] [Copy] [Kimi ] [REL]
#106 Personas within Parameters: Fine-Tuning Small Language Models with Low-Rank Adapters to Mimic User Behaviors #106 参数内的人格：使用低秩适配器微调小型语言模型以模拟用户行为
#107 AI-Powered Assistant for Long-Term Access to RHIC Knowledge #107 基于人工智能的 RHIC 知识长期访问助手
#108 GeoGPT.RAG Technical Report
#109 TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation #109 TalkPlayData 2：一种用于多模态对话式音乐推荐的能动合成数据管道
#110 Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation #110 面向流程挖掘领域的文本到 SQL：用于查询翻译的葡英双语数据集
#111 Forecasting Clicks in Digital Advertising: Multimodal Inputs and Interpretable Outputs #111 预测数字广告中的点击量：多模态输入与可解释输出
#112 DB3 Team’s Solution For Meta KDD Cup’ 25 #112 DB3 团队为 Meta KDD Cup'25 提供的解决方案
#113 AEGIS: An Agent for Extraction and Geographic Identification in Scholarly Proceedings #113 AEGIS：用于学术会议论文的抽取与地理识别的代理
#114 Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL
#115 Generative Engine Optimization: How to Dominate AI Search #115 生成引擎优化：如何主导人工智能搜索

2025-09-15科研追新

1. 源数据

1.1 媒体

From：量子位、机器之心、新智元、AGI Hunt、小红书、X其他

1.2 Arxiv

1.2.1 Computation and Language

From：https:// /arxiv/cs.CL https://arxiv.org/list/cs.CL/recent 2025-09-15 | | 总计：71

#1 WhisTLE：面向预训练语音识别 Transformer 的深度监督、仅文本域自适应

Authors: [Akshat Pandey](https://arxiv.org/search/?searchtype=author&query=Akshat Pandey), [Karun Kumar](https://arxiv.org/search/?searchtype=author&query=Karun Kumar), [Raphael Tang](https://arxiv.org/search/?searchtype=author&query=Raphael Tang) 作者：Akshat Pandey、Karun Kumar、Raphael Tang

诸如 Whisper 等预训练自动语音识别（ASR）模型表现良好，但仍需进行域自适应以应对未知词汇和用语。在许多现实场景中，收集语音数据并不可行，因此需要仅使用文本的自适应方法。我们提出了 WhisTLE，一种针对预训练编码器—解码器 ASR 模型的深度监督、仅文本自适应方法。WhisTLE 训练变分自编码器（VAE）以对来自文本的编码器输出进行建模，并使用学得的文本到潜变量编码器微调解码器，可选择与文本到语音（TTS）自适应结合。在推理时恢复原始编码器，不增加额外运行时成本。在四个域外数据集和四个 ASR 模型上，结合 TTS 的 WhisTLE 相较于仅 TTS 自适应将词错误率（WER）相对降低 12.3%，并在 32 种情形中的 27 种中优于所有非 WhisTLE 基线。

学科：计算与语言, 机器学习

发布：2025-09-12 17:59:09 UTC

#2 DeepDive：通过知识图谱和多轮强化学习推进深度搜索代理

通过为大型语言模型 (LLMs) 增强浏览工具，显著提升了它们作为深度搜索代理解决复杂现实任务的潜力。然而，开放式 LLM 在此类场景中仍表现不佳，原因在于使用浏览工具时的长期推理能力有限以及缺乏足够困难的监督数据。为了解决这些挑战，我们提出了 DeepDive，以推进深度搜索代理的发展。首先，我们提出了一种从开放知识图自动合成复杂、困难且难以查找问题的策略。其次，我们应用端到端多回合强化学习 (RL) 来增强 LLM 在深度搜索中的长期推理能力。实验表明，DeepDive-32B 在 BrowseComp 上取得了新的开源竞品成绩，优于 WebSailor、DeepSeek-R1-Browse 和 Search-o1。我们证明了多回合 RL 训练改善了深度搜索能力，并在多个基准上显著推动了性能提升。我们观察到 DeepDive 支持在测试时扩展工具调用并进行并行采样。所有数据集、模型和代码均已公开，地址为 https://github.com/THUDM/DeepDive。

Subject: 计算与语言

发表：2025-09-12 17:52:35 UTC

#3 RefactorCoderQA：在云端和边缘部署中为多领域编码问题解决方案对 LLMs 进行基准测试

Authors: [Shadikur Rahman](https://arxiv.org/search/?searchtype=author&query=Shadikur Rahman), [Aroosa Hameed](https://arxiv.org/search/?searchtype=author&query=Aroosa Hameed), [Gautam Srivastava](https://arxiv.org/search/?searchtype=author&query=Gautam Srivastava), [Syed Muhammad Danish](https://arxiv.org/search/?searchtype=author&query=Syed Muhammad Danish) 作者：Shadikur Rahman、Aroosa Hameed、Gautam Srivastava、Syed Muhammad Danish

为了优化大型语言模型（LLMs）的推理和问题解决能力，我们提出了一种新颖的云-边协同架构，使结构化的多代理提示框架成为可能。该框架由三种专门组件组成：GuideLLM——部署在边缘的轻量级模型，用于提供方法指导；SolverLLM——托管在云端的更强大模型，负责生成代码解决方案；以及 JudgeLLM——用于自动评估解答正确性和质量的评判器。为在真实场景中评估并展示该架构的有效性，我们引入了 RefactorCoderQA，这是一个全面的基准，用于评估和提升大型语言模型（LLMs）在多领域编码任务上的表现。出于对现有基准局限性的考虑，RefactorCoderQA 系统性地覆盖了多个技术领域，包括软件工程、数据科学、机器学习和自然语言处理，并采用来自 Stack Overflow 的真实编码挑战。大量实验表明，我们调整后的模型 RefactorCoder-MoE 达到了最先进的性能，以 76.84% 的总体准确率显著超越了领先的开源和商业基线。人工评估进一步验证了所生成解答的可解释性、准确性和实际相关性。此外，我们评估了系统级指标，如吞吐量和延迟，以深入了解所提架构的性能特征和权衡。

主题 : Computation and Language 主题：计算与语言

发布 : 2025-09-12 17:44:22 UTC

#4 长上下文自动化作文评分与语言模型

Authors: [Christopher Ormerod](https://arxiv.org/search/?searchtype=author&query=Christopher Ormerod), [Gitit Kehat](https://arxiv.org/search/?searchtype=author&query=Gitit Kehat) 作者：Christopher Ormerod、Gitit Kehat

基于 Transformer 的语言模型在架构上被限制为处理固定最大长度的文本。高年级学生写的作文常常超过许多流行开源模型允许的最大长度。在使用这些模型进行自动化作文评分时，常见的解决方法是截断输入文本。这会引发严重的有效性问题，因为它削弱了模型充分捕捉和评估评分细则中的组织性元素的能力，而这些元素需要长上下文来评估。在本研究中，我们使用 Kaggle ASAP 2.0 数据集评估了若干通过对标准 Transformer 架构进行架构修改以克服这些长度限制的模型。本研究中考虑的模型包括微调后的 XLNet、Longformer、ModernBERT、Mamba 和 Llama 模型。

主题： Computation and Language 主题：计算与语言

发布：2025-09-12 17:13:47 UTC

#5 在上下文中学习真的是在“学习”吗？

Author: [Adrian de Wynter](https://arxiv.org/search/?searchtype=author&query=Adrian de Wynter) 作者：Adrian de Wynter

情境学习（ICL）允许某些自回归模型通过下一个标记的预测来解决任务，而无需进一步训练。这导致了关于这些模型仅凭提示中的少量示例（shots）就能解决（学习）未见过任务的能力的说法。然而，推理并不总是意味着学习，因为 ICL 并没有显式地对给定的观察进行编码。相反，模型依赖于其先验知识以及所提供的示例（如果有的话）。我们认为，从数学上讲，ICL 确实构成了学习，但其完整的表征需要实证研究。于是我们进行了大规模的 ICL 分析，剖析或考虑了记忆化、预训练、分布转移以及提示风格和措辞的影响。我们发现 ICL 是一个有效的学习范式，但在学习和泛化到未见任务方面存在局限。我们注意到，在示例数量增多的极限情况下，准确率对示例分布、模型、提示风格以及输入的语言特征不敏感。相反，模型从提示中的规律性中推断模式，这导致了对分布的敏感性，尤其是在诸如链式思维（chain-of-thought）等提示风格中。鉴于在形式上相似的任务上准确率各不相同，我们得出结论：自回归的临时编码并不是一种稳健的机制，并且表明其作为通用方法的泛化能力有限。

Subjects: Computation and Language, Artificial Intelligence, 机器学习主题：计算与语言、人工智能、机器学习

发布： 2025-09-12 17:12:04 UTC

#6 放弃专家、重组神经元：无需再训练的稀疏专家混合 LLMs 剪枝

稀疏专家混合（Sparse Mixture-of-Experts，SMoE）架构因其计算效率在大型语言模型（LLMs）中被广泛使用。然而，尽管每个标记只激活少数专家，SMoE 仍需加载所有专家参数，导致内存占用高并带来部署挑战。先前的工作尝试通过剪枝和合并专家来减少开销，但主要集中在专家级别的操作，忽略了神经元级别的结构。我们提出了 DERN（Dropping Experts, Recombining Neurons），这是一种与任务无关且无需重训练的专家剪枝与重构框架。我们观察到专家之间在神经元层面上常常存在错位和语义冲突，这给直接合并带来挑战。为了解决这一问题，DERN 包含三步：首先利用路由器统计剪除冗余专家；然后将其分解为神经元级别的专家片段，并将每个片段分配给最兼容的保留专家；最后在每个保留专家内部合并片段以构建紧凑表示。在 Mixtral、Qwen 和 DeepSeek 的 SMoE 模型上的实验表明，在 50%专家稀疏度下，DERN 在常识推理和 MMLU 基准上的性能提高了超过 5%，且无需额外训练。它还大大减少了专家数量和内存使用，使得 SMoE LLMs 在实际部署中更容易。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-09-12 16:09:39 UTC 发布：2025-09-12 16:09:39 UTC

#7 SI-FACT：通过自我改进的可信度感知对比调优缓解知识冲突

Author: [Shengqiang Fu](https://arxiv.org/search/?searchtype=author&query=Shengqiang Fu)作者：傅胜强

大型语言模型在知识密集型任务中常因知识冲突而产生不可靠的回答，即更倾向于依赖内部参数化知识而非所提供的上下文。为了解决这一问题，我们提出了一种新颖的自我改进框架：Self Improving Faithfulness Aware Contrastive Tuning（自我改进的忠实性感知对比微调）。该框架使用自我指导机制，使基础 LLM 能自动生成高质量、结构化的对比学习数据，包括锚样本、语义等价的正样本以及模拟不忠实场景的负样本，从而显著降低人工标注成本。随后，应用对比学习训练模型，使其在表示空间中将忠实的回答拉近，而将不忠实的回答推远。在知识冲突评估基准 ECARE KRE 和 COSE KRE 上的实验表明，基于 Llama3 8B Instruct 的 SI FACT 模型在上下文召回率上比最优基线方法提高了 6.2%，同时显著降低了对内部记忆的依赖。结果表明，SI FACT 在提升 LLM 上下文忠实性方面具有强大的有效性和高数据效率，为构建更主动、更值得信赖的语言模型提供了实用路径。

主题 : 计算与语言, 人工智能

Publish: 2025-09-12 12:56:14 UTC 发布：2025-09-12 12:56:14 UTC

#8 超出令牌限制：评估语言模型在长文本分类任务中的表现

Authors: [Miklós Sebők](https://arxiv.org/search/?searchtype=author&query=Miklós Sebők), [Viktor Kovács](https://arxiv.org/search/?searchtype=author&query=Viktor Kovács), [Martin Bánóczy](https://arxiv.org/search/?searchtype=author&query=Martin Bánóczy), [Daniel Møller Eriksen](https://arxiv.org/search/?searchtype=author&query=Daniel Møller Eriksen), [Nathalie Neptune](https://arxiv.org/search/?searchtype=author&query=Nathalie Neptune), [Philippe Roussille](https://arxiv.org/search/?searchtype=author&query=Philippe Roussille) 作者：Miklós Sebők、Viktor Kovács、Martin Bánóczy、Daniel Møller Eriksen、Nathalie Neptune、Philippe Roussille

在社会科学中最广泛使用的大型语言模型（例如 BERT 及其衍生模型，如 RoBERTa）在可处理的输入文本长度方面存在限制，这些模型只能在一定长度内生成预测。这对某些分类任务来说是一个特别紧迫的问题，这类任务的目标是处理长输入文本。一个这样的领域涉及法律和法案（草案），这些文本可能长达数百页，因此并不适合只能处理例如 512 个令牌的模型。在本文中，我们展示了覆盖 5 种语言的实验结果，所用模型包括 XLM-RoBERTa、Longformer、GPT-3.5、GPT-4，任务为比较议题计划（Comparative Agendas Project）的多类别分类，该数据集的代码本包含 21 个从教育到医疗保健的政策主题标签。结果表明，专为处理长输入而预训练的 Longformer 模型并未展现出特别优势。GPT 变体与表现最佳的开源模型之间的比较显示后者略占上风。对类别层面因素的分析表明，在处理长文本输入时，特定类别之间的支持度和内容重叠对性能具有重要影响。

主题： Computation and Language 主题：计算与语言

发布：2025-09-12 12:47:28 UTC

#9 不相称的积极性：当失准的积极态度削弱在线支持性对话时

Authors: [Leen Almajed](https://arxiv.org/search/?searchtype=author&query=Leen Almajed), [Abeer ALdayel](https://arxiv.org/search/?searchtype=author&query=Abeer ALdayel) 作者：Leen Almajed、Abeer ALdayel

在情感支持型对话中，出于良好意图的积极回应有时会适得其反，导致让人感觉被忽视、被轻视或过于乐观不切实际的回复。我们将这种不一致的积极性视为对积极支持的误校准表达，研究其在人工和 LLM 生成回复中的表现。为此，我们收集了来自 Reddit 的真实用户-助理对话，覆盖不同情绪强度，并针对相同语境使用大型语言模型生成了额外的回复。我们按强度将这些对话分为两个级别：轻度，涵盖关系紧张和一般建议；严重，涵盖丧失与焦虑的对话。此级别划分使得比较分析支持性回复在低风险与高风险语境中的差异成为可能。我们的分析显示，LLM 更容易表现出不切实际的积极性，带有轻视和淡化的语气，尤其在高风险语境中更为明显。为进一步研究该现象的潜在维度，我们在具有强烈与弱烈情感反应的数据集上对 LLM 进行了微调。此外，我们开发了一个弱监督多标签分类器集成（DeBERTa 和 MentalBERT），在两种不同程度的关注（轻度和严重）上对不一致的积极类型检测表现出更好的效果。我们的研究表明，需要超越仅生成通用积极回应的做法，转而研究一致性的支持措施，以在积极情感与情绪确认之间取得平衡。这一方法为使大型语言模型在在线支持性对话中符合情感预期提供了见解，为面向情境感知并维护信任的在线对话系统铺平了道路。

主题：计算与语言

Publish: 2025-09-12 12:25:02 UTC 发表：2025-09-12 12:25:02 协调世界时

#10 LLM 生成文本风格变异基准测试

Authors: [Jiří Milička](https://arxiv.org/search/?searchtype=author&query=Jiří Milička), [Anna Marklová](https://arxiv.org/search/?searchtype=author&query=Anna Marklová), [Václav Cvrček](https://arxiv.org/search/?searchtype=author&query=Václav Cvrček) 作者：Jiří Milička、Anna Marklová、Václav Cvrček

本研究考察了人类撰写文本与大型语言模型（LLMs）生成的可比文本在语域（register）变异方面的差异。研究对一组人类写作文本与作为其对应物生成的 AI 文本样本应用了 Biber 的多维度分析（MDA），以找出 LLM 在哪些变异维度上与人类存在最显著且最系统的差异。作为文本材料，使用了新的 LLM 生成语料库 AI-Brown，它可与 BE-21（代表当代英式英语的 Brown 系列语料库）相比较。由于除英语外的所有语言在前沿 LLM 的训练数据中均被低估，研究还在捷克语上重复了类似分析，使用 AI-Koditex 语料库和捷克语的多维度模型。研究考察了 16 个前沿模型在不同设置和提示下的表现，重点放在基础模型与指令微调模型之间的差异。基于此，构建了一个基准，可用于在可解释的维度上相互比较和对模型进行排名。

主题 : 计算与语言, 人工智能

发布：2025-09-12 12:12:20 UTC

#11 朝着可靠且可解释的文档问答迈进：基于 VLM 的方法

Authors: [Alessio Chen](https://arxiv.org/search/?searchtype=author&query=Alessio Chen), [Simone Giovannini](https://arxiv.org/search/?searchtype=author&query=Simone Giovannini), [Andrea Gemelli](https://arxiv.org/search/?searchtype=author&query=Andrea Gemelli), [Fabio Coppini](https://arxiv.org/search/?searchtype=author&query=Fabio Coppini), [Simone Marinai](https://arxiv.org/search/?searchtype=author&query=Simone Marinai) 作者：Alessio Chen、Simone Giovannini、Andrea Gemelli、Fabio Coppini、Simone Marinai

视觉-语言模型（VLMs）在文档理解方面展示了强大能力，尤其是在识别和提取复杂文档中的文本信息方面。尽管如此，在文档中准确定位答案仍然是一个重大挑战，这限制了解释性和实际应用。为了解决这一问题，我们提出了 DocExplainerV0，一个即插即用的边界框预测模块，将答案生成与空间定位解耦。该设计使其可应用于现有 VLMs，包括无法进行微调的专有系统。通过系统评估，我们对文本准确性与空间定位之间的差距提供了量化见解，表明正确的答案常常缺乏可靠的定位。我们标准化的框架突出了这些不足，并为未来旨在构建更具可解释性和鲁棒性的文档信息提取 VLMs 的研究建立了基准。

Subjects: Computation and Language, Information Retrieval 主题：计算与语言，信息检索

Publish: 2025-09-12 10:44:24 UTC 发布时间：2025-09-12 10:44:24 UTC

#12 面向人口对齐的人格生成用于基于 LLM 的社会模拟

最近大型语言模型（LLM）的进展使得以人类般的规模和逼真度进行社会模拟成为可能，为计算社会科学提供了新的机会。然而，一个关键挑战是构建能够真实代表现实人口多样性与分布的人格集合。大多数现有的基于 LLM 的社会模拟研究主要集中在设计主体框架和模拟环境上，常常忽视了人格生成的复杂性以及由非代表性人格集合引入的潜在偏差。在本文中，我们提出了一个系统化框架，用于为 LLM 驱动的社会模拟合成高质量、与人口对齐的人格集合。我们的方法首先利用 LLM 从长期社交媒体数据中生成叙事式人格，并通过严格的质量评估来筛除低保真度的档案。然后我们应用重要性抽样以实现与参考心理测量分布（例如大五人格特质）的全局对齐。为满足特定模拟情境的需要，我们进一步引入了一个任务专用模块，将全局对齐的人格集合调整为针对性的子群体。大量实验证明，我们的方法能显著降低群体层面的偏差，并使社会模拟在广泛的研究和政策应用中实现准确且灵活的运用。

Subjects: Computation and Language, Artificial Intelligence, 机器学习主题：计算与语言，人工智能，机器学习

发布：2025-09-12 10:43:47 UTC

#13 面向对话语音的突出度感知自动语音识别

Authors: [Julian Linke](https://arxiv.org/search/?searchtype=author&query=Julian Linke), [Barbara Schuppler](https://arxiv.org/search/?searchtype=author&query=Barbara Schuppler) 作者：Julian Linke，Barbara Schuppler

本文通过将重音（突出性）检测与语音识别相结合，研究了面向会话奥地利德语的重音感知自动语音识别（ASR）。首先，通过微调 wav2vec2 模型以对词级重音进行分类，开发了重音检测器。随后使用该检测器在大型语料库中自动标注韵律重音。基于这些标注，我们训练了新型的重音感知 ASR 系统，能够同时识别词语及其重音级别。与基线 ASR 系统相比，整合重音信息并未改变识别性能；在识别的词序列正确的语句中，重音检测的准确率达到了 85.53%。本文表明基于 Transformer 的模型能够有效编码韵律信息，并在增强韵律的 ASR 领域做出新颖贡献，具有语言学研究和基于韵律的对话系统的潜在应用。

主题 : 计算与语言, 音频与语音处理

发布：2025-09-12 10:18:38 UTC

#14 使用合成数据扩展阿拉伯语医疗聊天机器人：通过合成病历增强生成式人工智能

Authors: [Abdulrahman Allam](https://arxiv.org/search/?searchtype=author&query=Abdulrahman Allam), [Seif Ahmed](https://arxiv.org/search/?searchtype=author&query=Seif Ahmed), [Ali Hamdi](https://arxiv.org/search/?searchtype=author&query=Ali Hamdi), [Khaled Shaban](https://arxiv.org/search/?searchtype=author&query=Khaled Shaban) 作者：Abdulrahman Allam、Seif Ahmed、Ali Hamdi、Khaled Shaban

阿拉伯语医疗聊天机器人开发受到大规模高质量标注数据稀缺的严重制约。尽管先前工作从社交媒体汇编了 2 万条阿拉伯语患者-医生互动数据集以微调 LLMs，但模型的可扩展性和泛化能力仍然有限。在本研究中，我们提出了一种可扩展的合成数据增强策略，将训练语料扩展到 10 万条记录。利用先进的生成式人工智能系统 ChatGPT-4o 和 Gemini 2.5 Pro，我们生成了 8 万条在语境上相关且医学上连贯的合成问答对，这些问答对以原始数据集的结构为基础。这些合成样本经过语义过滤、人工验证并整合到训练流程中。我们微调了五个 LLM 模型，包括 Mistral-7B 和 AraGPT2，并使用 BERTScore 指标和专家驱动的定性评估来评估其性能。为了进一步分析合成来源的有效性，我们还进行了一项消融研究，独立比较了由 ChatGPT-4o 和 Gemini 生成的数据。结果显示，使用 ChatGPT-4o 数据在所有模型中都能持续带来更高的 F1 分数和更少的幻觉现象。总体而言，我们的发现表明，合成增强在低资源医学自然语言处理领域作为增强特定领域语言模型的实用解决方案是可行的，为更具包容性、可扩展且更准确的阿拉伯语医疗聊天机器人系统铺平了道路。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-09-12 09:58:11 UTC 发布：2025-09-12 09:58:11 协调世界时（UTC）

#15 阿拉伯语大型语言模型用于医学文本生成

Authors: [Abdulrahman Allam](https://arxiv.org/search/?searchtype=author&query=Abdulrahman Allam), [Seif Ahmed](https://arxiv.org/search/?searchtype=author&query=Seif Ahmed), [Ali Hamdi](https://arxiv.org/search/?searchtype=author&query=Ali Hamdi), [Ammar Mohammed](https://arxiv.org/search/?searchtype=author&query=Ammar Mohammed) 作者：Abdulrahman Allam、Seif Ahmed、Ali Hamdi、Ammar Mohammed

高效的医院管理系统（HMS）在全球范围内对解决如过度拥挤、资源有限和急诊医疗可及性差等挑战至关重要。现有方法往往无法提供准确的实时医疗建议，尤其在处理非规范输入和代表性不足的语言时更是如此。为克服这些局限，本研究提出了一种对阿拉伯语医学文本生成进行微调的 LLMs 方法。该系统旨在通过基于用户输入提供准确的医疗建议、诊断、用药推荐和治疗方案来帮助患者。研究方法需要从社交媒体平台收集独特的数据集，捕捉患者与医生之间的真实医疗对话。该数据集包括患者主诉及相应的医疗建议，并经过妥善清洗和预处理以兼顾多种阿拉伯语方言。对最先进的生成模型（如 Mistral-7B-Instruct-v0.2、LLaMA-2-7B 和 GPT-2 Medium）进行微调，优化了系统生成可靠医学文本的能力。评估结果表明，经过微调的 Mistral-7B 模型优于其他模型，在精确率、召回率和 F1 分数方面分别达到了平均 BERT（双向编码器表征变换器）得分值 68.5%、69.08% 和 68.5%。比较基准测试和定性评估验证了该系统能够针对非正式输入生成连贯且相关的医学回复。本研究强调了生成式人工智能（AI）在推进卫生管理系统（HMS）方面的潜力，为全球医疗挑战提供了一种可扩展且可适应的解决方案，尤其适用于语言和文化多样的环境。

主题： Computation and Language 主题：计算与语言

发布：2025-09-12 09:37:26 UTC

#16 查询气候知识：用于科学发现的语义检索

Authors: [Mustapha Adamu](https://arxiv.org/search/?searchtype=author&query=Mustapha Adamu), [Qi Zhang](https://arxiv.org/search/?searchtype=author&query=Qi Zhang), [Huitong Pan](https://arxiv.org/search/?searchtype=author&query=Huitong Pan), [Longin Jan Latecki](https://arxiv.org/search/?searchtype=author&query=Longin Jan Latecki), [Eduard C. Dragut](https://arxiv.org/search/?searchtype=author&query=Eduard C. Dragut) 作者：Mustapha Adamu、Qi Zhang、Huitong Pan、Longin Jan Latecki、Eduard C. Dragut

气候科学文献的复杂性和数量不断增加，使研究人员越来越难以在模型、数据集、区域和变量之间找到相关信息。本文引入了一个从气候出版物和更广泛的科学文本构建的领域特定知识图（KG），旨在改善气候知识的访问和使用方式。不同于基于关键词的检索，我们的知识图支持结构化的语义查询，帮助研究人员发现精确的关联，例如哪些模型在特定区域已被验证，或哪些数据集通常与某些遥相关模式一同使用。我们展示了如何使用 Cypher 查询让知识图回答此类问题，并概述了其与 RAG 系统中大型语言模型的集成，以提高气候相关问答的透明性和可靠性。该工作超越了知识图构建，展示了其对气候研究人员、模型开发者以及依赖准确、具上下文的科学信息的其他人员的现实价值。

主题：计算与语言

发布：2025-09-12 09:28:29 UTC

#17 已确立的心理测量学问卷与生态有效问卷：重新思考大型语言模型中的心理评估

Authors: [Dongmin Choi](https://arxiv.org/search/?searchtype=author&query=Dongmin Choi), [Woojung Song](https://arxiv.org/search/?searchtype=author&query=Woojung Song), [Jongwook Han](https://arxiv.org/search/?searchtype=author&query=Jongwook Han), [Eun-Ju Lee](https://arxiv.org/search/?searchtype=author&query=Eun-Ju Lee), [Yohan Jo](https://arxiv.org/search/?searchtype=author&query=Yohan Jo) 作者：Choi Dongmin、Song Woojung、Han Jongwook、Lee Eun-Ju、Yohan Jo

研究人员已经应用现成的心理测量问卷（例如 BFI、PVQ）来衡量大型语言模型 (LLMs) 在回答中反映出的性格特征和价值观。然而，人们对将这些为人类设计的问卷用于 LLMs 提出了质疑。其中一个担忧是它们缺乏生态效度——即调查问题在多大程度上足以反映并类似于 LLMs 在响应用户查询时生成文本的真实世界情境。然而，尚不清楚现有问卷与具有生态效度的问卷在结果上如何不同，以及这些差异可能提供何种见解。在本文中，我们对这两类问卷进行了全面的比较分析。我们的分析表明，现有问卷（1）相比具有生态效度的问卷，会产生与用户查询语境中表达的心理特征显著不同的 LLMs 特征画像；（2）题目数量不足以保证稳定测量；（3）造成对 LLMs 拥有稳定构念的误导性印象；以及（4）对通过角色提示(persona-prompted)驱动的 LLMs 给出夸大的特征画像。总体而言，我们的工作提醒人们对将既有心理学问卷用于 LLMs 保持谨慎。我们的代码将在发表时公开。

主题 : 计算与语言, 人工智能

Publish: 2025-09-12 09:14:42 UTC 发布时间：2025-09-12 09:14:42 UTC

#18 !MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment #18 !MSA 在 BAREC 共享任务 2025：集成阿拉伯语变换器用于可读性评估 [PDF] [复制] [Kimi] [REL]

Authors: [Mohamed Basem](https://arxiv.org/search/?searchtype=author&query=Mohamed Basem), [Mohamed Younes](https://arxiv.org/search/?searchtype=author&query=Mohamed Younes), [Seif Ahmed](https://arxiv.org/search/?searchtype=author&query=Seif Ahmed), [Abdelrahman Moustafa](https://arxiv.org/search/?searchtype=author&query=Abdelrahman Moustafa) 作者：Mohamed Basem、Mohamed Younes、Seif Ahmed、Abdelrahman Moustafa

我们提出了 MSAs 在 BAREC 2025 细粒度阿拉伯语可读性评估共享任务中的获胜系统，在六个赛道中均获得第一名。我们的方法是一个由四个互补的 Transformer 模型（AraBERTv2、AraELECTRA、MARBERT 和 CAMeLBERT）组成的置信度加权集成，每个模型都使用不同的损失函数进行微调以捕捉多样的可读性信号。为应对严重的类别不平衡和数据稀缺问题，我们采用了加权训练、高级预处理、使用我们最强模型对 SAMER 语料进行重新标注，以及通过 Gemini 2.5 Flash 生成合成数据，新增了大约 10,000 条稀有等级样本。一个有针对性的后处理步骤纠正了预测分布偏斜，带来了 6.3 个百分点的二次加权卡帕（QWK）提升。我们的系统在句子级别达到 87.5%的 QWK，在文档级别达到 87.4%的 QWK，展示了模型与损失多样性、基于置信度的融合以及智能增强在稳健阿拉伯语可读性预测中的强大作用。

主题： Computation and Language 主题：计算与语言

发布 : 2025-09-12 08:08:45 UTC

#19 双相情感障碍在社交媒体上的语言轨迹

Authors: [Laurin Plank](https://arxiv.org/search/?searchtype=author&query=Laurin Plank), [Armin Zlomuzica](https://arxiv.org/search/?searchtype=author&query=Armin Zlomuzica) 作者：Laurin Plank，Armin Zlomuzica

语言提供了双相情感障碍（BD）等情感障碍的重要标志，但临床评估的规模仍然有限。为此，社交媒体（SM）语言的分析因其高时间分辨率和纵向范围而受到重视。在此，我们引入了一种确定用户诊断时间的方法，并将其应用于研究从 BD 诊断前三年到诊断后 21 年的语言轨迹——并与报告单相抑郁（UD）和未受影响的用户（HC）进行对比。我们表明，BD 诊断伴随广泛的语言改变，反映出情绪紊乱、精神病共病、物质滥用、住院、医疗共病、异常思维内容和思维紊乱。我们进一步观察到诊断后两十年中反复出现的与情绪相关的语言变化，并表现出明显的 12 个月周期性，提示季节性情绪发作。最后，趋势性证据表明，估计为女性的用户具有更高的周期性。总之，我们的研究结果为 BD 急性期和慢性期的语言改变提供了证据。这验证并扩展了最近利用社交媒体进行可扩展心理健康监测的工作。

主题： Computation and Language 主题：Computation and Language

发布：2025-09-12 08:02:38 UTC

#20 对话理解中的多意图识别：对比较小的开源 LLMs

Authors: [Adnan Ahmad](https://arxiv.org/search/?searchtype=author&query=Adnan Ahmad), [Philine Kowol](https://arxiv.org/search/?searchtype=author&query=Philine Kowol), [Stefan Hillmann](https://arxiv.org/search/?searchtype=author&query=Stefan Hillmann), [Sebastian Möller](https://arxiv.org/search/?searchtype=author&query=Sebastian Möller) 作者：Adnan Ahmad，Philine Kowol，Stefan Hillmann，Sebastian Möller

在本文中，我们对使用开放源代码、公开可得并可在消费级硬件上运行的大型语言模型（LLMs）进行多标签意图分类做了详尽分析。我们使用对话系统领域的基准数据集 MultiWOZ 2.1，研究三种流行的开源预训练 LLMs 的效果，分别是 LLama2-7B-hf、Mistral-7B-v0.1 和 Yi-6B。我们在少样本设置下执行分类任务，在提示中提供 20 个示例并附上一些指令。我们的方法侧重于通过系统地评估这些模型在多标签意图分类任务上的表现差异，比较它们在若干性能指标上的表现。此外，我们将基于指令的微调方法与使用较小的 Transformer 模型 BertForSequenceClassification 的监督学习作为基线进行了比较。为了评估模型性能，我们使用准确率、精确率和召回率等评估指标，以及微平均、宏平均和加权 F1 分数。同时我们还报告了推理时间、显存需求等。在 F 值方面，Mistral-7B-v0.1 在 14 个意图类别中有 11 个的表现优于其他两种生成模型，加权平均值为 0.50。它的 Humming Loss 相对较低且 Jaccard 相似度较高，使其在 few-shot 设置中成为获胜模型。我们发现基于 BERT 的监督分类器相比表现最好的 few-shot 生成型 LLM 拥有更优的性能。该研究为小型开源 LLM 在检测复杂多意图对话方面提供了一个框架，从而增强了面向任务的聊天机器人的自然语言理解能力。

学科：计算与语言，人机交互

发表：2025-09-12 07:10:55 UTC

#21 通过检查推理过程进行无监督幻觉检测

Authors: [Ponhvoan Srey](https://arxiv.org/search/?searchtype=author&query=Ponhvoan Srey), [Xiaobao Wu](https://arxiv.org/search/?searchtype=author&query=Xiaobao Wu), [Anh Tuan Luu](https://arxiv.org/search/?searchtype=author&query=Anh Tuan Luu) 作者：Ponhvoan Srey，Xiaobao Wu，Anh Tuan Luu

无监督幻觉检测旨在在不依赖标注数据的情况下识别大型语言模型（LLMs）生成的幻觉内容。尽管无监督方法通过消除费力的人类注释而越来越受欢迎，但它们常常依赖与事实正确性无关的代理信号。这种不一致使检测探针偏向表面或与事实无关的方面，限制了在不同数据集和场景间的泛化能力。为克服这些限制，我们提出了 IRIS，一种无监督幻觉检测框架，利用与事实正确性内在相关的内部表示。IRIS 促使 LLM 仔细核验给定陈述的真实性，并将其上下文化的嵌入作为用于训练的信息性特征。同时，每个响应的不确定性被视为真实性的软伪标签。实验结果表明，IRIS 在各项指标上持续优于现有的无监督方法。我们的方法完全无监督、计算成本低，即使在少量训练数据下也能良好工作，适用于实时检测。

主题 : 计算与语言, 人工智能

发布：2025-09-12 06:58:17 UTC

#22 CMHG：面向中国少数民族语言的标题生成数据集与基准

Authors: [Guixian Xu](https://arxiv.org/search/?searchtype=author&query=Guixian Xu), [Zeli Su](https://arxiv.org/search/?searchtype=author&query=Zeli Su), [Ziyin Zhang](https://arxiv.org/search/?searchtype=author&query=Ziyin Zhang), [Jianing Liu](https://arxiv.org/search/?searchtype=author&query=Jianing Liu), [XU Han](https://arxiv.org/search/?searchtype=author&query=XU Han), [Ting Zhang](https://arxiv.org/search/?searchtype=author&query=Ting Zhang), [Yushuang Dong](https://arxiv.org/search/?searchtype=author&query=Yushuang Dong) 作者：Guixian Xu、Zeli Su、Ziyin Zhang、Jianing Liu、XU Han、Ting Zhang、Yushuang Dong

中国的少数民族语言，如藏语、维吾尔语和传统蒙古语，由于其独特的书写系统与国际标准存在差异，面临着显著挑战。这种差异导致相关语料严重匮乏，尤其是用于监督任务（如标题生成）的语料。为填补这一空白，我们提出了一个新数据集——中国少数民族语言标题生成（CMHG），其中包含藏语 100,000 条，维吾尔语和蒙古语各 50,000 条，专门为标题生成任务精心挑选。此外，我们还提出了由母语者注释的高质量测试集，旨在作为该领域未来研究的基准。我们希望该数据集能成为推动中国少数民族语言标题生成发展的宝贵资源，并有助于相关基准的构建。

主题：计算与语言

发布：2025-09-12 06:18:44 UTC

#23 大型语言模型与法律人工智能的相遇：一项综述

Authors: [Zhitian Hou](https://arxiv.org/search/?searchtype=author&query=Zhitian Hou), [Zihan Ye](https://arxiv.org/search/?searchtype=author&query=Zihan Ye), [Nanli Zeng](https://arxiv.org/search/?searchtype=author&query=Nanli Zeng), [Tianyong Hao](https://arxiv.org/search/?searchtype=author&query=Tianyong Hao), [Kun Zeng](https://arxiv.org/search/?searchtype=author&query=Kun Zeng) 作者：Zhitian Hou、Zihan Ye、Nanli Zeng、Tianyong Hao、Kun Zeng

Large Language Models (LLMs) have significantly advanced the development of Legal Artificial Intelligence (Legal AI) in recent years, enhancing the efficiency and accuracy of legal tasks. To advance research and applications of LLM-based approaches in legal domain, this paper provides a comprehensive review of 16 legal LLMs series and 47 LLM-based frameworks for legal tasks, and also gather 15 benchmarks and 29 datasets to evaluate different legal capabilities. Additionally, we analyse the challenges and discuss future directions for LLM-based approaches in the legal domain. We hope this paper provides a systematic introduction for beginners and encourages future research in this field. Resources are available at https://github.com/ZhitianHou/LLMs4LegalAI. 大型语言模型（LLMs）在近年来显著推动了法律人工智能（Legal AI）的发展，提高了法律任务的效率和准确性。为推进基于 LLM 的方法在法律领域的研究与应用，本文全面综述了 16 个法律 LLM 系列和 47 个基于 LLM 的法律任务框架，并汇集了 15 个基准和 29 个数据集用于评估不同的法律能力。此外，我们分析了挑战并讨论了基于 LLM 的方法在法律领域的未来方向。我们希望本文能为初学者提供系统性的入门介绍并鼓励未来在该领域的研究。资源可在 https://github.com/ZhitianHou/LLMs4LegalAI 获取。

主题 : 计算与语言, 人工智能

Publish: 2025-09-12 05:08:11 UTC 发布时间：2025-09-12 05:08:11 UTC

#24 Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case #24 模拟公众舆论：面向智利案例的 AI 生成合成调查回应的概念验证 [PDF] [复制] [Kimi] [REL]

Authors: [Bastián González-Bustamante](https://arxiv.org/search/?searchtype=author&query=Bastián González-Bustamante), [Nando Verelst](https://arxiv.org/search/?searchtype=author&query=Nando Verelst), [Carla Cisternas](https://arxiv.org/search/?searchtype=author&query=Carla Cisternas) 作者：Bastián González-Bustamante、Nando Verelst、Carla Cisternas

Large Language Models (LLMs) offer promising avenues for methodological and applied innovations in survey research by using synthetic respondents to emulate human answers and behaviour, potentially mitigating measurement and representation errors. However, the extent to which LLMs recover aggregate item distributions remains uncertain and downstream applications risk reproducing social stereotypes and biases inherited from training data. We evaluate the reliability of LLM-generated synthetic survey responses against ground-truth human responses from a Chilean public opinion probabilistic survey. Specifically, we benchmark 128 prompt-model-question triplets, generating 189,696 synthetic profiles, and pool performance metrics (i.e., accuracy, precision, recall, and F1-score) in a meta-analysis across 128 question-subsample pairs to test for biases along key sociodemographic dimensions. The evaluation spans OpenAI’s GPT family and o-series reasoning models, as well as Llama and Qwen checkpoints. Three results stand out. First, synthetic responses achieve excellent performance on trust items (F1-score and accuracy > 0.90). Second, GPT-4o, GPT-4o-mini and Llama 4 Maverick perform comparably on this task. Third, synthetic-human alignment is highest among respondents aged 45-59. Overall, LLM-based synthetic samples approximate responses from a probabilistic sample, though with substantial item-level heterogeneity. Capturing the full nuance of public opinion remains challenging and requires careful calibration and additional distributional tests to ensure algorithmic fidelity and reduce errors. 大型语言模型（LLMs）为问卷调查研究的方法和应用创新提供了有希望的途径，通过使用合成受访者来模拟人类的回答和行为，可能减轻测量误差和代表性误差。然而，LLMs 在多大程度上能恢复总体项分布仍不确定，且下游应用有可能再现训练数据中继承的社会刻板印象和偏见。我们将 LLM 生成的合成调查回答与来自智利一项基于概率抽样的民意调查的真实人类回答进行可靠性评估。具体而言，我们基准测试了 128 个提示-模型-问题三元组，生成了 189,696 个合成档案，并在 128 个问题-子样本对上进行荟萃分析以汇总性能指标（即准确率、精确率、召回率和 F1 分数），以检测关键社会人口学维度上的偏差。评估涵盖了 OpenAI 的 GPT 系列和 o 系列推理模型，以及 Llama 和 Qwen 检查点。三个结果尤为显著。首先，合成回复在信任题目上表现出色（F1 分数和准确率均 > 0.90）。其次，GPT-4o、GPT-4o-mini 和 Llama 4 Maverick 在该任务上的表现相当。第三，合成—人工一致性在 45–59 岁受访者中最高。总体而言，基于 LLM 的合成样本可以近似概率抽样的回复，尽管在题目层面存在显著异质性。要完整捕捉公众舆论的细微差别仍具挑战性，需要谨慎校准并进行额外的分布检验以确保算法忠实性并减少错误。

主题 : 计算与语言, 人工智能

Publish: 2025-09-11 21:43:59 UTC 发布：2025-09-11 21:43:59 UTC

#25 Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization #25 主题引导的强化学习与 LLMs 用于增强多文档摘要 [PDF] [复制] [Kimi] [REL]

Authors: [Chuyuan Li](https://arxiv.org/search/?searchtype=author&query=Chuyuan Li), [Austin Xu](https://arxiv.org/search/?searchtype=author&query=Austin Xu), [Shafiq Joty](https://arxiv.org/search/?searchtype=author&query=Shafiq Joty), [Giuseppe Carenini](https://arxiv.org/search/?searchtype=author&query=Giuseppe Carenini) 作者：Chuyuan Li, Austin Xu, Shafiq Joty, Giuseppe Carenini

A key challenge in Multi-Document Summarization (MDS) is effectively integrating information from multiple sources while maintaining coherence and topical relevance. While Large Language Models have shown impressive results in single-document summarization, their performance on MDS still leaves room for improvement. In this paper, we propose a topic-guided reinforcement learning approach to improve content selection in MDS. We first show that explicitly prompting models with topic labels enhances the informativeness of the generated summaries. Building on this insight, we propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between the generated summary and source documents. Experimental results on the Multi-News and Multi-XScience datasets demonstrate that our method consistently outperforms strong baselines, highlighting the effectiveness of leveraging topical cues in MDS. 在多文档摘要（MDS）中，一个关键挑战是如何在保持连贯性和主题相关性的同时，有效整合来自多个来源的信息。尽管大型语言模型在单文档摘要方面已展示出令人印象深刻的成果，但它们在多文档摘要上的表现仍有提升空间。在本文中，我们提出了一种主题引导的强化学习方法，以改进多文档摘要中的内容选择。我们首先展示了通过显式地以主题标签提示模型可以增强生成摘要的信息量。在此基础上，我们在组相对策略优化（Group Relative Policy Optimization，GRPO）框架内提出了一种新的主题奖励，用于衡量生成摘要与源文档之间的主题一致性。在 Multi-News 和 Multi-XScience 数据集上的实验结果表明，我们的方法始终优于强基线，突出了在多文档摘要中利用主题线索的有效性。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-09-11 21:01:54 UTC 发布：2025-09-11 21:01:54 UTC

#26 Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization #26 手势唤起的语用框架：FrameNet Brasil 对话轮组织多模态研究

Authors: [Helen de Andrade Abreu](https://arxiv.org/search/?searchtype=author&query=Helen de Andrade Abreu), [Tiago Timponi Torrent](https://arxiv.org/search/?searchtype=author&query=Tiago Timponi Torrent), [Ely Edison da Silva Matos](https://arxiv.org/search/?searchtype=author&query=Ely Edison da Silva Matos) 作者：Helen de Andrade Abreu，Tiago Timponi Torrent，Ely Edison da Silva Matos

This paper proposes a framework for modeling multimodal conversational turn organization via the proposition of correlations between language and interactive gestures, based on analysis as to how pragmatic frames are conceptualized and evoked by communicators. As a means to provide evidence for the analysis, we developed an annotation methodology to enrich a multimodal dataset (annotated for semantic frames) with pragmatic frames modeling conversational turn organization. Although conversational turn organization has been studied by researchers from diverse fields, the specific strategies, especially gestures used by communicators, had not yet been encoded in a dataset that can be used for machine learning. To fill this gap, we enriched the Frame2 dataset with annotations of gestures used for turn organization. The Frame2 dataset features 10 episodes from the Brazilian TV series Pedro Pelo Mundo annotated for semantic frames evoked in both video and text. This dataset allowed us to closely observe how communicators use interactive gestures outside a laboratory, in settings, to our knowledge, not previously recorded in related literature. Our results have confirmed that communicators involved in face-to-face conversation make use of gestures as a tool for passing, taking and keeping conversational turns, and also revealed variations of some gestures that had not been documented before. We propose that the use of these gestures arises from the conceptualization of pragmatic frames, involving mental spaces, blending and conceptual metaphors. In addition, our data demonstrate that the annotation of pragmatic frames contributes to a deeper understanding of human cognition and language. 本文提出了一个框架，用于通过语言与互动手势之间相关性的假设来建模多模态对话轮次组织，基于对交际者如何概念化并唤起语用框（pragmatic frames）的分析。为了为该分析提供证据，我们开发了一种标注方法，将对话轮次组织的语用框标注加入到一个多模态数据集中（该数据集已对语义框进行了标注）。尽管来自不同领域的研究者已研究对话轮次组织，但具体策略，尤其是交际者使用的手势，尚未被编码进可用于机器学习的数据集中。为填补这一空白，我们在 Frame2 数据集中增加了用于轮次组织的手势标注。Frame2 数据集包含巴西电视剧 Pedro Pelo Mundo 的 10 集，并对视频和文本中唤起的语义框进行了标注。该数据集使我们能够在非实验室环境中近距离观察交际者如何使用互动手势，据我们所知，这类设置在相关文献中此前未被记录。我们的研究结果证实，参与面对面会话的交流者使用手势作为传递、争取和保持话语轮次的工具，并且还发现了一些此前未被记录的手势变体。我们提出，这些手势的使用源于语用框架的概念化，涉及心理空间、混合与概念隐喻。此外，我们的数据表明，对语用框架的注释有助于更深入地理解人类的认知与语言。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-09-11 19:14:57 UTC 发布：2025-09-11 19:14:57 UTC

#27 HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning #27 HEFT：一种由粗到细的层次结构，用于提升语言模型推理的效率与准确性

Author: [Brennen Hill](https://arxiv.org/search/?searchtype=author&query=Brennen Hill)作者：Brennen Hill

The adaptation of large language models (LLMs) to specialized reasoning tasks is fundamentally constrained by computational resources. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a powerful solution, yet the landscape of these techniques is diverse, with distinct methods operating in either the model’s weight space or its representation space. This paper investigates the hypothesis that a synergistic combination of these paradigms can unlock superior performance and efficiency. We introduce HEFT (Hierarchical Efficient Fine-Tuning), a novel hierarchical adaptation strategy that composes two distinct PEFT methods in a coarse-to-fine manner: first, a broad, foundational adaptation in the weight space using Low-Rank Adaptation (LoRA), followed by a precise, surgical refinement of internal activations using Representation Fine-Tuning (ReFT). We evaluate this approach by fine-tuning a Llama-2-7B model on the BoolQ benchmark, a challenging dataset for inferential reasoning. Our results reveal a profound synergistic effect. A model fine-tuned for only three epochs with our HEFT strategy achieves an accuracy of 85.17%, exceeding the performance of models trained for 20 epochs with either LoRA-only (85.05%) or ReFT-only (83.36%) methodologies. This work demonstrates that the thoughtful composition of PEFT methods is a potent algorithmic innovation, offering a more efficient and effective path toward advancing the reasoning capabilities of language models. By achieving superior results with a fraction of the computational budget, our findings present a principled approach to overcoming the obstacles inherent in adapting large-scale models for complex cognitive tasks. 将大型语言模型（LLMs）适配到专门的推理任务根本上受计算资源的限制。参数高效微调（PEFT）方法已成为一种强有力的解决方案，然而这些技术流派多样，存在在模型权重空间或其表征空间中运行的不同方法。本文探讨了这样一个假设：这些范式的协同结合能够释放更优的性能和效率。我们提出了 HEFT（Hierarchical Efficient Fine-Tuning，层级高效微调），一种新颖的层级适配策略，以粗到细的方式组合两种不同的 PEFT 方法：首先在权重空间进行广泛的基础适配，采用低秩适配（LoRA），随后在内部激活上进行精确的外科式微调，采用表征微调（ReFT）。我们通过在 BoolQ 基准上对 Llama-2-7B 模型进行微调来评估该方法，BoolQ 是一个针对推理性判断的挑战性数据集。我们的结果揭示了显著的协同效应。仅用我们的 HEFT 策略微调三个 epoch 的模型即可达到 85.17% 的准确率，优于使用仅 LoRA（85.05%）或仅 ReFT（83.36%）方法训练 20 个 epoch 的模型。此项工作表明，经过深思熟虑地组合 PEFT 方法是一种强有力的算法创新，为提升语言模型的推理能力提供了一条更高效、更有效的路径。通过在极小的计算预算下取得更优的结果，我们的发现提出了一种有原则的方法，以克服将大规模模型用于复杂认知任务时固有的障碍。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-09-11 19:06:46 UTC 发布：2025-09-11 19:06:46 UTC

#28 Discrimination by LLMs: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation #28 LLMs 对歧视的识别：跨语种偏见评估及在决策与摘要中的缓解

Authors: [Willem Huijzer](https://arxiv.org/search/?searchtype=author&query=Willem Huijzer), [Jieying Chen](https://arxiv.org/search/?searchtype=author&query=Jieying Chen) 作者：Willem Huijzer，Jieying Chen

The rapid integration of Large Language Models (LLMs) into various domains raises concerns about societal inequalities and information bias. This study examines biases in LLMs related to background, gender, and age, with a focus on their impact on decision-making and summarization tasks. Additionally, the research examines the cross-lingual propagation of these biases and evaluates the effectiveness of prompt-instructed mitigation strategies. Using an adapted version of the dataset by Tamkin et al. (2023) translated into Dutch, we created 151,200 unique prompts for the decision task and 176,400 for the summarisation task. Various demographic variables, instructions, salience levels, and languages were tested on GPT-3.5 and GPT-4o. Our analysis revealed that both models were significantly biased during decision-making, favouring female gender, younger ages, and certain backgrounds such as the African-American background. In contrast, the summarisation task showed minimal evidence of bias, though significant age-related differences emerged for GPT-3.5 in English. Cross-lingual analysis showed that bias patterns were broadly similar between English and Dutch, though notable differences were observed across specific demographic categories. The newly proposed mitigation instructions, while unable to eliminate biases completely, demonstrated potential in reducing them. The most effective instruction achieved a 27% mean reduction in the gap between the most and least favorable demographics. Notably, contrary to GPT-3.5, GPT-4o displayed reduced biases for all prompts in English, indicating the specific potential for prompt-based mitigation within newer models. This research underscores the importance of cautious adoption of LLMs and context-specific bias testing, highlighting the need for continued development of effective mitigation strategies to ensure responsible deployment of AI. 大型语言模型（LLMs）迅速融入各个领域，引发了对社会不平等和信息偏见的担忧。本研究考察了与背景、性别和年龄相关的 LLMs 偏见，重点关注这些偏见对决策和摘要任务的影响。此外，研究还检验了这些偏见的跨语言传播以及提示指令式缓解策略的有效性。我们使用 Tamkin 等人（2023）数据集的改编版本并翻译成荷兰语，为决策任务创建了 151,200 条独特提示，为摘要任务创建了 176,400 条独特提示。在 GPT-3.5 和 GPT-4o 上测试了各种人口统计变量、指令、显著性水平和语言。我们的分析显示，两种模型在决策过程中都存在显著偏见，偏向女性、更年轻的年龄以及某些背景，例如非裔美国人背景。相比之下，摘要任务显示出极少的偏见证据，尽管 GPT-3.5 在英文中出现了显著的与年龄相关的差异。跨语言分析显示，英语和荷兰语之间的偏见模式总体相似，但在特定人口统计类别上观察到显著差异。新提出的缓解指令虽然无法完全消除偏见，但在减少偏见方面显示出潜力。最有效的指令在最有利与最不利人口群体之间的差距上实现了平均 27%的减少。值得注意的是，与 GPT-3.5 相反，GPT-4o 在所有英语提示中表现出更低的偏见，这表明在较新模型中基于提示的缓解方法具有特定潜力。本研究强调谨慎采用 LLMs 并进行针对具体情境的偏见测试的重要性，凸显了继续开发有效缓解策略以确保负责任部署 AI 的必要性。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-09-10 14:25:09 UTC 发布：2025-09-10 14:25:09 UTC

#29 MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools #29 MCP-AgentBench：使用 MCP 调解的工具评估现实世界语言代理性能

Authors: [Zikang Guo](https://arxiv.org/search/?searchtype=author&query=Zikang Guo), [Benfeng Xu](https://arxiv.org/search/?searchtype=author&query=Benfeng Xu), [Chiwei Zhu](https://arxiv.org/search/?searchtype=author&query=Chiwei Zhu), [Wentao Hong](https://arxiv.org/search/?searchtype=author&query=Wentao Hong), [Xiaorui Wang](https://arxiv.org/search/?searchtype=author&query=Xiaorui Wang), [Zhendong Mao](https://arxiv.org/search/?searchtype=author&query=Zhendong Mao) 作者：郭子康，徐本峰，朱持维，洪文涛，王晓睿，毛振东

The Model Context Protocol (MCP) is rapidly emerging as a pivotal open standard, designed to enhance agent-tool integration and interoperability, and is positioned to unlock a new era of powerful, interconnected, and genuinely utilitarian agentic AI. However, despite MCP’s growing adoption, existing benchmarks often fail to capture real-world agent performance within this new paradigm, leading to a distorted perception of their true operational value and an inability to reliably differentiate proficiencies. To bridge this critical evaluation gap, we introduce MCP-AgentBench – a comprehensive benchmark specifically engineered to rigorously assess language agent capabilities in MCP-mediated tool interactions. Core contributions of MCP-AgentBench include: the establishment of a robust MCP testbed comprising 33 operational servers with 188 distinct tools; the development of a benchmark featuring 600 systematically designed queries distributed across 6 distinct categories of varying interaction complexity; and the introduction of MCP-Eval, a novel outcome-oriented evaluation methodology prioritizing real-world task success. Through extensive empirical evaluation of leading language agents, we provide foundational insights. MCP-AgentBench aims to equip the research community with a standardized and reliable framework to build, validate, and advance agents capable of fully leveraging MCP’s transformative benefits, thereby accelerating progress toward truly capable and interoperable AI systems. 模型上下文协议（MCP）正迅速成为一个关键的开放标准，旨在增强代理与工具的集成与互操作性，并有望开启一个强大、互联且真正实用的代理式人工智能新时代。然而，尽管 MCP 的采用日益增加，现有基准测试往往未能在这一新范式下捕捉到真实世界中代理的性能，导致对其实际运行价值的扭曲认知以及无法可靠地区分能力水平。为弥合这一关键评估鸿沟，我们引入了 MCP-AgentBench——一个专门为严格评估在 MCP 介导的工具交互中语言代理能力而设计的综合基准。MCP-AgentBench 的核心贡献包括：建立了一个由 33 个可运行服务器和 188 个不同工具组成的稳健 MCP 测试平台；开发了一个包含 600 个系统设计查询、分布在 6 个交互复杂度各异类别中的基准；以及引入了 MCP-Eval，一种以结果为导向、优先考虑真实世界任务成功的新型评估方法。通过对领先语言代理进行广泛的实证评估，我们提供了基础性见解。MCP-AgentBench 旨在为研究界提供一个标准化且可靠的框架，以构建、验证并推进能够充分利用 MCP 变革性优势的代理，从而加速朝着真正强大且可互操作的人工智能系统的进展。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-09-10 14:08:40 UTC 发布时间：2025-09-10 14:08:40 UTC

#30 Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning #30 对中文古籍文献的视觉-语言模型基准测试：从光学字符识别到知识推理 [PDF 1 ] [复制] [Kimi] [REL]

Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring. 中国古代文献是承载数千年中国历史与文化的宝贵载体，蕴含着丰富的跨学科知识，但在数字化与理解方面面临挑战。传统方法仅对其进行图像扫描，而现有的视觉-语言模型（VLMs）难以处理其复杂的视觉和语言特性。现有的文档基准多聚焦于英文印刷文本或简体中文，难以评估 VLM 在古代中文文献上的表现。为此，我们提出了 AncientDoc，这是第一个针对中国古代文献的基准，旨在从光学字符识别（OCR）到知识推理全面评估 VLM。AncientDoc 包含五项任务（页级 OCR、文言文到白话的翻译、基于推理的问答、基于知识的问答、语言变体问答），覆盖 14 种文献类型、100 多部典籍以及约 3,000 页。基于 AncientDoc，我们使用多种度量对主流 VLM 进行评估，并辅以一个与人类对齐的大型语言模型用于评分。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-09-10 13:02:29 UTC 发布日期：2025-09-10 13:02:29 UTC

#31 MultimodalHugs: Enabling Sign Language Processing in Hugging Face #31 MultimodalHugs：在 Hugging Face 上实现手语处理 [PDF] [复制] [Kimi 1 ] [REL]

Authors: [Gerard Sant](https://arxiv.org/search/?searchtype=author&query=Gerard Sant), [Zifan Jiang](https://arxiv.org/search/?searchtype=author&query=Zifan Jiang), [Carlos Escolano](https://arxiv.org/search/?searchtype=author&query=Carlos Escolano), [Amit Moryossef](https://arxiv.org/search/?searchtype=author&query=Amit Moryossef), [Mathias Müller](https://arxiv.org/search/?searchtype=author&query=Mathias Müller), [Rico Sennrich](https://arxiv.org/search/?searchtype=author&query=Rico Sennrich), [Sarah Ebling](https://arxiv.org/search/?searchtype=author&query=Sarah Ebling) 作者：Gerard Sant, Zifan Jiang, Carlos Escolano, Amit Moryossef, Mathias Müller, Rico Sennrich, Sarah Ebling

In recent years, sign language processing (SLP) has gained importance in the general field of Natural Language Processing. However, compared to research on spoken languages, SLP research is hindered by complex ad-hoc code, inadvertently leading to low reproducibility and unfair comparisons. Existing tools that are built for fast and reproducible experimentation, such as Hugging Face, are not flexible enough to seamlessly integrate sign language experiments. This view is confirmed by a survey we conducted among SLP researchers. To address these challenges, we introduce MultimodalHugs, a framework built on top of Hugging Face that enables more diverse data modalities and tasks, while inheriting the well-known advantages of the Hugging Face ecosystem. Even though sign languages are our primary focus, MultimodalHugs adds a layer of abstraction that makes it more widely applicable to other use cases that do not fit one of the standard templates of Hugging Face. We provide quantitative experiments to illustrate how MultimodalHugs can accommodate diverse modalities such as pose estimation data for sign languages, or pixel data for text characters. 近年来，手语处理（SLP）在自然语言处理领域中的重要性日益增加。然而，与口语研究相比，SLP 的研究受到复杂的临时性代码的制约，导致复现性低且比较不公平。现有旨在实现快速且可复现实验的工具（如 Hugging Face）在无缝整合手语实验方面不够灵活。我们在对 SLP 研究者所做的调查也证实了这一观点。为了解决这些挑战，我们引入了 MultimodalHugs，这是一个构建在 Hugging Face 之上的框架，能够支持更多样的数据模态和任务，同时继承 Hugging Face 生态系统的众所周知的优势。尽管手语是我们的主要关注点，MultimodalHugs 添加了一层抽象，使其更广泛适用于那些不符合 Hugging Face 标准模板的其他用例。我们提供了定量实验，说明 MultimodalHugs 如何容纳多种模态，例如用于手语的姿态估计数据或用于文本字符的像素数据。

Subjects: Computation and Language, Artificial Intelligence, Multimedia 主题：计算与语言，人工智能，多媒体

Publish: 2025-09-10 11:14:54 UTC 发布：2025-09-10 11:14:54 协调世界时

#32 A meta-analysis on the performance of machine-learning based language models for sentiment analysis #32 基于机器学习的语言模型在情感分析中表现的元分析 [PDF] [复制] [Kimi] [REL]

Authors: [Elena Rohde](https://arxiv.org/search/?searchtype=author&query=Elena Rohde), [Jonas Klingwort](https://arxiv.org/search/?searchtype=author&query=Jonas Klingwort), [Christian Borgs](https://arxiv.org/search/?searchtype=author&query=Christian Borgs) 作者：Elena Rohde、Jonas Klingwort、Christian Borgs

This paper presents a meta-analysis evaluating ML performance in sentiment analysis for Twitter data. The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance. Using PRISMA guidelines, we searched academic databases and selected 195 trials from 20 studies with 12 study features. Overall accuracy, the most reported performance metric, was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the AIC-optimized model was 0.80 [0.76, 0.84]. This paper provides two key insights: 1) Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization. 2) Standardized reporting of model performance, including reporting confusion matrices for independent test sets, is essential for reliable comparisons of ML classifiers across studies, which seems far from common practice. 本文呈现了一项元分析，评估机器学习在 Twitter 情感分析中的表现。该研究旨在估计平均性能、评估研究之间与研究内部的异质性，并分析研究特征如何影响模型性能。遵循 PRISMA 指南，我们检索了学术数据库并从 20 项研究中选择了 195 次试验，记录了 12 项研究特征。对最常报告的性能指标——总体准确率，采用了双反正弦变换和三级随机效应模型进行分析。AIC 优化模型的平均总体准确率为 0.80 [0.76, 0.84]。本文提供了两个关键见解：1）总体准确率被广泛使用，但由于其对类别不平衡和情感类别数量的敏感性，常常具有误导性，因此需要归一化处理。2）模型性能的标准化报告——包括对独立测试集报告混淆矩阵——对于在各研究间可靠比较机器学习分类器至关重要，而这一做法显然还远未普及。

Subjects: Computation and Language, Machine Learning, Applications 主题：计算与语言、机器学习、应用

Publish: 2025-09-10 10:05:32 UTC 发布时间：2025-09-10 10:05:32 UTC

#33 A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs #33 一个面向角色感知的多智能体框架，用于使用 LLMs 的金融教育问答 [PDF] [复制] [Kimi] [REL]

Authors: [Andy Zhu](https://arxiv.org/search/?searchtype=author&query=Andy Zhu), [Yingjun Du](https://arxiv.org/search/?searchtype=author&query=Yingjun Du) 作者：Andy Zhu，Yingjun Du

Question answering (QA) plays a central role in financial education, yet existing large language model (LLM) approaches often fail to capture the nuanced and specialized reasoning required for financial problem-solving. The financial domain demands multistep quantitative reasoning, familiarity with domain-specific terminology, and comprehension of real-world scenarios. We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA. Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer. We evaluated our framework on a set of 3,532 expert-designed finance education questions from Study.com, an online learning platform. We leverage retrieval-augmented generation (RAG) for contextual evidence from 6 finance textbooks and prompting strategies for a domain-expert reviewer. Our experiments indicate that critique-based refinement improves answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines, with the highest performance from Gemini-2.0-Flash. Furthermore, our method enables GPT-4o-mini to achieve performance comparable to the finance-tuned FinGPT-mt_Llama3-8B_LoRA. Our results show a cost-effective approach to enhancing financial QA and offer insights for further research in multi-agent financial LLM systems. 问答（QA）在金融教育中起着核心作用，然而现有的大型语言模型（LLM）方法常常无法捕捉金融问题解决所需的细致且专业的推理。金融领域要求多步定量推理、熟悉领域特定术语并理解真实世界的情境。我们提出了一个多智能体框架，利用基于角色的提示来提升领域特定问答的性能。我们的框架由一个基础生成器、一个证据检索器和一个专家审阅者代理组成，它们在一次性迭代中协同工作以生成精炼的答案。我们在来自在线学习平台 Study.com 的 3,532 道专家设计的金融教育问题上对框架进行了评估。我们利用检索增强生成（RAG）从 6 本金融教科书中获取上下文证据，并针对领域专家审阅者采用了提示策略。我们的实验表明，相较于零样本链式思维（Chain-of-Thought）基线，基于批评的精炼方法能将答案准确率提升 6.6%到 8.3%，其中 Gemini-2.0-Flash 的表现最优。此外，我们的方法使得 GPT-4o-mini 的表现可与针对金融微调的 FinGPT-mt_Llama3-8B_LoRA 相媲美。我们的结果表明，这是一种具有成本效益的增强金融问答的方法，并为多智能体金融 LLM 系统的进一步研究提供了见解。

Subjects: Computation and Language, Computational Engineering, Finance, and Science 主题：计算与语言、计算工程、金融与科学

Publish: 2025-09-10 09:40:18 UTC 发布：2025-09-10 09:40:18 UTC

#34 Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure #34 通过将证明步骤非形式化并沿证明结构进行递归摘要实现形式证明的自然语言翻译

Authors: [Seiji Hattori](https://arxiv.org/search/?searchtype=author&query=Seiji Hattori), [Takuya Matsuzaki](https://arxiv.org/search/?searchtype=author&query=Takuya Matsuzaki), [Makoto Fujiwara](https://arxiv.org/search/?searchtype=author&query=Makoto Fujiwara) 作者：Seiji Hattori、Takuya Matsuzaki、Makoto Fujiwara

This paper proposes a natural language translation method for machine-verifiable formal proofs that leverages the informalization (verbalization of formal language proof steps) and summarization capabilities of LLMs. For evaluation, it was applied to formal proof data created in accordance with natural language proofs taken from an undergraduate-level textbook, and the quality of the generated natural language proofs was analyzed in comparison with the original natural language proofs. Furthermore, we will demonstrate that this method can output highly readable and accurate natural language proofs by applying it to existing formal proof library of the Lean proof assistant. 本文提出了一种针对可机器验证的形式证明的自然语言翻译方法，该方法利用了 LLMs 的非正式化（将形式语言的证明步骤口头化）和摘要能力。为评估该方法，本文将其应用于根据本科教材中的自然语言证明创建的形式证明数据，并将生成的自然语言证明的质量与原始自然语言证明进行了比较分析。此外，我们还将展示通过将该方法应用于现有的 Lean 证明助理的形式证明库，可以输出高度可读且准确的自然语言证明。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-09-10 09:22:12 UTC 发布：2025-09-10 09:22:12 UTC

#35 BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025 #35 BIBERT-Pipe 在 BioASQ 2025 的生物医学嵌套命名实体链接任务 [PDF] [复制] [Kimi] [REL]

Authors: [Chunyu Li](https://arxiv.org/search/?searchtype=author&query=Chunyu Li), [Xindi Zheng](https://arxiv.org/search/?searchtype=author&query=Xindi Zheng), [Siqi Liu](https://arxiv.org/search/?searchtype=author&query=Siqi Liu) 作者：李春雨、郑信迪、刘思齐

Entity linking (EL) for biomedical text is typically benchmarked on English-only corpora with flat mentions, leaving the more realistic scenario of nested and multilingual mentions largely unexplored. We present our system for the BioNNE 2025 Multilingual Biomedical Nested Named Entity Linking shared task (English & Russian), closing this gap with a lightweight pipeline that keeps the original EL model intact and modifies only three task-aligned components: Two-stage retrieval-ranking. We leverage the same base encoder model in both stages: the retrieval stage uses the original pre-trained model, while the ranking stage applies domain-specific fine-tuning. Boundary cues. In the ranking stage, we wrap each mention with learnable [Ms] / [Me] tags, providing the encoder with an explicit, language-agnostic span before robustness to overlap and nesting. Dataset augmentation. We also automatically expand the ranking training corpus with three complementary data sources, enhancing coverage without extra manual annotation. On the BioNNE 2025 leaderboard, our two stage system, bilingual bert (BIBERT-Pipe), ranks third in the multilingual track, demonstrating the effectiveness and competitiveness of these minimal yet principled modifications. Code are publicly available at https://github.com/Kaggle-Competitions-Code/BioNNE-L. 生物医学文本的实体链接（EL）通常在仅限英语且为平铺提及的语料上进行基准测试，而更现实的嵌套和多语言提及场景大多未被探索。我们提出了用于 BioNNE 2025 多语言生物医学嵌套命名实体链接共享任务（英语与俄语）的系统，通过一个轻量级管道弥补这一空白，该管道保持原始实体链接模型不变，仅修改三个与任务对齐的组件：两阶段检索-排序。我们在两个阶段均使用相同的基础编码器模型：检索阶段使用原始的预训练模型，而排序阶段进行领域特定的微调。边界提示。在排序阶段，我们用可学习的 [Ms] / [Me] 标签包裹每个提及，为编码器提供明确的、与语言无关的跨度表示，从而提高对重叠和嵌套的鲁棒性。数据集扩增。我们还通过三种互补的数据来源自动扩展排序训练语料，在不增加人工标注的情况下提升覆盖率。在 BioNNE 2025 排行榜上，我们的两阶段系统 bilingual bert (BIBERT-Pipe) 在多语言赛道中排名第三，展示了这些简洁但有原则的改进的有效性和竞争力。代码已在 https://github.com/Kaggle-Competitions-Code/BioNNE-L 上公开。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-09-10 09:14:25 UTC 发布：2025-09-10 09:14:25 UTC

#36 DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model #36 DiTTO-LLM：通过大语言模型发现基于主题的技术机会的框架 [PDF] [复制] [Kimi] [REL]

Authors: [Wonyoung Kim](https://arxiv.org/search/?searchtype=author&query=Wonyoung Kim), [Sujeong Seo](https://arxiv.org/search/?searchtype=author&query=Sujeong Seo), [Juhyun Lee](https://arxiv.org/search/?searchtype=author&query=Juhyun Lee) 作者：Wonyoung Kim、Sujeong Seo、Juhyun Lee

Technology opportunities are critical information that serve as a foundation for advancements in technology, industry, and innovation. This paper proposes a framework based on the temporal relationships between technologies to identify emerging technology opportunities. The proposed framework begins by extracting text from a patent dataset, followed by mapping text-based topics to discover inter-technology relationships. Technology opportunities are then identified by tracking changes in these topics over time. To enhance efficiency, the framework leverages a large language model to extract topics and employs a prompt for a chat-based language model to support the discovery of technology opportunities. The framework was evaluated using an artificial intelligence patent dataset provided by the United States Patent and Trademark Office. The experimental results suggest that artificial intelligence technology is evolving into forms that facilitate everyday accessibility. This approach demonstrates the potential of the proposed framework to identify future technology opportunities. 技术机会是为技术、产业和创新进步提供基础的关键信息。本文提出了一个基于技术间时间关系来识别新兴技术机会的框架。该框架首先从专利数据集中提取文本，然后将基于文本的主题映射以发现技术间的关系。通过追踪这些主题随时间的变化来识别技术机会。为提高效率，框架利用大型语言模型来提取主题，并使用基于对话的语言模型提示来支持技术机会的发现。该框架使用美国专利商标局提供的一个人工智能专利数据集进行了评估。实验结果表明，人工智能技术正朝着便于日常可及性的方向发展。该方法展示了所提框架识别未来技术机会的潜力。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-09-10 05:47:25 UTC 发布：2025-09-10 05:47:25 UTC

#37 ALIGNS: Unlocking nomological networks in psychological measurement through a large language model #37 ALIGNS：通过大型语言模型解锁心理测量中的法则网络 [PDF] [复制] [Kimi] [REL]

Authors: [Kai R. Larsen](https://arxiv.org/search/?searchtype=author&query=Kai R. Larsen), [Sen Yan](https://arxiv.org/search/?searchtype=author&query=Sen Yan), [Roland Müller](https://arxiv.org/search/?searchtype=author&query=Roland Müller), [Lan Sang](https://arxiv.org/search/?searchtype=author&query=Lan Sang), [Mikko Rönkkö](https://arxiv.org/search/?searchtype=author&query=Mikko Rönkkö), [Ravi Starzl](https://arxiv.org/search/?searchtype=author&query=Ravi Starzl), [Donald Edmondson](https://arxiv.org/search/?searchtype=author&query=Donald Edmondson) 作者：Kai R. Larsen、Sen Yan、Roland Müller、Lan Sang、Mikko Rönkkö、Ravi Starzl、Donald Edmondson

Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system’s importance, accessibility, and suitability. ALIGNS is freely available at nomologicalnetwork.org, complementing traditional validation methods with large-scale nomological analysis. 心理测量对许多学科至关重要。尽管测量学有所进展，但构建法则网络——即将概念与测量之间关系绘制成理论图以确立效度——仍然是一个挑战，自从 Cronbach 和 Meehl 在 70 年前提出它们为验证的基础以来这一点一直存在。这一局限具有实际后果：临床试验可能无法检测到治疗效果，公共政策可能针对错误的结果。我们介绍了“用于生成法则结构的潜在指标分析”（ALIGNS），这是一种基于大型语言模型并以经验证的问卷量表为训练的数据驱动系统。ALIGNS 提供了三套全面的法则网络，包含跨心理学、医学、社会政策及其他领域的超过 55 万个指标。这是大型语言模型首次被用于解决测量验证中的一个基础性问题。我们报告了用于开发该模型的分类准确性测试，以及三项评估。在第一项评估中，广泛使用的 NIH PROMIS 焦虑和抑郁量表被证明在情绪困扰这一单一维度上趋同。第二项评估检查儿童气质测量，识别出四个当前框架未涵盖的潜在维度，并质疑了一个现有维度。第三项评估为适用性检查，邀请心理测量学专家评估该系统的重要性、可及性和适合性。ALIGNS 可在 nomologicalnetwork.org 免费获取，借助大规模法则网络分析补充传统验证方法。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning, Methodology 主题：计算与语言、人工智能、机器学习、方法论

Publish: 2025-09-10 04:21:02 UTC 发布日期：2025-09-10 04:21:02 UTC

#38 Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA #38 在 HaluEval 和 TruthfulQA 上调查 Gemma 模型幻觉的符号触发因素 [PDF] [Copy] [Kimi 1 ] [REL]

Authors: [Naveen Lamba](https://arxiv.org/search/?searchtype=author&query=Naveen Lamba), [Sanju Tiwari](https://arxiv.org/search/?searchtype=author&query=Sanju Tiwari), [Manas Gaur](https://arxiv.org/search/?searchtype=author&query=Manas Gaur) 作者：Naveen Lamba, Sanju Tiwari, Manas Gaur

Hallucination in Large Language Models (LLMs) is a well studied problem. However, the properties that make LLM intrinsically vulnerable to hallucinations have not been identified and studied. This research identifies and characterizes the key properties, allowing us to pinpoint vulnerabilities within the model’s internal mechanisms. To solidify on these properties, we utilized two established datasets, HaluEval and TruthfulQA and convert their existing format of question answering into various other formats to narrow down these properties as the reason for the hallucinations. Our findings reveal that hallucination percentages across symbolic properties are notably high for Gemma-2-2B, averaging 79.0% across tasks and datasets. With increased model scale, hallucination drops to 73.6% for Gemma-2-9B and 63.9% for Gemma-2-27B, reflecting a 15 percentage point reduction overall. Although the hallucination rate decreases as the model size increases, a substantial amount of hallucination caused by symbolic properties still persists. This is especially evident for modifiers (ranging from 84.76% to 94.98%) and named entities (ranging from 83.87% to 93.96%) across all Gemma models and both datasets. These findings indicate that symbolic elements continue to confuse the models, pointing to a fundamental weakness in how these LLMs process such inputs–regardless of their scale. 大型语言模型（LLMs）中的幻觉问题是一个研究已久的问题。然而，使 LLM 本质上易于产生幻觉的属性尚未被识别和研究。本研究识别并表征了关键属性，使我们能够定位模型内部机制中的脆弱点。为巩固这些属性，我们使用了两个已建立的数据集——HaluEval 和 TruthfulQA，并将它们现有的问题回答格式转换为各种其他格式，以将这些属性缩小为幻觉产生的原因。我们的发现显示，对于 Gemma-2-2B，跨符号属性的幻觉百分比显著偏高，在各任务和数据集上的平均值为 79.0%。随着模型规模的增大，Gemma-2-9B 的幻觉率降至 73.6%，Gemma-2-27B 降至 63.9%，总体上减少了 15 个百分点。尽管随着模型规模的增加幻觉率有所下降，但由符号属性引起的大量幻觉仍然存在。这在修饰语（在所有 Gemma 模型和两个数据集中范围为 84.76% 到 94.98%）和命名实体（范围为 83.87% 到 93.96%）方面尤为明显。这些发现表明，符号元素仍然会使模型感到困惑，指出这些 LLMs 在处理此类输入时存在根本性弱点——无论其规模如何。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-09 05:50:08 UTC 发布：2025-09-09 05:50:08 UTC

#39 How Small Transformation Expose the Weakness of Semantic Similarity Measures

Authors: [Serge Lionel Nikiema](https://arxiv.org/search/?searchtype=author&query=Serge Lionel Nikiema), [Albérick Euraste Djire](https://arxiv.org/search/?searchtype=author&query=Albérick Euraste Djire), [Abdoul Aziz Bonkoungou](https://arxiv.org/search/?searchtype=author&query=Abdoul Aziz Bonkoungou), [Micheline Bénédicte Moumoula](https://arxiv.org/search/?searchtype=author&query=Micheline Bénédicte Moumoula), [Jordan Samhi](https://arxiv.org/search/?searchtype=author&query=Jordan Samhi), [Abdoul Kader Kabore](https://arxiv.org/search/?searchtype=author&query=Abdoul Kader Kabore), [Jacques Klein](https://arxiv.org/search/?searchtype=author&query=Jacques Klein), [Tegawendé F. Bissyande](https://arxiv.org/search/?searchtype=author&query=Tegawendé F. Bissyande) 作者：Serge Lionel Nikiema、Albérick Euraste Djire、Abdoul Aziz Bonkoungou、Micheline Bénédicte Moumoula、Jordan Samhi、Abdoul Kader Kabore、Jacques Klein、Tegawendé F. Bissyande

This research examines how well different methods measure semantic similarity, which is important for various software engineering applications such as code search, API recommendations, automated code reviews, and refactoring tools. While large language models are increasingly used for these similarity assessments, questions remain about whether they truly understand semantic relationships or merely recognize surface patterns. The study tested 18 different similarity measurement approaches, including word-based methods, embedding techniques, LLM-based systems, and structure-aware algorithms. The researchers created a systematic testing framework that applies controlled changes to text and code to evaluate how well each method handles different types of semantic relationships. The results revealed significant issues with commonly used metrics. Some embedding-based methods incorrectly identified semantic opposites as similar up to 99.9 percent of the time, while certain transformer-based approaches occasionally rated opposite meanings as more similar than synonymous ones. The study found that embedding methods’ poor performance often stemmed from how they calculate distances; switching from Euclidean distance to cosine similarity improved results by 24 to 66 percent. LLM-based approaches performed better at distinguishing semantic differences, producing low similarity scores (0.00 to 0.29) for genuinely different meanings, compared to embedding methods that incorrectly assigned high scores (0.82 to 0.99) to dissimilar content. 这项研究考察了不同方法在衡量语义相似性方面的表现，这对代码搜索、API 推荐、自动化代码审查和重构工具等各种软件工程应用非常重要。尽管大型语言模型（LLM）越来越多地用于这些相似性评估，但仍存在疑问：它们是否真正理解语义关系，还是仅仅识别表面模式。研究测试了 18 种不同的相似性测量方法，包括基于词的办法、嵌入技术、基于 LLM 的系统和结构感知算法。研究人员构建了一个系统化的测试框架，对文本和代码施加受控变化，以评估每种方法如何处理不同类型的语义关系。结果揭示了常用度量的一些重大问题：一些基于嵌入的方法将语义相反的项错误地判定为相似，错误率高达 99.9%，而某些基于 Transformer 的方法有时会将相反含义评为比同义含义更相似。研究发现，嵌入方法表现不佳常常源于它们计算距离的方式；将欧几里得距离改为余弦相似度后，结果提高了 24%到 66%。基于 LLM 的方法在区分语义差异方面表现更好，对于真正含义不同的内容会给出较低的相似度分数（0.00 到 0.29），而嵌入方法却错误地对不相似的内容赋予较高分数（0.82 到 0.99）。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-08 11:00:18 UTC 发布：2025-09-08 11:00:18 协调世界时（UTC）

#40 HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering

The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system’s adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks. 检索增强生成（RAG）方法通过将信息检索（IR）技术与大型语言模型（LLMs）结合起来，提升了问答系统和对话生成任务的能力。这一策略通过从外部知识库检索信息来增强生成模型的响应能力，已取得一定成效。然而，当前的 RAG 方法在处理多跳查询时仍面临诸多挑战。例如，一些方法过度依赖迭代检索，在复合查询上浪费了过多的检索步骤。此外，使用原始复杂查询进行检索可能无法捕捉与特定子查询相关的内容，导致检索到的内容夹杂噪音。如果噪音得不到管理，会导致噪音累积问题。为了解决这些问题，我们提出了 HANRAG，一种基于启发式的新框架，旨在高效处理不同复杂度的问题。在一个强大 revelator 的驱动下，HANRAG 对查询进行路由、将其分解为子查询，并从检索到的文档中过滤噪音。这增强了系统的适应性和抗噪能力，使其能够高效处理各种查询。我们将所提出的框架与业界其他领先方法在多个基准上进行比较。结果表明，我们的框架在单跳和多跳问答任务中均取得了更优的性能。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-08 06:22:38 UTC 发布：2025-09-08 06:22:38 协调世界时

#41 The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization #41 思想治疗师：使用监督微调和比值率策略优化训练大型语言模型以提供接纳与承诺疗法

Author: [Talha Tahir](https://arxiv.org/search/?searchtype=author&query=Talha Tahir)作者：Talha Tahir

Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using 50 sets of synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity (χ2(5)=185.15,p<.001) and therapeutic empathy (χ2(5)=140.37,p<.001). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points (p<.001), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic process' over imitating content,’ a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm. 接纳与承诺疗法（ACT）是一种第三波认知行为疗法，已有证据显示在若干精神疾病中具有疗效。本研究考察了训练后方法和显式推理对小型开权重大型语言模型（LLM）提供 ACT 能力的影响。使用由 Mistral-Large 生成的 50 组合成 ACT 对话记录，我们以两种不同方法对 Llama-3.2-3b-Instruct 进行了训练，分别是监督微调（SFT）和比值比策略优化（ORPO），每种方法都分别有无显式思路链（COT）推理步骤。通过将这四种训练后变体与基础 Instruct 模型进行比较来评估性能。这些模型在模拟治疗会话中进行了基准测试，性能由经过人工评估微调的 LLM 评判员通过 ACT 保真度量表（ACT-FM）和治疗师同理心量表（TES）进行量化评估。我们的研究结果表明，ORPO 训练的模型在 ACT 保真度（ χ2(5)=185.15,p<.001 ）和治疗同理心（ χ2(5)=140.37,p<.001 ）上显著优于其 SFT 和 Instruct 对应模型。 COT 的效果具有条件性：它对 SFT 模型有显著益处，使 ACT-FM 分数平均提高 2.68 分（ p<.001 ），但对更优秀的 ORPO 或指令微调变体并未带来可辨别的优势。我们假设 ORPO 的优越性源于其能够学习治疗性的“过程”而非模仿“内容”，这是 ACT 的关键方面，而 COT 则作为仅通过模仿训练的模型所需的支撑。该研究确立了偏好对齐的策略优化能够有效在小型 LLMs 中灌输 ACT 能力，并且显式推理的效用高度依赖于底层的训练范式。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-08 02:30:12 UTC 发布：2025-09-08 02:30:12 UTC

#42 Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry

Authors: [Aya E. Fouda](https://arxiv.org/search/?searchtype=author&query=Aya E. Fouda), [Abdelrahamn A. Hassan](https://arxiv.org/search/?searchtype=author&query=Abdelrahamn A. Hassan), [Radwa J. Hanafy](https://arxiv.org/search/?searchtype=author&query=Radwa J. Hanafy), [Mohammed E. Fouda](https://arxiv.org/search/?searchtype=author&query=Mohammed E. Fouda) 作者：Aya E. Fouda、Abdelrahamn A. Hassan、Radwa J. Hanafy、Mohammed E. Fouda

Large language models (LLMs) hold great promise in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of psychiatric reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items. We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, LLaMA 3, and QWQ-32) alongside leading open-source medical models (e.g., OpenBiloLLM, MedGemma) using both conventional metrics and an “LLM-as-judge” similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications. 大型语言模型（LLMs）在提升精神科实践方面具有巨大潜力，从提高诊断准确性到简化临床记录和提供治疗支持。然而，现有的评估资源在很大程度上依赖于小规模的临床面谈语料库、社交媒体帖子或合成对话，这限制了其临床有效性，无法捕捉精神科推理的全部复杂性。在本研究中，我们推出了 PsychiatryBench，这是一个严格策划的基准，完全基于权威的、经专家验证的精神科教科书和病例集。PsychiatryBench 包含十一种不同的问答任务，涵盖诊断推理和治疗规划到纵向随访、管理计划、临床方法、序列病例分析以及多项选择/扩展匹配格式，总计超过 5,300 个经专家注释的条目。我们使用传统指标和“LLM 作为评判者”的相似性评分框架，对一系列前沿的 LLMs（包括 Google Gemini、DeepSeek、LLaMA 3 和 QWQ-32）以及领先的开源医学模型（例如 OpenBiloLLM、MedGemma）进行了评估。我们的研究结果显示在临床一致性和安全性方面存在显著差距，尤其是在多轮跟进和管理任务中，这凸显了对专门的模型调优和更健全评估范式的需求。PsychiatryBench 提供了一个模块化、可扩展的平台，用于在高风险的心理健康应用中评估和提升 LLM 的性能。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-07 20:57:24 UTC 发布日期：2025-09-07 20:57:24 UTC

#43 Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data #43 利用由人口普查和土地利用数据提供信息的大型语言模型生成个人旅行日记 [PDF] [复制] [Kimi] [REL]

Authors: [Sepehr Golrokh Amin](https://arxiv.org/search/?searchtype=author&query=Sepehr Golrokh Amin), [Devin Rhoads](https://arxiv.org/search/?searchtype=author&query=Devin Rhoads), [Fatemeh Fakhrmoosavi](https://arxiv.org/search/?searchtype=author&query=Fatemeh Fakhrmoosavi), [Nicholas E. Lownes](https://arxiv.org/search/?searchtype=author&query=Nicholas E. Lownes), [John N. Ivan](https://arxiv.org/search/?searchtype=author&query=John N. Ivan) 作者：Sepehr Golrokh Amin、Devin Rhoads、Fatemeh Fakhrmoosavi、Nicholas E. Lownes、John N. Ivan

This study introduces a Large Language Model (LLM) scheme for generating individual travel diaries in agent-based transportation models. While traditional approaches rely on large quantities of proprietary household travel surveys, the method presented in this study generates personas stochastically from open-source American Community Survey (ACS) and Smart Location Database (SLD) data, then synthesizes diaries through direct prompting. This study features a novel one-to-cohort realism score: a composite of four metrics (Trip Count Score, Interval Score, Purpose Score, and Mode Score) validated against the Connecticut Statewide Transportation Study (CSTS) diaries, matched across demographic variables. The validation utilizes Jensen-Shannon Divergence to measure distributional similarities between generated and real diaries. When compared to diaries generated with classical methods (Negative Binomial for trip generation; Multinomial Logit for mode/purpose) calibrated on the validation set, LLM-generated diaries achieve comparable overall realism (LLM mean: 0.485 vs. 0.455). The LLM excels in determining trip purpose and demonstrates greater consistency (narrower realism score distribution), while classical models lead in numerical estimates of trip count and activity duration. Aggregate validation confirms the LLM’s statistical representativeness (LLM mean: 0.612 vs. 0.435), demonstrating LLM’s zero-shot viability and establishing a quantifiable metric of diary realism for future synthetic diary evaluation systems. 本研究提出了一种用于在基于代理的交通模型中生成个人出行日记的大型语言模型（LLM）方案。传统方法依赖大量专有的家庭出行调查数据，而本研究提出的方法则从开源的美国社区调查（ACS）和智能位置数据库（SLD）数据中随机生成群像角色，然后通过直接提示合成日记。本研究引入了一种新颖的“一对群体真实度评分”：由四个指标（出行次数评分、间隔评分、出行目的评分和出行方式评分）组成的复合评分，并针对康涅狄格州全州交通研究（CSTS）日记进行了验证，按人口统计变量匹配。验证采用詹森-香农散度（Jensen-Shannon Divergence）衡量生成日记与真实日记之间的分布相似性。与在验证集上校准的经典方法（用于出行生成的负二项分布；用于方式/目的的多项式逻辑回归）生成的日记相比，LLM 生成的日记在总体真实度上具有可比性（LLM 平均值：0.485 vs. 0.455)。LLM 在判断出行目的方面表现优异，并且表现出更高的一致性（现实感评分分布更窄），而传统模型在出行次数和活动时长的数值估计上占优。聚合验证确认了 LLM 的统计代表性（LLM 平均值：0.612 对比 0.435），证明了 LLM 的零样本可行性，并为未来合成日记评估系统建立了可量化的日记现实感度量。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-09-07 17:03:08 UTC 发布日期：2025-09-07 17:03:08 UTC

Authors: [Jing Ren](https://arxiv.org/search/?searchtype=author&query=Jing Ren), [Weiqi Wang](https://arxiv.org/search/?searchtype=author&query=Weiqi Wang) 作者：Jing Ren、Weiqi Wang

Large language models (LLMs) like ChatGPT are increasingly used in academic writing, yet issues such as incorrect or fabricated references raise ethical concerns. Moreover, current content quality evaluations often rely on subjective human judgment, which is labor-intensive and lacks objectivity, potentially compromising the consistency and reliability. In this study, to provide a quantitative evaluation and enhance research proposal writing capabilities of LLMs, we propose two key evaluation metrics–content quality and reference validity–and an iterative prompting method based on the scores derived from these two metrics. Our extensive experiments show that the proposed metrics provide an objective, quantitative framework for assessing ChatGPT’s writing performance. Additionally, iterative prompting significantly enhances content quality while reducing reference inaccuracies and fabrications, addressing critical ethical challenges in academic contexts. 像 ChatGPT 这样的大型语言模型（LLMs）在学术写作中的使用日益增多，但诸如引用错误或捏造等问题引发了伦理担忧。此外，目前的内容质量评估往往依赖主观的人为判断，这既费时又缺乏客观性，可能影响一致性和可靠性。在这项研究中，为了对 LLMs 的研究提案写作能力进行量化评估并加以提升，我们提出了两个关键评估指标——内容质量和引用有效性——以及一种基于这两项指标得分的迭代提示方法。我们的大量实验表明，所提出的指标为评估 ChatGPT 写作表现提供了客观、量化的框架。此外，迭代提示显著提升了内容质量，同时减少了引用不准确和捏造的问题，从而解决了学术情境中的关键伦理挑战。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-07 10:24:28 UTC 发布：2025-09-07 10:24:28 UTC

#45 Beyond I’m Sorry, I Can’t: Dissecting Large Language Model Refusal #45 超越“对不起，我不能”：剖析大型语言模型的拒绝行为

Authors: [Nirmalendu Prakash](https://arxiv.org/search/?searchtype=author&query=Nirmalendu Prakash), [Yeo Wei Jie](https://arxiv.org/search/?searchtype=author&query=Yeo Wei Jie), [Amir Abdullah](https://arxiv.org/search/?searchtype=author&query=Amir Abdullah), [Ranjan Satapathy](https://arxiv.org/search/?searchtype=author&query=Ranjan Satapathy), [Erik Cambria](https://arxiv.org/search/?searchtype=author&query=Erik Cambria), [Roy Ka Wei Lee](https://arxiv.org/search/?searchtype=author&query=Roy Ka Wei Lee) 作者：Nirmalendu Prakash、Yeo Wei Jie、Amir Abdullah、Ranjan Satapathy、Erik Cambria、Roy Ka Wei Lee

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space. 在有害提示下的拒绝行为是经过指令微调的大型语言模型（LLMs）的一项关键安全行为，但该行为的内部原因仍然知之甚少。我们使用在残差流激活上训练的稀疏自编码器（SAE），研究了两个公开的指令微调模型 Gemma-2-2B-IT 和 LLaMA-3.1-8B-IT。针对有害提示，我们在 SAE 的潜在空间中搜索一组特征，通过消融这些特征将模型从拒绝转为顺从，从而证明了因果影响并创建了越狱方法。我们的搜索分三个阶段进行： (1) 拒绝方向：找到一个介导拒绝的方向并收集接近该方向的 SAE 特征；(2) 贪心筛选：剪枝至最小集合；(3) 交互发现：拟合一个因子分解机（FM），以捕捉剩余活跃特征与最小集合之间的非线性交互。该流水线产生了一组广泛的越狱关键特征，为拒绝的机械基础提供了洞见。此外，我们发现存在冗余特征，除非早期特征被抑制，否则这些冗余特征会保持休眠状态。我们的研究结果突显了通过操控可解释的潜在空间，对安全行为进行细粒度审计和有针对性干预的潜力。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-07 02:29:07 UTC 发布：2025-09-07 02:29:07 UTC

#46 The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks #46 小型 LLMs 的非确定性：在标准多项选择基准的重复试验中低答案一致性的证据

Authors: [Claudio Pinhanez](https://arxiv.org/search/?searchtype=author&query=Claudio Pinhanez), [Paulo Cavalin](https://arxiv.org/search/?searchtype=author&query=Paulo Cavalin), [Cassia Sanctos](https://arxiv.org/search/?searchtype=author&query=Cassia Sanctos), [Marcelo Grave](https://arxiv.org/search/?searchtype=author&query=Marcelo Grave), [Yago Primerano](https://arxiv.org/search/?searchtype=author&query=Yago Primerano) 作者：Claudio Pinhanez、Paulo Cavalin、Cassia Sanctos、Marcelo Grave、Yago Primerano

This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency. 本文探讨了小型 LLMs（20 亿至 80 亿参数）在对同一问题多次回答时的一致性。我们对已知的开源 LLMs 进行了研究，分析它们在多项选择基准测试 MMLU-Redux 和 MedQA 上对每个问题重复回答 10 次的表现，考察了不同的推理温度、小型与中型模型（500 亿至 800 亿参数）、微调模型与基线模型以及其他参数的影响。我们还研究了要求多次试验回答一致性对准确率的影响，以及在决定哪个模型能同时最好地提供两者时所涉及的权衡。为支持这些研究，我们提出了一些新的分析和图形工具。结果显示，不同模型能一致回答的问题数量差异显著，但对于低推理温度下的小型模型，通常在 50%至 80%范围内。此外，一致回答中的准确率似乎与整体准确率有合理的相关性。中型模型的结果则表明答案一致性水平明显更高。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-05 17:31:14 UTC 发布：2025-09-05 17:31:14 协调世界时

#47 Temporal Preferences in Language Models for Long-Horizon Assistance #47 语言模型在长期辅助中的时间偏好 [PDF 1 ] [复制] [Kimi] [REL]

Authors: [Ali Mazyaki](https://arxiv.org/search/?searchtype=author&query=Ali Mazyaki), [Mohammad Naghizadeh](https://arxiv.org/search/?searchtype=author&query=Mohammad Naghizadeh), [Samaneh Ranjkhah Zonouzaghi](https://arxiv.org/search/?searchtype=author&query=Samaneh Ranjkhah Zonouzaghi), [Hossein Setareh](https://arxiv.org/search/?searchtype=author&query=Hossein Setareh) 作者：Ali Mazyaki、Mohammad Naghizadeh、Samaneh Ranjkhah Zonouzaghi、Hossein Setareh

We study whether language models (LMs) exhibit future- versus present-oriented preferences in intertemporal choice and whether those preferences can be systematically manipulated. Using adapted human experimental protocols, we evaluate multiple LMs on time-tradeoff tasks and benchmark them against a sample of human decision makers. We introduce an operational metric, the Manipulability of Time Orientation (MTO), defined as the change in an LM’s revealed time preference between future- and present-oriented prompts. In our tests, reasoning-focused models (e.g., DeepSeek-Reasoner and grok-3-mini) choose later options under future-oriented prompts but only partially personalize decisions across identities or geographies. Moreover, models that correctly reason about time orientation internalize a future orientation for themselves as AI decision makers. We discuss design implications for AI assistants that should align with heterogeneous, long-horizon goals and outline a research agenda on personalized contextual calibration and socially aware deployment. 我们研究语言模型（LM）在跨期选择中是否表现出面向未来或面向当前的偏好，以及这些偏好是否可以被系统性地操控。采用改编自人类实验的协议，我们在时间权衡任务上评估多个语言模型，并将它们与一组人类决策者的样本进行基准比较。我们引入了一个操作性度量——时间取向可操控性（MTO），定义为语言模型在面向未来与面向当前提示之间所揭示的时间偏好的变化。在我们的测试中，侧重推理的模型（例如 DeepSeek-Reasoner 和 grok-3-mini）在面向未来的提示下倾向选择较后的选项，但仅在身份或地理位置上部分实现个性化决策。此外，那些能够正确推理时间取向的模型，会将面向未来的取向内化为它们自身作为 AI 决策者的倾向。我们讨论了面向异质的、长期目标应当对齐的 AI 助手的设计启示，并勾勒了关于个性化情境校准与具有社会意识的部署的研究议程。

Subjects: Computation and Language, Artificial Intelligence, Computers and Society

Publish: 2025-09-05 16:21:23 UTC 发布：2025-09-05 16:21:23 世界协调时

#48 CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor #48 CTCC：一种通过跨对话上下文相关性后门对大型语言模型进行鲁棒且隐蔽的指纹框架 [PDF] [复制] [Kimi] [REL]

Authors: [Zhenhua Xu](https://arxiv.org/search/?searchtype=author&query=Zhenhua Xu), [Xixiang Zhao](https://arxiv.org/search/?searchtype=author&query=Xixiang Zhao), [Xubin Yue](https://arxiv.org/search/?searchtype=author&query=Xubin Yue), [Shengwei Tian](https://arxiv.org/search/?searchtype=author&query=Shengwei Tian), [Changting Lin](https://arxiv.org/search/?searchtype=author&query=Changting Lin), [Meng Han](https://arxiv.org/search/?searchtype=author&query=Meng Han) 作者：徐振华、赵希向、岳许斌、田胜伟、林长亭、韩蒙

The widespread deployment of large language models (LLMs) has intensified concerns around intellectual property (IP) protection, as model theft and unauthorized redistribution become increasingly feasible. To address this, model fingerprinting aims to embed verifiable ownership traces into LLMs. However, existing methods face inherent trade-offs between stealthness, robustness, and generalizability, being either detectable via distributional shifts, vulnerable to adversarial modifications, or easily invalidated once the fingerprint is revealed. In this work, we introduce CTCC, a novel rule-driven fingerprinting framework that encodes contextual correlations across multiple dialogue turns, such as counterfactual, rather than relying on token-level or single-turn triggers. CTCC enables fingerprint verification under black-box access while mitigating false positives and fingerprint leakage, supporting continuous construction under a shared semantic rule even if partial triggers are exposed. Extensive experiments across multiple LLM architectures demonstrate that CTCC consistently achieves stronger stealth and robustness than prior work. Our findings position CTCC as a reliable and practical solution for ownership verification in real-world LLM deployment scenarios. Our code and data are publicly available at https://github.com/Xuzhenhua55/CTCC. 大规模语言模型（LLMs）的广泛部署加剧了对知识产权（IP）保护的担忧，因为模型窃取和未经授权的再分发变得愈发可行。为此，模型指纹旨在将可验证的所有权痕迹嵌入到 LLMs 中。然而，现有方法在隐蔽性、鲁棒性和泛化性之间存在固有权衡：要么通过分布变化被检测到，要么易受对抗性修改攻击，要么一旦指纹被揭露便容易失效。在本工作中，我们提出了 CTCC，一种新颖的基于规则的指纹框架，它编码跨多轮对话的上下文相关性，例如反事实，而不是依赖于令牌级或单轮触发器。CTCC 在黑盒访问下即可进行指纹验证，同时减少误报和指纹泄露，即便部分触发器被暴露也能在共享语义规则下支持持续构建。跨多种 LLM 架构的大量实验证明，CTCC 在隐蔽性和鲁棒性方面始终优于先前工作。我们的研究结果将 CTCC 定位为在现实 LLM 部署场景中用于所有权验证的可靠且实用的解决方案。我们的代码和数据已公开发布于 https://github.com/Xuzhenhua55/CTCC。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-05 05:59:50 UTC 发布：2025-09-05 05:59:50 UTC

#49 Creativity Benchmark: A benchmark for marketing creativity for LLM models #49 创造力基准：用于评估 LLM 模型营销创造力的基准 [PDF] [复制] [Kimi 1 ] [REL]

Authors: [Ninad Bhat](https://arxiv.org/search/?searchtype=author&query=Ninad Bhat), [Kieran Browne](https://arxiv.org/search/?searchtype=author&query=Kieran Browne), [Pip Bingemann](https://arxiv.org/search/?searchtype=author&query=Pip Bingemann) 作者：Ninad Bhat、Kieran Browne、Pip Bingemann

We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is Δθ≈0.45, which implies a head-to-head win probability of 0.61; the highest-rated model beats the lowest only about 61% of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows. 我们推出了创意基准（Creativity Benchmark），用于评估大型语言模型（LLMs）在营销创意方面的表现。该基准涵盖 100 个品牌（12 个类别）和三种提示类型（洞察、创意、异想天开）。来自 678 名在职创意人员对 11,012 次匿名比对的人工成对偏好，通过 Bradley-Terry 模型分析显示表现高度聚集且没有某一模型在所有品牌或提示类型上占优：最高到最低的差距为 Δθ≈0.45 ，这意味着一对一对决的胜率为 0.61 ；评分最高的模型击败评分最低的模型的概率也只有大约 61% 。我们还使用余弦距离分析模型多样性，以捕捉模型内部及模型之间的变异性以及对提示改写的敏感性。将三种以 LLM 为评审者的设置与人工排名比较显示出弱且不一致的相关性以及特定评审者的偏差，强调了自动评审不能替代人工评估。传统的创造力测试也只部分适用于受品牌约束的任务。总体而言，结果突显了对专家人工评估和关注多样性的工作流程的需求。

Subjects: Computation and Language, Artificial Intelligence, Human-Computer Interaction 主题：计算与语言、人工智能、人机交互

Publish: 2025-09-05 04:44:29 UTC 发布：2025-09-05 04:44:29 UTC

#50 Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task #50 在正则化视界下针对语音翻译任务的最优多任务学习

Authors: [JungHo Jung](https://arxiv.org/search/?searchtype=author&query=JungHo Jung), [Junhyun Lee](https://arxiv.org/search/?searchtype=author&query=Junhyun Lee) 作者：JungHo Jung, Junhyun Lee

End-to-end speech-to-text translation typically suffers from the scarcity of paired speech-text data. One way to overcome this shortcoming is to utilize the bitext data from the Machine Translation (MT) task and perform Multi-Task Learning (MTL). In this paper, we formulate MTL from a regularization perspective and explore how sequences can be regularized within and across modalities. By thoroughly investigating the effect of consistency regularization (different modality) and R-drop (same modality), we show how they respectively contribute to the total regularization. We also demonstrate that the coefficient of MT loss serves as another source of regularization in the MTL setting. With these three sources of regularization, we introduce the optimal regularization contour in the high-dimensional space, called the regularization horizon. Experiments show that tuning the hyperparameters within the regularization horizon achieves near state-of-the-art performance on the MuST-C dataset. 端到端语音到文本的翻译通常受限于配对语音-文本数据的稀缺性。克服这一不足的一种方法是利用机器翻译（MT）任务的双语文本数据并进行多任务学习（MTL）。在本文中，我们从正则化的角度对 MTL 进行建模，并探讨序列在模态内和模态间如何被正则化。通过全面研究一致性正则化（不同模态）和 R-drop（相同模态）的效果，我们展示了它们各自如何对总体正则化做出贡献。我们还证明了 MT 损失的系数在 MTL 设置中充当了另一个正则化来源。基于这三种正则化来源，我们在高维空间中引入了称为正则化视界的最优正则化轮廓。实验证明，在正则化视界内调节超参数可以在 MuST-C 数据集上达到接近最先进的性能。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-09-04 17:21:36 UTC 发布：2025-09-04 17:21:36 协调世界时

#51 Cross-Layer Attention Probing for Fine-Grained Hallucination Detection

Authors: [Malavika Suresh](https://arxiv.org/search/?searchtype=author&query=Malavika Suresh), [Rahaf Aljundi](https://arxiv.org/search/?searchtype=author&query=Rahaf Aljundi), [Ikechukwu Nkisi-Orji](https://arxiv.org/search/?searchtype=author&query=Ikechukwu Nkisi-Orji), [Nirmalie Wiratunga](https://arxiv.org/search/?searchtype=author&query=Nirmalie Wiratunga) 作者：Malavika Suresh、Rahaf Aljundi、Ikechukwu Nkisi-Orji、Nirmalie Wiratunga

With the large-scale adoption of Large Language Models (LLMs) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the LLM activations across the entire residual stream as a joint sequence. Our empirical evaluations using five LLMs and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy decoded responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among different sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve LLM reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution. 随着大型语言模型（LLMs）在各种应用中的广泛采用，其倾向于生成不准确文本（即幻觉）引发的可靠性担忧日益增加。在本工作中，我们提出了跨层注意力探测（Cross-Layer Attention Probing，CLAP），这是一种用于幻觉检测的新型激活探测技术，它将整个残差流中的 LLM 激活作为一个联合序列进行处理。我们在五个 LLM 和三个任务上的实证评估表明，CLAP 在贪心解码的响应以及在更高温度采样的响应上均优于基线方法，从而实现了细粒度检测，即能够在对给定提示的不同采样响应之间区分幻觉与非幻觉。这使我们能够提出一种基于 CLAP 的先检测再缓解策略，以比直接缓解方法更好地减少幻觉并提高 LLM 的可靠性。最后，我们展示了 CLAP 在应用于分布外数据时仍能保持高可靠性。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-04 14:37:34 UTC 发布：2025-09-04 14:37:34 UTC

#52 Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs #52 结构化信息很重要：使用病人级知识图谱的可解释 ICD 编码 [PDF] [复制] [Kimi 1 ] [REL]

Authors: [Mingyang Li](https://arxiv.org/search/?searchtype=author&query=Mingyang Li), [Viktor Schlegel](https://arxiv.org/search/?searchtype=author&query=Viktor Schlegel), [Tingting Mu](https://arxiv.org/search/?searchtype=author&query=Tingting Mu), [Warren Del-Pinto](https://arxiv.org/search/?searchtype=author&query=Warren Del-Pinto), [Goran Nenadic](https://arxiv.org/search/?searchtype=author&query=Goran Nenadic) 作者：Mingyang Li、Viktor Schlegel、Tingting Mu、Warren Del-Pinto、Goran Nenadic

Mapping clinical documents to standardised clinical vocabularies is an important task, as it provides structured data for information retrieval and analysis, which is essential to clinical research, hospital administration and improving patient care. However, manual coding is both difficult and time-consuming, making it impractical at scale. Automated coding can potentially alleviate this burden, improving the availability and accuracy of structured clinical data. The task is difficult to automate, as it requires mapping to high-dimensional and long-tailed target spaces, such as the International Classification of Diseases (ICD). While external knowledge sources have been readily utilised to enhance output code representation, the use of external resources for representing the input documents has been underexplored. In this work, we compute a structured representation of the input documents, making use of document-level knowledge graphs (KGs) that provide a comprehensive structured view of a patient’s condition. The resulting knowledge graph efficiently represents the patient-centred input documents with 23% of the original text while retaining 90% of the information. We assess the effectiveness of this graph for automated ICD-9 coding by integrating it into the state-of-the-art ICD coding architecture PLM-ICD. Our experiments yield improved Macro-F1 scores by up to 3.20% on popular benchmarks, while improving training efficiency. We attribute this improvement to different types of entities and relationships in the KG, and demonstrate the improved explainability potential of the approach over the text-only baseline. 将临床文档映射到标准化临床词汇表是一项重要任务，因为它为信息检索与分析提供了结构化数据，这对临床研究、医院管理和改善病人护理至关重要。然而，人工编码既艰难又耗时，难以实现大规模应用。自动编码有望减轻这一负担，提高结构化临床数据的可用性和准确性。该任务难以自动化，因为需要映射到高维且长尾分布的目标空间，例如《国际疾病分类》（ICD）。尽管外部知识源已被广泛用于增强输出代码的表征，但用于表示输入文档的外部资源却鲜有探讨。在这项工作中，我们计算了输入文档的结构化表征，利用文档级知识图谱（KG），为病人的状况提供了全面的结构化视图。由此得到的知识图谱以原始文本的 23%有效地表示了以病人为中心的输入文档，同时保留了 90%的信息。我们通过将该图谱整合到最先进的 ICD 编码架构 PLM-ICD 中，评估其对自动化 ICD-9 编码的有效性。我们的实验在常用基准上将宏平均 F1 分数提高了最多 3.20%，同时提高了训练效率。我们将该提升归因于知识图谱中不同类型的实体和关系，并展示了该方法相比仅文本基线在可解释性方面的改善潜力。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-04 12:01:38 UTC 发布：2025-09-04 12:01:38 UTC

#53 Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems #53 假设、行动、预测：为多智能体系统的自动故障归因构建因果推理支架

Authors: [Alva West](https://arxiv.org/search/?searchtype=author&query=Alva West), [Yixuan Weng](https://arxiv.org/search/?searchtype=author&query=Yixuan Weng), [Minjun Zhu](https://arxiv.org/search/?searchtype=author&query=Minjun Zhu), [Zhen Lin](https://arxiv.org/search/?searchtype=author&query=Zhen Lin), [Yue Zhang](https://arxiv.org/search/?searchtype=author&query=Yue Zhang) 作者：Alva West、Yixuan Weng、Minjun Zhu、Zhen Lin、Yue Zhang

Failure attribution in multi-agent systems – pinpointing the exact step where a decisive error occurs – is a critical yet unsolved challenge. Current methods treat this as a pattern recognition task over long conversation logs, leading to critically low step-level accuracy (below 17%), which renders them impractical for debugging complex systems. Their core weakness is a fundamental inability to perform robust counterfactual reasoning: to determine if correcting a single action would have actually averted the task failure. To bridge this counterfactual inference gap, we introduce Abduct-Act-Predict (A2P) Scaffolding, a novel agent framework that transforms failure attribution from pattern recognition into a structured causal inference task. A2P explicitly guides a large language model through a formal three-step reasoning process within a single inference pass: (1) Abduction, to infer the hidden root causes behind an agent’s actions; (2) Action, to define a minimal corrective intervention; and (3) Prediction, to simulate the subsequent trajectory and verify if the intervention resolves the failure. This structured approach leverages the holistic context of the entire conversation while imposing a rigorous causal logic on the model’s analysis. Our extensive experiments on the Who&When benchmark demonstrate its efficacy. On the Algorithm-Generated dataset, A2P achieves 47.46% step-level accuracy, a 2.85× improvement over the 16.67% of the baseline. On the more complex Hand-Crafted dataset, it achieves 29.31% step accuracy, a 2.43× improvement over the baseline’s 12.07%. By reframing the problem through a causal lens, A2P Scaffolding provides a robust, verifiable, and significantly more accurate solution for automated failure attribution. 在多智能体系统中进行故障归因——精确定位导致决定性错误的具体步骤——是一个关键但尚未解决的挑战。现有方法将其视为对长对话日志的模式识别任务，导致步骤级准确率极低（低于 17%），使其在调试复杂系统时不切实际。它们的核心弱点在于无法进行稳健的反事实推理：判断如果纠正单个行为是否真的能够避免任务失败。为弥补这一反事实推理缺口，我们提出了 Abduct-Act-Predict (A2P) 支架，这是一个新颖的智能体框架，将故障归因从模式识别转变为结构化的因果推理任务。A2P 在一次推理过程中明确引导大语言模型完成形式化的三步推理流程：（1）溯因（Abduction），推断智能体行为背后的隐含根本原因；（2）行动（Action），定义最小的纠正性干预；（3）预测（Prediction），模拟随后轨迹并验证干预是否能解决故障。这种结构化方法利用整个对话的整体上下文，同时对模型的分析施加严格的因果逻辑。我们在 Who&When 基准上的大量实验展示了其有效性。在算法生成的数据集上，A2P 在步骤级准确率上达到 47.46%，比基线的 16.67% 提高了 2.85 × 。在更复杂的人工构造数据集上，它在步骤准确率上达到 29.31%，比基线的 12.07% 提高了 2.43 × 。通过以因果视角重新构建问题，A2P 支架提供了一种稳健、可验证且显著更为准确的自动故障归因解决方案。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-09-12 16:51:15 UTC 发布：2025-09-12 16:51:15 UTC

#54 Error Analysis in a Modular Meeting Transcription System #54 模块化会议转录系统中的错误分析 [PDF 1 ] [复制] [Kimi] [REL]

Authors: [Peter Vieting](https://arxiv.org/search/?searchtype=author&query=Peter Vieting), [Simon Berger](https://arxiv.org/search/?searchtype=author&query=Simon Berger), [Thilo von Neumann](https://arxiv.org/search/?searchtype=author&query=Thilo von Neumann), [Christoph Boeddeker](https://arxiv.org/search/?searchtype=author&query=Christoph Boeddeker), [Ralf Schlüter](https://arxiv.org/search/?searchtype=author&query=Ralf Schlüter), [Reinhold Haeb-Umbach](https://arxiv.org/search/?searchtype=author&query=Reinhold Haeb-Umbach) 作者：Peter Vieting、Simon Berger、Thilo von Neumann、Christoph Boeddeker、Ralf Schlüter、Reinhold Haeb-Umbach

Meeting transcription is a field of high relevance and remarkable progress in recent years. Still, challenges remain that limit its performance. In this work, we extend a previously proposed framework for analyzing leakage in speech separation with proper sensitivity to temporal locality. We show that there is significant leakage to the cross channel in areas where only the primary speaker is active. At the same time, the results demonstrate that this does not affect the final performance much as these leaked parts are largely ignored by the voice activity detection (VAD). Furthermore, different segmentations are compared showing that advanced diarization approaches are able to reduce the gap to oracle segmentation by a third compared to a simple energy-based VAD. We additionally reveal what factors contribute to the remaining difference. The results represent state-of-the-art performance on LibriCSS among systems that train the recognition module on LibriSpeech data only. 会议转录是一个高度相关且近年来取得显著进展的领域。但仍存在限制其性能的挑战。在这项工作中，我们扩展了先前提出的用于分析语音分离泄露的框架，使其对时间局部性具有适当的敏感性。我们展示了在仅有主讲者活动的区域，跨通道存在显著的泄露。同时，结果表明这并未在最终性能上造成太大影响，因为这些泄露部分在很大程度上被语音活动检测（VAD）忽略。此外，比较了不同的分割方法，结果表明，先进的说话人划分方法相比于简单的基于能量的 VAD，能够将与完美分割的差距缩小三分之一。我们还揭示了导致剩余差异的因素。该结果代表了在仅使用 LibriSpeech 数据训练识别模块的系统中，在 LibriCSS 上的最新水平性能。

Subjects: Audio and Speech Processing, Computation and Language, Machine Learning, Sound 主题：音频与语音处理、计算与语言、机器学习、声音

Publish: 2025-09-12 11:10:38 UTC 发布：2025-09-12 11:10:38 UTC

#55 VARCO-VISION-2.0 Technical Report #55 VARCO-VISION-2.0 技术报告

Authors: [Young-rok Cha](https://arxiv.org/search/?searchtype=author&query=Young-rok Cha), [Jeongho Ju](https://arxiv.org/search/?searchtype=author&query=Jeongho Ju), [SunYoung Park](https://arxiv.org/search/?searchtype=author&query=SunYoung Park), [Jong-Hyeon Lee](https://arxiv.org/search/?searchtype=author&query=Jong-Hyeon Lee), [Younghyun Yu](https://arxiv.org/search/?searchtype=author&query=Younghyun Yu), [Youngjune Kim](https://arxiv.org/search/?searchtype=author&query=Youngjune Kim) 作者：Young-rok Cha、Jeongho Ju、SunYoung Park、Jong-Hyeon Lee、Younghyun Yu、Youngjune Kim

We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model. 我们推出了 VARCO-VISION-2.0，一款开源权重的韩英双语视觉-语言模型（VLM），相较于先前的 VARCO-VISION-14B 在能力上有所提升。该模型支持多图像理解，能处理诸如文档、图表和表格等复杂输入，并通过预测文本内容及其空间位置实现支持布局的 OCR。模型通过四阶段课程训练并采用节省内存的技术，实现了增强的多模态对齐，同时保留了核心语言能力并通过偏好优化提升了安全性。大量基准评估表明其具有强大的空间定位能力，并在两种语言上取得了具有竞争力的结果；其中 14B 模型在同等规模的模型中于 OpenCompass VLM 排行榜上名列第八。除 14B 规模模型外，我们还发布了为设备端部署优化的 1.7B 版本。我们相信这些模型推动了双语 VLM 的发展及其实际应用。VARCO-VISION-2.0 的两个变体已在 Hugging Face 发布：完整版 14B 模型和轻量级 1.7B 模型。

Subjects: Computer Vision and Pattern Recognition, Computation and Language 主题：计算机视觉与模式识别，计算与语言

Publish: 2025-09-12 09:55:56 UTC 发表：2025-09-12 09:55:56 世界协调时

#56 Unified Learnable 2D Convolutional Feature Extraction for ASR #56 统一可学习二维卷积特征提取用于自动语音识别 [PDF] [复制] [Kimi] [相关]

Authors: [Peter Vieting](https://arxiv.org/search/?searchtype=author&query=Peter Vieting), [Benedikt Hilmes](https://arxiv.org/search/?searchtype=author&query=Benedikt Hilmes), [Ralf Schlüter](https://arxiv.org/search/?searchtype=author&query=Ralf Schlüter), [Hermann Ney](https://arxiv.org/search/?searchtype=author&query=Hermann Ney)

Neural front-ends represent a promising approach to feature extraction for automatic speech recognition (ASR) systems as they enable to learn specifically tailored features for different tasks. Yet, many of the existing techniques remain heavily influenced by classical methods. While this inductive bias may ease the system design, our work aims to develop a more generic front-end for feature extraction. Furthermore, we seek to unify the front-end architecture contrasting with existing approaches that apply a composition of several layer topologies originating from different sources. The experiments systematically show how to reduce the influence of existing techniques to achieve a generic front-end. The resulting 2D convolutional front-end is parameter-efficient and suitable for a scenario with limited computational resources unlike large models pre-trained on unlabeled audio. The results demonstrate that this generic unified approach is not only feasible but also matches the performance of existing supervised learnable feature extractors. 神经前端代表了一种有前景的特征提取方法，可用于自动语音识别（ASR）系统，因为它们能够为不同任务学习专门定制的特征。然而，许多现有技术仍然受到传统方法的强烈影响。尽管这种归纳偏置可能简化系统设计，我们的工作旨在开发一种更通用的特征提取前端。此外，我们寻求统一前端架构，以区别于那些将源自不同来源的多种层拓扑组合在一起的现有方法。实验系统地展示了如何减少现有技术的影响以实现通用前端。由此得到的二维卷积前端参数效率高，适用于计算资源有限的场景，不同于在无标签语音上预训练的大型模型。结果表明，这种通用统一的方法不仅可行，而且能匹配现有有监督可学习特征提取器的性能。

Subjects: Audio and Speech Processing, Computation and Language, Machine Learning, Sound 主题：音频与语音处理，计算与语言，机器学习，声音

Publish: 2025-09-12 07:52:51 UTC 发布：2025-09-12 07:52:51 UTC

#57 Whisper Has an Internal Word Aligner #57 Whisper 有一个内部词对齐器 [PDF 1 ] [复制] [Kimi] [REL]

Authors: [Sung-Lin Yeh](https://arxiv.org/search/?searchtype=author&query=Sung-Lin Yeh), [Yen Meng](https://arxiv.org/search/?searchtype=author&query=Yen Meng), [Hao Tang](https://arxiv.org/search/?searchtype=author&query=Hao Tang) 作者：Sung-Lin Yeh, Yen Meng, Hao Tang

There is an increasing interest in obtaining accurate word-level timestamps from strong automatic speech recognizers, in particular Whisper. Existing approaches either require additional training or are simply not competitive. The evaluation in prior work is also relatively loose, typically using a tolerance of more than 200 ms. In this work, we discover attention heads in Whisper that capture accurate word alignments and are distinctively different from those that do not. Moreover, we find that using characters produces finer and more accurate alignments than using wordpieces. Based on these findings, we propose an unsupervised approach to extracting word alignments by filtering attention heads while teacher forcing Whisper with characters. Our approach not only does not require training but also produces word alignments that are more accurate than prior work under a stricter tolerance between 20 ms and 100 ms. 人们越来越关注从强大的自动语音识别器（尤其是 Whisper）中获取准确的逐词时间戳。现有方法要么需要额外训练，要么竞争力不足。之前工作的评估也相对宽松，通常使用超过 200 毫秒的容差。在本工作中，我们发现 Whisper 中的一些注意力头能够捕捉到准确的词对齐，并且这些注意力头与那些不能捕捉的明显不同。此外，我们发现使用字符比使用词片（wordpieces）能产生更精细、更准确的对齐。基于这些发现，我们提出了一种无监督的方法，通过在以字符进行 teacher forcing 时筛选注意力头来提取词对齐。我们的方法不仅不需要训练，而且在 20 毫秒到 100 毫秒这一更严格的容差范围内，生成的词对齐比以前的工作更为准确。

Subjects: Audio and Speech Processing, Computation and Language 主题：音频与语音处理，计算与语言

Publish: 2025-09-12 06:03:24 UTC 发布：2025-09-12 06:03:24 UTC

#58 Vibe Check: Understanding the Effects of LLM-Based Conversational Agents’ Personality and Alignment on User Perceptions in Goal-Oriented Tasks #58 Vibe Check：理解基于 LLM 的会话代理的个性与对齐对面向目标任务的用户感知的影响 [PDF] [复制] [Kimi] [REL]

Authors: [Hasibur Rahman](https://arxiv.org/search/?searchtype=author&query=Hasibur Rahman), [Smit Desai](https://arxiv.org/search/?searchtype=author&query=Smit Desai) 作者：Hasibur Rahman、Smit Desai

Large language models (LLMs) enable conversational agents (CAs) to express distinctive personalities, raising new questions about how such designs shape user perceptions. This study investigates how personality expression levels and user-agent personality alignment influence perceptions in goal-oriented tasks. In a between-subjects experiment (N=150), participants completed travel planning with CAs exhibiting low, medium, or high expression across the Big Five traits, controlled via our novel Trait Modulation Keys framework. Results revealed an inverted-U relationship: medium expression produced the most positive evaluations across Intelligence, Enjoyment, Anthropomorphism, Intention to Adopt, Trust, and Likeability, significantly outperforming both extremes. Personality alignment further enhanced outcomes, with Extraversion and Emotional Stability emerging as the most influential traits. Cluster analysis identified three distinct compatibility profiles, with “Well-Aligned” users reporting substantially positive perceptions. These findings demonstrate that personality expression and strategic trait alignment constitute optimal design targets for CA personality, offering design implications as LLM-based CAs become increasingly prevalent. 大型语言模型 (LLMs) 使对话代理 (CAs) 能够展现独特的人格特征，这引发了关于此类设计如何影响用户感知的新问题。本研究探讨了人格表达强度及用户与代理人格匹配如何在以目标为导向的任务中影响感知。在一项组间实验（N=150）中，参与者与在五大人格特质上分别表现出低、中或高表达的 CAs 一起完成旅行规划，表达程度通过我们新提出的特质调节键 (Trait Modulation Keys) 框架进行控制。结果显示出一种倒 U 形关系：中等表达在智能、愉悦感、拟人化、采用意向、信任和喜好度上的评估最为正面，显著优于两种极端。人格匹配进一步提升了结果，其中外向性和情绪稳定性成为最具影响力的特质。聚类分析识别出三种不同的兼容性档案，“高度匹配”用户报告了显著积极的感知。这些发现表明，个性表达和策略性特质对齐构成了 CA 个性的最佳设计目标，为在基于 LLM 的 CA 日益普及的情况下的设计提供了启示。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language 主题：人机交互，人工智能，计算与语言

Publish: 2025-09-11 21:43:49 UTC 发布：2025-09-11 21:43:49 UTC

#59 LLMs as Agentic Cooperative Players in Multiplayer UNO

Authors: [Yago Romano Matinez](https://arxiv.org/search/?searchtype=author&query=Yago Romano Matinez), [Jesse Roberts](https://arxiv.org/search/?searchtype=author&query=Jesse Roberts) 作者：Yago Romano Matinez，Jesse Roberts

LLMs promise to assist humans – not just by answering questions, but by offering useful guidance across a wide range of tasks. But how far does that assistance go? Can a large language model based agent actually help someone accomplish their goal as an active participant? We test this question by engaging an LLM in UNO, a turn-based card game, asking it not to win but instead help another player to do so. We built a tool that allows decoder-only LLMs to participate as agents within the RLCard game environment. These models receive full game-state information and respond using simple text prompts under two distinct prompting strategies. We evaluate models ranging from small (1B parameters) to large (70B parameters) and explore how model scale impacts performance. We find that while all models were able to successfully outperform a random baseline when playing UNO, few were able to significantly aid another player. LLMs 不只是通过回答问题来协助人类，还可以在广泛的任务中提供有用的指导。但这种辅助能走多远？基于大型语言模型的代理能否作为积极参与者真正帮助他人实现目标？我们通过让 LLM 参与回合制纸牌游戏 UNO 来检验这个问题，要求其不是取胜，而是帮助另一名玩家获胜。我们构建了一个工具，使仅解码器架构的 LLM 能作为代理在 RLCard 游戏环境中参与。这些模型获得完整的游戏状态信息，并在两种不同的提示策略下使用简单文本提示作出回应。我们评估了从小型（1B 参数）到大型（70B 参数）的模型，并探讨模型规模如何影响性能。我们发现，虽然所有模型在玩 UNO 时都能成功超越随机基线，但很少有模型能显著地帮助另一名玩家。

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-09-11 21:42:33 UTC 发布：2025-09-11 21:42:33 UTC

#60 Latency and Token-Aware Test-Time Compute #60 延迟与令牌感知的测试时计算

Authors: [Jenny Y. Huang](https://arxiv.org/search/?searchtype=author&query=Jenny Y. Huang), [Mehul Damani](https://arxiv.org/search/?searchtype=author&query=Mehul Damani), [Yousef El-Kurdi](https://arxiv.org/search/?searchtype=author&query=Yousef El-Kurdi), [Ramon Astudillo](https://arxiv.org/search/?searchtype=author&query=Ramon Astudillo), [Wei Sun](https://arxiv.org/search/?searchtype=author&query=Wei Sun) 作者：Jenny Y. Huang，Mehul Damani，Yousef El-Kurdi，Ramon Astudillo，Wei Sun

Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute typically considers only parallel generation methods such as best-of-N, overlooking incremental decoding methods like beam search, and has largely ignored latency, focusing only on token usage. We formulate inference-time scaling as a problem of dynamic compute allocation and method selection, where the system must decide which strategy to apply and how much compute to allocate on a per-query basis. Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic workflows where models must issue multiple queries efficiently. Experiments on reasoning benchmarks show that our approach consistently outperforms static strategies, achieving favorable accuracy-cost trade-offs while remaining practical for deployment. 推理时扩展已成为一种通过生成多个候选响应并在其间选择来提升大型语言模型（LLM）性能的强大方法。然而，现有关于测试时计算的动态分配的工作通常仅考虑诸如 best-of-N 之类的并行生成方法，忽视了像束搜索这样的增量解码方法，并在很大程度上忽略了延迟，仅关注令牌使用量。我们将推理时扩展表述为一个动态计算分配和方法选择的问题，其中系统必须决定在每个查询上应用哪种策略以及分配多少计算资源。我们的框架明确同时纳入了令牌成本和实际墙钟延迟，后者对用户体验至关重要，尤其在模型必须高效发出多个查询的代理工作流中更为关键。在推理基准上的实验证明，我们的方法始终优于静态策略，在保持可部署性的同时实现了有利的准确性-成本折衷。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-09-11 21:35:19 UTC 发布：2025-09-11 21:35:19 协调世界时

#61 Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture #61 可执行本体：用数据流架构合成事件语义 [PDF] [复制] [Kimi] [REL]

Author: [Aleksandr Boldachev](https://arxiv.org/search/?searchtype=author&query=Aleksandr Boldachev) 作者：Aleksandr Boldachev

This paper presents boldsea, Boldachev’s semantic-event approach – an architecture for modeling complex dynamic systems using executable ontologies – semantic models that act as dynamic structures, directly controlling process execution. We demonstrate that integrating event semantics with a dataflow architecture addresses the limitations of traditional Business Process Management (BPM) systems and object-oriented semantic technologies. The paper presents the formal BSL (boldsea Semantic Language), including its BNF grammar, and outlines the boldsea-engine’s architecture, which directly interprets semantic models as executable algorithms without compilation. It enables the modification of event models at runtime, ensures temporal transparency, and seamlessly merges data and business logic within a unified semantic framework. 本文介绍了 boldsea——Boldachev 的语义事件方法论，一种使用可执行本体来建模复杂动态系统的架构——将语义模型作为动态结构，直接控制流程执行。我们证明了将事件语义与数据流架构相结合可以解决传统业务流程管理（BPM）系统和面向对象语义技术的局限性。论文介绍了形式化的 BSL（boldsea 语义语言），包括其 BNF 语法，并概述了 boldsea 引擎的架构，该引擎直接将语义模型作为可执行算法进行解释而无需编译。它支持在运行时修改事件模型，确保时间透明性，并在统一的语义框架内无缝合并数据和业务逻辑。

Subjects: Artificial Intelligence, Computation and Language, Formal Languages and Automata Theory, Software Engineering 主题：人工智能、计算与语言、形式语言与自动机理论、软件工程

Publish: 2025-09-11 18:12:46 UTC 发布：2025-09-11 18:12:46 UTC

#62 HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets #62 HypoGeneAgent：一种用于基因集合簇解析选择的假设语言代理，基于 Perturb-seq 数据集

Large-scale single-cell and Perturb-seq investigations routinely involve clustering cells and subsequently annotating each cluster with Gene-Ontology (GO) terms to elucidate the underlying biological programs. However, both stages, resolution selection and functional annotation, are inherently subjective, relying on heuristics and expert curation. We present HYPOGENEAGENT, a large language model (LLM)-driven framework, transforming cluster annotation into a quantitatively optimizable task. Initially, an LLM functioning as a gene-set analyst analyzes the content of each gene program or perturbation module and generates a ranked list of GO-based hypotheses, accompanied by calibrated confidence scores. Subsequently, we embed every predicted description with a sentence-embedding model, compute pair-wise cosine similarities, and let the agent referee panel score (i) the internal consistency of the predictions, high average similarity within the same cluster, termed intra-cluster agreement (ii) their external distinctiveness, low similarity between clusters, termed inter-cluster separation. These two quantities are combined to produce an agent-derived resolution score, which is maximized when clusters exhibit simultaneous coherence and mutual exclusivity. When applied to a public K562 CRISPRi Perturb-seq dataset as a preliminary test, our Resolution Score selects clustering granularities that exhibit alignment with known pathway compared to classical metrics such silhouette score, modularity score for gene functional enrichment summary. These findings establish LLM agents as objective adjudicators of cluster resolution and functional annotation, thereby paving the way for fully automated, context-aware interpretation pipelines in single-cell multi-omics studies. 大规模单细胞和 Perturb-seq 研究常常需要对细胞进行聚类，并随后为每个簇注释基因本体（Gene-Ontology，GO）术语以阐明其背后的生物学程序。然而，分辨率选择和功能注释这两个阶段本质上是主观的，依赖于启发式方法和专家策划。我们提出了 HYPOGENEAGENT，一种由 LLM 驱动的框架，将簇注释转变为可定量优化的任务。最初，作为基因集合分析师的 LLM 分析每个基因程序或扰动模块的内容，并生成基于 GO 的假设排序列表，同时给出校准过的置信分数。随后，我们使用句子嵌入模型对每个预测描述进行嵌入，计算成对余弦相似度，并让代理裁判小组对以下两点进行评分：(i) 预测的内部一致性，即同一簇内高平均相似度，称为簇内一致性；(ii) 它们的外部可区分性，即簇间低相似度，称为簇间分离。这两个量被结合起来产生一个由代理导出的分辨率评分，当簇同时表现出一致性和相互排他性时该评分达到最大化。作为初步测试，应用于公共的 K562 CRISPRi Perturb-seq 数据集时，我们的分辨率评分在簇粒度的选择上相较于诸如轮廓系数（silhouette score）、用于基因功能富集概况的模块度评分（modularity score）等经典指标，更能与已知通路保持一致。这些发现确立了 LLM 代理作为簇分辨率和功能注释的客观裁定者，从而为单细胞多组学研究中完全自动化、具上下文感知的解释流程铺平了道路。

Subjects: Quantitative Methods, Artificial Intelligence, Computation and Language, Machine Learning 主题：定量方法、人工智能、计算与语言、机器学习

Publish: 2025-09-10 22:25:33 UTC 发布：2025-09-10 22:25:33 UTC

#63 Improving MLLM Historical Record Extraction with Test-Time Image #63 改进 MLLM 历史记录提取的测试时图像 [PDF 1 ] [Copy] [Kimi] [REL]

Authors: [Taylor Archibald](https://arxiv.org/search/?searchtype=author&query=Taylor Archibald), [Tony Martinez](https://arxiv.org/search/?searchtype=author&query=Tony Martinez) 作者：Taylor Archibald，Tony Martinez

We present a novel ensemble framework that stabilizes LLM based text extraction from noisy historical documents. We transcribe multiple augmented variants of each image with Gemini 2.0 Flash and fuse these outputs with a custom Needleman Wunsch style aligner that yields both a consensus transcription and a confidence score. We present a new dataset of 622 Pennsylvania death records, and demonstrate our method improves transcription accuracy by 4 percentage points relative to a single shot baseline. We find that padding and blurring are the most useful for improving accuracy, while grid warp perturbations are best for separating high and low confidence cases. The approach is simple, scalable, and immediately deployable to other document collections and transcription models. 我们提出了一个新颖的集成框架，使基于 LLM 的文本提取在嘈杂的历史文档中更稳健。我们使用 Gemini 2.0 Flash 对每张图像的多个增强变体进行转录，并用一个定制的 Needleman–Wunsch 风格比对器融合这些输出，该比对器同时产生一致性转录和置信度得分。我们发布了一个包含 622 份宾夕法尼亚州死亡记录的新数据集，并展示了我们的方法相较于单次转录基线将转录准确率提高了 4 个百分点。我们发现填充和模糊对于提升准确率最有用，而网格扭曲扰动最有助于区分高置信度和低置信度的案例。该方法简单、可扩展，且可立即部署到其他文档集合和转录模型中。

Subjects: Computer Vision and Pattern Recognition, Computation and Language, Machine Learning 学科：计算机视觉与模式识别，计算与语言，机器学习

Publish: 2025-09-10 03:18:24 UTC 发布时间：2025-09-10 03:18:24 协调世界时 (UTC)

#64 VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions #64 VStyle：一个用于带有口语指令的语音风格适应基准 [PDF] [复制] [Kimi 2 ] [REL]

Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at \href{https://junzhan2000.github.io/VStyle.github.io/}{project’s homepage}. 语音语言模型（SLMs）已经成为语音理解与生成的统一范式，使自然的人机交互成为可能。然而，尽管大多数进展集中在语义准确性和遵循指令上，SLMs 根据口头指令调整说话风格的能力却很少被关注。我们提出了语音风格适应（VSA），这是一个新任务，用于检验 SLMs 是否能根据自然语言口头命令改变其说话风格，例如音色、韵律或人设。为研究该任务，我们构建了 VStyle，一个涵盖四类语音生成的双语（中文和英文）基准：声学属性、自然语言指令、角色扮演和隐含共情。我们还引入了“大型音频语言模型作为评判”（LALM as a Judge）框架，按文本忠实性、风格遵循度和自然度逐步评估输出，确保可重复且客观的评估。在商业系统和开源 SLMs 上的实验表明，当前模型在可控风格适应方面存在明显局限，这突显了该任务的新颖性和挑战性。通过发布 VStyle 及其评估工具包，我们旨在为社区提供一个推动以人为本的语音交互发展的基础。该数据集和代码可在项目主页 \href{https://junzhan2000.github.io/VStyle.github.io/}{project’s homepage} 上公开获取。

Subjects: Sound, Artificial Intelligence, Computation and Language, Audio and Speech Processing 主题：声音、人工智能、计算与语言、音频与语音处理

Publish: 2025-09-09 14:28:58 UTC 发布时间：2025-09-09 14:28:58 UTC

#65 LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm #65 基于 LLM 的实例驱动启发式偏置在有偏随机键遗传算法背景下

Authors: [Camilo Chacón Sartori](https://arxiv.org/search/?searchtype=author&query=Camilo Chacón Sartori), [Martín Isla Pino](https://arxiv.org/search/?searchtype=author&query=Martín Isla Pino), [Pedro Pinacho-Davidson](https://arxiv.org/search/?searchtype=author&query=Pedro Pinacho-Davidson), [Christian Blum](https://arxiv.org/search/?searchtype=author&query=Christian Blum) 作者：Camilo Chacón Sartori、Martín Isla Pino、Pedro Pinacho-Davidson、Christian Blum

Integrating Large Language Models (LLMs) within metaheuristics opens a novel path for solving complex combinatorial optimization problems. While most existing approaches leverage LLMs for code generation to create or refine specific heuristics, they often overlook the structural properties of individual problem instances. In this work, we introduce a novel framework that integrates LLMs with a Biased Random-Key Genetic Algorithm (BRKGA) to solve the NP-hard Longest Run Subsequence problem. Our approach extends the instance-driven heuristic bias paradigm by introducing a human-LLM collaborative process to co-design and implement a set of computationally efficient metrics. The LLM analyzes these instance-specific metrics to generate a tailored heuristic bias, which steers the BRKGA toward promising areas of the search space. We conduct a comprehensive experimental evaluation, including rigorous statistical tests, convergence and behavioral analyses, and targeted ablation studies, comparing our method against a standard BRKGA baseline across 1,050 generated instances of varying complexity. Results show that our top-performing hybrid, BRKGA+Llama-4-Maverick, achieves statistically significant improvements over the baseline, particularly on the most complex instances. Our findings confirm that leveraging an LLM to produce an a priori, instance-driven heuristic bias is a valuable approach for enhancing metaheuristics in complex optimization domains. 将大型语言模型（LLMs）与元启发式算法整合，为解决复杂的组合优化问题开辟了一条新路径。尽管现有大多数方法利用 LLMs 进行代码生成以创建或改进特定启发式算法，但它们常常忽视单个问题实例的结构特性。在本工作中，我们提出了一个新框架，将 LLMs 与带偏随机键遗传算法（BRKGA）结合起来，用于求解 NP 难题最长运行子序列问题。我们的方法通过引入人类与 LLM 协同设计过程来扩展基于实例的启发式偏置范式，共同设计并实现了一组计算高效的度量。LLM 分析这些针对实例的度量以生成定制的启发式偏置，从而引导 BRKGA 朝搜索空间中有前景的区域探索。我们进行了全面的实验评估，包括严格的统计检验、收敛性与行为分析以及有针对性的消融研究，并在 1,050 个不同复杂度的生成实例上将我们的方法与标准 BRKGA 基线进行了比较。结果表明，我们表现最好的混合模型 BRKGA+Llama-4-Maverick 在基线之上取得了具有统计显著性的改进，尤其是在最复杂的实例上。我们的研究结果证实，利用 LLM 生成先验的、以实例为驱动的启发式偏置，是在复杂优化领域增强元启发式算法的一种有价值的方法。

Subjects: Neural and Evolutionary Computing, Artificial Intelligence, Computation and Language 受试领域：神经与进化计算、人工智能、计算与语言

Publish: 2025-09-05 21:46:41 UTC 发布：2025-09-05 21:46:41 UTC

#66 Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks #66 变压器语言模型的差异性鲁棒性：在对抗性文本攻击下的实证评估

Authors: [Taniya Gidatkar](https://arxiv.org/search/?searchtype=author&query=Taniya Gidatkar), [Oluwaseun Ajao](https://arxiv.org/search/?searchtype=author&query=Oluwaseun Ajao), [Matthew Shardlow](https://arxiv.org/search/?searchtype=author&query=Matthew Shardlow) 作者：Taniya Gidatkar、Oluwaseun Ajao、Matthew Shardlow

This study evaluates the resilience of large language models (LLMs) against adversarial attacks, specifically focusing on Flan-T5, BERT, and RoBERTa-Base. Using systematically designed adversarial tests through TextFooler and BERTAttack, we found significant variations in model robustness. RoBERTa-Base and FlanT5 demonstrated remarkable resilience, maintaining accuracy even when subjected to sophisticated attacks, with attack success rates of 0%. In contrast. BERT-Base showed considerable vulnerability, with TextFooler achieving a 93.75% success rate in reducing model accuracy from 48% to just 3%. Our research reveals that while certain LLMs have developed effective defensive mechanisms, these safeguards often require substantial computational resources. This study contributes to the understanding of LLM security by identifying existing strengths and weaknesses in current safeguarding approaches and proposes practical recommendations for developing more efficient and effective defensive strategies. 本研究评估了大型语言模型（LLMs）在对抗性攻击下的鲁棒性，尤其关注 Flan-T5、BERT 和 RoBERTa-Base。通过使用 TextFooler 和 BERTAttack 系统性设计的对抗性测试，我们发现模型鲁棒性存在显著差异。RoBERTa-Base 和 FlanT5 展现出卓越的抗扰性，即使在面对复杂攻击时仍能保持准确率，攻击成功率为 0%。相反，BERT-Base 表现出较大脆弱性，TextFooler 在将模型准确率从 48% 降至仅 3% 的过程中取得了 93.75% 的成功率。我们的研究表明，尽管某些 LLMs 已经发展出有效的防御机制，但这些防护措施往往需要大量计算资源。本研究通过识别当前防护方法的强项与弱点，为理解 LLM 安全性作出贡献，并提出了构建更高效、更有效防御策略的实用建议。

Subjects: Cryptography and Security, Artificial Intelligence, Computation and Language 主题：密码学与安全、人工智能、计算与语言

Publish: 2025-09-05 21:43:06 UTC 发布：2025-09-05 21:43:06 UTC

#67 Personas within Parameters: Fine-Tuning Small Language Models with Low-Rank Adapters to Mimic User Behaviors

Authors: [Himanshu Thakur](https://arxiv.org/search/?searchtype=author&query=Himanshu Thakur), [Eshani Agrawal](https://arxiv.org/search/?searchtype=author&query=Eshani Agrawal), [Smruthi Mukund](https://arxiv.org/search/?searchtype=author&query=Smruthi Mukund) 作者：Himanshu Thakur、Eshani Agrawal、Smruthi Mukund

A long-standing challenge in developing accurate recommendation models is simulating user behavior, mainly due to the complex and stochastic nature of user interactions. Towards this, one promising line of work has been the use of Large Language Models (LLMs) for simulating user behavior. However, aligning these general-purpose large pre-trained models with user preferences necessitates: (i) effectively and continously parsing large-scale tabular user-item interaction data, (ii) overcoming pre-training-induced inductive biases to accurately learn user specific knowledge, and (iii) achieving the former two at scale for millions of users. While most previous works have focused on complex methods to prompt an LLM or fine-tune it on tabular interaction datasets, our approach shifts the focus to extracting robust textual user representations using a frozen LLM and simulating cost-effective, resource-efficient user agents powered by fine-tuned Small Language Models (SLMs). Further, we showcase a method for training multiple low-rank adapters for groups of users or \textit{persona}, striking an optimal balance between scalability and performance of user behavior agents. Our experiments provide compelling empirical evidence of the efficacy of our methods, demonstrating that user agents developed using our approach have the potential to bridge the gap between offline metrics and real-world performance of recommender systems. 在开发准确的推荐模型时，一个长期存在的挑战是模拟用户行为，这主要由于用户交互的复杂性和随机性。为此，一条有前景的研究方向是使用大型语言模型（LLMs）来模拟用户行为。然而，将这些通用的大型预训练模型与用户偏好对齐需要：（i）有效且持续地解析大规模表格化的用户-物品交互数据，（ii）克服预训练带来的归纳偏差以准确学习特定用户的知识，和（iii）在数百万用户的规模上实现前两点。尽管以往大多数工作集中于设计复杂的方法来提示 LLM 或在表格交互数据集上微调它，我们的方法将重点转向使用冻结的 LLM 提取鲁棒的文本化用户表示，并使用微调过的小型语言模型（SLMs）构建成本更低、资源更节省的用户代理来进行模拟。此外，我们展示了一种为用户群体或角色（persona）训练多个低秩适配器的方法，在用户行为代理的可扩展性和性能之间取得最佳平衡。我们的实验证明了我们方法的显著有效性，表明使用我们方法开发的用户代理有可能弥合离线指标与推荐系统实际表现之间的差距。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Machine Learning 主题：信息检索、人工智能、计算与语言、机器学习

Publish: 2025-08-18 22:14:57 UTC 发布：2025-08-18 22:14:57 世界协调时间 (UTC)

#68 AI-Powered Assistant for Long-Term Access to RHIC Knowledge #68 基于 AI 的助理，用于长期获取 RHIC 知识 [PDF] [复制] [Kimi] [REL]

Authors: [Mohammad Atif](https://arxiv.org/search/?searchtype=author&query=Mohammad Atif), [Vincent Garonne](https://arxiv.org/search/?searchtype=author&query=Vincent Garonne), [Eric Lancon](https://arxiv.org/search/?searchtype=author&query=Eric Lancon), [Jerome Lauret](https://arxiv.org/search/?searchtype=author&query=Jerome Lauret), [Alexandr Prozorov](https://arxiv.org/search/?searchtype=author&query=Alexandr Prozorov), [Michal Vranovsky](https://arxiv.org/search/?searchtype=author&query=Michal Vranovsky) 作者：Mohammad Atif、Vincent Garonne、Eric Lancon、Jerome Lauret、Alexandr Prozorov、Michal Vranovsky

As the Relativistic Heavy Ion Collider (RHIC) at Brookhaven National Laboratory concludes 25 years of operation, preserving not only its vast data holdings (∼1 ExaByte) but also the embedded scientific knowledge becomes a critical priority. The RHIC Data and Analysis Preservation Plan (DAPP) introduces an AI-powered assistant system that provides natural language access to documentation, workflows, and software, with the aim of supporting reproducibility, education, and future discovery. Built upon Large Language Models using Retrieval-Augmented Generation and the Model Context Protocol, this assistant indexes structured and unstructured content from RHIC experiments and enables domain-adapted interaction. We report on the deployment, computational performance, ongoing multi-experiment integration, and architectural features designed for a sustainable and explainable long-term AI access. Our experience illustrates how modern AI/ML tools can transform the usability and discoverability of scientific legacy data. 随着布鲁克海文国家实验室的相对论性重离子对撞机（RHIC）结束了 25 年的运行，保存其庞大的数据存储（ ∼ 1 拍字节）以及其中蕴含的科学知识成为一项关键优先任务。RHIC 数据与分析保存计划（DAPP）引入了一个由人工智能驱动的助手系统，提供对文档、工作流程和软件的自然语言访问，旨在支持可重复性、教育与未来的发现。该助手基于大型语言模型，采用检索增强生成（Retrieval-Augmented Generation）和模型上下文协议（Model Context Protocol），对来自 RHIC 实验的结构化与非结构化内容进行索引，并实现领域适配的交互。我们报告了部署情况、计算性能、正在进行的多实验整合以及为实现可持续且可解释的长期 AI 访问而设计的架构特性。我们的经验展示了现代 AI/ML 工具如何改变科学遗留数据的可用性与可发现性。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language 主题：信息检索、人工智能、计算与语言

Publish: 2025-08-18 15:16:29 UTC 发布：2025-08-18 15:16:29 UTC

#69 Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation #69 面向流程挖掘领域的文本到 SQL：用于查询翻译的英德数据集 [PDF] [复制] [Kimi] [REL]

Authors: [Bruno Yui Yamate](https://arxiv.org/search/?searchtype=author&query=Bruno Yui Yamate), [Thais Rodrigues Neubauer](https://arxiv.org/search/?searchtype=author&query=Thais Rodrigues Neubauer), [Marcelo Fantinato](https://arxiv.org/search/?searchtype=author&query=Marcelo Fantinato), [Sarajane Marques Peres](https://arxiv.org/search/?searchtype=author&query=Sarajane Marques Peres) 作者：Bruno Yui Yamate、Thais Rodrigues Neubauer、Marcelo Fantinato、Sarajane Marques Peres

This paper introduces text-2-SQL-4-PM, a bilingual (Portuguese-English) benchmark dataset designed for the text-to-SQL task in the process mining domain. Text-to-SQL conversion facilitates natural language querying of databases, increasing accessibility for users without SQL expertise and productivity for those that are experts. The text-2-SQL-4-PM dataset is customized to address the unique challenges of process mining, including specialized vocabularies and single-table relational structures derived from event logs. The dataset comprises 1,655 natural language utterances, including human-generated paraphrases, 205 SQL statements, and ten qualifiers. Methods include manual curation by experts, professional translations, and a detailed annotation process to enable nuanced analyses of task complexity. Additionally, a baseline study using GPT-3.5 Turbo demonstrates the feasibility and utility of the dataset for text-to-SQL applications. The results show that text-2-SQL-4-PM supports evaluation of text-to-SQL implementations, offering broader applicability for semantic parsing and other natural language processing tasks. 本文介绍了 text-2-SQL-4-PM，这是一个为流程挖掘领域的文本到 SQL 任务设计的双语（葡萄牙语-英语）基准数据集。文本到 SQL 的转换便于以自然语言对数据库进行查询，既提高了无 SQL 专业知识用户的可及性，也提升了有经验用户的工作效率。text-2-SQL-4-PM 数据集针对流程挖掘的独特挑战进行了定制，包括专业术语表和源自事件日志的单表关系结构。该数据集包含 1,655 条自然语言语句（包括人工生成的释义）、205 条 SQL 语句和十个限定词。方法包括专家的人工整理、专业翻译以及详尽的注释流程，以支持对任务复杂性的细致分析。此外，使用 GPT-3.5 Turbo 的基线研究展示了该数据集在文本到 SQL 应用中的可行性与实用性。结果表明，text-2-SQL-4-PM 支持对文本到 SQL 实现的评估，并为语义解析和其他自然语言处理任务提供了更广泛的适用性。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Databases 主题：信息检索、人工智能、计算与语言、数据库

Publish: 2025-08-18 01:25:41 UTC

#70 DB3 Team’s Solution For Meta KDD Cup’ 25 #70 DB3 团队为 Meta KDD Cup'25 提交的解决方案 [PDF 1 ] [复制] [Kimi] [REL]

This paper presents the db3 team’s winning solution for the Meta CRAG-MM Challenge 2025 at KDD Cup'25. Addressing the challenge’s unique multi-modal, multi-turn question answering benchmark (CRAG-MM), we developed a comprehensive framework that integrates tailored retrieval pipelines for different tasks with a unified LLM-tuning approach for hallucination control. Our solution features (1) domain-specific retrieval pipelines handling image-indexed knowledge graphs, web sources, and multi-turn conversations; and (2) advanced refusal training using SFT, DPO, and RL. The system achieved 2nd place in Task 1, 2nd place in Task 2, and 1st place in Task 3, securing the grand prize for excellence in ego-centric queries through superior handling of first-person perspective challenges. 本文介绍了 db3 团队在 KDD Cup'25 的 Meta CRAG-MM Challenge 2025 中的获胜方案。针对该挑战独特的多模态、多轮问答基准（CRAG-MM），我们开发了一个综合框架，将为不同任务量身定制的检索管道与用于幻觉控制的统一 LLM 调优方法相结合。我们的方案特点包括（1）处理图像索引知识图谱、网络资源和多轮对话的领域特定检索管道；以及（2）使用 SFT、DPO 和 RL 的高级拒绝训练。该系统在任务 1 中获得第 2 名，任务 2 中获得第 2 名，在任务 3 中获得第 1 名，并因在以自我为中心的查询中优越地处理第一人称视角挑战而赢得卓越大奖。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Machine Learning 主题：信息检索、人工智能、计算与语言、机器学习

Publish: 2025-08-12 08:27:53 UTC 发布：2025-08-12 08:27:53 UTC

#71 Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL #71 公平裁剪你的序列：对序列级 RL 强制执行长度公平性 [PDF 6 ] [Copy] [Kimi 6 ] [REL]

Authors: [Hanyi Mao](https://arxiv.org/search/?searchtype=author&query=Hanyi Mao), [Quanjia Xiao](https://arxiv.org/search/?searchtype=author&query=Quanjia Xiao), [Lei Pang](https://arxiv.org/search/?searchtype=author&query=Lei Pang), [Haixiao Liu](https://arxiv.org/search/?searchtype=author&query=Haixiao Liu) 作者：毛汉毅，肖全佳，庞磊，刘海啸

We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping directly in the importance-sampling (IS) weight space. We revisit sequence-level RL methods and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the effective objective. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a directional cosine guarantee between the clipped and true updates. FSPO introduces a simple, Gaussian-motivated remedy: we clip the sequence log-IS ratio with a band that applies a KL-corrected drift term and scales as L−−√. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets. 我们提出了 FSPO（Fair Sequence Policy Optimization），一种对 LLMs 的序列级强化学习方法，在重要性采样（IS）权重空间中直接强制执行长度公平裁剪。我们重新审视了序列级强化学习方法，并识别出当 PPO/GRPO 风格的裁剪被移植到序列时存在的不匹配：固定的裁剪范围系统性地对短回复与长回复重新加权，扭曲了实际目标。从理论上，我们通过长度重加权误差（Length Reweighting Error，LRE）形式化长度公平性，并证明小的 LRE 会在裁剪更新与真实更新之间产生方向余弦保证。FSPO 引入了一个简单的、高斯动机的补救措施：我们用一个带有 KL 校正漂移项并按 L−−√ 缩放的区间来裁剪序列对数 IS 比率。实证上，FSPO 在各长度区间平滑了裁剪率，稳定了训练，并在多个评估数据集上优于所有基线。

Subject: Machine Learning 主题：机器学习

Publish: 2025-09-11 06:27:10 UTC 发布：2025-09-11 06:27:10 协调世界时

1.2.2 Artificial Intelligence

From：https://papers.cool/arxiv/cs.AI https://arxiv.org/list/cs.AI/recent 2025-09-15 | | 总计：115

#1 Mutual Information Tracks Policy Coherence in Reinforcement Learning #1 互信息追踪强化学习中的策略一致性

Authors: [Cameron Reid](https://arxiv.org/search/?searchtype=author&query=Cameron Reid), [Wael Hafez](https://arxiv.org/search/?searchtype=author&query=Wael Hafez), [Amirhossein Nazeri](https://arxiv.org/search/?searchtype=author&query=Amirhossein Nazeri) 作者：Cameron Reid、Wael Hafez、Amirhossein Nazeri

Reinforcement Learning (RL) agents deployed in real-world environments face degradation from sensor faults, actuator wear, and environmental shifts, yet lack intrinsic mechanisms to detect and diagnose these failures. We present an information-theoretic framework that reveals both the fundamental dynamics of RL and provides practical methods for diagnosing deployment-time anomalies. Through analysis of state-action mutual information patterns in a robotic control task, we first demonstrate that successful learning exhibits characteristic information signatures: mutual information between states and actions steadily increases from 0.84 to 2.83 bits (238% growth) despite growing state entropy, indicating that agents develop increasingly selective attention to task-relevant patterns. Intriguingly, states, actions and next states joint mutual information, MI(S,A;S’), follows an inverted U-curve, peaking during early learning before declining as the agent specializes suggesting a transition from broad exploration to efficient exploitation. More immediately actionable, we show that information metrics can differentially diagnose system failures: observation-space, i.e., states noise (sensor faults) produces broad collapses across all information channels with pronounced drops in state-action coupling, while action-space noise (actuator faults) selectively disrupts action-outcome predictability while preserving state-action relationships. This differential diagnostic capability demonstrated through controlled perturbation experiments enables precise fault localization without architectural modifications or performance degradation. By establishing information patterns as both signatures of learning and diagnostic for system health, we provide the foundation for adaptive RL systems capable of autonomous fault detection and policy adjustment based on information-theoretic principles. 部署在真实环境中的强化学习（RL）智能体会因传感器故障、执行器磨损和环境变化而性能下降，但缺乏内在机制来检测和诊断这些故障。我们提出了一个信息论框架，既揭示了强化学习的基本动态，又为部署时异常诊断提供了实用方法。通过对机器人控制任务中状态-动作互信息模式的分析，我们首先展示了成功学习所具有的典型信息特征：尽管状态熵增加，状态与动作之间的互信息仍稳步从 0.84 比特增长到 2.83 比特（增长 238%），这表明智能体对与任务相关模式的选择性注意不断增强。有趣的是，状态、动作与下一个状态的联合互信息 MI(S,A;S’)呈现倒 U 形曲线，在早期学习阶段达到峰值，然后随着智能体专业化而下降，暗示从广泛探索向高效利用的转变。更具可操作性的是，我们展示了信息度量能够差异化地诊断系统故障：在观测空间，即状态噪声（传感器故障）会在所有信息通道上产生广泛的坍缩，并在状态—动作耦合上出现明显下降；而在动作空间的噪声（执行器故障）则选择性地破坏动作—结果可预测性，同时保留状态—动作关系。通过受控扰动实验展示的这种差异化诊断能力，使得在不修改架构或降低性能的前提下实现精确的故障定位成为可能。通过确立信息模式既是学习的特征又可用于系统健康诊断，我们为基于信息理论原理、具备自主故障检测和策略调整能力的自适应强化学习系统奠定了基础。

Subjects: Artificial Intelligence, Machine Learning, Robotics 主题：人工智能、机器学习、机器人学

Publish: 2025-09-12 17:24:20 UTC 发布：2025-09-12 17:24:20 协调世界时

#2 Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems #2 逆推、行动、预测：为多主体系统中自动故障归因搭建因果推理脚手架

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-09-12 16:51:15 UTC 发布：2025-09-12 16:51:15 UTC

#3 State Algebra for Propositional Logic #3 状态代数用于命题逻辑

Authors: [Dmitry Lesnik](https://arxiv.org/search/?searchtype=author&query=Dmitry Lesnik), [Tobias Schäfer](https://arxiv.org/search/?searchtype=author&query=Tobias Schäfer) 作者：Dmitry Lesnik、Tobias Schäfer

This paper presents State Algebra, a novel framework designed to represent and manipulate propositional logic using algebraic methods. The framework is structured as a hierarchy of three representations: Set, Coordinate, and Row Decomposition. These representations anchor the system in well-known semantics while facilitating the computation using a powerful algebraic engine. A key aspect of State Algebra is its flexibility in representation. We show that although the default reduction of a state vector is not canonical, a unique canonical form can be obtained by applying a fixed variable order during the reduction process. This highlights a trade-off: by foregoing guaranteed canonicity, the framework gains increased flexibility, potentially leading to more compact representations of certain classes of problems. We explore how this framework provides tools to articulate both search-based and knowledge compilation algorithms and discuss its natural extension to probabilistic logic and Weighted Model Counting. 本文提出了状态代数（State Algebra），一种旨在使用代数方法表示和操作命题逻辑的新框架。该框架结构为三层表示的层次：集合（Set）、坐标（Coordinate）和行分解（Row Decomposition）。这些表示将系统锚定于已知的语义，同时通过强大的代数引擎促进计算。状态代数的一个关键方面是其表示的灵活性。我们展示了尽管状态向量的默认约简并非规范形式，但在约简过程中施加固定变量顺序可以得到唯一的规范形式。这突出了一个权衡：放弃对规范性的保证，框架获得了更大的灵活性，可能导致对某些问题类更紧凑的表示。我们探讨了该框架如何提供工具来描述基于搜索和知识编译的算法，并讨论了其向概率逻辑和加权模型计数（Weighted Model Counting）的自然扩展。

Subjects: Artificial Intelligence, Logic in Computer Science 主题：人工智能，计算机科学中的逻辑

Publish: 2025-09-12 15:05:52 UTC 发布：2025-09-12 15:05:52 UTC

#4 The Morality of Probability: How Implicit Moral Biases in LLMs May Shape the Future of Human-AI Symbiosis #4 概率的道德性：LLMs 中隐含的道德偏见如何可能塑造人类与人工智能共生的未来

Authors: [Eoin O’Doherty](https://arxiv.org/search/?searchtype=author&query=Eoin O’Doherty), [Nicole Weinrauch](https://arxiv.org/search/?searchtype=author&query=Nicole Weinrauch), [Andrew Talone](https://arxiv.org/search/?searchtype=author&query=Andrew Talone), [Uri Klempner](https://arxiv.org/search/?searchtype=author&query=Uri Klempner), [Xiaoyuan Yi](https://arxiv.org/search/?searchtype=author&query=Xiaoyuan Yi), [Xing Xie](https://arxiv.org/search/?searchtype=author&query=Xing Xie), [Yi Zeng](https://arxiv.org/search/?searchtype=author&query=Yi Zeng) 作者：Eoin O’Doherty, Nicole Weinrauch, Andrew Talone, Uri Klempner, Xiaoyuan Yi, Xing Xie, Yi Zeng

Artificial intelligence (AI) is advancing at a pace that raises urgent questions about how to align machine decision-making with human moral values. This working paper investigates how leading AI systems prioritize moral outcomes and what this reveals about the prospects for human-AI symbiosis. We address two central questions: (1) What moral values do state-of-the-art large language models (LLMs) implicitly favour when confronted with dilemmas? (2) How do differences in model architecture, cultural origin, and explainability affect these moral preferences? To explore these questions, we conduct a quantitative experiment with six LLMs, ranking and scoring outcomes across 18 dilemmas representing five moral frameworks. Our findings uncover strikingly consistent value biases. Across all models, Care and Virtue values outcomes were rated most moral, while libertarian choices were consistently penalized. Reasoning-enabled models exhibited greater sensitivity to context and provided richer explanations, whereas non-reasoning models produced more uniform but opaque judgments. This research makes three contributions: (i) Empirically, it delivers a large-scale comparison of moral reasoning across culturally distinct LLMs; (ii) Theoretically, it links probabilistic model behaviour with underlying value encodings; (iii) Practically, it highlights the need for explainability and cultural awareness as critical design principles to guide AI toward a transparent, aligned, and symbiotic future. 人工智能（AI）正以一种引发迫切问题的速度发展——即如何使机器决策与人类道德价值观相一致。本文工作论文研究了领先的人工智能系统在道德结果上的优先排序以及这对人机共生前景的启示。我们探讨两个核心问题：（1）在面临道德困境时，最先进的大型语言模型（LLMs）隐含偏好哪些道德价值？（2）模型架构、文化来源和可解释性差异如何影响这些道德偏好？为探讨这些问题，我们对六个 LLMs 进行了量化实验，对代表五种道德框架的 18 个困境中的结果进行排名和评分。我们的发现揭示了显著一致的价值偏向。在所有模型中，关怀（Care）和美德（Virtue）价值的结果被评为最具道德性，而自由主义（libertarian）选择则持续受到惩罚。具备推理能力的模型对情境更为敏感并提供更丰富的解释，而非推理模型则给出更一致但不透明的判断。这项研究有三方面贡献：（i）在经验上，提供了跨文化不同 LLMs 的道德推理大规模比较；（ii）在理论上，将概率模型行为与其底层价值编码联系起来；（iii）在实践上，强调可解释性和文化意识作为关键设计原则，以引导人工智能走向透明、对齐与共生的未来。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-12 14:37:57 UTC 发布：2025-09-12 14:37:57 UTC

#5 Investigating Language Model Capabilities to Represent and Process Formal Knowledge: A Preliminary Study to Assist Ontology Engineering #5 探究语言模型表示与处理形式知识的能力：一项辅助本体工程的初步研究 [PDF 1 ] [Copy] [Kimi ] [REL]

Author: [Hanna Abi Akl](https://arxiv.org/search/?searchtype=author&query=Hanna Abi Akl) 作者：Hanna Abi Akl

Recent advances in Language Models (LMs) have failed to mask their shortcomings particularly in the domain of reasoning. This limitation impacts several tasks, most notably those involving ontology engineering. As part of a PhD research, we investigate the consequences of incorporating formal methods on the performance of Small Language Models (SLMs) on reasoning tasks. Specifically, we aim to orient our work toward using SLMs to bootstrap ontology construction and set up a series of preliminary experiments to determine the impact of expressing logical problems with different grammars on the performance of SLMs on a predefined reasoning task. Our findings show that it is possible to substitute Natural Language (NL) with a more compact logical language while maintaining a strong performance on reasoning tasks and hope to use these results to further refine the role of SLMs in ontology engineering. 最近在语言模型（LM）方面的进展未能掩盖其在推理领域的短板。此局限影响到多项任务，尤其是本体工程相关的任务。作为博士研究的一部分，我们考察了将形式方法引入对小型语言模型（SLMs）在推理任务上表现的影响。具体而言，我们旨在将工作导向使用 SLMs 启动本体构建，并设计一系列初步实验以确定使用不同语法表达逻辑问题对 SLMs 在预设推理任务上表现的影响。我们的研究结果表明，可以用一种更紧凑的逻辑语言替代自然语言（NL），在保持推理任务强劲表现的同时，并希望利用这些结果进一步细化 SLMs 在本体工程中的作用。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-12 13:46:43 UTC 发布：2025-09-12 13:46:43 UTC

#6 Compartmentalised Agentic Reasoning for Clinical NLI #6 用于临床自然语言推理的分区代理化推理

Authors: [Maël Jullien](https://arxiv.org/search/?searchtype=author&query=Maël Jullien), [Lei Xu](https://arxiv.org/search/?searchtype=author&query=Lei Xu), [Marco Valentino](https://arxiv.org/search/?searchtype=author&query=Marco Valentino), [André Freitas](https://arxiv.org/search/?searchtype=author&query=André Freitas) 作者：Maël Jullien, Lei Xu, Marco Valentino, André Freitas

A common assumption holds that scaling data and parameters yields increasingly structured, generalisable internal representations. We interrogate this assumption in clinical natural language inference (NLI) by adopting a benchmark decomposed into four reasoning families, Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction, and introducing CARENLI, a Compartmentalised Agentic Reasoning for Clinical NLI that separates knowledge access from principled inference. CARENLI routes each premise, statement pair to a family specific solver and enforces auditable procedures via a planner, verifier, and refiner. Across four LLMs, CARENLI improves fidelity by up to 42 points, reaching 98.0% in Causal Attribution and 81.2% in Risk State Abstraction. Verifiers flag violations with near-ceiling reliability, while refiners correct a substantial share of epistemic errors. Remaining failures cluster in routing, identifying family classification as the main bottleneck. These results show that LLMs often retain relevant facts but default to heuristics when inference is underspecified, a dissociation CARENLI makes explicit while offering a framework for safer, auditable reasoning. 一种普遍的假设认为，扩大数据和参数规模会产生越来越有结构性、可泛化的内部表征。我们在临床自然语言推理（NLI）领域检验这一假设，采用一个分解为四大推理家族的基准：因果归因（Causal Attribution）、组合性归着（Compositional Grounding）、认识论验证（Epistemic Verification）和风险状态抽象（Risk State Abstraction），并引入 CARENLI（一种用于临床 NLI 的分隔式能动推理 Compartmentalised Agentic Reasoning），将知识访问与有原则的推理分离。CARENLI 将每个前提—陈述对路由到特定家族的求解器，并通过规划器、验证器和精炼器实施可审计的程序。在四种 LLMs 上，CARENLI 将保真度提升最多达 42 个百分点，在因果归因上达到 98.0%，在风险状态抽象上达到 81.2%。验证器以接近天花板的可靠性标记违规，精炼器纠正了大量认识论错误。剩余的失败集中在路由上，识别家族分类被确定为主要瓶颈。这些结果表明，LLMs 经常保留相关事实，但在推理不明确时会默认采用启发式方法，CARENLI 在明确这一点的同时提供了一个用于更安全、可审计推理的框架。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-12 13:14:47 UTC 发布：2025-09-12 13:14:47 UTC

#7 Towards Fully Automated Molecular Simulations: Multi-Agent Framework for Simulation Setup and Force Field Extraction #7 迈向全自动分子模拟：用于模拟设置和力场提取的多代理框架

Authors: [Marko Petković](https://arxiv.org/search/?searchtype=author&query=Marko Petković), [Vlado Menkovski](https://arxiv.org/search/?searchtype=author&query=Vlado Menkovski), [Sofía Calero](https://arxiv.org/search/?searchtype=author&query=Sofía Calero) 作者：Marko Petković、Vlado Menkovski、Sofía Calero

Automated characterization of porous materials has the potential to accelerate materials discovery, but it remains limited by the complexity of simulation setup and force field selection. We propose a multi-agent framework in which LLM-based agents can autonomously understand a characterization task, plan appropriate simulations, assemble relevant force fields, execute them and interpret their results to guide subsequent steps. As a first step toward this vision, we present a multi-agent system for literature-informed force field extraction and automated RASPA simulation setup. Initial evaluations demonstrate high correctness and reproducibility, highlighting this approach’s potential to enable fully autonomous, scalable materials characterization. 对多孔材料的自动化表征有望加速材料发现，但目前仍受制于模拟设置和力场选择的复杂性。我们提出了一个多智能体框架，其中基于 LLM 的代理可以自主理解表征任务、规划适当的模拟、组装相关力场、执行模拟并解释其结果以指导后续步骤。作为实现这一愿景的第一步，我们提出了一个用于基于文献提取力场并自动化 RASPA 模拟设置的多智能体系统。初步评估表明其具有高度的正确性和可重复性，突显了该方法在实现完全自主、可扩展的材料表征方面的潜力。

Subjects: Artificial Intelligence, Multiagent Systems 主题：人工智能，多智能体系统

Publish: 2025-09-12 12:56:47 UTC 发表：2025-09-12 12:56:47 UTC

#8 Online Robust Planning under Model Uncertainty: A Sample-Based Approach #8 在模型不确定性下的在线鲁棒规划：一种基于样本的方法

Authors: [Tamir Shazman](https://arxiv.org/search/?searchtype=author&query=Tamir Shazman), [Idan Lev-Yehudi](https://arxiv.org/search/?searchtype=author&query=Idan Lev-Yehudi), [Ron Benchetit](https://arxiv.org/search/?searchtype=author&query=Ron Benchetit), [Vadim Indelman](https://arxiv.org/search/?searchtype=author&query=Vadim Indelman) 作者：Tamir Shazman、Idan Lev-Yehudi、Ron Benchetit、Vadim Indelman

Online planning in Markov Decision Processes (MDPs) enables agents to make sequential decisions by simulating future trajectories from the current state, making it well-suited for large-scale or dynamic environments. Sample-based methods such as Sparse Sampling and Monte Carlo Tree Search (MCTS) are widely adopted for their ability to approximate optimal actions using a generative model. However, in practical settings, the generative model is often learned from limited data, introducing approximation errors that can degrade performance or lead to unsafe behaviors. To address these challenges, Robust MDPs (RMDPs) offer a principled framework for planning under model uncertainty, yet existing approaches are typically computationally intensive and not suited for real-time use. In this work, we introduce Robust Sparse Sampling (RSS), the first online planning algorithm for RMDPs with finite-sample theoretical performance guarantees. Unlike Sparse Sampling, which estimates the nominal value function, RSS computes a robust value function by leveraging the efficiency and theoretical properties of Sample Average Approximation (SAA), enabling tractable robust policy computation in online settings. RSS is applicable to infinite or continuous state spaces, and its sample and computational complexities are independent of the state space size. We provide theoretical performance guarantees and empirically show that RSS outperforms standard Sparse Sampling in environments with uncertain dynamics. 在马尔可夫决策过程（MDP）中，在线规划使智能体能够通过从当前状态模拟未来轨迹来做出序贯决策，因此非常适合大规模或动态环境。基于采样的方法如稀疏采样（Sparse Sampling）和蒙特卡洛树搜索（MCTS）因其能够使用生成模型近似最优动作而被广泛采用。然而，在实际场景中，生成模型通常是从有限数据中学习得到的，这会引入近似误差，进而可能降低性能或导致不安全行为。为了解决这些挑战，鲁棒 MDP（RMDP）为在模型不确定性下的规划提供了一个有原则的框架，但现有方法通常计算开销大，不适合实时使用。在这项工作中，我们提出了鲁棒稀疏采样（Robust Sparse Sampling，RSS），这是首个针对 RMDP 的具有有限样本理论性能保证的在线规划算法。与稀疏采样不同，稀疏稳健采样（RSS）通过利用样本均值近似（SAA）的高效性和理论属性来计算稳健值函数，从而实现在线设置下可解的稳健策略计算。RSS 适用于无限或连续状态空间，其样本和计算复杂度与状态空间大小无关。我们提供了理论性能保证，并在实验上证明，在动力学不确定的环境中，RSS 优于标准的稀疏采样。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-12 11:41:23 UTC 发布：2025-09-12 11:41:23 UTC

#9 Virtual Agent Economies #9 虚拟代理经济

Authors: [Nenad Tomasev](https://arxiv.org/search/?searchtype=author&query=Nenad Tomasev), [Matija Franklin](https://arxiv.org/search/?searchtype=author&query=Matija Franklin), [Joel Z. Leibo](https://arxiv.org/search/?searchtype=author&query=Joel Z. Leibo), [Julian Jacobs](https://arxiv.org/search/?searchtype=author&query=Julian Jacobs), [William A. Cunningham](https://arxiv.org/search/?searchtype=author&query=William A. Cunningham), [Iason Gabriel](https://arxiv.org/search/?searchtype=author&query=Iason Gabriel), [Simon Osindero](https://arxiv.org/search/?searchtype=author&query=Simon Osindero) 作者：Nenad Tomasev, Matija Franklin, Joel Z. Leibo, Julian Jacobs, William A. Cunningham, Iason Gabriel, Simon Osindero

The rapid adoption of autonomous AI agents is giving rise to a new economic layer where agents transact and coordinate at scales and speeds beyond direct human oversight. We propose the “sandbox economy” as a framework for analyzing this emergent system, characterizing it along two key dimensions: its origins (emergent vs. intentional) and its degree of separateness from the established human economy (permeable vs. impermeable). Our current trajectory points toward a spontaneous emergence of a vast and highly permeable AI agent economy, presenting us with opportunities for an unprecedented degree of coordination as well as significant challenges, including systemic economic risk and exacerbated inequality. Here we discuss a number of possible design choices that may lead to safely steerable AI agent markets. In particular, we consider auction mechanisms for fair resource allocation and preference resolution, the design of AI “mission economies” to coordinate around achieving collective goals, and socio-technical infrastructure needed to ensure trust, safety, and accountability. By doing this, we argue for the proactive design of steerable agent markets to ensure the coming technological shift aligns with humanity’s long-term collective flourishing. 自主 AI 代理的快速普及催生了一个新的经济层面，代理以超出直接人工监督的规模和速度进行交易与协调。我们提出“沙盒经济”作为分析这一新兴系统的框架，从两个关键维度对其进行刻画：其起源（自发涌现 vs. 有意设计）以及其与既有人工经济的分离程度（可渗透 vs. 不可渗透）。我们目前的发展轨迹指向一个自发涌现的、规模巨大且高度可渗透的 AI 代理经济，这既为前所未有的协调程度提供了机遇，也带来了重大挑战，包括系统性经济风险和不平等加剧。在此我们讨论了若干可能的设计选择，可能导致可安全引导的 AI 代理市场。特别地，我们考察了用于公平资源分配和偏好解决的拍卖机制、为实现集体目标而围绕协调设计的 AI “使命经济”、以及确保信任、安全与问责所需的社会技术基础设施。通过这样做，我们主张积极设计可引导的代理市场，以确保即将到来的技术变革与人类的长期集体繁荣相一致。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-12 11:20:11 UTC 发布：2025-09-12 11:20:11 协调世界时（UTC）

#10 AI Harmonics: a human-centric and harms severity-adaptive AI risk assessment framework #10 AI Harmonics：以人为本且依据危害严重性自适应的 AI 风险评估框架

Authors: [Sofia Vei](https://arxiv.org/search/?searchtype=author&query=Sofia Vei), [Paolo Giudici](https://arxiv.org/search/?searchtype=author&query=Paolo Giudici), [Pavlos Sermpezis](https://arxiv.org/search/?searchtype=author&query=Pavlos Sermpezis), [Athena Vakali](https://arxiv.org/search/?searchtype=author&query=Athena Vakali), [Adelaide Emma Bernardelli](https://arxiv.org/search/?searchtype=author&query=Adelaide Emma Bernardelli) 作者：Sofia Vei、Paolo Giudici、Pavlos Sermpezis、Athena Vakali、Adelaide Emma Bernardelli

The absolute dominance of Artificial Intelligence (AI) introduces unprecedented societal harms and risks. Existing AI risk assessment models focus on internal compliance, often neglecting diverse stakeholder perspectives and real-world consequences. We propose a paradigm shift to a human-centric, harm-severity adaptive approach grounded in empirical incident data. We present AI Harmonics, which includes a novel AI harm assessment metric (AIH) that leverages ordinal severity data to capture relative impact without requiring precise numerical estimates. AI Harmonics combines a robust, generalized methodology with a data-driven, stakeholder-aware framework for exploring and prioritizing AI harms. Experiments on annotated incident data confirm that political and physical harms exhibit the highest concentration and thus warrant urgent mitigation: political harms erode public trust, while physical harms pose serious, even life-threatening risks, underscoring the real-world relevance of our approach. Finally, we demonstrate that AI Harmonics consistently identifies uneven harm distributions, enabling policymakers and organizations to target their mitigation efforts effectively. 人工智能（AI）的绝对主导地位带来了前所未有的社会危害和风险。现有的 AI 风险评估模型侧重于内部合规，常常忽视多样化的利益相关者视角和现实世界的后果。我们提出一种以人为本、根据伤害严重性自适应的新范式，基于实证事件数据。我们提出了 AI Harmonics，其中包括一种新颖的 AI 伤害评估指标（AIH），该指标利用序数严重性数据来捕捉相对影响，而无需精确的数值估计。AI Harmonics 将稳健且普适的方法论与数据驱动、关注利益相关者的框架相结合，用于探索和优先排序 AI 造成的伤害。对带注释的事件数据的实验表明，政治性和物理性伤害的集中度最高，因此需要紧急缓解：政治性伤害侵蚀公众信任，而物理性伤害则构成严重甚至危及生命的风险，突显了我们方法的现实相关性。最后，我们证明 AI Harmonics 能一致识别不均匀的伤害分布，从而使政策制定者和组织能够有效地针对其缓解工作。

Subjects: Artificial Intelligence, Methodology 主题：人工智能，方法论

Publish: 2025-09-12 09:52:45 UTC 发布：2025-09-12 09:52:45 UTC

#11 XAgents: A Unified Framework for Multi-Agent Cooperation via IF-THEN Rules and Multipolar Task Processing Graph #11 XAgents：通过若则规则和多极任务处理图实现多智能体协作的统一框架

Authors: [Hailong Yang](https://arxiv.org/search/?searchtype=author&query=Hailong Yang), [Mingxian Gu](https://arxiv.org/search/?searchtype=author&query=Mingxian Gu), [Jianqi Wang](https://arxiv.org/search/?searchtype=author&query=Jianqi Wang), [Guanjin Wang](https://arxiv.org/search/?searchtype=author&query=Guanjin Wang), [Zhaohong Deng](https://arxiv.org/search/?searchtype=author&query=Zhaohong Deng) 作者：杨海龙、顾明贤、王剑琦、王冠今、邓朝宏

The rapid advancement of Large Language Models (LLMs) has significantly enhanced the capabilities of Multi-Agent Systems (MAS) in supporting humans with complex, real-world tasks. However, MAS still face challenges in effective task planning when handling highly complex tasks with uncertainty, often resulting in misleading or incorrect outputs that hinder task execution. To address this, we propose XAgents, a unified multi-agent cooperative framework built on a multipolar task processing graph and IF-THEN rules. XAgents uses the multipolar task processing graph to enable dynamic task planning and handle task uncertainty. During subtask processing, it integrates domain-specific IF-THEN rules to constrain agent behaviors, while global rules enhance inter-agent collaboration. We evaluate the performance of XAgents across three distinct datasets, demonstrating that it consistently surpasses state-of-the-art single-agent and multi-agent approaches in both knowledge-typed and logic-typed question-answering tasks. The codes for XAgents are available at: https://github.com/AGI-FHBC/XAgents. 大型语言模型（LLMs）的快速发展显著提升了多智能体系统（MAS）在支持人类完成复杂现实任务方面的能力。然而，MAS 在处理具有不确定性的高度复杂任务时，仍然面临有效任务规划的挑战，常常导致误导性或错误的输出，从而妨碍任务执行。为了解决这一问题，我们提出了 XAgents——一个基于多极任务处理图和 IF-THEN 规则构建的统一多智能体协作框架。XAgents 利用多极任务处理图实现动态任务规划并处理任务不确定性。在子任务处理过程中，它融入了领域特定的 IF-THEN 规则以约束智能体行为，同时全局规则增强了智能体间的协作。我们在三个不同数据集上评估了 XAgents 的性能，结果表明其在知识型和逻辑型问答任务中均持续超越最先进的单智能体和多智能体方法。XAgents 的代码可在以下地址获取： https://github.com/AGI-FHBC/XAgents。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-12 08:40:58 UTC 发布：2025-09-12 08:40:58 UTC

#12 GAMA: A General Anonymizing Multi-Agent System for Privacy Preservation Enhanced by Domain Rules and Disproof Method #12 GAMA：一种通用匿名多智能体系统，通过领域规则和反证方法增强隐私保护

Authors: [Hailong Yang](https://arxiv.org/search/?searchtype=author&query=Hailong Yang), [Renhuo Zhao](https://arxiv.org/search/?searchtype=author&query=Renhuo Zhao), [Guanjin Wang](https://arxiv.org/search/?searchtype=author&query=Guanjin Wang), [Zhaohong Deng](https://arxiv.org/search/?searchtype=author&query=Zhaohong Deng) 作者：杨海龙、赵仁火、王冠津、邓昭红

With the rapid advancement of Large Language Model (LLM), LLM-based agents exhibit exceptional abilities in understanding and generating natural language, facilitating human-like collaboration and information transmission in LLM-based Multi-Agent System (MAS). High-performance LLMs are often hosted on remote servers in public spaces. When tasks involve privacy data, MAS cannot securely utilize these LLMs without implementing privacy-preserving mechanisms. To address this challenge, we propose a General Anonymizing Multi-Agent system (GAMA), which divides the agents’ workspace into private and public spaces and protects privacy through the anonymizing mechanism. In the private space, agents handle sensitive data, while in the public space, only anonymized data is utilized. GAMA incorporates two key modules to mitigate semantic loss caused by anonymization: Domain-Rule-based Knowledge Enhancement (DRKE) and Disproof-based Logic Enhancement (DLE). We evaluate GAMA on two public question-answering datasets: Trivia Creative Writing and Logic Grid Puzzle. The results demonstrate that GAMA has superior performance compared to the state-of-the-art models. To further assess its privacy-preserving capabilities, we designed two new datasets: Knowledge Privacy Preservation and Logic Privacy Preservation. The final results highlight GAMA’s exceptional effectiveness in both task processing and privacy preservation. 随着大型语言模型（LLM）的快速发展，基于 LLM 的智能体展现出卓越的自然语言理解与生成能力，在基于 LLM 的多智能体系统（MAS）中实现了类人协作与信息传递。高性能 LLM 通常部署于公共空间的远程服务器上。当任务涉及隐私数据时，若不实施隐私保护机制，MAS 将无法安全使用这些 LLM。为应对这一挑战，我们提出通用匿名化多智能体系统（GAMA），该系统将智能体工作空间划分为私有空间与公共空间，并通过匿名化机制保护隐私。私有空间用于处理敏感数据，公共空间则仅使用匿名化数据。GAMA 整合两大核心模块以缓解匿名化引发的语义损失：基于领域规则的知识增强（DRKE）与基于反证的逻辑增强（DLE）。我们在两个公开问答数据集（Trivia Creative Writing 与 Logic Grid Puzzle）上评估 GAMA，结果表明其性能显著优于现有最先进模型。为了进一步评估其隐私保护能力，我们设计了两个新的数据集：知识隐私保护和逻辑隐私保护。最终结果凸显了 GAMA 在任务处理和隐私保护方面的卓越有效性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-12 07:22:49 UTC 发布：2025-09-12 07:22:49 协调世界时 (UTC)

#13 Evaluation of Black-Box XAI Approaches for Predictors of Values of Boolean Formulae #13 评估用于布尔公式值预测器的黑盒可解释性人工智能（XAI）方法

Authors: [Stav Armoni-Friedmann](https://arxiv.org/search/?searchtype=author&query=Stav Armoni-Friedmann), [Hana Chockler](https://arxiv.org/search/?searchtype=author&query=Hana Chockler), [David A. Kelly](https://arxiv.org/search/?searchtype=author&query=David A. Kelly) 作者：Stav Armoni-Friedmann、Hana Chockler、David A. Kelly

Evaluating explainable AI (XAI) approaches is a challenging task in general, due to the subjectivity of explanations. In this paper, we focus on tabular data and the specific use case of AI models predicting the values of Boolean functions. We extend the previous work in this domain by proposing a formal and precise measure of importance of variables based on actual causality, and we evaluate state-of-the-art XAI tools against this measure. We also present a novel XAI tool B-ReX, based on the existing tool ReX, and demonstrate that it is superior to other black-box XAI tools on a large-scale benchmark. Specifically, B-ReX achieves a Jensen-Shannon divergence of 0.072 ± 0.012 on random 10-valued Boolean formulae 评估可解释性人工智能（XAI）方法总体上是一项具有挑战性的任务，因为解释具有主观性。在本文中，我们聚焦于表格数据以及 AI 模型预测布尔函数值的特定用例。我们通过基于实际因果关系提出一个形式化且精确的变量重要性度量来扩展该领域的先前工作，并根据该度量评估了最先进的 XAI 工具。我们还提出了一个新型 XAI 工具 B-ReX，基于现有工具 ReX，并演示了其在大规模基准测试中优于其他黑盒 XAI 工具。具体而言，B-ReX 在随机的 10 值布尔公式上实现了 0.072 ± 0.012 的 Jensen–Shannon 散度

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-12 05:52:47 UTC 发表时间：2025-09-12 05:52:47 UTC

#14 A Markovian Framing of WaveFunctionCollapse for Procedurally Generating Aesthetically Complex Environments #14 将波函数塌缩的程序化生成美学复杂环境的马尔可夫框架

Procedural content generation often requires satisfying both designer-specified objectives and adjacency constraints implicitly imposed by the underlying tile set. To address the challenges of jointly optimizing both constraints and objectives, we reformulate WaveFunctionCollapse (WFC) as a Markov Decision Process (MDP), enabling external optimization algorithms to focus exclusively on objective maximization while leveraging WFC’s propagation mechanism to enforce constraint satisfaction. We empirically compare optimizing this MDP to traditional evolutionary approaches that jointly optimize global metrics and local tile placement. Across multiple domains with various difficulties, we find that joint optimization not only struggles as task complexity increases, but consistently underperforms relative to optimization over the WFC-MDP, underscoring the advantages of decoupling local constraint satisfaction from global objective optimization. 过程化内容生成通常需要同时满足设计师指定的目标与底层瓦片集隐含施加的邻接约束。为应对约束与目标的联合优化难题，我们将波函数坍缩（WFC）重构为马尔可夫决策过程（MDP），使外部优化算法能专注于目标最大化，同时借助 WFC 的传播机制强制满足约束条件。我们通过实证对比，将该 MDP 优化方案与传统进化算法（后者同时优化全局指标与局部瓦片布局）进行比较。在多个难度各异的领域中，我们发现联合优化不仅在任务复杂度提升时表现不佳，其效果始终逊于基于 WFC-MDP 的优化方案，这充分证明了将局部约束满足与全局目标优化解耦的显著优势。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-12 01:51:01 UTC 发布日期：2025-09-12 01:51:01 UTC

#15 The (R)evolution of Scientific Workflows in the Agentic AI Era: Towards Autonomous Science #15 在具代理性的人工智能时代，科学工作流程的（革新）革命：迈向自主科学

Authors: [Woong Shin](https://arxiv.org/search/?searchtype=author&query=Woong Shin), [Renan Souza](https://arxiv.org/search/?searchtype=author&query=Renan Souza), [Daniel Rosendo](https://arxiv.org/search/?searchtype=author&query=Daniel Rosendo), [Frédéric Suter](https://arxiv.org/search/?searchtype=author&query=Frédéric Suter), [Feiyi Wang](https://arxiv.org/search/?searchtype=author&query=Feiyi Wang), [Prasanna Balaprakash](https://arxiv.org/search/?searchtype=author&query=Prasanna Balaprakash), [Rafael Ferreira da Silva](https://arxiv.org/search/?searchtype=author&query=Rafael Ferreira da Silva) 作者：Woong Shin、Renan Souza、Daniel Rosendo、Frédéric Suter、Feiyi Wang、Prasanna Balaprakash、Rafael Ferreira da Silva

Modern scientific discovery increasingly requires coordinating distributed facilities and heterogeneous resources, forcing researchers to act as manual workflow coordinators rather than scientists. Advances in AI leading to AI agents show exciting new opportunities that can accelerate scientific discovery by providing intelligence as a component in the ecosystem. However, it is unclear how this new capability would materialize and integrate in the real world. To address this, we propose a conceptual framework where workflows evolve along two dimensions which are intelligence (from static to intelligent) and composition (from single to swarm) to chart an evolutionary path from current workflow management systems to fully autonomous, distributed scientific laboratories. With these trajectories in mind, we present an architectural blueprint that can help the community take the next steps towards harnessing the opportunities in autonomous science with the potential for 100x discovery acceleration and transformational scientific workflows. 现代科学发现日益需要协调分布式设施与异构资源，迫使研究人员充当人工工作流协调者而非科学家。人工智能的进步催生了智能代理，这种将智能作为生态系统组成部分的新模式，为加速科学发现带来了令人振奋的新机遇。然而，这种新能力如何在现实世界中实现并整合仍不明朗。为此，我们提出一个概念框架：工作流将沿智能（从静态到智能）与组合（从单体到群集）双维度演进，从而规划从当前工作流管理系统向完全自主的分布式科学实验室进化的路径。基于上述发展轨迹，我们提出架构蓝图，助力科研界把握自主科学带来的机遇——该技术有望实现 100 倍的发现加速与变革性工作流转型。

Subjects: Artificial Intelligence, Distributed, Parallel, and Cluster Computing 主题：人工智能，分布式、并行与群集计算

Publish: 2025-09-12 01:14:34 UTC 发布：2025-09-12 01:14:34 协调世界时

#16 LLMs as Agentic Cooperative Players in Multiplayer UNO #16 LLMs 作为多人 UNO 中具有代理性的合作玩家 [PDF 2 ] [Copy] [Kimi 2 ] [REL]

Subjects: Artificial Intelligence, Computation and Language 主题：人工智能，计算与语言

Publish: 2025-09-11 21:42:33 UTC 发布：2025-09-11 21:42:33 UTC

#17 Towards an AI-based knowledge assistant for goat farmers based on Retrieval-Augmented Generation #17 朝向基于检索增强生成的山羊养殖者 AI 知识助理

Authors: [Nana Han](https://arxiv.org/search/?searchtype=author&query=Nana Han), [Dong Liu](https://arxiv.org/search/?searchtype=author&query=Dong Liu), [Tomas Norton](https://arxiv.org/search/?searchtype=author&query=Tomas Norton) 作者：韩娜娜、刘东、托马斯·诺顿

Large language models (LLMs) are increasingly being recognised as valuable knowledge communication tools in many industries. However, their application in livestock farming remains limited, being constrained by several factors not least the availability, diversity and complexity of knowledge sources. This study introduces an intelligent knowledge assistant system designed to support health management in farmed goats. Leveraging the Retrieval-Augmented Generation (RAG), two structured knowledge processing methods, table textualization and decision-tree textualization, were proposed to enhance large language models’ (LLMs) understanding of heterogeneous data formats. Based on these methods, a domain-specific goat farming knowledge base was established to improve LLM’s capacity for cross-scenario generalization. The knowledge base spans five key domains: Disease Prevention and Treatment, Nutrition Management, Rearing Management, Goat Milk Management, and Basic Farming Knowledge. Additionally, an online search module is integrated to enable real-time retrieval of up-to-date information. To evaluate system performance, six ablation experiments were conducted to examine the contribution of each component. The results demonstrated that heterogeneous knowledge fusion method achieved the best results, with mean accuracies of 87.90% on the validation set and 84.22% on the test set. Across the text-based, table-based, decision-tree based Q&A tasks, accuracy consistently exceeded 85%, validating the effectiveness of structured knowledge fusion within a modular design. Error analysis identified omission as the predominant error category, highlighting opportunities to further improve retrieval coverage and context integration. In conclusion, the results highlight the robustness and reliability of the proposed system for practical applications in goat farming. 大型语言模型（LLMs）正日益被众多行业视为宝贵的知识传播工具。然而，其在畜牧业的应用仍受限于多重因素，其中知识来源的可获取性、多样性与复杂性尤为突出。本研究提出一种智能知识辅助系统，旨在支持养殖山羊的健康管理。通过运用检索增强生成（RAG）技术，本研究提出两种结构化知识处理方法——表格文本化与决策树文本化，以增强大型语言模型对异构数据格式的理解能力。基于这些方法，建立了针对山羊养殖的领域专用知识库，从而提升了大型语言模型跨场景泛化能力。该知识库涵盖五大核心领域：疾病防治、营养管理、饲养管理、山羊奶管理及基础养殖知识。同时集成在线检索模块，实现实时更新信息的即时获取。为评估系统性能，开展了六项消融实验以考察各组件的贡献。结果表明异构知识融合方法表现最佳，在验证集上平均准确率达 87.90%，测试集上达 84.22%。在基于文本、表格及决策树的问答任务中，准确率始终保持在 85%以上，验证了模块化设计中结构化知识融合的有效性。误差分析表明遗漏是主要错误类型，凸显了进一步提升检索覆盖率与上下文整合能力的改进空间。综上所述，研究结果彰显了该系统在山羊养殖实践应用中的稳健性与可靠性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-11 20:58:51 UTC 发布时间：2025-09-11 20:58:51 UTC

#18 Towards a Common Framework for Autoformalization #18 朝向自动形式化的通用框架

Authors: [Agnieszka Mensfelt](https://arxiv.org/search/?searchtype=author&query=Agnieszka Mensfelt), [David Tena Cucala](https://arxiv.org/search/?searchtype=author&query=David Tena Cucala), [Santiago Franco](https://arxiv.org/search/?searchtype=author&query=Santiago Franco), [Angeliki Koutsoukou-Argyraki](https://arxiv.org/search/?searchtype=author&query=Angeliki Koutsoukou-Argyraki), [Vince Trencsenyi](https://arxiv.org/search/?searchtype=author&query=Vince Trencsenyi), [Kostas Stathis](https://arxiv.org/search/?searchtype=author&query=Kostas Stathis) 作者：Agnieszka Mensfelt、David Tena Cucala、Santiago Franco、Angeliki Koutsoukou-Argyraki、Vince Trencsenyi、Kostas Stathis

Autoformalization has emerged as a term referring to the automation of formalization - specifically, the formalization of mathematics using interactive theorem provers (proof assistants). Its rapid development has been driven by progress in deep learning, especially large language models (LLMs). More recently, the term has expanded beyond mathematics to describe the broader task of translating informal input into formal logical representations. At the same time, a growing body of research explores using LLMs to translate informal language into formal representations for reasoning, planning, and knowledge representation - often without explicitly referring to this process as autoformalization. As a result, despite addressing similar tasks, the largely independent development of these research areas has limited opportunities for shared methodologies, benchmarks, and theoretical frameworks that could accelerate progress. The goal of this paper is to review - explicit or implicit - instances of what can be considered autoformalization and to propose a unified framework, encouraging cross-pollination between different fields to advance the development of next generation AI systems. 自动形式化（autoformalization）已成为指代形式化自动化的术语——具体来说，是使用交互式定理证明器（证明助手）对数学进行形式化。其快速发展由深度学习的进展推动，尤其是大型语言模型（LLMs）。最近，该术语的含义已超出数学范畴，用来描述将非正式输入翻译为形式逻辑表示的更广泛任务。与此同时，越来越多的研究探索使用 LLMs 将非正式语言翻译为用于推理、规划和知识表示的形式表示——通常并不明确将这一过程称为自动形式化。因此，尽管处理的任务相似，这些研究领域在很大程度上的独立发展限制了共享方法论、基准和理论框架的机会，而这些本可加速进展。本文的目标是回顾——无论明示或暗示——可被视为自动形式化的实例，并提出一个统一框架，鼓励不同领域之间的交叉交流，以推动下一代人工智能系统的发展。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-11 19:28:56 UTC 发布：2025-09-11 19:28:56 协调世界时（UTC）

#19 A Modular and Multimodal Generative AI Framework for Urban Building Energy Data: Generating Synthetic Homes #19 一个用于城市建筑能耗数据的模块化多模态生成式人工智能框架：生成合成住宅 [PDF 1 ] [复制] [Kimi ] [关联]

Authors: [Jackson Eshbaugh](https://arxiv.org/search/?searchtype=author&query=Jackson Eshbaugh), [Chetan Tiwari](https://arxiv.org/search/?searchtype=author&query=Chetan Tiwari), [Jorge Silveyra](https://arxiv.org/search/?searchtype=author&query=Jorge Silveyra) 作者：Jackson Eshbaugh、Chetan Tiwari、Jorge Silveyra

Computational models have emerged as powerful tools for energy modeling research, touting scalability and quantitative results. However, these models require a plethora of data, some of which is inaccessible, expensive, or raises privacy concerns. We introduce a modular multimodal framework to produce this data from publicly accessible residential information and images using generative artificial intelligence (AI). Additionally, we provide a pipeline demonstrating this framework, and we evaluate its generative AI components. Our experiments show that our framework’s use of AI avoids common issues with generative models. Our framework produces realistic, labeled data. By reducing dependence on costly or restricted data sources, we pave a path towards more accessible and reproducible research. 计算模型已成为能源建模研究的强大工具，宣称具备可扩展性和定量结果。然而，这些模型需要大量数据，其中一些数据无法获取、成本高昂或存在隐私问题。我们提出了一个模块化的多模态框架，利用生成式人工智能（AI）从公开可访问的住宅信息和图像中生成这些数据。此外，我们提供了演示该框架的流水线，并评估了其生成式 AI 组件。我们的实验表明，框架中对 AI 的使用避免了生成模型的常见问题，能够生成真实且带标签的数据。通过减少对昂贵或受限数据源的依赖，我们为更易获取和可重复的研究铺平了道路。

Subjects: Artificial Intelligence, Machine Learning 主题：人工智能、机器学习

Publish: 2025-09-11 18:53:21 UTC 发布：2025-09-11 18:53:21 UTC

#20 How well can LLMs provide planning feedback in grounded environments? #20 在有根环境中，LLMs 在提供规划反馈方面表现如何？

Authors: [Yuxuan Li](https://arxiv.org/search/?searchtype=author&query=Yuxuan Li), [Victor Zhong](https://arxiv.org/search/?searchtype=author&query=Victor Zhong) 作者：李宇轩，Victor Zhong

Learning to plan in grounded environments typically requires carefully designed reward functions or high-quality annotated demonstrations. Recent works show that pretrained foundation models, such as large language models (LLMs) and vision language models (VLMs), capture background knowledge helpful for planning, which reduces the amount of reward design and demonstrations needed for policy learning. We evaluate how well LLMs and VLMs provide feedback across symbolic, language, and continuous control environments. We consider prominent types of feedback for planning including binary feedback, preference feedback, action advising, goal advising, and delta action feedback. We also consider inference methods that impact feedback performance, including in-context learning, chain-of-thought, and access to environment dynamics. We find that foundation models can provide diverse high-quality feedback across domains. Moreover, larger and reasoning models consistently provide more accurate feedback, exhibit less bias, and benefit more from enhanced inference methods. Finally, feedback quality degrades for environments with complex dynamics or continuous state spaces and action spaces. 在有根环境中学习规划通常需要精心设计的奖励函数或高质量的带注释示范。近期研究表明，预训练的基础模型（例如大型语言模型（LLMs）和视觉语言模型（VLMs））蕴含对规划有帮助的背景知识，从而减少了策略学习所需的奖励设计和示范数量。我们评估了 LLMs 和 VLMs 在符号、语言和连续控制环境中提供反馈的能力。我们考虑了对规划至关重要的典型反馈类型，包括二元反馈、偏好反馈、动作建议、目标建议和动作增量反馈。我们也考察了影响反馈性能的推理方法，包括上下文学习、链式思维以及对环境动力学的访问。研究发现，基础模型能够在各个领域提供多样且高质量的反馈。此外，更大且具推理能力的模型持续提供更准确的反馈、表现出更少的偏差，并且更能从增强的推理方法中受益。最后，当环境具有复杂动力学或连续的状态空间与动作空间时，反馈质量会下降。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-09-11 18:51:26 UTC 发布：2025-09-11 18:51:26 UTC

#21 Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture #21 可执行本体：用数据流架构合成事件语义

Author: [Aleksandr Boldachev](https://arxiv.org/search/?searchtype=author&query=Aleksandr Boldachev) 作者：Aleksandr Boldachev

Publish: 2025-09-11 18:12:46 UTC 发布：2025-09-11 18:12:46 UTC

#22 Human-AI Collaboration Increases Efficiency in Regulatory Writing #22 人机协作提高了法规写作的效率

Background: Investigational New Drug (IND) application preparation is time-intensive and expertise-dependent, slowing early clinical development. Objective: To evaluate whether a large language model (LLM) platform (AutoIND) can reduce first-draft composition time while maintaining document quality in regulatory submissions. Methods: Drafting times for IND nonclinical written summaries (eCTD modules 2.6.2, 2.6.4, 2.6.6) generated by AutoIND were directly recorded. For comparison, manual drafting times for IND summaries previously cleared by the U.S. FDA were estimated from the experience of regulatory writers (≥6 years) and used as industry-standard benchmarks. Quality was assessed by a blinded regulatory writing assessor using seven pre-specified categories: correctness, completeness, conciseness, consistency, clarity, redundancy, and emphasis. Each sub-criterion was scored 0-3 and normalized to a percentage. A critical regulatory error was defined as any misrepresentation or omission likely to alter regulatory interpretation (e.g., incorrect NOAEL, omission of mandatory GLP dose-formulation analysis). Results: AutoIND reduced initial drafting time by ∼97% (from ∼100 h to 3.7 h for 18,870 pages/61 reports in IND-1; and to 2.6 h for 11,425 pages/58 reports in IND-2). Quality scores were 69.6% and 77.9% for IND-1 and IND-2. No critical regulatory errors were detected, but deficiencies in emphasis, conciseness, and clarity were noted. Conclusions: AutoIND can dramatically accelerate IND drafting, but expert regulatory writers remain essential to mature outputs to submission-ready quality. Systematic deficiencies identified provide a roadmap for targeted model improvements. 背景：研究性新药（IND）申请准备耗时且依赖专业知识，延缓了早期临床开发。目的：评估大型语言模型（LLM）平台（AutoIND）能否在保持监管提交文件质量的同时，缩短首稿撰写时间。方法：直接记录由 AutoIND 生成的 IND 非临床书面摘要（eCTD 模块 2.6.2、2.6.4、2.6.6）的起草时间。作为比较，依据法规撰写人员的经验估算以前已获美国 FDA 批准的 IND 摘要的手工起草时间（ ≥ 6 年），并用作行业标准基准。质量由一名盲法的法规写作评估员根据七个预先指定的类别评估：正确性、完整性、简洁性、一致性、清晰度、冗余性和重点突出。每个子标准按 0-3 评分并标准化为百分比。关键法规错误定义为任何可能改变监管解读的错误陈述或遗漏（例如：不正确的 NOAEL，遗漏强制性 GLP 剂型分析）。结果：AutoIND 将初始起草时间减少了 ∼ 97%（在 IND-1 中从 ∼ 100 小时减少到 3.7 小时，处理 18,870 页/61 份报告；在 IND-2 中减少到 2.6 小时，处理 11,425 页/58 份报告）。质量评分在 IND-1 和 IND-2 中分别为 69.6% 和 77.9%。未发现关键监管错误，但注意到在重点、简洁性和清晰度方面存在不足。结论：AutoIND 能显著加速 IND 起草，但经验丰富的监管文案撰写者仍然对将产出成熟到提交就绪质量至关重要。所识别的系统性不足为针对性改进模型提供了路线图。

Subjects: Artificial Intelligence, Quantitative Methods 对象：人工智能，定量方法

Publish: 2025-09-10 18:02:23 UTC 发表：2025-09-10 18:02:23 UTC

#23 Standards in the Preparation of Biomedical Research Metadata: A Bridge2AI Perspective #23 生物医学研究元数据准备标准：Bridge2AI 视角

AI-readiness describes the degree to which data may be optimally and ethically used for subsequent AI and Machine Learning (AI/ML) methods, where those methods may involve some combination of model training, data classification, and ethical, explainable prediction. The Bridge2AI consortium has defined the particular criteria a biomedical dataset may possess to render it AI-ready: in brief, a dataset’s readiness is related to its FAIRness, provenance, degree of characterization, explainability, sustainability, and computability, in addition to its accompaniment with documentation about ethical data practices. To ensure AI-readiness and to clarify data structure and relationships within Bridge2AI’s Grand Challenges (GCs), particular types of metadata are necessary. The GCs within the Bridge2AI initiative include four data-generating projects focusing on generating AI/ML-ready datasets to tackle complex biomedical and behavioral research problems. These projects develop standardized, multimodal data, tools, and training resources to support AI integration, while addressing ethical data practices. Examples include using voice as a biomarker, building interpretable genomic tools, modeling disease trajectories with diverse multimodal data, and mapping cellular and molecular health indicators across the human body. This report assesses the state of metadata creation and standardization in the Bridge2AI GCs, provides guidelines where required, and identifies gaps and areas for improvement across the program. New projects, including those outside the Bridge2AI consortium, would benefit from what we have learned about creating metadata as part of efforts to promote AI readiness. AI 就绪性描述了数据在多大程度上可以被最佳且合乎伦理地用于后续的人工智能与机器学习（AI/ML）方法，这些方法可能涉及模型训练、数据分类以及合伦理、可解释的预测等组合。Bridge2AI 联盟定义了生物医学数据集可能具备的特定标准，以使其达到 AI 就绪：简言之，数据集的就绪度与其 FAIR 特性、来源（溯源）、表征程度、可解释性、可持续性和可计算性有关，此外还需配有关于伦理数据实践的文档。为了确保 AI 就绪并阐明 Bridge2AI 大挑战（GCs）中数据结构和关系，需具备特定类型的元数据。Bridge2AI 计划中的大挑战包括四个以数据生成为主的项目，旨在生成 AI/ML 就绪的数据集以解决复杂的生物医学与行为研究问题。这些项目开发标准化的多模态数据、工具和培训资源以支持 AI 的整合，同时处理伦理数据实践。示例包括使用语音作为生物标志物、构建可解释的基因组工具、用多样的多模态数据建模疾病轨迹，以及绘制人体各处的细胞和分子健康指标。本报告评估了 Bridge2AI GC 中元数据创建与标准化的现状，在必要处提供指导，并识别了整个项目中的差距与改进领域。包括 Bridge2AI 联盟之外的新项目在内，都将从我们关于在推动人工智能准备工作中创建元数据所学到的经验中受益。

Subjects: Other Quantitative Biology, Artificial Intelligence 学科：其他定量生物学、人工智能

Publish: 2025-09-12 17:38:46 UTC 发布：2025-09-12 17:38:46 UTC

#24 Is In-Context Learning Learning? #24 在上下文学习真的是在“学习”吗？

Author: [Adrian de Wynter](https://arxiv.org/search/?searchtype=author&query=Adrian de Wynter) 作者：Adrian de Wynter

In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model’s ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input’s linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression’s ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability. 情境学习（ICL）允许某些自回归模型通过下一个标记的预测来解决任务，而无需进一步训练。这导致了关于这些模型仅凭提示中的少量示例（shots）就能解决（学习）未见过任务的能力的说法。然而，推理并不总是意味着学习，因为 ICL 并没有显式地对给定的观察进行编码。相反，模型依赖于其先验知识以及所提供的示例（如果有的话）。我们认为，从数学上讲，ICL 确实构成了学习，但其完整的表征需要实证研究。于是我们进行了大规模的 ICL 分析，剖析或考虑了记忆化、预训练、分布转移以及提示风格和措辞的影响。我们发现 ICL 是一个有效的学习范式，但在学习和泛化到未见任务方面存在局限。我们注意到，在示例数量增多的极限情况下，准确率对示例分布、模型、提示风格以及输入的语言特征不敏感。相反，模型从提示中的规律性中推断模式，这导致了对分布的敏感性，尤其是在诸如链式思维（chain-of-thought）等提示风格中。鉴于在形式上相似的任务上准确率各不相同，我们得出结论：自回归的临时编码并不是一种稳健的机制，并且表明其作为通用方法的泛化能力有限。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-09-12 17:12:04 UTC 发布：2025-09-12 17:12:04 UTC

#25 Multimodal SAM-adapter for Semantic Segmentation #25 用于语义分割的多模态 SAM-adapter

Authors: [Iacopo Curti](https://arxiv.org/search/?searchtype=author&query=Iacopo Curti), [Pierluigi Zama Ramirez](https://arxiv.org/search/?searchtype=author&query=Pierluigi Zama Ramirez), [Alioscia Petrelli](https://arxiv.org/search/?searchtype=author&query=Alioscia Petrelli), [Luigi Di Stefano](https://arxiv.org/search/?searchtype=author&query=Luigi Di Stefano) 作者：Iacopo Curti、Pierluigi Zama Ramirez、Alioscia Petrelli、Luigi Di Stefano

Semantic segmentation, a key task in computer vision with broad applications in autonomous driving, medical imaging, and robotics, has advanced substantially with deep learning. Nevertheless, current approaches remain vulnerable to challenging conditions such as poor lighting, occlusions, and adverse weather. To address these limitations, multimodal methods that integrate auxiliary sensor data (e.g., LiDAR, infrared) have recently emerged, providing complementary information that enhances robustness. In this work, we present MM SAM-adapter, a novel framework that extends the capabilities of the Segment Anything Model (SAM) for multimodal semantic segmentation. The proposed method employs an adapter network that injects fused multimodal features into SAM’s rich RGB features. This design enables the model to retain the strong generalization ability of RGB features while selectively incorporating auxiliary modalities only when they contribute additional cues. As a result, MM SAM-adapter achieves a balanced and efficient use of multimodal information. We evaluate our approach on three challenging benchmarks, DeLiVER, FMB, and MUSES, where MM SAM-adapter delivers state-of-the-art performance. To further analyze modality contributions, we partition DeLiVER and FMB into RGB-easy and RGB-hard subsets. Results consistently demonstrate that our framework outperforms competing methods in both favorable and adverse conditions, highlighting the effectiveness of multimodal adaptation for robust scene understanding. The code is available at the following link: https://github.com/iacopo97/Multimodal-SAM-Adapter. 语义分割是计算机视觉中的一项关键任务，在自动驾驶、医学影像和机器人等领域有广泛应用，随着深度学习的发展已取得显著进展。然而，现有方法在光照不足、遮挡和恶劣天气等挑战性条件下仍然脆弱。为了解决这些局限性，最近出现了将辅助传感器数据（例如 LiDAR、红外）融合的多模态方法，提供了互补信息以增强鲁棒性。在本工作中，我们提出了 MM SAM-adapter，一种将 Segment Anything Model（SAM）扩展到多模态语义分割的新框架。所提出的方法采用一个适配器网络，将融合的多模态特征注入到 SAM 丰富的 RGB 特征中。该设计使模型在保留 RGB 特征强泛化能力的同时，仅在辅助模态提供额外线索时选择性地融合它们。因此，MM SAM-adapter 实现了对多模态信息的均衡且高效的利用。我们在三个具有挑战性的基准数据集 DeLiVER、FMB 和 MUSES 上评估了我们的方法，MM SAM-adapter 在这些数据集上提供了最先进的性能。为进一步分析模态贡献，我们将 DeLiVER 和 FMB 划分为 RGB-easy 和 RGB-hard 子集。结果一致表明，在有利和不利条件下，我们的框架均优于其他方法，这突显了多模态适配在实现鲁棒场景理解方面的有效性。代码可在以下链接获取： https://github.com/iacopo97/Multimodal-SAM-Adapter 。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-09-12 16:58:51 UTC 发布：2025-09-12 16:58:51 UTC

#26 Diversified recommendations of cultural activities with personalized determinantal point processes #26 使用个性化行列式点过程的文化活动多样化推荐

Authors: [Carole Ibrahim](https://arxiv.org/search/?searchtype=author&query=Carole Ibrahim), [Hiba Bederina](https://arxiv.org/search/?searchtype=author&query=Hiba Bederina), [Daniel Cuesta](https://arxiv.org/search/?searchtype=author&query=Daniel Cuesta), [Laurent Montier](https://arxiv.org/search/?searchtype=author&query=Laurent Montier), [Cyrille Delabre](https://arxiv.org/search/?searchtype=author&query=Cyrille Delabre), [Jill-Jênn Vie](https://arxiv.org/search/?searchtype=author&query=Jill-Jênn Vie) 作者：Carole Ibrahim、Hiba Bederina、Daniel Cuesta、Laurent Montier、Cyrille Delabre、Jill-Jênn Vie

While optimizing recommendation systems for user engagement is a well-established practice, effectively diversifying recommendations without negatively impacting core business metrics remains a significant industry challenge. In line with our initiative to broaden our audience’s cultural practices, this study investigates using personalized Determinantal Point Processes (DPPs) to sample diverse and relevant recommendations. We rely on a well-known quality-diversity decomposition of the similarity kernel to give more weight to user preferences. In this paper, we present our implementations of the personalized DPP sampling, evaluate the trade-offs between relevance and diversity through both offline and online metrics, and give insights for practitioners on their use in a production environment. For the sake of reproducibility, we release the full code for our platform and experiments on GitHub. 在优化推荐系统以提升用户参与度已是普遍做法的同时，如何在不对核心业务指标产生负面影响的前提下有效地实现推荐多样化，仍然是行业面临的一大挑战。为配合我们拓宽受众文化实践的倡议，本研究探讨了使用个性化行列式点过程（DPP）来抽取既多样又相关的推荐项。我们依赖于相似性核的一个著名的质量-多样性分解，以便更加强调用户偏好。在本文中，我们介绍了个性化 DPP 抽样的实现，采用离线和在线指标评估相关性与多样性之间的权衡，并为从业者在生产环境中使用该方法提供见解。为便于可重复性，我们在 GitHub 上发布了平台和实验的完整代码。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-09-12 16:34:07 UTC 发布：2025-09-12 16:34:07 UTC

#27 Improving Audio Event Recognition with Consistency Regularization #27 通过一致性正则化改进音频事件识别

Authors: [Shanmuka Sadhu](https://arxiv.org/search/?searchtype=author&query=Shanmuka Sadhu), [Weiran Wang](https://arxiv.org/search/?searchtype=author&query=Weiran Wang) 作者：Shanmuka Sadhu，Weiran Wang

Consistency regularization (CR), which enforces agreement between model predictions on augmented views, has found recent benefits in automatic speech recognition [1]. In this paper, we propose the use of consistency regularization for audio event recognition, and demonstrate its effectiveness on AudioSet. With extensive ablation studies for both small (∼20k) and large (∼1.8M) supervised training sets, we show that CR brings consistent improvement over supervised baselines which already heavily utilize data augmentation, and CR using stronger augmentation and multiple augmentations leads to additional gain for the small training set. Furthermore, we extend the use of CR into the semi-supervised setup with 20K labeled samples and 1.8M unlabeled samples, and obtain performance improvement over our best model trained on the small set. 一致性正则化（CR）通过增强视图之间模型预测的一致性，在近期的自动语音识别中显示出益处[1]。在本文中，我们提出将一致性正则化用于音频事件识别，并在 AudioSet 上证明其有效性。通过对小规模（ ∼ 20k）和大规模（ ∼ 1.8M）有监督训练集进行的广泛消融研究，我们展示了 CR 在已经大量使用数据增强的有监督基线之上带来了持续的改进；对于小规模训练集，使用更强的数据增强和多重增强的 CR 还能带来额外提升。此外，我们将 CR 扩展到含有 20K 有标签样本和 1.8M 无标签样本的半监督设置，并在性能上超过了我们在小数据集上训练的最佳模型。

Subjects: Sound, Artificial Intelligence 主题：声音，人工智能

Publish: 2025-09-12 16:31:20 UTC 发布：2025-09-12 16:31:20 UTC

#28 Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms #28 数据分布影响基于对比学习的心电图基础模型的性能与泛化性

Contrastive learning is a widely adopted self-supervised pretraining strategy, yet its dependence on cohort composition remains underexplored. We present Contrasting by Patient Augmented Electrocardiograms (CAPE) foundation model and pretrain on four cohorts (n = 5,203,352), from diverse populations across three continents (North America, South America, Asia). We systematically assess how cohort demographics, health status, and population diversity influence the downstream performance for prediction tasks also including two additional cohorts from another continent (Europe). We find that downstream performance depends on the distributional properties of the pretraining cohort, including demographics and health status. Moreover, while pretraining with a multi-centre, demographically diverse cohort improves in-distribution accuracy, it reduces out-of-distribution (OOD) generalisation of our contrastive approach by encoding cohort-specific artifacts. To address this, we propose the In-Distribution Batch (IDB) strategy, which preserves intra-cohort consistency during pretraining and enhances OOD robustness. This work provides important insights for developing clinically fair and generalisable foundation models. 对比学习是一种被广泛采用的自监督预训练策略，但其对队列组成的依赖尚未被充分研究。我们提出了基于患者增强心电图的对比（CAPE）基础模型，并在来自三个大洲（北美、南美、亚洲）的四个队列（n = 5,203,352）上进行预训练。我们系统地评估了队列人口统计学、健康状况和人群多样性如何影响下游预测任务的性能，这些预测任务还包括来自另一个大洲（欧洲）的两个额外队列。我们发现下游性能取决于预训练队列的分布特性，包括人口统计学和健康状况。此外，尽管使用多中心、人口统计学多样的队列进行预训练能提高同分布（in-distribution）的准确性，但它通过编码队列特定的伪影降低了我们对比方法的分布外（OOD）泛化能力。为了解决这一问题，我们提出了同分布批次（In-Distribution Batch，IDB）策略，该策略在预训练期间保持队列内一致性并增强了 OOD 鲁棒性。这项工作为开发临床公正且具通用性的基础模型提供了重要见解。

Subjects: Machine Learning, Artificial Intelligence, Signal Processing, Tissues and Organs 主题：机器学习、人工智能、信号处理、组织与器官

Publish: 2025-09-12 16:01:18 UTC 发布日期：2025-09-12 16:01:18 协调世界时

#29 Towards Understanding Visual Grounding in Visual Language Models #29 朝着理解视觉语言模型中的视觉定位

Authors: [Georgios Pantazopoulos](https://arxiv.org/search/?searchtype=author&query=Georgios Pantazopoulos), [Eda B. Özyiğit](https://arxiv.org/search/?searchtype=author&query=Eda B. Özyiğit) 作者：Georgios Pantazopoulos、Eda B. Özyiğit

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in various domains, including referring expression comprehension, answering questions pertinent to fine-grained details in images or videos, caption visual context by explicitly referring to entities, as well as low and high-level control in simulated and real environments. In this survey paper, we review representative works across the key areas of research on modern general-purpose vision language models (VLMs). We first outline the importance of grounding in VLMs, then delineate the core components of the contemporary paradigm for developing grounded models, and examine their practical applications, including benchmarks and evaluation metrics for grounded multimodal generation. We also discuss the multifaceted interrelations among visual grounding, multimodal chain-of-thought, and reasoning in VLMs. Finally, we analyse the challenges inherent to visual grounding and suggest promising directions for future research. 视觉定位是指模型识别与文本描述相匹配的视觉输入区域的能力。因此，具备视觉定位能力的模型可以应用于各个领域的广泛任务，包括指代表达理解、回答与图像或视频中细粒度细节相关的问题、通过明确指代实体来为视觉上下文生成描述，以及在模拟和真实环境中实现低级和高级控制。在这篇综述论文中，我们回顾了现代通用视觉语言模型（VLM）研究的关键领域中的代表性工作。我们首先概述了在 VLM 中进行定位的重要性，然后阐明了构建具备定位能力模型的当代范式的核心组成部分，并考察了其实际应用，包括用于定位多模态生成的基准和评估指标。我们还讨论了视觉定位、多模态链式思维以及 VLM 中推理之间的多方面相互关系。最后，我们分析了视觉定位固有的挑战，并提出了未来研究的有希望方向。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-09-12 15:33:49 UTC 发布：2025-09-12 15:33:49 UTC

#30 GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography #30 GLAM：用于乳腺 X 光多视图 VLP 的几何引导局部对齐

Authors: [Yuexi Du](https://arxiv.org/search/?searchtype=author&query=Yuexi Du), [Lihui Chen](https://arxiv.org/search/?searchtype=author&query=Lihui Chen), [Nicha C. Dvornek](https://arxiv.org/search/?searchtype=author&query=Nicha C. Dvornek) 作者：杜悦希，陈立辉，Nicha C. Dvornek

Mammography screening is an essential tool for early detection of breast cancer. The speed and accuracy of mammography interpretation have the potential to be improved with deep learning methods. However, the development of a foundation visual language model (VLM) is hindered by limited data and domain differences between natural and medical images. Existing mammography VLMs, adapted from natural images, often ignore domain-specific characteristics, such as multi-view relationships in mammography. Unlike radiologists who analyze both views together to process ipsilateral correspondence, current methods treat them as independent images or do not properly model the multi-view correspondence learning, losing critical geometric context and resulting in suboptimal prediction. We propose GLAM: Global and Local Alignment for Multi-view mammography for VLM pretraining using geometry guidance. By leveraging the prior knowledge about the multi-view imaging process of mammograms, our model learns local cross-view alignments and fine-grained local features through joint global and local, visual-visual, and visual-language contrastive learning. Pretrained on EMBED [14], one of the largest open mammography datasets, our model outperforms baselines across multiple datasets under different settings. 乳腺 X 线摄影筛查是乳腺癌早期发现的关键工具。深度学习方法有望提升乳腺 X 线影像判读的速度与准确性。然而，受限于数据量不足以及自然图像与医学影像之间的领域差异，构建基础视觉语言模型（VLM）面临挑战。现有从自然图像迁移而来的乳腺 X 线 VLM 往往忽视领域特有特征，例如乳腺 X 线的多视图关系。与放射科医师会同时分析两侧视图以处理同侧对应关系不同，当前方法要么将其视为独立图像，要么未能恰当地建模多视图对应学习，从而丢失关键的几何上下文，导致预测不理想。我们提出了 GLAM：用于 VLM 预训练并引入几何引导的多视图乳腺 X 线的全局与局部对齐方法。通过利用关于乳腺 X 线多视图成像过程的先验知识，我们的模型通过结合全局与局部、视觉-视觉和视觉-语言的对比学习，学习跨视图的局部对齐与细粒度局部特征。在 EMBED [14]（最大的公开乳腺 X 线影像数据集之一）上进行预训练后，我们的模型在不同设置下在多个数据集上均优于基线。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-09-12 15:33:18 UTC 发布：2025-09-12 15:33:18 UTC

#31 I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation #31 I-Segmenter：用于高效语义分割的整数仅视觉变压器

Authors: [Jordan Sassoon](https://arxiv.org/search/?searchtype=author&query=Jordan Sassoon), [Michal Szczepanski](https://arxiv.org/search/?searchtype=author&query=Michal Szczepanski), [Martyna Poreba](https://arxiv.org/search/?searchtype=author&query=Martyna Poreba) 作者：Jordan Sassoon、Michal Szczepanski、Martyna Poreba

Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose λ-ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment. 视觉变换器（ViT）最近在语义分割方面取得了显著成果，但由于其高内存占用和计算成本，在资源受限设备上的部署仍然有限。量化为提高效率提供了有效策略，但基于 ViT 的分割模型在低精度下通常非常脆弱，因为量化误差会在深度编码器—解码器流水线中累积。我们提出了 I-Segmenter，这是第一个完全仅使用整数运算的 ViT 分割框架。基于 Segmenter 架构，I-Segmenter 系统地将浮点运算替换为仅整数运算。为进一步稳定训练和推理，我们提出了 λ -ShiftGELU，一种新型激活函数，能缓解均匀量化在处理长尾激活分布时的局限性。此外，我们移除了 L2 归一化层，并将解码器中的双线性插值替换为最近邻上采样，确保整个计算图均为仅整数执行。大量实验表明，I-Segmenter 在精度上与其 FP32 基线保持在一个合理范围内（平均差距为 5.1%），同时将模型大小最多缩小至 3.8 倍，并在经过优化的运行时下实现最多 1.2 倍的推理加速。值得注意的是，即便在仅使用一张校准图像进行的一次性 PTQ 中，I-Segmenter 仍能提供具有竞争力的精度，凸显了其在现实部署中的实用性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-09-12 15:14:19 UTC 发布：2025-09-12 15:14:19 UTC

#32 Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Data #32 超越次优性的泛化：离线强化学习通过随机数据学会有效调度

Authors: [Jesse van Remmerden](https://arxiv.org/search/?searchtype=author&query=Jesse van Remmerden), [Zaharah Bukhsh](https://arxiv.org/search/?searchtype=author&query=Zaharah Bukhsh), [Yingqian Zhang](https://arxiv.org/search/?searchtype=author&query=Yingqian Zhang) 作者：Jesse van Remmerden、Zaharah Bukhsh、Yingqian Zhang

The Job-Shop Scheduling Problem (JSP) and Flexible Job-Shop Scheduling Problem (FJSP), are canonical combinatorial optimization problems with wide-ranging applications in industrial operations. In recent years, many online reinforcement learning (RL) approaches have been proposed to learn constructive heuristics for JSP and FJSP. Although effective, these online RL methods require millions of interactions with simulated environments that may not capture real-world complexities, and their random policy initialization leads to poor sample efficiency. To address these limitations, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), a novel offline RL algorithm that learns effective scheduling policies directly from historical data, eliminating the need for costly online interactions, while maintaining the ability to improve upon suboptimal training data. CDQAC couples a quantile-based critic with a delayed policy update, estimating the return distribution of each machine-operation pair rather than selecting pairs outright. Our extensive experiments demonstrate CDQAC’s remarkable ability to learn from diverse data sources. CDQAC consistently outperforms the original data-generating heuristics and surpasses state-of-the-art offline and online RL baselines. In addition, CDQAC is highly sample efficient, requiring only 10-20 training instances to learn high-quality policies. Surprisingly, we find that CDQAC performs better when trained on data generated by a random heuristic than when trained on higher-quality data from genetic algorithms and priority dispatching rules. 作业车间调度问题（JSP）和柔性作业车间调度问题（FJSP）是具有广泛工业应用的典型组合优化问题。近年来，许多在线强化学习（RL）方法被提出，用以为 JSP 和 FJSP 学习构造性启发式算法。尽管这些在线 RL 方法有效，但它们需要与模拟环境进行数百万次交互，而模拟环境可能无法捕捉真实世界的复杂性，且其随机策略初始化导致样本效率低下。为了解决这些限制，我们提出了保守离散分位数策略-价值（CDQAC），这是一种新颖的离线 RL 算法，能够直接从历史数据中学习有效的调度策略，从而消除了代价高昂的在线交互需求，同时保留在次优训练数据上继续改进的能力。CDQAC 将基于分位数的评价器与延迟策略更新相结合，估计每个机台-工序对的回报分布，而不是直接选择配对。我们的大量实验表明，CDQAC 在从多样化数据源中学习方面表现出显著的能力。 CDQAC 始终优于原始的数据生成启发式方法，并超过了最先进的离线和在线强化学习基准。此外，CDQAC 对样本非常高效，仅需 10–20 个训练实例即可学到高质量的策略。令人惊讶的是，我们发现当使用随机启发式生成的数据进行训练时，CDQAC 的表现比使用来自遗传算法和优先调度规则的更高质量数据训练时更好。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-09-12 14:45:39 UTC 发布：2025-09-12 14:45:39 UTC

#33 We Need a New Ethics for a World of AI Agents #33 我们需要为一个由人工智能代理构成的世界制定新的伦理

Authors: [Iason Gabriel](https://arxiv.org/search/?searchtype=author&query=Iason Gabriel), [Geoff Keeling](https://arxiv.org/search/?searchtype=author&query=Geoff Keeling), [Arianna Manzini](https://arxiv.org/search/?searchtype=author&query=Arianna Manzini), [James Evans](https://arxiv.org/search/?searchtype=author&query=James Evans) 作者：Iason Gabriel、Geoff Keeling、Arianna Manzini、James Evans

The deployment of capable AI agents raises fresh questions about safety, human-machine relationships and social coordination. We argue for greater engagement by scientists, scholars, engineers and policymakers with the implications of a world increasingly populated by AI agents. We explore key challenges that must be addressed to ensure that interactions between humans and agents, and among agents themselves, remain broadly beneficial. 能够胜任任务的人工智能代理的部署带来了关于安全、人机关系和社会协调的新问题。我们主张科学家、学者、工程师和政策制定者应更深入地参与，思考一个日益由人工智能代理充斥的世界所带来的影响。我们探讨了必须解决的关键挑战，以确保人类与代理之间以及代理彼此之间的互动保持总体上的有益性。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-09-12 14:29:14 UTC 发布：2025-09-12 14:29:14 UTC

#34 SignClip: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion #34 SignClip：通过多模态对比融合利用口型线索进行手语翻译

Authors: [Wenfang Wu](https://arxiv.org/search/?searchtype=author&query=Wenfang Wu), [Tingting Yuan](https://arxiv.org/search/?searchtype=author&query=Tingting Yuan), [Yupeng Li](https://arxiv.org/search/?searchtype=author&query=Yupeng Li), [Daling Wang](https://arxiv.org/search/?searchtype=author&query=Daling Wang), [Xiaoming Fu](https://arxiv.org/search/?searchtype=author&query=Xiaoming Fu) 作者：吴文芳、袁婷婷、李玉朋、王达令、付晓明

Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38. 手语翻译（SLT）旨在将手语视频中的自然语言进行翻译，是实现包容性交流的重要桥梁。尽管最近的进展利用了强大的视觉骨干和大型语言模型，但大多数方法主要关注手部信号（手势），往往忽视了如唇形等非手部提示。事实上，唇形在手语中传达了重要的语言信息，并在消除视觉上相似手势的歧义方面发挥关键作用。本文提出了 SignClip，一种用于提高手语翻译准确性的全新框架。它融合了手部和非手部提示，特别是空间手势和唇部运动特征。此外，SignClip 引入了具有多层级对齐目标的层次对比学习框架，确保手语-唇形和视觉-文本模态之间的语义一致性。在两个基准数据集 PHOENIX14T 和 How2Sign 上的大量实验证明了我们方法的优越性。例如，在 PHOENIX14T 的无注释（Gloss-free）设置中，SignClip 超越了之前的最先进模型 SpaMo，将 BLEU-4 从 24.32 提升到 24.71，ROUGE 从 46.57 提升到 48.38。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-09-12 14:08:06 UTC 发布：2025-09-12 14:08:06 协调世界时

#35 Openness in AI and downstream governance: A global value chain approach #35 人工智能的开放性与下游治理：全球价值链方法

Author: [Christopher Foster](https://arxiv.org/search/?searchtype=author&query=Christopher Foster) 作者：Christopher Foster

The rise of AI has been rapid, becoming a leading sector for investment and promising disruptive impacts across the economy. Within the critical analysis of the economic impacts, AI has been aligned to the critical literature on data power and platform capitalism - further concentrating power and value capture amongst a small number of “big tech” leaders. The equally rapid rise of openness in AI (here taken to be claims made by AI firms about openness, “open source” and free provision) signals an interesting development. It highlights an emerging ecosystem of open AI models, datasets and toolchains, involving massive capital investment. It poses questions as to whether open resources can support technological transfer and the ability for catch-up, even in the face of AI industry power. This work seeks to add conceptual clarity to these debates by conceptualising openness in AI as a unique type of interfirm relation and therefore amenable to value chain analysis. This approach then allows consideration of the capitalist dynamics of “outsourcing” of foundational firms in value chains, and consequently the types of governance and control that might emerge downstream as AI is adopted. This work, therefore, extends previous mapping of AI value chains to build a framework which links foundational AI with downstream value chains. Overall, this work extends our understanding of AI as a productive sector. While the work remains critical of the power of leading AI firms, openness in AI may lead to potential spillovers stemming from the intense competition for global technological leadership in AI. 人工智能的崛起速度极快，已成为投资的领先领域，并有望在整个经济中带来颠覆性影响。在对其经济影响的关键分析中，人工智能被纳入有关数据权力和平台资本主义的批判性文献——进一步将权力和价值攫取集中到少数“大型科技”领军企业手中。与之同样快速兴起的还有人工智能领域的开放性（此处指人工智能公司关于开放性、“开源”和免费提供的宣称），这一现象表明了一个有趣的发展方向。它凸显出一个由开放的人工智能模型、数据集和工具链构成的新兴生态系统，同时伴随着巨额资本投入。人们由此质疑：即便在人工智能产业权力的强势环境下，开放资源是否仍能支持技术转移和追赶能力。本研究旨在通过将人工智能中的开放性概念化为一种独特的企业间关系，从而有助于为这些争论提供概念上的清晰性，因此也适用于价值链分析。这一方法进而允许我们考虑价值链中“基础性企业”被资本主义式“外包”的动态，以及随着人工智能被采用后可能在下游出现的治理与控制类型。因此，本研究在此前对人工智能价值链的映射基础上进一步扩展，构建了一个将基础性人工智能与下游价值链相连接的框架。总体而言，本研究加深了我们对人工智能作为一个生产性部门的理解。尽管本研究仍对领先人工智能公司的权力保持批判性，但人工智能的开放性可能会导致由于全球技术领导权的激烈竞争而产生的潜在溢出效应。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-09-12 13:12:09 UTC 发布时间：2025-09-12 13:12:09 UTC

#36 SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning #36 SI-FACT：通过自我改进的忠实度感知对比微调缓解知识冲突 [PDF 2 ] [Copy] [Kimi 2 ] [REL]

Author: [Shengqiang Fu](https://arxiv.org/search/?searchtype=author&query=Shengqiang Fu) 作者：Fu Shengqiang

Large Language Models often generate unfaithful responses in knowledge intensive tasks due to knowledge conflict,that is,a preference for relying on internal parametric knowledge rather than the provided context.To address this issue,we propose a novel self improving framework,Self Improving Faithfulness Aware Contrastive Tuning.The framework uses a self instruct mechanism that allows the base LLM to automatically generate high quality,structured contrastive learning data,including anchor samples,semantically equivalent positive samples,and negative samples simulating unfaithful scenarios.This approach significantly reduces the cost of manual annotation.Subsequently,contrastive learning is applied to train the model,enabling it to pull faithful responses closer and push unfaithful responses farther apart in the representation space.Experiments on knowledge conflict evaluation benchmarks ECARE KRE and COSE KRE show that the SI FACT model based on Llama3 8B Instruct improves the Contextual Recall Rate by 6.2% over the best baseline method,while significantly reducing dependence on internal memory.The results indicate that SI FACT provides strong effectiveness and high data efficiency in enhancing the contextual faithfulness of LLMs,offering a practical pathway toward building more proactive and trustworthy language models. 大型语言模型在知识密集型任务中常因知识冲突而产生不可靠的回答，即更倾向于依赖内部参数化知识而非所提供的上下文。为了解决这一问题，我们提出了一种新颖的自我改进框架：Self Improving Faithfulness Aware Contrastive Tuning（自我改进的忠实性感知对比微调）。该框架使用自我指导机制，使基础 LLM 能自动生成高质量、结构化的对比学习数据，包括锚样本、语义等价的正样本以及模拟不忠实场景的负样本，从而显著降低人工标注成本。随后，应用对比学习训练模型，使其在表示空间中将忠实的回答拉近，而将不忠实的回答推远。在知识冲突评估基准 ECARE KRE 和 COSE KRE 上的实验表明，基于 Llama3 8B Instruct 的 SI FACT 模型在上下文召回率上比最优基线方法提高了 6.2%，同时显著降低了对内部记忆的依赖。结果表明，SI FACT 在提升 LLM 上下文忠实性方面具有强大的有效性和高数据效率，为构建更主动、更值得信赖的语言模型提供了实用路径。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-12 12:56:14 UTC 发布：2025-09-12 12:56:14 UTC

#37 Benchmark of stylistic variation in LLM-generated texts #37 基准测试：LLM 生成文本的文体变异

This study investigates the register variation in texts written by humans and comparable texts produced by large language models (LLMs). Biber’s multidimensional analysis (MDA) is applied to a sample of human-written texts and AI-created texts generated to be their counterparts to find the dimensions of variation in which LLMs differ most significantly and most systematically from humans. As textual material, a new LLM-generated corpus AI-Brown is used, which is comparable to BE-21 (a Brown family corpus representing contemporary British English). Since all languages except English are underrepresented in the training data of frontier LLMs, similar analysis is replicated on Czech using AI-Koditex corpus and Czech multidimensional model. Examined were 16 frontier models in various settings and prompts, with emphasis placed on the difference between base models and instruction-tuned models. Based on this, a benchmark is created through which models can be compared with each other and ranked in interpretable dimensions. 本研究考察了人类撰写文本与大型语言模型（LLM）生成的可比文本之间的文体（语域）变异。研究对人类写作样本和作为其对应物生成的 AI 文本应用了 Biber 的多维度分析（MDA），以找出 LLM 与人类在何种变异维度上存在最显著和最系统性的差异。作为文本材料，使用了新的 LLM 生成语料库 AI-Brown，该语料库可与 BE-21（代表当代英式英语的 Brown 家族语料库）相比较。鉴于除英语外的所有语言在前沿 LLM 的训练数据中均被弱表示，研究还在捷克语上重复了类似分析，使用 AI-Koditex 语料库和捷克语多维模型。研究考察了 16 个前沿模型在不同设置和提示下的表现，重点关注基础模型与指令微调模型之间的差异。基于此，研究创建了一个基准，通过该基准可以在可解释的维度上比较并对模型进行排名。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-12 12:12:20 UTC 发布：2025-09-12 12:12:20 UTC

#38 BenchECG and xECG: a benchmark and baseline for ECG foundation models #38 BenchECG 和 xECG：用于心电图基础模型的基准和基线

Authors: [Riccardo Lunelli](https://arxiv.org/search/?searchtype=author&query=Riccardo Lunelli), [Angus Nicolson](https://arxiv.org/search/?searchtype=author&query=Angus Nicolson), [Samuel Martin Pröll](https://arxiv.org/search/?searchtype=author&query=Samuel Martin Pröll), [Sebastian Johannes Reinstadler](https://arxiv.org/search/?searchtype=author&query=Sebastian Johannes Reinstadler), [Axel Bauer](https://arxiv.org/search/?searchtype=author&query=Axel Bauer), [Clemens Dlaska](https://arxiv.org/search/?searchtype=author&query=Clemens Dlaska) 作者：Riccardo Lunelli、Angus Nicolson、Samuel Martin Pröll、Sebastian Johannes Reinstadler、Axel Bauer、Clemens Dlaska

Electrocardiograms (ECGs) are inexpensive, widely used, and well-suited to deep learning. Recently, interest has grown in developing foundation models for ECGs - models that generalise across diverse downstream tasks. However, consistent evaluation has been lacking: prior work often uses narrow task selections and inconsistent datasets, hindering fair comparison. Here, we introduce BenchECG, a standardised benchmark comprising a comprehensive suite of publicly available ECG datasets and versatile tasks. We also propose xECG, an xLSTM-based recurrent model trained with SimDINOv2 self-supervised learning, which achieves the best BenchECG score compared to publicly available state-of-the-art models. In particular, xECG is the only publicly available model to perform strongly on all datasets and tasks. By standardising evaluation, BenchECG enables rigorous comparison and aims to accelerate progress in ECG representation learning. xECG achieves superior performance over earlier approaches, defining a new baseline for future ECG foundation models. 心电图（ECG）廉价、被广泛使用且非常适合深度学习。近来，人们越来越关注为心电图开发基础模型——能够在多种下游任务中泛化的模型。然而，一直缺乏一致的评估：以往工作常使用范围狭窄的任务选择和不一致的数据集，阻碍了公平比较。在此，我们引入了 BenchECG，这是一个标准化基准，包含了一套全面的公开可用心电图数据集和多功能任务。我们还提出了 xECG，一种基于 xLSTM 的循环模型，使用 SimDINOv2 自监督学习训练，与公开可用的最先进模型相比，在 BenchECG 得分上表现最好。特别是，xECG 是唯一在所有数据集和任务上均表现良好的公开可用模型。通过标准化评估，BenchECG 使严格比较成为可能，并旨在加速心电图表征学习的进展。xECG 相比早期方法实现了更优的性能，为未来的心电图基础模型设定了新的基线。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-09-12 11:27:17 UTC 发布时间：2025-09-12 11:27:17 UTC

#39 Efficient Learning-Based Control of a Legged Robot in Lunar Gravity #39 基于高效学习的月球重力下步行机器人的控制

Authors: [Philip Arm](https://arxiv.org/search/?searchtype=author&query=Philip Arm), [Oliver Fischer](https://arxiv.org/search/?searchtype=author&query=Oliver Fischer), [Joseph Church](https://arxiv.org/search/?searchtype=author&query=Joseph Church), [Adrian Fuhrer](https://arxiv.org/search/?searchtype=author&query=Adrian Fuhrer), [Hendrik Kolvenbach](https://arxiv.org/search/?searchtype=author&query=Hendrik Kolvenbach), [Marco Hutter](https://arxiv.org/search/?searchtype=author&query=Marco Hutter) 作者：Philip Arm、Oliver Fischer、Joseph Church、Adrian Fuhrer、Hendrik Kolvenbach、Marco Hutter

Legged robots are promising candidates for exploring challenging areas on low-gravity bodies such as the Moon, Mars, or asteroids, thanks to their advanced mobility on unstructured terrain. However, as planetary robots’ power and thermal budgets are highly restricted, these robots need energy-efficient control approaches that easily transfer to multiple gravity environments. In this work, we introduce a reinforcement learning-based control approach for legged robots with gravity-scaled power-optimized reward functions. We use our approach to develop and validate a locomotion controller and a base pose controller in gravity environments from lunar gravity (1.62 m/s2) to a hypothetical super-Earth (19.62 m/s2). Our approach successfully scales across these gravity levels for locomotion and base pose control with the gravity-scaled reward functions. The power-optimized locomotion controller reached a power consumption for locomotion of 23.4 W in Earth gravity on a 15.65 kg robot at 0.4 m/s, a 23 % improvement over the baseline policy. Additionally, we designed a constant-force spring offload system that allowed us to conduct real-world experiments on legged locomotion in lunar gravity. In lunar gravity, the power-optimized control policy reached 12.2 W, 36 % less than a baseline controller which is not optimized for power efficiency. Our method provides a scalable approach to developing power-efficient locomotion controllers for legged robots across multiple gravity levels. 多足机器人因其在非结构化地形上的先进机动性，被认为是探索月球、火星或小行星等低重力天体具有前景的候选者。然而，由于行星机器人在能量和热量方面的预算高度受限，这些机器人需要能够在多重重力环境中轻松迁移的节能控制方法。在这项工作中，我们提出了一种基于强化学习的多足机器人控制方法，采用按重力缩放的功率优化奖励函数。我们使用该方法在从月球重力（1.62 m/s²）到假想超地球（19.62 m/s²）的重力环境中开发并验证了一个运动控制器和一个基座姿态控制器。我们的办法通过重力缩放的奖励函数在这些重力水平上成功实现了运动和基座姿态控制的可扩展性。在地球重力下，该功率优化的运动控制器在质量为 15.65 kg、速度为 0.4 m/s 时达到了 23.4 W 的运动能耗，比基线策略提高了 23%。此外，我们设计了一个恒力弹簧卸载系统，使我们能够在月球重力条件下进行真实世界的步行机器人实验。在月球重力下，经过功率优化的控制策略达到了 12.2 W，比未针对功率效率优化的基线控制器低 36%。我们的方法为在多重重力水平上开发节能的步行机器人控制器提供了一种可扩展的途径。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-09-12 10:43:58 UTC 发布：2025-09-12 10:43:58 UTC

Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations. Most existing LLM-based social simulation studies focus primarily on designing agentic frameworks and simulation environments, often overlooking the complexities of persona generation and the potential biases introduced by unrepresentative persona sets. In this paper, we propose a systematic framework for synthesizing high-quality, population-aligned persona sets for LLM-driven social simulation. Our approach begins by leveraging LLMs to generate narrative personas from long-term social media data, followed by rigorous quality assessment to filter out low-fidelity profiles. We then apply importance sampling to achieve global alignment with reference psychometric distributions, such as the Big Five personality traits. To address the needs of specific simulation contexts, we further introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations. Extensive experiments demonstrate that our method significantly reduces population-level bias and enables accurate, flexible social simulation for a wide range of research and policy applications. 最近大型语言模型（LLM）的进展使得以人类般的规模和逼真度进行社会模拟成为可能，为计算社会科学提供了新的机会。然而，一个关键挑战是构建能够真实代表现实人口多样性与分布的人格集合。大多数现有的基于 LLM 的社会模拟研究主要集中在设计主体框架和模拟环境上，常常忽视了人格生成的复杂性以及由非代表性人格集合引入的潜在偏差。在本文中，我们提出了一个系统化框架，用于为 LLM 驱动的社会模拟合成高质量、与人口对齐的人格集合。我们的方法首先利用 LLM 从长期社交媒体数据中生成叙事式人格，并通过严格的质量评估来筛除低保真度的档案。然后我们应用重要性抽样以实现与参考心理测量分布（例如大五人格特质）的全局对齐。为满足特定模拟情境的需要，我们进一步引入了一个任务专用模块，将全局对齐的人格集合调整为针对性的子群体。大量实验证明，我们的方法能显著降低群体层面的偏差，并使社会模拟在广泛的研究和政策应用中实现准确且灵活的运用。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-09-12 10:43:47 UTC 发布：2025-09-12 10:43:47 UTC

#41 Realism Control One-step Diffusion for Real-World Image Super-Resolution #41 真实感控制一步扩散用于真实世界图像超分辨率

Authors: [Zongliang Wu](https://arxiv.org/search/?searchtype=author&query=Zongliang Wu), [Siming Zheng](https://arxiv.org/search/?searchtype=author&query=Siming Zheng), [Peng-Tao Jiang](https://arxiv.org/search/?searchtype=author&query=Peng-Tao Jiang), [Xin Yuan](https://arxiv.org/search/?searchtype=author&query=Xin Yuan) 作者：吴宗亮，郑思明，蒋鹏涛，袁欣

Pre-trained diffusion models have shown great potential in real-world image super-resolution (Real-ISR) tasks by enabling high-resolution reconstructions. While one-step diffusion (OSD) methods significantly improve efficiency compared to traditional multi-step approaches, they still have limitations in balancing fidelity and realism across diverse scenarios. Since the OSDs for SR are usually trained or distilled by a single timestep, they lack flexible control mechanisms to adaptively prioritize these competing objectives, which are inherently manageable in multi-step methods through adjusting sampling steps. To address this challenge, we propose a Realism Controlled One-step Diffusion (RCOD) framework for Real-ISR. RCOD provides a latent domain grouping strategy that enables explicit control over fidelity-realism trade-offs during the noise prediction phase with minimal training paradigm modifications and original training data. A degradation-aware sampling strategy is also introduced to align distillation regularization with the grouping strategy and enhance the controlling of trade-offs. Moreover, a visual prompt injection module is used to replace conventional text prompts with degradation-aware visual tokens, enhancing both restoration accuracy and semantic consistency. Our method achieves superior fidelity and perceptual quality while maintaining computational efficiency. Extensive experiments demonstrate that RCOD outperforms state-of-the-art OSD methods in both quantitative metrics and visual qualities, with flexible realism control capabilities in the inference stage. The code will be released. 预训练的扩散模型在真实世界图像超分辨率（Real-ISR）任务中展现出巨大潜力，能够实现高分辨率重建。尽管单步扩散（OSD）方法相较传统多步方法显著提升了效率，但在平衡不同场景下的保真度与真实感方面仍存在局限。由于用于超分辨率的 OSD 模型通常仅通过单时间步训练或蒸馏，其缺乏灵活的控制机制来自适应地平衡这两项竞争目标——而多步方法可通过调整采样步骤实现这种平衡。为解决此问题，我们提出面向真实图像增强的"现实感可控单步扩散"（RCOD）框架。RCOD 通过潜在域分组策略，在噪声预测阶段实现保真度与现实感权衡的显式控制，仅需最小化训练范式调整并保留原始训练数据。同时引入退化感知采样策略，使知识蒸馏正则化与分组策略协同作用，强化权衡控制能力。此外，我们引入了一个视觉提示注入模块，用以用感知退化的视觉标记替代传统的文本提示，从而提升修复精度和语义一致性。我们的方法在保持计算效率的同时实现了更高的保真度和感知质量。大量实验证明，RCOD 在定量指标和视觉质量上均优于最先进的 OSD 方法，并且在推断阶段具备灵活的真实感控制能力。代码将会开源。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-09-12 10:32:04 UTC 发布：2025-09-12 10:32:04 UTC

#42 Generating Energy-Efficient Code via Large-Language Models – Where are we now? #42 通过大规模语言模型生成节能代码——我们现在处于什么位置？

Authors: [Radu Apsan](https://arxiv.org/search/?searchtype=author&query=Radu Apsan), [Vincenzo Stoico](https://arxiv.org/search/?searchtype=author&query=Vincenzo Stoico), [Michel Albonico](https://arxiv.org/search/?searchtype=author&query=Michel Albonico), [Rudra Dhar](https://arxiv.org/search/?searchtype=author&query=Rudra Dhar), [Karthik Vaidhyanathan](https://arxiv.org/search/?searchtype=author&query=Karthik Vaidhyanathan), [Ivano Malavolta](https://arxiv.org/search/?searchtype=author&query=Ivano Malavolta) 作者：Radu Apsan, Vincenzo Stoico, Michel Albonico, Rudra Dhar, Karthik Vaidhyanathan, Ivano Malavolta

Context. The rise of Large Language Models (LLMs) has led to their widespread adoption in development pipelines. Goal. We empirically assess the energy efficiency of Python code generated by LLMs against human-written code and code developed by a Green software expert. Method. We test 363 solutions to 9 coding problems from the EvoEval benchmark using 6 widespread LLMs with 4 prompting techniques, and comparing them to human-developed solutions. Energy consumption is measured on three different hardware platforms: a server, a PC, and a Raspberry Pi for a total of ~881h (36.7 days). Results. Human solutions are 16% more energy-efficient on the server and 3% on the Raspberry Pi, while LLMs outperform human developers by 25% on the PC. Prompting does not consistently lead to energy savings, where the most energy-efficient prompts vary by hardware platform. The code developed by a Green software expert is consistently more energy-efficient by at least 17% to 30% against all LLMs on all hardware platforms. Conclusions. Even though LLMs exhibit relatively good code generation capabilities, no LLM-generated code was more energy-efficient than that of an experienced Green software developer, suggesting that as of today there is still a great need of human expertise for developing energy-efficient Python code. 背景。大型语言模型（LLMs）的兴起使其在开发流程中得到广泛应用。目标。我们通过实证评估 LLMs 生成的 Python 代码与人工编写代码及绿色软件专家开发的代码在能效方面的差异。方法。我们采用 6 种主流 LLMs 及 4 种提示技术，对 EvoEval 基准测试中 9 个编程问题的 363 个解决方案进行测试，并与人工开发的解决方案进行对比。能耗测量在三种硬件平台进行：服务器、个人电脑和树莓派，总计约 881 小时（36.7 天）。结果。在服务器端，人工解决方案能效高出 16%，树莓派端高出 3%；而在个人电脑端，LLM 的能效比人工开发者高出 25%。提示策略并非始终能节省能耗，不同硬件平台下最节能的提示方案存在差异。由绿色软件专家开发的代码在所有硬件平台上均比所有 LLM 节能至少 17%至 30%。结论。尽管 LLMs 在代码生成方面表现相对良好，但没有任何由 LLM 生成的代码比经验丰富的绿色软件开发人员编写的代码更节能，这表明截至目前，在开发节能的 Python 代码方面仍然非常需要人类专业知识。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-09-12 09:49:46 UTC 发布：2025-09-12 09:49:46 协调世界时 (UTC)

#43 Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models #43 既定的心理测量学问卷与生态有效问卷：在大型语言模型中重新思考心理评估

Researchers have applied established psychometric questionnaires (e.g., BFI, PVQ) to measure the personality traits and values reflected in the responses of Large Language Models (LLMs). However, concerns have been raised about applying these human-designed questionnaires to LLMs. One such concern is their lack of ecological validity–the extent to which survey questions adequately reflect and resemble real-world contexts in which LLMs generate texts in response to user queries. However, it remains unclear how established questionnaires and ecologically valid questionnaires differ in their outcomes, and what insights these differences may provide. In this paper, we conduct a comprehensive comparative analysis of the two types of questionnaires. Our analysis reveals that established questionnaires (1) yield substantially different profiles of LLMs from ecologically valid ones, deviating from the psychological characteristics expressed in the context of user queries, (2) suffer from insufficient items for stable measurement, (3) create misleading impressions that LLMs possess stable constructs, and (4) yield exaggerated profiles for persona-prompted LLMs. Overall, our work cautions against the use of established psychological questionnaires for LLMs. Our code will be released upon publication. 研究人员已经应用现成的心理测量问卷（例如 BFI、PVQ）来衡量大型语言模型 (LLMs) 在回答中反映出的性格特征和价值观。然而，人们对将这些为人类设计的问卷用于 LLMs 提出了质疑。其中一个担忧是它们缺乏生态效度——即调查问题在多大程度上足以反映并类似于 LLMs 在响应用户查询时生成文本的真实世界情境。然而，尚不清楚现有问卷与具有生态效度的问卷在结果上如何不同，以及这些差异可能提供何种见解。在本文中，我们对这两类问卷进行了全面的比较分析。我们的分析表明，现有问卷（1）相比具有生态效度的问卷，会产生与用户查询语境中表达的心理特征显著不同的 LLMs 特征画像；（2）题目数量不足以保证稳定测量；（3）造成对 LLMs 拥有稳定构念的误导性印象；以及（4）对通过角色提示(persona-prompted)驱动的 LLMs 给出夸大的特征画像。总体而言，我们的工作提醒人们对将既有心理学问卷用于 LLMs 保持谨慎。我们的代码将在发表时公开。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-12 09:14:42 UTC 发布时间：2025-09-12 09:14:42 UTC

#44 Predictive Spike Timing Enables Distributed Shortest Path Computation in Spiking Neural Networks #44 预测性峰值时序使脉冲神经网络实现分布式最短路径计算成为可能

Authors: [Simen Storesund](https://arxiv.org/search/?searchtype=author&query=Simen Storesund), [Kristian Valset Aars](https://arxiv.org/search/?searchtype=author&query=Kristian Valset Aars), [Robin Dietrich](https://arxiv.org/search/?searchtype=author&query=Robin Dietrich), [Nicolai Waniek](https://arxiv.org/search/?searchtype=author&query=Nicolai Waniek) 作者：Simen Storesund、Kristian Valset Aars、Robin Dietrich、Nicolai Waniek

Efficient planning and sequence selection are central to intelligence, yet current approaches remain largely incompatible with biological computation. Classical graph algorithms like Dijkstra’s or A* require global state and biologically implausible operations such as backtracing, while reinforcement learning methods rely on slow gradient-based policy updates that appear inconsistent with rapid behavioral adaptation observed in natural systems. We propose a biologically plausible algorithm for shortest-path computation that operates through local spike-based message-passing with realistic processing delays. The algorithm exploits spike-timing coincidences to identify nodes on optimal paths: Neurons that receive inhibitory-excitatory message pairs earlier than predicted reduce their response delays, creating a temporal compression that propagates backwards from target to source. Through analytical proof and simulations on random spatial networks, we demonstrate that the algorithm converges and discovers all shortest paths using purely timing-based mechanisms. By showing how short-term timing dynamics alone can compute shortest paths, this work provides new insights into how biological networks might solve complex computational problems through purely local computation and relative spike-time prediction. These findings open new directions for understanding distributed computation in biological and artificial systems, with possible implications for computational neuroscience, AI, reinforcement learning, and neuromorphic systems. 高效规划与序列选择是智能的核心，但现有方法仍与生物计算存在显著差异。经典图论算法如迪杰斯特拉法或 A*法需要全局状态和生物学上难以实现的操作（如回溯），而强化学习方法依赖于缓慢的梯度策略更新，这与自然系统中观察到的快速行为适应性相悖。我们提出一种基于局部尖峰传递且包含真实处理延迟的最短路径计算算法，该算法在生物学上具有合理性。该算法利用尖峰时序同步现象识别最优路径节点：当神经元接收抑制性-兴奋性消息对的时间早于预测值时，其响应延迟会相应缩短，由此产生的时间压缩效应将从目标节点向源节点逆向传播。通过解析证明及随机空间网络模拟，我们证实该算法仅凭时序机制即可收敛并发现所有最短路径。通过展示仅凭短期时间动态就能计算最短路径，这项工作为理解生物网络如何通过纯粹的局部计算和相对脉冲时间预测来解决复杂计算问题提供了新见解。这些发现为理解生物和人工系统中的分布式计算开辟了新方向，并可能对计算神经科学、人工智能、强化学习和类脑系统产生影响。

Subjects: Neural and Evolutionary Computing, Artificial Intelligence, Data Structures and Algorithms, Machine Learning 主题：神经与进化计算、人工智能、数据结构与算法、机器学习

Publish: 2025-09-12 09:13:47 UTC 发布时间：2025-09-12 09:13:47 世界协调时（UTC）

#45 TwinTac: A Wide-Range, Highly Sensitive Tactile Sensor with Real-to-Sim Digital Twin Sensor Model #45 TwinTac：一种宽量程、高灵敏度的触觉传感器，带有真实到仿真的数字孪生传感器模型

Authors: [Xiyan Huang](https://arxiv.org/search/?searchtype=author&query=Xiyan Huang), [Zhe Xu](https://arxiv.org/search/?searchtype=author&query=Zhe Xu), [Chenxi Xiao](https://arxiv.org/search/?searchtype=author&query=Chenxi Xiao) 作者：黄希妍、许喆、肖晨曦

Robot skill acquisition processes driven by reinforcement learning often rely on simulations to efficiently generate large-scale interaction data. However, the absence of simulation models for tactile sensors has hindered the use of tactile sensing in such skill learning processes, limiting the development of effective policies driven by tactile perception. To bridge this gap, we present TwinTac, a system that combines the design of a physical tactile sensor with its digital twin model. Our hardware sensor is designed for high sensitivity and a wide measurement range, enabling high quality sensing data essential for object interaction tasks. Building upon the hardware sensor, we develop the digital twin model using a real-to-sim approach. This involves collecting synchronized cross-domain data, including finite element method results and the physical sensor’s outputs, and then training neural networks to map simulated data to real sensor responses. Through experimental evaluation, we characterized the sensitivity of the physical sensor and demonstrated the consistency of the digital twin in replicating the physical sensor’s output. Furthermore, by conducting an object classification task, we showed that simulation data generated by our digital twin sensor can effectively augment real-world data, leading to improved accuracy. These results highlight TwinTac’s potential to bridge the gap in cross-domain learning tasks. 基于强化学习的机器人技能习得过程通常依赖仿真技术来高效生成大规模交互数据。然而，触觉传感器的仿真模型缺失阻碍了触觉感知在技能学习中的应用，限制了基于触觉感知的高效策略开发。为弥补这一缺口，我们提出 TwinTac 系统，该系统将物理触觉传感器设计与其数字孪生模型相结合。我们的硬件传感器专为高灵敏度与宽测量范围设计，可提供物体交互任务所需的高质量感知数据。基于硬件传感器，我们采用实物-仿真协同方法构建数字孪生模型。该方法通过同步采集跨域数据（包括有限元计算结果与物理传感器输出），训练神经网络实现仿真数据与真实传感器响应的映射。实验评估揭示了物理传感器的灵敏特性，并验证了数字孪生在复现物理传感器输出方面的一致性。此外，通过进行目标分类任务，我们展示了由我们的数字孪生传感器生成的仿真数据可以有效地增强真实世界数据，从而提高准确性。这些结果突显了 TwinTac 在弥合跨域学习任务差距方面的潜力。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-09-12 08:51:28 UTC 发布：2025-09-12 08:51:28 UTC

#46 Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration #46 多模态数学推理嵌入航拍图像：基准测试、分析与探索

Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math 数学推理对于无人驾驶航空器（UAV）遥感中的精确距离与面积计算、轨迹估算及空间分析等任务至关重要，但现有的视觉语言模型（VLMs）在此领域尚未得到充分验证。为弥补这一缺口，我们推出 AVI-Math——首个严格评估飞行器影像中多模态数学推理能力的基准测试，其不仅超越简单计数任务，更涵盖几何、逻辑与代数等领域的专业知识。该数据集包含 3,773 道由无人机视角采集的高质量飞行器相关问题，覆盖 6 个数学学科及 20 个主题。数据在不同高度和多角度无人机视角下采集，真实还原了无人机场景，确保构建的数学问题兼具多样性与复杂性。本文通过全面评估对 14 个主流 VLMs 进行基准测试，结果表明：尽管这些模型在先前多模态基准测试中表现优异，但在 AVI-Math 的推理任务中仍显力不从心。我们的详细分析揭示了当前视觉语言模型在数学推理能力方面的重大局限，并为未来研究指明了方向。此外，我们探索了链式思维提示与微调技术，这些方法在解决 AVI-Math 中的推理挑战方面展现出潜力。研究成果不仅揭示了视觉语言模型在数学推理中的局限性，更为推进基于无人机的可信视觉语言模型在实际应用中的发展提供了宝贵见解。相关代码及数据集将于 https://github.com/VisionXLab/avi-math 发布。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-09-12 08:46:49 UTC 发表：2025-09-12 08:46:49 协调世界时 (UTC)

#47 Reinforcement learning for spin torque oscillator tasks #47 强化学习用于自旋转矩振荡器任务

Authors: [Jakub Mojsiejuk](https://arxiv.org/search/?searchtype=author&query=Jakub Mojsiejuk), [Sławomir Ziętek](https://arxiv.org/search/?searchtype=author&query=Sławomir Ziętek), [Witold Skowroński](https://arxiv.org/search/?searchtype=author&query=Witold Skowroński) 作者：Jakub Mojsiejuk、Sławomir Ziętek、Witold Skowroński

We address the problem of automatic synchronisation of the spintronic oscillator (STO) by means of reinforcement learning (RL). A numerical solution of the macrospin Landau-Lifschitz-Gilbert-Slonczewski equation is used to simulate the STO and we train the two types of RL agents to synchronise with a target frequency within a fixed number of steps. We explore modifications to this base task and show an improvement in both convergence and energy efficiency of the synchronisation that can be easily achieved in the simulated environment. 我们研究通过强化学习（RL）实现自旋电子振荡器（STO）自动同步的问题。采用宏自旋朗道—里夫希茨—吉尔伯特—斯隆切夫斯基方程的数值解来模拟 STO，并训练两类强化学习智能体在固定步数内与目标频率实现同步。我们探讨对该基本任务的修改，并展示在模拟环境中可轻易实现的同步性收敛性和能效方面的提升。

Subjects: Applied Physics, Artificial Intelligence, Machine Learning 学科：应用物理学、人工智能、机器学习

Publish: 2025-09-12 08:41:39 UTC 发布：2025-09-12 08:41:39 UTC

#48 Exploring Expert Specialization through Unsupervised Training in Sparse Mixture of Experts #48 通过稀疏专家混合中的无监督训练探索专家专业化

Authors: [Strahinja Nikolic](https://arxiv.org/search/?searchtype=author&query=Strahinja Nikolic), [Ilker Oguz](https://arxiv.org/search/?searchtype=author&query=Ilker Oguz), [Demetri Psaltis](https://arxiv.org/search/?searchtype=author&query=Demetri Psaltis) 作者：Strahinja Nikolic、Ilker Oguz、Demetri Psaltis

Understanding the internal organization of neural networks remains a fundamental challenge in deep learning interpretability. We address this challenge by exploring a novel Sparse Mixture of Experts Variational Autoencoder (SMoE-VAE) architecture. We test our model on the QuickDraw dataset, comparing unsupervised expert routing against a supervised baseline guided by ground-truth labels. Surprisingly, we find that unsupervised routing consistently achieves superior reconstruction performance. The experts learn to identify meaningful sub-categorical structures that often transcend human-defined class boundaries. Through t-SNE visualizations and reconstruction analysis, we investigate how MoE models uncover fundamental data structures that are more aligned with the model’s objective than predefined labels. Furthermore, our study on the impact of dataset size provides insights into the trade-offs between data quantity and expert specialization, offering guidance for designing efficient MoE architectures. 理解神经网络内部组织结构仍然是深度学习可解释性方面的一个基础性挑战。我们通过探索一种新颖的稀疏专家混合变分自编码器（SMoE-VAE）架构来应对这一挑战。我们在 QuickDraw 数据集上测试了模型，比较了由无监督专家路由与由真实标签指导的有监督基线的表现。令人惊讶的是，我们发现无监督路由在重构性能上始终优于有监督路由。专家们学会识别有意义的子类别结构，这些结构常常超越人类定义的类别界限。通过 t-SNE 可视化和重构分析，我们探讨了 MoE 模型如何发现更符合模型目标而非预定义标签的基础数据结构。此外，我们关于数据集规模影响的研究提供了有关数据量与专家专业化之间权衡的见解，为设计高效的 MoE 架构提供了指导。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-09-12 07:45:10 UTC 发布时间：2025-09-12 07:45:10 UTC

#49 Intrinsic Dimension Estimating Autoencoder (IDEA) Using CancelOut Layer and a Projected Loss #49 本征维度估计自编码器（IDEA）：使用 CancelOut 层和投影损失

Authors: [Antoine Orioua](https://arxiv.org/search/?searchtype=author&query=Antoine Orioua), [Philipp Krah](https://arxiv.org/search/?searchtype=author&query=Philipp Krah), [Julian Koellermeier](https://arxiv.org/search/?searchtype=author&query=Julian Koellermeier) 作者：Antoine Orioua、Philipp Krah、Julian Koellermeier

This paper introduces the Intrinsic Dimension Estimating Autoencoder (IDEA), which identifies the underlying intrinsic dimension of a wide range of datasets whose samples lie on either linear or nonlinear manifolds. Beyond estimating the intrinsic dimension, IDEA is also able to reconstruct the original dataset after projecting it onto the corresponding latent space, which is structured using re-weighted double CancelOut layers. Our key contribution is the introduction of the projected reconstruction loss term, guiding the training of the model by continuously assessing the reconstruction quality under the removal of an additional latent dimension. We first assess the performance of IDEA on a series of theoretical benchmarks to validate its robustness. These experiments allow us to test its reconstruction ability and compare its performance with state-of-the-art intrinsic dimension estimators. The benchmarks show good accuracy and high versatility of our approach. Subsequently, we apply our model to data generated from the numerical solution of a vertically resolved one-dimensional free-surface flow, following a pointwise discretization of the vertical velocity profile in the horizontal direction, vertical direction, and time. IDEA succeeds in estimating the dataset’s intrinsic dimension and then reconstructs the original solution by working directly within the projection space identified by the network. 本文提出了内在维度估计自编码器（IDEA），用于识别样本位于线性或非线性流形上的各种数据集的底层内在维度。除了估计内在维度外，IDEA 还能在将数据投影到相应的潜在空间后重构原始数据，该潜在空间通过重加权的双重 CancelOut 层进行结构化。我们的关键贡献是引入投影重构损失项，通过在移除额外潜在维度的情况下持续评估重构质量来引导模型训练。我们首先在一系列理论基准上评估了 IDEA 的性能以验证其稳健性。这些实验使我们能够测试其重构能力并将其性能与最先进的内在维度估计器进行比较。基准测试显示了我们方法的良好准确性和高度通用性。随后，我们将模型应用于来自垂向分辨的一维自由表面流的数值解生成的数据，该数值解对垂直速度剖面在水平方向、垂直方向和时间上进行了逐点离散化。IDEA 成功估计了数据集的内在维度，随后在网络识别出的投影空间中直接工作，从而重建了原始解。

Subjects: Machine Learning, Artificial Intelligence, Numerical Analysis 主题：机器学习、人工智能、数值分析

Publish: 2025-09-12 07:11:05 UTC 发布：2025-09-12 07:11:05 UTC

#50 Unsupervised Hallucination Detection by Inspecting Reasoning Processes #50 通过检查推理过程进行无监督幻觉检测

Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection. 无监督幻觉检测旨在在不依赖标注数据的情况下识别大型语言模型（LLMs）生成的幻觉内容。尽管无监督方法通过消除费力的人类注释而越来越受欢迎，但它们常常依赖与事实正确性无关的代理信号。这种不一致使检测探针偏向表面或与事实无关的方面，限制了在不同数据集和场景间的泛化能力。为克服这些限制，我们提出了 IRIS，一种无监督幻觉检测框架，利用与事实正确性内在相关的内部表示。IRIS 促使 LLM 仔细核验给定陈述的真实性，并将其上下文化的嵌入作为用于训练的信息性特征。同时，每个响应的不确定性被视为真实性的软伪标签。实验结果表明，IRIS 在各项指标上持续优于现有的无监督方法。我们的方法完全无监督、计算成本低，即使在少量训练数据下也能良好工作，适用于实时检测。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-12 06:58:17 UTC 发布时间：2025-09-12 06:58:17 UTC

#51 Drone-Based Multispectral Imaging and Deep Learning for Timely Detection of Branched Broomrape in Tomato Farms #51 基于无人机多光谱成像与深度学习的番茄农场分枝菟丝子实时检测研究 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Mohammadreza Narimani](https://arxiv.org/search/?searchtype=author&query=Mohammadreza Narimani), [Alireza Pourreza](https://arxiv.org/search/?searchtype=author&query=Alireza Pourreza), [Ali Moghimi](https://arxiv.org/search/?searchtype=author&query=Ali Moghimi), [Mohsen Mesgaran](https://arxiv.org/search/?searchtype=author&query=Mohsen Mesgaran), [Parastoo Farajpoor](https://arxiv.org/search/?searchtype=author&query=Parastoo Farajpoor), [Hamid Jafarbiglu](https://arxiv.org/search/?searchtype=author&query=Hamid Jafarbiglu) 作者：Mohammadreza Narimani、Alireza Pourreza、Ali Moghimi、Mohsen Mesgaran、Parastoo Farajpoor、Hamid Jafarbiglu

This study addresses the escalating threat of branched broomrape (Phelipanche ramosa) to California’s tomato industry, which supplies over 90 percent of U.S. processing tomatoes. The parasite’s largely underground life cycle makes early detection difficult, while conventional chemical controls are costly, environmentally harmful, and often ineffective. To address this, we combined drone-based multispectral imagery with Long Short-Term Memory (LSTM) deep learning networks, using the Synthetic Minority Over-sampling Technique (SMOTE) to handle class imbalance. Research was conducted on a known broomrape-infested tomato farm in Woodland, Yolo County, CA, across five key growth stages determined by growing degree days (GDD). Multispectral images were processed to isolate tomato canopy reflectance. At 897 GDD, broomrape could be detected with 79.09 percent overall accuracy and 70.36 percent recall without integrating later stages. Incorporating sequential growth stages with LSTM improved detection substantially. The best-performing scenario, which integrated all growth stages with SMOTE augmentation, achieved 88.37 percent overall accuracy and 95.37 percent recall. These results demonstrate the strong potential of temporal multispectral analysis and LSTM networks for early broomrape detection. While further real-world data collection is needed for practical deployment, this study shows that UAV-based multispectral sensing coupled with deep learning could provide a powerful precision agriculture tool to reduce losses and improve sustainability in tomato production. 本研究针对分枝菟丝子（Phelipanche ramosa）对加州番茄产业日益加剧的威胁展开探讨。该产业供应着全美 90%以上的加工番茄。该寄生植物主要在地下完成生命周期，导致早期检测困难，而传统化学防治手段成本高昂、危害环境且效果有限。为此，我们结合无人机多光谱成像与长短期记忆（LSTM）深度学习网络，并采用合成少数类过采样技术（SMOTE）处理数据类不平衡问题。研究选址于加州约洛县伍德兰市一处已知寄生菟丝子侵染的番茄农场，依据生长度日（GDD）划分五个关键生长阶段进行监测。通过多光谱图像处理技术，成功分离番茄冠层反射率数据。在 897 生长度日时，不整合后期阶段即可实现 79.09%总体准确率与 70.36%召回率的孢子草检测。通过 LSTM 整合序列生长阶段显著提升了检测效果。最佳方案采用 SMOTE 数据增强技术整合所有生长阶段，最终达成 88.37%总体准确率与 95.37%召回率。这些结果展示了时序多光谱分析和 LSTM 网络在早期螨草检测方面的强大潜力。尽管为了实际部署仍需进一步收集真实世界的数据，本研究表明，基于无人机的多光谱感测结合深度学习有望成为一种强有力的精准农业工具，从而减少损失并改善番茄生产的可持续性。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 主题：图像与视频处理、人工智能、计算机视觉与模式识别、机器学习

Publish: 2025-09-12 05:16:56 UTC 发布时间：2025-09-12 05:16:56 UTC

#52 Securing LLM-Generated Embedded Firmware through AI Agent-Driven Validation and Patching #52 通过 AI 代理驱动的验证与修补保护 LLM 生成的嵌入式固件

Authors: [Seyed Moein Abtahi](https://arxiv.org/search/?searchtype=author&query=Seyed Moein Abtahi), [Akramul Azim](https://arxiv.org/search/?searchtype=author&query=Akramul Azim) 作者：Seyed Moein Abtahi、Akramul Azim

Large Language Models (LLMs) show promise in generating firmware for embedded systems, but often introduce security flaws and fail to meet real-time performance constraints. This paper proposes a three-phase methodology that combines LLM-based firmware generation with automated security validation and iterative refinement in a virtualized environment. Using structured prompts, models like GPT-4 generate firmware for networking and control tasks, deployed on FreeRTOS via QEMU. These implementations are tested using fuzzing, static analysis, and runtime monitoring to detect vulnerabilities such as buffer overflows (CWE-120), race conditions (CWE-362), and denial-of-service threats (CWE-400). Specialized AI agents for Threat Detection, Performance Optimization, and Compliance Verification collaborate to improve detection and remediation. Identified issues are categorized using CWE, then used to prompt targeted LLM-generated patches in an iterative loop. Experiments show a 92.4% Vulnerability Remediation Rate (37.3% improvement), 95.8% Threat Model Compliance, and 0.87 Security Coverage Index. Real-time metrics include 8.6ms worst-case execution time and 195{\mu}s jitter. This process enhances firmware security and performance while contributing an open-source dataset for future research. 大型语言模型（LLMs）在生成嵌入式系统固件方面展现出潜力，但常引入安全漏洞且无法满足实时性能要求。本文提出三阶段方法论，在虚拟化环境中将基于 LLM 的固件生成与自动化安全验证及迭代优化相结合。通过结构化提示，GPT-4 等模型可生成网络与控制任务固件，经 QEMU 部署于 FreeRTOS 系统。采用模糊测试、静态分析及运行时监控对实现方案进行检测，识别缓冲区溢出（CWE-120）、竞争条件（CWE-362）及拒绝服务威胁（CWE-400）等漏洞。专用的威胁检测、性能优化及合规验证 AI 代理协同工作，提升检测与修复效能。通过 CWE 分类识别问题后，在迭代循环中生成针对性的大语言模型补丁。实验数据显示：漏洞修复率达 92.4%（提升 37.3%）、威胁模型合规率 95.8%、安全覆盖指数 0.87。实时指标包括：最坏情况执行时间 8.6 毫秒，抖动值 195 微秒。该过程在提升固件安全性和性能的同时，还为未来的研究贡献了一个开源数据集。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全性 , 人工智能

Publish: 2025-09-12 05:15:35 UTC 发布：2025-09-12 05:15:35 协调世界时 (UTC)

#53 Large Language Models Meet Legal Artificial Intelligence: A Survey #53 大型语言模型遇上法律人工智能：一项综述 [PDF ] [Copy] [Kimi 1 ] [REL]

Authors: [Zhitian Hou](https://arxiv.org/search/?searchtype=author&query=Zhitian Hou), [Zihan Ye](https://arxiv.org/search/?searchtype=author&query=Zihan Ye), [Nanli Zeng](https://arxiv.org/search/?searchtype=author&query=Nanli Zeng), [Tianyong Hao](https://arxiv.org/search/?searchtype=author&query=Tianyong Hao), [Kun Zeng](https://arxiv.org/search/?searchtype=author&query=Kun Zeng) 作者：侯志天、叶子涵、曾楠丽、郝天勇、曾坤

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-12 05:08:11 UTC 发布时间：2025-09-12 05:08:11 UTC

#54 Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes #54 有限参考、可靠生成：适用于低数据情形的表格数据生成双组件框架

Authors: [Mingxuan Jiang](https://arxiv.org/search/?searchtype=author&query=Mingxuan Jiang), [Yongxin Wang](https://arxiv.org/search/?searchtype=author&query=Yongxin Wang), [Ziyue Dai](https://arxiv.org/search/?searchtype=author&query=Ziyue Dai), [Yicun Liu](https://arxiv.org/search/?searchtype=author&query=Yicun Liu), [Hongyi Nie](https://arxiv.org/search/?searchtype=author&query=Hongyi Nie), [Sen Liu](https://arxiv.org/search/?searchtype=author&query=Sen Liu), [Hongfeng Chai](https://arxiv.org/search/?searchtype=author&query=Hongfeng Chai) 作者：蒋明轩、王永新、戴紫月、刘逸存、聂鸿毅、刘森、柴洪峰

Synthetic tabular data generation is increasingly essential in data management, supporting downstream applications when real-world and high-quality tabular data is insufficient. Existing tabular generation approaches, such as generative adversarial networks (GANs), diffusion models, and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific databases with scarce records. While prompt-based LLMs offer flexibility without parameter tuning, they often fail to capture dataset-specific feature-label dependencies and generate redundant data, leading to degradation in downstream task performance. To overcome these issues, we propose ReFine, a framework that (i) derives symbolic “if-then” rules from interpretable models and embeds them into prompts to explicitly guide generation toward domain-specific feature distribution, and (ii) applies a dual-granularity filtering strategy that suppresses over-sampling patterns and selectively refines rare but informative samples to reduce distributional imbalance. Extensive experiments on various regression and classification benchmarks demonstrate that ReFine consistently outperforms state-of-the-art methods, achieving up to 0.44 absolute improvement in R-squared for regression and 10.0 percent relative improvement in F1 score for classification tasks. 合成表格数据生成在数据管理中日益重要，当现实世界的高质量表格数据不足时，它能为下游应用提供支持。现有表格生成方法（如生成对抗网络 GANs、扩散模型及微调大型语言模型 LLMs）通常需要充足的参考数据，这限制了它们在记录稀缺的特定领域数据库中的应用效果。基于提示的 LLMs 虽具备无需参数调优的灵活性，却常无法捕捉数据集特有的特征-标签依赖关系，导致生成冗余数据并降低下游任务性能。为解决这些问题，我们提出 ReFine 框架：其一，从可解释模型中推导符号化"if-then"规则，将其嵌入提示语以明确引导生成符合领域特异性特征分布；其二，采用双粒度过滤策略，抑制过度采样模式的同时有选择地优化稀有但信息丰富的样本，从而缓解分布失衡问题。在各种回归和分类基准上的大量实验表明，ReFine 始终优于最先进的方法，在回归任务中 R² 值最多提高 0.44 个绝对值，在分类任务中 F1 得分最多提高 10.0% 的相对值。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-09-12 04:34:46 UTC 发布时间：2025-09-12 04:34:46 协调世界时 (UTC)

#55 Zero-Shot Referring Expression Comprehension via Visual-Language True/False Verification #55 零样本指代表达理解：通过视觉-语言真/假验证

Authors: [Jeffrey Liu](https://arxiv.org/search/?searchtype=author&query=Jeffrey Liu), [Rongbin Hu](https://arxiv.org/search/?searchtype=author&query=Rongbin Hu) 作者：Jeffrey Liu，Rongbin Hu

Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance. 指代表达理解（Referring Expression Comprehension，REC）通常通过针对该任务训练的定位模型来解决。我们展示了一种零样本工作流程，在没有任何针对 REC 的特定训练的情况下，也能达到有竞争力甚至更优的表现。我们的方法将 REC 重新表述为基于框的视觉-语言验证：给定来自一个 COCO 清洁通用检测器（YOLO-World）的候选区域，一个通用的视觉-语言模型（VLM）对每个区域独立地回答“是/否”查询。这个简单的流程减少了跨框干扰，支持弃答和多重匹配，并且无需微调。在 RefCOCO、RefCOCO+和 RefCOCOg 上，我们的方法不仅超越了零样本的 GroundingDINO 基线，还超过了在 REC 上训练的 GroundingDINO 及 GroundingDINO+CRG 的已报道结果。使用相同候选区域的控制研究证实，验证式方法显著优于基于选择的提示法，且在开源 VLM 上也成立。总体而言，我们表明是工作流程设计而非任务特定的预训练推动了强大的零样本 REC 性能。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-09-12 04:32:52 UTC 发布日期：2025-09-12 04:32:52 UTC

#56 Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge #56 自适应令牌合并以实现边缘高效 Transformer 语义通信

Authors: [Omar Erak](https://arxiv.org/search/?searchtype=author&query=Omar Erak), [Omar Alhussein](https://arxiv.org/search/?searchtype=author&query=Omar Alhussein), [Hatem Abou-Zeid](https://arxiv.org/search/?searchtype=author&query=Hatem Abou-Zeid), [Mehdi Bennis](https://arxiv.org/search/?searchtype=author&query=Mehdi Bennis), [Sami Muhaidat](https://arxiv.org/search/?searchtype=author&query=Sami Muhaidat) 作者：奥马尔·埃拉克、奥马尔·阿尔胡赛因、哈特姆·阿布-泽德、梅迪·本尼斯、萨米·穆海达特

Large-scale transformers are central to modern semantic communication, yet their high computational and communication costs hinder deployment on resource-constrained edge devices. This paper introduces a training-free framework for adaptive token merging, a novel mechanism that compresses transformer representations at runtime by selectively merging semantically redundant tokens under per-layer similarity thresholds. Unlike prior fixed-ratio reduction, our approach couples merging directly to input redundancy, enabling data-dependent adaptation that balances efficiency and task relevance without retraining. We cast the discovery of merging strategies as a multi-objective optimization problem and leverage Bayesian optimization to obtain Pareto-optimal trade-offs between accuracy, inference cost, and communication cost. On ImageNet classification, we match the accuracy of the unmodified transformer with 30% fewer floating-point operations per second and under 20% of the original communication cost, while for visual question answering our method achieves performance competitive with the full LLaVA model at less than one-third of the compute and one-tenth of the bandwidth. Finally, we show that our adaptive merging is robust across varying channel conditions and provides inherent privacy benefits, substantially degrading the efficacy of model inversion attacks. Our framework provides a practical and versatile solution for deploying powerful transformer models in resource-limited edge intelligence scenarios. 大规模变压器是现代语义通信的核心，但其高计算和通信成本阻碍了在资源受限的边缘设备上的部署。本文提出了一种无需训练的自适应词汇合并框架，该机制通过在运行时选择性地合并语义冗余词汇（基于每层相似性阈值），实现变压器表示的压缩。不同于传统的固定比例压缩，本方法将合并操作直接与输入冗余度挂钩，实现无需重新训练的数据依赖式自适应，在效率与任务相关性间取得平衡。我们将合并策略的发现转化为多目标优化问题，并借助贝叶斯优化技术，在准确率、推理成本和通信成本间获得帕累托最优的权衡方案。在 ImageNet 分类任务中，我们以减少 30%的每秒浮点运算量和低于原始 20%的通信成本实现了与未修改 Transformer 相同的准确率；而在视觉问答任务中，本方法以不到完整 LLaVA 模型三分之一的计算量和十分之一的带宽实现了具有竞争力的性能。最后，我们展示了自适应合并在不同信道条件下的鲁棒性，并提供了内在的隐私保护，显著降低了模型反演攻击的有效性。我们的框架为在资源受限的边缘智能场景中部署强大变换器模型提供了实用且多功能的解决方案。

Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition, Image and Video Processing 主题：机器学习、人工智能、计算机视觉与模式识别、图像与视频处理

Publish: 2025-09-12 04:11:59 UTC 发布：2025-09-12 04:11:59 世界协调时间 (UTC)

#57 SmartCoder-R1: Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization #57 SmartCoder-R1：迈向安全且可解释的智能合约生成，采用面向安全的群体相对策略优化 [PDF ] [Copy] [Kimi 1 ] [REL]

Authors: [Lei Yu](https://arxiv.org/search/?searchtype=author&query=Lei Yu), [Jingyuan Zhang](https://arxiv.org/search/?searchtype=author&query=Jingyuan Zhang), [Xin Wang](https://arxiv.org/search/?searchtype=author&query=Xin Wang), [Jiajia Ma](https://arxiv.org/search/?searchtype=author&query=Jiajia Ma), [Li Yang](https://arxiv.org/search/?searchtype=author&query=Li Yang), [Fengjun Zhang](https://arxiv.org/search/?searchtype=author&query=Fengjun Zhang) 作者：余雷，张景远，王鑫，马佳佳，杨丽，张奉军

Smart contracts automate the management of high-value assets, where vulnerabilities can lead to catastrophic financial losses. This challenge is amplified in Large Language Models (LLMs) by two interconnected failures: they operate as unauditable “black boxes” lacking a transparent reasoning process, and consequently, generate code riddled with critical security vulnerabilities. To address both issues, we propose SmartCoder-R1 (based on Qwen2.5-Coder-7B), a novel framework for secure and explainable smart contract generation. It begins with Continual Pre-training (CPT) to specialize the model. We then apply Long Chain-of-Thought Supervised Fine-Tuning (L-CoT SFT) on 7,998 expert-validated reasoning-and-code samples to train the model to emulate human security analysis. Finally, to directly mitigate vulnerabilities, we employ Security-Aware Group Relative Policy Optimization (S-GRPO), a reinforcement learning phase that refines the generation policy by optimizing a weighted reward signal for compilation success, security compliance, and format correctness. Evaluated against 17 baselines on a benchmark of 756 real-world functions, SmartCoder-R1 establishes a new state of the art, achieving top performance across five key metrics: a ComPass of 87.70%, a VulRate of 8.60%, a SafeAval of 80.16%, a FuncRate of 53.84%, and a FullRate of 50.53%. This FullRate marks a 45.79% relative improvement over the strongest baseline, DeepSeek-R1. Crucially, its generated reasoning also excels in human evaluations, achieving high-quality ratings for Functionality (82.7%), Security (85.3%), and Clarity (90.7%). 智能合约可自动管理高价值资产，其漏洞可能导致灾难性的财务损失。在大型语言模型（LLMs）中，这一挑战因两种相互关联的失败而被放大：它们以不可审计的“黑箱”形式运行，缺乏透明的推理过程，因此会生成充满关键安全漏洞的代码。为了解决这两个问题，我们提出了 SmartCoder-R1（基于 Qwen2.5-Coder-7B），这是一种用于安全且可解释的智能合约生成的新框架。该框架首先通过持续预训练（Continual Pre-training，CPT）对模型进行专业化训练。然后我们在 7,998 条经专家验证的推理与代码样本上应用长链式思维监督微调（Long Chain-of-Thought Supervised Fine-Tuning，L-CoT SFT），以训练模型模拟人类的安全分析。最后，为了直接减轻漏洞问题，我们采用了安全感知组相对策略优化（Security-Aware Group Relative Policy Optimization，S-GRPO）——一个通过优化编译成功、安全合规和格式正确性加权奖励信号来精炼生成策略的强化学习阶段。在 756 个真实世界函数的基准测试中，针对 17 个基准线进行评估，SmartCoder-R1 确立了全新技术标杆，在五大关键指标中均取得顶尖表现：ComPass 达 87.70%，VulRate 为 8.60%，SafeAval 达 80.16%，FuncRate 达 53.84%，FullRate 达 50.53%。其 FullRate 指标较最强基准 DeepSeek-R1 相对提升 45.79%。尤为关键的是，其生成的推理结果在人类评估中同样表现卓越，在功能性（82.7%）、安全性（85.3%）和清晰度（90.7%）维度均获得高质量评分。

Subjects: Cryptography and Security, Artificial Intelligence, Software Engineering 主题：密码学与安全、人工智能、软件工程

Publish: 2025-09-12 03:14:50 UTC 发布时间：2025-09-12 03:14:50 协调世界时（UTC）

#58 WALL: A Web Application for Automated Quality Assurance using Large Language Models #58 WALL：使用大型语言模型的自动化质量保证网络应用

As software projects become increasingly complex, the volume and variety of issues in code files have grown substantially. Addressing this challenge requires efficient issue detection, resolution, and evaluation tools. This paper presents WALL, a web application that integrates SonarQube and large language models (LLMs) such as GPT-3.5 Turbo and GPT-4o to automate these tasks. WALL comprises three modules: an issue extraction tool, code issues reviser, and code comparison tool. Together, they enable a seamless pipeline for detecting software issues, generating automated code revisions, and evaluating the accuracy of revisions. Our experiments, conducted on 563 files with over 7,599 issues, demonstrate WALL’s effectiveness in reducing human effort while maintaining high-quality revisions. Results show that employing a hybrid approach of cost-effective and advanced LLMs can significantly lower costs and improve revision rates. Future work aims to enhance WALL’s capabilities by integrating open-source LLMs and eliminating human intervention, paving the way for fully automated code quality management. 随着软件项目日益复杂，代码文件中的问题数量与种类均大幅增加。应对这一挑战需要高效的问题检测、修复与评估工具。本文提出 WALL——一款整合 SonarQube 与大型语言模型（如 GPT-3.5 Turbo 和 GPT-4o）的网络应用，可自动化完成这些任务。WALL 包含三大模块：问题提取工具、代码问题修订器及代码比较工具。三者协同构建了无缝流程，可实现软件问题检测、自动代码修订生成及修订准确性评估。实验在包含 563 个文件、7,599 余个问题的数据集上验证：WALL 在维持高质量修订的同时显著降低人工投入。结果表明，采用经济高效型与先进型 LLM 的混合策略可大幅降低成本并提升修订率。后续工作将通过集成开源 LLM 并消除人工干预，推动 WALL 能力升级，为全自动化代码质量管理铺平道路。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-09-12 01:43:26 UTC 发布：2025-09-12 01:43:26 世界协调时 (UTC)

#59 An Autoencoder and Vision Transformer-based Interpretability Analysis of the Differences in Automated Staging of Second and Third Molars #59 基于自编码器和视觉变换器的可解释性分析：第二、第三磨牙自动分期差异

Authors: [Barkin Buyukcakir](https://arxiv.org/search/?searchtype=author&query=Barkin Buyukcakir), [Jannick De Tobel](https://arxiv.org/search/?searchtype=author&query=Jannick De Tobel), [Patrick Thevissen](https://arxiv.org/search/?searchtype=author&query=Patrick Thevissen), [Dirk Vandermeulen](https://arxiv.org/search/?searchtype=author&query=Dirk Vandermeulen), [Peter Claes](https://arxiv.org/search/?searchtype=author&query=Peter Claes) 作者：Barkin Buyukcakir、Jannick De Tobel、Patrick Thevissen、Dirk Vandermeulen、Peter Claes

The practical adoption of deep learning in high-stakes forensic applications, such as dental age estimation, is often limited by the ‘black box’ nature of the models. This study introduces a framework designed to enhance both performance and transparency in this context. We use a notable performance disparity in the automated staging of mandibular second (tooth 37) and third (tooth 38) molars as a case study. The proposed framework, which combines a convolutional autoencoder (AE) with a Vision Transformer (ViT), improves classification accuracy for both teeth over a baseline ViT, increasing from 0.712 to 0.815 for tooth 37 and from 0.462 to 0.543 for tooth 38. Beyond improving performance, the framework provides multi-faceted diagnostic insights. Analysis of the AE’s latent space metrics and image reconstructions indicates that the remaining performance gap is data-centric, suggesting high intra-class morphological variability in the tooth 38 dataset is a primary limiting factor. This work highlights the insufficiency of relying on a single mode of interpretability, such as attention maps, which can appear anatomically plausible yet fail to identify underlying data issues. By offering a methodology that both enhances accuracy and provides evidence for why a model may be uncertain, this framework serves as a more robust tool to support expert decision-making in forensic age estimation. 在高风险法医应用（如牙齿年龄估算）中，深度学习的实际应用常受限于模型的"黑箱"特性。本研究提出一种框架，旨在同时提升该领域的性能与透明度。我们以自动分级下颌第二磨牙（37 号牙）与第三磨牙（38 号牙）时显著的性能差异为案例展开研究。该框架融合卷积自编码器（AE）与视觉变换器（ViT），使两颗牙齿的分类准确率均超越基线 ViT 模型：牙 37 从 0.712 提升至 0.815，牙 38 从 0.462 提升至 0.543。除性能优化外，该框架还提供多维诊断洞察。通过分析 AE 的潜在空间指标与图像重建结果，发现剩余性能差距源于数据特性——牙 38 数据集内存在显著的形态学变异性，这是主要限制因素。本研究揭示了仅依赖单一可解释性模式（如注意力图）的局限性：这类可视化结果虽在解剖学上看似合理，却无法揭示潜在的数据问题。通过提供一种既能提高准确性又能为模型不确定性提供依据的方法论，该框架成为支持法医年龄鉴定领域专家决策的更强大工具。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-09-12 00:54:07 UTC 发布时间：2025-09-12 00:54:07 UTC

#60 Tackling One Health Risks: How Large Language Models are leveraged for Risk Negotiation and Consensus-building #60 应对一体化健康风险：大型语言模型如何被用于风险协商与共识构建

Key global challenges of our times are characterized by complex interdependencies and can only be effectively addressed through an integrated, participatory effort. Conventional risk analysis frameworks often reduce complexity to ensure manageability, creating silos that hinder comprehensive solutions. A fundamental shift towards holistic strategies is essential to enable effective negotiations between different sectors and to balance the competing interests of stakeholders. However, achieving this balance is often hindered by limited time, vast amounts of information, and the complexity of integrating diverse perspectives. This study presents an AI-assisted negotiation framework that incorporates large language models (LLMs) and AI-based autonomous agents into a negotiation-centered risk analysis workflow. The framework enables stakeholders to simulate negotiations, systematically model dynamics, anticipate compromises, and evaluate solution impacts. By leveraging LLMs’ semantic analysis capabilities we could mitigate information overload and augment decision-making process under time constraints. Proof-of-concept implementations were conducted in two real-world scenarios: (i) prudent use of a biopesticide, and (ii) targeted wild animal population control. Our work demonstrates the potential of AI-assisted negotiation to address the current lack of tools for cross-sectoral engagement. Importantly, the solution’s open source, web based design, suits for application by a broader audience with limited resources and enables users to tailor and develop it for their own needs. 当今时代的关键全球性挑战具有复杂的相互依存性，唯有通过综合性、参与式协作方能有效应对。传统风险分析框架常为确保可控性而简化复杂性，形成阻碍综合解决方案的孤岛。向整体性战略的根本性转变至关重要，这能促进不同领域间的有效协商，并平衡利益相关方的竞争性诉求。然而，时间有限、信息庞杂以及整合多元视角的复杂性，往往阻碍这种平衡的实现。本研究提出一种人工智能辅助谈判框架，将大型语言模型（LLMs）与基于人工智能的自主代理融入以谈判为核心的风险分析工作流。该框架使利益相关方能够模拟谈判过程、系统建模动态关系、预判妥协方案并评估解决方案的影响。通过运用大型语言模型的语义分析能力，我们能在时间限制下缓解信息过载问题并增强决策过程。概念验证实现已在两个现实场景中进行： (i) 谨慎使用生物农药，和 (ii) 有针对性的野生动物种群控制。我们的工作展示了人工智能辅助协商在应对当前跨部门参与工具缺乏问题上的潜力。重要的是，该解决方案采用开源、基于网络的设计，适合资源有限的更广泛受众应用，并使用户能够根据自身需求定制和开发。

Subjects: Multiagent Systems, Artificial Intelligence 主题：多智能体系统，人工智能

Publish: 2025-09-12 00:25:20 UTC 发布：2025-09-12 00:25:20 UTC

#61 Self-Augmented Robot Trajectory: Efficient Imitation Learning via Safe Self-augmentation with Demonstrator-annotated Precision #61 自我增强机器人轨迹：通过示范者标注精度的安全自我增强实现高效模仿学习

Authors: [Hanbit Oh](https://arxiv.org/search/?searchtype=author&query=Hanbit Oh), [Masaki Murooka](https://arxiv.org/search/?searchtype=author&query=Masaki Murooka), [Tomohiro Motoda](https://arxiv.org/search/?searchtype=author&query=Tomohiro Motoda), [Ryoichi Nakajo](https://arxiv.org/search/?searchtype=author&query=Ryoichi Nakajo), [Yukiyasu Domae](https://arxiv.org/search/?searchtype=author&query=Yukiyasu Domae) 作者：Oh Hanbit、Masaki Murooka、Tomohiro Motoda、Ryoichi Nakajo、Yukiyasu Domae

Imitation learning is a promising paradigm for training robot agents; however, standard approaches typically require substantial data acquisition – via numerous demonstrations or random exploration – to ensure reliable performance. Although exploration reduces human effort, it lacks safety guarantees and often results in frequent collisions – particularly in clearance-limited tasks (e.g., peg-in-hole) – thereby, necessitating manual environmental resets and imposing additional human burden. This study proposes Self-Augmented Robot Trajectory (SART), a framework that enables policy learning from a single human demonstration, while safely expanding the dataset through autonomous augmentation. SART consists of two stages: (1) human teaching only once, where a single demonstration is provided and precision boundaries – represented as spheres around key waypoints – are annotated, followed by one environment reset; (2) robot self-augmentation, where the robot generates diverse, collision-free trajectories within these boundaries and reconnects to the original demonstration. This design improves the data collection efficiency by minimizing human effort while ensuring safety. Extensive evaluations in simulation and real-world manipulation tasks show that SART achieves substantially higher success rates than policies trained solely on human-collected demonstrations. Video results available at https://sites.google.com/view/sart-il . 模仿学习是训练机器人代理的有效范式，但传统方法通常需要通过大量演示或随机探索获取海量数据才能确保可靠性能。尽管探索能减轻人力投入，却缺乏安全保障且常导致频繁碰撞——尤其在空间受限任务（如插销入孔）中——这迫使操作者手动重置环境，增加了额外负担。本研究提出自增强机器人轨迹（SART）框架，该方法仅需单次人类示范即可实现策略学习，同时通过自主增强技术安全扩展数据集。SART 包含两个阶段：(1) 仅需一次人类教学：提供单次演示并标注精度边界（以关键路径点周围的球体表示），随后进行一次环境重置；(2) 机器人自我增强：在边界范围内生成多样化无碰撞轨迹，并重新连接至原始演示。该设计通过最小化人力投入提升数据采集效率，同时确保安全性。在模拟环境和真实世界操作任务中的广泛评估表明，SART 策略的成功率显著高于仅基于人类收集的示范数据训练的策略。视频演示可访问：https://sites.google.com/view/sart-il。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-09-11 23:10:56 UTC 发布时间：2025-09-11 23:10:56 世界协调时 (UTC)

#62 Automated Tuning for Diffusion Inverse Problem Solvers without Generative Prior Retraining #62 无需对生成先验重训练的扩散逆问题解算器自动调优 [PDF 3 ] [Copy] [Kimi 1 ] [REL]

Authors: [Yaşar Utku Alçalar](https://arxiv.org/search/?searchtype=author&query=Yaşar Utku Alçalar), [Junno Yun](https://arxiv.org/search/?searchtype=author&query=Junno Yun), [Mehmet Akçakaya](https://arxiv.org/search/?searchtype=author&query=Mehmet Akçakaya) 作者：亚萨尔·乌特库·阿尔查拉尔、云俊诺、梅赫梅特·阿克恰卡亚

Diffusion/score-based models have recently emerged as powerful generative priors for solving inverse problems, including accelerated MRI reconstruction. While their flexibility allows decoupling the measurement model from the learned prior, their performance heavily depends on carefully tuned data fidelity weights, especially under fast sampling schedules with few denoising steps. Existing approaches often rely on heuristics or fixed weights, which fail to generalize across varying measurement conditions and irregular timestep schedules. In this work, we propose Zero-shot Adaptive Diffusion Sampling (ZADS), a test-time optimization method that adaptively tunes fidelity weights across arbitrary noise schedules without requiring retraining of the diffusion prior. ZADS treats the denoising process as a fixed unrolled sampler and optimizes fidelity weights in a self-supervised manner using only undersampled measurements. Experiments on the fastMRI knee dataset demonstrate that ZADS consistently outperforms both traditional compressed sensing and recent diffusion-based methods, showcasing its ability to deliver high-fidelity reconstructions across varying noise schedules and acquisition settings. 扩散/基于评分模型近期作为强大的生成先验模型，在解决逆问题（包括加速 MRI 重建）中崭露头角。尽管其灵活性可实现测量模型与学习先验的解耦，但其性能高度依赖于精心调优的数据保真度权重——尤其在降噪步骤较少的快速采样方案中。现有方法常依赖启发式或固定权重，难以适应多变的测量条件和不规则时间步长方案。本文提出零次自适应扩散采样（ZADS）——一种测试时优化方法，可在任意噪声方案下自适应调节保真度权重，且无需重新训练扩散先验。ZADS 将去噪过程视为固定展开采样器，仅利用欠采样测量数据以自监督方式优化保真度权重。在 fastMRI 膝盖数据集上的实验表明，ZADS 始终优于传统的压缩感知方法和近期的基于扩散的方法，展示了其在不同噪声计划和采集设置下提供高保真重建的能力。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning, Medical Physics 主题：图像与视频处理、人工智能、计算机视觉与模式识别、机器学习、医学物理

Publish: 2025-09-11 22:22:32 UTC 发布：2025-09-11 22:22:32 协调世界时 (UTC)

#63 From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem #63 从 Hugging Face 到 GitHub：追踪开源人工智能生态系统中的许可漂移

Authors: [James Jewitt](https://arxiv.org/search/?searchtype=author&query=James Jewitt), [Hao Li](https://arxiv.org/search/?searchtype=author&query=Hao Li), [Bram Adams](https://arxiv.org/search/?searchtype=author&query=Bram Adams), [Gopi Krishnan Rajbahadur](https://arxiv.org/search/?searchtype=author&query=Gopi Krishnan Rajbahadur), [Ahmed E. Hassan](https://arxiv.org/search/?searchtype=author&query=Ahmed E. Hassan) 作者：James Jewitt、Hao Li、Bram Adams、Gopi Krishnan Rajbahadur、Ahmed E. Hassan

Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks, exposing organizations to potential litigation and users to undisclosed risk. However, the field lacks a data-driven understanding of how frequently these conflicts occur, where they originate, and which communities are most affected. We present the first end-to-end audit of licenses for datasets and models on Hugging Face, as well as their downstream integration into open-source software applications, covering 364 thousand datasets, 1.6 million models, and 140 thousand GitHub projects. Our empirical analysis reveals systemic non-compliance in which 35.5% of model-to-application transitions eliminate restrictive license clauses by relicensing under permissive terms. In addition, we prototype an extensible rule engine that encodes almost 200 SPDX and model-specific clauses for detecting license conflicts, which can solve 86.4% of license conflicts in software applications. To support future research, we release our dataset and the prototype engine. Our study highlights license compliance as a critical governance challenge in open-source AI and provides both the data and tools necessary to enable automated, AI-aware compliance at scale. 开源人工智能生态系统中隐藏的许可证冲突带来严重的法律和道德风险，使组织面临潜在诉讼风险，用户则暴露于未披露的风险之中。然而该领域缺乏数据驱动的认知，无法了解此类冲突的发生频率、起源地以及受影响最严重的社区。我们首次对 Hugging Face 平台上的数据集与模型许可证进行端到端审计，并追踪其在开源软件应用中的下游集成情况，覆盖 36.4 万个数据集、160 万个模型及 14 万个 GitHub 项目。实证分析揭示系统性违规现象：35.5%的模型到应用程序的迁移过程中，通过重新采用宽松条款许可证来规避限制性条款。此外，我们开发了可扩展规则引擎原型，该引擎编码了近 200 条 SPDX 通用条款及模型特异性条款用于检测许可冲突，可解决软件应用中 86.4%的许可冲突问题。为支持后续研究，我们公开了数据集及原型引擎。我们的研究强调许可证合规性是开源人工智能中的一个关键治理挑战，并提供了实现大规模、具备人工智能意识的自动化合规所需的数据和工具。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-09-11 21:46:20 UTC 发布时间：2025-09-11 21:46:20 协调世界时（UTC）

#64 Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case #64 模拟公众舆论：一项关于人工智能生成的智利案例合成问卷回应的概念验证

Large Language Models (LLMs) offer promising avenues for methodological and applied innovations in survey research by using synthetic respondents to emulate human answers and behaviour, potentially mitigating measurement and representation errors. However, the extent to which LLMs recover aggregate item distributions remains uncertain and downstream applications risk reproducing social stereotypes and biases inherited from training data. We evaluate the reliability of LLM-generated synthetic survey responses against ground-truth human responses from a Chilean public opinion probabilistic survey. Specifically, we benchmark 128 prompt-model-question triplets, generating 189,696 synthetic profiles, and pool performance metrics (i.e., accuracy, precision, recall, and F1-score) in a meta-analysis across 128 question-subsample pairs to test for biases along key sociodemographic dimensions. The evaluation spans OpenAI’s GPT family and o-series reasoning models, as well as Llama and Qwen checkpoints. Three results stand out. First, synthetic responses achieve excellent performance on trust items (F1-score and accuracy > 0.90). Second, GPT-4o, GPT-4o-mini and Llama 4 Maverick perform comparably on this task. Third, synthetic-human alignment is highest among respondents aged 45-59. Overall, LLM-based synthetic samples approximate responses from a probabilistic sample, though with substantial item-level heterogeneity. Capturing the full nuance of public opinion remains challenging and requires careful calibration and additional distributional tests to ensure algorithmic fidelity and reduce errors. 大型语言模型（LLMs）通过使用合成受访者模拟人类回答和行为，为调查研究提供了方法论和应用创新的可行途径，有望减轻测量和表征误差。然而，LLMs 在多大程度上能还原项目分布的总体特征仍不确定，且下游应用存在复制训练数据中社会刻板印象和偏见的风险。我们基于智利一项概率性民意调查中真实人类回答的数据，评估了 LLM 生成的合成调查响应的可靠性。具体而言，我们对 128 组提示-模型-问题三元组进行基准测试，生成 189,696 份合成档案，并在 128 个问题-子样本对的元分析中汇总性能指标（即准确率、精确率、召回率和 F1 分数），以检验关键社会人口统计维度上的偏见。评估范围涵盖 OpenAI 的 GPT 家族、o 系列推理模型，以及 Llama 和 Qwen 检查点。其中三项结果尤为突出：首先，合成响应在可信度项目上表现优异（F1 分数和准确率均>0.90）。其次，GPT-4o、GPT-4o-mini 和 Llama 4 Maverick 在该任务中表现相当。第三，45-59 岁受访者群体中合成数据与人类真实数据的契合度最高。总体而言，基于大型语言模型的合成样本虽能近似概率抽样结果，但存在显著的项目级异质性。要完整捕捉公众意见的细微差异仍具挑战，需通过精密校准与额外分布测试来确保算法保真度并降低误差。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-11 21:43:59 UTC 发布：2025-09-11 21:43:59 UTC

#65 Vibe Check: Understanding the Effects of LLM-Based Conversational Agents’ Personality and Alignment on User Perceptions in Goal-Oriented Tasks #65 氛围检测：理解基于 LLM 的会话代理在面向目标任务中人格与对齐对用户感知的影响

Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language 主题：人机交互，人工智能，计算与语言

Publish: 2025-09-11 21:43:49 UTC 发布：2025-09-11 21:43:49 UTC

#66 Surrogate Supervision for Robust and Generalizable Deformable Image Registration #66 用替代监督实现稳健且可泛化的可变形图像配准 [PDF 1 ] [Copy] [Kimi ] [REL]

Objective: Deep learning-based deformable image registration has achieved strong accuracy, but remains sensitive to variations in input image characteristics such as artifacts, field-of-view mismatch, or modality difference. We aim to develop a general training paradigm that improves the robustness and generalizability of registration networks. Methods: We introduce surrogate supervision, which decouples the input domain from the supervision domain by applying estimated spatial transformations to surrogate images. This allows training on heterogeneous inputs while ensuring supervision is computed in domains where similarity is well defined. We evaluate the framework through three representative applications: artifact-robust brain MR registration, mask-agnostic lung CT registration, and multi-modal MR registration. Results: Across tasks, surrogate supervision demonstrated strong resilience to input variations including inhomogeneity field, inconsistent field-of-view, and modality differences, while maintaining high performance on well-curated data. Conclusions: Surrogate supervision provides a principled framework for training robust and generalizable deep learning-based registration models without increasing complexity. Significance: Surrogate supervision offers a practical pathway to more robust and generalizable medical image registration, enabling broader applicability in diverse biomedical imaging scenarios. 目标：基于深度学习的可变形图像配准已取得较高精度，但仍对输入图像特性变化敏感，例如伪影、视野不匹配或模态差异。我们旨在开发一种通用训练范式，以提高配准网络的鲁棒性和泛化能力。方法：我们引入替代监督（surrogate supervision），通过将估计的空间变换应用于替代图像来解耦输入域与监督域。这使得在异构输入上进行训练成为可能，同时确保监督在相似性定义良好的域中计算。我们通过三种代表性应用评估该框架：对伪影鲁棒的脑部 MR 配准、不依赖掩膜的肺部 CT 配准以及多模态 MR 配准。结果：在各项任务中，替代监督对包括不均匀场、视野不一致和模态差异等输入变化表现出强鲁棒性，同时在精心整理的数据上保持较高性能。结论：替代监督提供了一个有原则的框架，可在不增加复杂度的情况下训练稳健且具泛化能力的基于深度学习的配准模型。重要性：替代监督为更稳健、更具泛化能力的医学影像配准提供了实用途径，使其在多种生物医学成像场景中具有更广泛的适用性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-09-11 21:43:45 UTC 发布时间：2025-09-11 21:43:45 UTC

#67 Latency and Token-Aware Test-Time Compute #67 延迟与基于代币的感知测试时计算

Authors: [Jenny Y. Huang](https://arxiv.org/search/?searchtype=author&query=Jenny Y. Huang), [Mehul Damani](https://arxiv.org/search/?searchtype=author&query=Mehul Damani), [Yousef El-Kurdi](https://arxiv.org/search/?searchtype=author&query=Yousef El-Kurdi), [Ramon Astudillo](https://arxiv.org/search/?searchtype=author&query=Ramon Astudillo), [Wei Sun](https://arxiv.org/search/?searchtype=author&query=Wei Sun) 作者：Jenny Y. Huang, Mehul Damani, Yousef El-Kurdi, Ramon Astudillo, Wei Sun

Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute typically considers only parallel generation methods such as best-of-N, overlooking incremental decoding methods like beam search, and has largely ignored latency, focusing only on token usage. We formulate inference-time scaling as a problem of dynamic compute allocation and method selection, where the system must decide which strategy to apply and how much compute to allocate on a per-query basis. Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic workflows where models must issue multiple queries efficiently. Experiments on reasoning benchmarks show that our approach consistently outperforms static strategies, achieving favorable accuracy-cost trade-offs while remaining practical for deployment. 推理时扩展（inference-time scaling）已经成为提高大型语言模型（LLM）性能的强大方法，通过生成多个候选回应并在其间进行选择来实现。然而，现有关于测试时计算动态分配的工作通常只考虑并行生成方法，例如 best-of-N，忽视了像束搜索（beam search）这样的增量解码方法，而且主要忽略了延迟，只关注令牌使用。我们将推理时扩展表述为动态计算分配和方法选择的问题，系统必须决定针对每个查询应采用哪种策略以及分配多少计算资源。我们的框架明确同时考虑了令牌成本和真实时间延迟（wall-clock latency），后者对用户体验至关重要，尤其是在模型需要高效发出多个查询的智能代理工作流中。基准推理实验表明，我们的方法在准确率与成本的权衡上始终优于静态策略，同时仍然具备部署的实用性。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-09-11 21:35:19 UTC 发布：2025-09-11 21:35:19 协调世界时

#68 SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints #68 SWE-Effi：在资源受限下重新评估软件人工智能代理系统的有效性

The advancement of large language models (LLMs) and code agents has demonstrated significant potential to assist software engineering (SWE) tasks, such as autonomous issue resolution and feature addition. Existing AI for software engineering leaderboards (e.g., SWE-bench) focus solely on solution accuracy, ignoring the crucial factor of effectiveness in a resource-constrained world. This is a universal problem that also exists beyond software engineering tasks: any AI system should be more than correct - it must also be cost-effective. To address this gap, we introduce SWE-Effi, a set of new metrics to re-evaluate AI systems in terms of holistic effectiveness scores. We define effectiveness as the balance between the accuracy of outcome (e.g., issue resolve rate) and the resources consumed (e.g., token and time). In this paper, we specifically focus on the software engineering scenario by re-ranking popular AI systems for issue resolution on a subset of the SWE-bench benchmark using our new multi-dimensional metrics. We found that AI system’s effectiveness depends not just on the scaffold itself, but on how well it integrates with the base model, which is key to achieving strong performance in a resource-efficient manner. We also identified systematic challenges such as the “token snowball” effect and, more significantly, a pattern of “expensive failures”. In these cases, agents consume excessive resources while stuck on unsolvable tasks - an issue that not only limits practical deployment but also drives up the cost of failed rollouts during RL training. Lastly, we observed a clear trade-off between effectiveness under the token budget and effectiveness under the time budget, which plays a crucial role in managing project budgets and enabling scalable reinforcement learning, where fast responses are essential. 大型语言模型（LLMs）和代码代理的发展已显示出在辅助软件工程（SWE）任务方面的显著潜力，例如自主解决问题和添加功能。现有的软件工程领域的 AI 排行榜（例如 SWE-bench）仅关注解决方案的准确性，忽视了在资源受限的世界中至关重要的效率因素。这是一个普遍存在的问题，且超出软件工程任务范围：任何 AI 系统不仅要正确——还必须具备成本效益。为了解决这一缺口，我们引入了 SWE-Effi，一组用于以整体效能得分重新评估 AI 系统的新指标。我们将效能定义为结果准确性（例如问题解决率）与消耗资源（例如令牌和时间）之间的平衡。在本文中，我们特别关注软件工程场景，使用我们新的多维度指标在 SWE-bench 基准的一个子集上对流行的用于问题解决的 AI 系统进行重新排序。我们发现，AI 系统的有效性不仅取决于支架本身，还取决于它与基础模型的集成程度——这是以资源高效的方式实现强劲性能的关键。我们还识别出一些系统性挑战，例如“token 雪崩”效应，更重要的是发现了“昂贵失败”的模式。在这些情况下，代理在卡在不可解决任务上时会消耗过多资源——这一问题不仅限制了实际部署，还推高了在强化学习训练中失败试验的成本。最后，我们观察到在 token 预算下的有效性与在时间预算下的有效性之间存在明显权衡，这对管理项目预算和实现可扩展的强化学习至关重要，后者要求快速响应。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-09-11 21:04:10 UTC 发布：2025-09-11 21:04:10 UTC

#69 HGEN: Heterogeneous Graph Ensemble Networks #69 HGEN：异构图集成网络

Authors: [Jiajun Shen](https://arxiv.org/search/?searchtype=author&query=Jiajun Shen), [Yufei Jin](https://arxiv.org/search/?searchtype=author&query=Yufei Jin), [Yi He](https://arxiv.org/search/?searchtype=author&query=Yi He), [Xingquan Zhu](https://arxiv.org/search/?searchtype=author&query=Xingquan Zhu) 作者：沈家骏、金宇飞、何毅、朱兴权

This paper presents HGEN that pioneers ensemble learning for heterogeneous graphs. We argue that the heterogeneity in node types, nodal features, and local neighborhood topology poses significant challenges for ensemble learning, particularly in accommodating diverse graph learners. Our HGEN framework ensembles multiple learners through a meta-path and transformation-based optimization pipeline to uplift classification accuracy. Specifically, HGEN uses meta-path combined with random dropping to create Allele Graph Neural Networks (GNNs), whereby the base graph learners are trained and aligned for later ensembling. To ensure effective ensemble learning, HGEN presents two key components: 1) a residual-attention mechanism to calibrate allele GNNs of different meta-paths, thereby enforcing node embeddings to focus on more informative graphs to improve base learner accuracy, and 2) a correlation-regularization term to enlarge the disparity among embedding matrices generated from different meta-paths, thereby enriching base learner diversity. We analyze the convergence of HGEN and attest its higher regularization magnitude over simple voting. Experiments on five heterogeneous networks validate that HGEN consistently outperforms its state-of-the-art competitors by substantial margin. 本文提出了 HGEN，开创性地在异构图上引入集成学习。我们认为，节点类型、节点特征和局部邻域拓扑的异质性给集成学习带来了重大挑战，特别是在兼容多样化图学习器方面。我们的 HGEN 框架通过基于元路径和变换的优化流水线对多种学习器进行集成，以提升分类准确率。具体而言，HGEN 使用元路径结合随机丢弃来构建等位基因图神经网络（Allele GNNs），由此训练并对齐基图学习器以便后续集成。为确保有效的集成学习，HGEN 提出两个关键组成部分：1）残差注意力机制用于校准不同元路径的等位基因 GNN，从而强制节点嵌入关注更具信息性的图以提高基学习器的准确性；2）相关性正则项用于扩大由不同元路径生成的嵌入矩阵之间的差异，从而丰富基学习器的多样性。我们分析了 HGEN 的收敛性，并证实其相较于简单投票具有更高的正则化幅度。在五个异构网络上的实验验证了 HGEN 始终以显著幅度优于其最先进的竞争对手。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-09-11 20:50:00 UTC 发布：2025-09-11 20:50:00 UTC

#70 Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning #70 重新审视离散动作离策略强化学习中的演员-评论家方法

Authors: [Reza Asad](https://arxiv.org/search/?searchtype=author&query=Reza Asad), [Reza Babanezhad](https://arxiv.org/search/?searchtype=author&query=Reza Babanezhad), [Sharan Vaswani](https://arxiv.org/search/?searchtype=author&query=Sharan Vaswani) 作者：Reza Asad、Reza Babanezhad、Sharan Vaswani

Value-based approaches such as DQN are the default methods for off-policy reinforcement learning with discrete-action environments such as Atari. Common policy-based methods are either on-policy and do not effectively learn from off-policy data (e.g. PPO), or have poor empirical performance in the discrete-action setting (e.g. SAC). Consequently, starting from discrete SAC (DSAC), we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC can have comparable performance as DQN. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case. Our framework allows using an m-step Bellman operator for the critic update, and enables combining standard policy optimization methods with entropy regularization to instantiate the resulting actor objective. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting. Empirically, we demonstrate that these methods can approach the performance of DQN on standard Atari games, and do so even without entropy regularization or explicit exploration. 基于价值的方法（如 DQN）是离策略强化学习在离散动作环境（如 Atari）中的默认方法。常见的基于策略的方法要么是策略内方法，无法有效利用离策略数据进行学习（如 PPO），要么在离散动作场景中表现欠佳（如 SAC）。因此，我们以离散 SAC（DSAC）为起点，重新审视该场景下演员-批评家方法的设计。首先，我们确定演员与批评家熵之间的耦合是 DSAC 表现不佳的主要原因。实验表明，仅需解耦这两部分组件，DSAC 即可达到与 DQN 相当的性能。基于此洞见，我们提出一种灵活的离策略行为者-批评者框架，将 DSAC 作为其特例。该框架允许使用 m 步贝尔曼算子更新批评者，并支持将标准策略优化方法与熵正则化结合以实现行为者目标函数。理论上，我们证明所提方法在表格化设置中可保证收敛至最优正则化价值函数。通过实证我们证明，这些方法在标准雅达利游戏上可以接近 DQN 的性能，而且即便在没有熵正则化或显式探索的情况下也能做到这一点。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-09-11 20:34:08 UTC 发布日期：2025-09-11 20:34:08 UTC

#71 CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio #71 CoDiCodec：统一的连续和离散音频压缩表示

Authors: [Marco Pasini](https://arxiv.org/search/?searchtype=author&query=Marco Pasini), [Stefan Lattner](https://arxiv.org/search/?searchtype=author&query=Stefan Lattner), [George Fazekas](https://arxiv.org/search/?searchtype=author&query=George Fazekas) 作者：Marco Pasini、Stefan Lattner、George Fazekas

Efficiently representing audio signals in a compressed latent space is critical for latent generative modelling. However, existing autoencoders often force a choice between continuous embeddings and discrete tokens. Furthermore, achieving high compression ratios while maintaining audio fidelity remains a challenge. We introduce CoDiCodec, a novel audio autoencoder that overcomes these limitations by both efficiently encoding global features via summary embeddings, and by producing both compressed continuous embeddings at ~ 11 Hz and discrete tokens at a rate of 2.38 kbps from the same trained model, offering unprecedented flexibility for different downstream generative tasks. This is achieved through Finite Scalar Quantization (FSQ) and a novel FSQ-dropout technique, and does not require additional loss terms beyond the single consistency loss used for end-to-end training. CoDiCodec supports both autoregressive decoding and a novel parallel decoding strategy, with the latter achieving superior audio quality and faster decoding. CoDiCodec outperforms existing continuous and discrete autoencoders at similar bitrates in terms of reconstruction audio quality. Our work enables a unified approach to audio compression, bridging the gap between continuous and discrete generative modelling paradigms. 在压缩潜在空间中高效地表示音频信号对于潜在生成建模至关重要。然而，现有的自编码器常常迫使在连续嵌入与离散标记之间做出选择。此外，在保持音频保真度的同时实现高压缩率仍然是一个挑战。我们提出了 CoDiCodec，一种新颖的音频自编码器，它通过摘要嵌入高效编码全局特征，并且能从同一训练模型中同时产生约 11 Hz 的压缩连续嵌入和以 2.38 kbps 速率的离散标记，为不同的下游生成任务提供了前所未有的灵活性。这是通过有限标量量化（Finite Scalar Quantization，FSQ）和一种新颖的 FSQ-dropout 技术实现的，且在端到端训练中不需要除单一一致性损失之外的额外损失项。CoDiCodec 支持自回归解码和一种新颖的并行解码策略，后者在音频质量和解码速度上都更优。与在相似比特率下的现有连续与离散自编码器相比，CoDiCodec 在重建音频质量方面表现更佳。我们的工作实现了一种统一的音频压缩方法，弥合了连续与离散生成建模范式之间的差距。

Subjects: Sound, Artificial Intelligence, Machine Learning, Audio and Speech Processing 主题：声音，人工智能，机器学习，音频与语音处理

Publish: 2025-09-11 20:31:18 UTC 发布日期：2025-09-11 20:31:18 UTC

#72 SoilSound: Smartphone-based Soil Moisture Estimation #72 SoilSound：基于智能手机的土壤湿度估算

Authors: [Yixuan Gao](https://arxiv.org/search/?searchtype=author&query=Yixuan Gao), [Tanvir Ahmed](https://arxiv.org/search/?searchtype=author&query=Tanvir Ahmed), [Shuang He](https://arxiv.org/search/?searchtype=author&query=Shuang He), [Zhongqi Cheng](https://arxiv.org/search/?searchtype=author&query=Zhongqi Cheng), [Rajalakshmi Nandakumar](https://arxiv.org/search/?searchtype=author&query=Rajalakshmi Nandakumar) 作者：Yixuan Gao，Tanvir Ahmed，Shuang He，Zhongqi Cheng，Rajalakshmi Nandakumar

Soil moisture monitoring is essential for agriculture and environmental management, yet existing methods require either invasive probes disturbing the soil or specialized equipment, limiting access to the public. We present SoilSound, an ubiquitous accessible smartphone-based acoustic sensing system that can measure soil moisture without disturbing the soil. We leverage the built-in speaker and microphone to perform a vertical scan mechanism to accurately measure moisture without any calibration. Unlike existing work that use transmissive properties, we propose an alternate model for acoustic reflections in soil based on the surface roughness effect to enable moisture sensing without disturbing the soil. The system works by sending acoustic chirps towards the soil and recording the reflections during a vertical scan, which are then processed and fed to a convolutional neural network for on-device soil moisture estimation with negligible computational, memory, or power overhead. We evaluated the system by training with curated soils in boxes in the lab and testing in the outdoor fields and show that SoilSound achieves a mean absolute error (MAE) of 2.39% across 10 different locations. Overall, the evaluation shows that SoilSound can accurately track soil moisture levels ranging from 15.9% to 34.0% across multiple soil types, environments, and users; without requiring any calibration or disturbing the soil, enabling widespread moisture monitoring for home gardeners, urban farmers, citizen scientists, and agricultural communities in resource-limited settings. 土壤水分监测对农业和环境管理至关重要，然而现有方法要么需要侵入性探针干扰土壤，要么依赖专用设备，限制了公众的可及性。我们提出了 SoilSound，一种普适可及的基于智能手机的声学感知系统，能够在不扰动土壤的情况下测量土壤水分。我们利用手机内置扬声器和麦克风执行垂直扫描机制，以在无需任何校准的情况下准确测量水分。不同于利用透射特性的现有工作，我们提出了一种基于表面粗糙度效应的土壤声学反射替代理论，从而实现无需扰动土壤的水分感知。该系统通过向土壤发射声学啁啾并在垂直扫描过程中记录反射信号，将这些信号处理后输入卷积神经网络，在设备端进行土壤水分估计，且几乎不增加计算、内存或功耗开销。我们通过在实验室中用盒子培养的经过策划的土壤进行训练，并在户外田地中进行测试来评估该系统，结果显示 SoilSound 在 10 个不同地点的平均绝对误差（MAE）为 2.39%。总体而言，评估表明 SoilSound 能够在多种土壤类型、环境和用户之间准确跟踪土壤湿度水平，范围为 15.9%到 34.0%；无需任何校准或扰动土壤，从而使家庭园丁、城市农民、公民科学家以及资源有限环境中的农业社区能够广泛进行湿度监测。

Subjects: Sound, Artificial Intelligence, Emerging Technologies, Human-Computer Interaction, Signal Processing 主题：声音、人工智能、新兴技术、人机交互、信号处理

Publish: 2025-09-11 19:49:30 UTC 发布：2025-09-11 19:49:30 协调世界时 (UTC)

#73 HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning #73 HEFT：一种从粗到细的层次结构，用于提升语言模型推理的效率与准确性 [PDF ] [Copy] [Kimi ] [REL]

Author: [Brennen Hill](https://arxiv.org/search/?searchtype=author&query=Brennen Hill) 作者：Brennen Hill

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-09-11 19:06:46 UTC 发布：2025-09-11 19:06:46 UTC

#74 ZORRO: Zero-Knowledge Robustness and Privacy for Split Learning (Full Version) #74 ZORRO：用于分割学习的零知识鲁棒性与隐私（完整版）

Authors: [Nojan Sheybani](https://arxiv.org/search/?searchtype=author&query=Nojan Sheybani), [Alessandro Pegoraro](https://arxiv.org/search/?searchtype=author&query=Alessandro Pegoraro), [Jonathan Knauer](https://arxiv.org/search/?searchtype=author&query=Jonathan Knauer), [Phillip Rieger](https://arxiv.org/search/?searchtype=author&query=Phillip Rieger), [Elissa Mollakuqe](https://arxiv.org/search/?searchtype=author&query=Elissa Mollakuqe), [Farinaz Koushanfar](https://arxiv.org/search/?searchtype=author&query=Farinaz Koushanfar), [Ahmad-Reza Sadeghi](https://arxiv.org/search/?searchtype=author&query=Ahmad-Reza Sadeghi) 作者：Nojan Sheybani，Alessandro Pegoraro，Jonathan Knauer，Phillip Rieger，Elissa Mollakuqe，Farinaz Koushanfar，Ahmad-Reza Sadeghi

Split Learning (SL) is a distributed learning approach that enables resource-constrained clients to collaboratively train deep neural networks (DNNs) by offloading most layers to a central server while keeping in- and output layers on the client-side. This setup enables SL to leverage server computation capacities without sharing data, making it highly effective in resource-constrained environments dealing with sensitive data. However, the distributed nature enables malicious clients to manipulate the training process. By sending poisoned intermediate gradients, they can inject backdoors into the shared DNN. Existing defenses are limited by often focusing on server-side protection and introducing additional overhead for the server. A significant challenge for client-side defenses is enforcing malicious clients to correctly execute the defense algorithm. We present ZORRO, a private, verifiable, and robust SL defense scheme. Through our novel design and application of interactive zero-knowledge proofs (ZKPs), clients prove their correct execution of a client-located defense algorithm, resulting in proofs of computational integrity attesting to the benign nature of locally trained DNN portions. Leveraging the frequency representation of model partitions enables ZORRO to conduct an in-depth inspection of the locally trained models in an untrusted environment, ensuring that each client forwards a benign checkpoint to its succeeding client. In our extensive evaluation, covering different model architectures as well as various attack strategies and data scenarios, we show ZORRO’s effectiveness, as it reduces the attack success rate to less than 6% while causing even for models storing \numprint{1000000} parameters on the client-side an overhead of less than 10 seconds. 拆分学习（Split Learning，SL）是一种分布式学习方法，使资源受限的客户端能够通过将大部分层卸载到中央服务器来协同训练深度神经网络（DNN），同时将输入层和输出层保留在客户端。这种设置使得 SL 能够在不共享数据的前提下利用服务器的计算能力，因此在处理敏感数据的资源受限环境中非常有效。然而，分布式特性也使得恶意客户端可以篡改训练过程。通过发送被投毒的中间梯度，它们可以向共享的 DNN 注入后门。现有防御措施的局限在于通常侧重于服务器端防护，并为服务器引入额外开销。对客户端防护的一大挑战是强制恶意客户端正确执行防御算法。我们提出了 ZORRO，一种私密、可验证且鲁棒的 SL 防御方案。通过我们对交互式零知识证明（ZKP）的新颖设计和应用，客户端能够证明其在客户端执行防御算法的正确性，从而生成证明计算完整性的证据，以证明本地训练的深度神经网络（DNN）部分的良性。利用模型分区的频域表示，ZORRO 能在不受信任的环境中对本地训练模型进行深入检查，确保每个客户端向其后续客户端转发的是良性检查点。在我们涵盖不同模型架构以及各种攻击策略和数据场景的广泛评估中，我们展示了 ZORRO 的有效性：它将攻击成功率降至低于 6%，即便是在客户端存储 1000000 个参数的模型上，其开销也低于 10 秒。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全性 , 人工智能

Publish: 2025-09-11 18:44:09 UTC

#75 LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation #75 LAVa：具有动态预算分配的分层键值缓存逐层驱逐

Authors: [Yiqun Shen](https://arxiv.org/search/?searchtype=author&query=Yiqun Shen), [Song Yuan](https://arxiv.org/search/?searchtype=author&query=Song Yuan), [Zhengze Zhang](https://arxiv.org/search/?searchtype=author&query=Zhengze Zhang), [Xiaoliang Wang](https://arxiv.org/search/?searchtype=author&query=Xiaoliang Wang), [Daxin Jiang](https://arxiv.org/search/?searchtype=author&query=Daxin Jiang), [Nguyen Cam-Tu](https://arxiv.org/search/?searchtype=author&query=Nguyen Cam-Tu) 作者：沈逸群、袁松、张政泽、王晓亮、蒋达鑫、Nguyen Cam-Tu

KV Cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for cache eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types. Our code is available at https://github.com/MGDDestiny/Lava. KV 缓存常用于加速具有长上下文的 LLM 推理，但其高内存需求推动了对缓存压缩的需求。然而，现有的压缩方法在很大程度上依赖启发式规则，且缺乏动态预算分配。为了解决这一限制，我们提出了一个通过最小化 Transformer 残差流信息损失的统一缓存压缩框架。在此基础上，我们分析了层级注意力输出的损失并推导出一个用于在头之间比较缓存条目的新度量，使得可以进行具有动态头预算的分层压缩。此外，通过对比跨层信息，我们还实现了动态层预算。LAVa 是首个用于缓存逐出和动态预算分配的统一策略，不像先前的方法那样依赖训练或多策略组合。基准测试（LongBench、Needle-In-A-Haystack、Ruler 和 InfiniteBench）的实验证明了其优越性。此外，我们的实验揭示了一个新见解：动态层预算对生成任务（例如代码补全）至关重要，而动态头预算在抽取任务（例如抽取式问答）中起关键作用。作为一种完全动态的压缩方法，LAVa 在各类任务中始终保持顶级性能。我们的代码可在 https://github.com/MGDDestiny/Lava 获取。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-09-11 16:48:24 UTC 发布：2025-09-11 16:48:24 UTC

#76 Meta-Learning Reinforcement Learning for Crypto-Return Prediction #76 元学习强化学习用于加密回报预测 [PDF ] [Copy] [Kimi ] [REL]

Predicting cryptocurrency returns is notoriously difficult: price movements are driven by a fast-shifting blend of on-chain activity, news flow, and social sentiment, while labeled training data are scarce and expensive. In this paper, we present Meta-RL-Crypto, a unified transformer-based architecture that unifies meta-learning and reinforcement learning (RL) to create a fully self-improving trading agent. Starting from a vanilla instruction-tuned LLM, the agent iteratively alternates between three roles-actor, judge, and meta-judge-in a closed-loop architecture. This learning process requires no additional human supervision. It can leverage multimodal market inputs and internal preference feedback. The agent in the system continuously refines both the trading policy and evaluation criteria. Experiments across diverse market regimes demonstrate that Meta-RL-Crypto shows good performance on the technical indicators of the real market and outperforming other LLM-based baselines. 预测加密货币回报异常困难：价格波动由快速变化的链上活动、新闻流和社交情绪混合驱动，同时带标签的训练数据稀缺且昂贵。在本文中，我们提出了 Meta-RL-Crypto，一种统一的基于 Transformer 的架构，将元学习和强化学习（RL）结合起来，创建一个完全自我改进的交易代理。从一个经过指令微调的普通 LLM 开始，代理在一个闭环架构中迭代交替扮演三种角色——执行者（actor）、评判者（judge）和元评判者（meta-judge）。这一学习过程不需要额外的人类监督。它可以利用多模态的市场输入和内部偏好反馈。系统中的代理持续改进交易策略和评估标准。在多种市场机制下的实验证明，Meta-RL-Crypto 在真实市场的技术指标上表现良好，并且优于其他基于 LLM 的基线方法。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-09-11 14:20:45 UTC 发布：2025-09-11 14:20:45 UTC

#77 A Co-Training Semi-Supervised Framework Using Faster R-CNN and YOLO Networks for Object Detection in Densely Packed Retail Images #77 一种使用 Faster R-CNN 与 YOLO 网络的协同训练半监督框架，用于密集摆放零售图像中的目标检测

Authors: [Hossein Yazdanjouei](https://arxiv.org/search/?searchtype=author&query=Hossein Yazdanjouei), [Arash Mansouri](https://arxiv.org/search/?searchtype=author&query=Arash Mansouri), [Mohammad Shokouhifar](https://arxiv.org/search/?searchtype=author&query=Mohammad Shokouhifar) 作者：Hossein Yazdanjouei、Arash Mansouri、Mohammad Shokouhifar

This study proposes a semi-supervised co-training framework for object detection in densely packed retail environments, where limited labeled data and complex conditions pose major challenges. The framework combines Faster R-CNN (utilizing a ResNet backbone) for precise localization with YOLO (employing a Darknet backbone) for global context, enabling mutual pseudo-label exchange that improves accuracy in scenes with occlusion and overlapping objects. To strengthen classification, it employs an ensemble of XGBoost, Random Forest, and SVM, utilizing diverse feature representations for higher robustness. Hyperparameters are optimized using a metaheuristic-driven algorithm, enhancing precision and efficiency across models. By minimizing reliance on manual labeling, the approach reduces annotation costs and adapts effectively to frequent product and layout changes common in retail. Experiments on the SKU-110k dataset demonstrate strong performance, highlighting the scalability and practicality of the proposed framework for real-world retail applications such as automated inventory tracking, product monitoring, and checkout systems. 本研究提出了一种用于密集货架零售环境中的半监督协同训练框架，用于目标检测；在这些环境中，标注数据有限且条件复杂是主要挑战。该框架结合了用于精确定位的 Faster R-CNN（采用 ResNet 主干网络）和用于全局上下文的 YOLO（采用 Darknet 主干网络），通过相互交换伪标签来提高在遮挡和对象重叠场景下的准确性。为增强分类性能，采用了 XGBoost、随机森林和 SVM 的集成，利用多样的特征表示以获得更高的鲁棒性。超参数使用一种元启发式驱动的算法进行优化，从而提高各模型的精度和效率。通过尽量减少对人工标注的依赖，该方法降低了标注成本，并能有效适应零售中常见的频繁商品和布局变化。在 SKU-110k 数据集上的实验表明了其优异的性能，突出了所提出框架在自动库存跟踪、商品监控和结账系统等真实零售应用中的可扩展性和实用性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-09-11 13:40:43 UTC 发布：2025-09-11 13:40:43 UTC

#78 D-CAT: Decoupled Cross-Attention Transfer between Sensor Modalities for Unimodal Inference #78 D-CAT：用于单模态推理的传感器模态间解耦交叉注意力迁移 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Leen Daher](https://arxiv.org/search/?searchtype=author&query=Leen Daher), [Zhaobo Wang](https://arxiv.org/search/?searchtype=author&query=Zhaobo Wang), [Malcolm Mielle](https://arxiv.org/search/?searchtype=author&query=Malcolm Mielle) 作者：Leen Daher、Zhaobo Wang、Malcolm Mielle

Cross-modal transfer learning is used to improve multi-modal classification models (e.g., for human activity recognition in human-robot collaboration). However, existing methods require paired sensor data at both training and inference, limiting deployment in resource-constrained environments where full sensor suites are not economically and technically usable. To address this, we propose Decoupled Cross-Attention Transfer (D-CAT), a framework that aligns modality-specific representations without requiring joint sensor modality during inference. Our approach combines a self-attention module for feature extraction with a novel cross-attention alignment loss, which enforces the alignment of sensors’ feature spaces without requiring the coupling of the classification pipelines of both modalities. We evaluate D-CAT on three multi-modal human activity datasets (IMU, video, and audio) under both in-distribution and out-of-distribution scenarios, comparing against uni-modal models. Results show that in in-distribution scenarios, transferring from high-performing modalities (e.g., video to IMU) yields up to 10% F1-score gains over uni-modal training. In out-of-distribution scenarios, even weaker source modalities (e.g., IMU to video) improve target performance, as long as the target model isn’t overfitted on the training data. By enabling single-sensor inference with cross-modal knowledge, D-CAT reduces hardware redundancy for perception systems while maintaining accuracy, which is critical for cost-sensitive or adaptive deployments (e.g., assistive robots in homes with variable sensor availability). Code is available at https://github.com/Schindler-EPFL-Lab/D-CAT. 跨模态迁移学习用于提升多模态分类模型（例如在人机协作中的人体活动识别）。然而，现有方法在训练和推理阶段都需要成对的传感器数据，这限制了其在资源受限环境中的部署——在这些环境中，完整的传感器套件在经济和技术上不可行。为了解决这一问题，我们提出了去耦合跨注意力迁移（D-CAT），该框架在推理时无需联合传感模态即可对模态特定表示进行对齐。我们的方法将用于特征提取的自注意力模块与一种新颖的跨注意力对齐损失相结合，后者在不要求两种模态的分类管道耦合的情况下强制对齐传感器的特征空间。我们在三个多模态人体活动数据集（IMU、视频和音频）上对 D-CAT 进行了评估，覆盖分布内和分布外两种情形，并与单模态模型进行了比较。结果表明，在分布内情形下，从性能较高的模态（例如从视频到 IMU）进行迁移，相较于单模态训练可带来高达 10%的 F1 分数增益。在分布外的场景中，即使是较弱的源模态（例如从 IMU 到视频）也能提升目标性能，只要目标模型没有在训练数据上过拟合。通过以跨模态知识实现单传感器推断，D-CAT 在保持准确性的同时减少了感知系统的硬件冗余，这对对成本敏感或需要自适应部署（例如在传感器可用性可变的家庭中为辅助机器人提供服务）非常关键。代码可在 https://github.com/Schindler-EPFL-Lab/D-CAT 获取。

Subjects: Machine Learning, Artificial Intelligence, Robotics 主题：机器学习，人工智能，机器人学

Publish: 2025-09-11 10:54:07 UTC 发布：2025-09-11 10:54:07 UTC

#79 Structure Matters: Brain Graph Augmentation via Learnable Edge Masking for Data-efficient Psychiatric Diagnosis #79 结构很重要：通过可学习的边掩码进行脑图增强以实现数据高效的精神疾病诊断

The limited availability of labeled brain network data makes it challenging to achieve accurate and interpretable psychiatric diagnoses. While self-supervised learning (SSL) offers a promising solution, existing methods often rely on augmentation strategies that can disrupt crucial structural semantics in brain graphs. To address this, we propose SAM-BG, a two-stage framework for learning brain graph representations with structural semantic preservation. In the pre-training stage, an edge masker is trained on a small labeled subset to capture key structural semantics. In the SSL stage, the extracted structural priors guide a structure-aware augmentation process, enabling the model to learn more semantically meaningful and robust representations. Experiments on two real-world psychiatric datasets demonstrate that SAM-BG outperforms state-of-the-art methods, particularly in small-labeled data settings, and uncovers clinically relevant connectivity patterns that enhance interpretability. Our code is available at https://github.com/mjliu99/SAM-BG. 标注的脑网络数据有限，导致实现准确且可解释的精神疾病诊断具有挑战性。尽管自监督学习（SSL）提供了有前景的解决方案，现有方法常依赖可能破坏脑图关键结构语义的增强策略。为了解决这一问题，我们提出了 SAM-BG，一种用于学习保留结构语义的脑图表征的两阶段框架。在预训练阶段，一个边掩码器在小规模带标注子集上训练以捕捉关键结构语义。在自监督学习阶段，提取的结构先验引导结构感知的增强过程，使模型能够学习更具语义意义且更鲁棒的表征。在两个真实世界的精神科数据集上的实验证明，SAM-BG 优于最先进的方法，特别是在小标注数据设置中，并发现了提高可解释性的临床相关连通性模式。我们的代码可在 https://github.com/mjliu99/SAM-BG 获取。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-09-11 07:24:39 UTC 发布：2025-09-11 07:24:39 UTC

#80 HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets #80 HypoGeneAgent：一种用于基因集合簇分辨率选择的基于假设语言代理，适用于 Perturb-seq 数据集 [PDF ] [Copy] [Kimi ] [REL]

Large-scale single-cell and Perturb-seq investigations routinely involve clustering cells and subsequently annotating each cluster with Gene-Ontology (GO) terms to elucidate the underlying biological programs. However, both stages, resolution selection and functional annotation, are inherently subjective, relying on heuristics and expert curation. We present HYPOGENEAGENT, a large language model (LLM)-driven framework, transforming cluster annotation into a quantitatively optimizable task. Initially, an LLM functioning as a gene-set analyst analyzes the content of each gene program or perturbation module and generates a ranked list of GO-based hypotheses, accompanied by calibrated confidence scores. Subsequently, we embed every predicted description with a sentence-embedding model, compute pair-wise cosine similarities, and let the agent referee panel score (i) the internal consistency of the predictions, high average similarity within the same cluster, termed intra-cluster agreement (ii) their external distinctiveness, low similarity between clusters, termed inter-cluster separation. These two quantities are combined to produce an agent-derived resolution score, which is maximized when clusters exhibit simultaneous coherence and mutual exclusivity. When applied to a public K562 CRISPRi Perturb-seq dataset as a preliminary test, our Resolution Score selects clustering granularities that exhibit alignment with known pathway compared to classical metrics such silhouette score, modularity score for gene functional enrichment summary. These findings establish LLM agents as objective adjudicators of cluster resolution and functional annotation, thereby paving the way for fully automated, context-aware interpretation pipelines in single-cell multi-omics studies. 大规模单细胞和 Perturb-seq 研究通常涉及对细胞进行聚类，随后用基因本体（GO）术语注释每个簇，以阐明潜在的生物学程序。然而，这两个步骤——分辨率选择和功能注释——本质上具有主观性，依赖于启发式方法和专家人工整理。我们提出了 HYPOGENEAGENT，一种由大语言模型（LLM）驱动的框架，将簇注释转化为可定量优化的任务。起初，作为基因集分析师的大语言模型会分析每个基因程序或扰动模块的内容，并生成基于 GO 的假设排名列表，同时给出经校准的置信度分数。随后，我们用句子嵌入模型对每个预测描述进行嵌入，计算成对余弦相似度，并让代理裁判团评分：(i) 预测的内部一致性，即同一簇内平均相似度高，称为簇内一致性；(ii) 它们的外部差异性，即簇间相似度低，称为簇间分离。这两个量被结合起来生成一个由代理得出的分辨率评分，当簇同时表现出内聚性和相互排斥性时该评分达到最大值。作为初步测试，将其应用于公开的 K562 CRISPRi Perturb-seq 数据集时，我们的分辨率评分相较于经典度量（如轮廓系数、基因功能富集概述的模块度评分）选择出了与已知通路更一致的聚类粒度。这些发现确立了 LLM 代理作为聚类分辨率和功能注释的客观裁判者，从而为单细胞多组学研究中完全自动化、具上下文感知的解释管线铺平了道路。

Subjects: Quantitative Methods, Artificial Intelligence, Computation and Language, Machine Learning 主题：定量方法、人工智能、计算与语言、机器学习

Publish: 2025-09-10 22:25:33 UTC 发布：2025-09-10 22:25:33 UTC

#81 World Modeling with Probabilistic Structure Integration #81 具有概率结构整合的世界建模

We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful “intermediate structures”, in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles – akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements. 我们提出了概率结构整合（PSI），一种从数据中学习高度可控且灵活可提示的世界模型的系统。PSI 包含一个三步循环。第一步，概率预测，涉及构建数据的概率图模型 Psi，以随机访问自回归序列模型的形式。Psi 支持一整套学习到的条件分布，描述数据中任意变量对任意其他变量集合的依赖关系。第二步，结构提取，我们展示了如何通过对 Psi 进行因果推断，以零次样本方式提取数据中对应于各种有意义“中间结构”的潜在低维属性。第三步，整合，通过将这些结构转换为新的令牌类型并将其持续混入训练输入作为条件信号和预测目标来完成循环。每个这样的循环都增强了 Psi 的能力，既使其能更好地建模底层数据，又创造了新的控制手柄——类似于一种 LLM 风格的通用提示语言。我们在 1.4 万亿个标记的互联网视频数据上训练了一个 Psi 实例；我们用它来执行各种有用的视频预测和理解推断；我们提取了最先进的光流、自监督深度和对象分割；并且我们使用这些结构来支持完整的预测改进循环。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-09-10 18:01:04 UTC 发布：2025-09-10 18:01:04 UTC

#82 MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools #82 MCP-AgentBench：使用 MCP 中介工具评估现实世界语言代理性能

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-09-10 14:08:40 UTC 发布时间：2025-09-10 14:08:40 UTC

#83 MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance #83 MITS：用于智能交通监控的大规模多模态基准数据集

General-domain large multimodal models (LMMs) have achieved significant advances in various image-text tasks. However, their performance in the Intelligent Traffic Surveillance (ITS) domain remains limited due to the absence of dedicated multimodal datasets. To address this gap, we introduce MITS (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset specifically designed for ITS. MITS includes 170,400 independently collected real-world ITS images sourced from traffic surveillance cameras, annotated with eight main categories and 24 subcategories of ITS-specific objects and events under diverse environmental conditions. Additionally, through a systematic data generation pipeline, we generate high-quality image captions and 5 million instruction-following visual question-answer pairs, addressing five critical ITS tasks: object and event recognition, object counting, object localization, background analysis, and event reasoning. To demonstrate MITS’s effectiveness, we fine-tune mainstream LMMs on this dataset, enabling the development of ITS-specific applications. Experimental results show that MITS significantly improves LMM performance in ITS applications, increasing LLaVA-1.5’s performance from 0.494 to 0.905 (+83.2%), LLaVA-1.6’s from 0.678 to 0.921 (+35.8%), Qwen2-VL’s from 0.584 to 0.926 (+58.6%), and Qwen2.5-VL’s from 0.732 to 0.930 (+27.0%). We release the dataset, code, and models as open-source, providing high-value resources to advance both ITS and LMM research. 通用领域的大型多模态模型（LMMs）在各种图文任务中取得了显著进展。然而，由于缺乏专门的多模态数据集，它们在智能交通监控（ITS）领域的表现仍然有限。为了解决这一空白，我们提出了 MITS（多模态智能交通监控），这是首个专为 ITS 设计的大规模多模态基准数据集。MITS 包含 170,400 张独立采集的真实世界 ITS 图像，来源于交通监控摄像头，并在多种环境条件下对 ITS 专有的八大主类别和 24 个子类别的对象与事件进行了标注。此外，通过系统化的数据生成管道，我们生成了高质量的图像说明以及 500 万条遵循指令的视觉问答对，覆盖五个关键的 ITS 任务：对象与事件识别、对象计数、对象定位、背景分析和事件推理。为展示 MITS 的有效性，我们在该数据集上对主流 LMMs 进行了微调，从而促成了面向 ITS 的应用开发。实验结果表明，MITS 在 ITS 应用中显著提升了 LMM 的性能，将 LLaVA-1.5 的性能从 0.494 提升到 0.905（+83.2%）、LLaVA-1.6 的从 0.678 提升到 0.921（+35.8%）、Qwen2-VL 的从 0.584 提升到 0.926（+58.6%）、以及 Qwen2.5-VL 的从 0.732 提升到 0.930（+27.0%）。我们将数据集、代码和模型以开源形式发布，为推动 ITS 与 LMM 研究提供了高价值资源。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-09-10 12:07:34 UTC 发布：2025-09-10 12:07:34 UTC

#84 MultimodalHugs: Enabling Sign Language Processing in Hugging Face #84 MultimodalHugs：在 Hugging Face 中实现手语处理

In recent years, sign language processing (SLP) has gained importance in the general field of Natural Language Processing. However, compared to research on spoken languages, SLP research is hindered by complex ad-hoc code, inadvertently leading to low reproducibility and unfair comparisons. Existing tools that are built for fast and reproducible experimentation, such as Hugging Face, are not flexible enough to seamlessly integrate sign language experiments. This view is confirmed by a survey we conducted among SLP researchers. To address these challenges, we introduce MultimodalHugs, a framework built on top of Hugging Face that enables more diverse data modalities and tasks, while inheriting the well-known advantages of the Hugging Face ecosystem. Even though sign languages are our primary focus, MultimodalHugs adds a layer of abstraction that makes it more widely applicable to other use cases that do not fit one of the standard templates of Hugging Face. We provide quantitative experiments to illustrate how MultimodalHugs can accommodate diverse modalities such as pose estimation data for sign languages, or pixel data for text characters. 近年来，手语处理（SLP）在自然语言处理领域变得越来越重要。然而，与口语研究相比，SLP 研究受到复杂的临时性代码的制约，这无意中导致了可复现性低和不公平的比较。现有为快速且可复现的实验而构建的工具（例如 Hugging Face）并不够灵活，无法无缝集成手语实验。我们在对 SLP 研究人员进行的调查中证实了这一观点。为了解决这些挑战，我们引入了 MultimodalHugs，这是一个构建在 Hugging Face 之上的框架，能够支持更多样的数据模态和任务，同时继承了 Hugging Face 生态系统的众所周知的优势。尽管手语是我们的主要关注点，MultimodalHugs 添加了一层抽象，使其更广泛地适用于其他不符合 Hugging Face 标准模板的用例。我们提供了定量实验来说明 MultimodalHugs 如何容纳多种模态，例如用于手语的姿态估计数据或用于文本字符的像素数据。

Subjects: Computation and Language, Artificial Intelligence, Multimedia 主题：计算与语言、人工智能、多媒体

Publish: 2025-09-10 11:14:54 UTC 发布：2025-09-10 11:14:54 协调世界时

#85 DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model #85 DiTTO-LLM：通过大型语言模型发现基于主题的技术机会的框架

Technology opportunities are critical information that serve as a foundation for advancements in technology, industry, and innovation. This paper proposes a framework based on the temporal relationships between technologies to identify emerging technology opportunities. The proposed framework begins by extracting text from a patent dataset, followed by mapping text-based topics to discover inter-technology relationships. Technology opportunities are then identified by tracking changes in these topics over time. To enhance efficiency, the framework leverages a large language model to extract topics and employs a prompt for a chat-based language model to support the discovery of technology opportunities. The framework was evaluated using an artificial intelligence patent dataset provided by the United States Patent and Trademark Office. The experimental results suggest that artificial intelligence technology is evolving into forms that facilitate everyday accessibility. This approach demonstrates the potential of the proposed framework to identify future technology opportunities. 技术机会是为技术、产业和创新进步提供基础的重要信息。本文提出了一个基于技术间时间关系的框架，用以识别新兴技术机会。该框架首先从专利数据集中提取文本，然后将基于文本的主题映射以发现技术间的关系。接着通过追踪这些主题随时间的变化来识别技术机会。为提高效率，该框架利用大语言模型提取主题，并使用基于对话的语言模型提示来辅助发现技术机会。该框架在美国专利商标局提供的人工智能专利数据集上进行了评估。实验结果表明，人工智能技术正演变为便于日常可及的形式。该方法展示了所提框架识别未来技术机会的潜力。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-09-10 05:47:25 UTC 发布：2025-09-10 05:47:25 UTC

#86 ALIGNS: Unlocking nomological networks in psychological measurement through a large language model #86 ALIGNS：通过大型语言模型在心理测量中解锁法则网络

Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system’s importance, accessibility, and suitability. ALIGNS is freely available at nomologicalnetwork.org, complementing traditional validation methods with large-scale nomological analysis. 心理测量对许多学科至关重要。尽管测量学取得了进展，但构建名义网络（nomological networks，即概念与测量之间如何关联以建立效度的理论图谱）仍然是一项挑战——自从 Cronbach 和 Meehl 在 70 年前提出它们为验证的基础以来，这一问题依然存在。这一局限带来了实际后果：临床试验可能无法检测到治疗效果，公共政策可能将目标指向错误的结果。我们提出了“生成名义结构的潜在指标分析”（Analysis of Latent Indicators to Generate Nomological Structures，简称 ALIGNS），这是一种基于大型语言模型并以经验证的问卷测量为训练的系统。ALIGNS 提供了三个全面的名义网络，包含了跨心理学、医学、社会政策及其他领域的超过 55 万个指标。这代表了大型语言模型在解决测量验证基础性问题上的首次应用。我们报告了用于开发该模型的分类准确性测试以及三项评估。在第一项评估中，广泛使用的 NIH PROMIS 焦虑和抑郁量表被证明会汇聚为情绪痛苦的单一维度。第二项评估检验儿童气质测量，识别出四个当前框架未涵盖的潜在维度，并对现有的一个维度提出质疑。第三项评估为可适用性检查，邀请了心理测量学专家评估该系统的重要性、可获取性和适用性。ALIGNS 可在 nomologicalnetwork.org 免费获取，通过大规模的名义网络分析补充了传统的验证方法。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning, Methodology 主题：计算与语言、人工智能、机器学习、方法论

Publish: 2025-09-10 04:21:02 UTC 发布日期：2025-09-10 04:21:02 UTC

#87 A Multimodal RAG Framework for Housing Damage Assessment: Collaborative Optimization of Image Encoding and Policy Vector Retrieval #87 一种用于房屋损害评估的多模态 RAG 框架：图像编码与策略向量检索的协同优化

Authors: [Jiayi Miao](https://arxiv.org/search/?searchtype=author&query=Jiayi Miao), [Dingxin Lu](https://arxiv.org/search/?searchtype=author&query=Dingxin Lu), [Zhuqi Wang](https://arxiv.org/search/?searchtype=author&query=Zhuqi Wang) 作者：苗佳怡，卢定鑫，王竹玘

After natural disasters, accurate evaluations of damage to housing are important for insurance claims response and planning of resources. In this work, we introduce a novel multimodal retrieval-augmented generation (MM-RAG) framework. On top of classical RAG architecture, we further the framework to devise a two-branch multimodal encoder structure that the image branch employs a visual encoder composed of ResNet and Transformer to extract the characteristic of building damage after disaster, and the text branch harnesses a BERT retriever for the text vectorization of posts as well as insurance policies and for the construction of a retrievable restoration index. To impose cross-modal semantic alignment, the model integrates a cross-modal interaction module to bridge the semantic representation between image and text via multi-head attention. Meanwhile, in the generation module, the introduced modal attention gating mechanism dynamically controls the role of visual evidence and text prior information during generation. The entire framework takes end-to-end training, and combines the comparison loss, the retrieval loss and the generation loss to form multi-task optimization objectives, and achieves image understanding and policy matching in collaborative learning. The results demonstrate superior performance in retrieval accuracy and classification index on damage severity, where the Top-1 retrieval accuracy has been improved by 9.6%. 在自然灾害发生后，对房屋损毁的准确评估对于理赔响应和资源规划至关重要。在这项工作中，我们提出了一种新颖的多模态检索增强生成（MM-RAG）框架。在经典 RAG 架构基础上，我们进一步扩展该框架，设计了一个双支路多模态编码器结构：图像分支使用由 ResNet 和 Transformer 组成的视觉编码器来提取灾后建筑损毁的特征，文本分支则利用 BERT 检索器对帖子以及保险政策进行文本向量化并构建可检索的修复索引。为施加跨模态语义对齐，模型集成了一个跨模态交互模块，通过多头注意力连接图像与文本之间的语义表示。同时，在生成模块中，引入的模态注意门控机制在生成过程中动态控制视觉证据与文本先验信息的作用。整个框架采用端到端训练，结合比较损失、检索损失和生成损失构成多任务优化目标，并在协同学习中实现图像理解和策略匹配。结果在检索准确率和损伤严重度的分类指标上表现优异，其中 Top-1 检索准确率提高了 9.6%。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-09-10 01:58:07 UTC 发布：2025-09-10 01:58:07 UTC

#88 VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions #88 VStyle：带有口头指令的语音风格适应基准

Subjects: Sound, Artificial Intelligence, Computation and Language, Audio and Speech Processing 主题：声音、人工智能、计算与语言、音频与语音处理

Publish: 2025-09-09 14:28:58 UTC 发布时间：2025-09-09 14:28:58 UTC

#89 Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA #89 在 HaluEval 和 TruthfulQA 中调查 Gemma 模型幻觉的符号触发因素

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-09 05:50:08 UTC 发布：2025-09-09 05:50:08 UTC

#90 How Small Transformation Expose the Weakness of Semantic Similarity Measures #90 微小变换如何暴露语义相似度度量的弱点

This research examines how well different methods measure semantic similarity, which is important for various software engineering applications such as code search, API recommendations, automated code reviews, and refactoring tools. While large language models are increasingly used for these similarity assessments, questions remain about whether they truly understand semantic relationships or merely recognize surface patterns. The study tested 18 different similarity measurement approaches, including word-based methods, embedding techniques, LLM-based systems, and structure-aware algorithms. The researchers created a systematic testing framework that applies controlled changes to text and code to evaluate how well each method handles different types of semantic relationships. The results revealed significant issues with commonly used metrics. Some embedding-based methods incorrectly identified semantic opposites as similar up to 99.9 percent of the time, while certain transformer-based approaches occasionally rated opposite meanings as more similar than synonymous ones. The study found that embedding methods’ poor performance often stemmed from how they calculate distances; switching from Euclidean distance to cosine similarity improved results by 24 to 66 percent. LLM-based approaches performed better at distinguishing semantic differences, producing low similarity scores (0.00 to 0.29) for genuinely different meanings, compared to embedding methods that incorrectly assigned high scores (0.82 to 0.99) to dissimilar content. 这项研究考察了不同方法在测量语义相似性方面的表现，这对代码搜索、API 推荐、自动化代码审查和重构工具等各种软件工程应用非常重要。尽管大型语言模型（LLM）越来越多地被用于这些相似性评估，但仍存在疑问：它们是否真正理解语义关系，还是仅仅识别表面模式。研究测试了 18 种不同的相似性测量方法，包括基于词的方法、嵌入技术、LLM 驱动的系统和结构感知算法。研究人员创建了一个系统化的测试框架，对文本和代码施加受控变更，以评估每种方法如何处理不同类型的语义关系。结果揭示了常用度量的显著问题。一些基于嵌入的方法在最多 99.9% 的情况下错误地将语义相反的项判断为相似，而某些基于 Transformer 的方法有时会将相反含义评为比同义含义更相似。研究发现，嵌入方法表现不佳常常源于它们计算距离的方式；将欧氏距离改为余弦相似度后，结果提高了 24%到 66%。基于 LLM 的方法在区分语义差异方面表现更好，对于意义真正不同的内容会给出较低的相似度分（0.00 到 0.29），而嵌入方法则错误地对不相似的内容分配了很高的分数（0.82 到 0.99）。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-08 11:00:18 UTC 发布：2025-09-08 11:00:18 协调世界时（UTC）

#91 HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering #91 HANRAG：启发式准确抗噪检索增强生成用于多跳问答 [PDF ] [Copy] [Kimi 1 ] [REL]

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-08 06:22:38 UTC 发布：2025-09-08 06:22:38 协调世界时

#92 The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization #92 思考中的治疗师：使用监督微调和比值比策略优化训练大型语言模型以提供接纳与承诺疗法 [PDF 1 ] [Copy] [Kimi 1 ] [REL]

Author: [Talha Tahir](https://arxiv.org/search/?searchtype=author&query=Talha Tahir) 作者：Talha Tahir

Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using 50 sets of synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity (χ2(5)=185.15,p<.001) and therapeutic empathy (χ2(5)=140.37,p<.001). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points (p<.001), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic process' over imitating content,’ a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm. 接受与承诺疗法（ACT）是一种第三波认知行为疗法，在多种精神疾病中显示出逐渐增多的疗效证据。本研究考察了训练后方法学和显式推理对一个小型开源权重大型语言模型（LLM）提供 ACT 能力的影响。我们使用由 Mistral-Large 生成的 50 组合成 ACT 会话记录，分别以两种不同方法对 Llama-3.2-3b-Instruct 进行训练：监督微调（SFT）和赔率比策略优化（ORPO），每种方法又分有无显式链式思考（COT）推理步骤。通过将这四种训练后变体与基础 Instruct 模型进行比较来评估其表现。这些模型在模拟治疗会话中进行了基准测试，性能通过一个经过以人工评估微调的 LLM 评审在 ACT 忠实度量表（ACT-FM）和治疗师共情量表（TES）上进行了定量评估。我们的研究结果表明，ORPO 训练的模型在 ACT 忠实度（ χ2(5)=185.15,p<.001 ）和治疗共情（ χ2(5)=140.37,p<.001 ）方面显著优于其 SFT 和 Instruct 对应模型。 COT 的效果具有条件性：它为 SFT 模型提供了显著益处，使 ACT-FM 分数平均提高了 2.68 点（ p<.001 ），但对更优的 ORPO 或指令微调变体并未带来可辨别的优势。我们认为 ORPO 的优越性来自于它能够学习治疗的“过程”而非模仿“内容”，这是 ACT 的关键方面，而 COT 则为仅通过模仿训练的模型提供了必要的支架。本研究确立了偏好对齐的策略优化能够有效在小型 LLMs 中灌输 ACT 能力，并且明确推理的有用性高度依赖于其底层训练范式。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-08 02:30:12 UTC 发布：2025-09-08 02:30:12 UTC

#93 Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry

Large language models (LLMs) hold great promise in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of psychiatric reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items. We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, LLaMA 3, and QWQ-32) alongside leading open-source medical models (e.g., OpenBiloLLM, MedGemma) using both conventional metrics and an “LLM-as-judge” similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications. 大型语言模型 (LLMs) 在提升精神科实践方面具有巨大潜力，从提高诊断准确性到简化临床文书和治疗支持。然而，现有的评估资源在很大程度上依赖于小规模的临床访谈语料、社交媒体帖子或合成对话，这限制了它们的临床有效性，也无法捕捉精神科推理的全部复杂性。在这项工作中，我们介绍了 PsychiatryBench，这是一个严格策划的基准，完全基于权威的、经专家验证的精神科教科书和病例集。PsychiatryBench 包含十一种不同的问答任务，涵盖从诊断推理和治疗计划到纵向随访、管理计划、临床方法、序贯病例分析以及多项选择/扩展匹配格式，总计超过 5,300 条专家注释条目。我们使用传统指标和“LLM-as-judge”相似性评分框架，评估了一系列前沿的 LLMs（包括 Google Gemini、DeepSeek、LLaMA 3 和 QWQ-32）以及领先的开源医学模型（如 OpenBiloLLM、MedGemma）。我们的结果显示在临床一致性和安全性方面存在明显差距，尤其是在多轮跟进和管理任务中，这凸显了对专门模型微调和更健全评估范式的需求。PsychiatryBench 提供了一个模块化、可扩展的平台，用于在高风险精神健康应用中基准测试和提升 LLM 的表现。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-07 20:57:24 UTC 发布日期：2025-09-07 20:57:24 UTC

#94 Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data #94 使用大型语言模型并结合人口普查与土地利用数据生成个人旅行日记

This study introduces a Large Language Model (LLM) scheme for generating individual travel diaries in agent-based transportation models. While traditional approaches rely on large quantities of proprietary household travel surveys, the method presented in this study generates personas stochastically from open-source American Community Survey (ACS) and Smart Location Database (SLD) data, then synthesizes diaries through direct prompting. This study features a novel one-to-cohort realism score: a composite of four metrics (Trip Count Score, Interval Score, Purpose Score, and Mode Score) validated against the Connecticut Statewide Transportation Study (CSTS) diaries, matched across demographic variables. The validation utilizes Jensen-Shannon Divergence to measure distributional similarities between generated and real diaries. When compared to diaries generated with classical methods (Negative Binomial for trip generation; Multinomial Logit for mode/purpose) calibrated on the validation set, LLM-generated diaries achieve comparable overall realism (LLM mean: 0.485 vs. 0.455). The LLM excels in determining trip purpose and demonstrates greater consistency (narrower realism score distribution), while classical models lead in numerical estimates of trip count and activity duration. Aggregate validation confirms the LLM’s statistical representativeness (LLM mean: 0.612 vs. 0.435), demonstrating LLM’s zero-shot viability and establishing a quantifiable metric of diary realism for future synthetic diary evaluation systems. 本研究提出了一种用于在基于代理的交通模型中生成个体出行日记的大型语言模型（LLM）方案。传统方法依赖大量专有的家庭出行调查数据，而本研究提出的方法则从开源的美国社区调查（ACS）和智能位置数据库（SLD）数据中随机生成代表性人物，然后通过直接提示合成日记。本研究引入了一种新颖的“一对群体”真实度评分：由四个指标（出行次数得分、间隔得分、出行目的得分和出行方式得分）组成，并针对康涅狄格州全州交通研究（CSTS）日记在人口统计变量上进行匹配后进行了验证。验证中使用詹森-香农散度来测量生成日记与真实日记之间的分布相似性。与在验证集上校准的经典方法（用于出行生成的负二项分布；用于方式/目的的多项 Logit）生成的日记相比，LLM 生成的日记在总体真实度上达到可比水平（LLM 平均值：0.485 vs. 0.455)。该 LLM 在判断出行目的方面表现出色，并展现出更高的一致性（现实性得分分布更窄），而传统模型在出行次数和活动持续时间的数值估计方面占优。总体验证确认了该 LLM 的统计代表性（LLM 平均值：0.612 对比 0.435），证明了 LLM 在零样本情形下的可行性，并为未来合成日记评估系统建立了可量化的日记现实性指标。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-09-07 17:03:08 UTC 发布日期：2025-09-07 17:03:08 UTC

Authors: [Jing Ren](https://arxiv.org/search/?searchtype=author&query=Jing Ren), [Weiqi Wang](https://arxiv.org/search/?searchtype=author&query=Weiqi Wang) 作者：Jing Ren、Weiqi Wang

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-07 10:24:28 UTC 发布：2025-09-07 10:24:28 UTC

#96 Beyond I’m Sorry, I Can’t: Dissecting Large Language Model Refusal #96 超越“抱歉，我做不到”：剖析大型语言模型的拒绝行为

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space. 在有害提示下拒绝响应是指令调优大型语言模型（LLMs）的一项关键安全行为，但这一行为的内部成因仍然知之甚少。我们研究了两种公开的指令调优模型 Gemma-2-2B-IT 和 LLaMA-3.1-8B-IT，使用在残差流激活上训练的稀疏自编码器（SAE）。针对有害提示，我们在 SAE 的潜在空间中搜索那些其消融会将模型从拒绝转为服从的特征集，证明其因果影响并制造出越狱方法。我们的搜索分为三个阶段：（1）拒绝方向：找到一个介导拒绝的方向并收集位于该方向附近的 SAE 特征；（2）贪婪筛选：剪除至最小集合；（3）交互发现：拟合一个因子分解机（FM），以捕捉剩余活跃特征与最小集合之间的非线性交互。该流程产生了一组广泛的对越狱至关重要的特征，为拒绝的机械基础提供了见解。此外，我们发现存在冗余特征，除非先前的特征被抑制，否则这些冗余特征保持沉寂。我们的研究结果强调了通过操控可解释的潜在空间来对安全行为进行细粒度审计和有针对性干预的潜力。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-07 02:29:07 UTC 发布：2025-09-07 02:29:07 UTC

#97 LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm #97 基于 LLM 的实例驱动启发式偏差：带偏置随机键遗传算法的背景下 [PDF ] [Copy] [Kimi ] [REL]

Subjects: Neural and Evolutionary Computing, Artificial Intelligence, Computation and Language 受试领域：神经与进化计算、人工智能、计算与语言

Publish: 2025-09-05 21:46:41 UTC 发布：2025-09-05 21:46:41 UTC

#98 Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks #98 变压器语言模型的差异鲁棒性：在对抗性文本攻击下的实证评估 [PDF ] [Copy] [Kimi ] [REL]

Subjects: Cryptography and Security, Artificial Intelligence, Computation and Language 主题：密码学与安全、人工智能、计算与语言

Publish: 2025-09-05 21:43:06 UTC 发布：2025-09-05 21:43:06 UTC

#99 The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks #99 小型 LLM 的非确定性：在标准多项选择基准的重复试验中答案一致性低的证据 [PDF 2 ] [Copy] [Kimi 3 ] [REL]

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-05 17:31:14 UTC 发布：2025-09-05 17:31:14 协调世界时

#100 Temporal Preferences in Language Models for Long-Horizon Assistance #100 语言模型在长期辅助中的时间偏好

We study whether language models (LMs) exhibit future- versus present-oriented preferences in intertemporal choice and whether those preferences can be systematically manipulated. Using adapted human experimental protocols, we evaluate multiple LMs on time-tradeoff tasks and benchmark them against a sample of human decision makers. We introduce an operational metric, the Manipulability of Time Orientation (MTO), defined as the change in an LM’s revealed time preference between future- and present-oriented prompts. In our tests, reasoning-focused models (e.g., DeepSeek-Reasoner and grok-3-mini) choose later options under future-oriented prompts but only partially personalize decisions across identities or geographies. Moreover, models that correctly reason about time orientation internalize a future orientation for themselves as AI decision makers. We discuss design implications for AI assistants that should align with heterogeneous, long-horizon goals and outline a research agenda on personalized contextual calibration and socially aware deployment. 我们研究语言模型（LM）在跨期选择中是否表现出面向未来或面向当下的偏好，以及这些偏好是否可以被系统性地操控。采用改编自人类实验的协议，我们在时间权衡任务上评估了多种语言模型，并将它们与一组人类决策者的样本进行对比。我们提出了一个操作性指标——时间取向可操控性（MTO），定义为在面向未来与面向当下提示之间，语言模型显性时间偏好的变化。在我们的测试中，注重推理的模型（例如 DeepSeek-Reasoner 和 grok-3-mini）在面向未来的提示下更倾向选择较晚的选项，但仅在一定程度上根据身份或地理位置个性化决策。此外，能够正确推理时间取向的模型会将面向未来的取向内化为它们自身作为人工智能决策者的属性。我们讨论了应与异质的长期目标对齐的 AI 助手的设计含义，并概述了关于个性化情境校准和具有社会意识部署的研究议程。

Subjects: Computation and Language, Artificial Intelligence, Computers and Society

Publish: 2025-09-05 16:21:23 UTC 发布：2025-09-05 16:21:23 世界协调时

#101 CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor #101 CTCC：一种通过跨轮次上下文相关性后门对大型语言模型进行健壮且隐蔽指纹标记的框架

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-05 05:59:50 UTC 发布：2025-09-05 05:59:50 UTC

#102 Creativity Benchmark: A benchmark for marketing creativity for LLM models #102 创造力基准：用于评估 LLM 模型营销创造力的基准

We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is Δθ≈0.45, which implies a head-to-head win probability of 0.61; the highest-rated model beats the lowest only about 61% of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows. 我们推出了创造力基准（Creativity Benchmark），用于评估大型语言模型（LLMs）在营销创意方面的表现。该基准涵盖 100 个品牌（12 个类别）和三种提示类型（Insights、Ideas、Wild Ideas）。来自 678 名在职创意人员的人工成对偏好，在 11,012 次匿名比较中被收集，并用 Bradley-Terry 模型分析，结果显示表现紧密聚集，没有任何模型在所有品牌或提示类型上占据主导：最高与最低的差距为 Δθ≈0.45 ，这意味着一对一对决的获胜概率为 0.61 ；评分最高的模型击败最低模型的概率仅约为 61% 。我们还使用余弦距离分析模型多样性，以捕捉模型内部和模型间的差异以及对提示重构的敏感性。将三种以 LLM 为裁判的设置与人工排名进行比较表明，相关性薄弱且不一致且存在裁判特有的偏差，这强调了自动化裁判无法替代人工评估。传统的创造力测试也仅部分可迁移到品牌约束任务。总体而言，结果凸显了需要专家人工评估和关注多样性的工作流程。

Subjects: Computation and Language, Artificial Intelligence, Human-Computer Interaction 主题：计算与语言、人工智能、人机交互

Publish: 2025-09-05 04:44:29 UTC 发布：2025-09-05 04:44:29 UTC

#103 Cross-Layer Attention Probing for Fine-Grained Hallucination Detection #103 跨层注意力探测用于细粒度幻觉检测

With the large-scale adoption of Large Language Models (LLMs) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the LLM activations across the entire residual stream as a joint sequence. Our empirical evaluations using five LLMs and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy decoded responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among different sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve LLM reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution. 随着大型语言模型（LLMs）在各类应用中的大规模采用，其倾向于生成不准确文本（即幻觉）引发的可靠性问题日益突出。在这项工作中，我们提出了跨层注意力探测（Cross-Layer Attention Probing，CLAP），这是一种用于幻觉检测的新型激活探测技术，它将整个残差流中的 LLM 激活作为一个联合序列来处理。我们使用五种 LLM 和三项任务进行的实证评估表明，CLAP 在贪心解码响应以及在更高温度下抽样得到的响应上，相较基线方法都能改进幻觉检测，从而实现细粒度检测，即能够在针对给定提示的不同抽样响应中区分幻觉与非幻觉。这使我们能提出一种基于 CLAP 的先检测再缓解策略，以减少幻觉并提高 LLM 的可靠性，相较于直接缓解方法更为有效。最后，我们展示了 CLAP 在应用到分布外数据时仍能保持高可靠性。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-04 14:37:34 UTC 发布：2025-09-04 14:37:34 UTC

#104 Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs #104 结构化信息很重要：基于患者级知识图的可解释 ICD 编码

Mapping clinical documents to standardised clinical vocabularies is an important task, as it provides structured data for information retrieval and analysis, which is essential to clinical research, hospital administration and improving patient care. However, manual coding is both difficult and time-consuming, making it impractical at scale. Automated coding can potentially alleviate this burden, improving the availability and accuracy of structured clinical data. The task is difficult to automate, as it requires mapping to high-dimensional and long-tailed target spaces, such as the International Classification of Diseases (ICD). While external knowledge sources have been readily utilised to enhance output code representation, the use of external resources for representing the input documents has been underexplored. In this work, we compute a structured representation of the input documents, making use of document-level knowledge graphs (KGs) that provide a comprehensive structured view of a patient’s condition. The resulting knowledge graph efficiently represents the patient-centred input documents with 23% of the original text while retaining 90% of the information. We assess the effectiveness of this graph for automated ICD-9 coding by integrating it into the state-of-the-art ICD coding architecture PLM-ICD. Our experiments yield improved Macro-F1 scores by up to 3.20% on popular benchmarks, while improving training efficiency. We attribute this improvement to different types of entities and relationships in the KG, and demonstrate the improved explainability potential of the approach over the text-only baseline. 将临床文档映射到标准化临床词汇表是一项重要任务，因为它为信息检索和分析提供了结构化数据，而这些对于临床研究、医院管理和改善病人护理都是必不可少的。然而，手工编码既困难又耗时，难以在大规模应用。自动编码有望减轻这一负担，提高结构化临床数据的可用性和准确性。该任务难以实现自动化，因为它需要映射到高维且长尾的目标空间，例如《国际疾病分类》（ICD）。尽管外部知识源已被广泛用于增强输出代码的表示，使用外部资源来表示输入文档的研究却不多。在这项工作中，我们计算了输入文档的结构化表示，利用提供病人状况全面结构化视图的文档级知识图（KG）。由此生成的知识图以原始文本 23％的容量高效地表示以病人为中心的输入文档，同时保留了 90％的信息。我们通过将该图谱整合到最先进的 ICD 编码架构 PLM-ICD 中，评估其对自动化 ICD-9 编码的有效性。我们的实验在流行基准上将 Macro-F1 分数提高了最多 3.20%，同时提升了训练效率。我们将这一改进归因于知识图谱中不同类型的实体和关系，并展示了该方法在可解释性方面相比仅文本基线的提升潜力。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-09-04 12:01:38 UTC 发布：2025-09-04 12:01:38 UTC

#105 Wave-Based Semantic Memory with Resonance-Based Retrieval: A Phase-Aware Alternative to Vector Embedding Stores #105 基于波的语义记忆与基于共振的检索：一种相位感知的向量嵌入存储替代方案 [PDF ] [Copy] [Kimi ] [REL]

Author: [Aleksandr Listopad](https://arxiv.org/search/?searchtype=author&query=Aleksandr Listopad) 作者：Aleksandr Listopad

Conventional vector-based memory systems rely on cosine or inner product similarity within real-valued embedding spaces. While computationally efficient, such approaches are inherently phase-insensitive and limited in their ability to capture resonance phenomena crucial for meaning representation. We propose Wave-Based Semantic Memory, a novel framework that models knowledge as wave patterns ψ(x)=A(x)eiϕ(x) and retrieves it through resonance-based interference. This approach preserves both amplitude and phase information, enabling more expressive and robust semantic similarity. We demonstrate that resonance-based retrieval achieves higher discriminative power in cases where vector methods fail, including phase shifts, negations, and compositional queries. Our implementation, ResonanceDB, shows scalability to millions of patterns with millisecond latency, positioning wave-based memory as a viable alternative to vector stores for AGI-oriented reasoning and knowledge representation. 传统的基于向量的记忆系统依赖于实值嵌入空间内的余弦或内积相似度。虽然计算上高效，但此类方法本质上对相位不敏感，且在捕捉对意义表示至关重要的共振现象方面存在局限。我们提出了基于波的语义记忆，一种新框架，将知识建模为波模式 ψ(x)=A(x)eiϕ(x) ，并通过基于共振的干涉来检索。该方法保留了振幅和相位信息，从而实现更具表现力且更稳健的语义相似性。我们证明，在向量方法失效的情形下（包括相位偏移、否定和组合查询），基于共振的检索能够获得更高的判别能力。我们的实现 ResonanceDB 展示了对百万级模式的可扩展性并能在毫秒级延迟内响应，使基于波的记忆成为面向通用人工智能推理与知识表示的可行替代方案。

Subjects: Information Retrieval, Artificial Intelligence, Databases

Publish: 2025-08-21 10:13:24 UTC

#106 Personas within Parameters: Fine-Tuning Small Language Models with Low-Rank Adapters to Mimic User Behaviors #106 参数内的人格：使用低秩适配器微调小型语言模型以模拟用户行为

A long-standing challenge in developing accurate recommendation models is simulating user behavior, mainly due to the complex and stochastic nature of user interactions. Towards this, one promising line of work has been the use of Large Language Models (LLMs) for simulating user behavior. However, aligning these general-purpose large pre-trained models with user preferences necessitates: (i) effectively and continously parsing large-scale tabular user-item interaction data, (ii) overcoming pre-training-induced inductive biases to accurately learn user specific knowledge, and (iii) achieving the former two at scale for millions of users. While most previous works have focused on complex methods to prompt an LLM or fine-tune it on tabular interaction datasets, our approach shifts the focus to extracting robust textual user representations using a frozen LLM and simulating cost-effective, resource-efficient user agents powered by fine-tuned Small Language Models (SLMs). Further, we showcase a method for training multiple low-rank adapters for groups of users or \textit{persona}, striking an optimal balance between scalability and performance of user behavior agents. Our experiments provide compelling empirical evidence of the efficacy of our methods, demonstrating that user agents developed using our approach have the potential to bridge the gap between offline metrics and real-world performance of recommender systems. 在开发准确的推荐模型时，一个长期存在的挑战是模拟用户行为，主要因为用户交互的复杂性和随机性。为此，一条有前景的研究方向是使用大型语言模型（LLMs）来模拟用户行为。然而，将这些通用的大型预训练模型与用户偏好对齐需要： (i) 有效且持续地解析大规模表格化的用户-项目交互数据，(ii) 克服预训练引入的归纳偏差以准确学习用户特定的知识，(iii) 在数百万用户的规模上实现前两者。尽管以往的大多数工作集中于用复杂的方法来提示 LLM 或在表格交互数据集上微调它，我们的方法将重点转向使用冻结的 LLM 提取稳健的文本化用户表示，并模拟由微调的小型语言模型（SLMs）驱动的成本效益高、资源高效的用户代理。此外，我们展示了一种为用户群体或“人格”（persona）训练多个低秩适配器的方法，在用户行为代理的可扩展性和性能之间取得最佳平衡。我们的实验证明了我们方法的有效性，实证证据令人信服，表明使用我们方法开发的用户代理有可能弥合离线评估指标与推荐系统实际表现之间的差距。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Machine Learning 主题：信息检索、人工智能、计算与语言、机器学习

Publish: 2025-08-18 22:14:57 UTC 发布：2025-08-18 22:14:57 世界协调时间 (UTC)

#107 AI-Powered Assistant for Long-Term Access to RHIC Knowledge #107 基于人工智能的 RHIC 知识长期访问助手

As the Relativistic Heavy Ion Collider (RHIC) at Brookhaven National Laboratory concludes 25 years of operation, preserving not only its vast data holdings (∼1 ExaByte) but also the embedded scientific knowledge becomes a critical priority. The RHIC Data and Analysis Preservation Plan (DAPP) introduces an AI-powered assistant system that provides natural language access to documentation, workflows, and software, with the aim of supporting reproducibility, education, and future discovery. Built upon Large Language Models using Retrieval-Augmented Generation and the Model Context Protocol, this assistant indexes structured and unstructured content from RHIC experiments and enables domain-adapted interaction. We report on the deployment, computational performance, ongoing multi-experiment integration, and architectural features designed for a sustainable and explainable long-term AI access. Our experience illustrates how modern AI/ML tools can transform the usability and discoverability of scientific legacy data. 随着布鲁克海文国家实验室的相对论性重离子对撞机（RHIC）结束其 25 年的运行，保护其不仅庞大的数据存储（ ∼ 1 ExaByte），还包括其中蕴含的科学知识，已成为一项关键优先事项。RHIC 数据与分析保存计划（DAPP）引入了一个由人工智能驱动的助手系统，提供对文档、工作流和软件的自然语言访问，旨在支持可重复性、教育和未来的发现。该助手基于大型语言模型并采用检索增强生成和模型上下文协议，索引来自 RHIC 实验的结构化和非结构化内容，并实现领域适配的交互。我们报告了该系统的部署、计算性能、正在进行的多实验整合以及为实现可持续且可解释的长期 AI 访问而设计的架构特性。我们的经验展示了现代 AI/ML 工具如何改变科学遗留数据的可用性和可发现性。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language 主题：信息检索、人工智能、计算与语言

Publish: 2025-08-18 15:16:29 UTC 发布：2025-08-18 15:16:29 UTC

#108 GeoGPT.RAG Technical Report

Authors: [Fei Huang](https://arxiv.org/search/?searchtype=author&query=Fei Huang), [Fan Wu](https://arxiv.org/search/?searchtype=author&query=Fan Wu), [Zeqing Zhang](https://arxiv.org/search/?searchtype=author&query=Zeqing Zhang), [Qihao Wang](https://arxiv.org/search/?searchtype=author&query=Qihao Wang), [Long Zhang](https://arxiv.org/search/?searchtype=author&query=Long Zhang), [Grant Michael Boquet](https://arxiv.org/search/?searchtype=author&query=Grant Michael Boquet), [Hongyang Chen](https://arxiv.org/search/?searchtype=author&query=Hongyang Chen) 作者：黄飞、吴帆、张泽清、王起豪、张龙、Grant Michael Boquet、陈宏阳

GeoGPT is an open large language model system built to advance research in the geosciences. To enhance its domain-specific capabilities, we integrated Retrieval Augmented Generation(RAG), which augments model outputs with relevant information retrieved from an external knowledge source. GeoGPT uses RAG to draw from the GeoGPT Library, a specialized corpus curated for geoscientific content, enabling it to generate accurate, context-specific answers. Users can also create personalized knowledge bases by uploading their own publication lists, allowing GeoGPT to retrieve and respond using user-provided materials. To further improve retrieval quality and domain alignment, we fine-tuned both the embedding model and a ranking model that scores retrieved passages by relevance to the query. These enhancements optimize RAG for geoscience applications and significantly improve the system’s ability to deliver precise and trustworthy outputs. GeoGPT reflects a strong commitment to open science through its emphasis on collaboration, transparency, and community driven development. As part of this commitment, we have open-sourced two core RAG components-GeoEmbedding and GeoReranker-to support geoscientists, researchers, and professionals worldwide with powerful, accessible AI tools. GeoGPT 是一个开放的大型语言模型系统，旨在推进地球科学领域的研究。为了增强其领域特定能力，我们整合了检索增强生成（RAG），该方法通过从外部知识来源检索相关信息来补充模型输出。GeoGPT 使用 RAG 从 GeoGPT 文库中提取内容——这是一个为地球科学内容精心策划的专业语料库，使其能够生成准确、具有上下文针对性的答案。用户还可以通过上传自己的出版物清单来创建个性化知识库，允许 GeoGPT 使用用户提供的资料进行检索和回答。为进一步提高检索质量和领域一致性，我们对用于向量化的嵌入模型和对检索到段落按与查询相关性进行评分的排序模型进行了微调。这些改进优化了 RAG 在地球科学应用中的表现，并显著提高了系统提供精确且可信输出的能力。GeoGPT 通过强调协作、透明和社区驱动开发，体现了对开放科学的坚定承诺。作为这一承诺的一部分，我们已将两个核心 RAG 组件开源——GeoEmbedding 和 GeoReranker——以便为全球地球科学家、研究人员和专业人士提供强大且易于获取的人工智能工具。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-18 08:29:22 UTC 发布：2025-08-18 08:29:22 UTC

#109 TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation #109 TalkPlayData 2：一种用于多模态对话式音乐推荐的能动合成数据管道

Authors: [Keunwoo Choi](https://arxiv.org/search/?searchtype=author&query=Keunwoo Choi), [Seungheon Doh](https://arxiv.org/search/?searchtype=author&query=Seungheon Doh), [Juhan Nam](https://arxiv.org/search/?searchtype=author&query=Juhan Nam) 作者：Keunwoo Choi、Seungheon Doh、Juhan Nam

We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In TalkPlayData 2 pipeline, multiple large language model (LLM) agents are created under various roles with specialized prompts and access to different parts of information, and the chat data is acquired by logging the conversation between the Listener LLM and the Recsys LLM. To cover various conversation scenarios, for each conversation, the Listener LLM is conditioned on a finetuned conversation goal. Finally, all the LLMs are multimodal with audio and images, allowing a simulation of multimodal recommendation and conversation. In the LLM-as-a-judge and subjective evaluation experiments, TalkPlayData 2 achieved the proposed goal in various aspects related to training a generative recommendation model for music. TalkPlayData 2 and its generation code are open-sourced at https://talkpl.ai/talkplaydata2.html. 我们提出了 TalkPlayData 2，这是一种通过智能代理数据管道生成的用于多模态对话式音乐推荐的合成数据集。在 TalkPlayData 2 管道中，创建了多个具有不同角色的大型语言模型（LLM）代理，这些代理配备了专用的提示词并能够访问不同部分的信息，聊天数据通过记录 Listener LLM 与 Recsys LLM 之间的对话获得。为了涵盖各种对话场景，对于每次对话，Listener LLM 都会基于一个微调过的对话目标进行条件设定。最后，所有的 LLM 都是多模态的，支持音频和图像，从而可以模拟多模态的推荐与对话。在以 LLM 作为评判者以及主观评估的实验中，TalkPlayData 2 在与训练用于音乐生成式推荐模型相关的各方面都实现了所设定的目标。TalkPlayData 2 及其生成代码已在 https://talkpl.ai/talkplaydata2.html 开源。

Subjects: Information Retrieval, Artificial Intelligence, Multimedia, Sound, Audio and Speech Processing 主题：信息检索，人工智能，多媒体，声音，音频与语音处理

Publish: 2025-08-18 05:06:58 UTC 发布：2025-08-18 05:06:58 UTC

#110 Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation #110 面向流程挖掘领域的文本到 SQL：用于查询翻译的葡英双语数据集

This paper introduces text-2-SQL-4-PM, a bilingual (Portuguese-English) benchmark dataset designed for the text-to-SQL task in the process mining domain. Text-to-SQL conversion facilitates natural language querying of databases, increasing accessibility for users without SQL expertise and productivity for those that are experts. The text-2-SQL-4-PM dataset is customized to address the unique challenges of process mining, including specialized vocabularies and single-table relational structures derived from event logs. The dataset comprises 1,655 natural language utterances, including human-generated paraphrases, 205 SQL statements, and ten qualifiers. Methods include manual curation by experts, professional translations, and a detailed annotation process to enable nuanced analyses of task complexity. Additionally, a baseline study using GPT-3.5 Turbo demonstrates the feasibility and utility of the dataset for text-to-SQL applications. The results show that text-2-SQL-4-PM supports evaluation of text-to-SQL implementations, offering broader applicability for semantic parsing and other natural language processing tasks. 本文介绍了 text-2-SQL-4-PM，这是一个为流程挖掘领域的文本到 SQL 任务设计的双语（葡萄牙语-英语）基准数据集。文本到 SQL 的转换便于对数据库进行自然语言查询，提高了非 SQL 专业用户的可访问性，并提升了专家用户的工作效率。text-2-SQL-4-PM 数据集针对流程挖掘的独特挑战进行了定制，包括专门术语和源自事件日志的单表关系结构。该数据集包含 1,655 条自然语言表达（包括人工生成的释义）、205 条 SQL 语句和十个限定符。方法包括专家手工整理、专业翻译以及详细的注释流程，以便对任务复杂性进行细致分析。此外，使用 GPT-3.5 Turbo 进行的基线研究展示了该数据集在文本到 SQL 应用中的可行性和实用性。结果表明，text-2-SQL-4-PM 支持对文本到 SQL 实现的评估，并为语义解析和其他自然语言处理任务提供了更广泛的适用性。

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Databases 主题：信息检索、人工智能、计算与语言、数据库

Publish: 2025-08-18 01:25:41 UTC

#111 Forecasting Clicks in Digital Advertising: Multimodal Inputs and Interpretable Outputs #111 预测数字广告中的点击量：多模态输入与可解释输出

Authors: [Briti Gangopadhyay](https://arxiv.org/search/?searchtype=author&query=Briti Gangopadhyay), [Zhao Wang](https://arxiv.org/search/?searchtype=author&query=Zhao Wang), [Shingo Takamatsu](https://arxiv.org/search/?searchtype=author&query=Shingo Takamatsu) 作者：Briti Gangopadhyay, Zhao Wang, Shingo Takamatsu

Forecasting click volume is a key task in digital advertising, influencing both revenue and campaign strategy. Traditional time series models rely solely on numerical data, often overlooking rich contextual information embedded in textual elements, such as keyword updates. We present a multimodal forecasting framework that combines click data with textual logs from real-world ad campaigns and generates human-interpretable explanations alongside numeric predictions. Reinforcement learning is used to improve comprehension of textual information and enhance fusion of modalities. Experiments on a large-scale industry dataset show that our method outperforms baselines in both accuracy and reasoning quality. 预测点击量是数字广告中的一项关键任务，影响收入和广告策略。传统的时间序列模型仅依赖数值数据，常常忽视嵌入在文本元素（如关键词更新）中的丰富上下文信息。我们提出了一个多模态预测框架，将点击数据与来自真实广告活动的文本日志相结合，并在生成数值预测的同时提供可供人类理解的解释。我们使用强化学习来提高对文本信息的理解并增强模态融合。在大规模行业数据集上的实验表明，我们的方法在准确性和推理质量上均优于基线方法。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-15 10:01:53 UTC 发布：2025-08-15 10:01:53 UTC

#112 DB3 Team’s Solution For Meta KDD Cup’ 25 #112 DB3 团队为 Meta KDD Cup'25 提供的解决方案

Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Machine Learning 主题：信息检索、人工智能、计算与语言、机器学习

Publish: 2025-08-12 08:27:53 UTC 发布：2025-08-12 08:27:53 UTC

#113 AEGIS: An Agent for Extraction and Geographic Identification in Scholarly Proceedings #113 AEGIS：用于学术会议论文的抽取与地理识别的代理

Authors: [Om Vishesh](https://arxiv.org/search/?searchtype=author&query=Om Vishesh), [Harshad Khadilkar](https://arxiv.org/search/?searchtype=author&query=Harshad Khadilkar), [Deepak Akkil](https://arxiv.org/search/?searchtype=author&query=Deepak Akkil) 作者：Om Vishesh、Harshad Khadilkar、Deepak Akkil

Keeping pace with the rapid growth of academia literature presents a significant challenge for researchers, funding bodies, and academic societies. To address the time-consuming manual effort required for scholarly discovery, we present a novel, fully automated system that transitions from data discovery to direct action. Our pipeline demonstrates how a specialized AI agent, ‘Agent-E’, can be tasked with identifying papers from specific geographic regions within conference proceedings and then executing a Robotic Process Automation (RPA) to complete a predefined action, such as submitting a nomination form. We validated our system on 586 papers from five different conferences, where it successfully identified every target paper with a recall of 100% and a near perfect accuracy of 99.4%. This demonstration highlights the potential of task-oriented AI agents to not only filter information but also to actively participate in and accelerate the workflows of the academic community. 随着学术文献的快速增长，研究人员、资助机构和学术团体面对着重大挑战。为了解决学术发现过程中费时的人工工作，我们提出了一种新颖的、全自动化系统，实现从数据发现到直接行动的转变。我们的管道展示了如何将一个专门的 AI 代理“Agent-E”用于在会议论文集中识别特定地理区域的论文，然后执行机器人流程自动化（RPA）以完成预定义的操作，例如提交提名表。我们在来自五个不同会议的 586 篇论文上验证了系统，成功识别出所有目标论文，召回率为 100%，准确率近乎完美，为 99.4%。该演示突显了面向任务的 AI 代理不仅能筛选信息，还能积极参与并加速学术界工作流程的潜力。

Subject: Machine Learning 主题：机器学习

Publish: 2025-09-11 13:52:52 UTC 发布：2025-09-11 13:52:52 UTC

#114 Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

Subject: Machine Learning 主题：机器学习

Publish: 2025-09-11 06:27:10 UTC 发布：2025-09-11 06:27:10 协调世界时

#115 Generative Engine Optimization: How to Dominate AI Search #115 生成引擎优化：如何主导人工智能搜索

Authors: [Mahe Chen](https://arxiv.org/search/?searchtype=author&query=Mahe Chen), [Xiaoxuan Wang](https://arxiv.org/search/?searchtype=author&query=Xiaoxuan Wang), [Kaiwen Chen](https://arxiv.org/search/?searchtype=author&query=Kaiwen Chen), [Nick Koudas](https://arxiv.org/search/?searchtype=author&query=Nick Koudas) 作者：Mahe Chen、Xiaoxuan Wang、Kaiwen Chen、Nick Koudas

The rapid adoption of generative AI-powered search engines like ChatGPT, Perplexity, and Gemini is fundamentally reshaping information retrieval, moving from traditional ranked lists to synthesized, citation-backed answers. This shift challenges established Search Engine Optimization (SEO) practices and necessitates a new paradigm, which we term Generative Engine Optimization (GEO). This paper presents a comprehensive comparative analysis of AI Search and traditional web search (Google). Through a series of large-scale, controlled experiments across multiple verticals, languages, and query paraphrases, we quantify critical differences in how these systems source information. Our key findings reveal that AI Search exhibit a systematic and overwhelming bias towards Earned media (third-party, authoritative sources) over Brand-owned and Social content, a stark contrast to Google’s more balanced mix. We further demonstrate that AI Search services differ significantly from each other in their domain diversity, freshness, cross-language stability, and sensitivity to phrasing. Based on these empirical results, we formulate a strategic GEO agenda. We provide actionable guidance for practitioners, emphasizing the critical need to: (1) engineer content for machine scannability and justification, (2) dominate earned media to build AI-perceived authority, (3) adopt engine-specific and language-aware strategies, and (4) overcome the inherent “big brand bias” for niche players. Our work provides the foundational empirical analysis and a strategic framework for achieving visibility in the new generative search landscape. 像 ChatGPT、Perplexity 和 Gemini 这样的生成式人工智能搜索引擎的快速普及，正在从根本上重塑信息检索方式，从传统的排序列表转向合成的、有引用支持的答案。这一转变对既有的搜索引擎优化（SEO）实践提出了挑战，并需要一种我们称之为生成引擎优化（GEO）的新范式。本文对 AI 搜索与传统网页搜索（Google）进行了全面的比较分析。通过在多个垂直领域、语言和查询改写上开展一系列大规模受控实验，我们量化了这些系统在信息来源方面的关键差异。我们的主要发现显示，AI 搜索系统在来源偏好上存在系统性且显著的倾向：它们更偏向于“赢得的媒体”（第三方权威来源），而非品牌自有和社交内容，这与 Google 更为平衡的混合形成鲜明对比。我们还证明，不同的 AI 搜索服务在域名多样性、新鲜度、跨语言稳定性以及对措辞的敏感性方面存在显著差异。基于这些实证结果，我们制定了一个战略性的 GEO 议程。我们为从业者提供可操作的指导，强调了以下关键需求：(1) 设计内容以便机器可扫描并能被证明其合理性，(2) 主导自然获取的媒体以建立被 AI 视为权威的地位，(3) 采用针对搜索引擎和语言的特定策略，(4) 帮助小众参与者克服固有的“强势品牌偏见”。我们的工作提供了在新生成式搜索环境中实现可见性的基础性实证分析和战略框架。

Subjects: Information Retrieval, Computation and Language, Social and Information Networks 主题：信息检索、计算与语言、社会与信息网络

Publish: 2025-09-10 18:29:18 UTC 发布时间：2025-09-10 18:29:18 UTC

1.3 Huggingface

2. 简单记录

3. 其他

图片插入

![](https://gitee.com/dujh22/pic/raw/master/logicReason/SLR.png)

2025-09-15科研追新

2025-09-15科研追新

1. 源数据

1.1 媒体

1.2 Arxiv

1.2.1 Computation and Language

#18 !MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment #18 !MSA 在 BAREC 共享任务 2025：集成阿拉伯语变换器用于可读性评估 [PDF] [复制] [Kimi] [REL]

#24 Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case #24 模拟公众舆论：面向智利案例的 AI 生成合成调查回应的概念验证 [PDF] [复制] [Kimi] [REL]

#25 Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization #25 主题引导的强化学习与 LLMs 用于增强多文档摘要 [PDF] [复制] [Kimi] [REL]

#26 Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization #26 手势唤起的语用框架：FrameNet Brasil 对话轮组织多模态研究

#27 HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning #27 HEFT：一种由粗到细的层次结构，用于提升语言模型推理的效率与准确性

#28 Discrimination by LLMs: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation #28 LLMs 对歧视的识别：跨语种偏见评估及在决策与摘要中的缓解

#29 MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools #29 MCP-AgentBench：使用 MCP 调解的工具评估现实世界语言代理性能

#30 Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning #30 对中文古籍文献的视觉-语言模型基准测试：从光学字符识别到知识推理 [PDF 1 ] [复制] [Kimi] [REL]

#31 MultimodalHugs: Enabling Sign Language Processing in Hugging Face #31 MultimodalHugs：在 Hugging Face 上实现手语处理 [PDF] [复制] [Kimi 1 ] [REL]

#32 A meta-analysis on the performance of machine-learning based language models for sentiment analysis #32 基于机器学习的语言模型在情感分析中表现的元分析 [PDF] [复制] [Kimi] [REL]

#33 A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs #33 一个面向角色感知的多智能体框架，用于使用 LLMs 的金融教育问答 [PDF] [复制] [Kimi] [REL]

#34 Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure #34 通过将证明步骤非形式化并沿证明结构进行递归摘要实现形式证明的自然语言翻译

#35 BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025 #35 BIBERT-Pipe 在 BioASQ 2025 的生物医学嵌套命名实体链接任务 [PDF] [复制] [Kimi] [REL]

#36 DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model #36 DiTTO-LLM：通过大语言模型发现基于主题的技术机会的框架 [PDF] [复制] [Kimi] [REL]

#37 ALIGNS: Unlocking nomological networks in psychological measurement through a large language model #37 ALIGNS：通过大型语言模型解锁心理测量中的法则网络 [PDF] [复制] [Kimi] [REL]

#38 Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA #38 在 HaluEval 和 TruthfulQA 上调查 Gemma 模型幻觉的符号触发因素 [PDF] [Copy] [Kimi 1 ] [REL]

#41 The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization #41 思想治疗师：使用监督微调和比值率策略优化训练大型语言模型以提供接纳与承诺疗法

#43 Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data #43 利用由人口普查和土地利用数据提供信息的大型语言模型生成个人旅行日记 [PDF] [复制] [Kimi] [REL]

#44 Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement #44 使用大型语言模型辅助科研提案写作：评估与改进 [PDF] [复制] [Kimi] [REL]

#45 Beyond I’m Sorry, I Can’t: Dissecting Large Language Model Refusal #45 超越“对不起，我不能”：剖析大型语言模型的拒绝行为

#46 The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks #46 小型 LLMs 的非确定性：在标准多项选择基准的重复试验中低答案一致性的证据

#47 Temporal Preferences in Language Models for Long-Horizon Assistance #47 语言模型在长期辅助中的时间偏好 [PDF 1 ] [复制] [Kimi] [REL]

#48 CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor #48 CTCC：一种通过跨对话上下文相关性后门对大型语言模型进行鲁棒且隐蔽的指纹框架 [PDF] [复制] [Kimi] [REL]

#49 Creativity Benchmark: A benchmark for marketing creativity for LLM models #49 创造力基准：用于评估 LLM 模型营销创造力的基准 [PDF] [复制] [Kimi 1 ] [REL]

#50 Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task #50 在正则化视界下针对语音翻译任务的最优多任务学习

#52 Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs #52 结构化信息很重要：使用病人级知识图谱的可解释 ICD 编码 [PDF] [复制] [Kimi 1 ] [REL]

#53 Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems #53 假设、行动、预测：为多智能体系统的自动故障归因构建因果推理支架

#54 Error Analysis in a Modular Meeting Transcription System #54 模块化会议转录系统中的错误分析 [PDF 1 ] [复制] [Kimi] [REL]

#55 VARCO-VISION-2.0 Technical Report #55 VARCO-VISION-2.0 技术报告

#56 Unified Learnable 2D Convolutional Feature Extraction for ASR #56 统一可学习二维卷积特征提取用于自动语音识别 [PDF] [复制] [Kimi] [相关]

#57 Whisper Has an Internal Word Aligner #57 Whisper 有一个内部词对齐器 [PDF 1 ] [复制] [Kimi] [REL]

#58 Vibe Check: Understanding the Effects of LLM-Based Conversational Agents’ Personality and Alignment on User Perceptions in Goal-Oriented Tasks #58 Vibe Check：理解基于 LLM 的会话代理的个性与对齐对面向目标任务的用户感知的影响 [PDF] [复制] [Kimi] [REL]

#60 Latency and Token-Aware Test-Time Compute #60 延迟与令牌感知的测试时计算

#61 Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture #61 可执行本体：用数据流架构合成事件语义 [PDF] [复制] [Kimi] [REL]

#62 HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets #62 HypoGeneAgent：一种用于基因集合簇解析选择的假设语言代理，基于 Perturb-seq 数据集

#63 Improving MLLM Historical Record Extraction with Test-Time Image #63 改进 MLLM 历史记录提取的测试时图像 [PDF 1 ] [Copy] [Kimi] [REL]

#64 VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions #64 VStyle：一个用于带有口语指令的语音风格适应基准 [PDF] [复制] [Kimi 2 ] [REL]

#65 LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm #65 基于 LLM 的实例驱动启发式偏置 在有偏随机键遗传算法背景下

#66 Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks #66 变压器语言模型的差异性鲁棒性：在对抗性文本攻击下的实证评估

#68 AI-Powered Assistant for Long-Term Access to RHIC Knowledge #68 基于 AI 的助理，用于长期获取 RHIC 知识 [PDF] [复制] [Kimi] [REL]

#69 Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation #69 面向流程挖掘领域的文本到 SQL：用于查询翻译的英德数据集 [PDF] [复制] [Kimi] [REL]

#70 DB3 Team’s Solution For Meta KDD Cup’ 25 #70 DB3 团队为 Meta KDD Cup'25 提交的解决方案 [PDF 1 ] [复制] [Kimi] [REL]

#71 Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL #71 公平裁剪你的序列：对序列级 RL 强制执行长度公平性 [PDF 6 ] [Copy] [Kimi 6 ] [REL]

1.2.2 Artificial Intelligence

#1 Mutual Information Tracks Policy Coherence in Reinforcement Learning #1 互信息追踪强化学习中的策略一致性

#2 Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems #2 逆推、行动、预测：为多主体系统中自动故障归因搭建因果推理脚手架

#3 State Algebra for Propositional Logic #3 状态代数用于命题逻辑

#4 The Morality of Probability: How Implicit Moral Biases in LLMs May Shape the Future of Human-AI Symbiosis #4 概率的道德性：LLMs 中隐含的道德偏见如何可能塑造人类与人工智能共生的未来

#5 Investigating Language Model Capabilities to Represent and Process Formal Knowledge: A Preliminary Study to Assist Ontology Engineering #5 探究语言模型表示与处理形式知识的能力：一项辅助本体工程的初步研究 [PDF 1 ] [Copy] [Kimi ] [REL]

#6 Compartmentalised Agentic Reasoning for Clinical NLI #6 用于临床自然语言推理的分区代理化推理

#7 Towards Fully Automated Molecular Simulations: Multi-Agent Framework for Simulation Setup and Force Field Extraction #7 迈向全自动分子模拟：用于模拟设置和力场提取的多代理框架

#8 Online Robust Planning under Model Uncertainty: A Sample-Based Approach #8 在模型不确定性下的在线鲁棒规划：一种基于样本的方法

#9 Virtual Agent Economies #9 虚拟代理经济

#10 AI Harmonics: a human-centric and harms severity-adaptive AI risk assessment framework #10 AI Harmonics：以人为本且依据危害严重性自适应的 AI 风险评估框架

#11 XAgents: A Unified Framework for Multi-Agent Cooperation via IF-THEN Rules and Multipolar Task Processing Graph #11 XAgents：通过若则规则和多极任务处理图实现多智能体协作的统一框架

#12 GAMA: A General Anonymizing Multi-Agent System for Privacy Preservation Enhanced by Domain Rules and Disproof Method #12 GAMA：一种通用匿名多智能体系统，通过领域规则和反证方法增强隐私保护

#13 Evaluation of Black-Box XAI Approaches for Predictors of Values of Boolean Formulae #13 评估用于布尔公式值预测器的黑盒可解释性人工智能（XAI）方法

#14 A Markovian Framing of WaveFunctionCollapse for Procedurally Generating Aesthetically Complex Environments #14 将波函数塌缩的程序化生成美学复杂环境的马尔可夫框架

#15 The (R)evolution of Scientific Workflows in the Agentic AI Era: Towards Autonomous Science #15 在具代理性的人工智能时代，科学工作流程的（革新）革命：迈向自主科学

#16 LLMs as Agentic Cooperative Players in Multiplayer UNO #16 LLMs 作为多人 UNO 中具有代理性的合作玩家 [PDF 2 ] [Copy] [Kimi 2 ] [REL]

#17 Towards an AI-based knowledge assistant for goat farmers based on Retrieval-Augmented Generation #17 朝向基于检索增强生成的山羊养殖者 AI 知识助理

#18 Towards a Common Framework for Autoformalization #18 朝向自动形式化的通用框架

#19 A Modular and Multimodal Generative AI Framework for Urban Building Energy Data: Generating Synthetic Homes #19 一个用于城市建筑能耗数据的模块化多模态生成式人工智能框架：生成合成住宅 [PDF 1 ] [复制] [Kimi ] [关联]

#20 How well can LLMs provide planning feedback in grounded environments? #20 在有根环境中，LLMs 在提供规划反馈方面表现如何？

#21 Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture #21 可执行本体：用数据流架构合成事件语义

#22 Human-AI Collaboration Increases Efficiency in Regulatory Writing #22 人机协作提高了法规写作的效率

#23 Standards in the Preparation of Biomedical Research Metadata: A Bridge2AI Perspective #23 生物医学研究元数据准备标准：Bridge2AI 视角

#24 Is In-Context Learning Learning? #24 在上下文学习真的是在“学习”吗？

#25 Multimodal SAM-adapter for Semantic Segmentation #25 用于语义分割的多模态 SAM-adapter

#26 Diversified recommendations of cultural activities with personalized determinantal point processes #26 使用个性化行列式点过程的文化活动多样化推荐

#27 Improving Audio Event Recognition with Consistency Regularization #27 通过一致性正则化改进音频事件识别

#28 Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms #28 数据分布影响基于对比学习的心电图基础模型的性能与泛化性

#29 Towards Understanding Visual Grounding in Visual Language Models #29 朝着理解视觉语言模型中的视觉定位

#30 GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography #30 GLAM：用于乳腺 X 光多视图 VLP 的几何引导局部对齐

#65 LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm #65 基于 LLM 的实例驱动启发式偏置在有偏随机键遗传算法背景下