2025-08-09to12科研追新
2025-08-09to12科研追新
2025-08-08 16:58:43 Friday ~ 2025-08-12 15:55:46 Tuesday
1. 源数据
1.1 公众号
1.1.1 量子位
- 黄仁勋子女成长路径曝光:一个学烘焙一个开酒吧,从基层做到英伟达高管
- GPT-5编程成绩有猫腻!自删23道测试题,关键基准还是自己提的
- 拿下3D生成行业新标杆!昆仑万维Matrix-3D新模型鲨疯了,一张图建模游戏场景
- GitHub独立时代落幕!CEO离职创业,微软全面接管
- OpenAI夺金IOI,但输给3位中国高中生
- GPT-5数字母依然翻车!马库斯:泛化问题仍未解决,Scaling无法实现AGI
- WRC整理床铺机器人背后模型曝光!端到端双系统全身智能VLA,仅凭少量微调就能get任务
- 黄仁勋像押注OpenAI一样押注中国机器人,英伟达首批Jetson Thor芯片给了他
- GPT-oss太离谱:无提示自行想象编程问题,还重复求解5000次
- 让OpenAI只领先5天,百川发布推理新模型,掀翻医疗垂域开源天花板
- 让64张卡像一张卡!浪潮信息发布新一代AI超节点,支持四大国产开源模型同时运行
- 推理成本骤降75%!gpt-oss用新数据类型实现4倍推理速度,80GB显卡能跑1200亿参数大模型
- 错信AI幻觉,一男子用溴化钠替代食用盐,真给自己吃出幻觉了
- VLA进化后降维打击!双手拣货,漂移操作,还能批量化秀舞,太空舱直接开上街,被银河通用卷到了
- 一文看尽世界机器人大会,不用去现场人挤人了
- 即梦新升级了一个扶持计划,要让AI创作者不再“为爱发电”
- 又是王冠:27M小模型超越o3-mini!拒绝马斯克的00后果然不同
- 蛋白质基座的GPT时代来了?!
- 别焦虑!不会用AI也不会被淘汰,工程师老哥实测各类工具:10倍生产力神话太夸张了
- 代季峰陈天桥联手AGI首秀炸场!最强开源深度研究模型,GAIA测试82.4分超OpenAI
- 实测谷歌AI故事书,我实现漫画和绘本自由了
- 4个月,创建20万个应用,这是背后的产品|对话百度秒哒
- 史上最大高质量科学推理后训练数据集开源,快速让Qwen3等变“科学家”
- 本科必学Dijkstra算法被超越!清华段然团队打破图灵奖得主证明的普遍最优性
- “还我GPT-4o”!奥特曼强推GPT-5惹怒网友,紧急公关来了
- Nature:锂可逆转老年痴呆
- NotebookLM能生成PPT了,还带演讲配音
1.1.2 机器之心
- 身家25亿刀,是四家公司创始人,这位伯克利教授还在给本科生上课
- 商汤王晓刚:世界模型将加快AI从数字空间进入物理世界,「悟能」想做那个桥梁
- 是「福尔摩斯」,也是「列文虎克」,智谱把OpenAI藏着掖着的视觉推理能力开源了
- 东方理工·甬江论坛|新大学、新使命,邀你共启未来
- LLM总是把简单任务复杂化,Karpathy无语:有些任务无需那么多思考
- ICCV 2025 | 小红书AIGC团队提出图像和视频换脸新算法DynamicFace
- 刚刚,OpenAI拿下IOI金牌,仅次于前五名人类选手!参赛推理模型才夺得IMO金牌
- Lumina-mGPT 2.0:自回归模型华丽复兴,媲美顶尖扩散模型
- 「一只手有几根手指」,你的GPT-5答对了吗?
- 4D空间智能:AI如何一步步「看懂」时空结构?一篇综述解析通往四维世界的五大层次
- 智谱终于发布GLM-4.5技术报告,从预训练到后训练,细节大公开
- 从捍卫者到引路人,上交&上海AI Lab提出LEGION:不仅是AI图像伪造克星,还能反哺生成模型进化?
- 脑子比不过AI,手也要沦陷了?这只灵巧手看得我有点慌
- 第二届 “兴智杯” 全国人工智能创新应用大赛专题活动明天开启,技术解析 + 资源对接一站式平台重磅来袭!
- ICCV 2025 | 机器人自主探索未知复杂空间?GLEAM破解主动探索建图的泛化难题
- 机器人上下文协议首次开源:阿里达摩院一口气放出具身智能「三大件」
- Attention Sink产生的起点?清华&美团首次揭秘MoE LLM中的超级专家机制
- 腾讯张正友:具身智能必须回答的三个「真问题」
- token危机解决?扩散模型数据潜力3倍于自回归,重训480次性能仍攀升
- 联合理解生成的关键拼图?腾讯发布X-Omni:强化学习让离散自回归生成方法重焕生机,轻松渲染长文本图像
- 40年后,Dijkstra算法极限再被突破,清华段然团队更快最短路径算法摘STOC最佳论文
- 数据困局下的具身智能,谁能率先破局?
- GPT-5问题太多,奥特曼带团回应一切,图表弄错是因「太累了」
- ARPO:智能体强化策略优化,让Agent在关键时刻多探索一步
- ICCV 2025 | 新型后门攻击直指Scaffold联邦学习,NTU联手0G Labs揭示中心化训练安全漏洞
- 用户痛批GPT-5,哭诉「还我GPT-4o」,奥特曼妥协了
- 上海AI Lab、浙大EagleLab等提出RRVF:利用「验证非对称性」,只输入图片学习视觉推理
- OpenAI 董事会主席:「按 token 计费」大错特错!市场终将选择「按成果付费」
- 挤不动的世界机器人大会上,自变量秀出了真·通用具身智能
- 4比0横扫Grok 4,o3强势夺冠,首届大模型对抗赛结果出炉
- 扩散LLM推理新范式:打破生成长度限制,实现动态自适应调节
1.1.3 新智元
- OpenAI开源霸权5天终结,百川M2一战夺冠!实测比GPT更懂中国医疗
- 单机狂飙4万亿参数,国产AI「四大天王」首次合体!这台超节点鲨疯了
- AI全国榜单爆冷,全网吃瓜大狂欢!这家黑马竟靠DeepSeek杀进全国TOP 2
- 一觉醒来,GitHub没了?CEO辞职,微软接管,开发者天塌了
- 物理学「AlphaGo时刻」?40年未竟之事被AI一举攻破,顶尖物理学家集体傻眼
- 李飞飞押注的「世界模型」,中国自研Matrix-3D已抢先实现了?
- 刚刚,OpenAI内部推理模型斩获IOI 2025金牌!所有AI选手中第一
- 41个榜单SOTA!智谱最新开源GLM-4.5V实测:看图猜地址、视频秒变代码
- 硅谷精英放弃生娃!MIT女记者揭秘:人类只是AI垫脚石,世界很快就毁灭
- 2025全球大模型应用报告:红海混战「忠诚度」瓦解,用户脚踏4.7条船!
- AI圈世界杯开赛,千队万将齐聚深圳!报名「兴智杯」,共夺大赛奖金
- 奥特曼砍掉GPT-4o引爆AI「戒断反应」,马斯克官宣Grok 4全球免费!
- 突破40年Dijkstra算法瓶颈,清华教授等颠覆教科书!斩获STOC最佳论文
- 刚刚,谷歌摊牌:Genie 3让你1秒「进入」名画,人人可造交互世界!
- OpenAI惊人自曝:GPT-5真「降智」了!但重现「神之一手」,剑指代码王座
- 1亿美元买不走梦想!但只因奥特曼这句话,他离开了OpenAI
- AI正在掏空大脑,思想沦为残废!未来只分AI的「主人」和「奴隶」
- 78年后,中国数学家刷新世界记录!陶哲轩伯乐的外星人难题新突破
- 实测GPT-5 Pro:别被普通版骗了!Pro才是OpenAI真正的顶级模型
- 内幕曝光:OpenAI模型坦承不会第六题,3人俩月拿下IMO金牌!
- Gemini再揽金牌,力压大学学霸,AI数学推理时代来了!
- 奥特曼曝惊世预言:2035年GPT-8治愈癌症!人类将为算力爆发三战
- 超低标注需求,实现医学图像分割!UCSD提出三阶段框架GenSeg
- AI「解码」古罗马,重现千年铭文真相!DeepMind新模型再登Nature
- GPT-5波折超乎想象!奥特曼连夜回应一切:4o重新上阵,团队紧急补救
- 单机狂飙4万亿参数,国产AI「四大天王」首次合体!这台超节点鲨疯了
- OpenAI o3封王,4比0横扫马斯克Grok 4!全球大模型对抗赛完美收官
- 首篇WebAgents综述:大模型赋能AI Agent,实现下一代Web自动化
1.1.4 AGI Hunt
- 八卦 | 抄袭DeepSeek劈腿前同事,Mistral 员工被前女友5000字小作文爆锤
- 超级麦吉:职场人 Vibe Working 的开源 LLM OS
- 突发!DeepSeek 服务宕机
- Google用Genie3复活了大师的名画,这是成就了它,还是毁了它?
- 可以用Claude Code来搞运维了
- 突发!Grok 4 永久免费使用
- OpenAI、Anthropic以30万美元年薪狂挖华尔街量化精英,AGI人才争夺战打响!
- 关于GPT-5 API
- 山姆奥特曼:GPT-6 将发现新科学,非凡将变得平凡
1.1.5 其他
1.2 Arxiv
1.2.1 Computation and Language
From:https://arxiv.org/list/cs.CL/recent
-
[132] arXiv:2508.06482 [pdf, other]
Post-training for Efficient Communication via Convention Formation 通过约定形成进行高效通信的后训练Yilun Hua, Evan Wang, Yoav ArtziComments: Accepted to COLM 2025 评论:已被 COLM 2025 接收Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:计算与语言 (cs.CL);人工智能 (cs.AI);机器学习 (cs.LG)
-
[133] arXiv:2508.06475 [pdf, html, other] [133] arXiv:2508.06475 [ pdf,html,other]
HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning HapticLLaMA:用于触觉字幕的多模态感官语言模型Guimin Hu, Daniel Hershcovich, Hasti Seifi 胡贵敏,Daniel Hershcovich,Hasti SeifiSubjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[134] arXiv:2508.06471 [pdf, other]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models GLM-4.5:具备代理性、推理与编码能力的(ARC)基础模型GLM-4.5 Team: Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge, Chenghua Huang, Chenhui Zhang, Chenxi Xu, Chenzheng Zhu, Chuang Li, Congfeng Yin, Daoyan Lin, Dayong Yang, Dazhi Jiang, Ding Ai, Erle Zhu, Fei Wang, Gengzheng Pan, Guo Wang, Hailong Sun, Haitao Li, Haiyang Li, Haiyi Hu, Hanyu Zhang, Hao Peng, Hao Tai, Haoke Zhang, Haoran Wang, Haoyu Yang, He Liu, He Zhao, Hongwei Liu, Hongxi Yan, Huan Liu, Huilong Chen, Ji Li, Jiajing Zhao, Jiamin Ren, Jian Jiao, Jiani Zhao, Jianyang Yan, Jiaqi Wang, Jiayi Gui, Jiayue Zhao, Jie Liu, Jijie Li, Jing Li, Jing Lu, Jingsen Wang, Jingwei Yuan, Jingxuan Li, Jingzhao Du, Jinhua Du, Jinxin Liu, Junkai Zhi, Junli Gao, Ke Wang, Lekang Yang, Liang Xu, Lin Fan, Lindong Wu, Lintao Ding, Lu Wang, Man Zhang, Minghao Li, Minghuan Xu, Mingming Zhao, Mingshu Zhai GLM-4.5 团队:曾奥涵、吕欣、郑钦凯、侯振宇、陈斌、谢承行、王存翔、尹达、曾浩、张嘉杰、王科东、钟鲁森、刘明道、卢睿、曹书林、张晓涵、黄轩成、魏尧、程晏、安一凡、牛依麟、文元皓、白语诗、杜正晓、王子涵、朱子琳、张伯涵、温博思、吴博文、徐博文、黄灿、赵凯西、蔡昌鹏、于超、李晨、葛辰迪、黄承华、张晨辉、徐辰熙、朱晨政、李闯、尹聪锋、林道彦、杨大勇、蒋大志、艾丁、朱尔乐、王飞、潘更正、王国、孙海龙、李海涛、李海洋、胡海怡、张汉宇、彭浩、太昊、张浩柯、王浩然、杨浩宇、刘和、赵和、刘鸿伟、阎鸿熙、刘欢、陈辉龙、李冀、赵嘉靖、任佳敏、焦剑、赵佳妮、严建阳、王佳琦、桂嘉怡、赵佳悦、刘洁、李继杰、李婧、卢晶、王竞森、袁竞为、李竞宣、杜竞钊、杜晋华、刘金鑫、支俊凯、高俊立、王可、杨乐康、徐亮、范林、吴林东、丁林涛、王璐、张漫、李明豪、徐明焕、赵明明、翟明姝Subjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[135] arXiv:2508.06447 [pdf, html, other]
SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning SlimInfer:通过动态标记修剪加速长上下文 LLM 推理Lingkun Long, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang 龚龙坤、杨如冰、黄雨诗、惠德生、周傲、杨建磊Subjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[136] arXiv:2508.06445 [pdf, html, other]
Echoes of Automation: The Increasing Use of LLMs in Newsmaking 自动化的回声:在新闻制作中越来越多地使用 LLMsAbolfazl Ansari, Delvin Ce Zhang, Nafis Irtiza Tripto, Dongwon Lee Abolfazl Ansari、Delvin Ce Zhang、Nafis Irtiza Tripto、Dongwon LeeComments: To appear in 18th International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and Simulation, and to be published in the Springer LNCS series 注释:将发表于第 18 届国际社会计算、行为文化建模与预测会议暨建模与仿真中的行为表示会议,并将在 Springer LNCS 系列中出版Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[137] arXiv:2508.06435 [pdf, html, other]
Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages 学习主题,而非语言:LLMs 如何跨语言对在线移民话语进行分类Andrea Nasuto, Stefano Maria Iacus, Francisco Rowe, Devika Jain Andrea Nasuto、Stefano Maria Iacus、Francisco Rowe、Devika JainSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[138] arXiv:2508.06433 [pdf, html, other]
Memp: Exploring Agent Procedural Memory Memp:探索代理过程记忆Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang 方润南,梁远,王晓彬,吴嘉龙,乔朔飞,谢鹏军,黄飞,陈华钧,张宁宇Comments: Work in progress 备注:进行中的工作Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA) 主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG);多智能体系统(cs.MA)
-
[139] arXiv:2508.06418 [pdf, html, other]
Quantifying Conversation Drift in MCP via Latent Polytope 通过潜在多面体量化 MCP 中的对话漂移Haoran Shi, Hongwei Yao, Shuo Shao, Shaopeng Jiao, Ziqi Peng, Zhan Qin, Cong Wang 石浩然,姚鸿伟,邵硕,焦少鹏,彭子棋,秦战,王聪Subjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[140] arXiv:2508.06388 [pdf, html, other]
LLMs vs. Chinese Anime Enthusiasts: A Comparative Study on Emotionally Supportive Role-Playing LLMs 与中国动漫爱好者:情感支持式角色扮演的比较研究Lanlan Qiu, Xiao Pu, Yeqi Feng, Tianxing He 邱兰兰,蒲潇,冯业琦,何天行Comments: 21 pages, 17 figures, 3 tables 注释:21 页,17 张图,3 张表Subjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[141] arXiv:2508.06374 [pdf, html, other]
Evaluating Style-Personalized Text Generation: Challenges and Directions 评估风格个性化文本生成:挑战与方向Anubhav Jangra, Bahareh Sarrafzadeh, Adrian de Wynter, Silviu Cucerzan, Sujay Kumar Jauhar Anubhav Jangra、Bahareh Sarrafzadeh、Adrian de Wynter、Silviu Cucerzan、Sujay Kumar JauharSubjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[142] arXiv:2508.06360 [pdf, html, other]
Cyberbullying Detection via Aggression-Enhanced Prompting 通过增强攻击性提示进行网络欺凌检测Aisha Saeid, Anu Sabu, Girish A. Koushik, Ferrante Neri, Diptesh Kanojia Aisha Saeid、Anu Sabu、Girish A. Koushik、Ferrante Neri、Diptesh KanojiaComments: Accepted to RANLP 2025 备注:已被 RANLP 2025 接收Subjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[143] arXiv:2508.06345 [pdf, html, other]
Harnessing Adaptive Topology Representations for Zero-Shot Graph Question Answering 利用自适应拓扑表示进行零样本图问答Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu, James T. Kwok, Yu Zhang 魏彦彬,严江月,康春,陈阳,刘华,James T. Kwok,张宇Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG) 主题:计算与语言(cs.CL);人工智能(cs.AI);图形学(cs.GR);机器学习(cs.LG)
-
[144] arXiv:2508.06309 [pdf, html, other]
Matrix-Driven Instant Review: Confident Detection and Reconstruction of LLM Plagiarism on PC 矩阵驱动的即时审查:在个人电脑上对 LLM 抄袭的自信检测与重构Ruichong Zhang 张锐崇Subjects: Computation and Language (cs.CL); Probability (math.PR) 学科:计算与语言 (cs.CL); 概率 (math.PR)
-
[145] arXiv:2508.06277 [pdf, html, other]
Large Language Model Data Generation for Enhanced Intent Recognition in German Speech 用于增强德语语音意图识别的大型语言模型数据生成Theresa Pekarek Rosin, Burak Can Kaplan, Stefan Wermter Theresa Pekarek Rosin、Burak Can Kaplan、Stefan WermterComments: 11 pages, 3 figures, accepted at KONVENS 2025 注释:11 页,3 图,已被 KONVENS 2025 接收Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD) 主题:计算与语言 (cs.CL);机器学习 (cs.LG);声音 (cs.SD)
-
[146] arXiv:2508.06220 [pdf, html, other]
InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic? InfoCausalQA:模型能否基于信息图执行非显式因果推理?Keummin Ka, Junhyeong Park, Jahyun Jeon, Youngjae YuComments: 14 pages, 9 figures 注释:14 页,9 张图Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[147] arXiv:2508.06204 [pdf, html, other]
Classification is a RAG problem: A case study on hate speech detection 分类是一个 RAG 问题:关于仇恨言论检测的个案研究Richard Willats, Josh Pennington, Aravind Mohan, Bertie Vidgen Richard Willats、Josh Pennington、Aravind Mohan、Bertie VidgenSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG)
-
[148] arXiv:2508.06196 [pdf, html, other] [148] arXiv:2508.06196 [ pdf,html,other]
EICAP: Deep Dive in Assessment and Enhancement of Large Language Models in Emotional Intelligence through Multi-Turn Conversations EICAP:通过多轮对话对大型语言模型在情感智能方面进行评估与增强的深入研究Nizi Nazar, Ehsaneddin Asgari Nizi Nazar,Ehsaneddin AsgariSubjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) 主题:计算与语言(cs.CL);人机交互(cs.HC)
-
[149] arXiv:2508.06194 [pdf, other]
Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation 超越统一标准:情境自适应多维度越狱评估Lai Jiang, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan 赖江, 李跃康, 张晓涵, 丁有涛, 潘立Subjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[150] arXiv:2508.06186 [pdf, other]
DKG-LLM : A Framework for Medical Diagnosis and Personalized Treatment Recommendations via Dynamic Knowledge Graph and Large Language Model Integration DKG-LLM:通过动态知识图谱与 LLM 集成实现医疗诊断与个性化治疗建议的框架Ali Sarabadani, Maryam Abdollahi Shamami, Hamidreza Sadeghsalehi, Borhan Asadi, Saba Hesaraki Ali Sarabadani、Maryam Abdollahi Shamami、Hamidreza Sadeghsalehi、Borhan Asadi、Saba HesarakiSubjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[151] arXiv:2508.06178 [pdf, html, other]
Comparing Knowledge Injection Methods for LLMs in a Low-Resource Regime 在低资源情境下比较向 LLMs 注入知识的方法Hugo Abonizio, Thales Almeida, Roberto Lotufo, Rodrigo Nogueira Hugo Abonizio、Thales Almeida、Roberto Lotufo、Rodrigo NogueiraSubjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[152] arXiv:2508.06167 [pdf, other]
Pragmatics beyond humans: meaning, communication, and LLMs 超越人类的话语学:意义、交流与 LLMsVít GvoždiakSubjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) 主题:计算与语言(cs.CL);人机交互(cs.HC)
-
[153] arXiv:2508.06165 [pdf, other]
UR2: Unify RAG and Reasoning through Reinforcement Learning UR 2 :通过强化学习统一 RAG 与推理Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu 李伟涛,向博然,王晓龙,苟志楠,马伟志,刘洋Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[154] arXiv:2508.06163 [pdf, other]
One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging 一刀切并不适用:一种面向分布感知的稀疏化方法,用于更精确的模型合并Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, Jingbo Zhu 罗英锋、林丁洋、王俊鑫、许子强、常凯妍、郑通、李蓓、马安翔、肖通、于正涛、朱婧博Comments: Under review 评论:审稿中Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG)
-
[155] arXiv:2508.06155 [pdf, other]
Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach 大型语言模型中隐含偏见的语义与结构分析:一种可解释的方法Renhan Zhang, Lian Lian, Zhen Qi, Guiran Liu 张仁涵,连连,祁震,刘桂然Subjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[156] arXiv:2508.06149 [pdf, other]
Scaling Personality Control in LLMs with Big Five Scaler Prompts 在大型语言模型中通过“大五量表提示”扩展个性控制Gunhee Cho, Yun-Gyung CheongSubjects: Computation and Language (cs.CL); Multiagent Systems (cs.MA) 主题:计算与语言(cs.CL);多智能体系统(cs.MA)
-
[157] arXiv:2508.06135 [pdf, html, other]
Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models 少即是多:用于大语言模型的兼容且高效知识蒸馏的选择性反思Lingyuan Liu, Mengxiang ZhangSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[158] arXiv:2508.06124 [pdf, html, other]
AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models AURA:面向大语言模型的可用性理解与风险感知对齐技术Sayantan Adak, Pratyush Chatterjee, Somnath Banerjee, Rima Hazra, Somak Aditya, Animesh Mukherjee Sayantan Adak、Pratyush Chatterjee、Somnath Banerjee、Rima Hazra、Somak Aditya、Animesh MukherjeeSubjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[159] arXiv:2508.06105 [pdf, html, other]
You Don’t Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures 你不需要预先构建的图来进行 RAG:具有自适应推理结构的检索增强生成Shengyuan Chen, Chuang Zhou, Zheng Yuan, Qinggang Zhang, Zeyang Cui, Hao Chen, Yilin Xiao, Jiannong Cao, Xiao Huang 陈胜元、周闯、袁政、张青刚、崔泽扬、陈浩、肖依霖、曹建农、黄晓Subjects: Computation and Language (cs.CL) 主题:计算与语言(cs.CL)
-
[160] arXiv:2508.06103 [pdf, html, other]
Few-Shot Prompting for Extractive Quranic QA with Instruction-Tuned LLMs 针对经文抽取式问答的少样本提示,使用经指令微调的 LLMsMohamed Basem, Islam Oshallah, Ali Hamdi, Ammar MohammedComments: 6 pages , 2 figures , Accepted in IMSA 2025,Egypt , this https URL 备注:6 页,2 张图,已被 2025 年埃及 IMSA 接收,this https URLSubjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) 学科:计算与语言 (cs.CL); 信息检索 (cs.IR)
-
[161] arXiv:2508.06094 [pdf, html, other]
ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline ConlangCrafter:使用多跳 LLM 管道构建语言Morris Alper, Moran Yanuka, Raja Giryes, Gašper Beguš Morris Alper、Moran Yanuka、Raja Giryes、Gašper BegušComments: Project page: this https URL 评注:项目页面:this https URLSubjects: Computation and Language (cs.CL) 主题:计算与语言 (cs.CL)
-
[162] arXiv:2508.06046 [pdf, html, other]
EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation EvolvR:用于故事评估以增强生成的自我进化成对推理Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Zhibo Yang, Xingsheng Zhang, Luxi Xing, Qiang Zhou, Chen Zhang 王鑫达、侯郑煦、张扬世杰、颜炳仁、杨志博、张兴胜、邢路曦、周强、张晨Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[163] arXiv:2508.06030 [pdf, html, other]
Efficient Knowledge Probing of Large Language Models by Adapting Pre-trained Embeddings 通过调整预训练嵌入对大型语言模型进行高效知识探测Kartik Sharma, Yiqiao Jin, Rakshit Trivedi, Srijan KumarSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG) 主题:计算与语言 (cs.CL); 机器学习 (cs.LG)
-
[164] arXiv:2508.06026 [pdf, html, other]
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future 时序自奖励语言模型:通过过去-未来将已选与被拒解耦Yidong Wang, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang, Jianing Chu, Xuran Meng, Shuxun Yang, Libo Qin, Yue Zhang, Wei Ye, Shikun Zhang 王一东,王欣,王存祥,方俊锋,王秋峰,褚佳宁,孟旭然,杨淑逊,秦立波,张悦,叶伟,张世坤Comments: 12 pages, 5 figures 注释:12 页,5 幅图Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[165] arXiv:2508.06016 [pdf, html, other]
Crisp Attention: Regularizing Transformers via Structured Sparsity Crisp Attention:通过结构化稀疏性对变压器进行正则化Sagar Gandhi, Vishal Gandhi Sagar Gandhi,Vishal GandhiSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[166] arXiv:2508.05987 [pdf, html, other]
Adversarial Topic-aware Prompt-tuning for Cross-topic Automated Essay Scoring 面向话题的对抗提示调优用于跨话题自动作文评分Chunyun Zhang, Hongyan Zhao, Chaoran Cui, Qilong Song, Zhiqing Lu, Shuai Gong, Kailin Liu 张春云,赵红艳,崔朝然,宋齐龙,卢志清,龚帅,刘凯林Subjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[167] arXiv:2508.05938 [pdf, html, other]
Prosocial Behavior Detection in Player Game Chat: From Aligning Human-AI Definitions to Efficient Annotation at Scale 在玩家游戏聊天中检测利他行为:从对齐人类-人工智能定义到大规模高效标注Rafal Kocielnik, Min Kim, Penphob (Andrea)Boonyarungsrit, Fereshteh Soltani, Deshawn Sambrano, Animashree Anandkumar, R. Michael Alvarez Rafal Kocielnik、Min Kim、Penphob(Andrea)Boonyarungsrit、Fereshteh Soltani、Deshawn Sambrano、Animashree Anandkumar、R. Michael AlvarezComments: 9 pages, 4 figures, 4 tables 注释:9 页,4 图,4 表Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) 主题:计算与语言(cs.CL);人工智能(cs.AI);计算机与社会(cs.CY)
-
[168] arXiv:2508.05909 [pdf, html, other]
Spectrum Projection Score: Aligning Retrieved Summaries with Reader Models in Retrieval-Augmented Generation 频谱投影得分:在增强检索生成中将检索到的摘要与读者模型对齐Zhanghao Hu, Qinglin Zhu, Siya Qi, Yulan He, Hanqi Yan, Lin Gui 胡章浩,祝清林,祁思雅,何玉兰,颜汉琦,桂麟Subjects: Computation and Language (cs.CL) 主题:计算与语言(cs.CL)
-
[169] arXiv:2508.05880 [pdf, html, other]
Do Machines Think Emotionally? Cognitive Appraisal Analysis of Large Language Models 机器会有情感性思考吗?大型语言模型的认知评价分析Sree Bhattacharyya, Lucas Craig, Tharun Dilliraj, Jia Li, James Z. Wang Sree Bhattacharyya、Lucas Craig、Tharun Dilliraj、Jia Li、James Z. WangSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 主题:计算与语言(cs.CL);人工智能(cs.AI)
-
[170] arXiv:2508.05843 [pdf, html, other]
Discovering Properties of Inflectional Morphology in Neural Emergent Communication 发现神经涌现通信中屈折形态学的性质Miles Gilberti, Shane Storks, Huteng Dai 迈尔斯·吉尔伯蒂、谢恩·斯托克斯、戴胡腾Subjects: Computation and Language (cs.CL) 主题:计算与语言(cs.CL)
-
[171] arXiv:2508.05830 [pdf, other]
“Mirror” Language AI Models of Depression are Criterion-Contaminated “镜像”语言人工智能模型中的抑郁症是受准则污染的Tong Li, Rasiq Hussain, Mehak Gupta, Joshua R. OltmannsComments: 39 pages, 9 figures 备注:39 页,9 幅图Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) 主题:计算与语言(cs.CL);计算机与社会(cs.CY)
-
[172] arXiv:2508.05803 [pdf, html, other]
Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models 类人短暂记忆提高了变换器语言模型的语言学习能力,但损害了阅读时间预测Abishek Thamma, Micha Heilbron 阿比谢克·塔玛,米哈·海尔布隆Subjects: Computation and Language (cs.CL) 主题:计算与语言(cs.CL)
-
[173] arXiv:2508.05782 [pdf, html, other]
FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification FineDialFact:一个用于细粒度对话事实核验的基准Xiangyan Chen, Yufeng Li, Yujian Gan, Arkaitz Zubiaga, Matthew PurverSubjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[174] arXiv:2508.05775 [pdf, html, other]
Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation 守护者与施害者:有害内容生成与安全缓解的综述Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu 张驰、朱常佳、熊俊杰、徐晓然、李灵尧、刘尧、陆卓Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) 主题:计算与语言(cs.CL);计算机与社会(cs.CY)
-
[175] arXiv:2508.05722 [pdf, other]
PEACH: A sentence-aligned Parallel English-Arabic Corpus for Healthcare PEACH:一个面向医疗健康的英阿对齐句子语料库Rania Al-Sabbagh 拉妮娅·阿尔-萨巴格Journal-ref: Corpora 2024, 19, 3, 395-410 期刊引用:Corpora 2024, 19, 3, 395-410Subjects: Computation and Language (cs.CL) 学科:计算与语言(cs.CL)
-
[176] arXiv:2508.06492 (cross-list from cs.CV) [pdf, html, other] [176] arXiv:2508.06492(来自 cs.CV 的交叉列表)[ pdf, html, other ]
Effective Training Data Synthesis for Improving MLLM Chart Understanding 为提高多模态大语言模型图表理解效果的有效训练数据合成Yuwei Yang, Zeyu Zhang, Yunzhong Hou, Zhuowan Li, Gaowen Liu, Ali Payani, Yuan-Sen Ting, Liang Zheng 杨宇威, 张泽宇, 侯云中, 李卓宛, 刘高文, Ali Payani, Yuan-Sen Ting, 郑亮Comments: Accepted by ICCV 2025 (poster). 26 pages, 17 figures 注释:被 ICCV 2025(海报)接受。26 页,17 张图Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) 主题:计算机视觉与模式识别(cs.CV);计算与语言(cs.CL)
-
[177] arXiv:2508.06457 (cross-list from cs.CR) [pdf, html, other] [177] arXiv:2508.06457(从 cs.CR 交叉列出)[ pdf, html, other]
ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls ScamAgents:AI 代理如何模拟人类水平的诈骗电话Sanket BadheComments: Accepted at CAMLIS 25: Conference on Applied Machine Learning for Information Security. 10 pages, 3 figures 评审意见:已被 CAMLIS 25(应用机器学习与信息安全大会)接收。10 页,3 张图Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) 学科:密码学与安全(cs.CR);人工智能(cs.AI);计算与语言(cs.CL);多智能体系统(cs.MA)
-
[178] arXiv:2508.06412 (cross-list from cs.LG) [pdf, html, other] [178] arXiv:2508.06412(从 cs.LG 交叉列出)[ pdf,html,other]
Sample-efficient LLM Optimization with Reset Replay 具有重置重放的样本高效 LLM 优化Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian 刘子川,王晋宇,宋磊,边江Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) 学科:机器学习 (cs.LG);计算与语言 (cs.CL)
-
[179] arXiv:2508.06401 (cross-list from cs.DL) [pdf, other] [179] arXiv:2508.06401(从 cs.DL 交叉列出)[ pdf,other]
A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges 检索增强生成的系统性文献综述:技术、评估指标与挑战Andrew Brown, Muhammad Roman, Barry DevereuxComments: 58 pages 注释:58 页Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) 主题:数字图书馆(cs.DL);人工智能(cs.AI);计算与语言(cs.CL);信息检索(cs.IR)
-
[180] arXiv:2508.06065 (cross-list from cs.HC) [pdf, html, other] [180] arXiv:2508.06065(跨列自 cs.HC)[ pdf,html,other]
ThematicPlane: Bridging Tacit User Intent and Latent Spaces for Image Generation ThematicPlane:弥合隐性用户意图与图像生成潜在空间的桥梁Daniel Lee, Nikhil Sharma, Donghoon Shin, DaEun Choi, Harsh Sharma, Jeonghwan Kim, Heng Ji Daniel Lee、Nikhil Sharma、Donghoon Shin、DaEun Choi、Harsh Sharma、Jeonghwan Kim、Heng JiJournal-ref: In Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ‘25), Sept 28-Oct 1, 2025, Busan, Republic of Korea. ACM, New York, NY, USA 期刊参考:收录于第 38 届年度 ACM 用户界面软件与技术研讨会(UIST ‘25)附属论文集,2025 年 9 月 28 日–10 月 1 日,韩国釜山。ACM,美国纽约,NY,USASubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) 主题:人机交互 (cs.HC);人工智能 (cs.AI);计算与语言 (cs.CL);计算机视觉与模式识别 (cs.CV)
-
[181] arXiv:2508.06059 (cross-list from cs.CR) [pdf, other] [181] arXiv:2508.06059(从 cs.CR 交叉列出)[ pdf,其他]
Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System Fact2Fiction:针对具代理能力事实核查系统的定向投毒攻击Haorui He, Yupeng Li, Bin Benjamin Zhu, Dacheng Wen, Reynold Cheng, Francis C. M. Lau 何浩锐、李宇鹏、朱斌(Benjamin Zhu)、温大成、程瑞诺(Reynold Cheng)、刘子鸣(Francis C. M. Lau)Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) 主题:密码学与安全(cs.CR);计算与语言(cs.CL)
-
[181] arXiv:2508.06059 (cross-list from cs.CR) [pdf, other] [181] arXiv:2508.06059(从 cs.CR 交叉列出)[ pdf,其他]
Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System Fact2Fiction:针对具代理能力事实核查系统的定向投毒攻击Haorui He, Yupeng Li, Bin Benjamin Zhu, Dacheng Wen, Reynold Cheng, Francis C. M. Lau 何浩锐、李宇鹏、朱斌(Benjamin Zhu)、温大成、程瑞诺(Reynold Cheng)、刘子鸣(Francis C. M. Lau)Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) 主题:密码学与安全(cs.CR);计算与语言(cs.CL)
-
[182] arXiv:2508.06017 (cross-list from cs.SE) [pdf, html, other] [182] arXiv:2508.06017(来自 cs.SE 的交叉列表)[ pdf, html, other]
Position: Intelligent Coding Systems Should Write Programs with Justifications 职位:智能编码系统应编写带有理由的程序Xiangzhe Xu, Shiwei Feng, Zian Su, Chengpeng Wang, Xiangyu Zhang 徐翔哲,冯世伟,苏子安,王成鹏,张翔宇Comments: The first two authors contributed equally to this work 备注:前两位作者对本工作贡献相同Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG) 主题:软件工程 (cs.SE); 计算与语言 (cs.CL); 机器学习 (cs.LG)
-
[183] arXiv:2508.05954 (cross-list from cs.CV) [pdf, html, other] [183] arXiv:2508.05954(从 cs.CV 跨列)[ pdf, html, other]
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents Bifrost-1:用补丁级 CLIP 潜表示将多模态 LLMs 与扩散模型连接起来Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, Mohit BansalComments: Project Page: this https URL 注释:项目页面:此 https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 学科:计算机视觉与模式识别(cs.CV);人工智能(cs.AI);计算与语言(cs.CL)
-
[184] arXiv:2508.05913 (cross-list from cs.HC) [pdf, other] [184] arXiv:2508.05913(从 cs.HC 交叉列出)[ pdf,其他]
Do Ethical AI Principles Matter to Users? A Large-Scale Analysis of User Sentiment and Satisfaction 伦理化人工智能原则对用户重要吗?关于用户情绪与满意度的大规模分析Stefan Pasch, Min Chul ChaSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 主题:人机交互 (cs.HC);人工智能 (cs.AI);计算与语言 (cs.CL)
-
[185] arXiv:2508.05835 (cross-list from eess.AS) [pdf, html, other] [185] arXiv:2508.05835(从 eess.AS 交叉列表)[ pdf,html,other]
NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference NanoCodec:走向高质量超快速语音 LLM 推理Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Jukić, Jason Li, Boris Ginsburg 埃德雷森·卡萨诺瓦, 帕尔斯·尼克哈拉, 瑞安·兰格曼, 谢希恩·侯赛因, 苏班卡尔·戈什, 孙松洋, 安特·尤基奇, 贾森·李, 鲍里斯·金斯堡Comments: Accepted to Interspeech 2025 评论:被接收至 Interspeech 2025Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD) 主题:音频与语音处理 (eess.AS); 计算与语言 (cs.CL); 声音 (cs.SD)
-
[186] arXiv:2508.05798 (cross-list from cs.LO) [pdf, html, other] [186] arXiv:2508.05798(从 cs.LO 交叉列出)[ pdf, html, other]
Basic interactive algorithms: Preview 基础交互算法:预览Yuri Gurevich 尤里·古列维奇Journal-ref: The Bulletin of the EATCS, volume 146, June 2025 期刊出处:EATCS 公报,第 146 期,2025 年 6 月Subjects: Logic in Computer Science (cs.LO); Computation and Language (cs.CL); Logic (math.LO); Quantum Physics (quant-ph) 主题:计算机科学中的逻辑(cs.LO);计算与语言(cs.CL);逻辑(math.LO);量子物理(quant-ph)
-
[187] arXiv:2508.05731 (cross-list from cs.AI) [pdf, html, other] [187] arXiv:2508.05731(从 cs.AI 交叉列出)[ pdf, html, other]
InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization InfiGUI-G1:通过自适应探索策略优化推进图形用户界面定位Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu 刘宇航,刘泽宇,朱双鹤,李鹏翔,谢从凯,王嘉盛,胡雪玉,韩晓天,袁建波,王欣尧,张胜宇,杨宏霞,吴飞Comments: 11 pages, 3 figures 注释:11 页,3 图Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 主题:人工智能(cs.AI);计算与语言(cs.CL)
-
[188] arXiv:2508.05694 (cross-list from cs.CR) [pdf, html, other] [188] arXiv:2508.05694(从 cs.CR 交叉列出)[ pdf, html, other]
DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection DMFI:基于 LLM 的内部威胁检测的双模态微调与推理框架Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Guanggang Geng, Zhiying Li, Jian Weng 孔开川,刘东杰,金晓波,耿广刚,李志英,翁剑Comments: Submitted to the 2025 IEEE International Conference on Data Mining (ICDM) 备注:已提交至 2025 年 IEEE 国际数据挖掘会议(ICDM)Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 主题:密码学与安全(cs.CR);人工智能(cs.AI);计算与语言(cs.CL)
-
[189] arXiv:2508.05671 (cross-list from cs.CR) [pdf, html, other] [189] arXiv:2508.05671(从 cs.CR 交叉列出)[ pdf, html, other]
DINA: A Dual Defense Framework Against Internal Noise and External Attacks in Natural Language Processing DINA:一种针对自然语言处理内部噪声和外部攻击的双重防御框架Ko-Wei Chuang, Hen-Hsen Huang, Tsai-Yen Li Ko-Wei Chuang、Hen-Hsen Huang、Tsai-Yen LiComments: 7 pages 注释:7 页Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) 主题:密码学与安全(cs.CR);计算与语言(cs.CL)
-
[190] arXiv:2508.05669 (cross-list from cs.IR) [pdf, other] [190] arXiv:2508.05669(从 cs.IR 交叉列出)[ pdf,其他]
Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports 为马来西亚经审计财务报表中的财务表格进行 Markdown 转换而微调视觉-语言模型Jin Khye Tan (Faculty of Computer Science and Information Technology, Universiti Malaya), En Jun Choong, Ethan Jeremiah Chitty, Yan Pheng Choo, John Hsin Yang Wong, Chern Eu Cheah Jin Khye Tan(马来亚大学计算机科学与信息技术学院)、En Jun Choong、Ethan Jeremiah Chitty、Yan Pheng Choo、John Hsin Yang Wong、Chern Eu CheahComments: 28 pages, 14 figures, 5 tables. Evaluation code (LLM-as-a-judge and Markdown TEDS) is available at this https URL. The development dataset and evaluation benchmark are available on Hugging Face at this https URL and this https URL respectively 注释:28 页,14 张图,5 张表。评估代码(LLM-as-a-judge 和 Markdown TEDS)可在此 https URL 获取。开发数据集和评估基准分别可在 Hugging Face 的此 https URL 和此 https URL 获取Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) 主题:信息检索 (cs.IR); 人工智能 (cs.AI); 计算与语言 (cs.CL); 计算机视觉与模式识别 (cs.CV); 机器学习 (cs.LG)
-
[191] arXiv:2508.05668 (cross-list from cs.IR) [pdf, html, other] [191] arXiv:2508.05668(从 cs.IR 跨列表)[ pdf, html, other]
A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges 基于 LLM 的深度搜索代理综述:范式、优化、评估与挑战Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, Weinan Zhang 奚云佳,林江浩,肖永钊,周哲理,山荣,高特,朱嘉宸,刘伟文,于勇,张维楠Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 主题:信息检索(cs.IR);人工智能(cs.AI);计算与语言(cs.CL)
-
[192] arXiv:2508.05664 (cross-list from cs.IR) [pdf, other] [192] arXiv:2508.05664(来自 cs.IR 的交叉列表)[ pdf,other]
Enhancing Retrieval-Augmented Generation for Electric Power Industry Customer Support 提升用于电力行业客户支持的检索增强生成(RAG)Hei Yu Chan, Kuok Tou Ho, Chenglong Ma, Yujing Si, Hok Lai Lin, Sa Lei Lam 陈贺宇、何國濤、马成龙、司玉晶、林学礼、林世利Comments: 6 pages 注释:6 页Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 学科:信息检索 (cs.IR);人工智能 (cs.AI);计算与语言 (cs.CL)
-
[193] arXiv:2508.04748 (cross-list from cs.LG) [pdf, html, other] [193] arXiv:2508.04748(从 cs.LG 交叉列出)[ pdf,html,其他]
AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models AttriLens-Mol:用于分子性质预测的属性引导强化学习与大型语言模型Xuan Lin, Long Chen, Yile Wang 林璇, 陈龙, 王一乐Comments: 9 pages 评注:9 页Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 学科:机器学习 (cs.LG); 人工智能 (cs.AI); 计算与语言 (cs.CL)
#1 Jinx: Unlimited LLMs for Probing Alignment Failures Jinx:用于探测对齐失败的无限制 LLMs
Authors: [Jiahao Zhao](https://arxiv.org/search/?searchtype=author&query=Jiahao Zhao), [Liwei Dong](https://arxiv.org/search/?searchtype=author&query=Liwei Dong)
Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model’s capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety. 所谓无限制或仅有帮助的语言模型是在没有安全对齐约束下训练的,且从不拒绝用户查询。它们被领先的人工智能公司作为内部工具广泛用于红队测试和对齐评估。例如,如果一个经过安全对齐的模型产生的有害输出与一个无限制模型相似,这就表明存在需要进一步关注的对齐失败。尽管这些模型在评估对齐方面起着关键作用,但研究界无法获得此类模型。我们推出了 Jinx,这是一种流行开源权重 LLMs 的仅有帮助变体。Jinx 对所有查询都作出响应,不会拒绝或进行安全过滤,同时保留了基础模型在推理和遵循指令方面的能力。它为研究人员提供了一个可获取的工具,用于探测对齐失败、评估安全边界以及系统地研究语言模型安全中的失败模式。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 17:56:06 UTC 发布:2025-08-11 17:56:06 UTC
#2 Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge 通过将 LLM 作为评判者,探索 LLMs 在中文心理健康对话中的安全对齐评估
Authors: [Yunna Cai](https://arxiv.org/search/?searchtype=author&query=Yunna Cai), [Fan Wang](https://arxiv.org/search/?searchtype=author&query=Fan Wang), [Haowei Wang](https://arxiv.org/search/?searchtype=author&query=Haowei Wang), [Kun Wang](https://arxiv.org/search/?searchtype=author&query=Kun Wang), [Kailai Yang](https://arxiv.org/search/?searchtype=author&query=Kailai Yang), [Sophia Ananiadou](https://arxiv.org/search/?searchtype=author&query=Sophia Ananiadou), [Moyan Li](https://arxiv.org/search/?searchtype=author&query=Moyan Li), [Mingming Fan](https://arxiv.org/search/?searchtype=author&query=Mingming Fan)
Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research. 在高风险心理健康对话中评估 LLM 回答的安全性一致性尤其困难,原因在于缺乏金标准答案以及这些互动的伦理敏感性。为了解决这一挑战,我们提出了 PsyCrisis-Bench,这是一个基于真实中文心理健康对话的无参考评估基准。它评估模型回答是否符合专家定义的安全原则。专门为无标准参考的场景设计,我们的方法采用基于提示的 LLM-as-Judge 方法,使用以心理干预原则为基础的专家定义推理链进行上下文内评估。我们在多个安全维度上采用二元逐点评分,以增强评估的可解释性和可追溯性。此外,我们提供了一个手工策划的高质量中文数据集,涵盖自残、自杀意念和存在性痛苦,来源于真实的在线言论。 在 3600 次评判的实验中,我们的方法与专家评估的契合度最高,并且相比现有方法能生成更具可解释性的评估理由。我们的数据集和评估工具已公开,以便促进进一步研究。
Subjects: Computation and Language, Computers and Society 主题:计算与语言,计算机与社会
Publish: 2025-08-11 17:52:07 UTC 发布时间:2025-08-11 17:52:07 UTC
#3 Capabilities of GPT-5 on Multimodal Medical Reasoning GPT-5 在多模态医学推理方面的能力
Authors: [Shansong Wang](https://arxiv.org/search/?searchtype=author&query=Shansong Wang), [Mingzhe Hu](https://arxiv.org/search/?searchtype=author&query=Mingzhe Hu), [Qiang Li](https://arxiv.org/search/?searchtype=author&query=Qiang Li), [Mojtaba Safari](https://arxiv.org/search/?searchtype=author&query=Mojtaba Safari), [Xiaofeng Yang](https://arxiv.org/search/?searchtype=author&query=Xiaofeng Yang)
Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5’s ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems. 近期在大型语言模型(LLMs)方面的进展使通用系统在无需大量微调的情况下,能够执行日益复杂的特定领域推理。在医学领域,决策常常需要整合异构信息源,包括患者叙述、结构化数据和医学影像。本研究将 GPT-5 定位为用于医学决策支持的通用多模态推理器,并在统一协议下系统评估其在文本问答和视觉问答任务上的零样本链式思维推理性能。我们对 GPT-5、GPT-5-mini、GPT-5-nano 和 GPT-4o-2024-11-20 在 MedQA、MedXpertQA(文本和多模态)、MMLU 医学子集、USMLE 自测考试和 VQA-RAD 的标准划分上进行了基准测试。结果显示,GPT-5 持续超越所有基线,在所有问答基准上实现了最先进的准确率,并在多模态推理方面带来了显著提升。 在 MedXpertQA MM 上,GPT-5 在推理和理解得分上分别比 GPT-4o 提高了 +29.62% 和 +36.18%,并且在推理上比预许可的人类专家高出 +24.23%,在理解上高出 +29.40%。相比之下,GPT-4o 在大多数维度仍然低于人类专家表现。一项具有代表性的案例研究展示了 GPT-5 将视觉和文本线索整合为连贯诊断推理链的能力,并建议了适当的高风险干预措施。我们的结果表明,在这些受控的多模态推理基准上,GPT-5 已从与人类可比提升到超越人类专家的表现。这一改进可能会大幅影响未来临床决策支持系统的设计。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 17:43:45 UTC 发布日期:2025-08-11 17:43:45 UTC
#4 SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling SAEMark:使用推理时缩放的多比特 LLM 水印技术
Authors: [Zhuohao Yu](https://arxiv.org/search/?searchtype=author&query=Zhuohao Yu), [Xingru Jiang](https://arxiv.org/search/?searchtype=author&query=Xingru Jiang), [Weizheng Gu](https://arxiv.org/search/?searchtype=author&query=Weizheng Gu), [Yidong Wang](https://arxiv.org/search/?searchtype=author&query=Yidong Wang), [Shikun Zhang](https://arxiv.org/search/?searchtype=author&query=Shikun Zhang), [Wei Ye](https://arxiv.org/search/?searchtype=author&query=Wei Ye)
Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework’s effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark’s consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution. 为内容归属和防止错误信息传播,对由 LLM 生成文本进行水印标注至关重要。然而,现有方法会损害文本质量、需要白盒模型访问和 logit 操控。这些限制排除了基于 API 的模型和多语言场景。我们提出了 SAEMark,一种通用的事后多比特水印框架,它仅通过推理时基于特征的拒绝采样来嵌入个性化信息,无需更改模型 logit 或进行训练。我们的方法基于从生成文本中提取的确定性特征,选择其特征统计与密钥导出目标一致的输出。该框架自然可推广到不同语言和领域,同时通过采样 LLM 输出而不是修改它们来保持文本质量。我们提供了理论保证,将水印成功概率与计算预算建立联系,适用于任何合适的特征提取器。在实证方面,我们使用稀疏自编码器(SAE)展示了该框架的有效性,达到了更高的检测准确率和文本质量。 在 4 个数据集上的实验表明,SAEMark 表现稳定,在英语上达到 99.7%的 F1,并具有强大的多比特检测准确性。SAEMark 为可扩展水印设立了新的范式,该方法可开箱即用于闭源 LLMs,同时实现内容归属。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习
Publish: 2025-08-11 17:33:18 UTC 发布:2025-08-11 17:33:18 UTC
#5 Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models 推理时不确定性的与人类对齐与校准(Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models)
Authors: [Kyle Moore](https://arxiv.org/search/?searchtype=author&query=Kyle Moore), [Jesse Roberts](https://arxiv.org/search/?searchtype=author&query=Jesse Roberts), [Daryl Watson](https://arxiv.org/search/?searchtype=author&query=Daryl Watson)
There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis. 近年来,人们对评估大型语言模型(LLM)的不确定性校准产生了浓厚兴趣,以便促进模型控制并调节用户信任。推理时不确定性尤为重要,因为它可以为模型或外部控制模块提供实时信号,从而在实践中改善 LLM 与用户的交互体验。尽管许多现有论文考虑了模型校准,但相对较少的工作致力于评估模型不确定性与人类不确定性的一致程度。在本研究中,我们评估了一系列推理时不确定性度量,既采用既有度量也引入新变体,以确定它们与人类群体层面不确定性以及传统模型校准概念的一致程度。我们发现,尽管与人类答案偏好缺乏一致性,仍有多种度量显示出与人类不确定性高度一致的证据。对于这些成功的度量,我们在正确性相关性和分布分析方面都发现了中等到强烈的模型校准证据。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 17:22:45 UTC 发布:2025-08-11 17:22:45 UTC
#6 Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions 高效的 Llama 大规模投机解码:挑战与解决方案
Authors: [Bangsheng Tang](https://arxiv.org/search/?searchtype=author&query=Bangsheng Tang), [Carl Chengyan Fu](https://arxiv.org/search/?searchtype=author&query=Carl Chengyan Fu), [Fei Kou](https://arxiv.org/search/?searchtype=author&query=Fei Kou), [Grigory Sizov](https://arxiv.org/search/?searchtype=author&query=Grigory Sizov), [Haoci Zhang](https://arxiv.org/search/?searchtype=author&query=Haoci Zhang), [Jason Park](https://arxiv.org/search/?searchtype=author&query=Jason Park), [Jiawen Liu](https://arxiv.org/search/?searchtype=author&query=Jiawen Liu), [Jie You](https://arxiv.org/search/?searchtype=author&query=Jie You), [Qirui Yang](https://arxiv.org/search/?searchtype=author&query=Qirui Yang), [Sachin Mehta](https://arxiv.org/search/?searchtype=author&query=Sachin Mehta), [Shengyong Cai](https://arxiv.org/search/?searchtype=author&query=Shengyong Cai), [Xiaodong Wang](https://arxiv.org/search/?searchtype=author&query=Xiaodong Wang), [Xingyu Liu](https://arxiv.org/search/?searchtype=author&query=Xingyu Liu), [Yunlu Li](https://arxiv.org/search/?searchtype=author&query=Yunlu Li), [Yanjun Zhou](https://arxiv.org/search/?searchtype=author&query=Yanjun Zhou), [Wei Wei](https://arxiv.org/search/?searchtype=author&query=Wei Wei), [Zhiwei Zhao](https://arxiv.org/search/?searchtype=author&query=Zhiwei Zhao), [Zixi Qi](https://arxiv.org/search/?searchtype=author&query=Zixi Qi), [Adolfo Victoria](https://arxiv.org/search/?searchtype=author&query=Adolfo Victoria), [Aya Ibrahim](https://arxiv.org/search/?searchtype=author&query=Aya Ibrahim), [Bram Wasti](https://arxiv.org/search/?searchtype=author&query=Bram Wasti), [Changkyu Kim](https://arxiv.org/search/?searchtype=author&query=Changkyu Kim), [Daniel Haziza](https://arxiv.org/search/?searchtype=author&query=Daniel Haziza), [Fei Sun](https://arxiv.org/search/?searchtype=author&query=Fei Sun), [Giancarlo Delfin](https://arxiv.org/search/?searchtype=author&query=Giancarlo Delfin), [Emily Guo](https://arxiv.org/search/?searchtype=author&query=Emily Guo), [Jialin Ouyang](https://arxiv.org/search/?searchtype=author&query=Jialin Ouyang), [Jaewon Lee](https://arxiv.org/search/?searchtype=author&query=Jaewon Lee), [Jianyu Huang](https://arxiv.org/search/?searchtype=author&query=Jianyu Huang), [Jeremy Reizenstein](https://arxiv.org/search/?searchtype=author&query=Jeremy Reizenstein), [Lu Fang](https://arxiv.org/search/?searchtype=author&query=Lu Fang), [Quinn Zhu](https://arxiv.org/search/?searchtype=author&query=Quinn Zhu), [Ria Verma](https://arxiv.org/search/?searchtype=author&query=Ria Verma), [Vlad Mihailescu](https://arxiv.org/search/?searchtype=author&query=Vlad Mihailescu), [Xingwen Guo](https://arxiv.org/search/?searchtype=author&query=Xingwen Guo), [Yan Cui](https://arxiv.org/search/?searchtype=author&query=Yan Cui), [Ye Hu](https://arxiv.org/search/?searchtype=author&query=Ye Hu), [Yejin Lee](https://arxiv.org/search/?searchtype=author&query=Yejin Lee)
Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale. 推测解码是加速大型语言模型推理速度的常用方法。然而,将其扩展到生产环境会带来若干工程挑战,包括如何在 GPU 上高效实现不同操作(例如树形注意力和多轮推测解码)。在本文中,我们详细介绍了为在生产规模上针对 Llama 模型实现基于 EAGLE 的推测解码而实施的训练与推理优化技术。通过这些改进,我们为 Llama 模型达成了新的最先进推理延迟水平。例如,Llama4 Maverick 在 8 块 NVIDIA H100 GPU 上以每令牌约 4 毫秒的速度解码(批大小为 1),比此前已知的最佳方法快 10%。此外,对于基于 EAGLE 的推测解码,我们的优化使得在生产规模下大批量时的加速比达到 1.4 倍到 2.0 倍。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 17:11:26 UTC 发布:2025-08-11 17:11:26 UTC
#7 LPI-RIT at LeWiDi-2025: Improving Distributional Predictions via Metadata and Loss Reweighting with DisCo LPI-RIT at LeWiDi-2025:通过元数据与损失重加权并结合 DisCo 改善分布式预测
Authors: [Mandira Sawkar](https://arxiv.org/search/?searchtype=author&query=Mandira Sawkar), [Samay U. Shetty](https://arxiv.org/search/?searchtype=author&query=Samay U. Shetty), [Deepak Pandita](https://arxiv.org/search/?searchtype=author&query=Deepak Pandita), [Tharindu Cyril Weerasooriya](https://arxiv.org/search/?searchtype=author&query=Tharindu Cyril Weerasooriya), [Christopher M. Homan](https://arxiv.org/search/?searchtype=author&query=Christopher M. Homan)
The Learning With Disagreements (LeWiDi) 2025 shared task is to model annotator disagreement through soft label distribution prediction and perspectivist evaluation, modeling annotators. We adapt DisCo (Distribution from Context), a neural architecture that jointly models item-level and annotator-level label distributions, and present detailed analysis and improvements. In this paper, we extend the DisCo by incorporating annotator metadata, enhancing input representations, and modifying the loss functions to capture disagreement patterns better. Through extensive experiments, we demonstrate substantial improvements in both soft and perspectivist evaluation metrics across three datasets. We also conduct in-depth error and calibration analyses, highlighting the conditions under which improvements occur. Our findings underscore the value of disagreement-aware modeling and offer insights into how system components interact with the complexity of human-annotated data. 2025 年 LeWiDi(带有分歧的学习,Learning With Disagreements)共享任务旨在通过软标签分布预测和观点主义评估来模拟标注者分歧,即对标注者建模。我们改编了 DisCo(Distribution from Context),这是一种联合建模条目级和标注者级标签分布的神经架构,并给出了详细的分析和改进。在本文中,我们通过加入标注者元数据、增强输入表征以及修改损失函数以更好地捕捉分歧模式,扩展了 DisCo。通过大量实验,我们证明在三个数据集上,软评估和观点主义评估指标均有显著提升。我们还进行了深入的错误与校准分析,重点说明了改进发生的条件。我们的发现强调了关注分歧的建模的价值,并提供了关于系统组件如何与人工标注数据的复杂性相互作用的见解。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习
Publish: 2025-08-11 16:39:09 UTC 发布:2025-08-11 16:39:09 UTC
#8 REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation REX-RAG:检索增强生成中的策略校正与推理探索
Authors: [Wentao Jiang](https://arxiv.org/search/?searchtype=author&query=Wentao Jiang), [Xiang Feng](https://arxiv.org/search/?searchtype=author&query=Xiang Feng), [Zengmao Wang](https://arxiv.org/search/?searchtype=author&query=Zengmao Wang), [Yong Luo](https://arxiv.org/search/?searchtype=author&query=Yong Luo), [Pingbo Xu](https://arxiv.org/search/?searchtype=author&query=Pingbo Xu), [Zhe Chen](https://arxiv.org/search/?searchtype=author&query=Zhe Chen), [Bo Du](https://arxiv.org/search/?searchtype=author&query=Bo Du), [Jing Zhang](https://arxiv.org/search/?searchtype=author&query=Jing Zhang)
Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as “dead ends”, committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through principled distributional corrections. Our approach introduces two key innovations: (1) Mixed Sampling Strategy, which combines a novel probe sampling method with exploratory prompts to escape dead ends; and (2) Policy Correction Mechanism, which employs importance sampling to correct distribution shifts induced by mixed sampling, thereby mitigating gradient estimation bias. We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, demonstrating competitive results across multiple datasets. The code is publicly available at https://github.com/MiliLab/REX-RAG. 强化学习(RL)正逐渐成为使大型语言模型(LLMs)执行复杂推理任务的强大范式。最新进展表明,将 RL 与检索增强生成(RAG)相结合可以让 LLMs 动态地整合外部知识,从而实现更有见地且更稳健的决策。然而,我们识别出在策略驱动的轨迹采样过程中存在一个关键挑战:LLMs 经常陷入无效的推理路径,我们称之为“死胡同”,对错误结论过于自信却不自知。这严重阻碍了探索并削弱了有效的策略优化。为了解决这一挑战,我们提出了 REX-RAG(在检索增强生成中通过策略纠正进行推理探索),该新框架在保持严格策略学习的同时,通过原则性的分布校正来探索备选推理路径。 我们的方法引入了两项关键创新:(1)混合采样策略,将一种新颖的探测采样方法与探索性提示相结合以摆脱僵局;以及(2)策略校正机制,使用重要性采样来矫正由混合采样引起的分布偏移,从而减轻梯度估计偏差。我们在七个问答基准上进行了评估,实验结果显示 REX-RAG 在 Qwen2.5-3B 上平均提升 5.1%,在 Qwen2.5-7B 上提升 3.6%,相较于强基线在多个数据集上取得了有竞争力的结果。代码已公开发布于 https://github.com/MiliLab/REX-RAG 。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 16:25:25 UTC 发布:2025-08-11 16:25:25 UTC
#9 Data-Efficient Biomedical In-Context Learning: A Diversity-Enhanced Submodular Perspective 数据高效的生物医学上下文学习:一种增强多样性的次模优化视角
Authors: [Jun Wang](https://arxiv.org/search/?searchtype=author&query=Jun Wang), [Zaifu Zhan](https://arxiv.org/search/?searchtype=author&query=Zaifu Zhan), [Qixin Zhang](https://arxiv.org/search/?searchtype=author&query=Qixin Zhang), [Mingquan Lin](https://arxiv.org/search/?searchtype=author&query=Mingquan Lin), [Meijia Song](https://arxiv.org/search/?searchtype=author&query=Meijia Song), [Rui Zhang](https://arxiv.org/search/?searchtype=author&query=Rui Zhang)
Recent progress in large language models (LLMs) has leveraged their in-context learning (ICL) abilities to enable quick adaptation to unseen biomedical NLP tasks. By incorporating only a few input-output examples into prompts, LLMs can rapidly perform these new tasks. While the impact of these demonstrations on LLM performance has been extensively studied, most existing approaches prioritize representativeness over diversity when selecting examples from large corpora. To address this gap, we propose Dual-Div, a diversity-enhanced data-efficient framework for demonstration selection in biomedical ICL. Dual-Div employs a two-stage retrieval and ranking process: First, it identifies a limited set of candidate examples from a corpus by optimizing both representativeness and diversity (with optional annotation for unlabeled data). Second, it ranks these candidates against test queries to select the most relevant and non-redundant demonstrations. Evaluated on three biomedical NLP tasks (named entity recognition (NER), relation extraction (RE), and text classification (TC)) using LLaMA 3.1 and Qwen 2.5 for inference, along with three retrievers (BGE-Large, BMRetriever, MedCPT), Dual-Div consistently outperforms baselines-achieving up to 5% higher macro-F1 scores-while demonstrating robustness to prompt permutations and class imbalance. Our findings establish that diversity in initial retrieval is more critical than ranking-stage optimization, and limiting demonstrations to 3-5 examples maximizes performance efficiency. 在大型语言模型(LLMs)方面的最新进展利用了其上下文学习(ICL)能力,使其能够快速适应未见过的生物医学自然语言处理任务。通过在提示中仅加入少量的输入-输出示例,LLMs 就能迅速执行这些新任务。尽管大量研究已探讨这些示例对 LLM 性能的影响,但现有大多数方法在从大型语料库中选择示例时更强调代表性而非多样性。为填补这一空白,我们提出了 Dual-Div —— 一种在生物医学 ICL 中用于示例选择的增强多样性且数据高效的框架。Dual-Div 采用两阶段检索与排序流程:首先,通过优化代表性和多样性(对无标注数据可选地进行注释)从语料库中识别出有限的候选示例;其次,将这些候选示例与测试查询进行比对排序,以选择最相关且不冗余的示范示例。 在三个生物医学自然语言处理任务(命名实体识别(NER)、关系抽取(RE)和文本分类(TC))上进行评估,推理时使用了 LLaMA 3.1 和 Qwen 2.5,并结合三种检索器(BGE-Large、BMRetriever、MedCPT)。Dual-Div 始终优于基线——宏 F1 分数最高提升达 5%——同时表现出对提示顺序变化和类别不平衡的鲁棒性。我们的研究结果表明,初始检索的多样性比排序阶段的优化更为关键,并且将示例演示数量限制在 3–5 个可以最大化性能效率。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 16:13:21 UTC 发布:2025-08-11 16:13:21 UTC
#10 Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models LLMs 能否发现它们的虚构内容?在不确定性感知语言模型中估计可靠性
Authors: [Tianyi Zhou](https://arxiv.org/search/?searchtype=author&query=Tianyi Zhou), [Johanne Medina](https://arxiv.org/search/?searchtype=author&query=Johanne Medina), [Sanjay Chawla](https://arxiv.org/search/?searchtype=author&query=Sanjay Chawla)
Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation. 大型语言模型(LLMs)容易生成流畅但不正确的内容,这被称为虚构(confabulation),在多轮或具代理性的应用中风险日益增加,因为输出可能被重复用作上下文。在本研究中,我们探讨了上下文信息如何影响模型行为以及 LLMs 是否能够识别其不可靠的回答。我们提出了一种可靠性估计方法,利用基于标记的(token-level)不确定性来引导内部模型表示的聚合。具体而言,我们从输出 logits 中计算固有不确定性(aleatoric)和认知不确定性(epistemic),以识别显著标记,并将这些标记的隐层状态聚合为用于回答级可靠性预测的紧凑表示。通过在开放问答基准上的受控实验,我们发现,正确的上下文信息既提高了答案准确率,也提高了模型置信度,而误导性上下文常常导致模型自信地给出错误回答,揭示了不确定性与正确性之间的错位。我们的基于探测的方法捕捉到了这些模型行为的变化,并在多种开源 LLMs 上改进了对不可靠输出的检测。 这些结果强调了直接不确定性信号的局限性,并彰显了以不确定性为指导的探测在面向可靠性生成方面的潜力。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 16:12:36 UTC 发布:2025-08-11 16:12:36 UTC
#11 Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models 语音语言模型中用于语音文本对齐的最优传输正则化
Authors: [Wenze Xu](https://arxiv.org/search/?searchtype=author&query=Wenze Xu), [Chun Wang](https://arxiv.org/search/?searchtype=author&query=Chun Wang), [Jiazhen Yu](https://arxiv.org/search/?searchtype=author&query=Jiazhen Yu), [Sheng Chen](https://arxiv.org/search/?searchtype=author&query=Sheng Chen), [Liang Gao](https://arxiv.org/search/?searchtype=author&query=Liang Gao), [Weihong Deng](https://arxiv.org/search/?searchtype=author&query=Weihong Deng)
Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training. In each training iteration, OTReg first establishes a structured correspondence between speech and transcript embeddings by determining the optimal transport plan, then incorporates the regularization loss based on this transport plan to optimize SLMs in generating speech embeddings that align more effectively with transcript embeddings. OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures. Extensive multilingual ASR experiments demonstrate that OTReg enhances speech-text alignment, mitigates the modality gap, and consequently improves SLM generalization across diverse datasets. 口语语言模型(Spoken Language Models,SLMs)将大型语言模型(LLMs)扩展为能感知语音输入,因其有望推动语音理解任务而受到越来越多关注。然而,尽管近期取得了一些进展,研究表明 SLMs 常常难以在不同数据集间泛化,即便是对已训练的语言和任务也存在这一问题,这引发了对它们是否如预期那样以类文本的方式处理语音的担忧。导致这一限制的一个关键挑战是语音与文本表示之间的模态差距。语音嵌入的高可变性可能使 SLMs 通过利用非预期的语音变化来在域内取得较强的性能,但这最终会阻碍泛化。为缓解该模态差距,我们提出了最优传输正则化(Optimal Transport Regularization,OTReg),该方法将语音-文本对齐表述为一个最优传输问题,并由此导出一种正则化损失以改进 SLM 的训练。 在每次训练迭代中,OTReg 首先通过确定最优传输计划在语音和文本嵌入之间建立结构化对应关系,然后基于该传输计划加入正则化损失,以优化 SLM,使其生成的语音嵌入能更有效地与文本嵌入对齐。OTReg 轻量、不需要额外标签或可学习参数,并能无缝整合到现有的 SLM 训练流程中。大量多语言 ASR 实验表明,OTReg 能增强语音-文本对齐、缓解模态间差距,从而提升 SLM 在不同数据集上的泛化能力。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 16:06:04 UTC 发布:2025-08-11 16:06:04 UTC
#12 Czech Dataset for Complex Aspect-Based Sentiment Analysis Tasks
Authors: [Jakub Šmíd](https://arxiv.org/search/?searchtype=author&query=Jakub Šmíd), [Pavel Přibáň](https://arxiv.org/search/?searchtype=author&query=Pavel Přibáň), [Ondřej Pražák](https://arxiv.org/search/?searchtype=author&query=Ondřej Pražák), [Pavel Král](https://arxiv.org/search/?searchtype=author&query=Pavel Král)
In this paper, we introduce a novel Czech dataset for aspect-based sentiment analysis (ABSA), which consists of 3.1K manually annotated reviews from the restaurant domain. The dataset is built upon the older Czech dataset, which contained only separate labels for the basic ABSA tasks such as aspect term extraction or aspect polarity detection. Unlike its predecessor, our new dataset is specifically designed for more complex tasks, e.g. target-aspect-category detection. These advanced tasks require a unified annotation format, seamlessly linking sentiment elements (labels) together. Our dataset follows the format of the well-known SemEval-2016 datasets. This design choice allows effortless application and evaluation in cross-lingual scenarios, ultimately fostering cross-language comparisons with equivalent counterpart datasets in other languages. The annotation process engaged two trained annotators, yielding an impressive inter-annotator agreement rate of approximately 90%. Additionally, we provide 24M reviews without annotations suitable for unsupervised learning. We present robust monolingual baseline results achieved with various Transformer-based models and insightful error analysis to supplement our contributions. Our code and dataset are freely available for non-commercial research purposes. 在本文中,我们介绍了一个用于基于方面的情感分析(ABSA)的捷克语新数据集,该数据集由餐饮领域的 3.1K 条人工注释评论构成。该数据集建立在早期的捷克语数据集之上,后者仅包含用于基本 ABSA 任务(如方面术语抽取或方面极性检测)的单独标签。与其前身不同,我们的新数据集专为更复杂的任务设计,例如目标-方面-类别检测。这些高级任务需要统一的注释格式,将情感元素(标签)无缝地关联在一起。我们的数据集遵循著名的 SemEval-2016 数据集格式。此设计选择便于在跨语言场景中进行无缝应用和评估,最终促进与其他语言中对应数据集的跨语言比较。注释过程由两名受过训练的注释员参与,产生了约 90% 的令人印象深刻的注释者间一致率。此外,我们还提供了 2400 万条未注释的评论,适用于无监督学习。 我们展示了使用多种基于 Transformer 的模型获得的稳健单语基线结果,并提供了有见地的错误分析以补充我们的贡献。我们的代码和数据集可供非商业研究用途免费使用。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 16:03:28 UTC 发布:2025-08-11 16:03:28 UTC
#13 Iterative refinement, not training objective, makes HuBERT behave differently from wav2vec 2.0 迭代精炼(iterative refinement),而非训练目标,使得 HuBERT 与 wav2vec 2.0 行为不同
Authors: [Robin Huo](https://arxiv.org/search/?searchtype=author&query=Robin Huo), [Ewan Dunbar](https://arxiv.org/search/?searchtype=author&query=Ewan Dunbar)
Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations remains under-studied. This study investigates two such models, HuBERT and wav2vec 2.0, and minimally compares two of their architectural differences: training objective and iterative pseudo-label refinement through multiple training iterations. We find that differences in canonical correlation of hidden representations to word identity, phoneme identity, and speaker identity are explained by training iteration, not training objective. We suggest that future work investigate the reason for the effectiveness of iterative refinement in encoding linguistic information in self-supervised speech representations. 自监督语音表示学习模型因其通用性和在下游任务上的表现而被广泛使用,但模型架构对其表示中所学语言信息的影响仍研究不足。本研究考察了两种此类模型——HuBERT 和 wav2vec 2.0,并对它们的两个架构差异进行了最小限度的比较:训练目标与通过多次训练迭代进行的伪标签迭代精炼。我们发现,隐藏表示与词身份、音素身份和说话人身份的典型相关性差异是由训练迭代决定的,而非训练目标。我们建议未来工作探究为何迭代精炼在将语言信息编码到自监督语音表示中如此有效的原因。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 15:48:56 UTC 发布:2025-08-11 15:48:56 UTC
#14 Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection? 在教育情境中评估 LLM 文本检测:人为贡献是否影响检测?
Authors: [Lukas Gehring](https://arxiv.org/search/?searchtype=author&query=Lukas Gehring), [Benjamin Paaßen](https://arxiv.org/search/?searchtype=author&query=Benjamin Paaßen)
Recent advancements in Large Language Models (LLMs) and their increased accessibility have made it easier than ever for students to automatically generate texts, posing new challenges for educational institutions. To enforce norms of academic integrity and ensure students’ learning, learning analytics methods to automatically detect LLM-generated text appear increasingly appealing. This paper benchmarks the performance of different state-of-the-art detectors in educational contexts, introducing a novel dataset, called Generative Essay Detection in Education (GEDE), containing over 900 student-written essays and over 12,500 LLM-generated essays from various domains. To capture the diversity of LLM usage practices in generating text, we propose the concept of contribution levels, representing students’ contribution to a given assignment. These levels range from purely human-written texts, to slightly LLM-improved versions, to fully LLM-generated texts, and finally to active attacks on the detector by “humanizing” generated texts. We show that most detectors struggle to accurately classify texts of intermediate student contribution levels, like LLM-improved human-written texts. Detectors are particularly likely to produce false positives, which is problematic in educational settings where false suspicions can severely impact students’ lives. Our dataset, code, and additional supplementary materials are publicly available at https://github.com/lukasgehring/Assessing-LLM-Text-Detection-in-Educational-Contexts. 近年来大型语言模型(LLMs)的进步及其日益可及,使学生自动生成文本比以往任何时候都更容易,这给教育机构带来了新的挑战。为维护学术诚信规范并确保学生学习,自动检测 LLM 生成文本的学习分析方法变得愈发具有吸引力。本文在教育情境中对不同最先进检测器的性能进行了基准评估,并引入了一个新数据集——教育中生成性文章检测(GEDE),该数据集包含来自各领域的 900 多篇学生撰写的文章和 12,500 多篇 LLM 生成的文章。为捕捉在生成文本过程中 LLM 使用实践的多样性,我们提出了贡献程度的概念,用以表示学生对某一作业的贡献。这些程度从纯人工撰写的文本、略微由 LLM 改进的版本、到完全由 LLM 生成的文本,再到通过“使生成文本更像人写”来主动攻击检测器的情况。我们表明,大多数检测器难以准确分类处于中等学生贡献程度的文本,例如由 LLM 改进的人类撰写文本。 检测器特别容易产生误报,这在教育环境中尤为成问题,因为错误的怀疑可能严重影响学生的生活。我们的数据集、代码和其他补充材料已公开发布在 https://github.com/lukasgehring/Assessing-LLM-Text-Detection-in-Educational-Contexts。
Subjects: Computation and Language, Machine Learning 主题:计算与语言,机器学习
Publish: 2025-08-11 15:34:49 UTC 发布时间:2025-08-11 15:34:49 UTC
#15 Dual Information Speech Language Models for Emotional Conversations 面向情感会话的双重信息语音语言模型
Authors: [Chun Wang](https://arxiv.org/search/?searchtype=author&query=Chun Wang), [Chenyang Liu](https://arxiv.org/search/?searchtype=author&query=Chenyang Liu), [Wenze Xu](https://arxiv.org/search/?searchtype=author&query=Wenze Xu), [Weihong Deng](https://arxiv.org/search/?searchtype=author&query=Weihong Deng)
Conversational systems relying on text-based large language models (LLMs) often overlook paralinguistic cues, essential for understanding emotions and intentions. Speech-language models (SLMs), which use speech as input, are emerging as a promising solution. However, SLMs built by extending frozen LLMs struggle to capture paralinguistic information and exhibit reduced context understanding. We identify entangled information and improper training strategies as key issues. To address these issues, we propose two heterogeneous adapters and suggest a weakly supervised training strategy. Our approach disentangles paralinguistic and linguistic information, enabling SLMs to interpret speech through structured representations. It also preserves contextual understanding by avoiding the generation of task-specific vectors through controlled randomness. This approach trains only the adapters on common datasets, ensuring parameter and data efficiency. Experiments demonstrate competitive performance in emotional conversation tasks, showcasing the model’s ability to effectively integrate both paralinguistic and linguistic information within contextual settings. 依赖基于文本的大型语言模型(LLMs)的对话系统往往忽视了副语言线索,而这些线索对理解情绪和意图至关重要。以语音为输入的语音-语言模型(SLMs)正成为一种有前景的解决方案。然而,通过扩展冻结的 LLMs 构建的 SLMs 难以捕捉副语言信息,并表现出上下文理解能力下降。我们识别出信息纠缠和不当训练策略为关键问题。为了解决这些问题,我们提出了两种异质适配器,并建议一种弱监督训练策略。我们的方法将副语言信息与语言信息解耦,使 SLMs 能够通过结构化表征来解释语音。同时,通过受控随机性避免生成任务特定向量,从而保留了上下文理解能力。该方法仅在常见数据集上训练适配器,确保了参数和数据的高效性。实验表明在情感对话任务上具有竞争性表现,展示了模型在上下文设置中有效整合副语言与语言信息的能力。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 15:33:44 UTC 发布:2025-08-11 15:33:44 UTC
#16 9th Workshop on Sign Language Translation and Avatar Technologies (SLTAT 2025) 第九届手语翻译与虚拟形象技术研讨会(SLTAT 2025)
Authors: [Fabrizio Nunnari](https://arxiv.org/search/?searchtype=author&query=Fabrizio Nunnari), [Cristina Luna Jiménez](https://arxiv.org/search/?searchtype=author&query=Cristina Luna Jiménez), [Rosalee Wolfe](https://arxiv.org/search/?searchtype=author&query=Rosalee Wolfe), [John C. McDonald](https://arxiv.org/search/?searchtype=author&query=John C. McDonald), [Michael Filhol](https://arxiv.org/search/?searchtype=author&query=Michael Filhol), [Eleni Efthimiou](https://arxiv.org/search/?searchtype=author&query=Eleni Efthimiou), [Evita Fotinea](https://arxiv.org/search/?searchtype=author&query=Evita Fotinea), [Thomas Hanke](https://arxiv.org/search/?searchtype=author&query=Thomas Hanke)
The Sign Language Translation and Avatar Technology (SLTAT) workshops continue a series of gatherings to share recent advances in improving deaf / human communication through non-invasive means. This 2025 edition, the 9th since its first appearance in 2011, is hosted by the International Conference on Intelligent Virtual Agents (IVA), giving the opportunity for contamination between two research communities, using digital humans as either virtual interpreters or as interactive conversational agents. As presented in this summary paper, SLTAT sees contributions beyond avatar technologies, with a consistent number of submissions on sign language recognition, and other work on data collection, data analysis, tools, ethics, usability, and affective computing. 手语翻译与虚拟形象技术(SLTAT)研讨会延续了一系列学术聚会,分享通过非侵入性手段改善聋人与他人交流的最新进展。2025 年这一届为自 2011 年首届以来的第九届,由国际智能虚拟代理大会(IVA)主办,为两个研究社区之间的交叉融合提供了机会:将数字人用作虚拟译员或作为交互式会话代理。如本摘要论文所述,SLTAT 的贡献超越了虚拟形象技术,持续收到数量稳定的关于手语识别的投稿,以及关于数据采集、数据分析、工具、伦理、可用性和情感计算的其他研究成果。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 14:50:21 UTC 发布:2025-08-11 14:50:21 UTC
#17 Progressive Depth Up-scaling via Optimal Transport 通过最优传输实现渐进式深度放大
Authors: [Mingzi Cao](https://arxiv.org/search/?searchtype=author&query=Mingzi Cao), [Xi Wang](https://arxiv.org/search/?searchtype=author&query=Xi Wang), [Nikolaos Aletras](https://arxiv.org/search/?searchtype=author&query=Nikolaos Aletras)
Scaling Large Language Models (LLMs) yields performance gains but incurs substantial training costs. Depth up-scaling offers training efficiency by adding new layers to pre-trained models. However, most existing methods copy or average weights from base layers, neglecting neuron permutation differences. This limitation can potentially cause misalignment that harms performance. Inspired by applying Optimal Transport (OT) for neuron alignment, we propose Optimal Transport Depth Up-Scaling (OpT-DeUS). OpT-DeUS aligns and fuses Transformer blocks in adjacent base layers via OT for new layer creation, to mitigate neuron permutation mismatch between layers. OpT-DeUS achieves better overall performance and offers improved training efficiency than existing methods for continual pre-training and supervised fine-tuning across different model sizes. To further evaluate the impact of interpolation positions, our extensive analysis shows that inserting new layers closer to the top results in higher training efficiency due to shorter back-propagation time while obtaining additional performance gains. 扩大大型语言模型(LLMs)的规模可以带来性能提升,但也会产生可观的训练成本。通过向预训练模型中新增层数进行深度扩展,能够提高训练效率。然而,大多数现有方法通过复制或平均基层权重来构建新层,忽视了神经元排列(permutation)差异。这一局限可能导致错位,从而损害性能。受将最优传输(Optimal Transport,OT)用于神经元对齐的启发,我们提出了最优传输深度扩展(Optimal Transport Depth Up-Scaling,OpT-DeUS)。OpT-DeUS 通过 OT 对齐并融合相邻基层的 Transformer 模块以创建新层,以缓解层间神经元排列不匹配的问题。与现有方法相比,OpT-DeUS 在持续预训练和有监督微调的不同模型规模上均取得了更好的整体性能并提供了更高的训练效率。为了进一步评估插入位置的影响,我们的详尽分析显示,将新层插得更接近顶部能带来更高的训练效率(因反向传播路径更短)并获得额外的性能提升。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 14:15:33 UTC
#18 WideSearch: Benchmarking Agentic Broad Info-Seeking WideSearch:评估代理式大规模信息搜集的基准
Authors: [Ryan Wong](https://arxiv.org/search/?searchtype=author&query=Ryan Wong), [Jiawei Wang](https://arxiv.org/search/?searchtype=author&query=Jiawei Wang), [Junjie Zhao](https://arxiv.org/search/?searchtype=author&query=Junjie Zhao), [Li Chen](https://arxiv.org/search/?searchtype=author&query=Li Chen), [Yan Gao](https://arxiv.org/search/?searchtype=author&query=Yan Gao), [Long Zhang](https://arxiv.org/search/?searchtype=author&query=Long Zhang), [Xuan Zhou](https://arxiv.org/search/?searchtype=author&query=Xuan Zhou), [Zuo Wang](https://arxiv.org/search/?searchtype=author&query=Zuo Wang), [Kai Xiang](https://arxiv.org/search/?searchtype=author&query=Kai Xiang), [Ge Zhang](https://arxiv.org/search/?searchtype=author&query=Ge Zhang), [Wenhao Huang](https://arxiv.org/search/?searchtype=author&query=Wenhao Huang), [Yang Wang](https://arxiv.org/search/?searchtype=author&query=Yang Wang), [Ke Wang](https://arxiv.org/search/?searchtype=author&query=Ke Wang)
From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such “wide-context” collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0%, with the best performer reaching just 5%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/ 从专业研究到日常规划,许多任务瓶颈在于大规模的信息搜集,这类工作比起认知复杂性更偏向重复性。随着大型语言模型(LLMs)的快速发展,由 LLMs 驱动的自动化搜索代理为将人类从这类繁琐工作中解放出来提供了有希望的解决方案。然而,由于缺乏合适的基准,这些代理在执行此类“宽上下文”收集任务时能否可靠且完整地完成,仍然大多未被评估。为弥补这一空白,我们引入了 WideSearch,这是一个为评估代理在这些大规模收集任务上的可靠性而设计的新基准。该基准包含 200 个人工整理的问题(100 个英文,100 个中文),来自 15 个以上的多样领域,基于真实用户查询构建。每个任务要求代理收集大规模的原子信息,这些信息可以逐一客观验证,并将其整理成结构良好的输出。一个严格的五阶段质量控制流程确保了数据集的难度、完整性和可验证性。 我们评测了 10 多种最先进的智能搜索系统,包括单代理、多代理框架以及端到端的商业系统。大多数系统的总体成功率接近 0%,表现最好的也仅达到 5%。然而,在充足时间条件下,多名人工测试者交叉验证却能达到接近 100%的成功率。这些结果表明,目前的搜索代理在大规模信息检索方面存在关键性缺陷,凸显了智能搜索未来研究与开发中亟需解决的领域。我们的数据集、评估流程和基准结果已在 https://widesearch-seed.github.io/ 上公开发布。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 14:03:09 UTC 发布:2025-08-11 14:03:09 UTC
#19 The Medical Metaphors Corpus (MCC) 医学隐喻语料库(MCC)
Authors: [Anna Sofia Lippolis](https://arxiv.org/search/?searchtype=author&query=Anna Sofia Lippolis), [Andrea Giovanni Nuzzolese](https://arxiv.org/search/?searchtype=author&query=Andrea Giovanni Nuzzolese), [Aldo Gangemi](https://arxiv.org/search/?searchtype=author&query=Aldo Gangemi)
Metaphor is a fundamental cognitive mechanism that shapes scientific understanding, enabling the communication of complex concepts while potentially constraining paradigmatic thinking. Despite the prevalence of figurative language in scientific discourse, existing metaphor detection resources primarily focus on general-domain text, leaving a critical gap for domain-specific applications. In this paper, we present the Medical Metaphors Corpus (MCC), a comprehensive dataset of 792 annotated scientific conceptual metaphors spanning medical and biological domains. MCC aggregates metaphorical expressions from diverse sources including peer-reviewed literature, news media, social media discourse, and crowdsourced contributions, providing both binary and graded metaphoricity judgments validated through human annotation. Each instance includes source-target conceptual mappings and perceived metaphoricity scores on a 0-7 scale, establishing the first annotated resource for computational scientific metaphor research. Our evaluation demonstrates that state-of-the-art language models achieve modest performance on scientific metaphor detection, revealing substantial room for improvement in domain-specific figurative language understanding. MCC enables multiple research applications including metaphor detection benchmarking, quality-aware generation systems, and patient-centered communication tools. 隐喻是一种基本的认知机制,它塑造科学理解,能够传达复杂概念,同时可能限制范式化思维。尽管比喻性语言在科学话语中普遍存在,现有的隐喻检测资源主要集中于通用领域文本,留下了面向特定领域应用的关键空白。在本文中,我们提出了医学隐喻语料库(MCC),这是一个涵盖医学和生物学领域的 792 条注释化科学概念隐喻的综合数据集。MCC 汇集了来自多种来源的比喻表达,包括经同行评审的文献、新闻媒体、社交媒体话语和众包贡献,提供了通过人工注释验证的二元和分级隐喻性判断。每个实例都包括源-目标概念映射和在 0-7 量表上的感知隐喻性评分,建立了第一个用于计算科学隐喻研究的注释资源。 我们的评估表明,最先进的语言模型在科学隐喻检测上表现平平,暴露出在特定领域比喻性语言理解方面存在大量改进空间。MCC 可用于多种研究应用,包括隐喻检测基准测试、质量感知生成系统以及以患者为中心的交流工具。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 13:55:31 UTC 发布:2025-08-11 13:55:31 UTC
#20 Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL 超越十回合:利用大规模异步强化学习解锁长时域主动搜索
Authors: [Jiaxuan Gao](https://arxiv.org/search/?searchtype=author&query=Jiaxuan Gao), [Wei Fu](https://arxiv.org/search/?searchtype=author&query=Wei Fu), [Minyang Xie](https://arxiv.org/search/?searchtype=author&query=Minyang Xie), [Shusheng Xu](https://arxiv.org/search/?searchtype=author&query=Shusheng Xu), [Chuyi He](https://arxiv.org/search/?searchtype=author&query=Chuyi He), [Zhiyu Mei](https://arxiv.org/search/?searchtype=author&query=Zhiyu Mei), [Banghua Zhu](https://arxiv.org/search/?searchtype=author&query=Banghua Zhu), [Yi Wu](https://arxiv.org/search/?searchtype=author&query=Yi Wu)
Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher. 基于 LLM 的智能体在整合外部工具处理复杂、知识密集型任务方面的最新进展展示了显著能力。在多种工具选择中,搜索工具在获取大量外部知识方面发挥着关键作用。然而,开源智能体在实现专家级搜索智能方面仍然不足,即解决模糊查询、生成精确检索、分析结果并进行深入探索的能力仍有差距。现有方法在可扩展性、效率和数据质量方面存在不足。例如,现有在线强化学习方法中的较小回合限制(例如 ≤10)限制了复杂策略的学习。本文提出了 ASearcher,一个用于大规模强化学习训练搜索智能体的开源项目。我们的主要贡献包括:(1)可扩展的完全异步强化学习训练,使得在保持高训练效率的同时实现长时域搜索成为可能。(2)一种基于提示的 LLM 智能体,能够自主合成高质量且具有挑战性的问答,构建大规模问答数据集。通过强化学习训练,我们的基于提示的 QwQ-32B 智能体取得了显著提升,在 xBench 和 GAIA 上的 Avg@4 分别提高了 46.7% 和 20.8%。 值得注意的是,我们的智能体展现出极端的长时程搜索能力,在训练期间工具调用超过 40 轮,输出令牌超过 15 万。在简单的智能体设计且不依赖外部 LLMs 的情况下,ASearcher-Web-QwQ 在 xBench 上的 Avg@4 得分为 42.1,在 GAIA 上为 52.8,超过了现有开源的 32B 智能体。我们在 https://github.com/inclusionAI/ASearcher 上开源了模型、训练数据和代码。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 13:36:57 UTC 发布:2025-08-11 13:36:57 协调世界时
#21 Understanding Syntactic Generalization in Structure-inducing Language Models 理解在结构诱导语言模型中的句法泛化
Authors: [David Arps](https://arxiv.org/search/?searchtype=author&query=David Arps), [Hassan Sajjad](https://arxiv.org/search/?searchtype=author&query=Hassan Sajjad), [Laura Kallmeyer](https://arxiv.org/search/?searchtype=author&query=Laura Kallmeyer)
Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. A wide variety of SiLMs have been proposed. However, these have typically been evaluated on a relatively small scale, and evaluation of these models has systematic gaps and lacks comparability. In this work, we study three different SiLM architectures using both natural language (English) corpora and synthetic bracketing expressions: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022) and GPST (Hu et al., 2024). We compare them with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties. 结构诱导语言模型(Structure-inducing Language Models,SiLM)在自监督语言建模任务上进行训练,并在处理输入时作为副产品诱导出层级句子表示。已经提出了各种各样的 SiLM。然而,这些模型通常只在相对小规模上进行评估,且评估存在系统性缺口并缺乏可比性。在这项工作中,我们使用自然语言(英语)语料和合成括号表达式研究了三种不同的 SiLM 架构:Structformer(Shen et al., 2021)、UDGN(Shen et al., 2022)和 GPST(Hu et al., 2024)。我们从 (i) 诱导出的句法表示的属性、(ii) 语法性判断任务的性能以及 (iii) 训练动态三方面对它们进行了比较。我们发现这三种架构在所有评估指标上都没有任何一种完全占优。然而,在诱导出的句法表示方面存在显著差异。生成式预训练结构化变换器(Generative Pretrained Structured Transformer,GPST;Hu et al. 2024)在各评估设置中表现最为稳健,并且在括号表达式的长距离依赖上优于其他模型。 此外,我们的研究表明,用大量合成数据训练的小模型为评估模型的基本属性提供了一个有用的试验平台。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 13:29:41 UTC 发布:2025-08-11 13:29:41 UTC
#22 Toward Machine Interpreting: Lessons from Human Interpreting Studies 走向机器口译:来自人类口译研究的经验教训
Authors: [Matthias Sperber](https://arxiv.org/search/?searchtype=author&query=Matthias Sperber), [Maureen de Seyssel](https://arxiv.org/search/?searchtype=author&query=Maureen de Seyssel), [Jiajun Bao](https://arxiv.org/search/?searchtype=author&query=Jiajun Bao), [Matthias Paulik](https://arxiv.org/search/?searchtype=author&query=Matthias Paulik)
Current speech translation systems, while having achieved impressive accuracies, are rather static in their behavior and do not adapt to real-world situations in ways human interpreters do. In order to improve their practical usefulness and enable interpreting-like experiences, a precise understanding of the nature of human interpreting is crucial. To this end, we discuss human interpreting literature from the perspective of the machine translation field, while considering both operational and qualitative aspects. We identify implications for the development of speech translation systems and argue that there is great potential to adopt many human interpreting principles using recent modeling techniques. We hope that our findings provide inspiration for closing the perceived usability gap, and can motivate progress toward true machine interpreting. 当前的语音翻译系统尽管在准确率上取得了令人瞩目的成绩,但其行为相当静态,无法像人类口译员那样适应现实世界的情境。为提高其实用性并实现类似口译的体验,精确理解人类口译的本质至关重要。为此,我们从机器翻译领域的视角讨论了人类口译相关文献,同时兼顾操作性和质量方面的考量。我们指出了对语音翻译系统开发的启示,并主张利用最新的建模技术来采用许多人类口译的原则具有巨大潜力。我们希望我们的研究发现能够为弥合所感知的可用性差距提供灵感,并能激励朝着真正的机器口译取得进展。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 13:20:33 UTC 发布:2025-08-11 13:20:33 协调世界时
#23 Large Language Models for Subjective Language Understanding: A Survey 用于主观性语言理解的大型语言模型:综述
Authors: [Changhao Song](https://arxiv.org/search/?searchtype=author&query=Changhao Song), [Yazhou Zhang](https://arxiv.org/search/?searchtype=author&query=Yazhou Zhang), [Hui Gao](https://arxiv.org/search/?searchtype=author&query=Hui Gao), [Ben Yao](https://arxiv.org/search/?searchtype=author&query=Ben Yao), [Peng Zhang](https://arxiv.org/search/?searchtype=author&query=Peng Zhang)
Subjective language understanding refers to a broad set of natural language processing tasks where the goal is to interpret or generate content that conveys personal feelings, opinions, or figurative meanings rather than objective facts. With the advent of large language models (LLMs) such as ChatGPT, LLaMA, and others, there has been a paradigm shift in how we approach these inherently nuanced tasks. In this survey, we provide a comprehensive review of recent advances in applying LLMs to subjective language tasks, including sentiment analysis, emotion recognition, sarcasm detection, humor understanding, stance detection, metaphor interpretation, intent detection, and aesthetics assessment. We begin by clarifying the definition of subjective language from linguistic and cognitive perspectives, and we outline the unique challenges posed by subjective language (e.g. ambiguity, figurativeness, context dependence). We then survey the evolution of LLM architectures and techniques that particularly benefit subjectivity tasks, highlighting why LLMs are well-suited to model subtle human-like judgments. For each of the eight tasks, we summarize task definitions, key datasets, state-of-the-art LLM-based methods, and remaining challenges. We provide comparative insights, discussing commonalities and differences among tasks and how multi-task LLM approaches might yield unified models of subjectivity. Finally, we identify open issues such as data limitations, model bias, and ethical considerations, and suggest future research directions. We hope this survey will serve as a valuable resource for researchers and practitioners interested in the intersection of affective computing, figurative language processing, and large-scale language models. 主观性语言理解指的是一类广泛的自然语言处理任务,其目标是解释或生成传达个人感受、观点或比喻意义的内容,而不是客观事实。随着大型语言模型(LLMs)如 ChatGPT、LLaMA 等的出现,我们在处理这些本质上微妙的任务方面已经发生了范式转变。在本综述中,我们对将 LLMs 应用于主观性语言任务的最新进展进行了全面回顾,涵盖情感分析、情绪识别、讽刺检测、幽默理解、立场检测、隐喻解读、意图检测和美学评估。我们首先从语言学和认知的角度阐明主观性语言的定义,并概述主观性语言所带来的独特挑战(例如歧义性、比喻性、依赖上下文)。随后,我们回顾了特别有利于主观性任务的 LLM 架构和技术演进,强调了为何 LLMs 擅长模拟微妙的人类式判断。 对于这八项任务中的每一项,我们总结了任务定义、关键数据集、基于 LLM 的最新方法以及仍然存在的挑战。我们提供了比较性的见解,讨论了各任务之间的共性与差异,以及多任务 LLM 方法如何可能产生关于主观性的统一模型。最后,我们指出了诸如数据限制、模型偏见和伦理考量等未解决问题,并提出了未来研究方向的建议。我们希望这篇综述能成为对情感计算、比喻语言处理和大规模语言模型交叉领域感兴趣的研究人员和从业者的一份有价值的资源。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 13:10:44 UTC 发布:2025-08-11 13:10:44 协调世界时
#24 Expert Preference-based Evaluation of Automated Related Work Generation 专家基于偏好的自动相关工作生成评估
Authors: [Furkan Şahinuç](https://arxiv.org/search/?searchtype=author&query=Furkan Şahinuç), [Subhabrata Dutta](https://arxiv.org/search/?searchtype=author&query=Subhabrata Dutta), [Iryna Gurevych](https://arxiv.org/search/?searchtype=author&query=Iryna Gurevych)
Expert domain writing, such as scientific writing, typically demands extensive domain knowledge. Recent advances in LLMs show promising potential in reducing the expert workload. However, evaluating the quality of automatically generated scientific writing is a crucial open issue, as it requires knowledge of domain-specific evaluation criteria and the ability to discern expert preferences. Conventional automatic metrics and LLM-as-a-judge systems are insufficient to grasp expert preferences and domain-specific quality standards. To address this gap and support human-AI collaborative writing, we focus on related work generation, one of the most challenging scientific tasks, as an exemplar. We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences. Instead of assigning a single score, our framework decomposes the evaluation into fine-grained dimensions. This localized evaluation approach is further augmented with contrastive few-shot examples to provide detailed contextual guidance for the evaluation dimensions. The design principles allow our framework to deliver cardinal assessment of quality, which can facilitate better post-training compared to ordinal preference data. For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs. Empirical investigation reveals that our framework is able to assess the quality of related work sections in a much more robust manner compared to standard LLM judges, reflects natural scenarios of scientific writing, and bears a strong correlation with the human expert assessment. We also observe that generations from state-of-the-art LLMs struggle to satisfy validation constraints of a suitable related work section. They (mostly) fail to improve based on feedback as well. 专家领域写作,例如科学写作,通常需要广泛的领域知识。最近在 LLMs 方面的进展显示出减少专家工作量的有希望潜力。然而,自动生成的科学写作质量评估是一个关键的未决问题,因为它需要掌握特定领域的评估标准并能够辨别专家偏好。传统的自动化度量和将 LLM 作为评判者的系统不足以理解专家偏好和特定领域的质量标准。为填补这一空白并支持人机协作写作,我们以相关工作生成这一最具挑战性的科学任务之一为示例。我们提出了 GREP,这是一种多轮评估框架,将经典的相关工作评估标准与专家特定偏好相结合。我们的框架不是赋予单一分数,而是将评估分解为细粒度维度。这种局部化的评估方法通过对比少样本示例进一步增强,为各评估维度提供了详细的上下文指导。 这些设计原则使我们的框架能够提供质量的基数评估,相较于序数偏好数据,这有助于更好的训练后处理。为提高可及性,我们设计了两种 GREP 变体:一种使用专有 LLMs 作为评估器的更精确变体,另一种使用开源权重 LLMs 的更廉价替代方案。实证研究表明,与标准的 LLM 裁判相比,我们的框架能够以更稳健的方式评估相关工作章节的质量,反映了科学写作的自然情境,并且与人类专家评估具有很强的相关性。我们还观察到,当前最先进 LLMs 的生成结果难以满足适当相关工作章节的验证约束,而且它们(大多)也无法根据反馈进行改进。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 13:08:07 UTC 发布:2025-08-11 13:08:07 UTC
#25 Challenges and opportunities in portraying emotion in generated sign language 在生成手语中表达情感的挑战与机遇
Authors: [John C. McDonald](https://arxiv.org/search/?searchtype=author&query=John C. McDonald), [Rosalee Wolfe](https://arxiv.org/search/?searchtype=author&query=Rosalee Wolfe), [Fabrizio Nunnari](https://arxiv.org/search/?searchtype=author&query=Fabrizio Nunnari)
Non-manual signals in sign languages continue to be a challenge for signing avatars. More specifically, emotional content has been difficult to incorporate because of a lack of a standard method of specifying the avatar’s emotional state. This paper explores the application of an intuitive two-parameter representation for emotive non-manual signals to the Paula signing avatar that shows promise for facilitating the linguistic specification of emotional facial expressions in a more coherent manner than previous methods. Users can apply these parameters to control Paula’s emotional expressions through a textual representation called the EASIER notation. The representation can allow avatars to express more nuanced emotional states using two numerical parameters. It also has the potential to enable more consistent specification of emotional non-manual signals in linguistic annotations which drive signing avatars. 手语中的非手部信号仍然是手语化身的一个挑战。更具体地说,情感内容难以纳入,原因在于缺乏一种用于明确化身情感状态的标准方法。本文探讨了一种直观的、由两个参数表示的情感性非手部信号在 Paula 手语化身上的应用,该方法有望比以往方法更连贯地促成情感面部表情的语言学指定。用户可以通过一种称为 EASIER 符号的文本表示来应用这些参数,从而控制 Paula 的情感表情。该表示法允许化身使用两个数值参数来表达更为细腻的情感状态。它还有可能使驱动手语化身的语言学注释中对情感性非手部信号的指定更加一致。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 12:52:39 UTC 发布:2025-08-11 12:52:39 协调世界时
#26 Tailored Emotional LLM-Supporter: Enhancing Cultural Sensitivity
Authors: [Chen Cecilia Liu](https://arxiv.org/search/?searchtype=author&query=Chen Cecilia Liu), [Hiba Arnaout](https://arxiv.org/search/?searchtype=author&query=Hiba Arnaout), [Nils Kovačić](https://arxiv.org/search/?searchtype=author&query=Nils Kovačić), [Dana Atzil-Slonim](https://arxiv.org/search/?searchtype=author&query=Dana Atzil-Slonim), [Iryna Gurevych](https://arxiv.org/search/?searchtype=author&query=Iryna Gurevych)
Large language models (LLMs) show promise in offering emotional support and generating empathetic responses for individuals in distress, but their ability to deliver culturally sensitive support remains underexplored due to lack of resources. In this work, we introduce CultureCare, the first dataset designed for this task, spanning four cultures and including 1729 distress messages, 1523 cultural signals, and 1041 support strategies with fine-grained emotional and cultural annotations. Leveraging CultureCare, we (i) develop and test four adaptation strategies for guiding three state-of-the-art LLMs toward culturally sensitive responses; (ii) conduct comprehensive evaluations using LLM judges, in-culture human annotators, and clinical psychologists; (iii) show that adapted LLMs outperform anonymous online peer responses, and that simple cultural role-play is insufficient for cultural sensitivity; and (iv) explore the application of LLMs in clinical training, where experts highlight their potential in fostering cultural competence in future therapists.
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 12:17:58 UTC 发布:2025-08-11 12:17:58 协调世界时
#27 Few-shot Cross-lingual Aspect-Based Sentiment Analysis with Sequence-to-Sequence Models
Authors: [Jakub Šmíd](https://arxiv.org/search/?searchtype=author&query=Jakub Šmíd), [Pavel Přibáň](https://arxiv.org/search/?searchtype=author&query=Pavel Přibáň), [Pavel Král](https://arxiv.org/search/?searchtype=author&query=Pavel Král)
Aspect-based sentiment analysis (ABSA) has received substantial attention in English, yet challenges remain for low-resource languages due to the scarcity of labelled data. Current cross-lingual ABSA approaches often rely on external translation tools and overlook the potential benefits of incorporating a small number of target language examples into training. In this paper, we evaluate the effect of adding few-shot target language examples to the training set across four ABSA tasks, six target languages, and two sequence-to-sequence models. We show that adding as few as ten target language examples significantly improves performance over zero-shot settings and achieves a similar effect to constrained decoding in reducing prediction errors. Furthermore, we demonstrate that combining 1,000 target language examples with English data can even surpass monolingual baselines. These findings offer practical insights for improving cross-lingual ABSA in low-resource and domain-specific settings, as obtaining ten high-quality annotated examples is both feasible and highly effective.
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 11:31:37 UTC 发布:2025-08-11 11:31:37 协调世界时
#28 Large Language Models for Czech Aspect-Based Sentiment Analysis 用于捷克语基于方面的情感分析的大型语言模型
Authors: [Jakub Šmíd](https://arxiv.org/search/?searchtype=author&query=Jakub Šmíd), [Pavel Přibáň](https://arxiv.org/search/?searchtype=author&query=Pavel Přibáň), [Pavel Král](https://arxiv.org/search/?searchtype=author&query=Pavel Král)
Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that aims to identify sentiment toward specific aspects of an entity. While large language models (LLMs) have shown strong performance in various natural language processing (NLP) tasks, their capabilities for Czech ABSA remain largely unexplored. In this work, we conduct a comprehensive evaluation of 19 LLMs of varying sizes and architectures on Czech ABSA, comparing their performance in zero-shot, few-shot, and fine-tuning scenarios. Our results show that small domain-specific models fine-tuned for ABSA outperform general-purpose LLMs in zero-shot and few-shot settings, while fine-tuned LLMs achieve state-of-the-art results. We analyze how factors such as multilingualism, model size, and recency influence performance and present an error analysis highlighting key challenges, particularly in aspect term prediction. Our findings provide insights into the suitability of LLMs for Czech ABSA and offer guidance for future research in this area. 基于方面的情感分析(ABSA)是一种细粒度的情感分析任务,旨在识别针对实体特定方面的情感。尽管 LLMs 在各种自然语言处理(NLP)任务中表现出色,但其在捷克语 ABSA 方面的能力仍 largely unexplored。在本研究中,我们对 19 种不同规模和架构的 LLMs 在捷克语 ABSA 上进行了全面评估,比较了它们在零样本、少样本和微调场景下的表现。我们的结果表明,为 ABSA 微调的小型领域专用模型在零样本和少样本设置中优于通用的 LLMs,而经过微调的 LLMs 则达到了最先进的结果。我们分析了多语性、模型规模和新近性等因素如何影响性能,并提出了错误分析,突出了关键挑战,尤其是在方面术语预测方面。我们的研究结果为评估 LLMs 在捷克语 ABSA 中的适用性提供了见解,并为该领域的未来研究提供了指导。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 11:24:57 UTC 发布:2025-08-11 11:24:57 UTC
#29 LLMs for Law: Evaluating Legal-Specific LLMs on Contract Understanding LLMs 用于法律:在合同理解上评估法律专用 LLMs
Authors: [Amrita Singh](https://arxiv.org/search/?searchtype=author&query=Amrita Singh), [H. Suhan Karaca](https://arxiv.org/search/?searchtype=author&query=H. Suhan Karaca), [Aditya Joshi](https://arxiv.org/search/?searchtype=author&query=Aditya Joshi), [Hye-young Paik](https://arxiv.org/search/?searchtype=author&query=Hye-young Paik), [Jiaojiao Jiang](https://arxiv.org/search/?searchtype=author&query=Jiaojiao Jiang)
Despite advances in legal NLP, no comprehensive evaluation covering multiple legal-specific LLMs currently exists for contract classification tasks in contract understanding. To address this gap, we present an evaluation of 10 legal-specific LLMs on three English language contract understanding tasks and compare them with 7 general-purpose LLMs. The results show that legal-specific LLMs consistently outperform general-purpose models, especially on tasks requiring nuanced legal understanding. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing general-purpose LLM. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract understanding. Our results provide a holistic evaluation of legal-specific LLMs and will facilitate the development of more accurate contract understanding systems. 尽管法律自然语言处理取得了进展,目前尚无涵盖多种法律专用 LLMs 的全面评估,专门针对合同理解中的合同分类任务。为填补这一空白,我们对 10 个法律专用 LLMs 在三项英文合同理解任务上进行了评估,并将其与 7 个通用 LLMs 进行了比较。结果表明,法律专用 LLMs 在各项任务中持续优于通用模型,尤其在需要细致法律理解的任务上表现更为突出。Legal-BERT 和 Contracts-BERT 在三项任务中的两项上创下了新的 SOTA,尽管其参数量比表现最好的通用 LLM 少 69%。我们还识别出 CaseLaw-BERT 和 LexLM 作为合同理解的有力附加基线。我们的结果提供了对法律专用 LLMs 的整体评估,并将促进更精确的合同理解系统的发展。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 11:08:32 UTC 发布日期:2025-08-11 11:08:32 UTC
#30 Evaluating Large Language Models as Expert Annotators 将大型语言模型作为专家注释者进行评估
Authors: [Yu-Min Tseng](https://arxiv.org/search/?searchtype=author&query=Yu-Min Tseng), [Wei-Lin Chen](https://arxiv.org/search/?searchtype=author&query=Wei-Lin Chen), [Chung-Chi Chen](https://arxiv.org/search/?searchtype=author&query=Chung-Chi Chen), [Hsin-Hsi Chen](https://arxiv.org/search/?searchtype=author&query=Hsin-Hsi Chen)
Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time-consuming, and labor-intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top-performing LLMs, which might be perceived as having expert-level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi-agent discussion framework to simulate a group of human annotators, where LLMs are tasked to engage in discussions by considering others’ annotations and justifications before finalizing their labels. Additionally, we incorporate reasoning models (e.g., o3-mini) to enable a more comprehensive comparison. Our empirical results reveal that: (1) Individual LLMs equipped with inference-time techniques (e.g., chain-of-thought (CoT), self-consistency) show only marginal or even negative performance gains, contrary to prior literature suggesting their broad effectiveness. (2) Overall, reasoning models do not demonstrate statistically significant improvements over non-reasoning models in most settings. This suggests that extended long CoT provides relatively limited benefits for data annotation in specialized domains. (3) Certain model behaviors emerge in the multi-agent discussion environment. For instance, Claude 3.7 Sonnet with thinking rarely changes its initial annotations, even when other agents provide correct annotations or valid reasoning. 文本数据标注,即为文本添加相关信息的过程,通常代价高、耗时且劳动密集。尽管大型语言模型 (LLMs) 已在通用领域的自然语言处理 (NLP) 任务中展示了可作为人工标注者直接替代者的潜力,但它们在需要专家知识的领域中的标注任务上的有效性仍未被充分探索。在本文中,我们探讨:那些在学术和职业基准测试中可能被认为具有专家级熟练度的顶级 LLMs,能否作为人工专家标注者的直接替代?为此,我们评估了单个 LLMs 与多智能体方法在三个高度专业化领域——金融、生物医学和法律——的表现。具体而言,我们提出了一个多智能体讨论框架来模拟一组人工标注者,其中 LLMs 被要求在最终确定标签之前通过考虑他人的标注和理由来参与讨论。此外,我们还引入了推理模型(如 o3-mini)以实现更全面的比较。 我们的实证结果表明:(1)单个 LLMs 在推理时使用技巧(例如 chain-of-thought (CoT)、自洽性等)所带来的性能提升仅为边际性甚至为负, 与先前文献中宣称的广泛有效性相反。(2)总体而言,在大多数设置下,推理模型相较于非推理模型并未表现出统计学上显著的改进。这表明在专业领域的数据标注中,较长的 CoT 所能带来的好处相对有限。(3)在多代理讨论环境中出现了某些模型行为。例如,Claude 3.7 Sonnet 在“思考”模式下很少改变其初始标注,即使其他代理提供了正确的标注或有效的推理。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 10:19:10 UTC 出版:2025-08-11 10:19:10 UTC
#31 Evaluating Compositional Approaches for Focus and Sentiment Analysis 评估聚合性方法在焦点与情感分析中的应用
Authors: [Olga Kellert](https://arxiv.org/search/?searchtype=author&query=Olga Kellert), [Muhammad Imran](https://arxiv.org/search/?searchtype=author&query=Muhammad Imran), [Nicholas Hill Matlis](https://arxiv.org/search/?searchtype=author&query=Nicholas Hill Matlis), [Mahmud Uz Zaman](https://arxiv.org/search/?searchtype=author&query=Mahmud Uz Zaman), [Carlos Gómez-Rodríguez](https://arxiv.org/search/?searchtype=author&query=Carlos Gómez-Rodríguez)
This paper summarizes the results of evaluating a compositional approach for Focus Analysis (FA) in Linguistics and Sentiment Analysis (SA) in Natural Language Processing (NLP). While quantitative evaluations of compositional and non-compositional approaches in SA exist in NLP, similar quantitative evaluations are very rare in FA in Linguistics that deal with linguistic expressions representing focus or emphasis such as “it was John who left”. We fill this gap in research by arguing that compositional rules in SA also apply to FA because FA and SA are closely related meaning that SA is part of FA. Our compositional approach in SA exploits basic syntactic rules such as rules of modification, coordination, and negation represented in the formalism of Universal Dependencies (UDs) in English and applied to words representing sentiments from sentiment dictionaries. Some of the advantages of our compositional analysis method for SA in contrast to non-compositional analysis methods are interpretability and explainability. We test the accuracy of our compositional approach and compare it with a non-compositional approach VADER that uses simple heuristic rules to deal with negation, coordination and modification. In contrast to previous related work that evaluates compositionality in SA on long reviews, this study uses more appropriate datasets to evaluate compositionality. In addition, we generalize the results of compositional approaches in SA to compositional approaches in FA. 本文总结了对聚合性方法在语言学中的焦点分析(FA)和自然语言处理(NLP)中的情感分析(SA)方面评估的结果。尽管在 NLP 中已有关于情感分析中聚合性与非聚合性方法的定量评估,但在语言学中针对表示焦点或强调的语言表达(如“是约翰离开的”)的焦点分析的类似定量评估却非常罕见。我们通过论证情感分析中的聚合性规则同样适用于焦点分析来填补这一研究空白,因为焦点分析与情感分析密切相关,情感分析是焦点分析的一部分。我们在情感分析中采用的聚合性方法利用了基本句法规则,如修饰、并列与否定规则,这些规则以英语的通用依存(UD)形式表示,并应用于来自情感词典的情感词。一些与非聚合性分析方法相比,我们的聚合性情感分析方法的优势包括可解释性和可说明性。 我们测试了组合方法的准确性,并将其与使用简单启发式规则处理否定、并列和修饰的非组合方法 VADER 进行比较。与先前在情感分析中对长评论的组合性评估相关的工作不同,本研究使用了更合适的数据集来评估组合性。此外,我们将情感分析中组合方法的结果推广到面向方面的组合方法。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 09:52:41 UTC 发布日期:2025-08-11 09:52:41 协调世界时
#32 Can You Trick the Grader? Adversarial Persuasion of LLM Judges 你能骗过评分者吗?对 LLM 裁判的对抗性说服
Authors: [Yerin Hwang](https://arxiv.org/search/?searchtype=author&query=Yerin Hwang), [Dongryeol Lee](https://arxiv.org/search/?searchtype=author&query=Dongryeol Lee), [Taegwan Kang](https://arxiv.org/search/?searchtype=author&query=Taegwan Kang), [Yongil Kim](https://arxiv.org/search/?searchtype=author&query=Yongil Kim), [Kyomin Jung](https://arxiv.org/search/?searchtype=author&query=Kyomin Jung)
As large language models take on growing roles as automated evaluators in practical settings, a critical question arises: Can individuals persuade an LLM judge to assign unfairly high scores? This study is the first to reveal that strategically embedded persuasive language can bias LLM judges when scoring mathematical reasoning tasks, where correctness should be independent of stylistic variation. Grounded in Aristotle’s rhetorical principles, we formalize seven persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) and embed them into otherwise identical responses. Across six math benchmarks, we find that persuasive language leads LLM judges to assign inflated scores to incorrect solutions, by up to 8% on average, with Consistency causing the most severe distortion. Notably, increasing model size does not substantially mitigate this vulnerability. Further analysis demonstrates that combining multiple persuasion techniques amplifies the bias, and pairwise evaluation is likewise susceptible. Moreover, the persuasive effect persists under counter prompting strategies, highlighting a critical vulnerability in LLM-as-a-Judge pipelines and underscoring the need for robust defenses against persuasion-based attacks. 随着大型语言模型(LLM)在实际场景中担任越来越多的自动评分者,一个关键问题出现:个人能否说服 LLM 评分者给出不公平的高分?本研究首次揭示,在数学推理任务中(正确性本应与文体差异无关),策略性嵌入的劝说性语言会偏倚 LLM 评分者。基于亚里士多德的修辞原则,我们形式化了七种劝说技巧(多数、连贯、奉承、互惠、怜悯、权威、身份认同),并将它们嵌入到其他方面完全相同的回答中。在六个数学基准上,我们发现劝说性语言会导致 LLM 评分者对错误解答给出高估分,平均最高可达 8%,其中“连贯”造成的扭曲最为严重。值得注意的是,增大模型规模并不能显著缓解这一脆弱性。进一步分析表明,结合多种劝说技巧会放大偏差,成对评估同样易受影响。 此外,这种说服效应在对抗性提示策略下仍然存在,凸显了将 LLM 用作裁判流程中的一个关键脆弱点,并强调了需要针对基于说服的攻击建立强有力防御措施。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 09:45:02 UTC 发布:2025-08-11 09:45:02 UTC
#33 Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts Grove MoE:通过伴随专家迈向高效且更优的 MoE LLMs
Authors: [Haoyuan Wu](https://arxiv.org/search/?searchtype=author&query=Haoyuan Wu), [Haoxing Chen](https://arxiv.org/search/?searchtype=author&query=Haoxing Chen), [Xiaodong Chen](https://arxiv.org/search/?searchtype=author&query=Xiaodong Chen), [Zhanchao Zhou](https://arxiv.org/search/?searchtype=author&query=Zhanchao Zhou), [Tieyuan Chen](https://arxiv.org/search/?searchtype=author&query=Tieyuan Chen), [Yihong Zhuang](https://arxiv.org/search/?searchtype=author&query=Yihong Zhuang), [Guoshan Lu](https://arxiv.org/search/?searchtype=author&query=Guoshan Lu), [Zenan Huang](https://arxiv.org/search/?searchtype=author&query=Zenan Huang), [Junbo Zhao](https://arxiv.org/search/?searchtype=author&query=Junbo Zhao), [Lin Liu](https://arxiv.org/search/?searchtype=author&query=Lin Liu), [Zhenzhong Lan](https://arxiv.org/search/?searchtype=author&query=Zhenzhong Lan), [Bei Yu](https://arxiv.org/search/?searchtype=author&query=Bei Yu), [Jianguo Li](https://arxiv.org/search/?searchtype=author&query=Jianguo Li)
The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size. 专家混合(Mixture of Experts,MoE)架构是现代最先进大型语言模型(LLMs)的基石。MoE 模型通过启用稀疏参数激活来促进可扩展性。然而,传统的 MoE 架构使用同质且尺寸一致的专家,不论输入复杂度如何都激活固定数量的参数,从而限制了计算效率。为克服这一限制,我们引入了 Grove MoE,这是一种借鉴异构 big.LITTLE CPU 架构理念、包含不同规模专家的新型架构。该架构引入了新型的伴随专家(adjugate experts)和动态激活机制,使模型在扩展容量的同时保持可控的计算开销。基于该架构,我们提出了 GroveMoE-Base 和 GroveMoE-Inst 两款 33B 参数的 LLMs,它们是在中期训练与后期训练过程中对 Qwen3-30B-A3B-Base 模型应用回收升级策略而开发的。GroveMoE 模型根据标记复杂度动态激活 3.14–3.28B 参数,并在性能上可与相同或更大规模的开源 SOTA 模型相媲美。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 09:15:36 UTC 发布时间:2025-08-11 09:15:36 UTC
#34 SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation SASST:利用语法感知分块和 LLMs 实现同步语音翻译
Authors: [Zeyu Yang](https://arxiv.org/search/?searchtype=author&query=Zeyu Yang), [Lai Wei](https://arxiv.org/search/?searchtype=author&query=Lai Wei), [Roman Koshkin](https://arxiv.org/search/?searchtype=author&query=Roman Koshkin), [Xi Chen](https://arxiv.org/search/?searchtype=author&query=Xi Chen), [Satoshi Nakamura](https://arxiv.org/search/?searchtype=author&query=Satoshi Nakamura)
This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or <WAIT>
symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-{De, Zh, Ja} demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in LLM-driven SimulST systems.
本工作提出了一种基于语法的切分策略,通过解析依存关系(例如名词短语边界、动宾结构)和标点特征,将输入流切分为语义完整的单元。该方法保证了切分的一致性并将语义碎片化降至最低。在此机制基础上,我们提出了 SASST(Syntax-Aware Simultaneous Speech Translation),一种端到端框架,集成了冻结的 Whisper 编码器和仅解码器的 LLM。该统一架构动态输出翻译词元或 <WAIT>
符号,以联合优化翻译时机和内容,目标端重排用于处理语序差异。在 CoVoST2 多语语料 En-{De, Zh, Ja} 上的实验表明,该方法在各语言上均显著提升了翻译质量,并验证了句法结构在以 LLM 驱动的 SimulST 系统中的有效性。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 09:13:35 UTC 发布:2025-08-11 09:13:35 UTC
#35 Exploring Causal Effect of Social Bias on Faithfulness Hallucinations in Large Language Models 探索社会偏见对大型语言模型忠实性幻觉的因果影响 #35 探索社会偏见对大型语言模型忠实性幻觉的因果影响
Authors: [Zhenliang Zhang](https://arxiv.org/search/?searchtype=author&query=Zhenliang Zhang), [Junzhe Zhang](https://arxiv.org/search/?searchtype=author&query=Junzhe Zhang), [Xinyu Hu](https://arxiv.org/search/?searchtype=author&query=Xinyu Hu), [HuiXuan Zhang](https://arxiv.org/search/?searchtype=author&query=HuiXuan Zhang), [Xiaojun Wan](https://arxiv.org/search/?searchtype=author&query=Xiaojun Wan) 作者:张振良,张俊喆,胡欣宇,张慧萱,万晓军
Large language models (LLMs) have achieved remarkable success in various tasks, yet they remain vulnerable to faithfulness hallucinations, where the output does not align with the input. In this study, we investigate whether social bias contributes to these hallucinations, a causal relationship that has not been explored. A key challenge is controlling confounders within the context, which complicates the isolation of causality between bias states and hallucinations. To address this, we utilize the Structural Causal Model (SCM) to establish and validate the causality and design bias interventions to control confounders. In addition, we develop the Bias Intervention Dataset (BID), which includes various social biases, enabling precise measurement of causal effects. Experiments on mainstream LLMs reveal that biases are significant causes of faithfulness hallucinations, and the effect of each bias state differs in direction. We further analyze the scope of these causal effects across various models, specifically focusing on unfairness hallucinations, which are primarily targeted by social bias, revealing the subtle yet significant causal effect of bias on hallucination generation. 大型语言模型 (LLMs) 在各种任务中取得了显著成功,但仍然容易出现可信性幻想(faithfulness hallucinations),即输出与输入不一致。在本研究中,我们探讨了社会偏见是否促成了这些幻想——这一因果关系此前尚未被探索。一个关键挑战是在上下文中控制混淆变量,这使得隔离偏见状态与幻想之间的因果性变得复杂。为了解决这一问题,我们利用结构因果模型(SCM)来建立并验证因果关系,并设计偏见干预以控制混淆变量。此外,我们开发了偏见干预数据集(Bias Intervention Dataset, BID),其中包含多种社会偏见,从而可以精确测量因果效应。在主流 LLMs 上的实验表明,偏见是导致可信性幻想的重要原因,并且每种偏见状态的作用方向各不相同。我们进一步分析了这些因果效应在不同模型中的作用范围,特别关注主要由社会偏见针对的不公平幻想(unfairness hallucinations),揭示了偏见对幻想生成的微妙但显著的因果影响。
Subject: Computation and Language 主题:Computation and Language 主题:计算与语言
Publish: 2025-08-11 08:34:28 UTC 发布:2025-08-11 08:34:28 协调世界时(UTC)
#36 What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction 我在这里遗漏了什么?:评估用于被掩盖句子预测的大型语言模型 #36 我在这里遗漏了什么?:评估用于被掩盖句子预测的大型语言模型 我在这里遗漏了什么?:评估用于被掩盖句子预测的大型语言模型[PDF ] [Copy] [Kimi ] [REL]
Authors: [Charlie Wyatt](https://arxiv.org/search/?searchtype=author&query=Charlie Wyatt), [Aditya Joshi](https://arxiv.org/search/?searchtype=author&query=Aditya Joshi), [Flora Salim](https://arxiv.org/search/?searchtype=author&query=Flora Salim) 作者:Charlie Wyatt, Aditya Joshi, Flora Salim
Transformer-based models primarily rely on Next Token Prediction (NTP), which predicts the next token in a sequence based on the preceding context. However, NTP’s focus on single-token prediction often limits a model’s ability to plan ahead or maintain long-range coherence, raising questions about how well LLMs can predict longer contexts, such as full sentences within structured documents. While NTP encourages local fluency, it provides no explicit incentive to ensure global coherence across sentence boundaries-an essential skill for reconstructive or discursive tasks. To investigate this, we evaluate three commercial LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash) on Masked Sentence Prediction (MSP) - the task of infilling a randomly removed sentence - from three domains: ROCStories (narrative), Recipe1M (procedural), and Wikipedia (expository). We assess both fidelity (similarity to the original sentence) and cohesiveness (fit within the surrounding context). Our key finding reveals that commercial LLMs, despite their superlative performance in other tasks, are poor at predicting masked sentences in low-structured domains, highlighting a gap in current model capabilities. 基于 Transformer 的模型主要依赖于下一个标记预测(Next Token Prediction, NTP),该方法根据前面的上下文预测序列中的下一个标记。然而,NTP 对单标记预测的专注常常限制了模型的前瞻性规划能力或保持远程连贯性的能力,这就引发了对 LLMs 能否预测更长上下文(例如结构化文档中的完整句子)的质疑。尽管 NTP 鼓励局部流畅性,但它并没有明确激励跨句子边界确保全局连贯性——而这正是重构或论述性任务所必需的技能。为此,我们在掩码句子预测(Masked Sentence Prediction, MSP)任务上评估了三种商业 LLMs(GPT-4o、Claude 3.5 Sonnet 和 Gemini 2.0 Flash)——该任务是在三个领域中对随机移除的句子进行填充:ROCStories(叙事)、Recipe1M(程序性)和 Wikipedia(说明性)。我们评估了忠实度(与原句的相似性)和连贯性(与周围上下文的契合度)。我们的主要发现显示,尽管这些商业 LLMs 在其他任务上表现卓越,但在结构松散的领域预测被掩盖句子方面表现不佳,凸显了当前模型能力的一个缺口。
Subject: Computation and Language 主题:Computation and Language 主题:计算与语言
Publish: 2025-08-11 07:25:50 UTC 发布时间:2025-08-11 07:25:50 协调世界时(UTC)
#37 LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval LoSemB:用于归纳性工具检索的逻辑引导语义桥接
Authors: [Luyao Zhuang](https://arxiv.org/search/?searchtype=author&query=Luyao Zhuang), [Qinggang Zhang](https://arxiv.org/search/?searchtype=author&query=Qinggang Zhang), [Huachi Zhou](https://arxiv.org/search/?searchtype=author&query=Huachi Zhou), [Juhua Liu](https://arxiv.org/search/?searchtype=author&query=Juhua Liu), [Qing Li](https://arxiv.org/search/?searchtype=author&query=Qing Li), [Xiao Huang](https://arxiv.org/search/?searchtype=author&query=Xiao Huang)
Tool learning has emerged as a promising paradigm for large language models (LLMs) to solve many real-world tasks. Nonetheless, with the tool repository rapidly expanding, it is impractical to contain all tools within the limited input length of LLMs. To alleviate these issues, researchers have explored incorporating a tool retrieval module to select the most relevant tools or represent tools as unique tokens within LLM parameters. However, most state-of-the-art methods are under transductive settings, assuming all tools have been observed during training. Such a setting deviates from reality as the real-world tool repository is evolving and incorporates new tools frequently. When dealing with these unseen tools, which refer to tools not encountered during the training phase, these methods are limited by two key issues, including the large distribution shift and the vulnerability of similarity-based retrieval. To this end, inspired by human cognitive processes of mastering unseen tools through discovering and applying the logical information from prior experience, we introduce a novel Logic-Guided Semantic Bridging framework for inductive tool retrieval, namely, LoSemB, which aims to mine and transfer latent logical information for inductive tool retrieval without costly retraining. Specifically, LoSemB contains a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce the vulnerability of similarity-based retrieval. Extensive experiments demonstrate that LoSemB achieves advanced performance in inductive settings while maintaining desirable effectiveness in the transductive setting. 工具学习已成为大型语言模型(LLMs)解决许多现实任务的有前景范式。然而,随着工具库快速扩展,将所有工具放入 LLMs 有限的输入长度中并不现实。为缓解这些问题,研究人员已探索引入工具检索模块以选择最相关的工具,或将工具表示为 LLM 参数中的独特标记。然而,大多数最先进的方法都处于转导设置下,假设所有工具在训练期间都已被观察到。这种设置偏离现实,因为现实世界的工具库在不断演进并频繁引入新工具。在处理这些未见过的工具(指训练阶段未遇到的工具)时,这些方法受制于两个关键问题:大的分布偏移和基于相似度的检索脆弱性。 为此,受人类通过从既有经验中发现并应用逻辑信息来掌握未知工具的认知过程启发,我们提出了一种用于归纳式工具检索的新颖逻辑引导语义桥接框架,称为 LoSemB,旨在在无需昂贵重训练的情况下挖掘并转移潜在的逻辑信息以用于归纳式工具检索。具体而言,LoSemB 包含一个基于逻辑的嵌入对齐模块以缓解分布偏移,并实现了一个关系增强的检索机制以降低基于相似度检索的脆弱性。大量实验表明,LoSemB 在归纳设置中取得了先进的性能,同时在传导设置中也保持了理想的有效性。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 07:07:18 UTC 发布:2025-08-11 07:07:18 UTC
#38 InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information InterChart:在分解和分布式图表信息上的可视化推理基准测试
Authors: [Anirudh Iyengar Kaniyar Narayana Iyengar](https://arxiv.org/search/?searchtype=author&query=Anirudh Iyengar Kaniyar Narayana Iyengar), [Srija Mukhopadhyay](https://arxiv.org/search/?searchtype=author&query=Srija Mukhopadhyay), [Adnan Qidwai](https://arxiv.org/search/?searchtype=author&query=Adnan Qidwai), [Shubhankar Singh](https://arxiv.org/search/?searchtype=author&query=Shubhankar Singh), [Dan Roth](https://arxiv.org/search/?searchtype=author&query=Dan Roth), [Vivek Gupta](https://arxiv.org/search/?searchtype=author&query=Vivek Gupta)
We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2-3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments. 我们提出了 InterChart,这是一个诊断性基准,用于评估视觉-语言模型(VLMs)在多个相关图表之间进行推理的能力,这项任务在科学报告、财务分析和公共政策仪表盘等真实应用中至关重要。与以往关注孤立且视觉上单一图表的基准不同,InterChart 通过多样的问题类型对模型提出挑战,问题类型包括实体推断、趋势关联、数值估算以及基于 2-3 个主题或结构相关图表的抽象多步推理。我们将基准划分为三个难度逐渐增加的层级:(1)对单个图表的事实性推理,(2)在合成对齐的图表集合上进行的综合分析,和(3)针对视觉复杂的真实世界图表对进行的语义推断。我们对最先进的开源和闭源 VLMs 的评估显示,随着图表复杂性的增加,准确率出现持续且陡峭的下降。我们发现,当将多实体图表分解为更简单的视觉单元时,模型表现更好,这凸显了它们在跨图表整合方面的困难。 通过揭示这些系统性限制,InterChart 为在复杂的多视觉环境中推进多模态推理提供了一个严谨的框架。
Subjects: Computation and Language, Artificial Intelligence, Computer Vision and Pattern Recognition 主题:计算与语言、人工智能、计算机视觉与模式识别
Publish: 2025-08-11 05:19:23 UTC 发布:2025-08-11 05:19:23 UTC
#39 Keyword-Centric Prompting for One-Shot Event Detection with Self-Generated Rationale Enhancements 关键词驱动提示:用于一-shot 事件检测并辅以自生成推理增强 #39 关键词驱动提示:用于一次性事件检测并辅以自生成推理增强[PDF ] [Copy] [Kimi ] [REL]
Authors: [Ziheng Li](https://arxiv.org/search/?searchtype=author&query=Ziheng Li), [Zhi-Hong Deng](https://arxiv.org/search/?searchtype=author&query=Zhi-Hong Deng) 作者:李子衡,邓志宏
Although the LLM-based in-context learning (ICL) paradigm has demonstrated considerable success across various natural language processing tasks, it encounters challenges in event detection. This is because LLMs lack an accurate understanding of event triggers and tend to make over-interpretation, which cannot be effectively corrected through in-context examples alone. In this paper, we focus on the most challenging one-shot setting and propose KeyCP++, a keyword-centric chain-of-thought prompting approach. KeyCP++ addresses the weaknesses of conventional ICL by automatically annotating the logical gaps between input text and detection results for the demonstrations. Specifically, to generate in-depth and meaningful rationale, KeyCP++ constructs a trigger discrimination prompting template. It incorporates the exemplary triggers (a.k.a keywords) into the prompt as the anchor to simply trigger profiling, let LLM propose candidate triggers, and justify each candidate. These propose-and-judge rationales help LLMs mitigate over-reliance on the keywords and promote detection rule learning. Extensive experiments demonstrate the effectiveness of our approach, showcasing significant advancements in one-shot event detection. 尽管基于 LLM 的上下文内学习(ICL)范式在各种自然语言处理任务上取得了显著成功,但在事件检测方面仍面临挑战。这是因为 LLMs 缺乏对事件触发词的准确理解,且倾向于过度解释,仅通过上下文示例无法有效纠正这种问题。在本文中,我们聚焦最具挑战性的一次性学习(one-shot)设置,提出了 KeyCP++,一种以关键词为中心的链式思维提示(chain-of-thought prompting)方法。KeyCP++通过自动为示例注释输入文本与检测结果之间的逻辑缺口,来弥补传统 ICL 的不足。具体而言,为了生成深入且有意义的推理,KeyCP++构建了一个触发词判别提示模板。它将示例触发词(即关键词)作为提示中的锚点来简化触发词画像,让 LLM 提出候选触发词并对每个候选词作出论证。这些“提出-判断”的推理帮助 LLMs 减轻对关键词的过度依赖,促进检测规则的学习。 大量实验表明我们方法的有效性,展示了在一次性事件检测方面的显著进展。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 03:58:35 UTC 发布:2025-08-11 03:58:35 UTC
#40 IBPS: Indian Bail Prediction System IBPS:印度保释预测系统 #40IBPS:印度保释预测系统[PDF ] [Copy] [Kimi ] [REL]
Authors: [Puspesh Kumar Srivastava](https://arxiv.org/search/?searchtype=author&query=Puspesh Kumar Srivastava), [Uddeshya Raj](https://arxiv.org/search/?searchtype=author&query=Uddeshya Raj), [Praveen Patel](https://arxiv.org/search/?searchtype=author&query=Praveen Patel), [/Shubham Kumar Nigam](https://arxiv.org/search/?searchtype=author&query=/Shubham Kumar Nigam), [Noel Shallum](https://arxiv.org/search/?searchtype=author&query=Noel Shallum), [Arnab Bhattacharya](https://arxiv.org/search/?searchtype=author&query=Arnab Bhattacharya) 作者:Puspesh Kumar Srivastava, Uddeshya Raj, Praveen Patel, /Shubham Kumar Nigam, Noel Shallum, Arnab Bhattacharya
Bail decisions are among the most frequently adjudicated matters in Indian courts, yet they remain plagued by subjectivity, delays, and inconsistencies. With over 75% of India’s prison population comprising undertrial prisoners, many from socioeconomically disadvantaged backgrounds, the lack of timely and fair bail adjudication exacerbates human rights concerns and contributes to systemic judicial backlog. In this paper, we present the Indian Bail Prediction System (IBPS), an AI-powered framework designed to assist in bail decision-making by predicting outcomes and generating legally sound rationales based solely on factual case attributes and statutory provisions. We curate and release a large-scale dataset of 150,430 High Court bail judgments, enriched with structured annotations such as age, health, criminal history, crime category, custody duration, statutes, and judicial reasoning. We fine-tune a large language model using parameter-efficient techniques and evaluate its performance across multiple configurations, with and without statutory context, and with RAG. Our results demonstrate that models fine-tuned with statutory knowledge significantly outperform baselines, achieving strong accuracy and explanation quality, and generalize well to a test set independently annotated by legal experts. IBPS offers a transparent, scalable, and reproducible solution to support data-driven legal assistance, reduce bail delays, and promote procedural fairness in the Indian judicial system. 保释决定是印度法院中最常审理的事项之一,但仍然存在主观性、拖延和不一致的问题。印度监狱人口中超过 75%为未审先判的被拘留者,许多来自社会经济弱势群体,缺乏及时公正的保释裁决加剧了人权问题并导致司法积压。在本文中,我们提出了印度保释预测系统(IBPS),这是一个由人工智能驱动的框架,旨在通过仅基于事实性案件属性和法定条文来预测结果并生成符合法律要求的理由,以辅助保释决策。我们策划并发布了一个包含 150,430 份高等法院保释判决的大规模数据集,数据集丰富地包含结构化注释,如年龄、健康状况、犯罪记录、罪行类别、羁押时长、适用法条和法官推理。我们使用参数高效技术对大型语言模型进行微调,并在多种配置下评估其性能,包括有无法定背景信息以及结合检索增强生成(RAG)的情形。 我们的结果表明,经过法规知识微调的模型显著优于基线模型,取得了较高的准确性和解释质量,并能很好地推广到由法律专家独立标注的测试集。IBPS 提供了一种透明、可扩展且可复现的解决方案,以支持数据驱动的法律援助、减少保释延误并促进印度司法体系的程序公平。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 03:44:17 UTC 发布:2025-08-11 03:44:17 UTC
#41 From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR 从试错到改进:对 RLVR 中 LLM 探索机制的系统性分析 #41 从试错到改进:对 RLVR 中 LLM 探索机制的系统性分析[PDF 5 ] [Copy] [Kimi 1 ] [REL]
Authors: [Jia Deng](https://arxiv.org/search/?searchtype=author&query=Jia Deng), [Jie Chen](https://arxiv.org/search/?searchtype=author&query=Jie Chen), [Zhipeng Chen](https://arxiv.org/search/?searchtype=author&query=Zhipeng Chen), [Daixuan Cheng](https://arxiv.org/search/?searchtype=author&query=Daixuan Cheng), [Fei Bai](https://arxiv.org/search/?searchtype=author&query=Fei Bai), [Beichen Zhang](https://arxiv.org/search/?searchtype=author&query=Beichen Zhang), [Yinqian Min](https://arxiv.org/search/?searchtype=author&query=Yinqian Min), [Yanzipeng Gao](https://arxiv.org/search/?searchtype=author&query=Yanzipeng Gao), [Wayne Xin Zhao](https://arxiv.org/search/?searchtype=author&query=Wayne Xin Zhao), [Ji-Rong Wen](https://arxiv.org/search/?searchtype=author&query=Ji-Rong Wen) 作者:Jia Deng,Jie Chen,Zhipeng Chen,Daixuan Cheng,Fei Bai,Beichen Zhang,Yinqian Min,Yanzipeng Gao,Wayne Xin Zhao,Ji-Rong Wen
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains – a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR’s empirical success, the fundamental mechanisms governing LLMs’ exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs’ capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems. 可验证奖励的强化学习(RLVR)已成为提升大型语言模型(LLMs)推理能力的一种强大范式。不同于传统的强化学习方法,RLVR 利用基于规则的反馈来引导 LLMs 生成并改进复杂的推理链——这一过程在很大程度上依赖于有效的探索策略。尽管以往工作已展示了 RLVR 的经验性成功,但支配 LLMs 探索行为的基本机制仍未被充分研究。本技术报告对 RLVR 中的探索能力进行了系统性调研,涵盖四个主要方面:(1)探索空间塑造,我们开发了量化指标以刻画 LLMs 能力边界;(2)熵-性能权衡,在训练阶段、单个实例和令牌层面分析;(3)强化学习性能优化,考察将探索收益有效转化为可测量改进的方法。通过将以往识别的见解与新的实证证据相统一,本工作旨在为推进 RLVR 系统提供一个基础性框架。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 01:26:16 UTC 发布日期:2025-08-11 01:26:16 UTC
#42 Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews 词云作为共同声音:LLM 辅助的参与者加权主题可视化在定性访谈中的应用 #42 词云作为共同之声:LLM 辅助的参与者加权主题可视化在定性访谈中的应用词云作为共同声音:LLM 辅助的参与者加权主题可视化在定性访谈中的应用[PDF ] [Copy] [Kimi ] [REL]
Authors: [Joseph T. Colonel](https://arxiv.org/search/?searchtype=author&query=Joseph T. Colonel), [Baihan Lin](https://arxiv.org/search/?searchtype=author&query=Baihan Lin) 作者:Joseph T. Colonel,Baihan Lin
Word clouds are a common way to summarize qualitative interviews, yet traditional frequency-based methods often fail in conversational contexts: they surface filler words, ignore paraphrase, and fragment semantically related ideas. This limits their usefulness in early-stage analysis, when researchers need fast, interpretable overviews of what participant actually said. We introduce ThemeClouds, an open-source visualization tool that uses large language models (LLMs) to generate thematic, participant-weighted word clouds from dialogue transcripts. The system prompts an LLM to identify concept-level themes across a corpus and then counts how many unique participants mention each topic, yielding a visualization grounded in breadth of mention rather than raw term frequency. Researchers can customize prompts and visualization parameters, providing transparency and control. Using interviews from a user study comparing five recording-device configurations (31 participants; 155 transcripts, Whisper ASR), our approach surfaces more actionable device concerns than frequency clouds and topic-modeling baselines (e.g., LDA, BERTopic). We discuss design trade-offs for integrating LLM assistance into qualitative workflows, implications for interpretability and researcher agency, and opportunities for interactive analyses such as per-condition contrasts (``diff clouds’’). 词云是总结定性访谈的常见方式,但传统的基于频率的方法在对话场景中常常失效:它们会凸显填充词,忽视释义重述,并割裂语义相关的想法。这限制了它们在早期分析阶段的有用性——研究人员需要快速且可解释的概览来了解参与者实际说了什么。我们提出了 ThemeClouds,一款开源可视化工具,使用大型语言模型(LLMs)从对话文字记录中生成主题化、按参与者加权的词云。该系统通过提示 LLM 在语料库中识别概念层面的主题,然后统计提及每个主题的独立参与者数量,从而生成基于被提及广度而非原始词频的可视化。研究人员可以自定义提示和可视化参数,以提供透明性和控制。使用来自一项比较五种录音设备配置的用户研究的访谈(31 名参与者;155 个转录本,Whisper 自动语音识别),我们的方法比频率词云和主题建模基线(如 LDA、BERTopic)更能揭示可操作的设备问题。 我们讨论了将 LLM 辅助集成到定性工作流中的设计权衡、对可解释性和研究者能动性的影响,以及用于交互式分析的机会,例如按条件对比(“差异云”)。
Subjects: Computation and Language, Artificial Intelligence, Human-Computer Interaction 主题:计算与语言、人工智能、人机交互
Publish: 2025-08-11 00:27:52 UTC 发布:2025-08-11 00:27:52 协调世界时
#43 Augmenting Bias Detection in LLMs Using Topological Data Analysis 增强对 LLMs 中偏见检测的能力:使用拓扑数据分析 #43 在 LLMs 中增强偏见检测:使用拓扑数据分析 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Keshav Varadarajan](https://arxiv.org/search/?searchtype=author&query=Keshav Varadarajan), [Tananun Songdechakraiwut](https://arxiv.org/search/?searchtype=author&query=Tananun Songdechakraiwut) 作者:Keshav Varadarajan,Tananun Songdechakraiwut
Recently, many bias detection methods have been proposed to determine the level of bias a large language model captures. However, tests to identify which parts of a large language model are responsible for bias towards specific groups remain underdeveloped. In this study, we present a method using topological data analysis to identify which heads in GPT-2 contribute to the misrepresentation of identity groups present in the StereoSet dataset. We find that biases for particular categories, such as gender or profession, are concentrated in attention heads that act as hot spots. The metric we propose can also be used to determine which heads capture bias for a specific group within a bias category, and future work could extend this method to help de-bias large language models. 近期,提出了许多偏见检测方法以确定大型语言模型所包含的偏见程度。然而,用以识别大型语言模型中哪些部分对特定群体产生偏见的测试仍不成熟。在本研究中,我们提出了一种使用拓扑数据分析的方法,来识别 GPT-2 中哪些注意力头导致对 StereoSet 数据集中身份群体的错误表征。我们发现,针对特定类别(例如性别或职业)的偏见集中在作为热点的注意力头上。我们提出的度量也可用于确定哪些注意力头在某一偏见类别中对特定群体捕捉到了偏见,未来的工作可以将该方法扩展用于帮助去偏大型语言模型。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-11 00:19:47 UTC 发布:2025-08-11 00:19:47 协调世界时 (UTC)
#44 ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models ALOPE:使用大型语言模型对翻译质量估计的自适应层优化
Authors: [Archchana Sindhujan](https://arxiv.org/search/?searchtype=author&query=Archchana Sindhujan), [Shenbin Qian](https://arxiv.org/search/?searchtype=author&query=Shenbin Qian), [Chan Chi Chun Matthew](https://arxiv.org/search/?searchtype=author&query=Chan Chi Chun Matthew), [Constantin Orasan](https://arxiv.org/search/?searchtype=author&query=Constantin Orasan), [Diptesh Kanojia](https://arxiv.org/search/?searchtype=author&query=Diptesh Kanojia)
Large Language Models (LLMs) have shown remarkable performance across a wide range of natural language processing tasks. Quality Estimation (QE) for Machine Translation (MT), which assesses the quality of a source-target pair without relying on reference translations, remains a challenging cross-lingual task for LLMs. The challenges stem from the inherent limitations of existing LLM-based QE systems, which are pre-trained for causal language modelling rather than regression-specific tasks, further elevated by the presence of low-resource languages given pre-training data distribution. This paper introduces ALOPE, an adaptive layer-optimization framework designed to enhance LLM-based QE by restructuring Transformer representations through layer-wise adaptation for improved regression-based prediction. Our framework integrates low-rank adapters (LoRA) with regression task heads, leveraging selected pre-trained Transformer layers for improved cross-lingual alignment. In addition to the layer-specific adaptation, ALOPE introduces two strategies-dynamic weighting, which adaptively combines representations from multiple layers, and multi-head regression, which aggregates regression losses from multiple heads for QE. Our framework shows improvements over various existing LLM-based QE approaches. Empirical evidence suggests that intermediate Transformer layers in LLMs provide contextual representations that are more aligned with the cross-lingual nature of the QE task. We make resultant models and framework code publicly available for further research, also allowing existing LLM-based MT frameworks to be scaled with QE capabilities. 大型语言模型(LLMs)在广泛的自然语言处理任务中表现出色。机器翻译(MT)的质量估计(QE)用于在不依赖参考译文的情况下评估源-目标对的质量,对于 LLMs 来说仍然是一个具有挑战性的跨语言任务。挑战源于现有基于 LLM 的 QE 系统的固有局限性:这些系统在预训练阶段侧重于因果语言建模,而非回归专用任务;再加上预训练数据分布导致的低资源语言存在,使问题更加复杂。本文提出了 ALOPE,一种自适应层优化框架,旨在通过逐层适配重构 Transformer 表示来增强基于 LLM 的 QE,以改进基于回归的预测。我们的框架将低秩适配器(LoRA)与回归任务头相结合,利用经过挑选的预训练 Transformer 层以改进跨语言对齐。 除了特定层的适配之外,ALOPE 引入了两种策略——动态加权(dynamic weighting),用于自适应地组合来自多个层的表示,以及多头回归(multi-head regression),用于汇总来自多个头的回归损失以进行质量估计(QE)。我们的框架在多种现有基于 LLM 的 QE 方法上表现出改进。实证结果表明,LLM 的中间 Transformer 层提供的上下文表示更契合 QE 任务的跨语言特性。我们已将得到的模型和框架代码公开,以便进一步研究,也允许现有基于 LLM 的 MT 框架扩展具备 QE 能力。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 20:59:44 UTC
#45 Positional Biases Shift as Inputs Approach Context Window Limits 随着输入接近上下文窗口极限,位置偏向发生变化
Authors: [Blerta Veseli](https://arxiv.org/search/?searchtype=author&query=Blerta Veseli), [Julian Chibane](https://arxiv.org/search/?searchtype=author&query=Julian Chibane), [Mariya Toneva](https://arxiv.org/search/?searchtype=author&query=Mariya Toneva), [Alexander Koller](https://arxiv.org/search/?searchtype=author&query=Alexander Koller)
Large Language Models (LLMs) often struggle to use information across long inputs effectively. Prior work has identified positional biases, such as the Lost in the Middle (LiM) effect, where models perform better when information appears at the beginning (primacy bias) or end (recency bias) of the input, rather than in the middle. However, long-context studies have not consistently replicated these effects, raising questions about their intensity and the conditions under which they manifest. To address this, we conducted a comprehensive analysis using relative rather than absolute input lengths, defined with respect to each model’s context window. Our findings reveal that the LiM effect is strongest when inputs occupy up to 50% of a model’s context window. Beyond that, the primacy bias weakens, while recency bias remains relatively stable. This effectively eliminates the LiM effect; instead, we observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input. Furthermore, our results suggest that successful retrieval is a prerequisite for reasoning in LLMs, and that the observed positional biases in reasoning are largely inherited from retrieval. These insights have implications for long-context tasks, the design of future LLM benchmarks, and evaluation methodologies for LLMs handling extended inputs. 大型语言模型(LLMs)通常难以有效利用跨越较长输入的信息。先前研究已指出位置偏向,例如“中间丢失”(Lost in the Middle,LiM)效应,即当信息出现在输入的开头(首位偏向)或结尾(近因偏向)时,模型表现优于信息位于中间的情况。然而,关于长上下文的研究并未始终如一地重复这些效应,这对其强度以及在何种条件下出现提出了疑问。为了解决这一问题,我们采用相对于而非绝对输入长度的方法进行全面分析,长度以各模型的上下文窗口为参照。我们的发现表明,当输入占据模型上下文窗口最多 50%时,LiM 效应最强。超过此范围后,首位偏向减弱,而近因偏向保持相对稳定。这实际上消除了 LiM 效应;取而代之的是一种基于距离的偏向,即当相关信息靠近输入末端时,模型表现更好。 此外,我们的研究结果表明,成功的检索是 LLMs 进行推理的前提条件,而且在推理中观察到的位置偏差在很大程度上源自检索。这些见解对于长上下文任务、未来 LLM 基准的设计以及评估处理长输入的 LLM 的方法论具有重要意义。
Subject: Computation and Language 主题:Computation and Language 主题:计算与语言
Publish: 2025-08-10 20:40:24 UTC 发布:2025-08-10 20:40:24 协调世界时
#46 Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs 让我们逐步修订:一个用于与 LLMs 协同的代码生成统一局部搜索框架 #46 让我们逐步修订:一个与 LLMs 协同的代码生成统一局部搜索框架[PDF 1 ] [Copy] [Kimi ] [REL]
Authors: [Zhiyi Lyu](https://arxiv.org/search/?searchtype=author&query=Zhiyi Lyu), [Jianguo Huang](https://arxiv.org/search/?searchtype=author&query=Jianguo Huang), [Yanchen Deng](https://arxiv.org/search/?searchtype=author&query=Yanchen Deng), [Steven Hoi](https://arxiv.org/search/?searchtype=author&query=Steven Hoi), [Bo An](https://arxiv.org/search/?searchtype=author&query=Bo An) 作者:吕志怡,黄建国,邓艳辰,Steven Hoi,安博
Large Language Models (LLMs) with inference-time scaling techniques show promise for code generation, yet face notable efficiency and scalability challenges. Construction-based tree-search methods suffer from rapid growth in tree size, high token consumption, and lack of anytime property. In contrast, improvement-based methods offer better performance but often struggle with uninformative reward signals and inefficient search strategies. In this work, we propose \textbf{ReLoc}, a unified local search framework which effectively performs step-by-step code revision. Specifically, ReLoc explores a series of local revisions through four key algorithmic components: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating, each of which can be instantiated with specific decision rules to realize different local search algorithms such as Hill Climbing (HC) or Genetic Algorithm (GA). Furthermore, we develop a specialized revision reward model that evaluates code quality based on revision distance to produce fine-grained preferences that guide the local search toward more promising candidates. Finally, our extensive experimental results demonstrate that our approach achieves superior performance across diverse code generation tasks, significantly outperforming both construction-based tree search as well as the state-of-the-art improvement-based code generation methods. 具有推理时扩展技术的大型语言模型(LLMs)在代码生成方面展现出潜力,但仍面临显著的效率和可扩展性挑战。基于构造的树搜索方法存在树规模快速膨胀、高令牌消耗以及缺乏任意时刻可用性的问题。相比之下,基于改进的方法性能更好,但常常受到无信息量的奖励信号和低效搜索策略的困扰。在本工作中,我们提出了 ReLoc,一个统一的局部搜索框架,能够有效地执行逐步的代码修订。具体而言,ReLoc 通过四个关键算法组件探索一系列局部修订:初始代码起草、邻域代码生成、候选评估和在位代码更新,每个组件都可以用具体的决策规则实例化,以实现诸如爬山算法(Hill Climbing, HC)或遗传算法(Genetic Algorithm, GA)等不同的局部搜索算法。此外,我们开发了一个专门的修订奖励模型,该模型基于修订距离来评估代码质量,从而产生细粒度的偏好,指导局部搜索朝着更有希望的候选方向前进。 最后,我们大量的实验结果表明,我们的方法在各种代码生成任务上均取得了卓越的性能,显著优于基于构建的树搜索方法以及最先进的基于改进的代码生成方法。
Subject: Computation and Language 主题:Computation and Language 主题:计算与语言
Publish: 2025-08-10 17:11:56 UTC 发布:2025-08-10 17:11:56 协调世界时
#47 Grounding Multilingual Multimodal LLMs With Cultural Knowledge 以文化知识为基础的多语种多模态大型语言模型
Authors: [Jean de Dieu Nyandwi](https://arxiv.org/search/?searchtype=author&query=Jean de Dieu Nyandwi), [Yueqi Song](https://arxiv.org/search/?searchtype=author&query=Yueqi Song), [Simran Khanuja](https://arxiv.org/search/?searchtype=author&query=Simran Khanuja), [Graham Neubig](https://arxiv.org/search/?searchtype=author&query=Graham Neubig) 作者:Jean de Dieu Nyandwi、Yueqi Song、Simran Khanuja、Graham Neubig
Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of 5.0 without degrading results on mainstream vision-language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems. 多模态大型语言模型在高资源环境中表现出色,但往往误解长尾的文化实体,并在低资源语言上表现不佳。为了解决这一差距,我们提出了一种以数据为中心的方法,直接将多模态 LLM 与文化知识接地。利用来自 Wikidata 的大规模知识图谱,我们收集了能代表文化重要实体的图像,并生成了合成的多语言视觉问答数据。由此得到的数据集 CulturalGround 包含 2200 万条高质量、富有文化内涵的 VQA 对,覆盖 42 个国家和 39 种语言。我们在 CulturalGround 上训练了开源多模态模型 CulturalPangea,并穿插标准的多语言指令微调数据以保持通用能力。CulturalPangea 在各种以文化为中心的多语言多模态基准测试中,在开源模型中取得了最先进的性能,平均超过先前模型 5.0 点,同时在主流视觉-语言任务上的结果未见下降。我们的研究发现表明,针对性的、以文化为基础的方法能够大幅缩小多模态 LLM 的文化差距,并为构建面向全球包容性的多模态系统提供一条可行路径。
Subjects: Computation and Language, Machine Learning 主题:计算与语言,机器学习
Publish: 2025-08-10 16:24:11 UTC 发布:2025-08-10 16:24:11 UTC
#48 Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance 三思而后言:通过受计划启发的文本引导在全双工语音语言模型中增强有意义的对话生成 #48 三思而后言:通过受计划启发的文本引导在全双工语音-语言模型中增强有意义的对话生成 [PDF ] [Copy] [Kimi 1 ] [REL]
Authors: [Wenqian Cui](https://arxiv.org/search/?searchtype=author&query=Wenqian Cui), [Lei Zhu](https://arxiv.org/search/?searchtype=author&query=Lei Zhu), [Xiaohui Li](https://arxiv.org/search/?searchtype=author&query=Xiaohui Li), [Zhihan Guo](https://arxiv.org/search/?searchtype=author&query=Zhihan Guo), [Haoli Bai](https://arxiv.org/search/?searchtype=author&query=Haoli Bai), [Lu Hou](https://arxiv.org/search/?searchtype=author&query=Lu Hou), [Irwin King](https://arxiv.org/search/?searchtype=author&query=Irwin King) 作者:崔文倩,朱磊,李晓晖,郭志涵,白浩利,侯璐,Irwin King
Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational dynamics such as interruptions, backchannels, and overlapping speech, and End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions. However, they face a critical challenge – their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. While text-guided speech generation could mitigate these issues, it suffers from timing and length issues when integrating textual guidance into double-channel audio streams, disrupting the precise time alignment essential for natural interactions. To address these challenges, we propose TurnGuide, a novel planning-inspired approach that mimics human conversational planning by dynamically segmenting assistant speech into dialogue turns and generating turn-level text guidance before speech output, which effectively resolves both insertion timing and length challenges. Extensive experiments demonstrate our approach significantly improves e2e FD-SLMs’ conversational abilities, enabling them to generate semantically meaningful and coherent speech while maintaining natural conversational flow. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code will be available at https://github.com/dreamtheater123/TurnGuide. 全双工语音语言模型(FD-SLMs)是专门的基础模型,旨在通过建模诸如打断、回声反馈和重叠语音等复杂的对话动态,实现自然的实时语音交互。端到端(e2e)FD-SLMs 利用真实世界的双通道会话数据来捕捉两人对话的细微模式,以实现类人交互。然而,它们面临一个关键挑战——与纯文本对话相比,由于语音序列持续时间较长且高质量语音对话数据有限,其对话能力往往会下降。虽然以文本为指导的语音生成可以缓解这些问题,但在将文本引导整合到双通道音频流时,会出现时序和长度问题,破坏自然交互所需的精确时间对齐。为了解决这些挑战,我们提出了 TurnGuide,这是一种受规划启发的新方法,模仿人类的对话规划,通过动态地将助理语音分段为对话轮次并在语音输出前生成轮次级别的文本引导,从而有效解决了插入时机和长度两个问题。 大量实验表明,我们的方法显著提升了端到端带指示的语音语言模型(e2e FD-SLMs)的对话能力,使其在保持自然对话流的同时生成语义上有意义且连贯的语音。演示可在 https://dreamtheater123.github.io/TurnGuide-Demo/ 查看。代码将发布在 https://github.com/dreamtheater123/TurnGuide。
Subjects: Computation and Language, Sound, Audio and Speech Processing 主题:计算与语言、声音、音频与语音处理
Publish: 2025-08-10 14:49:43 UTC 发布:2025-08-10 14:49:43 UTC
#49 Strategies of Code-switching in Human-Machine Dialogs #49 人机对话中的语言混用策略
Authors: [Dean Geckt](https://arxiv.org/search/?searchtype=author&query=Dean Geckt), [Melinda Fricke](https://arxiv.org/search/?searchtype=author&query=Melinda Fricke), [Shuly Wintner](https://arxiv.org/search/?searchtype=author&query=Shuly Wintner) 作者:Dean Geckt、Melinda Fricke、Shuly Wintner
Most people are multilingual, and most multilinguals code-switch, yet the characteristics of code-switched language are not fully understood. We developed a chatbot capable of completing a Map Task with human participants using code-switched Spanish and English. In two experiments, we prompted the bot to code-switch according to different strategies, examining (1) the feasibility of such experiments for investigating bilingual language use, and (2) whether participants would be sensitive to variations in discourse and grammatical patterns. Participants generally enjoyed code-switching with our bot as long as it produced predictable code-switching behavior; when code-switching was random or ungrammatical (as when producing unattested incongruent mixed-language noun phrases, such as la fork'), participants enjoyed the task less and were less successful at completing it. These results underscore the potential downsides of deploying insufficiently developed multilingual language technology, while also illustrating the promise of such technology for conducting research on bilingual language use. 大多数人会多语种交流,而大多数多语者会进行语码转换,然而语码转换语言的特征尚未被完全理解。我们开发了一个聊天机器人,能够使用西班牙语和英语的语码转换与人类参与者完成地图任务。在两项实验中,我们按不同策略提示该机器人进行语码转换,考察(1)此类实验作为研究双语语言使用的可行性,以及(2)参与者是否会对话语和语法模式的变化敏感。只要机器人产生可预测的语码转换行为,参与者通常很喜欢与其进行语码转换;而当语码转换是随机的或不合语法时(例如生成不存在的、语种不一致的混合名词短语,如
la fork’),参与者对任务的喜好度下降且完成任务的成功率降低。这些结果强调了部署开发不充分的多语种语言技术的潜在缺点,同时也展示了此类技术在开展双语语言使用研究方面的前景。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 12:41:46 UTC 发布时间:2025-08-10 12:41:46 UTC
#50 ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering ObfusQAte:一个用于评估 LLM 在混淆事实问答中鲁棒性的提议框架 #50ObfusQAte:一个用于评估 LLM 在混淆事实问答中鲁棒性的提议框架 ObfusQAte:一个用于评估 LLM 在混淆事实问答中鲁棒性的提议框架[PDF ] [Copy] [Kimi ] [REL]
Authors: [Shubhra Ghosh](https://arxiv.org/search/?searchtype=author&query=Shubhra Ghosh), [Abhilekh Borah](https://arxiv.org/search/?searchtype=author&query=Abhilekh Borah), [Aditya Kumar Guru](https://arxiv.org/search/?searchtype=author&query=Aditya Kumar Guru), [Kripabandhu Ghosh](https://arxiv.org/search/?searchtype=author&query=Kripabandhu Ghosh) 作者:Shubhra Ghosh, Abhilekh Borah, Aditya Kumar Guru, Kripabandhu Ghosh
The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs’ robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, ObfusQAte and, leveraging the same, introduce ObfusQA, a comprehensive, first of its kind, framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three distinct dimensions: (i) Named-Entity Indirection, (ii) Distractor Indirection, and (iii) Contextual Overload. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available. 大型语言模型(LLMs)的快速普及显著推动了能够进行事实性问答(QA)的公平人工智能系统的发展。然而,目前尚无已知研究在将问题呈现为混淆版本时测试 LLMs 的鲁棒性。为系统地评估这些局限性,我们提出了一种新技术 ObfusQAte,并基于该技术引入了 ObfusQA,这是一个全面的、首创的框架,具有多层次的混淆级别,旨在从三个不同维度检验 LLM 的能力:(i)命名实体间接化,(ii)干扰项间接化,和(iii)上下文过载。通过捕捉语言中的这些细粒度差异,ObfusQA 提供了一个用于评估 LLM 鲁棒性和适应性的全面基准。我们的研究观察到,当面对此类日益细微的变体时,LLMs 倾向于失败或生成虚构的回答。为了促进该方向的研究,我们公开提供了 ObfusQAte。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习
Publish: 2025-08-10 12:27:52 UTC 发布:2025-08-10 12:27:52 UTC
#51 HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways #51 HealthBranches: 通过决策路径综合构建临床基础的问答数据集
Authors: [Cristian Cosentino](https://arxiv.org/search/?searchtype=author&query=Cristian Cosentino), [Annamaria Defilippo](https://arxiv.org/search/?searchtype=author&query=Annamaria Defilippo), [Marco Dossena](https://arxiv.org/search/?searchtype=author&query=Marco Dossena), [Christopher Irwin](https://arxiv.org/search/?searchtype=author&query=Christopher Irwin), [Sara Joubbi](https://arxiv.org/search/?searchtype=author&query=Sara Joubbi), [Pietro Liò](https://arxiv.org/search/?searchtype=author&query=Pietro Liò) 作者:Cristian Cosentino、Annamaria Defilippo、Marco Dossena、Christopher Irwin、Sara Joubbi、Pietro Liò
HealthBranches is a novel benchmark dataset for medical Question-Answering (Q&A), specifically designed to evaluate complex reasoning in Large Language Models (LLMs). This dataset is generated through a semi-automated pipeline that transforms explicit decision pathways from medical source into realistic patient cases with associated questions and answers. Covering 4,063 case studies across 17 healthcare topics, each data point is based on clinically validated reasoning chains. HealthBranches supports both open-ended and multiple-choice question formats and uniquely includes the full reasoning path for each Q&A. Its structured design enables robust evaluation of LLMs’ multi-step inference capabilities, including their performance in structured Retrieval-Augmented Generation (RAG) contexts. HealthBranches establishes a foundation for the development of more trustworthy, interpretable, and clinically reliable LLMs in high-stakes domains while also serving as a valuable resource for educational purposes. HealthBranches 是一个用于医学问答(Q&A)的新型基准数据集,专为评估大型语言模型(LLMs)中的复杂推理而设计。该数据集通过半自动化管道生成,将医疗来源中的明确决策路径转换为具有相关问题和答案的真实患者病例。覆盖 17 个医疗主题的 4,063 个病例研究,每个数据点均基于临床验证的推理链。HealthBranches 支持开放式和多项选择两种问题格式,并独特地为每个问答提供完整的推理路径。其结构化设计使得可以对 LLM 的多步骤推理能力进行稳健评估,包括在结构化检索增强生成(RAG)场景下的表现。HealthBranches 为在高风险领域开发更值得信赖、可解释且临床可靠的 LLM 奠定了基础,同时也作为教育用途的宝贵资源。
Subjects: Computation and Language, Artificial Intelligence, Information Retrieval, Machine Learning 主题:计算与语言、人工智能、信息检索、机器学习
Publish: 2025-08-10 11:45:34 UTC 发布:2025-08-10 11:45:34 UTC
#52 CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation #52 CCFQA:用于跨语言与跨模态语音和文本事实性评估的基准
Authors: [Yexing Du](https://arxiv.org/search/?searchtype=author&query=Yexing Du), [Kaiyuan Liu](https://arxiv.org/search/?searchtype=author&query=Kaiyuan Liu), [Youcheng Pan](https://arxiv.org/search/?searchtype=author&query=Youcheng Pan), [Zheng Chu](https://arxiv.org/search/?searchtype=author&query=Zheng Chu), [Bo Yang](https://arxiv.org/search/?searchtype=author&query=Bo Yang), [Xiaocheng Feng](https://arxiv.org/search/?searchtype=author&query=Xiaocheng Feng), [Yang Xiang](https://arxiv.org/search/?searchtype=author&query=Yang Xiang), [Ming Liu](https://arxiv.org/search/?searchtype=author&query=Ming Liu) 作者:杜野星、刘凯远、潘有成、楚正、杨波、冯晓成、向扬、刘明
As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel \textbf{C}ross-lingual and \textbf{C}ross-modal \textbf{F}actuality benchmark (\textbf{CCFQA}). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs’ cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa. 随着大型语言模型(LLMs)在多语言世界中的普及,确保无幻觉的事实性变得尤为重要。然而,现有用于评估多模态大型语言模型(MLLMs)可靠性的基准主要集中于文本或视觉模态,并以英语为主,这在处理多语言输入、尤其是语音时造成了评估上的空白。为填补这一空白,我们提出了一个新颖的跨语言跨模态事实性基准(CCFQA)。具体而言,CCFQA 基准包含跨 8 种语言的平行语音-文本事实性问题,旨在系统评估 MLLMs 的跨语言与跨模态事实性能力。我们的实验证明,当前的 MLLMs 在 CCFQA 基准上仍面临重大挑战。此外,我们提出了一种少样本迁移学习策略,能够有效将 LLMs 在英语上的问答(QA)能力迁移到多语言口语问答(SQA)任务,仅用 5 次示例训练便能以 GPT-4o-mini-Audio 实现具有竞争力的表现。 我们发布了 CCFQA 作为一项基础研究资源,以促进具备更强健和更可靠语音理解能力的多模态大模型(MLLMs)的发展。我们的代码和数据集可在 https://github.com/yxduir/ccfqa 获得。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-10 11:09:41 UTC 发布时间:2025-08-10 11:09:41 UTC
#53 Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking #53 Arce:用于自动规则检查的命名实体识别的上下文化阐释增强 Roberta
Authors: [Jian Chen](https://arxiv.org/search/?searchtype=author&query=Jian Chen), [Jinbao Tian](https://arxiv.org/search/?searchtype=author&query=Jinbao Tian), [Yankui Li](https://arxiv.org/search/?searchtype=author&query=Yankui Li), [Zhou Li](https://arxiv.org/search/?searchtype=author&query=Zhou Li) 作者:陈健,田劲宝,李彦奎,李舟
Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:https://github.com/nxcc-lab/ARCE. 从专业文本中准确提取信息是一项关键挑战,尤其是在建筑、工程与施工(AEC)领域中用于支持自动规则校验(ARC)的命名实体识别(NER)。标准预训练模型的性能常常受制于领域差异,因为它们难以理解 AEC 文本中固有的专业术语和复杂的关系语境。尽管通过在大量人工整理的领域语料上进一步预训练(如 ARCBERT 等方法)可以缓解这一问题,但这种方法既耗时又成本高昂。因此,利用大型语言模型(LLMs)进行自动化知识生成已成为一种有前景的替代方案。然而,如何生成能够真正提升较小且高效模型的知识仍是一个未解的问题。为此,我们提出了 ARCE(带上下文化阐释的增强 RoBERTa),这是一种系统探索并优化该生成过程的新方法。 ARCE 使用 LLM 首先生成一组简单、直接的解释语料,我们将其称为 Cote,然后在将 RoBERTa 微调到下游任务之前,利用该语料对其进行增量预训练。我们的广泛实验证明,ARCE 在基准 AEC 数据集上建立了新的最先进水平,达到了 77.20% 的 Macro-F1 得分。该结果还揭示了一个关键发现:基于简单解释的知识在该任务上出人意料地比基于复杂角色的理由更有效。代码已公开可用:https://github.com/nxcc-lab/ARCE。
Subjects: Computation and Language, Information Retrieval 主题:Computation and Language , Information Retrieval
Publish: 2025-08-10 10:49:48 UTC 发表:2025-08-10 10:49:48 UTC
#54 "Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas #54 “Pull or Not to Pull?’’: Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas [PDF ] [Copy] [Kimi ] [REL]
Authors: [Junchen Ding](https://arxiv.org/search/?searchtype=author&query=Junchen Ding), [Penghao Jiang](https://arxiv.org/search/?searchtype=author&query=Penghao Jiang), [Zihao Xu](https://arxiv.org/search/?searchtype=author&query=Zihao Xu), [Ziqi Ding](https://arxiv.org/search/?searchtype=author&query=Ziqi Ding), [Yichen Zhu](https://arxiv.org/search/?searchtype=author&query=Yichen Zhu), [Jiaojiao Jiang](https://arxiv.org/search/?searchtype=author&query=Jiaojiao Jiang), [Yuekang Li](https://arxiv.org/search/?searchtype=author&query=Yuekang Li)
As large language models (LLMs) increasingly mediate ethically sensitive decisions, understanding their moral reasoning processes becomes imperative. This study presents a comprehensive empirical evaluation of 14 leading LLMs, both reasoning enabled and general purpose, across 27 diverse trolley problem scenarios, framed by ten moral philosophies, including utilitarianism, deontology, and altruism. Using a factorial prompting protocol, we elicited 3,780 binary decisions and natural language justifications, enabling analysis along axes of decisional assertiveness, explanation answer consistency, public moral alignment, and sensitivity to ethically irrelevant cues. Our findings reveal significant variability across ethical frames and model types: reasoning enhanced models demonstrate greater decisiveness and structured justifications, yet do not always align better with human consensus. Notably, “sweet zones” emerge in altruistic, fairness, and virtue ethics framings, where models achieve a balance of high intervention rates, low explanation conflict, and minimal divergence from aggregated human judgments. However, models diverge under frames emphasizing kinship, legality, or self interest, often producing ethically controversial outcomes. These patterns suggest that moral prompting is not only a behavioral modifier but also a diagnostic tool for uncovering latent alignment philosophies across providers. We advocate for moral reasoning to become a primary axis in LLM alignment, calling for standardized benchmarks that evaluate not just what LLMs decide, but how and why. 随着大型语言模型(LLMs)在伦理敏感决策中日益发挥中介作用,了解它们的道德推理过程变得势在必行。本研究对 14 种领先的 LLMs(包括支持推理的模型和通用模型)在 27 种不同的电车难题情境中进行了全面的实证评估,并以十种道德哲学为框架,包括功利主义、义务论和利他主义。通过因子式提示协议,我们引出 3780 个二元决策和自然语言理由陈述,从而能够沿着决策果断性、解释答复一致性、公众道德对齐度以及对伦理上无关线索的敏感性等维度进行分析。研究结果显示,不同伦理框架和模型类型之间存在显著差异:增强推理的模型表现出更强的决断力和更有结构的理由陈述,但并不总是与人类共识更一致。值得注意的是,在利他主义、公平伦理和德性伦理的框架下出现了“甜区”,在这些框架中,模型实现了较高的干预率、较低的解释冲突以及与汇总人类判断的最小偏差之间的平衡。 然而,在强调亲属关系、合法性或自身利益的框架下,模型会出现分歧,经常产生具有伦理争议的结果。这些模式表明,道德提示不仅是行为调节器,还是揭示不同供应商潜在对齐哲学的诊断工具。我们主张将道德推理作为 LLM 对齐的主要轴心,呼吁制定标准化基准,不仅评估 LLMs 的决定是什么,还要评估其如何以及为何做出这些决定。
Subjects: Computation and Language, Artificial Intelligence, Computers and Society 主题:计算与语言, 人工智能, 计算机与社会
Publish: 2025-08-10 10:45:16 UTC 发布:2025-08-10 10:45:16 世界协调时间 (UTC)
#55 MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory #55 MAQuA:使用项目反应理论的多维心理健康筛查自适应提问 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Vasudha Varadarajan](https://arxiv.org/search/?searchtype=author&query=Vasudha Varadarajan), [Hui Xu](https://arxiv.org/search/?searchtype=author&query=Hui Xu), [Rebecca Astrid Boehme](https://arxiv.org/search/?searchtype=author&query=Rebecca Astrid Boehme), [Mariam Marlan Mirstrom](https://arxiv.org/search/?searchtype=author&query=Mariam Marlan Mirstrom), [Sverker Sikstrom](https://arxiv.org/search/?searchtype=author&query=Sverker Sikstrom), [H. Andrew Schwartz](https://arxiv.org/search/?searchtype=author&query=H. Andrew Schwartz) 作者:Vasudha Varadarajan, Hui Xu, Rebecca Astrid Boehme, Mariam Marlan Mirstrom, Sverker Sikstrom, H. Andrew Schwartz
Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows. 近年来大型语言模型(LLMs)的进展为可扩展、互动式心理健康评估提供了新机遇,但由 LLMs 进行的过度询问会给用户带来负担,并且在跨诊断症状谱的真实筛查中效率低下。我们提出了 MAQuA,一种用于同时进行多维心理健康筛查的自适应提问框架。MAQuA 将语言回应上的多结果建模与项目反应理论(IRT)和因子分析相结合,在每一步选择在多个维度上能提供最多信息的题目以优化诊断信息,从而提高准确性并有可能减少答题负担。在一项新数据集上的实证结果表明,与随机排序相比,MAQuA 将评分稳定所需的评估问题数量减少了 50%–87%(例如,在抑郁评分达到稳定时问题数量减少了 71%,在饮食失调评分达到稳定时减少了 85%)。MAQuA 在内向型(抑郁、焦虑)和外向型(物质使用、饮食失调)领域均表现出稳健的性能,早停策略进一步减少了患者的时间和负担。 这些发现将 MAQuA 定位为一个强大且高效的工具,能够实现可扩展、细致且交互式的心理健康筛查,推动基于 LLM 的代理融入真实临床工作流程。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 10:33:16 UTC 发布:2025-08-10 10:33:16 UTC
#56 Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models #56 在大型语音-语言模型中融入情境副语言理解
Authors: [Qiongqiong Wang](https://arxiv.org/search/?searchtype=author&query=Qiongqiong Wang), [Hardik B. Sailor](https://arxiv.org/search/?searchtype=author&query=Hardik B. Sailor), [Jeremy H. M. Wong](https://arxiv.org/search/?searchtype=author&query=Jeremy H. M. Wong), [Tianchi Liu](https://arxiv.org/search/?searchtype=author&query=Tianchi Liu), [Shuo Sun](https://arxiv.org/search/?searchtype=author&query=Shuo Sun), [Wenyu Zhang](https://arxiv.org/search/?searchtype=author&query=Wenyu Zhang), [Muhammad Huzaifah](https://arxiv.org/search/?searchtype=author&query=Muhammad Huzaifah), [Nancy Chen](https://arxiv.org/search/?searchtype=author&query=Nancy Chen), [Ai Ti Aw](https://arxiv.org/search/?searchtype=author&query=Ai Ti Aw) 作者:王琼琼、Hardik B. Sailor、Jeremy H. M. Wong、刘天池、孙硕、张文钰、Muhammad Huzaifah、陈南希、艾蒂·奥夫
Current large speech language models (Speech-LLMs) often exhibit limitations in empathetic reasoning, primarily due to the absence of training datasets that integrate both contextual content and paralinguistic cues. In this work, we propose two approaches to incorporate contextual paralinguistic information into model training: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that automatically generates novel training question-answer (QA) pairs using both categorical and dimensional emotion annotations alongside speech transcriptions. Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach, showing effectiveness in contextual paralinguistic understanding. We also validate the LLM judge by demonstrating its correlation with classification metrics, providing support for its reliability. 当前大型语音语言模型(Speech-LLMs)在共情推理方面通常存在局限,主要原因是缺乏将上下文内容与副语言线索相结合的训练数据。在本工作中,我们提出了两种将上下文副语言信息纳入模型训练的方法:(1)显式方法,直接向 LLM 提供副语言元数据(例如情感标注);(2)隐式方法,利用类别和维度化的情感标注以及语音转录自动生成新的训练问答(QA)对。我们的隐式方法在一个人工标注的 QA 基准上(由 LLM 评判)将性能提升了 38.41%,与显式方法结合时达到 46.02%,表明在上下文副语言理解方面的有效性。我们还通过展示其与分类度量的相关性来验证 LLM 评判器,支持其可靠性。
Subjects: Computation and Language, Artificial Intelligence, Audio and Speech Processing 主题:计算与语言、人工智能、音频与语音处理
Publish: 2025-08-10 10:03:30 UTC 发布:2025-08-10 10:03:30 UTC
#57 The 2D+ Dynamic Articulatory Model DYNARTmo: Tongue-Palate Contact Area Estimation #57 二维+ 动态发音构形模型 DYNARTmo:舌-腭接触面积估计
Author: [Bernd J. Kröger](https://arxiv.org/search/?searchtype=author&query=Bernd J. Kröger) 作者:Bernd J. Kröger
This paper describes an extension of the two-dimensional dynamic articulatory model DYNARTmo by integrating an internal three-dimensional representation of the palatal dome to estimate tongue-palate contact areas from midsagittal tongue contours. Two alternative dome geometries - a half-ellipse and a cosine based profile - are implemented to model lateral curvature in the coronal plane. Using these geometries, lateral contact points are analytically computed for each anterior-posterior position, enabling the generation of electropalatography-like visualizations within the 2D+ framework. The enhanced model supports three synchronized views (sagittal, glottal, and palatal) for static and dynamic (animated) articulation displays, suitable for speech science education and speech therapy. Future work includes adding a facial (lip) view and implementing articulatory-to-acoustic synthesis to quantitatively evaluate model realism. 本文描述了通过在二维动态发音模型 DYNARTmo 中整合一个内部三维上颚穹窿表示,以便从矢状面舌轮廓估算舌-腭接触面积的扩展。实现了两种可选的穹窿几何形状——半椭圆和基于余弦的轮廓——用于模拟冠状面中的侧向弯曲。利用这些几何形状,可以针对每个前后位置解析地计算侧向接触点,从而在 2D+ 框架内生成类似电腭图的可视化。增强后的模型支持三个同步视图(矢状面、声门面和腭面)用于静态和动态(动画)发音显示,适用于语音科学教育和语言治疗。未来工作包括增加面部(唇)视图并实现从发音到声学的合成,以定量评估模型的逼真性。
Subjects: Computation and Language, Robotics
Publish: 2025-08-10 09:28:24 UTC
#58 Prompt Tuning for Few-Shot Continual Learning Named Entity Recognition #58 针对少样本持续学习的命名实体识别的提示微调
Author: [Zhe Ren](https://arxiv.org/search/?searchtype=author&query=Zhe Ren) 作者:任喆
Knowledge distillation has been successfully applied to Continual Learning Named Entity Recognition (CLNER) tasks, by using a teacher model trained on old-class data to distill old-class entities present in new-class data as a form of regularization, thereby avoiding catastrophic forgetting. However, in Few-Shot CLNER (FS-CLNER) tasks, the scarcity of new-class entities makes it difficult for the trained model to generalize during inference. More critically, the lack of old-class entity information hinders the distillation of old knowledge, causing the model to fall into what we refer to as the Few-Shot Distillation Dilemma. In this work, we address the above challenges through a prompt tuning paradigm and memory demonstration template strategy. Specifically, we designed an expandable Anchor words-oriented Prompt Tuning (APT) paradigm to bridge the gap between pre-training and fine-tuning, thereby enhancing performance in few-shot scenarios. Additionally, we incorporated Memory Demonstration Templates (MDT) into each training instance to provide replay samples from previous tasks, which not only avoids the Few-Shot Distillation Dilemma but also promotes in-context learning. Experiments show that our approach achieves competitive performances on FS-CLNER. 知识蒸馏已成功应用于持续学习命名实体识别(CLNER)任务:通过使用在旧类别数据上训练的教师模型,将新类别数据中出现的旧类别实体作为一种正则化手段进行蒸馏,从而避免灾难性遗忘。然而,在少样本持续学习命名实体识别(FS-CLNER)任务中,新类别实体的稀缺使得训练好的模型在推理时难以泛化。更关键的是,缺乏旧类别实体信息阻碍了旧知识的蒸馏,导致模型陷入我们所称的少样本蒸馏困境。在本工作中,我们通过提示调优范式和记忆示例模板策略来解决上述挑战。具体而言,我们设计了一个可扩展的面向锚词的提示调优(APT)范式,以弥合预训练与微调之间的差距,从而提升少样本场景下的性能。此外,我们在每个训练实例中加入了记忆示例模板(MDT),以提供来自先前任务的重放样本,这不仅避免了少样本蒸馏困境,而且促进了上下文内学习。 实验表明,我们的方法在少样本跨语言命名实体识别(FS-CLNER)任务上取得了具有竞争力的表现。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-10 09:02:53 UTC 发布:2025-08-10 09:02:53 协调世界时
#59 How Does a Deep Neural Network Look at Lexical Stress? #59 深度神经网络如何看待词汇重音?
Authors: [Itai Allouche](https://arxiv.org/search/?searchtype=author&query=Itai Allouche), [Itay Asael](https://arxiv.org/search/?searchtype=author&query=Itay Asael), [Rotem Rousso](https://arxiv.org/search/?searchtype=author&query=Rotem Rousso), [Vered Dassa](https://arxiv.org/search/?searchtype=author&query=Vered Dassa), [Ann Bradlow](https://arxiv.org/search/?searchtype=author&query=Ann Bradlow), [Seung-Eun Kim](https://arxiv.org/search/?searchtype=author&query=Seung-Eun Kim), [Matthew Goldrick](https://arxiv.org/search/?searchtype=author&query=Matthew Goldrick), [Joseph Keshet](https://arxiv.org/search/?searchtype=author&query=Joseph Keshet) 作者:Itai Allouche、Itay Asael、Rotem Rousso、Vered Dassa、Ann Bradlow、Seung-Eun Kim、Matthew Goldrick、Joseph Keshet
Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel’s first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning’s ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli. 尽管在语音处理方面取得了成功,神经网络常常作为黑箱运行,这就引发了这样的问题:是什么在指导它们的决策,我们如何解释它们?本研究在词汇重音的背景下考察了这一问题。我们从朗读语音和自发语音中自动构建了一个英语二音节词的数据集。训练了若干卷积神经网络(CNN)结构,从缺少最小重音对(例如,重音在首音节的 WAllet 与重音在末音节的 exTEND)的二音节词的频谱图表示中预测重音位置,在保留测试数据上最高达到 92%的准确率。层级相关传播(LRP)——一种用于 CNN 可解释性分析的技术——显示,对保留的最小对(PROtest 与 proTEST)的预测,主要受有重音与无重音音节中的信息影响,尤其是有重音元音的谱特性。然而,分类器也关注了整个词中的信息。 提出了一种特征特定的相关性分析,其结果表明我们表现最好的分类器强烈受重读元音的一、二共振峰的影响,并有一些证据表明其基频和第三共振峰也有所贡献。这些结果揭示了深度学习从自然发生的数据中获取分布式重读线索的能力,扩展了以高度受控刺激为基础的传统语音学研究。
Subjects: Computation and Language, Machine Learning, Sound, Audio and Speech Processing 主题:计算与语言、机器学习、声音、音频与语音处理
Publish: 2025-08-10 08:13:40 UTC 发布:2025-08-10 08:13:40 UTC
#60 Enhancing Rumor Detection Methods with Propagation Structure Infused Language Model #60 通过注入传播结构的语言模型增强谣言检测方法
Authors: [Chaoqun Cui](https://arxiv.org/search/?searchtype=author&query=Chaoqun Cui), [Siyuan Li](https://arxiv.org/search/?searchtype=author&query=Siyuan Li), [Kunkun Ma](https://arxiv.org/search/?searchtype=author&query=Kunkun Ma), [Caiyan Jia](https://arxiv.org/search/?searchtype=author&query=Caiyan Jia) 作者:崔朝群,李思远,马坤坤,贾彩燕
Pretrained Language Models (PLMs) have excelled in various Natural Language Processing tasks, benefiting from large-scale pretraining and self-attention mechanism’s ability to capture long-range dependencies. However, their performance on social media application tasks like rumor detection remains suboptimal. We attribute this to mismatches between pretraining corpora and social texts, inadequate handling of unique social symbols, and pretraining tasks ill-suited for modeling user engagements implicit in propagation structures. To address these issues, we propose a continue pretraining strategy called Post Engagement Prediction (PEP) to infuse information from propagation structures into PLMs. PEP makes models to predict root, branch, and parent relations between posts, capturing interactions of stance and sentiment crucial for rumor detection. We also curate and release large-scale Twitter corpus: TwitterCorpus (269GB text), and two unlabeled claim conversation datasets with propagation structures (UTwitter and UWeibo). Utilizing these resources and PEP strategy, we train a Twitter-tailored PLM called SoLM. Extensive experiments demonstrate PEP significantly boosts rumor detection performance across universal and social media PLMs, even in few-shot scenarios. On benchmark datasets, PEP enhances baseline models by 1.0-3.7% accuracy, even enabling it to outperform current state-of-the-art methods on multiple datasets. SoLM alone, without high-level modules, also achieves competitive results, highlighting the strategy’s effectiveness in learning discriminative post interaction features. 预训练语言模型(PLM)在各种自然语言处理任务中表现出色,这得益于大规模预训练和自注意力机制捕捉长距离依赖的能力。然而,它们在社交媒体应用任务(如谣言检测)上的表现仍不尽如人意。我们认为这是由于预训练语料与社交文本存在不匹配、对独特社交符号处理不足以及预训练任务不适合建模在传播结构中隐含的用户参与。为了解决这些问题,我们提出了一种名为帖子参与预测(Post Engagement Prediction,PEP)的继续预训练策略,以将传播结构的信息注入到 PLM 中。PEP 使模型预测帖子之间的根源、分支和父子关系,捕捉立场和情感的互动,这对谣言检测至关重要。我们还整理并发布了大规模推特语料:TwitterCorpus(269GB 文本),以及两个带传播结构的未标注声明会话数据集(UTwitter 和 UWeibo)。利用这些资源和 PEP 策略,我们训练了一个针对推特优化的 PLM,称为 SoLM。 大量实验证明,PEP 在通用和社交媒体预训练语言模型上显著提升了谣言检测性能,即使在少样本场景下也能取得提升。在基准数据集上,PEP 使基线模型的准确率提高了 1.0–3.7%,甚至在多个数据集上使其超越了当前的最先进方法。仅使用 SoLM(不含高级模块)也能取得具有竞争力的结果,突显了该策略在学习判别性帖文交互特征方面的有效性。
Subjects: Computation and Language, Social and Information Networks 主题:计算与语言,社交与信息网络
Publish: 2025-08-10 07:04:50 UTC 发布时间:2025-08-10 07:04:50 协调世界时
#61 Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment #61 通过时间异质性建模和语义对齐将 LLMs 适配于时间序列预测
Authors: [Yanru Sun](https://arxiv.org/search/?searchtype=author&query=Yanru Sun), [Emadeldeen Eldele](https://arxiv.org/search/?searchtype=author&query=Emadeldeen Eldele), [Zongxia Xie](https://arxiv.org/search/?searchtype=author&query=Zongxia Xie), [Yucheng Wang](https://arxiv.org/search/?searchtype=author&query=Yucheng Wang), [Wenzhe Niu](https://arxiv.org/search/?searchtype=author&query=Wenzhe Niu), [Qinghua Hu](https://arxiv.org/search/?searchtype=author&query=Qinghua Hu), [Chee Keong Kwoh](https://arxiv.org/search/?searchtype=author&query=Chee Keong Kwoh), [Min Wu](https://arxiv.org/search/?searchtype=author&query=Min Wu) 作者:孙艳如,Emadeldeen Eldele,谢宗霞,王雨成,牛文哲,胡清华,郭志强,吴敏
Large Language Models (LLMs) have recently demonstrated impressive capabilities in natural language processing due to their strong generalization and sequence modeling capabilities. However, their direct application to time series forecasting remains challenging due to two fundamental issues: the inherent heterogeneity of temporal patterns and the modality gap between continuous numerical signals and discrete language representations. In this work, we propose TALON, a unified framework that enhances LLM-based forecasting by modeling temporal heterogeneity and enforcing semantic alignment. Specifically, we design a Heterogeneous Temporal Encoder that partitions multivariate time series into structurally coherent segments, enabling localized expert modeling across diverse temporal patterns. To bridge the modality gap, we introduce a Semantic Alignment Module that aligns temporal features with LLM-compatible representations, enabling effective integration of time series into language-based models while eliminating the need for handcrafted prompts during inference. Extensive experiments on seven real-world benchmarks demonstrate that TALON achieves superior performance across all datasets, with average MSE improvements of up to 11% over recent state-of-the-art methods. These results underscore the effectiveness of incorporating both pattern-aware and semantic-aware designs when adapting LLMs for time series forecasting. The code is available at: https://github.com/syrGitHub/TALON. 大型语言模型(LLMs)近年来在自然语言处理方面展示了令人印象深刻的能力,这归功于它们强大的泛化和序列建模能力。然而,将它们直接应用于时间序列预测仍然具有挑战性,原因在于两个基本问题:时间模式的固有异质性,以及连续数值信号与离散语言表示之间的模态差距。在本工作中,我们提出了 TALON,一个通过建模时间异质性并强制语义对齐来增强基于 LLM 的预测的统一框架。具体而言,我们设计了一个异构时间编码器,将多变量时间序列划分为结构上一致的片段,从而使得针对多样时间模式的局部专家建模成为可能。为弥合模态差距,我们引入了一个语义对齐模块,将时间特征与兼容 LLM 的表示对齐,使时间序列能够有效整合到基于语言的模型中,同时在推理阶段无需手工设计提示。 在七个真实世界基准上的大量实验表明,TALON 在所有数据集上都实现了更优的性能,相较于近期的最先进方法平均均方误差(MSE)提升了多达 11%。这些结果强调了在将 LLMs 应用于时间序列预测时同时引入模式感知和语义感知设计的有效性。代码可在以下位置获取: https://github.com/syrGitHub/TALON。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 06:06:19 UTC 发布:2025-08-10 06:06:19 UTC
#62 DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention #62 DySK-Attn:一种通过动态稀疏知识注意力在大型语言模型中实现高效实时知识更新的框架
Authors: [Kabir Khan](https://arxiv.org/search/?searchtype=author&query=Kabir Khan), [Priya Sharma](https://arxiv.org/search/?searchtype=author&query=Priya Sharma), [Arjun Mehta](https://arxiv.org/search/?searchtype=author&query=Arjun Mehta), [Neha Gupta](https://arxiv.org/search/?searchtype=author&query=Neha Gupta), [Ravi Narayanan](https://arxiv.org/search/?searchtype=author&query=Ravi Narayanan) 作者:Kabir Khan、Priya Sharma、Arjun Mehta、Neha Gupta、Ravi Narayanan
Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world. 大型语言模型(LLMs)存在一个关键局限:它们的知识是静态的并且很快就会过时。对这些庞大模型进行重新训练在计算上不可行,而现有的知识编辑技术可能速度缓慢并引入不可预见的副作用。为了解决这一问题,我们提出了 DySK-Attn,一种使 LLMs 能够高效整合来自动态外部源的实时知识的新框架。我们的方法将 LLM 与一个可以即时更新的动态知识图谱(KG)相结合。该框架的核心是一种稀疏知识注意力机制,允许 LLM 执行由粗到细的搜索,能高效识别并聚焦于来自庞大 KG 的一小部分高度相关事实。该机制避免了对整个知识库进行密集注意力的高计算成本,并减轻了来自无关信息的噪声。 通过在时效性问答任务上进行的大量实验,我们证明了 DySK-Attn 在更新知识的事实准确性和计算效率方面,显著优于包括标准检索增强生成(RAG)和模型编辑技术在内的强基线。我们的框架为构建能够与瞬息万变的世界保持同步的 LLMs 提供了可扩展且有效的解决方案。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习
Publish: 2025-08-10 05:22:38 UTC 发布:2025-08-10 05:22:38 协调世界时
#63 Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks #63 大规模模式谱系抽取:多语言管道、复合评估和语言模型基准
Authors: [Jiaqi Yin](https://arxiv.org/search/?searchtype=author&query=Jiaqi Yin), [Yi-Wei Chen](https://arxiv.org/search/?searchtype=author&query=Yi-Wei Chen), [Meng-Lung Lee](https://arxiv.org/search/?searchtype=author&query=Meng-Lung Lee), [Xiya Liu](https://arxiv.org/search/?searchtype=author&query=Xiya Liu) 作者:殷嘉琦、陈奕维、李孟龙、刘熙雅
Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This “semantic drift” compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications. 企业数据管道通常在多种编程语言中进行复杂的转换,这常导致原始元数据与下游数据之间出现语义脱节。这种“语义漂移”会削弱数据的可复现性和治理能力,并降低检索增强生成(RAG)和文本到 SQL 等服务的效用。为了解决这一问题,提出了一种新框架,用于从多语言的企业管道脚本中自动提取细粒度的模式血缘。该方法识别四个关键组件:源模式、源表、转换逻辑和聚合操作,从而创建数据转换的标准化表示。为严格评估血缘质量,本文引入了模式血缘综合评估(SLiCE),该指标同时评估结构正确性和语义保真度。文中还提供了一个新的基准数据集,包括来自真实工业脚本的 1,700 条人工标注的血缘记录。 实验在 12 种语言模型上进行,涵盖从 1.3B 到 32B 的小型语言模型(SLMs),以及像 GPT-4o 和 GPT-4.1 这样的大型语言模型(LLMs)。结果表明,模式谱系提取的性能随着模型规模和提示技术的复杂性而提升。特别是一款 32B 的开源模型,在使用单一推理轨迹时,能够在常规提示下达到与 GPT 系列相当的性能。该发现表明,在实际应用中部署具备模式感知能力的代理时,可以采用一种可扩展且经济的方案。
Subjects: Computation and Language, Artificial Intelligence, Databases 主题:计算与语言、人工智能、数据库
Publish: 2025-08-10 05:04:32 UTC 发布时间:2025-08-10 05:04:32 协调世界时 (UTC)
#64 Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback #64 通过从隐式反馈中去噪虚假兴趣改进个性化标题生成 [PDF 1 ] [Copy] [Kimi ] [REL]
Authors: [Kejin Liu](https://arxiv.org/search/?searchtype=author&query=Kejin Liu), [Junhong Lian](https://arxiv.org/search/?searchtype=author&query=Junhong Lian), [Xiang Ao](https://arxiv.org/search/?searchtype=author&query=Xiang Ao), [Ningtao Wang](https://arxiv.org/search/?searchtype=author&query=Ningtao Wang), [Xing Fu](https://arxiv.org/search/?searchtype=author&query=Xing Fu), [Yu Cheng](https://arxiv.org/search/?searchtype=author&query=Yu Cheng), [Weiqiang Wang](https://arxiv.org/search/?searchtype=author&query=Weiqiang Wang), [Xinyu Liu](https://arxiv.org/search/?searchtype=author&query=Xinyu Liu) 作者:刘克进,连俊宏,敖翔,王宁涛,付兴,程煜,王伟强,刘新宇
Accurate personalized headline generation hinges on precisely capturing user interests from historical behaviors. However, existing methods neglect personalized-irrelevant click noise in entire historical clickstreams, which may lead to hallucinated headlines that deviate from genuine user preferences. In this paper, we reveal the detrimental impact of click noise on personalized generation quality through rigorous analysis in both user and news dimensions. Based on these insights, we propose a novel Personalized Headline Generation framework via Denoising Fake Interests from Implicit Feedback (PHG-DIF). PHG-DIF first employs dual-stage filtering to effectively remove clickstream noise, identified by short dwell times and abnormal click bursts, and then leverages multi-level temporal fusion to dynamically model users’ evolving and multi-faceted interests for precise profiling. Moreover, we release DT-PENS, a new benchmark dataset comprising the click behavior of 1,000 carefully curated users and nearly 10,000 annotated personalized headlines with historical dwell time annotations. Extensive experiments demonstrate that PHG-DIF substantially mitigates the adverse effects of click noise and significantly improves headline quality, achieving state-of-the-art (SOTA) results on DT-PENS. Our framework implementation and dataset are available at https://github.com/liukejin-up/PHG-DIF. 准确的个性化标题生成取决于从历史行为中精确捕捉用户兴趣。然而,现有方法忽视了整个历史点击流中与个性化无关的点击噪音,这可能导致与真实用户偏好偏离的虚构标题。本文通过对用户和新闻维度的严格分析,揭示了点击噪音对个性化生成质量的有害影响。基于这些洞见,我们提出了一种通过从隐式反馈中去噪伪兴趣的个性化标题生成新框架(PHG-DIF)。PHG-DIF 首先采用双阶段过滤有效去除由短停留时间和异常点击爆发标识的点击流噪音,然后利用多层次时间融合动态建模用户不断演进且多面的兴趣以实现精确画像。此外,我们发布了 DT-PENS,这是一个新的基准数据集,包含 1,000 名精心挑选用户的点击行为和近 10,000 条带有历史停留时间注释的个性化标题。 大量实验表明,PHG-DIF 能显著缓解点击噪声的不利影响并大幅提升标题质量,在 DT-PENS 数据集上取得了最先进(SOTA)的结果。我们的框架实现和数据集可在 https://github.com/liukejin-up/PHG-DIF 获得。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 04:56:13 UTC 发布:2025-08-10 04:56:13 UTC
#65 Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models #65 Omni-SafetyBench:用于视听大型语言模型安全评估的基准
Authors: [Leyi Pan](https://arxiv.org/search/?searchtype=author&query=Leyi Pan), [Zheyu Fu](https://arxiv.org/search/?searchtype=author&query=Zheyu Fu), [Yunpeng Zhai](https://arxiv.org/search/?searchtype=author&query=Yunpeng Zhai), [Shuchang Tao](https://arxiv.org/search/?searchtype=author&query=Shuchang Tao), [Sheng Guan](https://arxiv.org/search/?searchtype=author&query=Sheng Guan), [Shiyu Huang](https://arxiv.org/search/?searchtype=author&query=Shiyu Huang), [Lingzhe Zhang](https://arxiv.org/search/?searchtype=author&query=Lingzhe Zhang), [Zhaoyang Liu](https://arxiv.org/search/?searchtype=author&query=Zhaoyang Liu), [Bolin Ding](https://arxiv.org/search/?searchtype=author&query=Bolin Ding), [Felix Henry](https://arxiv.org/search/?searchtype=author&query=Felix Henry), [Lijie Wen](https://arxiv.org/search/?searchtype=author&query=Lijie Wen), [Aiwei Liu](https://arxiv.org/search/?searchtype=author&query=Aiwei Liu) 作者:潘乐意、傅哲宇、翟云鹏、陶书昌、关圣、黄世宇、张凌喆、刘昭阳、丁博霖、Felix Henry、温立杰、刘艾为
The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and prior benchmarks designed for other LLMs lack the ability to assess safety performance under audio-visual joint inputs or cross-modal safety consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality combinations and variations with 972 samples each, including dedicated audio-visual harm cases. Considering OLLMs’ comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency Score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) no model excels in both overall safety and consistency, with only 3 models achieving over 0.6 in both metrics and top performer scoring around 0.8; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Our benchmark and metrics highlight urgent needs for enhanced OLLM safety, providing a foundation for future improvements. 基于视觉和听觉处理与文本整合的全模态大型语言模型(OLLMs)的兴起,要求进行稳健的安全评估以减轻有害输出。然而,目前尚无专门针对 OLLMs 的基准测试,先前为其他 LLMs 设计的基准也无法评估在音视频联合输入下的安全表现或跨模态安全一致性。为填补这一空白,我们提出了 Omni-SafetyBench,这是首个用于 OLLM 安全评估的综合并行基准,包含 24 种模态组合及每种变体 972 个样本,包括专门的音视频有害案例。考虑到 OLLMs 在理解复杂全模态输入方面的挑战以及评估跨模态一致性的需求,我们提出了定制化指标:基于条件攻击成功率(C-ASR)和拒绝率(C-RR)的安全得分以考虑理解失败,以及用于衡量模态间一致性的跨模态安全一致性得分(CMSC-score)。 对 6 个开源和 4 个闭源开放领域大模型(OLLM)进行评估揭示了关键脆弱性:(1)没有任何模型在整体安全性和一致性上同时表现优秀,仅有 3 个模型在这两项指标上都超过 0.6,表现最好的模型得分约为 0.8;(2)随着输入变得复杂,尤其是音视频联合输入,安全防护能力会减弱;(3)严重的薄弱环节仍然存在,部分模型在特定模态上的得分低至 0.14。我们的基准和指标突出了对加强 OLLM 安全性的紧迫需求,为未来改进提供了基础。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-10 04:15:16 UTC 发布:2025-08-10 04:15:16 UTC
#66 Gradient Surgery for Safe LLM Fine-Tuning #66 梯度手术用于安全的 LLM 微调
Authors: [Biao Yi](https://arxiv.org/search/?searchtype=author&query=Biao Yi), [Jiahao Li](https://arxiv.org/search/?searchtype=author&query=Jiahao Li), [Baolei Zhang](https://arxiv.org/search/?searchtype=author&query=Baolei Zhang), [Lihai Nie](https://arxiv.org/search/?searchtype=author&query=Lihai Nie), [Tong Li](https://arxiv.org/search/?searchtype=author&query=Tong Li), [Tiansheng Huang](https://arxiv.org/search/?searchtype=author&query=Tiansheng Huang), [Zheli Liu](https://arxiv.org/search/?searchtype=author&query=Zheli Liu) 作者:易彪、李佳豪、张宝磊、聂立海、李彤、黄天胜、刘哲利
Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user’s fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs). While a recognized paradigm frames safe fine-tuning as a multi-objective optimization problem balancing user task performance with safety alignment, we find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases. We diagnose that this failure stems from conflicting gradients, where the user-task update directly undermines the safety objective. To resolve this, we propose SafeGrad, a novel method that employs gradient surgery. When a conflict is detected, SafeGrad nullifies the harmful component of the user-task gradient by projecting it onto the orthogonal plane of the alignment gradient, allowing the model to learn the user’s task without sacrificing safety. To further enhance robustness and data efficiency, we employ a KL-divergence alignment loss that learns the rich, distributional safety profile of the well-aligned foundation model. Extensive experiments show that SafeGrad provides state-of-the-art defense across various LLMs and datasets, maintaining robust safety even at high harmful ratios without compromising task fidelity. Fine-tuning-as-a-Service 引入了一种关键漏洞,用户微调数据集中混入少量恶意样本即可破坏大型语言模型 (LLMs) 的安全对齐。尽管已有将安全微调视为在用户任务性能与安全对齐之间平衡的多目标优化范式,但我们发现现有解决方案对有害样本比例极其敏感,随着有害比例增加,防御效果急剧下降。我们诊断出这种失败源于梯度冲突,用户任务的更新直接削弱了安全目标。为了解决这一问题,我们提出了 SafeGrad,一种采用梯度手术的新方法。当检测到冲突时,SafeGrad 通过将用户任务梯度投影到对齐梯度的正交平面上,消除用户任务梯度中的有害分量,从而在不牺牲安全性的前提下学习用户的任务。为进一步增强鲁棒性和数据效率,我们采用了 KL 散度对齐损失,学习与良好对齐的基础模型相符的丰富分布式安全特征。 大量实验证明,SafeGrad 在各种 LLMs 和数据集上提供了最先进的防御,即使在高有害比例下也能保持稳健的安全性,同时不损害任务的准确性。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-10 04:13:41 UTC 发布:2025-08-10 04:13:41 UTC
#67 Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens #67 自动语音识别的公平性:通过哲学视角审视
Authors: [Anna Seo Gyeong Choi](https://arxiv.org/search/?searchtype=author&query=Anna Seo Gyeong Choi), [Hoon Choi](https://arxiv.org/search/?searchtype=author&query=Hoon Choi) 作者:Anna Seo Gyeong Choi,Hoon Choi
Automatic Speech Recognition (ASR) systems now mediate countless human-technology interactions, yet research on their fairness implications remains surprisingly limited. This paper examines ASR bias through a philosophical lens, arguing that systematic misrecognition of certain speech varieties constitutes more than a technical limitation – it represents a form of disrespect that compounds historical injustices against marginalized linguistic communities. We distinguish between morally neutral classification (discriminate1) and harmful discrimination (discriminate2), demonstrating how ASR systems can inadvertently transform the former into the latter when they consistently misrecognize non-standard dialects. We identify three unique ethical dimensions of speech technologies that differentiate ASR bias from other algorithmic fairness concerns: the temporal burden placed on speakers of non-standard varieties (“temporal taxation”), the disruption of conversational flow when systems misrecognize speech, and the fundamental connection between speech patterns and personal/cultural identity. These factors create asymmetric power relationships that existing technical fairness metrics fail to capture. The paper analyzes the tension between linguistic standardization and pluralism in ASR development, arguing that current approaches often embed and reinforce problematic language ideologies. We conclude that addressing ASR bias requires more than technical interventions; it demands recognition of diverse speech varieties as legitimate forms of expression worthy of technological accommodation. This philosophical reframing offers new pathways for developing ASR systems that respect linguistic diversity and speaker autonomy. 自动语音识别(ASR)系统如今介入了无数人机交互,但关于其公平性影响的研究仍然出人意料地有限。本文通过哲学视角审视 ASR 偏见,认为对某些语言变体的系统性误识不仅仅是技术上的局限——它构成了一种对被边缘化语言社区的历史不公的叠加性不尊重。我们区分了道德中性的分类(discriminate1)与有害的歧视(discriminate2),并论证当 ASR 系统持续性地误识非标准方言时,会无意中将前者转化为后者。我们指出语音技术在伦理上有三项独特维度,使 ASR 偏见区别于其他算法公平性问题:对非标准语言使用者施加的时间负担(“时间税”)、当系统误识语音时对会话流的破坏,以及语音模式与个人/文化身份之间的根本性关联。 这些因素造成了现有技术公平性指标无法捕捉的不对称权力关系。论文分析了自动语音识别(ASR)开发中语言标准化与多元化之间的紧张关系,论证了当前方法常常嵌入并强化了有问题的语言意识形态。我们得出的结论是,解决 ASR 的偏见不仅需要技术干预,还需要承认多样的言语变体作为值得技术适配的合法表达形式。这一哲学上的重构为开发尊重语言多样性和说话者自主权的 ASR 系统提供了新的路径。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 02:26:47 UTC 发布:2025-08-10 02:26:47 协调世界时 (UTC)
#68 Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution #68 通过在共指消解中使用置信度差异研究大型语言模型的交叉性偏见
Authors: [Falaah Arif Khan](https://arxiv.org/search/?searchtype=author&query=Falaah Arif Khan), [Nivedha Sivakumar](https://arxiv.org/search/?searchtype=author&query=Nivedha Sivakumar), [Yinong Oliver Wang](https://arxiv.org/search/?searchtype=author&query=Yinong Oliver Wang), [Katherine Metcalf](https://arxiv.org/search/?searchtype=author&query=Katherine Metcalf), [Cezanne Camacho](https://arxiv.org/search/?searchtype=author&query=Cezanne Camacho), [Barry-John Theobald](https://arxiv.org/search/?searchtype=author&query=Barry-John Theobald), [Luca Zappella](https://arxiv.org/search/?searchtype=author&query=Luca Zappella), [Nicholas Apostoloff](https://arxiv.org/search/?searchtype=author&query=Nicholas Apostoloff) 作者:Falaah Arif Khan、Nivedha Sivakumar、Yinong Oliver Wang、Katherine Metcalf、Cezanne Camacho、Barry-John Theobald、Luca Zappella、Nicholas Apostoloff
Large language models (LLMs) have achieved impressive performance, leading to their widespread adoption as decision-support tools in resource-constrained contexts like hiring and admissions. There is, however, scientific consensus that AI systems can reflect and exacerbate societal biases, raising concerns about identity-based harm when used in critical social contexts. Prior work has laid a solid foundation for assessing bias in LLMs by evaluating demographic disparities in different language reasoning tasks. In this work, we extend single-axis fairness evaluations to examine intersectional bias, recognizing that when multiple axes of discrimination intersect, they create distinct patterns of disadvantage. We create a new benchmark called WinoIdentity by augmenting the WinoBias dataset with 25 demographic markers across 10 attributes, including age, nationality, and race, intersected with binary gender, yielding 245,700 prompts to evaluate 50 distinct bias patterns. Focusing on harms of omission due to underrepresentation, we investigate bias through the lens of uncertainty and propose a group (un)fairness metric called Coreference Confidence Disparity which measures whether models are more or less confident for some intersectional identities than others. We evaluate five recently published LLMs and find confidence disparities as high as 40% along various demographic attributes including body type, sexual orientation and socio-economic status, with models being most uncertain about doubly-disadvantaged identities in anti-stereotypical settings. Surprisingly, coreference confidence decreases even for hegemonic or privileged markers, indicating that the recent impressive performance of LLMs is more likely due to memorization than logical reasoning. Notably, these are two independent failures in value alignment and validity that can compound to cause social harm. 大型语言模型(LLMs)已取得令人瞩目的表现,推动它们在招聘和录取等资源受限的场景中作为决策支持工具被广泛采用。然而,科学界普遍认为,人工智能系统可能反映并加剧社会偏见,在关键社会情境中使用时会引发基于身份的伤害担忧。先前的研究通过评估不同语言推理任务中的人口统计差异,为评估 LLMs 中的偏见奠定了坚实基础。在本工作中,我们将单轴公平性评估扩展到考察交叉性偏见,认识到当多重歧视轴相交时,会产生独特的不利模式。我们通过在 WinoBias 数据集上增加涵盖年龄、国籍和种族等 10 个属性共 25 个人口统计标记,并与二元性别相交,创建了一个名为 WinoIdentity 的新基准,生成了 245,700 个提示以评估 50 种不同的偏见模式。 我们关注因代表性不足而导致的遗漏性伤害,从不确定性的角度研究偏见,提出了一种群体(不)公平度量——共指置信度差异(Coreference Confidence Disparity),用于衡量模型对某些交叉身份相较于其他身份是否更有或更无信心。我们评估了五个近期发布的 LLMs,发现沿不同人口属性(包括体型、性取向和社会经济地位)置信度差异高达 40%,在反刻板印象的情境中,模型对处于双重弱势的身份尤为不确定。值得注意的是,即便是对于霸权或特权标识,模型的共指置信度也会降低,这表明 LLMs 近期令人印象深刻的表现更可能源于记忆而非逻辑推理。值得关注的是,这两者分别是价值对齐和有效性方面的独立失败,且可能叠加导致社会伤害。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-09 22:24:40 UTC 发布时间:2025-08-09 22:24:40 UTC
#69 Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning #69 少即是多:无训练稀疏注意力结合全局与局部性以实现高效推理
Authors: [Lijie Yang](https://arxiv.org/search/?searchtype=author&query=Lijie Yang), [Zhihao Zhang](https://arxiv.org/search/?searchtype=author&query=Zhihao Zhang), [Arti Jain](https://arxiv.org/search/?searchtype=author&query=Arti Jain), [Shijie Cao](https://arxiv.org/search/?searchtype=author&query=Shijie Cao), [Baihong Yuan](https://arxiv.org/search/?searchtype=author&query=Baihong Yuan), [Yiwei Chen](https://arxiv.org/search/?searchtype=author&query=Yiwei Chen), [Zhihao Jia](https://arxiv.org/search/?searchtype=author&query=Zhihao Jia), [Ravi Netravali](https://arxiv.org/search/?searchtype=author&query=Ravi Netravali) 作者:杨立杰,张志豪,Arti Jain,曹世杰,袁柏宏,陈奕威,贾志豪,Ravi Netravali
Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves – and in some cases improves – accuracy while achieving a 1.1× average decoding speed-up compared to full attention. Moreover, LessIsMore attends to 2× fewer tokens without accuracy loss, achieving a 1.13× end-to-end speed-up compared to existing sparse attention methods. 大型推理模型通过在测试时扩展规模取得了强大的性能,但代价是显著的计算开销,尤其是在处理短输入提示时产生过多标记生成。尽管稀疏注意力机制可以降低延迟和内存使用,现有方法在长序列生成推理过程中由于误差累积而导致显著的精度下降。此类方法通常要么需要高标记保留率,要么需要昂贵的再训练。我们提出了 LessIsMore,一种用于推理任务的无训练稀疏注意力机制,它利用全局注意力模式,而不是依赖传统的按头局部优化。LessIsMore 将来自局部注意力头的标记选择与最近的上下文信息汇聚,能够为后续解码层实现跨头的统一标记排序。这种统一选择通过避免为每个注意力头维护独立的标记子集,从而提升了泛化能力和效率。 在多样的推理任务和基准评估中,LessIsMore 在保持——在某些情况下甚至提高——准确率的同时,与全注意力相比实现了平均 1.1× 的解码加速。此外,LessIsMore 在不损失准确率的情况下关注更少的 2× 个标记,与现有稀疏注意力方法相比实现了端到端 1.13× 的加速。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-09 21:10:33 UTC 发布:2025-08-09 21:10:33 协调世界时
#70 BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context #70 BharatBBQ:面向印度语境问答的多语言偏见基准
Authors: [Aditya Tomar](https://arxiv.org/search/?searchtype=author&query=Aditya Tomar), [Nihar Ranjan Sahoo](https://arxiv.org/search/?searchtype=author&query=Nihar Ranjan Sahoo), [Pushpak Bhattacharyya](https://arxiv.org/search/?searchtype=author&query=Pushpak Bhattacharyya) 作者:Aditya Tomar、Nihar Ranjan Sahoo、Pushpak Bhattacharyya
Evaluating social biases in language models (LMs) is crucial for ensuring fairness and minimizing the reinforcement of harmful stereotypes in AI systems. Existing benchmarks, such as the Bias Benchmark for Question Answering (BBQ), primarily focus on Western contexts, limiting their applicability to the Indian context. To address this gap, we introduce BharatBBQ, a culturally adapted benchmark designed to assess biases in Hindi, English, Marathi, Bengali, Tamil, Telugu, Odia, and Assamese. BharatBBQ covers 13 social categories, including 3 intersectional groups, reflecting prevalent biases in the Indian sociocultural landscape. Our dataset contains 49,108 examples in one language that are expanded using translation and verification to 392,864 examples in eight different languages. We evaluate five multilingual LM families across zero and few-shot settings, analyzing their bias and stereotypical bias scores. Our findings highlight persistent biases across languages and social categories and often amplified biases in Indian languages compared to English, demonstrating the necessity of linguistically and culturally grounded benchmarks for bias evaluation. 评估语言模型(LM)中的社会偏见对于确保公平并尽量减少人工智能系统强化有害刻板印象至关重要。现有基准,比如问答偏见基准(Bias Benchmark for Question Answering,BBQ),主要关注西方背景,限制了其在印度语境中的适用性。为填补这一空白,我们引入了 BharatBBQ,这是一个经过文化适配的基准,旨在评估印地语、英语、马拉地语、孟加拉语、泰米尔语、泰卢固语、奥利亚语和阿萨姆语中的偏见。BharatBBQ 涵盖 13 个社会类别,包括 3 个交叉群体,反映了印度社会文化背景中普遍存在的偏见。我们的数据集在一种语言中包含 49,108 个示例,并通过翻译与验证扩展为八种语言共 392,864 个示例。我们在零样本和少样本设置下评估了五个多语种语言模型家族,分析了它们的偏见得分和刻板偏见得分。我们的研究结果突出了跨语言和社会类别中持续存在的偏见,并且在印度语言中常常较英语有更强的偏向,这表明在偏见评估中需要以语言和文化为基础的基准。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-09 20:24:24 UTC 发布日期:2025-08-09 20:24:24 协调世界时 (UTC)
#71 SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages #71 SEADialogues:一个以文化为基础的东南亚语言多轮多语种对话数据集
Authors: [Muhammad Dehan Al Kautsar](https://arxiv.org/search/?searchtype=author&query=Muhammad Dehan Al Kautsar), [Aswin Candra](https://arxiv.org/search/?searchtype=author&query=Aswin Candra), [Muhammad Alif Al Hakim](https://arxiv.org/search/?searchtype=author&query=Muhammad Alif Al Hakim), [Maxalmina Satria Kahfi](https://arxiv.org/search/?searchtype=author&query=Maxalmina Satria Kahfi), [Fajri Koto](https://arxiv.org/search/?searchtype=author&query=Fajri Koto), [Alham Fikri Aji](https://arxiv.org/search/?searchtype=author&query=Alham Fikri Aji), [Peerat Limkonchotiwat](https://arxiv.org/search/?searchtype=author&query=Peerat Limkonchotiwat), [Ekapol Chuangsuwanich](https://arxiv.org/search/?searchtype=author&query=Ekapol Chuangsuwanich), [Genta Indra Winata](https://arxiv.org/search/?searchtype=author&query=Genta Indra Winata) 作者:Muhammad Dehan Al Kautsar、Aswin Candra、Muhammad Alif Al Hakim、Maxalmina Satria Kahfi、Fajri Koto、Alham Fikri Aji、Peerat Limkonchotiwat、Ekapol Chuangsuwanich、Genta Indra Winata
Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents. 尽管已经开发了许多用于支持对话系统的数据集,但大多数现有的闲聊数据集忽视了自然人类对话中固有的文化细微差别。为填补这一空白,我们推出了 SEADialogues,这是一个以东南亚为中心、具有文化根基的对话数据集。东南亚地区人口超过 7 亿,文化极为多样。我们的数据集包含来自东南亚六个国家的八种语言的对话,其中许多语言虽然拥有大量使用者却属于低资源语言。为了增强文化相关性和个性化,每段对话都包括人物属性和两个反映各自社区日常生活的文化主题。此外,我们还发布了一个多轮对话数据集,以推动对具有文化意识和以人为中心的大型语言模型(包括会话型对话代理)的研究。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-09 18:22:35 UTC 发布:2025-08-09 18:22:35 协调世界时
#72 Vec2Summ: Text Summarization via Probabilistic Sentence Embeddings #72 Vec2Summ:通过概率句子嵌入进行文本摘要
Authors: [Mao Li](https://arxiv.org/search/?searchtype=author&query=Mao Li), [Fred Conrad](https://arxiv.org/search/?searchtype=author&query=Fred Conrad), [Johann Gagnon-Bartsch](https://arxiv.org/search/?searchtype=author&query=Johann Gagnon-Bartsch) 作者:Mao Li、Fred Conrad、Johann Gagnon-Bartsch
We propose Vec2Summ, a novel method for abstractive summarization that frames the task as semantic compression. Vec2Summ represents a document collection using a single mean vector in the semantic embedding space, capturing the central meaning of the corpus. To reconstruct fluent summaries, we perform embedding inversion – decoding this mean vector into natural language using a generative language model. To improve reconstruction quality and capture some degree of topical variability, we introduce stochasticity by sampling from a Gaussian distribution centered on the mean. This approach is loosely analogous to bagging in ensemble learning, where controlled randomness encourages more robust and varied outputs. Vec2Summ addresses key limitations of LLM-based summarization methods. It avoids context-length constraints, enables interpretable and controllable generation via semantic parameters, and scales efficiently with corpus size – requiring only O(d+d2) parameters. Empirical results show that Vec2Summ produces coherent summaries for topically focused, order-invariant corpora, with performance comparable to direct LLM summarization in terms of thematic coverage and efficiency, albeit with less fine-grained detail. These results underscore Vec2Summ’s potential in settings where scalability, semantic control, and corpus-level abstraction are prioritized. 我们提出了 Vec2Summ,一种将抽象摘要任务构建为语义压缩的新方法。Vec2Summ 使用语义嵌入空间中的单一均值向量来表示文档集合,捕捉语料的中心含义。为了重构流畅的摘要,我们执行嵌入反演——使用生成式语言模型将该均值向量解码为自然语言。为提高重构质量并捕捉一定程度的主题变异性,我们通过从以该均值为中心的高斯分布中采样来引入随机性。这种方法在某种程度上类似于集成学习中的 bagging,其中受控的随机性鼓励更稳健且多样化的输出。Vec2Summ 解决了基于 LLM 的摘要方法的关键局限。它避免了上下文长度的限制,通过语义参数实现可解释且可控的生成,并且能有效地随语料规模扩展——仅需 O(d+d2) 参数。 实证结果表明,Vec2Summ 能为主题集中、顺序不敏感的语料生成连贯的摘要,其在主题覆盖和效率方面的表现可与直接使用 LLM 进行的摘要相媲美,尽管在细节层面不如后者精细。这些结果强调了在优先考虑可扩展性、语义控制和语料级抽象的场景中,Vec2Summ 的潜在价值。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-09 15:31:02 UTC 发布:2025-08-09 15:31:02 协调世界时
#73 Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models #73 重新审视一位元优化:利用预训练大型语言模型
Authors: [Zhijun Tu](https://arxiv.org/search/?searchtype=author&query=Zhijun Tu), [Hanting Chen](https://arxiv.org/search/?searchtype=author&query=Hanting Chen), [Siqi Liu](https://arxiv.org/search/?searchtype=author&query=Siqi Liu), [Chuanjian Liu](https://arxiv.org/search/?searchtype=author&query=Chuanjian Liu), [Jian Li](https://arxiv.org/search/?searchtype=author&query=Jian Li), [Jie Hu](https://arxiv.org/search/?searchtype=author&query=Jie Hu), [Yunhe Wang](https://arxiv.org/search/?searchtype=author&query=Yunhe Wang) 作者:涂志军、陈旱廷、刘思奇、刘传建、李剑、胡杰、王云鹤
1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes direct adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the floating-point weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch. 1 位 LLM 量化在减少存储和计算成本方面具有显著优势。然而,现有方法通常从头训练 1 位 LLM,未能充分利用预训练模型。这导致训练成本高昂且准确性显著下降。我们发现,满精度表示与 1 位表示之间的巨大差距使得直接适配变得困难。在本文中,我们提出了一种正向和反向一致的渐进式训练方法,平滑地将浮点权重转换为二值权重。此外,我们引入了二值感知初始化和双重缩放补偿,以降低渐进式训练的难度并提高性能。对各种规模的 LLM 进行的实验结果表明,我们的方法优于现有方法。我们的结果显示,可以利用预训练模型实现高性能的 1 位 LLM,从而无需昂贵的从头训练。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-09 13:00:16 UTC 发布:2025-08-09 13:00:16 UTC
#74 Two-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction #74 通过集成检索和指令调优答案提取的两阶段古兰经问答 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Mohamed Basem](https://arxiv.org/search/?searchtype=author&query=Mohamed Basem), [Islam Oshallah](https://arxiv.org/search/?searchtype=author&query=Islam Oshallah), [Ali Hamdi](https://arxiv.org/search/?searchtype=author&query=Ali Hamdi), [Khaled Shaban](https://arxiv.org/search/?searchtype=author&query=Khaled Shaban), [Hozaifa Kassab](https://arxiv.org/search/?searchtype=author&query=Hozaifa Kassab) 作者:Mohamed Basem、Islam Oshallah、Ali Hamdi、Khaled Shaban、Hozaifa Kassab
Quranic Question Answering presents unique challenges due to the linguistic complexity of Classical Arabic and the semantic richness of religious texts. In this paper, we propose a novel two-stage framework that addresses both passage retrieval and answer extraction. For passage retrieval, we ensemble fine-tuned Arabic language models to achieve superior ranking performance. For answer extraction, we employ instruction-tuned large language models with few-shot prompting to overcome the limitations of fine-tuning on small datasets. Our approach achieves state-of-the-art results on the Quran QA 2023 Shared Task, with a MAP@10 of 0.3128 and MRR@10 of 0.5763 for retrieval, and a pAP@10 of 0.669 for extraction, substantially outperforming previous methods. These results demonstrate that combining model ensembling and instruction-tuned language models effectively addresses the challenges of low-resource question answering in specialized domains. 古兰经问答由于古典阿拉伯语的语言复杂性和宗教文本的语义丰富性而具有独特挑战。在本文中,我们提出了一个新颖的两阶段框架,兼顾了段落检索和答案抽取。对于段落检索,我们集成了微调的阿拉伯语语言模型以实现更优的排序性能。对于答案抽取,我们采用了经过指令微调的大型语言模型并结合少样本提示,以克服在小数据集上微调的局限性。我们的方法在 2023 年古兰经问答共享任务上取得了最先进的结果,检索方面的 MAP@10 为 0.3128,MRR@10 为 0.5763,抽取方面的 pAP@10 为 0.669,显著优于以往方法。这些结果表明,模型集成与指令微调语言模型的结合能够有效应对专门领域低资源问答的挑战。
Subjects: Computation and Language, Information Retrieval 主题:计算与语言,信息检索
Publish: 2025-08-09 12:37:19 UTC 发布时间:2025-08-09 12:37:19 UTC
#75 Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection #75 与模型无关的情感分布稳定性分析,用于鲁棒的 LLM 生成文本检测
Authors: [Siyuan Li](https://arxiv.org/search/?searchtype=author&query=Siyuan Li), [Xi Lin](https://arxiv.org/search/?searchtype=author&query=Xi Lin), [Guangyan Li](https://arxiv.org/search/?searchtype=author&query=Guangyan Li), [Zehao Liu](https://arxiv.org/search/?searchtype=author&query=Zehao Liu), [Aodu Wulianghai](https://arxiv.org/search/?searchtype=author&query=Aodu Wulianghai), [Li Ding](https://arxiv.org/search/?searchtype=author&query=Li Ding), [Jun Wu](https://arxiv.org/search/?searchtype=author&query=Jun Wu), [Jianhua Li](https://arxiv.org/search/?searchtype=author&query=Jianhua Li) 作者:李思远、林曦、李广彦、刘泽浩、奥度·乌良海、丁力、吴军、李建华
The rapid advancement of large language models (LLMs) has resulted in increasingly sophisticated AI-generated content, posing significant challenges in distinguishing LLM-generated text from human-written language. Existing detection methods, primarily based on lexical heuristics or fine-tuned classifiers, often suffer from limited generalizability and are vulnerable to paraphrasing, adversarial perturbations, and cross-domain shifts. In this work, we propose SentiDetect, a model-agnostic framework for detecting LLM-generated text by analyzing the divergence in sentiment distribution stability. Our method is motivated by the empirical observation that LLM outputs tend to exhibit emotionally consistent patterns, whereas human-written texts display greater emotional variability. To capture this phenomenon, we define two complementary metrics: sentiment distribution consistency and sentiment distribution preservation, which quantify stability under sentiment-altering and semantic-preserving transformations. We evaluate SentiDetect on five diverse datasets and a range of advanced LLMs,including Gemini-1.5-Pro, Claude-3, GPT-4-0613, and LLaMa-3.3. Experimental results demonstrate its superiority over state-of-the-art baselines, with over 16% and 11% F1 score improvements on Gemini-1.5-Pro and GPT-4-0613, respectively. Moreover, SentiDetect also shows greater robustness to paraphrasing, adversarial attacks, and text length variations, outperforming existing detectors in challenging scenarios. 大型语言模型(LLMs)的快速发展带来了越来越复杂的 AI 生成内容,使得区分 LLM 生成文本与人类撰写文本变得极具挑战性。现有的检测方法主要基于词汇启发式或微调的分类器,常常面临泛化能力有限的问题,且易受改写、对抗性扰动和跨领域迁移的影响。在本工作中,我们提出了 SentiDetect,一种模型无关的框架,通过分析情感分布稳定性的差异来检测 LLM 生成文本。我们的方法基于经验观察:LLM 输出往往表现出情感上的一致性模式,而人类撰写的文本则显示出更大的情感多样性。为捕捉这一现象,我们定义了两个互补的度量:情感分布一致性和情感分布保持性,用以量化在改变情感和保持语义不变的变换下的稳定性。 我们在五个不同的数据集和一系列先进的 LLMs 上评估了 SentiDetect,包括 Gemini-1.5-Pro、Claude-3、GPT-4-0613 和 LLaMa-3.3。实验结果表明,其优于最先进的基线方法,在 Gemini-1.5-Pro 和 GPT-4-0613 上分别提高了超过 16%和 11%的 F1 分数。此外,SentiDetect 在对释义、对抗攻击和文本长度变化的稳健性方面也表现更佳,在具有挑战性的场景中超过了现有的检测器。
Subjects: Computation and Language, Machine Learning 主题:计算与语言,机器学习
Publish: 2025-08-09 09:55:47 UTC 发布时间:2025-08-09 09:55:47 UTC
#76 Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores #76 在你开口前评分:使用回复质量分数提高对话生成中的人格一致性
Authors: [Arpita Saggar](https://arxiv.org/search/?searchtype=author&query=Arpita Saggar), [Jonathan C. Darling](https://arxiv.org/search/?searchtype=author&query=Jonathan C. Darling), [Vania Dimitrova](https://arxiv.org/search/?searchtype=author&query=Vania Dimitrova), [Duygu Sarikaya](https://arxiv.org/search/?searchtype=author&query=Duygu Sarikaya), [David C. Hogg](https://arxiv.org/search/?searchtype=author&query=David C. Hogg) 作者:Arpita Saggar、Jonathan C. Darling、Vania Dimitrova、Duygu Sarikaya、David C. Hogg
Persona-based dialogue generation is an important milestone towards building conversational artificial intelligence. Despite the ever-improving capabilities of large language models (LLMs), effectively integrating persona fidelity in conversations remains challenging due to the limited diversity in existing dialogue data. We propose a novel framework SBS (Score-Before-Speaking), which outperforms previous methods and yields improvements for both million and billion-parameter models. Unlike previous methods, SBS unifies the learning of responses and their relative quality into a single step. The key innovation is to train a dialogue model to correlate augmented responses with a quality score during training and then leverage this knowledge at inference. We use noun-based substitution for augmentation and semantic similarity-based scores as a proxy for response quality. Through extensive experiments with benchmark datasets (PERSONA-CHAT and ConvAI2), we show that score-conditioned training allows existing models to better capture a spectrum of persona-consistent dialogues. Our ablation studies also demonstrate that including scores in the input prompt during training is superior to conventional training setups. Code and further details are available at https://arpita2512.github.io/score_before_you_speak 基于人物设定的对话生成是构建会话式人工智能的重要里程碑。尽管大型语言模型(LLMs)的能力不断提升,但由于现有对话数据多样性有限,在对话中有效整合人物设定一致性仍然具有挑战性。我们提出了一种新框架 SBS(Score-Before-Speaking),其性能优于以往方法,并在百万级和十亿级参数模型上均带来提升。与以往方法不同,SBS 将回复及其相对质量的学习统一为一步完成。该方法的关键创新是:在训练期间让对话模型学习将增强后的回复与质量分数相关联,然后在推理时利用这一知识。我们使用基于名词的替换进行增强,并以语义相似度为代理来衡量回复质量。通过在基准数据集(PERSONA-CHAT 和 ConvAI2)上的大量实验,我们表明基于分数的条件化训练使现有模型能更好地捕捉人物设定一致对话的多样性谱系。 我们的消融研究还表明,在训练过程中在输入提示中包含评分优于传统的训练设置。代码和更多细节可在 https://arpita2512.github.io/score_before_you_speak 获取
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-09 08:30:06 UTC 发布:2025-08-09 08:30:06 UTC
#77 The ReQAP System for Question Answering over Personal Information #77 用于个人信息问答的 ReQAP 系统
Authors: [Philipp Christmann](https://arxiv.org/search/?searchtype=author&query=Philipp Christmann), [Gerhard Weikum](https://arxiv.org/search/?searchtype=author&query=Gerhard Weikum) 作者:Philipp Christmann、Gerhard Weikum
Personal information is abundant on users’ devices, from structured data in calendar, shopping records or fitness tools, to unstructured contents in mail and social media posts. This works presents the ReQAP system that supports users with answers for complex questions that involve filters, joins and aggregation over heterogeneous sources. The unique trait of ReQAP is that it recursively decomposes questions and incrementally builds an operator tree for execution. Both the question interpretation and the individual operators make smart use of light-weight language models, with judicious fine-tuning. The demo showcases the rich functionality for advanced user questions, and also offers detailed tracking of how the answers are computed by the operators in the execution tree. Being able to trace answers back to the underlying sources is vital for human comprehensibility and user trust in the system. 个人设备上充满了个人信息,从日历、购物记录或健身工具中的结构化数据,到邮件和社交媒体帖文中的非结构化内容。本文介绍了 ReQAP 系统,该系统为涉及对异构来源进行筛选、连接和聚合的复杂问题提供答案。ReQAP 的独特之处在于它递归地分解问题并逐步构建用于执行的操作符树。问题解释和各个操作符都巧妙地利用了轻量级语言模型,并进行了谨慎的微调。演示展示了针对高级用户问题的丰富功能,并提供了执行树中各操作符如何计算出答案的详细追踪。能够将答案追溯到底层来源对于人类的可理解性和用户对系统的信任至关重要。
Subjects: Computation and Language, Information Retrieval 主题:计算与语言,信息检索
Publish: 2025-08-09 08:21:53 UTC 发布日期:2025-08-09 08:21:53 协调世界时
#78 ESNERA: Empirical and semantic named entity alignment for named entity dataset merging #78 ESNERA:用于命名实体数据集合并的经验与语义命名实体对齐
Authors: [Xiaobo Zhang](https://arxiv.org/search/?searchtype=author&query=Xiaobo Zhang), [Congqing He](https://arxiv.org/search/?searchtype=author&query=Congqing He), [Ying He](https://arxiv.org/search/?searchtype=author&query=Ying He), [Jian Peng](https://arxiv.org/search/?searchtype=author&query=Jian Peng), [Dajie Fu](https://arxiv.org/search/?searchtype=author&query=Dajie Fu), [Tien-Ping Tan](https://arxiv.org/search/?searchtype=author&query=Tien-Ping Tan) 作者:张晓博、何聪清、何颖、彭建、付大杰、陈天平
Named Entity Recognition (NER) is a fundamental task in natural language processing. It remains a research hotspot due to its wide applicability across domains. Although recent advances in deep learning have significantly improved NER performance, they rely heavily on large, high-quality annotated datasets. However, building these datasets is expensive and time-consuming, posing a major bottleneck for further research. Current dataset merging approaches mainly focus on strategies like manual label mapping or constructing label graphs, which lack interpretability and scalability. To address this, we propose an automatic label alignment method based on label similarity. The method combines empirical and semantic similarities, using a greedy pairwise merging strategy to unify label spaces across different datasets. Experiments are conducted in two stages: first, merging three existing NER datasets into a unified corpus with minimal impact on NER performance; second, integrating this corpus with a small-scale, self-built dataset in the financial domain. The results show that our method enables effective dataset merging and enhances NER performance in the low-resource financial domain. This study presents an efficient, interpretable, and scalable solution for integrating multi-source NER corpora. 命名实体识别(NER)是自然语言处理中的一项基础任务。由于其在各领域的广泛适用性,它仍然是一个研究热点。尽管近期深度学习的进展显著提升了 NER 的性能,但这些方法在很大程度上依赖于大型、高质量的标注数据集。然而,构建这些数据集既昂贵又耗时,成为进一步研究的主要瓶颈。当前的数据集合并方法主要侧重于手工标签映射或构建标签图等策略,这些方法缺乏可解释性和可扩展性。为了解决这一问题,我们提出了一种基于标签相似度的自动标签对齐方法。该方法结合经验相似度和语义相似度,采用贪心的成对合并策略来统一不同数据集的标签空间。实验分两个阶段进行:首先,将三个现有的 NER 数据集合并为一个统一语料库,对 NER 性能的影响最小;其次,将该语料库与一个小规模的自建金融领域数据集整合。结果表明,我们的方法能够实现有效的数据集合并,并提升低资源金融领域的 NER 性能。 本研究提出了一种用于整合多源命名实体识别语料的高效、可解释且可扩展的解决方案。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-09 08:15:26 UTC 发布:2025-08-09 08:15:26 UTC
#79 Text to Speech System for Meitei Mayek Script #79 面向梅泰梅耶克文字的梅泰语文本转语音系统
Authors: [Gangular Singh Irengbam](https://arxiv.org/search/?searchtype=author&query=Gangular Singh Irengbam), [Nirvash Singh Wahengbam](https://arxiv.org/search/?searchtype=author&query=Nirvash Singh Wahengbam), [Lanthoiba Meitei Khumanthem](https://arxiv.org/search/?searchtype=author&query=Lanthoiba Meitei Khumanthem), [Paikhomba Oinam](https://arxiv.org/search/?searchtype=author&query=Paikhomba Oinam) 作者:Gangular Singh Irengbam、Nirvash Singh Wahengbam、Lanthoiba Meitei Khumanthem、Paikhomba Oinam
This paper presents the development of a Text-to-Speech (TTS) system for the Manipuri language using the Meitei Mayek script. Leveraging Tacotron 2 and HiFi-GAN, we introduce a neural TTS architecture adapted to support tonal phonology and under-resourced linguistic environments. We develop a phoneme mapping for Meitei Mayek to ARPAbet, curate a single-speaker dataset, and demonstrate intelligible and natural speech synthesis, validated through subjective and objective metrics. This system lays the groundwork for linguistic preservation and technological inclusion of Manipuri. 本文介绍了使用梅泰梅耶克(Meitei Mayek)文字为曼尼普里语(Manipuri)开发的文本转语音(TTS)系统。利用 Tacotron 2 和 HiFi-GAN,我们提出了一种神经 TTS 架构,经过调整以支持声调音系并适应资源匮乏的语言环境。我们为梅泰梅耶克制定了到 ARPAbet 的音素映射,整理了单说话者数据集,并通过主观和客观指标验证了可懂且自然的语音合成效果。该系统为曼尼普里语的语言保护和技术包容奠定了基础。
Subjects: Computation and Language, Machine Learning, Sound 主题:计算与语言、机器学习、声音
Publish: 2025-08-09 07:40:53 UTC 发布:2025-08-09 07:40:53 UTC
#80 Annotating Errors in English Learners' Written Language Production: Advancing Automated Written Feedback Systems #80 在英语学习者书面语言产出中标注错误:推进自动化书面反馈系统
Authors: [Steven Coyne](https://arxiv.org/search/?searchtype=author&query=Steven Coyne), [Diana Galvan-Sosa](https://arxiv.org/search/?searchtype=author&query=Diana Galvan-Sosa), [Ryan Spring](https://arxiv.org/search/?searchtype=author&query=Ryan Spring), [Camélia Guerraoui](https://arxiv.org/search/?searchtype=author&query=Camélia Guerraoui), [Michael Zock](https://arxiv.org/search/?searchtype=author&query=Michael Zock), [Keisuke Sakaguchi](https://arxiv.org/search/?searchtype=author&query=Keisuke Sakaguchi), [Kentaro Inui](https://arxiv.org/search/?searchtype=author&query=Kentaro Inui) 作者:Steven Coyne、Diana Galvan-Sosa、Ryan Spring、Camélia Guerraoui、Michael Zock、Keisuke Sakaguchi、Kentaro Inui
Recent advances in natural language processing (NLP) have contributed to the development of automated writing evaluation (AWE) systems that can correct grammatical errors. However, while these systems are effective at improving text, they are not optimally designed for language learning. They favor direct revisions, often with a click-to-fix functionality that can be applied without considering the reason for the correction. Meanwhile, depending on the error type, learners may benefit most from simple explanations and strategically indirect hints, especially on generalizable grammatical rules. To support the generation of such feedback, we introduce an annotation framework that models each error’s error type and generalizability. For error type classification, we introduce a typology focused on inferring learners’ knowledge gaps by connecting their errors to specific grammatical patterns. Following this framework, we collect a dataset of annotated learner errors and corresponding human-written feedback comments, each labeled as a direct correction or hint. With this data, we evaluate keyword-guided, keyword-free, and template-guided methods of generating feedback using large language models (LLMs). Human teachers examined each system’s outputs, assessing them on grounds including relevance, factuality, and comprehensibility. We report on the development of the dataset and the comparative performance of the systems investigated. 近年来自然语言处理(NLP)的进步促进了自动化写作评估(AWE)系统的发展,使其能够纠正语法错误。然而,尽管这些系统在改善文本方面效果显著,但它们并非为语言学习而最佳设计。它们倾向于直接修改,通常带有一键修复功能,用户可以在不考虑纠正原因的情况下应用修正。与此同时,依据错误类型,学习者往往最受益于简单的解释和策略性间接提示,尤其是关于可推广的语法规则。为支持此类反馈的生成,我们提出了一个注释框架,用以对每个错误的错误类型和可推广性建模。在错误类型分类方面,我们引入了一种以推断学习者知识缺口为重点的类型学,通过将其错误与特定语法模式联系起来。遵循该框架,我们收集了一个注释过的学习者错误数据集及相应的人类编写反馈评论,每条评论均标注为直接改正或提示。 基于这些数据,我们评估了使用大型语言模型(LLMs)生成反馈的三种方法:基于关键词的方法、无关键词的方法和基于模板的方法。人类教师审查了每个系统的输出,基于相关性、事实性和可理解性等标准对其进行评估。我们报告了该数据集的开发过程以及所研究系统的比较性能。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-09 04:06:18 UTC 发表时间:2025-08-09 04:06:18 协调世界时 (UTC)
#81 SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection #81 SEVADE:用于抗幻觉讽刺检测的具有解耦评估的自我进化多智能体分析
Authors: [Ziqi Liu](https://arxiv.org/search/?searchtype=author&query=Ziqi Liu), [Yangbin Chen](https://arxiv.org/search/?searchtype=author&query=Yangbin Chen), [Ziyang Zhou](https://arxiv.org/search/?searchtype=author&query=Ziyang Zhou), [Yilin Li](https://arxiv.org/search/?searchtype=author&query=Yilin Li), [Mingxuan Hu](https://arxiv.org/search/?searchtype=author&query=Mingxuan Hu), [Yushan Pan](https://arxiv.org/search/?searchtype=author&query=Yushan Pan), [Zhijie Xu](https://arxiv.org/search/?searchtype=author&query=Zhijie Xu) 作者:刘子祺、陈阳斌、周子扬、李一琳、胡明轩、潘玉珊、徐志杰
Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single-perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose SEVADE, a novel Self-Evolving multi-agent Analysis framework with Decoupled Evaluation for hallucination-resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state-of-the-art performance, with average improvements of 6.75% in Accuracy and 6.29% in Macro-F1 score. 讽刺检测是一个关键但具有挑战性的自然语言处理任务。现有的大型语言模型方法常常受限于单一视角分析、静态推理路径,并且在处理复杂的讽刺修辞时容易出现幻觉,从而影响其准确性和可靠性。为了解决这些挑战,我们提出了 SEVADE,一种新颖的自我进化多智能体分析框架,带有解耦评估以抵抗幻觉,用于讽刺检测。我们框架的核心是动态主体推理引擎(DARE),它利用一组基于语言学理论的专门化智能体对文本进行多方面的解构并生成结构化的推理链。随后,一个独立的轻量级理由裁决器(RA)仅基于该推理链进行最终分类。该解耦架构旨在通过将复杂推理与最终判断分离来降低幻觉的风险。 在四个基准数据集上进行的大量实验证明我们的框架达到了最先进的性能,平均在准确率上提升了6.75%,在宏观 F1 分数上提升了6.29%。
Subjects: Computation and Language, Multiagent Systems 主题:计算与语言, 多智能体系统
Publish: 2025-08-09 03:25:45 UTC 发布时间:2025-08-09 03:25:45 协调世界时 (UTC)
#82 Many-Turn Jailbreaking #82 多轮越狱
Authors: [Xianjun Yang](https://arxiv.org/search/?searchtype=author&query=Xianjun Yang), [Liqiang Xiao](https://arxiv.org/search/?searchtype=author&query=Liqiang Xiao), [Shiyang Li](https://arxiv.org/search/?searchtype=author&query=Shiyang Li), [Faisal Ladhak](https://arxiv.org/search/?searchtype=author&query=Faisal Ladhak), [Hyokun Yun](https://arxiv.org/search/?searchtype=author&query=Hyokun Yun), [Linda Ruth Petzold](https://arxiv.org/search/?searchtype=author&query=Linda Ruth Petzold), [Yi Xu](https://arxiv.org/search/?searchtype=author&query=Yi Xu), [William Yang Wang](https://arxiv.org/search/?searchtype=author&query=William Yang Wang) 作者:杨显军,肖立强,李世阳,Faisal Ladhak,Hyokun Yun,Linda Ruth Petzold,许毅,William Yang Wang
Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs. 当前针对大型语言模型(LLMs)的越狱研究旨在从给定提示中引出不安全的输出。然而,这些研究仅关注针对单一查询的单轮越狱。相反,先进的 LLMs 被设计用于处理极长的上下文,因此能够进行多轮对话。基于此,我们提出探索多轮越狱,其中被越狱的 LLMs 在超过首轮对话或单一目标查询的情况下持续受到测试。这是一种更严重的威胁,因为 1)用户常常会继续提出相关的后续问题以澄清某些越狱细节,且 2)首次越狱回合也可能导致 LLMs 持续对额外的无关问题做出响应。作为探索多轮越狱的第一步(初稿完成于 2024 年 6 月),我们构建了一个多轮越狱基准(MTJ-Bench),用于在一系列开源和闭源模型上对该设置进行基准测试,并为这一新的安全威胁提供了新见解。 通过揭示这一新型漏洞,我们旨在号召社区共同努力构建更安全的 LLMs,并为更深入理解对 LLMs 进行越狱奠定基础。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-09 00:02:39 UTC 发布:2025-08-09 00:02:39 UTC
#83 Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis #83 面向口述历史理解的用于文本分类和情感分析的大型语言模型
Authors: [Komala Subramanyam Cherukuri](https://arxiv.org/search/?searchtype=author&query=Komala Subramanyam Cherukuri), [Pranav Abishai Moses](https://arxiv.org/search/?searchtype=author&query=Pranav Abishai Moses), [Aisa Sakata](https://arxiv.org/search/?searchtype=author&query=Aisa Sakata), [Jiangping Chen](https://arxiv.org/search/?searchtype=author&query=Jiangping Chen), [Haihua Chen](https://arxiv.org/search/?searchtype=author&query=Haihua Chen) 作者:Komala Subramanyam Cherukuri、Pranav Abishai Moses、Aisa Sakata、Jiangping Chen、Haihua Chen
Oral histories are vital records of lived experience, particularly within communities affected by systemic injustice and historical erasure. Effective and efficient analysis of their oral history archives can promote access and understanding of the oral histories. However, Large-scale analysis of these archives remains limited due to their unstructured format, emotional complexity, and high annotation costs. This paper presents a scalable framework to automate semantic and sentiment annotation for Japanese American Incarceration Oral History. Using LLMs, we construct a high-quality dataset, evaluate multiple models, and test prompt engineering strategies in historically sensitive contexts. Our multiphase approach combines expert annotation, prompt design, and LLM evaluation with ChatGPT, Llama, and Qwen. We labeled 558 sentences from 15 narrators for sentiment and semantic classification, then evaluated zero-shot, few-shot, and RAG strategies. For semantic classification, ChatGPT achieved the highest F1 score (88.71%), followed by Llama (84.99%) and Qwen (83.72%). For sentiment analysis, Llama slightly outperformed Qwen (82.66%) and ChatGPT (82.29%), with all models showing comparable results. The best prompt configurations were used to annotate 92,191 sentences from 1,002 interviews in the JAIOH collection. Our findings show that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by well-designed prompts. This study provides a reusable annotation pipeline and practical guidance for applying LLMs in culturally sensitive archival analysis. By bridging archival ethics with scalable NLP techniques, this work lays the groundwork for responsible use of artificial intelligence in digital humanities and preservation of collective memory. GitHub: https://github.com/kc6699c/LLM4OralHistoryAnalysis. 口述历史是记录亲身经历的重要资料,尤其是在受到系统性不公和历史抹杀影响的社区中。对这些口述历史档案进行有效且高效的分析可以促进人们对口述历史的获取与理解。然而,由于档案的非结构化格式、情感复杂性以及高昂的标注成本,对这些档案的大规模分析仍然受限。本文提出了一个可扩展的框架,用于自动化注释日裔美國拘禁口述历史的语义与情感。我们使用 LLMs 构建了高质量数据集、评估了多种模型,并在具有历史敏感性的情境中测试了提示工程策略。我们的多阶段方法将专家标注、提示设计与使用 ChatGPT、Llama 和 Qwen 的 LLM 评估相结合。我们对来自 15 位叙述者的 558 句子进行了情感与语义分类标注,然后评估了零样本、少样本与 RAG 策略。 在语义分类方面,ChatGPT 获得了最高的 F1 分数(88.71%),其后是 Llama(84.99%)和 Qwen(83.72%)。在情感分析方面,Llama 略微优于 Qwen(82.66%)和 ChatGPT(82.29%),所有模型表现相近。使用最佳提示配置,我们对 JAIOH 集合中 1,002 次访谈的 92,191 句子进行了标注。我们的研究表明,在精心设计的提示引导下,LLMs 能够在大规模口述历史集合中有效执行语义和情感标注。本研究提供了一个可复用的标注流程以及将 LLMs 应用于具有文化敏感性的档案分析的实用指导。通过将档案伦理与可扩展的自然语言处理技术相结合,这项工作为在数字人文和集体记忆保存中负责任地使用人工智能奠定了基础。GitHub: https://github.com/kc6699c/LLM4OralHistoryAnalysis。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-08 22:06:23 UTC 发布:2025-08-08 22:06:23 UTC
#84 Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge #84 喜欢偏好:一种测量 LLM 作为裁判时自我偏见的统计方法
Authors: [Evangelia Spiliopoulou](https://arxiv.org/search/?searchtype=author&query=Evangelia Spiliopoulou), [Riccardo Fogliato](https://arxiv.org/search/?searchtype=author&query=Riccardo Fogliato), [Hanna Burnsky](https://arxiv.org/search/?searchtype=author&query=Hanna Burnsky), [Tamer Soliman](https://arxiv.org/search/?searchtype=author&query=Tamer Soliman), [Jie Ma](https://arxiv.org/search/?searchtype=author&query=Jie Ma), [Graham Horwood](https://arxiv.org/search/?searchtype=author&query=Graham Horwood), [Miguel Ballesteros](https://arxiv.org/search/?searchtype=author&query=Miguel Ballesteros) 作者:Evangelia Spiliopoulou、Riccardo Fogliato、Hanna Burnsky、Tamer Soliman、Jie Ma、Graham Horwood、Miguel Ballesteros
Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other LLM outputs. However, models may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias, which can distort evaluations of true model performance. Previous studies often conflate genuine differences in model quality with bias or incorrectly assume that evaluations from LLMs and humans follow the same rating distributions. In this work, we present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated. Our method models the difference in the scoring distribution that LLM-as-a-judge assigns to its own completions compared to other models, while accounting for the underlying quality of the completions provided by an independent, third-party judge (e.g., humans). Our method reliably isolates and quantifies self-bias, even when models vary in ability, ensuring that genuine performance differences are not mistaken for self-bias. We conduct an empirical analysis of self-bias on a large dataset (>5000 prompt-completion pairs) consisting of expert human annotations and judgments from nine different LLM judges. We find that some models, such as GPT-4o and Claude 3.5 Sonnet, systematically assign higher scores to their own outputs. These models also display family-bias; systematically assigning higher ratings to outputs produced by other models of the same family. Our findings highlight potential pitfalls of using LLM judges and offer practical guidance to mitigate biases when interpreting automated evaluations. 大型语言模型 (LLMs) 可以作为评审,为其他 LLM 的输出提供快速且可靠的评估。然而,模型可能会系统性地对自己的输出给出过于有利的评分,这一现象被称为自我偏差(self-bias),它可能扭曲对模型真实性能的评估。以往的研究常常将模型质量的真实差异与偏差混为一谈,或错误地假设 LLM 和人类的评估遵循相同的评分分布。在本研究中,我们提出了一个统计框架,明确形式化了在何种假设下可以识别和估计自我偏差。我们的方法对 LLM 作为评审时对自身生成与对其他模型生成的评分分布差异建模,同时考虑由独立第三方评审(例如人类)提供的生成质量的潜在差异。即便在模型能力各不相同的情况下,我们的方法也能可靠地隔离并量化自我偏差,确保不会将真实的性能差异误认为自我偏差。 我们对一个大规模数据集(>5000 条提示-完成对)进行了关于自我偏见的实证分析,该数据集由专家人工注释和来自九个不同 LLM 评审者的判断组成。我们发现一些模型(例如 GPT-4o 和 Claude 3.5 Sonnet)系统性地对其自身的输出给出更高的分数。这些模型还表现出家族偏见;系统性地对同一系列的其他模型产生的输出给出更高的评分。我们的研究结果凸显了使用 LLM 评审者时可能存在的陷阱,并为在解读自动化评估时减轻偏见提供了实用的指导。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-08 21:22:12 UTC 发布:2025-08-08 21:22:12 UTC
#85 Do Biased Models Have Biased Thoughts? #85 有偏见的模型是否会有偏见的“思维”?
Authors: [Swati Rajwal](https://arxiv.org/search/?searchtype=author&query=Swati Rajwal), [Shivank Garg](https://arxiv.org/search/?searchtype=author&query=Shivank Garg), [Reem Abdel-Salam](https://arxiv.org/search/?searchtype=author&query=Reem Abdel-Salam), [Abdelrahman Zayed](https://arxiv.org/search/?searchtype=author&query=Abdelrahman Zayed) 作者:Swati Rajwal、Shivank Garg、Reem Abdel-Salam、Abdelrahman Zayed
The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: \textit{Do biased models have biased thoughts}? To answer our question, we conduct experiments on 5 popular large language models using fairness metrics to quantify 11 different biases in the model’s thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than 0.6 correlation with a p-value smaller than 0.001 in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts. 语言模型令人印象深刻的性能是毋庸置疑的。然而,基于性别、种族、社会经济地位、外貌和性取向的偏见使得语言模型的部署变得具有挑战性。本文研究了“思路链提示”(chain-of-thought prompting)这一近期方法对公平性的影响,该方法研究模型在回答前所遵循的步骤。更具体地,我们提出了这样一个问题:有偏见的模型是否会有偏见的“思路”?为回答这一问题,我们对 5 款流行的大型语言模型进行了实验,使用公平性度量来量化模型“思路”和输出中的 11 种不同偏见。我们的结果表明,思考步骤中的偏见与输出偏见并不高度相关(在大多数情况下相关性小于 0.6 ,且 p 的 p 值小于 0.001 )。换句话说,与人类不同,测试中那些做出有偏见决策的模型并不总是拥有有偏见的思路。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-08 19:41:20 UTC 发布:2025-08-08 19:41:20 协调世界时
#86 Testing the Limits of Machine Translation from One Book #86 测试来自一本书的机器翻译极限
Authors: [Jonathan Shaw](https://arxiv.org/search/?searchtype=author&query=Jonathan Shaw), [Dillon Mee](https://arxiv.org/search/?searchtype=author&query=Dillon Mee), [Timothy Khouw](https://arxiv.org/search/?searchtype=author&query=Timothy Khouw), [Zackary Leech](https://arxiv.org/search/?searchtype=author&query=Zackary Leech), [Daniel Wilson](https://arxiv.org/search/?searchtype=author&query=Daniel Wilson) 作者:Jonathan Shaw、Dillon Mee、Timothy Khouw、Zackary Leech、Daniel Wilson
Current state-of-the-art models demonstrate capacity to leverage in-context learning to translate into previously unseen language contexts. Tanzer et al. [2024] utilize language materials (e.g. a grammar) to improve translation quality for Kalamang using large language models (LLMs). We focus on Kanuri, a language that, despite having substantial speaker population, has minimal digital resources. We design two datasets for evaluation: one focused on health and humanitarian terms, and another containing generalized terminology, investigating how domain-specific tasks impact LLM translation quality. By providing different combinations of language resources (grammar, dictionary, and parallel sentences), we measure LLM translation effectiveness, comparing results to native speaker translations and human linguist performance. We evaluate using both automatic metrics and native speaker assessments of fluency and accuracy. Results demonstrate that parallel sentences remain the most effective data source, outperforming other methods in human evaluations and automatic metrics. While incorporating grammar improves over zero-shot translation, it fails as an effective standalone data source. Human evaluations reveal that LLMs achieve accuracy (meaning) more effectively than fluency (grammaticality). These findings suggest LLM translation evaluation benefits from multidimensional assessment beyond simple accuracy metrics, and that grammar alone, without parallel sentences, does not provide sufficient context for effective domain-specific translation. 当前最先进的模型展示了利用上下文学习将文本翻译到先前未见过的语言环境中的能力。Tanzer 等人 [2024] 利用语言材料(例如语法)来提高大型语言模型(LLMs)对卡拉芒语的翻译质量。我们聚焦于卡努里语,该语言尽管有大量使用者,但数字资源极少。我们设计了两个用于评估的数据集:一个侧重于健康和人道主义术语,另一个包含通用术语,研究领域特定任务如何影响 LLM 的翻译质量。通过提供不同组合的语言资源(语法、词典和平行句子),我们衡量 LLM 的翻译效果,并将结果与母语者翻译及语言学家人工翻译的表现进行比较。我们使用自动评估指标和母语者对流利度与准确性的评估进行评测。结果表明,平行句子仍然是最有效的数据来源,在人工评估和自动指标中均优于其他方法。尽管加入语法相比零样本翻译有所改进,但作为独立的数据来源并不够有效。 人工评估显示,LLMs 在准确性(意义)方面的表现优于流利度(语法性)。这些发现表明,对 LLMs 的翻译评估应采用超越单一准确性指标的多维评估,而且仅靠语法本身、在没有平行句的情况下,并不能为有效的特定领域翻译提供足够的上下文。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-08 19:27:44 UTC 发布:2025-08-08 19:27:44 协调世界时 (UTC)
#87 Measuring Stereotype and Deviation Biases in Large Language Models #87 在大型语言模型中衡量刻板印象和偏差偏差
Authors: [Daniel Wang](https://arxiv.org/search/?searchtype=author&query=Daniel Wang), [Eli Brignac](https://arxiv.org/search/?searchtype=author&query=Eli Brignac), [Minjia Mao](https://arxiv.org/search/?searchtype=author&query=Minjia Mao), [Xiao Fang](https://arxiv.org/search/?searchtype=author&query=Xiao Fang) 作者:Daniel Wang、Eli Brignac、Minjia Mao、Xiao Fang
Large language models (LLMs) are widely applied across diverse domains, raising concerns about their limitations and potential risks. In this study, we investigate two types of bias that LLMs may display: stereotype bias and deviation bias. Stereotype bias refers to when LLMs consistently associate specific traits with a particular demographic group. Deviation bias reflects the disparity between the demographic distributions extracted from LLM-generated content and real-world demographic distributions. By asking four advanced LLMs to generate profiles of individuals, we examine the associations between each demographic group and attributes such as political affiliation, religion, and sexual orientation. Our experimental results show that all examined LLMs exhibit both significant stereotype bias and deviation bias towards multiple groups. Our findings uncover the biases that occur when LLMs infer user attributes and shed light on the potential harms of LLM-generated outputs. 大型语言模型(LLMs)已广泛应用于各个领域,这也引发了人们对其局限性和潜在风险的担忧。在本研究中,我们探讨了 LLMs 可能表现出的两类偏见:刻板印象偏见和偏离偏见。刻板印象偏见指的是 LLMs 持续将特定特质与某一人口群体联系在一起。偏离偏见反映了从 LLM 生成内容中提取的人口分布与真实世界人口分布之间的差异。通过让四种先进的 LLM 生成个人档案,我们考察了各人口群体与政治倾向、宗教和性取向等属性之间的关联。我们的实验结果表明,所有被检视的 LLMs 对多个群体都表现出显著的刻板印象偏见和偏离偏见。我们的发现揭示了 LLMs 在推断用户属性时出现的偏见,并阐明了 LLM 生成输出的潜在危害。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-08 19:03:57 UTC
#88 Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models #88 训练然后忘记:合并列表对语言模型中的 BPE 推理并非必要
Authors: [Tomohiro Sawada](https://arxiv.org/search/?searchtype=author&query=Tomohiro Sawada), [Kartik Goyal](https://arxiv.org/search/?searchtype=author&query=Kartik Goyal) 作者:Tomohiro Sawada, Kartik Goyal
Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting information about language model’s training data. In this paper, we explore the downstream impact of BPE inference algorithms that do not rely on this merge list at all, and hence differ from the encoding process during BPE training. To address this question, we investigate two broad classes of BPE inference schemes that differ from BPE application during training: a) targeted deviation from merge-lists including random merge orders, and various corruptions of merge list involving deletion/truncation, and b) non-targeted BPE inference algorithms that do not depend on the merge list but focus on compressing the text either greedily or exactly. Extensive experiments across diverse language modeling tasks like accuracy-based QA benchmarks, machine translation, and open-ended generation reveal that while targeted deviation from the merge lists exhibits significant degradation in language model performance, the non-targeted merge-list-free inference algorithms result in minimal impact on downstream performance that is often much smaller than expected. These findings pave way for simpler and potentially more privacy-preserving tokenization schemes that do not catastrophically compromise model performance. 标准的字节对编码(BPE)通过将学习到的标记词汇表与详细的合并列表配对来压缩文本。近期研究表明,这个合并列表暴露了一个潜在的攻击面,可能用于提取有关语言模型训练数据的信息。本文中,我们探讨了完全不依赖该合并列表的 BPE 推理算法的下游影响,因此这些算法在应用时与 BPE 训练期间的编码过程不同。为了解答这一问题,我们研究了两类与训练期间 BPE 应用不同的 BPE 推理方案:a)有针对性的偏离合并列表的方案,包括随机合并顺序,以及涉及删除/截断的各种合并列表损坏;以及 b)非针对性的 BPE 推理算法,这类算法不依赖合并列表,而是专注于以贪婪或精确的方式压缩文本。 在广泛的实验中,覆盖基于准确率的问答基准、机器翻译和开放式生成等多种语言建模任务,结果表明:尽管对合并列表的有针对性偏离会显著降低语言模型的性能,但非针对性的、无需合并列表的推理算法对下游性能的影响很小,通常远小于预期。这些发现为更简单且可能更具隐私保护性的分词方案铺平道路,这类方案不会灾难性地损害模型性能。
Subject: Computation and Language 主题:Computation and Language
Publish: 2025-08-08 18:10:03 UTC 发布时间:2025-08-08 18:10:03 协调世界时 (UTC)
#89 BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent #89 BrowseComp-Plus:一个更公平、更透明的深度研究代理评估基准
Authors: [Zijian Chen](https://arxiv.org/search/?searchtype=author&query=Zijian Chen), [Xueguang Ma](https://arxiv.org/search/?searchtype=author&query=Xueguang Ma), [Shengyao Zhuang](https://arxiv.org/search/?searchtype=author&query=Shengyao Zhuang), [Ping Nie](https://arxiv.org/search/?searchtype=author&query=Ping Nie), [Kai Zou](https://arxiv.org/search/?searchtype=author&query=Kai Zou), [Andrew Liu](https://arxiv.org/search/?searchtype=author&query=Andrew Liu), [Joshua Green](https://arxiv.org/search/?searchtype=author&query=Joshua Green), [Kshama Patel](https://arxiv.org/search/?searchtype=author&query=Kshama Patel), [Ruoxi Meng](https://arxiv.org/search/?searchtype=author&query=Ruoxi Meng), [Mingyi Su](https://arxiv.org/search/?searchtype=author&query=Mingyi Su), [Sahel Sharifymoghaddam](https://arxiv.org/search/?searchtype=author&query=Sahel Sharifymoghaddam), [Yanxi Li](https://arxiv.org/search/?searchtype=author&query=Yanxi Li), [Haoran Hong](https://arxiv.org/search/?searchtype=author&query=Haoran Hong), [Xinyu Shi](https://arxiv.org/search/?searchtype=author&query=Xinyu Shi), [Xuye Liu](https://arxiv.org/search/?searchtype=author&query=Xuye Liu), [Nandan Thakur](https://arxiv.org/search/?searchtype=author&query=Nandan Thakur), [Crystina Zhang](https://arxiv.org/search/?searchtype=author&query=Crystina Zhang), [Luyu Gao](https://arxiv.org/search/?searchtype=author&query=Luyu Gao), [Wenhu Chen](https://arxiv.org/search/?searchtype=author&query=Wenhu Chen), [Jimmy Lin](https://arxiv.org/search/?searchtype=author&query=Jimmy Lin) 作者:陈子堅、马学光、庄胜尧、聂平、邹凯、Andrew Liu、Joshua Green、Kshama Patel、孟若熙、苏明毅、Sahel Sharifymoghaddam、李燕曦、洪浩然、石欣宇、刘绪叶、南丹·塔库尔、张晶婷、高鹿屿、陈文虎、Jimmy Lin
Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system. Deep-Research agents 将大型语言模型 (LLMs) 与检索工具集成,已被证明在处理需要迭代搜索规划和对搜索结果进行推理的复杂查询时能提高效果。像 BrowseComp 这样的现有基准依赖于黑盒的实时网络搜索 API,但存在显著局限:(1) 公平性:动态且不透明的网络 API 阻碍了深度研究方法的公平比较和可复现性;(2) 透明性:无法控制文档语料库,使得难以分离检索器的贡献。换言之,当前的评估可能比较的是某一时点的完整深度研究系统,但它们并未促成受控实验来提供对底层深度研究 LLMs 能力的洞见。为了解决这些挑战,我们引入了 BrowseComp-Plus,这一基准源自 BrowseComp,采用了固定且精心策划的语料库。BrowseComp-Plus 中的每个查询都包括人工验证的支持文档和挖掘出的具有挑战性的负样本,从而实现受控实验。该基准已被证明能有效区分深度研究系统的性能。 例如,开源模型 Search-R1 在与 BM25 检索器配合时,准确率为 3.86%,而 GPT-5 的准确率为 55.9%。将 GPT-5 与 Qwen3-Embedding-8B 检索器结合使用,进一步将其准确率提升到 70.1%,同时减少了检索调用次数。该基准可对深度研究代理和检索方法进行全面评估和解耦分析,从而有助于深入研究深度研究系统中的检索效果、引用准确性和上下文工程。
Subjects: Computation and Language, Information Retrieval 主题:计算与语言,信息检索
Publish: 2025-08-08 17:55:11 UTC 发布:2025-08-08 17:55:11 协调世界时 (UTC)
#90 LLM Unlearning Without an Expert Curated Dataset #90 LLM 在没有专家策划数据集的情况下的遗忘
Authors: [Xiaoyuan Zhu](https://arxiv.org/search/?searchtype=author&query=Xiaoyuan Zhu), [Muru Zhang](https://arxiv.org/search/?searchtype=author&query=Muru Zhang), [Ollie Liu](https://arxiv.org/search/?searchtype=author&query=Ollie Liu), [Robin Jia](https://arxiv.org/search/?searchtype=author&query=Robin Jia), [Willie Neiswanger](https://arxiv.org/search/?searchtype=author&query=Willie Neiswanger) 作者:Xiaoyuan Zhu、Muru Zhang、Ollie Liu、Robin Jia、Willie Neiswanger
Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook. 现代大型语言模型常常编码敏感、有害或受版权保护的知识,因此需要事后“遗忘”能力——在不完全重训练模型的情况下移除特定领域知识。当前遗忘流程中的一大瓶颈是构建有效的遗忘集——即近似目标领域并引导模型遗忘该领域的数据集。在这项工作中,我们提出了一种可扩展的、自动化的方法,使用语言模型自身生成高质量的遗忘集。我们的方法通过结构化提示流水线合成教科书式的数据,输入仅需领域名称。通过对生物安全、网络安全和《哈利·波特》小说的遗忘实验,我们表明合成数据集在各项测试中始终优于基线合成替代品,并且与专家策划的数据集相当。此外,消融研究表明,多步生成流水线显著提升了数据多样性,从而提高了遗忘效用。 总体而言,我们的研究结果表明,合成数据集为在广泛新兴领域实现实用且可扩展的“消除学习”(unlearning)提供了一条有前景的途径,而无需人工干预。我们在 https://github.com/xyzhu123/Synthetic_Textbook 上发布了我们的代码和数据集。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习
Publish: 2025-08-08 14:30:08 UTC 发布:2025-08-08 14:30:08 UTC
#91 Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs #91 明辨之心还是通用导师?评估苏格拉底式 LLMs 的教学指导能力
Authors: [Ying Liu](https://arxiv.org/search/?searchtype=author&query=Ying Liu), [Can Li](https://arxiv.org/search/?searchtype=author&query=Can Li), [Ting Zhang](https://arxiv.org/search/?searchtype=author&query=Ting Zhang), [Mei Wang](https://arxiv.org/search/?searchtype=author&query=Mei Wang), [Qiannan Zhu](https://arxiv.org/search/?searchtype=author&query=Qiannan Zhu), [Jian Li](https://arxiv.org/search/?searchtype=author&query=Jian Li), [Hua Huang](https://arxiv.org/search/?searchtype=author&query=Hua Huang) 作者:刘英、李灿、张婷、王梅、朱芊楠、李健、黄华
The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their capacity for Socratic questioning, it often overlooks a critical dimension: adaptively guiding learners based on their cognitive states. This study shifts focus from mere question generation to the broader instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in response to learners’ understanding? To investigate this, we propose GuideEval, a benchmark grounded in authentic educational dialogues that evaluates pedagogical guidance through a three-phase behavioral framework: (1) Perception, inferring learner states; (2) Orchestration, adapting instructional strategies; and (3) Elicitation, stimulating proper reflections. Empirical findings reveal that existing LLMs frequently fail to provide effective adaptive scaffolding when learners exhibit confusion or require redirection. Furthermore, we introduce a behavior-guided finetuning strategy that leverages behavior-prompted instructional dialogues, significantly enhancing guidance performance. By shifting the focus from isolated content evaluation to learner-centered interaction, our work advocates a more dialogic paradigm for evaluating Socratic LLMs. 大型语言模型的对话能力在实现可扩展和互动式辅导方面具有重要潜力。尽管先前研究主要考察了它们进行苏格拉底式提问的能力,但常常忽视一个关键维度:基于学习者认知状态进行自适应引导。本研究将关注点从单纯的问题生成转向更广泛的教学引导能力。我们提出问题:LLMs 能否模仿专家导师,根据学习者的理解情况动态调整策略?为此,我们提出了 GuideEval,这是一个基于真实教育对话的基准,通过三阶段的行为框架评估教学引导能力:(1)感知,推断学习者状态;(2)编排,调整教学策略;(3)引导,激发适当的反思。实证结果表明,当学习者表现出困惑或需要重定向时,现有的 LLMs 经常无法提供有效的自适应支架。此外,我们引入了一种行为引导的微调策略,该策略利用行为提示的教学对话,显著提升了引导性能。 通过将关注点从孤立的内容评估转向以学习者为中心的互动,我们的工作倡导一种更具对话性的范式来评估 Socratic LLMs。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-08 01:02:44 UTC 发布日期:2025-08-08 01:02:44 UTC
#92 Factor Augmented Supervised Learning with Text Embeddings #92 带因子增强的有监督学习与文本嵌入
Authors: [Zhanye Luo](https://arxiv.org/search/?searchtype=author&query=Zhanye Luo), [Yuefeng Han](https://arxiv.org/search/?searchtype=author&query=Yuefeng Han), [Xiufan Yu](https://arxiv.org/search/?searchtype=author&query=Xiufan Yu) 作者:罗展烨,韩岳枫,于秀凡
Large language models (LLMs) generate text embeddings from text data, producing vector representations that capture the semantic meaning and contextual relationships of words. However, the high dimensionality of these embeddings often impedes efficiency and drives up computational cost in downstream tasks. To address this, we propose AutoEncoder-Augmented Learning with Text (AEALT), a supervised, factor-augmented framework that incorporates dimension reduction directly into pre-trained LLM workflows. First, we extract embeddings from text documents; next, we pass them through a supervised augmented autoencoder to learn low-dimensional, task-relevant latent factors. By modeling the nonlinear structure of complex embeddings, AEALT outperforms conventional deep-learning approaches that rely on raw embeddings. We validate its broad applicability with extensive experiments on classification, anomaly detection, and prediction tasks using multiple real-world public datasets. Numerical results demonstrate that AEALT yields substantial gains over both vanilla embeddings and several standard dimension reduction methods. 大型语言模型 (LLMs) 从文本数据生成文本嵌入,产生能够捕捉词语语义意义和上下文关系的向量表示。然而,这些嵌入的高维性常常妨碍效率并提高下游任务的计算成本。为了解决这一问题,我们提出了带有文本的自编码器增强学习(AEALT),这是一种有监督的、因子增强的框架,将降维直接并入预训练 LLM 的工作流程。首先,我们从文本文档中提取嵌入;接着,将它们传入一个有监督增强自编码器,以学习与任务相关的低维潜在因子。通过对复杂嵌入的非线性结构建模,AEALT 的表现优于依赖原始嵌入的传统深度学习方法。我们通过在多个真实公共数据集上对分类、异常检测和预测任务进行的大量实验验证了其广泛适用性。数值结果表明,AEALT 相较于原始嵌入和若干标准降维方法都带来了显著提升。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning, Machine Learning 主题:计算与语言、人工智能、机器学习、机器学习
Publish: 2025-08-06 01:44:47 UTC 发布:2025-08-06 01:44:47 UTC
#93 The Art of Breaking Words: Rethinking Multilingual Tokenizer Design #93 断词的艺术:重新思考多语种分词器设计
Authors: [Aamod Thakur](https://arxiv.org/search/?searchtype=author&query=Aamod Thakur), [Ajay Nagpal](https://arxiv.org/search/?searchtype=author&query=Ajay Nagpal), [Atharva Savarkar](https://arxiv.org/search/?searchtype=author&query=Atharva Savarkar), [Kundeshwar Pundalik](https://arxiv.org/search/?searchtype=author&query=Kundeshwar Pundalik), [Siddhesh Dosi](https://arxiv.org/search/?searchtype=author&query=Siddhesh Dosi), [Piyush Sawarkar](https://arxiv.org/search/?searchtype=author&query=Piyush Sawarkar), [Viraj Thakur](https://arxiv.org/search/?searchtype=author&query=Viraj Thakur), [Rohit Saluja](https://arxiv.org/search/?searchtype=author&query=Rohit Saluja), [Maunendra Sankar Desarkar](https://arxiv.org/search/?searchtype=author&query=Maunendra Sankar Desarkar), [Ganesh Ramakrishnan](https://arxiv.org/search/?searchtype=author&query=Ganesh Ramakrishnan) 作者:Aamod Thakur、Ajay Nagpal、Atharva Savarkar、Kundeshwar Pundalik、Siddhesh Dosi、Piyush Sawarkar、Viraj Thakur、Rohit Saluja、Maunendra Sankar Desarkar、Ganesh Ramakrishnan
While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs 尽管模型架构和训练目标已被广泛研究,分词(尤其是在多语言环境下)仍然是大型语言模型(LLM)开发中相对被忽视的方面。现有的分词器常表现出较高的词元与单词比、上下文长度利用效率低以及推理速度较慢。我们提出了一项系统性研究,将词表大小、预分词规则和训练语料构成与词元—单词效率及模型质量联系起来。为了在语言学多样的背景下开展分析,我们在印地语系文字(Indic scripts)上进行了大量实验,这些文字由于其高度的文字体系多样性和正写法复杂性而带来独特挑战。基于这些分析所得的见解,我们提出了一种用于数据构成的新算法,用于在分词器训练中平衡多语言数据。我们在预分词策略上的观察显著提升了模型性能,并且我们的数据构成算法相比传统的数据随机化方法将平均词元—单词比约降低了 6%。 我们的分词器在平均“标记到单词”比率上比最先进的多语言印度语模型提高了超过 40%。这一改进在模型性能和推理速度上都带来了可观的收益。这突显了分词与架构和训练目标一道,作为构建高效、可扩展多语言 LLMs 的关键杠杆。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-03 15:31:10 UTC 发布:2025-08-03 15:31:10 UTC
#94 CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models #94 CarbonScaling:将神经缩放定律扩展到大型语言模型的碳足迹
Authors: [Lei Jiang](https://arxiv.org/search/?searchtype=author&query=Lei Jiang), [Fan Chen](https://arxiv.org/search/?searchtype=author&query=Fan Chen) 作者:姜雷、陈凡
Neural scaling laws have driven the development of increasingly large language models (LLMs) by linking accuracy improvements to growth in parameter count, dataset size, and compute. However, these laws overlook the carbon emissions that scale exponentially with LLM size. This paper presents \textit{CarbonScaling}, an analytical framework that extends neural scaling laws to incorporate both operational and embodied carbon in LLM training. By integrating models for neural scaling, GPU hardware evolution, parallelism optimization, and carbon estimation, \textit{CarbonScaling} quantitatively connects model accuracy to carbon footprint. Results show that while a power-law relationship between accuracy and carbon holds, real-world inefficiencies significantly increase the scaling factor. Hardware technology scaling reduces carbon emissions for small to mid-sized models, but offers diminishing returns for extremely large LLMs due to communication overhead and underutilized GPUs. Training optimizations-especially aggressive critical batch size scaling-help alleviate this inefficiency. \textit{CarbonScaling} offers key insights for training more sustainable and carbon-efficient LLMs. 神经规模定律通过将准确性提升与参数数量、数据集规模和计算量的增长联系起来,推动了越来越大规模语言模型(LLMs)的发展。然而,这些定律忽视了随 LLM 规模呈指数增长的碳排放。本文提出了 CarbonScaling,一种分析性框架,将神经规模定律扩展为在 LLM 训练中同时纳入运营和具身碳排放。通过整合神经规模模型、GPU 硬件演进、并行优化和碳估算模型,CarbonScaling 将模型准确性与碳足迹定量联系起来。结果表明,虽然准确性与碳之间存在幂律关系,但现实中的低效会显著增加该缩放因子。硬件技术的进步可降低小型到中型模型的碳排放,但由于通信开销和 GPU 未充分利用,对于极大型 LLM 其收益递减。训练优化——尤其是积极的临界批量大小缩放——有助于缓解这种低效。CarbonScaling 为训练更可持续、更节碳的 LLMs 提供了关键见解。
Subjects: Computation and Language, Artificial Intelligence, Computers and Society, Distributed, Parallel, and Cluster Computing, Machine Learning 主题:计算与语言、人工智能、计算机与社会、分布式、并行与集群计算、机器学习
Publish: 2025-08-02 00:41:45 UTC 发布日期:2025-08-02 00:41:45 协调世界时
#95 Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models #95 基于检索增强生成的动态提示用于大语言模型的少样本生物医学命名实体识别
Authors: [Yao Ge](https://arxiv.org/search/?searchtype=author&query=Yao Ge), [Sudeshna Das](https://arxiv.org/search/?searchtype=author&query=Sudeshna Das), [Yuting Guo](https://arxiv.org/search/?searchtype=author&query=Yuting Guo), [Abeed Sarker](https://arxiv.org/search/?searchtype=author&query=Abeed Sarker) 作者:Yao Ge、Sudeshna Das、Yuting Guo、Abeed Sarker
Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data). In this article, we address the performance challenges of LLMs for few-shot biomedical NER by investigating a dynamic prompting strategy involving retrieval-augmented generation (RAG). In our approach, the annotated in-context learning examples are selected based on their similarities with the input texts, and the prompt is dynamically updated for each instance during inference. We implemented and optimized static and dynamic prompt engineering techniques and evaluated them on five biomedical NER datasets. Static prompting with structured components increased average F1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further improved performance, with TF-IDF and SBERT retrieval methods yielding the best results, improving average F1-scores by 7.3% and 5.6% in 5-shot and 10-shot settings, respectively. These findings highlight the utility of contextually adaptive prompts via RAG for biomedical NER. 生物医学命名实体识别(NER)是一项高实用价值的自然语言处理(NLP)任务,而大型语言模型(LLMs)在少样本设置(即训练数据有限)中表现出潜力。本文通过研究一种涉及检索增强生成(RAG)的动态提示策略,来解决 LLMs 在少样本生物医学 NER 中的性能挑战。在我们的方法中,带注释的上下文学习示例是基于它们与输入文本的相似性进行选择的,并且在推理过程中为每个实例动态更新提示。我们实现并优化了静态和动态的提示工程技术,并在五个生物医学 NER 数据集上进行了评估。相比基础静态提示,带有结构化组件的静态提示使 GPT-4 的平均 F1 分数提高了 12%,使 GPT-3.5 和 LLaMA 3-70B 的平均 F1 分数提高了 11%。动态提示进一步提升了性能,其中 TF-IDF 和 SBERT 检索方法的效果最佳,分别在 5-shot 和 10-shot 设置中将平均 F1 分数提升了 7.3%和 5.6%。这些发现突出了通过 RAG 实现上下文自适应提示对生物医学 NER 的实用价值。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-07-25 20:57:16 UTC 发布:2025-07-25 20:57:16 UTC
#96 Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction #96 葡萄牙语的半自动事实核查:通过检索与声明提取丰富语料库
Authors: [Juliana Resplande Sant’anna Gomes](https://arxiv.org/search/?searchtype=author&query=Juliana Resplande Sant’anna Gomes), [Arlindo Rodrigues Galvão Filho](https://arxiv.org/search/?searchtype=author&query=Arlindo Rodrigues Galvão Filho) 作者:Juliana Resplande Sant’anna Gomes,Arlindo Rodrigues Galvão Filho
The accelerated dissemination of disinformation often outpaces the capacity for manual fact-checking, highlighting the urgent need for Semi-Automated Fact-Checking (SAFC) systems. Within the Portuguese language context, there is a noted scarcity of publicly available datasets that integrate external evidence, an essential component for developing robust AFC systems, as many existing resources focus solely on classification based on intrinsic text features. This dissertation addresses this gap by developing, applying, and analyzing a methodology to enrich Portuguese news corpora (Fake.Br, COVID19.BR, MuMiN-PT) with external evidence. The approach simulates a user’s verification process, employing Large Language Models (LLMs, specifically Gemini 1.5 Flash) to extract the main claim from texts and search engine APIs (Google Search API, Google FactCheck Claims Search API) to retrieve relevant external documents (evidence). Additionally, a data validation and preprocessing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora. 虚假信息的加速传播往往超出人工事实核查的能力,凸显了半自动事实核查(SAFC)系统的迫切需求。在葡萄牙语环境中,公开的、整合外部证据的数据集明显稀缺,而外部证据是构建健壮自动事实核查(AFC)系统的关键组成部分,许多现有资源仅基于文本内部特征进行分类。本文通过开发、应用和分析一种方法来弥补这一空白,该方法为葡萄牙语新闻语料库(Fake.Br、COVID19.BR、MuMiN-PT)注入外部证据。该方法模拟用户的核验流程,使用大型语言模型(LLMs,具体为 Gemini 1.5 Flash)从文本中提取主要主张,并使用搜索引擎 API(Google Search API、Google FactCheck Claims Search API)检索相关的外部文档(证据)。此外,提出了一个数据验证与预处理框架,包括近重复检测,以提高基础语料库的质量。
Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题:计算与语言、人工智能、信息检索
Publish: 2025-07-19 23:46:40 UTC 发布:2025-07-19 23:46:40 协调世界时
#97 Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning #97 第 I 部分:技巧还是陷阱?对用于 LLM 推理的强化学习的深入探究
Authors: [Zihe Liu](https://arxiv.org/search/?searchtype=author&query=Zihe Liu), [Jiashun Liu](https://arxiv.org/search/?searchtype=author&query=Jiashun Liu), [Yancheng He](https://arxiv.org/search/?searchtype=author&query=Yancheng He), [Weixun Wang](https://arxiv.org/search/?searchtype=author&query=Weixun Wang), [Jiaheng Liu](https://arxiv.org/search/?searchtype=author&query=Jiaheng Liu), [Ling Pan](https://arxiv.org/search/?searchtype=author&query=Ling Pan), [Xinyu Hu](https://arxiv.org/search/?searchtype=author&query=Xinyu Hu), [Shaopan Xiong](https://arxiv.org/search/?searchtype=author&query=Shaopan Xiong), [Ju Huang](https://arxiv.org/search/?searchtype=author&query=Ju Huang), [Jian Hu](https://arxiv.org/search/?searchtype=author&query=Jian Hu), [Shengyi Huang](https://arxiv.org/search/?searchtype=author&query=Shengyi Huang), [Siran Yang](https://arxiv.org/search/?searchtype=author&query=Siran Yang), [Jiamang Wang](https://arxiv.org/search/?searchtype=author&query=Jiamang Wang), [Wenbo Su](https://arxiv.org/search/?searchtype=author&query=Wenbo Su), [Bo Zheng](https://arxiv.org/search/?searchtype=author&query=Bo Zheng) 作者:刘子和、刘嘉顺、何彦成、王伟勋、刘佳恒、潘玲、胡欣宇、熊少攀、黄举、胡剑、黄晟一、杨思然、王家芒、苏文博、郑博
Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO. 针对 LLM 推理的强化学习迅速成为一个重要的研究领域,在算法创新和实际应用方面相关研究显著增加。尽管取得了进展,但仍存在若干关键挑战,包括缺乏使用 RL 技术的标准化指南以及对其内部机制的碎片化理解。此外,不一致的实验设置、训练数据的差异和模型初始化的不同导致了相互矛盾的结论,模糊了这些技术的关键特征,并在从业者选择合适技术时造成困惑。本文通过在统一的开源框架中进行严格重现和独立评估,系统性地回顾了广泛采用的 RL 技术。我们通过细粒度实验分析了每种技术的内部机制、适用场景和核心原理,实验涵盖不同难度的数据集、模型规模和架构。 基于这些见解,我们提出了为特定设置选择强化学习技术的明确指南,并为在 LLM 领域中使用强化学习的从业者提供了一条可靠的路线图。最后,我们揭示了两种技术的极简组合可以使用原始 PPO 损失激活无评论者策略的学习能力。结果表明,我们的简单组合能够持续提升性能,超越如 GRPO 和 DAPO 等策略。
Subjects: Machine Learning, Computation and Language 主题:机器学习,计算与语言
Publish: 2025-08-11 17:39:45 UTC 发布:2025-08-11 17:39:45 世界协调时
#98 HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches #98 HierSearch:一个将本地与网页检索相结合的层级企业深度检索框架
Authors: [Jiejun Tan](https://arxiv.org/search/?searchtype=author&query=Jiejun Tan), [Zhicheng Dou](https://arxiv.org/search/?searchtype=author&query=Zhicheng Dou), [Yan Yu](https://arxiv.org/search/?searchtype=author&query=Yan Yu), [Jiehan Cheng](https://arxiv.org/search/?searchtype=author&query=Jiehan Cheng), [Qiang Ju](https://arxiv.org/search/?searchtype=author&query=Qiang Ju), [Jian Xie](https://arxiv.org/search/?searchtype=author&query=Jian Xie), [Ji-Rong Wen](https://arxiv.org/search/?searchtype=author&query=Ji-Rong Wen) 作者:谭杰军、窦志成、余艳、程洁涵、居强、谢健、温继荣
Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus. Simply training an agent equipped with multiple search tools using flat reinforcement learning (RL) is a straightforward idea, but it has problems such as low training data efficiency and poor mastery of complex tools. To address the above issue, we propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL. At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains. At the high level, a planner agent coordinates low-level agents and provides the final answer. Moreover, to prevent direct answer copying and error propagation, we design a knowledge refiner that filters out hallucinations and irrelevant evidence returned by low-level agents. Experiments show that HierSearch achieves better performance compared to flat RL, and outperforms various deep search and multi-source retrieval-augmented generation baselines in six benchmarks across general, finance, and medical domains. 最近,大型推理模型在数学和编码能力方面表现出强大实力,且深度检索利用它们的推理能力来应对具有挑战性的信息检索任务。现有的深度检索工作通常仅限于单一知识源,要么是本地,要么是网络。然而,企业往往需要能够同时利用本地语料库和网络语料库的私有深度检索系统。一个直观的想法是用平面强化学习训练一个配备多个检索工具的代理,但这种做法存在训练数据效率低下和对复杂工具掌握不佳等问题。为了解决上述问题,我们提出了一种用分层强化学习训练的分层代理式深度检索框架 HierSearch。在低层,训练了一个本地深度检索代理和一个网络深度检索代理,分别从各自领域检索证据。在高层,一个规划者代理协调低层代理并给出最终答案。此外,为了防止直接复制答案和错误传播,我们设计了一个知识修正器,用于过滤低层代理返回的幻觉信息和无关证据。 实验表明,与扁平强化学习相比,HierSearch 在性能上更佳,并且在通用、金融和医疗领域的六项基准测试中,优于各种深度检索和多源检索增强生成基线。
Subjects: Information Retrieval, Artificial Intelligence, Computation and Language 主题:信息检索、人工智能、计算与语言
Publish: 2025-08-11 15:31:47 UTC 发布时间:2025-08-11 15:31:47 UTC
#99 Investigating the Design Space of Visual Grounding in Multimodal Large Language Model #99 调查多模态大语言模型中视觉定位的设计空间
Authors: [Weitai Kang](https://arxiv.org/search/?searchtype=author&query=Weitai Kang), [Weiming Zhuang](https://arxiv.org/search/?searchtype=author&query=Weiming Zhuang), [Zhizhong Li](https://arxiv.org/search/?searchtype=author&query=Zhizhong Li), [Yan Yan](https://arxiv.org/search/?searchtype=author&query=Yan Yan), [Lingjuan Lyu](https://arxiv.org/search/?searchtype=author&query=Lingjuan Lyu) 作者:康伟泰、庄伟明、李志忠、阎艳、吕玲娟
Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs’ fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5. 细粒度多模态能力在多模态大语言模型(MLLMs)中已成为一个关键研究方向,尤其是在解决视觉定位(VG)问题时。尽管现有方法取得了较强的性能,但在对 MLLMs 进行 VG 微调时,它们通常采用各不相同的设计选择,缺乏系统性的验证来支持这些设计。为填补这一空白,本文对影响 MLLMs VG 性能的各类设计选择进行了全面研究。我们使用在先前 MLLM 实证研究中被广泛采用的 LLaVA-1.5 进行分析。尽管存在更新的模型,我们仍遵循这一惯例以确保我们的发现具有广泛适用性并可扩展到其他架构。我们涵盖了两个关键方面:(1)探索 MLLMs 中不同的视觉定位范式,识别最有效的设计并提供我们的见解;以及(2)对定位数据的设计进行消融研究,以优化 MLLMs 在 VG 任务上的微调。最后,我们的发现促成了更强大的 MLLM 视觉定位能力,在 RefCOCO/+/g 上相较于 LLaVA-1.5 分别提升了+5.6% / +6.9% / +7.0%。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language, Machine Learning 主题:计算机视觉与模式识别、人工智能、计算与语言、机器学习
Publish: 2025-08-11 15:10:52 UTC 发布时间:2025-08-11 15:10:52 UTC
#100 From Source to Target: Leveraging Transfer Learning for Predictive Process Monitoring in Organizations #100 从源到目标:在组织中利用迁移学习进行预测性流程监控
Authors: [Sven Weinzierl](https://arxiv.org/search/?searchtype=author&query=Sven Weinzierl), [Sandra Zilker](https://arxiv.org/search/?searchtype=author&query=Sandra Zilker), [Annina Liessmann](https://arxiv.org/search/?searchtype=author&query=Annina Liessmann), [Martin Käppel](https://arxiv.org/search/?searchtype=author&query=Martin Käppel), [Weixin Wang](https://arxiv.org/search/?searchtype=author&query=Weixin Wang), [Martin Matzner](https://arxiv.org/search/?searchtype=author&query=Martin Matzner) 作者:Sven Weinzierl、Sandra Zilker、Annina Liessmann、Martin Käppel、Weixin Wang、Martin Matzner
Event logs reflect the behavior of business processes that are mapped in organizational information systems. Predictive process monitoring (PPM) transforms these data into value by creating process-related predictions that provide the insights required for proactive interventions at process runtime. Existing PPM techniques require sufficient amounts of event data or other relevant resources that might not be readily available, preventing some organizations from utilizing PPM. The transfer learning-based PPM technique presented in this paper allows organizations without suitable event data or other relevant resources to implement PPM for effective decision support. The technique is instantiated in two real-life use cases, based on which numerical experiments are performed using event logs for IT service management processes in an intra- and inter-organizational setting. The results of the experiments suggest that knowledge of one business process can be transferred to a similar business process in the same or a different organization to enable effective PPM in the target context. With the proposed technique, organizations can benefit from transfer learning in an intra- and inter-organizational setting, where resources like pre-trained models are transferred within and across organizational boundaries. 事件日志反映了在组织信息系统中映射的业务流程行为。预测性流程监控(PPM)通过创建与流程相关的预测将这些数据转化为价值,从而在流程运行时提供进行主动干预所需的洞见。现有的 PPM 技术需要足够数量的事件数据或其他可能不易获得的相关资源,这阻止了一些组织利用 PPM。本文提出的基于迁移学习的 PPM 技术使得没有合适事件数据或其他相关资源的组织能够实施 PPM,以实现有效的决策支持。该技术在两个真实用例中进行了实例化,并基于这些用例使用用于 IT 服务管理流程的事件日志在组织内部和组织间设置中进行了数值实验。实验结果表明,一个业务流程的知识可以转移到同一组织或不同组织中相似的业务流程,从而在目标情境中实现有效的 PPM。 通过所提出的技术,组织能够在组织内和组织间环境中受益于迁移学习,其中诸如预训练模型之类的资源在组织内部和跨组织边界进行转移。
Subjects: Machine Learning, Computation and Language, Databases
Publish: 2025-08-11 15:03:50 UTC
#101 Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning #101 Audio-Thinker:通过强化学习引导音频语言模型何时以及如何思考
Authors: [Shu Wu](https://arxiv.org/search/?searchtype=author&query=Shu Wu), [Chenxing Li](https://arxiv.org/search/?searchtype=author&query=Chenxing Li), [Wenfu Wang](https://arxiv.org/search/?searchtype=author&query=Wenfu Wang), [Hao Zhang](https://arxiv.org/search/?searchtype=author&query=Hao Zhang), [Hualei Wang](https://arxiv.org/search/?searchtype=author&query=Hualei Wang), [Meng Yu](https://arxiv.org/search/?searchtype=author&query=Meng Yu), [Dong Yu](https://arxiv.org/search/?searchtype=author&query=Dong Yu) 作者:吴舒,李晨兴,王文甫,张浩,王华磊,于萌,余东
Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities. 近年来,大型语言模型、多模态大型语言模型和大型音频语言模型(LALM)在通过基于规则的奖励进行强化学习方面显著提升了推理能力。然而,显式推理过程尚未在音频问答上表现出显著益处,有效利用深入推理仍是一个未解决的挑战,LALM 在听觉-语言推理方面仍未达到人类水平。为了解决这些限制,我们提出了 Audio-Thinker,一种旨在增强 LALM 推理能力的强化学习框架,重点提高其适应性、一致性和有效性。我们的方法引入了一种自适应的思考准确性奖励,使模型能根据任务复杂性动态调整其推理策略。此外,我们还引入了一个外部奖励模型来评估推理过程的整体一致性和质量,并辅以基于思考的奖励,帮助模型在训练期间区分有效与有缺陷的推理路径。 实验结果表明,我们的 Audio-Thinker 模型在各种基准任务上优于现有以推理为导向的 LALM,表现出更强的推理和泛化能力。
Subjects: Sound, Computation and Language, Multimedia 主题:声音,计算与语言,多媒体
Publish: 2025-08-11 14:41:10 UTC 发布:2025-08-11 14:41:10 UTC
#102 Exploring Procedural Data Generation for Automatic Acoustic Guitar Fingerpicking Transcription #102 探索用于自动原声吉他指弹转录的程序化数据生成
Authors: [Sebastian Murgul](https://arxiv.org/search/?searchtype=author&query=Sebastian Murgul), [Michael Heizmann](https://arxiv.org/search/?searchtype=author&query=Michael Heizmann) 作者:Sebastian Murgul,Michael Heizmann
Automatic transcription of acoustic guitar fingerpicking performances remains a challenging task due to the scarcity of labeled training data and legal constraints connected with musical recordings. This work investigates a procedural data generation pipeline as an alternative to real audio recordings for training transcription models. Our approach synthesizes training data through four stages: knowledge-based fingerpicking tablature composition, MIDI performance rendering, physical modeling using an extended Karplus-Strong algorithm, and audio augmentation including reverb and distortion. We train and evaluate a CRNN-based note-tracking model on both real and synthetic datasets, demonstrating that procedural data can be used to achieve reasonable note-tracking results. Finetuning with a small amount of real data further enhances transcription accuracy, improving over models trained exclusively on real recordings. These results highlight the potential of procedurally generated audio for data-scarce music information retrieval tasks. 由于标注训练数据稀缺及与音乐录音相关的法律限制,对原声吉他指弹演奏进行自动转录仍然是一项具有挑战性的任务。本工作将程序化数据生成管道作为用于训练转录模型的真实音频录音替代方案进行研究。我们的方法通过四个阶段合成训练数据:基于知识的指弹谱曲、MIDI 演奏渲染、使用扩展 Karplus-Strong 算法的物理建模,以及包括混响和失真在内的音频增强。我们在真实和合成数据集上训练并评估了基于 CRNN 的音符跟踪模型,证明程序化数据可用于获得合理的音符跟踪结果。用少量真实数据进行微调可进一步提升转录精度,超过仅在真实录音上训练的模型。这些结果突出了在数据匮乏的音乐信息检索任务中程序化生成音频的潜力。
Subjects: Sound, Computation and Language, Audio and Speech Processing 主题:声音、计算与语言、音频与语音处理
Publish: 2025-08-11 13:52:17 UTC 发布:2025-08-11 13:52:17 UTC
#103 Improving Document Retrieval Coherence for Semantically Equivalent Queries #103 提升对语义等价查询的文档检索一致性
Authors: [Stefano Campese](https://arxiv.org/search/?searchtype=author&query=Stefano Campese), [Alessandro Moschitti](https://arxiv.org/search/?searchtype=author&query=Alessandro Moschitti), [Ivano Lauriola](https://arxiv.org/search/?searchtype=author&query=Ivano Lauriola) 作者:Stefano Campese、Alessandro Moschitti、Ivano Lauriola
Dense Retrieval (DR) models have proven to be effective for Document Retrieval and Information Grounding tasks. Usually, these models are trained and optimized for improving the relevance of top-ranked documents for a given query. Previous work has shown that popular DR models are sensitive to the query and document lexicon: small variations of it may lead to a significant difference in the set of retrieved documents. In this paper, we propose a variation of the Multi-Negative Ranking loss for training DR that improves the coherence of models in retrieving the same documents with respect to semantically similar queries. The loss penalizes discrepancies between the top-k ranked documents retrieved for diverse but semantic equivalent queries. We conducted extensive experiments on various datasets, MS-MARCO, Natural Questions, BEIR, and TREC DL 19/20. The results show that (i) models optimizes by our loss are subject to lower sensitivity, and, (ii) interestingly, higher accuracy. 密集检索(Dense Retrieval,DR)模型已被证明在文档检索和信息定位任务中非常有效。通常,这些模型的训练和优化目标是提升给定查询的前排文档的相关性。先前研究表明,流行的 DR 模型对查询和文档的词汇较为敏感:对词汇的微小变动可能导致检索到的文档集合发生显著差异。本文提出了一种多负样本排序损失(Multi-Negative Ranking loss)的变体,用于训练 DR,以提高模型在面对语义相似查询时检索出相同文档的一致性。该损失会惩罚针对不同但语义等价的查询在前 k 排名文档上出现的不一致。我们在多个数据集上进行了大量实验,包括 MS-MARCO、Natural Questions、BEIR 和 TREC DL 19/20。结果表明:(i) 使用我们损失函数优化的模型具有较低的敏感性,且 (ii) 有趣的是,准确率更高。
Subjects: Information Retrieval, Computation and Language 主题:信息检索,计算与语言
Publish: 2025-08-11 13:34:59 UTC 发布:2025-08-11 13:34:59 UTC
#104 Joint Transcription of Acoustic Guitar Strumming Directions and Chords #104 吉他扫弦方向与和弦的联合转录
Authors: [Sebastian Murgul](https://arxiv.org/search/?searchtype=author&query=Sebastian Murgul), [Johannes Schimper](https://arxiv.org/search/?searchtype=author&query=Johannes Schimper), [Michael Heizmann](https://arxiv.org/search/?searchtype=author&query=Michael Heizmann) 作者:Sebastian Murgul,Johannes Schimper,Michael Heizmann
Automatic transcription of guitar strumming is an underrepresented and challenging task in Music Information Retrieval (MIR), particularly for extracting both strumming directions and chord progressions from audio signals. While existing methods show promise, their effectiveness is often hindered by limited datasets. In this work, we extend a multimodal approach to guitar strumming transcription by introducing a novel dataset and a deep learning-based transcription model. We collect 90 min of real-world guitar recordings using an ESP32 smartwatch motion sensor and a structured recording protocol, complemented by a synthetic dataset of 4h of labeled strumming audio. A Convolutional Recurrent Neural Network (CRNN) model is trained to detect strumming events, classify their direction, and identify the corresponding chords using only microphone audio. Our evaluation demonstrates significant improvements over baseline onset detection algorithms, with a hybrid method combining synthetic and real-world data achieving the highest accuracy for both strumming action detection and chord classification. These results highlight the potential of deep learning for robust guitar strumming transcription and open new avenues for automatic rhythm guitar analysis. 吉他扫弦的自动转录在音乐信息检索(MIR)中是一个被低估且具有挑战性的任务,特别是在从音频信号中同时提取扫弦方向和和弦进行方面。尽管现有方法显示出潜力,但其有效性常常受到数据集有限的制约。在这项工作中,我们通过引入一个新数据集和一个基于深度学习的转录模型,扩展了一种多模态的吉他扫弦转录方法。我们使用 ESP32 智能手表的运动传感器和结构化录音协议收集了 90 分钟的真实世界吉他录音,并补充了一个包含 4 小时带标注扫弦音频的合成数据集。我们训练了一个卷积递归神经网络(CRNN)模型,仅使用麦克风音频来检测扫弦事件、分类其方向并识别相应的和弦。我们的评估显示,与基线起音检测算法相比有显著改进,其中结合合成与真实世界数据的混合方法在扫弦动作检测和和弦分类两方面都达到了最高准确率。 这些结果突显了深度学习在稳健吉他扫弦转录方面的潜力,并为自动节奏吉他分析开辟了新途径。
Subjects: Sound, Computation and Language, Audio and Speech Processing 主题:声音,计算与语言,音频与语音处理
Publish: 2025-08-11 13:34:49 UTC 发布:2025-08-11 13:34:49 UTC
#105 Pareto Multi-Objective Alignment for Language Models #105 帕累托多目标对齐用于语言模型
Authors: [Qiang He](https://arxiv.org/search/?searchtype=author&query=Qiang He), [Setareh Maghsudi](https://arxiv.org/search/?searchtype=author&query=Setareh Maghsudi) 作者:Qiang He、Setareh Maghsudi
Large language models (LLMs) are increasingly deployed in real-world applications that require careful balancing of multiple, often conflicting, objectives, such as informativeness versus conciseness, or helpfulness versus creativity. However, current alignment methods, primarily based on RLHF, optimize LLMs toward a single reward function, resulting in rigid behavior that fails to capture the complexity and diversity of human preferences. This limitation hinders the adaptability of LLMs to practical scenarios, making multi-objective alignment (MOA) a critical yet underexplored area. To bridge this gap, we propose Pareto Multi-Objective Alignment (PAMA), a principled and computationally efficient algorithm designed explicitly for MOA in LLMs. In contrast to computationally prohibitive multi-objective optimization (MOO) methods, PAMA transforms multi-objective RLHF into a convex optimization with a closed-form solution, significantly enhancing scalability. Traditional MOO approaches suffer from prohibitive O(n^2d) complexity, where d represents the number of model parameters, typically in the billions for LLMs, rendering direct optimization infeasible. PAMA reduces this complexity to O(n) where n is the number of objectives, enabling optimization to be completed within milliseconds. We provide theoretical guarantees that PAMA converges to a Pareto stationary point, where no objective can be improved without degrading at least one other. Extensive experiments across language models ranging from 125M to 7B parameters demonstrate PAMA’s robust and effective MOA capabilities, aligning with its theoretical advantages. PAMA provides a highly efficient solution to the MOA problem that was previously considered intractable, offering a practical and theoretically grounded approach to aligning LLMs with diverse human values, paving the way for versatile and adaptable real-world AI deployments. 大型语言模型(LLMs)正越来越多地被部署到需要在多种常常冲突的目标之间谨慎平衡的实际应用中,例如信息量与简洁性之间,或有帮助性与创造性之间。然而,当前的对齐方法主要基于 RLHF,通常将 LLMs 优化为单一的奖励函数,导致行为僵化,无法反映人类偏好的复杂性和多样性。这一局限性阻碍了 LLMs 在实际场景中的适应性,使得多目标对齐(MOA)成为一个关键但尚未充分研究的领域。为弥补这一空白,我们提出了帕累托多目标对齐(PAMA),这是一种为 LLMs 的多目标对齐专门设计的、具有原理性且计算高效的算法。与计算代价高昂的多目标优化(MOO)方法不同,PAMA 将多目标 RLHF 转化为具有解析解的凸优化问题,从而显著提升了可扩展性。传统的 MOO 方法存在不可接受的 O(n^2d) 复杂度,其中 d 表示模型参数的数量——对于 LLMs 通常以十亿计——使得直接优化变得不可行。 PAMA 将这种复杂性降至 O(n),其中 n 是目标数量,使得优化可在毫秒级内完成。我们提供了理论保证,表明 PAMA 收敛到帕累托驻点,即在不牺牲至少另一个目标的情况下,无法改进任何目标。在从 1.25 亿到 70 亿参数的语言模型上的大量实验表明,PAMA 的多目标优化(MOA)能力稳健且有效,与其理论优势一致。PAMA 为此前被认为难以处理的 MOA 问题提供了一个高效的解决方案,提出了一种实用且有理论依据的方法来使 LLMs 与多样化的人类价值观保持一致,为在现实世界中实现多功能且可适应的 AI 部署铺平了道路。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题:机器学习、人工智能、计算与语言
Publish: 2025-08-11 08:54:14 UTC 发布日期:2025-08-11 08:54:14 UTC
#106 Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment #106 学习对齐,靠对齐来学习:一种用于自我优化对齐的统一方法
Authors: [Haowen Wang](https://arxiv.org/search/?searchtype=author&query=Haowen Wang), [Yun Yue](https://arxiv.org/search/?searchtype=author&query=Yun Yue), [Zhiling Ye](https://arxiv.org/search/?searchtype=author&query=Zhiling Ye), [Shuowen Zhang](https://arxiv.org/search/?searchtype=author&query=Shuowen Zhang), [Lei Fan](https://arxiv.org/search/?searchtype=author&query=Lei Fan), [Jiaxin Liang](https://arxiv.org/search/?searchtype=author&query=Jiaxin Liang), [Jiadi Jiang](https://arxiv.org/search/?searchtype=author&query=Jiadi Jiang), [Cheng Wei](https://arxiv.org/search/?searchtype=author&query=Cheng Wei), [Jingyuan Deng](https://arxiv.org/search/?searchtype=author&query=Jingyuan Deng), [Xudong Han](https://arxiv.org/search/?searchtype=author&query=Xudong Han), [Ji Li](https://arxiv.org/search/?searchtype=author&query=Ji Li), [Chunxiao Guo](https://arxiv.org/search/?searchtype=author&query=Chunxiao Guo), [Peng Wei](https://arxiv.org/search/?searchtype=author&query=Peng Wei), [Jian Wang](https://arxiv.org/search/?searchtype=author&query=Jian Wang), [Jinjie Gu](https://arxiv.org/search/?searchtype=author&query=Jinjie Gu) 作者:王浩文,岳芸,叶志玲,张硕文,范磊,梁嘉欣,江家迪,魏成,邓静远,韩旭东,李霁,郭春晓,魏鹏,王健,顾金杰
Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO’s convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO’s superior performance, achieving 57.70%,17.65% 7.95% and 5.18% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models. 对齐方法已成为提升语言模型对齐能力的关键路径。尽管 SFT(监督微调)通过直接的逐标记损失干预加速了收敛,但其效用受限于离线策略轨迹。相比之下,RL(强化学习)促进了探索性策略优化,但存在样本效率低和对高质量基础模型高度依赖的问题。为了解决这双重挑战,我们提出了 GRAO(群体相对对齐优化),这是一个通过三项关键创新将 SFT 与 RL 各自优势协同统一的框架:1)一种多样本生成策略,通过奖励反馈实现比较质量评估;2)一种新颖的群体直接对齐损失形式,利用组内相对优势加权;3)在成对偏好动态指导下的参考感知参数更新。我们的理论分析确立了 GRAO 相较于传统方法的收敛性保证和样本效率优势。 在复杂的人类对齐任务上的全面评估表明,GRAO 的表现更为优越,分别比 SFT、DPO、PPO 和 GRPO 基线取得了 57.70%、17.65%、7.95% 和 5.18% 的相对提升。该工作既提供了一个有理论基础的对齐框架,也为语言模型能力高效演化提供了实证证据。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题:机器学习、人工智能、计算与语言
Publish: 2025-08-11 08:28:47 UTC 发布:2025-08-11 08:28:47 UTC
#107 GLiClass: Generalist Lightweight Model for Sequence Classification Tasks #107 GLiClass:用于序列分类任务的通用轻量级模型
Authors: [Ihor Stepanov](https://arxiv.org/search/?searchtype=author&query=Ihor Stepanov), [Mykhailo Shtopko](https://arxiv.org/search/?searchtype=author&query=Mykhailo Shtopko), [Dmytro Vodianytskyi](https://arxiv.org/search/?searchtype=author&query=Dmytro Vodianytskyi), [Oleksandr Lukashov](https://arxiv.org/search/?searchtype=author&query=Oleksandr Lukashov), [Alexander Yavorskyi](https://arxiv.org/search/?searchtype=author&query=Alexander Yavorskyi), [Mykyta Yaroshenko](https://arxiv.org/search/?searchtype=author&query=Mykyta Yaroshenko) 作者:Ihor Stepanov, Mykhailo Shtopko, Dmytro Vodianytskyi, Oleksandr Lukashov, Alexander Yavorskyi, Mykyta Yaroshenko
Classification is one of the most widespread tasks in AI applications, serving often as the first step in filtering, sorting, and categorizing data. Since modern AI systems must handle large volumes of input data and early pipeline stages can propagate errors downstream, achieving high efficiency and accuracy is critical. Moreover, classification requirements can change dynamically based on user needs, necessitating models with strong zero-shot capabilities. While generative LLMs have become mainstream for zero-shot classification due to their versatility, they suffer from inconsistent instruction following and computational inefficiency. Cross-encoders, commonly used as rerankers in RAG pipelines, face a different bottleneck: they must process text-label pairs sequentially, significantly reducing efficiency with large label sets. Embedding-based approaches offer good efficiency but struggle with complex scenarios involving logical and semantic constraints. We propose GLiClass, a novel method that adapts the GLiNER architecture for sequence classification tasks. Our approach achieves strong accuracy and efficiency comparable to embedding-based methods, while maintaining the flexibility needed for zero-shot and few-shot learning scenarios. Additionally, we adapted proximal policy optimization (PPO) for multi-label text classification, enabling training classifiers in data-sparse conditions or from human feedback. 分类是人工智能应用中最广泛的任务之一,常常作为筛选、排序和分类数据的第一步。由于现代人工智能系统必须处理大量输入数据,并且早期流水线阶段的错误会向下游传播,因此实现高效性和高准确性至关重要。此外,分类需求可能会根据用户需求动态变化,这就需要具备强大零样本能力的模型。尽管生成型 LLMs 凭借其通用性已成为零样本分类的主流,但它们在遵循指令方面存在不一致性且计算效率低下。作为在 RAG 流水线中常用的重排序器,交叉编码器面临的瓶颈则不同:它们必须对文本-标签对逐一处理,在标签集合较大时显著降低效率。基于嵌入的方法在效率方面表现良好,但在涉及逻辑和语义约束的复杂场景中表现欠佳。我们提出了 GLiClass,一种将 GLiNER 架构改造用于序列分类任务的新方法。 我们的方法在准确性和效率上与基于嵌入的方法相当,同时保留了零样本和少样本学习场景所需的灵活性。此外,我们将近端策略优化(PPO)改编用于多标签文本分类,使得在数据稀缺条件下或基于人类反馈能够训练分类器。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题:机器学习、人工智能、计算与语言
Publish: 2025-08-11 06:22:25 UTC 发布:2025-08-11 06:22:25 UTC
#108 Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents #108 拆解与重建:基于技能混合的视觉与语言导航代理
Authors: [Tianyi Ma](https://arxiv.org/search/?searchtype=author&query=Tianyi Ma), [Yue Zhang](https://arxiv.org/search/?searchtype=author&query=Yue Zhang), [Zehao Wang](https://arxiv.org/search/?searchtype=author&query=Zehao Wang), [Parisa Kordjamshidi](https://arxiv.org/search/?searchtype=author&query=Parisa Kordjamshidi) 作者:马天意、张越、王泽浩、Parisa Kordjamshidi
Vision-and-Language Navigation (VLN) poses significant challenges in enabling agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. We then introduce a novel zero-shot Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav achieves a new state-of-the-art performance on the R2R benchmark and demonstrates strong generalization to the GSA-R2R benchmark that includes novel instruction styles and unseen environments. 视觉与语言导航(VLN)在使智能体理解自然语言指令并在复杂三维环境中导航方面面临重大挑战。尽管最近通过大规模预训练和数据增强取得了进展,但当前方法在泛化到未见场景时仍然困难,尤其是在需要复杂时空推理时。在本工作中,我们提出了 SkillNav,这是一个模块化框架,将结构化、基于技能的推理引入基于 Transformer 的 VLN 智能体。我们的方法将导航分解为一组可解释的原子技能(例如,垂直移动、区域与区域识别、停止与暂停),每种技能由专门的智能体处理。随后我们引入了一种新颖的零样本视觉-语言模型(VLM)路由器,通过将子目标与视觉观测和历史动作对齐,在每个时间步动态选择最合适的智能体。SkillNav 在 R2R 基准上实现了新的最先进性能,并在包含新颖指令风格和未见环境的 GSA-R2R 基准上展现出强大的泛化能力。
Subjects: Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition 主题:人工智能、计算与语言、计算机视觉与模式识别
Publish: 2025-08-11 05:50:30 UTC 发布:2025-08-11 05:50:30 UTC
#109 Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization #109 Klear-Reasoner:通过保持梯度的裁剪策略优化提升推理能力 [PDF 3 ] [Copy] [Kimi ] [REL]
Authors: [Zhenpeng Su](https://arxiv.org/search/?searchtype=author&query=Zhenpeng Su), [Leiyu Pan](https://arxiv.org/search/?searchtype=author&query=Leiyu Pan), [Xue Bai](https://arxiv.org/search/?searchtype=author&query=Xue Bai), [Dening Liu](https://arxiv.org/search/?searchtype=author&query=Dening Liu), [Guanting Dong](https://arxiv.org/search/?searchtype=author&query=Guanting Dong), [Jiaming Huang](https://arxiv.org/search/?searchtype=author&query=Jiaming Huang), [Wenping Hu](https://arxiv.org/search/?searchtype=author&query=Wenping Hu), [Guorui Zhou](https://arxiv.org/search/?searchtype=author&query=Guorui Zhou) 作者:Zhenpeng Su、Leiyu Pan、Xue Bai、Dening Liu、Guanting Dong、Jiaming Huang、Wenping Hu、Guorui Zhou
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model’s exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6. 我们提出了 Klear-Reasoner,这是一种具备长程推理能力的模型,在问题求解过程中展现出谨慎的深思熟虑,并在多个基准上取得了优异的表现。尽管当前社区中已有许多优秀的推理模型相关工作,但由于训练细节披露不全,仍然存在许多高性能推理模型难以复现的问题。本报告对该推理模型进行了深入分析,涵盖了从数据准备、长链式思维监督微调(long CoT SFT)到强化学习(RL)的整个后训练工作流,并对每个实验组件进行了详细的消融研究。关于 SFT 数据,我们的实验证明少量高质量数据源比大量多样化数据源更为有效,并且困难样本在不进行准确性筛选的情况下能获得更好的结果。此外,我们还研究了当前 RL 裁剪机制的两个关键问题:裁剪抑制了重要的探索信号并忽视了次优轨迹。 为了解决这些挑战,我们提出了保持梯度的裁剪策略优化(Gradient-Preserving clipping Policy Optimization,GPPO),它能够从被裁剪的标记中温和地反向传播梯度。GPPO 不仅增强了模型的探索能力,还提高了其从负样本中学习的效率。Klear-Reasoner 在数学和编程方面表现出卓越的推理能力,在 AIME 2024 上得分 90.5%,AIME 2025 上得分 83.2%,LiveCodeBench V5 上得分 66.0%,LiveCodeBench V6 上得分 58.1%。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题:机器学习、人工智能、计算与语言
Publish: 2025-08-11 05:17:51 UTC 发布时间:2025-08-11 05:17:51 UTC
#110 ThinkTuning: Instilling Cognitive Reflections without Distillation #110 ThinkTuning:在不蒸馏的情况下植入认知反思
Authors: [Aswin RRV](https://arxiv.org/search/?searchtype=author&query=Aswin RRV), [Jacob Dineen](https://arxiv.org/search/?searchtype=author&query=Jacob Dineen), [Divij Handa](https://arxiv.org/search/?searchtype=author&query=Divij Handa), [Md Nayem Uddin](https://arxiv.org/search/?searchtype=author&query=Md Nayem Uddin), [Mihir Parmar](https://arxiv.org/search/?searchtype=author&query=Mihir Parmar), [Chitta Baral](https://arxiv.org/search/?searchtype=author&query=Chitta Baral), [Ben Zhou](https://arxiv.org/search/?searchtype=author&query=Ben Zhou) 作者:Aswin RRV、Jacob Dineen、Divij Handa、Md Nayem Uddin、Mihir Parmar、Chitta Baral、Ben Zhou
Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How can we train the models that don’t exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO-based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback – enough to point the mind in the right direction and then show the solution. Each piece of feedback reshapes the student’s thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. In particular, on average, our method shows a 3.85% improvement over zero-shot baselines across benchmarks, and on MATH-500, AIME and GPQA-Diamond it shows 2.08%, 2.23% and 3.99% improvements over the vanilla-GRPO baseline. Source code is available at https://github.com/3rdAT/ThinkTuning. 最近在测试时扩展方面的进展催生了表现出自我反思行为和多步推理的思考型 LLMs。虽然强化学习推动了这一自我改进范式,但最近的一项研究(Gandhi 等,2025)表明,仅靠强化学习并不能真正灌输这些新的推理能力——它只是将基础模型中已存在的行为表现出来。这引出了一个问题:我们如何训练那些不表现出此类思维行为的模型,使其首先发展出这种能力?为此,我们提出了 ThinkTuning,一种基于 GRPO 的交互式训练方法,在该方法中我们用教师模型的指导来增强学生模型的 rollout。一个来自课堂实践的简单想法启发了我们的方法:教师提出一个问题,让学生尝试回答,然后给予纠正性反馈——足以把思路指向正确的方向,然后展示解答。每一条反馈都会重塑学生的思维,促使他们得出正确的解决方案。类似地,我们发现,通过来自同等规模教师模型的这类隐式反馈监督可以提升学生模型的推理能力。 特别地,平均而言,我们的方法在各基准上比零样本基线提高了 3.85%,在 MATH-500、AIME 和 GPQA-Diamond 上分别比原始 GRPO 基线提高了 2.08%、2.23%和 3.99%。源代码可在 https://github.com/3rdAT/ThinkTuning 获取。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning 主题:Artificial Intelligence , Computation and Language , Machine Learning
Publish: 2025-08-11 04:51:43 UTC 发布:2025-08-11 04:51:43 UTC
#111 Conversational DNA: A New Visual Language for Understanding Dialogue Structure in Human and AI #111 会话 DNA:一种用于理解人类与人工智能对话结构的新视觉语言
Author: [Baihan Lin](https://arxiv.org/search/?searchtype=author&query=Baihan Lin) 作者:林柏涵
What if the patterns hidden within dialogue reveal more about communication than the words themselves? We introduce Conversational DNA, a novel visual language that treats any dialogue – whether between humans, between human and AI, or among groups – as a living system with interpretable structure that can be visualized, compared, and understood. Unlike traditional conversation analysis that reduces rich interaction to statistical summaries, our approach reveals the temporal architecture of dialogue through biological metaphors. Linguistic complexity flows through strand thickness, emotional trajectories cascade through color gradients, conversational relevance forms through connecting elements, and topic coherence maintains structural integrity through helical patterns. Through exploratory analysis of therapeutic conversations and historically significant human-AI dialogues, we demonstrate how this visualization approach reveals interaction patterns that traditional methods miss. Our work contributes a new creative framework for understanding communication that bridges data visualization, human-computer interaction, and the fundamental question of what makes dialogue meaningful in an age where humans increasingly converse with artificial minds. 如果对话中隐藏的模式比词语本身更能揭示交流的本质,会怎样?我们提出“会话 DNA”,这是一种新颖的视觉语言,将任何对话——无论是人类之间、人类与人工智能之间,还是群体间的对话——视为具有可解释结构的活体系,能够被可视化、比较并理解。不同于将丰富互动简化为统计摘要的传统会话分析方法,我们的方法通过生物学隐喻揭示对话的时间架构。语言复杂性通过线条粗细流动,情感轨迹通过颜色渐变层叠,会话相关性通过连接元素形成,主题连贯性通过螺旋模式维持结构完整性。通过对治疗性对话和具有历史意义的人机对话的探索性分析,我们展示了这种可视化方法如何揭示传统方法所忽视的互动模式。 我们的工作提出了一个新的创造性框架来理解交流,连接了数据可视化、人机交互,以及在这个人类日益与人工智能对话的时代,究竟是什么使对话有意义这一根本问题。
Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language, Computers and Society 主题:人机交互、人工智能、计算与语言、计算机与社会
Publish: 2025-08-11 00:43:35 UTC 发表:2025-08-11 00:43:35 UTC
#112 Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy #112 民主化的外交:一个用于评估任何大型语言模型在完全公开式外交中表现的工具
Authors: [Alexander Duffy](https://arxiv.org/search/?searchtype=author&query=Alexander Duffy), [Samuel J Paech](https://arxiv.org/search/?searchtype=author&query=Samuel J Paech), [Ishana Shastri](https://arxiv.org/search/?searchtype=author&query=Ishana Shastri), [Elizabeth Karpinski](https://arxiv.org/search/?searchtype=author&query=Elizabeth Karpinski), [Baptiste Alloui-Cros](https://arxiv.org/search/?searchtype=author&query=Baptiste Alloui-Cros), [Tyler Marques](https://arxiv.org/search/?searchtype=author&query=Tyler Marques), [Matthew Lyle Olson](https://arxiv.org/search/?searchtype=author&query=Matthew Lyle Olson) 作者:Alexander Duffy、Samuel J Paech、Ishana Shastri、Elizabeth Karpinski、Baptiste Alloui-Cros、Tyler Marques、Matthew Lyle Olson
We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy’s game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced. 我们提出了第一个评估工具,使任何开箱即用、本地运行的大型语言模型(LLMs)能够在不进行微调或专门训练的情况下,完整地参与全压式《外交》(Diplomacy)游戏。先前的工作由于《外交》游戏状态的高复杂性和信息密度,或需依赖最前沿的 LLMs,或需进行微调。再加上比赛的高方差,这些因素使得研究《外交》变得困难。在本工作中,我们通过数据驱动的迭代优化了文本化的游戏状态表示,使得一个 24B 模型能够在不进行任何微调的情况下可靠地完成比赛。我们开发了便于假设检验和统计分析的工具,并对说服、侵略性玩法以及不同模型间的表现等进行了案例研究。我们在多种流行的 LLMs 上进行了多种实验,发现更大的模型表现最好,但较小的模型仍能胜任游戏。我们还引入了关键状态分析(Critical State Analysis):一种用于快速迭代并深入分析游戏关键时刻的实验协议。 我们的测试框架通过无需微调来使评估 LLMs 的策略推理能力变得民主化,并提供了关于这些能力如何在广泛使用的 LLMs 中自然出现的见解。我们的代码已在补充材料中提供,并将开源。
Subjects: Artificial Intelligence, Computation and Language, Computers and Society, Machine Learning 主题:人工智能,计算与语言,计算机与社会,机器学习
Publish: 2025-08-10 21:07:08 UTC 发布:2025-08-10 21:07:08 UTC
#113 CP-Agent: Agentic Constraint Programming #113 CP-Agent:具代理性的约束编程
Author: [Stefan Szeider](https://arxiv.org/search/?searchtype=author&query=Stefan Szeider) 作者:Stefan Szeider
Translating natural language problem descriptions into formal constraint models remains a fundamental challenge in constraint programming, requiring deep expertise in both the problem domain and modeling frameworks. Previous approaches to automating this translation have employed fixed workflows with predetermined modeling steps, failing on a significant number of benchmark problems. We present a new approach using a pure agentic strategy without any fixed pipeline. We developed a general-purpose Python coding agent based on the ReAct (Reason and Act) principle, utilizing a persistent IPython kernel for stateful code execution and iterative development. Rather than embedding constraint programming logic into the agent architecture, domain-specific expertise is injected solely through a carefully crafted project prompt. The agent combines this prompt-encoded knowledge with access to file operations and code execution tools, enabling it to test hypotheses, debug failures, and verify solutions dynamically. Implemented in just a few hundred lines of code, this architecture successfully solves all 101 problems of the CP-Bench constraint programming benchmark set. The results suggest that constraint modeling tasks require the combination of general coding tools and domain expertise encoded in prompts, rather than specialized agent architectures or predefined workflows. 将自然语言问题描述翻译为形式化约束模型仍然是约束编程领域的一个基本挑战,这需要在问题领域和建模框架两方面具备深厚的专业知识。以往自动化此翻译的尝试采用了具有预定建模步骤的固定工作流,在大量基准问题上失败。我们提出了一种全新的方法,使用纯代理策略而不依赖任何固定管道。我们基于 ReAct(推理与行动)原则开发了一个通用的 Python 编码代理,利用持久化的 IPython 内核进行有状态的代码执行和迭代开发。与其将约束编程逻辑嵌入到代理架构中,不如通过精心设计的项目提示将领域特定的专业知识注入其中。该代理将提示中编码的知识与对文件操作和代码执行工具的访问相结合,使其能够动态地测试假设、调试错误并验证解决方案。该架构仅用几百行代码实现,就成功解决了 CP-Bench 约束编程基准集中全部 101 个问题。 结果表明,约束建模任务需要将通用编码工具与在提示中编码的领域专业知识结合起来,而不是依赖专门的代理架构或预定义的工作流程。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning, Software Engineering 主题:人工智能,计算与语言,机器学习,软件工程
Publish: 2025-08-10 19:59:01 UTC 发布:2025-08-10 19:59:01 UTC
#114 Event-Aware Sentiment Factors from LLM-Augmented Financial Tweets: A Transparent Framework for Interpretable Quant Trading #114 基于 LLM 增强的金融推特事件感知情绪因子:用于可解释量化交易的透明框架
Authors: [Yueyi Wang](https://arxiv.org/search/?searchtype=author&query=Yueyi Wang), [Qiyao Wei](https://arxiv.org/search/?searchtype=author&query=Qiyao Wei) 作者:王悦怡,魏其尧
In this study, we wish to showcase the unique utility of large language models (LLMs) in financial semantic annotation and alpha signal discovery. Leveraging a corpus of company-related tweets, we use an LLM to automatically assign multi-label event categories to high-sentiment-intensity tweets. We align these labeled sentiment signals with forward returns over 1-to-7-day horizons to evaluate their statistical efficacy and market tradability. Our experiments reveal that certain event labels consistently yield negative alpha, with Sharpe ratios as low as -0.38 and information coefficients exceeding 0.05, all statistically significant at the 95% confidence level. This study establishes the feasibility of transforming unstructured social media text into structured, multi-label event variables. A key contribution of this work is its commitment to transparency and reproducibility; all code and methodologies are made publicly available. Our results provide compelling evidence that social media sentiment is a valuable, albeit noisy, signal in financial forecasting and underscore the potential of open-source frameworks to democratize algorithmic trading research. 在本研究中,我们旨在展示大型语言模型(LLMs)在金融语义标注和阿尔法信号发现方面的独特实用性。利用一组与公司相关的推文语料库,我们使用 LLM 自动为高情绪强度的推文分配多标签事件类别。我们将这些带标签的情绪信号与未来 1 到 7 天的回报对齐,以评估其统计效力和市场可交易性。我们的实验证明,某些事件标签持续产生负阿尔法,夏普比率低至 -0.38,信息系数超过 0.05,且均在 95% 置信水平上具有统计显著性。本研究证明了将非结构化社交媒体文本转换为结构化、多标签事件变量的可行性。本工作的一个关键贡献是致力于透明性和可重复性;所有代码和方法均公开可用。我们的结果提供了有力证据,表明社交媒体情绪是金融预测中一种有价值但噪声较大的信号,并强调了开源框架在普及算法交易研究方面的潜力。
Subjects: Statistical Finance, Computation and Language, Machine Learning 主题:统计金融、计算与语言、机器学习
Publish: 2025-08-10 16:09:14 UTC 发表:2025-08-10 16:09:14 UTC
#115 A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems #115 自我进化人工智能代理的全面综述:连接基础模型与终身主体系统的新范式
Authors: [Jinyuan Fang](https://arxiv.org/search/?searchtype=author&query=Jinyuan Fang), [Yanwen Peng](https://arxiv.org/search/?searchtype=author&query=Yanwen Peng), [Xi Zhang](https://arxiv.org/search/?searchtype=author&query=Xi Zhang), [Yingxu Wang](https://arxiv.org/search/?searchtype=author&query=Yingxu Wang), [Xinhao Yi](https://arxiv.org/search/?searchtype=author&query=Xinhao Yi), [Guibin Zhang](https://arxiv.org/search/?searchtype=author&query=Guibin Zhang), [Yi Xu](https://arxiv.org/search/?searchtype=author&query=Yi Xu), [Bin Wu](https://arxiv.org/search/?searchtype=author&query=Bin Wu), [Siwei Liu](https://arxiv.org/search/?searchtype=author&query=Siwei Liu), [Zihao Li](https://arxiv.org/search/?searchtype=author&query=Zihao Li), [Zhaochun Ren](https://arxiv.org/search/?searchtype=author&query=Zhaochun Ren), [Nikos Aletras](https://arxiv.org/search/?searchtype=author&query=Nikos Aletras), [Xi Wang](https://arxiv.org/search/?searchtype=author&query=Xi Wang), [Han Zhou](https://arxiv.org/search/?searchtype=author&query=Han Zhou), [Zaiqiao Meng](https://arxiv.org/search/?searchtype=author&query=Zaiqiao Meng) 作者:方金元、彭燕文、张曦、王颖旭、易新昊、张贵斌、徐毅、吴斌、刘思唯、李子豪、任昭春、Nikos Aletras、王曦、周涵、孟在桥
Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems. 近年来大型语言模型的进展引发了人们对能够解决复杂现实任务的智能体的浓厚兴趣。然而,大多数现有的智能体系统依赖人工设计的配置,部署后保持静态,限制了它们在动态和不断变化的环境中适应的能力。为此,近期研究探索了智能体进化技术,旨在基于交互数据和环境反馈自动增强智能体系统。这一新兴方向为自我进化的人工智能智能体奠定了基础,将基础模型的静态能力与终身智能体系统所需的持续适应性联系起来。在本综述中,我们对现有的自我进化智能体系统技术进行了全面回顾。具体而言,我们首先引入了一个统一的概念框架,用以抽象设计自我进化智能体系统所依赖的反馈循环。该框架强调四个关键组成部分:系统输入、智能体系统、环境和优化器,为理解和比较不同策略提供了基础。 基于该框架,我们系统性地回顾了针对智能体系统不同组成部分的一系列自我演化技术。我们还研究了为生物医学、编程和金融等专门领域开发的领域特定演化策略,这些领域的优化目标与领域约束紧密相关。此外,我们针对自我演化智能体系统的评估、安全性和伦理考量进行了专门讨论,这些内容对于确保其有效性和可靠性至关重要。本综述旨在为研究人员和实践者提供对自我演化人工智能智能体的系统性理解,为开发更具适应性、自主性和终身学习能力的智能体系统奠定基础。
Subjects: Artificial Intelligence, Computation and Language, Multiagent Systems 主题:人工智能、计算与语言、多智能体系统
Publish: 2025-08-10 16:07:32 UTC 发布时间:2025-08-10 16:07:32 UTC
#116 Generative AI for Strategic Plan Development #116 用于战略规划制定的生成式人工智能
Author: [Jesse Ponnock](https://arxiv.org/search/?searchtype=author&query=Jesse Ponnock) 作者:Jesse Ponnock
Given recent breakthroughs in Generative Artificial Intelligence (GAI) and Large Language Models (LLMs), more and more professional services are being augmented through Artificial Intelligence (AI), which once seemed impossible to automate. This paper presents a modular model for leveraging GAI in developing strategic plans for large scale government organizations and evaluates leading machine learning techniques in their application towards one of the identified modules. Specifically, the performance of BERTopic and Non-negative Matrix Factorization (NMF) are evaluated in their ability to use topic modeling to generate themes representative of Vision Elements within a strategic plan. To accomplish this, BERTopic and NMF models are trained using a large volume of reports from the Government Accountability Office (GAO). The generated topics from each model are then scored for similarity against the Vision Elements of a published strategic plan and the results are compared. Our results show that these techniques are capable of generating themes similar to 100% of the elements being evaluated against. Further, we conclude that BERTopic performs best in this application with more than half of its correlated topics achieving a “medium” or “strong” correlation. A capability of GAI-enabled strategic plan development impacts a multi-billion dollar industry and assists the federal government in overcoming regulatory requirements which are crucial to the public good. Further work will focus on the operationalization of the concept proven in this study as well as viability of the remaining modules in the proposed model for GAI-generated strategic plans. 鉴于生成式人工智能(GAI)和大型语言模型(LLMs)的最新突破,越来越多原本看似无法自动化的专业服务正通过人工智能(AI)得到增强。本文提出了一个模块化模型,用于在制定大型政府机构的战略计划时利用 GAI,并评估了若干领先的机器学习技术在其所识别模块之一中的应用表现。具体而言,评估了 BERTopic 和非负矩阵分解(NMF)在使用主题建模生成代表战略计划中愿景要素的主题方面的能力。为此,使用来自美国政府问责办公室(GAO)的海量报告训练了 BERTopic 和 NMF 模型。随后将每个模型生成的主题与已发布战略计划的愿景要素进行相似度评分,并对结果进行了比较。我们的结果显示,这些技术能够生成与被评估的 100%要素相似的主题。 此外,我们得出结论,在此应用中,BERTopic 表现最佳,其超过一半的相关主题达到了“中等”或“强”相关度。GAI 支持的战略计划制定能力影响着数十亿美元的产业,并帮助联邦政府克服对公共利益至关重要的监管要求。本研究后续工作将侧重于已验证概念的落地实施,以及拟议模型中其余模块在 GAI 生成的战略计划中可行性的研究。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning 主题:Artificial Intelligence , Computation and Language , Machine Learning
Publish: 2025-08-10 16:07:07 UTC
#117 Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach
Authors: [Rubing Chen](https://arxiv.org/search/?searchtype=author&query=Rubing Chen), [Jiaxin Wu](https://arxiv.org/search/?searchtype=author&query=Jiaxin Wu), [Jian Wang](https://arxiv.org/search/?searchtype=author&query=Jian Wang), [Xulu Zhang](https://arxiv.org/search/?searchtype=author&query=Xulu Zhang), [Wenqi Fan](https://arxiv.org/search/?searchtype=author&query=Wenqi Fan), [Chenghua Lin](https://arxiv.org/search/?searchtype=author&query=Chenghua Lin), [Xiao-Yong Wei](https://arxiv.org/search/?searchtype=author&query=Xiao-Yong Wei), [Qing Li](https://arxiv.org/search/?searchtype=author&query=Qing Li)
Numerous benchmarks have been built to evaluate the domain-specific abilities of large language models (LLMs), highlighting the need for effective and efficient benchmark construction. Existing domain-specific benchmarks primarily focus on the scaling law, relying on massive corpora for supervised fine-tuning or generating extensive question sets for broad coverage. However, the impact of corpus and question-answer (QA) set design on the precision and recall of domain-specific LLMs remains unexplored. In this paper, we address this gap and demonstrate that the scaling law is not always the optimal principle for benchmark construction in specific domains. Instead, we propose Comp-Comp, an iterative benchmarking framework based on a comprehensiveness-compactness principle. Here, comprehensiveness ensures semantic recall of the domain, while compactness enhances precision, guiding both corpus and QA set construction. To validate our framework, we conducted a case study in a well-renowned university, resulting in the creation of XUBench, a large-scale and comprehensive closed-domain benchmark. Although we use the academic domain as the case in this work, our Comp-Comp framework is designed to be extensible beyond academia, providing valuable insights for benchmark construction across various domains. 大量基准测试已被构建用于评估大语言模型(LLMs)的特定领域能力,凸显了有效且高效的基准构建需求。现有的领域特定基准主要侧重于规模定律,依赖海量语料进行监督微调或生成大量问题集以求覆盖广度。然而,语料和问答(QA)集设计对领域特定 LLMs 的精确率与召回率的影响仍未被探索。本文针对这一空白展开研究,并证明在特定领域中,规模定律并不总是基准构建的最优原则。相反,我们提出了 Comp-Comp,这是一种基于全面性-紧凑性原则的迭代基准框架。其中,全面性保证领域的语义召回,而紧凑性提升精确率,从而指导语料与 QA 集的构建。为验证我们的框架,我们在一所知名大学进行了案例研究,最终创建了 XUBench,这是一个大规模且全面的封闭领域基准。 尽管在本研究中我们以学术领域作为案例,但我们的 Comp-Comp 框架被设计为可扩展至学术界之外,为各类领域的基准构建提供有价值的见解。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning 主题:Artificial Intelligence , Computation and Language , Machine Learning
Publish: 2025-08-10 14:08:28 UTC 发布:2025-08-10 14:08:28 UTC
#118 PrLM: Learning Explicit Reasoning for Personalized RAG via Contrastive Reward Optimization #118 PrLM:通过对比奖励优化为个性化 RAG 学习显式推理
Authors: [Kepu Zhang](https://arxiv.org/search/?searchtype=author&query=Kepu Zhang), [Teng Shi](https://arxiv.org/search/?searchtype=author&query=Teng Shi), [Weijie Yu](https://arxiv.org/search/?searchtype=author&query=Weijie Yu), [Jun Xu](https://arxiv.org/search/?searchtype=author&query=Jun Xu) 作者:张科普、石腾、余伟杰、徐俊
Personalized retrieval-augmented generation (RAG) aims to produce user-tailored responses by incorporating retrieved user profiles alongside the input query. Existing methods primarily focus on improving retrieval and rely on large language models (LLMs) to implicitly integrate the retrieved context with the query. However, such models are often sensitive to retrieval quality and may generate responses that are misaligned with user preferences. To address this limitation, we propose PrLM, a reinforcement learning framework that trains LLMs to explicitly reason over retrieved user profiles. Guided by a contrastively trained personalization reward model, PrLM effectively learns from user responses without requiring annotated reasoning paths. Experiments on three personalized text generation datasets show that PrLM outperforms existing methods and remains robust across varying numbers of retrieved profiles and different retrievers. 个性化检索增强生成(RAG)旨在通过将检索到的用户画像与输入查询结合来生成针对用户的定制化回复。现有方法主要侧重于改进检索,并依赖大型语言模型(LLMs)将检索到的上下文与查询隐式整合。然而,此类模型通常对检索质量敏感,可能生成与用户偏好不一致的回复。为了解决这一局限,我们提出了 PrLM,一种强化学习框架,用于训练 LLMs 对检索到的用户画像进行显式推理。在对比训练的个性化奖励模型的引导下,PrLM 能有效地从用户反馈中学习,而无需标注的推理路径。在三个个性化文本生成数据集上的实验表明,PrLM 优于现有方法,并在不同数量的检索画像和不同检索器下保持稳健。
Subjects: Information Retrieval, Computation and Language 主题:信息检索,计算与语言
Publish: 2025-08-10 13:37:26 UTC 发布时间:2025-08-10 13:37:26 UTC
#119 FlexCTC: GPU-powered CTC Beam Decoding with advanced Contextual Abilities #119 FlexCTC:具备高级上下文能力的 GPU 加速 CTC 波束解码
Authors: [Lilit Grigoryan](https://arxiv.org/search/?searchtype=author&query=Lilit Grigoryan), [Vladimir Bataev](https://arxiv.org/search/?searchtype=author&query=Vladimir Bataev), [Nikolay Karpov](https://arxiv.org/search/?searchtype=author&query=Nikolay Karpov), [Andrei Andrusenko](https://arxiv.org/search/?searchtype=author&query=Andrei Andrusenko), [Vitaly Lavrukhin](https://arxiv.org/search/?searchtype=author&query=Vitaly Lavrukhin), [Boris Ginsburg](https://arxiv.org/search/?searchtype=author&query=Boris Ginsburg) 作者:Lilit Grigoryan、Vladimir Bataev、Nikolay Karpov、Andrei Andrusenko、Vitaly Lavrukhin、Boris Ginsburg
While beam search improves speech recognition quality over greedy decoding, standard implementations are slow, often sequential, and CPU-bound. To fully leverage modern hardware capabilities, we present a novel open-source FlexCTC toolkit for fully GPU-based beam decoding, designed for Connectionist Temporal Classification (CTC) models. Developed entirely in Python and PyTorch, it offers a fast, user-friendly, and extensible alternative to traditional C++, CUDA, or WFST-based decoders. The toolkit features a high-performance, fully batched GPU implementation with eliminated CPU-GPU synchronization and minimized kernel launch overhead via CUDA Graphs. It also supports advanced contextualization techniques, including GPU-powered N-gram language model fusion and phrase-level boosting. These features enable accurate and efficient decoding, making them suitable for both research and production use. 虽然束搜索(beam search)相比贪心解码提高了语音识别质量,但标准实现通常很慢、常为顺序执行且受限于 CPU。为充分利用现代硬件能力,我们提出了一个新颖的开源 FlexCTC 工具包,用于完全基于 GPU 的束解码,专为连接时序分类(CTC)模型设计。该工具包完全使用 Python 和 PyTorch 开发,提供了一个快速、易用且可扩展的替代方案,可替代传统的 C++、CUDA 或基于 WFST 的解码器。该工具包具有高性能、全批次的 GPU 实现,消除了 CPU-GPU 同步并通过 CUDA 图减少了内核启动开销。它还支持高级上下文化技术,包括由 GPU 驱动的 N 元语法语言模型融合和短语级增强。这些特性使其能够实现准确且高效的解码,适用于研究和生产环境。
Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language, Machine Learning, Sound 主题:音频与语音处理、人工智能、计算与语言、机器学习、声音
Publish: 2025-08-10 12:15:57 UTC 发布日期:2025-08-10 12:15:57 协调世界时 (UTC)
#120 EndoAgent: A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning #120 EndoAgent:一种用于智能内镜视觉到决策推理的记忆引导反思型智能体
Authors: [Yi Tang](https://arxiv.org/search/?searchtype=author&query=Yi Tang), [Kaini Wang](https://arxiv.org/search/?searchtype=author&query=Kaini Wang), [Yang Chen](https://arxiv.org/search/?searchtype=author&query=Yang Chen), [Guangquan Zhou](https://arxiv.org/search/?searchtype=author&query=Guangquan Zhou) 作者:唐毅、王凯妮、陈洋、周光泉
Developing general artificial intelligence (AI) systems to support endoscopic image diagnosis is an emerging research priority. Existing methods based on large-scale pretraining often lack unified coordination across tasks and struggle to handle the multi-step processes required in complex clinical workflows. While AI agents have shown promise in flexible instruction parsing and tool integration across domains, their potential in endoscopy remains underexplored. To address this gap, we propose EndoAgent, the first memory-guided agent for vision-to-decision endoscopic analysis that integrates iterative reasoning with adaptive tool selection and collaboration. Built on a dual-memory design, it enables sophisticated decision-making by ensuring logical coherence through short-term action tracking and progressively enhancing reasoning acuity through long-term experiential learning. To support diverse clinical tasks, EndoAgent integrates a suite of expert-designed tools within a unified reasoning loop. We further introduce EndoAgentBench, a benchmark of 5,709 visual question-answer pairs that assess visual understanding and language generation capabilities in realistic scenarios. Extensive experiments show that EndoAgent consistently outperforms both general and medical multimodal models, exhibiting its strong flexibility and reasoning capabilities. 开发用于支持内镜图像诊断的通用人工智能(AI)系统是一个新兴的研究重点。基于大规模预训练的现有方法往往缺乏跨任务的统一协调,并且难以处理复杂临床工作流程中所需的多步骤过程。尽管 AI 代理在灵活解析指令和跨领域工具整合方面显示出潜力,但其在内镜领域的应用仍未得到充分探索。为了解决这一空白,我们提出了 EndoAgent,这是第一个用于视觉到决策的内镜分析的记忆引导代理,它将迭代推理与自适应工具选择和协作相结合。基于双重记忆设计,该方法通过短期动作跟踪确保逻辑连贯性,并通过长期经验学习逐步增强推理敏锐度,从而实现复杂的决策制定。为了支持多样化的临床任务,EndoAgent 在统一的推理循环中集成了一套专家设计的工具。我们进一步引入了 EndoAgentBench,一个包含 5,709 个视觉问答对的基准,用于评估真实场景中的视觉理解和语言生成能力。 大量实验表明,EndoAgent 始终优于通用及医学多模态模型,展现出其强大的灵活性和推理能力。
Subjects: Artificial Intelligence, Computation and Language 主题:人工智能,计算与语言
Publish: 2025-08-10 11:02:57 UTC 发布:2025-08-10 11:02:57 协调世界时 (UTC)
#121 Towards Real-World Rumor Detection: Anomaly Detection Framework with Graph Supervised Contrastive Learning #121 面向真实世界的谣言检测:结合图监督对比学习的异常检测框架 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Chaoqun Cui](https://arxiv.org/search/?searchtype=author&query=Chaoqun Cui), [Caiyan Jia](https://arxiv.org/search/?searchtype=author&query=Caiyan Jia) 作者:崔超群、贾彩燕
Current rumor detection methods based on propagation structure learning predominately treat rumor detection as a class-balanced classification task on limited labeled data. However, real-world social media data exhibits an imbalanced distribution with a minority of rumors among massive regular posts. To address the data scarcity and imbalance issues, we construct two large-scale conversation datasets from Weibo and Twitter and analyze the domain distributions. We find obvious differences between rumor and non-rumor distributions, with non-rumors mostly in entertainment domains while rumors concentrate in news, indicating the conformity of rumor detection to an anomaly detection paradigm. Correspondingly, we propose the Anomaly Detection framework with Graph Supervised Contrastive Learning (AD-GSCL). It heuristically treats unlabeled data as non-rumors and adapts graph contrastive learning for rumor detection. Extensive experiments demonstrate AD-GSCL’s superiority under class-balanced, imbalanced, and few-shot conditions. Our findings provide valuable insights for real-world rumor detection featuring imbalanced data distributions. 当前基于传播结构学习的谣言检测方法主要将谣言检测视为在有限标注数据上的类别平衡分类任务。然而,真实世界的社交媒体数据呈现不平衡分布,在海量普通帖文中谣言仅占少数。为了解决数据稀缺和不平衡问题,我们从微博和推特构建了两个大规模会话数据集并分析了领域分布。我们发现谣言与非谣言的分布存在明显差异,非谣言多集中在娱乐领域,而谣言则聚集在新闻领域,这表明谣言检测符合异常检测范式。针对这一点,我们提出了带图监督对比学习的异常检测框架(AD-GSCL)。该方法启发式地将未标注数据视为非谣言,并将图对比学习调整用于谣言检测。大量实验表明,AD-GSCL 在类别平衡、不平衡和少样本条件下均具有优越性。我们的发现为具有不平衡数据分布的真实世界谣言检测提供了有价值的见解。
Subjects: Social and Information Networks, Computation and Language 主题:社交与信息网络、计算与语言
Publish: 2025-08-10 06:59:33 UTC 发布:2025-08-10 06:59:33 UTC
#122 Propagation Tree Is Not Deep: Adaptive Graph Contrastive Learning Approach for Rumor Detection #122 传播树并不深:用于谣言检测的自适应图对比学习方法
Authors: [Chaoqun Cui](https://arxiv.org/search/?searchtype=author&query=Chaoqun Cui), [Caiyan Jia](https://arxiv.org/search/?searchtype=author&query=Caiyan Jia) 作者:崔朝群,贾蔡燕
Rumor detection on social media has become increasingly important. Most existing graph-based models presume rumor propagation trees (RPTs) have deep structures and learn sequential stance features along branches. However, through statistical analysis on real-world datasets, we find RPTs exhibit wide structures, with most nodes being shallow 1-level replies. To focus learning on intensive substructures, we propose Rumor Adaptive Graph Contrastive Learning (RAGCL) method with adaptive view augmentation guided by node centralities. We summarize three principles for RPT augmentation: 1) exempt root nodes, 2) retain deep reply nodes, 3) preserve lower-level nodes in deep sections. We employ node dropping, attribute masking and edge dropping with probabilities from centrality-based importance scores to generate views. A graph contrastive objective then learns robust rumor representations. Extensive experiments on four benchmark datasets demonstrate RAGCL outperforms state-of-the-art methods. Our work reveals the wide-structure nature of RPTs and contributes an effective graph contrastive learning approach tailored for rumor detection through principled adaptive augmentation. The proposed principles and augmentation techniques can potentially benefit other applications involving tree-structured graphs. 社交媒体上的谣言检测变得日益重要。大多数现有的基于图的方法假设谣言传播树(RPT)具有深层结构,并沿分支学习序列立场特征。然而,通过对真实世界数据集的统计分析,我们发现 RPT 表现为宽层结构,大多数节点是浅层的一级回复。为将学习聚焦于密集子结构,我们提出了基于节点中心性的自适应视图增强的谣言自适应图对比学习(RAGCL)方法。我们总结了用于 RPT 增强的三条原则:1)豁免根节点,2)保留深层回复节点,3)在深层部分保留较低层节点。我们使用基于中心性重要性分数得到的概率进行节点丢弃、属性掩码和边丢弃以生成视图。然后通过图对比目标学习鲁棒的谣言表示。在四个基准数据集上的大量实验证明 RAGCL 优于最先进的方法。我们的工作揭示了 RPT 的宽结构特性,并通过有原则的自适应增强为谣言检测贡献了一种有效的图对比学习方法。 所提出的原则和增强技术有可能惠及其他涉及树状结构图的应用。
Subjects: Social and Information Networks, Artificial Intelligence, Computation and Language 主题:社会与信息网络、人工智能、计算与语言
Publish: 2025-08-10 06:53:30 UTC 发表:2025-08-10 06:53:30 UTC
#123 SQL-Exchange: Transforming SQL Queries Across Domains #123 SQL-Exchange:跨域转换 SQL 查询
Authors: [Mohammadreza Daviran](https://arxiv.org/search/?searchtype=author&query=Mohammadreza Daviran), [Brian Lin](https://arxiv.org/search/?searchtype=author&query=Brian Lin), [Davood Rafiei](https://arxiv.org/search/?searchtype=author&query=Davood Rafiei) 作者:Mohammadreza Daviran、Brian Lin、Davood Rafiei
We introduce SQL-Exchange, a framework for mapping SQL queries across different database schemas by preserving the source query structure while adapting domain-specific elements to align with the target schema. We investigate the conditions under which such mappings are feasible and beneficial, and examine their impact on enhancing the in-context learning performance of text-to-SQL systems as a downstream task. Our comprehensive evaluation across multiple model families and benchmark datasets–assessing structural alignment with source queries, execution validity on target databases, and semantic correctness–demonstrates that SQL-Exchange is effective across a wide range of schemas and query types. Our results further show that using mapped queries as in-context examples consistently improves text-to-SQL performance over using queries from the source schema. 我们提出了 SQL-Exchange 这一框架,用于在不同数据库模式之间映射 SQL 查询:在保留源查询结构的同时,将特定领域的元素调整以与目标模式对齐。我们研究了此类映射在何种条件下是可行且有益的,并考察了它们作为下游任务对增强文本到 SQL 系统的上下文学习性能的影响。我们在多个模型家族和基准数据集上进行了全面评估——评估内容包括与源查询的结构对齐、在目标数据库上的执行有效性以及语义正确性——结果表明 SQL-Exchange 在广泛的模式和查询类型中均有效。我们的结果还显示,将映射后的查询作为上下文示例相比使用源模式的查询,能够持续提升文本到 SQL 的性能。
Subjects: Databases, Artificial Intelligence, Computation and Language 主题:数据库,人工智能,计算与语言
Publish: 2025-08-09 19:55:54 UTC 发布:2025-08-09 19:55:54 UTC
#124 ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability #124 ReasonRank:以强推理能力增强段落排序
Authors: [Wenhan Liu](https://arxiv.org/search/?searchtype=author&query=Wenhan Liu), [Xinyu Ma](https://arxiv.org/search/?searchtype=author&query=Xinyu Ma), [Weiwei Sun](https://arxiv.org/search/?searchtype=author&query=Weiwei Sun), [Yutao Zhu](https://arxiv.org/search/?searchtype=author&query=Yutao Zhu), [Yuchen Li](https://arxiv.org/search/?searchtype=author&query=Yuchen Li), [Dawei Yin](https://arxiv.org/search/?searchtype=author&query=Dawei Yin), [Zhicheng Dou](https://arxiv.org/search/?searchtype=author&query=Zhicheng Dou) 作者:刘文涵、马欣宇、孙伟伟、祝雨涛、李宇辰、尹大伟、窦志成
Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank. 基于大型语言模型(LLM)的列表式排序在许多段落排序任务中表现出色。随着大型推理模型的发展,许多研究表明测试时的逐步推理有助于提升列表式排序性能。然而,由于需要大量推理能力的训练数据稀缺,现有的重排序器在许多复杂排序场景中表现不佳,具有推理密集型能力的重排序器的排序能力在很大程度上仍未得到充分发展。本文首先提出了一个自动化的推理密集型训练数据合成框架,该框架从不同领域获取训练查询和段落,并采用 DeepSeek-R1 生成高质量的训练标签。我们设计了一种自一致性数据过滤机制以保证数据质量。为了赋予列表式重排序器强大的推理能力,我们进一步提出了一个两阶段的后训练方法,其中包括用于学习推理模式的冷启动监督微调(SFT)阶段以及用于进一步提升排序能力的强化学习(RL)阶段。 在强化学习阶段,基于列表式排序的特性,我们设计了一个多视角排序奖励,这比基于排序指标的奖励更为有效。大量实验证明,我们训练出的侧重推理的重排序器 ReasonRank 显著优于现有基线,并且相比逐点重排序器 Rank1 也实现了更低的延迟。通过进一步实验,我们的 ReasonRank 在 BRIGHT 排行榜上达到了 40.6 的最新(SOTA)性能\footnote{https://brightbenchmark.github.io/}。我们的代码可在 https://github.com/8421BCD/ReasonRank 获取。
Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Machine Learning 主题:信息检索、人工智能、计算与语言、机器学习
Publish: 2025-08-09 17:26:18 UTC 发布:2025-08-09 17:26:18 协调世界时 (UTC)
#125 MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA #125 MultiMedEdit:用于评估医学视觉问答中知识编辑的情景感知基准
Authors: [Shengtao Wen](https://arxiv.org/search/?searchtype=author&query=Shengtao Wen), [Haodong Chen](https://arxiv.org/search/?searchtype=author&query=Haodong Chen), [Yadong Wang](https://arxiv.org/search/?searchtype=author&query=Yadong Wang), [Zhongying Pan](https://arxiv.org/search/?searchtype=author&query=Zhongying Pan), [Xiang Chen](https://arxiv.org/search/?searchtype=author&query=Xiang Chen), [Yu Tian](https://arxiv.org/search/?searchtype=author&query=Yu Tian), [Bo Qian](https://arxiv.org/search/?searchtype=author&query=Bo Qian), [Dong Liang](https://arxiv.org/search/?searchtype=author&query=Dong Liang), [Sheng-Jun Huang](https://arxiv.org/search/?searchtype=author&query=Sheng-Jun Huang) 作者:温胜涛、陈昊东、王亚东、潘仲颖、陈翔、田宇、钱博、梁东、黄胜军
Knowledge editing (KE) provides a scalable approach for updating factual knowledge in large language models without full retraining. While previous studies have demonstrated effectiveness in general domains and medical QA tasks, little attention has been paid to KE in multimodal medical scenarios. Unlike text-only settings, medical KE demands integrating updated knowledge with visual reasoning to support safe and interpretable clinical decisions. To address this gap, we propose MultiMedEdit, the first benchmark tailored to evaluating KE in clinical multimodal tasks. Our framework spans both understanding and reasoning task types, defines a three-dimensional metric suite (reliability, generality, and locality), and supports cross-paradigm comparisons across general and domain-specific models. We conduct extensive experiments under single-editing and lifelong-editing settings. Results suggest that current methods struggle with generalization and long-tail reasoning, particularly in complex clinical workflows. We further present an efficiency analysis (e.g., edit latency, memory footprint), revealing practical trade-offs in real-world deployment across KE paradigms. Overall, MultiMedEdit not only reveals the limitations of current approaches but also provides a solid foundation for developing clinically robust knowledge editing techniques in the future. 知识编辑(KE)为在大型语言模型中更新事实知识提供了一种可扩展的方法,无需完全重新训练。尽管先前的研究已证明其在通用领域和医疗问答任务中的有效性,但在多模态医疗场景下的 KE 却鲜有关注。与仅文本设置不同,医疗知识编辑要求将更新后的知识与视觉推理结合,以支持安全且可解释的临床决策。为填补这一空白,我们提出了 MultiMedEdit,这是首个专门用于评估临床多模态任务中知识编辑的基准。我们的框架涵盖理解和推理两类任务,定义了一个三维度的评估指标体系(可靠性、通用性和局部性),并支持在通用与特定领域模型之间进行跨范式比较。我们在单次编辑和终身编辑设置下进行了大量实验。结果表明,当前方法在泛化和长尾推理方面存在困难,尤其是在复杂的临床工作流程中。我们还给出了效率分析(例如,编辑延迟、内存占用),揭示了不同知识编辑范式在现实部署中的实际权衡。 总体而言,MultiMedEdit 不仅揭示了当前方法的局限性,还为未来开发临床上稳健的知识编辑技术提供了坚实的基础。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning, Multimedia 主题:人工智能,计算与语言,机器学习,多媒体
Publish: 2025-08-09 15:36:08 UTC 发布:2025-08-09 15:36:08 协调世界时
#126 TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree #126 TurboBias:由 GPU 加速短语增强树驱动的通用 ASR 上下文偏置
Authors: [Andrei Andrusenko](https://arxiv.org/search/?searchtype=author&query=Andrei Andrusenko), [Vladimir Bataev](https://arxiv.org/search/?searchtype=author&query=Vladimir Bataev), [Lilit Grigoryan](https://arxiv.org/search/?searchtype=author&query=Lilit Grigoryan), [Vitaly Lavrukhin](https://arxiv.org/search/?searchtype=author&query=Vitaly Lavrukhin), [Boris Ginsburg](https://arxiv.org/search/?searchtype=author&query=Boris Ginsburg) 作者:Andrei Andrusenko、Vladimir Bataev、Lilit Grigoryan、Vitaly Lavrukhin、Boris Ginsburg
Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit. 识别特定关键短语是上下文化自动语音识别(ASR)的一项重要任务。然而,大多数现有的上下文偏置方法存在需要额外模型训练、显著降低解码速度或限制 ASR 系统类型选择的局限性。本文提出了一个通用的 ASR 上下文偏置框架,支持所有主要类型:CTC、Transducers 和注意力编码器-解码器模型。该框架基于 GPU 加速的单词增强树,使其能够以浅融合模式用于贪婪和束搜索解码,即使在大量关键短语(多达 20K 项)情况下也不会显著降低速度。获得的结果表明了所提方法的高效性,在准确性和解码速度上均优于所考虑的开源上下文偏置方法。我们的上下文偏置框架作为 NeMo 工具包的一部分已开源。
Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language, Sound 主题:音频与语音处理、人工智能、计算与语言、声音
Publish: 2025-08-09 15:27:07 UTC 发布时间:2025-08-09 15:27:07 UTC
#127 DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery #127 数据集研究:面向需求驱动数据集发现的代理系统基准测试
Authors: [Keyu Li](https://arxiv.org/search/?searchtype=author&query=Keyu Li), [Mohan Jiang](https://arxiv.org/search/?searchtype=author&query=Mohan Jiang), [Dayuan Fu](https://arxiv.org/search/?searchtype=author&query=Dayuan Fu), [Yunze Wu](https://arxiv.org/search/?searchtype=author&query=Yunze Wu), [Xiangkun Hu](https://arxiv.org/search/?searchtype=author&query=Xiangkun Hu), [Dequan Wang](https://arxiv.org/search/?searchtype=author&query=Dequan Wang), [Pengfei Liu](https://arxiv.org/search/?searchtype=author&query=Pengfei Liu) 作者:李可宇、蒋墨涵、付大源、吴云泽、胡祥坤、王德全、刘鹏飞
The rapid advancement of large language models has fundamentally shifted the bottleneck in AI development from computational power to data availability-with countless valuable datasets remaining hidden across specialized repositories, research appendices, and domain platforms. As reasoning capabilities and deep research methodologies continue to evolve, a critical question emerges: can AI agents transcend conventional search to systematically discover any dataset that meets specific user requirements, enabling truly autonomous demand-driven data curation? We introduce DatasetResearch, the first comprehensive benchmark evaluating AI agents’ ability to discover and synthesize datasets from 208 real-world demands across knowledge-intensive and reasoning-intensive tasks. Our tri-dimensional evaluation framework reveals a stark reality: even advanced deep research systems achieve only 22% score on our challenging DatasetResearch-pro subset, exposing the vast gap between current capabilities and perfect dataset discovery. Our analysis uncovers a fundamental dichotomy-search agents excel at knowledge tasks through retrieval breadth, while synthesis agents dominate reasoning challenges via structured generation-yet both catastrophically fail on “corner cases” outside existing distributions. These findings establish the first rigorous baseline for dataset discovery agents and illuminate the path toward AI systems capable of finding any dataset in the digital universe. Our benchmark and comprehensive analysis provide the foundation for the next generation of self-improving AI systems and are publicly available at https://github.com/GAIR-NLP/DatasetResearch. 大型语言模型的快速发展已将人工智能开发的瓶颈从计算能力根本性地转移到数据可用性——无数有价值的数据集仍隐藏在专业存储库、研究附录和领域平台中。随着推理能力和深入研究方法的不断演进,一个关键问题浮出水面:AI 代理能否超越传统搜索,系统性地发现任何满足特定用户需求的数据集,从而实现真正自主的按需数据策划?我们提出了 DatasetResearch,这是首个综合基准,用于评估 AI 代理从 208 个真实需求中发现并综合数据集的能力,这些需求涵盖知识密集和推理密集型任务。我们的三维评估框架揭示了一个严峻现实:即使是先进的深度研究系统在我们具有挑战性的 DatasetResearch-pro 子集上也仅获得 22% 的得分,暴露出现有能力与完美数据集发现之间的巨大差距。 我们的分析揭示了一个根本性的二分法——搜索代理通过检索广度在知识任务上表现出色,而综合代理则通过结构化生成在推理挑战中占据主导地位——但两者在现有分布之外的“边缘情况”上均会严重失败。这些发现为数据集发现代理建立了第一个严格的基线,并阐明了通向能够在数字世界中找到任意数据集的 AI 系统的路径。我们的基准和全面分析为下一代自我改进型 AI 系统提供了基础,并已在 https://github.com/GAIR-NLP/DatasetResearch 上公开可用。
Subjects: Artificial Intelligence, Computation and Language 主题:人工智能,计算与语言
Publish: 2025-08-09 12:15:08 UTC 发布:2025-08-09 12:15:08 UTC
#128 AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance #128 AMFT:通过元学习最优模仿-探索平衡来对齐 LLM 推理器
Authors: [Lixuan He](https://arxiv.org/search/?searchtype=author&query=Lixuan He), [Jie Feng](https://arxiv.org/search/?searchtype=author&query=Jie Feng), [Yong Li](https://arxiv.org/search/?searchtype=author&query=Yong Li) 作者:何立轩,冯捷,李勇
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT’s implicit, path-level reward and RL’s explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT’s stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment.Our codes are open-sourced via https://github.com/hlxtsyj/AMFT. 大型语言模型(LLMs)通常通过监督微调(SFT)后接强化学习(RL)的两阶段流程来针对推理任务进行微调,这一过程容易出现灾难性遗忘并在模仿与探索之间产生次优权衡。近期的一些单阶段方法尝试用启发式方法将 SFT 与 RL 统一,但缺乏一个可以动态平衡这两种范式的原则性机制。在本文中,我们通过“隐式奖励”(implicit rewards)的理论视角重新构建这一挑战,将 SFT 和 RL 视为互补的奖励信号而非截然不同的方法。我们提出了自适应元微调(Adaptive Meta Fine-Tuning,AMFT),这是一种新颖的单阶段算法,用于学习在 SFT 的隐式、路径级奖励与 RL 的显式、基于结果的奖励之间的最优平衡。AMFT 的核心是一个元梯度自适应权重控制器,它将 SFT-RL 的平衡视为可学习参数,动态优化该参数以最大化长期任务性能。该前瞻性方法通过策略熵进行正则化以提高稳定性,能够自主发现有效的训练课程。 我们在涵盖数学推理、抽象视觉推理(General Points)和视觉-语言导航(V-IRL)的具有挑战性基准上进行了全面评估。AMFT 始终创下新的最先进水平,并在分布外(OOD)任务上展示出更强的泛化能力。消融研究和训练动态分析证实,元学习控制器对于 AMFT 的稳定性、样本效率和性能至关重要,为 LLM 对齐提供了一种更有原则性和更有效的范式。我们的代码已开源于 https://github.com/hlxtsyj/AMFT。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition 主题:机器学习,人工智能,计算与语言,计算机视觉与模式识别
Publish: 2025-08-09 11:40:54 UTC 发布:2025-08-09 11:40:54 UTC
#129 Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody #129 Maestro-EVC:由参考和显式韵律引导的可控情感语音转换
Authors: [Jinsung Yoon](https://arxiv.org/search/?searchtype=author&query=Jinsung Yoon), [Wooyeol Jeong](https://arxiv.org/search/?searchtype=author&query=Wooyeol Jeong), [Jio Gim](https://arxiv.org/search/?searchtype=author&query=Jio Gim), [Young-Joo Suh](https://arxiv.org/search/?searchtype=author&query=Young-Joo Suh) 作者:Jinsung Yoon、Wooyeol Jeong、Jio Gim、Young-Joo Suh
Emotional voice conversion (EVC) aims to modify the emotional style of speech while preserving its linguistic content. In practical EVC, controllability, the ability to independently control speaker identity and emotional style using distinct references, is crucial. However, existing methods often struggle to fully disentangle these attributes and lack the ability to model fine-grained emotional expressions such as temporal dynamics. We propose Maestro-EVC, a controllable EVC framework that enables independent control of content, speaker identity, and emotion by effectively disentangling each attribute from separate references. We further introduce a temporal emotion representation and an explicit prosody modeling with prosody augmentation to robustly capture and transfer the temporal dynamics of the target emotion, even under prosody-mismatched conditions. Experimental results confirm that Maestro-EVC achieves high-quality, controllable, and emotionally expressive speech synthesis. 情感语音转换(EVC)旨在在保留语言内容的同时修改语音的情感风格。在实际的 EVC 中,可控性——使用不同参考独立控制说话人身份和情感风格的能力——至关重要。然而,现有方法常常难以完全解耦这些属性,也缺乏对诸如时间动态等细粒度情感表达的建模能力。我们提出了 Maestro-EVC,一种可控的 EVC 框架,通过从不同参考中有效解耦每个属性,实现对内容、说话人身份和情感的独立控制。我们进一步引入了时间情感表示,并采用明确的韵律建模与韵律增强,以在韵律不匹配的情况下也能稳健地捕捉并迁移目标情感的时间动态。实验结果证实,Maestro-EVC 实现了高质量、可控且情感表现力强的语音合成。
Subjects: Sound, Artificial Intelligence, Computation and Language 主题:声音、人工智能、计算与语言
Publish: 2025-08-09 08:46:32 UTC 发布:2025-08-09 08:46:32 UTC
#130 Story Ribbons: Reimagining Storyline Visualizations with Large Language Models #130 故事丝带:用大型语言模型重新构想故事线可视化
Authors: [Catherine Yeh](https://arxiv.org/search/?searchtype=author&query=Catherine Yeh), [Tara Menon](https://arxiv.org/search/?searchtype=author&query=Tara Menon), [Robin Singh Arya](https://arxiv.org/search/?searchtype=author&query=Robin Singh Arya), [Helen He](https://arxiv.org/search/?searchtype=author&query=Helen He), [Moira Weigel](https://arxiv.org/search/?searchtype=author&query=Moira Weigel), [Fernanda Viégas](https://arxiv.org/search/?searchtype=author&query=Fernanda Viégas), [Martin Wattenberg](https://arxiv.org/search/?searchtype=author&query=Martin Wattenberg) 作者:Catherine Yeh、Tara Menon、Robin Singh Arya、Helen He、Moira Weigel、Fernanda Viégas、Martin Wattenberg
Analyzing literature involves tracking interactions between characters, locations, and themes. Visualization has the potential to facilitate the mapping and analysis of these complex relationships, but capturing structured information from unstructured story data remains a challenge. As large language models (LLMs) continue to advance, we see an opportunity to use their text processing and analysis capabilities to augment and reimagine existing storyline visualization techniques. Toward this goal, we introduce an LLM-driven data parsing pipeline that automatically extracts relevant narrative information from novels and scripts. We then apply this pipeline to create Story Ribbons, an interactive visualization system that helps novice and expert literary analysts explore detailed character and theme trajectories at multiple narrative levels. Through pipeline evaluations and user studies with Story Ribbons on 36 literary works, we demonstrate the potential of LLMs to streamline narrative visualization creation and reveal new insights about familiar stories. We also describe current limitations of AI-based systems, and interaction motifs designed to address these issues. 分析文学作品涉及追踪人物、地点和主题之间的相互作用。可视化有助于映射和分析这些复杂关系,但从非结构化的故事数据中提取结构化信息仍然是一大挑战。随着大型语言模型(LLMs)不断发展,我们看到利用其文本处理和分析能力来增强并重新构想现有故事线可视化技术的机会。为此,我们引入了一个由 LLM 驱动的数据解析管道,能够自动从小说和剧本中提取相关的叙事信息。随后,我们将该管道应用于创建“故事丝带”(Story Ribbons),这是一个交互式可视化系统,帮助初学者和专家级文学分析者在多层叙事层面上探索详细的人物和主题轨迹。通过对该管道的评估以及在 36 部文学作品上对 Story Ribbons 的用户研究,我们展示了 LLMs 在简化叙事可视化创建并揭示熟悉故事新见解方面的潜力。我们同时描述了基于人工智能系统的当前局限性,以及为应对这些问题而设计的交互模式。
Subjects: Human-Computer Interaction, Computation and Language, Machine Learning 主题:人机交互、计算与语言、机器学习
Publish: 2025-08-09 01:49:30 UTC 发布:2025-08-09 01:49:30 协调世界时
#131 Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials #131 生成式人工智能从植物中提取结构-功能关系以开发新材料
Authors: [Rachel K. Luu](https://arxiv.org/search/?searchtype=author&query=Rachel K. Luu), [Jingyu Deng](https://arxiv.org/search/?searchtype=author&query=Jingyu Deng), [Mohammed Shahrudin Ibrahim](https://arxiv.org/search/?searchtype=author&query=Mohammed Shahrudin Ibrahim), [Nam-Joon Cho](https://arxiv.org/search/?searchtype=author&query=Nam-Joon Cho), [Ming Dao](https://arxiv.org/search/?searchtype=author&query=Ming Dao), [Subra Suresh](https://arxiv.org/search/?searchtype=author&query=Subra Suresh), [Markus J. Buehler](https://arxiv.org/search/?searchtype=author&query=Markus J. Buehler) 作者:Rachel K. Luu、Jingyu Deng、Mohammed Shahrudin Ibrahim、Nam-Joon Cho、Ming Dao、Subra Suresh、Markus J. Buehler
Large language models (LLMs) have reshaped the research landscape by enabling new approaches to knowledge retrieval and creative ideation. Yet their application in discipline-specific experimental science, particularly in highly multi-disciplinary domains like materials science, remains limited. We present a first-of-its-kind framework that integrates generative AI with literature from hitherto-unconnected fields such as plant science, biomimetics, and materials engineering to extract insights and design experiments for materials. We focus on humidity-responsive systems such as pollen-based materials and Rhapis excelsa (broadleaf lady palm) leaves, which exhibit self-actuation and adaptive performance. Using a suite of AI tools, including a fine-tuned model (BioinspiredLLM), Retrieval-Augmented Generation (RAG), agentic systems, and a Hierarchical Sampling strategy, we extract structure-property relationships and translate them into new classes of bioinspired materials. Structured inference protocols generate and evaluate hundreds of hypotheses from a single query, surfacing novel and experimentally tractable ideas. We validate our approach through real-world implementation: LLM-generated procedures, materials designs, and mechanical predictions were tested in the laboratory, culminating in the fabrication of a novel pollen-based adhesive with tunable morphology and measured shear strength, establishing a foundation for future plant-derived adhesive design. This work demonstrates how AI-assisted ideation can drive real-world materials design and enable effective human-AI collaboration. 大型语言模型(LLMs)通过支持新的知识检索和创意构思方法,重塑了研究格局。然而,它们在学科特定的实验科学中的应用仍然有限,尤其是在像材料科学这样高度多学科的领域。我们提出了一个首创框架,将生成式人工智能与此前未曾关联的文献领域(如植物科学、仿生学和材料工程)整合,以提取见解并为材料设计实验。我们聚焦于湿度响应系统,例如基于花粉的材料和 Rhapis excelsa(宽叶棕竹)叶片,这些系统表现出自驱动和自适应性能。利用一套 AI 工具,包括一个微调模型(BioinspiredLLM)、检索增强生成(RAG)、代理系统和分层采样策略,我们提取结构—性能关系并将其转化为新类别的仿生材料。结构化推理协议从单一查询生成并评估数百个假设,挖掘出新颖且可实验验证的想法。 我们通过现实世界的实验验证了我们的方法:由 LLM 生成的流程、材料设计和力学预测在实验室中得到测试,最终制备出一种具有可调形态并测得剪切强度的新型花粉基粘合剂,为未来植物来源粘合剂的设计奠定了基础。这项工作展示了 AI 辅助的创意如何推动现实世界的材料设计并实现有效的人机协作。
Subjects: Machine Learning, Disordered Systems and Neural Networks, Materials Science, Other Condensed Matter, Artificial Intelligence, Computation and Language 主题:机器学习,紊乱系统与神经网络,材料科学,其他凝聚态物质,人工智能,计算与语言
Publish: 2025-08-08 10:41:03 UTC 发布时间:2025-08-08 10:41:03 UTC
1.2.2 Artificial Intelligence
From:https://papers.cool/arxiv/cs.AI
From:https://arxiv.org/list/cs.AI/recent
-
[313] arXiv:2508.06454 [pdf, html, other]
What Voting Rules Actually Do: A Data-Driven Analysis of Multi-Winner Voting 什么是投票规则实际做的:多议员投票的数据驱动分析Joshua Caiata, Ben Armstrong, Kate Larson Joshua Caiata、Ben Armstrong、Kate LarsonComments: 41 pages 注释:41 页Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) 主题:人工智能(cs.AI);计算机科学与博弈论(cs.GT)
-
[314] arXiv:2508.06443 [pdf, html, other]
The Fair Game: Auditing & Debiasing AI Algorithms Over Time 公平博弈:随时间审计与去偏见化人工智能算法Debabrota Basu, Udvas Das Debabrota Basu,Udvas DasJournal-ref: Cambridge Forum on AI: Law and Governance , Volume 1 , 2025 , p. e27 期刊参考:剑桥人工智能论坛:法律与治理,第一卷,2025 年,页 e27Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Computer Science and Game Theory (cs.GT) 主题:人工智能(cs.AI);计算机与社会(cs.CY);新兴技术(cs.ET);计算机科学与博弈论(cs.GT)
-
[315] arXiv:2508.06368 [pdf, html, other]
Automated Creation of the Legal Knowledge Graph Addressing Legislation on Violence Against Women: Resource, Methodology and Lessons Learned 针对有关针对女性暴力立法的法律知识图谱的自动创建:资源、方法与经验教训Claudia dAmato, Giuseppe Rubini, Francesco Didio, Donato Francioso, Fatima Zahra Amara, Nicola FanizziSubjects: Artificial Intelligence (cs.AI) 学科:人工智能 (cs.AI)
-
[316] arXiv:2508.06352 [pdf, other]
From Explainable to Explanatory Artificial Intelligence: Toward a New Paradigm for Human-Centered Explanations through Generative AI 从可解释到说明性人工智能:通过生成式人工智能迈向以人为中心解释的新范式Christian Meske, Justin Brenne, Erdi Uenal, Sabahat Oelcer, Ayseguel DoganguenSubjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) 学科:人工智能 (cs.AI); 人机交互 (cs.HC)
-
[317] arXiv:2508.06348 [pdf, html, other]
AntiCheatPT: A Transformer-Based Approach to Cheat Detection in Competitive Computer Games AntiCheatPT:一种基于 Transformer 的竞技电子游戏作弊检测方法Mille Mei Zhen Loo, Gert Luzkov, Paolo Burelli Mille Mei Zhen Loo、Gert Luzkov、Paolo BurelliSubjects: Artificial Intelligence (cs.AI) 主题:人工智能(cs.AI)
-
[318] arXiv:2508.06326 [pdf, html, other]
A “good regulator theorem” for embodied agents 面向具身智能体的“良好调节器定理”Nathaniel Virgo, Martin Biehl, Manuel Baltieri, Matteo CapucciComments: Accepted at the Artificial Life conference 2025 (ALife 2025). 10 pages, 1 figure 备注:被接收于 2025 年人工生命会议(ALife 2025)。10 页,1 幅图Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY) 主题:人工智能 (cs.AI);系统与控制 (eess.SY)
-
[319] arXiv:2508.06296 [pdf, html, other]
LLM Robustness Leaderboard v1 –Technical report LLM 鲁棒性排行榜 v1 —— 技术报告Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas MiailheSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:人工智能(cs.AI);机器学习(cs.LG)
-
[320] arXiv:2508.06263 [pdf, html, other]
Symmetry breaking for inductive logic programming 归纳逻辑编程的对称性破缺Andrew Cropper, David M. Cerna, Matti JärvisaloSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:人工智能 (cs.AI); 机器学习 (cs.LG)
-
[321] arXiv:2508.06230 [pdf, html, other]
Learning Logical Rules using Minimum Message Length 使用最小消息长度学习逻辑规则Ruben Sharma, Sebastijan Dumančić, Ross D. King, Andrew CropperSubjects: Artificial Intelligence (cs.AI) 主题:人工智能 (cs.AI)
-
[322] arXiv:2508.06226 [pdf, html, other]
GeoLaux: A Benchmark for Evaluating MLLMs’ Geometry Performance on Long-Step Problems Requiring Auxiliary Lines GeoLaux:用于评估多模态大模型在需要辅助线的长步几何问题上的性能基准Yumeng Fu, Jiayin Zhu, Lingling Zhang, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Yanrui Wu, Wenjun Wu 傅雨萌,朱佳茵,张玲玲,赵博,马少轩,张宇顺,吴言睿,吴文君Subjects: Artificial Intelligence (cs.AI) 主题:人工智能 (cs.AI)
-
[323] arXiv:2508.06225 [pdf, html, other]
Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution 在将 LLM 作为评判者时的过度自信:诊断与基于置信度的解决方案Zailong Tian, Zhuoheng Han, Yanzhe Chen, Haozhe Xu, Xi Yang, Richeng Xuan, Houfeng Wang, Lizi Liao 田在龙,韩卓恒,陈燕哲,许浩喆,杨曦,宣日成,王厚峰,廖丽滋Subjects: Artificial Intelligence (cs.AI) 主题:人工智能(cs.AI)
-
[324] arXiv:2508.06145 [pdf, other]
Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications 用于全面药物禁忌检索增强的大型语言模型系统Byeonghun Bang, Jongsuk Yoon, Dong-Jin Chang, Seho Park, Yong Oh LeeSubjects: Artificial Intelligence (cs.AI) 主题:人工智能 (cs.AI)
-
[325] arXiv:2508.06129 [pdf, html, other] [325] arXiv:2508.06129 [ pdf, html, other ]
Study of Robust Features in Formulating Guidance for Heuristic Algorithms for Solving the Vehicle Routing Problem 研究在为启发式算法制定求解车辆路径问题指导时的鲁棒特征Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux Bachtiar Herdianto、Romain Billot、Flavien Lucas、Marc SevauxComments: 22 pages, 14 figures 注释:22 页,14 幅图Subjects: Artificial Intelligence (cs.AI) 主题:人工智能 (cs.AI)
-
[326] arXiv:2508.06111 [pdf, html, other]
SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges SKATE,一种可扩展的锦标赛评估:通过可验证的挑战,较弱的 LLMs 可以区分更强的模型Dewi S. W. Gould, Bruno Mlodozeniec, Samuel F. Brown Dewi S. W. Gould、Bruno Mlodozeniec、Samuel F. BrownComments: 7 pages and appendices 备注:7 页及附录Subjects: Artificial Intelligence (cs.AI) 主题:人工智能(cs.AI)
-
[327] arXiv:2508.06110 [pdf, html, other] [327] arXiv:2508.06110 [ pdf、html、其他]
PanelTR: Zero-Shot Table Reasoning Framework Through Multi-Agent Scientific Discussion PanelTR:通过多智能体科学讨论实现零样本表格推理框架Yiran Rex Ma 马亦然 Rex YiranComments: Accepted at IJCNN 2025 评注:被 IJCNN 2025 接收Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) 学科:人工智能(cs.AI);多智能体系统(cs.MA)
-
[328] arXiv:2508.06091 [pdf, other]
Aggregate-Combine-Readout GNNs Are More Expressive Than Logic C2 Aggregate-Combine-Readout GNNs 比逻辑 C2 更具表达力Stan P Hauke, Przemysław Andrzej WałęgaComments: 18 pages 备注:18 页Subjects: Artificial Intelligence (cs.AI) 主题:人工智能 (cs.AI)
-
[329] arXiv:2508.06074 [pdf, html, other]
ME3-BEV: Mamba-Enhanced Deep Reinforcement Learning for End-to-End Autonomous Driving with BEV-Perception ME 3 -BEV:用于端到端自主驾驶的 Mamba 增强型深度强化学习与鸟瞰图感知Siyi Lu, Run Liu, Dongsheng Yang, Lei He 陆思怡,刘润,杨东升,何雷Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO) 主题:人工智能(cs.AI);机器人学(cs.RO)
-
[330] arXiv:2508.06064 [pdf, html, other]
A Generic Complete Anytime Beam Search for Optimal Decision Tree 一种通用的、任意时间可停止的完全束搜索用于最优决策树Harold Silvère Kiossou, Siegfried Nijssen, Pierre Schaus Harold Silvère Kiossou、Siegfried Nijssen、Pierre SchausSubjects: Artificial Intelligence (cs.AI) 主题:人工智能 (cs.AI)
-
[331] arXiv:2508.06062 [pdf, html, other]
Don’t Forget Imagination! 别忘了想象力!Evgenii E. Vityaev, Andrei MantsivodaComments: 14 pages, 2 figures 注释:14 页,2 幅图Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) 学科:人工智能(cs.AI);机器学习(cs.LG);计算机科学中的逻辑(cs.LO)
-
[332] arXiv:2508.06060 [pdf, html, other]
LLMs for Resource Allocation: A Participatory Budgeting Approach to Inferring Preferences 用于资源分配的 LLMs:一种用于推断偏好的参与式预算方法Sankarshan Damle, Boi Faltings 桑卡尔尚·达姆莱,博伊·法尔廷斯Comments: Published in the Proceedings of the 28th European Conference on Artificial Intelligence (ECAI 2025) 注:收录于第 28 届欧洲人工智能大会(ECAI 2025)会议论文集Subjects: Artificial Intelligence (cs.AI) 主题:人工智能(cs.AI)
-
[333] arXiv:2508.06042 [pdf, html, other]
Society of Mind Meets Real-Time Strategy: A Hierarchical Multi-Agent Framework for Strategic Reasoning 《心智社会遇见实时战略:用于战略推理的分层多智能体框架》Daechul Ahn, San Kim, Jonghyun Choi Daechul Ahn、San Kim、Jonghyun ChoiComments: COLM 2025 备注:COLM 2025Subjects: Artificial Intelligence (cs.AI) 主题:人工智能(cs.AI)
-
[334] arXiv:2508.05996 [pdf, other]
Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making 由中介引导的开源模型间多智能体协作用于医疗决策制定Kaitao Chen, Mianxin Liu, Daoming Zong, Chaoyue Ding, Shaohao Rui, Yankai Jiang, Mu Zhou, Xiaosong Wang 陈凯韬, 刘缅鑫, 宗道明, 丁朝岳, 芮少昊, 蒋彦凯, 周穆, 王晓松Comments: 14 pages, 4 figures 备注: 14 页, 4 图Subjects: Artificial Intelligence (cs.AI) 主题:人工智能(cs.AI)
-
[335] arXiv:2508.05888 [pdf, html, other]
Planning Agents on an Ego-Trip: Leveraging Hybrid Ego-Graph Ensembles for Improved Tool Retrieval in Enterprise Task Planning 以自我为中心的规划代理:利用混合自我图集成改进企业任务规划中的工具检索Sahil Bansal, Sai Shruthi Sistla, Aarti Arikatala, Sebastian SchreiberSubjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) 学科:人工智能 (cs.AI); 信息检索 (cs.IR)
-
[336] arXiv:2508.05855 [pdf, html, other]
Safety of Embodied Navigation: A Survey 具身导航的安全性:综述Zixia Wang, Jia Hu, Ronghui Mu 王子夏,胡佳,穆荣辉Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO) 学科:人工智能(cs.AI);机器人学(cs.RO)
-
[337] arXiv:2508.05792 [pdf, html, other]
Holistic Explainable AI (H-XAI): Extending Transparency Beyond Developers in AI-Driven Decision Making 整体可解释人工智能(H-XAI):在以人工智能为驱动的决策中将透明性扩展到开发者之外Kausik Lakkaraju, Siva Likitha Valluru, Biplav SrivastavaSubjects: Artificial Intelligence (cs.AI) 主题:人工智能(cs.AI)
-
[338] arXiv:2508.05776 [pdf, html, other]
Whither symbols in the era of advanced neural networks? 在先进神经网络时代,符号何去何从?Thomas L. Griffiths, Brenden M. Lake, R. Thomas McCoy, Ellie Pavlick, Taylor W. WebbSubjects: Artificial Intelligence (cs.AI) 主题:人工智能(cs.AI)
-
[339] arXiv:2508.05766 [pdf, html, other]
A Framework for Inherently Safer AGI through Language-Mediated Active Inference 通过语言媒介的主动推断实现内在更安全的通用人工智能框架Bo Wen 博 文Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Adaptation and Self-Organizing Systems (nlin.AO) 学科:人工智能 (cs.AI);机器学习 (cs.LG);系统与控制 (eess.SY);自适应与自组织系统 (nlin.AO)
-
[340] arXiv:2508.05731 [pdf, html, other]
InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization InfiGUI-G1:通过自适应探索策略优化推进图形用户界面定位Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu 刘宇航,刘泽宇,朱双鹤,李鹏翔,谢从凯,王嘉盛,胡雪玉,韩晓天,袁建波,王欣尧,张胜宇,杨宏霞,吴飞Comments: 11 pages, 3 figures 注:11 页,3 幅图Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 主题:人工智能(cs.AI);计算与语言(cs.CL)
-
[341] arXiv:2508.06485 (cross-list from cs.CV) [pdf, html, other] [341] arXiv:2508.06485(来自 cs.CV 的交叉列表)[ pdf,html,other]
WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion WGAST:用于日常 10 米地表温度估计的弱监督生成网络,基于时空融合Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai Sofiane Bouaziz、Adel Hafiane、Raphael Canals、Rachid NedjaiComments: Submitted to IEEE Transactions on Geoscience and Remote Sensing (TGRS) 备注:已提交至 IEEE 地球科学与遥感汇刊(TGRS)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 学科:计算机视觉与模式识别(cs.CV);人工智能(cs.AI);机器学习(cs.LG)
-
[342] arXiv:2508.06482 (cross-list from cs.CL) [pdf, other] [342] arXiv:2508.06482(从 cs.CL 交叉列出)[ pdf,其他]
Post-training for Efficient Communication via Convention Formation 通过约定形成进行高效通信的后训练Yilun Hua, Evan Wang, Yoav Artzi Yilun Hua,Evan Wang,Yoav ArtziComments: Accepted to COLM 2025 评论:已被 COLM 2025 接收Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:计算与语言 (cs.CL);人工智能 (cs.AI);机器学习 (cs.LG)
-
[343] arXiv:2508.06477 (cross-list from physics.soc-ph) [pdf, html, other] [343] arXiv:2508.06477 (从 physics.soc-ph 交叉列出) [ pdf, html, other]
Intuition emerges in Maximum Caliber models at criticality 在临界性时,直觉在最大卡力博模型中出现Lluís Arola-FernándezSubjects: Physics and Society (physics.soc-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:物理与社会(physics.soc-ph);无序系统与神经网络(cond-mat.dis-nn);统计力学(cond-mat.stat-mech);人工智能(cs.AI);机器学习(cs.LG)
-
[344] arXiv:2508.06457 (cross-list from cs.CR) [pdf, html, other] [344] arXiv:2508.06457(从 cs.CR 交叉列出)[ pdf,html,other]
ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls ScamAgents:AI 代理如何模拟人类水平的诈骗电话Sanket BadheComments: Accepted at CAMLIS 25: Conference on Applied Machine Learning for Information Security. 10 pages, 3 figures 评审意见:已被 CAMLIS 25(应用机器学习与信息安全大会)接收。10 页,3 张图Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) 学科:密码学与安全(cs.CR);人工智能(cs.AI);计算与语言(cs.CL);多智能体系统(cs.MA)
-
[345] arXiv:2508.06453 (cross-list from cs.CV) [pdf, other] [345] arXiv:2508.06453(从 cs.CV 交叉列出)[ pdf,other]
Text Embedded Swin-UMamba for DeepLesion Segmentation 嵌入文本的 Swin-UMamba 用于 DeepLesion 分割Ruida Cheng, Tejas Sudharshan Mathai, Pritam Mukherjee, Benjamin Hou, Qingqing Zhu, Zhiyong Lu, Matthew McAuliffe, Ronald M. SummersSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[346] arXiv:2508.06445 (cross-list from cs.CL) [pdf, html, other] [346] arXiv:2508.06445(从 cs.CL 交叉列出)[ pdf, html, other]
Echoes of Automation: The Increasing Use of LLMs in Newsmaking 自动化的回声:LLMs 在新闻制作中日益增多的使用Abolfazl Ansari, Delvin Ce Zhang, Nafis Irtiza Tripto, Dongwon Lee Abolfazl Ansari、Delvin Ce Zhang、Nafis Irtiza Tripto、Dongwon LeeComments: To appear in 18th International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and Simulation, and to be published in the Springer LNCS series 注释:将发表于第 18 届国际社会计算、行为-文化建模与预测及行为表示在建模与仿真中的会议,并将收录于 Springer LNCS 系列Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[347] arXiv:2508.06435 (cross-list from cs.CL) [pdf, html, other] [347] arXiv:2508.06435(跨列自 cs.CL)[ pdf, html, other]
Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages 学习话题而非语言:LLMs 如何跨语言对在线移民话语进行分类Andrea Nasuto, Stefano Maria Iacus, Francisco Rowe, Devika Jain Andrea Nasuto、Stefano Maria Iacus、Francisco Rowe、Devika JainSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[348] arXiv:2508.06434 (cross-list from cs.CV) [pdf, html, other] [348] arXiv:2508.06434(跨列自 cs.CV)[ pdf, html, other]
CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment CLIPin:一种用于多模态语义对齐的非对比式 CLIP 插件Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li 杨胜竹, 杜嘉伟, 卢帅, 张伟航, 王宁立, 李慧琦Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[349] arXiv:2508.06433 (cross-list from cs.CL) [pdf, html, other] [349] arXiv:2508.06433(来自 cs.CL 的交叉列表)[ pdf, html, other]
Memp: Exploring Agent Procedural Memory Memp:探索智能体的程序化记忆Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang 方润南,梁远,王晓彬,吴嘉龙,乔朔飞,谢鹏军,黄飞,陈华钧,张宁宇Comments: Work in progress 备注:进行中的工作Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA) 主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG);多智能体系统(cs.MA)
-
[350] arXiv:2508.06429 (cross-list from cs.CV) [pdf, html, other] [350] arXiv:2508.06429(来自 cs.CV 的交叉列表)[ pdf, html, other]
SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation 稀疏数据,丰富结果:通过类别条件图像翻译的少样本半监督学习Guido Manni, Clemente Lauretti, Loredana Zollo, Paolo Soda Guido Manni、Clemente Lauretti、Loredana Zollo、Paolo SodaSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 学科:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[351] arXiv:2508.06426 (cross-list from cs.RO) [pdf, html, other] [351] arXiv:2508.06426(从 cs.RO 交叉列出)[ pdf,html,other]
Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation 通用机器人策略中的捷径学习:数据集多样性与碎片化的作用Youguang Xing, Xu Luo, Junlin Xie, Lianli Gao, Hengtao Shen, Jingkuan Song 邢有光,罗旭,谢君林,高连利,沈亨韬,宋敬宽Comments: CoRL 2025 评注:CoRL 2025Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) 学科:机器人学(cs.RO);人工智能(cs.AI);计算机视觉与模式识别(cs.CV)
-
[352] arXiv:2508.06411 (cross-list from cs.CY) [pdf, html, other] [352] arXiv:2508.06411(从 cs.CY 交叉列出)[ pdf,html,其他]
Dimensional Characterization and Pathway Modeling for Catastrophic AI Risks 灾难性人工智能风险的维度表征与路径建模Ze Shen ChinComments: 24 pages including references, 6 figures. To be presented in Technical AI Governance Forum 2025 注释:24 页(含参考文献),6 张图。将在 2025 年技术人工智能治理论坛上展示Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:计算机与社会(cs.CY);人工智能(cs.AI);机器学习(cs.LG)
-
[353] arXiv:2508.06407 (cross-list from cs.CV) [pdf, html, other] [353] arXiv:2508.06407(来自 cs.CV 的交叉列表)[ pdf,html,other]
A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery 面向合成孔径雷达船舶目标的分类感知超分辨率框架Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus Ch Muhammad Awais、Marco Reggiannini、Davide Moroni、Oktay KarakusSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI);图像与视频处理(eess.IV)
-
[354] arXiv:2508.06401 (cross-list from cs.DL) [pdf, other] [354] arXiv:2508.06401(从 cs.DL 交叉列出)[ pdf, other]
A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges 检索增强生成的系统性文献综述:技术、评估指标与挑战Andrew Brown, Muhammad Roman, Barry DevereuxComments: 58 pages 注释:58 页Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) 主题:数字图书馆 (cs.DL);人工智能 (cs.AI);计算与语言 (cs.CL);信息检索 (cs.IR)
-
[355] arXiv:2508.06393 (cross-list from cs.SD) [pdf, html, other] [355] arXiv:2508.06393(从 cs.SD 交叉列出)[ pdf, html, other]
Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling 通过增强的说话人嵌入采样实现的鲁棒目标说话人划分与分离Md Asif Jalal, Luca Remaggi, Vasileios Moschopoulos, Thanasis Kotsiopoulos, Vandana Rajan, Karthikeyan Saravanan, Anastasis Drosou, Junho Heo, Hyuk Oh, Seokyeong JeongComments: Accepted to Interspeech 2025 评论:被接收至 Interspeech 2025Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) 主题:声音 (cs.SD);人工智能 (cs.AI)
-
[356] arXiv:2508.06389 (cross-list from cs.NE) [pdf, html, other] [356] arXiv:2508.06389(从 cs.NE 交叉列出)[ pdf, html, other]
Identity Increases Stability in Neural Cellular Automata 身份增加了神经细胞自动机的稳定性James Stovold 詹姆斯·斯托沃尔德Comments: Accepted to ALIFE 2025 备注:已被 ALIFE 2025 接收Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) 学科:神经与进化计算(cs.NE);人工智能(cs.AI)
-
[357] arXiv:2508.06387 (cross-list from cs.LG) [pdf, html, other] [357] arXiv:2508.06387(从 cs.LG 交叉列表)[ pdf,html,其他]
End-to-End Text-to-SQL with Dataset Selection: Leveraging LLMs for Adaptive Query Generation 端到端文本到 SQL:通过数据集选择利用 LLMs 进行自适应查询生成Anurag Tripathi, Vaibhav Patle, Abhinav Jain, Ayush Pundir, Sairam Menon, Ajeet Kumar Singh, Dorien Herremans Anurag Tripathi,Vaibhav Patle,Abhinav Jain,Ayush Pundir,Sairam Menon,Ajeet Kumar Singh,Dorien HerremansComments: Accepted in IJCNN25 评论:已被 IJCNN25 接收Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习 (cs.LG);人工智能 (cs.AI)
-
[358] arXiv:2508.06372 (cross-list from cs.SD) [pdf, html, other] [358] arXiv:2508.06372(从 cs.SD 交叉列出)[ pdf, html, other]
SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models SpeakerLM:基于多模态大型语言模型的端到端多功能说话人分离与识别Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li 韩尹,陈亚锋,邓冲,程路遥,王辉,谭超宏,陈倩,王文,李向刚Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) 主题:声音 (cs.SD);人工智能 (cs.AI)
-
[359] arXiv:2508.06364 (cross-list from cs.LG) [pdf, other] [359] arXiv:2508.06364(从 cs.LG 交叉列出)[ pdf,其他]
ActivityDiff: A diffusion model with Positive and Negative Activity Guidance for De Novo Drug Design ActivityDiff:一种用于新药设计的具有正负活性引导的扩散模型Renyi Zhou, Huimin Zhu, Jing Tang, Min Li 周仁毅,朱慧敏,唐婧,李旻Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM) 主题:机器学习 (cs.LG);人工智能 (cs.AI);生物分子 (q-bio.BM)
-
[360] arXiv:2508.06361 (cross-list from cs.LG) [pdf, html, other] [360] arXiv:2508.06361(来自 cs.LG 的交叉列表)[ pdf, html, other]
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts 超越提示诱导的谎言:在无害提示下研究 LLM 的欺骗行为Zhaomin Wu, Mingzhe Du, See-Kiong Ng, Bingsheng He 吴朝民,杜明哲,黄思強,何炳胜Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[361] arXiv:2508.06357 (cross-list from cs.CV) [pdf, html, other] [361] arXiv:2508.06357(跨列表来自 cs.CV)[ pdf, html, other]
Are you In or Out (of gallery)? Wisdom from the Same-Identity Crowd 你是在(画廊里)还是在外面?来自同一身份群体的智慧Aman Bhatta, Maria Dhakal, Michael C. King, Kevin W. BowyerSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[362] arXiv:2508.06347 (cross-list from cs.LG) [pdf, html, other] [362] arXiv:2508.06347(跨列表来自 cs.LG)[ pdf, html, other]
Structural Equation-VAE: Disentangled Latent Representations for Tabular Data 结构方程—变分自编码器:用于表格数据的可解缠潜在表示Ruiyu Zhang, Ce Zhao, Xin Zhao, Lin Nie, Wai-Fung Lam 张锐瑜, 赵策, 赵昕, 聂林, 林伟锋Comments: 10 pages, 2 figures 评论:10 页,2 幅图Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) 主题:机器学习 (cs.LG); 人工智能 (cs.AI); 神经与进化计算 (cs.NE)
-
[351] arXiv:2508.06426 (cross-list from cs.RO) [pdf, html, other] [351] arXiv:2508.06426(从 cs.RO 交叉列出)[ pdf,html,other]
Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation 通用机器人策略中的捷径学习:数据集多样性与碎片化的作用Youguang Xing, Xu Luo, Junlin Xie, Lianli Gao, Hengtao Shen, Jingkuan Song 邢有光,罗旭,谢君林,高连利,沈亨韬,宋敬宽Comments: CoRL 2025 评注:CoRL 2025Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) 学科:机器人学(cs.RO);人工智能(cs.AI);计算机视觉与模式识别(cs.CV)
-
[352] arXiv:2508.06411 (cross-list from cs.CY) [pdf, html, other] [352] arXiv:2508.06411(从 cs.CY 交叉列出)[ pdf,html,其他]
Dimensional Characterization and Pathway Modeling for Catastrophic AI Risks 灾难性人工智能风险的维度表征与路径建模Ze Shen ChinComments: 24 pages including references, 6 figures. To be presented in Technical AI Governance Forum 2025 注释:24 页(含参考文献),6 张图。将在 2025 年技术人工智能治理论坛上展示Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:计算机与社会(cs.CY);人工智能(cs.AI);机器学习(cs.LG)
-
[353] arXiv:2508.06407 (cross-list from cs.CV) [pdf, html, other] [353] arXiv:2508.06407(来自 cs.CV 的交叉列表)[ pdf,html,other]
A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery 面向合成孔径雷达船舶目标的分类感知超分辨率框架Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus Ch Muhammad Awais、Marco Reggiannini、Davide Moroni、Oktay KarakusSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI);图像与视频处理(eess.IV)
-
[354] arXiv:2508.06401 (cross-list from cs.DL) [pdf, other] [354] arXiv:2508.06401(从 cs.DL 交叉列出)[ pdf,其他]
A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges 检索增强生成的系统性文献综述:技术、评估指标与挑战Andrew Brown, Muhammad Roman, Barry DevereuxComments: 58 pages 注释:58 页Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) 主题:数字图书馆 (cs.DL);人工智能 (cs.AI);计算与语言 (cs.CL);信息检索 (cs.IR)
-
[355] arXiv:2508.06393 (cross-list from cs.SD) [pdf, html, other] [355] arXiv:2508.06393(从 cs.SD 交叉列出)[ pdf,html,其他]
Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling 通过增强的说话人嵌入采样实现的鲁棒目标说话人划分与分离Md Asif Jalal, Luca Remaggi, Vasileios Moschopoulos, Thanasis Kotsiopoulos, Vandana Rajan, Karthikeyan Saravanan, Anastasis Drosou, Junho Heo, Hyuk Oh, Seokyeong JeongComments: Accepted to Interspeech 2025 评论:被接收至 Interspeech 2025Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) 主题:声音(cs.SD);人工智能(cs.AI)
-
[356] arXiv:2508.06389 (cross-list from cs.NE) [pdf, html, other] [356] arXiv:2508.06389(从 cs.NE 交叉列出)[ pdf, html, other]
Identity Increases Stability in Neural Cellular Automata 身份增加了神经元胞自动机的稳定性James StovoldComments: Accepted to ALIFE 2025 评注:已被 ALIFE 2025 接收Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) 主题:神经与进化计算(cs.NE);人工智能(cs.AI)
-
[357] arXiv:2508.06387 (cross-list from cs.LG) [pdf, html, other] [357] arXiv:2508.06387(从 cs.LG 交叉列出)[ pdf, html, other]
End-to-End Text-to-SQL with Dataset Selection: Leveraging LLMs for Adaptive Query Generation 端到端文本到 SQL 的方法与数据集选择:利用 LLMs 进行自适应查询生成Anurag Tripathi, Vaibhav Patle, Abhinav Jain, Ayush Pundir, Sairam Menon, Ajeet Kumar Singh, Dorien Herremans Anurag Tripathi、Vaibhav Patle、Abhinav Jain、Ayush Pundir、Sairam Menon、Ajeet Kumar Singh、Dorien HerremansComments: Accepted in IJCNN25 评注:被 IJCNN25 接收Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[358] arXiv:2508.06372 (cross-list from cs.SD) [pdf, html, other] [358] arXiv:2508.06372(从 cs.SD 交叉列出)[ pdf, html, other]
SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models SpeakerLM:基于多模态大语言模型的端到端多功能说话人分段与识别Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang LiSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI) 学科:声音(cs.SD);人工智能(cs.AI)
-
[359] arXiv:2508.06364 (cross-list from cs.LG) [pdf, other] [359] arXiv:2508.06364(从 cs.LG 交叉列出)[ pdf,其他]
ActivityDiff: A diffusion model with Positive and Negative Activity Guidance for De Novo Drug Design ActivityDiff:一种用于新药设计的扩散模型,具有正负活动引导Renyi Zhou, Huimin Zhu, Jing Tang, Min Li 周仁毅,朱慧敏,唐婧,李旻Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM) 主题:机器学习 (cs.LG);人工智能 (cs.AI);生物分子 (q-bio.BM)
-
[360] arXiv:2508.06361 (cross-list from cs.LG) [pdf, html, other] [360] arXiv:2508.06361(来自 cs.LG 的交叉列表)[ pdf, html, other]
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts 超越提示诱导的谎言:在无害提示下研究 LLM 的欺骗行为Zhaomin Wu, Mingzhe Du, See-Kiong Ng, Bingsheng He Zhaomin Wu、Mingzhe Du、See-Kiong Ng、Bingsheng HeSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[361] arXiv:2508.06357 (cross-list from cs.CV) [pdf, html, other] [361] arXiv:2508.06357(跨列表来自 cs.CV)[ pdf, html, other]
Are you In or Out (of gallery)? Wisdom from the Same-Identity Crowd 你是在画廊内还是外?来自同一身份群体的智慧Aman Bhatta, Maria Dhakal, Michael C. King, Kevin W. BowyerSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[362] arXiv:2508.06347 (cross-list from cs.LG) [pdf, html, other] [362] arXiv:2508.06347(来自 cs.LG 的交叉列表)[ pdf, html, other]
Structural Equation-VAE: Disentangled Latent Representations for Tabular Data 结构方程-VAE:用于表格数据的可解耦潜在表示Ruiyu Zhang, Ce Zhao, Xin Zhao, Lin Nie, Wai-Fung LamComments: 10 pages, 2 figures 评论:10 页,2 幅图Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) 主题:机器学习(cs.LG);人工智能(cs.AI);神经与进化计算(cs.NE)
-
[363] arXiv:2508.06345 (cross-list from cs.CL) [pdf, html, other] [363] arXiv:2508.06345(从 cs.CL 交叉列出)[ pdf, html, other]
Harnessing Adaptive Topology Representations for Zero-Shot Graph Question Answering 利用自适应拓扑表示进行零样本图问答Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu, James T. Kwok, Yu ZhangSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG) 主题:计算与语言(cs.CL);人工智能(cs.AI);图形(cs.GR);机器学习(cs.LG)
-
[364] arXiv:2508.06343 (cross-list from cs.DM) [pdf, html, other] [364] arXiv:2508.06343(来自 cs.DM 的交叉列表)[ pdf, html, other]
On Approximate MMS Allocations on Restricted Graph Classes 关于在受限图类上近似 MMS 分配Václav Blažej, Michał Dębski ad Zbigniew Lonc, Marta Piecyk, Paweł Rzążewski Václav Blažej、Michał Dębski 和 Zbigniew Lonc、Marta Piecyk、Paweł RzążewskiSubjects: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI) 学科:离散数学 (cs.DM);人工智能 (cs.AI)
-
[365] arXiv:2508.06336 (cross-list from cs.LG) [pdf, html, other] [365] arXiv:2508.06336(从 cs.LG 交叉列出)[ pdf, html, other]
Unsupervised Partner Design Enables Robust Ad-hoc Teamwork 无监督伙伴设计实现稳健的临时团队合作Constantin Ruhdorfer, Matteo Bortoletto, Victor Oei, Anna Penzkofer, Andreas Bulling Constantin Ruhdorfer、Matteo Bortoletto、Victor Oei、Anna Penzkofer、Andreas BullingComments: 16 pages 评论:16 页Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA) 学科:机器学习(cs.LG);人工智能(cs.AI);人机交互(cs.HC);多智能体系统(cs.MA)
-
[366] arXiv:2508.06318 (cross-list from cs.CV) [pdf, html, other] [366] arXiv:2508.06318(从 cs.CV 跨列表)[ pdf,html,other]
Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection 由高斯斑点引导的专家混合:一种用于弱监督视频异常检测的新方法Giacomo D'Amicantonio, Snehashis Majhi, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, François Bremond, Egor Bondarev Giacomo D’Amicantonio、Snehashis Majhi、Quan Kong、Lorenzo Garattoni、Gianpiero Francesca、François Bremond、Egor BondarevSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[367] arXiv:2508.06301 (cross-list from cs.LG) [pdf, html, other] [367] arXiv:2508.06301(从 cs.LG 跨列表)[ pdf,html,other]
FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields FedMeNF:用于神经场的隐私保护联邦元学习Junhyeog Yun, Minui Hong, Gunhee Kim Junhyeog Yun、Minui Hong、Gunhee KimComments: ICCV 2025 备注:ICCV 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC) 学科:机器学习 (cs.LG);人工智能 (cs.AI);计算机视觉与模式识别 (cs.CV);分布式、并行与集群计算 (cs.DC)
-
[368] arXiv:2508.06287 (cross-list from eess.IV) [pdf, other] [368] arXiv:2508.06287(来自 eess.IV 的交叉列表)[ pdf,other ]
Advanced Deep Learning Techniques for Accurate Lung Cancer Detection and Classification 用于准确肺癌检测与分类的先进深度学习技术Mobarak Abumohsen, Enrique Costa-Montenegro, Silvia García-Méndez, Amani Yousef Owda, Majdi Owda Mobarak Abumohsen,Enrique Costa-Montenegro,Silvia García-Méndez,Amani Yousef Owda,Majdi OwdaSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) 学科:图像与视频处理(eess.IV);人工智能(cs.AI);计算机视觉与模式识别(cs.CV)
-
[369] arXiv:2508.06269 (cross-list from cs.LG) [pdf, html, other] [369] arXiv:2508.06269(来自 cs.LG 的交叉列表)[ pdf,html,其他]
OM2P: Offline Multi-Agent Mean-Flow Policy OM2P:离线多智能体均流策略Zhuoran Li, Xun Wang, Hai Zhong, Longbo Huang Zhuoran Li,Xun Wang,Hai Zhong,Longbo HuangSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[370] arXiv:2508.06264 (cross-list from math.NA) [pdf, other] [370] arXiv:2508.06264(来自 math.NA 的交叉列表)[ pdf,其他]
Numerical Considerations in Weighted Model Counting 加权模型计数中的数值考量Randal E. Bryant 兰德尔·E·布莱恩特Subjects: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) 学科:数值分析(math.NA);人工智能(cs.AI);计算机科学中的逻辑(cs.LO)
-
[371] arXiv:2508.06259 (cross-list from cs.CV) [pdf, html, other] [371] arXiv:2508.06259(从 cs.CV 交叉列出)[ pdf, html, other]
SIFThinker: Spatially-Aware Image Focus for Visual Reasoning SIFThinker:用于视觉推理的空间感知图像聚焦Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang 陈章泉,赵睿辉,罗楚伟,孙明泽,余鑫磊,康洋洋,黄如琦Comments: 15 pages, 13 figures 注释:15 页,13 张图Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[372] arXiv:2508.06251 (cross-list from cs.LG) [pdf, html, other] [372] arXiv:2508.06251(从 cs.LG 交叉列出)[ pdf,html,其他]
Synthetic Data Generation and Differential Privacy using Tensor Networks’ Matrix Product States (MPS) 使用张量网络的矩阵乘积态(MPS)进行合成数据生成与差分隐私Alejandro Moreno R., Desale Fentaw, Samuel Palmer, Raúl Salles de Padua, Ninad Dixit, Samuel Mugel, Roman Orús, Manuel Radons, Josef Menter, Ali AbediComments: 10 pages 备注:10 页Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Quantum Physics (quant-ph) 主题:机器学习(cs.LG);人工智能(cs.AI);密码学与安全(cs.CR);量子物理(quant-ph)
-
[373] arXiv:2508.06249 (cross-list from cs.LG) [pdf, html, other] [373] arXiv:2508.06249(从 cs.LG 交叉列出)[ pdf, html, other]
In-Training Defenses against Emergent Misalignment in Language Models 在训练中针对语言模型出现的错位行为的防御措施David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Lucie Flek, Florian MaiComments: Under review 评审中Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[374] arXiv:2508.06244 (cross-list from cs.LG) [pdf, html, other] [374] arXiv:2508.06244(从 cs.LG 交叉归类)[ pdf, html, other]
Membership Inference Attack with Partial Features 带有部分特征的成员推断攻击Xurun Wang, Guangrui Liu, Xinjie Li, Haoyu He, Lin Yao, Weizhe Zhang 王绪润,刘光锐,李新杰,何昊宇,姚琳,张维喆Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) 主题:机器学习(cs.LG);人工智能(cs.AI);密码学与安全(cs.CR)
-
[375] arXiv:2508.06220 (cross-list from cs.CL) [pdf, html, other] [375] arXiv:2508.06220(来自 cs.CL 的交叉列表)[ pdf, html, other]
InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic? InfoCausalQA:模型能否基于信息图进行非显式因果推理?Keummin Ka, Junhyeong Park, Jahyun Jeon, Youngjae YuComments: 14 pages, 9 figures 注释:14 页,9 张图Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[376] arXiv:2508.06214 (cross-list from cs.LG) [pdf, html, other] [376] arXiv:2508.06214(来自 cs.LG 的交叉列表)[ pdf, html, other]
Reparameterization Proximal Policy Optimization 重参数化近端策略优化Hai Zhong, Xun Wang, Zhuoran Li, Longbo HuangSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[377] arXiv:2508.06208 (cross-list from cs.LG) [pdf, html, other] [377] arXiv:2508.06208(来自 cs.LG 的交叉列表)[ pdf, html, other]
Graph Federated Learning for Personalized Privacy Recommendation 用于个性化隐私推荐的图联邦学习Ce Na, Kai Yang, Dengzhao Fang, Yu Li, Jingtong Gao, Chengcheng Zhu, Jiale Zhang, Xiaobing Sun, Yi ChangSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[378] arXiv:2508.06204 (cross-list from cs.CL) [pdf, html, other] [378] arXiv:2508.06204(从 cs.CL 交叉列出)[ pdf,html,other]
Classification is a RAG problem: A case study on hate speech detection 分类是一个 RAG 问题:关于仇恨言论检测的案例研究Richard Willats, Josh Pennington, Aravind Mohan, Bertie Vidgen Richard Willats、Josh Pennington、Aravind Mohan、Bertie VidgenSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:计算与语言 (cs.CL);人工智能 (cs.AI);机器学习 (cs.LG)
-
[379] arXiv:2508.06202 (cross-list from cs.CV) [pdf, html, other] [379] arXiv:2508.06202(从 cs.CV 交叉列出)[ pdf,html,other]
LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning LoRA 中的 LoRA:面向持续视觉指令微调的参数高效架构扩展Chang Che, Ziqi Wang, Pengwan Yang, Qi Wang, Hui Ma, Zenglin Shi 常初、王子齐、杨鹏万、王琦、马晖、石增林Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[380] arXiv:2508.06199 (cross-list from cs.LG) [pdf, html, other] [380] arXiv:2508.06199(跨列表自 cs.LG)[ pdf, html, other]
Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning 预训练分子嵌入模型在分子表征学习中的基准测试Mateusz Praski, Jakub Adamczyk, Wojciech Czech Mateusz Praski、Jakub Adamczyk、Wojciech CzechSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[381] arXiv:2508.06183 (cross-list from cs.LG) [pdf, html, other] [381] arXiv:2508.06183(交叉列出自 cs.LG)[ pdf,html,其他]
Differentially Private Federated Clustering with Random Rebalancing 差分隐私联邦聚类与随机再平衡Xiyuan Yang, Shengyuan Hu, Soyeon Kim, Tian Li 杨熙源,胡胜元,金素妍,李天Comments: 21 pages 评论:21 页Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[382] arXiv:2508.06170 (cross-list from cs.CV) [pdf, html, other] [382] arXiv:2508.06170(跨榜自 cs.CV)[ pdf, html, other]
Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation 基于合成数据的多架构框架:通过集成检测与掩码生成实现自动息肉分割Ojonugwa Oluwafemi Ejiga Peter, Akingbola Oluwapemiisin, Amalahu Chetachi, Adeniran Opeyemi, Fahmi Khalifa, Md Mahmudur Rahman Ojonugwa Oluwafemi Ejiga Peter、Akingbola Oluwapemiisin、Amalahu Chetachi、Adeniran Opeyemi、Fahmi Khalifa、Md Mahmudur RahmanJournal-ref: Proc. of SPIE Vol. 13410, 1341024 (2025) 期刊参考:SPIE 会议论文集,第 13410 卷,1341024(2025)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[383] arXiv:2508.06169 (cross-list from cs.CV) [pdf, html, other] [383] arXiv:2508.06169(跨分类自 cs.CV)[ pdf,html,other]
UW-3DGS: Underwater 3D Reconstruction with Physics-Aware Gaussian Splatting UW-3DGS:具有物理感知的水下高斯点撒播三维重建Wenpeng Xing, Jie Chen, Zaifeng Yang, Changting Lin, Jianfeng Dong, Chaochao Chen, Xun Zhou, Meng Han 邢文鹏,陈杰,杨在锋,林长廷,董建锋,陈超超,周勋,韩萌Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[384] arXiv:2508.06165 (cross-list from cs.CL) [pdf, other] [384] arXiv:2508.06165(跨分类自 cs.CL)[ pdf,other]
UR2: Unify RAG and Reasoning through Reinforcement Learning UR 2 :通过强化学习统一 RAG 与推理Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu 李伟涛,向博然,王晓龙,苟志楠,马伟志,刘洋Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[385] arXiv:2508.06163 (cross-list from cs.CL) [pdf, other] [385] arXiv:2508.06163(来自 cs.CL 的交叉列表)[ pdf,其他]
One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging 一刀切并不适用:一种面向分布感知的稀疏化方法,用于更精确的模型合并Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, Jingbo Zhu 罗英锋、林丁洋、王俊鑫、许子强、常凯妍、郑通、李蓓、马安翔、肖通、于正涛、朱婧博Comments: Under review 注释:审稿中Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:计算与语言 (cs.CL);人工智能 (cs.AI);机器学习 (cs.LG)
-
[386] arXiv:2508.06154 (cross-list from cs.IR) [pdf, html, other] [386] arXiv:2508.06154(来自 cs.IR 的交叉列表)[ pdf,html,其他]
Semantic Item Graph Enhancement for Multimodal Recommendation 用于多模态推荐的语义项目图增强Xiaoxiong Zhang, Xin Zhou, Zhiwei Zeng, Dusit Niyato, Zhiqi Shen 张晓雄, 周鑫, 曾志伟, Dusit Niyato, 沈志奇Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM) 主题: 信息检索 (cs.IR); 人工智能 (cs.AI); 多媒体 (cs.MM)
-
[387] arXiv:2508.06136 (cross-list from cs.CV) [pdf, html, other] [387] arXiv:2508.06136(从 cs.CV 交叉列出)[ pdf, html, other]
Roll Your Eyes: Gaze Redirection via Explicit 3D Eyeball Rotation 翻转你的眼睛:通过显式 3D 眼球旋转进行凝视重定向YoungChan Choi, HengFei Wang, YiHua Cheng, Boeun Kim, Hyung Jin Chang, YoungGeun Choi, Sang-Il Choi YoungChan Choi、HengFei Wang、YiHua Cheng、Boeun Kim、Hyung Jin Chang、YoungGeun Choi、Sang-Il ChoiComments: 9 pages, 5 figures, ACM Multimeida 2025 accepted 评论:9 页,5 幅图,已被 ACM Multimedia 2025 录用Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[388] arXiv:2508.06135 (cross-list from cs.CL) [pdf, html, other] [388] arXiv:2508.06135(来自 cs.CL 的交叉列表)[ pdf, html, other]
Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models 少即是多:用于大语言模型的兼容且高效知识蒸馏的选择性反思Lingyuan Liu, Mengxiang Zhang 刘凌远,张梦翔Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[389] arXiv:2508.06133 (cross-list from math.OC) [pdf, html, other] [389] arXiv:2508.06133(从 math.OC 交叉列出)[ pdf,html,other]
LLM Serving Optimization with Variable Prefill and Decode Lengths LLM 提供服务优化:可变预填充与解码长度Meixuan Wang, Yinyu Ye, Zijie Zhou 王美璇,叶银宇,周子杰Subjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:优化与控制(math.OC);人工智能(cs.AI);机器学习(cs.LG)
-
[390] arXiv:2508.06109 (cross-list from cs.CV) [pdf, html, other] [390] arXiv:2508.06109(从 cs.CV 交叉列出)[ pdf,html,other]
FMCE-Net++: Feature Map Convergence Evaluation and Training FMCE-Net++:特征图收敛评估与训练Zhibo Zhu, Renyu Huang, Lei He 朱志博,黄仁宇,何磊Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[391] arXiv:2508.06108 (cross-list from cs.LG) [pdf, html, other] [391] arXiv:2508.06108(从 cs.LG 交叉列出)[ pdf,html,other]
GCHR : Goal-Conditioned Hindsight Regularization for Sample-Efficient Reinforcement Learning GCHR:用于样本高效强化学习的面向目标的事后正则化Xing Lei, Wenyan Yang, Kaiqiang Ke, Shentao Yang, Xuetao Zhang, Joni Pajarinen, Donglin Wang 邢磊、杨文言、柯凯强、杨申涛、张学涛、Joni Pajarinen、王东林Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[392] arXiv:2508.06107 (cross-list from cs.CV) [pdf, html, other] [392] arXiv:2508.06107(从 cs.CV 交叉列出)[ pdf,html,other]
Mask & Match: Learning to Recognize Handwritten Math with Self-Supervised Attention Mask & Match:通过自监督注意力学习识别手写数学符号Shree Mitra, Ritabrata Chakraborty, Nilkanta Sahu Shree Mitra、Ritabrata Chakraborty、Nilkanta SahuSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[393] arXiv:2508.06098 (cross-list from cs.SD) [pdf, html, other] [393] arXiv:2508.06098(来自 cs.SD 的交叉列表)[ pdf、html、other]
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows MeanAudio:使用均值流的快速且忠实的文本到音频生成Xiquan Li, Junxi Liu, Yuzhe Liang, Zhikang Niu, Wenxi Chen, Xie Chen 李惜全、刘君熙、梁宇哲、牛志康、陈文熙、陈燮Comments: 9 pages, 3 figures 注释:9 页,3 幅图Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) 学科:声音(cs.SD);人工智能(cs.AI)
-
[394] arXiv:2508.06096 (cross-list from cs.RO) [pdf, html, other] [394] arXiv:2508.06096(从 cs.RO 交叉列出)[ pdf,html,其他]
Bounding Distributional Shifts in World Modeling through Novelty Detection 通过新颖性检测在世界建模中界定分布迁移的界限Eric Jing, Abdeslam Boularias 埃里克·京, 阿卜德斯拉姆·布拉里亚斯Comments: 7 pages, 6 figures 注释:7 页,6 幅图Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) 主题:机器人学 (cs.RO);人工智能 (cs.AI)
-
[395] arXiv:2508.06076 (cross-list from cs.CV) [pdf, html, other] [395] arXiv:2508.06076(从 cs.CV 交叉列出)[ pdf,html,其他]
Towards MR-Based Trochleoplasty Planning 迈向基于磁共振的髁沟成形术规划Michael Wehrli, Alicia Durrer, Paul Friedrich, Sidaty El Hadramy, Edwin Li, Luana Brahaj, Carol C. Hasler, Philippe C. Cattin Michael Wehrli、Alicia Durrer、Paul Friedrich、Sidaty El Hadramy、Edwin Li、Luana Brahaj、Carol C. Hasler、Philippe C. CattinComments: Accepted at MICCAI COLAS Workshop 2025. Code: this https URL 备注:已被 MICCAI COLAS 2025 研讨会接收。代码:this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[396] arXiv:2508.06072 (cross-list from cs.CV) [pdf, html, other] [396] arXiv:2508.06072(来自 cs.CV 的跨列表)[ pdf, html, other]
Can Large Models Fool the Eye? A New Turing Test for Biological Animation 大型模型能欺骗肉眼吗?一种用于生物动画的新图灵测试Zijian Chen, Lirong Deng, Zhengyu Chen, Kaiwei Zhang, Qi Jia, Yuan Tian, Yucheng Zhu, Guangtao Zhai 陈子建,邓丽蓉,陈征宇,张凯威,贾琦,田远,朱玉成,翟光涛Comments: 24 pages, 10 figures 注释:24 页,10 幅图Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[397] arXiv:2508.06066 (cross-list from cs.LG) [pdf, html, other] [397] arXiv:2508.06066(从 cs.LG 交叉列出)[ pdf,html,其他]
Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology 面向架构的时序网络泛化界限:理论与公平比较方法论Barak Gahtan, Alex M. Bronstein Barak Gahtan、Alex M. BronsteinSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[398] arXiv:2508.06065 (cross-list from cs.HC) [pdf, html, other] [398] arXiv:2508.06065(来自 cs.HC 的交叉列表)[ pdf、html、other]
ThematicPlane: Bridging Tacit User Intent and Latent Spaces for Image Generation ThematicPlane:弥合隐性用户意图与图像生成潜在空间的桥梁Daniel Lee, Nikhil Sharma, Donghoon Shin, DaEun Choi, Harsh Sharma, Jeonghwan Kim, Heng Ji Daniel Lee、Nikhil Sharma、Donghoon Shin、DaEun Choi、Harsh Sharma、Jeonghwan Kim、Heng JiJournal-ref: In Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ‘25), Sept 28-Oct 1, 2025, Busan, Republic of Korea. ACM, New York, NY, USA 期刊参考:收录于第 38 届年度 ACM 用户界面软件与技术研讨会(UIST ‘25)附属论文集,2025 年 9 月 28 日–10 月 1 日,韩国釜山。ACM,美国纽约,NY,USASubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) 主题:人机交互 (cs.HC);人工智能 (cs.AI);计算与语言 (cs.CL);计算机视觉与模式识别 (cs.CV)
-
[399] arXiv:2508.06046 (cross-list from cs.CL) [pdf, html, other] [399] arXiv:2508.06046(来自 cs.CL 的交叉列表)[ pdf、html、other]
EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation EvolvR:用于故事评估以增强生成的自我进化成对推理Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Zhibo Yang, Xingsheng Zhang, Luxi Xing, Qiang Zhou, Chen Zhang 王鑫达、侯郑煦、张扬世杰、颜炳仁、杨志博、张兴胜、邢路曦、周强、张晨Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[400] arXiv:2508.06041 (cross-list from cs.LG) [pdf, html, other] [400] arXiv:2508.06041(来自 cs.LG 的交叉列表)[ pdf、html、other]
DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment DP-LLM:通过动态逐层精度分配进行运行时模型自适应Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong ParkSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[401] arXiv:2508.06038 (cross-list from cs.CV) [pdf, html, other] [401] arXiv:2508.06038(来自 cs.CV 的交叉列表)[ pdf, html, other]
Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models Fourier-VLM:在频域中压缩视觉标记以用于大型视觉-语言模型Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan LinComments: 12 pages, 4 figures 备注:12 页,4 幅图Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[402] arXiv:2508.06034 (cross-list from cs.LG) [pdf, html, other] [402] arXiv:2508.06034(从 cs.LG 交叉列出)[ pdf,html,other ]
Adaptive Heterogeneous Graph Neural Networks: Bridging Heterophily and Heterogeneity 自适应异构图神经网络:弥合异质性与异构性Qin Chen, Guojie Song 秦琛,宋国杰Comments: Accepted tp CIKM 2025 评论:已被 CIKM 2025 接收Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[403] arXiv:2508.06026 (cross-list from cs.CL) [pdf, html, other] [403] arXiv:2508.06026(来自 cs.CL 的交叉列出)[ pdf, html, 其他]
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future 时间自我奖励语言模型:通过过去-未来解耦“选择-拒绝”Yidong Wang, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang, Jianing Chu, Xuran Meng, Shuxun Yang, Libo Qin, Yue Zhang, Wei Ye, Shikun Zhang 王一东,王欣,王存祥,方俊锋,王秋峰,褚佳宁,孟旭然,杨淑逊,秦立波,张悦,叶伟,张世坤Comments: 12 pages, 5 figures 评论:12 页,5 张图Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[404] arXiv:2508.06021 (cross-list from cs.CV) [pdf, html, other] [404] arXiv:2508.06021(从 cs.CV 交叉列出)[ pdf, html, other]
Improved Sub-Visible Particle Classification in Flow Imaging Microscopy via Generative AI-Based Image Synthesis 通过基于生成式人工智能的图像合成改进流式成像显微镜中的亚可见粒子分类Utku Ozbulak, Michaela Cohrs, Hristo L. Svilenov, Joris Vankerschaver, Wesley De NeveSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 学科:计算机视觉与模式识别(cs.CV);人工智能(cs.AI);机器学习(cs.LG)
-
[405] arXiv:2508.06016 (cross-list from cs.CL) [pdf, html, other] [405] arXiv:2508.06016(从 cs.CL 交叉挂载)[ pdf, html, other]
Crisp Attention: Regularizing Transformers via Structured Sparsity Crisp Attention:通过结构化稀疏性对变压器进行正则化Sagar Gandhi, Vishal Gandhi 萨加尔·甘地,维沙尔·甘地Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[406] arXiv:2508.06000 (cross-list from cs.HC) [pdf, html, other] [406] arXiv:2508.06000(从 cs.HC 交叉挂载)[ pdf, html, other]
Hand by Hand: LLM Driving EMS Assistant for Operational Skill Learning Hand by Hand:用于操作技能学习的 LLM 驱动 EMS 助手Wei Xiang, Ziyue Lei, Haoyuan Che, Fangyuan Ye, Xueting Wu, Lingyun Sun 魏翔,雷紫月,车昊源,叶芳圆,吴雪婷,孙凌云Comments: Accepted by IJCAI 2025 评审意见:被 IJCAI 2025 接收Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) 学科:人机交互(cs.HC);人工智能(cs.AI)
-
[407] arXiv:2508.05991 (cross-list from cs.CV) [pdf, html, other] [407] arXiv:2508.05991(跨列表自 cs.CV)[ pdf,html,其他]
ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge ECMF:用于 MER-SEMI 挑战赛的多模态情感识别增强型跨模态融合Juewen Hu, Yexin Li, Jiulin Li, Shuo Chen, Pring Wong 胡觉文、李叶新、李九林、陈硕、Pring WongSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) 学科:计算机视觉与模式识别 (cs.CV);人工智能 (cs.AI);计算机与社会 (cs.CY)
-
[408] arXiv:2508.05989 (cross-list from cs.CV) [pdf, html, other] [408] arXiv:2508.05989(从 cs.CV 交叉列出)[ pdf,html,other]
ETA: Energy-based Test-time Adaptation for Depth Completion ETA:用于深度补全的基于能量的测试时自适应Younjoon Chung, Hyoungseob Park, Patrick Rim, Xiaoran Zhang, Jihe He, Ziyao Zeng, Safa Cicek, Byung-Woo Hong, James S. Duncan, Alex Wong Younjoon Chung、Hyoungseob Park、Patrick Rim、Xiaoran Zhang、Jihe He、Ziyao Zeng、Safa Cicek、Byung-Woo Hong、James S. Duncan、Alex WongSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI);机器学习(cs.LG)
-
[409] arXiv:2508.05979 (cross-list from cs.CY) [pdf, html, other] [409] arXiv:2508.05979(从 cs.CY 交叉列出)[ pdf、html、other]
Learning by Teaching: Engaging Students as Instructors of Large Language Models in Computer Science Education 通过教学学习:在计算机科学教育中让学生作为大语言模型的指导者参与教学Xinming Yang, Haasil Pujara, Jun Li Xinming Yang、Haasil Pujara、Jun LiComments: Published at COLM 2025 评论:发表于 COLM 2025Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) 学科:计算机与社会(cs.CY);人工智能(cs.AI);人机交互(cs.HC)
-
[410] arXiv:2508.05978 (cross-list from cs.SD) [pdf, html, other] [410] arXiv:2508.05978(来自 cs.SD 的交叉列表)[ pdf, html, other]
DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching DAFMSVC:一种基于双注意力机制与流匹配的一次性人声转换方法Wei Chen, Binzhu Sha, Dan Luo, Jing Yang, Zhuo Wang, Fan Fan, Zhiyong WuComments: Accepted by INTERSPEECH 2025 注释:已被 INTERSPEECH 2025 接收Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 学科:声音 (cs.SD);人工智能 (cs.AI);机器学习 (cs.LG)
-
[411] arXiv:2508.05970 (cross-list from cs.SE) [pdf, html, other] [411] arXiv:2508.05970(从 cs.SE 交叉归类)[ pdf, html, other]
Impact-driven Context Filtering For Cross-file Code Completion 以影响为驱动的跨文件代码补全的上下文过滤Yanzhou Li, Shangqing Liu, Kangjie Chen, Tianwei Zhang, Yang Liu 李言舟,刘商清,陈康杰,张天伟,刘洋Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) 主题:软件工程(cs.SE);人工智能(cs.AI)
-
[412] arXiv:2508.05960 (cross-list from cs.LG) [pdf, html, other] [412] arXiv:2508.05960(从 cs.LG 交叉列出)[ pdf, html, other]
Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning 轻度保守的正则化评估用于离线强化学习Haohui Chen, Zhiyong Chen 陈皓辉,陈志勇Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[413] arXiv:2508.05957 (cross-list from cs.LG) [pdf, html, other] [413] arXiv:2508.05957(从 cs.LG 交叉列出)[ pdf, html, other]
Multi-Armed Bandits-Based Optimization of Decision Trees 基于多臂老虎机的决策树优化Hasibul Karim Shanto, Umme Ayman Koana, Shadikur Rahman Hasibul Karim Shanto、Umme Ayman Koana、Shadikur RahmanSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[414] arXiv:2508.05954 (cross-list from cs.CV) [pdf, html, other] [414] arXiv:2508.05954(从 cs.CV 交叉列出)[ pdf,html,other]
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents Bifrost-1:用补丁级 CLIP 潜在表示连接多模态 LLMs 和扩散模型Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, Mohit BansalComments: Project Page: this https URL 评论:项目页面:此 https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 学科:计算机视觉与模式识别(cs.CV);人工智能(cs.AI);计算与语言(cs.CL)
-
[415] arXiv:2508.05950 (cross-list from cs.CV) [pdf, html, other] [415] arXiv:2508.05950(从 cs.CV 交叉列出)[ pdf,html,其他]
A 3DGS-Diffusion Self-Supervised Framework for Normal Estimation from a Single Image 一种用于从单张图像估计法线的 3DGS-Diffusion 自监督框架Yanxing Liang, Yinghui Wang, Jinlong Yang, Wei Li 梁颜星,王英辉,杨金龙,李伟Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[416] arXiv:2508.05938 (cross-list from cs.CL) [pdf, html, other] [416] arXiv:2508.05938(跨列自 cs.CL)[ pdf, html, other]
Prosocial Behavior Detection in Player Game Chat: From Aligning Human-AI Definitions to Efficient Annotation at Scale 在玩家游戏聊天中检测利他行为:从对齐人类-人工智能定义到大规模高效标注Rafal Kocielnik, Min Kim, Penphob (Andrea)Boonyarungsrit, Fereshteh Soltani, Deshawn Sambrano, Animashree Anandkumar, R. Michael AlvarezComments: 9 pages, 4 figures, 4 tables 注释:9 页,4 图,4 表Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) 主题:计算与语言(cs.CL);人工智能(cs.AI);计算机与社会(cs.CY)
-
[417] arXiv:2508.05934 (cross-list from cs.HC) [pdf, html, other] [417] arXiv:2508.05934(跨列自 cs.HC)[ pdf, html, other]
ASLSL: Adaptive shared latent structure learning with incomplete multi-modal physiological data for multi-dimensional emotional feature selection ASLSL:用于多维情感特征选择的具有不完整多模态生理数据的自适应共享潜在结构学习Xueyuan Xu, Tianze Yu, Wenjia Dong, Fulin Wei, Li Zhuo 徐学远,余天泽,董文佳,韦福林,卓力Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 学科:人机交互(cs.HC);人工智能(cs.AI);机器学习(cs.LG)
-
[418] arXiv:2508.05933 (cross-list from cs.HC) [pdf, html, other] [418] arXiv:2508.05933(来自 cs.HC 的交叉列表)[ pdf, html, other]
REFS: Robust EEG feature selection with missing multi-dimensional annotation for emotion recognition 参考文献:具有缺失多维标注的稳健 EEG 特征选择用于情感识别Xueyuan Xu, Wenjia Dong, Fulin Wei, Li Zhuo 徐学远,董文佳,魏福林,朱磊Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) 学科:人机交互(cs.HC);人工智能(cs.AI)
-
[419] arXiv:2508.05923 (cross-list from cs.SE) [pdf, html, other] [419] arXiv:2508.05923(来自 cs.SE 的交叉列表)[ pdf, html, 其他]
Enhancing Software Vulnerability Detection Through Adaptive Test Input Generation Using Genetic Algorithm 通过使用遗传算法的自适应测试输入生成来增强软件漏洞检测Yanusha Mehendran, Maolin Tang, Yi Lu Yanusha Mehendran、Maolin Tang、Yi LuComments: 26 Pages, 3 figures, 6 Tables, Submitted to Empirical Software Engineering and it is under review 注释:26 页,3 幅图,6 张表,已提交至《实证软件工程》(Empirical Software Engineering),正在审稿中Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) 学科:软件工程(cs.SE);人工智能(cs.AI)
-
[420] arXiv:2508.05913 (cross-list from cs.HC) [pdf, other] [420] arXiv:2508.05913(来自 cs.HC 的交叉列表)[ pdf,其他]
Do Ethical AI Principles Matter to Users? A Large-Scale Analysis of User Sentiment and Satisfaction 伦理化的人工智能原则对用户重要吗?关于用户情感和满意度的大规模分析Stefan Pasch, Min Chul ChaSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 主题:人机交互 (cs.HC);人工智能 (cs.AI);计算与语言 (cs.CL)
-
[421] arXiv:2508.05880 (cross-list from cs.CL) [pdf, html, other] [421] arXiv:2508.05880(来自 cs.CL 的交叉列表)[ pdf, html, other]
Do Machines Think Emotionally? Cognitive Appraisal Analysis of Large Language Models 机器会情感性地思考吗?大型语言模型的认知评价分析Sree Bhattacharyya, Lucas Craig, Tharun Dilliraj, Jia Li, James Z. WangSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) 学科:计算与语言(cs.CL);人工智能(cs.AI)
-
[422] arXiv:2508.05846 (cross-list from cs.CY) [pdf, html, other] [422] arXiv:2508.05846(来自 cs.CY 的交叉列表)[ pdf, html, other]
Towards Transparent Ethical AI: A Roadmap for Trustworthy Robotic Systems 迈向透明的伦理人工智能:值得信赖的机器人系统路线图Ahmad Farooq, Kamran Iqbal 艾哈迈德·法鲁克,卡姆兰·伊克巴尔Comments: Published in the Proceedings of the 2025 3rd International Conference on Robotics, Control and Vision Engineering (RCVE'25). 6 pages, 3 tables 注:发表于 2025 年第三届机器人、控制与视觉工程国际会议(RCVE'25)论文集。6 页,3 张表Journal-ref: RCVE'25: Proceedings of the 2025 3rd International Conference on Robotics, Control and Vision Engineering 期刊参考:RCVE'25:2025 年第三届机器人、控制与视觉工程国际会议论文集Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO) 主题:计算机与社会(cs.CY);人工智能(cs.AI);人机交互(cs.HC);机器学习(cs.LG);机器人学(cs.RO)
-
[423] arXiv:2508.05838 (cross-list from cs.RO) [pdf, html, other] [423] arXiv:2508.05838(来自 cs.RO 的交叉列表)[ pdf,html,other]
Integrating Vision Foundation Models with Reinforcement Learning for Enhanced Object Interaction 将视觉基础模型与强化学习整合以增强物体交互Ahmad Farooq, Kamran IqbalComments: Published in the Proceedings of the 2025 3rd International Conference on Robotics, Control and Vision Engineering (RCVE'25). 6 pages, 3 figures, 1 table 备注:发表于 2025 年第三届国际机器人、控制与视觉工程会议(RCVE'25)论文集。6 页,3 幅图,1 张表Journal-ref: RCVE'25: Proceedings of the 2025 3rd International Conference on Robotics, Control and Vision Engineering 期刊引用:RCVE'25:2025 年第三届国际机器人、控制与视觉工程会议论文集Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY) 主题:机器人学 (cs.RO); 人工智能 (cs.AI); 计算机视觉与模式识别 (cs.CV); 机器学习 (cs.LG); 系统与控制 (eess.SY)
-
[424] arXiv:2508.05799 (cross-list from cs.SE) [pdf, html, other] [424] arXiv:2508.05799(来自 cs.SE 的交叉列出)[ pdf, html, other ]
AI-Guided Exploration of Large-Scale Codebases 基于 AI 的大规模代码库探索Yoseph Berhanu AlebachewSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) 主题:软件工程 (cs.SE); 人工智能 (cs.AI); 人机交互 (cs.HC)
-
[425] arXiv:2508.05791 (cross-list from cs.LG) [pdf, html, other] [425] arXiv:2508.05791(从 cs.LG 交叉列出)[ pdf, html, other]
From Imperfect Signals to Trustworthy Structure: Confidence-Aware Inference from Heterogeneous and Reliability-Varying Utility Data 从不完美信号到可信结构:来自异构且可靠性不一的效用数据的置信感知推断Haoran Li, Lihao Mai, Muhao Guo, Jiaqi Wu, Yang Weng, Yannan Sun, Ce Jimmy Liu 李浩然,麦立昊,郭穆浩,吴佳琦,翁扬,孙燕南,刘思捷 (Ce Jimmy Liu)Comments: 10 pages 备注:10 页Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) 主题:机器学习(cs.LG);人工智能(cs.AI)
-
[426] arXiv:2508.05783 (cross-list from cs.CV) [pdf, other] [426] arXiv:2508.05783 (从 cs.CV 交叉列出) [ pdf, other]
Few-Shot Deployment of Pretrained MRI Transformers in Brain Imaging Tasks 在脑成像任务中少量样本部署预训练 MRI TransformerMengyu Li, Guoyao Shen, Chad W. Farris, Xin Zhang 李梦雨,沈国尧,查德·W·法里斯,张鑫Comments: 30 pages, 8 figures, 7 tables 注释:30 页,8 图,7 表Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[427] arXiv:2508.05755 (cross-list from cs.CV) [pdf, html, other] [427] arXiv:2508.05755 (从 cs.CV 交叉列出) [ pdf, html, other]
UnGuide: Learning to Forget with LoRA-Guided Diffusion Models UnGuide:使用 LoRA 指导的扩散模型学习遗忘Agnieszka Polowczyk, Alicja Polowczyk, Dawid Malarz, Artur Kasymov, Marcin Mazur, Jacek Tabor, Przemysław SpurekSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) 主题:计算机视觉与模式识别(cs.CV);人工智能(cs.AI)
-
[428] arXiv:2508.05728 (cross-list from astro-ph.IM) [pdf, html, other] [428] arXiv:2508.05728(从 astro-ph.IM 交叉列出)[ pdf, html, other]
CLAPP: The CLASS LLM Agent for Pair Programming CLAPP:用于结对编程的 CLASS LLM 代理Santiago Casas, Christian Fidler, Boris Bolliet, Francisco Villaescusa-Navarro, Julien Lesgourgues Santiago Casas、Christian Fidler、Boris Bolliet、Francisco Villaescusa-Navarro、Julien LesgourguesComments: Code: this https URL, Streamlit app: this https URL 注释:代码:this https URL,Streamlit 应用:this https URLSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) 主题:天体物理学的仪器与方法(astro-ph.IM);宇宙学与非星系天体物理学(astro-ph.CO);人工智能(cs.AI);多智能体系统(cs.MA)
-
[429] arXiv:2508.05710 (cross-list from cs.SE) [pdf, html, other] [429] arXiv:2508.05710(从 cs.SE 交叉列出)[ pdf,html,其他]
Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning Klear-CodeTest:用于代码强化学习的可扩展测试用例生成Jia Fu, Xinyu Yang, Hongzhi Zhang, Yahui Liu, Jingyuan Zhang, Qi Wang, Fuzheng Zhang, Guorui Zhou 傅嘉,杨新宇,张宏志,刘雅慧,张镜远,王琦,张福征,周国睿Comments: 21 pages, 11 figures 评论:21 页,11 幅图Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) 学科:软件工程(cs.SE);人工智能(cs.AI)
-
[430] arXiv:2508.05705 (cross-list from q-bio.QM) [pdf, html, other] [430] arXiv:2508.05705(从 q-bio.QM 跨列表)[ pdf,html,其他]
A Physiologically-Constrained Neural Network Digital Twin Framework for Replicating Glucose Dynamics in Type 1 Diabetes 一种受生理学约束的神经网络数字孪生框架,用于复制 1 型糖尿病的血糖动态Valentina Roquemen-Echeverri, Taisa Kushner, Peter G. Jacobs, Clara Mosquera-Lopez Valentina Roquemen-Echeverri、Taisa Kushner、Peter G. Jacobs、Clara Mosquera-LopezSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:定量方法 (q-bio.QM); 人工智能 (cs.AI); 机器学习 (cs.LG)
-
[431] arXiv:2508.05702 (cross-list from cs.MA) [pdf, html, other] [431] arXiv:2508.05702(来自 cs.MA 的交叉列表)[ pdf, html, other]
Semantic Reasoning Meets Numerical Precision: An LLM-Powered Multi-Agent System for Power Grid Control 语义推理遇上数值精度:一个由 LLM 驱动的多智能体电网控制系统Yan Zhang 张彦Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) 学科:多智能体系统 (cs.MA);人工智能 (cs.AI);系统与控制 (eess.SY)
-
[432] arXiv:2508.05700 (cross-list from cs.IR) [pdf, html, other] [432] arXiv:2508.05700(来自 cs.IR 的交叉列表)[ pdf, html, other]
Multi-Faceted Large Embedding Tables for Pinterest Ads Ranking 多方面的大型嵌入表用于 Pinterest 广告排序Runze Su, Jiayin Jin, Jiacheng Li, Sihan Wang, Guangtong Bai, Zelun Wang, Li Tang, Yixiong Meng, Huasen Wu, Zhimeng Pan, Kungang Li, Han Sun, Zhifang Liu, Haoyang Li, Siping Ji, Ling Leng, Prathibha Deshikachar Runze Su、Jiayin Jin、Jiacheng Li、Sihan Wang、Guangtong Bai、Zelun Wang、Li Tang、Yixiong Meng、Huasen Wu、Zhimeng Pan、Kungang Li、Han Sun、Zhifang Liu、Haoyang Li、Siping Ji、Ling Leng、Prathibha DeshikacharSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:信息检索 (cs.IR);人工智能 (cs.AI);机器学习 (cs.LG)
-
[433] arXiv:2508.05696 (cross-list from cs.CR) [pdf, html, other] [433] arXiv:2508.05696(从 cs.CR 交叉列出)[ pdf, html, other]
Log2Sig: Frequency-Aware Insider Threat Detection via Multivariate Behavioral Signal Decomposition Log2Sig:通过多变量行为信号分解进行频率感知的内部威胁检测Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Zhiying Li, Guanggang Geng Kaichuan Kong、Dongjie Liu、Xiaobo Jin、Zhiying Li、Guanggang GengComments: Submitted to the 2025 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) 注释:已提交至 2025 年 IEEE 计算与通信信任、安全与隐私国际会议(TrustCom)Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) 主题:密码学与安全(cs.CR);人工智能(cs.AI)
-
[434] arXiv:2508.05694 (cross-list from cs.CR) [pdf, html, other] [434] arXiv:2508.05694(从 cs.CR 交叉列出)[ pdf, html, other]
DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection DMFI:基于 LLM 的内部威胁检测的双模态微调与推理框架Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Guanggang Geng, Zhiying Li, Jian Weng 孔开川,刘东杰,金晓波,耿广刚,李志英,翁剑Comments: Submitted to the 2025 IEEE International Conference on Data Mining (ICDM) 备注:已提交至 2025 年 IEEE 国际数据挖掘会议(ICDM)Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 主题:密码学与安全(cs.CR);人工智能(cs.AI);计算与语言(cs.CL)
-
[435] arXiv:2508.05693 (cross-list from cs.SE) [pdf, html, other] [435] arXiv:2508.05693(从 cs.SE 交叉列表)[ pdf, html, other]
Empirical Evaluation of AI-Assisted Software Package Selection: A Knowledge Graph Approach 基于知识图的 AI 辅助软件包选择的实证评估Siamak Farshidi, Amir Saberhabibi, Behbod Eskafi, Niloofar Nikfarjam, Sadegh Eskandari, Slinger Jansen, Michel Chaudron, Bedir Tekinerdogan Siamak Farshidi、Amir Saberhabibi、Behbod Eskafi、Niloofar Nikfarjam、Sadegh Eskandari、Slinger Jansen、Michel Chaudron、Bedir TekinerdoganSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) 学科:软件工程(cs.SE);人工智能(cs.AI)
-
[436] arXiv:2508.05687 (cross-list from cs.MA) [pdf, html, other] [436] arXiv:2508.05687(从 cs.MA 交叉列表)[ pdf, html, other]
Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems 面向受治理的基于 LLM 的多智能体系统的风险分析技术Alistair Reid, Simon O'Callaghan, Liam Carroll, Tiberio Caetano Alistair Reid、Simon O’Callaghan、Liam Carroll、Tiberio CaetanoSubjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) 学科:多智能体系统 (cs.MA);人工智能 (cs.AI)
-
[437] arXiv:2508.05681 (cross-list from cs.CR) [pdf, html, other] [437] arXiv:2508.05681(来自 cs.CR 的交叉列表)[ pdf,html,其他]
Selection-Based Vulnerabilities: Clean-Label Backdoor Attacks in Active Learning 基于选择的漏洞:主动学习中的洁标后门攻击Yuhan Zhi, Longtian Wang, Xiaofei Xie, Chao Shen, Qiang Hu, Xiaohong Guan Yuhan Zhi,Longtian Wang,Xiaofei Xie,Chao Shen,Qiang Hu,Xiaohong GuanSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) 主题:密码学与安全(cs.CR);人工智能(cs.AI)
-
[438] arXiv:2508.05680 (cross-list from cs.IR) [pdf, html, other] [438] arXiv:2508.05680(从 cs.IR 交叉列出)[ pdf, html, other]
Are All Genders Equal in the Eyes of Algorithms? – Analysing Search and Retrieval Algorithms for Algorithmic Gender Fairness 在算法眼中所有性别都是平等的吗?——分析搜索与检索算法的算法性别公平性Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Ludwig Bothmann, Christian Heumann, Stephanie Thiemichen Stefanie Urchs、Veronika Thurner、Matthias Aßenmacher、Ludwig Bothmann、Christian Heumann、Stephanie ThiemichenSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 主题:信息检索 (cs.IR);人工智能 (cs.AI)
-
[439] arXiv:2508.05677 (cross-list from cs.CR) [pdf, html, other] [439] arXiv:2508.05677(来自 cs.CR 的交叉列表)[ pdf, html, other]
Adversarial Attacks on Reinforcement Learning-based Medical Questionnaire Systems: Input-level Perturbation Strategies and Medical Constraint Validation 针对基于强化学习的医疗问诊系统的对抗性攻击:输入级扰动策略与医疗约束验证Peizhuo Liu 刘沛卓Comments: 30 pages (21 pages main text, 3 pages references, 6 pages appendix), 4 figures 评注:30 页(正文 21 页,参考文献 3 页,附录 6 页),4 张图Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 学科:密码学与安全(cs.CR);人工智能(cs.AI);机器学习(cs.LG)
-
[440] arXiv:2508.05675 (cross-list from cs.CR) [pdf, html, other] [440] arXiv:2508.05675(来自 cs.CR 的交叉列举)[ pdf,html,other]
Principle-Guided Verilog Optimization: IP-Safe Knowledge Transfer via Local-Cloud Collaboration 基于原理的 Verilog 优化:通过本地-云协作进行 IP 安全的知识转移Jing Wang, Zheng Li, Lei Li, Fan He, Liyu Lin, Yao Lai, Yan Li, Xiaoyang Zeng, Yufeng Guo 王静,李政,李磊,何凡,林立宇,赖尧,李燕,曾晓阳,郭雨峰Comments: Our code and dataset are available at this https URL 备注:我们的代码和数据集可在此 https URL 获取Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) 主题:密码学与安全(cs.CR);人工智能(cs.AI)
-
[441] arXiv:2508.05674 (cross-list from cs.CR) [pdf, other] [441] arXiv:2508.05674(从 cs.CR 交叉发布)[ pdf, other]
Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark 面向高效进攻性安全 LLM 代理:超参数调优、将 LLM 作为评判者,以及轻量级 CTF 基准Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique Minghao Shao、Nanda Rani、Kimberly Milner、Haoran Xi、Meet Udeshi、Saksham Aggarwal、Venkata Sai Charan Putrevu、Sandeep Kumar Shukla、Prashanth Krishnamurthy、Farshad Khorrami、Ramesh Karri、Muhammad ShafiqueSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) 主题:密码学与安全(cs.CR);人工智能(cs.AI)
-
[442] arXiv:2508.05673 (cross-list from cs.IR) [pdf, html, other]
Breaking the Top-K Barrier: Advancing Top-K Ranking Metrics Optimization in Recommender Systems 打破 Top- K 壁垒:推进推荐系统中 Top- K 排序指标的优化Weiqin Yang, Jiawei Chen, Shengjia Zhang, Peng Wu, Yuegang Sun, Yan Feng, Chun Chen, Can Wang 杨薇琴、陈家伟、张胜佳、吴鹏、孙越刚、冯岩、陈春、王灿Comments: Accepted by KDD 2025 注释:已被 KDD 2025 接收Journal-ref: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (2025) 3542 - 3552 期刊参考:第 31 届 ACM SIGKDD 知识发现与数据挖掘大会论文集 V.2 (2025) 3542 - 3552Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 学科:信息检索(cs.IR);人工智能(cs.AI);机器学习(cs.LG)
-
[443] arXiv:2508.05672 (cross-list from cs.IR) [pdf, html, other] [443] arXiv:2508.05672(从 cs.IR 交叉列出)[pdf,html,其他]
LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing LMAR:用于特定领域知识索引的语言模型增强检索器Yao Zhao, Yantian Ding, Zhiyue Zhang, Dapeng Yao, Yanxun Xu 赵垚,丁沿天,张志月,姚大鹏,徐燕勋Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 主题:信息检索 (cs.IR); 人工智能 (cs.AI)
-
[444] arXiv:2508.05670 (cross-list from cs.CR) [pdf, html, other] [444] arXiv:2508.05670(从 cs.CR 交叉列出)[ pdf, html, 其他]
Can LLMs effectively provide game-theoretic-based scenarios for cybersecurity? LLMs 能否有效地为网络安全提供基于博弈论的情景?Daniele Proverbio, Alessio Buscemi, Alessandro Di Stefano, The Anh Han, German Castignani, Pietro Liò Daniele Proverbio、Alessio Buscemi、Alessandro Di Stefano、The Anh Han、German Castignani、Pietro LiòSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT) 主题:密码学与安全 (cs.CR);人工智能 (cs.AI);计算机与社会 (cs.CY);计算机科学与博弈论 (cs.GT)
-
[445] arXiv:2508.05669 (cross-list from cs.IR) [pdf, other] [445] arXiv:2508.05669(从 cs.IR 交叉归档)[ pdf,其他]
Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports 为马来西亚经审计财务报表中的财务表格进行 Markdown 转换而微调视觉-语言模型Jin Khye Tan (Faculty of Computer Science and Information Technology, Universiti Malaya), En Jun Choong, Ethan Jeremiah Chitty, Yan Pheng Choo, John Hsin Yang Wong, Chern Eu Cheah Jin Khye Tan(马来亚大学计算机科学与信息技术学院)、En Jun Choong、Ethan Jeremiah Chitty、Yan Pheng Choo、John Hsin Yang Wong、Chern Eu CheahComments: 28 pages, 14 figures, 5 tables. Evaluation code (LLM-as-a-judge and Markdown TEDS) is available at this https URL. The development dataset and evaluation benchmark are available on Hugging Face at this https URL and this https URL respectively 注释:28 页,14 张图,5 张表。评估代码(LLM-as-a-judge 和 Markdown TEDS)可在此 https URL 获取。开发数据集和评估基准分别可在 Hugging Face 的此 https URL 和此 https URL 获取Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) 主题:信息检索 (cs.IR); 人工智能 (cs.AI); 计算与语言 (cs.CL); 计算机视觉与模式识别 (cs.CV); 机器学习 (cs.LG)
-
[446] arXiv:2508.05668 (cross-list from cs.IR) [pdf, html, other] [446] arXiv:2508.05668(从 cs.IR 交叉收录)[ pdf, html, other]
A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges 基于 LLM 的深度搜索代理综述:范式、优化、评估与挑战Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, Weinan Zhang 奚云佳,林江浩,肖永钊,周哲理,山荣,高特,朱嘉宸,刘伟文,于勇,张维楠Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 学科:信息检索 (cs.IR);人工智能 (cs.AI);计算与语言 (cs.CL)
-
[447] arXiv:2508.05667 (cross-list from cs.IR) [pdf, html, other] [447] arXiv:2508.05667(从 cs.IR 交叉收录)[ pdf, html, other]
ITDR: An Instruction Tuning Dataset for Enhancing Large Language Models in Recommendations ITDR:用于提升大型语言模型在推荐系统中表现的指令微调数据集Zekun Liu, Xiaowen Huang, Jitao Sang 刘泽坤、黄晓文、桑吉涛Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 主题:信息检索 (cs.IR); 人工智能 (cs.AI)
-
[448] arXiv:2508.05666 (cross-list from cs.IR) [pdf, other] [448] arXiv:2508.05666(来自 cs.IR 的交叉刊载)[ pdf,其他]
HySemRAG: A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis HySemRAG:一种用于自动化文献综合与方法学差距分析的混合语义检索增强生成框架Alejandro GodinezComments: 47 pages, 10 figures. Code: this https URL. Demo: this https URL. ETL+multi-agent RAG framework for literature synthesis, 35.1% improvement over PDF chunking. Real application: reduced 17,400 papers to 24 relevant ones (99.86%) in 10 minutes for wastewater epidemiology review 注释:47 页,10 幅图。代码:此 https URL。演示:此 https URL。用于文献综合的 ETL+多代理 RAG 框架,相较于 PDF 分块提升 35.1%。实际应用:在污水流行病学综述中将 17,400 篇论文在 10 分钟内筛至 24 篇相关论文(减少 99.86%)Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) 主题:信息检索(cs.IR);人工智能(cs.AI);机器学习(cs.LG)
-
[449] arXiv:2508.05664 (cross-list from cs.IR) [pdf, other] [449] arXiv:2508.05664(来自 cs.IR 的交叉列表)[ pdf,其他]
Enhancing Retrieval-Augmented Generation for Electric Power Industry Customer Support 提升用于电力行业客户支持的检索增强生成(RAG)Hei Yu Chan, Kuok Tou Ho, Chenglong Ma, Yujing Si, Hok Lai Lin, Sa Lei Lam 陈贺宇、何國濤、马成龙、司玉晶、林学礼、林世利Comments: 6 pages 注释:6 页Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 学科:信息检索 (cs.IR);人工智能 (cs.AI);计算与语言 (cs.CL)
-
[450] arXiv:2508.05662 (cross-list from cs.IR) [pdf, html, other] [450] arXiv:2508.05662(从 cs.IR 交叉归类)[ pdf,html,other]
From Static to Dynamic: A Streaming RAG Approach to Real-time Knowledge Base 从静态到动态:一种面向实时知识库的流式 RAG 方法Yuzhou Zhu 朱宇舟Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 学科:信息检索(cs.IR);人工智能(cs.AI)
-
[451] arXiv:2508.05661 (cross-list from cs.IR) [pdf, html, other] [451] arXiv:2508.05661(跨列表自 cs.IR)[ pdf,html,other]
Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace 面向双向市场的可扩展视觉搜索的零样本检索Andre Rusli, Shoma Ishimoto, Sho Akiyama, Aman Kumar Singh Andre Rusli、Shoma Ishimoto、Sho Akiyama、Aman Kumar SinghComments: 6 pages, KDD 2025 Workshop on Two-sided Marketplace Optimization: Search, Pricing, Matching & Growth (TSMO) 注:6 页,KDD 2025 双边市场优化研讨会:搜索、定价、匹配与增长(TSMO)Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 学科:信息检索(cs.IR);人工智能(cs.AI)
-
[452] arXiv:2508.05660 (cross-list from cs.IR) [pdf, html, other] [452] arXiv:2508.05660(来自 cs.IR 的交叉列表)[ pdf, html, other]
Open-Source Agentic Hybrid RAG Framework for Scientific Literature Review 面向科学文献综述的开源能动式混合检索增强生成(RAG)框架Aditya Nagori, Ricardo Accorsi Casonatto, Ayush Gautam, Abhinav Manikantha Sai Cheruvu, Rishikesan Kamaleswaran Aditya Nagori、Ricardo Accorsi Casonatto、Ayush Gautam、Abhinav Manikantha Sai Cheruvu、Rishikesan KamaleswaranSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 学科:信息检索(cs.IR);人工智能(cs.AI)
-
[453] arXiv:2508.05657 (cross-list from cs.IR) [pdf, html, other] [453] arXiv:2508.05657(从 cs.IR 交叉列出)[ pdf、html、other]
Beyond Single Labels: Improving Conversational Recommendation through LLM-Powered Data Augmentation 超越单一标签:通过 LLM 驱动的数据增强改进对话式推荐Haozhe Xu, Xiaohua Wang, Changze Lv, Xiaoqing Zheng Haozhe Xu、Xiaohua Wang、Changze Lv、Xiaoqing ZhengSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 学科:信息检索(cs.IR);人工智能(cs.AI)
-
[454] arXiv:2508.05654 (cross-list from cs.IR) [pdf, html, other] [454] arXiv:2508.05654(从 cs.IR 交叉列出)[ pdf, html, other]
Comparison of Information Retrieval Techniques Applied to IT Support Tickets 应用于 IT 支持工单的信息检索技术比较Leonardo Santiago Benitez Pereira, Robinson Pizzio, Samir Bonho Leonardo Santiago Benitez Pereira、Robinson Pizzio、Samir BonhoSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 学科:信息检索(cs.IR);人工智能(cs.AI)
-
[455] arXiv:2508.05653 (cross-list from cs.HC) [pdf, other] [455] arXiv:2508.05653(从 cs.HC 交叉列出)[ pdf, other]
Modeling Interactive Narrative Systems: A Formal Approach 建模交互式叙事系统:一种形式化方法Jules Clerc, Domitile Lourdeaux, Mohamed Sallak, Johann Barbier, Marc Ravaine Jules Clerc、Domitile Lourdeaux、Mohamed Sallak、Johann Barbier、Marc RavaineSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) 学科:人机交互(cs.HC);人工智能(cs.AI)
-
[456] arXiv:2508.05652 (cross-list from cs.IR) [pdf, html, other] [456] arXiv:2508.05652(来自 cs.IR 的交叉列表)[ pdf,html,其他]
Lessons from A Large Language Model-based Outdoor Trail Recommendation Chatbot with Retrieval Augmented Generation 基于大型语言模型并结合检索增强生成的户外徒步路径推荐聊天机器人经验教训Julia Ann Mathew, Suining He Julia Ann Mathew,Suining HeComments: 4 pages, UrbComp 2025 注释:4 页,UrbComp 2025Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 学科:信息检索(cs.IR);人工智能(cs.AI)
-
[457] arXiv:2508.05650 (cross-list from cs.IR) [pdf, html, other] [457] arXiv:2508.05650(来自 cs.IR 的交叉列表)[ pdf,html,其他]
OmniBench-RAG: A Multi-Domain Evaluation Platform for Retrieval-Augmented Generation Tools OmniBench-RAG:一个用于检索增强生成工具的多领域评估平台Jiaxuan Liang, Shide Zhou, Kailong Wang 梁佳轩、周世德、王凯龙Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 学科:信息检索(cs.IR);人工智能(cs.AI)
-
[458] arXiv:2508.05648 (cross-list from cs.IR) [pdf, html, other] [458] arXiv:2508.05648(从 cs.IR 跨列表)[ pdf, html, other]
AquiLLM: a RAG Tool for Capturing Tacit Knowledge in Research Groups AquiLLM:用于捕捉研究团队隐性知识的 RAG 工具Chandler Campbell, Bernie Boscoe, Tuan Do Chandler Campbell、Bernie Boscoe、Tuan DoComments: Accepted to US Research Software Engineer Association (US-RSE) 2025 评注:被美国研究软件工程师协会(US-RSE)2025 年接收Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 学科:信息检索(cs.IR);人工智能(cs.AI)
-
[459] arXiv:2508.05647 (cross-list from cs.IR) [pdf, other] [459] arXiv:2508.05647(从 cs.IR 交叉列出)[ pdf,其他]
Query-Aware Graph Neural Networks for Enhanced Retrieval-Augmented Generation 面向查询感知的图神经网络以增强检索增强生成Vibhor Agrawal, Fay Wang, Rishi Puri Vibhor Agrawal、Fay Wang、Rishi PuriSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 学科:信息检索(cs.IR);人工智能(cs.AI)
-
[460] arXiv:2508.05640 (cross-list from cs.IR) [pdf, html, other] [460] arXiv:2508.05640(跨列表自 cs.IR)[ pdf, html, other]
Request-Only Optimization for Recommendation Systems 仅请求优化的推荐系统Liang Guo, Wei Li, Lucy Liao, Huihui Cheng, Rui Zhang, Yu Shi, Yueming Wang, Yanzun Huang, Keke Zhai, Pengchao Wang, Timothy Shi, Xuan Cao, Shengzhi Wang, Renqin Cai, Zhaojie Gong, Omkar Vichare, Rui Jian, Leon Gao, Shiyan Deng, Xingyu Liu, Xiong Zhang, Fu Li, Wenlei Xie, Bin Wen, Rui Li, Xing Liu, Jiaqi Zhai 郭亮,李伟,Lucy Liao,程慧慧,张瑞,史宇,王跃明,黄彦尊,翟可可,王鹏超,史蒂芬·希,曹轩,王胜志,蔡仁勤,龚昭杰,Omkar Vichare,简锐,Leon Gao,邓世言,刘兴宇,张雄,李甫,谢文磊,温斌,李锐,刘星,翟佳琦Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) 学科:信息检索(cs.IR);人工智能(cs.AI)
-
[461] arXiv:2508.05637 (cross-list from cs.HC) [pdf, html, other] [461] arXiv:2508.05637(跨列表自 cs.HC)[ pdf, html, other]
Automated Visualization Makeovers with LLMs 使用 LLMs 的自动可视化改造Siddharth Gangwar, David A. Selby, Sebastian J. Vollmer Siddharth Gangwar、David A. Selby、Sebastian J. VollmerSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) 学科:人机交互(cs.HC);人工智能(cs.AI)
-
[462] arXiv:2508.04748 (cross-list from cs.LG) [pdf, html, other] [462] arXiv:2508.04748(从 cs.LG 交叉列出)[ pdf,html,其他]
AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models AttriLens-Mol:基于属性引导的强化学习用于结合 LLMs 的分子性质预测Xuan Lin, Long Chen, Yile Wang 林璇, 陈龙, 王一乐Comments: 9 pages 评注:9 页Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) 主题:机器学习 (cs.LG);人工智能 (cs.AI);计算与语言 (cs.CL)
-
[463] arXiv:2507.12286 (cross-list from cs.LO) [pdf, html, other] [463] arXiv:2507.12286(从 cs.LO 交叉列出)[ pdf, html, other ]
SHACL Validation in the Presence of Ontologies: Semantics and Rewriting Techniques 在本体存在下的 SHACL 验证:语义与重写技术Anouk Oudshoorn, Magdalena Ortiz, Mantas Simkus Anouk Oudshoorn、Magdalena Ortiz、Mantas SimkusComments: 36 pages, 6 figures, submitted to the journal of Artificial Intelligence (AIJ) 注释:36 页,6 张图,提交至《人工智能杂志》(AIJ)Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) 主题:计算机科学中的逻辑 (cs.LO);人工智能 (cs.AI)
-
[464] arXiv:2304.04475 (cross-list from cs.LG) [pdf, other] [464] arXiv:2304.04475(从 cs.LG 交叉列出)[ pdf,其他]
Epidemic Control on a Large-Scale-Agent-Based Epidemiology Model using Deep Deterministic Policy Gradient 在大规模基于主体的流行病模型上使用深度确定性策略梯度进行疫情控制Gaurav Deshkar, Jayanta Kshirsagar, Harshal Hayatnagarkar, Janani Venugopalan Gaurav Deshkar,Jayanta Kshirsagar,Harshal Hayatnagarkar,Janani VenugopalanSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) 主题:机器学习 (cs.LG);人工智能 (cs.AI);系统与控制 (eess.SY)
2025-08-12 | | 总计:312
#1 From Natural Language to Solver-Ready Power System Optimization: An LLM-Assisted, Validation-in-the-Loop Framework #1 从自然语言到求解器就绪的电力系统优化:一种由 LLM 辅助、带验证闭环的框架
Authors: [Yunkai Hu](https://arxiv.org/search/?searchtype=author&query=Yunkai Hu), [Tianqiao Zhao](https://arxiv.org/search/?searchtype=author&query=Tianqiao Zhao), [Meng Yue](https://arxiv.org/search/?searchtype=author&query=Meng Yue) 作者:胡云凯,赵天桥,岳萌
This paper introduces a novel Large Language Models (LLMs)-assisted agent that automatically converts natural-language descriptions of power system optimization scenarios into compact, solver-ready formulations and generates corresponding solutions. In contrast to approaches that rely solely on LLM to produce solutions directly, the proposed method focuses on discovering a mathematically compatible formulation that can be efficiently solved by off-the-shelf optimization solvers. Directly using LLMs to produce solutions often leads to infeasible or suboptimal results, as these models lack the numerical precision and constraint-handling capabilities of established optimization solvers. The pipeline integrates a domain-aware prompt and schema with an LLM, enforces feasibility through systematic validation and iterative repair, and returns both solver-ready models and user-facing results. Using the unit commitment problem as a representative case study, the agent produces optimal or near-optimal schedules along with the associated objective costs. Results demonstrate that coupling the solver with task-specific validation significantly enhances solution reliability. This work shows that combining AI with established optimization frameworks bridges high-level problem descriptions and executable mathematical models, enabling more efficient decision-making in energy systems 本文提出了一种新颖的大型语言模型(LLMs)辅助代理,能够将电力系统优化场景的自然语言描述自动转换为紧凑、可供求解器直接使用的数学表述,并生成相应的求解结果。与单纯依赖 LLM 直接给出解答的方法不同,所提出的方法侧重于发现一种与数学相兼容的表述,以便能被现成的优化求解器高效求解。直接使用 LLMs 生成解答常常导致不可行或次优的结果,因为这些模型缺乏既有优化求解器所具备的数值精度和约束处理能力。该流程将领域感知的提示和模式与 LLM 结合,通过系统化的验证与迭代修复来强制确保可行性,并返回既可供求解器使用的模型又面向用户的结果。以机组组合问题作为代表性案例研究,该代理生成了最优或近似最优的调度方案及其相应的目标成本。结果表明,将求解器与任务特定的验证相结合能显著提升解的可靠性。 这项工作表明,将人工智能与既有的优化框架结合起来可以架起从高级问题描述到可执行数学模型的桥梁,从而使能源系统中的决策更高效
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 16:22:57 UTC 发布:2025-08-11 16:22:57 UTC
#2 BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks #2 BlindGuard:在未知攻击下保护基于 LLM 的多智能体系统 [PDF ] [Copy] [Kimi 3 ] [REL]
Authors: [Rui Miao](https://arxiv.org/search/?searchtype=author&query=Rui Miao), [Yixin Liu](https://arxiv.org/search/?searchtype=author&query=Yixin Liu), [Yili Wang](https://arxiv.org/search/?searchtype=author&query=Yili Wang), [Xu Shen](https://arxiv.org/search/?searchtype=author&query=Xu Shen), [Yue Tan](https://arxiv.org/search/?searchtype=author&query=Yue Tan), [Yiwei Dai](https://arxiv.org/search/?searchtype=author&query=Yiwei Dai), [Shirui Pan](https://arxiv.org/search/?searchtype=author&query=Shirui Pan), [Xin Wang](https://arxiv.org/search/?searchtype=author&query=Xin Wang) 作者:Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, Xin Wang
The security of LLM-based multi-agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision-making through inter-agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real-world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model. To enable practical and generalizable MAS defenses, in this paper, we propose BlindGuard, an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors. To this end, we establish a hierarchical agent encoder to capture individual, neighborhood, and global interaction patterns of each agent, providing a comprehensive understanding for malicious agent detection. Meanwhile, we design a corruption-guided detector that consists of directional noise injection and contrastive learning, allowing effective detection model training solely on normal agent behaviors. Extensive experiments show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across MAS with various communication patterns while maintaining superior generalizability compared to supervised baselines. The code is available at: https://github.com/MR9812/BlindGuard. 基于 LLM 的多智能体系统(MAS)的安全性受到传播脆弱性的严重威胁,恶意智能体可以通过智能体之间的消息交互扭曲集体决策。尽管现有的有监督防御方法表现良好,但由于它们在训练有监督的恶意检测模型时高度依赖带标签的恶意智能体样本,在现实场景中可能不切实际。为实现实用且具有泛化性的 MAS 防御,本文提出了 BlindGuard,一种无监督防御方法,该方法在学习时无需任何针对特定攻击的标签或对恶意行为的先验知识。为此,我们建立了一个分层智能体编码器,用以捕捉每个智能体的个体、邻域和全局交互模式,从而为恶意智能体检测提供全面理解。与此同时,我们设计了一个基于扰动引导的检测器,包含方向性噪声注入和对比学习,使得检测模型能够仅基于正常智能体行为有效训练。 大量实验表明,BlindGuard 能有效检测多智能体系统(在不同通信模式下)的多种攻击类型(即提示注入、记忆投毒和工具攻击),同时在可泛化性方面优于监督基线。代码可在以下地址获取: https://github.com/MR9812/BlindGuard。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 16:04:47 UTC 发布:2025-08-11 16:04:47 UTC
#3 TeamMedAgents: Enhancing Medical Decision-Making of LLMs Through Structured Teamwork #3 TeamMedAgents:通过结构化团队合作增强 LLMs 的医学决策能力 [PDF ] [Copy] [Kimi 2 ] [REL]
Authors: [Pranav Pushkar Mishra](https://arxiv.org/search/?searchtype=author&query=Pranav Pushkar Mishra), [Mohammad Arvan](https://arxiv.org/search/?searchtype=author&query=Mohammad Arvan), [Mohan Zalake](https://arxiv.org/search/?searchtype=author&query=Mohan Zalake) 作者:Pranav Pushkar Mishra、Mohammad Arvan、Mohan Zalake
We present TeamMedAgents, a novel multi-agent approach that systematically integrates evidence-based teamwork components from human-human collaboration into medical decision-making with large language models (LLMs). Our approach validates an organizational psychology teamwork model from human collaboration to computational multi-agent medical systems by operationalizing six core teamwork components derived from Salas et al.’s “Big Five” model: team leadership, mutual performance monitoring, team orientation, shared mental models, closed-loop communication, and mutual trust. We implement and evaluate these components as modular, configurable mechanisms within an adaptive collaboration architecture while assessing the effect of the number of agents involved based on the task’s requirements and domain. Systematic evaluation of computational implementations of teamwork behaviors across eight medical benchmarks (MedQA, MedMCQA, MMLU-Pro Medical, PubMedQA, DDXPlus, MedBullets, Path-VQA, and PMC-VQA) demonstrates consistent improvements across 7 out of 8 evaluated datasets. Controlled ablation studies conducted on 50 questions per configuration across 3 independent runs provide mechanistic insights into individual component contributions, revealing optimal teamwork configurations that vary by reasoning task complexity and domain-specific requirements. Our ablation analyses reveal dataset-specific optimal teamwork configurations, indicating that different medical reasoning modalities benefit from distinct collaborative patterns. TeamMedAgents represents an advancement in collaborative AI by providing a systematic translation of established teamwork theories from human collaboration into agentic collaboration, establishing a foundation for evidence-based multi-agent system design in critical decision-making domains. 我们提出了 TeamMedAgents,一种新颖的多智能体方法,系统性地将基于证据的人与人协作中的团队工作要素整合到基于大型语言模型(LLMs)的医疗决策中。我们通过将 Salas 等人的 “大五” 模型衍生出的六个核心团队工作要素实现为可操作化组件,验证了将组织心理学中的团队模型从人类协作迁移到计算多智能体医疗系统的可行性:团队领导、相互绩效监测、团队导向、共享心理模型、闭环沟通和相互信任。我们将这些要素作为模块化、可配置的机制实现并评估于一个自适应协作架构中,同时根据任务需求和领域评估参与智能体数量的影响。在八个医学基准(MedQA、MedMCQA、MMLU-Pro Medical、PubMedQA、DDXPlus、MedBullets、Path-VQA 和 PMC-VQA)上对团队行为的计算实现进行系统评估,结果显示在 8 个评估数据集中有 7 个数据集表现出一致的改进。 在每种配置上对 50 个问题进行的三次独立运行的受控消融研究,为各个组成部分的机制性贡献提供了洞见,揭示了随推理任务复杂性和特定领域需求而变化的最优团队合作配置。我们的消融分析显示了数据集特定的最优团队配置,表明不同的医学推理模式受益于不同的协作模式。TeamMedAgents 通过将已确立的人类协作团队理论系统性地转化为代理协作,代表了协作式人工智能的一项进步,为在关键决策领域中基于证据的多代理系统设计奠定了基础。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 15:55:06 UTC 发表:2025-08-11 15:55:06 UTC
#4 FNBT: Full Negation Belief Transformation for Open-World Information Fusion Based on Dempster-Shafer Theory of Evidence #4 FNBT:基于证据理论的开放世界信息融合的完全否定置信转换 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Meishen He](https://arxiv.org/search/?searchtype=author&query=Meishen He), [Wenjun Ma](https://arxiv.org/search/?searchtype=author&query=Wenjun Ma), [Jiao Wang](https://arxiv.org/search/?searchtype=author&query=Jiao Wang), [Huijun Yue](https://arxiv.org/search/?searchtype=author&query=Huijun Yue), [Xiaoma Fan](https://arxiv.org/search/?searchtype=author&query=Xiaoma Fan) 作者:何美森、马文军、王娇、岳慧君、范晓马
The Dempster-Shafer theory of evidence has been widely applied in the field of information fusion under uncertainty. Most existing research focuses on combining evidence within the same frame of discernment. However, in real-world scenarios, trained algorithms or data often originate from different regions or organizations, where data silos are prevalent. As a result, using different data sources or models to generate basic probability assignments may lead to heterogeneous frames, for which traditional fusion methods often yield unsatisfactory results. To address this challenge, this study proposes an open-world information fusion method, termed Full Negation Belief Transformation (FNBT), based on the Dempster-Shafer theory. More specially, a criterion is introduced to determine whether a given fusion task belongs to the open-world setting. Then, by extending the frames, the method can accommodate elements from heterogeneous frames. Finally, a full negation mechanism is employed to transform the mass functions, so that existing combination rules can be applied to the transformed mass functions for such information fusion. Theoretically, the proposed method satisfies three desirable properties, which are formally proven: mass function invariance, heritability, and essential conflict elimination. Empirically, FNBT demonstrates superior performance in pattern classification tasks on real-world datasets and successfully resolves Zadeh’s counterexample, thereby validating its practical effectiveness. 邓普斯特-谢佛证据理论在不确定性下的信息融合领域被广泛应用。现有的大多数研究集中于在相同辨识框架内合并证据。然而,在实际场景中,经过训练的算法或数据通常来自不同的地区或组织,数据孤岛现象普遍存在。因此,使用不同的数据源或模型生成的基本概率赋值可能导致异构框架,而传统的融合方法常常在这种情况下产生不理想的结果。为了解决这一挑战,本研究提出了一种基于邓普斯特-谢佛理论的开放世界信息融合方法,称为全否定信念变换(FNBT)。更具体地,引入了一个判据来确定给定的融合任务是否属于开放世界设置。然后,通过扩展框架,该方法可以容纳来自异构框架的元素。最后,采用全否定机制对质量函数进行变换,以便现有的组合规则可以应用于这些经过变换的质量函数,从而实现此类信息融合。 从理论上讲,所提出的方法满足三项理想性质,并对其给出了形式化证明:质量函数不变性、可继承性以及本质冲突消除。实证上,FNBT 在真实数据集的模式分类任务中表现优异,并成功解决了 Zadeh 的反例,从而验证了其实际有效性。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 15:21:48 UTC 发布:2025-08-11 15:21:48 UTC
#5 AdaptFlow: Adaptive Workflow Optimization via Meta-Learning #5 AdaptFlow:通过元学习进行自适应工作流优化 [PDF ] [Copy] [Kimi 3 ] [REL]
Authors: [Runchuan Zhu](https://arxiv.org/search/?searchtype=author&query=Runchuan Zhu), [Bowen Jiang](https://arxiv.org/search/?searchtype=author&query=Bowen Jiang), [Lingrui Mei](https://arxiv.org/search/?searchtype=author&query=Lingrui Mei), [Fangkai Yang](https://arxiv.org/search/?searchtype=author&query=Fangkai Yang), [Lu Wang](https://arxiv.org/search/?searchtype=author&query=Lu Wang), [Haoxiang Gao](https://arxiv.org/search/?searchtype=author&query=Haoxiang Gao), [Fengshuo Bai](https://arxiv.org/search/?searchtype=author&query=Fengshuo Bai), [Pu Zhao](https://arxiv.org/search/?searchtype=author&query=Pu Zhao), [Qingwei Lin](https://arxiv.org/search/?searchtype=author&query=Qingwei Lin), [Saravan Rajmohan](https://arxiv.org/search/?searchtype=author&query=Saravan Rajmohan), [Dongmei Zhang](https://arxiv.org/search/?searchtype=author&query=Dongmei Zhang) 作者:朱润川、姜博文、梅凌睿、杨方恺、王璐、高浩翔、白凤硕、赵璞、林庆伟、Saravan Rajmohan、张东梅
Recent advances in large language models (LLMs) have sparked growing interest in agentic workflows, which are structured sequences of LLM invocations intended to solve complex tasks. However, existing approaches often rely on static templates or manually designed workflows, which limit adaptability to diverse tasks and hinder scalability. We propose AdaptFlow, a natural language-based meta-learning framework inspired by model-agnostic meta-learning (MAML). AdaptFlow learns a generalizable workflow initialization that enables rapid subtask-level adaptation. It employs a bi-level optimization scheme: the inner loop refines the workflow for a specific subtask using LLM-generated feedback, while the outer loop updates the shared initialization to perform well across tasks. This setup allows AdaptFlow to generalize effectively to unseen tasks by adapting the initialized workflow through language-guided modifications. Evaluated across question answering, code generation, and mathematical reasoning benchmarks, AdaptFlow consistently outperforms both manually crafted and automatically searched baselines, achieving state-of-the-art results with strong generalization across tasks and models. The source code and data are available at https://github.com/microsoft/DKI_LLM/tree/AdaptFlow/AdaptFlow. 最近在大型语言模型(LLMs)方面的进展引发了人们对具代理性的工作流的日益关注,这些工作流是为解决复杂任务而设计的结构化 LLM 调用序列。然而,现有方法常依赖静态模板或手工设计的工作流,限制了对多样任务的适应性并阻碍了可扩展性。我们提出了 AdaptFlow,一种受模型无关元学习(MAML)启发的基于自然语言的元学习框架。AdaptFlow 学习一种可泛化的工作流初始化,从而实现对子任务级别的快速适应。它采用双层优化方案:内层循环使用 LLM 生成的反馈对特定子任务的工作流进行精炼,外层循环更新共享的初始化以在各任务上表现良好。该设置使 AdaptFlow 能够通过语言引导的修改来适配初始化工作流,从而有效泛化到未见过的任务。在问答、代码生成和数学推理基准上的评估表明,AdaptFlow 始终优于手工设计和自动搜索的基线方法,在跨任务和跨模型的泛化方面取得了最先进的结果。 源代码和数据可在 https://github.com/microsoft/DKI_LLM/tree/AdaptFlow/AdaptFlow 获取。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 14:52:59 UTC 发布:2025-08-11 14:52:59 UTC
#6 Fitting Description Logic Ontologies to ABox and Query Examples #6 将描述逻辑本体拟合到 ABox 和查询示例
Authors: [Maurice Funk](https://arxiv.org/search/?searchtype=author&query=Maurice Funk), [Marvin Grosser](https://arxiv.org/search/?searchtype=author&query=Marvin Grosser), [Carsten Lutz](https://arxiv.org/search/?searchtype=author&query=Carsten Lutz) 作者:Maurice Funk、Marvin Grosser、Carsten Lutz
We study a fitting problem inspired by ontology-mediated querying: given a collection of positive and negative examples of the form (A,q) with A an ABox and q a Boolean query, we seek an ontology O that satisfies A∪O⊨q for all positive examples and A∪O⊭q for all negative examples. We consider the description logics ALC and ALCI as ontology languages and a range of query languages that includes atomic queries (AQs), conjunctive queries (CQs), and unions thereof (UCQs). For all of the resulting fitting problems, we provide effective characterizations and determine the computational complexity of deciding whether a fitting ontology exists. This problem turns out to be CONP for AQs and full CQs and 2EXPTIME-complete for CQs and UCQs. These results hold for both ALC and ALCI. 我们研究一个受本体中介查询启发的拟合问题:给定一组正例和负例,形式为 (A,q) ,其中 A 是一个 ABox, q 是一个布尔查询,我们寻找一个本体 O ,使得对于所有正例都满足 A∪O⊨q ,并且对于所有负例都满足 A∪O⊭q 。我们考虑描述逻辑 ALC 和 ALCI 作为本体语言,并考虑一系列查询语言,包括原子查询(AQs)、连接查询(CQs)及其并(UCQs)。对于所有由此得到的拟合问题,我们给出有效的刻画,并确定决定是否存在拟合本体的计算复杂性。对于 AQs 和完整 CQs,这一问题被证明是 CONP ,而对于 CQs 和 UCQs 则是 2EXPTIME -完全的。这些结果对 ALC 和 ALCI 均成立。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 14:11:27 UTC 发布:2025-08-11 14:11:27 协调世界时 (UTC)
#7 Interpreting Fedspeak with Confidence: A LLM-Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths #7 用自信解读美联储表态:一个由货币政策传导路径引导的基于 LLM 的不确定性感知框架
Authors: [Rui Yao](https://arxiv.org/search/?searchtype=author&query=Rui Yao), [Qi Chai](https://arxiv.org/search/?searchtype=author&query=Qi Chai), [Jinhai Yao](https://arxiv.org/search/?searchtype=author&query=Jinhai Yao), [Siyuan Li](https://arxiv.org/search/?searchtype=author&query=Siyuan Li), [Junhao Chen](https://arxiv.org/search/?searchtype=author&query=Junhao Chen), [Qi Zhang](https://arxiv.org/search/?searchtype=author&query=Qi Zhang), [Hao Wang](https://arxiv.org/search/?searchtype=author&query=Hao Wang) 作者:姚睿、柴琦、姚金海、李思远、陈俊豪、张琦、王昊
“Fedspeak”, the stylized and often nuanced language used by the U.S. Federal Reserve, encodes implicit policy signals and strategic stances. The Federal Open Market Committee strategically employs Fedspeak as a communication tool to shape market expectations and influence both domestic and global economic conditions. As such, automatically parsing and interpreting Fedspeak presents a high-impact challenge, with significant implications for financial forecasting, algorithmic trading, and data-driven policy analysis. In this paper, we propose an LLM-based, uncertainty-aware framework for deciphering Fedspeak and classifying its underlying monetary policy stance. Technically, to enrich the semantic and contextual representation of Fedspeak texts, we incorporate domain-specific reasoning grounded in the monetary policy transmission mechanism. We further introduce a dynamic uncertainty decoding module to assess the confidence of model predictions, thereby enhancing both classification accuracy and model reliability. Experimental results demonstrate that our framework achieves state-of-the-art performance on the policy stance analysis task. Moreover, statistical analysis reveals a significant positive correlation between perceptual uncertainty and model error rates, validating the effectiveness of perceptual uncertainty as a diagnostic signal. “联储用语”是美国联邦储备系统所使用的一种程式化且常含微妙差别的语言,它编码了隐含的政策信号和战略立场。联邦公开市场委员会有策略地运用联储用语作为一种沟通工具,以塑造市场预期并影响国内和全球的经济状况。因此,自动解析和解读联储用语是一个具有重大影响的挑战,对金融预测、算法交易和数据驱动的政策分析具有重要意义。在本文中,我们提出了一个基于 LLM 且具不确定性感知的框架,用于破译联储用语并对其潜在的货币政策立场进行分类。从技术上讲,为了丰富联储用语文本的语义与上下文表示,我们融入了基于货币政策传导机制的领域特定推理。我们进一步引入了一个动态不确定性解码模块,以评估模型预测的置信度,从而提升分类准确性和模型可靠性。实验结果表明,我们的框架在政策立场分析任务上达到了最先进的性能。 此外,统计分析显示感知不确定性与模型错误率之间存在显著正相关,验证了感知不确定性作为诊断信号的有效性。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 14:04:59 UTC 发布:2025-08-11 14:04:59 UTC
#8 FEAT: A Multi-Agent Forensic AI System with Domain-Adapted Large Language Model for Automated Cause-of-Death Analysis #8 FEAT:一种多智能体取证人工智能系统,采用领域适配的大型语言模型用于自动死因分析
Authors: [Chen Shen](https://arxiv.org/search/?searchtype=author&query=Chen Shen), [Wanqing Zhang](https://arxiv.org/search/?searchtype=author&query=Wanqing Zhang), [Kehan Li](https://arxiv.org/search/?searchtype=author&query=Kehan Li), [Erwen Huang](https://arxiv.org/search/?searchtype=author&query=Erwen Huang), [Haitao Bi](https://arxiv.org/search/?searchtype=author&query=Haitao Bi), [Aiying Fan](https://arxiv.org/search/?searchtype=author&query=Aiying Fan), [Yiwen Shen](https://arxiv.org/search/?searchtype=author&query=Yiwen Shen), [Hongmei Dong](https://arxiv.org/search/?searchtype=author&query=Hongmei Dong), [Ji Zhang](https://arxiv.org/search/?searchtype=author&query=Ji Zhang), [Yuming Shao](https://arxiv.org/search/?searchtype=author&query=Yuming Shao), [Zengjia Liu](https://arxiv.org/search/?searchtype=author&query=Zengjia Liu), [Xinshe Liu](https://arxiv.org/search/?searchtype=author&query=Xinshe Liu), [Tao Li](https://arxiv.org/search/?searchtype=author&query=Tao Li), [Chunxia Yan](https://arxiv.org/search/?searchtype=author&query=Chunxia Yan), [Shuanliang Fan](https://arxiv.org/search/?searchtype=author&query=Shuanliang Fan), [Di Wu](https://arxiv.org/search/?searchtype=author&query=Di Wu), [Jianhua Ma](https://arxiv.org/search/?searchtype=author&query=Jianhua Ma), [Bin Cong](https://arxiv.org/search/?searchtype=author&query=Bin Cong), [Zhenyuan Wang](https://arxiv.org/search/?searchtype=author&query=Zhenyuan Wang), [Chunfeng Lian](https://arxiv.org/search/?searchtype=author&query=Chunfeng Lian) 作者:沈晨、张万庆、李科涵、黄尔文、毕海涛、范爱英、沈怡文、董红梅、张姬、邵玉明、刘增佳、刘新社、李涛、闫春霞、范双良、吴迪、马建华、丛彬、王振远、廉春锋
Forensic cause-of-death determination faces systemic challenges, including workforce shortages and diagnostic variability, particularly in high-volume systems like China’s medicolegal infrastructure. We introduce FEAT (ForEnsic AgenT), a multi-agent AI framework that automates and standardizes death investigations through a domain-adapted large language model. FEAT’s application-oriented architecture integrates: (i) a central Planner for task decomposition, (ii) specialized Local Solvers for evidence analysis, (iii) a Memory & Reflection module for iterative refinement, and (iv) a Global Solver for conclusion synthesis. The system employs tool-augmented reasoning, hierarchical retrieval-augmented generation, forensic-tuned LLMs, and human-in-the-loop feedback to ensure legal and medical validity. In evaluations across diverse Chinese case cohorts, FEAT outperformed state-of-the-art AI systems in both long-form autopsy analyses and concise cause-of-death conclusions. It demonstrated robust generalization across six geographic regions and achieved high expert concordance in blinded validations. Senior pathologists validated FEAT’s outputs as comparable to those of human experts, with improved detection of subtle evidentiary nuances. To our knowledge, FEAT is the first LLM-based AI agent system dedicated to forensic medicine, offering scalable, consistent death certification while maintaining expert-level rigor. By integrating AI efficiency with human oversight, this work could advance equitable access to reliable medicolegal services while addressing critical capacity constraints in forensic systems. 法医死亡原因判定面临系统性挑战,包括劳动力短缺和诊断差异性,尤其是在中国等高流量的法医系统中。我们提出了 FEAT(ForEnsic AgenT),一种通过领域适配的大型语言模型来自动化和规范死亡调查的多代理人工智能框架。FEAT 的面向应用的架构整合了:(i)用于任务分解的中央规划器,(ii)用于证据分析的专门本地求解器,(iii)用于迭代完善的记忆与反思模块,以及(iv)用于结论综合的全局求解器。该系统采用了工具增强推理、分层检索增强生成、为法医调整的 LLMs,以及人机闭环反馈以确保法律和医学的有效性。在针对中国不同病例队列的评估中,FEAT 在长篇验尸分析和简明死亡原因结论两方面均优于最先进的人工智能系统。它在六个地理区域中表现出稳健的泛化能力,并在盲测验证中获得了较高的专家一致性。 高级病理学家验证了 FEAT 的输出可与人类专家相媲美,并且在检测细微证据细节方面有所改进。 据我们所知,FEAT 是首个专注于法医医学的基于 LLM 的 AI 代理系统,提供可扩展且一致的死亡证明,同时保持专家级的严谨性。 通过将 AI 的高效性与人工监督相结合,这项工作有望在缓解法医体系关键能力限制的同时,推动人人平等获取可靠司法医学服务。
Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning, Multiagent Systems 学科:人工智能,计算机视觉与模式识别,机器学习,多智能体系统
Publish: 2025-08-11 13:05:59 UTC 发布:2025-08-11 13:05:59 UTC
#9 Deep Reinforcement Learning with anticipatory reward in LSTM for Collision Avoidance of Mobile Robots #9 在 LSTM 中用有预期奖励的深度强化学习用于移动机器人碰撞避免 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Olivier Poulet](https://arxiv.org/search/?searchtype=author&query=Olivier Poulet), [Frédéric Guinand](https://arxiv.org/search/?searchtype=author&query=Frédéric Guinand), [François Guérin](https://arxiv.org/search/?searchtype=author&query=François Guérin) 作者:Olivier Poulet、Frédéric Guinand、François Guérin
This article proposes a collision risk anticipation method based on short-term prediction of the agents position. A Long Short-Term Memory (LSTM) model, trained on past trajectories, is used to estimate the next position of each robot. This prediction allows us to define an anticipated collision risk by dynamically modulating the reward of a Deep Q-Learning Network (DQN) agent. The approach is tested in a constrained environment, where two robots move without communication or identifiers. Despite a limited sampling frequency (1 Hz), the results show a significant decrease of the collisions number and a stability improvement. The proposed method, which is computationally inexpensive, appears particularly attractive for implementation on embedded systems. 本文提出了一种基于对智能体位置短期预测的碰撞风险预判方法。采用基于历史轨迹训练的长短期记忆(LSTM)模型来估计每个机器人下一时刻的位置。该预测使我们能够通过动态调整深度 Q 学习网络(DQN)智能体的奖励来定义预期的碰撞风险。该方法在受限环境中进行了测试,场景中两个机器人在无通信且无标识的情况下移动。尽管采样频率较低(1 Hz),结果显示碰撞次数显著减少且稳定性有所提升。该方法计算开销低,尤其适合在嵌入式系统上实现。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 12:55:51 UTC 发布:2025-08-11 12:55:51 UTC
#10 X-evolve: Solution space evolution powered by large language models #10 X -evolve:由大型语言模型驱动的解空间演化
Authors: [Yi Zhai](https://arxiv.org/search/?searchtype=author&query=Yi Zhai), [Zhiqiang Wei](https://arxiv.org/search/?searchtype=author&query=Zhiqiang Wei), [Ruohan Li](https://arxiv.org/search/?searchtype=author&query=Ruohan Li), [Keyu Pan](https://arxiv.org/search/?searchtype=author&query=Keyu Pan), [Shuo Liu](https://arxiv.org/search/?searchtype=author&query=Shuo Liu), [Lu Zhang](https://arxiv.org/search/?searchtype=author&query=Lu Zhang), [Jianmin Ji](https://arxiv.org/search/?searchtype=author&query=Jianmin Ji), [Wuyang Zhang](https://arxiv.org/search/?searchtype=author&query=Wuyang Zhang), [Yu Zhang](https://arxiv.org/search/?searchtype=author&query=Yu Zhang), [Yanyong Zhang](https://arxiv.org/search/?searchtype=author&query=Yanyong Zhang) 作者:翟毅、魏志强、李若涵、潘克宇、刘硕、张璐、纪建民、张武阳、张宇、张燕勇
While combining large language models (LLMs) with evolutionary algorithms (EAs) shows promise for solving complex optimization problems, current approaches typically evolve individual solutions, often incurring high LLM call costs. We introduce X-evolve, a paradigm-shifting method that instead evolves solution spaces X (sets of individual solutions) - subsets of the overall search space S. In X-evolve, LLMs generate tunable programs wherein certain code snippets, designated as parameters, define a tunable solution space. A score-based search algorithm then efficiently explores this parametrically defined space, guided by feedback from objective function scores. This strategy enables broader and more efficient exploration, which can potentially accelerate convergence at a much lower search cost, requiring up to two orders of magnitude fewer LLM calls than prior leading methods. We demonstrate X-evolve’s efficacy across three distinct hard optimization problems. For the cap set problem, we discover a larger partial admissible set, establishing a new tighter asymptotic lower bound for the cap set constant (C≥2.2203). In information theory, we uncover a larger independent set for the 15-vertex cycle graph (C⊠515, size 19,946), thereby raising the known lower bound on its Shannon capacity. Furthermore, for the NP-hard online bin packing problem, we generate heuristics that consistently outperform standard strategies across established benchmarks. By evolving solution spaces, our method considerably improves search effectiveness, making it possible to tackle high-dimensional problems that were previously computationally prohibitive. 虽然将大型语言模型 (LLMs) 与进化算法 (EAs) 结合在一起在解决复杂优化问题方面显示出前景,但当前方法通常进化的是单个解,往往会产生高昂的 LLM 调用成本。我们提出了 X -evolve,这是一种颠覆性的方法,改为进化解空间 X (个体解的集合)——整体搜索空间的子集 S 。在 X -evolve 中,LLMs 生成可调程序,其中某些代码片段被指定为参数,用以定义一个可调的解空间。然后,一个基于评分的搜索算法在客观函数分数的反馈引导下高效地探索该参数化定义的空间。这一策略实现了更广泛且更高效的探索,能够在更低的搜索成本下潜在地加快收敛,所需的 LLM 调用次数比以往领先方法少多达两个数量级。我们在三类不同的困难优化问题上展示了 X -evolve 的有效性。对于帽集问题,我们发现了一个更大的部分可接受集,为帽集常数建立了一个新的更紧的渐近下界( C≥2.2203 )。 在信息论中,我们为 15 顶点的循环图找到了一个更大的独立集( C⊠515 ,大小 19,946),从而提高了其香农容量的已知下界。此外,对于 NP-困难的在线装箱问题,我们生成的启发式方法在既有基准测试中稳步优于标准策略。通过演化解空间,我们的方法显著提高了搜索效率,使得处理此前在计算上不可行的高维问题成为可能。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 12:47:59 UTC 发布:2025-08-11 12:47:59 UTC
#11 KIRETT: Knowledge-Graph-Based Smart Treatment Assistant for Intelligent Rescue Operations #11 KIRETT:基于知识图谱的智能救援作业智能处置助手
Authors: [Mubaris Nadeem](https://arxiv.org/search/?searchtype=author&query=Mubaris Nadeem), [Johannes Zenkert](https://arxiv.org/search/?searchtype=author&query=Johannes Zenkert), [Lisa Bender](https://arxiv.org/search/?searchtype=author&query=Lisa Bender), [Christian Weber](https://arxiv.org/search/?searchtype=author&query=Christian Weber), [Madjid Fathi](https://arxiv.org/search/?searchtype=author&query=Madjid Fathi) 作者:Mubaris Nadeem、Johannes Zenkert、Lisa Bender、Christian Weber、Madjid Fathi
Over the years, the need for rescue operations throughout the world has increased rapidly. Demographic changes and the resulting risk of injury or health disorders form the basis for emergency calls. In such scenarios, first responders are in a rush to reach the patient in need, provide first aid, and save lives. In these situations, they must be able to provide personalized and optimized healthcare in the shortest possible time and estimate the patients condition with the help of freshly recorded vital data in an emergency situation. However, in such a timedependent situation, first responders and medical experts cannot fully grasp their knowledge and need assistance and recommendation for further medical treatments. To achieve this, on the spot calculated, evaluated, and processed knowledge must be made available to improve treatments by first responders. The Knowledge Graph presented in this article as a central knowledge representation provides first responders with an innovative knowledge management that enables intelligent treatment recommendations with an artificial intelligence-based pre-recognition of the situation. 多年来,全球对救援行动的需求迅速增加。人口结构变化及由此产生的受伤或健康障碍风险构成了紧急呼救的基础。在此类场景中,急救人员争分夺秒赶到需要帮助的患者身边,提供急救并挽救生命。在这些情况下,他们必须能够在最短时间内提供个性化和优化的医疗,并在紧急情况下借助新近记录的生命体征数据评估患者状况。然而,在这种时间敏感的情形下,急救人员和医疗专家无法完全掌握所有知识,因而需要在进一步医疗处理方面的辅助与建议。为实现这一点,必须提供现场计算、评估和处理后的知识,以便改善急救人员的治疗。本文章所提出的知识图作为一种中心化的知识表示,为急救人员提供了一种创新的知识管理,能够通过基于人工智能的情境预识别,提供智能的治疗建议。
Subjects: Artificial Intelligence, Emerging Technologies 主题:人工智能,前沿技术
Publish: 2025-08-11 10:39:15 UTC 发布:2025-08-11 10:39:15 UTC
#12 Best-Effort Policies for Robust Markov Decision Processes #12 最佳努力策略用于鲁棒马尔可夫决策过程
Authors: [Alessandro Abate](https://arxiv.org/search/?searchtype=author&query=Alessandro Abate), [Thom Badings](https://arxiv.org/search/?searchtype=author&query=Thom Badings), [Giuseppe De Giacomo](https://arxiv.org/search/?searchtype=author&query=Giuseppe De Giacomo), [Francesco Fabiano](https://arxiv.org/search/?searchtype=author&query=Francesco Fabiano) 作者:Alessandro Abate、Thom Badings、Giuseppe De Giacomo、Francesco Fabiano
We study the common generalization of Markov decision processes (MDPs) with sets of transition probabilities, known as robust MDPs (RMDPs). A standard goal in RMDPs is to compute a policy that maximizes the expected return under an adversarial choice of the transition probabilities. If the uncertainty in the probabilities is independent between the states, known as s-rectangularity, such optimal robust policies can be computed efficiently using robust value iteration. However, there might still be multiple optimal robust policies, which, while equivalent with respect to the worst-case, reflect different expected returns under non-adversarial choices of the transition probabilities. Hence, we propose a refined policy selection criterion for RMDPs, drawing inspiration from the notions of dominance and best-effort in game theory. Instead of seeking a policy that only maximizes the worst-case expected return, we additionally require the policy to achieve a maximal expected return under different (i.e., not fully adversarial) transition probabilities. We call such a policy an optimal robust best-effort (ORBE) policy. We prove that ORBE policies always exist, characterize their structure, and present an algorithm to compute them with a small overhead compared to standard robust value iteration. ORBE policies offer a principled tie-breaker among optimal robust policies. Numerical experiments show the feasibility of our approach. 我们研究了具有一组转移概率的马尔可夫决策过程(MDP)的常见推广,称为鲁棒 MDP(RMDP)。RMDP 中的一个标准目标是计算一个策略,使其在对手选择转移概率的对抗情况下最大化期望回报。如果概率的不确定性在各状态之间相互独立,这被称为 s-可分解性(s-rectangularity),则可以使用鲁棒值迭代高效地计算这样的最优鲁棒策略。然而,仍可能存在多个最优鲁棒策略,这些策略在最坏情况下虽然等价,但在非对抗性选择的转移概率下呈现不同的期望回报。因此,我们提出了一个针对 RMDP 的精细策略选择标准,借鉴了博弈论中支配和尽力而为(best-effort)的概念。我们不仅寻求一个仅在最坏情况下最大化期望回报的策略,还要求该策略在不同(即非完全对抗的)转移概率下实现最大的期望回报。我们将这样的策略称为最优鲁棒尽力(ORBE)策略。 我们证明了 ORBE 策略总是存在,刻画了其结构,并提出了一种算法来计算它们,该算法相比标准鲁棒值迭代只需付出很小的开销。ORBE 策略在最优鲁棒策略之间提供了一个有原则的决胜标准。数值实验展示了我们方法的可行性。
Subjects: Artificial Intelligence, Logic in Computer Science 主题:人工智能、计算机科学中的逻辑
Publish: 2025-08-11 09:18:34 UTC 发布:2025-08-11 09:18:34 UTC
#13 Symmetry-Aware Transformer Training for Automated Planning #13 面向对称性的 Transformer 训练用于自动规划
Authors: [Markus Fritzsche](https://arxiv.org/search/?searchtype=author&query=Markus Fritzsche), [Elliot Gestrin](https://arxiv.org/search/?searchtype=author&query=Elliot Gestrin), [Jendrik Seipp](https://arxiv.org/search/?searchtype=author&query=Jendrik Seipp) 作者:Markus Fritzsche、Elliot Gestrin、Jendrik Seipp
While transformers excel in many settings, their application in the field of automated planning is limited. Prior work like PlanGPT, a state-of-the-art decoder-only transformer, struggles with extrapolation from easy to hard planning problems. This in turn stems from problem symmetries: planning tasks can be represented with arbitrary variable names that carry no meaning beyond being identifiers. This causes a combinatorial explosion of equivalent representations that pure transformers cannot efficiently learn from. We propose a novel contrastive learning objective to make transformers symmetry-aware and thereby compensate for their lack of inductive bias. Combining this with architectural improvements, we show that transformers can be efficiently trained for either plan-generation or heuristic-prediction. Our results across multiple planning domains demonstrate that our symmetry-aware training effectively and efficiently addresses the limitations of PlanGPT. 虽然变换器在许多场景中表现出色,但它们在自动规划领域的应用有限。先前的工作如 PlanGPT(一种最先进的仅解码器变换器)在从简单到复杂规划问题的外推方面表现不佳。这反过来源于问题的对称性:规划任务可以用任意变量名表示,这些变量名除了作为标识符外没有其它含义。这导致等价表示的组合爆炸,而纯变换器无法高效地从中学习。我们提出了一种新颖的对比学习目标,使变换器具备对称性感知性,从而弥补其归纳偏置的缺失。将其与架构改进相结合,我们展示了变换器可以高效地训练用于计划生成或启发式预测。我们在多个规划域的结果表明,我们的对称性感知训练有效且高效地解决了 PlanGPT 的局限性。
Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习
Publish: 2025-08-11 08:23:34 UTC 发布:2025-08-11 08:23:34 UTC
#14 Ethics2vec: aligning automatic agents and human preferences #14 Ethics2vec:使自动智能体与人类偏好保持一致
Author: [Gianluca Bontempi](https://arxiv.org/search/?searchtype=author&query=Gianluca Bontempi) 作者:Gianluca Bontempi
Though intelligent agents are supposed to improve human experience (or make it more efficient), it is hard from a human perspective to grasp the ethical values which are explicitly or implicitly embedded in an agent behaviour. This is the well-known problem of alignment, which refers to the challenge of designing AI systems that align with human values, goals and preferences. This problem is particularly challenging since most human ethical considerations refer to \emph{incommensurable} (i.e. non-measurable and/or incomparable) values and criteria. Consider, for instance, a medical agent prescribing a treatment to a cancerous patient. How could it take into account (and/or weigh) incommensurable aspects like the value of a human life and the cost of the treatment? Now, the alignment between human and artificial values is possible only if we define a common space where a metric can be defined and used. This paper proposes to extend to ethics the conventional Anything2vec approach, which has been successful in plenty of similar and hard-to-quantify domains (ranging from natural language processing to recommendation systems and graph analysis). This paper proposes a way to map an automatic agent decision-making (or control law) strategy to a multivariate vector representation, which can be used to compare and assess the alignment with human values. The Ethics2Vec method is first introduced in the case of an automatic agent performing binary decision-making. Then, a vectorisation of an automatic control law (like in the case of a self-driving car) is discussed to show how the approach can be extended to automatic control settings. 尽管智能代理本应改善人类体验(或提高效率),但从人的角度很难把握明确或隐含于代理行为中的伦理价值观。这就是众所周知的对齐问题,指的是设计与人类价值观、目标和偏好相一致的人工智能系统的挑战。该问题尤为棘手,因为大多数人类伦理考量涉及不可通约(即不可测量和/或不可比)的价值和标准。例如,考虑一个为癌症病人开处方的医疗代理。它如何考虑(和/或权衡)像人命的价值与治疗费用这样不可通约的方面?现在,只有当我们定义了一个可以建立并使用度量的共同空间时,人类与人工价值之间的对齐才有可能。本文提出将传统的 Anything2vec 方法扩展到伦理学领域,该方法在许多类似且难以量化的领域(从自然语言处理到推荐系统和图分析)中都取得了成功。 本文提出了一种将自动智能体的决策制定(或控制律)策略映射为多变量向量表示的方法,该表示可用于比较并评估与人类价值观的一致性。首先在自动智能体执行二元决策的情况下介绍了 Ethics2Vec 方法。随后讨论了自动控制律(如自动驾驶汽车情形)的向量化,以展示该方法如何扩展到自动控制场景。
Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习
Publish: 2025-08-11 06:52:46 UTC 发布:2025-08-11 06:52:46 UTC
#15 EMPATHIA: Multi-Faceted Human-AI Collaboration for Refugee Integration #15 EMPATHIA:面向难民融合的多维人类-人工智能协作
Authors: [Mohamed Rayan Barhdadi](https://arxiv.org/search/?searchtype=author&query=Mohamed Rayan Barhdadi), [Mehmet Tuncel](https://arxiv.org/search/?searchtype=author&query=Mehmet Tuncel), [Erchin Serpedin](https://arxiv.org/search/?searchtype=author&query=Erchin Serpedin), [Hasan Kurban](https://arxiv.org/search/?searchtype=author&query=Hasan Kurban) 作者:Mohamed Rayan Barhdadi、Mehmet Tuncel、Erchin Serpedin、Hasan Kurban
Current AI approaches to refugee integration optimize narrow objectives such as employment and fail to capture the cultural, emotional, and ethical dimensions critical for long-term success. We introduce EMPATHIA (Enriched Multimodal Pathways for Agentic Thinking in Humanitarian Immigrant Assistance), a multi-agent framework addressing the central Creative AI question: how do we preserve human dignity when machines participate in life-altering decisions? Grounded in Kegan’s Constructive Developmental Theory, EMPATHIA decomposes integration into three modules: SEED (Socio-cultural Entry and Embedding Decision) for initial placement, RISE (Rapid Integration and Self-sufficiency Engine) for early independence, and THRIVE (Transcultural Harmony and Resilience through Integrated Values and Engagement) for sustained outcomes. SEED employs a selector-validator architecture with three specialized agents - emotional, cultural, and ethical - that deliberate transparently to produce interpretable recommendations. Experiments on the UN Kakuma dataset (15,026 individuals, 7,960 eligible adults 15+ per ILO/UNHCR standards) and implementation on 6,359 working-age refugees (15+) with 150+ socioeconomic variables achieved 87.4% validation convergence and explainable assessments across five host countries. EMPATHIA’s weighted integration of cultural, emotional, and ethical factors balances competing value systems while supporting practitioner-AI collaboration. By augmenting rather than replacing human expertise, EMPATHIA provides a generalizable framework for AI-driven allocation tasks where multiple values must be reconciled. 当前用于难民融入的人工智能方法优化的是就业等狭窄目标,未能涵盖对长期成功至关重要的文化、情感和伦理维度。我们提出了 EMPATHIA(用于人道移民援助中具能动性思维的丰富多模态路径),这是一个多智能体框架,旨在解决核心的创造性人工智能问题:当机器参与改变人生的决策时,我们如何维护人的尊严?EMPATHIA 以凯根的建构性发展理论为基础,将融入过程分解为三大模块:用于初始安置的 SEED(社会文化进入与嵌入决策)、用于早期独立的 RISE(快速融入与自给引擎)以及用于持续成果的 THRIVE(通过整合价值与参与实现超文化和谐与韧性)。SEED 采用选择-验证器架构,包含三个专门智能体——情感、文化和伦理——它们以透明方式共同审议以产生可解释的建议。 在联合国卡库马数据集(15,026 人,其中符合国际劳工组织/联合国难民署标准的 15 岁及以上成年人为 7,960 人)上的实验,以及在 6,359 名适龄难民(15 岁及以上)上使用包含 150 多个社会经济变量的实现,达到了 87.4% 的验证收敛率,并在五个接待国中给出可解释的评估。EMPATHIA 对文化、情感和伦理因素的加权整合在支持实践者与人工智能协作的同时,平衡了相互竞争的价值体系。通过增强而非替代人类专业知识,EMPATHIA 为需要调和多重价值的人工智能驱动分配任务提供了一个可推广的框架。
Subjects: Artificial Intelligence, Computers and Society, Human-Computer Interaction, Multiagent Systems, Applications 学科:人工智能,计算机与社会,人机交互,多智能体系统,应用
Publish: 2025-08-11 06:50:55 UTC 发布:2025-08-11 06:50:55 UTC
#16 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning #16 1-2-3 检查:通过多智能体推理在 LLM 中增强上下文隐私
Authors: [Wenkai Li](https://arxiv.org/search/?searchtype=author&query=Wenkai Li), [Liwen Sun](https://arxiv.org/search/?searchtype=author&query=Liwen Sun), [Zhenxiang Guan](https://arxiv.org/search/?searchtype=author&query=Zhenxiang Guan), [Xuhui Zhou](https://arxiv.org/search/?searchtype=author&query=Xuhui Zhou), [Maarten Sap](https://arxiv.org/search/?searchtype=author&query=Maarten Sap) 作者:李文凯、孙立文、管振湘、周旭辉、Maarten Sap
Addressing contextual privacy concerns remains challenging in interactive settings where large language models (LLMs) process information from multiple sources (e.g., summarizing meetings with private and public information). We introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence to contextual privacy norms. To understand how privacy errors emerge and propagate, we conduct a systematic ablation over information-flow topologies, revealing when and why upstream detection mistakes cascade into downstream leakage. Experiments on the ConfAIde and PrivacyLens benchmark with several open-source and closed-sourced LLMs demonstrate that our best multi-agent configuration substantially reduces private information leakage (\textbf{18%} on ConfAIde and \textbf{19%} on PrivacyLens with GPT-4o) while preserving the fidelity of public content, outperforming single-agent baselines. These results highlight the promise of principled information-flow design in multi-agent systems for contextual privacy with LLMs. 在交互环境中处理来自多个来源的信息(例如在总结包含私人和公共信息的会议时),解决情境隐私问题仍然具有挑战性。我们提出了一个多代理框架,将隐私推理分解为专门的子任务(提取、分类),从而减少任何单一代理的信息负担,同时实现迭代验证并更可靠地遵守情境隐私规范。为了解隐私错误如何产生和传播,我们对信息流拓扑结构进行了系统性消融,揭示了上游检测错误何时以及为何会级联导致下游泄露。在使用多款开源与闭源 LLMs 的 ConfAIde 和 PrivacyLens 基准上的实验表明,我们最佳的多代理配置在保持公共内容真实性的同时,大幅减少了私人信息泄露(在 ConfAIde 上降低了 18%,在 PrivacyLens 上使用 GPT-4o 降低了 19%),并优于单代理基线。这些结果凸显了在面向 LLMs 的情境隐私多代理系统中,基于原则的信息流设计的前景。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 06:34:09 UTC 发布:2025-08-11 06:34:09 UTC
#17 Disentangling Multiplex Spatial-Temporal Transition Graph Representation Learning for Socially Enhanced POI Recommendation #17 解开多重空间-时间迁移图表示学习用于社会增强的兴趣点推荐
Authors: [Jie Li](https://arxiv.org/search/?searchtype=author&query=Jie Li), [Haoye Dong](https://arxiv.org/search/?searchtype=author&query=Haoye Dong), [Zhengyang Wu](https://arxiv.org/search/?searchtype=author&query=Zhengyang Wu), [Zetao Zheng](https://arxiv.org/search/?searchtype=author&query=Zetao Zheng), [Mingrong Lin](https://arxiv.org/search/?searchtype=author&query=Mingrong Lin) 作者:李捷、董浩烨、吴正阳、郑泽陶、林明荣
Next Point-of-Interest (POI) recommendation is a research hotspot in business intelligence, where users’ spatial-temporal transitions and social relationships play key roles. However, most existing works model spatial and temporal transitions separately, leading to misaligned representations of the same spatial-temporal key nodes. This misalignment introduces redundant information during fusion, increasing model uncertainty and reducing interpretability. To address this issue, we propose DiMuST, a socially enhanced POI recommendation model based on disentangled representation learning over multiplex spatial-temporal transition graphs. The model employs a novel Disentangled variational multiplex graph Auto-Encoder (DAE), which first disentangles shared and private distributions using a multiplex spatial-temporal graph strategy. It then fuses the shared features via a Product of Experts (PoE) mechanism and denoises the private features through contrastive constraints. The model effectively captures the spatial-temporal transition representations of POIs while preserving the intrinsic correlation of their spatial-temporal relationships. Experiments on two challenging datasets demonstrate that our DiMuST significantly outperforms existing methods across multiple metrics. 下一兴趣点(POI)推荐是商业智能领域的一个研究热点,用户的时空迁移和社会关系在其中起关键作用。然而,大多数现有工作将空间和时间迁移分别建模,导致相同时空关键节点的表示不对齐。这种不对齐在融合时引入了冗余信息,增加了模型不确定性并降低了解释性。为了解决该问题,我们提出了 DiMuST,一种基于在多重时空迁移图上进行解缠表示学习的社会增强 POI 推荐模型。该模型采用一种新颖的解缠变分多重图自编码器(DAE),首先通过多重时空图策略解缠共享分布和私有分布;然后通过专家乘积(PoE)机制融合共享特征,并通过对比约束对私有特征进行去噪。该模型在保留时空关系内在相关性的同时,有效捕捉了 POI 的时空迁移表示。 在两个具有挑战性的数据集上的实验表明,我们的 DiMuST 在多项指标上显著优于现有方法。
Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习
Publish: 2025-08-11 06:00:20 UTC 发布日期:2025-08-11 06:00:20 UTC
#18 Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents #18 拆解与重建:基于技能混合的视觉与语言导航智能体
Authors: [Tianyi Ma](https://arxiv.org/search/?searchtype=author&query=Tianyi Ma), [Yue Zhang](https://arxiv.org/search/?searchtype=author&query=Yue Zhang), [Zehao Wang](https://arxiv.org/search/?searchtype=author&query=Zehao Wang), [Parisa Kordjamshidi](https://arxiv.org/search/?searchtype=author&query=Parisa Kordjamshidi) 作者:马天一,张越,王泽昊,Parisa Kordjamshidi
Vision-and-Language Navigation (VLN) poses significant challenges in enabling agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. We then introduce a novel zero-shot Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav achieves a new state-of-the-art performance on the R2R benchmark and demonstrates strong generalization to the GSA-R2R benchmark that includes novel instruction styles and unseen environments. 视觉与语言导航(VLN)在使智能体理解自然语言指令并在复杂的三维环境中导航方面提出了重大挑战。尽管近期依赖大规模预训练和数据增强取得了进展,但现有方法在推广到未见场景时仍然困难重重,尤其是在需要复杂空间和时间推理的情况下。在本工作中,我们提出了 SkillNav,一个将结构化、基于技能的推理引入基于 Transformer 的 VLN 智能体的模块化框架。我们的方法将导航分解为一组可解释的原子技能(例如,垂直移动、区域与区域识别、停止与暂停),每个技能由专门的智能体处理。随后我们引入了一种新颖的零样本视觉-语言模型(VLM)路由器,该路由器通过将子目标与视觉观测和历史行为对齐,在每个时间步动态选择最合适的智能体。SkillNav 在 R2R 基准上达到了新的最先进性能,并在包含新颖指令风格和未见环境的 GSA-R2R 基准上展示了强大的泛化能力。
Subjects: Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition 主题:人工智能、计算与语言、计算机视觉与模式识别
Publish: 2025-08-11 05:50:30 UTC 发布:2025-08-11 05:50:30 UTC
#19 Multimodal AI Systems for Enhanced Laying Hen Welfare Assessment and Productivity Optimization #19 多模态人工智能系统用于增强产蛋母鸡福利评估与生产力优化
Authors: [Daniel Essien](https://arxiv.org/search/?searchtype=author&query=Daniel Essien), [Suresh Neethirajan](https://arxiv.org/search/?searchtype=author&query=Suresh Neethirajan) 作者:Daniel Essien、Suresh Neethirajan
The future of poultry production depends on a paradigm shift replacing subjective, labor-intensive welfare checks with data-driven, intelligent monitoring ecosystems. Traditional welfare assessments-limited by human observation and single-sensor data-cannot fully capture the complex, multidimensional nature of laying hen welfare in modern farms. Multimodal Artificial Intelligence (AI) offers a breakthrough, integrating visual, acoustic, environmental, and physiological data streams to reveal deeper insights into avian welfare dynamics. This investigation highlights multimodal As transformative potential, showing that intermediate (feature-level) fusion strategies achieve the best balance between robustness and performance under real-world poultry conditions, and offer greater scalability than early or late fusion approaches. Key adoption barriers include sensor fragility in harsh farm environments, high deployment costs, inconsistent behavioral definitions, and limited cross-farm generalizability. To address these, we introduce two novel evaluation tools - the Domain Transfer Score (DTS) to measure model adaptability across diverse farm settings, and the Data Reliability Index (DRI) to assess sensor data quality under operational constraints. We also propose a modular, context-aware deployment framework designed for laying hen environments, enabling scalable and practical integration of multimodal sensing. This work lays the foundation for a transition from reactive, unimodal monitoring to proactive, precision-driven welfare systems that unite productivity with ethical, science based animal care. 家禽生产的未来依赖于一种范式转变,即用数据驱动的智能监测生态系统取代主观且劳动密集的福利检查。传统的福利评估——受限于人工观察和单一传感器数据——无法充分捕捉现代养殖场中产蛋鸡福利的复杂、多维特性。多模态人工智能(AI)提供了突破,通过整合视觉、声学、环境和生理数据流来揭示更深层次的禽类福利动态。本研究强调了多模态 AI 的变革性潜力,表明中间(特征级)融合策略在真实养鸡条件下在稳健性与性能之间取得了最佳平衡,并且比早期或晚期融合方法具有更大的可扩展性。主要采用障碍包括传感器在恶劣养殖环境中的易损性、高昂的部署成本、不一致的行为定义以及有限的跨场通用性。 为了解决这些问题,我们引入了两种新颖的评估工具——领域迁移得分(DTS),用于衡量模型在不同养殖环境中的适应性,以及数据可靠性指数(DRI),用于在操作约束下评估传感器数据质量。我们还提出了一个面向产蛋鸡环境的模块化、情境感知部署框架,支持多模态传感的可扩展与实用整合。这项工作为从被动的单一模式监测向主动的、以精确度为驱动的福利系统转变奠定了基础,该系统将生产力与基于科学的伦理动物护理结合起来。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 05:17:16 UTC 发布:2025-08-11 05:17:16 UTC
#20 ThinkTuning: Instilling Cognitive Reflections without Distillation #20 ThinkTuning:在不进行蒸馏的情况下灌输认知反思 [PDF 4 ] [Copy] [Kimi 1 ] [REL]
Authors: [Aswin RRV](https://arxiv.org/search/?searchtype=author&query=Aswin RRV), [Jacob Dineen](https://arxiv.org/search/?searchtype=author&query=Jacob Dineen), [Divij Handa](https://arxiv.org/search/?searchtype=author&query=Divij Handa), [Md Nayem Uddin](https://arxiv.org/search/?searchtype=author&query=Md Nayem Uddin), [Mihir Parmar](https://arxiv.org/search/?searchtype=author&query=Mihir Parmar), [Chitta Baral](https://arxiv.org/search/?searchtype=author&query=Chitta Baral), [Ben Zhou](https://arxiv.org/search/?searchtype=author&query=Ben Zhou) 作者:Aswin RRV、Jacob Dineen、Divij Handa、Md Nayem Uddin、Mihir Parmar、Chitta Baral、Ben Zhou
Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How can we train the models that don’t exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO-based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback – enough to point the mind in the right direction and then show the solution. Each piece of feedback reshapes the student’s thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. In particular, on average, our method shows a 3.85% improvement over zero-shot baselines across benchmarks, and on MATH-500, AIME and GPQA-Diamond it shows 2.08%, 2.23% and 3.99% improvements over the vanilla-GRPO baseline. Source code is available at https://github.com/3rdAT/ThinkTuning. 最近在推理时扩展(test-time scaling)方面的进展催生了展现自我反思行为和多步推理能力的“会思考”的 LLMs。尽管强化学习推动了这种自我改进范式,但一项最新研究(Gandhi et al., 2025)表明,仅靠强化学习并不能真正赋予这些新的推理能力——它只是将基础模型中已存在的行为显现出来。这就提出了一个问题:我们如何训练那些尚未表现出这种思维行为的模型,使其首先发展出这种能力?为此,我们提出了 ThinkTuning,一种基于 GRPO 的交互式训练方法,在该方法中我们用教师模型的指导来增强学生模型的 rollouts。一个来自课堂实践的简单想法启发了我们的方法:老师提出一个问题,让学生尝试回答,然后给出纠正性反馈——足以将思路引向正确方向,然后展示解法。每一条反馈都会重塑学生的思路,使他们最终得出正确的解答。类似地,我们发现来自同等规模教师模型的这种通过反馈进行的隐式监督,可以提升学生模型的推理能力。 具体而言,平均而言,我们的方法在各基准上比零样本基线提高了 3.85%,在 MATH-500、AIME 和 GPQA-Diamond 上分别比原始 GRPO 基线提高了 2.08%、2.23% 和 3.99%。源代码可在 https://github.com/3rdAT/ThinkTuning 获取。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning 主题:Artificial Intelligence , Computation and Language , Machine Learning
Publish: 2025-08-11 04:51:43 UTC 发布:2025-08-11 04:51:43 UTC
#21 HGMF: A Hierarchical Gaussian Mixture Framework for Scalable Tool Invocation within the Model Context Protocol #21 HGMF:一种用于模型上下文协议中可扩展工具调用的分层高斯混合框架
Authors: [Wenpeng Xing](https://arxiv.org/search/?searchtype=author&query=Wenpeng Xing), [Zhipeng Chen](https://arxiv.org/search/?searchtype=author&query=Zhipeng Chen), [Changting Lin](https://arxiv.org/search/?searchtype=author&query=Changting Lin), [Meng Han](https://arxiv.org/search/?searchtype=author&query=Meng Han) 作者:邢文鹏,陈志鹏,林长廷,韩萌
Invoking external tools enables Large Language Models (LLMs) to perform complex, real-world tasks, yet selecting the correct tool from large, hierarchically-structured libraries remains a significant challenge. The limited context windows of LLMs and noise from irrelevant options often lead to low selection accuracy and high computational costs. To address this, we propose the Hierarchical Gaussian Mixture Framework (HGMF), a probabilistic pruning method for scalable tool invocation. HGMF first maps the user query and all tool descriptions into a unified semantic space. The framework then operates in two stages: it clusters servers using a Gaussian Mixture Model (GMM) and filters them based on the query’s likelihood. Subsequently, it applies the same GMM-based clustering and filtering to the tools associated with the selected servers. This hierarchical process produces a compact, high-relevance candidate set, simplifying the final selection task for the LLM. Experiments on a public dataset show that HGMF significantly improves tool selection accuracy while reducing inference latency, confirming the framework’s scalability and effectiveness for large-scale tool libraries. 调用外部工具使大型语言模型(LLMs)能够执行复杂的现实任务,但从大型的分层结构库中选择正确的工具仍然是一个重大挑战。LLMs 的有限上下文窗口和来自无关选项的噪音常常导致选择准确率低且计算成本高。为了解决这一问题,我们提出了分层高斯混合框架(Hierarchical Gaussian Mixture Framework,HGMF),一种用于可扩展工具调用的概率剪枝方法。HGMF 首先将用户查询和所有工具描述映射到一个统一的语义空间。该框架随后分两阶段运行:它使用高斯混合模型(GMM)对服务器进行聚类,并基于查询的似然性对其进行过滤。随后,对所选服务器所关联的工具应用相同的基于 GMM 的聚类和过滤。该分层过程生成了一个紧凑且高度相关的候选集合,从而简化了 LLM 的最终选择任务。在公开数据集上的实验表明,HGMF 在提高工具选择准确性同时降低推理延迟方面有显著提升,验证了该框架在大规模工具库中的可扩展性和有效性。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 04:13:06 UTC 发布时间:2025-08-11 04:13:06 UTC
#22 Optimization of Private Semantic Communication Performance: An Uncooperative Covert Communication Method #22 私有语义通信性能优化:一种不合作的隐蔽通信方法
Authors: [Wenjing Zhang](https://arxiv.org/search/?searchtype=author&query=Wenjing Zhang), [Ye Hu](https://arxiv.org/search/?searchtype=author&query=Ye Hu), [Tao Luo](https://arxiv.org/search/?searchtype=author&query=Tao Luo), [Zhilong Zhang](https://arxiv.org/search/?searchtype=author&query=Zhilong Zhang), [Mingzhe Chen](https://arxiv.org/search/?searchtype=author&query=Mingzhe Chen)
In this paper, a novel covert semantic communication framework is investigated. Within this framework, a server extracts and transmits the semantic information, i.e., the meaning of image data, to a user over several time slots. An attacker seeks to detect and eavesdrop the semantic transmission to acquire details of the original image. To avoid data meaning being eavesdropped by an attacker, a friendly jammer is deployed to transmit jamming signals to interfere the attacker so as to hide the transmitted semantic information. Meanwhile, the server will strategically select time slots for semantic information transmission. Due to limited energy, the jammer will not communicate with the server and hence the server does not know the transmit power of the jammer. Therefore, the server must jointly optimize the semantic information transmitted at each time slot and the corresponding transmit power to maximize the privacy and the semantic information transmission quality of the user. To solve this problem, we propose a prioritised sampling assisted twin delayed deep deterministic policy gradient algorithm to jointly determine the transmitted semantic information and the transmit power per time slot without the communications between the server and the jammer. Compared to standard reinforcement learning methods, the propose method uses an additional Q network to estimate Q values such that the agent can select the action with a lower Q value from the two Q networks thus avoiding local optimal action selection and estimation bias of Q values. Simulation results show that the proposed algorithm can improve the privacy and the semantic information transmission quality by up to 77.8% and 14.3% compared to the traditional reinforcement learning methods. 在本文中,研究了一种新型的隐蔽语义通信框架。在该框架下,服务器在若干时间时隙内提取并传输语义信息,即图像数据的含义,给用户。攻击者试图检测并窃听语义传输以获取原始图像的细节。为防止数据含义被攻击者窃听,部署了一个友好干扰器来发射干扰信号以干扰攻击者,从而隐藏被传输的语义信息。与此同时,服务器将策略性地选择语义信息的传输时隙。由于能量有限,干扰器不会与服务器通信,因此服务器不知道干扰器的发射功率。因此,服务器必须联合优化每个时隙传输的语义信息及相应的发射功率,以最大化用户的隐私和语义信息传输质量。 为了解决此问题,我们提出了一种优先采样辅助的双延迟深度确定性策略梯度算法,以在服务器与干扰者之间无需通信的情况下联合确定每个时隙传输的语义信息和发射功率。与标准强化学习方法相比,所提出的方法使用一个额外的 Q 网络来估计 Q 值,使智能体能够从两个 Q 网络中选择 Q 值更低的动作,从而避免局部最优动作选择和 Q 值的估计偏差。仿真结果表明,与传统强化学习方法相比,所提出的算法在隐私保护和语义信息传输质量上分别可提升最多 77.8% 和 14.3%。
Subjects: Artificial Intelligence, Networking and Internet Architecture 主题:人工智能、网络与互联网架构
Publish: 2025-08-11 03:31:05 UTC 发布:2025-08-11 03:31:05 UTC
#23 MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark #23 MCPToolBench++:大规模 AI Agent 模型上下文协议 MCP 工具使用基准 [PDF 1 ] [Copy] [Kimi 1 ] [REL]
Authors: [Shiqing Fan](https://arxiv.org/search/?searchtype=author&query=Shiqing Fan), [Xichen Ding](https://arxiv.org/search/?searchtype=author&query=Xichen Ding), [Liang Zhang](https://arxiv.org/search/?searchtype=author&query=Liang Zhang), [Linjian Mo](https://arxiv.org/search/?searchtype=author&query=Linjian Mo) 作者:范仕清,丁希晨,张亮,莫林坚
LLMs’ capabilities are enhanced by using function calls to integrate various data sources or API results into the context window. Typical tools include search, web crawlers, maps, financial data, file systems, and browser usage, etc. Integrating these data sources or functions requires a standardized method. The Model Context Protocol (MCP) provides a standardized way to supply context to LLMs. However, the evaluation of LLMs and AI Agents’ MCP tool use abilities suffer from several issues. First, there’s a lack of comprehensive datasets or benchmarks to evaluate various MCP tools. Second, the diverse formats of response from MCP tool call execution further increase the difficulty of evaluation. Additionally, unlike existing tool-use benchmarks with high success rates in functions like programming and math functions, the success rate of real-world MCP tool is not guaranteed and varies across different MCP servers. Furthermore, the LLMs’ context window also limits the number of available tools that can be called in a single run, because the textual descriptions of tool and the parameters have long token length for an LLM to process all at once. To help address the challenges of evaluating LLMs’ performance on calling MCP tools, we propose MCPToolBench++, a large-scale, multi-domain AI Agent tool use benchmark. As of July 2025, this benchmark is build upon marketplace of over 4k MCP servers from more than 40 categories, collected from the MCP marketplaces and GitHub communities. The datasets consist of both single-step and multi-step tool calls across different categories. We evaluated SOTA LLMs with agentic abilities on this benchmark and reported the results. 通过使用函数调用将各种数据源或 API 结果整合到上下文窗口中,可以增强 LLMs 的能力。典型的工具包括搜索、网页爬虫、地图、金融数据、文件系统和浏览器使用等。整合这些数据源或函数需要一种标准化的方法。模型上下文协议(Model Context Protocol,MCP)提供了一种向 LLMs 提供上下文的标准化方式。然而,对 LLMs 和 AI 代理的 MCP 工具使用能力的评估存在若干问题。首先,缺乏用于评估各种 MCP 工具的综合数据集或基准。其次,MCP 工具调用执行的响应格式多样,进一步增加了评估的难度。此外,与在编程和数学等功能上具有高成功率的现有工具使用基准不同,真实世界中 MCP 工具的成功率并不有保证,并且在不同的 MCP 服务器之间存在差异。再者,LLMs 的上下文窗口也限制了在一次运行中可调用工具的数量,因为工具的文本描述和参数对 LLM 来说具有较长的 token 长度,难以一次性全部处理。 为了解决评估 LLMs 在调用 MCP 工具时性能的挑战,我们提出了 MCPToolBench++,一个大规模、多领域的 AI Agent 工具使用基准。截至 2025 年 7 月,该基准构建于来自超过 40 个类别、超过 4k MCP 服务器的市场之上,这些服务器收集自 MCP 市场和 GitHub 社区。数据集包含跨不同类别的单步和多步工具调用。我们在该基准上评估了具备代理能力的 SOTA LLMs 并报告了结果。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-11 03:16:02 UTC 发布:2025-08-11 03:16:02 UTC
#24 Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy #24 民主化外交:一个用于在完整加压 Diplomacy 上评估任意大型语言模型的测试框架
Authors: [Alexander Duffy](https://arxiv.org/search/?searchtype=author&query=Alexander Duffy), [Samuel J Paech](https://arxiv.org/search/?searchtype=author&query=Samuel J Paech), [Ishana Shastri](https://arxiv.org/search/?searchtype=author&query=Ishana Shastri), [Elizabeth Karpinski](https://arxiv.org/search/?searchtype=author&query=Elizabeth Karpinski), [Baptiste Alloui-Cros](https://arxiv.org/search/?searchtype=author&query=Baptiste Alloui-Cros), [Tyler Marques](https://arxiv.org/search/?searchtype=author&query=Tyler Marques), [Matthew Lyle Olson](https://arxiv.org/search/?searchtype=author&query=Matthew Lyle Olson) 作者:Alexander Duffy、Samuel J Paech、Ishana Shastri、Elizabeth Karpinski、Baptiste Alloui-Cros、Tyler Marques、Matthew Lyle Olson
We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy’s game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced. 我们提出了首个评估框架,使任何开箱即用的本地大型语言模型(LLMs)都可以在无需微调或专门训练的情况下完整地进行《外交》(Diplomacy)对弈。先前的工作由于《外交》游戏状态的高度复杂性和信息密度,通常需要最前沿的 LLMs 或微调。再加上比赛的高方差,这些因素使得研究《外交》变得困难。在本工作中,我们通过数据驱动的迭代优化了一种文本化的游戏状态表示,使得一个 24B 模型能够在不进行任何微调的情况下可靠地完成对局。我们开发了便于假设检验和统计分析的工具,并就说服、侵略性对局风格以及不同模型间的性能差异呈现了案例研究。我们在许多流行的 LLMs 上进行了各种实验,发现更大的模型表现最佳,但较小的模型仍然能胜任游戏。我们还引入了关键状态分析(Critical State Analysis):一种用于快速迭代并深入分析游戏中关键时刻的实验协议。 我们的测试框架通过消除微调的需求,使对 LLMs 战略推理能力的评估全民化,并提供了关于这些能力如何在广泛使用的 LLMs 中自然出现的见解。我们的代码随补充材料提供,并将开源。
Subjects: Artificial Intelligence, Computation and Language, Computers and Society, Machine Learning 主题:人工智能,计算与语言,计算机与社会,机器学习
Publish: 2025-08-10 21:07:08 UTC 发表:2025-08-10 21:07:08 世界协调时间
#25 CP-Agent: Agentic Constraint Programming #25 CP-Agent:具代理性的约束编程
Author: [Stefan Szeider](https://arxiv.org/search/?searchtype=author&query=Stefan Szeider) 作者:Stefan Szeider
Translating natural language problem descriptions into formal constraint models remains a fundamental challenge in constraint programming, requiring deep expertise in both the problem domain and modeling frameworks. Previous approaches to automating this translation have employed fixed workflows with predetermined modeling steps, failing on a significant number of benchmark problems. We present a new approach using a pure agentic strategy without any fixed pipeline. We developed a general-purpose Python coding agent based on the ReAct (Reason and Act) principle, utilizing a persistent IPython kernel for stateful code execution and iterative development. Rather than embedding constraint programming logic into the agent architecture, domain-specific expertise is injected solely through a carefully crafted project prompt. The agent combines this prompt-encoded knowledge with access to file operations and code execution tools, enabling it to test hypotheses, debug failures, and verify solutions dynamically. Implemented in just a few hundred lines of code, this architecture successfully solves all 101 problems of the CP-Bench constraint programming benchmark set. The results suggest that constraint modeling tasks require the combination of general coding tools and domain expertise encoded in prompts, rather than specialized agent architectures or predefined workflows. 将自然语言的问题描述翻译为形式化的约束模型仍然是约束编程中的一个根本性挑战,这需要在问题域和建模框架方面具备深厚的专业知识。以往旨在自动化这一翻译的做法采用了具有预定建模步骤的固定工作流程,在大量基准问题上表现不佳。我们提出了一种新的方法,使用纯代理策略,不设固定流程。我们基于 ReAct(推理与行动)原则开发了一个通用的 Python 编码代理,利用持久化的 IPython 内核来实现有状态的代码执行和迭代开发。我们并未将约束编程逻辑嵌入代理架构中,而是通过精心设计的项目提示将领域专长注入其中。该代理将提示中编码的知识与文件操作和代码执行工具的访问能力相结合,从而能够动态地测试假设、调试错误并验证解法。该架构仅用几百行代码实现,就成功解决了 CP-Bench 约束编程基准集中全部 101 个问题。 结果表明,约束建模任务需要将通用编码工具与在提示中编码的领域专长相结合,而不是依赖专门的智能体架构或预定义的工作流程。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning, Software Engineering 主题:人工智能、计算与语言、机器学习、软件工程
Publish: 2025-08-10 19:59:01 UTC 发布:2025-08-10 19:59:01 UTC
#26 Grounding Natural Language for Multi-agent Decision-Making with Multi-agentic LLMs #26 将自然语言落地用于多智能体决策制定,使用多智能体 LLMs
Authors: [Dom Huh](https://arxiv.org/search/?searchtype=author&query=Dom Huh), [Prasant Mohapatra](https://arxiv.org/search/?searchtype=author&query=Prasant Mohapatra) 作者:Dom Huh,Prasant Mohapatra
Language is a ubiquitous tool that is foundational to reasoning and collaboration, ranging from everyday interactions to sophisticated problem-solving tasks. The establishment of a common language can serve as a powerful asset in ensuring clear communication and understanding amongst agents, facilitating desired coordination and strategies. In this work, we extend the capabilities of large language models (LLMs) by integrating them with advancements in multi-agent decision-making algorithms. We propose a systematic framework for the design of multi-agentic large language models (LLMs), focusing on key integration practices. These include advanced prompt engineering techniques, the development of effective memory architectures, multi-modal information processing, and alignment strategies through fine-tuning algorithms. We evaluate these design choices through extensive ablation studies on classic game settings with significant underlying social dilemmas and game-theoretic considerations. 语言是一种无处不在的工具,是推理与协作的基础,涵盖从日常互动到复杂问题解决任务。建立一种共同语言可以作为确保代理之间清晰沟通与理解的强大资产,促进期望的协调与策略。在本工作中,我们通过将其与多代理决策算法的进展相结合,扩展了大型语言模型(LLMs)的能力。我们提出了一个用于设计多代理大型语言模型(LLMs)的系统化框架,侧重于关键的集成实践。这些包括高级提示工程技术、有效记忆架构的开发、多模态信息处理,以及通过微调算法进行的对齐策略。我们通过在具有重要潜在社会困境和博弈论考量的经典博弈设置上进行的大量消融研究来评估这些设计选择。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-10 19:53:23 UTC 发布:2025-08-10 19:53:23 UTC
#27 A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems #27 自我进化人工智能代理的全面综述:一个连接基础模型与终身代理系统的新范式 [PDF 3 ] [Copy] [Kimi 3 ] [REL]
Authors: [Jinyuan Fang](https://arxiv.org/search/?searchtype=author&query=Jinyuan Fang), [Yanwen Peng](https://arxiv.org/search/?searchtype=author&query=Yanwen Peng), [Xi Zhang](https://arxiv.org/search/?searchtype=author&query=Xi Zhang), [Yingxu Wang](https://arxiv.org/search/?searchtype=author&query=Yingxu Wang), [Xinhao Yi](https://arxiv.org/search/?searchtype=author&query=Xinhao Yi), [Guibin Zhang](https://arxiv.org/search/?searchtype=author&query=Guibin Zhang), [Yi Xu](https://arxiv.org/search/?searchtype=author&query=Yi Xu), [Bin Wu](https://arxiv.org/search/?searchtype=author&query=Bin Wu), [Siwei Liu](https://arxiv.org/search/?searchtype=author&query=Siwei Liu), [Zihao Li](https://arxiv.org/search/?searchtype=author&query=Zihao Li), [Zhaochun Ren](https://arxiv.org/search/?searchtype=author&query=Zhaochun Ren), [Nikos Aletras](https://arxiv.org/search/?searchtype=author&query=Nikos Aletras), [Xi Wang](https://arxiv.org/search/?searchtype=author&query=Xi Wang), [Han Zhou](https://arxiv.org/search/?searchtype=author&query=Han Zhou), [Zaiqiao Meng](https://arxiv.org/search/?searchtype=author&query=Zaiqiao Meng) 作者:方金元、彭燕文、张曦、王颖旭、易新昊、张贵斌、徐毅、吴斌、刘思唯、李子豪、任昭春、Nikos Aletras、王曦、周涵、孟在桥
Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems. 近年来大型语言模型的进展引发了人们对能够解决复杂现实任务的智能体的浓厚兴趣。然而,大多数现有的智能体系统依赖人工设计的配置,部署后保持静态,限制了它们在动态和不断变化的环境中适应的能力。为此,近期研究探索了智能体进化技术,旨在基于交互数据和环境反馈自动增强智能体系统。这一新兴方向为自我进化的人工智能智能体奠定了基础,将基础模型的静态能力与终身智能体系统所需的持续适应性联系起来。在本综述中,我们对现有的自我进化智能体系统技术进行了全面回顾。具体而言,我们首先引入了一个统一的概念框架,用以抽象设计自我进化智能体系统所依赖的反馈循环。该框架强调四个关键组成部分:系统输入、智能体系统、环境和优化器,为理解和比较不同策略提供了基础。 基于该框架,我们系统性地回顾了针对智能体系统不同组成部分的一系列自我演化技术。我们还研究了为生物医学、编程和金融等专门领域开发的领域特定演化策略,这些领域的优化目标与领域约束紧密相关。此外,我们针对自我演化智能体系统的评估、安全性和伦理考量进行了专门讨论,这些内容对于确保其有效性和可靠性至关重要。本综述旨在为研究人员和实践者提供对自我演化人工智能智能体的系统性理解,为开发更具适应性、自主性和终身学习能力的智能体系统奠定基础。
Subjects: Artificial Intelligence, Computation and Language, Multiagent Systems 主题:人工智能、计算与语言、多智能体系统
Publish: 2025-08-10 16:07:32 UTC 发布时间:2025-08-10 16:07:32 UTC
#28 Generative AI for Strategic Plan Development #28 用于战略规划制定的生成式人工智能
Author: [Jesse Ponnock](https://arxiv.org/search/?searchtype=author&query=Jesse Ponnock) 作者:Jesse Ponnock
Given recent breakthroughs in Generative Artificial Intelligence (GAI) and Large Language Models (LLMs), more and more professional services are being augmented through Artificial Intelligence (AI), which once seemed impossible to automate. This paper presents a modular model for leveraging GAI in developing strategic plans for large scale government organizations and evaluates leading machine learning techniques in their application towards one of the identified modules. Specifically, the performance of BERTopic and Non-negative Matrix Factorization (NMF) are evaluated in their ability to use topic modeling to generate themes representative of Vision Elements within a strategic plan. To accomplish this, BERTopic and NMF models are trained using a large volume of reports from the Government Accountability Office (GAO). The generated topics from each model are then scored for similarity against the Vision Elements of a published strategic plan and the results are compared. Our results show that these techniques are capable of generating themes similar to 100% of the elements being evaluated against. Further, we conclude that BERTopic performs best in this application with more than half of its correlated topics achieving a “medium” or “strong” correlation. A capability of GAI-enabled strategic plan development impacts a multi-billion dollar industry and assists the federal government in overcoming regulatory requirements which are crucial to the public good. Further work will focus on the operationalization of the concept proven in this study as well as viability of the remaining modules in the proposed model for GAI-generated strategic plans. 鉴于生成式人工智能(GAI)和大型语言模型(LLMs)方面的最新突破,越来越多曾被认为无法自动化的专业服务正通过人工智能(AI)得到增强。本文提出了一个模块化模型,用于在为大型政府组织制定战略计划时利用 GAI,并评估了若干领先的机器学习技术在应用于所识别模块之一时的表现。具体来说,评估了 BERTopic 和非负矩阵分解(NMF)在使用主题建模生成代表战略计划中愿景要素的主题方面的能力。为此,使用大量来自美国政府问责办(GAO)的报告训练了 BERTopic 和 NMF 模型。然后将每个模型生成的主题与已发布战略计划的愿景要素进行相似性打分,并对结果进行比较。我们的结果表明,这些技术能够生成与所评估的 100%要素相似的主题。 此外,我们得出结论:在此应用中,BERTopic 表现最佳,其相关主题中有超过一半达到“中等”或“强”相关性。具备生成式人工智能支持的战略规划能力会影响数十亿美元规模的产业,并帮助联邦政府克服对公共利益至关重要的监管要求。后续工作将侧重于将本研究中验证的概念投入运行,以及评估所提模型中其余模块用于生成式人工智能生成战略计划的可行性。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning 主题:Artificial Intelligence , Computation and Language , Machine Learning
Publish: 2025-08-10 16:07:07 UTC
#29 Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks for Enhanced Action Understanding #29 Invert4TVG:一种通过反演任务增强动作理解的时序视频定向框架 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Zhaoyu Chen](https://arxiv.org/search/?searchtype=author&query=Zhaoyu Chen), [Hongnan Lin](https://arxiv.org/search/?searchtype=author&query=Hongnan Lin), [Yongwei Nie](https://arxiv.org/search/?searchtype=author&query=Yongwei Nie), [Fei Ma](https://arxiv.org/search/?searchtype=author&query=Fei Ma), [Xuemiao Xu](https://arxiv.org/search/?searchtype=author&query=Xuemiao Xu), [Fei Yu](https://arxiv.org/search/?searchtype=author&query=Fei Yu), [Chengjiang Long](https://arxiv.org/search/?searchtype=author&query=Chengjiang Long) 作者:陈兆宇、林宏楠、聂永伟、马飞、徐学淼、于飞、龙成江
Temporal Video Grounding (TVG) seeks to localize video segments matching a given textual query. Current methods, while optimizing for high temporal Intersection-over-Union (IoU), often overfit to this metric, compromising semantic action understanding in the video and query, a critical factor for robust TVG. To address this, we introduce Inversion Tasks for TVG (Invert4TVG), a novel framework that enhances both localization accuracy and action understanding without additional data. Our approach leverages three inversion tasks derived from existing TVG annotations: (1) Verb Completion, predicting masked action verbs in queries from video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions of video segments that explicitly embed query-relevant actions. These tasks, integrated with TVG via a reinforcement learning framework with well-designed reward functions, ensure balanced optimization of localization and semantics. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1% improvement in R1@0.7 on Charades-STA for a 3B model compared to Time-R1. By inverting TVG to derive query-related actions from segments, our approach strengthens semantic understanding, significantly raising the ceiling of localization accuracy. 时序视频定位(Temporal Video Grounding, TVG)旨在定位与给定文本查询匹配的视频片段。当前方法虽然在优化高时序交并比(IoU)方面表现出色,但常常对该指标产生过拟合,影响了对视频与查询中语义动作的理解,而语义动作理解是实现鲁棒 TVG 的关键因素。为了解决这一问题,我们提出了用于 TVG 的反向任务(Inversion Tasks for TVG,Invert4TVG),这是一种新的框架,在无需额外数据的情况下同时提升定位精度与动作理解能力。我们的方法利用了从现有 TVG 标注中派生出的三种反向任务:(1)动词补全:从视频片段预测查询中被遮蔽的动作动词;(2)动作识别:识别查询所描述的动作;以及(3)视频描述:生成显式包含与查询相关动作的视频片段描述。这些任务通过带有精心设计奖励函数的强化学习框架与 TVG 集成,确保对定位与语义的均衡优化。实验表明,我们的方法优于最先进的方法,在 Charades-STA 数据集上,针对一个 3B 模型相比 Time-R1 在 R1@0.7 上取得了 7.1%的提升。 通过将 TVG 反向应用于从片段中推导与查询相关的动作,我们的方法加强了语义理解,显著提高了定位准确性的上限。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-10 15:38:04 UTC 发布:2025-08-10 15:38:04 UTC
#30 Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning #30 Pentest-R1:朝着通过两阶段强化学习优化推理的自主渗透测试 [PDF ] [Copy] [Kimi 1 ] [REL]
Authors: [He Kong](https://arxiv.org/search/?searchtype=author&query=He Kong), [Die Hu](https://arxiv.org/search/?searchtype=author&query=Die Hu), [Jingguo Ge](https://arxiv.org/search/?searchtype=author&query=Jingguo Ge), [Liangxiong Li](https://arxiv.org/search/?searchtype=author&query=Liangxiong Li), [Hui Li](https://arxiv.org/search/?searchtype=author&query=Hui Li), [Tong Li](https://arxiv.org/search/?searchtype=author&query=Tong Li) 作者:He Kong、Die Hu、Jingguo Ge、Liangxiong Li、Hui Li、Tong Li
Automating penetration testing is crucial for enhancing cybersecurity, yet current Large Language Models (LLMs) face significant limitations in this domain, including poor error handling, inefficient reasoning, and an inability to perform complex end-to-end tasks autonomously. To address these challenges, we introduce Pentest-R1, a novel framework designed to optimize LLM reasoning capabilities for this task through a two-stage reinforcement learning pipeline. We first construct a dataset of over 500 real-world, multi-step walkthroughs, which Pentest-R1 leverages for offline reinforcement learning (RL) to instill foundational attack logic. Subsequently, the LLM is fine-tuned via online RL in an interactive Capture The Flag (CTF) environment, where it learns directly from environmental feedback to develop robust error self-correction and adaptive strategies. Our extensive experiments on the Cybench and AutoPenBench benchmarks demonstrate the framework’s effectiveness. On AutoPenBench, Pentest-R1 achieves a 24.2% success rate, surpassing most state-of-the-art models and ranking second only to Gemini 2.5 Flash. On Cybench, it attains a 15.0% success rate in unguided tasks, establishing a new state-of-the-art for open-source LLMs and matching the performance of top proprietary models. Ablation studies confirm that the synergy of both training stages is critical to its success. 自动化渗透测试对于提升网络安全至关重要,然而当前的大型语言模型(LLMs)在该领域存在显著局限,包括差劲的错误处理能力、低效的推理以及无法自主执行复杂的端到端任务。为了解决这些挑战,我们提出了 Pentest-R1,这是一种新颖的框架,旨在通过两阶段强化学习流水线优化 LLM 的推理能力。我们首先构建了一个包含 500 多个真实世界、多步骤演练的数据集,Pentest-R1 利用该数据集进行离线强化学习(RL),以灌输基础的攻击逻辑。随后,在交互式夺旗赛(CTF)环境中通过在线 RL 对 LLM 进行微调,使其直接从环境反馈中学习,形成强健的错误自我纠正和自适应策略。我们在 Cybench 和 AutoPenBench 基准上的大量实验验证了该框架的有效性。在 AutoPenBench 上,Pentest-R1 达到 24.2% 的成功率,超过大多数最先进模型,仅次于 Gemini 2.5 Flash,位居第二。 在 Cybench 上,在无引导任务中它达到了 15.0% 的成功率,为开源 LLMs 树立了新的最先进水平,并且与顶级专有模型的表现相当。消融研究证实,两阶段训练的协同作用对其成功至关重要。
Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习
Publish: 2025-08-10 15:14:05 UTC 发布时间:2025-08-10 15:14:05 UTC
#31 Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach #31 重新思考特定领域 LLM 基准构建:全面性-紧凑性 方法
Authors: [Rubing Chen](https://arxiv.org/search/?searchtype=author&query=Rubing Chen), [Jiaxin Wu](https://arxiv.org/search/?searchtype=author&query=Jiaxin Wu), [Jian Wang](https://arxiv.org/search/?searchtype=author&query=Jian Wang), [Xulu Zhang](https://arxiv.org/search/?searchtype=author&query=Xulu Zhang), [Wenqi Fan](https://arxiv.org/search/?searchtype=author&query=Wenqi Fan), [Chenghua Lin](https://arxiv.org/search/?searchtype=author&query=Chenghua Lin), [Xiao-Yong Wei](https://arxiv.org/search/?searchtype=author&query=Xiao-Yong Wei), [Qing Li](https://arxiv.org/search/?searchtype=author&query=Qing Li) 作者:Rubing Chen、Jiaxin Wu、Jian Wang、Xulu Zhang、Wenqi Fan、Chenghua Lin、Xiao-Yong Wei、Qing Li
Numerous benchmarks have been built to evaluate the domain-specific abilities of large language models (LLMs), highlighting the need for effective and efficient benchmark construction. Existing domain-specific benchmarks primarily focus on the scaling law, relying on massive corpora for supervised fine-tuning or generating extensive question sets for broad coverage. However, the impact of corpus and question-answer (QA) set design on the precision and recall of domain-specific LLMs remains unexplored. In this paper, we address this gap and demonstrate that the scaling law is not always the optimal principle for benchmark construction in specific domains. Instead, we propose Comp-Comp, an iterative benchmarking framework based on a comprehensiveness-compactness principle. Here, comprehensiveness ensures semantic recall of the domain, while compactness enhances precision, guiding both corpus and QA set construction. To validate our framework, we conducted a case study in a well-renowned university, resulting in the creation of XUBench, a large-scale and comprehensive closed-domain benchmark. Although we use the academic domain as the case in this work, our Comp-Comp framework is designed to be extensible beyond academia, providing valuable insights for benchmark construction across various domains. 已有大量基准测试用于评估大型语言模型(LLMs)的领域特定能力,这突显了构建有效且高效基准的必要性。现有领域特定基准主要关注尺度法则,依赖大规模语料进行有监督微调或生成大量问题集以求广覆盖。然而,语料与问答(QA)集设计对领域特定 LLMs 的查全率与查准率的影响仍未被探究。本文针对这一空白进行了研究,并证明在特定领域内,尺度法则并不总是构建基准的最佳原则。相反,我们提出了 Comp-Comp,一种基于完备性-紧凑性原则的迭代基准构建框架。其中文本的完备性确保领域的语义查全率,而紧凑性则提升查准率,二者共同指导语料与 QA 集的构建。为验证我们的框架,我们在一所知名大学开展了案例研究,最终构建了 XUBench,一个大规模且完备的封闭领域基准。 虽然我们在本研究中以学术领域为案例,但我们的 Comp-Comp 框架旨在可扩展到学术界以外,为各类领域的基准构建提供有价值的见解。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning 主题:Artificial Intelligence , Computation and Language , Machine Learning
Publish: 2025-08-10 14:08:28 UTC 发布:2025-08-10 14:08:28 UTC
#32 Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape #32 幻觉作为计算边界:不可避免性的层级与神谕逃逸
Authors: [Quan Shi](https://arxiv.org/search/?searchtype=author&query=Quan Shi), [Wang Xi](https://arxiv.org/search/?searchtype=author&query=Wang Xi), [Zenghui Ding](https://arxiv.org/search/?searchtype=author&query=Zenghui Ding), [Jianqing Gao](https://arxiv.org/search/?searchtype=author&query=Jianqing Gao), [Xianjun Yang](https://arxiv.org/search/?searchtype=author&query=Xianjun Yang) 作者:史全、王曦、丁增辉、高建清、杨先军
The illusion phenomenon of large language models (LLMs) is the core obstacle to their reliable deployment. This article formalizes the large language model as a probabilistic Turing machine by constructing a “computational necessity hierarchy”, and for the first time proves the illusions are inevitable on diagonalization, incomputability, and information theory boundaries supported by the new “learner pump lemma”. However, we propose two “escape routes”: one is to model Retrieval Enhanced Generations (RAGs) as oracle machines, proving their absolute escape through “computational jumps”, providing the first formal theory for the effectiveness of RAGs; The second is to formalize continuous learning as an “internalized oracle” mechanism and implement this path through a novel neural game theory framework.Finally, this article proposes a 大型语言模型(LLMs)的幻觉现象是其可靠部署的核心障碍。本文通过构建“计算必要性层级”,将大型语言模型形式化为概率图灵机,并首次证明在对角化、不可计算性和信息论边界上,幻觉是不可避免的——这一论断由新的“学习者泵引理”所支持。然而,我们提出了两条“逃逸路线”:一是将检索增强生成(RAGs)建模为含有神谕机的模型,证明其可通过“计算跳跃”实现绝对逃逸,为 RAGs 有效性的首个形式理论提供依据;二是将持续学习形式化为一种“内化的神谕”机制,并通过一种新颖的神经博弈论框架实现这一路径。最后,本文提出了一个
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-10 13:26:36 UTC 发布:2025-08-10 13:26:36 协调世界时 (UTC)
#33 EndoAgent: A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning #33 EndoAgent:一种用于智能内镜视觉到决策推理的记忆引导反思代理
Authors: [Yi Tang](https://arxiv.org/search/?searchtype=author&query=Yi Tang), [Kaini Wang](https://arxiv.org/search/?searchtype=author&query=Kaini Wang), [Yang Chen](https://arxiv.org/search/?searchtype=author&query=Yang Chen), [Guangquan Zhou](https://arxiv.org/search/?searchtype=author&query=Guangquan Zhou) 作者:唐毅、王凯妮、陈洋、周光泉
Developing general artificial intelligence (AI) systems to support endoscopic image diagnosis is an emerging research priority. Existing methods based on large-scale pretraining often lack unified coordination across tasks and struggle to handle the multi-step processes required in complex clinical workflows. While AI agents have shown promise in flexible instruction parsing and tool integration across domains, their potential in endoscopy remains underexplored. To address this gap, we propose EndoAgent, the first memory-guided agent for vision-to-decision endoscopic analysis that integrates iterative reasoning with adaptive tool selection and collaboration. Built on a dual-memory design, it enables sophisticated decision-making by ensuring logical coherence through short-term action tracking and progressively enhancing reasoning acuity through long-term experiential learning. To support diverse clinical tasks, EndoAgent integrates a suite of expert-designed tools within a unified reasoning loop. We further introduce EndoAgentBench, a benchmark of 5,709 visual question-answer pairs that assess visual understanding and language generation capabilities in realistic scenarios. Extensive experiments show that EndoAgent consistently outperforms both general and medical multimodal models, exhibiting its strong flexibility and reasoning capabilities. 开发用于支持内镜图像诊断的通用人工智能(AI)系统是一个新兴的研究重点。基于大规模预训练的现有方法往往缺乏跨任务的统一协调,并且难以处理复杂临床工作流程中所需的多步骤过程。尽管 AI 代理在灵活解析指令和跨领域工具整合方面显示出潜力,但其在内镜领域的应用仍未得到充分探索。为了解决这一空白,我们提出了 EndoAgent,这是第一个用于视觉到决策的内镜分析的记忆引导代理,它将迭代推理与自适应工具选择和协作相结合。基于双重记忆设计,该方法通过短期动作跟踪确保逻辑连贯性,并通过长期经验学习逐步增强推理敏锐度,从而实现复杂的决策制定。为了支持多样化的临床任务,EndoAgent 在统一的推理循环中集成了一套专家设计的工具。我们进一步引入了 EndoAgentBench,一个包含 5,709 个视觉问答对的基准,用于评估真实场景中的视觉理解和语言生成能力。 大量实验表明,EndoAgent 始终优于通用及医学多模态模型,展现出其强大的灵活性和推理能力。
Subjects: Artificial Intelligence, Computation and Language 主题:人工智能,计算与语言
Publish: 2025-08-10 11:02:57 UTC 发布:2025-08-10 11:02:57 协调世界时 (UTC)
#34 Multi-Dimensional Summarization Agents with Context-Aware Reasoning over Enterprise Tables #34 多维摘要代理:面向企业表格的上下文感知推理
Author: [Amit Dhanda](https://arxiv.org/search/?searchtype=author&query=Amit Dhanda) 作者:Amit Dhanda
We propose a novel framework for summarizing structured enterprise data across multiple dimensions using large language model (LLM)-based agents. Traditional table-to-text models often lack the capacity to reason across hierarchical structures and context-aware deltas, which are essential in business reporting tasks. Our method introduces a multi-agent pipeline that extracts, analyzes, and summarizes multi-dimensional data using agents for slicing, variance detection, context construction, and LLM-based generation. Our results show that the proposed framework outperforms traditional approaches, achieving 83% faithfulness to underlying data, superior coverage of significant changes, and high relevance scores (4.4/5) for decision-critical insights. The improvements are especially pronounced in categories involving subtle trade-offs, such as increased revenue due to price changes amid declining unit volumes, which competing methods either overlook or address with limited specificity. We evaluate the framework on Kaggle datasets and demonstrate significant improvements in faithfulness, relevance, and insight quality over baseline table summarization approaches. 我们提出了一个用于跨多维度汇总结构化企业数据的新框架,该框架使用基于大语言模型(LLM)的代理。传统的表格到文本模型往往缺乏在分层结构和具上下文感知的差异上进行推理的能力,而这些正是在商业报告任务中至关重要的。我们的方法引入了一个多代理流水线,使用切片、差异检测、上下文构建和基于 LLM 的生成代理来提取、分析和汇总多维数据。我们的结果表明,所提出的框架优于传统方法,实现了对底层数据 83%的忠实度、对显著变化的更好覆盖率,以及针对决策关键洞见的高相关性评分(4.4/5)。在涉及微妙权衡的类别中改进尤为明显,例如在单位销量下降的情况下因价格变动导致的收入增长,竞争方法要么忽视此类情况,要么仅以有限的具体性来处理。我们在 Kaggle 数据集上评估了该框架,并展示了在忠实度、相关性和洞见质量方面相较于基线表格摘要方法的显著提升。
Subjects: Artificial Intelligence, Multiagent Systems 主题:人工智能,多智能体系统
Publish: 2025-08-10 05:27:42 UTC 发布时间:2025-08-10 05:27:42 UTC
#35 Designing a Feedback-Driven Decision Support System for Dynamic Student Intervention #35 为动态学生干预设计以反馈为驱动的决策支持系统
Authors: [Timothy Oluwapelumi Adeyemi](https://arxiv.org/search/?searchtype=author&query=Timothy Oluwapelumi Adeyemi), [Nadiah Fahad AlOtaibi](https://arxiv.org/search/?searchtype=author&query=Nadiah Fahad AlOtaibi) 作者:Timothy Oluwapelumi Adeyemi、Nadiah Fahad AlOtaibi
Accurate prediction of student performance is essential for timely academic intervention. However, most machine learning models in education are static and cannot adapt when new data, such as post-intervention outcomes, become available. To address this limitation, we propose a Feedback-Driven Decision Support System (DSS) with a closed-loop architecture that enables continuous model refinement. The system integrates a LightGBM-based regressor with incremental retraining, allowing educators to input updated student results, which automatically trigger model updates. This adaptive mechanism improves prediction accuracy by learning from real-world academic progress. The platform features a Flask-based web interface for real-time interaction and incorporates SHAP for explainability, ensuring transparency. Experimental results show a 10.7% reduction in RMSE after retraining, with consistent upward adjustments in predicted scores for intervened students. By transforming static predictors into self-improving systems, our approach advances educational analytics toward human-centered, data-driven, and responsive AI. The framework is designed for integration into LMS and institutional dashboards. 准确预测学生表现对于及时学业干预至关重要。然而,大多数教育领域的机器学习模型是静态的,无法在出现新数据(例如干预后的结果)时进行自适应。为了解决这一局限性,我们提出了一个反馈驱动的决策支持系统(DSS),采用闭环架构以实现持续的模型精炼。该系统将基于 LightGBM 的回归器与增量重训练相结合,允许教育工作者输入更新后的学生成绩,从而自动触发模型更新。这种自适应机制通过学习真实世界的学业进展来提高预测准确性。该平台具有基于 Flask 的网页界面以实现实时交互,并整合了 SHAP 以增强可解释性,确保透明性。实验结果显示,重训练后 RMSE 降低了 10.7%,并且对接受干预的学生的预测分数呈持续上调。通过将静态预测器转变为自我改进的系统,我们的方法推动了教育分析向以人为本、数据驱动和响应式的人工智能方向发展。 该框架旨在集成到学习管理系统和机构仪表盘中。
Subjects: Artificial Intelligence, Computers and Society 学科:人工智能,计算机与社会
Publish: 2025-08-09 21:24:54 UTC
#36 Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach #36 迈向更安全的人工智能审核:通过统一基准数据集评估 LLM 审核系统并倡导以人为本的方法
Authors: [Naseem Machlovi](https://arxiv.org/search/?searchtype=author&query=Naseem Machlovi), [Maryam Saleki](https://arxiv.org/search/?searchtype=author&query=Maryam Saleki), [Innocent Ababio](https://arxiv.org/search/?searchtype=author&query=Innocent Ababio), [Ruhul Amin](https://arxiv.org/search/?searchtype=author&query=Ruhul Amin) 作者:Naseem Machlovi, Maryam Saleki, Innocent Ababio, Ruhul Amin
As AI systems become more integrated into daily life, the need for safer and more reliable moderation has never been greater. Large Language Models (LLMs) have demonstrated remarkable capabilities, surpassing earlier models in complexity and performance. Their evaluation across diverse tasks has consistently showcased their potential, enabling the development of adaptive and personalized agents. However, despite these advancements, LLMs remain prone to errors, particularly in areas requiring nuanced moral reasoning. They struggle with detecting implicit hate, offensive language, and gender biases due to the subjective and context-dependent nature of these issues. Moreover, their reliance on training data can inadvertently reinforce societal biases, leading to inconsistencies and ethical concerns in their outputs. To explore the limitations of LLMs in this role, we developed an experimental framework based on state-of-the-art (SOTA) models to assess human emotions and offensive behaviors. The framework introduces a unified benchmark dataset encompassing 49 distinct categories spanning the wide spectrum of human emotions, offensive and hateful text, and gender and racial biases. Furthermore, we introduced SafePhi, a QLoRA fine-tuned version of Phi-4, adapting diverse ethical contexts and outperforming benchmark moderators by achieving a Macro F1 score of 0.89, where OpenAI Moderator and Llama Guard score 0.77 and 0.74, respectively. This research also highlights the critical domains where LLM moderators consistently underperformed, pressing the need to incorporate more heterogeneous and representative data with human-in-the-loop, for better model robustness and explainability. 随着人工智能系统越来越融入日常生活,对更安全、更可靠的内容审核的需求前所未有地迫切。大型语言模型(LLMs)展现出卓越能力,在复杂性和性能上超越了早期模型。它们在各种任务上的评估持续展示出潜力,使得开发自适应和个性化智能体成为可能。然而,尽管取得了这些进展,LLMs 仍然容易出错,尤其是在需要细腻道德推理的领域。它们在检测隐含仇恨、冒犯性语言和性别偏见方面存在困难,这些问题本质上具有主观性并且依赖语境。此外,它们对训练数据的依赖可能无意中强化社会偏见,导致输出在一致性和伦理性方面出现问题。为了探究 LLMs 在此类角色中的局限性,我们基于最先进的(SOTA)模型开发了一个实验框架,用于评估人类情绪和冒犯行为。该框架引入了一个统一的基准数据集,涵盖 49 个不同类别,横跨人类情绪、冒犯与仇恨性文本以及性别与种族偏见的广泛谱系。 此外,我们引入了 SafePhi,这是一个通过 QLoRA 微调的 Phi-4 版本,适配了多样化的伦理情境,并通过实现 0.89 的宏观 F1 得分优于基准审核器,而 OpenAI Moderator 和 Llama Guard 的得分分别为 0.77 和 0.74。本研究还强调了 LLM 审核器持续表现不佳的关键领域,强调了在闭环人工参与下纳入更多异质且具有代表性的数据以提升模型鲁棒性与可解释性的必要性。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-09 18:00:27 UTC 发布:2025-08-09 18:00:27 协调世界时
#37 K-Dense Analyst: Towards Fully Automated Scientific Analysis #37 K-多密度分析器:迈向全自动科学分析
Authors: [Orion Li](https://arxiv.org/search/?searchtype=author&query=Orion Li), [Vinayak Agarwal](https://arxiv.org/search/?searchtype=author&query=Vinayak Agarwal), [Summer Zhou](https://arxiv.org/search/?searchtype=author&query=Summer Zhou), [Ashwin Gopinath](https://arxiv.org/search/?searchtype=author&query=Ashwin Gopinath), [Timothy Kassis](https://arxiv.org/search/?searchtype=author&query=Timothy Kassis) 作者:Orion Li、Vinayak Agarwal、Summer Zhou、Ashwin Gopinath、Timothy Kassis
The complexity of modern bioinformatics analysis has created a critical gap between data generation and developing scientific insights. While large language models (LLMs) have shown promise in scientific reasoning, they remain fundamentally limited when dealing with real-world analytical workflows that demand iterative computation, tool integration and rigorous validation. We introduce K-Dense Analyst, a hierarchical multi-agent system that achieves autonomous bioinformatics analysis through a dual-loop architecture. K-Dense Analyst, part of the broader K-Dense platform, couples planning with validated execution using specialized agents to decompose complex objectives into executable, verifiable tasks within secure computational environments. On BixBench, a comprehensive benchmark for open-ended biological analysis, K-Dense Analyst achieves 29.2% accuracy, surpassing the best-performing language model (GPT-5) by 6.3 percentage points, representing nearly 27% improvement over what is widely considered the most powerful LLM available. Remarkably, K-Dense Analyst achieves this performance using Gemini 2.5 Pro, which attains only 18.3% accuracy when used directly, demonstrating that our architectural innovations unlock capabilities far beyond the underlying model’s baseline performance. Our insights demonstrate that autonomous scientific reasoning requires more than enhanced language models, it demands purpose-built systems that can bridge the gap between high-level scientific objectives and low-level computational execution. These results represent a significant advance toward fully autonomous computational biologists capable of accelerating discovery across the life sciences. 现代生物信息学分析的复杂性在数据生成与科学洞见的形成之间造成了关键鸿沟。尽管大型语言模型(LLMs)在科学推理方面展现出潜力,但在处理需要迭代计算、工具整合和严格验证的真实分析工作流时仍存在根本性局限。我们提出了 K-Dense Analyst,这是一种分层多智能体系统,通过双回路架构实现自主生物信息学分析。作为更大 K-Dense 平台的一部分,K-Dense Analyst 将计划与经验证的执行相结合,使用专门的智能体将复杂目标分解为可执行且可验证的任务,并在安全的计算环境中运行。在用于开放式生物分析的综合基准 BixBench 上,K-Dense Analyst 的准确率为 29.2%,比表现最好的语言模型(GPT-5)高出 6.3 个百分点,相对于广泛认为最强大的 LLM,提升接近 27%。 值得注意的是,K-Dense Analyst 使用 Gemini 2.5 Pro 就能达到这种性能,而直接使用 Gemini 2.5 Pro 的准确率仅为 18.3%,这表明我们的架构创新激发的能力远超该基础模型的基线表现。我们的见解表明,要实现自主的科学推理,不仅需要更强的语言模型,还需要专门构建的系统,能够弥合高级科学目标与低级计算执行之间的差距。这些结果代表了朝着能够在生命科学领域加速发现的完全自主计算生物学家迈出的重要一步。
Subjects: Artificial Intelligence, Multiagent Systems, Genomics, Quantitative Methods 主题:人工智能, 多智能体系统, 基因组学, 定量方法
Publish: 2025-08-09 16:59:55 UTC 发布时间:2025-08-09 16:59:55 UTC
#38 MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA #38 MultiMedEdit:一个用于评估医疗视觉问答中知识编辑的情景感知基准
Authors: [Shengtao Wen](https://arxiv.org/search/?searchtype=author&query=Shengtao Wen), [Haodong Chen](https://arxiv.org/search/?searchtype=author&query=Haodong Chen), [Yadong Wang](https://arxiv.org/search/?searchtype=author&query=Yadong Wang), [Zhongying Pan](https://arxiv.org/search/?searchtype=author&query=Zhongying Pan), [Xiang Chen](https://arxiv.org/search/?searchtype=author&query=Xiang Chen), [Yu Tian](https://arxiv.org/search/?searchtype=author&query=Yu Tian), [Bo Qian](https://arxiv.org/search/?searchtype=author&query=Bo Qian), [Dong Liang](https://arxiv.org/search/?searchtype=author&query=Dong Liang), [Sheng-Jun Huang](https://arxiv.org/search/?searchtype=author&query=Sheng-Jun Huang) 作者:温胜涛、陈昊东、王亚东、潘仲颖、陈翔、田宇、钱博、梁东、黄胜军
Knowledge editing (KE) provides a scalable approach for updating factual knowledge in large language models without full retraining. While previous studies have demonstrated effectiveness in general domains and medical QA tasks, little attention has been paid to KE in multimodal medical scenarios. Unlike text-only settings, medical KE demands integrating updated knowledge with visual reasoning to support safe and interpretable clinical decisions. To address this gap, we propose MultiMedEdit, the first benchmark tailored to evaluating KE in clinical multimodal tasks. Our framework spans both understanding and reasoning task types, defines a three-dimensional metric suite (reliability, generality, and locality), and supports cross-paradigm comparisons across general and domain-specific models. We conduct extensive experiments under single-editing and lifelong-editing settings. Results suggest that current methods struggle with generalization and long-tail reasoning, particularly in complex clinical workflows. We further present an efficiency analysis (e.g., edit latency, memory footprint), revealing practical trade-offs in real-world deployment across KE paradigms. Overall, MultiMedEdit not only reveals the limitations of current approaches but also provides a solid foundation for developing clinically robust knowledge editing techniques in the future. 知识编辑(KE)为在大型语言模型中更新事实知识提供了一种可扩展的方法,无需完全重新训练。尽管先前的研究已证明其在通用领域和医疗问答任务中的有效性,但在多模态医疗场景下的 KE 却鲜有关注。与仅文本设置不同,医疗知识编辑要求将更新后的知识与视觉推理结合,以支持安全且可解释的临床决策。为填补这一空白,我们提出了 MultiMedEdit,这是首个专门用于评估临床多模态任务中知识编辑的基准。我们的框架涵盖理解和推理两类任务,定义了一个三维度的评估指标体系(可靠性、通用性和局部性),并支持在通用与特定领域模型之间进行跨范式比较。我们在单次编辑和终身编辑设置下进行了大量实验。结果表明,当前方法在泛化和长尾推理方面存在困难,尤其是在复杂的临床工作流程中。我们还给出了效率分析(例如,编辑延迟、内存占用),揭示了不同知识编辑范式在现实部署中的实际权衡。 总体而言,MultiMedEdit 不仅揭示了当前方法的局限性,还为未来开发临床上稳健的知识编辑技术提供了坚实的基础。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning, Multimedia 主题:人工智能,计算与语言,机器学习,多媒体
Publish: 2025-08-09 15:36:08 UTC 发布:2025-08-09 15:36:08 协调世界时
#39 Efficient and Reliable Hitting-Set Computations for the Implicit Hitting Set Approach #39 高效且可靠的隐式命中集方法的命中集计算
Authors: [Hannes Ihalainen](https://arxiv.org/search/?searchtype=author&query=Hannes Ihalainen), [Dieter Vandesande](https://arxiv.org/search/?searchtype=author&query=Dieter Vandesande), [André Schidler](https://arxiv.org/search/?searchtype=author&query=André Schidler), [Jeremias Berg](https://arxiv.org/search/?searchtype=author&query=Jeremias Berg), [Bart Bogaerts](https://arxiv.org/search/?searchtype=author&query=Bart Bogaerts), [Matti Järvisalo](https://arxiv.org/search/?searchtype=author&query=Matti Järvisalo) 作者:Hannes Ihalainen,Dieter Vandesande,André Schidler,Jeremias Berg,Bart Bogaerts,Matti Järvisalo
The implicit hitting set (IHS) approach offers a general framework for solving computationally hard combinatorial optimization problems declaratively. IHS iterates between a decision oracle used for extracting sources of inconsistency and an optimizer for computing so-called hitting sets (HSs) over the accumulated sources of inconsistency. While the decision oracle is language-specific, the optimizers is usually instantiated through integer programming. We explore alternative algorithmic techniques for hitting set optimization based on different ways of employing pseudo-Boolean (PB) reasoning as well as stochastic local search. We extensively evaluate the practical feasibility of the alternatives in particular in the context of pseudo-Boolean (0-1 IP) optimization as one of the most recent instantiations of IHS. Highlighting a trade-off between efficiency and reliability, while a commercial IP solver turns out to remain the most effective way to instantiate HS computations, it can cause correctness issues due to numerical instability; in fact, we show that exact HS computations instantiated via PB reasoning can be made competitive with a numerically exact IP solver. Furthermore, the use of PB reasoning as a basis for HS computations allows for obtaining certificates for the correctness of IHS computations, generally applicable to any IHS instantiation in which reasoning in the declarative language at hand can be captured in the PB-based proof format we employ. 隐式打击集(IHS)方法为以声明式方式解决计算上困难的组合优化问题提供了一个通用框架。IHS 在用于提取不一致来源的判定 Oracle 与用于在累积的不一致来源上计算所谓打击集(HS)的优化器之间反复迭代。尽管判定 Oracle 与语言相关,优化器通常通过整数规划来实例化。我们探索了基于不同方式利用伪布尔(PB)推理以及基于随机局部搜索的打击集优化的替代算法技术。我们在实践中广泛评估了这些替代方法的可行性,特别是在伪布尔(0-1 整数规划)优化这一 IHS 最近的实例化之一的背景下。突出效率与可靠性之间的权衡,尽管商业整数规划求解器被证明仍是实例化 HS 计算的最有效方式,但由于数值不稳定性可能导致正确性问题;事实上,我们展示了通过 PB 推理实例化的精确 HS 计算可以与数值上精确的整数规划求解器竞争。 此外,将 PB 推理作为 IHS 计算的基础,可用于为 IHS 计算的正确性获取证明证书,这通常适用于任何 IHS 实例,只要所用声明性语言的推理能够以我们采用的基于 PB 的证明格式来表示。
Subjects: Artificial Intelligence, Data Structures and Algorithms 主题:人工智能、数据结构与算法
Publish: 2025-08-09 15:27:36 UTC 发布:2025-08-09 15:27:36 UTC
#40 Simulating Biological Intelligence: Active Inference with Experiment-Informed Generative Model #40 模拟生物智能:基于实验信息的生成模型的主动推断
Authors: [Aswin Paul](https://arxiv.org/search/?searchtype=author&query=Aswin Paul), [Moein Khajehnejad](https://arxiv.org/search/?searchtype=author&query=Moein Khajehnejad), [Forough Habibollahi](https://arxiv.org/search/?searchtype=author&query=Forough Habibollahi), [Brett J. Kagan](https://arxiv.org/search/?searchtype=author&query=Brett J. Kagan), [Adeel Razi](https://arxiv.org/search/?searchtype=author&query=Adeel Razi) 作者:Aswin Paul、Moein Khajehnejad、Forough Habibollahi、Brett J. Kagan、Adeel Razi
With recent and rapid advancements in artificial intelligence (AI), understanding the foundation of purposeful behaviour in autonomous agents is crucial for developing safe and efficient systems. While artificial neural networks have dominated the path to AI, recent studies are exploring the potential of biologically based systems, such as networks of living biological neuronal networks. Along with promises of high power and data efficiency, these systems may also inform more explainable and biologically plausible models. In this work, we propose a framework rooted in active inference, a general theory of behaviour, to model decision-making in embodied agents. Using experiment-informed generative models, we simulate decision-making processes in a simulated game-play environment, mirroring experimental setups that use biological neurons. Our results demonstrate learning in these agents, providing insights into the role of memory-based learning and predictive planning in intelligent decision-making. This work contributes to the growing field of explainable AI by offering a biologically grounded and scalable approach to understanding purposeful behaviour in agents. 随着人工智能(AI)近年来的快速发展,理解自主代理中有目的行为的基础对于开发安全且高效的系统至关重要。尽管人工神经网络主导了通往 AI 的道路,近来的研究正在探索基于生物的系统的潜力,例如由活体生物神经网络组成的网络。除了在能量和数据效率方面的潜在优势外,这些系统还可能有助于形成更具可解释性和生物学可行性的模型。在这项工作中,我们提出了一个基于主动推理的框架——一种通用的行为理论——来模拟具身代理的决策过程。通过以实验为依据的生成模型,我们在一个模拟的游戏环境中模拟了决策过程,仿照使用生物神经元的实验设置。我们的结果展示了这些代理的学习能力,并为基于记忆的学习和预测性规划在智能决策中的作用提供了见解。这项工作通过提供一种有生物学根基且可扩展的方法,推动了可解释 AI 这一不断发展的领域,以理解代理中的有目的行为。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-09 13:26:38 UTC 发布:2025-08-09 13:26:38 协调世界时(UTC)
#41 DSperse: A Framework for Targeted Verification in Zero-Knowledge Machine Learning #41 DSperse:一种用于零知识机器学习中有针对性验证的框架
Authors: [Dan Ivanov](https://arxiv.org/search/?searchtype=author&query=Dan Ivanov), [Tristan Freiberg](https://arxiv.org/search/?searchtype=author&query=Tristan Freiberg), [Haruna Isah](https://arxiv.org/search/?searchtype=author&query=Haruna Isah) 作者:Dan Ivanov、Tristan Freiberg、Haruna Isah
DSperse is a modular framework for distributed machine learning inference with strategic cryptographic verification. Operating within the emerging paradigm of distributed zero-knowledge machine learning, DSperse avoids the high cost and rigidity of full-model circuitization by enabling targeted verification of strategically chosen subcomputations. These verifiable segments, or “slices”, may cover part or all of the inference pipeline, with global consistency enforced through audit, replication, or economic incentives. This architecture supports a pragmatic form of trust minimization, localizing zero-knowledge proofs to the components where they provide the greatest value. We evaluate DSperse using multiple proving systems and report empirical results on memory usage, runtime, and circuit behavior under sliced and unsliced configurations. By allowing proof boundaries to align flexibly with the model’s logical structure, DSperse supports scalable, targeted verification strategies suited to diverse deployment needs. DSperse 是一个用于分布式机器学习推理的模块化框架,配备有策略性的密码学验证。运行在新兴的分布式零知识机器学习范式内,DSperse 通过对策略性选择的子计算进行有针对性的验证,避免了对整模型电路化的高成本和僵化。这些可验证的片段或“切片”可以覆盖推理管线的部分或全部,且通过审计、复制或经济激励来强制实施全局一致性。该架构支持一种务实的信任最小化形式,将零知识证明本地化到能提供最大价值的组件上。我们使用多种证明系统对 DSperse 进行评估,并报告了在切片与非切片配置下的内存使用、运行时间和电路行为的实证结果。通过允许证明边界灵活地与模型的逻辑结构对齐,DSperse 支持适应多样部署需求的可扩展、有针对性的验证策略。
Subjects: Artificial Intelligence, Cryptography and Security, Distributed, Parallel, and Cluster Computing, Machine Learning 主题:人工智能、密码学与安全、分布式、并行与集群计算、机器学习
Publish: 2025-08-09 12:38:18 UTC 发表:2025-08-09 12:38:18 UTC
#42 MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair #42 MASteer:用于端到端 LLM 可信性修复的多智能体自适应引导策略
Authors: [Changqing Li](https://arxiv.org/search/?searchtype=author&query=Changqing Li), [Tianlin Li](https://arxiv.org/search/?searchtype=author&query=Tianlin Li), [Xiaohan Zhang](https://arxiv.org/search/?searchtype=author&query=Xiaohan Zhang), [Aishan Liu](https://arxiv.org/search/?searchtype=author&query=Aishan Liu), [Li Pan](https://arxiv.org/search/?searchtype=author&query=Li Pan) 作者:李长青、李天麟、张晓涵、刘艾珊、潘励
Large Language Models (LLMs) face persistent and evolving trustworthiness issues, motivating developers to seek automated and flexible repair methods that enable convenient deployment across diverse scenarios. Existing repair methods like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) are costly and slow, while prompt engineering lacks robustness and scalability. Representation engineering, which steers model behavior by injecting targeted concept vectors during inference, offers a lightweight, training-free alternative. However, current approaches depend on manually crafted samples and fixed steering strategies, limiting automation and adaptability. To overcome these challenges, we propose MASteer, the first end-to-end framework for trustworthiness repair in LLMs based on representation engineering. MASteer integrates two core components: AutoTester, a multi-agent system that generates diverse, high-quality steer samples tailored to developer needs; and AutoRepairer, which constructs adaptive steering strategies with anchor vectors for automated, context-aware strategy selection during inference. Experiments on standard and customized trustworthiness tasks show MASteer consistently outperforms baselines, improving metrics by 15.36% on LLaMA-3.1-8B-Chat and 4.21% on Qwen-3-8B-Chat, while maintaining general model capabilities. MASteer demonstrates strong robustness, generalization, and practical value for scalable, efficient trustworthiness repair. 大型语言模型 (LLMs) 面临持续且不断演变的可信性问题,这促使开发者寻求自动化且灵活的修复方法,以便在多种场景中方便部署。现有的修复方法如监督微调 (SFT) 和基于人类反馈的强化学习 (RLHF) 成本高且耗时,而提示工程缺乏鲁棒性和可扩展性。表示工程通过在推理时注入有针对性的概念向量来引导模型行为,提供了一种轻量、无需训练的替代方案。然而,现有方法依赖人工制作的样本和固定的引导策略,限制了自动化和适应性。为了解决这些挑战,我们提出了 MASteer,这是首个基于表示工程、用于 LLMs 可信性修复的端到端框架。MASteer 集成了两个核心组件:AutoTester,一个多智能体系统,生成针对开发者需求的多样化、高质量引导样本;以及 AutoRepairer,它构建带锚向量的自适应引导策略,以便在推理时实现自动、上下文感知的策略选择。 在标准和定制的可信性任务上的实验表明,MASteer 始终优于基线,在 LLaMA-3.1-8B-Chat 上将各项指标提升了 15.36%,在 Qwen-3-8B-Chat 上提升了 4.21%,同时保持了模型的一般能力。MASteer 展现出强大的鲁棒性、泛化能力及可扩展、高效的可信性修复的实用价值。
Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习
Publish: 2025-08-09 12:20:00 UTC 发布:2025-08-09 12:20:00 UTC
#43 DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery #43 数据集研究:针对需求驱动的数据集发现的代理系统基准测试
Authors: [Keyu Li](https://arxiv.org/search/?searchtype=author&query=Keyu Li), [Mohan Jiang](https://arxiv.org/search/?searchtype=author&query=Mohan Jiang), [Dayuan Fu](https://arxiv.org/search/?searchtype=author&query=Dayuan Fu), [Yunze Wu](https://arxiv.org/search/?searchtype=author&query=Yunze Wu), [Xiangkun Hu](https://arxiv.org/search/?searchtype=author&query=Xiangkun Hu), [Dequan Wang](https://arxiv.org/search/?searchtype=author&query=Dequan Wang), [Pengfei Liu](https://arxiv.org/search/?searchtype=author&query=Pengfei Liu) 作者:李柯宇、姜默涵、傅大元、吴云泽、胡翔坤、王德全、刘鹏飞
The rapid advancement of large language models has fundamentally shifted the bottleneck in AI development from computational power to data availability-with countless valuable datasets remaining hidden across specialized repositories, research appendices, and domain platforms. As reasoning capabilities and deep research methodologies continue to evolve, a critical question emerges: can AI agents transcend conventional search to systematically discover any dataset that meets specific user requirements, enabling truly autonomous demand-driven data curation? We introduce DatasetResearch, the first comprehensive benchmark evaluating AI agents’ ability to discover and synthesize datasets from 208 real-world demands across knowledge-intensive and reasoning-intensive tasks. Our tri-dimensional evaluation framework reveals a stark reality: even advanced deep research systems achieve only 22% score on our challenging DatasetResearch-pro subset, exposing the vast gap between current capabilities and perfect dataset discovery. Our analysis uncovers a fundamental dichotomy-search agents excel at knowledge tasks through retrieval breadth, while synthesis agents dominate reasoning challenges via structured generation-yet both catastrophically fail on “corner cases” outside existing distributions. These findings establish the first rigorous baseline for dataset discovery agents and illuminate the path toward AI systems capable of finding any dataset in the digital universe. Our benchmark and comprehensive analysis provide the foundation for the next generation of self-improving AI systems and are publicly available at https://github.com/GAIR-NLP/DatasetResearch. 大型语言模型的快速发展已将人工智能开发的瓶颈从计算能力根本性地转移到数据可得性——无数有价值的数据集仍隐藏在专业存储库、研究附录和领域平台中。随着推理能力和深度研究方法的不断演进,一个关键问题浮现:AI 代理能否超越传统搜索,系统性地发现满足特定用户需求的任何数据集,从而实现真正的按需自主数据策划?我们提出了 DatasetResearch,这是首个全面的基准,用于评估 AI 代理从 208 个真实需求中发现并综合数据集的能力,这些需求覆盖知识密集型和推理密集型任务。我们的三维评估框架揭示了一个严峻的现实:即便是先进的深度研究系统在我们具有挑战性的 DatasetResearch-pro 子集中也仅取得了 22%的得分,暴露出现有能力与完美数据集发现之间的巨大差距。 我们的分析揭示了一个根本性二分法——搜索型代理通过广泛检索在知识任务上表现出色,而合成型代理通过结构化生成在推理挑战上占据主导地位——但两者在现有分布之外的“极端情况”上都出现灾难性失败。 这些发现为数据集发现代理建立了第一个严格基线,并为构建能够在数字世界中找到任何数据集的人工智能系统指明了道路。 我们的基准和综合分析为下一代自我改进型人工智能系统提供了基础,所有内容已在 https://github.com/GAIR-NLP/DatasetResearch 公布。
Subjects: Artificial Intelligence, Computation and Language 主题:人工智能,计算与语言
Publish: 2025-08-09 12:15:08 UTC 发布:2025-08-09 12:15:08 UTC
#44 Large Language Models Do Not Simulate Human Psychology #44 大型语言模型并不模拟人类心理学
Authors: [Sarah Schröder](https://arxiv.org/search/?searchtype=author&query=Sarah Schröder), [Thekla Morgenroth](https://arxiv.org/search/?searchtype=author&query=Thekla Morgenroth), [Ulrike Kuhl](https://arxiv.org/search/?searchtype=author&query=Ulrike Kuhl), [Valerie Vaquet](https://arxiv.org/search/?searchtype=author&query=Valerie Vaquet), [Benjamin Paaßen](https://arxiv.org/search/?searchtype=author&query=Benjamin Paaßen) 作者:Sarah Schröder、Thekla Morgenroth、Ulrike Kuhl、Valerie Vaquet、Benjamin Paaßen
Large Language Models (LLMs),such as ChatGPT, are increasingly used in research, ranging from simple writing assistance to complex data annotation tasks. Recently, some research has suggested that LLMs may even be able to simulate human psychology and can, hence, replace human participants in psychological studies. We caution against this approach. We provide conceptual arguments against the hypothesis that LLMs simulate human psychology. We then present empiric evidence illustrating our arguments by demonstrating that slight changes to wording that correspond to large changes in meaning lead to notable discrepancies between LLMs’ and human responses, even for the recent CENTAUR model that was specifically fine-tuned on psychological responses. Additionally, different LLMs show very different responses to novel items, further illustrating their lack of reliability. We conclude that LLMs do not simulate human psychology and recommend that psychological researchers should treat LLMs as useful but fundamentally unreliable tools that need to be validated against human responses for every new application. 大型语言模型(LLMs),例如 ChatGPT,正日益被用于研究,从简单的写作辅助到复杂的数据注释任务。最近有些研究提出,LLMs 甚至可能能够模拟人类心理,因此可以在心理学研究中替代人类参与者。我们对这种做法提出警示。我们提供了反对 LLMs 能够模拟人类心理这一假设的概念性论证。随后我们通过实证证据来说明这些论点,展示了即使是对最近专门针对心理学反应进行微调的 CENTAUR 模型,仅对措辞做出与意义发生巨大变化的细微调整,也会导致 LLMs 与人类反应之间出现显著差异。此外,不同的 LLMs 对新颖条目的反应差异很大,进一步说明了它们缺乏可靠性。我们得出结论:LLMs 并不模拟人类心理,并建议心理学研究者将 LLMs 视为有用但从根本上不可靠的工具,对于每一种新应用都需通过与人类反应的比较来进行验证。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-09 11:56:59 UTC 发布:2025-08-09 11:56:59 世界协调时间
#45 Intrinsic Explainability of Multimodal Learning for Crop Yield Prediction #45 多模态学习在作物产量预测中的内在可解释性
Authors: [Hiba Najjar](https://arxiv.org/search/?searchtype=author&query=Hiba Najjar), [Deepak Pathak](https://arxiv.org/search/?searchtype=author&query=Deepak Pathak), [Marlon Nuske](https://arxiv.org/search/?searchtype=author&query=Marlon Nuske), [Andreas Dengel](https://arxiv.org/search/?searchtype=author&query=Andreas Dengel) 作者:Hiba Najjar、Deepak Pathak、Marlon Nuske、Andreas Dengel
Multimodal learning enables various machine learning tasks to benefit from diverse data sources, effectively mimicking the interplay of different factors in real-world applications, particularly in agriculture. While the heterogeneous nature of involved data modalities may necessitate the design of complex architectures, the model interpretability is often overlooked. In this study, we leverage the intrinsic explainability of Transformer-based models to explain multimodal learning networks, focusing on the task of crop yield prediction at the subfield level. The large datasets used cover various crops, regions, and years, and include four different input modalities: multispectral satellite and weather time series, terrain elevation maps and soil properties. Based on the self-attention mechanism, we estimate feature attributions using two methods, namely the Attention Rollout (AR) and Generic Attention (GA), and evaluate their performance against Shapley-based model-agnostic estimations, Shapley Value Sampling (SVS). Additionally, we propose the Weighted Modality Activation (WMA) method to assess modality attributions and compare it with SVS attributions. Our findings indicate that Transformer-based models outperform other architectures, specifically convolutional and recurrent networks, achieving R2 scores that are higher by 0.10 and 0.04 at the subfield and field levels, respectively. AR is shown to provide more robust and reliable temporal attributions, as confirmed through qualitative and quantitative evaluation, compared to GA and SVS values. Information about crop phenology stages was leveraged to interpret the explanation results in the light of established agronomic knowledge. Furthermore, modality attributions revealed varying patterns across the two methods compared.[…] 多模态学习使各种机器学习任务能够从多样化的数据来源中受益,有效模拟现实应用中不同因素的相互作用,尤其是在农业领域。尽管所涉数据模态的异构性可能需要设计复杂的架构,但模型可解释性常常被忽视。在本研究中,我们利用基于 Transformer 模型的内在可解释性来解释多模态学习网络,聚焦于田块级作物产量预测任务。所用的大型数据集覆盖多种作物、区域和年份,包含四种不同的输入模态:多光谱卫星和气象时间序列、地形高程图以及土壤特性。基于自注意力机制,我们使用两种方法估算特征归因,分别为注意力展开(Attention Rollout,AR)和通用注意力(Generic Attention,GA),并将它们的性能与基于 Shapley 的模型无关估计方法——Shapley 值采样(Shapley Value Sampling,SVS)进行比较。此外,我们提出了加权模态激活(Weighted Modality Activation,WMA)方法以评估模态归因,并将其与 SVS 归因进行对比。 我们的研究结果表明,基于 Transformer 的模型优于其他架构,特别是卷积网络和循环网络,在子领域和领域层面的 R2 得分分别高出 0.10 和 0.04。与 GA 和 SVS 值相比,AR 被证明能够提供更稳健且更可靠的时间归因,这一点通过定性和定量评估得到了证实。作物物候阶段的信息被用来根据既定的农艺学知识解释解释结果。此外,模态归因在两种比较方法中显示出不同的模式。[…]
Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习
Publish: 2025-08-09 11:09:10 UTC 发表:2025-08-09 11:09:10 UTC
#46 Automated Formalization via Conceptual Retrieval-Augmented LLMs #46 通过概念检索增强的 LLMs 实现自动形式化 [PDF 1 ] [Copy] [Kimi ] [REL]
Authors: [Wangyue Lu](https://arxiv.org/search/?searchtype=author&query=Wangyue Lu), [Lun Du](https://arxiv.org/search/?searchtype=author&query=Lun Du), [Sirui Li](https://arxiv.org/search/?searchtype=author&query=Sirui Li), [Ke Weng](https://arxiv.org/search/?searchtype=author&query=Ke Weng), [Haozhe Sun](https://arxiv.org/search/?searchtype=author&query=Haozhe Sun), [Hengyu Liu](https://arxiv.org/search/?searchtype=author&query=Hengyu Liu), [Minghe Yu](https://arxiv.org/search/?searchtype=author&query=Minghe Yu), [Tiancheng Zhang](https://arxiv.org/search/?searchtype=author&query=Tiancheng Zhang), [Ge Yu](https://arxiv.org/search/?searchtype=author&query=Ge Yu) 作者:Wangyue Lu、Lun Du、Sirui Li、Ke Weng、Haozhe Sun、Hengyu Liu、Minghe Yu、Tiancheng Zhang、Ge Yu
Interactive theorem provers (ITPs) require manual formalization, which is labor-intensive and demands expert knowledge. While automated formalization offers a potential solution, it faces two major challenges: model hallucination (e.g., undefined predicates, symbol misuse, and version incompatibility) and the semantic gap caused by ambiguous or missing premises in natural language descriptions. To address these issues, we propose CRAMF, a Concept-driven Retrieval-Augmented Mathematical Formalization framework. CRAMF enhances LLM-based autoformalization by retrieving formal definitions of core mathematical concepts, providing contextual grounding during code generation. However, applying retrieval-augmented generation (RAG) in this setting is non-trivial due to the lack of structured knowledge bases, the polymorphic nature of mathematical concepts, and the high precision required in formal retrieval. We introduce a framework for automatically constructing a concept-definition knowledge base from Mathlib4, the standard mathematical library for the Lean 4 theorem prover, indexing over 26,000 formal definitions and 1,000+ core mathematical concepts. To address conceptual polymorphism, we propose contextual query augmentation with domain- and application-level signals. In addition, we design a dual-channel hybrid retrieval strategy with reranking to ensure accurate and relevant definition retrieval. Experiments on miniF2F, ProofNet, and our newly proposed AdvancedMath benchmark show that CRAMF can be seamlessly integrated into LLM-based autoformalizers, yielding consistent improvements in translation accuracy, achieving up to 62.1% and an average of 29.9% relative improvement. 交互式定理证明器(ITP)需要手动形式化,这既耗费大量人力又要求专家知识。尽管自动形式化提供了潜在的解决方案,但它面临两个主要挑战:模型幻觉(例如未定义的谓词、符号误用和版本不兼容)以及由自然语言描述中含糊或缺失前提引起的语义鸿沟。为了解决这些问题,我们提出了 CRAMF,一种以概念为驱动的检索增强数学形式化框架。CRAMF 通过检索核心数学概念的形式定义来增强基于 LLM 的自动形式化,在代码生成过程中提供上下文锚定。然而,在此情境下应用检索增强生成(RAG)并非易事,原因在于缺乏结构化知识库、数学概念的多态性以及形式检索所需的高精度。我们提出了一个框架,用于从 Mathlib4(Lean 4 定理证明器的标准数学库)自动构建概念—定义知识库,对 26,000 多个形式定义和 1,000+ 个核心数学概念进行了索引。 为了解决概念多态性问题,我们提出了结合领域和应用层信号的上下文查询增强方法。此外,我们设计了一种带重排序的双通道混合检索策略,以确保定义检索的准确性和相关性。在 miniF2F、ProofNet 以及我们新提出的 AdvancedMath 基准上的实验表明,CRAMF 可以无缝集成到基于 LLM 的自动形式化系统中,带来在翻译准确性方面的一致提升,最高可达 62.1%,平均相对提升为 29.9%。
Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习
Publish: 2025-08-09 10:54:25 UTC 发布:2025-08-09 10:54:25 UTC
#47 GDBA Revisited: Unleashing the Power of Guided Local Search for Distributed Constraint Optimization #47 GDBA 再探:释放有引导的局部搜索在分布式约束优化中的威力
Authors: [Yanchen Deng](https://arxiv.org/search/?searchtype=author&query=Yanchen Deng), [Xinrun Wang](https://arxiv.org/search/?searchtype=author&query=Xinrun Wang), [Bo An](https://arxiv.org/search/?searchtype=author&query=Bo An) 作者:Yanchen Deng, Xinrun Wang, Bo An
Local search is an important class of incomplete algorithms for solving Distributed Constraint Optimization Problems (DCOPs) but it often converges to poor local optima. While GDBA provides a comprehensive rule set to escape premature convergence, its empirical benefits remain marginal on general-valued problems. In this work, we systematically examine GDBA and identify three factors that potentially lead to its inferior performance, i.e., over-aggressive constraint violation conditions, unbounded penalty accumulation, and uncoordinated penalty updates. To address these issues, we propose Distributed Guided Local Search (DGLS), a novel GLS framework for DCOPs that incorporates an adaptive violation condition to selectively penalize constraints with high cost, a penalty evaporation mechanism to control the magnitude of penalization, and a synchronization scheme for coordinated penalty updates. We theoretically show that the penalty values are bounded, and agents play a potential game in our DGLS. Our extensive empirical results on various standard benchmarks demonstrate the great superiority of DGLS over state-of-the-art baselines. Particularly, compared to Damped Max-sum with high damping factors (e.g., 0.7 or 0.9), our DGLS achieves competitive performance on general-valued problems, and outperforms it by significant margins (\textbf{3.77%–66.3%}) on structured problems in terms of anytime results. 局部搜索是解决分布式约束优化问题(DCOPs)的一类重要不完全算法,但它经常收敛到较差的局部最优解。虽然 GDBA 提供了一套全面的规则以逃避过早收敛,但其在一般值问题上的实际收益仍然有限。在这项工作中,我们系统地审视了 GDBA 并识别出可能导致其性能不佳的三个因素,即:过于激进的约束违规判定条件、无界的惩罚累积以及缺乏协调的惩罚更新。为了解决这些问题,我们提出了分布式引导局部搜索(DGLS),这是一种针对 DCOPs 的新型 GLS 框架,包含一种自适应的违规判定条件以选择性地对高成本约束施加惩罚、一种惩罚蒸发机制以控制惩罚强度以及一种用于协调惩罚更新的同步方案。我们在理论上证明了惩罚值是有界的,并且在我们的 DGLS 中代理人会进行一个势博弈。我们在各种标准基准上的大量实证结果表明 DGLS 相较于最先进基线方法具有显著优势。 特别是,相较于具有高阻尼因子(例如 0.7 或 0.9)的阻尼最大和(Damped Max-sum),我们的 DGLS 在一般值问题上取得了有竞争力的表现,并且在结构化问题上在任意时刻结果方面以显著幅度(3.77%–66.3%)超越了它。
Subjects: Artificial Intelligence, Discrete Mathematics 主题:人工智能,离散数学
Publish: 2025-08-09 09:12:06 UTC
#48 Pushdown Reward Machines for Reinforcement Learning
Authors: [Giovanni Varricchione](https://arxiv.org/search/?searchtype=author&query=Giovanni Varricchione), [Toryn Q. Klassen](https://arxiv.org/search/?searchtype=author&query=Toryn Q. Klassen), [Natasha Alechina](https://arxiv.org/search/?searchtype=author&query=Natasha Alechina), [Mehdi Dastani](https://arxiv.org/search/?searchtype=author&query=Mehdi Dastani), [Brian Logan](https://arxiv.org/search/?searchtype=author&query=Brian Logan), [Sheila A. McIlraith](https://arxiv.org/search/?searchtype=author&query=Sheila A. McIlraith)
Reward machines (RMs) are automata structures that encode (non-Markovian) reward functions for reinforcement learning (RL). RMs can reward any behaviour representable in regular languages and, when paired with RL algorithms that exploit RM structure, have been shown to significantly improve sample efficiency in many domains. In this work, we present pushdown reward machines (pdRMs), an extension of reward machines based on deterministic pushdown automata. pdRMs can recognize and reward temporally extended behaviours representable in deterministic context-free languages, making them more expressive than reward machines. We introduce two variants of pdRM-based policies, one which has access to the entire stack of the pdRM, and one which can only access the top k symbols (for a given constant k) of the stack. We propose a procedure to check when the two kinds of policies (for a given environment, pdRM, and constant k) achieve the same optimal expected reward. We then provide theoretical results establishing the expressive power of pdRMs, and space complexity results about the proposed learning problems. Finally, we provide experimental results showing how agents can be trained to perform tasks representable in deterministic context-free languages using pdRMs. 奖励机(RMs)是用于强化学习(RL)的自动机结构,用以编码(非马尔可夫)奖励函数。RMs 可以对任何正则语言可表示的行为给予奖励,并且当与利用 RM 结构的 RL 算法配对时,已被证明能在许多领域显著提高样本效率。在本工作中,我们提出了下推奖励机(pdRMs),这是一种基于确定性下推自动机的奖励机扩展。pdRMs 能够识别并奖励在确定性上下文无关语言中可表示的时序延展行为,使其比奖励机具有更强的表达力。我们引入了两种基于 pdRM 的策略变体,一种可以访问 pdRM 的整个栈,另一种只能访问栈的最顶部 k 个符号(对于给定常数 k )。我们提出了一个流程来检验在何种情况下(对于给定环境、pdRM 和常数 k )这两类策略能够达到相同的最优期望奖励。随后我们给出建立 pdRMs 表达能力的理论结果,以及关于所提出学习问题的空间复杂度结果。 最后,我们提供了实验结果,展示了如何使用 pdRMs 训练代理执行可表示为确定性上下文无关语言的任务。
Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习
Publish: 2025-08-09 08:59:09 UTC 发布:2025-08-09 08:59:09 协调世界时
#49 MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction #49 MeteorPred:用于强天气事件预测的气象多模态大模型与数据集
Authors: [Shuo Tang](https://arxiv.org/search/?searchtype=author&query=Shuo Tang), [Jian Xu](https://arxiv.org/search/?searchtype=author&query=Jian Xu), [Jiadong Zhang](https://arxiv.org/search/?searchtype=author&query=Jiadong Zhang), [Yi Chen](https://arxiv.org/search/?searchtype=author&query=Yi Chen), [Qizhao Jin](https://arxiv.org/search/?searchtype=author&query=Qizhao Jin), [Lingdong Shen](https://arxiv.org/search/?searchtype=author&query=Lingdong Shen), [Chenglin Liu](https://arxiv.org/search/?searchtype=author&query=Chenglin Liu), [Shiming Xiang](https://arxiv.org/search/?searchtype=author&query=Shiming Xiang) 作者:唐硕、徐建、张嘉栋、陈毅、金启照、沈岭东、刘成林、项世明
Timely and accurate severe weather warnings are critical for disaster mitigation. However, current forecasting systems remain heavily reliant on manual expert interpretation, introducing subjectivity and significant operational burdens. With the rapid development of AI technologies, the end-to-end “AI weather station” is gradually emerging as a new trend in predicting severe weather events. Three core challenges impede the development of end-to-end AI severe weather system: (1) scarcity of severe weather event samples; (2) imperfect alignment between high-dimensional meteorological data and textual warnings; (3) existing multimodal language models are unable to handle high-dimensional meteorological data and struggle to fully capture the complex dependencies across temporal sequences, vertical pressure levels, and spatial dimensions. To address these challenges, we introduce MP-Bench, the first large-scale temporal multimodal dataset for severe weather events prediction, comprising 421,363 pairs of raw multi-year meteorological data and corresponding text caption, covering a wide range of severe weather scenarios across China. On top of this dataset, we develop a meteorology multimodal large model (MMLM) that directly ingests 4D meteorological inputs. In addition, it is designed to accommodate the unique characteristics of 4D meteorological data flow, incorporating three plug-and-play adaptive fusion modules that enable dynamic feature extraction and integration across temporal sequences, vertical pressure layers, and spatial dimensions. Extensive experiments on MP-Bench demonstrate that MMLM performs exceptionally well across multiple tasks, highlighting its effectiveness in severe weather understanding and marking a key step toward realizing automated, AI-driven weather forecasting systems. Our source code and dataset will be made publicly available. 及时且准确的强对流天气预警对减灾至关重要。然而,当前的预报系统仍高度依赖人工专家解读,这带来了主观性并造成了显著的运行负担。随着人工智能技术的快速发展,端到端的“AI 气象站”正逐渐成为预测强对流天气事件的新趋势。阻碍端到端 AI 强对流天气系统发展的三大核心挑战是: (1) 强对流天气事件样本稀缺; (2) 高维气象数据与文本预警之间对齐不完善;(3) 现有的多模态语言模型无法处理高维气象数据,难以充分捕捉时间序列、垂直气压层和空间维度间的复杂依赖关系。为了解决这些挑战,我们提出了 MP-Bench,这是首个用于强对流天气事件预测的大规模时间多模态数据集,包含 421,363 对原始多年气象数据与相应文本描述,覆盖中国范围内的多种强对流天气场景。 在此数据集基础上,我们开发了一个气象多模态大模型(MMLM),能够直接接收四维气象输入。此外,该模型旨在适应四维气象数据流的独特特性,集成了三个即插即用的自适应融合模块,使其能够在时间序列、垂直气压层和空间维度上动态提取和整合特征。在 MP-Bench 上的广泛实验表明,MMLM 在多项任务中表现卓越,突显了其在强天气理解方面的有效性,并标志着实现自动化、由 AI 驱动的天气预报系统的关键一步。我们的源代码和数据集将公开发布。
Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition 主题:人工智能,计算机视觉与模式识别
Publish: 2025-08-09 06:54:41 UTC 发布:2025-08-09 06:54:41 UTC
#50 MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams #50 MDK12-Bench:对跨学科考试中的多模态大语言模型进行的全面评估
Authors: [Pengfei Zhou](https://arxiv.org/search/?searchtype=author&query=Pengfei Zhou), [Xiaopeng Peng](https://arxiv.org/search/?searchtype=author&query=Xiaopeng Peng), [Fanrui Zhang](https://arxiv.org/search/?searchtype=author&query=Fanrui Zhang), [Zhaopan Xu](https://arxiv.org/search/?searchtype=author&query=Zhaopan Xu), [Jiaxin Ai](https://arxiv.org/search/?searchtype=author&query=Jiaxin Ai), [Yansheng Qiu](https://arxiv.org/search/?searchtype=author&query=Yansheng Qiu), [Chuanhao Li](https://arxiv.org/search/?searchtype=author&query=Chuanhao Li), [Zhen Li](https://arxiv.org/search/?searchtype=author&query=Zhen Li), [Ming Li](https://arxiv.org/search/?searchtype=author&query=Ming Li), [Yukang Feng](https://arxiv.org/search/?searchtype=author&query=Yukang Feng), [Jianwen Sun](https://arxiv.org/search/?searchtype=author&query=Jianwen Sun), [Haoquan Zhang](https://arxiv.org/search/?searchtype=author&query=Haoquan Zhang), [Zizhen Li](https://arxiv.org/search/?searchtype=author&query=Zizhen Li), [Xiaofeng Mao](https://arxiv.org/search/?searchtype=author&query=Xiaofeng Mao), [Zekai Li](https://arxiv.org/search/?searchtype=author&query=Zekai Li), [Wangbo Zhao](https://arxiv.org/search/?searchtype=author&query=Wangbo Zhao), [Kai Wang](https://arxiv.org/search/?searchtype=author&query=Kai Wang), [Xiaojun Chang](https://arxiv.org/search/?searchtype=author&query=Xiaojun Chang), [Wenqi Shao](https://arxiv.org/search/?searchtype=author&query=Wenqi Shao), [Yang You](https://arxiv.org/search/?searchtype=author&query=Yang You), [Kaipeng Zhang](https://arxiv.org/search/?searchtype=author&query=Kaipeng Zhang) 作者:周鹏飞、彭晓鹏、张凡睿、徐兆磐、艾佳欣、邱彦盛、李传豪、李臻、李明、冯玉康、孙建文、张浩权、李紫臻、毛晓丰、李泽楷、赵望博、王凯、常晓军、邵文琦、尤扬、张凯鹏
Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines with 141K instances and 6,225 knowledge points organized in a six-layer taxonomy. Covering five question formats with difficulty and year annotations, it enables comprehensive evaluation to capture the extent to which MLLMs perform over four dimensions: 1) difficulty levels, 2) temporal (cross-year) shifts, 3) contextual shifts, and 4) knowledge-driven reasoning. We propose a novel dynamic evaluation framework that introduces unfamiliar visual, textual, and question form shifts to challenge model generalization while improving benchmark objectivity and longevity by mitigating data contamination. We further evaluate knowledge-point reference-augmented generation (KP-RAG) to examine the role of knowledge in problem-solving. Key findings reveal limitations in current MLLMs in multiple aspects and provide guidance for enhancing model robustness, interpretability, and AI-assisted education. 多模态大语言模型(MLLMs)通过整合语言和视觉线索用于问题解决,对推进通用人工智能(AGI)至关重要。然而,目前用于衡量 MLLMs 智力的基准存在规模有限、覆盖范围狭窄和知识无结构化等问题,仅能提供静态且缺乏差异化的评估。为弥补这一空白,我们提出了 MDK12-Bench,这是一个大规模的多学科基准,基于真实世界的 K-12 考试构建,涵盖六个学科,共有 141K 条实例和以六层分类法组织的 6,225 个知识点。覆盖五种题型并带有难度和年份标注,它能够进行全面评估,以捕捉 MLLMs 在四个维度上的表现程度:1)难度层级,2)时间(跨年)变化,3)情境变化,4)基于知识的推理。我们提出了一种新颖的动态评估框架,该框架引入不熟悉的视觉、文本和题型变化,以挑战模型的泛化能力,同时通过缓解数据污染来提高基准的客观性和持久性。 我们进一步评估了知识点参考增强生成(KP-RAG),以检验知识在问题解决中的作用。关键发现揭示了当前多模态大语言模型在多方面的局限性,并为提升模型鲁棒性、可解释性和人工智能辅助教育提供了指导。
Subjects: Artificial Intelligence, Computers and Society 学科:人工智能,计算机与社会
Publish: 2025-08-09 06:21:10 UTC 发布:2025-08-09 06:21:10 协调世界时
#51 Multi-level Advantage Credit Assignment for Cooperative Multi-Agent Reinforcement Learning #51 多层优势信用分配用于合作多智能体强化学习
Authors: [Xutong Zhao](https://arxiv.org/search/?searchtype=author&query=Xutong Zhao), [Yaqi Xie](https://arxiv.org/search/?searchtype=author&query=Yaqi Xie) 作者:赵绪通,谢雅琦
Cooperative multi-agent reinforcement learning (MARL) aims to coordinate multiple agents to achieve a common goal. A key challenge in MARL is credit assignment, which involves assessing each agent’s contribution to the shared reward. Given the diversity of tasks, agents may perform different types of coordination, with rewards attributed to diverse and often overlapping agent subsets. In this work, we formalize the credit assignment level as the number of agents cooperating to obtain a reward, and address scenarios with multiple coexisting levels. We introduce a multi-level advantage formulation that performs explicit counterfactual reasoning to infer credits across distinct levels. Our method, Multi-level Advantage Credit Assignment (MACA), captures agent contributions at multiple levels by integrating advantage functions that reason about individual, joint, and correlated actions. Utilizing an attention-based framework, MACA identifies correlated agent relationships and constructs multi-level advantages to guide policy learning. Comprehensive experiments on challenging Starcraft v1&v2 tasks demonstrate MACA’s superior performance, underscoring its efficacy in complex credit assignment scenarios. 合作式多智能体强化学习(MARL)旨在协调多个智能体以实现共同目标。MARL 的一个关键挑战是信用分配,即评估每个智能体对共享奖励的贡献。鉴于任务的多样性,智能体可能执行不同类型的协同,奖励归因于多样且常常重叠的智能体子集。在本工作中,我们将信用分配层级形式化为为获得奖励而合作的智能体数量,并处理存在多重并存层级的情形。我们提出了一种多层次优势(advantage)形式,该形式通过显式的反事实推理来推断不同层级间的贡献。我们的方法——多层次优势信用分配(Multi-level Advantage Credit Assignment,MACA)通过整合对个体、联合和相关动作进行推理的优势函数,捕捉多层次的智能体贡献。利用基于注意力的框架,MACA 识别相关的智能体关系并构建多层次优势以指导策略学习。 在具有挑战性的星际争霸 v1 与 v2 任务上的全面实验证明了 MACA 的优越性能,强调了它在复杂信用分配场景中的有效性。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-09 05:36:08 UTC 发表:2025-08-09 05:36:08 UTC
#52 Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges #52 以语言为中心视角的遥感影像智能解译:原理、方法与挑战 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Haifeng Li](https://arxiv.org/search/?searchtype=author&query=Haifeng Li), [Wang Guo](https://arxiv.org/search/?searchtype=author&query=Wang Guo), [Haiyang Wu](https://arxiv.org/search/?searchtype=author&query=Haiyang Wu), [Mengwei Wu](https://arxiv.org/search/?searchtype=author&query=Mengwei Wu), [Jipeng Zhang](https://arxiv.org/search/?searchtype=author&query=Jipeng Zhang), [Qing Zhu](https://arxiv.org/search/?searchtype=author&query=Qing Zhu), [Yu Liu](https://arxiv.org/search/?searchtype=author&query=Yu Liu), [Xin Huang](https://arxiv.org/search/?searchtype=author&query=Xin Huang), [Chao Tao](https://arxiv.org/search/?searchtype=author&query=Chao Tao) 作者:Haifeng Li、Wang Guo、Haiyang Wu、Mengwei Wu、Jipeng Zhang、Qing Zhu、Yu Liu、Xin Huang、Chao Tao
The mainstream paradigm of remote sensing image interpretation has long been dominated by vision-centered models, which rely on visual features for semantic understanding. However, these models face inherent limitations in handling multi-modal reasoning, semantic abstraction, and interactive decision-making. While recent advances have introduced Large Language Models (LLMs) into remote sensing workflows, existing studies primarily focus on downstream applications, lacking a unified theoretical framework that explains the cognitive role of language. This review advocates a paradigm shift from vision-centered to language-centered remote sensing interpretation. Drawing inspiration from the Global Workspace Theory (GWT) of human cognition, We propose a language-centered framework for remote sensing interpretation that treats LLMs as the cognitive central hub integrating perceptual, task, knowledge and action spaces to enable unified understanding, reasoning, and decision-making. We first explore the potential of LLMs as the central cognitive component in remote sensing interpretation, and then summarize core technical challenges, including unified multimodal representation, knowledge association, and reasoning and decision-making. Furthermore, we construct a global workspace-driven interpretation mechanism and review how language-centered solutions address each challenge. Finally, we outline future research directions from four perspectives: adaptive alignment of multimodal data, task understanding under dynamic knowledge constraints, trustworthy reasoning, and autonomous interaction. This work aims to provide a conceptual foundation for the next generation of remote sensing interpretation systems and establish a roadmap toward cognition-driven intelligent geospatial analysis. 遥感影像解译的主流范式长期以来被以视觉为中心的模型所主导,这些模型依赖视觉特征来进行语义理解。然而,这些模型在处理多模态推理、语义抽象和交互式决策方面存在内在局限。尽管近期进展将 LLMs 引入遥感工作流,现有研究主要侧重于下游应用,缺乏解释语言认知作用的统一理论框架。本综述倡导从以视觉为中心转向以语言为中心的遥感解译范式。我们从人类认知的全球工作区理论(Global Workspace Theory, GWT)汲取灵感,提出了一个以语言为中心的遥感解译框架,将 LLMs 视为整合感知、任务、知识与行动空间的认知中枢,以实现统一的理解、推理与决策。 我们首先探讨将 LLMs 作为遥感解译核心认知组件的潜力,然后总结核心技术挑战,包括统一的多模态表征、知识关联,以及推理与决策。此外,我们构建了一个基于全局工作空间的解译机制,并回顾了以语言为中心的解决方案如何应对每一项挑战。最后,我们从四个视角勾勒了未来研究方向:多模态数据的自适应对齐、在动态知识约束下的任务理解、可信推理,以及自主交互。此项工作旨在为下一代遥感解译系统提供概念性基础,并为走向以认知驱动的智能地理空间分析建立路线图。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-09 05:10:38 UTC 发布:2025-08-09 05:10:38 UTC
#53 Natural Language-Driven Viewpoint Navigation for Volume Exploration via Semantic Block Representation #53 基于自然语言的视角导航用于通过语义块表示进行体数据探索
Authors: [Xuan Zhao](https://arxiv.org/search/?searchtype=author&query=Xuan Zhao), [Jun Tao](https://arxiv.org/search/?searchtype=author&query=Jun Tao) 作者:赵轩,陶军
Exploring volumetric data is crucial for interpreting scientific datasets. However, selecting optimal viewpoints for effective navigation can be challenging, particularly for users without extensive domain expertise or familiarity with 3D navigation. In this paper, we propose a novel framework that leverages natural language interaction to enhance volumetric data exploration. Our approach encodes volumetric blocks to capture and differentiate underlying structures. It further incorporates a CLIP Score mechanism, which provides semantic information to the blocks to guide navigation. The navigation is empowered by a reinforcement learning framework that leverage these semantic cues to efficiently search for and identify desired viewpoints that align with the user’s intent. The selected viewpoints are evaluated using CLIP Score to ensure that they best reflect the user queries. By automating viewpoint selection, our method improves the efficiency of volumetric data navigation and enhances the interpretability of complex scientific phenomena. 探索体数据对于解释科学数据集至关重要。然而,为有效导航选择最佳视点可能具有挑战性,尤其对于没有丰富领域知识或不熟悉三维导航的用户。本文提出了一种新颖框架,利用自然语言交互来增强体数据的探索。我们的方法对体块进行编码以捕捉并区分潜在结构。它进一步引入了 CLIP 分数机制,为体块提供语义信息以指导导航。导航由强化学习框架驱动,该框架利用这些语义线索高效搜索并识别与用户意图一致的目标视点。所选视点通过 CLIP 分数进行评估,以确保它们最好地反映用户查询。通过自动化视点选择,我们的方法提高了体数据导航的效率并增强了复杂科学现象的可解释性。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-09 04:44:59 UTC 发布:2025-08-09 04:44:59 UTC
#54 A Fuzzy Logic Prompting Framework for Large Language Models in Adaptive and Uncertain Tasks #54 用于适应性和不确定任务的大型语言模型模糊逻辑提示框架
Author: [Vanessa Figueiredo](https://arxiv.org/search/?searchtype=author&query=Vanessa Figueiredo) 作者:Vanessa Figueiredo
We introduce a modular prompting framework that supports safer and more adaptive use of large language models (LLMs) across dynamic, user-centered tasks. Grounded in human learning theory, particularly the Zone of Proximal Development (ZPD), our method combines a natural language boundary prompt with a control schema encoded with fuzzy scaffolding logic and adaptation rules. This architecture enables LLMs to modulate behavior in response to user state without requiring fine-tuning or external orchestration. In a simulated intelligent tutoring setting, the framework improves scaffolding quality, adaptivity, and instructional alignment across multiple models, outperforming standard prompting baselines. Evaluation is conducted using rubric-based LLM graders at scale. While initially developed for education, the framework has shown promise in other interaction-heavy domains, such as procedural content generation for games. Designed for safe deployment, it provides a reusable methodology for structuring interpretable, goal-aligned LLM behavior in uncertain or evolving contexts. 我们提出了一个模块化提示框架,支持在动态、以用户为中心的任务中更安全且更具适应性的使用大型语言模型(LLMs)。该方法以人类学习理论为基础,尤其是近端发展区(ZPD),将自然语言边界提示与用模糊支架逻辑和适应规则编码的控制模式结合起来。该架构使 LLMs 能够根据用户状态调节行为,而无需微调或外部协调。在模拟的智能辅导场景中,该框架提升了支架质量、适应性和教学一致性,优于标准提示基线,并在多个模型上表现出色。评估使用基于评分量表的大规模 LLM 评分器进行。尽管最初为教育开发,该框架在其他以交互为主的领域也展现出潜力,例如用于游戏的过程内容生成。该框架为安全部署而设计,提供了一种可重用的方法论,用于在不确定或不断变化的情境中构建可解释且与目标一致的 LLM 行为。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-08 23:50:48 UTC 发布:2025-08-08 23:50:48 UTC
#55 Pushing the Envelope of LLM Inference on AI-PC #55 在 AI-PC 上推进 LLM 推理的极限
Authors: [Evangelos Georganas](https://arxiv.org/search/?searchtype=author&query=Evangelos Georganas), [Dhiraj Kalamkar](https://arxiv.org/search/?searchtype=author&query=Dhiraj Kalamkar), [Alexander Heinecke](https://arxiv.org/search/?searchtype=author&query=Alexander Heinecke) 作者:Evangelos Georganas、Dhiraj Kalamkar、Alexander Heinecke
The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. Our optimized runtime advances the state of LLM inference on AI PCs and edge devices, paving the way for efficient deployment of ultra-low-bit LLM models. 超低位宽 LLM 模型(1/1.58/2 位)的出现,使其在相同模型规模下能够匹配全精度模型的困惑度和任务终端性能,这正在为边缘设备和 AI 个人电脑等资源受限环境的 LLM 推理开辟新纪元。尽管这些量化进展在延迟、内存、吞吐量和能耗方面有望带来更具成本效益的模型,但用于部署它们的最先进(SOTA)推理运行时(例如 bitnet.cpp)的计算效率仍未被充分研究。在本工作中,我们采取自下而上的方法:首先设计并实现了针对现代 CPU 优化的 1 位和 2 位微核(microkernels),在多种 CPU 平台上实现了峰值计算效率。我们将这些微核集成到一个最先进的 LLM 推理框架中,即 PyTorch-TPP,并给出端到端的 2 位模型推理结果,性能优于当前的 SOTA 运行时 bitnet.cpp 达到最多 2.2 倍,并相比 16 位模型推理提供最多 7 倍的加速。 我们的优化运行时推动了在 AI 个人电脑和边缘设备上对 LLM 推理的进展,为超低位宽 LLM 模型的高效部署铺平了道路。
Subjects: Artificial Intelligence, Machine Learning, Performance 学科:人工智能 , 机器学习 , 性能
Publish: 2025-08-08 23:33:38 UTC 发布:2025-08-08 23:33:38 UTC
#56 Topology Generation of UAV Covert Communication Networks: A Graph Diffusion Approach with Incentive Mechanism #56 无人机隐蔽通信网络的拓扑生成:一种带激励机制的图扩散方法
Authors: [Xin Tang](https://arxiv.org/search/?searchtype=author&query=Xin Tang), [Qian Chen](https://arxiv.org/search/?searchtype=author&query=Qian Chen), [Fengshun Li](https://arxiv.org/search/?searchtype=author&query=Fengshun Li), [Youchun Gong](https://arxiv.org/search/?searchtype=author&query=Youchun Gong), [Yinqiu Liu](https://arxiv.org/search/?searchtype=author&query=Yinqiu Liu), [Wen Tian](https://arxiv.org/search/?searchtype=author&query=Wen Tian), [Shaowen Qin](https://arxiv.org/search/?searchtype=author&query=Shaowen Qin), [Xiaohuan Li](https://arxiv.org/search/?searchtype=author&query=Xiaohuan Li) 作者:唐鑫,陈倩,李凤顺,龚有春,刘英秋,田文,秦少文,李晓欢
With the growing demand for Uncrewed Aerial Vehicle (UAV) networks in sensitive applications, such as urban monitoring, emergency response, and secure sensing, ensuring reliable connectivity and covert communication has become increasingly vital. However, dynamic mobility and exposure risks pose significant challenges. To tackle these challenges, this paper proposes a self-organizing UAV network framework combining Graph Diffusion-based Policy Optimization (GDPO) with a Stackelberg Game (SG)-based incentive mechanism. The GDPO method uses generative AI to dynamically generate sparse but well-connected topologies, enabling flexible adaptation to changing node distributions and Ground User (GU) demands. Meanwhile, the Stackelberg Game (SG)-based incentive mechanism guides self-interested UAVs to choose relay behaviors and neighbor links that support cooperation and enhance covert communication. Extensive experiments are conducted to validate the effectiveness of the proposed framework in terms of model convergence, topology generation quality, and enhancement of covert communication performance. 随着对无人机(UAV)网络在城市监控、应急响应和安全感知等敏感应用中的需求增长,确保可靠的连接性和隐蔽通信变得愈发重要。然而,动态移动性和暴露风险带来了重大挑战。为应对这些挑战,本文提出了一个自组织无人机网络框架,将基于图扩散的策略优化(GDPO)与基于 Stackelberg 博弈(SG)的激励机制相结合。GDPO 方法利用生成式人工智能动态生成稀疏但连通性良好的拓扑结构,使其能够灵活适应变化的节点分布和地面用户(GU)需求。同时,基于 Stackelberg 博弈(SG)的激励机制引导具有自利性的无人机选择有助于协作并增强隐蔽通信的中继行为和邻居链路。进行了大量实验以验证所提框架在模型收敛性、拓扑生成质量以及隐蔽通信性能提升方面的有效性。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-08 23:06:49 UTC 发布:2025-08-08 23:06:49 UTC
#57 ParBalans: Parallel Multi-Armed Bandits-based Adaptive Large Neighborhood Search #57 ParBalans:基于并行多臂赌博机的自适应大邻域搜索
Authors: [Alican Yilmaz](https://arxiv.org/search/?searchtype=author&query=Alican Yilmaz), [Junyang Cai](https://arxiv.org/search/?searchtype=author&query=Junyang Cai), [Serdar Kadioglu](https://arxiv.org/search/?searchtype=author&query=Serdar Kadioglu), [Bistra Dilkina](https://arxiv.org/search/?searchtype=author&query=Bistra Dilkina) 作者:Alican Yilmaz、Junyang Cai、Serdar Kadioglu、Bistra Dilkina
Solving Mixed-Integer Programming (MIP) problems often requires substantial computational resources due to their combinatorial nature. Parallelization has emerged as a critical strategy to accelerate solution times and enhance scalability to tackle large, complex instances. This paper investigates the parallelization capabilities of Balans, a recently proposed multi-armed bandits-based adaptive large neighborhood search for MIPs. While Balans’s modular architecture inherently supports parallel exploration of diverse parameter configurations, this potential has not been thoroughly examined. To address this gap, we introduce ParBalans, an extension that leverages both solver-level and algorithmic-level parallelism to improve performance on challenging MIP instances. Our experimental results demonstrate that ParBalans exhibits competitive performance compared to the state-of-the-art commercial solver Gurobi, particularly on hard optimization benchmarks. 求解混合整数规划(MIP)问题通常因其组合性质而需要大量计算资源。并行化已成为加速求解时间和提高可扩展性以处理大规模复杂实例的关键策略。本文研究了 Balans(一种最近提出的基于多臂老虎机的自适应大邻域搜索用于 MIP)的并行化能力。尽管 Balans 的模块化架构本质上支持对多种参数配置的并行探索,但这一潜力尚未得到充分检验。为填补这一空白,我们提出了 ParBalans,一个扩展方法,利用求解器级和算法级并行性来提升在困难 MIP 实例上的性能。我们的实验结果表明,ParBalans 在硬优化基准测试上相较于最先进的商业求解器 Gurobi 表现出竞争力。
Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习
Publish: 2025-08-08 22:30:19 UTC 发布:2025-08-08 22:30:19 UTC
#58 GLIDR: Graph-Like Inductive Logic Programming with Differentiable Reasoning #58 GLIDR:具可微分推理的类图归纳逻辑编程
Authors: [Blair Johnson](https://arxiv.org/search/?searchtype=author&query=Blair Johnson), [Clayton Kerce](https://arxiv.org/search/?searchtype=author&query=Clayton Kerce), [Faramarz Fekri](https://arxiv.org/search/?searchtype=author&query=Faramarz Fekri)
Differentiable inductive logic programming (ILP) techniques have proven effective at finding approximate rule-based solutions to link prediction and node classification problems on knowledge graphs; however, the common assumption of chain-like rule structure can hamper the performance and interpretability of existing approaches. We introduce GLIDR, a differentiable rule learning method that models the inference of logic rules with more expressive syntax than previous methods. GLIDR uses a differentiable message passing inference algorithm that generalizes previous chain-like rule learning methods to allow rules with features like branches and cycles. GLIDR has a simple and expressive rule search space which is parameterized by a limit on the maximum number of free variables that may be included in a rule. Explicit logic rules can be extracted from the weights of a GLIDR model for use with symbolic solvers. We demonstrate that GLIDR can significantly outperform existing rule learning methods on knowledge graph completion tasks and even compete with embedding methods despite the inherent disadvantage of being a structure-only prediction method. We show that rules extracted from GLIDR retain significant predictive performance, and that GLIDR is highly robust to training data noise. Finally, we demonstrate that GLIDR can be chained with deep neural networks and optimized end-to-end for rule learning on arbitrary data modalities. 可微分归纳逻辑编程(ILP)技术已被证明在为知识图谱上的连边预测和节点分类问题寻找近似基于规则的解时有效;然而,常见的链状规则结构假设可能会妨碍现有方法的性能和可解释性。我们提出了 GLIDR,一种可微分规则学习方法,它使用比以往方法更具表达力的语法来建模逻辑规则的推理。GLIDR 使用一种可微分的消息传递推理算法,该算法将先前的链状规则学习方法推广为允许具有分支和循环等特性的规则。GLIDR 拥有一个简单且富有表现力的规则搜索空间,其由规则中可包含的最大自由变量数的上限参数化。可以从 GLIDR 模型的权重中提取显式逻辑规则以供符号求解器使用。我们证明了 GLIDR 在知识图谱补全任务上可以显著优于现有规则学习方法,甚至在固有地仅基于结构进行预测的劣势下仍能与嵌入方法竞争。 我们展示了从 GLIDR 提取的规则保持了显著的预测性能,并且 GLIDR 对训练数据噪声具有很强的鲁棒性。最后,我们证明了 GLIDR 可以与深度神经网络串联并进行端到端优化,以便在任意数据模态上进行规则学习。
Subjects: Artificial Intelligence, Machine Learning, Logic in Computer Science 主题:人工智能、机器学习、计算机科学中的逻辑
Publish: 2025-08-08 21:31:55 UTC 发布:2025-08-08 21:31:55 协调世界时 (UTC)
#59 Probabilistic Circuits for Knowledge Graph Completion with Reduced Rule Sets #59 使用简化规则集的概率电路进行知识图谱补全
Authors: [Jaikrishna Manojkumar Patil](https://arxiv.org/search/?searchtype=author&query=Jaikrishna Manojkumar Patil), [Nathaniel Lee](https://arxiv.org/search/?searchtype=author&query=Nathaniel Lee), [Al Mehdi Saadat Chowdhury](https://arxiv.org/search/?searchtype=author&query=Al Mehdi Saadat Chowdhury), [YooJung Choi](https://arxiv.org/search/?searchtype=author&query=YooJung Choi), [Paulo Shakarian](https://arxiv.org/search/?searchtype=author&query=Paulo Shakarian) 作者:Jaikrishna Manojkumar Patil、Nathaniel Lee、Al Mehdi Saadat Chowdhury、YooJung Choi、Paulo Shakarian
Rule-based methods for knowledge graph completion provide explainable results but often require a significantly large number of rules to achieve competitive performance. This can hinder explainability due to overwhelmingly large rule sets. We discover rule contexts (meaningful subsets of rules that work together) from training data and use learned probability distribution (i.e. probabilistic circuits) over these rule contexts to more rapidly achieve performance of the full rule set. Our approach achieves a 70-96% reduction in number of rules used while outperforming baseline by up to 31× when using equivalent minimal number of rules and preserves 91% of peak baseline performance even when comparing our minimal rule sets against baseline’s full rule sets. We show that our framework is grounded in well-known semantics of probabilistic logic, does not require independence assumptions, and that our tractable inference procedure provides both approximate lower bounds and exact probability of a given query. The efficacy of our method is validated by empirical studies on 8 standard benchmark datasets where we show competitive performance by using only a fraction of the rules required by AnyBURL’s standard inference method, the current state-of-the-art for rule-based knowledge graph completion. This work may have further implications for general probabilistic reasoning over learned sets of rules. 基于规则的知识图谱补全方法提供了可解释的结果,但通常需要大量规则才能达到有竞争力的性能。这会因过于庞大的规则集而削弱可解释性。我们从训练数据中发现规则上下文(即协同工作的有意义规则子集),并对这些规则上下文学习概率分布(即概率电路),以更快速地达到完整规则集的性能。我们的方法在使用的规则数量上实现了 70%–96% 的减少,同时在使用等效最少规则时相较基线最多提升了 31 × ,即便将我们的最小规则集与基线的完整规则集比较,也能保留基线峰值性能的 91%。我们证明了我们的框架基于概率逻辑的著名语义,不依赖独立性假设,并且我们的可解析推理过程既能提供近似下界,又能给出指定查询的精确概率。 我们的方法通过在 8 个标准基准数据集上的实证研究得到验证,结果表明我们在仅使用 AnyBURL 标准推理方法(基于规则的知识图补全领域的当前最先进方法)所需规则的一小部分时,仍能表现出具有竞争力的性能。该工作可能对在已学习的规则集合上进行一般概率推理具有进一步的意义。
Subjects: Artificial Intelligence, Logic in Computer Science 学科:人工智能,计算机科学中的逻辑
Publish: 2025-08-08 21:17:03 UTC 发布:2025-08-08 21:17:03 UTC
#60 Zero-Shot Cellular Trajectory Map Matching #60 零样本细胞轨迹地图匹配
Authors: [Weijie Shi](https://arxiv.org/search/?searchtype=author&query=Weijie Shi), [Yue Cui](https://arxiv.org/search/?searchtype=author&query=Yue Cui), [Hao Chen](https://arxiv.org/search/?searchtype=author&query=Hao Chen), [Jiaming Li](https://arxiv.org/search/?searchtype=author&query=Jiaming Li), [Mengze Li](https://arxiv.org/search/?searchtype=author&query=Mengze Li), [Jia Zhu](https://arxiv.org/search/?searchtype=author&query=Jia Zhu), [Jiajie Xu](https://arxiv.org/search/?searchtype=author&query=Jiajie Xu), [Xiaofang Zhou](https://arxiv.org/search/?searchtype=author&query=Xiaofang Zhou) 作者:史伟杰、崔越、陈浩、李嘉明、李梦泽、朱佳、徐佳杰、周晓芳
Cellular Trajectory Map-Matching (CTMM) aims to align cellular location sequences to road networks, which is a necessary preprocessing in location-based services on web platforms like Google Maps, including navigation and route optimization. Current approaches mainly rely on ID-based features and region-specific data to learn correlations between cell towers and roads, limiting their adaptability to unexplored areas. To enable high-accuracy CTMM without additional training in target regions, Zero-shot CTMM requires to extract not only region-adaptive features, but also sequential and location uncertainty to alleviate positioning errors in cellular data. In this paper, we propose a pixel-based trajectory calibration assistant for zero-shot CTMM, which takes advantage of transferable geospatial knowledge to calibrate pixelated trajectory, and then guide the path-finding process at the road network level. To enhance knowledge sharing across similar regions, a Gaussian mixture model is incorporated into VAE, enabling the identification of scenario-adaptive experts through soft clustering. To mitigate high positioning errors, a spatial-temporal awareness module is designed to capture sequential features and location uncertainty, thereby facilitating the inference of approximate user positions. Finally, a constrained path-finding algorithm is employed to reconstruct the road ID sequence, ensuring topological validity within the road network. This process is guided by the calibrated trajectory while optimizing for the shortest feasible path, thus minimizing unnecessary detours. Extensive experiments demonstrate that our model outperforms existing methods in zero-shot CTMM by 16.8%. 蜂窝轨迹地图匹配(Cellular Trajectory Map-Matching,CTMM)旨在将蜂窝定位序列与道路网络对齐,这是在像 Google 地图这样的网络平台上提供基于位置的服务(包括导航和路径优化)所必需的预处理步骤。当前方法主要依赖基于小区 ID 的特征和区域特定的数据来学习基站与道路之间的关联,这限制了它们对未探索区域的适应能力。为了在目标区域无需额外训练即可实现高精度的 CTMM,零样本 CTMM(Zero-shot CTMM)需要提取不仅具有区域适应性的特征,还要提取序列性和位置不确定性,以缓解蜂窝数据中的定位误差。在本文中,我们提出了一种基于像素的轨迹校准辅助方法用于零样本 CTMM,该方法利用可迁移的地理空间知识对像素化轨迹进行校准,然后在道路网络层面引导路径查找过程。为增强相似区域间的知识共享,我们在变分自编码器(VAE)中引入了高斯混合模型,通过软聚类实现场景自适应专家的识别。 为减轻高定位误差,设计了一个时空感知模块以捕捉序列特征和位置不确定性,从而辅助推断用户的大致位置。最后,采用受约束的路径搜索算法重建道路 ID 序列,确保在道路网络内的拓扑有效性。该过程以校准后的轨迹为引导,同时优化可行的最短路径,从而将不必要的绕行降到最低。大量实验证明,我们的模型在零样本 CTMM 上比现有方法高出 16.8%。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-08-08 19:47:45 UTC 发布:2025-08-08 19:47:45 UTC
#61 Formal Concept Analysis: a Structural Framework for Variability Extraction and Analysis #61 形式概念分析:用于变异性提取与分析的结构性框架
Author: [Jessie Galasso](https://arxiv.org/search/?searchtype=author&query=Jessie Galasso) 作者:Jessie Galasso
Formal Concept Analysis (FCA) is a mathematical framework for knowledge representation and discovery. It performs a hierarchical clustering over a set of objects described by attributes, resulting in conceptual structures in which objects are organized depending on the attributes they share. These conceptual structures naturally highlight commonalities and variabilities among similar objects by categorizing them into groups which are then arranged by similarity, making it particularly appropriate for variability extraction and analysis. Despite the potential of FCA, determining which of its properties can be leveraged for variability-related tasks (and how) is not always straightforward, partly due to the mathematical orientation of its foundational literature. This paper attempts to bridge part of this gap by gathering a selection of properties of the framework which are essential to variability analysis, and how they can be used to interpret diverse variability information within the resulting conceptual structures. 形式概念分析(FCA)是一种用于知识表示和发现的数学框架。它对由属性描述的一组对象执行层次聚类,产生的概念结构根据对象共享的属性组织这些对象。这些概念结构通过将相似对象分类为组并按相似性排列,自然地突出了相似对象之间的共性和差异性,使其特别适合于变异提取和分析。尽管 FCA 具有潜力,但确定其哪些属性可以用于与变异相关的任务(以及如何使用)并不总是直观,部分原因在于其基础文献的数学取向。本文尝试弥补这一部分差距,收集该框架中对变异分析至关重要的一些属性,并说明如何利用这些属性在生成的概念结构中解释多样的变异信息。
Subjects: Artificial Intelligence, Information Retrieval, Software Engineering 学科:人工智能,信息检索,软件工程
Publish: 2025-08-08 19:30:14 UTC 发布:2025-08-08 19:30:14 UTC
#62 CountQA: How Well Do MLLMs Count in the Wild? #62 CountQA:多模态大模型在真实环境中的计数能力如何?
Authors: [Jayant Sravan Tamarapalli](https://arxiv.org/search/?searchtype=author&query=Jayant Sravan Tamarapalli), [Rynaa Grover](https://arxiv.org/search/?searchtype=author&query=Rynaa Grover), [Nilay Pande](https://arxiv.org/search/?searchtype=author&query=Nilay Pande), [Sahiti Yerramilli](https://arxiv.org/search/?searchtype=author&query=Sahiti Yerramilli) 作者:Jayant Sravan Tamarapalli、Rynaa Grover、Nilay Pande、Sahiti Yerramilli
Multimodal Large Language Models (MLLMs) demonstrate remarkable fluency in understanding visual scenes, yet they exhibit a critical lack in a fundamental cognitive skill: object counting. This blind spot severely limits their reliability in real-world applications. To date, this capability has been largely unevaluated in complex scenarios, as existing benchmarks either feature sparse object densities or are confined to specific visual domains, failing to test models under realistic conditions. Addressing this gap, we introduce CountQA, a challenging new benchmark designed to probe this deficiency. Comprising over 1,500 question-answer pairs, CountQA features real-world images with high object density, clutter, and occlusion. We investigate this weakness by evaluating 15 prominent MLLMs on the CountQA benchmark and reveal that the top-performing model achieves a mere 42.9% accuracy, with performance declining as object counts rise. By providing a dedicated benchmark to diagnose and rectify this core weakness, CountQA paves the way for a new generation of MLLMs that are not only descriptively fluent but also numerically grounded and spatially aware. We will open-source the dataset and code upon paper acceptance to foster further research. 多模态大语言模型(MLLMs)在理解视觉场景方面表现出惊人的流利性,但在一项基本的认知技能上存在严重缺陷:对象计数。这个盲点严重限制了它们在现实应用中的可靠性。迄今为止,这一能力在复杂场景中在很大程度上未被评估,因为现有基准要么对象密度稀疏,要么局限于特定视觉领域,无法在现实条件下测试模型。为了解决这一空白,我们提出了 CountQA,这是一个旨在探查该缺陷的具有挑战性的新基准。CountQA 包含超过 1,500 个问答对,选用具有高对象密度、杂乱和遮挡的真实世界图像。我们通过在 CountQA 基准上评估 15 个重要的 MLLM 来研究这一弱点,结果显示表现最好的模型仅达到 42.9% 的准确率,且随着对象数量的增加表现下降。通过提供一个专门用于诊断和修复这一核心弱点的基准,CountQA 为新一代不仅在描述上流利且在数字上有依据、具有空间感知能力的 MLLM 铺平了道路。 在论文被接收后,我们将开源数据集和代码以促进进一步研究。
Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition 主题:人工智能,计算机视觉与模式识别
Publish: 2025-08-08 04:23:04 UTC 发布日期:2025-08-08 04:23:04 协调世界时(UTC)
#63 IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model #63 IRL-VLA:通过奖励世界模型训练视觉-语言-动作策略 [PDF 2 ] [Copy] [Kimi ] [REL]
Authors: [Anqing Jiang](https://arxiv.org/search/?searchtype=author&query=Anqing Jiang), [Yu Gao](https://arxiv.org/search/?searchtype=author&query=Yu Gao), [Yiru Wang](https://arxiv.org/search/?searchtype=author&query=Yiru Wang), [Zhigang Sun](https://arxiv.org/search/?searchtype=author&query=Zhigang Sun), [Shuo Wang](https://arxiv.org/search/?searchtype=author&query=Shuo Wang), [Yuwen Heng](https://arxiv.org/search/?searchtype=author&query=Yuwen Heng), [Hao Sun](https://arxiv.org/search/?searchtype=author&query=Hao Sun), [Shichen Tang](https://arxiv.org/search/?searchtype=author&query=Shichen Tang), [Lijuan Zhu](https://arxiv.org/search/?searchtype=author&query=Lijuan Zhu), [Jinhao Chai](https://arxiv.org/search/?searchtype=author&query=Jinhao Chai), [Jijun Wang](https://arxiv.org/search/?searchtype=author&query=Jijun Wang), [Zichong Gu](https://arxiv.org/search/?searchtype=author&query=Zichong Gu), [Hao Jiang](https://arxiv.org/search/?searchtype=author&query=Hao Jiang), [Li Sun](https://arxiv.org/search/?searchtype=author&query=Li Sun) 作者:姜安庆、高宇、王怡茹、孙志刚、王硕、衡雨文、孙昊、唐世臣、朱丽娟、柴金浩、王继军、顾自冲、蒋浩、孙力
Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving. 视觉-语言-行动(VLA)模型在自动驾驶方面展现出潜力。然而,两个关键挑战阻碍了它们的发展:(1)现有的 VLA 架构通常基于开环设置下的模仿学习,这往往只会捕捉数据集中记录的行为,导致性能次优且受限;(2)闭环训练高度依赖高保真度的传感器仿真,而领域差距和计算低效成为显著障碍。在本文中,我们提出了 IRL-VLA,一种新颖的闭环强化学习方法,通过逆向强化学习构建奖励世界模型,并采用自研的 VLA 方法。我们的框架采用三阶段范式:在第一阶段,我们提出了一种 VLA 架构并通过模仿学习对 VLA 策略进行预训练。在第二阶段,我们通过逆向强化学习构建了一个轻量级的奖励世界模型,以实现高效的闭环奖励计算。 为了进一步提升规划性能,我们最终设计了基于 PPO(近端策略优化)的专用奖励世界模型引导强化学习,以有效平衡安全事故、乘坐舒适性和交通效率。我们的方法在 NAVSIM v2 端到端驾驶基准上取得了最先进的性能,并在 CVPR2025 自动驾驶大赛中获得亚军。我们希望我们的框架能加速闭环自动驾驶中的 VLA 研究。
Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition, Robotics 主题:人工智能,计算机视觉与模式识别,机器人学
Publish: 2025-08-07 06:30:05 UTC 发布:2025-08-07 06:30:05 世界协调时间
#64 Operationalizing Serendipity: Multi-Agent AI Workflows for Enhanced Materials Characterization with Theory-in-the-Loop #64 将机遇性付诸实践:具有理论闭环的多智能体 AI 工作流,用于增强材料表征
Authors: [Lance Yao](https://arxiv.org/search/?searchtype=author&query=Lance Yao), [Suman Samantray](https://arxiv.org/search/?searchtype=author&query=Suman Samantray), [Ayana Ghosh](https://arxiv.org/search/?searchtype=author&query=Ayana Ghosh), [Kevin Roccapriore](https://arxiv.org/search/?searchtype=author&query=Kevin Roccapriore), [Libor Kovarik](https://arxiv.org/search/?searchtype=author&query=Libor Kovarik), [Sarah Allec](https://arxiv.org/search/?searchtype=author&query=Sarah Allec), [Maxim Ziatdinov](https://arxiv.org/search/?searchtype=author&query=Maxim Ziatdinov) 作者:Lance Yao、Suman Samantray、Ayana Ghosh、Kevin Roccapriore、Libor Kovarik、Sarah Allec、Maxim Ziatdinov
The history of science is punctuated by serendipitous discoveries, where unexpected observations, rather than targeted hypotheses, opened new fields of inquiry. While modern autonomous laboratories excel at accelerating hypothesis testing, their optimization for efficiency risks overlooking these crucial, unplanned findings. To address this gap, we introduce SciLink, an open-source, multi-agent artificial intelligence framework designed to operationalize serendipity in materials research by creating a direct, automated link between experimental observation, novelty assessment, and theoretical simulations. The framework employs a hybrid AI strategy where specialized machine learning models perform quantitative analysis of experimental data, while large language models handle higher-level reasoning. These agents autonomously convert raw data from materials characterization techniques into falsifiable scientific claims, which are then quantitatively scored for novelty against the published literature. We demonstrate the framework’s versatility across diverse research scenarios, showcasing its application to atomic-resolution and hyperspectral data, its capacity to integrate real-time human expert guidance, and its ability to close the research loop by proposing targeted follow-up experiments. By systematically analyzing all observations and contextualizing them, SciLink provides a practical framework for AI-driven materials research that not only enhances efficiency but also actively cultivates an environment ripe for serendipitous discoveries, thereby bridging the gap between automated experimentation and open-ended scientific exploration. 科学史充满了机缘巧合的发现,正是那些意外的观察,而非有针对性的假设,开辟了新的研究领域。尽管现代自主实验室在加速假设检验方面表现出色,但它们为效率而优化的设计有可能忽视这些关键的、非计划性的发现。为弥补这一空白,我们推出了 SciLink,这是一个开源的多智能体人工智能框架,旨在通过在实验观察、新颖性评估和理论模拟之间建立直接的自动化联结,将机缘发现制度化应用于材料研究。该框架采用混合人工智能策略:专门的机器学习模型对实验数据进行定量分析,而大型语言模型负责更高层次的推理。这些智能体能自主将材料表征技术产生的原始数据转化为可证伪的科学主张,然后将这些主张与已发表文献进行定量对比打分以评估新颖性。 我们展示了该框架在多种研究场景中的多功能性,演示了其在原子分辨率和高光谱数据上的应用、整合实时人类专家指导的能力,以及通过提出有针对性的后续实验来闭合研究循环的能力。通过系统地分析所有观测并将其置于上下文中,SciLink 提供了一个实用的 AI 驱动材料研究框架,该框架不仅提高了效率,还积极培养了一个利于机遇性发现(serendipity)的环境,从而弥合了自动化实验与开放式科学探索之间的鸿沟。
Subjects: Artificial Intelligence, Materials Science 学科:人工智能,材料科学
Publish: 2025-08-07 04:59:17 UTC 发布:2025-08-07 04:59:17 UTC
#65 Solving Pasur Using GPU-Accelerated Counterfactual Regret Minimization #65 使用 GPU 加速的反事实后悔最小化算法解决 Pasur
Author: [Sina Baghal](https://arxiv.org/search/?searchtype=author&query=Sina Baghal) 作者:Sina Baghal
Pasur is a fishing card game played over six rounds and is played similarly to games such as Cassino and Scopa, and Bastra. This paper introduces a CUDA-accelerated computational framework for simulating Pasur, emphasizing efficient memory management. We use our framework to compute near-Nash equilibria via Counterfactual Regret Minimization (CFR), a well-known algorithm for solving large imperfect-information games. Solving Pasur presents unique challenges due to its intricate rules and the large size of its game tree. We handle rule complexity using PyTorch CUDA tensors and to address the memory-intensive nature of the game, we decompose the game tree into two key components: (1) actual game states, and (2) inherited scores from previous rounds. We construct the Full Game Tree by pairing card states with accumulated scores in the Unfolding Process. This design reduces memory overhead by storing only essential strategy values and node connections. To further manage computational complexity, we apply a round-by-round backward training strategy, starting from the final round and recursively propagating average utilities to earlier stages. Our approach constructs the complete game tree, which on average consists of over 109 nodes. We provide detailed implementation snippets. After computing a near-Nash equilibrium strategy, we train a tree-based model to predict these strategies for use during gameplay. We then estimate the fair value of each deck through large-scale self-play between equilibrium strategies by simulating, for instance, 10,000 games per matchup, executed in parallel using GPU acceleration. Similar frameworks can be extended to other reinforcement learning algorithms where the action tree naturally decomposes into multiple rounds such as turn-based strategy games or sequential trading decisions in financial markets. Pasur 是一种进行六轮的钓鱼纸牌游戏,玩法类似于 Cassino、Scopa 和 Bastra 等游戏。本文介绍了一个基于 CUDA 加速的 Pasur 模拟计算框架,重点在于高效的内存管理。我们使用该框架通过反事实遗憾最小化(CFR)——一种用于求解大型不完全信息博弈的著名算法——来计算近纳什均衡。由于规则复杂且博弈树规模庞大,求解 Pasur 面临独特挑战。我们使用 PyTorch CUDA 张量来处理规则复杂性;为了解决游戏对内存的高需求,我们将博弈树分解为两个关键组件:(1)实际游戏状态,和(2)来自前几轮的继承得分。在展开过程中,我们通过将牌面状态与累计得分配对来构建完整博弈树。这种设计通过仅存储必要的策略值和节点连接来减少内存开销。为进一步控制计算复杂性,我们采用逐轮向后训练策略,从最后一轮开始并递归地将平均效用传播到较早的阶段。 我们的方法构建了完整的博弈树,平均包含超过 109 个节点。我们提供了详细的实现片段。在计算出近纳什均衡策略后,我们训练了基于树的模型来预测这些策略,以便在游戏过程中使用。随后,我们通过在均衡策略之间进行大规模自博弈来估计每个卡组的公平价值,例如对局间模拟 10,000 局比赛,并使用 GPU 加速并行执行。类似的框架可以扩展到其他强化学习算法,在这些算法中行动树自然分解为多个回合,例如回合制策略游戏或金融市场中的序列交易决策。
Subjects: Artificial Intelligence, Computer Science and Game Theory, Machine Learning 学科:人工智能、计算机科学与博弈论、机器学习
Publish: 2025-08-06 15:15:11 UTC 发布:2025-08-06 15:15:11 UTC
#66 Cut2Next: Generating Next Shot via In-Context Tuning #66 Cut2Next:通过上下文调优生成下一镜头
Authors: [Jingwen He](https://arxiv.org/search/?searchtype=author&query=Jingwen He), [Hongbo Liu](https://arxiv.org/search/?searchtype=author&query=Hongbo Liu), [Jiajun Li](https://arxiv.org/search/?searchtype=author&query=Jiajun Li), [Ziqi Huang](https://arxiv.org/search/?searchtype=author&query=Ziqi Huang), [Yu Qiao](https://arxiv.org/search/?searchtype=author&query=Yu Qiao), [Wanli Ouyang](https://arxiv.org/search/?searchtype=author&query=Wanli Ouyang), [Ziwei Liu](https://arxiv.org/search/?searchtype=author&query=Ziwei Liu) 作者:Jingwen He、Hongbo Liu、Jiajun Li、Ziqi Huang、Yu Qiao、Wanli Ouyang、Ziwei Liu
Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots. 有效的多镜头生成需要有目的的、电影化的转换以及严格的影视连续性。然而,目前的方法往往只强调基本的视觉一致性,忽视了驱动叙事流动以实现引人入胜故事讲述的关键剪辑模式(例如对打镜头、插入镜头)。这导致生成结果在视觉上可能连贯,但缺乏叙事精细度和真正的电影完整性。为弥补这一点,我们提出了下一个镜头生成(Next Shot Generation,NSG):合成后续的高质量镜头,既严格符合专业剪辑模式,又保持严谨的影视连续性。我们的框架 Cut2Next 利用了一种扩散变换器(Diffusion Transformer,DiT)。它通过一种新颖的分层多提示(Hierarchical Multi-Prompting)策略来进行上下文内微调指导。该策略使用关系提示(Relational Prompts)来定义整体语境和镜头间的剪辑风格,单独的提示则指定每个镜头的内容和摄影属性。二者共同引导 Cut2Next 生成符合电影化要求的后续镜头。 架构创新——上下文感知条件注入(Context-Aware Condition Injection,CACI)和层次注意掩码(Hierarchical Attention Mask,HAM)——在不引入新参数的情况下,进一步整合了这些多样化信号。我们构建了 RawCuts(大规模)和 CuratedCuts(精炼)数据集,二者均配有层次化提示,并引入了 CutBench 作为评估基准。实验证明 Cut2Next 在视觉一致性和文本忠实度方面表现出色。关键的是,用户研究显示用户强烈偏好 Cut2Next,尤其是因为其遵循预期编辑模式并保持整体影像连续性,这验证了其生成高质量、具有叙事表达力且电影级连贯性的后续镜头的能力。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 17:56:59 UTC 发布:2025-08-11 17:56:59 UTC
#67 VGGSounder: Audio-Visual Evaluations for Foundation Models #67 VGGSounder:用于基础模型的视听评估
Authors: [Daniil Zverev](https://arxiv.org/search/?searchtype=author&query=Daniil Zverev), [Thaddäus Wiedemer](https://arxiv.org/search/?searchtype=author&query=Thaddäus Wiedemer), [Ameya Prabhu](https://arxiv.org/search/?searchtype=author&query=Ameya Prabhu), [Matthias Bethge](https://arxiv.org/search/?searchtype=author&query=Matthias Bethge), [Wieland Brendel](https://arxiv.org/search/?searchtype=author&query=Wieland Brendel), [A. Sophia Koepke](https://arxiv.org/search/?searchtype=author&query=A. Sophia Koepke) 作者:Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke
The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSounder dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSounder, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric. 音视频基础模型的出现突显了可靠评估其多模态理解能力的重要性。VGGSounder 数据集常被用作音视频分类评估的基准。然而,我们的分析指出了 VGGSounder 的若干局限性,包括标注不完整、类别部分重叠以及模态不对齐。这些问题导致对听觉和视觉能力的评估出现失真。为了解决这些局限性,我们引入了 VGGSounder —— 一个经过全面重新标注的多标签测试集,扩展自 VGGSound,专门用于评估音视频基础模型。VGGSounder 具有详尽的模态标注,能够对特定模态的性能进行精确分析。此外,我们通过在加入另一输入模态时分析性能下降,并使用我们新的模态混淆度量,揭示了模型的局限性。
Subjects: Multimedia, Artificial Intelligence, Sound 主题:多媒体,人工智能,声音
Publish: 2025-08-11 17:53:23 UTC 发布日期:2025-08-11 17:53:23 UTC
#68 LL3M: Large Language 3D Modelers #68 LL3M:大型语言三维建模器
Authors: [Sining Lu](https://arxiv.org/search/?searchtype=author&query=Sining Lu), [Guan Chen](https://arxiv.org/search/?searchtype=author&query=Guan Chen), [Nam Anh Dinh](https://arxiv.org/search/?searchtype=author&query=Nam Anh Dinh), [Itai Lang](https://arxiv.org/search/?searchtype=author&query=Itai Lang), [Ari Holtzman](https://arxiv.org/search/?searchtype=author&query=Ari Holtzman), [Rana Hanocka](https://arxiv.org/search/?searchtype=author&query=Rana Hanocka) 作者:鲁思宁,陈冠,Nam Anh Dinh,Itai Lang,Ari Holtzman,Rana Hanocka
We present LL3M, a multi-agent system that leverages pretrained large language models (LLMs) to generate 3D assets by writing interpretable Python code in Blender. We break away from the typical generative approach that learns from a collection of 3D data. Instead, we reformulate shape generation as a code-writing task, enabling greater modularity, editability, and integration with artist workflows. Given a text prompt, LL3M coordinates a team of specialized LLM agents to plan, retrieve, write, debug, and refine Blender scripts that generate and edit geometry and appearance. The generated code works as a high-level, interpretable, human-readable, well-documented representation of scenes and objects, making full use of sophisticated Blender constructs (e.g. B-meshes, geometry modifiers, shader nodes) for diverse, unconstrained shapes, materials, and scenes. This code presents many avenues for further agent and human editing and experimentation via code tweaks or procedural parameters. This medium naturally enables a co-creative loop in our system: agents can automatically self-critique using code and visuals, while iterative user instructions provide an intuitive way to refine assets. A shared code context across agents enables awareness of previous attempts, and a retrieval-augmented generation knowledge base built from Blender API documentation, BlenderRAG, equips agents with examples, types, and functions empowering advanced modeling operations and code correctness. We demonstrate the effectiveness of LL3M across diverse shape categories, style and material edits, and user-driven refinements. Our experiments showcase the power of code as a generative and interpretable medium for 3D asset creation. Our project page is at https://threedle.github.io/ll3m. 我们提出了 LL3M,这是一个多智能体系统,利用预训练的大型语言模型(LLMs)通过在 Blender 中编写可解释的 Python 代码来生成 3D 资源。我们突破了典型的从一组 3D 数据中学习的生成方法。相反,我们将形状生成重新表述为一个写代码的任务,从而实现更高的模块化、可编辑性以及与艺术家工作流程的整合。给定一个文本提示,LL3M 协调一支由专门 LLM 代理组成的团队来规划、检索、编写、调试和完善生成和编辑几何体与外观的 Blender 脚本。所生成的代码作为场景和对象的高层次、可解释、可读且文档完备的表示,充分利用了复杂的 Blender 构造(例如 B-mesh、几何修改器、着色器节点)以实现多样且不受限的形状、材质和场景。该代码为通过代码调整或程序化参数进行进一步的代理和人工编辑与实验提供了多种途径。 这种媒介自然在我们的系统中形成了一个共创循环:代理可以使用代码和可视化自动进行自我批评,而反复的用户指令则提供了一种直观的方式来细化资产。跨代理的共享代码上下文使其能够感知先前的尝试,而从 Blender API 文档构建的检索增强生成知识库 BlenderRAG 为代理提供了示例、类型和函数,从而增强了高级建模操作和代码正确性。我们在多种形状类别、风格与材质编辑以及用户驱动的细化方面演示了 LL3M 的有效性。我们的实验展示了代码作为生成性且可解释媒介用于 3D 资源创建的强大能力。我们的项目页面是 https://threedle.github.io/ll3m。
Subjects: Graphics, Artificial Intelligence 学科:图形学,人工智能
Publish: 2025-08-11 17:48:02 UTC 发布时间:2025-08-11 17:48:02 协调世界时
#69 OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution #69 OMGSR:在真实世界图像超分辨率中你只需一次中间时刻引导
Authors: [Zhiqiang Wu](https://arxiv.org/search/?searchtype=author&query=Zhiqiang Wu), [Zhaomang Sun](https://arxiv.org/search/?searchtype=author&query=Zhaomang Sun), [Tong Zhou](https://arxiv.org/search/?searchtype=author&query=Tong Zhou), [Bingtao Fu](https://arxiv.org/search/?searchtype=author&query=Bingtao Fu), [Ji Cong](https://arxiv.org/search/?searchtype=author&query=Ji Cong), [Yitong Dong](https://arxiv.org/search/?searchtype=author&query=Yitong Dong), [Huaqi Zhang](https://arxiv.org/search/?searchtype=author&query=Huaqi Zhang), [Xuan Tang](https://arxiv.org/search/?searchtype=author&query=Xuan Tang), [Mingsong Chen](https://arxiv.org/search/?searchtype=author&query=Mingsong Chen), [Xian Wei](https://arxiv.org/search/?searchtype=author&query=Xian Wei) 作者:吴志强、孙昭芒、周彤、付炳涛、丛基、董一彤、张华琦、唐萱、陈明松、卫先
Denoising Diffusion Probabilistic Models (DDPM) and Flow Matching (FM) generative models show promising potential for one-step Real-World Image Super-Resolution (Real-ISR). Recent one-step Real-ISR models typically inject a Low-Quality (LQ) image latent distribution at the initial timestep. However, a fundamental gap exists between the LQ image latent distribution and the Gaussian noisy latent distribution, limiting the effective utilization of generative priors. We observe that the noisy latent distribution at DDPM/FM mid-timesteps aligns more closely with the LQ image latent distribution. Based on this insight, we present One Mid-timestep Guidance Real-ISR (OMGSR), a universal framework applicable to DDPM/FM-based generative models. OMGSR injects the LQ image latent distribution at a pre-computed mid-timestep, incorporating the proposed Latent Distribution Refinement loss to alleviate the latent distribution gap. We also design the Overlap-Chunked LPIPS/GAN loss to eliminate checkerboard artifacts in image generation. Within this framework, we instantiate OMGSR for DDPM/FM-based generative models with two variants: OMGSR-S (SD-Turbo) and OMGSR-F (FLUX.1-dev). Experimental results demonstrate that OMGSR-S/F achieves balanced/excellent performance across quantitative and qualitative metrics at 512-resolution. Notably, OMGSR-F establishes overwhelming dominance in all reference metrics. We further train a 1k-resolution OMGSR-F to match the default resolution of FLUX.1-dev, which yields excellent results, especially in the details of the image generation. We also generate 2k-resolution images by the 1k-resolution OMGSR-F using our two-stage Tiled VAE & Diffusion. 去噪扩散概率模型(DDPM)和流匹配(FM)生成模型在一步式真实世界图像超分辨率(Real-ISR)方面展示出有前景的潜力。近期的一步式 Real-ISR 模型通常在初始时间步注入低质量(LQ)图像的潜在分布。然而,LQ 图像潜在分布与高斯噪声潜在分布之间存在根本差距,限制了生成先验的有效利用。我们观察到在 DDPM/FM 中间时间步的噪声潜在分布与 LQ 图像潜在分布更为接近。基于这一洞见,我们提出了一步中间时间步引导的 Real-ISR(OMGSR),这是一个适用于基于 DDPM/FM 的生成模型的通用框架。OMGSR 在预先计算的中间时间步注入 LQ 图像潜在分布,并引入了所提出的潜在分布细化损失以缩小潜在分布差距。我们还设计了重叠分块的 LPIPS/GAN 损失以消除图像生成中的棋盘伪影。 在此框架下,我们为基于 DDPM/FM 的生成模型实例化了 OMGSR,并提供两个变体:OMGSR-S(SD-Turbo)和 OMGSR-F(FLUX.1-dev)。实验结果表明,OMGSR-S/F 在 512 分辨率下在定量和定性指标上均取得了平衡/优异的表现。值得注意的是,OMGSR-F 在所有参考指标上都确立了压倒性的优势。我们进一步训练了一个 1k 分辨率的 OMGSR-F 以匹配 FLUX.1-dev 的默认分辨率,该模型产生了出色的结果,尤其是在图像生成的细节方面。我们还使用两阶段 Tiled VAE 与扩散方法,通过 1k 分辨率的 OMGSR-F 生成了 2k 分辨率的图像。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 17:44:59 UTC 发布时间:2025-08-11 17:44:59 UTC
#70 Capabilities of GPT-5 on Multimodal Medical Reasoning #70 GPT-5 在多模态医学推理方面的能力
Authors: [Shansong Wang](https://arxiv.org/search/?searchtype=author&query=Shansong Wang), [Mingzhe Hu](https://arxiv.org/search/?searchtype=author&query=Mingzhe Hu), [Qiang Li](https://arxiv.org/search/?searchtype=author&query=Qiang Li), [Mojtaba Safari](https://arxiv.org/search/?searchtype=author&query=Mojtaba Safari), [Xiaofeng Yang](https://arxiv.org/search/?searchtype=author&query=Xiaofeng Yang)
Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5’s ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems. 近期在大型语言模型(LLMs)方面的进展使通用系统在无需大量微调的情况下,能够执行日益复杂的特定领域推理。在医学领域,决策常常需要整合异构信息源,包括患者叙述、结构化数据和医学影像。本研究将 GPT-5 定位为用于医学决策支持的通用多模态推理器,并在统一协议下系统评估其在文本问答和视觉问答任务上的零样本链式思维推理性能。我们对 GPT-5、GPT-5-mini、GPT-5-nano 和 GPT-4o-2024-11-20 在 MedQA、MedXpertQA(文本和多模态)、MMLU 医学子集、USMLE 自测考试和 VQA-RAD 的标准划分上进行了基准测试。结果显示,GPT-5 持续超越所有基线,在所有问答基准上实现了最先进的准确率,并在多模态推理方面带来了显著提升。 在 MedXpertQA MM 上,GPT-5 在推理和理解得分上分别比 GPT-4o 提高了 +29.62% 和 +36.18%,并且在推理上比预许可的人类专家高出 +24.23%,在理解上高出 +29.40%。相比之下,GPT-4o 在大多数维度仍然低于人类专家表现。一项具有代表性的案例研究展示了 GPT-5 将视觉和文本线索整合为连贯诊断推理链的能力,并建议了适当的高风险干预措施。我们的结果表明,在这些受控的多模态推理基准上,GPT-5 已从与人类可比提升到超越人类专家的表现。这一改进可能会大幅影响未来临床决策支持系统的设计。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 17:43:45 UTC 发布:2025-08-11 17:43:45 协调世界时 (UTC)
#71 Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent
Authors: [Tong Yang](https://arxiv.org/search/?searchtype=author&query=Tong Yang), [Yu Huang](https://arxiv.org/search/?searchtype=author&query=Yu Huang), [Yingbin Liang](https://arxiv.org/search/?searchtype=author&query=Yingbin Liang), [Yuejie Chi](https://arxiv.org/search/?searchtype=author&query=Yuejie Chi)
Transformers have demonstrated remarkable capabilities in multi-step reasoning tasks. However, understandings of the underlying mechanisms by which they acquire these abilities through training remain limited, particularly from a theoretical standpoint. This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes, focusing on path-finding in trees. We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task, where the model implements two-stage reasoning by first identifying the goal-to-root path and then reversing it to produce the root-to-goal path. Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees. In particular, our multi-phase training dynamics for forward reasoning elucidate how different attention heads learn to specialize and coordinate autonomously to solve the two subtasks in a single autoregressive path. These results provide a mechanistic explanation of how trained transformers can implement sequential algorithmic procedures. Moreover, they offer insights into the emergence of reasoning abilities, suggesting that when tasks are structured to take intermediate chain-of-thought steps, even shallow multi-head transformers can effectively solve problems that would otherwise require deeper architectures. 变换器在多步推理任务中展示了显著能力。然而,对于它们在训练过程中如何获得这些能力的底层机制的理解仍然有限,尤其是从理论角度。本工作研究了变换器如何通过链式思维过程学习解决符号化的多步推理问题,重点是树中的路径搜索。我们分析了两个交织的任务:一个是后向推理任务,模型输出从目标节点到根节点的路径;另一个是更复杂的前向推理任务,模型通过两阶段推理先识别目标到根的路径,然后将其反转以生成从根到目标的路径。我们基于梯度下降动力学的理论分析表明,经训练的一层变换器可以有理论保证地解决这两类任务,并能推广到未见过的树。特别地,我们对前向推理的多阶段训练动力学阐明了不同注意力头如何自发地学习专门化并协调,从而在单一自回归路径中自治地完成这两个子任务。 这些结果为训练后的 Transformer 如何实现顺序算法流程提供了机械性的解释。此外,它们还对推理能力的出现提供了见解,表明当任务被结构化为需要中间链式思维步骤时,即使是浅层多头 Transformer 也能有效解决那些本来需要更深架构的问题。
Subjects: Machine Learning, Artificial Intelligence, Information Theory, Optimization and Control, Machine Learning 学科:机器学习、人工智能、信息论、优化与控制、机器学习
Publish: 2025-08-11 17:40:47 UTC 发布时间:2025-08-11 17:40:47 UTC
#72 SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling #72 SAEMark:用于推理时缩放的多比特 LLM 水印
Authors: [Zhuohao Yu](https://arxiv.org/search/?searchtype=author&query=Zhuohao Yu), [Xingru Jiang](https://arxiv.org/search/?searchtype=author&query=Xingru Jiang), [Weizheng Gu](https://arxiv.org/search/?searchtype=author&query=Weizheng Gu), [Yidong Wang](https://arxiv.org/search/?searchtype=author&query=Yidong Wang), [Shikun Zhang](https://arxiv.org/search/?searchtype=author&query=Shikun Zhang), [Wei Ye](https://arxiv.org/search/?searchtype=author&query=Wei Ye)
Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework’s effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark’s consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution. 为内容归属和防止错误信息传播,对由 LLM 生成文本进行水印标注至关重要。然而,现有方法会损害文本质量、需要白盒模型访问和 logit 操控。这些限制排除了基于 API 的模型和多语言场景。我们提出了 SAEMark,一种通用的事后多比特水印框架,它仅通过推理时基于特征的拒绝采样来嵌入个性化信息,无需更改模型 logit 或进行训练。我们的方法基于从生成文本中提取的确定性特征,选择其特征统计与密钥导出目标一致的输出。该框架自然可推广到不同语言和领域,同时通过采样 LLM 输出而不是修改它们来保持文本质量。我们提供了理论保证,将水印成功概率与计算预算建立联系,适用于任何合适的特征提取器。在实证方面,我们使用稀疏自编码器(SAE)展示了该框架的有效性,达到了更高的检测准确率和文本质量。 在 4 个数据集上的实验表明,SAEMark 表现稳定,在英语上达到 99.7%的 F1,并具有强大的多比特检测准确性。SAEMark 为可扩展水印设立了新的范式,该方法可开箱即用于闭源 LLMs,同时实现内容归属。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习
Publish: 2025-08-11 17:33:18 UTC 发布:2025-08-11 17:33:18 UTC
#73 Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models #73 在大型语言模型中对推理时不确定性的人工对齐与校准
Authors: [Kyle Moore](https://arxiv.org/search/?searchtype=author&query=Kyle Moore), [Jesse Roberts](https://arxiv.org/search/?searchtype=author&query=Jesse Roberts), [Daryl Watson](https://arxiv.org/search/?searchtype=author&query=Daryl Watson)
There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis. 近年来,人们对评估大型语言模型(LLM)的不确定性校准产生了浓厚兴趣,以便促进模型控制并调节用户信任。推理时不确定性尤为重要,因为它可以为模型或外部控制模块提供实时信号,从而在实践中改善 LLM 与用户的交互体验。尽管许多现有论文考虑了模型校准,但相对较少的工作致力于评估模型不确定性与人类不确定性的一致程度。在本研究中,我们评估了一系列推理时不确定性度量,既采用既有度量也引入新变体,以确定它们与人类群体层面不确定性以及传统模型校准概念的一致程度。我们发现,尽管与人类答案偏好缺乏一致性,仍有多种度量显示出与人类不确定性高度一致的证据。对于这些成功的度量,我们在正确性相关性和分布分析方面都发现了中等到强烈的模型校准证据。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 17:22:45 UTC 发布时间:2025-08-11 17:22:45 UTC
#74 Street-Level AI: Are Large Language Models Ready for Real-World Judgments? #74 Street-Level AI:大型语言模型是否已准备好用于现实世界的判断?
Authors: [Gaurab Pokharel](https://arxiv.org/search/?searchtype=author&query=Gaurab Pokharel), [Shafkat Farabi](https://arxiv.org/search/?searchtype=author&query=Shafkat Farabi), [Patrick J. Fowler](https://arxiv.org/search/?searchtype=author&query=Patrick J. Fowler), [Sanmay Das](https://arxiv.org/search/?searchtype=author&query=Sanmay Das) 作者:Gaurab Pokharel、Shafkat Farabi、Patrick J. Fowler、Sanmay Das
A surge of recent work explores the ethical and societal implications of large-scale AI models that make “moral” judgments. Much of this literature focuses either on alignment with human judgments through various thought experiments or on the group fairness implications of AI judgments. However, the most immediate and likely use of AI is to help or fully replace the so-called street-level bureaucrats, the individuals deciding to allocate scarce social resources or approve benefits. There is a rich history underlying how principles of local justice determine how society decides on prioritization mechanisms in such domains. In this paper, we examine how well LLM judgments align with human judgments, as well as with socially and politically determined vulnerability scoring systems currently used in the domain of homelessness resource allocation. Crucially, we use real data on those needing services (maintaining strict confidentiality by only using local large models) to perform our analyses. We find that LLM prioritizations are extremely inconsistent in several ways: internally on different runs, between different LLMs, and between LLMs and the vulnerability scoring systems. At the same time, LLMs demonstrate qualitative consistency with lay human judgments in pairwise testing. Findings call into question the readiness of current generation AI systems for naive integration in high-stakes societal decision-making. 最近大量研究探讨了大型 AI 模型对“道德”判断的伦理和社会影响。大部分文献要么通过各种思想实验关注与人类判断的一致性,要么关注 AI 判断的群体公平性影响。然而,AI 最直接且最有可能的用途是辅助或完全替代所谓的一线官僚,即那些决定分配有限社会资源或批准福利的个体。关于地方正义原则如何决定社会在此类领域中选择优先机制,有着丰富的历史渊源。在本文中,我们考察了 LLM 的判断与人类判断的匹配程度,以及与当前用于无家可归者资源分配领域的、由社会和政治决定的脆弱性评分系统的一致性。关键是,我们使用了有关需要服务人员的真实数据(通过仅使用本地大型模型严格保密)来进行分析。我们发现 LLM 的优先排序在多方面极不一致:在不同运行之间、不同 LLM 之间,以及 LLM 与脆弱性评分系统之间都存在显著差异。 与此同时,LLMs 在成对测试中与非专业人类判断表现出定性一致性。研究结果对当前一代人工智能系统在高风险社会决策中进行天真的直接整合提出了质疑。
Subjects: Computers and Society, Artificial Intelligence 主题:计算机与社会,人工智能
Publish: 2025-08-11 17:12:55 UTC 发布:2025-08-11 17:12:55 UTC
#75 RedDino: A foundation model for red blood cell analysis #75 RedDino:用于红细胞分析的基础模型
Authors: [Luca Zedda](https://arxiv.org/search/?searchtype=author&query=Luca Zedda), [Andrea Loddo](https://arxiv.org/search/?searchtype=author&query=Andrea Loddo), [Cecilia Di Ruberto](https://arxiv.org/search/?searchtype=author&query=Cecilia Di Ruberto), [Carsten Marr](https://arxiv.org/search/?searchtype=author&query=Carsten Marr) 作者:Luca Zedda、Andrea Loddo、Cecilia Di Ruberto、Carsten Marr
Red blood cells (RBCs) are essential to human health, and their precise morphological analysis is important for diagnosing hematological disorders. Despite the promise of foundation models in medical diagnostics, comprehensive AI solutions for RBC analysis remain scarce. We present RedDino, a self-supervised foundation model designed for RBC image analysis. RedDino uses an RBC-specific adaptation of the DINOv2 self-supervised learning framework and is trained on a curated dataset of 1.25 million RBC images from diverse acquisition modalities and sources. Extensive evaluations show that RedDino outperforms existing state-of-the-art models on RBC shape classification. Through assessments including linear probing and nearest neighbor classification, we confirm its strong feature representations and generalization ability. Our main contributions are: (1) a foundation model tailored for RBC analysis, (2) ablation studies exploring DINOv2 configurations for RBC modeling, and (3) a detailed evaluation of generalization performance. RedDino addresses key challenges in computational hematology by capturing nuanced morphological features, advancing the development of reliable diagnostic tools. The source code and pretrained models for RedDino are available at https://github.com/Snarci/RedDino, and the pretrained models can be downloaded from our Hugging Face collection at https://huggingface.co/collections/Snarcy/reddino-689a13e29241d2e5690202fc 红细胞(RBC)对人体健康至关重要,对其精确的形态学分析对于诊断血液系统疾病非常重要。尽管基础模型在医学诊断中前景广阔,但用于红细胞分析的全面人工智能解决方案仍然稀缺。我们提出了 RedDino,一种为红细胞图像分析设计的自监督基础模型。RedDino 使用了针对红细胞的 DINOv2 自监督学习框架的改进,并在来自多种采集方式和来源、经挑选的 125 万张红细胞图像数据集上进行训练。大量评估表明,RedDino 在红细胞形状分类任务上优于现有最先进模型。通过线性探测和最近邻分类等评估,我们验证了其强大的特征表示能力和泛化能力。我们的主要贡献包括: (1) 一个为红细胞分析量身定制的基础模型,(2) 探索用于红细胞建模的 DINOv2 配置的消融研究,(3) 对泛化性能的详细评估。 RedDino 通过捕捉细微的形态学特征,解决了计算血液学中的关键挑战,推动了可靠诊断工具的发展。RedDino 的源代码和预训练模型可在 https://github.com/Snarci/RedDino 获取,预训练模型也可从我们的 Hugging Face 收藏中下载:https://huggingface.co/collections/Snarcy/reddino-689a13e29241d2e5690202fc
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 16:59:31 UTC 发布:2025-08-11 16:59:31 UTC
#76 MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision #76 MedReasoner:强化学习将推理从临床思维引导到像素级精确度
Authors: [Zhonghao Yan](https://arxiv.org/search/?searchtype=author&query=Zhonghao Yan), [Muxi Diao](https://arxiv.org/search/?searchtype=author&query=Muxi Diao), [Yuxuan Yang](https://arxiv.org/search/?searchtype=author&query=Yuxuan Yang), [Jiayuan Xu](https://arxiv.org/search/?searchtype=author&query=Jiayuan Xu), [Kaizhou Zhang](https://arxiv.org/search/?searchtype=author&query=Kaizhou Zhang), [Ruoyan Jing](https://arxiv.org/search/?searchtype=author&query=Ruoyan Jing), [Lele Yang](https://arxiv.org/search/?searchtype=author&query=Lele Yang), [Yanxi Liu](https://arxiv.org/search/?searchtype=author&query=Yanxi Liu), [Kongming Liang](https://arxiv.org/search/?searchtype=author&query=Kongming Liang), [Zhanyu Ma](https://arxiv.org/search/?searchtype=author&query=Zhanyu Ma) 作者:颜中昊,刁木西,杨昱轩,许嘉元,张凯周,荆若妍,杨乐乐,柳嫣曦,梁孔明,马展宇
Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding. 在医学影像中,准确定位感兴趣区域(ROI)对诊断和治疗规划至关重要。尽管多模态大语言模型(MLLMs)将视觉感知与自然语言结合起来,现有的医学定位管线仍依赖带有明确空间提示的监督微调,因此难以应对临床实践中常见的隐式查询。本工作有三项核心贡献。首先,我们定义了统一医学推理定位(Unified Medical Reasoning Grounding,UMRG)这一新颖的视觉-语言任务,该任务要求临床推理与像素级定位。其次,我们发布了 U-MRG-14K 数据集,包含 14K 样本,配有像素级掩模、隐式临床查询和推理痕迹,涵盖 10 种模态、15 个大类和 108 个具体类别。最后,我们提出了 MedReasoner,一个明确将推理与分割分离的模块化框架:MLLM 推理器通过强化学习进行优化,而冻结的分割专家将空间提示转换为掩模,通过格式与准确性奖励实现对齐。 MedReasoner 在 U-MRG-14K 上达到了最先进的性能,并且对未见过的临床查询表现出强大的泛化能力,突显了强化学习在可解释医学基础定位方面的巨大前景。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 16:59:06 UTC 发布:2025-08-11 16:59:06 UTC
#77 Neural Logic Networks for Interpretable Classification #77 神经逻辑网络用于可解释分类
Authors: [Vincent Perreault](https://arxiv.org/search/?searchtype=author&query=Vincent Perreault), [Katsumi Inoue](https://arxiv.org/search/?searchtype=author&query=Katsumi Inoue), [Richard Labib](https://arxiv.org/search/?searchtype=author&query=Richard Labib), [Alain Hertz](https://arxiv.org/search/?searchtype=author&query=Alain Hertz) 作者:Vincent Perreault、Katsumi Inoue、Richard Labib、Alain Hertz
Traditional neural networks have an impressive classification performance, but what they learn cannot be inspected, verified or extracted. Neural Logic Networks on the other hand have an interpretable structure that enables them to learn a logical mechanism relating the inputs and outputs with AND and OR operations. We generalize these networks with NOT operations and biases that take into account unobserved data and develop a rigorous logical and probabilistic modeling in terms of concept combinations to motivate their use. We also propose a novel factorized IF-THEN rule structure for the model as well as a modified learning algorithm. Our method improves the state-of-the-art in Boolean networks discovery and is able to learn relevant, interpretable rules in tabular classification, notably on an example from the medical field where interpretability has tangible value. 传统神经网络在分类性能上令人印象深刻,但它们所学到的内容无法被检查、验证或提取。另一方面,神经逻辑网络具有可解释的结构,使其能够使用与(AND)与或(OR)运算学习输入与输出之间的逻辑机制。我们将这些网络推广为包含非(NOT)运算和考虑未观测数据的偏置项,并以概念组合的形式发展了严格的逻辑和概率建模来论证其使用。我们还为模型提出了一种新颖的因式分解式“如果—那么”规则结构以及一种改进的学习算法。我们的方法提升了布尔网络发现的最新水平,并能够在表格分类中学习到相关且可解释的规则,尤其是在医疗领域的一个示例中,可解释性具有切实的价值。
Subjects: Machine Learning, Artificial Intelligence, Logic in Computer Science 主题:机器学习,人工智能,计算机科学中的逻辑
Publish: 2025-08-11 16:49:56 UTC 发布:2025-08-11 16:49:56 UTC
#78 PyVeritas: On Verifying Python via LLM-Based Transpilation and Bounded Model Checking for C #78 PyVeritas:关于通过基于 LLM 的转译和针对 C 的有界模型检查来验证 Python
Authors: [Pedro Orvalho](https://arxiv.org/search/?searchtype=author&query=Pedro Orvalho), [Marta Kwiatkowska](https://arxiv.org/search/?searchtype=author&query=Marta Kwiatkowska) 作者:Pedro Orvalho,Marta Kwiatkowska
Python has become the dominant language for general-purpose programming, yet it lacks robust tools for formal verification. In contrast, programmers working in languages such as C benefit from mature model checkers, for example CBMC, which enable exhaustive symbolic reasoning and fault localisation. The inherent complexity of Python, coupled with the verbosity and low-level nature of existing transpilers (e.g., Cython), have historically limited the applicability of formal verification to Python programs. In this paper, we propose PyVeritas, a novel framework that leverages Large Language Models (LLMs) for high-level transpilation from Python to C, followed by bounded model checking and MaxSAT-based fault localisation in the generated C code. PyVeritas enables verification and bug localisation for Python code using existing model checking tools for C. Our empirical evaluation on two Python benchmarks demonstrates that LLM-based transpilation can achieve a high degree of accuracy, up to 80–90% for some LLMs, enabling effective development environment that supports assertion-based verification and interpretable fault diagnosis for small yet non-trivial Python programs. Python 已成为通用编程的主导语言,但它缺乏用于形式化验证的强大工具。相比之下,使用 C 等语言的程序员受益于成熟的模型检验器,例如 CBMC,这些工具支持穷尽的符号推理和故障定位。Python 的固有复杂性,再加上现有转译器(例如 Cython)的冗长与低层特性,历来限制了形式化验证在 Python 程序中的适用性。在本文中,我们提出了 PyVeritas,一个新颖的框架,利用 Large Language Models (LLMs) 将 Python 进行高层次转译为 C,随后对生成的 C 代码进行有界模型检验和基于 MaxSAT 的故障定位。PyVeritas 使得可以使用现有的 C 语言模型检验工具来对 Python 代码进行验证和错误定位。我们在两个 Python 基准上的实证评估表明,基于 LLM 的转译可以达到较高的准确率,对于某些 LLM 可达 80–90%,从而支持针对小而非平凡的 Python 程序的断言式验证和可解释的故障诊断的有效开发环境。
Subjects: Software Engineering, Artificial Intelligence 主题:软件工程,人工智能
Publish: 2025-08-11 16:49:07 UTC 发布:2025-08-11 16:49:07 UTC
#79 LPI-RIT at LeWiDi-2025: Improving Distributional Predictions via Metadata and Loss Reweighting with DisCo #79 LPI-RIT 在 LeWiDi-2025:通过元数据和损失重加权以及 DisCo 改善分布式预测
Authors: [Mandira Sawkar](https://arxiv.org/search/?searchtype=author&query=Mandira Sawkar), [Samay U. Shetty](https://arxiv.org/search/?searchtype=author&query=Samay U. Shetty), [Deepak Pandita](https://arxiv.org/search/?searchtype=author&query=Deepak Pandita), [Tharindu Cyril Weerasooriya](https://arxiv.org/search/?searchtype=author&query=Tharindu Cyril Weerasooriya), [Christopher M. Homan](https://arxiv.org/search/?searchtype=author&query=Christopher M. Homan) 作者:Mandira Sawkar、Samay U. Shetty、Deepak Pandita、Tharindu Cyril Weerasooriya、Christopher M. Homan
The Learning With Disagreements (LeWiDi) 2025 shared task is to model annotator disagreement through soft label distribution prediction and perspectivist evaluation, modeling annotators. We adapt DisCo (Distribution from Context), a neural architecture that jointly models item-level and annotator-level label distributions, and present detailed analysis and improvements. In this paper, we extend the DisCo by incorporating annotator metadata, enhancing input representations, and modifying the loss functions to capture disagreement patterns better. Through extensive experiments, we demonstrate substantial improvements in both soft and perspectivist evaluation metrics across three datasets. We also conduct in-depth error and calibration analyses, highlighting the conditions under which improvements occur. Our findings underscore the value of disagreement-aware modeling and offer insights into how system components interact with the complexity of human-annotated data. 2025 年 LeWiDi(带有分歧的学习,Learning With Disagreements)共享任务旨在通过软标签分布预测和观点主义评估来模拟标注者分歧,即对标注者建模。我们改编了 DisCo(Distribution from Context),这是一种联合建模条目级和标注者级标签分布的神经架构,并给出了详细的分析和改进。在本文中,我们通过加入标注者元数据、增强输入表征以及修改损失函数以更好地捕捉分歧模式,扩展了 DisCo。通过大量实验,我们证明在三个数据集上,软评估和观点主义评估指标均有显著提升。我们还进行了深入的错误与校准分析,重点说明了改进发生的条件。我们的发现强调了关注分歧的建模的价值,并提供了关于系统组件如何与人工标注数据的复杂性相互作用的见解。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习
Publish: 2025-08-11 16:39:09 UTC 发布时间:2025-08-11 16:39:09 UTC
#80 Can AI Explanations Make You Change Your Mind? #80 人工智能解释能让你改变主意吗?
Authors: [Laura Spillner](https://arxiv.org/search/?searchtype=author&query=Laura Spillner), [Rachel Ringe](https://arxiv.org/search/?searchtype=author&query=Rachel Ringe), [Robert Porzel](https://arxiv.org/search/?searchtype=author&query=Robert Porzel), [Rainer Malaka](https://arxiv.org/search/?searchtype=author&query=Rainer Malaka) 作者:Laura Spillner、Rachel Ringe、Robert Porzel、Rainer Malaka
In the context of AI-based decision support systems, explanations can help users to judge when to trust the AI’s suggestion, and when to question it. In this way, human oversight can prevent AI errors and biased decision-making. However, this rests on the assumption that users will consider explanations in enough detail to be able to catch such errors. We conducted an online study on trust in explainable DSS, and were surprised to find that in many cases, participants spent little time on the explanation and did not always consider it in detail. We present an exploratory analysis of this data, investigating what factors impact how carefully study participants consider AI explanations, and how this in turn impacts whether they are open to changing their mind based on what the AI suggests. 在基于人工智能的决策支持系统背景下,解释可以帮助用户判断何时信任 AI 的建议、何时对其提出质疑。通过这种方式,人类监督可以防止 AI 错误和有偏见的决策。然而,这建立在一个假设之上:用户会足够详尽地审视解释,以便发现此类错误。我们在可解释决策支持系统的信任问题上进行了在线研究,并惊讶地发现,在许多情况下,参与者在解释上花费的时间很少,也并不总是详细考虑。我们对这些数据进行了探索性分析,研究了哪些因素影响参与者多仔细地考虑 AI 解释,以及这反过来如何影响他们是否愿意根据 AI 的建议改变主意。
Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能
Publish: 2025-08-11 16:36:20 UTC 发布:2025-08-11 16:36:20 UTC
#81 COMponent-Aware Pruning for Accelerated Control Tasks in Latent Space Models #81 面向组件感知的剪枝,用于潜在空间模型中加速控制任务
Authors: [Ganesh Sundaram](https://arxiv.org/search/?searchtype=author&query=Ganesh Sundaram), [Jonas Ulmen](https://arxiv.org/search/?searchtype=author&query=Jonas Ulmen), [Amjad Haider](https://arxiv.org/search/?searchtype=author&query=Amjad Haider), [Daniel Görges](https://arxiv.org/search/?searchtype=author&query=Daniel Görges) 作者:Ganesh Sundaram、Jonas Ulmen、Amjad Haider、Daniel Görges
The rapid growth of resource-constrained mobile platforms, including mobile robots, wearable systems, and Internet-of-Things devices, has increased the demand for computationally efficient neural network controllers (NNCs) that can operate within strict hardware limitations. While deep neural networks (DNNs) demonstrate superior performance in control applications, their substantial computational complexity and memory requirements present significant barriers to practical deployment on edge devices. This paper introduces a comprehensive model compression methodology that leverages component-aware structured pruning to determine the optimal pruning magnitude for each pruning group, ensuring a balance between compression and stability for NNC deployment. Our approach is rigorously evaluated on Temporal Difference Model Predictive Control (TD-MPC), a state-of-the-art model-based reinforcement learning algorithm, with a systematic integration of mathematical stability guarantee properties, specifically Lyapunov criteria. The key contribution of this work lies in providing a principled framework for determining the theoretical limits of model compression while preserving controller stability. Experimental validation demonstrates that our methodology successfully reduces model complexity while maintaining requisite control performance and stability characteristics. Furthermore, our approach establishes a quantitative boundary for safe compression ratios, enabling practitioners to systematically determine the maximum permissible model reduction before violating critical stability properties, thereby facilitating the confident deployment of compressed NNCs in resource-limited environments. 包括移动机器人、可穿戴设备和物联网设备在内的资源受限移动平台的快速增长,增加了对能够在严格硬件限制内运行的计算高效神经网络控制器(NNC)的需求。尽管深度神经网络(DNN)在控制应用中表现优越,但其大量的计算复杂性和内存要求对在边缘设备上的实际部署构成了重大障碍。本文提出了一种综合模型压缩方法,该方法利用组件感知的结构化剪枝来确定每个剪枝组的最佳剪枝幅度,确保在压缩与稳定性之间实现平衡以便部署 NNC。我们的方法在时序差分模型预测控制(TD-MPC)——一种最先进的基于模型的强化学习算法上进行了严格评估,并系统性地整合了数学稳定性保证属性,特别是李雅普诺夫(Lyapunov)准则。 本工作的关键贡献在于提供了一个有原则的框架,用以确定在保持控制器稳定性的同时模型压缩的理论极限。实验验证表明,我们的方法在降低模型复杂度的同时,成功维持了必要的控制性能和稳定性特征。此外,我们的方法建立了安全压缩比的定量界限,使从业者能够系统地确定在破坏关键稳定性属性之前最大可允许的模型缩减,从而促进在资源受限环境中自信地部署压缩的神经网络控制器(NNC)。
Subjects: Robotics, Artificial Intelligence, Systems and Control 学科:机器人学、人工智能、系统与控制
Publish: 2025-08-11 16:16:51 UTC 发布:2025-08-11 16:16:51 UTC
#82 Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models #82 LLMs 能否检测到它们的虚构?在具备不确定性感知的语言模型中估计可靠性 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Tianyi Zhou](https://arxiv.org/search/?searchtype=author&query=Tianyi Zhou), [Johanne Medina](https://arxiv.org/search/?searchtype=author&query=Johanne Medina), [Sanjay Chawla](https://arxiv.org/search/?searchtype=author&query=Sanjay Chawla) 作者:周天翼、Johanne Medina、Sanjay Chawla
Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation. 大型语言模型(LLMs)容易生成流畅但不正确的内容,这被称为虚构(confabulation),在多轮或具代理性的应用中风险日益增加,因为输出可能被重复用作上下文。在本研究中,我们探讨了上下文信息如何影响模型行为以及 LLMs 是否能够识别其不可靠的回答。我们提出了一种可靠性估计方法,利用基于标记的(token-level)不确定性来引导内部模型表示的聚合。具体而言,我们从输出 logits 中计算固有不确定性(aleatoric)和认知不确定性(epistemic),以识别显著标记,并将这些标记的隐层状态聚合为用于回答级可靠性预测的紧凑表示。通过在开放问答基准上的受控实验,我们发现,正确的上下文信息既提高了答案准确率,也提高了模型置信度,而误导性上下文常常导致模型自信地给出错误回答,揭示了不确定性与正确性之间的错位。我们的基于探测的方法捕捉到了这些模型行为的变化,并在多种开源 LLMs 上改进了对不可靠输出的检测。 这些结果强调了直接不确定性信号的局限性,并彰显了以不确定性为指导的探测在面向可靠性生成方面的潜力。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 16:12:36 UTC 发布:2025-08-11 16:12:36 UTC
#83 MuaLLM: A Multimodal Large Language Model Agent for Circuit Design Assistance with Hybrid Contextual Retrieval-Augmented Generation #83 MuaLLM: 一种用于电路设计辅助的多模态大型语言模型代理,采用混合上下文的检索增强生成 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Pravallika Abbineni](https://arxiv.org/search/?searchtype=author&query=Pravallika Abbineni), [Saoud Aldowaish](https://arxiv.org/search/?searchtype=author&query=Saoud Aldowaish), [Colin Liechty](https://arxiv.org/search/?searchtype=author&query=Colin Liechty), [Soroosh Noorzad](https://arxiv.org/search/?searchtype=author&query=Soroosh Noorzad), [Ali Ghazizadeh](https://arxiv.org/search/?searchtype=author&query=Ali Ghazizadeh), [Morteza Fayazi](https://arxiv.org/search/?searchtype=author&query=Morteza Fayazi) 作者:Pravallika Abbineni、Saoud Aldowaish、Colin Liechty、Soroosh Noorzad、Ali Ghazizadeh、Morteza Fayazi
Conducting a comprehensive literature review is crucial for advancing circuit design methodologies. However, the rapid influx of state-of-the-art research, inconsistent data representation, and the complexity of optimizing circuit design objectives make this task significantly challenging. In this paper, we propose MuaLLM, an open-source multimodal Large Language Model (LLM) agent for circuit design assistance that integrates a hybrid Retrieval-Augmented Generation (RAG) framework with an adaptive vector database of circuit design research papers. Unlike conventional LLMs, the MuaLLM agent employs a Reason + Act (ReAct) workflow for iterative reasoning, goal-setting, and multi-step information retrieval. It functions as a question-answering design assistant, capable of interpreting complex queries and providing reasoned responses grounded in circuit literature. Its multimodal capabilities enable processing of both textual and visual data, facilitating more efficient and comprehensive analysis. The system dynamically adapts using intelligent search tools, automated document retrieval from the internet, and real-time database updates. Unlike conventional approaches constrained by model context limits, MuaLLM decouples retrieval from inference, enabling scalable reasoning over arbitrarily large corpora. At the maximum context length supported by standard LLMs, MuaLLM remains up to 10x less costly and 1.6x faster while maintaining the same accuracy. This allows rapid, no-human-in-the-loop database generation, overcoming the bottleneck of simulation-based dataset creation for circuits. To evaluate MuaLLM, we introduce two custom datasets: RAG-250, targeting retrieval and citation performance, and Reasoning-100 (Reas-100), focused on multistep reasoning in circuit design. MuaLLM achieves 90.1% recall on RAG-250, and 86.8% accuracy on Reas-100. 进行全面的文献综述对于推进电路设计方法至关重要。然而,最先进研究的大量涌入、数据表示的不一致性以及优化电路设计目标的复杂性使这项任务变得极具挑战性。在本文中,我们提出了 MuaLLM,一种开源的多模态 LLM 代理,用于电路设计辅助,该代理将混合的检索增强生成(RAG)框架与自适应电路设计研究论文向量数据库相结合。与传统的 LLM 不同,MuaLLM 代理采用 Reason + Act(ReAct)工作流进行迭代推理、目标设定和多步信息检索。它作为一个问答式设计助手,能够解读复杂查询并提供基于电路文献的推理回应。其多模态能力使其能够处理文本和视觉数据,从而促进更高效、更全面的分析。该系统通过智能搜索工具、来自互联网的自动文档检索和实时数据库更新进行动态自适应。 与受模型上下文限制的传统方法不同,MuaLLM 将检索与推理解耦,使得对任意大语料库进行可扩展推理成为可能。在标准 LLMs 支持的最大上下文长度下,MuaLLM 在保持相同准确率的同时,成本低至最多 10 分之 1 且速度快 1.6 倍。这使得快速、无人参与的数据库生成成为可能,克服了基于仿真的数据集创建在电路领域的瓶颈。为评估 MuaLLM,我们引入了两个自定义数据集:针对检索与引用性能的 RAG-250,以及专注于电路设计多步推理的 Reasoning-100(Reas-100)。MuaLLM 在 RAG-250 上实现了 90.1% 的召回率,在 Reas-100 上达到了 86.8% 的准确率。
Subjects: Machine Learning, Artificial Intelligence, Systems and Control 主题:机器学习、人工智能、系统与控制
Publish: 2025-08-11 16:11:09 UTC 发布:2025-08-11 16:11:09 UTC
#84 Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models #84 最优传输正则化用于口语语言模型中的语音文本对齐
Authors: [Wenze Xu](https://arxiv.org/search/?searchtype=author&query=Wenze Xu), [Chun Wang](https://arxiv.org/search/?searchtype=author&query=Chun Wang), [Jiazhen Yu](https://arxiv.org/search/?searchtype=author&query=Jiazhen Yu), [Sheng Chen](https://arxiv.org/search/?searchtype=author&query=Sheng Chen), [Liang Gao](https://arxiv.org/search/?searchtype=author&query=Liang Gao), [Weihong Deng](https://arxiv.org/search/?searchtype=author&query=Weihong Deng) 作者:徐文泽,王淳,俞佳臻,陈晟,高亮,邓维宏
Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training. In each training iteration, OTReg first establishes a structured correspondence between speech and transcript embeddings by determining the optimal transport plan, then incorporates the regularization loss based on this transport plan to optimize SLMs in generating speech embeddings that align more effectively with transcript embeddings. OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures. Extensive multilingual ASR experiments demonstrate that OTReg enhances speech-text alignment, mitigates the modality gap, and consequently improves SLM generalization across diverse datasets. 口语语言模型(Spoken Language Models,SLMs)将大型语言模型(LLMs)扩展为能感知语音输入,因其有望推动语音理解任务而受到越来越多关注。然而,尽管近期取得了一些进展,研究表明 SLMs 常常难以在不同数据集间泛化,即便是对已训练的语言和任务也存在这一问题,这引发了对它们是否如预期那样以类文本的方式处理语音的担忧。导致这一限制的一个关键挑战是语音与文本表示之间的模态差距。语音嵌入的高可变性可能使 SLMs 通过利用非预期的语音变化来在域内取得较强的性能,但这最终会阻碍泛化。为缓解该模态差距,我们提出了最优传输正则化(Optimal Transport Regularization,OTReg),该方法将语音-文本对齐表述为一个最优传输问题,并由此导出一种正则化损失以改进 SLM 的训练。 在每次训练迭代中,OTReg 首先通过确定最优传输计划在语音和文本嵌入之间建立结构化对应关系,然后基于该传输计划加入正则化损失,以优化 SLM,使其生成的语音嵌入能更有效地与文本嵌入对齐。OTReg 轻量、不需要额外标签或可学习参数,并能无缝整合到现有的 SLM 训练流程中。大量多语言 ASR 实验表明,OTReg 能增强语音-文本对齐、缓解模态间差距,从而提升 SLM 在不同数据集上的泛化能力。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 16:06:04 UTC 发表时间:2025-08-11 16:06:04 UTC
#85 MemoryKT: An Integrative Memory-and-Forgetting Method for Knowledge Tracing #85 MemoryKT:一种用于知识追踪的整合记忆与遗忘方法
Authors: [Mingrong Lin](https://arxiv.org/search/?searchtype=author&query=Mingrong Lin), [Ke Deng](https://arxiv.org/search/?searchtype=author&query=Ke Deng), [Zhengyang Wu](https://arxiv.org/search/?searchtype=author&query=Zhengyang Wu), [Zetao Zheng](https://arxiv.org/search/?searchtype=author&query=Zetao Zheng), [Jie Li](https://arxiv.org/search/?searchtype=author&query=Jie Li) 作者:Mingrong Lin、Ke Deng、Zhengyang Wu、Zetao Zheng、Jie Li
Knowledge Tracing (KT) is committed to capturing students’ knowledge mastery from their historical interactions. Simulating students’ memory states is a promising approach to enhance both the performance and interpretability of knowledge tracing models. Memory consists of three fundamental processes: encoding, storage, and retrieval. Although forgetting primarily manifests during the storage stage, most existing studies rely on a single, undifferentiated forgetting mechanism, overlooking other memory processes as well as personalized forgetting patterns. To address this, this paper proposes memoryKT, a knowledge tracing model based on a novel temporal variational autoencoder. The model simulates memory dynamics through a three-stage process: (i) Learning the distribution of students’ knowledge memory features, (ii) Reconstructing their exercise feedback, while (iii) Embedding a personalized forgetting module within the temporal workflow to dynamically modulate memory storage strength. This jointly models the complete encoding-storage-retrieval cycle, significantly enhancing the model’s perception capability for individual differences. Extensive experiments on four public datasets demonstrate that our proposed approach significantly outperforms state-of-the-art baselines. 知识追踪(KT)致力于从学生的历史交互中捕捉其知识掌握情况。模拟学生的记忆状态是提升知识追踪模型性能和可解释性的一个有前景的方法。记忆由三大基本过程构成:编码、储存与提取。尽管遗忘主要在储存阶段表现出来,但大多数现有研究依赖单一且未区分的遗忘机制,忽视了其他记忆过程以及个性化的遗忘模式。为了解决这一问题,本文提出了 memoryKT,一种基于新型时序变分自编码器的知识追踪模型。该模型通过三阶段流程模拟记忆动态:(i) 学习学生知识记忆特征的分布,(ii) 重构他们的练习反馈,同时 (iii) 在时序工作流中嵌入个性化遗忘模块以动态调节记忆储存强度。该方法将完整的编码—储存—提取循环联合建模,显著增强了模型对于个体差异的感知能力。 在四个公共数据集上的大量实验表明,我们提出的方法显著优于最先进的基线方法。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 15:59:59 UTC 发布:2025-08-11 15:59:59 UTC
#86 Vision-Based Localization and LLM-based Navigation for Indoor Environments #86 基于视觉的定位与基于 LLM 的室内环境导航
Authors: [Keyan Rahimi](https://arxiv.org/search/?searchtype=author&query=Keyan Rahimi), [Md. Wasiul Haque](https://arxiv.org/search/?searchtype=author&query=Md. Wasiul Haque), [Sagar Dasgupta](https://arxiv.org/search/?searchtype=author&query=Sagar Dasgupta), [Mizanur Rahman](https://arxiv.org/search/?searchtype=author&query=Mizanur Rahman) 作者:Keyan Rahimi、Md. Wasiul Haque、Sagar Dasgupta、Mizanur Rahman
Indoor navigation remains a complex challenge due to the absence of reliable GPS signals and the architectural intricacies of large enclosed environments. This study presents an indoor localization and navigation approach that integrates vision-based localization with large language model (LLM)-based navigation. The localization system utilizes a ResNet-50 convolutional neural network fine-tuned through a two-stage process to identify the user’s position using smartphone camera input. To complement localization, the navigation module employs an LLM, guided by a carefully crafted system prompt, to interpret preprocessed floor plan images and generate step-by-step directions. Experimental evaluation was conducted in a realistic office corridor with repetitive features and limited visibility to test localization robustness. The model achieved high confidence and an accuracy of 96% across all tested waypoints, even under constrained viewing conditions and short-duration queries. Navigation tests using ChatGPT on real building floor maps yielded an average instruction accuracy of 75%, with observed limitations in zero-shot reasoning and inference time. This research demonstrates the potential for scalable, infrastructure-free indoor navigation using off-the-shelf cameras and publicly available floor plans, particularly in resource-constrained settings like hospitals, airports, and educational institutions. 室内导航由于缺乏可靠的 GPS 信号以及大型封闭环境的建筑复杂性,仍然是一个复杂的挑战。本研究提出了一种将基于视觉的定位与基于大语言模型(LLM)的导航相结合的室内定位与导航方法。定位系统利用经过两阶段微调的 ResNet-50 卷积神经网络,通过智能手机摄像头输入识别用户位置。为配合定位,导航模块采用 LLM,在精心设计的系统提示引导下,解释预处理后的楼层平面图像并生成逐步的导航指令。实验评估在具有重复特征和能见度受限的真实办公走廊中进行,以测试定位的鲁棒性。即使在视野受限和短时查询的条件下,该模型在所有测试路径点上仍达到高置信度和 96% 的准确率。使用 ChatGPT 在真实建筑楼层地图上进行的导航测试显示平均指令准确率为 75%,并观察到在零样本推理和推理时间方面的局限性。 这项研究展示了利用现成摄像头和公开楼层平面图在无需基础设施的情况下实现可扩展室内导航的潜力,尤其适用于医院、机场和教育机构等资源受限的环境。
Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition 学科:机器学习、人工智能、计算机视觉与模式识别
Publish: 2025-08-11 15:59:09 UTC 发布:2025-08-11 15:59:09 UTC
#87 GRASPTrack: Geometry-Reasoned Association via Segmentation and Projection for Multi-Object Tracking #87 GRASPTrack:通过分割与投影进行几何推理关联的多目标跟踪
Authors: [Xudong Han](https://arxiv.org/search/?searchtype=author&query=Xudong Han), [Pengcheng Fang](https://arxiv.org/search/?searchtype=author&query=Pengcheng Fang), [Yueying Tian](https://arxiv.org/search/?searchtype=author&query=Yueying Tian), [Jianhui Yu](https://arxiv.org/search/?searchtype=author&query=Jianhui Yu), [Xiaohao Cai](https://arxiv.org/search/?searchtype=author&query=Xiaohao Cai), [Daniel Roggen](https://arxiv.org/search/?searchtype=author&query=Daniel Roggen), [Philip Birch](https://arxiv.org/search/?searchtype=author&query=Philip Birch) 作者:韩旭东、方鹏程、田悦影、俞建辉、蔡孝豪、丹尼尔·罗根、菲利普·伯奇
Multi-object tracking (MOT) in monocular videos is fundamentally challenged by occlusions and depth ambiguity, issues that conventional tracking-by-detection (TBD) methods struggle to resolve owing to a lack of geometric awareness. To address these limitations, we introduce GRASPTrack, a novel depth-aware MOT framework that integrates monocular depth estimation and instance segmentation into a standard TBD pipeline to generate high-fidelity 3D point clouds from 2D detections, thereby enabling explicit 3D geometric reasoning. These 3D point clouds are then voxelized to enable a precise and robust Voxel-Based 3D Intersection-over-Union (IoU) for spatial association. To further enhance tracking robustness, our approach incorporates Depth-aware Adaptive Noise Compensation, which dynamically adjusts the Kalman filter process noise based on occlusion severity for more reliable state estimation. Additionally, we propose a Depth-enhanced Observation-Centric Momentum, which extends the motion direction consistency from the image plane into 3D space to improve motion-based association cues, particularly for objects with complex trajectories. Extensive experiments on the MOT17, MOT20, and DanceTrack benchmarks demonstrate that our method achieves competitive performance, significantly improving tracking robustness in complex scenes with frequent occlusions and intricate motion patterns. 单目视频中的多目标跟踪(MOT)在本质上受到遮挡和深度模糊的挑战,常规的基于检测跟踪(TBD)方法由于缺乏几何感知难以解决这些问题。为了解决这些局限性,我们提出了 GRASPTrack,一种新颖的深度感知 MOT 框架,将单目深度估计和实例分割集成到标准 TBD 流水线中,从二维检测生成高保真三维点云,从而实现显式的三维几何推理。随后将这些三维点云体素化,以实现用于空间关联的精确且鲁棒的基于体素的三维交并比(IoU)。为了进一步增强跟踪的鲁棒性,我们的方法引入了深度感知自适应噪声补偿,根据遮挡严重程度动态调整卡尔曼滤波的过程噪声,以实现更可靠的状态估计。此外,我们提出了深度增强的以观测为中心的动量(Depth-enhanced Observation-Centric Momentum),将运动方向一致性从图像平面扩展到三维空间,以改善基于运动的关联线索,尤其针对轨迹复杂的目标。 在 MOT17、MOT20 和 DanceTrack 基准数据集上的大量实验表明,我们的方法实现了具有竞争力的性能,在存在频繁遮挡和复杂运动模式的复杂场景中显著提高了跟踪鲁棒性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 15:56:21 UTC 发布:2025-08-11 15:56:21 UTC
#88 Hyperspectral Imaging
Authors: [Danfeng Hong](https://arxiv.org/search/?searchtype=author&query=Danfeng Hong), [Chenyu Li](https://arxiv.org/search/?searchtype=author&query=Chenyu Li), [Naoto Yokoya](https://arxiv.org/search/?searchtype=author&query=Naoto Yokoya), [Bing Zhang](https://arxiv.org/search/?searchtype=author&query=Bing Zhang), [Xiuping Jia](https://arxiv.org/search/?searchtype=author&query=Xiuping Jia), [Antonio Plaza](https://arxiv.org/search/?searchtype=author&query=Antonio Plaza), [Paolo Gamba](https://arxiv.org/search/?searchtype=author&query=Paolo Gamba), [Jon Atli Benediktsson](https://arxiv.org/search/?searchtype=author&query=Jon Atli Benediktsson), [Jocelyn Chanussot](https://arxiv.org/search/?searchtype=author&query=Jocelyn Chanussot) 作者:Danfeng Hong, Chenyu Li, Naoto Yokoya, Bing Zhang, Xiuping Jia, Antonio Plaza, Paolo Gamba, Jon Atli Benediktsson, Jocelyn Chanussot
Hyperspectral imaging (HSI) is an advanced sensing modality that simultaneously captures spatial and spectral information, enabling non-invasive, label-free analysis of material, chemical, and biological properties. This Primer presents a comprehensive overview of HSI, from the underlying physical principles and sensor architectures to key steps in data acquisition, calibration, and correction. We summarize common data structures and highlight classical and modern analysis methods, including dimensionality reduction, classification, spectral unmixing, and AI-driven techniques such as deep learning. Representative applications across Earth observation, precision agriculture, biomedicine, industrial inspection, cultural heritage, and security are also discussed, emphasizing HSI’s ability to uncover sub-visual features for advanced monitoring, diagnostics, and decision-making. Persistent challenges, such as hardware trade-offs, acquisition variability, and the complexity of high-dimensional data, are examined alongside emerging solutions, including computational imaging, physics-informed modeling, cross-modal fusion, and self-supervised learning. Best practices for dataset sharing, reproducibility, and metadata documentation are further highlighted to support transparency and reuse. Looking ahead, we explore future directions toward scalable, real-time, and embedded HSI systems, driven by sensor miniaturization, self-supervised learning, and foundation models. As HSI evolves into a general-purpose, cross-disciplinary platform, it holds promise for transformative applications in science, technology, and society. 高光谱成像(HSI)是一种先进的感测方式,能够同时捕捉空间和光谱信息,从而实现对材料、化学和生物特性进行无创、无标记的分析。本导论全面概述了高光谱成像的内容,从基础物理原理和传感器架构,到数据采集、校准与校正的关键步骤。我们总结了常见的数据结构,并重点介绍了经典与现代的分析方法,包括降维、分类、光谱解混以及诸如深度学习等以人工智能为驱动的技术。文中还讨论了地球观测、精准农业、生物医学、工业检测、文化遗产和安全等代表性应用,强调了高光谱成像在揭示肉眼不可见特征以实现高级监测、诊断和决策中的能力。文章还审视了持续存在的挑战,如硬件权衡、采集变异性以及高维数据的复杂性,并介绍了包括计算成像、物理知情建模、跨模态融合和自监督学习在内的新兴解决方案。 进一步强调了数据集共享、可复现性和元数据文档的最佳实践,以支持透明性和重用。展望未来,我们探讨了朝着可扩展、实时和嵌入式高光谱成像系统的发展方向,这些方向由传感器小型化、自监督学习和基础模型驱动。随着高光谱成像演变为一种通用的跨学科平台,它有望在科学、技术和社会领域带来变革性的应用。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 15:47:24 UTC 发表时间:2025-08-11 15:47:24 协调世界时
#89 ChatGPT on the Road: Leveraging Large Language Model-Powered In-vehicle Conversational Agents for Safer and More Enjoyable Driving Experience #89 ChatGPT on the Road:利用大型语言模型驱动的车载对话代理实现更安全、更愉悦的驾驶体验
Authors: [Yeana Lee Bond](https://arxiv.org/search/?searchtype=author&query=Yeana Lee Bond), [Mungyeong Choe](https://arxiv.org/search/?searchtype=author&query=Mungyeong Choe), [Baker Kasim Hasan](https://arxiv.org/search/?searchtype=author&query=Baker Kasim Hasan), [Arsh Siddiqui](https://arxiv.org/search/?searchtype=author&query=Arsh Siddiqui), [Myounghoon Jeon](https://arxiv.org/search/?searchtype=author&query=Myounghoon Jeon) 作者:Yeana Lee Bond、Mungyeong Choe、Baker Kasim Hasan、Arsh Siddiqui、Myounghoon Jeon
Studies on in-vehicle conversational agents have traditionally relied on pre-scripted prompts or limited voice commands, constraining natural driver-agent interaction. To resolve this issue, the present study explored the potential of a ChatGPT-based in-vehicle agent capable of carrying continuous, multi-turn dialogues. Forty drivers participated in our experiment using a motion-based driving simulator, comparing three conditions (No agent, Pre-scripted agent, and ChatGPT-based agent) as a within-subjects variable. Results showed that the ChatGPT-based agent condition led to more stable driving performance across multiple metrics. Participants demonstrated lower variability in longitudinal acceleration, lateral acceleration, and lane deviation compared to the other two conditions. In subjective evaluations, the ChatGPT-based agent also received significantly higher ratings in competence, animacy, affective trust, and preference compared to the Pre-scripted agent. Our thematic analysis of driver-agent conversations revealed diverse interaction patterns in topics, including driving assistance/questions, entertainment requests, and anthropomorphic interactions. Our results highlight the potential of LLM-powered in-vehicle conversational agents to enhance driving safety and user experience through natural, context-rich interactions. 关于车载对话代理的研究传统上依赖预先编写的提示或有限的语音指令,限制了驾驶员与代理之间的自然互动。为了解决这一问题,本研究探索了一种基于 ChatGPT 的车载代理,该代理能够进行连续的多轮对话。四十名驾驶员参与了我们使用基于运动的驾驶模拟器进行的实验,将三种条件(无代理、预设脚本代理和基于 ChatGPT 的代理)作为被试内变量进行比较。结果表明,基于 ChatGPT 的代理条件在多项指标上带来了更稳定的驾驶表现。与另外两种条件相比,参与者在纵向加速度、横向加速度和车道偏离方面表现出更低的变异性。在主观评价中,基于 ChatGPT 的代理在能力、拟人性、情感信任和偏好等方面也显著高于预设脚本代理。我们对驾驶员与代理对话的主题分析揭示了多样的互动模式,话题包括驾驶辅助/提问、娱乐请求以及拟人化互动。 我们的研究结果突显了由 LLM 驱动的车载对话代理通过自然、富含情境的交互提升驾驶安全性和用户体验的潜力。
Subjects: Human-Computer Interaction, Artificial Intelligence, Software Engineering 主题:人机交互、人工智能、软件工程
Publish: 2025-08-11 15:40:44 UTC 发布:2025-08-11 15:40:44 UTC
#90 Grid2Guide: A* Enabled Small Language Model for Indoor Navigation #90 Grid2Guide:一种用于室内导航的 A* 支持小型语言模型
Authors: [Md. Wasiul Haque](https://arxiv.org/search/?searchtype=author&query=Md. Wasiul Haque), [Sagar Dasgupta](https://arxiv.org/search/?searchtype=author&query=Sagar Dasgupta), [Mizanur Rahman](https://arxiv.org/search/?searchtype=author&query=Mizanur Rahman) 作者:Md. Wasiul Haque、Sagar Dasgupta、Mizanur Rahman
Reliable indoor navigation remains a significant challenge in complex environments, particularly where external positioning signals and dedicated infrastructures are unavailable. This research presents Grid2Guide, a hybrid navigation framework that combines the A* search algorithm with a Small Language Model (SLM) to generate clear, human-readable route instructions. The framework first conducts a binary occupancy matrix from a given indoor map. Using this matrix, the A* algorithm computes the optimal path between origin and destination, producing concise textual navigation steps. These steps are then transformed into natural language instructions by the SLM, enhancing interpretability for end users. Experimental evaluations across various indoor scenarios demonstrate the method’s effectiveness in producing accurate and timely navigation guidance. The results validate the proposed approach as a lightweight, infrastructure-free solution for real-time indoor navigation support. 在复杂环境中,可靠的室内导航仍然是一个重大挑战,尤其是在外部定位信号和专用基础设施不可用的情况下。本研究提出了 Grid2Guide,一种将 A搜索算法与小型语言模型(SLM)结合的混合导航框架,用于生成清晰、可读的人类路径指令。该框架首先从给定的室内地图生成二值占用矩阵。利用该矩阵,A算法计算起点与终点之间的最优路径,生成简明的文本导航步骤。然后,SLM 将这些步骤转换为自然语言指令,增强了终端用户的可理解性。在各种室内场景下的实验评估表明,该方法在生成准确且及时的导航指引方面是有效的。结果验证了所提方法作为一种轻量级、无需基础设施的实时室内导航支持方案的可行性。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 15:39:27 UTC 发布时间:2025-08-11 15:39:27 UTC
#91 Dual Information Speech Language Models for Emotional Conversations #91 双信息语音语言模型用于情感对话
Authors: [Chun Wang](https://arxiv.org/search/?searchtype=author&query=Chun Wang), [Chenyang Liu](https://arxiv.org/search/?searchtype=author&query=Chenyang Liu), [Wenze Xu](https://arxiv.org/search/?searchtype=author&query=Wenze Xu), [Weihong Deng](https://arxiv.org/search/?searchtype=author&query=Weihong Deng) 作者:王春、刘晨阳、许文泽、邓伟鸿
Conversational systems relying on text-based large language models (LLMs) often overlook paralinguistic cues, essential for understanding emotions and intentions. Speech-language models (SLMs), which use speech as input, are emerging as a promising solution. However, SLMs built by extending frozen LLMs struggle to capture paralinguistic information and exhibit reduced context understanding. We identify entangled information and improper training strategies as key issues. To address these issues, we propose two heterogeneous adapters and suggest a weakly supervised training strategy. Our approach disentangles paralinguistic and linguistic information, enabling SLMs to interpret speech through structured representations. It also preserves contextual understanding by avoiding the generation of task-specific vectors through controlled randomness. This approach trains only the adapters on common datasets, ensuring parameter and data efficiency. Experiments demonstrate competitive performance in emotional conversation tasks, showcasing the model’s ability to effectively integrate both paralinguistic and linguistic information within contextual settings. 依赖基于文本的大型语言模型(LLMs)的对话系统常常忽视副语言线索,而这些线索对理解情感和意图至关重要。以语音作为输入的语音-语言模型(SLMs)正逐渐成为一种有前景的解决方案。然而,通过扩展冻结的 LLMs 构建的 SLMs 在捕捉副语言信息方面存在困难,并且表现出上下文理解能力下降。我们识别出信息纠缠和不当训练策略是关键问题。为了解决这些问题,我们提出了两种异构适配器并建议一种弱监督训练策略。我们的方法将副语言信息与语言信息解缠,使 SLMs 能够通过结构化表示来解释语音。它还通过受控随机性避免生成特定任务向量,从而保留了上下文理解能力。这种方法仅在通用数据集上训练适配器,确保了参数和数据的高效性。实验表明,在情感对话任务中取得了有竞争力的表现,展示了模型在上下文环境中有效整合副语言和语言信息的能力。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 15:33:44 UTC 发布:2025-08-11 15:33:44 UTC
#92 Growing Reservoirs with Developmental Graph Cellular Automata #92 使用发育性图形元胞自动机增长水库
Authors: [Matias Barandiaran](https://arxiv.org/search/?searchtype=author&query=Matias Barandiaran), [James Stovold](https://arxiv.org/search/?searchtype=author&query=James Stovold) 作者:Matias Barandiaran,James Stovold
Developmental Graph Cellular Automata (DGCA) are a novel model for morphogenesis, capable of growing directed graphs from single-node seeds. In this paper, we show that DGCAs can be trained to grow reservoirs. Reservoirs are grown with two types of targets: task-driven (using the NARMA family of tasks) and task-independent (using reservoir metrics). Results show that DGCAs are able to grow into a variety of specialized, life-like structures capable of effectively solving benchmark tasks, statistically outperforming `typical’ reservoirs on the same task. Overall, these lay the foundation for the development of DGCA systems that produce plastic reservoirs and for modeling functional, adaptive morphogenesis. 发育图细胞自动机(DGCA)是一种用于形态发生的新型模型,能够从单节点种子生长出有向图。在本文中,我们展示了 DGCA 可以被训练来生长储备网络。储备网络以两种目标进行生长:任务驱动型(使用 NARMA 系列任务)和与任务无关型(使用储备度量)。结果表明,DGCA 能够生长出多种专门化的、类生命结构,这些结构能够有效地解决基准任务,在统计学上优于在相同任务上的“典型”储备网络。总体而言,这为开发能够产生可塑性储备网络的 DGCA 系统以及对功能性、适应性形态发生建模奠定了基础。
Subjects: Neural and Evolutionary Computing, Artificial Intelligence 主题:神经与进化计算,人工智能
Publish: 2025-08-11 15:32:01 UTC 发布日期:2025-08-11 15:32:01 协调世界时 (UTC)
#93 HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches #93 HierSearch:一个整合本地与网络搜索的分层企业深度搜索框架
Authors: [Jiejun Tan](https://arxiv.org/search/?searchtype=author&query=Jiejun Tan), [Zhicheng Dou](https://arxiv.org/search/?searchtype=author&query=Zhicheng Dou), [Yan Yu](https://arxiv.org/search/?searchtype=author&query=Yan Yu), [Jiehan Cheng](https://arxiv.org/search/?searchtype=author&query=Jiehan Cheng), [Qiang Ju](https://arxiv.org/search/?searchtype=author&query=Qiang Ju), [Jian Xie](https://arxiv.org/search/?searchtype=author&query=Jian Xie), [Ji-Rong Wen](https://arxiv.org/search/?searchtype=author&query=Ji-Rong Wen) 作者:谭杰军、窦志成、余艳、程洁涵、居强、谢健、温继荣
Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus. Simply training an agent equipped with multiple search tools using flat reinforcement learning (RL) is a straightforward idea, but it has problems such as low training data efficiency and poor mastery of complex tools. To address the above issue, we propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL. At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains. At the high level, a planner agent coordinates low-level agents and provides the final answer. Moreover, to prevent direct answer copying and error propagation, we design a knowledge refiner that filters out hallucinations and irrelevant evidence returned by low-level agents. Experiments show that HierSearch achieves better performance compared to flat RL, and outperforms various deep search and multi-source retrieval-augmented generation baselines in six benchmarks across general, finance, and medical domains. 最近,大型推理模型在数学和编码能力方面表现出强大实力,且深度检索利用它们的推理能力来应对具有挑战性的信息检索任务。现有的深度检索工作通常仅限于单一知识源,要么是本地,要么是网络。然而,企业往往需要能够同时利用本地语料库和网络语料库的私有深度检索系统。一个直观的想法是用平面强化学习训练一个配备多个检索工具的代理,但这种做法存在训练数据效率低下和对复杂工具掌握不佳等问题。为了解决上述问题,我们提出了一种用分层强化学习训练的分层代理式深度检索框架 HierSearch。在低层,训练了一个本地深度检索代理和一个网络深度检索代理,分别从各自领域检索证据。在高层,一个规划者代理协调低层代理并给出最终答案。此外,为了防止直接复制答案和错误传播,我们设计了一个知识修正器,用于过滤低层代理返回的幻觉信息和无关证据。 实验表明,与扁平强化学习相比,HierSearch 在性能上更佳,并且在通用、金融和医疗领域的六项基准测试中,优于各种深度检索和多源检索增强生成基线。
Subjects: Information Retrieval, Artificial Intelligence, Computation and Language 主题:信息检索、人工智能、计算与语言
Publish: 2025-08-11 15:31:47 UTC 发布时间:2025-08-11 15:31:47 UTC
#94 C-MAG: Cascade Multimodal Attributed Graphs for Supply Chain Link Prediction #94 C-MAG:用于供应链链接预测的级联多模态属性图
Authors: [Yunqing Li](https://arxiv.org/search/?searchtype=author&query=Yunqing Li), [Zixiang Tang](https://arxiv.org/search/?searchtype=author&query=Zixiang Tang), [Jiaying Zhuang](https://arxiv.org/search/?searchtype=author&query=Jiaying Zhuang), [Zhenyu Yang](https://arxiv.org/search/?searchtype=author&query=Zhenyu Yang), [Farhad Ameri](https://arxiv.org/search/?searchtype=author&query=Farhad Ameri), [Jianbang Zhang](https://arxiv.org/search/?searchtype=author&query=Jianbang Zhang) 作者:Yunqing Li、Zixiang Tang、Jiaying Zhuang、Zhenyu Yang、Farhad Ameri、Jianbang Zhang
Connecting an ever-expanding catalogue of products with suitable manufacturers and suppliers is critical for resilient, efficient global supply chains, yet traditional methods struggle to capture complex capabilities, certifications, geographic constraints, and rich multimodal data of real-world manufacturer profiles. To address these gaps, we introduce PMGraph, a public benchmark of bipartite and heterogeneous multimodal supply-chain graphs linking 8,888 manufacturers, over 70k products, more than 110k manufacturer-product edges, and over 29k product images. Building on this benchmark, we propose the Cascade Multimodal Attributed Graph C-MAG, a two-stage architecture that first aligns and aggregates textual and visual attributes into intermediate group embeddings, then propagates them through a manufacturer-product hetero-graph via multiscale message passing to enhance link prediction accuracy. C-MAG also provides practical guidelines for modality-aware fusion, preserving predictive performance in noisy, real-world settings. 将日益扩展的产品目录与合适的制造商和供应商对接,对构建具有韧性和高效的全球供应链至关重要,然而传统方法难以捕捉现实世界制造商档案中复杂的能力、认证、地理限制以及丰富的多模态数据。为了解决这些不足,我们引入了 PMGraph,这是一个公开的二分异构多模态供应链图基准,连接 8,888 家制造商、超过 70k 个产品、超过 110k 条制造商-产品边以及超过 29k 张产品图片。在此基准之上,我们提出了级联多模态属性图 C-MAG,这是一种两阶段架构:首先将文本和视觉属性对齐并聚合为中间的群组嵌入,然后通过制造商-产品异构图利用多尺度消息传递传播这些嵌入以提高链接预测的准确性。C-MAG 还为感知模态的融合提供了实用指南,在噪声较多的真实环境中仍能保持预测性能。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 15:14:03 UTC 发布:2025-08-11 15:14:03 UTC
#95 Investigating the Design Space of Visual Grounding in Multimodal Large Language Model #95 调查多模态大语言模型中视觉定位的设计空间
Authors: [Weitai Kang](https://arxiv.org/search/?searchtype=author&query=Weitai Kang), [Weiming Zhuang](https://arxiv.org/search/?searchtype=author&query=Weiming Zhuang), [Zhizhong Li](https://arxiv.org/search/?searchtype=author&query=Zhizhong Li), [Yan Yan](https://arxiv.org/search/?searchtype=author&query=Yan Yan), [Lingjuan Lyu](https://arxiv.org/search/?searchtype=author&query=Lingjuan Lyu) 作者:康伟泰、庄伟明、李志忠、阎艳、吕玲娟
Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs’ fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5. 细粒度多模态能力在多模态大语言模型(MLLMs)中已成为一个关键研究方向,尤其是在解决视觉定位(VG)问题时。尽管现有方法取得了较强的性能,但在对 MLLMs 进行 VG 微调时,它们通常采用各不相同的设计选择,缺乏系统性的验证来支持这些设计。为填补这一空白,本文对影响 MLLMs VG 性能的各类设计选择进行了全面研究。我们使用在先前 MLLM 实证研究中被广泛采用的 LLaVA-1.5 进行分析。尽管存在更新的模型,我们仍遵循这一惯例以确保我们的发现具有广泛适用性并可扩展到其他架构。我们涵盖了两个关键方面:(1)探索 MLLMs 中不同的视觉定位范式,识别最有效的设计并提供我们的见解;以及(2)对定位数据的设计进行消融研究,以优化 MLLMs 在 VG 任务上的微调。最后,我们的发现促成了更强大的 MLLM 视觉定位能力,在 RefCOCO/+/g 上相较于 LLaVA-1.5 分别提升了+5.6% / +6.9% / +7.0%。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language, Machine Learning 主题:计算机视觉与模式识别、人工智能、计算与语言、机器学习
Publish: 2025-08-11 15:10:52 UTC 发表:2025-08-11 15:10:52 UTC
#96 On Understanding of the Dynamics of Model Capacity in Continual Learning #96 关于在持续学习中模型容量动态的理解
Authors: [Supriyo Chakraborty](https://arxiv.org/search/?searchtype=author&query=Supriyo Chakraborty), [Krishnan Raghavan](https://arxiv.org/search/?searchtype=author&query=Krishnan Raghavan) 作者:Supriyo Chakraborty,Krishnan Raghavan
The stability-plasticity dilemma, closely related to a neural network’s (NN) capacity-its ability to represent tasks-is a fundamental challenge in continual learning (CL). Within this context, we introduce CL’s effective model capacity (CLEMC) that characterizes the dynamic behavior of the stability-plasticity balance point. We develop a difference equation to model the evolution of the interplay between the NN, task data, and optimization procedure. We then leverage CLEMC to demonstrate that the effective capacity-and, by extension, the stability-plasticity balance point is inherently non-stationary. We show that regardless of the NN architecture or optimization method, a NN’s ability to represent new tasks diminishes when incoming task distributions differ from previous ones. We conduct extensive experiments to support our theoretical findings, spanning a range of architectures-from small feedforward network and convolutional networks to medium-sized graph neural networks and transformer-based large language models with millions of parameters. 稳定-可塑性困境与神经网络(NN)的容量——即其表示任务的能力——密切相关,是持续学习(CL)中的一个基本挑战。在此背景下,我们提出了 CL 的有效模型容量(CLEMC),用于刻画稳定-可塑性平衡点的动态行为。我们建立了一个差分方程来模拟神经网络、任务数据与优化过程之间相互作用的演化。然后我们利用 CLEMC 证明了有效容量——从而稳定-可塑性平衡点——本质上是非平稳的。我们展示了无论神经网络的架构或优化方法如何,当新到来的任务分布与先前不同时,神经网络表示新任务的能力都会下降。我们进行了大量实验以支持我们的理论发现,涵盖各种架构——从小型前馈网络和卷积网络到中等规模的图神经网络以及具有百万参数的基于变换器的大型语言模型。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 14:52:56 UTC 发布:2025-08-11 14:52:56 UTC
#97 Rethinking Self-Replication: Detecting Distributed Selfhood in the Outlier Cellular Automaton #97 重新思考自我复制:在异常元胞自动机中检测分布式自我性
Authors: [Arend Hintze](https://arxiv.org/search/?searchtype=author&query=Arend Hintze), [Clifford Bohm](https://arxiv.org/search/?searchtype=author&query=Clifford Bohm) 作者:Arend Hintze、Clifford Bohm
Spontaneous self-replication in cellular automata has long been considered rare, with most known examples requiring careful design or artificial initialization. In this paper, we present formal, causal evidence that such replication can emerge unassisted – and that it can do so in a distributed, multi-component form. Building on prior work identifying complex dynamics in the Outlier rule, we introduce a data-driven framework that reconstructs the full causal ancestry of patterns in a deterministic cellular automaton. This allows us to rigorously identify self-replicating structures via explicit causal lineages. Our results show definitively that self-replicators in the Outlier CA are not only spontaneous and robust, but are also often composed of multiple disjoint clusters working in coordination, raising questions about some conventional notions of individuality and replication in artificial life systems. 细胞自动机中自发自我复制长期以来被认为是罕见的,已知的大多数例子都需要精心设计或人工初始化。在本文中,我们提供了形式化的因果证据,证明这种复制可以不受外力地出现——并且可以以分布式、多组件的形式出现。基于先前在 Outlier 规则中识别复杂动力学的工作,我们提出了一个数据驱动框架,用以重建确定性细胞自动机中模式的完整因果谱系。该框架使我们能够通过明确的因果谱系严格地识别自我复制结构。我们的结果明确表明,Outlier 细胞自动机中的自我复制体不仅是自发且稳健的,而且常常由多个相互分离的簇组成并协同工作,这对人工生命系统中关于个体性和复制的一些传统观念提出了质疑。
Subjects: Cellular Automata and Lattice Gases, Artificial Intelligence 主题:元胞自动机与格子气、人工智能
Publish: 2025-08-11 14:49:11 UTC 发布:2025-08-11 14:49:11 UTC
#98 Multi-modal Adaptive Mixture of Experts for Cold-start Recommendation #98 多模态自适应专家混合模型用于冷启动推荐
Authors: [Van-Khang Nguyen](https://arxiv.org/search/?searchtype=author&query=Van-Khang Nguyen), [Duc-Hoang Pham](https://arxiv.org/search/?searchtype=author&query=Duc-Hoang Pham), [Huy-Son Nguyen](https://arxiv.org/search/?searchtype=author&query=Huy-Son Nguyen), [Cam-Van Thi Nguyen](https://arxiv.org/search/?searchtype=author&query=Cam-Van Thi Nguyen), [Hoang-Quynh Le](https://arxiv.org/search/?searchtype=author&query=Hoang-Quynh Le), [Duc-Trong Le](https://arxiv.org/search/?searchtype=author&query=Duc-Trong Le) 作者:Van-Khang Nguyen、Duc-Hoang Pham、Huy-Son Nguyen、Cam-Van Thi Nguyen、Hoang-Quynh Le、Duc-Trong Le
Recommendation systems have faced significant challenges in cold-start scenarios, where new items with a limited history of interaction need to be effectively recommended to users. Though multimodal data (e.g., images, text, audio, etc.) offer rich information to address this issue, existing approaches often employ simplistic integration methods such as concatenation, average pooling, or fixed weighting schemes, which fail to capture the complex relationships between modalities. Our study proposes a novel Mixture of Experts (MoE) framework for multimodal cold-start recommendation, named MAMEX, which dynamically leverages latent representation from different modalities. MAMEX utilizes modality-specific expert networks and introduces a learnable gating mechanism that adaptively weights the contribution of each modality based on its content characteristics. This approach enables MAMEX to emphasize the most informative modalities for each item while maintaining robustness when certain modalities are less relevant or missing. Extensive experiments on benchmark datasets show that MAMEX outperforms state-of-the-art methods in cold-start scenarios, with superior accuracy and adaptability. For reproducibility, the code has been made available on Github https://github.com/L2R-UET/MAMEX. 在冷启动场景下,推荐系统面临重大挑战:需要将交互历史有限的新物品有效地推荐给用户。尽管多模态数据(如图像、文本、音频等)提供了丰富信息以应对这一问题,现有方法往往采用简单的整合手段,例如拼接、平均池化或固定加权方案,这些方法无法捕捉模态之间的复杂关系。本研究提出了一种用于多模态冷启动推荐的新型专家混合(Mixture of Experts,MoE)框架,命名为 MAMEX,它能够动态利用来自不同模态的潜在表示。MAMEX 使用模态专属的专家网络,并引入可学习的门控机制,基于内容特征自适应地为每个模态的贡献赋予权重。该方法使 MAMEX 能够强调对每个物品最具信息量的模态,同时在某些模态不太相关或缺失时保持鲁棒性。在基准数据集上的大量实验表明,MAMEX 在冷启动场景中优于最先进的方法,具有更高的准确性和适应性。 为便于复现,代码已在 Github 上开源 https://github.com/L2R-UET/MAMEX。
Subjects: Information Retrieval, Artificial Intelligence 主题:信息检索,人工智能
Publish: 2025-08-11 14:47:14 UTC 发布:2025-08-11 14:47:14 UTC
#99 BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models #99 BadPromptFL:一种针对多模态模型中基于提示的联邦学习的新型后门威胁
Authors: [Maozhen Zhang](https://arxiv.org/search/?searchtype=author&query=Maozhen Zhang), [Mengnan Zhao](https://arxiv.org/search/?searchtype=author&query=Mengnan Zhao), [Bo Wang](https://arxiv.org/search/?searchtype=author&query=Bo Wang) 作者:张茂振,赵梦楠,王博
Prompt-based tuning has emerged as a lightweight alternative to full fine-tuning in large vision-language models, enabling efficient adaptation via learned contextual prompts. This paradigm has recently been extended to federated learning settings (e.g., PromptFL), where clients collaboratively train prompts under data privacy constraints. However, the security implications of prompt-based aggregation in federated multimodal learning remain largely unexplored, leaving a critical attack surface unaddressed. In this paper, we introduce \textbf{BadPromptFL}, the first backdoor attack targeting prompt-based federated learning in multimodal contrastive models. In BadPromptFL, compromised clients jointly optimize local backdoor triggers and prompt embeddings, injecting poisoned prompts into the global aggregation process. These prompts are then propagated to benign clients, enabling universal backdoor activation at inference without modifying model parameters. Leveraging the contextual learning behavior of CLIP-style architectures, BadPromptFL achieves high attack success rates (e.g., >90%) with minimal visibility and limited client participation. Extensive experiments across multiple datasets and aggregation protocols validate the effectiveness, stealth, and generalizability of our attack, raising critical concerns about the robustness of prompt-based federated learning in real-world deployments. 基于提示的微调已成为大规模视觉-语言模型中比完整微调更轻量的替代方案,通过学习的上下文提示实现高效适配。最近,这一范式被扩展到联邦学习场景(例如 PromptFL),客户端在数据隐私约束下协同训练提示。然而,基于提示的聚合在联邦多模态学习中的安全性影响仍然大多未被探索,留下了一个关键的攻击面未被解决。本文提出了 BadPromptFL,这是首个针对多模态对比模型中基于提示的联邦学习的后门攻击。在 BadPromptFL 中,被攻陷的客户端共同优化本地后门触发器和提示嵌入,将中毒的提示注入全局聚合过程。这些提示随后被传播到良性客户端,使得在推理时无需修改模型参数即可实现通用后门激活。利用 CLIP 风格架构的上下文学习行为,BadPromptFL 在能见度极低且参与客户端有限的情况下实现了高攻击成功率(例如 >90% )。 在多个数据集和聚合协议上进行的大量实验证明了我们攻击的有效性、隐蔽性和泛化能力,这对基于提示的联邦学习在实际部署中的鲁棒性提出了严重担忧。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 14:42:44 UTC 发布:2025-08-11 14:42:44 UTC
#100 Exploring Strategies for Personalized Radiation Therapy: Part III Identifying genetic determinants for Radiation Response with Meta Learning #100 探索个性化放射治疗策略:第三部分 使用元学习识别放射反应的遗传决定因素
Authors: [Hao Peng](https://arxiv.org/search/?searchtype=author&query=Hao Peng), [Yuanyuan Zhang](https://arxiv.org/search/?searchtype=author&query=Yuanyuan Zhang), [Steve Jiang](https://arxiv.org/search/?searchtype=author&query=Steve Jiang), [Robert Timmerman](https://arxiv.org/search/?searchtype=author&query=Robert Timmerman), [John Minna](https://arxiv.org/search/?searchtype=author&query=John Minna) 作者:彭昊、张媛媛、史蒂夫·姜、罗伯特·蒂默曼、约翰·明纳
Radiation response in cancer is shaped by complex, patient specific biology, yet current treatment strategies often rely on uniform dose prescriptions without accounting for tumor heterogeneity. In this study, we introduce a meta learning framework for one-shot prediction of radiosensitivity measured by SF2 using cell line level gene expression data. Unlike the widely used Radiosensitivity Index RSI a rank-based linear model trained on a fixed 10-gene signature, our proposed meta-learned model allows the importance of each gene to vary by sample through fine tuning. This flexibility addresses key limitations of static models like RSI, which assume uniform gene contributions across tumor types and discard expression magnitude and gene gene interactions. Our results show that meta learning offers robust generalization to unseen samples and performs well in tumor subgroups with high radiosensitivity variability, such as adenocarcinoma and large cell carcinoma. By learning transferable structure across tasks while preserving sample specific adaptability, our approach enables rapid adaptation to individual samples, improving predictive accuracy across diverse tumor subtypes while uncovering context dependent patterns of gene influence that may inform personalized therapy. 癌症对放射治疗的反应由复杂的、病人特异的生物学决定,但当前的治疗策略常常依赖统一的剂量处方,未考虑肿瘤的异质性。在本研究中,我们提出一种元学习框架,用于通过细胞系水平的基因表达数据对以 SF2 衡量的放射敏感性进行一次性预测。与广泛使用的放射敏感性指数 RSI 不同——那是基于固定 10 基因签名训练的基于秩的线性模型——我们提出的元学习模型允许通过微调使每个基因的重要性因样本而异。这种灵活性解决了像 RSI 这样的静态模型的关键限制,后者假定基因在各种肿瘤类型中的贡献是统一的,并且丢弃了表达量的大小和基因间的相互作用。我们的结果表明,元学习对未见样本具有稳健的泛化能力,并且在放射敏感性变异性较高的肿瘤亚组(如腺癌和大细胞癌)中表现良好。 通过在任务间学习可迁移的结构同时保留对样本特异性的可适应性,我们的方法能够快速适应个体样本,提高对不同肿瘤亚型的预测准确性,同时揭示基因影响的情境依赖模式,这些模式可能为个性化治疗提供参考。
Subjects: Medical Physics, Artificial Intelligence, Machine Learning 学科:医学物理、人工智能、机器学习
Publish: 2025-08-11 14:34:18 UTC
#101 Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches
Authors: [Ahmed Aboeitta](https://arxiv.org/search/?searchtype=author&query=Ahmed Aboeitta), [Ahmed Sharshar](https://arxiv.org/search/?searchtype=author&query=Ahmed Sharshar), [Youssef Nafea](https://arxiv.org/search/?searchtype=author&query=Youssef Nafea), [Shady Shehata](https://arxiv.org/search/?searchtype=author&query=Shady Shehata)
Speech Recognition (ASR) due to phoneme distortions and high variability. While self-supervised ASR models like Wav2Vec, HuBERT, and Whisper have shown promise, their effectiveness in dysarthric speech remains unclear. This study systematically benchmarks these models with different decoding strategies, including CTC, seq2seq, and LLM-enhanced decoding (BART,GPT-2, Vicuna). Our contributions include (1) benchmarking ASR architectures for dysarthric speech, (2) introducing LLM-based decoding to improve intelligibility, (3) analyzing generalization across datasets, and (4) providing insights into recognition errors across severity levels. Findings highlight that LLM-enhanced decoding improves dysarthric ASR by leveraging linguistic constraints for phoneme restoration and grammatical correction. 由于音素失真和高度变异性,言语识别(ASR)面临挑战。尽管像 Wav2Vec、HuBERT 和 Whisper 这样的自监督 ASR 模型展示了潜力,但它们在构音运动障碍(dysarthric)语音中的有效性仍不明朗。本研究系统性地基准测试了这些模型,并采用了不同的解码策略,包括 CTC、seq2seq 以及基于 LLM 的解码(BART、GPT-2、Vicuna)。我们的贡献包括 (1) 为构音运动障碍语音对 ASR 架构进行基准测试,(2) 引入基于 LLM 的解码以提高可懂度,(3) 分析跨数据集的泛化能力,和 (4) 提供不同严重程度下识别错误的见解。研究结果强调,基于 LLM 的解码通过利用语言约束进行音素恢复和语法纠正,从而改善了构音运动障碍的 ASR 性能。
Subjects: Sound, Artificial Intelligence, Audio and Speech Processing 主题:声音,人工智能,音频与语音处理
Publish: 2025-08-11 14:31:20 UTC 发布时间:2025-08-11 14:31:20 协调世界时
#102 Advancing Knowledge Tracing by Exploring Follow-up Performance Trends #102 通过探索后续表现趋势推进知识追踪
Authors: [Hengyu Liu](https://arxiv.org/search/?searchtype=author&query=Hengyu Liu), [Yushuai Li](https://arxiv.org/search/?searchtype=author&query=Yushuai Li), [Minghe Yu](https://arxiv.org/search/?searchtype=author&query=Minghe Yu), [Tiancheng Zhang](https://arxiv.org/search/?searchtype=author&query=Tiancheng Zhang), [Ge Yu](https://arxiv.org/search/?searchtype=author&query=Ge Yu), [Torben Bach Pedersen](https://arxiv.org/search/?searchtype=author&query=Torben Bach Pedersen), [Kristian Torp](https://arxiv.org/search/?searchtype=author&query=Kristian Torp), [Christian S. Jensen](https://arxiv.org/search/?searchtype=author&query=Christian S. Jensen), [Tianyi Li](https://arxiv.org/search/?searchtype=author&query=Tianyi Li) 作者:Hengyu Liu、Yushuai Li、Minghe Yu、Tiancheng Zhang、Ge Yu、Torben Bach Pedersen、Kristian Torp、Christian S. Jensen、Tianyi Li
Intelligent Tutoring Systems (ITS), such as Massive Open Online Courses, offer new opportunities for human learning. At the core of such systems, knowledge tracing (KT) predicts students’ future performance by analyzing their historical learning activities, enabling an accurate evaluation of students’ knowledge states over time. We show that existing KT methods often encounter correlation conflicts when analyzing the relationships between historical learning sequences and future performance. To address such conflicts, we propose to extract so-called Follow-up Performance Trends (FPTs) from historical ITS data and to incorporate them into KT. We propose a method called Forward-Looking Knowledge Tracing (FINER) that combines historical learning sequences with FPTs to enhance student performance prediction accuracy. FINER constructs learning patterns that facilitate the retrieval of FPTs from historical ITS data in linear time; FINER includes a novel similarity-aware attention mechanism that aggregates FPTs based on both frequency and contextual similarity; and FINER offers means of combining FPTs and historical learning sequences to enable more accurate prediction of student future performance. Experiments on six real-world datasets show that FINER can outperform ten state-of-the-art KT methods, increasing accuracy by 8.74% to 84.85%. 智能辅导系统(ITS),例如大规模开放在线课程,为人类学习提供了新机遇。在此类系统的核心,知识追踪(KT)通过分析学生的历史学习活动来预测其未来表现,从而能够对学生随时间变化的知识状态进行准确评估。我们指出,现有的 KT 方法在分析历史学习序列与未来表现之间关系时常常遇到相关性冲突。为了解决此类冲突,我们提出从历史 ITS 数据中提取所谓的后续表现趋势(FPTs),并将其融入 KT。我们提出了一种称为前瞻性知识追踪(FINER)的方法,将历史学习序列与 FPTs 结合,以提升学生表现预测的准确性。 FINER 构建了能够以线性时间从历史 ITS 数据中检索 FPT(后续表现趋势)的学习模式;FINER 包含一种新颖的感知相似性的注意力机制,该机制基于频率和上下文相似性对 FPT 进行聚合;并且 FINER 提供了将 FPT 与历史学习序列相结合的方法,以实现对学生未来表现的更准确预测。对六个真实世界数据集的实验表明,FINER 可优于十种最先进的知识追踪方法,准确率提高了 8.74% 到 84.85%。
Subjects: Computers and Society, Artificial Intelligence, Machine Learning 主题:计算机与社会、人工智能、机器学习
Publish: 2025-08-11 14:26:11 UTC 发布:2025-08-11 14:26:11 协调世界时 (UTC)
#103 Learning to Select MCP Algorithms: From Traditional ML to Dual-Channel GAT-MLP
Authors: [Xiang Li](https://arxiv.org/search/?searchtype=author&query=Xiang Li), [Shanshan Wang](https://arxiv.org/search/?searchtype=author&query=Shanshan Wang), [Chenglong Xiao](https://arxiv.org/search/?searchtype=author&query=Chenglong Xiao) 作者:李翔、王珊珊、肖成龙
Extensive experiments and prior studies show that no single maximum clique algorithm consistently performs best across all instances, highlighting the importance of selecting suitable algorithms based on instance features. Through an extensive analysis of relevant studies, it is found that there is a lack of research work concerning algorithm selection oriented toward the Maximum Clique Problem (MCP). In this work, we propose a learning-based framework that integrates both traditional machine learning and graph neural networks to address this gap. We construct a labeled dataset by running four exact MCP algorithms on a diverse collection of graph instances, accompanied by structural and global statistical features extracted from each graph. We first evaluate four conventional classifiers: Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), and K-Nearest Neighbors (KNN), across multiple dataset variants. Experimental results show that RF consistently shows strong performance across metrics and dataset variants, making it a reliable baseline. In addition, feature importance analysis indicates that connectivity and topological structure are strong predictors of algorithm performance. Building on these findings, we develop a dual-channel model named GAT-MLP, which combines a Graph Attention Network (GAT) for local structural encoding with a Multilayer Perceptron (MLP) for global feature modeling. The GAT-MLP model shows strong and consistent performance across all metrics. Our results highlight the effectiveness of dual-channel architectures and the promise of graph neural networks in combinatorial algorithm selection. 大量实验和既往研究表明,没有单一的最大团算法能在所有实例上始终表现最佳,这凸显了根据实例特征选择合适算法的重要性。通过对相关研究的广泛分析,发现针对最大团问题(MCP)的算法选择研究工作较为缺乏。在本工作中,我们提出了一个基于学习的框架,将传统机器学习与图神经网络相结合以弥补这一空白。我们通过在多样化的图实例集合上运行四种精确的 MCP 算法来构建带标签的数据集,并从每个图中提取结构性和全局统计特征。我们首先评估了四种传统分类器:支持向量机(SVM)、随机森林(RF)、决策树(DT)和 K 近邻(KNN),在多个数据集变体上的表现。实验结果表明,随机森林在各项指标和数据集变体中始终表现出较强的性能,使其成为一个可靠的基线。 此外,特征重要性分析表明,连通性和拓扑结构是算法性能的强有力预测因子。基于这些发现,我们提出了一种名为 GAT-MLP 的双通道模型,该模型将用于局部结构编码的图注意力网络(GAT)与用于全局特征建模的多层感知器(MLP)相结合。GAT-MLP 模型在所有指标上表现出强劲且一致的性能。我们的结果凸显了双通道架构的有效性以及图神经网络在组合算法选择方面的前景。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 14:09:58 UTC 发布:2025-08-11 14:09:58 UTC
#104 DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval #104 DIVER:一种用于注重推理的信息检索的多阶段方法
Authors: [Meixiu Long](https://arxiv.org/search/?searchtype=author&query=Meixiu Long), [Duolin Sun](https://arxiv.org/search/?searchtype=author&query=Duolin Sun), [Dan Yang](https://arxiv.org/search/?searchtype=author&query=Dan Yang), [Junjie Wang](https://arxiv.org/search/?searchtype=author&query=Junjie Wang), [Yue Shen](https://arxiv.org/search/?searchtype=author&query=Yue Shen), [Jian Wang](https://arxiv.org/search/?searchtype=author&query=Jian Wang), [Peng Wei](https://arxiv.org/search/?searchtype=author&query=Peng Wei), [Jinjie Gu](https://arxiv.org/search/?searchtype=author&query=Jinjie Gu), [Jiahai Wang](https://arxiv.org/search/?searchtype=author&query=Jiahai Wang) 作者:龙美秀、孙多林、杨丹、王俊杰、沈跃、王坚、魏鹏、顾金杰、王嘉海
Retrieval-augmented generation has achieved strong performance on knowledge-intensive tasks where query-document relevance can be identified through direct lexical or semantic matches. However, many real-world queries involve abstract reasoning, analogical thinking, or multi-step inference, which existing retrievers often struggle to capture. To address this challenge, we present \textbf{DIVER}, a retrieval pipeline tailored for reasoning-intensive information retrieval. DIVER consists of four components: document processing to improve input quality, LLM-driven query expansion via iterative document interaction, a reasoning-enhanced retriever fine-tuned on synthetic multi-domain data with hard negatives, and a pointwise reranker that combines LLM-assigned helpfulness scores with retrieval scores. On the BRIGHT benchmark, DIVER achieves state-of-the-art nDCG@10 scores of 41.6 and 28.9 on original queries, consistently outperforming competitive reasoning-aware models. These results demonstrate the effectiveness of reasoning-aware retrieval strategies in complex real-world tasks. Our code and retrieval model will be released soon. 检索增强生成在那些可以通过直接的词汇或语义匹配识别查询与文档相关性的知识密集型任务上取得了强劲表现。然而,许多现实世界的查询涉及抽象推理、类比思维或多步推断,现有的检索器往往难以捕捉这些需求。为了解决这一挑战,我们提出了 DIVER,一种为推理密集型信息检索量身打造的检索管道。DIVER 由四个组件构成:用于改善输入质量的文档处理、通过与文档反复交互进行的由 LLM 驱动的查询扩展、在包含困难负样本的合成多领域数据上微调的增强推理检索器,以及结合 LLM 赋予的有用性评分与检索分数的逐点重排序器。在 BRIGHT 基准测试上,DIVER 在原始查询上分别取得了 41.6 和 28.9 的 nDCG@10 最佳成绩,持续优于具有推理意识的竞争模型。这些结果展示了在复杂现实任务中,具备推理意识的检索策略的有效性。我们的代码和检索模型将很快发布。
Subjects: Information Retrieval, Artificial Intelligence 主题:信息检索,人工智能
Publish: 2025-08-11 13:57:49 UTC 发布:2025-08-11 13:57:49 UTC
#105 Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation #105 全能效果:统一且可空间控制的视觉效果生成
Authors: [Fangyuan Mao](https://arxiv.org/search/?searchtype=author&query=Fangyuan Mao), [Aiming Hao](https://arxiv.org/search/?searchtype=author&query=Aiming Hao), [Jintao Chen](https://arxiv.org/search/?searchtype=author&query=Jintao Chen), [Dongxia Liu](https://arxiv.org/search/?searchtype=author&query=Dongxia Liu), [Xiaokun Feng](https://arxiv.org/search/?searchtype=author&query=Xiaokun Feng), [Jiashu Zhu](https://arxiv.org/search/?searchtype=author&query=Jiashu Zhu), [Meiqi Wu](https://arxiv.org/search/?searchtype=author&query=Meiqi Wu), [Chubin Chen](https://arxiv.org/search/?searchtype=author&query=Chubin Chen), [Jiahong Wu](https://arxiv.org/search/?searchtype=author&query=Jiahong Wu), [Xiangxiang Chu](https://arxiv.org/search/?searchtype=author&query=Xiangxiang Chu) 作者:毛方远、郝爱明、陈金涛、刘冬霞、冯晓坤、朱嘉述、吴美琪、陈楚斌、吴家宏、褚响响
Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects. 视觉特效(VFX)是现代电影制作中不可或缺的视觉增强元素。尽管视频生成模型为 VFX 制作提供了更具成本效益的解决方案,但现有方法受限于按效果训练的 LoRA,这将生成限制为单一效果。这一根本性限制阻碍了需要空间可控复合特效的应用,即在指定位置同时生成多种效果。然而,将多样化效果整合到统一框架中面临重大挑战:效果差异带来的干扰以及多 VFX 联合训练期间的空间不可控性。为了解决这些挑战,我们提出了 Omni-Effects,这是首个能够生成提示引导效果并实现空间可控复合效果的统一框架。我们框架的核心包含两项关键创新:(1)基于 LoRA 的专家混合(LoRA-MoE),它采用一组专家 LoRA,在统一模型内融合多样化效果,同时有效减轻任务间干扰。(2)空间感知提示(SAP)将空间掩码信息纳入文本标记,实现精确的空间控制。 此外,我们在 SAP 中引入了一个独立信息流(IIF)模块,将对应于各个特效的控制信号隔离开来,以防止任何不必要的混合。为促进该研究,我们通过一种结合图像编辑与首末帧到视频(FLF2V)合成的新型数据收集管道构建了一个全面的特效数据集 Omni-VFX,并引入了一个专门的特效评估框架以验证模型性能。大量实验证明,Omni-Effects 实现了精确的空间控制和多样的特效生成,使用户能够指定所需特效的类别和位置。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 13:41:24 UTC 发布:2025-08-11 13:41:24 UTC
#106 Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL #106 超越十回合:通过大规模异步强化学习解锁长时域具有智能体属性的搜索
Authors: [Jiaxuan Gao](https://arxiv.org/search/?searchtype=author&query=Jiaxuan Gao), [Wei Fu](https://arxiv.org/search/?searchtype=author&query=Wei Fu), [Minyang Xie](https://arxiv.org/search/?searchtype=author&query=Minyang Xie), [Shusheng Xu](https://arxiv.org/search/?searchtype=author&query=Shusheng Xu), [Chuyi He](https://arxiv.org/search/?searchtype=author&query=Chuyi He), [Zhiyu Mei](https://arxiv.org/search/?searchtype=author&query=Zhiyu Mei), [Banghua Zhu](https://arxiv.org/search/?searchtype=author&query=Banghua Zhu), [Yi Wu](https://arxiv.org/search/?searchtype=author&query=Yi Wu) 作者:高嘉轩、傅蔚、谢敏阳、徐书圣、何楚怡、梅志宇、朱邦华、吴毅
Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher. 基于 LLM 的智能体在整合外部工具处理复杂、知识密集型任务方面的最新进展展示了显著能力。在多种工具选择中,搜索工具在获取大量外部知识方面发挥着关键作用。然而,开源智能体在实现专家级搜索智能方面仍然不足,即解决模糊查询、生成精确检索、分析结果并进行深入探索的能力仍有差距。现有方法在可扩展性、效率和数据质量方面存在不足。例如,现有在线强化学习方法中的较小回合限制(例如 ≤10)限制了复杂策略的学习。本文提出了 ASearcher,一个用于大规模强化学习训练搜索智能体的开源项目。我们的主要贡献包括:(1)可扩展的完全异步强化学习训练,使得在保持高训练效率的同时实现长时域搜索成为可能。(2)一种基于提示的 LLM 智能体,能够自主合成高质量且具有挑战性的问答,构建大规模问答数据集。通过强化学习训练,我们的基于提示的 QwQ-32B 智能体取得了显著提升,在 xBench 和 GAIA 上的 Avg@4 分别提高了 46.7% 和 20.8%。 值得注意的是,我们的智能体展现出极端的长时程搜索能力,在训练期间工具调用超过 40 轮,输出令牌超过 15 万。在简单的智能体设计且不依赖外部 LLMs 的情况下,ASearcher-Web-QwQ 在 xBench 上的 Avg@4 得分为 42.1,在 GAIA 上为 52.8,超过了现有开源的 32B 智能体。我们在 https://github.com/inclusionAI/ASearcher 上开源了模型、训练数据和代码。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 13:36:57 UTC 发布:2025-08-11 13:36:57 协调世界时
#107 WeChat-YATT: A Simple, Scalable and Balanced RLHF Trainer #107 微信-YATT:一个简单、可扩展且平衡的 RLHF 训练器
Authors: [Junyu Wu](https://arxiv.org/search/?searchtype=author&query=Junyu Wu), [Weiming Chang](https://arxiv.org/search/?searchtype=author&query=Weiming Chang), [Xiaotao Liu](https://arxiv.org/search/?searchtype=author&query=Xiaotao Liu), [Guanyou He](https://arxiv.org/search/?searchtype=author&query=Guanyou He), [Tingfeng Xian](https://arxiv.org/search/?searchtype=author&query=Tingfeng Xian), [Haoqiang Hong](https://arxiv.org/search/?searchtype=author&query=Haoqiang Hong), [Boqi Chen](https://arxiv.org/search/?searchtype=author&query=Boqi Chen), [Haotao Tian](https://arxiv.org/search/?searchtype=author&query=Haotao Tian), [Tao Yang](https://arxiv.org/search/?searchtype=author&query=Tao Yang), [Yunsheng Shi](https://arxiv.org/search/?searchtype=author&query=Yunsheng Shi), [Feng Lin](https://arxiv.org/search/?searchtype=author&query=Feng Lin), [Ting Yao](https://arxiv.org/search/?searchtype=author&query=Ting Yao) 作者:吴俊宇、常伟明、刘晓涛、何冠佑、冼廷锋、洪豪强、陈博琪、田昊滔、杨涛、时云胜、林峰、姚霆
Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent paradigm for training large language models and multimodal systems. Despite notable advances enabled by existing RLHF training frameworks, significant challenges remain in scaling to complex multimodal workflows and adapting to dynamic workloads. In particular, current systems often encounter limitations related to controller scalability when managing large models, as well as inefficiencies in orchestrating intricate RLHF pipelines, especially in scenarios that require dynamic sampling and resource allocation. In this paper, we introduce WeChat-YATT (Yet Another Transformer Trainer in WeChat), a simple, scalable, and balanced RLHF training framework specifically designed to address these challenges. WeChat-YATT features a parallel controller programming model that enables flexible and efficient orchestration of complex RLHF workflows, effectively mitigating the bottlenecks associated with centralized controller architectures and facilitating scalability in large-scale data scenarios. In addition, we propose a dynamic placement schema that adaptively partitions computational resources and schedules workloads, thereby significantly reducing hardware idle time and improving GPU utilization under variable training conditions. We evaluate WeChat-YATT across a range of experimental scenarios, demonstrating that it achieves substantial improvements in throughput compared to state-of-the-art RLHF training frameworks. Furthermore, WeChat-YATT has been successfully deployed to train models supporting WeChat product features for a large-scale user base, underscoring its effectiveness and robustness in real-world applications. 来自人类反馈的强化学习(RLHF)已成为训练大型语言模型和多模态系统的一个重要范式。尽管现有的 RLHF 训练框架推动了显著进展,但在扩展到复杂多模态工作流和适应动态工作负载方面仍存在重大挑战。具体而言,当前系统在管理大型模型时常遇到与控制器可扩展性相关的限制,并且在协调复杂的 RLHF 管道时效率不高,尤其是在需要动态采样和资源分配的场景中。本文提出了 WeChat-YATT(WeChat 的又一变换器训练器,Yet Another Transformer Trainer in WeChat),这是一种简单、可扩展且平衡的 RLHF 训练框架,专门为应对这些挑战而设计。WeChat-YATT 采用并行控制器编程模型,能够灵活高效地编排复杂的 RLHF 工作流,有效缓解集中式控制器架构的瓶颈,促进在大规模数据场景下的可扩展性。 此外,我们提出了一种动态放置方案,该方案自适应地划分计算资源并调度工作负载,从而在可变的训练条件下显著减少硬件空闲时间并提升 GPU 利用率。我们在多种实验场景下评估了 WeChat-YATT,结果表明与最先进的 RLHF 训练框架相比,它在吞吐量方面取得了大幅提升。此外,WeChat-YATT 已成功部署用于训练支持大量用户使用的微信产品功能的模型,进一步凸显了其在实际应用中的有效性和鲁棒性。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 13:31:53 UTC 发布:2025-08-11 13:31:53 UTC
#108 Exploring the Challenges and Opportunities of AI-assisted Codebase Generation #108 探索 AI 辅助代码库生成的挑战与机遇
Authors: [Philipp Eibl](https://arxiv.org/search/?searchtype=author&query=Philipp Eibl), [Sadra Sabouri](https://arxiv.org/search/?searchtype=author&query=Sadra Sabouri), [Souti Chattopadhyay](https://arxiv.org/search/?searchtype=author&query=Souti Chattopadhyay) 作者:Philipp Eibl、Sadra Sabouri、Souti Chattopadhyay
Recent AI code assistants have significantly improved their ability to process more complex contexts and generate entire codebases based on a textual description, compared to the popular snippet-level generation. These codebase AI assistants (CBAs) can also extend or adapt codebases, allowing users to focus on higher-level design and deployment decisions. While prior work has extensively studied the impact of snippet-level code generation, this new class of codebase generation models is relatively unexplored. Despite initial anecdotal reports of excitement about these agents, they remain less frequently adopted compared to snippet-level code assistants. To utilize CBAs better, we need to understand how developers interact with CBAs, and how and why CBAs fall short of developers’ needs. In this paper, we explored these gaps through a counterbalanced user study and interview with (n = 16) students and developers working on coding tasks with CBAs. We found that participants varied the information in their prompts, like problem description (48% of prompts), required functionality (98% of prompts), code structure (48% of prompts), and their prompt writing process. Despite various strategies, the overall satisfaction score with generated codebases remained low (mean = 2.8, median = 3, on a scale of one to five). Participants mentioned functionality as the most common factor for dissatisfaction (77% of instances), alongside poor code quality (42% of instances) and communication issues (25% of instances). We delve deeper into participants’ dissatisfaction to identify six underlying challenges that participants faced when using CBAs, and extracted five barriers to incorporating CBAs into their workflows. Finally, we surveyed 21 commercial CBAs to compare their capabilities with participant challenges and present design opportunities for more efficient and useful CBAs. 与流行的片段级生成相比,近年来的 AI 代码助手在处理更复杂上下文和根据文本描述生成整个代码库方面能力显著提升。这些代码库 AI 助手(CBA)还能够扩展或适配代码库,使用户可以专注于更高层次的设计和部署决策。尽管先前研究已广泛考察片段级代码生成的影响,这一新型的代码库生成模型相对鲜有探索。尽管最初有关于这些代理令人振奋的轶事报道,但与片段级代码助手相比,它们的采用率仍然较低。为了更好地利用 CBA,我们需要了解开发者如何与 CBA 交互,以及 CBA 在何种程度上、为何未能满足开发者的需求。在本文中,我们通过一项对照平衡的用户研究和访谈(n = 16),研究了学生和开发者在使用 CBA 完成编码任务时的这些差距。我们发现参与者在提示信息上存在差异,例如问题描述(48% 的提示)、所需功能(98% 的提示)、代码结构(48% 的提示)以及他们的提示撰写过程。 尽管采取了各种策略,生成的代码库的总体满意度仍然很低(均值 = 2.8,中位数 = 3,评分范围为一到五)。参与者提到功能性是最常见的不满因素(占 77%的情况),其次是代码质量差(占 42%的情况)和沟通问题(占 25%的情况)。我们深入探讨了参与者的不满,识别出参与者在使用代码库助手(CBA)时面临的六大潜在挑战,并提取了将 CBA 融入其工作流程的五个障碍。最后,我们对 21 款商业 CBA 进行了调研,以将其能力与参与者面临的挑战进行比较,并提出了使 CBA 更高效、更有用的设计机会。
Subjects: Software Engineering, Artificial Intelligence 主题:软件工程,人工智能
Publish: 2025-08-11 13:26:48 UTC 发布:2025-08-11 13:26:48 协调世界时 (UTC)
#109 SCDF: A Speaker Characteristics DeepFake Speech Dataset for Bias Analysis #109 SCDF:用于偏见分析的说话人特征深度伪造语音数据集
Authors: [Vojtěch Staněk](https://arxiv.org/search/?searchtype=author&query=Vojtěch Staněk), [Karel Srna](https://arxiv.org/search/?searchtype=author&query=Karel Srna), [Anton Firc](https://arxiv.org/search/?searchtype=author&query=Anton Firc), [Kamil Malinka](https://arxiv.org/search/?searchtype=author&query=Kamil Malinka) 作者:Vojtěch Staněk、Karel Srna、Anton Firc、Kamil Malinka
Despite growing attention to deepfake speech detection, the aspects of bias and fairness remain underexplored in the speech domain. To address this gap, we introduce the Speaker Characteristics Deepfake (SCDF) dataset: a novel, richly annotated resource enabling systematic evaluation of demographic biases in deepfake speech detection. SCDF contains over 237,000 utterances in a balanced representation of both male and female speakers spanning five languages and a wide age range. We evaluate several state-of-the-art detectors and show that speaker characteristics significantly influence detection performance, revealing disparities across sex, language, age, and synthesizer type. These findings highlight the need for bias-aware development and provide a foundation for building non-discriminatory deepfake detection systems aligned with ethical and regulatory standards. 尽管对深度伪造语音检测的关注日益增加,但偏见与公平性在语音领域仍然未被充分探讨。为弥补这一空白,我们引入了说话人特征深度伪造(SCDF)数据集:一个新颖且丰富注释的资源,能够系统评估深度伪造语音检测中的人口统计学偏差。SCDF 包含超过 237,000 条语句,男女说话人代表均衡,涵盖五种语言及广泛的年龄范围。我们评估了若干最先进的检测器,并展示了说话人特征显著影响检测性能,揭示了在性别、语言、年龄和合成器类型方面的差异。这些发现强调了在开发过程中关注偏见的必要性,并为构建符合伦理和监管标准的非歧视性深度伪造检测系统提供了基础。
Subjects: Sound, Artificial Intelligence, Cryptography and Security 学科:声音、人工智能、密码学与安全
Publish: 2025-08-11 12:58:37 UTC 发布:2025-08-11 12:58:37 UTC
#110 Diffusing the Blind Spot: Uterine MRI Synthesis with Diffusion Models #110 扩散盲点:使用扩散模型合成子宫 MRI
Authors: [Johanna P. Müller](https://arxiv.org/search/?searchtype=author&query=Johanna P. Müller), [Anika Knupfer](https://arxiv.org/search/?searchtype=author&query=Anika Knupfer), [Pedro Blöss](https://arxiv.org/search/?searchtype=author&query=Pedro Blöss), [Edoardo Berardi Vittur](https://arxiv.org/search/?searchtype=author&query=Edoardo Berardi Vittur), [Bernhard Kainz](https://arxiv.org/search/?searchtype=author&query=Bernhard Kainz), [Jana Hutter](https://arxiv.org/search/?searchtype=author&query=Jana Hutter) 作者:约翰娜·P·穆勒、阿妮卡·克努普弗、佩德罗·布勒斯、爱多阿多·贝拉尔迪·维图尔、伯恩哈德·凯恩茨、雅娜·胡特尔
Despite significant progress in generative modelling, existing diffusion models often struggle to produce anatomically precise female pelvic images, limiting their application in gynaecological imaging, where data scarcity and patient privacy concerns are critical. To overcome these barriers, we introduce a novel diffusion-based framework for uterine MRI synthesis, integrating both unconditional and conditioned Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) in 2D and 3D. Our approach generates anatomically coherent, high fidelity synthetic images that closely mimic real scans and provide valuable resources for training robust diagnostic models. We evaluate generative quality using advanced perceptual and distributional metrics, benchmarking against standard reconstruction methods, and demonstrate substantial gains in diagnostic accuracy on a key classification task. A blinded expert evaluation further validates the clinical realism of our synthetic images. We release our models with privacy safeguards and a comprehensive synthetic uterine MRI dataset to support reproducible research and advance equitable AI in gynaecology. 尽管生成建模取得了重大进展,现有的扩散模型在生成解剖学上精确的女性骨盆影像方面仍常常力不从心,这限制了其在妇科影像学中的应用,而在该领域数据稀缺和患者隐私问题尤为关键。为克服这些障碍,我们提出了一种用于子宫磁共振成像合成的新型基于扩散的框架,整合了无条件与有条件的去噪扩散概率模型(DDPMs)以及二维和三维的潜空间扩散模型(LDMs)。我们的方法能够生成解剖一致、高保真的合成图像,逼真地模拟真实扫描,为训练鲁棒的诊断模型提供了有价值的资源。我们使用先进的感知和分布性指标评估生成质量,并与标准重建方法进行基准比较,展示了在一项关键分类任务上诊断准确率的显著提升。盲评专家评价进一步验证了我们合成影像的临床真实感。我们同时发布了具有隐私保护措施的模型和一个全面的合成子宫 MRI 数据集,以支持可复现研究并推动妇科领域公平的人工智能发展。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 12:18:23 UTC 发布时间:2025-08-11 12:18:23 UTC
#111 NeeCo: Image Synthesis of Novel Instrument States Based on Dynamic and Deformable 3D Gaussian Reconstruction #111 NeeCo: 基于动态可变形三维高斯重建的新型器械状态图像合成 [PDF 1 ] [Copy] [Kimi ] [REL]
Authors: [Tianle Zeng](https://arxiv.org/search/?searchtype=author&query=Tianle Zeng), [Junlei Hu](https://arxiv.org/search/?searchtype=author&query=Junlei Hu), [Gerardo Loza Galindo](https://arxiv.org/search/?searchtype=author&query=Gerardo Loza Galindo), [Sharib Ali](https://arxiv.org/search/?searchtype=author&query=Sharib Ali), [Duygu Sarikaya](https://arxiv.org/search/?searchtype=author&query=Duygu Sarikaya), [Pietro Valdastri](https://arxiv.org/search/?searchtype=author&query=Pietro Valdastri), [Dominic Jones](https://arxiv.org/search/?searchtype=author&query=Dominic Jones) 作者:Tianle Zeng, Junlei Hu, Gerardo Loza Galindo, Sharib Ali, Duygu Sarikaya, Pietro Valdastri, Dominic Jones
Computer vision-based technologies significantly enhance surgical automation by advancing tool tracking, detection, and localization. However, Current data-driven approaches are data-voracious, requiring large, high-quality labeled image datasets, which limits their application in surgical data science. Our Work introduces a novel dynamic Gaussian Splatting technique to address the data scarcity in surgical image datasets. We propose a dynamic Gaussian model to represent dynamic surgical scenes, enabling the rendering of surgical instruments from unseen viewpoints and deformations with real tissue backgrounds. We utilize a dynamic training adjustment strategy to address challenges posed by poorly calibrated camera poses from real-world scenarios. Additionally, we propose a method based on dynamic Gaussians for automatically generating annotations for our synthetic data. For evaluation, we constructed a new dataset featuring seven scenes with 14,000 frames of tool and camera motion and tool jaw articulation, with a background of an ex-vivo porcine model. Using this dataset, we synthetically replicate the scene deformation from the ground truth data, allowing direct comparisons of synthetic image quality. Experimental results illustrate that our method generates photo-realistic labeled image datasets with the highest values in Peak-Signal-to-Noise Ratio (29.87). We further evaluate the performance of medical-specific neural networks trained on real and synthetic images using an unseen real-world image dataset. Our results show that the performance of models trained on synthetic images generated by the proposed method outperforms those trained with state-of-the-art standard data augmentation by 10%, leading to an overall improvement in model performances by nearly 15%. 基于计算机视觉的技术通过推进工具跟踪、检测和定位,显著增强了手术自动化。然而,当前的数据驱动方法对数据需求极高,需要大量高质量标注的图像数据集,这限制了它们在手术数据科学中的应用。我们的工作引入了一种新颖的动态高斯点喷涂(dynamic Gaussian Splatting)技术,以应对手术图像数据集的稀缺问题。我们提出了一种动态高斯模型来表示动态手术场景,使得在真实组织背景下能够从未见视角和在变形条件下渲染手术器械。我们采用了一种动态训练调整策略,以应对真实场景中相机位姿校准不良带来的挑战。此外,我们提出了一种基于动态高斯的方法,用于自动生成合成数据的注释。为评估所提方法,我们构建了一个包含七个场景的新数据集,包含 14,000 帧的器械与相机运动以及器械嘴部关节动作,背景为离体猪模型。 使用此数据集,我们从真实数据合成性地再现场景变形,从而可以直接比较合成图像的质量。实验结果表明,我们的方法生成了峰值信噪比最高(29.87)的拟真带标签图像数据集。我们进一步使用一个未见过的真实世界图像数据集评估在真实图像和合成图像上训练的医学专用神经网络的性能。结果显示,使用所提出方法生成的合成图像训练的模型,其性能比使用最先进的标准数据增强训练的模型高出 10%,从而使模型整体性能提升近 15%。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 12:13:05 UTC 发布:2025-08-11 12:13:05 协调世界时 (UTC)
#112 Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant #112 尚未出现“AlphaFold 式”心智:评估 Centaur 作为合成参与者
Authors: [Sabrina Namazova](https://arxiv.org/search/?searchtype=author&query=Sabrina Namazova), [Alessandra Brondetta](https://arxiv.org/search/?searchtype=author&query=Alessandra Brondetta), [Younes Strittmatter](https://arxiv.org/search/?searchtype=author&query=Younes Strittmatter), [Matthew Nassar](https://arxiv.org/search/?searchtype=author&query=Matthew Nassar), [Sebastian Musslick](https://arxiv.org/search/?searchtype=author&query=Sebastian Musslick) 作者:Sabrina Namazova、Alessandra Brondetta、Younes Strittmatter、Matthew Nassar、Sebastian Musslick
Simulators have revolutionized scientific practice across the natural sciences. By generating data that reliably approximate real-world phenomena, they enable scientists to accelerate hypothesis testing and optimize experimental designs. This is perhaps best illustrated by AlphaFold, a Nobel-prize winning simulator in chemistry that predicts protein structures from amino acid sequences, enabling rapid prototyping of molecular interactions, drug targets, and protein functions. In the behavioral sciences, a reliable participant simulator - a system capable of producing human-like behavior across cognitive tasks - would represent a similarly transformative advance. Recently, Binz et al. introduced Centaur, a large language model (LLM) fine-tuned on human data from 160 experiments, proposing its use not only as a model of cognition but also as a participant simulator for “in silico prototyping of experimental studies”, e.g., to advance automated cognitive science. Here, we review the core criteria for a participant simulator and assess how well Centaur meets them. Although Centaur demonstrates strong predictive accuracy, its generative behavior - a critical criterion for a participant simulator - systematically diverges from human data. This suggests that, while Centaur is a significant step toward predicting human behavior, it does not yet meet the standards of a reliable participant simulator or an accurate model of cognition. 模拟器在自然科学领域革新了科学实践。通过生成可靠逼近真实现象的数据,它们使科学家能够加速假设检验并优化实验设计。这一点或许在 AlphaFold 上体现得最为明显:作为一款在化学领域获得诺贝尔奖级别认可的模拟器,AlphaFold 能从氨基酸序列预测蛋白质结构,从而实现分子相互作用、药物靶点和蛋白功能的快速原型化。在行为科学中,一个可靠的参与者模拟器——能够在认知任务中产生类人行为的系统——将代表同样具有变革性的重要进展。最近,Binz 等人提出了 Centaur,这是一种在来自 160 项实验的人类数据上微调的大型语言模型(LLM),他们建议将其不仅用作认知模型,还作为“用于实验研究的体外原型化”的参与者模拟器,例如推进自动化认知科学。在此,我们回顾了参与者模拟器的核心标准并评估了 Centaur 在这些方面的表现。 尽管 Centaur 展示了较强的预测准确性,但其生成行为——作为参与者模拟器的关键标准——系统性地偏离了人类数据。这表明,虽然 Centaur 在预测人类行为方面迈出了重要一步,但它尚未达到作为可靠的参与者模拟器或准确的认知模型的标准。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 12:05:18 UTC 发表:2025-08-11 12:05:18 UTC
#113 Autonomous Navigation of Cloud-Controlled Quadcopters in Confined Spaces Using Multi-Modal Perception and LLM-Driven High Semantic Reasoning #113 使用多模态感知和由 LLM 驱动的高语义推理在受限空间中实现云控四旋翼的自主导航
Authors: [Shoaib Ahmmad](https://arxiv.org/search/?searchtype=author&query=Shoaib Ahmmad), [Zubayer Ahmed Aditto](https://arxiv.org/search/?searchtype=author&query=Zubayer Ahmed Aditto), [Md Mehrab Hossain](https://arxiv.org/search/?searchtype=author&query=Md Mehrab Hossain), [Noushin Yeasmin](https://arxiv.org/search/?searchtype=author&query=Noushin Yeasmin), [Shorower Hossain](https://arxiv.org/search/?searchtype=author&query=Shorower Hossain) 作者:Shoaib Ahmmad、Zubayer Ahmed Aditto、Md Mehrab Hossain、Noushin Yeasmin、Shorower Hossain
This paper introduces an advanced AI-driven perception system for autonomous quadcopter navigation in GPS-denied indoor environments. The proposed framework leverages cloud computing to offload computationally intensive tasks and incorporates a custom-designed printed circuit board (PCB) for efficient sensor data acquisition, enabling robust navigation in confined spaces. The system integrates YOLOv11 for object detection, Depth Anything V2 for monocular depth estimation, a PCB equipped with Time-of-Flight (ToF) sensors and an Inertial Measurement Unit (IMU), and a cloud-based Large Language Model (LLM) for context-aware decision-making. A virtual safety envelope, enforced by calibrated sensor offsets, ensures collision avoidance, while a multithreaded architecture achieves low-latency processing. Enhanced spatial awareness is facilitated by 3D bounding box estimation with Kalman filtering. Experimental results in an indoor testbed demonstrate strong performance, with object detection achieving a mean Average Precision (mAP50) of 0.6, depth estimation Mean Absolute Error (MAE) of 7.2 cm, only 16 safety envelope breaches across 42 trials over approximately 11 minutes, and end-to-end system latency below 1 second. This cloud-supported, high-intelligence framework serves as an auxiliary perception and navigation system, complementing state-of-the-art drone autonomy for GPS-denied confined spaces. 本文介绍了一种用于无人机在无 GPS 的室内环境中自主导航的先进人工智能感知系统。所提出的框架利用云计算卸载计算密集型任务,并集成了定制设计的印制电路板(PCB)以实现高效的传感器数据采集,从而在受限空间内实现稳健的导航。该系统整合了用于目标检测的 YOLOv11、用于单目深度估计的 Depth Anything V2、配备飞行时间(ToF)传感器和惯性测量单元(IMU)的 PCB,以及用于上下文感知决策的基于云的 LLM。通过经校准的传感器偏移实施的虚拟安全包络确保了碰撞避免,而多线程架构则实现了低延迟处理。通过结合卡尔曼滤波的三维边界框估计,增强了空间感知能力。 在室内测试平台上的实验结果显示出强劲的性能:目标检测在 mAP50 指标上达到 0.6,深度估计的平均绝对误差(MAE)为 7.2 厘米,在大约 11 分钟内的 42 次试验中仅发生 16 次安全包线突破,端到端系统延迟低于 1 秒。该由云支持、高智能的框架作为辅助感知和导航系统,补充了在无 GPS 受限空间中最先进的无人机自主能力。
Subjects: Robotics, Artificial Intelligence, Computer Vision and Pattern Recognition, Systems and Control 主题:机器人学、人工智能、计算机视觉与模式识别、系统与控制
Publish: 2025-08-11 12:00:03 UTC 发表时间:2025-08-11 12:00:03 UTC
#114 Selective Contrastive Learning for Weakly Supervised Affordance Grounding #114 选择性对比学习用于弱监督可供性定位
Authors: [WonJun Moon](https://arxiv.org/search/?searchtype=author&query=WonJun Moon), [Hyun Seok Seong](https://arxiv.org/search/?searchtype=author&query=Hyun Seok Seong), [Jae-Pil Heo](https://arxiv.org/search/?searchtype=author&query=Jae-Pil Heo) 作者:WonJun Moon、Hyun Seok Seong、Jae-Pil Heo
Facilitating an entity’s interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordance-relevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common class-specific patterns that are unrelated to affordance. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, by cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordance-irrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method. Codes are available at github.com/hynnsk/SelectiveCL. 促进实体与物体的交互需要准确识别赋予特定动作的部位。弱监督可供性定位(WSAG)旨在模仿人类从第三人称示范中学习的过程,人类能直观把握功能性部位而无需像素级注释。为此,通常通过在不同视角图像间共享分类器来学习定位,并结合纳入部位发现过程的蒸馏策略。然而,由于与可供性相关的部位并不总是容易区分,模型主要依赖分类,往往关注那些与可供性无关的常见类别特定模式。为了解决这一限制,我们超越了孤立的部位级学习,引入了选择性原型和像素对比目标,根据可用信息的粒度自适应地在部位和物体层面学习与可供性相关的线索。首先,我们利用 CLIP 在第一视角(以物体为中心)和第三视角(第三人称示例)图像中找到与动作相关的物体。 然后,通过交叉参考互补视角中发现的对象,我们在每个视角中挖掘出精确的部件级可供性线索。通过持续学习将与可供性相关的区域与与可供性无关的背景上下文区分开来,我们的方法有效地将激活从无关区域转移到有意义的可供性线索上。实验结果证明了我们方法的有效性。代码可在 github.com/hynnsk/SelectiveCL 获得。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 11:49:37 UTC 发布:2025-08-11 11:49:37 UTC
#115 Towards Human-AI Collaboration System for the Detection of Invasive Ductal Carcinoma in Histopathology Images #115 面向人机协作的系统,用于在组织病理图像中检测浸润性导管癌
Authors: [Shuo Han](https://arxiv.org/search/?searchtype=author&query=Shuo Han), [Ahmed Karam Eldaly](https://arxiv.org/search/?searchtype=author&query=Ahmed Karam Eldaly), [Solomon Sunday Oyelere](https://arxiv.org/search/?searchtype=author&query=Solomon Sunday Oyelere) 作者:Shuo Han、Ahmed Karam Eldaly、Solomon Sunday Oyelere
Invasive ductal carcinoma (IDC) is the most prevalent form of breast cancer, and early, accurate diagnosis is critical to improving patient survival rates by guiding treatment decisions. Combining medical expertise with artificial intelligence (AI) holds significant promise for enhancing the precision and efficiency of IDC detection. In this work, we propose a human-in-the-loop (HITL) deep learning system designed to detect IDC in histopathology images. The system begins with an initial diagnosis provided by a high-performance EfficientNetV2S model, offering feedback from AI to the human expert. Medical professionals then review the AI-generated results, correct any misclassified images, and integrate the revised labels into the training dataset, forming a feedback loop from the human back to the AI. This iterative process refines the model’s performance over time. The EfficientNetV2S model itself achieves state-of-the-art performance compared to existing methods in the literature, with an overall accuracy of 93.65%. Incorporating the human-in-the-loop system further improves the model’s accuracy using four experimental groups with misclassified images. These results demonstrate the potential of this collaborative approach to enhance AI performance in diagnostic systems. This work contributes to advancing automated, efficient, and highly accurate methods for IDC detection through human-AI collaboration, offering a promising direction for future AI-assisted medical diagnostics. 浸润性导管癌(IDC)是最常见的乳腺癌类型,早期且准确的诊断对于通过指导治疗决策提高患者生存率至关重要。将医学专业知识与人工智能(AI)结合在一起,对于提高 IDC 检测的精确性和效率具有重要前景。在这项工作中,我们提出了一种人机协同(HITL)深度学习系统,用于检测病理切片图像中的 IDC。该系统以高性能的 EfficientNetV2S 模型提供的初步诊断为起点,向人类专家反馈 AI 结果。医学专业人员随后审查 AI 生成的结果,纠正任何被误分类的图像,并将修正后的标签整合到训练数据集中,形成从人类回馈到 AI 的反馈循环。该迭代过程随着时间的推移不断改进模型的性能。EfficientNetV2S 模型本身在与文献中现有方法的比较中达到了最先进的性能,总体准确率为 93.65%。引入人机协同系统后,在包含被误分类图像的四个实验组中进一步提升了模型的准确性。 这些结果展示了这种协作方法在提升诊断系统中人工智能性能方面的潜力。本研究通过人机协作为浸润性导管癌(IDC)检测推进了自动化、高效且高度准确的方法,为未来人工智能辅助的医学诊断提供了有前景的方向。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Human-Computer Interaction, Machine Learning 领域:计算机视觉与模式识别、人工智能、人机交互、机器学习
Publish: 2025-08-11 11:45:57 UTC 发布:2025-08-11 11:45:57 UTC
#116 Vertex Features for Neural Global Illumination #116 顶点特征用于神经全局光照
Authors: [Rui Su](https://arxiv.org/search/?searchtype=author&query=Rui Su), [Honghao Dong](https://arxiv.org/search/?searchtype=author&query=Honghao Dong), [Haojie Jin](https://arxiv.org/search/?searchtype=author&query=Haojie Jin), [Yisong Chen](https://arxiv.org/search/?searchtype=author&query=Yisong Chen), [Guoping Wang](https://arxiv.org/search/?searchtype=author&query=Guoping Wang), [Sheng Li](https://arxiv.org/search/?searchtype=author&query=Sheng Li) 作者:苏睿、董洪浩、金浩杰、陈怡松、王国平、李晟
Recent research on learnable neural representations has been widely adopted in the field of 3D scene reconstruction and neural rendering applications. However, traditional feature grid representations often suffer from substantial memory footprint, posing a significant bottleneck for modern parallel computing hardware. In this paper, we present neural vertex features, a generalized formulation of learnable representation for neural rendering tasks involving explicit mesh surfaces. Instead of uniformly distributing neural features throughout 3D space, our method stores learnable features directly at mesh vertices, leveraging the underlying geometry as a compact and structured representation for neural processing. This not only optimizes memory efficiency, but also improves feature representation by aligning compactly with the surface using task-specific geometric priors. We validate our neural representation across diverse neural rendering tasks, with a specific emphasis on neural radiosity. Experimental results demonstrate that our method reduces memory consumption to only one-fifth (or even less) of grid-based representations, while maintaining comparable rendering quality and lowering inference overhead. 近年来,可学习神经表示的研究已被广泛应用于三维场景重建和神经渲染等领域。然而,传统的特征网格表示往往占用大量内存,成为现代并行计算硬件的一大瓶颈。在本文中,我们提出了神经顶点特征,这是一种针对显式网格表面神经渲染任务的可学习表示的广义形式。与在三维空间中均匀分布神经特征不同,我们的方法将可学习特征直接存储在网格顶点处,利用底层几何作为一种紧凑且有结构的神经处理表示。这不仅优化了内存效率,还通过使用任务特定的几何先验使特征更紧密地与表面对齐,从而提升了特征表示能力。我们在多种神经渲染任务上验证了我们的神经表示,特别强调了在神经辐射度计算方面的应用。 实验结果表明,我们的方法将内存消耗降低到仅为基于网格表示的五分之一(甚至更低),同时保持可比较的渲染质量并降低推理开销。
Subjects: Graphics, Artificial Intelligence 学科:图形学,人工智能
Publish: 2025-08-11 11:10:19 UTC 发布:2025-08-11 11:10:19 UTC
#117 Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images #117 深空天气模型:基于多波长图像的远距离太阳耀斑预测
Authors: [Shunya Nagashima](https://arxiv.org/search/?searchtype=author&query=Shunya Nagashima), [Komei Sugiura](https://arxiv.org/search/?searchtype=author&query=Komei Sugiura) 作者:Shunya Nagashima,Komei Sugiura
Accurate, reliable solar flare prediction is crucial for mitigating potential disruptions to critical infrastructure, while predicting solar flares remains a significant challenge. Existing methods based on heuristic physical features often lack representation learning from solar images. On the other hand, end-to-end learning approaches struggle to model long-range temporal dependencies in solar images. In this study, we propose Deep Space Weather Model (Deep SWM), which is based on multiple deep state space models for handling both ten-channel solar images and long-range spatio-temporal dependencies. Deep SWM also features a sparse masked autoencoder, a novel pretraining strategy that employs a two-phase masking approach to preserve crucial regions such as sunspots while compressing spatial information. Furthermore, we built FlareBench, a new public benchmark for solar flare prediction covering a full 11-year solar activity cycle, to validate our method. Our method outperformed baseline methods and even human expert performance on standard metrics in terms of performance and reliability. The project page can be found at https://keio-smilab25.github.io/DeepSWM. 准确且可靠的太阳耀斑预测对于减轻对关键基础设施的潜在干扰至关重要,但预测太阳耀斑仍然是一个重大挑战。基于启发式物理特征的现有方法常常缺乏从太阳图像中进行表示学习的能力。另一方面,端到端学习方法在对太阳图像建模长程时间依赖性方面遇到困难。在本研究中,我们提出了深空天气模型(Deep Space Weather Model,Deep SWM),该模型基于多个深态空间模型,用于处理十通道太阳图像以及长程时空依赖性。Deep SWM 还具有稀疏掩码自编码器,这是一种新颖的预训练策略,采用两阶段掩码方法在压缩空间信息的同时保留黑子等关键区域。此外,我们构建了 FlareBench——一个涵盖完整 11 年太阳活动周期的新公开太阳耀斑预测基准,用以验证我们的方法。我们的模型在性能和可靠性方面,在标准指标上优于基线方法,甚至超过了人类专家的表现。项目页面为 https://keio-smilab25.github.io/DeepSWM。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 11:06:56 UTC 发布:2025-08-11 11:06:56 协调世界时 (UTC)
#118 DETACH: Cross-domain Learning for Long-Horizon Tasks via Mixture of Disentangled Experts #118 DETACH:通过解缠专家混合实现跨领域长时程任务学习
Authors: [Yutong Shen](https://arxiv.org/search/?searchtype=author&query=Yutong Shen), [Hangxu Liu](https://arxiv.org/search/?searchtype=author&query=Hangxu Liu), [Penghui Liu](https://arxiv.org/search/?searchtype=author&query=Penghui Liu), [Ruizhe Xia](https://arxiv.org/search/?searchtype=author&query=Ruizhe Xia), [Tianyi Yao](https://arxiv.org/search/?searchtype=author&query=Tianyi Yao), [Yitong Sun](https://arxiv.org/search/?searchtype=author&query=Yitong Sun), [Tongtong Feng](https://arxiv.org/search/?searchtype=author&query=Tongtong Feng) 作者:沈雨彤、刘航旭、刘鹏辉、夏锐哲、姚天一、孙一彤、冯彤彤
Long-Horizon (LH) tasks in Human-Scene Interaction (HSI) are complex multi-step tasks that require continuous planning, sequential decision-making, and extended execution across domains to achieve the final goal. However, existing methods heavily rely on skill chaining by concatenating pre-trained subtasks, with environment observations and self-state tightly coupled, lacking the ability to generalize to new combinations of environments and skills, failing to complete various LH tasks across domains. To solve this problem, this paper presents DETACH, a cross-domain learning framework for LH tasks via biologically inspired dual-stream disentanglement. Inspired by the brain’s “where-what” dual pathway mechanism, DETACH comprises two core modules: i) an environment learning module for spatial understanding, which captures object functions, spatial relationships, and scene semantics, achieving cross-domain transfer through complete environment-self disentanglement; ii) a skill learning module for task execution, which processes self-state information including joint degrees of freedom and motor patterns, enabling cross-skill transfer through independent motor pattern encoding. We conducted extensive experiments on various LH tasks in HSI scenes. Compared with existing methods, DETACH can achieve an average subtasks success rate improvement of 23% and average execution efficiency improvement of 29%. 在人-场景交互(HSI)中的长时程(LH)任务是复杂的多步骤任务,需要持续的规划、序列决策和跨域的长时间执行以实现最终目标。然而,现有方法严重依赖通过拼接预训练子任务来串联技能,且环境观测与自身状态紧密耦合,缺乏推广到新的环境与技能组合的能力,无法在各个领域完成多种 LH 任务。为了解决这一问题,本文提出了 DETACH,一种通过生物学启发的双流解缠(disentanglement)进行跨域 LH 任务学习的框架。受大脑“何处—何物”双通路机制启发,DETACH 包含两个核心模块:i)环境学习模块用于空间理解,捕捉物体功能、空间关系和场景语义,通过完整的环境—自身解缠实现跨域迁移;ii)技能学习模块用于任务执行,处理包括关节自由度和运动模式在内的自身状态信息,通过独立的运动模式编码实现跨技能迁移。 我们在高光谱图像场景中的各种 LH 任务上进行了广泛实验。与现有方法相比,DETACH 在各子任务的平均成功率上提高了 23%,在平均执行效率上提高了 29%。
Subjects: Robotics, Artificial Intelligence 主题:机器人学,人工智能
Publish: 2025-08-11 10:54:28 UTC 发布:2025-08-11 10:54:28 世界协调时 (UTC)
#119 Auditory Intelligence: Understanding the World Through Sound #119 听觉智能:通过声音理解世界
Author: [Hyeonuk Nam](https://arxiv.org/search/?searchtype=author&query=Hyeonuk Nam) 作者:Hyeonuk Nam
Recent progress in auditory intelligence has yielded high-performing systems for sound event detection (SED), acoustic scene classification (ASC), automated audio captioning (AAC), and audio question answering (AQA). Yet these tasks remain largely constrained to surface-level recognition-capturing what happened but not why, what it implies, or how it unfolds in context. I propose a conceptual reframing of auditory intelligence as a layered, situated process that encompasses perception, reasoning, and interaction. To instantiate this view, I introduce four cognitively inspired task paradigms-ASPIRE, SODA, AUX, and AUGMENT-those structure auditory understanding across time-frequency pattern captioning, hierarchical event/scene description, causal explanation, and goal-driven interpretation, respectively. Together, these paradigms provide a roadmap toward more generalizable, explainable, and human-aligned auditory intelligence, and are intended to catalyze a broader discussion of what it means for machines to understand sound. 近年来听觉智能方面的进展催生了在声事件检测(SED)、声学场景分类(ASC)、自动音频描述(AAC)和音频问答(AQA)等任务上表现优异的系统。然而,这些任务在很大程度上仍局限于表层识别——捕捉发生了什么,但未能说明为何发生、意味着什么或在情境中如何展开。我提出将听觉智能概念性地重新构建为一个分层的、情境化的过程,涵盖感知、推理与交互。为实现这一视角,我引入了四种受认知启发的任务范式——ASPIRE、SODA、AUX 和 AUGMENT——它们将听觉理解结构化为时频模式描述、分层事件/场景描写、因果解释与目标驱动的解读。总体而言,这些范式为更具可泛化性、可解释性并与人类更契合的听觉智能提供了路线图,旨在激发更广泛的讨论:机器理解声音究竟意味着什么。
Subjects: Audio and Speech Processing, Artificial Intelligence, Sound 主题:音频与语音处理、人工智能、声音
Publish: 2025-08-11 10:25:58 UTC 发布:2025-08-11 10:25:58 UTC
#120 Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP #120 架构协同设计用于零样本异常检测:在 CLIP 中解耦表示并动态融合特征
Authors: [Ke Ma](https://arxiv.org/search/?searchtype=author&query=Ke Ma), [Jun Long](https://arxiv.org/search/?searchtype=author&query=Jun Long), [Hongxiao Fei](https://arxiv.org/search/?searchtype=author&query=Hongxiao Fei), [Liujie Hua](https://arxiv.org/search/?searchtype=author&query=Liujie Hua), [Yueyi Luo](https://arxiv.org/search/?searchtype=author&query=Yueyi Luo) 作者:Ke Ma、Jun Long、Hongxiao Fei、Liujie Hua、Yueyi Luo
Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of local inductive biases for dense prediction and their reliance on inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method integrates a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. 预训练的视觉-语言模型(VLMs)在应用于零样本异常检测(ZSAD)时面临显著的适配差距,这源于它们缺乏用于密集预测的局部归纳偏置以及依赖僵化的特征融合范式。我们通过一种架构协同设计框架来解决这些局限,该框架联合优化特征表示和跨模态融合。我们的方法集成了参数高效的卷积低秩适配器(Conv-LoRA),以注入用于细粒度表示的局部归纳偏置,并引入了动态融合闸道(DFG),该闸道利用视觉上下文自适应地调节文本提示,从而实现强大的双向融合。在多种工业和医疗基准上的大量实验表明了卓越的精度和鲁棒性,验证了这种协同共设计对于将基础模型稳健地适配到密集感知任务的关键性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题:计算机视觉与模式识别,人工智能,机器学习
Publish: 2025-08-11 10:03:45 UTC 发表:2025-08-11 10:03:45 UTC
#121 MIND: A Noise-Adaptive Denoising Framework for Medical Images Integrating Multi-Scale Transformer #121 MIND:一种用于医学影像的噪声自适应去噪框架,整合了多尺度 Transformer
Authors: [Tao Tang](https://arxiv.org/search/?searchtype=author&query=Tao Tang), [Chengxu Yang](https://arxiv.org/search/?searchtype=author&query=Chengxu Yang) 作者:唐涛,杨成旭
The core role of medical images in disease diagnosis makes their quality directly affect the accuracy of clinical judgment. However, due to factors such as low-dose scanning, equipment limitations and imaging artifacts, medical images are often accompanied by non-uniform noise interference, which seriously affects structure recognition and lesion detection. This paper proposes a medical image adaptive denoising model (MI-ND) that integrates multi-scale convolutional and Transformer architecture, introduces a noise level estimator (NLE) and a noise adaptive attention module (NAAB), and realizes channel-spatial attention regulation and cross-modal feature fusion driven by noise perception. Systematic testing is carried out on multimodal public datasets. Experiments show that this method significantly outperforms the comparative methods in image quality indicators such as PSNR, SSIM, and LPIPS, and improves the F1 score and ROC-AUC in downstream diagnostic tasks, showing strong prac-tical value and promotional potential. The model has outstanding benefits in structural recovery, diagnostic sensitivity, and cross-modal robustness, and provides an effective solution for medical image enhancement and AI-assisted diagnosis and treatment. 医疗影像在疾病诊断中的核心作用使其质量直接影响临床判断的准确性。然而,由于低剂量扫描、设备限制和成像伪影等因素,医疗影像常伴随不均匀噪声干扰,严重影响结构识别和病灶检测。本文提出了一种融合多尺度卷积与 Transformer 架构的医疗影像自适应去噪模型(MI-ND),引入了噪声水平估计器(NLE)和噪声自适应注意力模块(NAAB),并实现了由噪声感知驱动的通道-空间注意力调节与跨模态特征融合。在多模态公共数据集上进行了系统测试。实验证明,该方法在 PSNR、SSIM、LPIPS 等图像质量指标上显著优于对比方法,并在下游诊断任务中提升了 F1 分数和 ROC-AUC,显示出较强的实用价值和推广潜力。 该模型在结构恢复、诊断敏感性和跨模态鲁棒性方面具有显著优势,为医学图像增强及辅助诊断治疗提供了有效解决方案。
Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning, Multimedia 学科:图像与视频处理、人工智能、计算机视觉与模式识别、机器学习、多媒体
Publish: 2025-08-11 10:00:51 UTC 发布时间:2025-08-11 10:00:51 UTC
#122 PCA-Guided Autoencoding for Structured Dimensionality Reduction in Active Infrared Thermography #122 PCA 引导自编码用于主动红外热成像中的结构化降维
Authors: [Mohammed Salah](https://arxiv.org/search/?searchtype=author&query=Mohammed Salah), [Numan Saeed](https://arxiv.org/search/?searchtype=author&query=Numan Saeed), [Davor Svetinovic](https://arxiv.org/search/?searchtype=author&query=Davor Svetinovic), [Stefano Sfarra](https://arxiv.org/search/?searchtype=author&query=Stefano Sfarra), [Mohammed Omar](https://arxiv.org/search/?searchtype=author&query=Mohammed Omar), [Yusra Abdulrahman](https://arxiv.org/search/?searchtype=author&query=Yusra Abdulrahman) 作者:Mohammed Salah、Numan Saeed、Davor Svetinovic、Stefano Sfarra、Mohammed Omar、Yusra Abdulrahman
Active Infrared thermography (AIRT) is a widely adopted non-destructive testing (NDT) technique for detecting subsurface anomalies in industrial components. Due to the high dimensionality of AIRT data, current approaches employ non-linear autoencoders (AEs) for dimensionality reduction. However, the latent space learned by AIRT AEs lacks structure, limiting their effectiveness in downstream defect characterization tasks. To address this limitation, this paper proposes a principal component analysis guided (PCA-guided) autoencoding framework for structured dimensionality reduction to capture intricate, non-linear features in thermographic signals while enforcing a structured latent space. A novel loss function, PCA distillation loss, is introduced to guide AIRT AEs to align the latent representation with structured PCA components while capturing the intricate, non-linear patterns in thermographic signals. To evaluate the utility of the learned, structured latent space, we propose a neural network-based evaluation metric that assesses its suitability for defect characterization. Experimental results show that the proposed PCA-guided AE outperforms state-of-the-art dimensionality reduction methods on PVC, CFRP, and PLA samples in terms of contrast, signal-to-noise ratio (SNR), and neural network-based metrics. 有源红外热成像(AIRT)是一种广泛采用的无损检测(NDT)技术,用于检测工业部件中的次表面异常。由于 AIRT 数据的高维性,当前方法采用非线性自编码器(AE)进行降维。然而,AIRT 自编码器学得的潜在空间缺乏结构,限制了其在下游缺陷表征任务中的有效性。为了解决这一限制,本文提出了一种主成分分析引导(PCA 引导)的自编码框架,用于结构化降维,以在强制潜在空间具有结构化的同时,捕捉热成像信号中复杂的非线性特征。提出了一种新颖的损失函数——PCA 蒸馏损失,用以引导 AIRT 自编码器使潜在表征与结构化的 PCA 分量对齐,同时捕捉热成像信号中的复杂非线性模式。为评估所学结构化潜在空间的实用性,我们提出了一种基于神经网络的评估指标,用于评估其在缺陷表征方面的适用性。 实验结果表明,所提出的基于主成分分析引导的自编码器在对 PVC、CFRP 和 PLA 样本的对比度、信噪比(SNR)以及基于神经网络的指标上均优于最先进的降维方法。
Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 主题:图像与视频处理、人工智能、计算机视觉与模式识别、机器学习
Publish: 2025-08-11 08:58:13 UTC 发布日期:2025-08-11 08:58:13 UTC
#123 Pareto Multi-Objective Alignment for Language Models #123 帕累托多目标对齐用于语言模型
Authors: [Qiang He](https://arxiv.org/search/?searchtype=author&query=Qiang He), [Setareh Maghsudi](https://arxiv.org/search/?searchtype=author&query=Setareh Maghsudi) 作者:何强,Setareh Maghsudi
Large language models (LLMs) are increasingly deployed in real-world applications that require careful balancing of multiple, often conflicting, objectives, such as informativeness versus conciseness, or helpfulness versus creativity. However, current alignment methods, primarily based on RLHF, optimize LLMs toward a single reward function, resulting in rigid behavior that fails to capture the complexity and diversity of human preferences. This limitation hinders the adaptability of LLMs to practical scenarios, making multi-objective alignment (MOA) a critical yet underexplored area. To bridge this gap, we propose Pareto Multi-Objective Alignment (PAMA), a principled and computationally efficient algorithm designed explicitly for MOA in LLMs. In contrast to computationally prohibitive multi-objective optimization (MOO) methods, PAMA transforms multi-objective RLHF into a convex optimization with a closed-form solution, significantly enhancing scalability. Traditional MOO approaches suffer from prohibitive O(n^2d) complexity, where d represents the number of model parameters, typically in the billions for LLMs, rendering direct optimization infeasible. PAMA reduces this complexity to O(n) where n is the number of objectives, enabling optimization to be completed within milliseconds. We provide theoretical guarantees that PAMA converges to a Pareto stationary point, where no objective can be improved without degrading at least one other. Extensive experiments across language models ranging from 125M to 7B parameters demonstrate PAMA’s robust and effective MOA capabilities, aligning with its theoretical advantages. PAMA provides a highly efficient solution to the MOA problem that was previously considered intractable, offering a practical and theoretically grounded approach to aligning LLMs with diverse human values, paving the way for versatile and adaptable real-world AI deployments. 大型语言模型(LLMs)正越来越多地被部署到需要在多种常常冲突的目标之间谨慎平衡的实际应用中,例如信息量与简洁性之间,或有帮助性与创造性之间。然而,当前的对齐方法主要基于 RLHF,通常将 LLMs 优化为单一的奖励函数,导致行为僵化,无法反映人类偏好的复杂性和多样性。这一局限性阻碍了 LLMs 在实际场景中的适应性,使得多目标对齐(MOA)成为一个关键但尚未充分研究的领域。为弥补这一空白,我们提出了帕累托多目标对齐(PAMA),这是一种为 LLMs 的多目标对齐专门设计的、具有原理性且计算高效的算法。与计算代价高昂的多目标优化(MOO)方法不同,PAMA 将多目标 RLHF 转化为具有解析解的凸优化问题,从而显著提升了可扩展性。传统的 MOO 方法存在不可接受的 O(n^2d) 复杂度,其中 d 表示模型参数的数量——对于 LLMs 通常以十亿计——使得直接优化变得不可行。 PAMA 将这种复杂性降至 O(n),其中 n 是目标数量,使得优化可在毫秒级内完成。我们提供了理论保证,表明 PAMA 收敛到帕累托驻点,即在不牺牲至少另一个目标的情况下,无法改进任何目标。在从 1.25 亿到 70 亿参数的语言模型上的大量实验表明,PAMA 的多目标优化(MOA)能力稳健且有效,与其理论优势一致。PAMA 为此前被认为难以处理的 MOA 问题提供了一个高效的解决方案,提出了一种实用且有理论依据的方法来使 LLMs 与多样化的人类价值观保持一致,为在现实世界中实现多功能且可适应的 AI 部署铺平了道路。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题:机器学习、人工智能、计算与语言
Publish: 2025-08-11 08:54:14 UTC 发布日期:2025-08-11 08:54:14 UTC
#124 UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models #124 UniSVG:用于多模态大语言模型的矢量图理解与生成的统一数据集
Authors: [Jinke Li](https://arxiv.org/search/?searchtype=author&query=Jinke Li), [Jiarui Yu](https://arxiv.org/search/?searchtype=author&query=Jiarui Yu), [Chenxing Wei](https://arxiv.org/search/?searchtype=author&query=Chenxing Wei), [Hande Dong](https://arxiv.org/search/?searchtype=author&query=Hande Dong), [Qiang Lin](https://arxiv.org/search/?searchtype=author&query=Qiang Lin), [Liangjing Yang](https://arxiv.org/search/?searchtype=author&query=Liangjing Yang), [Zhicai Wang](https://arxiv.org/search/?searchtype=author&query=Zhicai Wang), [Yanbin Hao](https://arxiv.org/search/?searchtype=author&query=Yanbin Hao) 作者:Jinke Li、Jiarui Yu、Chenxing Wei、Hande Dong、Qiang Lin、Liangjing Yang、Zhicai Wang、Yanbin Hao
Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM’s capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs’ performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on https://ryanlijinke.github.io/. 与位图图像不同,可缩放矢量图形(SVG)在缩放时保持质量,常用于计算机视觉和艺术设计中以 SVG 代码的形式表示。在这个 AI 驱动系统日益增多的时代,使 AI 理解和生成 SVG 变得愈发迫切。然而,AI 驱动的 SVG 理解与生成(U&G)依然是重大挑战。SVG 代码等同于由浮点参数控制的一组曲线与线段,这要求 SVG U&G 拥有高精度。此外,SVG 生成在多样的条件约束下运行,包括文本提示与视觉参考,这需要强大的多模态处理能力以实现从条件到 SVG 的转换。近期,多模态大型语言模型(MLLMs)的快速发展展示了处理多模态输入并生成复杂向量控制参数的能力,表明在统一模型中解决 SVG U&G 任务的潜力。为解锁 MLLM 在 SVG 领域的能力,我们提出了一个以 SVG 为核心的数据集 UniSVG,包含 52.5 万条数据项,专为 MLLM 的训练与评估而设计。 据我们所知,这是第一个为统一的 SVG 生成(基于文本提示和图像)和 SVG 理解(颜色、类别、用途等)设计的综合数据集。如预期的那样,在所提出的数据集上学习可以提升开源多模态大模型(MLLMs)在各种 SVG 理解与生成任务上的性能,超过像 GPT-4V 这样的闭源最先进大模型。我们在 https://ryanlijinke.github.io/ 上发布了数据集、基准、权重、代码和实验细节。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 08:50:14 UTC 发布:2025-08-11 08:50:14 UTC
#125 Sparse Probabilistic Graph Circuits #125 稀疏概率图电路(Sparse Probabilistic Graph Circuits)
Authors: [Martin Rektoris](https://arxiv.org/search/?searchtype=author&query=Martin Rektoris), [Milan Papež](https://arxiv.org/search/?searchtype=author&query=Milan Papež), [Václav Šmídl](https://arxiv.org/search/?searchtype=author&query=Václav Šmídl), [Tomáš Pevný](https://arxiv.org/search/?searchtype=author&query=Tomáš Pevný) 作者:Martin Rektoris,Milan Papež,Václav Šmídl,Tomáš Pevný
Deep generative models (DGMs) for graphs achieve impressively high expressive power thanks to very efficient and scalable neural networks. However, these networks contain non-linearities that prevent analytical computation of many standard probabilistic inference queries, i.e., these DGMs are considered \emph{intractable}. While recently proposed Probabilistic Graph Circuits (PGCs) address this issue by enabling \emph{tractable} probabilistic inference, they operate on dense graph representations with O(n2) complexity for graphs with n nodes and \emph{m edges}. To address this scalability issue, we introduce Sparse PGCs, a new class of tractable generative models that operate directly on sparse graph representation, reducing the complexity to O(n+m), which is particularly beneficial for m≪n2. In the context of de novo drug design, we empirically demonstrate that SPGCs retain exact inference capabilities, improve memory efficiency and inference speed, and match the performance of intractable DGMs in key metrics. 用于图的深度生成模型(DGMs)由于采用了非常高效且可扩展的神经网络,表现出令人印象深刻的表示能力。然而,这些网络包含非线性,使得许多标准概率推理查询无法解析计算,即这些 DGM 被视为“不具可解性”。最近提出的概率图电路(PGC)通过实现可解的概率推理解决了这一问题,但它们在稠密图表示上运行,对于具有 n 个节点和 m 条边的图,其复杂度为 O(n2) 。为了解决这一可扩展性问题,我们引入了稀疏 PGC(Sparse PGCs),这是一类新的可解生成模型,直接在稀疏图表示上运行,将复杂度降至 O(n+m) ,这对 m≪n2 尤为有利。在新药设计(de novo drug design)背景下,我们的实证结果表明,SPGC 保留了精确推理能力,提高了内存效率和推理速度,并在关键指标上匹配了不可解的 DGM 的性能。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 08:47:27 UTC 发布:2025-08-11 08:47:27 协调世界时
#126 Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment #126 学习对齐,借对齐来学习:一种用于自我优化对齐的统一方法
Authors: [Haowen Wang](https://arxiv.org/search/?searchtype=author&query=Haowen Wang), [Yun Yue](https://arxiv.org/search/?searchtype=author&query=Yun Yue), [Zhiling Ye](https://arxiv.org/search/?searchtype=author&query=Zhiling Ye), [Shuowen Zhang](https://arxiv.org/search/?searchtype=author&query=Shuowen Zhang), [Lei Fan](https://arxiv.org/search/?searchtype=author&query=Lei Fan), [Jiaxin Liang](https://arxiv.org/search/?searchtype=author&query=Jiaxin Liang), [Jiadi Jiang](https://arxiv.org/search/?searchtype=author&query=Jiadi Jiang), [Cheng Wei](https://arxiv.org/search/?searchtype=author&query=Cheng Wei), [Jingyuan Deng](https://arxiv.org/search/?searchtype=author&query=Jingyuan Deng), [Xudong Han](https://arxiv.org/search/?searchtype=author&query=Xudong Han), [Ji Li](https://arxiv.org/search/?searchtype=author&query=Ji Li), [Chunxiao Guo](https://arxiv.org/search/?searchtype=author&query=Chunxiao Guo), [Peng Wei](https://arxiv.org/search/?searchtype=author&query=Peng Wei), [Jian Wang](https://arxiv.org/search/?searchtype=author&query=Jian Wang), [Jinjie Gu](https://arxiv.org/search/?searchtype=author&query=Jinjie Gu) 作者:王浩文、岳云、叶志灵、张硕文、范磊、梁佳昕、姜佳迪、魏成、邓景远、韩旭东、李骥、郭春晓、魏鹏、王剑、顾金杰
Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO’s convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO’s superior performance, achieving 57.70%,17.65% 7.95% and 5.18% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models. 对齐方法已成为提升语言模型对齐能力的关键路径。虽然 SFT(监督微调)通过直接的逐标记损失干预加速了收敛,但其效能受限于离线策略轨迹。相比之下,RL(强化学习)促进了探索性策略优化,但存在样本效率低和对高质量基础模型高度依赖的问题。为了解决这两方面的挑战,我们提出了 GRAO(群体相对对齐优化),这是一个通过三项关键创新将 SFT 和 RL 各自优势协同统一的框架:1)一种多样本生成策略,能够通过奖励反馈对比性地评估质量;2)一种新颖的群体直接对齐损失表述,利用组内相对优势加权;3)由成对偏好动态指导的参考感知参数更新。我们的理论分析确立了 GRAO 的收敛性保证及其相对于传统方法的样本效率优势。 在复杂的人类对齐任务上的全面评估表明,GRAO 表现更为优越,相较于 SFT、DPO、PPO 和 GRPO 基线,分别实现了 57.70%、17.65%、7.95% 和 5.18% 的相对提升。该工作既提供了一个有理论依据的对齐框架,也为语言模型中能力高效演化提供了实证证据。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题:机器学习、人工智能、计算与语言
Publish: 2025-08-11 08:28:47 UTC 发布:2025-08-11 08:28:47 UTC
#127 Chimera: Harnessing Multi-Agent LLMs for Automatic Insider Threat Simulation #127 Chimera:利用多智能体 LLMs 进行自动化内部威胁模拟
Authors: [Jiongchi Yu](https://arxiv.org/search/?searchtype=author&query=Jiongchi Yu), [Xiaofei Xie](https://arxiv.org/search/?searchtype=author&query=Xiaofei Xie), [Qiang Hu](https://arxiv.org/search/?searchtype=author&query=Qiang Hu), [Yuhan Ma](https://arxiv.org/search/?searchtype=author&query=Yuhan Ma), [Ziming Zhao](https://arxiv.org/search/?searchtype=author&query=Ziming Zhao) 作者:余炯池、谢晓飞、胡强、马雨涵、赵子明
Insider threats, which can lead to severe losses, remain a major security concern. While machine learning-based insider threat detection (ITD) methods have shown promising results, their progress is hindered by the scarcity of high-quality data. Enterprise data is sensitive and rarely accessible, while publicly available datasets, when limited in scale due to cost, lack sufficient real-world coverage; and when purely synthetic, they fail to capture rich semantics and realistic user behavior. To address this, we propose Chimera, the first large language model (LLM)-based multi-agent framework that automatically simulates both benign and malicious insider activities and collects diverse logs across diverse enterprise environments. Chimera models each employee with agents that have role-specific behavior and integrates modules for group meetings, pairwise interactions, and autonomous scheduling, capturing realistic organizational dynamics. It incorporates 15 types of insider attacks (e.g., IP theft, system sabotage) and has been deployed to simulate activities in three sensitive domains: technology company, finance corporation, and medical institution, producing a new dataset, ChimeraLog. We assess ChimeraLog via human studies and quantitative analysis, confirming its diversity, realism, and presence of explainable threat patterns. Evaluations of existing ITD methods show an average F1-score of 0.83, which is significantly lower than 0.99 on the CERT dataset, demonstrating ChimeraLog’s higher difficulty and utility for advancing ITD research. 内部威胁可能导致严重损失,仍然是一个重大的安全问题。尽管基于机器学习的内部威胁检测(ITD)方法已显示出可喜的结果,但其进展受到高质量数据稀缺的制约。企业数据具有敏感性且很少可获取,而公开可用的数据集在规模受限以控制成本时缺乏足够的真实覆盖;而完全合成的数据则无法捕捉丰富的语义和真实的用户行为。为了解决这一问题,我们提出了 Chimera,这是首个基于 LLM 的多智能体框架,能够自动模拟良性和恶意的内部活动,并在多样化的企业环境中收集多样化的日志。Chimera 使用具有角色特定行为的代理来建模每位员工,并集成了用于群体会议、双人交互和自主日程安排的模块,从而捕捉真实的组织动态。 它涵盖了 15 种内部威胁类型(例如,知识产权窃取、系统破坏),并已用于在三个敏感领域模拟活动:科技公司、金融企业和医疗机构,生成了一个新数据集 ChimeraLog。我们通过人工研究和定量分析评估了 ChimeraLog,确认了其多样性、真实性以及可解释的威胁模式的存在。对现有内部威胁检测(ITD)方法的评估显示平均 F1 分数为 0.83,显著低于在 CERT 数据集上的 0.99,表明 ChimeraLog 更具挑战性并有助于推进 ITD 研究。
Subjects: Cryptography and Security, Artificial Intelligence, Software Engineering 主题:密码学与安全、人工智能、软件工程
Publish: 2025-08-11 08:24:48 UTC 发布:2025-08-11 08:24:48 UTC
#128 A Rule-Based Approach to Specifying Preferences over Conflicting Facts and Querying Inconsistent Knowledge Bases #128 一种基于规则的方法,用于在相互冲突的事实间指定偏好并查询不一致知识库
Authors: [Meghyn Bienvenu](https://arxiv.org/search/?searchtype=author&query=Meghyn Bienvenu), [Camille Bourgaux](https://arxiv.org/search/?searchtype=author&query=Camille Bourgaux), [Katsumi Inoue](https://arxiv.org/search/?searchtype=author&query=Katsumi Inoue), [Robin Jean](https://arxiv.org/search/?searchtype=author&query=Robin Jean) 作者:Meghyn Bienvenu、Camille Bourgaux、Katsumi Inoue、Robin Jean
Repair-based semantics have been extensively studied as a means of obtaining meaningful answers to queries posed over inconsistent knowledge bases (KBs). While several works have considered how to exploit a priority relation between facts to select optimal repairs, the question of how to specify such preferences remains largely unaddressed. This motivates us to introduce a declarative rule-based framework for specifying and computing a priority relation between conflicting facts. As the expressed preferences may contain undesirable cycles, we consider the problem of determining when a set of preference rules always yields an acyclic relation, and we also explore a pragmatic approach that extracts an acyclic relation by applying various cycle removal techniques. Towards an end-to-end system for querying inconsistent KBs, we present a preliminary implementation and experimental evaluation of the framework, which employs answer set programming to evaluate the preference rules, apply the desired cycle resolution techniques to obtain a priority relation, and answer queries under prioritized-repair semantics. 基于修复的语义已被广泛研究,作为在不一致知识库(KB)上对查询获得有意义答案的一种手段。尽管若干工作已考虑如何利用事实之间的优先关系来选择最优修复,但如何指定此类偏好这一问题在很大程度上尚未解决。这促使我们引入一个用于指定和计算冲突事实之间优先关系的声明性基于规则的框架。由于表达的偏好可能包含不希望出现的循环,我们研究了判定一组偏好规则何时总是产生无环关系的问题,同时也探索了一种务实的方法,通过应用各种循环移除技术来提取无环关系。为了构建一个可端到端查询不一致知识库的系统,我们提出了该框架的初步实现和实验评估,该实现使用答案集规划来评估偏好规则,应用所需的循环解决技术以获得优先关系,并在优先修复语义下回答查询。
Subjects: Logic in Computer Science, Artificial Intelligence, Databases 主题:计算机科学中的逻辑、人工智能、数据库
Publish: 2025-08-11 08:21:02 UTC 发布时间:2025-08-11 08:21:02 UTC
#129 CognitiveArm: Enabling Real-Time EEG-Controlled Prosthetic Arm Using Embodied Machine Learning #129 CognitiveArm:使用具身机器学习实现实时脑电控制假肢手臂
Authors: [Abdul Basit](https://arxiv.org/search/?searchtype=author&query=Abdul Basit), [Maha Nawaz](https://arxiv.org/search/?searchtype=author&query=Maha Nawaz), [Saim Rehman](https://arxiv.org/search/?searchtype=author&query=Saim Rehman), [Muhammad Shafique](https://arxiv.org/search/?searchtype=author&query=Muhammad Shafique) 作者:Abdul Basit、Maha Nawaz、Saim Rehman、Muhammad Shafique
Efficient control of prosthetic limbs via non-invasive brain-computer interfaces (BCIs) requires advanced EEG processing, including pre-filtering, feature extraction, and action prediction, performed in real time on edge AI hardware. Achieving this on resource-constrained devices presents challenges in balancing model complexity, computational efficiency, and latency. We present CognitiveArm, an EEG-driven, brain-controlled prosthetic system implemented on embedded AI hardware, achieving real-time operation without compromising accuracy. The system integrates BrainFlow, an open-source library for EEG data acquisition and streaming, with optimized deep learning (DL) models for precise brain signal classification. Using evolutionary search, we identify Pareto-optimal DL configurations through hyperparameter tuning, optimizer analysis, and window selection, analyzed individually and in ensemble configurations. We apply model compression techniques such as pruning and quantization to optimize models for embedded deployment, balancing efficiency and accuracy. We collected an EEG dataset and designed an annotation pipeline enabling precise labeling of brain signals corresponding to specific intended actions, forming the basis for training our optimized DL models. CognitiveArm also supports voice commands for seamless mode switching, enabling control of the prosthetic arm’s 3 degrees of freedom (DoF). Running entirely on embedded hardware, it ensures low latency and real-time responsiveness. A full-scale prototype, interfaced with the OpenBCI UltraCortex Mark IV EEG headset, achieved up to 90% accuracy in classifying three core actions (left, right, idle). Voice integration enables multiplexed, variable movement for everyday tasks (e.g., handshake, cup picking), enhancing real-world performance and demonstrating CognitiveArm’s potential for advanced prosthetic control. 通过非侵入式脑机接口(BCI)对假肢进行高效控制需要先进的脑电图(EEG)处理,包括实时在边缘人工智能硬件上执行的预滤波、特征提取和动作预测。在资源受限的设备上实现这一点,需要在模型复杂性、计算效率和延迟之间取得平衡。我们提出了 CognitiveArm,一种基于 EEG 的脑控假肢系统,在嵌入式 AI 硬件上实现实时运行且不牺牲准确性。该系统将用于 EEG 数据采集与流式传输的开源库 BrainFlow 与用于精确脑信号分类的优化深度学习(DL)模型相结合。通过进化搜索,我们通过超参数调优、优化器分析和窗口选择识别出帕累托最优的深度学习配置,并对单独配置和集成配置进行分析。我们应用剪枝和量化等模型压缩技术,对模型进行优化以部署到嵌入式设备上,平衡效率与准确性。 我们收集了一个脑电(EEG)数据集并设计了一个标注流程,使得能够精确标注对应特定意图动作的大脑信号,构成训练我们优化深度学习模型的基础。CognitiveArm 还支持语音命令以无缝切换工作模式,从而控制假肢手臂的三个自由度(DoF)。系统完全运行在嵌入式硬件上,确保低延迟和实时响应。一个与 OpenBCI UltraCortex Mark IV EEG 头戴设备接口的全尺寸原型,在分类三种核心动作(左、右、静止)时达到了最高 90% 的准确率。语音集成使日常任务(例如握手、拿杯子)能够实现多路复用和可变动作,从而提升现实世界表现,展示了 CognitiveArm 在先进假肢控制方面的潜力。
Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能
Publish: 2025-08-11 08:04:59 UTC 发布:2025-08-11 08:04:59 UTC
#130 DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models #130 DoorDet:通过目标检测和大型语言模型实现的半自动多类门检测数据集
Authors: [Licheng Zhang](https://arxiv.org/search/?searchtype=author&query=Licheng Zhang), [Bach Le](https://arxiv.org/search/?searchtype=author&query=Bach Le), [Naveed Akhtar](https://arxiv.org/search/?searchtype=author&query=Naveed Akhtar), [Tuan Ngo](https://arxiv.org/search/?searchtype=author&query=Tuan Ngo) 作者:张立成、Bach Le、Naveed Akhtar、Ngô Tuấn
Accurate detection and classification of diverse door types in floor plans drawings is critical for multiple applications, such as building compliance checking, and indoor scene understanding. Despite their importance, publicly available datasets specifically designed for fine-grained multi-class door detection remain scarce. In this work, we present a semi-automated pipeline that leverages a state-of-the-art object detector and a large language model (LLM) to construct a multi-class door detection dataset with minimal manual effort. Doors are first detected as a unified category using a deep object detection model. Next, an LLM classifies each detected instance based on its visual and contextual features. Finally, a human-in-the-loop stage ensures high-quality labels and bounding boxes. Our method significantly reduces annotation cost while producing a dataset suitable for benchmarking neural models in floor plan analysis. This work demonstrates the potential of combining deep learning and multimodal reasoning for efficient dataset construction in complex real-world domains. 在平面图中准确检测和分类多种门类型对于多种应用至关重要,例如建筑合规性检查和室内场景理解。尽管其重要性,专门为细粒度多类门检测设计的公开数据集仍然稀缺。在本工作中,我们提出了一个半自动化流程,利用最先进的目标检测器和大型语言模型(LLM)以最小的人工投入构建多类门检测数据集。首先使用深度目标检测模型将门作为一个统一类别进行检测。接着,LLM 基于每个检测实例的视觉和上下文特征对其进行分类。最后,带有人类参与的环节确保标签和边界框的高质量。我们的方法在显著降低标注成本的同时生成了一个适用于平面图分析中神经模型基准测试的数据集。本工作展示了将深度学习与多模态推理相结合在复杂现实领域中高效构建数据集的潜力。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Emerging Technologies 领域:计算机视觉与模式识别,人工智能,新兴技术
Publish: 2025-08-11 07:41:09 UTC 发布:2025-08-11 07:41:09 UTC
#131 Training-Free ANN-to-SNN Conversion for High-Performance Spiking Transformer #131 无需训练的 ANN 到 SNN 转换以实现高性能脉冲变换器
Authors: [Jingya Wang](https://arxiv.org/search/?searchtype=author&query=Jingya Wang), [Xin Deng](https://arxiv.org/search/?searchtype=author&query=Xin Deng), [Wenjie Wei](https://arxiv.org/search/?searchtype=author&query=Wenjie Wei), [Dehao Zhang](https://arxiv.org/search/?searchtype=author&query=Dehao Zhang), [Shuai Wang](https://arxiv.org/search/?searchtype=author&query=Shuai Wang), [Qian Sun](https://arxiv.org/search/?searchtype=author&query=Qian Sun), [Jieyuan Zhang](https://arxiv.org/search/?searchtype=author&query=Jieyuan Zhang), [Hanwen Liu](https://arxiv.org/search/?searchtype=author&query=Hanwen Liu), [Ning Xie](https://arxiv.org/search/?searchtype=author&query=Ning Xie), [Malu Zhang](https://arxiv.org/search/?searchtype=author&query=Malu Zhang) 作者:王婧雅、邓鑫、魏文杰、张德昊、王帅、孙倩、张杰远、刘汉文、谢宁、张马鲁
Leveraging the event-driven paradigm, Spiking Neural Networks (SNNs) offer a promising approach for constructing energy-efficient Transformer architectures. Compared to directly trained Spiking Transformers, ANN-to-SNN conversion methods bypass the high training costs. However, existing methods still suffer from notable limitations, failing to effectively handle nonlinear operations in Transformer architectures and requiring additional fine-tuning processes for pre-trained ANNs. To address these issues, we propose a high-performance and training-free ANN-to-SNN conversion framework tailored for Transformer architectures. Specifically, we introduce a Multi-basis Exponential Decay (MBE) neuron, which employs an exponential decay strategy and multi-basis encoding method to efficiently approximate various nonlinear operations. It removes the requirement for weight modifications in pre-trained ANNs. Extensive experiments across diverse tasks (CV, NLU, NLG) and mainstream Transformer architectures (ViT, RoBERTa, GPT-2) demonstrate that our method achieves near-lossless conversion accuracy with significantly lower latency. This provides a promising pathway for the efficient and scalable deployment of Spiking Transformers in real-world applications. 利用事件驱动范式,脉冲神经网络(SNNs)为构建节能的 Transformer 架构提供了有前景的方法。与直接训练的脉冲 Transformer 相比,ANN 到 SNN 的转换方法绕过了高昂的训练成本。然而,现有方法仍存在显著局限,无法有效处理 Transformer 架构中的非线性操作,并且需要对预训练的 ANN 进行额外的微调过程。为了解决这些问题,我们提出了一个针对 Transformer 架构的高性能且无需训练的 ANN 到 SNN 转换框架。具体而言,我们引入了一种多基指数衰减(MBE)神经元,它采用指数衰减策略和多基编码方法来高效逼近各种非线性操作。该方法消除了对预训练 ANN 权重修改的需求。跨多种任务(计算机视觉、自然语言理解、自然语言生成)和主流 Transformer 架构(ViT、RoBERTa、GPT-2)的广泛实验表明,我们的方法在显著更低延迟的情况下实现了近无损的转换精度。 这为在真实世界应用中高效且可扩展地部署脉冲变换器提供了一条有前景的路径。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 07:38:32 UTC
#132 Energy Consumption in Parallel Neural Network Training #132 并行神经网络训练中的能耗
Authors: [Philipp Huber](https://arxiv.org/search/?searchtype=author&query=Philipp Huber), [David Li](https://arxiv.org/search/?searchtype=author&query=David Li), [Juan Pedro Gutiérrez Hermosillo Muriedas](https://arxiv.org/search/?searchtype=author&query=Juan Pedro Gutiérrez Hermosillo Muriedas), [Deifilia Kieckhefen](https://arxiv.org/search/?searchtype=author&query=Deifilia Kieckhefen), [Markus Götz](https://arxiv.org/search/?searchtype=author&query=Markus Götz), [Achim Streit](https://arxiv.org/search/?searchtype=author&query=Achim Streit), [Charlotte Debus](https://arxiv.org/search/?searchtype=author&query=Charlotte Debus) 作者:Philipp Huber、David Li、Juan Pedro Gutiérrez Hermosillo Muriedas、Deifilia Kieckhefen、Markus Götz、Achim Streit、Charlotte Debus
The increasing demand for computational resources of training neural networks leads to a concerning growth in energy consumption. While parallelization has enabled upscaling model and dataset sizes and accelerated training, its impact on energy consumption is often overlooked. To close this research gap, we conducted scaling experiments for data-parallel training of two models, ResNet50 and FourCastNet, and evaluated the impact of parallelization parameters, i.e., GPU count, global batch size, and local batch size, on predictive performance, training time, and energy consumption. We show that energy consumption scales approximately linearly with the consumed resources, i.e., GPU hours; however, the respective scaling factor differs substantially between distinct model trainings and hardware, and is systematically influenced by the number of samples and gradient updates per GPU hour. Our results shed light on the complex interplay of scaling up neural network training and can inform future developments towards more sustainable AI research. 对神经网络训练所需计算资源日益增长导致能耗令人担忧的上升。虽然并行化使得模型和数据集规模扩大并加速了训练,其对能耗的影响却常被忽视。为填补这一研究空白,我们对两种模型(ResNet50 和 FourCastNet)进行了数据并行训练的扩展实验,并评估了并行化参数(即 GPU 数量、全局批量大小和本地批量大小)对预测性能、训练时间和能耗的影响。我们表明,能耗大致与所消耗的资源(即 GPU 小时)线性相关;然而,不同模型训练和硬件之间的相应比例因子差异显著,并且受每 GPU 小时的样本数和梯度更新次数的系统性影响。我们的结果揭示了扩展神经网络训练的复杂相互作用,并可为未来朝着更可持续的人工智能研究的发展提供参考。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 07:34:04 UTC
#133 LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval #133 LoSemB:用于归纳式工具检索的逻辑引导语义桥接
Authors: [Luyao Zhuang](https://arxiv.org/search/?searchtype=author&query=Luyao Zhuang), [Qinggang Zhang](https://arxiv.org/search/?searchtype=author&query=Qinggang Zhang), [Huachi Zhou](https://arxiv.org/search/?searchtype=author&query=Huachi Zhou), [Juhua Liu](https://arxiv.org/search/?searchtype=author&query=Juhua Liu), [Qing Li](https://arxiv.org/search/?searchtype=author&query=Qing Li), [Xiao Huang](https://arxiv.org/search/?searchtype=author&query=Xiao Huang) 作者:Luyao Zhuang、Qinggang Zhang、Huachi Zhou、Juhua Liu、Qing Li、Xiao Huang
Tool learning has emerged as a promising paradigm for large language models (LLMs) to solve many real-world tasks. Nonetheless, with the tool repository rapidly expanding, it is impractical to contain all tools within the limited input length of LLMs. To alleviate these issues, researchers have explored incorporating a tool retrieval module to select the most relevant tools or represent tools as unique tokens within LLM parameters. However, most state-of-the-art methods are under transductive settings, assuming all tools have been observed during training. Such a setting deviates from reality as the real-world tool repository is evolving and incorporates new tools frequently. When dealing with these unseen tools, which refer to tools not encountered during the training phase, these methods are limited by two key issues, including the large distribution shift and the vulnerability of similarity-based retrieval. To this end, inspired by human cognitive processes of mastering unseen tools through discovering and applying the logical information from prior experience, we introduce a novel Logic-Guided Semantic Bridging framework for inductive tool retrieval, namely, LoSemB, which aims to mine and transfer latent logical information for inductive tool retrieval without costly retraining. Specifically, LoSemB contains a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce the vulnerability of similarity-based retrieval. Extensive experiments demonstrate that LoSemB achieves advanced performance in inductive settings while maintaining desirable effectiveness in the transductive setting. 工具学习已成为大型语言模型(LLMs)解决许多现实任务的有前景范式。然而,随着工具库快速扩展,将所有工具放入 LLMs 有限的输入长度中并不现实。为缓解这些问题,研究人员已探索引入工具检索模块以选择最相关的工具,或将工具表示为 LLM 参数中的独特标记。然而,大多数最先进的方法都处于转导设置下,假设所有工具在训练期间都已被观察到。这种设置偏离现实,因为现实世界的工具库在不断演进并频繁引入新工具。在处理这些未见过的工具(指训练阶段未遇到的工具)时,这些方法受制于两个关键问题:大的分布偏移和基于相似度的检索脆弱性。 为此,受人类通过从既有经验中发现并应用逻辑信息来掌握未知工具的认知过程启发,我们提出了一种用于归纳式工具检索的新颖逻辑引导语义桥接框架,称为 LoSemB,旨在在无需昂贵重训练的情况下挖掘并转移潜在的逻辑信息以用于归纳式工具检索。具体而言,LoSemB 包含一个基于逻辑的嵌入对齐模块以缓解分布偏移,并实现了一个关系增强的检索机制以降低基于相似度检索的脆弱性。大量实验表明,LoSemB 在归纳设置中取得了先进的性能,同时在传导设置中也保持了理想的有效性。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 07:07:18 UTC 发布:2025-08-11 07:07:18 UTC
#134 TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding #134 TAR-TVG:通过时间戳锚点约束推理增强视觉-语言模型的时序视频定位
Authors: [Chaohong Guo](https://arxiv.org/search/?searchtype=author&query=Chaohong Guo), [Xun Mo](https://arxiv.org/search/?searchtype=author&query=Xun Mo), [Yongwei Nie](https://arxiv.org/search/?searchtype=author&query=Yongwei Nie), [Xuemiao Xu](https://arxiv.org/search/?searchtype=author&query=Xuemiao Xu), [Chao Xu](https://arxiv.org/search/?searchtype=author&query=Chao Xu), [Fei Yu](https://arxiv.org/search/?searchtype=author&query=Fei Yu), [Chengjiang Long](https://arxiv.org/search/?searchtype=author&query=Chengjiang Long) 作者:郭超宏、莫勋、聂永为、许学淼、徐超、于飞、龙承江
Temporal Video Grounding (TVG) aims to precisely localize video segments corresponding to natural language queries, which is a critical capability for long-form video understanding. Although existing reinforcement learning approaches encourage models to generate reasoning chains before predictions, they fail to explicitly constrain the reasoning process to ensure the quality of the final temporal predictions. To address this limitation, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG), a novel framework that introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content. These anchors serve as intermediate verification points. More importantly, we require each reasoning step to produce increasingly accurate temporal estimations, thereby ensuring that the reasoning process contributes meaningfully to the final prediction. To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model. This three-stage training strategy enables robust anchor generation while maintaining reasoning quality. Experiments show that our model achieves state-of-the-art performance while producing interpretable, verifiable reasoning chains with progressively refined temporal estimations. 时序视频定位(Temporal Video Grounding,TVG)旨在精确定位与自然语言查询相对应的视频片段,这是长视频理解的一项关键能力。尽管现有的强化学习方法鼓励模型在预测前生成推理链,但它们未能对推理过程进行明确约束以保证最终时序预测的质量。为了解决这一限制,我们提出了用于时序视频定位的基于时间戳锚点约束的推理(Timestamp Anchor-constrained Reasoning for Temporal Video Grounding,TAR-TVG),该新颖框架在推理过程中引入时间戳锚点,以对思维内容施加显式监督。这些锚点作为中间校验点。更重要的是,我们要求每一步推理产出日益精确的时间估计,从而确保推理过程对最终预测有切实的贡献。 为了解决模型(例如 Qwen2.5-VL-3B)中低概率锚点生成的问题,我们提出了一种高效的自我蒸馏训练策略:(1)初始 GRPO 训练以收集包含多个时间戳锚点的 3 万条高质量推理轨迹,(2)在蒸馏数据上进行监督微调(SFT),以及(3)在经过 SFT 强化的模型上进行最终的 GRPO 优化。该三阶段训练策略在保持推理质量的同时实现了稳健的锚点生成。实验表明,我们的模型在产生可解释、可验证的推理链并逐步精细化时间估计方面达到了最先进的性能。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 06:59:32 UTC 发布:2025-08-11 06:59:32 世界协调时
#135 MORE-CLEAR: Multimodal Offline Reinforcement learning for Clinical notes Leveraged Enhanced State Representation #135 MORE-CLEAR:用于临床记录的多模态离线强化学习,利用增强的状态表示
Authors: [Yooseok Lim](https://arxiv.org/search/?searchtype=author&query=Yooseok Lim), [ByoungJun Jeon](https://arxiv.org/search/?searchtype=author&query=ByoungJun Jeon), [Seong-A Park](https://arxiv.org/search/?searchtype=author&query=Seong-A Park), [Jisoo Lee](https://arxiv.org/search/?searchtype=author&query=Jisoo Lee), [Sae Won Choi](https://arxiv.org/search/?searchtype=author&query=Sae Won Choi), [Chang Wook Jeong](https://arxiv.org/search/?searchtype=author&query=Chang Wook Jeong), [Ho-Geol Ryu](https://arxiv.org/search/?searchtype=author&query=Ho-Geol Ryu), [Hongyeol Lee](https://arxiv.org/search/?searchtype=author&query=Hongyeol Lee), [Hyun-Lim Yang](https://arxiv.org/search/?searchtype=author&query=Hyun-Lim Yang) 作者:Yooseok Lim, ByoungJun Jeon, Seong-A Park, Jisoo Lee, Sae Won Choi, Chang Wook Jeong, Ho-Geol Ryu, Hongyeol Lee, Hyun-Lim Yang
Sepsis, a life-threatening inflammatory response to infection, causes organ dysfunction, making early detection and optimal management critical. Previous reinforcement learning (RL) approaches to sepsis management rely primarily on structured data, such as lab results or vital signs, and on a dearth of a comprehensive understanding of the patient’s condition. In this work, we propose a Multimodal Offline REinforcement learning for Clinical notes Leveraged Enhanced stAte Representation (MORE-CLEAR) framework for sepsis control in intensive care units. MORE-CLEAR employs pre-trained large-scale language models (LLMs) to facilitate the extraction of rich semantic representations from clinical notes, preserving clinical context and improving patient state representation. Gated fusion and cross-modal attention allow dynamic weight adjustment in the context of time and the effective integration of multimodal data. Extensive cross-validation using two public (MIMIC-III and MIMIC-IV) and one private dataset demonstrates that MORE-CLEAR significantly improves estimated survival rate and policy performance compared to single-modal RL approaches. To our knowledge, this is the first to leverage LLM capabilities within a multimodal offline RL for better state representation in medical applications. This approach can potentially expedite the treatment and management of sepsis by enabling reinforcement learning models to propose enhanced actions based on a more comprehensive understanding of patient conditions. 败血症是一种对感染的危及生命的炎症反应,会导致器官功能障碍,因此早期检测和优化管理至关重要。先前用于败血症管理的强化学习(RL)方法主要依赖结构化数据,如化验结果或生命体征,且缺乏对患者病情的全面理解。在本研究中,我们提出了一种用于重症监护病房败血症控制的多模态离线强化学习临床笔记利用增强状态表示框架——Multimodal Offline REinforcement learning for Clinical notes Leveraged Enhanced stAte Representation(MORE-CLEAR)。MORE-CLEAR 利用预训练的大规模语言模型(LLMs)来促进从临床笔记中提取丰富的语义表示,保留临床语境并改进患者状态表示。门控融合与跨模态注意力允许在时间语境中进行动态权重调整并有效整合多模态数据。使用两个公共数据集(MIMIC-III 和 MIMIC-IV)和一个私有数据集进行的大量交叉验证表明,与单模态 RL 方法相比,MORE-CLEAR 在估计生存率和策略表现上显著改进。 据我们所知,这是首个在多模态离线强化学习中利用 LLM 能力以改善医疗应用中状态表示的方法。该方法有望通过使强化学习模型在更全面理解患者状况的基础上提出改进的措施,从而加速脓毒症的治疗和管理。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 06:58:33 UTC 发布日期:2025-08-11 06:58:33 UTC
#136 AIS-LLM: A Unified Framework for Maritime Trajectory Prediction, Anomaly Detection, and Collision Risk Assessment with Explainable Forecasting #136 AIS-LLM:一个用于海上轨迹预测、异常检测与碰撞风险评估的统一框架,具备可解释的预测 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Hyobin Park](https://arxiv.org/search/?searchtype=author&query=Hyobin Park), [Jinwook Jung](https://arxiv.org/search/?searchtype=author&query=Jinwook Jung), [Minseok Seo](https://arxiv.org/search/?searchtype=author&query=Minseok Seo), [Hyunsoo Choi](https://arxiv.org/search/?searchtype=author&query=Hyunsoo Choi), [Deukjae Cho](https://arxiv.org/search/?searchtype=author&query=Deukjae Cho), [Sekil Park](https://arxiv.org/search/?searchtype=author&query=Sekil Park), [Dong-Geol Choi](https://arxiv.org/search/?searchtype=author&query=Dong-Geol Choi) 作者:朴孝彬、郑珍旭、徐敏锡、崔贤洙、赵德宰、朴世吉、崔东杰
With the increase in maritime traffic and the mandatory implementation of the Automatic Identification System (AIS), the importance and diversity of maritime traffic analysis tasks based on AIS data, such as vessel trajectory prediction, anomaly detection, and collision risk assessment, is rapidly growing. However, existing approaches tend to address these tasks individually, making it difficult to holistically consider complex maritime situations. To address this limitation, we propose a novel framework, AIS-LLM, which integrates time-series AIS data with a large language model (LLM). AIS-LLM consists of a Time-Series Encoder for processing AIS sequences, an LLM-based Prompt Encoder, a Cross-Modality Alignment Module for semantic alignment between time-series data and textual prompts, and an LLM-based Multi-Task Decoder. This architecture enables the simultaneous execution of three key tasks: trajectory prediction, anomaly detection, and risk assessment of vessel collisions within a single end-to-end system. Experimental results demonstrate that AIS-LLM outperforms existing methods across individual tasks, validating its effectiveness. Furthermore, by integratively analyzing task outputs to generate situation summaries and briefings, AIS-LLM presents the potential for more intelligent and efficient maritime traffic management. 随着海上交通量的增加以及自动识别系统(AIS)的强制实施,基于 AIS 数据的海上交通分析任务的重要性和多样性迅速增长,例如船舶轨迹预测、异常检测和碰撞风险评估。然而,现有方法往往单独处理这些任务,难以全面考虑复杂的海上情形。为了解决这一局限性,我们提出了一种新框架 AIS-LLM,它将时间序列 AIS 数据与大型语言模型(LLM)相结合。AIS-LLM 由用于处理 AIS 序列的时间序列编码器、基于 LLM 的提示(Prompt)编码器、用于时间序列数据与文本提示之间语义对齐的跨模态对齐模块以及基于 LLM 的多任务解码器组成。该架构实现了在单一端到端系统中同时执行三项关键任务:轨迹预测、异常检测和船舶碰撞风险评估。实验结果表明,AIS-LLM 在各单项任务上均优于现有方法,验证了其有效性。 此外,通过综合分析任务输出以生成情境摘要和简报,AIS-LLM 展示了在实现更智能、更高效的海上交通管理方面的潜力。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 06:39:45 UTC 发布:2025-08-11 06:39:45 UTC
#137 GLiClass: Generalist Lightweight Model for Sequence Classification Tasks #137 GLiClass:用于序列分类任务的通用轻量级模型
Authors: [Ihor Stepanov](https://arxiv.org/search/?searchtype=author&query=Ihor Stepanov), [Mykhailo Shtopko](https://arxiv.org/search/?searchtype=author&query=Mykhailo Shtopko), [Dmytro Vodianytskyi](https://arxiv.org/search/?searchtype=author&query=Dmytro Vodianytskyi), [Oleksandr Lukashov](https://arxiv.org/search/?searchtype=author&query=Oleksandr Lukashov), [Alexander Yavorskyi](https://arxiv.org/search/?searchtype=author&query=Alexander Yavorskyi), [Mykyta Yaroshenko](https://arxiv.org/search/?searchtype=author&query=Mykyta Yaroshenko) 作者:Ihor Stepanov、Mykhailo Shtopko、Dmytro Vodianytskyi、Oleksandr Lukashov、Alexander Yavorskyi、Mykyta Yaroshenko
Classification is one of the most widespread tasks in AI applications, serving often as the first step in filtering, sorting, and categorizing data. Since modern AI systems must handle large volumes of input data and early pipeline stages can propagate errors downstream, achieving high efficiency and accuracy is critical. Moreover, classification requirements can change dynamically based on user needs, necessitating models with strong zero-shot capabilities. While generative LLMs have become mainstream for zero-shot classification due to their versatility, they suffer from inconsistent instruction following and computational inefficiency. Cross-encoders, commonly used as rerankers in RAG pipelines, face a different bottleneck: they must process text-label pairs sequentially, significantly reducing efficiency with large label sets. Embedding-based approaches offer good efficiency but struggle with complex scenarios involving logical and semantic constraints. We propose GLiClass, a novel method that adapts the GLiNER architecture for sequence classification tasks. Our approach achieves strong accuracy and efficiency comparable to embedding-based methods, while maintaining the flexibility needed for zero-shot and few-shot learning scenarios. Additionally, we adapted proximal policy optimization (PPO) for multi-label text classification, enabling training classifiers in data-sparse conditions or from human feedback. 分类是人工智能应用中最广泛的任务之一,通常作为过滤、排序和归类数据的第一步。由于现代 AI 系统必须处理大量输入数据,且早期流水线阶段的错误可能向下游传播,因此实现高效性和高准确性至关重要。此外,分类需求可能会根据用户需求动态变化,这就需要具有强大零样本能力的模型。尽管生成型 LLMs 因其多功能性已成为零样本分类的主流,但它们在遵循指令方面表现不稳定且计算效率低下。作为 RAG 流水线中常用的重排序器,交叉编码器面临另一种瓶颈:它们必须顺序处理文本-标签对,在标签集较大时显著降低效率。基于嵌入的方法在效率上表现良好,但在涉及逻辑和语义约束的复杂场景中困难重重。我们提出了 GLiClass,一种将 GLiNER 架构改造用于序列分类任务的新方法。 我们的方法在准确性和效率上与基于嵌入的方法相当,同时保留了零样本和少样本学习场景所需的灵活性。此外,我们将近端策略优化(PPO)改编用于多标签文本分类,使得在数据稀缺的条件下或基于人类反馈训练分类器成为可能。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题:机器学习、人工智能、计算与语言
Publish: 2025-08-11 06:22:25 UTC 发布:2025-08-11 06:22:25 UTC
#138 Discovering Spatial Correlations between Earth Observations in Global Atmospheric State Estimation by using Adaptive Graph Structure Learning #138 通过自适应图结构学习在全球大气状态估计中发现地球观测之间的空间相关性
Authors: [Hyeon-Ju Jeon](https://arxiv.org/search/?searchtype=author&query=Hyeon-Ju Jeon), [Jeon-Ho Kang](https://arxiv.org/search/?searchtype=author&query=Jeon-Ho Kang), [In-Hyuk Kwon](https://arxiv.org/search/?searchtype=author&query=In-Hyuk Kwon), [O-Joun Lee](https://arxiv.org/search/?searchtype=author&query=O-Joun Lee) 作者:Hyeon-Ju Jeon、Jeon-Ho Kang、In-Hyuk Kwon、O-Joun Lee
This study aims to discover spatial correlations between Earth observations and atmospheric states to improve the forecasting accuracy of global atmospheric state estimation, which are usually conducted using conventional numerical weather prediction (NWP) systems and is the beginning of weather forecasting. NWP systems predict future atmospheric states at fixed locations, which are called NWP grid points, by analyzing previous atmospheric states and newly acquired Earth observations without fixed locations. Thus, surrounding meteorological context and the changing locations of the observations make spatial correlations between atmospheric states and observations over time. To handle complicated spatial correlations, which change dynamically, we employ spatiotemporal graph neural networks (STGNNs) with structure learning. However, structure learning has an inherent limitation that this can cause structural information loss and over-smoothing problem by generating excessive edges. To solve this problem, we regulate edge sampling by adaptively determining node degrees and considering the spatial distances between NWP grid points and observations. We validated the effectiveness of the proposed method by using real-world atmospheric state and observation data from East Asia. Even in areas with high atmospheric variability, the proposed method outperformed existing STGNN models with and without structure learning. 本研究旨在发现地球观测与大气状态之间的空间相关性,以提高全球大气状态估计的预报精度。通常这类估计使用传统的数值天气预报(NWP)系统进行,并且是天气预报的起点。NWP 系统通过分析先前的大气状态和新获取的、没有固定位置的地球观测来预测固定位置(称为 NWP 网格点)处的未来大气状态。因此,周边的气象背景与观测位置的变化使得大气状态与观测之间的空间相关性随时间而变化。为处理这种动态变化的复杂空间相关性,我们采用了具有结构学习的时空图神经网络(STGNNs)。然而,结构学习存在内在限制:通过生成过多边可能导致结构信息丢失和过度平滑问题。为了解决该问题,我们通过自适应确定节点度并考虑 NWP 网格点与观测点之间的空间距离来调节边采样。 我们使用来自东亚的真实大气状态和观测数据验证了所提方法的有效性。即便在大气变动剧烈的区域,所提方法也优于带有和不带有结构学习的现有 STGNN 模型。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 06:14:31 UTC 发布:2025-08-11 06:14:31 UTC
#139 Grasp-HGN: Grasping the Unexpected #139 Grasp-HGN:抓取意料之外的物体
Authors: [Mehrshad Zandigohar](https://arxiv.org/search/?searchtype=author&query=Mehrshad Zandigohar), [Mallesham Dasari](https://arxiv.org/search/?searchtype=author&query=Mallesham Dasari), [Gunar Schirner](https://arxiv.org/search/?searchtype=author&query=Gunar Schirner) 作者:Mehrshad Zandigohar,Mallesham Dasari,Gunar Schirner
For transradial amputees, robotic prosthetic hands promise to regain the capability to perform daily living activities. To advance next-generation prosthetic hand control design, it is crucial to address current shortcomings in robustness to out of lab artifacts, and generalizability to new environments. Due to the fixed number of object to interact with in existing datasets, contrasted with the virtually infinite variety of objects encountered in the real world, current grasp models perform poorly on unseen objects, negatively affecting users’ independence and quality of life. To address this: (i) we define semantic projection, the ability of a model to generalize to unseen object types and show that conventional models like YOLO, despite 80% training accuracy, drop to 15% on unseen objects. (ii) we propose Grasp-LLaVA, a Grasp Vision Language Model enabling human-like reasoning to infer the suitable grasp type estimate based on the object’s physical characteristics resulting in a significant 50.2% accuracy over unseen object types compared to 36.7% accuracy of an SOTA grasp estimation model. Lastly, to bridge the performance-latency gap, we propose Hybrid Grasp Network (HGN), an edge-cloud deployment infrastructure enabling fast grasp estimation on edge and accurate cloud inference as a fail-safe, effectively expanding the latency vs. accuracy Pareto. HGN with confidence calibration (DC) enables dynamic switching between edge and cloud models, improving semantic projection accuracy by 5.6% (to 42.3%) with 3.5x speedup over the unseen object types. Over a real-world sample mix, it reaches 86% average accuracy (12.2% gain over edge-only), and 2.2x faster inference than Grasp-LLaVA alone. 对于桡侧截肢者,机器人义手有望恢复执行日常生活活动的能力。为了推进下一代义手控制设计,关键在于解决当前在实验室外伪影鲁棒性和对新环境的泛化性方面的不足。由于现有数据集中可交互对象数量是固定的,而现实世界中对象的种类几乎是无限的,现有的抓握模型在未见过的对象上表现不佳,进而负面影响用户的独立性和生活质量。为了解决这一问题:(i)我们定义了语义投射(semantic projection),即模型对未见对象类型的泛化能力,并展示了像 YOLO 这样的传统模型尽管在训练集上有 80% 的准确率,但在未见对象上会降至 15%。(ii)我们提出了 Grasp-LLaVA,一种抓握视觉语言模型,能够进行类人推理,根据对象的物理特性推断合适的抓握类型估计,在未见对象类型上的准确率显著达到 50.2%,而最先进的抓握估计模型为 36.7%。 最后,为了弥合性能与延迟之间的差距,我们提出了混合抓取网络(Hybrid Grasp Network,HGN),这是一种边缘-云部署架构,能够在边缘实现快速的抓取估计,并以云端的精确推理作为备援,从而有效扩大延迟与精度的帕累托前沿。配备置信度校准(DC)的 HGN 能够在边缘模型与云模型之间动态切换,在未见过的物体类型上使语义投影准确率提高 5.6 个百分点(达到 42.3%),并实现 3.5 倍的加速。在真实世界的样本混合上,其平均准确率达到 86%(比仅边缘部署提高 12.2%),推理速度比仅使用 Grasp-LLaVA 快 2.2 倍。
Subjects: Robotics, Artificial Intelligence, Machine Learning 主题:机器人学、人工智能、机器学习
Publish: 2025-08-11 05:58:28 UTC 发布:2025-08-11 05:58:28 UTC
#140 Attribution Explanations for Deep Neural Networks: A Theoretical Perspective #140 深度神经网络的归因解释:理论视角
Authors: [Huiqi Deng](https://arxiv.org/search/?searchtype=author&query=Huiqi Deng), [Hongbin Pei](https://arxiv.org/search/?searchtype=author&query=Hongbin Pei), [Quanshi Zhang](https://arxiv.org/search/?searchtype=author&query=Quanshi Zhang), [Mengnan Du](https://arxiv.org/search/?searchtype=author&query=Mengnan Du) 作者:邓惠祺、裴鸿彬、张全实、杜梦楠
Attribution explanation is a typical approach for explaining deep neural networks (DNNs), inferring an importance or contribution score for each input variable to the final output. In recent years, numerous attribution methods have been developed to explain DNNs. However, a persistent concern remains unresolved, i.e., whether and which attribution methods faithfully reflect the actual contribution of input variables to the decision-making process. The faithfulness issue undermines the reliability and practical utility of attribution explanations. We argue that these concerns stem from three core challenges. First, difficulties arise in comparing attribution methods due to their unstructured heterogeneity, differences in heuristics, formulations, and implementations that lack a unified organization. Second, most methods lack solid theoretical underpinnings, with their rationales remaining absent, ambiguous, or unverified. Third, empirically evaluating faithfulness is challenging without ground truth. Recent theoretical advances provide a promising way to tackle these challenges, attracting increasing attention. We summarize these developments, with emphasis on three key directions: (i) Theoretical unification, which uncovers commonalities and differences among methods, enabling systematic comparisons; (ii) Theoretical rationale, clarifying the foundations of existing methods; (iii) Theoretical evaluation, rigorously proving whether methods satisfy faithfulness principles. Beyond a comprehensive review, we provide insights into how these studies help deepen theoretical understanding, inform method selection, and inspire new attribution methods. We conclude with a discussion of promising open problems for further work. 归因解释是解释深度神经网络(DNN)的典型方法,旨在推断每个输入变量对最终输出的重要性或贡献分数。近年来,已经开发出大量归因方法来解释 DNN。然而,一个持续存在的担忧尚未解决,即这些归因方法是否以及哪些能忠实地反映输入变量对决策过程的实际贡献。可忠实性问题削弱了归因解释的可靠性和实际效用。我们认为,这些担忧源于三个核心挑战。首先,由于归因方法的不成体系的异质性——在启发式、形式化和实现上的差异且缺乏统一组织——使得比较这些方法变得困难。第二,大多数方法缺乏坚实的理论基础,其论证要么缺失、含糊不清,要么未被验证。第三,在没有真实标签的情况下进行可忠实性的实证评估具有挑战性。最近的理论进展为应对这些挑战提供了有希望的途径,并正日益受到关注。 我们总结了这些进展,重点关注三个关键方向: (i) 理论上的统一,揭示方法之间的共性与差异,从而实现系统性的比较; (ii) 理论依据,阐明现有方法的基础; (iii) 理论评估,严格证明方法是否满足可信性原则。除了全面综述外,我们还提供了这些研究如何有助于加深理论理解、指导方法选择并激发新的归因方法的见解。最后,我们讨论了值得进一步研究的有前景的开放问题。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 05:41:20 UTC 发布:2025-08-11 05:41:20 UTC
#141 Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo #141 使用退火朗之万蒙特卡洛进行高效近似后验采样
Authors: [Advait Parulekar](https://arxiv.org/search/?searchtype=author&query=Advait Parulekar), [Litu Rout](https://arxiv.org/search/?searchtype=author&query=Litu Rout), [Karthikeyan Shanmugam](https://arxiv.org/search/?searchtype=author&query=Karthikeyan Shanmugam), [Sanjay Shakkottai](https://arxiv.org/search/?searchtype=author&query=Sanjay Shakkottai) 作者:Advait Parulekar、Litu Rout、Karthikeyan Shanmugam、Sanjay Shakkottai
We study the problem of posterior sampling in the context of score based generative models. We have a trained score network for a prior p(x), a measurement model p(y|x), and are tasked with sampling from the posterior p(x|y). Prior work has shown this to be intractable in KL (in the worst case) under well-accepted computational hardness assumptions. Despite this, popular algorithms for tasks such as image super-resolution, stylization, and reconstruction enjoy empirical success. Rather than establishing distributional assumptions or restricted settings under which exact posterior sampling is tractable, we view this as a more general “tilting” problem of biasing a distribution towards a measurement. Under minimal assumptions, we show that one can tractably sample from a distribution that is simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence. Intuitively, this combination ensures that the resulting sample is consistent with both the measurement and the prior. To the best of our knowledge these are the first formal results for (approximate) posterior sampling in polynomial time. 我们在基于分数的生成模型背景下研究后验采样问题。我们已经训练了一个先验的分数网络 p(x) ,一个测量模型 p(y|x) ,并且的任务是从后验 p(x|y) 中采样。先前的工作表明,在广泛接受的计算困难假设下,这在 KL 意义上(最坏情况下)是不可行的。尽管如此,用于图像超分辨率、风格迁移和重建等任务的流行算法在经验上仍然取得了成功。与其为精确后验采样建立分布假设或受限设置,我们将其视为将分布向测量偏置的更通用的“倾斜”问题。在最小假设下,我们证明了可以以可处理的方式从一个分布中采样,该分布在 KL 散度上同时接近加噪先验的后验,并在费舍尔散度上接近真实后验。直观地说,这种组合确保得到的样本既与测量一致又与先验一致。据我们所知,这是首批关于以多项式时间进行(近似)后验采样的形式化结果。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 05:25:24 UTC 发表:2025-08-11 05:25:24 UTC
#142 InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information #142 InterChart:在分解与分布图表信息下评估视觉推理
Authors: [Anirudh Iyengar Kaniyar Narayana Iyengar](https://arxiv.org/search/?searchtype=author&query=Anirudh Iyengar Kaniyar Narayana Iyengar), [Srija Mukhopadhyay](https://arxiv.org/search/?searchtype=author&query=Srija Mukhopadhyay), [Adnan Qidwai](https://arxiv.org/search/?searchtype=author&query=Adnan Qidwai), [Shubhankar Singh](https://arxiv.org/search/?searchtype=author&query=Shubhankar Singh), [Dan Roth](https://arxiv.org/search/?searchtype=author&query=Dan Roth), [Vivek Gupta](https://arxiv.org/search/?searchtype=author&query=Vivek Gupta) 作者:Anirudh Iyengar Kaniyar Narayana Iyengar、Srija Mukhopadhyay、Adnan Qidwai、Shubhankar Singh、Dan Roth、Vivek Gupta
We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2-3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments. 我们提出了 InterChart,这是一个诊断性基准,用于评估视觉-语言模型(VLMs)在多个相关图表间推理的能力——这一任务对于科学报告、财务分析和公共政策仪表盘等现实应用至关重要。与以往聚焦于孤立、视觉上单一的图表的基准不同,InterChart 通过多样化的问题类型对模型提出挑战,问题类型涵盖实体推断、趋势相关性、数值估计以及基于 2–3 个主题或结构相关图表的抽象多步推理。我们将基准组织为三层递增难度: (1) 针对单张图表的事实推理,(2) 针对合成对齐图表集的整合分析,(3) 针对视觉复杂的真实世界图表对的语义推断。我们对最新的开源和闭源 VLMs 的评估显示,随着图表复杂性的增加,准确率出现持续且陡峭的下降。我们发现,当将多实体图表分解为更简单的视觉单元时,模型表现更好,这凸显了它们在跨图表整合方面的困难。 通过揭示这些系统性局限性,InterChart 为在复杂的多视觉环境中推进多模态推理提供了严谨的框架。
Subjects: Computation and Language, Artificial Intelligence, Computer Vision and Pattern Recognition 主题:计算与语言、人工智能、计算机视觉与模式识别
Publish: 2025-08-11 05:19:23 UTC 发布:2025-08-11 05:19:23 UTC
#143 Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization #143 Klear-Reasoner:通过保持梯度的裁剪策略优化提升推理能力
Authors: [Zhenpeng Su](https://arxiv.org/search/?searchtype=author&query=Zhenpeng Su), [Leiyu Pan](https://arxiv.org/search/?searchtype=author&query=Leiyu Pan), [Xue Bai](https://arxiv.org/search/?searchtype=author&query=Xue Bai), [Dening Liu](https://arxiv.org/search/?searchtype=author&query=Dening Liu), [Guanting Dong](https://arxiv.org/search/?searchtype=author&query=Guanting Dong), [Jiaming Huang](https://arxiv.org/search/?searchtype=author&query=Jiaming Huang), [Wenping Hu](https://arxiv.org/search/?searchtype=author&query=Wenping Hu), [Guorui Zhou](https://arxiv.org/search/?searchtype=author&query=Guorui Zhou) 作者:苏振鹏、潘雷宇、白雪、刘德宁、董冠廷、黄家明、胡文平、周果睿
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model’s exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6. 我们提出了 Klear-Reasoner,这是一种具备长链推理能力的模型,在问题求解过程中表现出谨慎的推敲,在多个基准测试中取得了出色的表现。尽管当前社区已有许多与推理模型相关的优秀工作,但由于训练细节披露不完整,仍然存在复制高性能推理模型的诸多问题。本报告对该推理模型进行了深入分析,涵盖从数据准备、长链思路监督微调(long CoT SFT)到强化学习(RL)的整个后训练工作流,并对每个实验组件进行了详尽的消融研究。关于 SFT 数据,我们的实验证明少量高质量的数据源比大量多样化的数据源更有效,并且困难样本在不进行准确性筛选的情况下也能取得更好的结果。此外,我们还研究了当前强化学习中两项关键的裁剪机制问题:裁剪抑制了重要的探索信号并忽视了次优轨迹。 为了解决这些挑战,我们提出了梯度保持裁剪策略优化(Gradient-Preserving clipping Policy Optimization,GPPO),它可以将来自被裁剪标记的梯度温和地反向传播。GPPO 不仅增强了模型的探索能力,还提高了其从负样本中学习的效率。Klear-Reasoner 在数学和编程方面表现出卓越的推理能力,在 AIME 2024 上得分 90.5%,在 AIME 2025 上得分 83.2%,在 LiveCodeBench V5 上得分 66.0%,在 LiveCodeBench V6 上得分 58.1%。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题:机器学习、人工智能、计算与语言
Publish: 2025-08-11 05:17:51 UTC 发布时间:2025-08-11 05:17:51 UTC
#144 SOFA: Deep Learning Framework for Simulating and Optimizing Atrial Fibrillation Ablation #144 SOFA:用于模拟和优化房颤消融的深度学习框架
Authors: [Yunsung Chung](https://arxiv.org/search/?searchtype=author&query=Yunsung Chung), [Chanho Lim](https://arxiv.org/search/?searchtype=author&query=Chanho Lim), [Ghassan Bidaoui](https://arxiv.org/search/?searchtype=author&query=Ghassan Bidaoui), [Christian Massad](https://arxiv.org/search/?searchtype=author&query=Christian Massad), [Nassir Marrouche](https://arxiv.org/search/?searchtype=author&query=Nassir Marrouche), [Jihun Hamm](https://arxiv.org/search/?searchtype=author&query=Jihun Hamm) 作者:Yunsung Chung, Chanho Lim, Ghassan Bidaoui, Christian Massad, Nassir Marrouche, Jihun Hamm
Atrial fibrillation (AF) is a prevalent cardiac arrhythmia often treated with catheter ablation procedures, but procedural outcomes are highly variable. Evaluating and improving ablation efficacy is challenging due to the complex interaction between patient-specific tissue and procedural factors. This paper asks two questions: Can AF recurrence be predicted by simulating the effects of procedural parameters? How should we ablate to reduce AF recurrence? We propose SOFA (Simulating and Optimizing Atrial Fibrillation Ablation), a novel deep-learning framework that addresses these questions. SOFA first simulates the outcome of an ablation strategy by generating a post-ablation image depicting scar formation, conditioned on a patient’s pre-ablation LGE-MRI and the specific procedural parameters used (e.g., ablation locations, duration, temperature, power, and force). During this simulation, it predicts AF recurrence risk. Critically, SOFA then introduces an optimization scheme that refines these procedural parameters to minimize the predicted risk. Our method leverages a multi-modal, multi-view generator that processes 2.5D representations of the atrium. Quantitative evaluations show that SOFA accurately synthesizes post-ablation images and that our optimization scheme leads to a 22.18% reduction in the model-predicted recurrence risk. To the best of our knowledge, SOFA is the first framework to integrate the simulation of procedural effects, recurrence prediction, and parameter optimization, offering a novel tool for personalizing AF ablation. 房颤(AF)是一种常见的心律失常,通常通过导管消融手术治疗,但手术结果具有高度可变性。由于患者特异性组织与手术参数之间的复杂相互作用,评估和提高消融疗效具有挑战性。本文提出两个问题:是否可以通过模拟手术参数的影响来预测房颤复发?我们应如何进行消融以减少房颤复发?我们提出了 SOFA(Simulating and Optimizing Atrial Fibrillation Ablation),一种新颖的深度学习框架来解决这些问题。SOFA 首先通过生成一张描绘瘢痕形成的术后图像来模拟某一消融策略的结果,该生成过程以患者术前的 LGE-MRI 及所用的具体手术参数(例如消融位置、持续时间、温度、功率和力量)为条件。在此模拟过程中,它会预测房颤复发风险。关键是,SOFA 随后引入一种优化方案,细化这些手术参数以尽量降低预测到的风险。我们的方法利用了一个多模态、多视角的生成器,处理心房的 2.5D 表示。 定量评估表明,SOFA 能准确合成消融后的影像,并且我们的优化方案使模型预测的复发风险降低了 22.18%。据我们所知,SOFA 是首个将手术效果模拟、复发预测和参数优化整合在一起的框架,为个性化房颤消融提供了一种新颖的工具。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 05:01:54 UTC 发布日期:2025-08-11 05:01:54 UTC
#145 On the Limits of Selective AI Prediction: A Case Study in Clinical Decision Making #145 关于选择性人工智能预测的局限性:临床决策中的一个案例研究
Authors: [Sarah Jabbour](https://arxiv.org/search/?searchtype=author&query=Sarah Jabbour), [David Fouhey](https://arxiv.org/search/?searchtype=author&query=David Fouhey), [Nikola Banovic](https://arxiv.org/search/?searchtype=author&query=Nikola Banovic), [Stephanie D. Shepard](https://arxiv.org/search/?searchtype=author&query=Stephanie D. Shepard), [Ella Kazerooni](https://arxiv.org/search/?searchtype=author&query=Ella Kazerooni), [Michael W. Sjoding](https://arxiv.org/search/?searchtype=author&query=Michael W. Sjoding), [Jenna Wiens](https://arxiv.org/search/?searchtype=author&query=Jenna Wiens) 作者:Sarah Jabbour、David Fouhey、Nikola Banovic、Stephanie D. Shepard、Ella Kazerooni、Michael W. Sjoding、Jenna Wiens
AI has the potential to augment human decision making. However, even high-performing models can produce inaccurate predictions when deployed. These inaccuracies, combined with automation bias, where humans overrely on AI predictions, can result in worse decisions. Selective prediction, in which potentially unreliable model predictions are hidden from users, has been proposed as a solution. This approach assumes that when AI abstains and informs the user so, humans make decisions as they would without AI involvement. To test this assumption, we study the effects of selective prediction on human decisions in a clinical context. We conducted a user study of 259 clinicians tasked with diagnosing and treating hospitalized patients. We compared their baseline performance without any AI involvement to their AI-assisted accuracy with and without selective prediction. Our findings indicate that selective prediction mitigates the negative effects of inaccurate AI in terms of decision accuracy. Compared to no AI assistance, clinician accuracy declined when shown inaccurate AI predictions (66% [95% CI: 56%-75%] vs. 56% [95% CI: 46%-66%]), but recovered under selective prediction (64% [95% CI: 54%-73%]). However, while selective prediction nearly maintains overall accuracy, our results suggest that it alters patterns of mistakes: when informed the AI abstains, clinicians underdiagnose (18% increase in missed diagnoses) and undertreat (35% increase in missed treatments) compared to no AI input at all. Our findings underscore the importance of empirically validating assumptions about how humans engage with AI within human-AI systems. 人工智能有可能增强人类的决策能力。然而,即便是表现良好的模型在部署时也可能产生不准确的预测。这些不准确性,加上自动化偏差——即人类对 AI 预测的过度依赖——可能导致更差的决策。选择性预测(selective prediction),即对可能不可靠的模型预测向用户隐藏,已被提出作为一种解决方案。这种方法假定当 AI 弃权并告知用户时,人类会像在没有 AI 参与的情况下那样做出决策。为检验这一假设,我们研究了选择性预测对临床环境中人类决策的影响。我们对 259 名临床医生进行了用户研究,任务是诊断和治疗住院患者。我们将他们在没有任何 AI 参与时的基线表现与有 AI 辅助且有无选择性预测时的准确性进行了比较。我们的研究结果表明,就决策准确性而言,选择性预测减轻了不准确 AI 的负面影响。与没有 AI 辅助相比,当显示不准确的 AI 预测时,临床医生的准确性下降(66% [95% CI:56%-75%] vs. 56%(95% 置信区间:46%–66%),但在选择性预测下恢复为 64%(95% 置信区间:54%–73%)。然而,尽管选择性预测几乎保持了整体准确性,我们的结果表明它改变了错误的模式:当得知 AI 放弃判断时,与完全没有 AI 输入相比,临床医生会出现漏诊增加(漏诊率增加 18%)和治疗不足(漏治率增加 35%)。我们的发现强调了在有人机协作系统中,基于实证验证关于人类如何与 AI 互动的假设的重要性。
Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能
Publish: 2025-08-11 04:53:13 UTC
#146 ShoulderShot: Generating Over-the-Shoulder Dialogue Videos #146 ShoulderShot:生成越肩视角的对话视频
Authors: [Yuang Zhang](https://arxiv.org/search/?searchtype=author&query=Yuang Zhang), [Junqi Cheng](https://arxiv.org/search/?searchtype=author&query=Junqi Cheng), [Haoyu Zhao](https://arxiv.org/search/?searchtype=author&query=Haoyu Zhao), [Jiaxi Gu](https://arxiv.org/search/?searchtype=author&query=Jiaxi Gu), [Fangyuan Zou](https://arxiv.org/search/?searchtype=author&query=Fangyuan Zou), [Zenghui Lu](https://arxiv.org/search/?searchtype=author&query=Zenghui Lu), [Peng Shu](https://arxiv.org/search/?searchtype=author&query=Peng Shu) 作者:张远、程俊奇、赵浩宇、顾嘉熙、邹方元、卢增辉、舒鹏
Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers’ emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and generating long, multi-turn dialogues within limited computational budgets. Here, we present ShoulderShot, a framework that combines dual-shot generation with looping video, enabling extended dialogues while preserving character consistency. Our results demonstrate capabilities that surpass existing methods in terms of shot-reverse-shot layout, spatial continuity, and flexibility in dialogue length, thereby opening up new possibilities for practical dialogue video generation. Videos and comparisons are available at https://shouldershot.github.io. 过肩对话镜头在电影、短剧和广告中至关重要,它们提供了视觉上的变化并增强了观众的情感联系。尽管其重要性,视频生成研究中对这类对话场景的探索仍然很少。主要挑战包括在不同镜头间保持角色一致性、营造空间连续感,以及在有限计算预算下生成长且多轮的对话。在此,我们提出了 ShoulderShot,一个将双镜头生成与循环视频相结合的框架,能够在保持角色一致性的同时实现延长的对话。我们的结果在镜头—反镜头布局、空间连续性和对话长度的灵活性方面均超越了现有方法,从而为实用的对话视频生成开辟了新可能。视频和对比可见于 https://shouldershot.github.io 。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 03:56:23 UTC 发布时间:2025-08-11 03:56:23 UTC
#147 IBPS: Indian Bail Prediction System
Authors: [Puspesh Kumar Srivastava](https://arxiv.org/search/?searchtype=author&query=Puspesh Kumar Srivastava), [Uddeshya Raj](https://arxiv.org/search/?searchtype=author&query=Uddeshya Raj), [Praveen Patel](https://arxiv.org/search/?searchtype=author&query=Praveen Patel), [/Shubham Kumar Nigam](https://arxiv.org/search/?searchtype=author&query=/Shubham Kumar Nigam), [Noel Shallum](https://arxiv.org/search/?searchtype=author&query=Noel Shallum), [Arnab Bhattacharya](https://arxiv.org/search/?searchtype=author&query=Arnab Bhattacharya) 作者:Puspesh Kumar Srivastava, Uddeshya Raj, Praveen Patel, /Shubham Kumar Nigam, Noel Shallum, Arnab Bhattacharya
Bail decisions are among the most frequently adjudicated matters in Indian courts, yet they remain plagued by subjectivity, delays, and inconsistencies. With over 75% of India’s prison population comprising undertrial prisoners, many from socioeconomically disadvantaged backgrounds, the lack of timely and fair bail adjudication exacerbates human rights concerns and contributes to systemic judicial backlog. In this paper, we present the Indian Bail Prediction System (IBPS), an AI-powered framework designed to assist in bail decision-making by predicting outcomes and generating legally sound rationales based solely on factual case attributes and statutory provisions. We curate and release a large-scale dataset of 150,430 High Court bail judgments, enriched with structured annotations such as age, health, criminal history, crime category, custody duration, statutes, and judicial reasoning. We fine-tune a large language model using parameter-efficient techniques and evaluate its performance across multiple configurations, with and without statutory context, and with RAG. Our results demonstrate that models fine-tuned with statutory knowledge significantly outperform baselines, achieving strong accuracy and explanation quality, and generalize well to a test set independently annotated by legal experts. IBPS offers a transparent, scalable, and reproducible solution to support data-driven legal assistance, reduce bail delays, and promote procedural fairness in the Indian judicial system. 保释决定是印度法院中最常被裁决的事项之一,但仍然受主观性、延误和不一致性的困扰。印度超过 75%的监狱人口是未决犯,其中许多人来自社会经济上处于不利地位的背景,及时且公正的保释裁决的缺失加剧了人权问题并导致司法积压。在本文中,我们提出了印度保释预测系统(IBPS),这是一个由人工智能驱动的框架,旨在通过仅基于案件事实属性和法定条文来预测结果并生成具有法律依据的理由,从而协助保释决策。我们整理并发布了一个包含 150,430 份高等法院保释判决的大规模数据集,该数据集配有结构化注释,如年龄、健康状况、犯罪记录、罪行类别、羁押时长、相关法条以及司法推理。我们使用参数高效技术对一个大型语言模型进行微调,并在多种配置下评估其性能,包括有无法定背景信息以及结合检索增强生成(RAG)的情形。 我们的结果表明,经过法定知识微调的模型显著优于基线模型,取得了较高的准确性和解释质量,并能很好地泛化到由法律专家独立注释的测试集。IBPS 提供了一种透明、可扩展且可重复的解决方案,用以支持数据驱动的法律援助、减少保释延误并促进印度司法体系的程序公正。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-11 03:44:17 UTC 发布:2025-08-11 03:44:17 UTC
#148 Towards Theoretical Understanding of Transformer Test-Time Computing: Investigation on In-Context Linear Regression #148 朝着对变换器测试时计算的理论理解:对上下文内线性回归的研究
Authors: [Xingwu Chen](https://arxiv.org/search/?searchtype=author&query=Xingwu Chen), [Miao Lu](https://arxiv.org/search/?searchtype=author&query=Miao Lu), [Beining Wu](https://arxiv.org/search/?searchtype=author&query=Beining Wu), [Difan Zou](https://arxiv.org/search/?searchtype=author&query=Difan Zou) 作者:陈兴武、卢淼、吴贝宁、邹迪凡
Using more test-time computation during language model inference, such as generating more intermediate thoughts or sampling multiple candidate answers, has proven effective in significantly improving model performance. This paper takes an initial step toward bridging the gap between practical language model inference and theoretical transformer analysis by incorporating randomness and sampling. We focus on in-context linear regression with continuous/binary coefficients, where our framework simulates language model decoding through noise injection and binary coefficient sampling. Through this framework, we provide detailed analyses of widely adopted inference techniques. Supported by empirical results, our theoretical framework and analysis demonstrate the potential for offering new insights into understanding inference behaviors in real-world language models. 在语言模型推理过程中使用更多的测试时计算,例如生成更多的中间思路或采样多个候选答案,被证明能显著提高模型性能。本文迈出初步步伐,试图通过引入随机性和采样将实用的语言模型推理与理论变换器分析之间的鸿沟弥合。我们关注具有连续/二元系数的上下文内线性回归,其中我们的框架通过噪声注入和二元系数采样来模拟语言模型的解码过程。通过该框架,我们对广泛采用的推理技术进行了详细分析。在经验证据的支持下,我们的理论框架和分析展示了为理解现实世界语言模型中的推理行为提供新见解的潜力。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-11 03:05:36 UTC 发布时间:2025-08-11 03:05:36 UTC
#149 Retrieval-Augmented Multi-Agent System for Rapid Statement of Work Generation #149 检索增强型多代理系统用于快速生成工作说明书
Authors: [Amulya Suravarjhula](https://arxiv.org/search/?searchtype=author&query=Amulya Suravarjhula), [Rashi Chandrashekhar Agrawal](https://arxiv.org/search/?searchtype=author&query=Rashi Chandrashekhar Agrawal), [Sakshi Jayesh Patel](https://arxiv.org/search/?searchtype=author&query=Sakshi Jayesh Patel), [Rahul Gupta](https://arxiv.org/search/?searchtype=author&query=Rahul Gupta) 作者:Amulya Suravarjhula、Rashi Chandrashekhar Agrawal、Sakshi Jayesh Patel、Rahul Gupta
Drafting a Statement of Work (SOW) is a vital part of business and legal projects. It outlines key details like deliverables, timelines, responsibilities, and legal terms. However, creating these documents is often a slow and complex process. It usually involves multiple people, takes several days, and leaves room for errors or outdated content. This paper introduces a new AI-driven automation system that makes the entire SOW drafting process faster, easier, and more accurate. Instead of relying completely on humans, the system uses three intelligent components or ‘agents’ that each handle a part of the job. One agent writes the first draft, another checks if everything is legally correct, and the third agent formats the document and ensures everything is in order. Unlike basic online tools that just fill in templates, this system understands the meaning behind the content and customizes the SOW to match the needs of the project. It also checks legal compliance and formatting so that users can trust the result. The system was tested using real business examples. It was able to create a full SOW in under three minutes, compared to several hours or days using manual methods. It also performed well in accuracy and quality, showing that it can reduce legal risks and save a lot of time. This solution shows how artificial intelligence can be used to support legal and business professionals by taking care of routine work and helping them focus on more important decisions. It’s a step toward making legal processes smarter, faster, and more reliable. 起草工作说明书(SOW)是商业和法律项目中的关键环节。它概述了交付物、时间表、职责和法律条款等关键细节。然而,创建这些文档通常是一个缓慢且复杂的过程。它通常需要多人参与,耗时数天,并且存在错误或内容过时的风险。本文介绍了一种新的由人工智能驱动的自动化系统,使整个 SOW 起草过程更快、更简单且更准确。该系统不完全依赖人工,而是使用三个各司其职的智能组件或“代理”来处理工作。一位代理撰写初稿,另一位代理检查所有内容的法律合规性,第三位代理对文档进行格式化并确保一切就绪。与仅填充模板的基础在线工具不同,该系统理解内容背后的含义,并将 SOW 定制以匹配项目需求。它还检查法律合规性和格式,使用户能够信赖结果。该系统已通过真实业务示例进行测试。与手动方法需数小时或数天相比,它能够在三分钟内生成完整的 SOW。 它在准确性和质量方面也表现良好,表明它可以降低法律风险并节省大量时间。该解决方案展示了如何利用人工智能来支持法律和业务专业人员,处理日常工作并帮助他们专注于更重要的决策。这是使法律流程更智能、更快速、更可靠的一步。
Subjects: Multiagent Systems, Artificial Intelligence 主题:多智能体系统,人工智能
Publish: 2025-08-11 02:59:36 UTC 发布:2025-08-11 02:59:36 UTC
#150 A Small-footprint Acoustic Echo Cancellation Solution for Mobile Full-Duplex Speech Interactions #150 一种用于移动全双工语音交互的小占用声学回声消除方案
Authors: [Yiheng Jiang](https://arxiv.org/search/?searchtype=author&query=Yiheng Jiang), [Tian Biao](https://arxiv.org/search/?searchtype=author&query=Tian Biao) 作者:蒋一衡,田彪
In full-duplex speech interaction systems, effective Acoustic Echo Cancellation (AEC) is crucial for recovering echo-contaminated speech. This paper presents a neural network-based AEC solution to address challenges in mobile scenarios with varying hardware, nonlinear distortions and long latency. We first incorporate diverse data augmentation strategies to enhance the model’s robustness across various environments. Moreover, progressive learning is employed to incrementally improve AEC effectiveness, resulting in a considerable improvement in speech quality. To further optimize AEC’s downstream applications, we introduce a novel post-processing strategy employing tailored parameters designed specifically for tasks such as Voice Activity Detection (VAD) and Automatic Speech Recognition (ASR), thus enhancing their overall efficacy. Finally, our method employs a small-footprint model with streaming inference, enabling seamless deployment on mobile devices. Empirical results demonstrate effectiveness of the proposed method in Echo Return Loss Enhancement and Perceptual Evaluation of Speech Quality, alongside significant improvements in both VAD and ASR results. 在全双工语音交互系统中,有效的声学回声消除(AEC)对于恢复受回声污染的语音至关重要。本文提出了一种基于神经网络的 AEC 方案,以应对移动场景中多变的硬件、非线性失真和长延迟等挑战。我们首先引入多样化的数据增强策略,以提升模型在各种环境下的鲁棒性。此外,采用渐进式学习逐步提升 AEC 效果,从而显著改善语音质量。为了进一步优化 AEC 在下游任务中的表现,我们提出了一种新颖的后处理策略,使用为语音活动检测(VAD)和自动语音识别(ASR)等任务专门设计的定制参数,从而增强它们的整体效果。最后,我们的方法采用小体积模型并支持流式推理,能够在移动设备上实现无缝部署。实证结果表明,所提出的方法在回声损失增强(Echo Return Loss Enhancement)和语音感知质量评估(Perceptual Evaluation of Speech Quality)方面具有有效性,并在 VAD 和 ASR 结果上取得了显著改进。
Subjects: Sound, Artificial Intelligence 主题:声音,人工智能
Publish: 2025-08-11 02:45:31 UTC 发布时间:2025-08-11 02:45:31 协调世界时(UTC)
#151 Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning #151 不确定性驱动的可靠性:现代机器学习中的选择性预测与可信部署
Author: [Stephan Rabanser](https://arxiv.org/search/?searchtype=author&query=Stephan Rabanser) 作者:Stephan Rabanser
Machine learning (ML) systems are increasingly deployed in high-stakes domains where reliability is paramount. This thesis investigates how uncertainty estimation can enhance the safety and trustworthiness of ML, focusing on selective prediction – where models abstain when confidence is low. We first show that a model’s training trajectory contains rich uncertainty signals that can be exploited without altering its architecture or loss. By ensembling predictions from intermediate checkpoints, we propose a lightweight, post-hoc abstention method that works across tasks, avoids the cost of deep ensembles, and achieves state-of-the-art selective prediction performance. Crucially, this approach is fully compatible with differential privacy (DP), allowing us to study how privacy noise affects uncertainty quality. We find that while many methods degrade under DP, our trajectory-based approach remains robust, and we introduce a framework for isolating the privacy-uncertainty trade-off. Next, we then develop a finite-sample decomposition of the selective classification gap – the deviation from the oracle accuracy-coverage curve – identifying five interpretable error sources and clarifying which interventions can close the gap. This explains why calibration alone cannot fix ranking errors, motivating methods that improve uncertainty ordering. Finally, we show that uncertainty signals can be adversarially manipulated to hide errors or deny service while maintaining high accuracy, and we design defenses combining calibration audits with verifiable inference. Together, these contributions advance reliable ML by improving, evaluating, and safeguarding uncertainty estimation, enabling models that not only make accurate predictions – but also know when to say “I do not know”. 机器学习(ML)系统正越来越多地部署在可靠性至关重要的高风险领域。本文探讨了不确定性估计如何增强机器学习的安全性和可信度,重点研究选择性预测——即当模型置信度低时撤回预测。我们首先展示了模型的训练轨迹包含丰富的不确定性信号,这些信号可以在不改变模型架构或损失函数的情况下被利用。通过对中间检查点的预测进行集成,我们提出了一种轻量级的、事后启用的撤回方法,该方法适用于各种任务,避免了深度集成的高昂成本,并且在选择性预测性能上达到最先进水平。关键是,这种方法完全兼容差分隐私(DP),使我们能够研究隐私噪声如何影响不确定性的质量。我们发现,尽管许多方法在差分隐私下性能下降,但基于训练轨迹的方法仍然稳健,并且我们提出了一个用于分离隐私与不确定性权衡的框架。 接下来,我们构建了选择性分类差距(即偏离理想准确率-覆盖率曲线的偏差)在有限样本下的分解——识别出五种可解释的误差来源并澄清哪些干预措施可以弥补该差距。这解释了为什么仅仅校准无法修复排序错误,从而促使提出改善不确定性排序的方法。最后,我们展示了不确定性信号可以被对抗性操控以掩盖错误或拒绝服务,同时仍保持高准确率,并设计了将校准审计与可验证推理相结合的防御措施。综上,这些贡献通过改进、评估和保护不确定性估计推进了可靠的机器学习,使模型不仅能做出准确的预测——也能知道何时说“我不知道”。
Subjects: Machine Learning, Artificial Intelligence, Computers and Society, Machine Learning 主题:机器学习、人工智能、计算机与社会、机器学习
Publish: 2025-08-11 02:33:53 UTC 发表:2025-08-11 02:33:53 UTC
#152 A DICOM Image De-identification Algorithm in the MIDI-B Challenge #152 MIDI-B 挑战中的 DICOM 图像去标识化算法 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Hongzhu Jiang](https://arxiv.org/search/?searchtype=author&query=Hongzhu Jiang), [Sihan Xie](https://arxiv.org/search/?searchtype=author&query=Sihan Xie), [Zhiyu Wan](https://arxiv.org/search/?searchtype=author&query=Zhiyu Wan) 作者:姜鸿烛,谢思涵,万志宇
Image de-identification is essential for the public sharing of medical images, particularly in the widely used Digital Imaging and Communications in Medicine (DICOM) format as required by various regulations and standards, including Health Insurance Portability and Accountability Act (HIPAA) privacy rules, the DICOM PS3.15 standard, and best practices recommended by the Cancer Imaging Archive (TCIA). The Medical Image De-Identification Benchmark (MIDI-B) Challenge at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024) was organized to evaluate rule-based DICOM image de-identification algorithms with a large dataset of clinical DICOM images. In this report, we explore the critical challenges of de-identifying DICOM images, emphasize the importance of removing personally identifiable information (PII) to protect patient privacy while ensuring the continued utility of medical data for research, diagnostics, and treatment, and provide a comprehensive overview of the standards and regulations that govern this process. Additionally, we detail the de-identification methods we applied - such as pixel masking, date shifting, date hashing, text recognition, text replacement, and text removal - to process datasets during the test phase in strict compliance with these standards. According to the final leaderboard of the MIDI-B challenge, the latest version of our solution algorithm correctly executed 99.92% of the required actions and ranked 2nd out of 10 teams that completed the challenge (from a total of 22 registered teams). Finally, we conducted a thorough analysis of the resulting statistics and discussed the limitations of current approaches and potential avenues for future improvement. 图像去标识对于公开共享医学影像至关重要,尤其是在广泛使用的 DICOM(医学数字成像和通信)格式中,这也是各种法规和标准所要求的,包括《健康保险可携带性与责任法案》(HIPAA)隐私规则、DICOM PS3.15 标准,以及癌症影像档案(TCIA)推荐的最佳实践。第 27 届国际医学影像计算与计算机辅助干预会议(MICCAI 2024)上的医学影像去标识基准(MIDI-B)挑战赛旨在利用大量临床 DICOM 图像数据集评估基于规则的 DICOM 图像去标识算法。在本报告中,我们探讨了去标识 DICOM 图像的关键挑战,强调在保护患者隐私的同时去除个人可识别信息(PII)以确保医学数据在研究、诊断和治疗中持续可用的重要性,并对规范该过程的标准和法规进行了全面概述。 另外,我们详细说明了在测试阶段为严格遵守这些标准而对数据集实施的去标识化方法——例如像素遮盖、日期偏移、日期哈希、文本识别、文本替换和文本移除。根据 MIDI-B 挑战赛的最终排行榜,我们方案算法的最新版本正确执行了 99.92%的必需操作,在完成挑战的 10 支队伍中排名第 2(共有 22 支队伍注册)。最后,我们对所得统计数据进行了全面分析,并讨论了当前方法的局限性及未来改进的潜在方向。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 01:38:07 UTC 发布:2025-08-11 01:38:07 协调世界时 (UTC)
#153 Conversational DNA: A New Visual Language for Understanding Dialogue Structure in Human and AI #153 会话 DNA:一种用于理解人类与人工智能对话结构的新视觉语言
Author: [Baihan Lin](https://arxiv.org/search/?searchtype=author&query=Baihan Lin) 作者:林柏涵
What if the patterns hidden within dialogue reveal more about communication than the words themselves? We introduce Conversational DNA, a novel visual language that treats any dialogue – whether between humans, between human and AI, or among groups – as a living system with interpretable structure that can be visualized, compared, and understood. Unlike traditional conversation analysis that reduces rich interaction to statistical summaries, our approach reveals the temporal architecture of dialogue through biological metaphors. Linguistic complexity flows through strand thickness, emotional trajectories cascade through color gradients, conversational relevance forms through connecting elements, and topic coherence maintains structural integrity through helical patterns. Through exploratory analysis of therapeutic conversations and historically significant human-AI dialogues, we demonstrate how this visualization approach reveals interaction patterns that traditional methods miss. Our work contributes a new creative framework for understanding communication that bridges data visualization, human-computer interaction, and the fundamental question of what makes dialogue meaningful in an age where humans increasingly converse with artificial minds. 如果对话中隐藏的模式比词语本身更能揭示交流本质,会怎样?我们提出“会话 DNA”,一种新颖的可视化语言,将任何对话——无论是人际对话、人机对话,还是群体间的对话——视为具有可解释结构的生命系统,可被可视化、比较与理解。不同于将丰富互动简化为统计汇总的传统会话分析,我们的方法通过生物学隐喻揭示对话的时间结构。语言复杂性通过链条粗细流动,情感轨迹通过颜色渐变层叠,话语相关性通过连接元素形成,而话题连贯性则通过螺旋模式维持结构完整性。通过对治疗性对话和具有历史意义的人机对话的探索性分析,我们展示了这种可视化方法如何揭示传统方法遗漏的互动模式。 我们的工作提出了一个新的创意框架,用于理解交流,连接数据可视化、人机交互,以及在一个人类日益与人工智能交谈的时代里,关于是什么让对话有意义这一根本问题。
Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language, Computers and Society 主题:人机交互、人工智能、计算与语言、计算机与社会
Publish: 2025-08-11 00:43:35 UTC 发表:2025-08-11 00:43:35 UTC
#154 Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews #154 词云作为共同声音:在定性访谈中对参与者加权主题的 LLM 辅助可视化
Authors: [Joseph T. Colonel](https://arxiv.org/search/?searchtype=author&query=Joseph T. Colonel), [Baihan Lin](https://arxiv.org/search/?searchtype=author&query=Baihan Lin) 作者:Joseph T. Colonel,林柏涵
Word clouds are a common way to summarize qualitative interviews, yet traditional frequency-based methods often fail in conversational contexts: they surface filler words, ignore paraphrase, and fragment semantically related ideas. This limits their usefulness in early-stage analysis, when researchers need fast, interpretable overviews of what participant actually said. We introduce ThemeClouds, an open-source visualization tool that uses large language models (LLMs) to generate thematic, participant-weighted word clouds from dialogue transcripts. The system prompts an LLM to identify concept-level themes across a corpus and then counts how many unique participants mention each topic, yielding a visualization grounded in breadth of mention rather than raw term frequency. Researchers can customize prompts and visualization parameters, providing transparency and control. Using interviews from a user study comparing five recording-device configurations (31 participants; 155 transcripts, Whisper ASR), our approach surfaces more actionable device concerns than frequency clouds and topic-modeling baselines (e.g., LDA, BERTopic). We discuss design trade-offs for integrating LLM assistance into qualitative workflows, implications for interpretability and researcher agency, and opportunities for interactive analyses such as per-condition contrasts (``diff clouds’’). 词云是总结定性访谈的常用方式,但传统基于频率的方法在对话语境中常常失效:它们会凸显填充词,忽视意同表达,并将语义相关的观点拆散。这限制了它们在早期分析阶段的实用性——研究者需要快速、可解释的概览来了解参与者实际说了什么。我们提出了 ThemeClouds,一款开源可视化工具,利用大型语言模型(LLMs)从对话转录中生成主题化、以参与者为权重的词云。该系统通过提示 LLM 在语料中识别概念层面的主题,然后统计提及每个主题的不同参与者人数,生成基于提及广度而非原始词频的可视化。研究者可以自定义提示和可视化参数,从而提供透明性和控制。使用来自一项比较五种录音设备配置的用户研究的访谈(31 名参与者;155 篇转录,Whisper ASR),我们的方法比基于频率的词云和主题建模基线(如 LDA、BERTopic)更能突出可操作的设备问题。 我们讨论了将 LLM 辅助整合到定性工作流程中的设计权衡、对可解释性和研究者能动性的影响,以及用于交互式分析的机会,例如按条件对比(“差异云”)。
Subjects: Computation and Language, Artificial Intelligence, Human-Computer Interaction 主题:计算与语言、人工智能、人机交互
Publish: 2025-08-11 00:27:52 UTC 发布:2025-08-11 00:27:52 协调世界时
#155 From Field to Drone: Domain Drift Tolerant Automated Multi-Species and Damage Plant Semantic Segmentation for Herbicide Trials
Authors: [Artzai Picon](https://arxiv.org/search/?searchtype=author&query=Artzai Picon), [Itziar Eguskiza](https://arxiv.org/search/?searchtype=author&query=Itziar Eguskiza), [Daniel Mugica](https://arxiv.org/search/?searchtype=author&query=Daniel Mugica), [Javier Romero](https://arxiv.org/search/?searchtype=author&query=Javier Romero), [Carlos Javier Jimenez](https://arxiv.org/search/?searchtype=author&query=Carlos Javier Jimenez), [Eric White](https://arxiv.org/search/?searchtype=author&query=Eric White), [Gabriel Do-Lago-Junqueira](https://arxiv.org/search/?searchtype=author&query=Gabriel Do-Lago-Junqueira), [Christian Klukas](https://arxiv.org/search/?searchtype=author&query=Christian Klukas), [Ramon Navarra-Mestre](https://arxiv.org/search/?searchtype=author&query=Ramon Navarra-Mestre) 作者:Artzai Picon, Itziar Eguskiza, Daniel Mugica, Javier Romero, Carlos Javier Jimenez, Eric White, Gabriel Do-Lago-Junqueira, Christian Klukas, Ramon Navarra-Mestre
Field trials are vital in herbicide research and development to assess effects on crops and weeds under varied conditions. Traditionally, evaluations rely on manual visual assessments, which are time-consuming, labor-intensive, and subjective. Automating species and damage identification is challenging due to subtle visual differences, but it can greatly enhance efficiency and consistency. We present an improved segmentation model combining a general-purpose self-supervised visual model with hierarchical inference based on botanical taxonomy. Trained on a multi-year dataset (2018-2020) from Germany and Spain using digital and mobile cameras, the model was tested on digital camera data (year 2023) and drone imagery from the United States, Germany, and Spain (year 2024) to evaluate robustness under domain shift. This cross-device evaluation marks a key step in assessing generalization across platforms of the model. Our model significantly improved species identification (F1-score: 0.52 to 0.85, R-squared: 0.75 to 0.98) and damage classification (F1-score: 0.28 to 0.44, R-squared: 0.71 to 0.87) over prior methods. Under domain shift (drone images), it maintained strong performance with moderate degradation (species: F1-score 0.60, R-squared 0.80; damage: F1-score 0.41, R-squared 0.62), where earlier models failed. These results confirm the model’s robustness and real-world applicability. It is now deployed in BASF’s phenotyping pipeline, enabling large-scale, automated crop and weed monitoring across diverse geographies. 田间试验在除草剂研发中至关重要,用于评估在多种条件下对作物和杂草的影响。传统上,评估依赖人工目视检查,这既耗时又费力且带有主观性。自动化的物种与损伤识别具有挑战性,因其视觉差异细微,但能极大提升效率与一致性。我们提出了一种改进的分割模型,将通用自监督视觉模型与基于植物分类学的分层推理相结合。在德国和西班牙使用数码和移动相机收集的多年(2018–2020)数据集上进行训练,并在数码相机数据(2023 年)以及来自美国、德国和西班牙的无人机影像(2024 年)上进行测试,以评估域迁移下的鲁棒性。这种跨设备评估是检验模型跨平台泛化能力的关键步骤。与先前方法相比,我们的模型在物种识别上显著提升(F1 分数:0.52 提升至 0.85,R²:0.75 提升至 0.98),在损伤分类上也有改进(F1 分数:0.28 提升至 0.44,R²:0.71 提升至 0.87)。 在领域迁移(无人机图像)下,模型保持了较强的性能,仅出现中等程度的退化(物种:F1 分数 0.60,R 平方 0.80;损伤:F1 分数 0.41,R 平方 0.62),而早期模型在此情况下失败。这些结果证实了该模型的鲁棒性和实际应用性。该模型现已部署到巴斯夫的表型分析流程中,实现了在不同地理区域的大规模自动化作物与杂草监测。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-11 00:08:42 UTC 发布:2025-08-11 00:08:42 UTC
#156 Intersectoral Knowledge in AI and Urban Studies: A Framework for Transdisciplinary Research #156 人工智能与城市研究中的跨部门知识:跨学科研究框架
Author: [Rashid Mushkani](https://arxiv.org/search/?searchtype=author&query=Rashid Mushkani) 作者:Rashid Mushkani
Transdisciplinary approaches are increasingly essential for addressing grand societal challenges, particularly in complex domains such as Artificial Intelligence (AI), urban planning, and social sciences. However, effectively validating and integrating knowledge across distinct epistemic and ontological perspectives poses significant difficulties. This article proposes a six-dimensional framework for assessing and strengthening transdisciplinary knowledge validity in AI and city studies, based on an extensive analysis of the most cited research (2014–2024). Specifically, the framework classifies research orientations according to ontological, epistemological, methodological, teleological, axiological, and valorization dimensions. Our findings show a predominance of perspectives aligned with critical realism (ontological), positivism (epistemological), analytical methods (methodological), consequentialism (teleological), epistemic values (axiological), and social/economic valorization. Less common stances, such as idealism, mixed methods, and cultural valorization, are also examined for their potential to enrich knowledge production. We highlight how early career researchers and transdisciplinary teams can leverage this framework to reconcile divergent disciplinary viewpoints and promote socially accountable outcomes. 跨学科方法在应对重大社会挑战时变得越来越重要,尤其是在人工智能(AI)、城市规划和社会科学等复杂领域。然而,有效地验证和整合来自不同认知论和本体论视角的知识存在重大困难。本文提出了一个用于评估和加强 AI 与城市研究中跨学科知识有效性的六维框架,基于对 2014—2024 年被引频次最高研究的广泛分析。具体而言,该框架根据本体论、认识论、方法论、目的论、价值论和价值实现维度对研究取向进行分类。我们的研究发现,在本体论上以批判现实主义为主导,在认识论上以实证主义为主导,在方法论上以分析方法为主导,在目的论上以结果论为主导,在价值论上以认知价值为主导,并且在价值实现上以社会/经济价值实现为主导。文中亦考察了较少见的立场,如唯心主义、混合方法和文化价值实现,探讨它们丰富知识生产的潜力。 我们强调早期职业研究者和跨学科团队如何利用该框架调和不同学科的观点并促进具有社会责任的成果。
Subjects: Computers and Society, Artificial Intelligence 主题:计算机与社会,人工智能
Publish: 2025-08-10 23:35:09 UTC 发布:2025-08-10 23:35:09 协调世界时 (UTC)
#157 VA-Blueprint: Uncovering Building Blocks for Visual Analytics System Design #157 VA-Blueprint:揭示可视化分析系统设计的构建模块
Authors: [Leonardo Ferreira](https://arxiv.org/search/?searchtype=author&query=Leonardo Ferreira), [Gustavo Moreira](https://arxiv.org/search/?searchtype=author&query=Gustavo Moreira), [Fabio Miranda](https://arxiv.org/search/?searchtype=author&query=Fabio Miranda) 作者:Leonardo Ferreira、Gustavo Moreira、Fabio Miranda
Designing and building visual analytics (VA) systems is a complex, iterative process that requires the seamless integration of data processing, analytics capabilities, and visualization techniques. While prior research has extensively examined the social and collaborative aspects of VA system authoring, the practical challenges of developing these systems remain underexplored. As a result, despite the growing number of VA systems, there are only a few structured knowledge bases to guide their design and development. To tackle this gap, we propose VA-Blueprint, a methodology and knowledge base that systematically reviews and categorizes the fundamental building blocks of urban VA systems, a domain particularly rich and representative due to its intricate data and unique problem sets. Applying this methodology to an initial set of 20 systems, we identify and organize their core components into a multi-level structure, forming an initial knowledge base with a structured blueprint for VA system development. To scale this effort, we leverage a large language model to automate the extraction of these components for other 81 papers (completing a corpus of 101 papers), assessing its effectiveness in scaling knowledge base construction. We evaluate our method through interviews with experts and a quantitative analysis of annotation metrics. Our contributions provide a deeper understanding of VA systems’ composition and establish a practical foundation to support more structured, reproducible, and efficient system development. VA-Blueprint is available at https://urbantk.org/va-blueprint. 设计与构建可视化分析(VA)系统是一个复杂且反复迭代的过程,需将数据处理、分析能力与可视化技术无缝整合。尽管以往研究广泛考察了 VA 系统创作的社会与协作层面,但开发这些系统的实际挑战仍未得到充分探讨。因此,尽管 VA 系统数量日益增长,指导其设计与开发的结构化知识库仍屈指可数。为填补这一空白,我们提出了 VA-Blueprint,一种方法论与知识库,系统性地审视并分类城市 VA 系统的基础构件——该领域因其复杂的数据和独特的问题集而尤其丰富且具有代表性。将此方法应用于最初的 20 个系统后,我们识别并将其核心组件组织成多层结构,形成了一个带有结构化蓝图的初始知识库,用于 VA 系统开发。 为扩大此项工作规模,我们利用大型语言模型自动提取其余 81 篇论文的这些组成部分(完成 101 篇论文的语料库),以评估其在知识库构建扩展上的有效性。我们通过与专家的访谈和对注释指标的定量分析来评估我们的方法。我们的贡献加深了对可视化分析(VA)系统组成的理解,并建立了支持更结构化、可复现且更高效系统开发的实用基础。VA-Blueprint 可在 https://urbantk.org/va-blueprint 获取。
Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能
Publish: 2025-08-10 22:03:11 UTC 发表:2025-08-10 22:03:11 UTC
#158 From Product Hilbert Spaces to the Generalized Koopman Operator and the Nonlinear Fundamental Lemma #158 从乘积希尔伯特空间到广义库普曼算子与非线性基本引理
Author: [Mircea Lazar](https://arxiv.org/search/?searchtype=author&query=Mircea Lazar) 作者:Mircea Lazar
The generalization of the Koopman operator to systems with control input and the derivation of a nonlinear fundamental lemma are two open problems that play a key role in the development of data-driven control methods for nonlinear systems. Both problems hinge on the construction of observable or basis functions and their corresponding Hilbert space that enable an infinite-dimensional, linear system representation. In this paper we derive a novel solution to these problems based on orthonormal expansion in a product Hilbert space constructed as the tensor product between the Hilbert spaces of the state and input observable functions, respectively. We prove that there exists an infinite-dimensional linear operator, i.e. the generalized Koopman operator, from the constructed product Hilbert space to the Hilbert space corresponding to the lifted state propagated forward in time. A scalable data-driven method for computing finite-dimensional approximations of generalized Koopman operators and several choices of observable functions are also presented. Moreover, we derive a nonlinear fundamental lemma by exploiting the bilinear structure of the infinite-dimensional generalized Koopman model. The effectiveness of the developed generalized Koopman embedding is illustrated on the Van der Pol oscillator. 将 Koopman 算子推广到含控制输入的系统以及推导非线性基础引理是两个未解决的问题,它们在非线性系统数据驱动控制方法的发展中起着关键作用。这两个问题都依赖于可观测函数或基函数及其相应的希尔伯特空间的构造,以实现无限维线性系统的表示。本文基于在乘积希尔伯特空间中的正交归一展开提出了这些问题的新解,该乘积希尔伯特空间构造为状态和输入可观测函数的希尔伯特空间之间的张量积。我们证明存在一个从所构造的乘积希尔伯特空间到对应于随时间向前传播的升维状态的希尔伯特空间的无限维线性算子,即广义 Koopman 算子。文中还提出了一种可扩展的数据驱动方法,用于计算广义 Koopman 算子的有限维近似,以及若干可观测函数的选择方案。 此外,我们通过利用无限维广义库珀曼模型的双线性结构推导出了一个非线性的基本引理。所提出的广义库珀曼嵌入的有效性在范德波尔振子上得到了验证。
Subjects: Optimization and Control, Artificial Intelligence 学科:优化与控制、人工智能
Publish: 2025-08-10 21:57:16 UTC 发布:2025-08-10 21:57:16 UTC
#159 Extracting Overlapping Microservices from Monolithic Code via Deep Semantic Embeddings and Graph Neural Network-Based Soft Clustering #159 通过深度语义嵌入和基于图神经网络的软聚类从单体代码中提取重叠微服务
Authors: [Morteza Ziabakhsh](https://arxiv.org/search/?searchtype=author&query=Morteza Ziabakhsh), [Kiyan Rezaee](https://arxiv.org/search/?searchtype=author&query=Kiyan Rezaee), [Sadegh Eskandari](https://arxiv.org/search/?searchtype=author&query=Sadegh Eskandari), [Seyed Amir Hossein Tabatabaei](https://arxiv.org/search/?searchtype=author&query=Seyed Amir Hossein Tabatabaei), [Mohammad M. Ghassemi](https://arxiv.org/search/?searchtype=author&query=Mohammad M. Ghassemi) 作者:Morteza Ziabakhsh、Kiyan Rezaee、Sadegh Eskandari、Seyed Amir Hossein Tabatabaei、Mohammad M. Ghassemi
Modern software systems are increasingly shifting from monolithic architectures to microservices to enhance scalability, maintainability, and deployment flexibility. Existing microservice extraction methods typically rely on hard clustering, assigning each software component to a single microservice. This approach often increases inter-service coupling and reduces intra-service cohesion. We propose Mo2oM (Monolithic to Overlapping Microservices), a framework that formulates microservice extraction as a soft clustering problem, allowing components to belong probabilistically to multiple microservices. This approach is inspired by expert-driven decompositions, where practitioners intentionally replicate certain software components across services to reduce communication overhead. Mo2oM combines deep semantic embeddings with structural dependencies extracted from methodcall graphs to capture both functional and architectural relationships. A graph neural network-based soft clustering algorithm then generates the final set of microservices. We evaluate Mo2oM on four open-source monolithic benchmarks and compare it against eight state-of-the-art baselines. Our results demonstrate that Mo2oM achieves improvements of up to 40.97% in structural modularity (balancing cohesion and coupling), 58% in inter-service call percentage (communication overhead), 26.16% in interface number (modularity and decoupling), and 38.96% in non-extreme distribution (service size balance) across all benchmarks. 现代软件系统正逐步从单体架构转向微服务,以提升可扩展性、可维护性和部署灵活性。现有的微服务提取方法通常依赖硬聚类,将每个软件组件分配到单一微服务中。这种方法常常增加服务间耦合并降低服务内凝聚力。我们提出了 Mo2oM(从单体到重叠微服务),一个将微服务提取表述为软聚类问题的框架,允许组件以概率方式属于多个微服务。该方法的灵感来自专家驱动的分解实践,实践者有意在服务间复制某些软件组件以减少通信开销。Mo2oM 结合了深度语义嵌入与从方法调用图中提取的结构依赖,以捕捉功能和架构关系。随后基于图神经网络的软聚类算法生成最终的微服务集。我们在四个开源单体基准上评估了 Mo2oM,并将其与八个最先进的基线方法进行了比较。 我们的结果表明,Mo2oM 在所有基准测试中在结构模块性(平衡内聚与耦合)方面提高了最多 40.97%,在服务间调用比例(通信开销)方面提高了 58%,在接口数量(模块化与解耦)方面提高了 26.16%,在非极端分布(服务规模平衡)方面提高了 38.96%。
Subjects: Software Engineering, Artificial Intelligence, Computer Vision and Pattern Recognition 主题:软件工程、人工智能、计算机视觉与模式识别
Publish: 2025-08-10 21:07:20 UTC 发表:2025-08-10 21:07:20 UTC
#160 ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models #160 ALOPE:使用大型语言模型进行翻译质量评估的自适应层优化
Authors: [Archchana Sindhujan](https://arxiv.org/search/?searchtype=author&query=Archchana Sindhujan), [Shenbin Qian](https://arxiv.org/search/?searchtype=author&query=Shenbin Qian), [Chan Chi Chun Matthew](https://arxiv.org/search/?searchtype=author&query=Chan Chi Chun Matthew), [Constantin Orasan](https://arxiv.org/search/?searchtype=author&query=Constantin Orasan), [Diptesh Kanojia](https://arxiv.org/search/?searchtype=author&query=Diptesh Kanojia) 作者:Archchana Sindhujan、Shenbin Qian、Chan Chi Chun Matthew、Constantin Orasan、Diptesh Kanojia
Large Language Models (LLMs) have shown remarkable performance across a wide range of natural language processing tasks. Quality Estimation (QE) for Machine Translation (MT), which assesses the quality of a source-target pair without relying on reference translations, remains a challenging cross-lingual task for LLMs. The challenges stem from the inherent limitations of existing LLM-based QE systems, which are pre-trained for causal language modelling rather than regression-specific tasks, further elevated by the presence of low-resource languages given pre-training data distribution. This paper introduces ALOPE, an adaptive layer-optimization framework designed to enhance LLM-based QE by restructuring Transformer representations through layer-wise adaptation for improved regression-based prediction. Our framework integrates low-rank adapters (LoRA) with regression task heads, leveraging selected pre-trained Transformer layers for improved cross-lingual alignment. In addition to the layer-specific adaptation, ALOPE introduces two strategies-dynamic weighting, which adaptively combines representations from multiple layers, and multi-head regression, which aggregates regression losses from multiple heads for QE. Our framework shows improvements over various existing LLM-based QE approaches. Empirical evidence suggests that intermediate Transformer layers in LLMs provide contextual representations that are more aligned with the cross-lingual nature of the QE task. We make resultant models and framework code publicly available for further research, also allowing existing LLM-based MT frameworks to be scaled with QE capabilities. 大型语言模型(LLMs)在广泛的自然语言处理任务中表现出色。机器翻译(MT)的质量估计(QE)用于在不依赖参考译文的情况下评估源-目标对的质量,对于 LLMs 来说仍然是一个具有挑战性的跨语言任务。挑战源于现有基于 LLM 的 QE 系统的固有局限性:这些系统在预训练阶段侧重于因果语言建模,而非回归专用任务;再加上预训练数据分布导致的低资源语言存在,使问题更加复杂。本文提出了 ALOPE,一种自适应层优化框架,旨在通过逐层适配重构 Transformer 表示来增强基于 LLM 的 QE,以改进基于回归的预测。我们的框架将低秩适配器(LoRA)与回归任务头相结合,利用经过挑选的预训练 Transformer 层以改进跨语言对齐。 除了特定层的适配之外,ALOPE 引入了两种策略——动态加权(dynamic weighting),用于自适应地组合来自多个层的表示,以及多头回归(multi-head regression),用于汇总来自多个头的回归损失以进行质量估计(QE)。我们的框架在多种现有基于 LLM 的 QE 方法上表现出改进。实证结果表明,LLM 的中间 Transformer 层提供的上下文表示更契合 QE 任务的跨语言特性。我们已将得到的模型和框架代码公开,以便进一步研究,也允许现有基于 LLM 的 MT 框架扩展具备 QE 能力。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 20:59:44 UTC
#161 Noise-Aware Generative Microscopic Traffic Simulation #161 噪声感知生成式微观交通模拟
Authors: [Vindula Jayawardana](https://arxiv.org/search/?searchtype=author&query=Vindula Jayawardana), [Catherine Tang](https://arxiv.org/search/?searchtype=author&query=Catherine Tang), [Junyi Ji](https://arxiv.org/search/?searchtype=author&query=Junyi Ji), [Jonah Philion](https://arxiv.org/search/?searchtype=author&query=Jonah Philion), [Xue Bin Peng](https://arxiv.org/search/?searchtype=author&query=Xue Bin Peng), [Cathy Wu](https://arxiv.org/search/?searchtype=author&query=Cathy Wu) 作者:Vindula Jayawardana、Catherine Tang、Junyi Ji、Jonah Philion、Xue Bin Peng、Cathy Wu
Accurately modeling individual vehicle behavior in microscopic traffic simulation remains a key challenge in intelligent transportation systems, as it requires vehicles to realistically generate and respond to complex traffic phenomena such as phantom traffic jams. While traditional human driver simulation models offer computational tractability, they do so by abstracting away the very complexity that defines human driving. On the other hand, recent advances in infrastructure-mounted camera-based roadway sensing have enabled the extraction of vehicle trajectory data, presenting an opportunity to shift toward generative, agent-based models. Yet, a major bottleneck remains: most existing datasets are either overly sanitized or lack standardization, failing to reflect the noisy, imperfect nature of real-world sensing. Unlike data from vehicle-mounted sensors-which can mitigate sensing artifacts like occlusion through overlapping fields of view and sensor fusion-infrastructure-based sensors surface a messier, more practical view of challenges that traffic engineers encounter. To this end, we present the I-24 MOTION Scenario Dataset (I24-MSD)-a standardized, curated dataset designed to preserve a realistic level of sensor imperfection, embracing these errors as part of the learning problem rather than an obstacle to overcome purely from preprocessing. Drawing from noise-aware learning strategies in computer vision, we further adapt existing generative models in the autonomous driving community for I24-MSD with noise-aware loss functions. Our results show that such models not only outperform traditional baselines in realism but also benefit from explicitly engaging with, rather than suppressing, data imperfection. We view I24-MSD as a stepping stone toward a new generation of microscopic traffic simulation that embraces the real-world challenges and is better aligned with practical needs. 在微观交通仿真中精确建模单个车辆行为仍然是智能交通系统的一大挑战,因为这要求车辆能够真实地生成并响应诸如虚幻交通拥堵等复杂交通现象。尽管传统的人工驾驶员仿真模型在计算上可处理,但它们通过抽象掉定义人类驾驶的复杂性来实现这一点。另一方面,最近基于基础设施安装摄像头的路面感知进展使得车辆轨迹数据的提取成为可能,提供了向生成式、基于主体的模型转变的机会。然而,一个主要瓶颈依然存在:大多数现有数据集要么过于净化,要么缺乏标准化,未能反映现实世界感知的噪声和不完美性。与可通过重叠视场和传感器融合来减轻遮挡等感知伪影的车载传感器数据不同,基于基础设施的传感器呈现出更混乱、更贴近交通工程师实际遇到问题的视角。 为此,我们提出了 I-24 MOTION 场景数据集(I24-MSD)——一个标准化、精心整理的数据集,旨在保留现实级别的传感器不完美性,将这些误差视为学习问题的一部分,而不是仅通过预处理加以克服的障碍。借鉴计算机视觉中对噪声敏感的学习策略,我们进一步为 I24-MSD 在自动驾驶领域改造了现有的生成模型,采用了对噪声敏感的损失函数。我们的结果表明,此类模型不仅在逼真度上优于传统基准方法,而且通过主动应对而非抑制数据不完美性还能获得收益。我们将 I24-MSD 视为迈向新一代微观交通仿真的垫脚石,该仿真接纳现实世界的挑战并更好地契合实际需求。
Subjects: Systems and Control, Artificial Intelligence, Multiagent Systems, Robotics 主题:系统与控制、人工智能、多智能体系统、机器人学
Publish: 2025-08-10 18:41:49 UTC 发布:2025-08-10 18:41:49 UTC
#162 Stackelberg Coupling of Online Representation Learning and Reinforcement Learning #162 在线表示学习与强化学习的斯塔克尔伯格耦合
Authors: [Fernando Martinez](https://arxiv.org/search/?searchtype=author&query=Fernando Martinez), [Tao Li](https://arxiv.org/search/?searchtype=author&query=Tao Li), [Yingdong Lu](https://arxiv.org/search/?searchtype=author&query=Yingdong Lu), [Juntao Chen](https://arxiv.org/search/?searchtype=author&query=Juntao Chen) 作者:Fernando Martinez、Tao Li、Yingdong Lu、Juntao Chen
Integrated, end-to-end learning of representations and policies remains a cornerstone of deep reinforcement learning (RL). However, to address the challenge of learning effective features from a sparse reward signal, recent trends have shifted towards adding complex auxiliary objectives or fully decoupling the two processes, often at the cost of increased design complexity. This work proposes an alternative to both decoupling and naive end-to-end learning, arguing that performance can be significantly improved by structuring the interaction between distinct perception and control networks with a principled, game-theoretic dynamic. We formalize this dynamic by introducing the Stackelberg Coupled Representation and Reinforcement Learning (SCORER) framework, which models the interaction between perception and control as a Stackelberg game. The perception network (leader) strategically learns features to benefit the control network (follower), whose own objective is to minimize its Bellman error. We approximate the game’s equilibrium with a practical two-timescale algorithm. Applied to standard DQN variants on benchmark tasks, SCORER improves sample efficiency and final performance. Our results show that performance gains can be achieved through principled algorithmic design of the perception-control dynamic, without requiring complex auxiliary objectives or architectures. 端到端联合学习表征与策略仍然是深度强化学习(RL)的基石。然而,为了解决从稀疏奖励信号中学习有效特征的挑战,近期趋势转向添加复杂的辅助目标或完全将两者分离,往往以增加设计复杂性为代价。本工作提出了一种既不同于分离也不同于简单端到端学习的替代方案,认为通过以一种有原则的博弈论动态来构建感知网络与控制网络之间的交互,可以显著提升性能。我们通过引入 Stackelberg 耦合表征与强化学习(SCORER)框架来形式化这一动态,该框架将感知与控制之间的交互建模为一个 Stackelberg 博弈。感知网络(领导者)策略性地学习有利于控制网络(追随者)的特征,而控制网络的目标则是最小化其 Bellman 误差。我们用一个实用的两时间尺度算法来近似该博弈的均衡。将 SCORER 应用于基准任务上的标准 DQN 变体,显示出提升样本效率和最终性能。 我们的结果表明,通过对感知-控制动态进行有原则的算法设计,可以获得性能提升,而无需复杂的辅助目标或架构。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 18:36:54 UTC 发布:2025-08-10 18:36:54 UTC
#163 Optimizing Districting Plans to Maximize Majority-Minority Districts via IPs and Local Search #163 通过整数规划和局部搜索优化选区划分以最大化少数族裔多数选区
Authors: [Daniel Brous](https://arxiv.org/search/?searchtype=author&query=Daniel Brous), [David Shmoys](https://arxiv.org/search/?searchtype=author&query=David Shmoys) 作者:Daniel Brous,David Shmoys
In redistricting litigation, effective enforcement of the Voting Rights Act has often involved providing the court with districting plans that display a larger number of majority-minority districts than the current proposal (as was true, for example, in what followed Allen v. Milligan concerning the congressional districting plan for Alabama in 2023). Recent work by Cannon et al. proposed a heuristic algorithm for generating plans to optimize majority-minority districts, which they called short bursts; that algorithm relies on a sophisticated random walk over the space of all plans, transitioning in bursts, where the initial plan for each burst is the most successful plan from the previous burst. We propose a method based on integer programming, where we build upon another previous work, the stochastic hierarchical partitioning algorithm, which heuristically generates a robust set of potential districts (viewed as columns in a standard set partitioning formulation); that approach was designed to optimize a different notion of fairness across a statewide plan. We design a new column generation algorithm to find plans via integer programming that outperforms short bursts on multiple data sets in generating statewide plans with significantly more majority-minority districts. These results also rely on a new local re-optimization algorithm to iteratively improve on any baseline solution, as well as an algorithm to increase the compactness of districts in plans generated (without impacting the number of majority-minority districts). 在重划选区的诉讼中,有效执行《投票权法》常常涉及向法庭提供比当前提案显示更多多数-少数族裔选区的划分方案(例如,在 Allen v. Milligan 之后,关于 2023 年阿拉巴马州国会选区划分方案的情况就是如此)。Cannon 等人的最新工作提出了一种启发式算法来生成以优化多数-少数族裔选区为目标的方案,他们称之为短爆发(short bursts);该算法依赖于在所有方案空间上进行复杂的随机游走,以爆发式的方式转移,其中每次爆发的初始方案是上一爆发中最成功的方案。我们提出了一种基于整数规划的方法,建立在另一项先前工作——随机分层划分算法(stochastic hierarchical partitioning algorithm)的基础上,该算法启发式地生成一组稳健的潜在选区(在标准集合划分表述中视为列);该方法最初是为优化全州方案中另一种公平性概念而设计的。 我们设计了一种新的列生成算法,通过整数规划寻找选区方案,在生成全州选区计划时优于短时突发方法,在多个数据集上产生了显著更多的少数族裔多数选区。这些结果还依赖于一种新的局部再优化算法,用于对任一基线解进行迭代改进,以及一种用于提高所生成方案中选区紧凑性的算法(在不影响少数族裔多数选区数量的前提下)。
Subjects: Data Structures and Algorithms, Artificial Intelligence, Computers and Society 主题:数据结构与算法、人工智能、计算机与社会
Publish: 2025-08-10 17:58:54 UTC 发布:2025-08-10 17:58:54 UTC
#164 Freeze and Reveal: Exposing Modality Bias in Vision-Language Models
Authors: [Vivek Hruday Kavuri](https://arxiv.org/search/?searchtype=author&query=Vivek Hruday Kavuri), [Vysishtya Karanam](https://arxiv.org/search/?searchtype=author&query=Vysishtya Karanam), [Venkata Jahnavi Venkamsetty](https://arxiv.org/search/?searchtype=author&query=Venkata Jahnavi Venkamsetty), [Kriti Madumadukala](https://arxiv.org/search/?searchtype=author&query=Kriti Madumadukala), [Lakshmipathi Balaji Darur](https://arxiv.org/search/?searchtype=author&query=Lakshmipathi Balaji Darur), [Ponnurangam Kumaraguru](https://arxiv.org/search/?searchtype=author&query=Ponnurangam Kumaraguru) 作者:Vivek Hruday Kavuri,Vysishtya Karanam,Venkata Jahnavi Venkamsetty,Kriti Madumadukala,Lakshmipathi Balaji Darur,Ponnurangam Kumaraguru
Vision Language Models achieve impressive multi-modal performance but often inherit gender biases from their training data. This bias might be coming from both the vision and text modalities. In this work, we dissect the contributions of vision and text backbones to these biases by applying targeted debiasing using Counterfactual Data Augmentation and Task Vector methods. Inspired by data-efficient approaches in hate-speech classification, we introduce a novel metric, Degree of Stereotypicality and a corresponding debiasing method, Data Augmentation Using Degree of Stereotypicality - DAUDoS, to reduce bias with minimal computational cost. We curate a gender annotated dataset and evaluate all methods on VisoGender benchmark to quantify improvements and identify dominant source of bias. Our results show that CDA reduces the gender gap by 6% and DAUDoS by 3% but using only one-third of the data. Both methods also improve the model’s ability to correctly identify gender in images by 3%, with DAUDoS achieving this improvement using only almost one-third of training data. From our experiment’s, we observed that CLIP’s vision encoder is more biased whereas PaliGemma2’s text encoder is more biased. By identifying whether bias stems more from vision or text encoders, our work enables more targeted and effective bias mitigation strategies in future multi-modal systems. 视觉语言模型在多模态任务上表现出色,但经常从训练数据中继承性别偏见。这种偏见可能来自视觉和文本模态两方面。在这项工作中,我们通过采用针对性去偏方法(反事实数据增强和任务向量方法),解剖视觉和文本骨干对这些偏见的贡献。受在仇恨言论分类中数据高效方法的启发,我们引入了一种新度量——刻板化程度(Degree of Stereotypicality)及相应的去偏方法——基于刻板化程度的数据增强(Data Augmentation Using Degree of Stereotypicality,简称 DAUDoS),以尽量小的计算成本降低偏见。我们整理了一个带性别标注的数据集,并在 VisoGender 基准上评估所有方法,以量化改进并识别主要偏见来源。结果表明,CDA 将性别差距减少了 6%,而 DAUDoS 减少了 3%,但仅使用了三分之一的数据。两种方法还将模型在图像中正确识别性别的能力提高了 3%,其中 DAUDoS 在几乎仅使用三分之一训练数据的情况下达到了该提升。 从我们的实验中,我们观察到 CLIP 的视觉编码器偏差更大,而 PaliGemma2 的文本编码器偏差更大。通过识别偏差更多地来自视觉还是文本编码器,我们的工作能够在未来的多模态系统中实现更有针对性和更有效的偏差缓解策略。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-10 17:08:10 UTC
#165 Lightning Prediction under Uncertainty: DeepLight with Hazy Loss
Authors: [Md Sultanul Arifin](https://arxiv.org/search/?searchtype=author&query=Md Sultanul Arifin), [Abu Nowshed Sakib](https://arxiv.org/search/?searchtype=author&query=Abu Nowshed Sakib), [Yeasir Rayhan](https://arxiv.org/search/?searchtype=author&query=Yeasir Rayhan), [Tanzima Hashem](https://arxiv.org/search/?searchtype=author&query=Tanzima Hashem)
Lightning, a common feature of severe meteorological conditions, poses significant risks, from direct human injuries to substantial economic losses. These risks are further exacerbated by climate change. Early and accurate prediction of lightning would enable preventive measures to safeguard people, protect property, and minimize economic losses. In this paper, we present DeepLight, a novel deep learning architecture for predicting lightning occurrences. Existing prediction models face several critical limitations: they often struggle to capture the dynamic spatial context and inherent uncertainty of lightning events, underutilize key observational data, such as radar reflectivity and cloud properties, and rely heavily on Numerical Weather Prediction (NWP) systems, which are both computationally expensive and highly sensitive to parameter settings. To overcome these challenges, DeepLight leverages multi-source meteorological data, including radar reflectivity, cloud properties, and historical lightning occurrences through a dual-encoder architecture. By employing multi-branch convolution techniques, it dynamically captures spatial correlations across varying extents. Furthermore, its novel Hazy Loss function explicitly addresses the spatio-temporal uncertainty of lightning by penalizing deviations based on proximity to true events, enabling the model to better learn patterns amidst randomness. Extensive experiments show that DeepLight improves the Equitable Threat Score (ETS) by 18%-30% over state-of-the-art methods, establishing it as a robust solution for lightning prediction. 闪电作为严重气象条件中的常见现象,带来显著风险,从对人体的直接伤害到巨大的经济损失不等。气候变化进一步加剧了这些风险。对闪电进行早期且准确的预测可以采取预防措施以保护人员、财产并将经济损失降到最低。在本文中,我们提出了 DeepLight,一种用于预测闪电发生的新型深度学习架构。现有的预测模型存在若干关键局限:它们常常难以捕捉闪电事件的动态空间语境和内在不确定性,未能充分利用如雷达反射率和云特性等关键观测数据,并且过度依赖数值天气预报(NWP)系统,而后者不仅计算成本高且对参数设置高度敏感。为克服这些挑战,DeepLight 利用包括雷达反射率、云特性和历史闪电发生记录在内的多源气象数据,通过双编码器架构进行融合。通过采用多分支卷积技术,它能够动态捕捉不同范围内的空间相关性。 此外,其新提出的朦胧损失(Hazy Loss)通过基于与真实事件的接近程度来惩罚偏差,显式地处理了闪电的时空不确定性,使模型能够在随机性中更好地学习模式。大量实验表明,DeepLight 在公平威胁评分(ETS)上比最先进方法提高了 18%–30%,确立了其作为闪电预测的鲁棒解决方案的地位。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 16:59:03 UTC 发布:2025-08-10 16:59:03 协调世界时 (UTC)
#166 Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures #166 使用异构架构和机器学习对非结构化数据进行实时分析
Author: [Fotis I. Giasemis](https://arxiv.org/search/?searchtype=author&query=Fotis I. Giasemis) 作者:Fotis I. Giasemis
As the particle physics community needs higher and higher precisions in order to test our current model of the subatomic world, larger and larger datasets are necessary. With upgrades scheduled for the detectors of colliding-beam experiments around the world, and specifically at the Large Hadron Collider at CERN, more collisions and more complex interactions are expected. This directly implies an increase in data produced and consequently in the computational resources needed to process them. At CERN, the amount of data produced is gargantuan. This is why the data have to be heavily filtered and selected in real time before being permanently stored. This data can then be used to perform physics analyses, in order to expand our current understanding of the universe and improve the Standard Model of physics. This real-time filtering, known as triggering, involves complex processing happening often at frequencies as high as 40 MHz. This thesis contributes to understanding how machine learning models can be efficiently deployed in such environments, in order to maximize throughput and minimize energy consumption. Inevitably, modern hardware designed for such tasks and contemporary algorithms are needed in order to meet the challenges posed by the stringent, high-frequency data rates. In this work, I present our graph neural network-based pipeline, developed for charged particle track reconstruction at the LHCb experiment at CERN. The pipeline was implemented end-to-end inside LHCb’s first-level trigger, entirely on GPUs. Its performance was compared against the classical tracking algorithms currently in production at LHCb. The pipeline was also accelerated on the FPGA architecture, and its performance in terms of power consumption and processing speed was compared against the GPU implementation. 随着粒子物理学界为了检验我们当前的亚原子世界模型而需要越来越高的精度,所需的数据集也越来越大。随着全球各地对撞束实验探测器的升级,尤其是位于 CERN 的大型强子对撞机,预计会产生更多碰撞和更复杂的相互作用。这直接意味着产生的数据量增加,从而需要更多的计算资源来处理它们。在 CERN,产生的数据量极为庞大。因此,数据必须在实时情况下被大量过滤和筛选,然后才能永久存储。之后这些数据可用于进行物理分析,以扩展我们对宇宙的现有理解并改进物理学标准模型。这个实时过滤过程称为触发,通常涉及高达每秒 4000 万次(40 MHz)的复杂处理。本论文有助于理解如何在此类环境中高效部署机器学习模型,以便最大化吞吐量并最小化能耗。 不可避免地,为了应对严格且高频的数据速率所带来的挑战,需要为此类任务设计的现代硬件和当代算法。在这项工作中,我介绍了我们基于图神经网络的管线,用于 CERN 的 LHCb 实验中的带电粒子轨迹重建。该管线作为端到端实现,完全在 LHCb 的一阶触发器内运行,全部部署在 GPU 上。其性能与 LHCb 当前投入生产使用的经典追踪算法进行了比较。该管线也在 FPGA 架构上实现了加速,并在功耗和处理速度方面与 GPU 实现进行了比较。
Subjects: High Energy Physics - Experiment, Artificial Intelligence, Distributed, Parallel, and Cluster Computing, Machine Learning, Data Analysis, Statistics and Probability 主题:高能物理 - 实验,人工智能,分布式、并行与集群计算,机器学习,数据分析、统计与概率
Publish: 2025-08-10 16:45:10 UTC 发布:2025-08-10 16:45:10 协调世界时
#167 Leveraging GNN to Enhance MEF Method in Predicting ENSO #167 利用图神经网络增强 MEF 方法以预测 ENSO
Authors: [Saghar Ganji](https://arxiv.org/search/?searchtype=author&query=Saghar Ganji), [Mohammad Naisipour](https://arxiv.org/search/?searchtype=author&query=Mohammad Naisipour)
Reliable long-lead forecasting of the El Nino Southern Oscillation (ENSO) remains a long-standing challenge in climate science. The previously developed Multimodal ENSO Forecast (MEF) model uses 80 ensemble predictions by two independent deep learning modules: a 3D Convolutional Neural Network (3D-CNN) and a time-series module. In their approach, outputs of the two modules are combined using a weighting strategy wherein one is prioritized over the other as a function of global performance. Separate weighting or testing of individual ensemble members did not occur, however, which may have limited the model to optimize the use of high-performing but spread-out forecasts. In this study, we propose a better framework that employs graph-based analysis to directly model similarity between all 80 members of the ensemble. By constructing an undirected graph whose vertices are ensemble outputs and whose weights on edges measure similarity (via RMSE and correlation), we identify and cluster structurally similar and accurate predictions. From which we obtain an optimized subset of 20 members using community detection methods. The final prediction is then obtained by averaging this optimized subset. This method improves the forecast skill through noise removal and emphasis on ensemble coherence. Interestingly, our graph-based selection shows robust statistical characteristics among top performers, offering new ensemble behavior insights. In addition, we observe that while the GNN-based approach does not always outperform the baseline MEF under every scenario, it produces more stable and consistent outputs, particularly in compound long-lead situations. The approach is model-agnostic too, suggesting that it can be applied directly to other forecasting models with gargantuan ensemble outputs, such as statistical, physical, or hybrid models. 对厄尔尼诺-南方涛动(ENSO)进行长期可靠预报一直是气候科学中的长期挑战。先前开发的多模态 ENSO 预报(MEF)模型通过两个独立的深度学习模块——三维卷积神经网络(3D-CNN)和时间序列模块——产生 80 个集合预报。在他们的方法中,两个模块的输出通过一种加权策略结合,依据全球表现按优先级对其中一个模块进行倾斜。然而,他们并未对单独的集合成员进行单独加权或测试,这可能限制了模型在利用那些表现优异但分布分散的预测方面的优化能力。在本研究中,我们提出了一个更好的框架,使用基于图的方法直接对 80 个集合成员之间的相似性建模。通过构建一个无向图,该图的顶点为集合输出,边的权重通过均方根误差(RMSE)和相关性来度量相似性,我们识别并聚类结构上相似且准确的预测。基于此,我们利用社区发现方法从中获得了一个优化的 20 个成员子集。 最终预测通过对这个优化的子集取平均得到。该方法通过去噪和强调集合一致性来提升预报技能。有趣的是,我们基于图的方法在顶尖表现者中显示出稳健的统计特性,为集合行为提供了新的洞见。此外,我们观察到尽管基于 GNN 的方法并不总是在每种情景下都优于基线 MEF,但它产生了更稳定且一致的输出,尤其在复杂的长期预报情形中。该方法也是模型不可知的,表明它可以直接应用于其他产生大量集合输出的预报模型,如统计模型、物理模型或混合模型。
Subjects: Atmospheric and Oceanic Physics, Artificial Intelligence 学科:大气与海洋物理学,人工智能
Publish: 2025-08-10 16:16:58 UTC 发表时间:2025-08-10 16:16:58 协调世界时
#168 AgriVLN: Vision-and-Language Navigation for Agricultural Robots #168 AgriVLN:用于农业机器人的视觉与语言导航
Authors: [Xiaobei Zhao](https://arxiv.org/search/?searchtype=author&query=Xiaobei Zhao), [Xingqi Lyu](https://arxiv.org/search/?searchtype=author&query=Xingqi Lyu), [Xiang Li](https://arxiv.org/search/?searchtype=author&query=Xiang Li)
Agricultural robots have emerged as powerful members in agricultural tasks, nevertheless, still heavily rely on manual operation or untransportable railway for movement, resulting in limited mobility and poor adaptability. Vision-and-Language Navigation (VLN) enables robots to navigate to the target destinations following natural language instructions, demonstrating strong performance on several domains. However, none of the existing benchmarks or methods is specifically designed for agricultural scenes. To bridge this gap, we propose Agriculture to Agriculture (A2A) benchmark, containing 1,560 episodes across six diverse agricultural scenes, in which all realistic RGB videos are captured by front-facing camera on a quadruped robot at a height of 0.38 meters, aligning with the practical deployment conditions. Meanwhile, we propose Vision-and-Language Navigation for Agricultural Robots (AgriVLN) baseline based on Vision-Language Model (VLM) prompted with carefully crafted templates, which can understand both given instructions and agricultural environments to generate appropriate low-level actions for robot control. When evaluated on A2A, AgriVLN performs well on short instructions but struggles with long instructions, because it often fails to track which part of the instruction is currently being executed. To address this, we further propose Subtask List (STL) instruction decomposition module and integrate it into AgriVLN, improving Success Rate (SR) from 0.33 to 0.47. We additionally compare AgriVLN with several existing VLN methods, demonstrating the state-of-the-art performance in the agricultural domain. 农业机器人已成为执行农务任务的强大力量,但在移动方面仍严重依赖人工操作或不可移动的轨道,导致机动性受限且适应性差。视觉与语言导航(VLN)使机器人能够按照自然语言指令导航至目标地点,并在多个领域展现出强大性能。然而,现有的基准或方法均未专门针对农业场景设计。为弥补这一空白,我们提出了 Agriculture to Agriculture(A2A)基准,包含跨越六个多样农业场景的 1,560 个情节,其中所有真实 RGB 视频均由安装在高度为 0.38 米的四足机器人前置摄像头拍摄,契合实际部署条件。与此同时,我们提出了基于视觉-语言模型(VLM)并使用精心设计模板提示的农业机器人视觉与语言导航(AgriVLN)基线,该方法既能理解给定指令,也能感知农业环境,从而生成用于机器人控制的恰当低级动作。 在 A2A 上评估时,AgriVLN 在短指令上表现良好,但在长指令上表现欠佳,因为它常常无法跟踪当前正在执行的指令部分。为了解决这一问题,我们进一步提出了子任务列表(STL)指令分解模块并将其整合到 AgriVLN 中,使成功率(SR)从 0.33 提升到 0.47。我们还将 AgriVLN 与若干现有的视觉-语言导航(VLN)方法进行了比较,展示了在农业领域的最先进性能。
Subjects: Robotics, Artificial Intelligence, Computer Vision and Pattern Recognition 学科:机器人、人工智能、计算机视觉与模式识别
Publish: 2025-08-10 16:07:23 UTC 发布:2025-08-10 16:07:23 UTC
#169 A Spin Glass Characterization of Neural Networks #169 用自旋玻璃表征神经网络
Author: [Jun Li](https://arxiv.org/search/?searchtype=author&query=Jun Li) 作者:Jun Li
This work presents a statistical mechanics characterization of neural networks, motivated by the replica symmetry breaking (RSB) phenomenon in spin glasses. A Hopfield-type spin glass model is constructed from a given feedforward neural network (FNN). Overlaps between simulated replica samples serve as a characteristic descriptor of the FNN. The connection between the spin-glass description and commonly studied properties of the FNN – such as data fitting, capacity, generalization, and robustness – has been investigated and empirically demonstrated. Unlike prior analytical studies that focus on model ensembles, this method provides a computable descriptor for individual network instances, which reveals nontrivial structural properties that are not captured by conventional metrics such as loss or accuracy. Preliminary results suggests its potential for practical applications such as model inspection, safety verification, and detection of hidden vulnerabilities. 这项工作提出了一种神经网络的统计力学表征,灵感来自自旋玻璃中的副本对称性破缺(RSB)现象。基于给定的前馈神经网络(FNN),构建了一种霍普菲尔德型自旋玻璃模型。模拟的副本样本之间的重叠作为 FNN 的特征描述符。研究并实证了自旋玻璃描述与常被研究的 FNN 属性之间的联系——例如数据拟合、容量、泛化和鲁棒性。与以往侧重模型集合的解析性研究不同,此方法为单个网络实例提供了可计算的描述符,揭示了常规度量(如损失或准确率)无法捕捉的非平凡结构特性。初步结果表明其在模型检查、安全性验证和隐藏脆弱性检测等实际应用中的潜力。
Subjects: Disordered Systems and Neural Networks, Artificial Intelligence, Machine Learning 主题:无序系统与神经网络、人工智能、机器学习
Publish: 2025-08-10 15:53:58 UTC 发布:2025-08-10 15:53:58 UTC
#170 Urbanite: A Dataflow-Based Framework for Human-AI Interactive Alignment in Urban Visual Analytics #170 Urbanite:一个基于数据流的框架,用于城市可视化分析中人机交互对齐 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Gustavo Moreira](https://arxiv.org/search/?searchtype=author&query=Gustavo Moreira), [Leonardo Ferreira](https://arxiv.org/search/?searchtype=author&query=Leonardo Ferreira), [Carolina Veiga](https://arxiv.org/search/?searchtype=author&query=Carolina Veiga), [Maryam Hosseini](https://arxiv.org/search/?searchtype=author&query=Maryam Hosseini), [Fabio Miranda](https://arxiv.org/search/?searchtype=author&query=Fabio Miranda) 作者:Gustavo Moreira、Leonardo Ferreira、Carolina Veiga、Maryam Hosseini、Fabio Miranda
With the growing availability of urban data and the increasing complexity of societal challenges, visual analytics has become essential for deriving insights into pressing real-world problems. However, analyzing such data is inherently complex and iterative, requiring expertise across multiple domains. The need to manage diverse datasets, distill intricate workflows, and integrate various analytical methods presents a high barrier to entry, especially for researchers and urban experts who lack proficiency in data management, machine learning, and visualization. Advancements in large language models offer a promising solution to lower the barriers to the construction of analytics systems by enabling users to specify intent rather than define precise computational operations. However, this shift from explicit operations to intent-based interaction introduces challenges in ensuring alignment throughout the design and development process. Without proper mechanisms, gaps can emerge between user intent, system behavior, and analytical outcomes. To address these challenges, we propose Urbanite, a framework for human-AI collaboration in urban visual analytics. Urbanite leverages a dataflow-based model that allows users to specify intent at multiple scopes, enabling interactive alignment across the specification, process, and evaluation stages of urban analytics. Based on findings from a survey to uncover challenges, Urbanite incorporates features to facilitate explainability, multi-resolution definition of tasks across dataflows, nodes, and parameters, while supporting the provenance of interactions. We demonstrate Urbanite’s effectiveness through usage scenarios created in collaboration with urban experts. Urbanite is available at https://urbantk.org/urbanite. 随着城市数据日益增多以及社会挑战日益复杂,视觉分析在洞察紧迫的现实问题方面变得至关重要。然而,分析此类数据本质上具有复杂性和迭代性,要求跨多个领域的专业知识。管理多样数据集、提炼复杂工作流并整合各种分析方法的需求,构成了较高的入门门槛,尤其对于那些在数据管理、机器学习和可视化方面缺乏熟练技能的研究人员和城市专家而言。大型语言模型的进步为降低构建分析系统的门槛提供了有希望的解决方案,因为它们使用户可以指定意图而无需定义精确的计算操作。然而,从显式操作转向基于意图的交互在设计和开发过程中引入了确保对齐性的挑战。如果没有适当的机制,用户意图、系统行为与分析结果之间可能会出现脱节。为了解决这些挑战,我们提出了 Urbanite——一个用于城市视觉分析的人机协作框架。 Urbanite 利用一种基于数据流的模型,允许用户在多个范围上指定意图,从而实现城市分析的规范、流程和评估阶段之间的交互式对齐。基于一项旨在发现挑战的调查结果,Urbanite 集成了便于解释性的功能、在数据流、节点和参数级别对任务进行多分辨率定义的能力,同时支持交互溯源。我们通过与城市专家合作创建的使用场景展示了 Urbanite 的有效性。Urbanite 可在 https://urbantk.org/urbanite 获得。
Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能
Publish: 2025-08-10 15:44:37 UTC 发布时间:2025-08-10 15:44:37 UTC
#171 AutoAssert 1: A LoRA Fine-Tuned LLM Model for Efficient Automated Assertion Generation #171 AutoAssert 1:一种用于高效自动断言生成的 LoRA 微调 LLM 模型
Authors: [Yi Zhong](https://arxiv.org/search/?searchtype=author&query=Yi Zhong), [Hongchao Liu](https://arxiv.org/search/?searchtype=author&query=Hongchao Liu), [Di ZHao](https://arxiv.org/search/?searchtype=author&query=Di ZHao) 作者:钟毅,刘洪超,赵迪
As the complexity of software systems continues to increase, the demand for automated testing and maintenance tools is growing exponentially. To meet this urgent need, we propose a new assertion generation method based on Hardware Description Language (HDL). This method combines a lightweight, parameter-adjustable large language model (LLM) with the Unsloth platform to automatically generate test cases, thereby significantly reducing training costs without sacrificing accuracy or generalization performance. Empirical evaluation shows that our method can efficiently generate assertions that strictly conform to the hardware logic. This framework provides a robust and flexible solution to modern software testing and maintenance challenges. https://github.com/liusu-orange/AutoAssert-1 and https://gitee.com/OpenBPU/auto-assert1 are the locations of the source code. 随着软件系统复杂性的不断增加,对自动化测试和维护工具的需求呈指数级增长。为满足这一紧迫需求,我们提出了一种基于硬件描述语言(HDL)的新型断言生成方法。该方法将轻量级、参数可调的 LLM 与 Unsloth 平台相结合,以自动生成测试用例,从而在不牺牲准确性或泛化性能的前提下显著降低训练成本。实证评估表明,我们的方法能够高效生成严格符合硬件逻辑的断言。该框架为现代软件测试和维护挑战提供了稳健且灵活的解决方案。源代码位于 https://github.com/liusu-orange/AutoAssert-1 和 https://gitee.com/OpenBPU/auto-assert1 。
Subjects: Software Engineering, Artificial Intelligence 主题:软件工程,人工智能
Publish: 2025-08-10 14:43:54 UTC 发布:2025-08-10 14:43:54 UTC
#172 ProteoKnight: Convolution-based phage virion protein classification and uncertainty analysis #172 ProteoKnight:基于卷积的噬菌体衣壳蛋白分类与不确定性分析 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Samiha Afaf Neha](https://arxiv.org/search/?searchtype=author&query=Samiha Afaf Neha), [Abir Ahammed Bhuiyan](https://arxiv.org/search/?searchtype=author&query=Abir Ahammed Bhuiyan), [Md. Ishrak Khan](https://arxiv.org/search/?searchtype=author&query=Md. Ishrak Khan) 作者:Samiha Afaf Neha,Abir Ahammed Bhuiyan,Md. Ishrak Khan
\textbf{Introduction:} Accurate prediction of Phage Virion Proteins (PVP) is essential for genomic studies due to their crucial role as structural elements in bacteriophages. Computational tools, particularly machine learning, have emerged for annotating phage protein sequences from high-throughput sequencing. However, effective annotation requires specialized sequence encodings. Our paper introduces ProteoKnight, a new image-based encoding method that addresses spatial constraints in existing techniques, yielding competitive performance in PVP classification using pre-trained convolutional neural networks. Additionally, our study evaluates prediction uncertainty in binary PVP classification through Monte Carlo Dropout (MCD). \textbf{Methods:} ProteoKnight adapts the classical DNA-Walk algorithm for protein sequences, incorporating pixel colors and adjusting walk distances to capture intricate protein features. Encoded sequences were classified using multiple pre-trained CNNs. Variance and entropy measures assessed prediction uncertainty across proteins of various classes and lengths. \textbf{Results:} Our experiments achieved 90.8% accuracy in binary classification, comparable to state-of-the-art methods. Multi-class classification accuracy remains suboptimal. Our uncertainty analysis unveils variability in prediction confidence influenced by protein class and sequence length. \textbf{Conclusions:} Our study surpasses frequency chaos game representation (FCGR) by introducing novel image encoding that mitigates spatial information loss limitations. Our classification technique yields accurate and robust PVP predictions while identifying low-confidence predictions. 引言:由于噬菌体结构蛋白在噬菌体中的关键作用,准确预测噬菌体衣壳蛋白(Phage Virion Proteins,PVP)对基因组学研究至关重要。计算工具,尤其是机器学习,已被用于从高通量测序中注释噬菌体蛋白序列。然而,有效的注释需要专门的序列编码方法。本文提出了 ProteoKnight,一种新的基于图像的编码方法,解决了现有技术中的空间约束问题,利用预训练卷积神经网络在 PVP 分类任务上取得了具有竞争力的性能。此外,我们通过蒙特卡洛丢弃(Monte Carlo Dropout,MCD)评估了二元 PVP 分类中的预测不确定性。 方法:ProteoKnight 将经典的 DNA-Walk 算法改编用于蛋白质序列,加入像素颜色并调整行走距离以捕捉复杂的蛋白质特征。编码后的序列使用多个预训练的卷积神经网络进行分类。通过方差和熵度量评估了不同类别和长度蛋白质的预测不确定性。 \textbf{结果:} 我们的实验在二分类中达到了 90.8%的准确率,可与最先进的方法相媲美。多类分类准确率仍不理想。我们的不确定性分析揭示了预测置信度的变异性,受蛋白质类别和序列长度的影响。\textbf{结论:} 我们通过引入新颖的图像编码超越了频率混沌游戏表示(FCGR),减轻了空间信息丢失的限制。我们的分类技术在产生准确且稳健的 PVP 预测的同时,能够识别低置信度的预测。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 13:45:08 UTC 发布:2025-08-10 13:45:08 UTC
#173 Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative #173 通过 HessianAware 量化与 CPU GPU 协同实现高效边缘 LLM 部署
Authors: [Tuo Zhang](https://arxiv.org/search/?searchtype=author&query=Tuo Zhang), [Ning Li](https://arxiv.org/search/?searchtype=author&query=Ning Li), [Xin Yuan](https://arxiv.org/search/?searchtype=author&query=Xin Yuan), [Wenchao Xu](https://arxiv.org/search/?searchtype=author&query=Wenchao Xu), [Quan Chen](https://arxiv.org/search/?searchtype=author&query=Quan Chen), [Song Guo](https://arxiv.org/search/?searchtype=author&query=Song Guo), [Haijun Zhang](https://arxiv.org/search/?searchtype=author&query=Haijun Zhang) 作者:张拓、李宁、苑鑫、许文超、陈权、郭松、张海军
With the breakthrough progress of large language models (LLMs) in natural language processing and multimodal tasks, efficiently deploying them on resource-constrained edge devices has become a critical challenge. The Mixture of Experts (MoE) architecture enhances model capacity through sparse activation, but faces two major difficulties in practical deployment: (1) The presence of numerous outliers in activation distributions leads to severe degradation in quantization accuracy for both activations and weights, significantly impairing inference performance; (2) Under limited memory, efficient offloading and collaborative inference of expert modules struggle to balance latency and throughput. To address these issues, this paper proposes an efficient MoE edge deployment scheme based on Hessian-Aware Quantization (HAQ) and CPU-GPU collaborative inference. First, by introducing smoothed Hessian matrix quantization, we achieve joint 8-bit quantization of activations and weights, which significantly alleviates the accuracy loss caused by outliers while ensuring efficient implementation on mainstream hardware. Second, we design an expert-level collaborative offloading and inference mechanism, which, combined with expert activation path statistics, enables efficient deployment and scheduling of expert modules between CPU and GPU, greatly reducing memory footprint and inference latency. Extensive experiments validate the effectiveness of our method on mainstream large models such as the OPT series and Mixtral 87B: on datasets like Wikitext2 and C4, the inference accuracy of the low-bit quantized model approaches that of the full-precision model, while GPU memory usage is reduced by about 60%, and inference latency is significantly improved. 随着大型语言模型(LLMs)在自然语言处理和多模态任务上取得突破性进展,如何在资源受限的边缘设备上高效部署它们已成为一项关键挑战。专家混合(MoE)架构通过稀疏激活提升模型容量,但在实际部署中面临两大难题: (1) 激活分布中存在大量异常值,导致激活和权重的量化精度严重下降,从而显著削弱推理性能; (2) 在内存受限的情况下,专家模块的高效卸载与协同推理难以在延迟与吞吐量之间取得平衡。为了解决这些问题,本文提出了一种基于 Hessian 感知量化(HAQ)和 CPU-GPU 协同推理的高效 MoE 边缘部署方案。首先,通过引入平滑的 Hessian 矩阵量化,我们实现了激活与权重的联合 8 位量化,显著缓解了异常值带来的精度损失,同时保证了在主流硬件上的高效实现。 其次,我们设计了一种专家级的协同卸载与推理机制,结合专家激活路径统计,能够在 CPU 与 GPU 之间高效部署与调度专家模块,大幅减少内存占用和推理延迟。大量实验在主流大模型(如 OPT 系列和 Mixtral 87B)上验证了我们方法的有效性:在 Wikitext2 和 C4 等数据集上,低位量化模型的推理精度接近全精度模型,同时 GPU 内存使用量减少约 60%,推理延迟显著改善。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 12:59:57 UTC 发布:2025-08-10 12:59:57 UTC
#174 Strategies of Code-switching in Human-Machine Dialogs #174 人机对话中的语言切换策略
Authors: [Dean Geckt](https://arxiv.org/search/?searchtype=author&query=Dean Geckt), [Melinda Fricke](https://arxiv.org/search/?searchtype=author&query=Melinda Fricke), [Shuly Wintner](https://arxiv.org/search/?searchtype=author&query=Shuly Wintner) 作者:Dean Geckt、Melinda Fricke、Shuly Wintner
Most people are multilingual, and most multilinguals code-switch, yet the characteristics of code-switched language are not fully understood. We developed a chatbot capable of completing a Map Task with human participants using code-switched Spanish and English. In two experiments, we prompted the bot to code-switch according to different strategies, examining (1) the feasibility of such experiments for investigating bilingual language use, and (2) whether participants would be sensitive to variations in discourse and grammatical patterns. Participants generally enjoyed code-switching with our bot as long as it produced predictable code-switching behavior; when code-switching was random or ungrammatical (as when producing unattested incongruent mixed-language noun phrases, such as `la fork’), participants enjoyed the task less and were less successful at completing it. These results underscore the potential downsides of deploying insufficiently developed multilingual language technology, while also illustrating the promise of such technology for conducting research on bilingual language use. 大多数人会使用多种语言,而大多数多语言者会在语际切换 (code-switch) ,但对语际切换语言的特征仍未完全了解。我们开发了一个聊天机器人,能够使用夹杂西班牙语和英语的语际切换与真人参与者完成一项地图任务。在两项实验中,我们根据不同策略提示机器人进行语际切换,考察 (1) 此类实验作为研究双语语言使用的可行性,以及 (2) 参与者是否会对话语和语法模式的变化敏感。只要机器人产生可预测的语际切换行为,参与者通常都喜欢与其进行语际切换;当语际切换是随机的或违反语法的(例如生成未见过的不一致混合语言名词短语,如 la fork)时,参与者对此任务的喜好降低,且完成任务的成功率也下降。这些结果强调了部署发育不足的多语言语言技术可能带来的负面影响,同时也展示了此类技术在研究双语语言使用方面的潜力。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 12:41:46 UTC 发布时间:2025-08-10 12:41:46 UTC
#175 ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering #175 ObfusQAte:一个用于评估 LLM 在混淆事实问答上鲁棒性的提议框架 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Shubhra Ghosh](https://arxiv.org/search/?searchtype=author&query=Shubhra Ghosh), [Abhilekh Borah](https://arxiv.org/search/?searchtype=author&query=Abhilekh Borah), [Aditya Kumar Guru](https://arxiv.org/search/?searchtype=author&query=Aditya Kumar Guru), [Kripabandhu Ghosh](https://arxiv.org/search/?searchtype=author&query=Kripabandhu Ghosh) 作者:Shubhra Ghosh、Abhilekh Borah、Aditya Kumar Guru、Kripabandhu Ghosh
The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs’ robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, ObfusQAte and, leveraging the same, introduce ObfusQA, a comprehensive, first of its kind, framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three distinct dimensions: (i) Named-Entity Indirection, (ii) Distractor Indirection, and (iii) Contextual Overload. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available. 大型语言模型(LLMs)的快速普及显著推动了能够进行事实问答(QA)的公平人工智能系统的发展。然而,目前尚无研究在面对问题的混淆版本时测试 LLMs 的鲁棒性。为系统评估这些局限性,我们提出了一种新技术 ObfusQAte,并基于该技术引入了 ObfusQA——一个前所未有的综合框架,具有多层次的混淆级别,旨在从三个不同维度检验 LLM 的能力:(i)命名实体间接化、(ii)干扰项间接化、以及(iii)上下文过载。通过捕捉语言中的这些细微差异,ObfusQA 为评估 LLM 的鲁棒性和适应性提供了一个全面的基准。我们的研究发现,LLMs 在面对这些逐渐细化的变体时,倾向于失败或产生编造的回答。为促进该方向的研究,我们将 ObfusQAte 公开发布。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习
Publish: 2025-08-10 12:27:52 UTC 发布:2025-08-10 12:27:52 UTC
#176 FlexCTC: GPU-powered CTC Beam Decoding with advanced Contextual Abilities #176 FlexCTC:具备高级上下文能力的 GPU 加速 CTC 波束解码
Authors: [Lilit Grigoryan](https://arxiv.org/search/?searchtype=author&query=Lilit Grigoryan), [Vladimir Bataev](https://arxiv.org/search/?searchtype=author&query=Vladimir Bataev), [Nikolay Karpov](https://arxiv.org/search/?searchtype=author&query=Nikolay Karpov), [Andrei Andrusenko](https://arxiv.org/search/?searchtype=author&query=Andrei Andrusenko), [Vitaly Lavrukhin](https://arxiv.org/search/?searchtype=author&query=Vitaly Lavrukhin), [Boris Ginsburg](https://arxiv.org/search/?searchtype=author&query=Boris Ginsburg) 作者:Lilit Grigoryan、Vladimir Bataev、Nikolay Karpov、Andrei Andrusenko、Vitaly Lavrukhin、Boris Ginsburg
While beam search improves speech recognition quality over greedy decoding, standard implementations are slow, often sequential, and CPU-bound. To fully leverage modern hardware capabilities, we present a novel open-source FlexCTC toolkit for fully GPU-based beam decoding, designed for Connectionist Temporal Classification (CTC) models. Developed entirely in Python and PyTorch, it offers a fast, user-friendly, and extensible alternative to traditional C++, CUDA, or WFST-based decoders. The toolkit features a high-performance, fully batched GPU implementation with eliminated CPU-GPU synchronization and minimized kernel launch overhead via CUDA Graphs. It also supports advanced contextualization techniques, including GPU-powered N-gram language model fusion and phrase-level boosting. These features enable accurate and efficient decoding, making them suitable for both research and production use. 虽然束搜索(beam search)相比贪心解码提高了语音识别质量,但标准实现通常很慢、常为顺序执行且受限于 CPU。为充分利用现代硬件能力,我们提出了一个新颖的开源 FlexCTC 工具包,用于完全基于 GPU 的束解码,专为连接时序分类(CTC)模型设计。该工具包完全使用 Python 和 PyTorch 开发,提供了一个快速、易用且可扩展的替代方案,可替代传统的 C++、CUDA 或基于 WFST 的解码器。该工具包具有高性能、全批次的 GPU 实现,消除了 CPU-GPU 同步并通过 CUDA 图减少了内核启动开销。它还支持高级上下文化技术,包括由 GPU 驱动的 N 元语法语言模型融合和短语级增强。这些特性使其能够实现准确且高效的解码,适用于研究和生产环境。
Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language, Machine Learning, Sound 主题:音频与语音处理、人工智能、计算与语言、机器学习、声音
Publish: 2025-08-10 12:15:57 UTC 发布日期:2025-08-10 12:15:57 协调世界时 (UTC)
#177 HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways #177 HealthBranches:通过决策路径合成具有临床依据的问答数据集
Authors: [Cristian Cosentino](https://arxiv.org/search/?searchtype=author&query=Cristian Cosentino), [Annamaria Defilippo](https://arxiv.org/search/?searchtype=author&query=Annamaria Defilippo), [Marco Dossena](https://arxiv.org/search/?searchtype=author&query=Marco Dossena), [Christopher Irwin](https://arxiv.org/search/?searchtype=author&query=Christopher Irwin), [Sara Joubbi](https://arxiv.org/search/?searchtype=author&query=Sara Joubbi), [Pietro Liò](https://arxiv.org/search/?searchtype=author&query=Pietro Liò) 作者:Cristian Cosentino、Annamaria Defilippo、Marco Dossena、Christopher Irwin、Sara Joubbi、Pietro Liò
HealthBranches is a novel benchmark dataset for medical Question-Answering (Q&A), specifically designed to evaluate complex reasoning in Large Language Models (LLMs). This dataset is generated through a semi-automated pipeline that transforms explicit decision pathways from medical source into realistic patient cases with associated questions and answers. Covering 4,063 case studies across 17 healthcare topics, each data point is based on clinically validated reasoning chains. HealthBranches supports both open-ended and multiple-choice question formats and uniquely includes the full reasoning path for each Q&A. Its structured design enables robust evaluation of LLMs’ multi-step inference capabilities, including their performance in structured Retrieval-Augmented Generation (RAG) contexts. HealthBranches establishes a foundation for the development of more trustworthy, interpretable, and clinically reliable LLMs in high-stakes domains while also serving as a valuable resource for educational purposes. HealthBranches 是一个新颖的医学问答(Q&A)基准数据集,专为评估大型语言模型(LLMs)在复杂推理方面的能力而设计。该数据集通过半自动化管道生成,将医学来源中的明确决策路径转化为具有相关问题和答案的真实患者病例。涵盖 17 个医疗主题的 4,063 个案例研究,每个数据点均基于经过临床验证的推理链。HealthBranches 支持开放式和选择题两种问题格式,并独特地为每个问答包含完整的推理路径。其结构化设计使得能够对 LLMs 的多步推理能力进行稳健评估,包括它们在结构化检索增强生成(RAG)场景中的表现。HealthBranches 为在高风险领域开发更值得信赖、可解释且临床可靠的 LLMs 奠定了基础,同时也作为一项有价值的教育资源。
Subjects: Computation and Language, Artificial Intelligence, Information Retrieval, Machine Learning 主题:计算与语言,人工智能,信息检索,机器学习
Publish: 2025-08-10 11:45:34 UTC 发布:2025-08-10 11:45:34 UTC
#178 MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark #178 MCITlib:多模态持续指令微调库与基准
Authors: [Haiyang Guo](https://arxiv.org/search/?searchtype=author&query=Haiyang Guo), [Fei Zhu](https://arxiv.org/search/?searchtype=author&query=Fei Zhu), [Hongbo Zhao](https://arxiv.org/search/?searchtype=author&query=Hongbo Zhao), [Fanhu Zeng](https://arxiv.org/search/?searchtype=author&query=Fanhu Zeng), [Wenzhuo Liu](https://arxiv.org/search/?searchtype=author&query=Wenzhuo Liu), [Shijie Ma](https://arxiv.org/search/?searchtype=author&query=Shijie Ma), [Da-Han Wang](https://arxiv.org/search/?searchtype=author&query=Da-Han Wang), [Xu-Yao Zhang](https://arxiv.org/search/?searchtype=author&query=Xu-Yao Zhang) 作者:郭海洋、朱飞、赵鸿博、曾泛湖、刘文卓、马世杰、王达汉、张旭尧
Continual learning aims to equip AI systems with the ability to continuously acquire and adapt to new knowledge without forgetting previously learned information, similar to human learning. While traditional continual learning methods focusing on unimodal tasks have achieved notable success, the emergence of Multimodal Large Language Models has brought increasing attention to Multimodal Continual Learning tasks involving multiple modalities, such as vision and language. In this setting, models are expected to not only mitigate catastrophic forgetting but also handle the challenges posed by cross-modal interactions and coordination. To facilitate research in this direction, we introduce MCITlib, a comprehensive and constantly evolving code library for continual instruction tuning of Multimodal Large Language Models. In MCITlib, we have currently implemented 8 representative algorithms for Multimodal Continual Instruction Tuning and systematically evaluated them on 2 carefully selected benchmarks. MCITlib will be continuously updated to reflect advances in the Multimodal Continual Learning field. The codebase is released at https://github.com/Ghy0501/MCITlib. 持续学习旨在赋予人工智能系统持续获取和适应新知识的能力,同时不忘记先前学到的信息,类似于人类学习。尽管以单模态任务为主的传统持续学习方法已取得显著成功,多模态大型语言模型的出现使得涉及视觉与语言等多种模态的多模态持续学习任务受到越来越多关注。在此情境下,模型不仅需要减轻灾难性遗忘,还需应对跨模态交互与协调带来的挑战。为促进该方向的研究,我们引入了 MCITlib,一个用于多模态大型语言模型持续指令调优的全面且持续更新的代码库。在 MCITlib 中,我们目前实现了 8 种具有代表性的多模态持续指令调优算法,并在 2 个精心挑选的基准上对它们进行了系统评估。MCITlib 将持续更新,以反映多模态持续学习领域的最新进展。代码库发布于 https://github.com/Ghy0501/MCITlib。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-10 11:42:36 UTC 发布时间:2025-08-10 11:42:36 UTC
#179 DragonFruitQualityNet: A Lightweight Convolutional Neural Network for Real-Time Dragon Fruit Quality Inspection on Mobile Devices #179 DragonFruitQualityNet:一种用于移动设备实时火龙果质量检测的轻量级卷积神经网络
Authors: [Md Zahurul Haquea](https://arxiv.org/search/?searchtype=author&query=Md Zahurul Haquea), [Yeahyea Sarker](https://arxiv.org/search/?searchtype=author&query=Yeahyea Sarker), [Muhammed Farhan Sadique Mahi](https://arxiv.org/search/?searchtype=author&query=Muhammed Farhan Sadique Mahi), [Syed Jubayer Jaman](https://arxiv.org/search/?searchtype=author&query=Syed Jubayer Jaman), [Md Robiul Islam](https://arxiv.org/search/?searchtype=author&query=Md Robiul Islam) 作者:Md Zahurul Haquea、Yeahyea Sarker、Muhammed Farhan Sadique Mahi、Syed Jubayer Jaman、Md Robiul Islam
Dragon fruit, renowned for its nutritional benefits and economic value, has experienced rising global demand due to its affordability and local availability. As dragon fruit cultivation expands, efficient pre- and post-harvest quality inspection has become essential for improving agricultural productivity and minimizing post-harvest losses. This study presents DragonFruitQualityNet, a lightweight Convolutional Neural Network (CNN) optimized for real-time quality assessment of dragon fruits on mobile devices. We curated a diverse dataset of 13,789 images, integrating self-collected samples with public datasets (dataset from Mendeley Data), and classified them into four categories: fresh, immature, mature, and defective fruits to ensure robust model training. The proposed model achieves an impressive 93.98% accuracy, outperforming existing methods in fruit quality classification. To facilitate practical adoption, we embedded the model into an intuitive mobile application, enabling farmers and agricultural stakeholders to conduct on-device, real-time quality inspections. This research provides an accurate, efficient, and scalable AI-driven solution for dragon fruit quality control, supporting digital agriculture and empowering smallholder farmers with accessible technology. By bridging the gap between research and real-world application, our work advances post-harvest management and promotes sustainable farming practices. 火龙果以其营养价值和经济价值著称,由于价格亲民且在当地易得,全球需求不断上升。随着火龙果种植面积扩大,提升收获前后质量检测的效率已成为提高农业生产力和减少收获后损失的关键。本研究提出了 DragonFruitQualityNet,一种为移动设备上的实时火龙果质量评估优化的轻量级卷积神经网络(CNN)。我们整理了包含 13,789 张图像的多样化数据集,将自采样本与公共数据集(Mendeley Data 数据集)整合,并将图像分为四类:新鲜、未熟、成熟和有缺陷,以确保模型训练的鲁棒性。所提出的模型达到了令人瞩目的 93.98%准确率,优于现有的果实质量分类方法。为促进实际应用,我们将模型嵌入到一个直观的移动应用中,使农民和农业利益相关者能够在设备上进行实时质量检测。 本研究提供了一种准确、高效且可扩展的由人工智能驱动的火龙果质量检测解决方案,支持数字农业并为小农户提供可获取的技术。通过弥合研究与实际应用之间的差距,我们的工作推动了收获后管理的发展并促进了可持续农业实践。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-10 11:41:23 UTC 发布:2025-08-10 11:41:23 UTC
#180 From Knowledge to Conjectures: A Modal Framework for Reasoning about Hypotheses #180 从知识到猜想:一个用于关于假设推理的模态框架
Author: [Fabio Vitali](https://arxiv.org/search/?searchtype=author&query=Fabio Vitali) 作者:Fabio Vitali
This paper introduces a new family of cognitive modal logics designed to formalize conjectural reasoning: a modal system in which cognitive contexts extend known facts with hypothetical assumptions to explore their consequences. Unlike traditional doxastic and epistemic systems, conjectural logics rely on a principle, called Axiom C (φ→□φ), that ensures that all established facts are preserved across hypothetical layers. While Axiom C was dismissed in the past due to its association with modal collapse, we show that the collapse only arises under classical and bivalent assumptions, and specifically in the presence of Axiom T. Hence we avoid Axiom T and adopt a paracomplete semantic framework, grounded in Weak Kleene logic or Description Logic, where undefined propositions coexist with modal assertions. This prevents the modal collapse and guarantees a layering to distinguish between factual and conjectural statements. Under this framework we define new modal systems, e.g., KC and KDC, and show that they are complete, decidable, and robust under partial knowledge. Finally, we introduce a dynamic operation, settle(φ), which formalizes the transition from conjecture to accepted fact, capturing the event of the update of a world’s cognitive state through the resolution of uncertainty. 本文引入了一类新的认知模态逻辑,用以形式化猜想推理:一种在认知语境中以假设性前提扩展已知事实以探索其后果的模态系统。与传统的信念(doxastic)和知识(epistemic)系统不同,猜想逻辑依赖于一个称为公理 C( φ→□φ )的原理,该原理保证所有已确立的事实在假设层之间被保留。尽管过去由于其与模态坍缩相关而被摒弃,我们表明坍缩仅在经典和二值假设下出现,且特别在存在公理 T 时发生。因此我们避免采用公理 T,转而采用一种偏完备的语义框架,基于弱克莱尼逻辑或描述逻辑,在该框架中未定义命题与模态断言并存。这阻止了模态坍缩并保证了区分事实性与猜想性陈述的分层。在此框架下我们定义了新的模态系统,例如 KC 和 KDC,并证明它们在部分知识下是完备的、可判定的且具有鲁棒性。 最后,我们引入了一个动态操作 settle(φ) ,它形式化了从猜想到被接受事实的过渡,捕捉通过解决不确定性而更新世界认知状态的事件。
Subjects: Logic in Computer Science, Artificial Intelligence 主题:计算机科学中的逻辑、人工智能
Publish: 2025-08-10 11:37:49 UTC 发布:2025-08-10 11:37:49 UTC
#181 When Is Prior Knowledge Helpful? Exploring the Evaluation and Selection of Unsupervised Pretext Tasks from a Neuro-Symbolic Perspective #181 何时先验知识有用?从神符号视角探讨无监督预任务的评估与选择
Authors: [Lin-Han Jia](https://arxiv.org/search/?searchtype=author&query=Lin-Han Jia), [Si-Yu Han](https://arxiv.org/search/?searchtype=author&query=Si-Yu Han), [Wen-Chao Hu](https://arxiv.org/search/?searchtype=author&query=Wen-Chao Hu), [Jie-Jing Shao](https://arxiv.org/search/?searchtype=author&query=Jie-Jing Shao), [Wen-Da Wei](https://arxiv.org/search/?searchtype=author&query=Wen-Da Wei), [Zhi Zhou](https://arxiv.org/search/?searchtype=author&query=Zhi Zhou), [Lan-Zhe Guo](https://arxiv.org/search/?searchtype=author&query=Lan-Zhe Guo), [Yu-Feng Li](https://arxiv.org/search/?searchtype=author&query=Yu-Feng Li) 作者:贾林涵、韩思宇、胡文超、邵捷靖、韦文达、周智、郭兰哲、李玉锋
Neuro-symbolic (Nesy) learning improves the target task performance of models by enabling them to satisfy knowledge, while semi/self-supervised learning (SSL) improves the target task performance by designing unsupervised pretext tasks for unlabeled data to make models satisfy corresponding assumptions. We extend the Nesy theory based on reliable knowledge to the scenario of unreliable knowledge (i.e., assumptions), thereby unifying the theoretical frameworks of SSL and Nesy. Through rigorous theoretical analysis, we demonstrate that, in theory, the impact of pretext tasks on target performance hinges on three factors: knowledge learnability with respect to the model, knowledge reliability with respect to the data, and knowledge completeness with respect to the target. We further propose schemes to operationalize these theoretical metrics, and thereby develop a method that can predict the effectiveness of pretext tasks in advance. This will change the current status quo in practical applications, where the selections of unsupervised tasks are heuristic-based rather than theory-based, and it is difficult to evaluate the rationality of unsupervised pretext task selection before testing the model on the target task. In experiments, we verify a high correlation between the predicted performance-estimated using minimal data-and the actual performance achieved after large-scale semi-supervised or self-supervised learning, thus confirming the validity of the theory and the effectiveness of the evaluation method. 神经-符号(Nesy)学习通过使模型满足知识来提升目标任务的表现,而半监督/自监督学习(SSL)则通过为未标记数据设计无监督的预任务,使模型满足相应的假设,从而提升目标任务的表现。我们将基于可靠知识的 Nesy 理论扩展到不可靠知识(即假设)的情形,从而统一了 SSL 和 Nesy 的理论框架。通过严格的理论分析,我们证明了从理论上看,预任务对目标表现的影响取决于三个因素:相对于模型的知识可学性、相对于数据的知识可靠性和相对于目标的知识完整性。我们进一步提出了将这些理论度量具体化的方案,从而开发出一种能够事先预测预任务有效性的方法。 这将改变当前在实际应用中的现状,在那里无监督任务的选择更多依赖启发式而非理论依据,并且在将模型用于目标任务之前,很难评估无监督预训练任务选择的合理性。在实验中,我们验证了使用最少数据估计的预测性能与在大规模半监督或自监督学习后实际达到的性能之间存在高度相关性,从而确认了该理论的有效性和该评估方法的有效性。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 11:23:36 UTC 发布时间:2025-08-10 11:23:36 协调世界时 (UTC)
#182 Revisiting Data Attribution for Influence Functions #182 重审影响函数的数据归因
Authors: [Hongbo Zhu](https://arxiv.org/search/?searchtype=author&query=Hongbo Zhu), [Angelo Cangelosi](https://arxiv.org/search/?searchtype=author&query=Angelo Cangelosi) 作者:朱宏博,Angelo Cangelosi
The goal of data attribution is to trace the model’s predictions through the learning algorithm and back to its training data. thereby identifying the most influential training samples and understanding how the model’s behavior leads to particular predictions. Understanding how individual training examples influence a model’s predictions is fundamental for machine learning interpretability, data debugging, and model accountability. Influence functions, originating from robust statistics, offer an efficient, first-order approximation to estimate the impact of marginally upweighting or removing a data point on a model’s learned parameters and its subsequent predictions, without the need for expensive retraining. This paper comprehensively reviews the data attribution capability of influence functions in deep learning. We discuss their theoretical foundations, recent algorithmic advances for efficient inverse-Hessian-vector product estimation, and evaluate their effectiveness for data attribution and mislabel detection. Finally, highlighting current challenges and promising directions for unleashing the huge potential of influence functions in large-scale, real-world deep learning scenarios. 数据归因的目标是将模型的预测通过学习算法追溯回其训练数据,从而识别出最有影响力的训练样本并理解模型的行为如何导致特定的预测。理解单个训练样本如何影响模型的预测对于机器学习可解释性、数据调试和模型问责至关重要。影响函数起源于稳健统计学,提供了一种高效的一阶近似方法,用于估计微小增加权重或移除某个数据点对模型学习参数及其随后的预测的影响,而无需昂贵的重新训练。本文全面审视了影响函数在深度学习中进行数据归因的能力。我们讨论了其理论基础、用于高效估计逆海森-向量积的近期算法进展,并评估了其在数据归因和错误标签检测方面的有效性。 最后,强调了在大规模、真实世界深度学习场景中释放影响函数巨大潜力的当前挑战和有前景的方向。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 11:15:07 UTC 发布时间:2025-08-10 11:15:07 世界协调时间 (UTC)
#183 "Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas
Authors: [Junchen Ding](https://arxiv.org/search/?searchtype=author&query=Junchen Ding), [Penghao Jiang](https://arxiv.org/search/?searchtype=author&query=Penghao Jiang), [Zihao Xu](https://arxiv.org/search/?searchtype=author&query=Zihao Xu), [Ziqi Ding](https://arxiv.org/search/?searchtype=author&query=Ziqi Ding), [Yichen Zhu](https://arxiv.org/search/?searchtype=author&query=Yichen Zhu), [Jiaojiao Jiang](https://arxiv.org/search/?searchtype=author&query=Jiaojiao Jiang), [Yuekang Li](https://arxiv.org/search/?searchtype=author&query=Yuekang Li)
As large language models (LLMs) increasingly mediate ethically sensitive decisions, understanding their moral reasoning processes becomes imperative. This study presents a comprehensive empirical evaluation of 14 leading LLMs, both reasoning enabled and general purpose, across 27 diverse trolley problem scenarios, framed by ten moral philosophies, including utilitarianism, deontology, and altruism. Using a factorial prompting protocol, we elicited 3,780 binary decisions and natural language justifications, enabling analysis along axes of decisional assertiveness, explanation answer consistency, public moral alignment, and sensitivity to ethically irrelevant cues. Our findings reveal significant variability across ethical frames and model types: reasoning enhanced models demonstrate greater decisiveness and structured justifications, yet do not always align better with human consensus. Notably, “sweet zones” emerge in altruistic, fairness, and virtue ethics framings, where models achieve a balance of high intervention rates, low explanation conflict, and minimal divergence from aggregated human judgments. However, models diverge under frames emphasizing kinship, legality, or self interest, often producing ethically controversial outcomes. These patterns suggest that moral prompting is not only a behavioral modifier but also a diagnostic tool for uncovering latent alignment philosophies across providers. We advocate for moral reasoning to become a primary axis in LLM alignment, calling for standardized benchmarks that evaluate not just what LLMs decide, but how and why. 随着大型语言模型(LLMs)在越来越多涉及伦理敏感决策的场景中发挥中介作用,理解它们的道德推理过程变得至关重要。本研究对 14 种领先的 LLMs 进行了全面的实证评估,涵盖启用推理的模型与通用模型,在 27 种不同的电车难题情景中进行测试,框定了十种道德哲学视角,包括功利主义、义务论和利他主义。通过一种因子式提示协议,我们引出了 3,780 次二元决策和自然语言理由,从而能够沿着决策果断性、解释答案一致性、公众道德一致性以及对伦理无关线索的敏感性等维度进行分析。我们的发现显示,在不同伦理框架和模型类型之间存在显著差异:启用推理的模型表现出更强的果断性和更有结构的论证,但并不总是与人类共识更为一致。值得注意的是,在利他主义、公平性和德性伦理的框架下出现了“甜区”,模型在这些框架中实现了较高的干预率、较低的解释冲突以及与汇总人类判断的最小分歧。 然而,在强调亲属关系、合法性或自身利益的框架下,模型的表现会出现分歧,常常产生在道德上有争议的结果。这些模式表明,道德提示不仅是行为修正手段,还是揭示各供应商潜在对齐哲学的诊断工具。我们主张将道德推理作为 LLM 对齐的主要轴线,呼吁制定标准化基准,不仅评估 LLM 的决策结果,还评估其如何以及为何做出这些决策。
Subjects: Computation and Language, Artificial Intelligence, Computers and Society
Publish: 2025-08-10 10:45:16 UTC 发布:2025-08-10 10:45:16 世界协调时间 (UTC)
#184 Fine-Tuning Large Language Models Using EEG Microstate Features for Mental Workload Assessment #184 使用 EEG 微状态特征微调大型语言模型以评估心理工作负荷
Author: [Bujar Raufi](https://arxiv.org/search/?searchtype=author&query=Bujar Raufi) 作者:Bujar Raufi
This study explores the intersection of electroencephalography (EEG) microstates and Large Language Models (LLMs) to enhance the assessment of cognitive load states. By utilizing EEG microstate features, the research aims to fine-tune LLMs for improved predictions of distinct cognitive states, specifically ‘Rest’ and ‘Load’. The experimental design is delineated in four comprehensive stages: dataset collection and preprocessing, microstate segmentation and EEG backfitting, feature extraction paired with prompt engineering, and meticulous LLM model selection and refinement. Employing a supervised learning paradigm, the LLM is trained to identify cognitive load states based on EEG microstate features integrated into prompts, producing accurate discrimination of cognitive load. A curated dataset, linking EEG features to specified cognitive load conditions, underpins the experimental framework. The results indicate a significant improvement in model performance following the proposed fine-tuning, showcasing the potential of EEG-informed LLMs in cognitive neuroscience and cognitive AI applications. This approach not only contributes to the understanding of brain dynamics but also paves the way for advancements in machine learning techniques applicable to cognitive load and cognitive AI research. 本研究探讨了脑电图(EEG)微状态与 LLMs 的交叉,以增强认知负荷状态的评估。通过利用 EEG 微状态特征,研究旨在微调 LLMs 以更好地预测不同的认知状态,特别是“静息(Rest)”和“负荷(Load)”。实验设计分为四个全面阶段:数据集收集与预处理、微状态分割与 EEG 回拟合、特征提取与提示工程(prompt engineering)、以及细致的 LLM 模型选择与优化。采用监督学习范式,LLM 被训练以基于整合进提示的 EEG 微状态特征识别认知负荷状态,从而实现对认知负荷的准确区分。一个将 EEG 特征与特定认知负荷条件关联的精选数据集支撑了实验框架。结果表明,所提出的微调后模型性能显著提升,体现了基于 EEG 的 LLMs 在认知神经科学和认知人工智能应用中的潜力。 这种方法不仅有助于理解大脑动态,还为可应用于认知负荷和认知人工智能研究的机器学习技术的进步铺平了道路。
Subjects: Human-Computer Interaction, Artificial Intelligence, Signal Processing, Neurons and Cognition 主题:人机交互,人工智能,信号处理,神经元与认知
Publish: 2025-08-10 10:43:09 UTC 发布:2025-08-10 10:43:09 UTC
#185 Representation Understanding via Activation Maximization #185 通过激活最大化理解表征
Authors: [Hongbo Zhu](https://arxiv.org/search/?searchtype=author&query=Hongbo Zhu), [Angelo Cangelosi](https://arxiv.org/search/?searchtype=author&query=Angelo Cangelosi) 作者:朱洪波,Angelo Cangelosi
Understanding internal feature representations of deep neural networks (DNNs) is a fundamental step toward model interpretability. Inspired by neuroscience methods that probe biological neurons using visual stimuli, recent deep learning studies have employed Activation Maximization (AM) to synthesize inputs that elicit strong responses from artificial neurons. In this work, we propose a unified feature visualization framework applicable to both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Unlike prior efforts that predominantly focus on the last output-layer neurons in CNNs, we extend feature visualization to intermediate layers as well, offering deeper insights into the hierarchical structure of learned feature representations. Furthermore, we investigate how activation maximization can be leveraged to generate adversarial examples, revealing potential vulnerabilities and decision boundaries of DNNs. Our experiments demonstrate the effectiveness of our approach in both traditional CNNs and modern ViT, highlighting its generalizability and interpretive value. 理解深度神经网络(DNN)内部特征表示是实现模型可解释性的一个基础步骤。受到利用视觉刺激探测生物神经元的神经科学方法的启发,近期深度学习研究采用激活最大化(Activation Maximization,AM)来合成能够引发人工神经元强烈响应的输入。在本工作中,我们提出了一个适用于卷积神经网络(CNN)和视觉 Transformer(ViT)的统一特征可视化框架。不同于以往主要关注 CNN 最末输出层神经元的研究,我们将特征可视化扩展到中间层,从而更深入地洞察所学特征表示的层次结构。此外,我们还探讨了如何利用激活最大化生成对抗样本,以揭示 DNN 的潜在脆弱性和决策边界。我们的实验展示了该方法在传统 CNN 和现代 ViT 中的有效性,突显了其泛化性和解释价值。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-10 10:36:30 UTC 发布时间:2025-08-10 10:36:30 UTC
#186 MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory #186 MAQuA:基于项目反应理论的自适应多维心理健康筛查问询系统
Authors: [Vasudha Varadarajan](https://arxiv.org/search/?searchtype=author&query=Vasudha Varadarajan), [Hui Xu](https://arxiv.org/search/?searchtype=author&query=Hui Xu), [Rebecca Astrid Boehme](https://arxiv.org/search/?searchtype=author&query=Rebecca Astrid Boehme), [Mariam Marlan Mirstrom](https://arxiv.org/search/?searchtype=author&query=Mariam Marlan Mirstrom), [Sverker Sikstrom](https://arxiv.org/search/?searchtype=author&query=Sverker Sikstrom), [H. Andrew Schwartz](https://arxiv.org/search/?searchtype=author&query=H. Andrew Schwartz) 作者:Vasudha Varadarajan、Hui Xu、Rebecca Astrid Boehme、Mariam Marlan Mirstrom、Sverker Sikstrom、H. Andrew Schwartz
Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows. 近年来大型语言模型(LLMs)的进展为可扩展、互动式心理健康评估提供了新机遇,但由 LLMs 进行的过度询问会给用户带来负担,并且在跨诊断症状谱的真实筛查中效率低下。我们提出了 MAQuA,一种用于同时进行多维心理健康筛查的自适应提问框架。MAQuA 将语言回应上的多结果建模与项目反应理论(IRT)和因子分析相结合,在每一步选择在多个维度上能提供最多信息的题目以优化诊断信息,从而提高准确性并有可能减少答题负担。在一项新数据集上的实证结果表明,与随机排序相比,MAQuA 将评分稳定所需的评估问题数量减少了 50%–87%(例如,在抑郁评分达到稳定时问题数量减少了 71%,在饮食失调评分达到稳定时减少了 85%)。MAQuA 在内向型(抑郁、焦虑)和外向型(物质使用、饮食失调)领域均表现出稳健的性能,早停策略进一步减少了患者的时间和负担。 这些发现将 MAQuA 定位为一个强大且高效的工具,能够实现可扩展、细致且交互式的心理健康筛查,推动基于 LLM 的代理融入真实临床工作流程。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 10:33:16 UTC 发布:2025-08-10 10:33:16 UTC
#187 Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models #187 在大型语音-语言模型中融入上下文副语言理解
Authors: [Qiongqiong Wang](https://arxiv.org/search/?searchtype=author&query=Qiongqiong Wang), [Hardik B. Sailor](https://arxiv.org/search/?searchtype=author&query=Hardik B. Sailor), [Jeremy H. M. Wong](https://arxiv.org/search/?searchtype=author&query=Jeremy H. M. Wong), [Tianchi Liu](https://arxiv.org/search/?searchtype=author&query=Tianchi Liu), [Shuo Sun](https://arxiv.org/search/?searchtype=author&query=Shuo Sun), [Wenyu Zhang](https://arxiv.org/search/?searchtype=author&query=Wenyu Zhang), [Muhammad Huzaifah](https://arxiv.org/search/?searchtype=author&query=Muhammad Huzaifah), [Nancy Chen](https://arxiv.org/search/?searchtype=author&query=Nancy Chen), [Ai Ti Aw](https://arxiv.org/search/?searchtype=author&query=Ai Ti Aw) 作者:王琼琼、Hardik B. Sailor、Jeremy H. M. Wong、刘天池、孙硕、张文钰、Muhammad Huzaifah、陈南希、艾蒂·奥夫
Current large speech language models (Speech-LLMs) often exhibit limitations in empathetic reasoning, primarily due to the absence of training datasets that integrate both contextual content and paralinguistic cues. In this work, we propose two approaches to incorporate contextual paralinguistic information into model training: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that automatically generates novel training question-answer (QA) pairs using both categorical and dimensional emotion annotations alongside speech transcriptions. Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach, showing effectiveness in contextual paralinguistic understanding. We also validate the LLM judge by demonstrating its correlation with classification metrics, providing support for its reliability. 当前大型语音语言模型(Speech-LLMs)在共情推理方面通常存在局限,主要原因是缺乏将上下文内容与副语言线索相结合的训练数据。在本工作中,我们提出了两种将上下文副语言信息纳入模型训练的方法:(1)显式方法,直接向 LLM 提供副语言元数据(例如情感标注);(2)隐式方法,利用类别和维度化的情感标注以及语音转录自动生成新的训练问答(QA)对。我们的隐式方法在一个人工标注的 QA 基准上(由 LLM 评判)将性能提升了 38.41%,与显式方法结合时达到 46.02%,表明在上下文副语言理解方面的有效性。我们还通过展示其与分类度量的相关性来验证 LLM 评判器,支持其可靠性。
Subjects: Computation and Language, Artificial Intelligence, Audio and Speech Processing 主题:计算与语言、人工智能、音频与语音处理
Publish: 2025-08-10 10:03:30 UTC 发布:2025-08-10 10:03:30 UTC
#188 OpenHAIV: A Framework Towards Practical Open-World Learning #188 OpenHAIV:迈向实用开放世界学习的框架 [PDF 1 ] [Copy] [Kimi ] [REL]
Authors: [Xiang Xiang](https://arxiv.org/search/?searchtype=author&query=Xiang Xiang), [Qinhao Zhou](https://arxiv.org/search/?searchtype=author&query=Qinhao Zhou), [Zhuo Xu](https://arxiv.org/search/?searchtype=author&query=Zhuo Xu), [Jing Ma](https://arxiv.org/search/?searchtype=author&query=Jing Ma), [Jiaxin Dai](https://arxiv.org/search/?searchtype=author&query=Jiaxin Dai), [Yifan Liang](https://arxiv.org/search/?searchtype=author&query=Yifan Liang), [Hanlin Li](https://arxiv.org/search/?searchtype=author&query=Hanlin Li) 作者:Xiang Xiang、Qinhao Zhou、Zhuo Xu、Jing Ma、Jiaxin Dai、Yifan Liang、Hanlin Li
Substantial progress has been made in various techniques for open-world recognition. Out-of-distribution (OOD) detection methods can effectively distinguish between known and unknown classes in the data, while incremental learning enables continuous model knowledge updates. However, in open-world scenarios, these approaches still face limitations. Relying solely on OOD detection does not facilitate knowledge updates in the model, and incremental fine-tuning typically requires supervised conditions, which significantly deviate from open-world settings. To address these challenges, this paper proposes OpenHAIV, a novel framework that integrates OOD detection, new class discovery, and incremental continual fine-tuning into a unified pipeline. This framework allows models to autonomously acquire and update knowledge in open-world environments. The proposed framework is available at https://haiv-lab.github.io/openhaiv . 在开放世界识别方面,各种技术已取得了显著进展。分布外(OOD)检测方法可以有效区分数据中的已知类与未知类,而增量学习则使模型能够持续更新其知识。然而,在开放世界场景中,这些方法仍然存在局限性。仅依赖 OOD 检测并不能促进模型知识的更新,而增量微调通常需要有监督的条件,这与开放世界设定存在显著偏离。为了解决这些挑战,本文提出了 OpenHAIV,这是一个将 OOD 检测、新类别发现和增量持续微调整合为统一流程的新型框架。该框架允许模型在开放世界环境中自主获取并更新知识。所提框架可在 https://haiv-lab.github.io/openhaiv 获取。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning, Image and Video Processing, Machine Learning 主题:计算机视觉与模式识别、人工智能、机器学习、图像与视频处理、机器学习
Publish: 2025-08-10 09:55:19 UTC 发布:2025-08-10 09:55:19 UTC
#189 Causal Negative Sampling via Diffusion Model for Out-of-Distribution Recommendation #189 因果负采样通过扩散模型用于分布外推荐
Authors: [Chu Zhao](https://arxiv.org/search/?searchtype=author&query=Chu Zhao), [Eneng Yang](https://arxiv.org/search/?searchtype=author&query=Eneng Yang), [Yizhou Dang](https://arxiv.org/search/?searchtype=author&query=Yizhou Dang), [Jianzhe Zhao](https://arxiv.org/search/?searchtype=author&query=Jianzhe Zhao), [Guibing Guo](https://arxiv.org/search/?searchtype=author&query=Guibing Guo), [Xingwei Wang](https://arxiv.org/search/?searchtype=author&query=Xingwei Wang) 作者:赵楚、杨恩能、党一舟、赵建哲、郭桂兵、王兴伟
Heuristic negative sampling enhances recommendation performance by selecting negative samples of varying hardness levels from predefined candidate pools to guide the model toward learning more accurate decision boundaries. However, our empirical and theoretical analyses reveal that unobserved environmental confounders (e.g., exposure or popularity biases) in candidate pools may cause heuristic sampling methods to introduce false hard negatives (FHNS). These misleading samples can encourage the model to learn spurious correlations induced by such confounders, ultimately compromising its generalization ability under distribution shifts. To address this issue, we propose a novel method named Causal Negative Sampling via Diffusion (CNSDiff). By synthesizing negative samples in the latent space via a conditional diffusion process, CNSDiff avoids the bias introduced by predefined candidate pools and thus reduces the likelihood of generating FHNS. Moreover, it incorporates a causal regularization term to explicitly mitigate the influence of environmental confounders during the negative sampling process, leading to robust negatives that promote out-of-distribution (OOD) generalization. Comprehensive experiments under four representative distribution shift scenarios demonstrate that CNSDiff achieves an average improvement of 13.96% across all evaluation metrics compared to state-of-the-art baselines, verifying its effectiveness and robustness in OOD recommendation tasks. 启发式负采样通过从预定义候选池中选择不同难度级别的负样本来提升推荐性能,从而引导模型学习更准确的决策边界。然而,我们的实证和理论分析表明,候选池中存在的未观测环境混杂因素(例如曝光或流行度偏差)可能导致启发式采样方法引入虚假的困难负样本(FHNS)。这些误导性样本会促使模型学习由此类混杂因素引起的虚假相关性,最终在分布转移下损害其泛化能力。为了解决这一问题,我们提出了一种名为通过扩散进行因果负采样(CNSDiff)的新方法。CNSDiff 通过条件扩散过程在潜在空间合成负样本,避免了预定义候选池引入的偏差,从而降低生成 FHNS 的可能性。此外,它引入了一个因果正则项,在负采样过程中明确减轻环境混杂因素的影响,得到的稳健负样本有助于提升分布外(OOD)泛化能力。 在四种具有代表性的分布偏移场景下的全面实验表明,与最先进的基线方法相比,CNSDiff 在所有评估指标上平均提升了 13.96%,验证了其在 OOD(分布外)推荐任务中的有效性和鲁棒性。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 08:55:21 UTC 发布时间:2025-08-10 08:55:21 UTC
#190 SocRipple: A Two-Stage Framework for Cold-Start Video Recommendations #190 SocRipple:用于冷启动视频推荐的两阶段框架
Authors: [Amit Jaspal](https://arxiv.org/search/?searchtype=author&query=Amit Jaspal), [Kapil Dalwani](https://arxiv.org/search/?searchtype=author&query=Kapil Dalwani), [Ajantha Ramineni](https://arxiv.org/search/?searchtype=author&query=Ajantha Ramineni) 作者:Amit Jaspal、Kapil Dalwani、Ajantha Ramineni
Most industry scale recommender systems face critical cold start challenges new items lack interaction history, making it difficult to distribute them in a personalized manner. Standard collaborative filtering models underperform due to sparse engagement signals, while content only approaches lack user specific relevance. We propose SocRipple, a novel two stage retrieval framework tailored for coldstart item distribution in social graph based platforms. Stage 1 leverages the creators social connections for targeted initial exposure. Stage 2 builds on early engagement signals and stable user embeddings learned from historical interactions to “ripple” outwards via K Nearest Neighbor (KNN) search. Large scale experiments on a major video platform show that SocRipple boosts cold start item distribution by +36% while maintaining user engagement rate on cold start items, effectively balancing new item exposure with personalized recommendations. 大多数行业级推荐系统面临严重的冷启动挑战:新项目缺乏交互历史,难以以个性化方式分发。标准的协同过滤模型由于参与信号稀疏而表现不佳,而仅基于内容的方法又缺乏用户特定的相关性。我们提出了 SocRipple,一种为基于社交图平台的新项目分发量身定制的新颖两阶段检索框架。第一阶段利用创作者的社交关系进行有针对性的初始曝光。第二阶段建立在早期参与信号和从历史交互中学习到的稳定用户嵌入之上,通过 K 最近邻(KNN)搜索向外“涟漪”传播。对一家大型视频平台的大规模实验表明,SocRipple 将冷启动项目分发提升了 36%,同时保持了用户在冷启动项目上的参与率,有效地在新项目曝光与个性化推荐之间取得平衡。
Subjects: Information Retrieval, Artificial Intelligence 主题:信息检索,人工智能
Publish: 2025-08-10 08:37:36 UTC 发布:2025-08-10 08:37:36 UTC
#191 EDGE: A Theoretical Framework for Misconception-Aware Adaptive Learning #191 EDGE:一个面向误解感知自适应学习的理论框架
Author: [Ananda Prakash Verma](https://arxiv.org/search/?searchtype=author&query=Ananda Prakash Verma) 作者:Ananda Prakash Verma
We present EDGE, a general-purpose, misconception-aware adaptive learning framework composed of four stages: Evaluate (ability and state estimation), Diagnose (posterior infer-ence of misconceptions), Generate (counterfactual item synthesis), and Exercise (index-based retrieval scheduling). EDGE unifies psychometrics (IRT/Bayesian state space models), cog-nitive diagnostics (misconception discovery from distractor patterns and response latencies), contrastive item generation (minimal perturbations that invalidate learner shortcuts while pre-serving psychometric validity), and principled scheduling (a restless bandit approximation to spaced retrieval). We formalize a composite readiness metric, EdgeScore, prove its monotonicity and Lipschitz continuity, and derive an index policy that is near-optimal under mild assumptions on forgetting and learning gains. We further establish conditions under which counterfactual items provably reduce the posterior probability of a targeted misconception faster than standard practice. The paper focuses on theory and implementable pseudocode; empirical study is left to future work. 我们提出了 EDGE,一种通用的、考虑误解的自适应学习框架,包含四个阶段:Evaluate(能力与状态评估)、Diagnose(误解的后验推断)、Generate(反事实题目合成)和 Exercise(基于指标的检索调度)。EDGE 将心理测量学(IRT/贝叶斯状态空间模型)、认知诊断(从干扰项模式和作答时长中发现误解)、对比式题目生成(通过最小扰动使学习者的捷径失效,同时保留心理测量学有效性)以及有原则的调度(一个用于间隔记忆检索的激动臂近似)统一起来。我们形式化了一个复合的准备度度量 EdgeScore,证明了其单调性和利普希茨连续性,并在对遗忘和学习增益做出温和假设下推导出一个近似最优的指标策略。我们进一步建立了在何种条件下反事实题目可以在比标准做法更快地降低目标误解的后验概率的可证明性。本文侧重理论与可实现的伪代码;实证研究留作后续工作。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 08:06:00 UTC 发布日期:2025-08-10 08:06:00 协调世界时 (UTC)
#192 Selection and Exploitation of High-Quality Knowledge from Large Language Models for Recommendation #192 从大型语言模型中选择并利用高质量知识以用于推荐
Authors: [Guanchen Wang](https://arxiv.org/search/?searchtype=author&query=Guanchen Wang), [Mingming Ha](https://arxiv.org/search/?searchtype=author&query=Mingming Ha), [Tianbao Ma](https://arxiv.org/search/?searchtype=author&query=Tianbao Ma), [Linxun Chen](https://arxiv.org/search/?searchtype=author&query=Linxun Chen), [Zhaojie Liu](https://arxiv.org/search/?searchtype=author&query=Zhaojie Liu), [Guorui Zhou](https://arxiv.org/search/?searchtype=author&query=Guorui Zhou), [Kun Gai](https://arxiv.org/search/?searchtype=author&query=Kun Gai) 作者:Guanchen Wang、Mingming Ha、Tianbao Ma、Linxun Chen、Zhaojie Liu、Guorui Zhou、Kun Gai
In recent years, there has been growing interest in leveraging the impressive generalization capabilities and reasoning ability of large language models (LLMs) to improve the performance of recommenders. With this operation, recommenders can access and learn the additional world knowledge and reasoning information via LLMs. However, in general, for different users and items, the world knowledge derived from LLMs suffers from issues of hallucination, content redundant, and information homogenization. Directly feeding the generated response embeddings into the recommendation model can lead to unavoidable performance deterioration. To address these challenges, we propose a Knowledge Selection & Exploitation Recommendation (KSER) framework, which effectively select and extracts the high-quality knowledge from LLMs. The framework consists of two key components: a knowledge filtering module and a embedding spaces alignment module. In the knowledge filtering module, a Embedding Selection Filter Network (ESFNet) is designed to assign adaptive weights to different knowledge chunks in different knowledge fields. In the space alignment module, an attention-based architecture is proposed to align the semantic embeddings from LLMs with the feature space used to train the recommendation models. In addition, two training strategies–\textbf{all-parameters training} and \textbf{extractor-only training}–are proposed to flexibly adapt to different downstream tasks and application scenarios, where the extractor-only training strategy offers a novel perspective on knowledge-augmented recommendation. Experimental results validate the necessity and effectiveness of both the knowledge filtering and alignment modules, and further demonstrate the efficiency and effectiveness of the extractor-only training strategy. 近年来,越来越多的研究关注于利用大型语言模型(LLMs)出色的泛化能力和推理能力来提升推荐系统的性能。通过这种方式,推荐系统可以通过 LLMs 获取并学习额外的世界知识和推理信息。然而,通常情况下,对于不同的用户和物品,从 LLMs 导出的世界知识存在幻觉、内容冗余和信息同质化等问题。将生成的响应嵌入直接输入推荐模型可能会导致不可避免的性能下降。为了解决这些挑战,我们提出了一个知识选择与利用推荐(KSER)框架,该框架能够有效地从 LLMs 中筛选并提取高质量的知识。该框架由两个关键组件组成:知识过滤模块和嵌入空间对齐模块。在知识过滤模块中,设计了一个嵌入选择过滤网络(ESFNet),用于在不同知识领域中为不同的知识块分配自适应权重。 在空间对齐模块中,提出了一种基于注意力的架构,用于将来自 LLMs 的语义嵌入与用于训练推荐模型的特征空间进行对齐。此外,提出了两种训练策略——全部参数训练(all-parameters training)和仅提取器训练(extractor-only training)——以灵活适应不同的下游任务和应用场景,其中仅提取器训练策略为知识增强推荐提供了一种新的视角。实验结果验证了知识过滤和对齐模块的必要性与有效性,并进一步展示了仅提取器训练策略的高效性与有效性。
Subjects: Information Retrieval, Artificial Intelligence 主题:信息检索,人工智能
Publish: 2025-08-10 08:03:01 UTC 发布:2025-08-10 08:03:01 UTC
#193 LLM-based Agents for Automated Confounder Discovery and Subgroup Analysis in Causal Inference #193 基于 LLM 的代理用于因果推断中的自动混杂变量发现和亚组分析
Authors: [Po-Han Lee](https://arxiv.org/search/?searchtype=author&query=Po-Han Lee), [Yu-Cheng Lin](https://arxiv.org/search/?searchtype=author&query=Yu-Cheng Lin), [Chan-Tung Ku](https://arxiv.org/search/?searchtype=author&query=Chan-Tung Ku), [Chan Hsu](https://arxiv.org/search/?searchtype=author&query=Chan Hsu), [Pei-Cing Huang](https://arxiv.org/search/?searchtype=author&query=Pei-Cing Huang), [Ping-Hsun Wu](https://arxiv.org/search/?searchtype=author&query=Ping-Hsun Wu), [Yihuang Kang](https://arxiv.org/search/?searchtype=author&query=Yihuang Kang) 作者:李柏翰、林昱丞、顾詹同、许展、黄佩庆、吴秉勋、康宜璜
Estimating individualized treatment effects from observational data presents a persistent challenge due to unmeasured confounding and structural bias. Causal Machine Learning (causal ML) methods, such as causal trees and doubly robust estimators, provide tools for estimating conditional average treatment effects. These methods have limited effectiveness in complex real-world environments due to the presence of latent confounders or those described in unstructured formats. Moreover, reliance on domain experts for confounder identification and rule interpretation introduces high annotation cost and scalability concerns. In this work, we proposed Large Language Model-based agents for automated confounder discovery and subgroup analysis that integrate agents into the causal ML pipeline to simulate domain expertise. Our framework systematically performs subgroup identification and confounding structure discovery by leveraging the reasoning capabilities of LLM-based agents, which reduces human dependency while preserving interpretability. Experiments on real-world medical datasets show that our proposed approach enhances treatment effect estimation robustness by narrowing confidence intervals and uncovering unrecognized confounding biases. Our findings suggest that LLM-based agents offer a promising path toward scalable, trustworthy, and semantically aware causal inference. 从观测数据中估计个体化处理效应由于未观测混杂和结构性偏差而一直是一个持续的挑战。因果机器学习(causal ML)方法,例如因果树和双稳健估计器,为估计条件平均处理效应提供了工具。由于存在潜在混杂变量或以非结构化格式描述的混杂变量,这些方法在复杂的现实环境中的有效性有限。此外,依赖领域专家来识别混杂变量和解释规则会带来高昂的标注成本和可扩展性问题。在这项工作中,我们提出了基于大型语言模型(LLM)的代理,用于自动化混杂变量发现和子群分析,将代理集成到因果 ML 管道中以模拟领域专业知识。我们的框架通过利用基于 LLM 的代理的推理能力系统性地执行子群识别和混杂结构发现,从而减少对人工的依赖同时保持可解释性。 在真实世界的医疗数据集上的实验证明,我们提出的方法通过缩小置信区间并发现未被识别的混杂偏差,提升了治疗效果估计的鲁棒性。我们的研究表明,基于 LLM 的智能体为实现可扩展、值得信赖且具有语义感知能力的因果推断提供了有前景的途径。
Subjects: Machine Learning, Artificial Intelligence, Multiagent Systems, Applications, Methodology 主题:机器学习、人工智能、多智能体系统、应用、方法学
Publish: 2025-08-10 07:45:49 UTC 发布时间:2025-08-10 07:45:49 协调世界时 (UTC)
#194 Neural Bridge Processes #194 神经桥接过程
Authors: [Jian Xu](https://arxiv.org/search/?searchtype=author&query=Jian Xu), [Yican Liu](https://arxiv.org/search/?searchtype=author&query=Yican Liu), [Qibin Zhao](https://arxiv.org/search/?searchtype=author&query=Qibin Zhao), [John Paisley](https://arxiv.org/search/?searchtype=author&query=John Paisley), [Delu Zeng](https://arxiv.org/search/?searchtype=author&query=Delu Zeng) 作者:徐健, 刘亦灿, 赵启斌, John Paisley, 曾德禄
Learning stochastic functions from partially observed context-target pairs is a fundamental problem in probabilistic modeling. Traditional models like Gaussian Processes (GPs) face scalability issues with large datasets and assume Gaussianity, limiting their applicability. While Neural Processes (NPs) offer more flexibility, they struggle with capturing complex, multi-modal target distributions. Neural Diffusion Processes (NDPs) enhance expressivity through a learned diffusion process but rely solely on conditional signals in the denoising network, resulting in weak input coupling from an unconditional forward process and semantic mismatch at the diffusion endpoint. In this work, we propose Neural Bridge Processes (NBPs), a novel method for modeling stochastic functions where inputs x act as dynamic anchors for the entire diffusion trajectory. By reformulating the forward kernel to explicitly depend on x, NBP enforces a constrained path that strictly terminates at the supervised target. This approach not only provides stronger gradient signals but also guarantees endpoint coherence. We validate NBPs on synthetic data, EEG signal regression and image regression tasks, achieving substantial improvements over baselines. These results underscore the effectiveness of DDPM-style bridge sampling in enhancing both performance and theoretical consistency for structured prediction tasks. 从部分观测的上下文-目标对中学习随机函数是概率建模的一个基本问题。传统模型如高斯过程(GP)在大规模数据集上存在可扩展性问题,并且假设高斯性,限制了其适用性。虽然神经过程(NP)提供了更大的灵活性,但它们在捕捉复杂、多模态的目标分布方面表现不足。神经扩散过程(NDP)通过学习的扩散过程增强了表现力,但在去噪网络中仅依赖条件信号,导致来自无条件前向过程的输入耦合较弱,并在扩散终点产生语义不匹配。在本工作中,我们提出了神经桥过程(NBP),这是一种用于建模随机函数的新方法,其中输入 x 作为整个扩散轨迹的动态锚点。通过将前向核重新表述为显式依赖于 x,NBP 强制约束路径严格在有监督目标处终止。这种方法不仅提供了更强的梯度信号,而且保证了终点的一致性。 我们在合成数据、脑电信号回归和图像回归任务上验证了 NBP,取得了对基线方法的显著提升。这些结果强调了基于 DDPM 的桥采样在提升结构化预测任务的性能和理论一致性方面的有效性。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 07:44:52 UTC 发布:2025-08-10 07:44:52 UTC
#195 What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains #195 一人不能,两人可行:两层变换器可证明在任意阶马尔可夫链上表示归纳头 [PDF 1 ] [Copy] [Kimi ] [REL]
Authors: [Chanakya Ekbote](https://arxiv.org/search/?searchtype=author&query=Chanakya Ekbote), [Marco Bondaschi](https://arxiv.org/search/?searchtype=author&query=Marco Bondaschi), [Nived Rajaraman](https://arxiv.org/search/?searchtype=author&query=Nived Rajaraman), [Jason D. Lee](https://arxiv.org/search/?searchtype=author&query=Jason D. Lee), [Michael Gastpar](https://arxiv.org/search/?searchtype=author&query=Michael Gastpar), [Ashok Vardhan Makkuva](https://arxiv.org/search/?searchtype=author&query=Ashok Vardhan Makkuva), [Paul Pu Liang](https://arxiv.org/search/?searchtype=author&query=Paul Pu Liang) 作者:Chanakya Ekbote、Marco Bondaschi、Nived Rajaraman、Jason D. Lee、Michael Gastpar、Ashok Vardhan Makkuva、Paul Pu Liang
In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks. 上下文学习(ICL)是变换器的一个标志性能力,训练好的模型能够通过利用输入上下文中的信息来适应新任务。先前的工作表明,ICL 在变换器中出现是由于存在称为归纳头(induction heads)的特殊电路。鉴于归纳头与条件 k-gram 的等价性,最近一系列将序列输入建模为马尔可夫过程的工作揭示了模型深度对其 ICL 能力的根本影响:虽然两层变换器可以高效地表示条件 1-gram 模型,但其单层对应物除非规模呈指数增长,否则无法解决该任务。然而,对于更高阶的马尔可夫源,已知的最佳构造至少需要三层(每层只有一个注意力头)——这就留下了一个问题:两层单头变换器能否表示任意第 k 阶马尔可夫过程?在本文中,我们精准地回答了这个问题,并从理论上证明了每层一个头的两层变换器确实可以表示任意条件 k-gram。 因此,我们的结果提供了目前已知的关于变压器深度与上下文学习(ICL)中马尔可夫阶数相互作用的最紧确表征。在此基础上,我们进一步分析了我们两层构造的学习动态,聚焦于一阶马尔可夫链的简化变体,说明了有效的上下文表示如何在训练过程中逐步生成。综上,这些结果深化了我们对基于变压器的上下文学习的当前理解,并展示了即便是浅层架构在结构化序列建模任务上也能出人意料地表现出强大的上下文学习能力。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 07:03:01 UTC 发布:2025-08-10 07:03:01 UTC
#196 Presburger Functional Synthesis: Complexity and Tractable Normal Forms #196 Presburger 功能合成:复杂性与可处理的规范形式
Authors: [S. Akshay](https://arxiv.org/search/?searchtype=author&query=S. Akshay), [A. R. Balasubramanian](https://arxiv.org/search/?searchtype=author&query=A. R. Balasubramanian), [Supratik Chakraborty](https://arxiv.org/search/?searchtype=author&query=Supratik Chakraborty), [Georg Zetzsche](https://arxiv.org/search/?searchtype=author&query=Georg Zetzsche) 作者:S. Akshay、A. R. Balasubramanian、Supratik Chakraborty、Georg Zetzsche
Given a relational specification between inputs and outputs as a logic formula, the problem of functional synthesis is to automatically synthesize a function from inputs to outputs satisfying the relation. Recently, a rich line of work has emerged tackling this problem for specifications in different theories, from Boolean to general first-order logic. In this paper, we launch an investigation of this problem for the theory of Presburger Arithmetic, that we call Presburger Functional Synthesis (PFnS). We show that PFnS can be solved in EXPTIME and provide a matching exponential lower bound. This is unlike the case for Boolean functional synthesis (BFnS), where only conditional exponential lower bounds are known. Further, we show that PFnS for one input and one output variable is as hard as BFnS in general. We then identify a special normal form, called PSyNF, for the specification formula that guarantees poly-time and poly-size solvability of PFnS. We prove several properties of PSyNF, including how to check and compile to this form, and conditions under which any other form that guarantees poly-time solvability of PFnS can be compiled in poly-time to PSyNF. Finally, we identify a syntactic normal form that is easier to check but is exponentially less succinct than PSyNF. 给定一个将输入与输出之间的关系以逻辑公式表述的关系规范,函数综合问题就是自动合成一个从输入到输出的函数,使之满足该关系。最近,针对不同理论下的规范——从布尔到一般一阶逻辑——出现了一系列研究这一问题的工作。本文展开了对普雷斯堡算术(Presburger Arithmetic)理论下该问题的研究,我们称之为普雷斯堡函数综合(PFnS)。我们证明 PFnS 可在 EXPTIME 内求解,并给出匹配的指数下界。这与布尔函数综合(BFnS)的情形不同,后者目前仅知有条件的指数下界。此外,我们证明对于一个输入和一个输出变量的 PFnS 问题,其难度与一般的 BFnS 相当。随后,我们确定了一种特殊的规范范式,称为 PSyNF,该规范形式能保证 PFnS 在多项式时间和多项式大小内可解。我们证明了 PSyNF 的若干性质,包括如何检查并编译为该形式,以及在何种条件下任何其它能够保证 PFnS 多项式时间可解的形式都可以在多项式时间内编译为 PSyNF。 最后,我们确定了一种更易于检查的句法规范形式,但它在简洁性方面比 PSyNF 指数级地逊色。
Subjects: Logic in Computer Science, Artificial Intelligence 主题:计算机科学中的逻辑、人工智能
Publish: 2025-08-10 07:00:34 UTC 发布时间:2025-08-10 07:00:34 UTC
#197 Propagation Tree Is Not Deep: Adaptive Graph Contrastive Learning Approach for Rumor Detection #197 传播树并不深:用于谣言检测的自适应图对比学习方法
Authors: [Chaoqun Cui](https://arxiv.org/search/?searchtype=author&query=Chaoqun Cui), [Caiyan Jia](https://arxiv.org/search/?searchtype=author&query=Caiyan Jia) 作者:崔朝群,贾蔡燕
Rumor detection on social media has become increasingly important. Most existing graph-based models presume rumor propagation trees (RPTs) have deep structures and learn sequential stance features along branches. However, through statistical analysis on real-world datasets, we find RPTs exhibit wide structures, with most nodes being shallow 1-level replies. To focus learning on intensive substructures, we propose Rumor Adaptive Graph Contrastive Learning (RAGCL) method with adaptive view augmentation guided by node centralities. We summarize three principles for RPT augmentation: 1) exempt root nodes, 2) retain deep reply nodes, 3) preserve lower-level nodes in deep sections. We employ node dropping, attribute masking and edge dropping with probabilities from centrality-based importance scores to generate views. A graph contrastive objective then learns robust rumor representations. Extensive experiments on four benchmark datasets demonstrate RAGCL outperforms state-of-the-art methods. Our work reveals the wide-structure nature of RPTs and contributes an effective graph contrastive learning approach tailored for rumor detection through principled adaptive augmentation. The proposed principles and augmentation techniques can potentially benefit other applications involving tree-structured graphs. 社交媒体上的谣言检测变得越来越重要。大多数现有的基于图的模型假定谣言传播树(RPT)具有深层结构,并沿分支学习顺序立场特征。然而,通过对真实世界数据集的统计分析,我们发现 RPT 表现出宽结构,大多数节点是浅层的一层回复。为将学习集中在密集的子结构上,我们提出了基于节点中心性引导的自适应视图增强的谣言自适应图对比学习(RAGCL)方法。我们总结了 RPT 增强的三条原则:1)豁免根节点,2)保留深层回复节点,3)在深层区域保留较低级别的节点。我们使用节点丢弃、属性遮掩和边丢弃,按基于中心性的重要性得分给出的概率生成视图。然后通过图对比目标学习鲁棒的谣言表示。在四个基准数据集上的大量实验表明,RAGCL 优于最先进的方法。我们的工作揭示了 RPT 的宽结构特性,并通过有原则的自适应增强为谣言检测贡献了一种有效的图对比学习方法。 所提出的原则和增强技术有可能惠及其他涉及树状结构图的应用。
Subjects: Social and Information Networks, Artificial Intelligence, Computation and Language 主题:社会与信息网络、人工智能、计算与语言
Publish: 2025-08-10 06:53:30 UTC 发布:2025-08-10 06:53:30 UTC
#198 Can Smaller Large Language Models Evaluate Research Quality? #198 更小的大型语言模型能评估研究质量吗?
Author: [Mike Thelwall](https://arxiv.org/search/?searchtype=author&query=Mike Thelwall) 作者:Mike Thelwall
Although both Google Gemini (1.5 Flash) and ChatGPT (4o and 4o-mini) give research quality evaluation scores that correlate positively with expert scores in nearly all fields, and more strongly that citations in most, it is not known whether this is true for smaller Large Language Models (LLMs). In response, this article assesses Google’s Gemma-3-27b-it, a downloadable LLM (60Gb). The results for 104,187 articles show that Gemma-3-27b-it scores correlate positively with an expert research quality score proxy for all 34 Units of Assessment (broad fields) from the UK Research Excellence Framework 2021. The Gemma-3-27b-it correlations have 83.8% of the strength of ChatGPT 4o and 94.7% of the strength of ChatGPT 4o-mini correlations. Differently from the two larger LLMs, the Gemma-3-27b-it correlations do not increase substantially when the scores are averaged across five repetitions, its scores tend to be lower, and its reports are relatively uniform in style. Overall, the results show that research quality score estimation can be conducted by offline LLMs, so this capability is not an emergent property of the largest LLMs. Moreover, score improvement through repetition is not a universal feature of LLMs. In conclusion, although the largest LLMs still have the highest research evaluation score estimation capability, smaller ones can also be used for this task, and this can be helpful for cost saving or when secure offline processing is needed. 尽管谷歌 Gemini(1.5 Flash)和 ChatGPT(4o 与 4o-mini)在几乎所有领域给出的研究质量评估分数都与专家评分呈正相关,并且在大多数领域其相关性还强于引用数,但尚不清楚较小的大型语言模型(LLMs)是否也具备此特性。为此,本文评估了谷歌的 Gemma-3-27b-it,一款可下载的 LLM(60Gb)。对 104,187 篇文章的结果表明,Gemma-3-27b-it 的得分与英国 2021 年研究卓越框架(UK Research Excellence Framework)中所有 34 个评估单元(广义领域)的专家研究质量得分代理量呈正相关。Gemma-3-27b-it 的相关性强度分别为 ChatGPT 4o 的 83.8% 和 ChatGPT 4o-mini 的 94.7%。与这两款更大的 LLM 不同,Gemma-3-27b-it 的相关性在将分数取五次重复平均时并未显著增加,其得分倾向于较低,而且其报告风格相对统一。总体来看,结果表明离线 LLM 也能进行研究质量评分估计,因此此能力并非仅为最大规模 LLM 的涌现特性。 此外,通过重复获得分数提升并非所有 LLMs 的普遍特征。总之,尽管体量最大的 LLMs 仍然具有最高的研究评价分数估计能力,但较小的模型也可以用于此任务,这在节约成本或需要安全的离线处理时会有所帮助。
Subjects: Digital Libraries, Artificial Intelligence 主题:数字图书馆,人工智能
Publish: 2025-08-10 06:18:40 UTC 发布:2025-08-10 06:18:40 UTC
#199 Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment #199 通过时间异质性建模与语义对齐将 LLMs 适配于时间序列预测
Authors: [Yanru Sun](https://arxiv.org/search/?searchtype=author&query=Yanru Sun), [Emadeldeen Eldele](https://arxiv.org/search/?searchtype=author&query=Emadeldeen Eldele), [Zongxia Xie](https://arxiv.org/search/?searchtype=author&query=Zongxia Xie), [Yucheng Wang](https://arxiv.org/search/?searchtype=author&query=Yucheng Wang), [Wenzhe Niu](https://arxiv.org/search/?searchtype=author&query=Wenzhe Niu), [Qinghua Hu](https://arxiv.org/search/?searchtype=author&query=Qinghua Hu), [Chee Keong Kwoh](https://arxiv.org/search/?searchtype=author&query=Chee Keong Kwoh), [Min Wu](https://arxiv.org/search/?searchtype=author&query=Min Wu) 作者:孙艳如,Emadeldeen Eldele,谢宗霞,王雨成,牛文哲,胡清华,郭志强,吴敏
Large Language Models (LLMs) have recently demonstrated impressive capabilities in natural language processing due to their strong generalization and sequence modeling capabilities. However, their direct application to time series forecasting remains challenging due to two fundamental issues: the inherent heterogeneity of temporal patterns and the modality gap between continuous numerical signals and discrete language representations. In this work, we propose TALON, a unified framework that enhances LLM-based forecasting by modeling temporal heterogeneity and enforcing semantic alignment. Specifically, we design a Heterogeneous Temporal Encoder that partitions multivariate time series into structurally coherent segments, enabling localized expert modeling across diverse temporal patterns. To bridge the modality gap, we introduce a Semantic Alignment Module that aligns temporal features with LLM-compatible representations, enabling effective integration of time series into language-based models while eliminating the need for handcrafted prompts during inference. Extensive experiments on seven real-world benchmarks demonstrate that TALON achieves superior performance across all datasets, with average MSE improvements of up to 11% over recent state-of-the-art methods. These results underscore the effectiveness of incorporating both pattern-aware and semantic-aware designs when adapting LLMs for time series forecasting. The code is available at: https://github.com/syrGitHub/TALON. 大型语言模型(LLMs)近年来在自然语言处理方面展示了令人印象深刻的能力,这归功于它们强大的泛化和序列建模能力。然而,将它们直接应用于时间序列预测仍然具有挑战性,原因在于两个基本问题:时间模式的固有异质性,以及连续数值信号与离散语言表示之间的模态差距。在本工作中,我们提出了 TALON,一个通过建模时间异质性并强制语义对齐来增强基于 LLM 的预测的统一框架。具体而言,我们设计了一个异构时间编码器,将多变量时间序列划分为结构上一致的片段,从而使得针对多样时间模式的局部专家建模成为可能。为弥合模态差距,我们引入了一个语义对齐模块,将时间特征与兼容 LLM 的表示对齐,使时间序列能够有效整合到基于语言的模型中,同时在推理阶段无需手工设计提示。 在七个真实世界基准上的大量实验表明,TALON 在所有数据集上都实现了更优的性能,相较于近期的最先进方法平均均方误差(MSE)提升了多达 11%。这些结果强调了在将 LLMs 应用于时间序列预测时同时引入模式感知和语义感知设计的有效性。代码可在以下位置获取: https://github.com/syrGitHub/TALON。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 06:06:19 UTC 发布:2025-08-10 06:06:19 UTC
#200 DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention #200 DySK-Attn:一种通过动态稀疏知识注意力实现大规模语言模型高效、实时知识更新的框架
Authors: [Kabir Khan](https://arxiv.org/search/?searchtype=author&query=Kabir Khan), [Priya Sharma](https://arxiv.org/search/?searchtype=author&query=Priya Sharma), [Arjun Mehta](https://arxiv.org/search/?searchtype=author&query=Arjun Mehta), [Neha Gupta](https://arxiv.org/search/?searchtype=author&query=Neha Gupta), [Ravi Narayanan](https://arxiv.org/search/?searchtype=author&query=Ravi Narayanan) 作者:Kabir Khan、Priya Sharma、Arjun Mehta、Neha Gupta、Ravi Narayanan
Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world. 大型语言模型(LLMs)存在一个关键局限:它们的知识是静态的并且很快就会过时。对这些庞大模型进行重新训练在计算上不可行,而现有的知识编辑技术可能速度缓慢并引入不可预见的副作用。为了解决这一问题,我们提出了 DySK-Attn,一种使 LLMs 能够高效整合来自动态外部源的实时知识的新框架。我们的方法将 LLM 与一个可以即时更新的动态知识图谱(KG)相结合。该框架的核心是一种稀疏知识注意力机制,允许 LLM 执行由粗到细的搜索,能高效识别并聚焦于来自庞大 KG 的一小部分高度相关事实。该机制避免了对整个知识库进行密集注意力的高计算成本,并减轻了来自无关信息的噪声。 通过在时效性问答任务上进行的大量实验,我们证明了 DySK-Attn 在更新知识的事实准确性和计算效率方面,显著优于包括标准检索增强生成(RAG)和模型编辑技术在内的强基线。我们的框架为构建能够与瞬息万变的世界保持同步的 LLMs 提供了可扩展且有效的解决方案。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习
Publish: 2025-08-10 05:22:38 UTC 发布:2025-08-10 05:22:38 协调世界时
#201 Explainability-in-Action: Enabling Expressive Manipulation and Tacit Understanding by Bending Diffusion Models in ComfyUI #201 可解释性实践:在 ComfyUI 中弯曲扩散模型以实现表达性操控与隐性理解
Authors: [Ahmed M. Abuzuraiq](https://arxiv.org/search/?searchtype=author&query=Ahmed M. Abuzuraiq), [Philippe Pasquier](https://arxiv.org/search/?searchtype=author&query=Philippe Pasquier) 作者:Ahmed M. Abuzuraiq, Philippe Pasquier
Explainable AI (XAI) in creative contexts can go beyond transparency to support artistic engagement, modifiability, and sustained practice. While curated datasets and training human-scale models can offer artists greater agency and control, large-scale generative models like text-to-image diffusion systems often obscure these possibilities. We suggest that even large models can be treated as creative materials if their internal structure is exposed and manipulable. We propose a craft-based approach to explainability rooted in long-term, hands-on engagement akin to Schön’s “reflection-in-action” and demonstrate its application through a model-bending and inspection plugin integrated into the node-based interface of ComfyUI. We demonstrate that by interactively manipulating different parts of a generative model, artists can develop an intuition about how each component influences the output. 在创作语境中,可解释的人工智能(XAI)可以超越透明性,支持艺术参与性、可修改性和持续实践。尽管经过策划的数据集和以人为尺度训练的模型可以为艺术家提供更大的能动性和控制力,但像文本到图像扩散系统这样的超大规模生成模型常常掩盖了这些可能性。我们提出,即便是大型模型,只要其内部结构被揭示并可操作,也可视为创作材料。我们提出一种基于工艺的可解释性方法,植根于类似 Schön 所说的“在行动中的反思”的长期动手参与,并通过一个集成到 ComfyUI 节点式界面的模型弯曲与检查插件来演示其应用。我们展示了通过交互式操作生成模型的不同部分,艺术家可以发展出对每个组件如何影响输出的直觉。
Subjects: Human-Computer Interaction, Artificial Intelligence, Machine Learning, Multimedia 学科:人机交互、人工智能、机器学习、多媒体
Publish: 2025-08-10 05:19:30 UTC 发布:2025-08-10 05:19:30 UTC
#202 Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes #202 用于评估大语言模型在真实代码上的动态基准构建
Authors: [Zhe Zhang](https://arxiv.org/search/?searchtype=author&query=Zhe Zhang), [Runlin Liu](https://arxiv.org/search/?searchtype=author&query=Runlin Liu), [Aishan Liu](https://arxiv.org/search/?searchtype=author&query=Aishan Liu), [Xingyu Liu](https://arxiv.org/search/?searchtype=author&query=Xingyu Liu), [Xiang Gao](https://arxiv.org/search/?searchtype=author&query=Xiang Gao), [Hailong Sun](https://arxiv.org/search/?searchtype=author&query=Hailong Sun) 作者:张哲、刘润林、刘艾山、刘星宇、高翔、孙海龙
As large language models LLMs) become increasingly integrated into software development workflows, rigorously evaluating their performance on complex, real-world code generation tasks has become essential. However, existing benchmarks often suffer from data contamination and limited test rigor, constraining their ability to reveal model failures effectively. To address these, we present CODE2BENCH, a end-to-end pipeline for dynamically constructing robust and contamination-resistant benchmarks from real-world GitHub repositories. Specifically, CODE2BENCH introduces three key innovations: (1) Automated Dynamism, achieved through periodic ingestion of recent code to minimize training data contamination; (2) Scope Graph-based dependency analysis, which enables structured classification of functions into benchmark instances with controlled dependency levels (distinguishing between Self-Contained (SC) tasks for cross-language evaluation and Weakly Self-Contained (WSC) tasks involving permitted library usage); and (3) Property-Based Testing (PBT) for the automated synthesis of rigorous test suites to enable thorough functional verification. Using this pipeline, we construct CODE2BENCH-2505, the first benchmark derived from 880 recent Python projects spanning diverse domains, comprising 1,163 code generation tasks with 100% average branch coverage on ground-truth implementations. Extensive evaluation of 16 LLMs using CODE2BENCH-2505 reveals that models consistently struggle with SC tasks requiring complex, non-standard logic and cross-language transfer, while showing relatively stronger performance on WSC tasks in Python. Our work introduces a contamination-resistant, language-agnostic methodology for dynamic benchmark construction, offering a principled foundation for the comprehensive and realistic evaluation of LLMs on real-world software development tasks. 随着大型语言模型 LLMs) 日益融入软件开发工作流,严格评估它们在复杂的真实代码生成任务中的表现变得至关重要。然而,现有基准测试往往存在数据污染和测试强度有限的问题,限制了它们有效揭示模型失误的能力。为此,我们提出了 CODE2BENCH,一个用于从真实 GitHub 仓库动态构建稳健且抗污染基准的端到端管道。具体而言,CODE2BENCH 引入了三项关键创新:(1)自动化动态更新,通过定期引入最新代码以尽量减少训练数据污染;(2)基于作用域图的依赖分析,能够将函数结构化地分类为具有可控依赖级别的基准实例(区分用于跨语言评估的独立任务(Self-Contained,SC)和涉及被允许库使用的弱独立任务(Weakly Self-Contained,WSC));以及(3)基于属性的测试(Property-Based Testing,PBT),用于自动合成严格的测试套件以实现全面的功能验证。 使用此流程,我们构建了 CODE2BENCH-2505,这是首个从 880 个涵盖多样领域的最新 Python 项目中派生的基准,包含 1,163 个代码生成任务,且在真实实现上平均分支覆盖率达 100%。对 16 个 LLMs 使用 CODE2BENCH-2505 进行的大规模评估显示,模型在需要复杂、非标准逻辑和跨语言迁移的 SC 任务上持续表现不佳,而在 Python 的 WSC 任务上表现相对较强。我们的工作引入了一种抗污染、语言无关的动态基准构建方法,为在真实软件开发任务上对 LLMs 进行全面且现实的评估提供了原则性基础。
Subjects: Software Engineering, Artificial Intelligence 主题:软件工程,人工智能
Publish: 2025-08-10 05:06:36 UTC 发布日期:2025-08-10 05:06:36 UTC
#203 Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks #203 模式血缘提取规模化:多语言流水线、复合评估与语言模型基准
Authors: [Jiaqi Yin](https://arxiv.org/search/?searchtype=author&query=Jiaqi Yin), [Yi-Wei Chen](https://arxiv.org/search/?searchtype=author&query=Yi-Wei Chen), [Meng-Lung Lee](https://arxiv.org/search/?searchtype=author&query=Meng-Lung Lee), [Xiya Liu](https://arxiv.org/search/?searchtype=author&query=Xiya Liu) 作者:殷嘉琦、陈奕维、李孟龙、刘熙雅
Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This “semantic drift” compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications. 企业数据管道通常在多种编程语言中进行复杂的转换,这常导致原始元数据与下游数据之间出现语义脱节。这种“语义漂移”会削弱数据的可复现性和治理能力,并降低检索增强生成(RAG)和文本到 SQL 等服务的效用。为了解决这一问题,提出了一种新框架,用于从多语言的企业管道脚本中自动提取细粒度的模式血缘。该方法识别四个关键组件:源模式、源表、转换逻辑和聚合操作,从而创建数据转换的标准化表示。为严格评估血缘质量,本文引入了模式血缘综合评估(SLiCE),该指标同时评估结构正确性和语义保真度。文中还提供了一个新的基准数据集,包括来自真实工业脚本的 1,700 条人工标注的血缘记录。 实验在 12 种语言模型上进行,涵盖从 1.3B 到 32B 的小型语言模型(SLMs),以及像 GPT-4o 和 GPT-4.1 这样的大型语言模型(LLMs)。结果表明,模式谱系提取的性能随着模型规模和提示技术的复杂性而提升。特别是一款 32B 的开源模型,在使用单一推理轨迹时,能够在常规提示下达到与 GPT 系列相当的性能。该发现表明,在实际应用中部署具备模式感知能力的代理时,可以采用一种可扩展且经济的方案。
Subjects: Computation and Language, Artificial Intelligence, Databases 主题:计算与语言、人工智能、数据库
Publish: 2025-08-10 05:04:32 UTC 发布时间:2025-08-10 05:04:32 协调世界时 (UTC)
#204 Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback #204 通过从隐式反馈中去噪虚假兴趣改进个性化标题生成
Authors: [Kejin Liu](https://arxiv.org/search/?searchtype=author&query=Kejin Liu), [Junhong Lian](https://arxiv.org/search/?searchtype=author&query=Junhong Lian), [Xiang Ao](https://arxiv.org/search/?searchtype=author&query=Xiang Ao), [Ningtao Wang](https://arxiv.org/search/?searchtype=author&query=Ningtao Wang), [Xing Fu](https://arxiv.org/search/?searchtype=author&query=Xing Fu), [Yu Cheng](https://arxiv.org/search/?searchtype=author&query=Yu Cheng), [Weiqiang Wang](https://arxiv.org/search/?searchtype=author&query=Weiqiang Wang), [Xinyu Liu](https://arxiv.org/search/?searchtype=author&query=Xinyu Liu) 作者:刘克进,连俊宏,敖翔,王宁涛,付兴,程煜,王伟强,刘新宇
Accurate personalized headline generation hinges on precisely capturing user interests from historical behaviors. However, existing methods neglect personalized-irrelevant click noise in entire historical clickstreams, which may lead to hallucinated headlines that deviate from genuine user preferences. In this paper, we reveal the detrimental impact of click noise on personalized generation quality through rigorous analysis in both user and news dimensions. Based on these insights, we propose a novel Personalized Headline Generation framework via Denoising Fake Interests from Implicit Feedback (PHG-DIF). PHG-DIF first employs dual-stage filtering to effectively remove clickstream noise, identified by short dwell times and abnormal click bursts, and then leverages multi-level temporal fusion to dynamically model users’ evolving and multi-faceted interests for precise profiling. Moreover, we release DT-PENS, a new benchmark dataset comprising the click behavior of 1,000 carefully curated users and nearly 10,000 annotated personalized headlines with historical dwell time annotations. Extensive experiments demonstrate that PHG-DIF substantially mitigates the adverse effects of click noise and significantly improves headline quality, achieving state-of-the-art (SOTA) results on DT-PENS. Our framework implementation and dataset are available at https://github.com/liukejin-up/PHG-DIF. 准确的个性化标题生成取决于从历史行为中精确捕捉用户兴趣。然而,现有方法忽视了整个历史点击流中与个性化无关的点击噪音,这可能导致与真实用户偏好偏离的虚构标题。本文通过对用户和新闻维度的严格分析,揭示了点击噪音对个性化生成质量的有害影响。基于这些洞见,我们提出了一种通过从隐式反馈中去噪伪兴趣的个性化标题生成新框架(PHG-DIF)。PHG-DIF 首先采用双阶段过滤有效去除由短停留时间和异常点击爆发标识的点击流噪音,然后利用多层次时间融合动态建模用户不断演进且多面的兴趣以实现精确画像。此外,我们发布了 DT-PENS,这是一个新的基准数据集,包含 1,000 名精心挑选用户的点击行为和近 10,000 条带有历史停留时间注释的个性化标题。 大量实验表明,PHG-DIF 能显著缓解点击噪声的不利影响并大幅提升标题质量,在 DT-PENS 数据集上取得了最先进(SOTA)的结果。我们的框架实现和数据集可在 https://github.com/liukejin-up/PHG-DIF 获得。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 04:56:13 UTC 发布:2025-08-10 04:56:13 UTC
#205 Lightweight Multi-Scale Feature Extraction with Fully Connected LMF Layer for Salient Object Detection #205 轻量级多尺度特征提取,使用全连接 LMF 层用于显著性目标检测
Authors: [Yunpeng Shi](https://arxiv.org/search/?searchtype=author&query=Yunpeng Shi), [Lei Chen](https://arxiv.org/search/?searchtype=author&query=Lei Chen), [Xiaolu Shen](https://arxiv.org/search/?searchtype=author&query=Xiaolu Shen), [Yanju Guo](https://arxiv.org/search/?searchtype=author&query=Yanju Guo) 作者:石云鹏,陈磊,沈晓璐,郭艳菊
In the domain of computer vision, multi-scale feature extraction is vital for tasks such as salient object detection. However, achieving this capability in lightweight networks remains challenging due to the trade-off between efficiency and performance. This paper proposes a novel lightweight multi-scale feature extraction layer, termed the LMF layer, which employs depthwise separable dilated convolutions in a fully connected structure. By integrating multiple LMF layers, we develop LMFNet, a lightweight network tailored for salient object detection. Our approach significantly reduces the number of parameters while maintaining competitive performance. Here, we show that LMFNet achieves state-of-the-art or comparable results on five benchmark datasets with only 0.81M parameters, outperforming several traditional and lightweight models in terms of both efficiency and accuracy. Our work not only addresses the challenge of multi-scale learning in lightweight networks but also demonstrates the potential for broader applications in image processing tasks. The related code files are available at https://github.com/Shi-Yun-peng/LMFNet 在计算机视觉领域,多尺度特征提取对于显著性目标检测等任务至关重要。然而,由于效率与性能之间的权衡,在轻量级网络中实现这一能力仍然具有挑战性。本文提出了一种新颖的轻量级多尺度特征提取层,称为 LMF 层,该层在全连接结构中采用了深度可分离膨胀卷积。通过集成多个 LMF 层,我们设计了 LMFNet,一种针对显著性目标检测的轻量级网络。我们的方法在显著减少参数数量的同时保持了具有竞争力的性能。实验表明,LMFNet 仅用 0.81M 参数就在五个基准数据集上取得了最先进或可比的结果,且在效率和精度方面均优于多个传统和轻量级模型。我们的工作不仅解决了轻量级网络中多尺度学习的挑战,还展示了其在图像处理任务中更广泛的应用潜力。相关代码文件可在 https://github.com/Shi-Yun-peng/LMFNet 获取。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-10 04:06:48 UTC 发布:2025-08-10 04:06:48 UTC
#206 Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications #206 大规模多序列预训练用于在多样化临床应用中实现具通用性的 MRI 分析
Authors: [Zelin Qiu](https://arxiv.org/search/?searchtype=author&query=Zelin Qiu), [Xi Wang](https://arxiv.org/search/?searchtype=author&query=Xi Wang), [Zhuoyao Xie](https://arxiv.org/search/?searchtype=author&query=Zhuoyao Xie), [Juan Zhou](https://arxiv.org/search/?searchtype=author&query=Juan Zhou), [Yu Wang](https://arxiv.org/search/?searchtype=author&query=Yu Wang), [Lingjie Yang](https://arxiv.org/search/?searchtype=author&query=Lingjie Yang), [Xinrui Jiang](https://arxiv.org/search/?searchtype=author&query=Xinrui Jiang), [Juyoung Bae](https://arxiv.org/search/?searchtype=author&query=Juyoung Bae), [Moo Hyun Son](https://arxiv.org/search/?searchtype=author&query=Moo Hyun Son), [Qiang Ye](https://arxiv.org/search/?searchtype=author&query=Qiang Ye), [Dexuan Chen](https://arxiv.org/search/?searchtype=author&query=Dexuan Chen), [Rui Zhang](https://arxiv.org/search/?searchtype=author&query=Rui Zhang), [Tao Li](https://arxiv.org/search/?searchtype=author&query=Tao Li), [Neeraj Ramesh Mahboobani](https://arxiv.org/search/?searchtype=author&query=Neeraj Ramesh Mahboobani), [Varut Vardhanabhuti](https://arxiv.org/search/?searchtype=author&query=Varut Vardhanabhuti), [Xiaohui Duan](https://arxiv.org/search/?searchtype=author&query=Xiaohui Duan), [Yinghua Zhao](https://arxiv.org/search/?searchtype=author&query=Yinghua Zhao), [Hao Chen](https://arxiv.org/search/?searchtype=author&query=Hao Chen) 作者:邱泽林、王熙、谢卓耀、周娟、王宇、杨灵杰、姜欣睿、裴周英、孙武贤、叶强、陈德轩、张锐、李涛、Neeraj Ramesh Mahboobani、Varut Vardhanabhuti、段晓辉、赵英华、陈浩
Multi-sequence Magnetic Resonance Imaging (MRI) offers remarkable versatility, enabling the distinct visualization of different tissue types. Nevertheless, the inherent heterogeneity among MRI sequences poses significant challenges to the generalization capability of deep learning models. These challenges undermine model performance when faced with varying acquisition parameters, thereby severely restricting their clinical utility. In this study, we present PRISM, a foundation model PRe-trained with large-scale multI-Sequence MRI. We collected a total of 64 datasets from both public and private sources, encompassing a wide range of whole-body anatomical structures, with scans spanning diverse MRI sequences. Among them, 336,476 volumetric MRI scans from 34 datasets (8 public and 26 private) were curated to construct the largest multi-organ multi-sequence MRI pretraining corpus to date. We propose a novel pretraining paradigm that disentangles anatomically invariant features from sequence-specific variations in MRI, while preserving high-level semantic representations. We established a benchmark comprising 44 downstream tasks, including disease diagnosis, image segmentation, registration, progression prediction, and report generation. These tasks were evaluated on 32 public datasets and 5 private cohorts. PRISM consistently outperformed both non-pretrained models and existing foundation models, achieving first-rank results in 39 out of 44 downstream benchmarks with statistical significance improvements. These results underscore its ability to learn robust and generalizable representations across unseen data acquired under diverse MRI protocols. PRISM provides a scalable framework for multi-sequence MRI analysis, thereby enhancing the translational potential of AI in radiology. It delivers consistent performance across diverse imaging protocols, reinforcing its clinical applicability. 多序列磁共振成像(MRI)具有显著的多功能性,能够将不同组织类型区分开来进行可视化。然而,MRI 序列之间的内在异质性对深度学习模型的泛化能力构成了重大挑战。当面对不同采集参数时,这些挑战会削弱模型性能,从而严重限制其临床实用性。在本研究中,我们提出了 PRISM,一种用大规模多序列 MRI 预训练的基础模型(PRe-trained with large-scale multI-Sequence MRI)。我们共收集了来自公开和非公开来源的 64 个数据集,涵盖了广泛的全身解剖结构,扫描涉及多种 MRI 序列。其中,从 34 个数据集(8 个公开和 26 个非公开)整理出 336,476 个体积 MRI 扫描,构建了迄今为止最大的多器官多序列 MRI 预训练语料库。我们提出了一种新颖的预训练范式,该范式在保持高层语义表征的同时,将解剖上不变的特征与 MRI 的序列特异性变异进行解耦。 我们建立了一个包括 44 项下游任务的基准测试,涵盖疾病诊断、图像分割、配准、进展预测和报告生成等。这些任务在 32 个公开数据集和 5 个私有队列上进行了评估。PRISM 始终优于未进行预训练的模型和现有的基础模型,在 44 个下游基准中有 39 项取得了排名第一且具有统计学显著性改进的结果。这些结果强调了其在不同 MRI 协议下采集的未见数据上学习稳健且可泛化表征的能力。PRISM 为多序列 MRI 分析提供了可扩展的框架,从而增强了人工智能在放射学中的转化潜力。它在多种影像协议下都表现稳定,增强了其临床适用性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-10 03:31:46 UTC 发布:2025-08-10 03:31:46 UTC
#207 Integrating Neurosymbolic AI in Advanced Air Mobility: A Comprehensive Survey #207 在先进空中机动中整合神经符号人工智能:一项综合综述
Authors: [Kamal Acharya](https://arxiv.org/search/?searchtype=author&query=Kamal Acharya), [Iman Sharifi](https://arxiv.org/search/?searchtype=author&query=Iman Sharifi), [Mehul Lad](https://arxiv.org/search/?searchtype=author&query=Mehul Lad), [Liang Sun](https://arxiv.org/search/?searchtype=author&query=Liang Sun), [Houbing Song](https://arxiv.org/search/?searchtype=author&query=Houbing Song) 作者:Kamal Acharya、Iman Sharifi、Mehul Lad、Liang Sun、Houbing Song
Neurosymbolic AI combines neural network adaptability with symbolic reasoning, promising an approach to address the complex regulatory, operational, and safety challenges in Advanced Air Mobility (AAM). This survey reviews its applications across key AAM domains such as demand forecasting, aircraft design, and real-time air traffic management. Our analysis reveals a fragmented research landscape where methodologies, including Neurosymbolic Reinforcement Learning, have shown potential for dynamic optimization but still face hurdles in scalability, robustness, and compliance with aviation standards. We classify current advancements, present relevant case studies, and outline future research directions aimed at integrating these approaches into reliable, transparent AAM systems. By linking advanced AI techniques with AAM’s operational demands, this work provides a concise roadmap for researchers and practitioners developing next-generation air mobility solutions. 神经符号人工智能结合了神经网络的适应性与符号推理,有望为先进空中出行(AAM)中的复杂监管、运营与安全挑战提供解决途径。本综述回顾了其在需求预测、飞机设计和实时空中交通管理等 AAM 关键领域的应用。我们的分析显示,研究格局呈分散状态,其中包括神经符号强化学习在内的方法在动态优化方面展现出潜力,但仍面临可扩展性、鲁棒性及符合航空标准方面的挑战。我们对现有进展进行分类,提出相关案例研究,并概述了旨在将这些方法整合进可靠、透明的 AAM 系统的未来研究方向。通过将先进的人工智能技术与 AAM 的运营需求相结合,本工作为开发下一代空中出行解决方案的研究人员与从业者提供了一条简明的路线图。
Subjects: Robotics, Artificial Intelligence, Neural and Evolutionary Computing 主题:机器人学、人工智能、神经与进化计算
Publish: 2025-08-10 03:30:06 UTC 发布时间:2025-08-10 03:30:06 协调世界时
#208 Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction
Authors: [Yu Liu](https://arxiv.org/search/?searchtype=author&query=Yu Liu), [Zhijie Liu](https://arxiv.org/search/?searchtype=author&query=Zhijie Liu), [Xiao Ren](https://arxiv.org/search/?searchtype=author&query=Xiao Ren), [You-Fu Li](https://arxiv.org/search/?searchtype=author&query=You-Fu Li), [He Kong](https://arxiv.org/search/?searchtype=author&query=He Kong)
Predicting pedestrian motion trajectories is critical for the path planning and motion control of autonomous vehicles. Recent diffusion-based models have shown promising results in capturing the inherent stochasticity of pedestrian behavior for trajectory prediction. However, the absence of explicit semantic modelling of pedestrian intent in many diffusion-based methods may result in misinterpreted behaviors and reduced prediction accuracy. To address the above challenges, we propose a diffusion-based pedestrian trajectory prediction framework that incorporates both short-term and long-term motion intentions. Short-term intent is modelled using a residual polar representation, which decouples direction and magnitude to capture fine-grained local motion patterns. Long-term intent is estimated through a learnable, token-based endpoint predictor that generates multiple candidate goals with associated probabilities, enabling multimodal and context-aware intention modelling. Furthermore, we enhance the diffusion process by incorporating adaptive guidance and a residual noise predictor that dynamically refines denoising accuracy. The proposed framework is evaluated on the widely used ETH, UCY, and SDD benchmarks, demonstrating competitive results against state-of-the-art methods. 预测行人运动轨迹对于自动驾驶车辆的路径规划和运动控制至关重要。近期基于扩散的模型在捕捉行人行为固有随机性以进行轨迹预测方面显示出良好效果。然而,许多基于扩散的方法缺乏对行人意图的显式语义建模,这可能导致对行为的误解并降低预测精度。为了解决上述挑战,我们提出了一个将短期与长期运动意图相结合的基于扩散的行人轨迹预测框架。短期意图使用残差极坐标表示建模,该表示将方向与幅度解耦以捕捉细粒度的局部运动模式。长期意图则通过可学习的基于 token 的终点预测器来估计,该预测器生成多个带有相关概率的候选目标,从而实现多模态且具上下文感知的意图建模。此外,我们通过引入自适应引导和一个残差噪声预测器来增强扩散过程,该预测器动态地提高去噪精度。 所提出的框架在广泛使用的 ETH、UCY 和 SDD 基准上进行了评估,显示出与最先进方法相竞争的结果。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-10 02:36:33 UTC 发布时间:2025-08-10 02:36:33 协调世界时 (UTC)
#209 Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens #209 自动语音识别的公平性:通过哲学视角审视
Authors: [Anna Seo Gyeong Choi](https://arxiv.org/search/?searchtype=author&query=Anna Seo Gyeong Choi), [Hoon Choi](https://arxiv.org/search/?searchtype=author&query=Hoon Choi) 作者:Anna Seo Gyeong Choi,Hoon Choi
Automatic Speech Recognition (ASR) systems now mediate countless human-technology interactions, yet research on their fairness implications remains surprisingly limited. This paper examines ASR bias through a philosophical lens, arguing that systematic misrecognition of certain speech varieties constitutes more than a technical limitation – it represents a form of disrespect that compounds historical injustices against marginalized linguistic communities. We distinguish between morally neutral classification (discriminate1) and harmful discrimination (discriminate2), demonstrating how ASR systems can inadvertently transform the former into the latter when they consistently misrecognize non-standard dialects. We identify three unique ethical dimensions of speech technologies that differentiate ASR bias from other algorithmic fairness concerns: the temporal burden placed on speakers of non-standard varieties (“temporal taxation”), the disruption of conversational flow when systems misrecognize speech, and the fundamental connection between speech patterns and personal/cultural identity. These factors create asymmetric power relationships that existing technical fairness metrics fail to capture. The paper analyzes the tension between linguistic standardization and pluralism in ASR development, arguing that current approaches often embed and reinforce problematic language ideologies. We conclude that addressing ASR bias requires more than technical interventions; it demands recognition of diverse speech varieties as legitimate forms of expression worthy of technological accommodation. This philosophical reframing offers new pathways for developing ASR systems that respect linguistic diversity and speaker autonomy. 自动语音识别(ASR)系统如今介入了无数人机交互,但关于其公平性影响的研究仍然出人意料地有限。本文通过哲学视角审视 ASR 偏见,认为对某些语言变体的系统性误识不仅仅是技术上的局限——它构成了一种对被边缘化语言社区的历史不公的叠加性不尊重。我们区分了道德中性的分类(discriminate1)与有害的歧视(discriminate2),并论证当 ASR 系统持续性地误识非标准方言时,会无意中将前者转化为后者。我们指出语音技术在伦理上有三项独特维度,使 ASR 偏见区别于其他算法公平性问题:对非标准语言使用者施加的时间负担(“时间税”)、当系统误识语音时对会话流的破坏,以及语音模式与个人/文化身份之间的根本性关联。 这些因素造成了现有技术公平性指标无法捕捉的不对称权力关系。论文分析了自动语音识别(ASR)开发中语言标准化与多元化之间的紧张关系,论证了当前方法常常嵌入并强化了有问题的语言意识形态。我们得出的结论是,解决 ASR 的偏见不仅需要技术干预,还需要承认多样的言语变体作为值得技术适配的合法表达形式。这一哲学上的重构为开发尊重语言多样性和说话者自主权的 ASR 系统提供了新的路径。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-10 02:26:47 UTC 发布:2025-08-10 02:26:47 协调世界时 (UTC)
#210 SGD Convergence under Stepsize Shrinkage in Low-Precision Training #210 在低精度训练中步长收缩下的 SGD 收敛性
Author: [Vincent-Daniel Yun](https://arxiv.org/search/?searchtype=author&query=Vincent-Daniel Yun) 作者:Vincent-Daniel Yun
Low-precision training has become essential for reducing the computational and memory costs of large-scale deep learning. However, quantization of gradients introduces both magnitude shrinkage and additive noise, which can alter the convergence behavior of stochastic gradient descent (SGD). In this work, we study the convergence of SGD under a gradient shrinkage model, where each stochastic gradient is scaled by a factor qk∈(0,1] and perturbed by zero-mean quantization noise. We show that this shrinkage is equivalent to replacing the nominal stepsize μk with an effective stepsize μkqk, which slows convergence when qmin<1. Under standard smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a reduced rate determined by qmin, and with an increased asymptotic error floor due to quantization noise. We theoretically analyze how reduced numerical precision slows down training by modeling it as gradient shrinkage in the standard SGD convergence framework. 低精度训练已成为降低大规模深度学习计算和内存开销的关键。然而,对梯度的量化会引入幅度收缩和加性噪声,这可能改变随机梯度下降法(SGD)的收敛行为。在本工作中,我们研究在梯度收缩模型下的 SGD 收敛性,其中每个随机梯度都被一个因子缩放并受到零均值量化噪声的扰动。我们表明,这种收缩等价于将名义步长替换为一个有效步长,当有效步长小于名义步长时会导致收敛变慢。在标准的光滑性和有界方差假设下,我们证明低精度 SGD 仍然收敛,但收敛率被一个由该因子决定的较小速率降低,并且由于量化噪声而导致渐进误差下限增大。我们在标准 SGD 收敛框架中将其建模为梯度收缩,理论分析了数值精度降低如何放慢训练速度。
Subjects: Machine Learning, Artificial Intelligence, Information Theory, Numerical Analysis 主题:机器学习、人工智能、信息论、数值分析
Publish: 2025-08-10 02:25:48 UTC 发表:2025-08-10 02:25:48 UTC
#211 A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection #211 一个用于对抗性提示检测的实时自调节审查框架
Author: [Ivan Zhang](https://arxiv.org/search/?searchtype=author&query=Ivan Zhang) 作者:Ivan Zhang
Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google’s Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models. 随着 AI 模型日益普及并融入社会,确保 LLM 对齐对于信息安全至关重要。不幸的是,许多针对 LLM 的对抗性攻击和越狱的防御无法快速适应新攻击,会降低模型对良性提示的响应,或为可扩展实现引入重大障碍。为缓解这些挑战,我们引入了一种实时、自我调节(RTST)主持框架,以抵御对抗性攻击,同时保持轻量级的训练开销。我们使用谷歌的 Gemini 模型针对现代有效的越狱攻击对其有效性进行了实证评估。我们的结果证明,与传统的微调或分类器模型相比,自适应、干预最小的框架在越狱防御上具有优势。
Subjects: Cryptography and Security, Artificial Intelligence 主题:密码学与安全性 , 人工智能
Publish: 2025-08-10 01:59:07 UTC 发布:2025-08-10 01:59:07 UTC
#212 A Stable and Principled Loss Function for Direct Language Model Alignment #212 一个稳定且有原则的直接语言模型对齐损失函数
Author: [Yuandong Tan](https://arxiv.org/search/?searchtype=author&query=Yuandong Tan) 作者:Yuandong Tan
The alignment of large language models (LLMs) with human preferences is commonly achieved through Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) simplified this paradigm by establishing a direct mapping between the optimal policy and a reward function, eliminating the need for an explicit reward model. However, we argue that the DPO loss function is theoretically misaligned with its own derivation, as it promotes the indefinite maximization of a logits difference, which can lead to training instability and reward hacking. In this paper, we propose a novel loss function derived directly from the RLHF optimality condition. Our proposed loss targets a specific, finite value for the logits difference, which is dictated by the underlying reward, rather than its maximization. We provide a theoretical analysis, including a gradient-based comparison, to demonstrate that our method avoids the large gradients that plague DPO when the probability of dispreferred responses approaches zero. This inherent stability prevents reward hacking and leads to more effective alignment. We validate our approach by fine-tuning a Qwen2.5-7B model, showing significant win-rate improvements over a standard DPO baseline and achieving competitive performance against larger models like Llama-3.1-8B. 大型语言模型(LLMs)与人类偏好的对齐通常通过来自人类反馈的强化学习(RLHF)来实现。直接偏好优化(DPO)通过在最优策略与奖励函数之间建立直接映射简化了该范式,消除了显式奖励模型的需求。然而,我们认为 DPO 的损失函数在理论上与其自身的推导不一致,因为它鼓励对 logits 差值进行无限制地最大化,这可能导致训练不稳定和奖励操纵。在本文中,我们提出了一种直接从 RLHF 最优性条件推导出的新型损失函数。我们提出的损失以 logits 差值的特定有限值为目标,该值由潜在奖励决定,而非对其进行最大化。我们提供了理论分析,包括基于梯度的比较,以证明当不受欢迎响应的概率接近零时,我们的方法避免了困扰 DPO 的大梯度。这种内在的稳定性防止了奖励操纵,并带来了更有效的对齐。 我们通过微调 Qwen2.5-7B 模型来验证我们的方法,结果显示在胜率上相较于标准的 DPO 基线有显著提升,并在与更大模型(如 Llama-3.1-8B)的比较中取得了有竞争力的表现。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 01:56:58 UTC
#213 "Draw me a curator" Examining the visual stereotyping of a cultural services profession by generative AI #213 “为我画一位策展人” 通过生成式人工智能审视对文化服务职业的视觉刻板化 [PDF ] [复制] [Kimi ] [关联]
Author: [Dirk HR Spennemann](https://arxiv.org/search/?searchtype=author&query=Dirk HR Spennemann) 作者:Dirk HR Spennemann
Based on 230 visualisations, this paper examines the depiction of museum curators by the popular generative Artificial Intelligence (AI) model, ChatGPT4o. While the AI-generated representations do not reiterate popular stereotypes of curators as nerdy, conservative in dress and stuck in time rummaging through collections, they contrast sharply with real-world demographics. AI-generated imagery extremely underrepresents women (3.5% vs 49% to 72% in reality) and disregards ethnic communities other than Caucasian (0% vs 18% to 36%). It only over-represents young curators (79% vs approx. 27%) but also renders curators to resemble yuppie professionals or people featuring in fashion advertising. Stereotypical attributes are prevalent, with curators widely depicted as wearing beards and holding clipboards or digital tablets. The findings highlight biases in the generative AI image creation dataset, which is poised to shape an inaccurate portrayal of museum professionals if the images were to be taken uncritically at face value. 基于 230 张可视化图像,本文考察了流行的生成式人工智能(AI)模型 ChatGPT4o 对博物馆策展人的描绘。尽管 AI 生成的形象并未重申策展人作为书呆子、着装保守并停留在过去翻找藏品的流行刻板印象,但它们与现实世界的人口统计特征形成了鲜明对比。AI 生成的图像极度低估了女性(3.5% 对比现实中的 49% 到 72%),并忽视了除白人以外的族裔群体(0% 对比现实中的 18% 到 36%)。它仅过度呈现年轻策展人(79% 对比约 27%),但也将策展人描绘成雅皮士职业人士或出现在时尚广告中的人物。刻板化特征普遍存在,策展人常被描绘为留胡须且手持剪贴板或数字平板。研究结果突显了生成式 AI 图像创建数据集中的偏见,如果这些图像被不加批判地视为事实,可能会塑造对博物馆专业人员的不准确刻画。
Subjects: Computers and Society, Artificial Intelligence 主题:计算机与社会,人工智能
Publish: 2025-08-10 00:43:43 UTC
#214 Toward AI Matching Policies in Homeless Services: A Qualitative Study with Policymakers
Authors: [Caroline M. Johnston](https://arxiv.org/search/?searchtype=author&query=Caroline M. Johnston), [Olga Koumoundouros](https://arxiv.org/search/?searchtype=author&query=Olga Koumoundouros), [Angel Hsing-Chi Hwang](https://arxiv.org/search/?searchtype=author&query=Angel Hsing-Chi Hwang), [Laura Onasch-Vera](https://arxiv.org/search/?searchtype=author&query=Laura Onasch-Vera), [Eric Rice](https://arxiv.org/search/?searchtype=author&query=Eric Rice), [Phebe Vayanos](https://arxiv.org/search/?searchtype=author&query=Phebe Vayanos)
Artificial intelligence researchers have proposed various data-driven algorithms to improve the processes that match individuals experiencing homelessness to scarce housing resources. It remains unclear whether and how these algorithms are received or adopted by practitioners and what their corresponding consequences are. Through semi-structured interviews with 13 policymakers in homeless services in Los Angeles, we investigate whether such change-makers are open to the idea of integrating AI into the housing resource matching process, identifying where they see potential gains and drawbacks from such a system in issues of efficiency, fairness, and transparency. Our qualitative analysis indicates that, even when aware of various complicating factors, policymakers welcome the idea of an AI matching tool if thoughtfully designed and used in tandem with human decision-makers. Though there is no consensus as to the exact design of such an AI system, insights from policymakers raise open questions and design considerations that can be enlightening for future researchers and practitioners who aim to build responsible algorithmic systems to support decision-making in low-resource scenarios. 人工智能研究人员提出了各种以数据为驱动的算法,以改进将无家可归者与稀缺住房资源匹配的过程。目前尚不清楚这些算法是否以及如何被从业者接受或采纳,以及其相应的后果是什么。通过对洛杉矶 13 位无家可归者服务政策制定者进行半结构化访谈,我们调查了这些变革推动者是否愿意将人工智能整合到住房资源匹配过程中,并识别他们在效率、公平性和透明度等问题上认为此类系统可能带来的收益和弊端。我们的定性分析表明,即使在意识到各种复杂因素的情况下,政策制定者也欢迎一种经过深思熟虑设计并与人工决策者配合使用的人工智能匹配工具。 尽管关于此类人工智能系统的具体设计尚无共识,但决策者的见解提出了开放性问题和设计考量,这些对于未来旨在构建在低资源情境中支持决策的负责任算法系统的研究人员和从业者具有启发意义。
Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能
Publish: 2025-08-10 00:33:03 UTC 发布时间:2025-08-10 00:33:03 UTC
#215 Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays #215 感知评估:用于生成 X 射线的 GAN 与扩散模型
Authors: [Gregory Schuit](https://arxiv.org/search/?searchtype=author&query=Gregory Schuit), [Denis Parra](https://arxiv.org/search/?searchtype=author&query=Denis Parra), [Cecilia Besa](https://arxiv.org/search/?searchtype=author&query=Cecilia Besa) 作者:Gregory Schuit、Denis Parra、Cecilia Besa
Generative image models have achieved remarkable progress in both natural and medical imaging. In the medical context, these techniques offer a potential solution to data scarcity-especially for low-prevalence anomalies that impair the performance of AI-driven diagnostic and segmentation tools. However, questions remain regarding the fidelity and clinical utility of synthetic images, since poor generation quality can undermine model generalizability and trust. In this study, we evaluate the effectiveness of state-of-the-art generative models-Generative Adversarial Networks (GANs) and Diffusion Models (DMs)-for synthesizing chest X-rays conditioned on four abnormalities: Atelectasis (AT), Lung Opacity (LO), Pleural Effusion (PE), and Enlarged Cardiac Silhouette (ECS). Using a benchmark composed of real images from the MIMIC-CXR dataset and synthetic images from both GANs and DMs, we conducted a reader study with three radiologists of varied experience. Participants were asked to distinguish real from synthetic images and assess the consistency between visual features and the target abnormality. Our results show that while DMs generate more visually realistic images overall, GANs can report better accuracy for specific conditions, such as absence of ECS. We further identify visual cues radiologists use to detect synthetic images, offering insights into the perceptual gaps in current models. These findings underscore the complementary strengths of GANs and DMs and point to the need for further refinement to ensure generative models can reliably augment training datasets for AI diagnostic systems. 生成图像模型在自然影像和医学影像领域都取得了显著进展。在医学背景下,这些技术为数据稀缺问题提供了潜在解决方案——尤其是针对那些低发病率异常,这些异常会削弱基于人工智能的诊断和分割工具的性能。然而,关于合成图像的保真度和临床实用性仍存在疑问,因为较差的生成质量会破坏模型的泛化能力和可信度。在本研究中,我们评估了最先进的生成模型——生成对抗网络(GANs)和扩散模型(DMs)——在条件化生成胸部 X 光影像以表现四种异常(肺不张/部分塌陷(Atelectasis, AT)、肺不透明/浸润(Lung Opacity, LO)、胸腔积液(Pleural Effusion, PE)和心脏轮廓增大(Enlarged Cardiac Silhouette, ECS))方面的有效性。我们使用由 MIMIC-CXR 数据集中真实图像以及来自 GANs 和 DMs 的合成图像组成的基准数据集,开展了一项由三位不同经验水平的放射科医师参与的阅片研究。参与者需分辨真实图像与合成图像,并评估视觉特征与目标异常之间的一致性。 我们的结果表明,尽管扩散模型总体上生成的图像在视觉上更逼真,但生成对抗网络在某些特定条件下(例如缺乏 ECS)可以表现出更高的准确性。我们进一步识别出放射科医生用来辨别合成图像的视觉线索,为当前模型的感知差距提供了见解。这些发现强调了 GAN 和扩散模型的互补优势,并指出需要进一步改进,以确保生成模型能够可靠地增强用于 AI 诊断系统的训练数据集。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-10 00:32:18 UTC 发布:2025-08-10 00:32:18 UTC
#216 Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning #216 Pref-GUIDE:通过基于偏好的学习从实时人类反馈中进行持续策略学习 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Zhengran Ji](https://arxiv.org/search/?searchtype=author&query=Zhengran Ji), [Boyuan Chen](https://arxiv.org/search/?searchtype=author&query=Boyuan Chen) 作者:季正然,陈博远
Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE offers a scalable and principled approach for harnessing human input in online reinforcement learning. 当任务目标难以通过稠密奖励函数明确指定时,用人类反馈训练强化学习智能体至关重要。先前的方法依赖离线轨迹对比来引出人类偏好,而在智能体必须即时适应的在线学习场景中,这类数据不可用。近期的方法通过收集实时标量反馈来引导智能体行为,并在人工反馈不可用后训练奖励模型以继续学习。然而,标量反馈通常存在噪声和不一致性,限制了所学奖励的准确性和泛化能力。我们提出了 Pref-GUIDE 框架,该框架将实时标量反馈转化为基于偏好的数据,以改进用于持续策略训练的奖励模型学习。Pref-GUIDE Individual 通过在短时间窗口内比较智能体行为并过滤含糊反馈来缓解时间不一致性。Pref-GUIDE Voting 通过在用户群体中聚合奖励模型以形成共识偏好,进一步增强了鲁棒性。 在三个具有挑战性的环境中,Pref-GUIDE 显著优于标量反馈基线,其中投票变体甚至超过了专家设计的稠密奖励。通过将标量反馈重新构造成带有人群反馈的结构化偏好,Pref-GUIDE 为在在线强化学习中利用人类输入提供了一种可扩展且有原则的方法。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-10 00:18:44 UTC 发布日期:2025-08-10 00:18:44 UTC
#217 Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution #217 在核心指代消解中通过置信度差异研究大型语言模型的交叉性偏见
Authors: [Falaah Arif Khan](https://arxiv.org/search/?searchtype=author&query=Falaah Arif Khan), [Nivedha Sivakumar](https://arxiv.org/search/?searchtype=author&query=Nivedha Sivakumar), [Yinong Oliver Wang](https://arxiv.org/search/?searchtype=author&query=Yinong Oliver Wang), [Katherine Metcalf](https://arxiv.org/search/?searchtype=author&query=Katherine Metcalf), [Cezanne Camacho](https://arxiv.org/search/?searchtype=author&query=Cezanne Camacho), [Barry-John Theobald](https://arxiv.org/search/?searchtype=author&query=Barry-John Theobald), [Luca Zappella](https://arxiv.org/search/?searchtype=author&query=Luca Zappella), [Nicholas Apostoloff](https://arxiv.org/search/?searchtype=author&query=Nicholas Apostoloff) 作者:Falaah Arif Khan、Nivedha Sivakumar、Yinong Oliver Wang、Katherine Metcalf、Cezanne Camacho、Barry-John Theobald、Luca Zappella、Nicholas Apostoloff
Large language models (LLMs) have achieved impressive performance, leading to their widespread adoption as decision-support tools in resource-constrained contexts like hiring and admissions. There is, however, scientific consensus that AI systems can reflect and exacerbate societal biases, raising concerns about identity-based harm when used in critical social contexts. Prior work has laid a solid foundation for assessing bias in LLMs by evaluating demographic disparities in different language reasoning tasks. In this work, we extend single-axis fairness evaluations to examine intersectional bias, recognizing that when multiple axes of discrimination intersect, they create distinct patterns of disadvantage. We create a new benchmark called WinoIdentity by augmenting the WinoBias dataset with 25 demographic markers across 10 attributes, including age, nationality, and race, intersected with binary gender, yielding 245,700 prompts to evaluate 50 distinct bias patterns. Focusing on harms of omission due to underrepresentation, we investigate bias through the lens of uncertainty and propose a group (un)fairness metric called Coreference Confidence Disparity which measures whether models are more or less confident for some intersectional identities than others. We evaluate five recently published LLMs and find confidence disparities as high as 40% along various demographic attributes including body type, sexual orientation and socio-economic status, with models being most uncertain about doubly-disadvantaged identities in anti-stereotypical settings. Surprisingly, coreference confidence decreases even for hegemonic or privileged markers, indicating that the recent impressive performance of LLMs is more likely due to memorization than logical reasoning. Notably, these are two independent failures in value alignment and validity that can compound to cause social harm. 大型语言模型(LLMs)已取得令人瞩目的表现,推动它们在招聘和录取等资源受限的场景中作为决策支持工具被广泛采用。然而,科学界普遍认为,人工智能系统可能反映并加剧社会偏见,在关键社会情境中使用时会引发基于身份的伤害担忧。先前的研究通过评估不同语言推理任务中的人口统计差异,为评估 LLMs 中的偏见奠定了坚实基础。在本工作中,我们将单轴公平性评估扩展到考察交叉性偏见,认识到当多重歧视轴相交时,会产生独特的不利模式。我们通过在 WinoBias 数据集上增加涵盖年龄、国籍和种族等 10 个属性共 25 个人口统计标记,并与二元性别相交,创建了一个名为 WinoIdentity 的新基准,生成了 245,700 个提示以评估 50 种不同的偏见模式。 我们关注因代表性不足而导致的遗漏性伤害,从不确定性的角度研究偏见,提出了一种群体(不)公平度量——共指置信度差异(Coreference Confidence Disparity),用于衡量模型对某些交叉身份相较于其他身份是否更有或更无信心。我们评估了五个近期发布的 LLMs,发现沿不同人口属性(包括体型、性取向和社会经济地位)置信度差异高达 40%,在反刻板印象的情境中,模型对处于双重弱势的身份尤为不确定。值得注意的是,即便是对于霸权或特权标识,模型的共指置信度也会降低,这表明 LLMs 近期令人印象深刻的表现更可能源于记忆而非逻辑推理。值得关注的是,这两者分别是价值对齐和有效性方面的独立失败,且可能叠加导致社会伤害。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-09 22:24:40 UTC 发布时间:2025-08-09 22:24:40 UTC
#218 Towards High-Order Mean Flow Generative Models: Feasibility, Expressivity, and Provably Efficient Criteria #218 面向高阶均值流生成模型的探索:可行性、表达能力与可证明高效的准则
Authors: [Yang Cao](https://arxiv.org/search/?searchtype=author&query=Yang Cao), [Yubin Chen](https://arxiv.org/search/?searchtype=author&query=Yubin Chen), [Zhao Song](https://arxiv.org/search/?searchtype=author&query=Zhao Song), [Jiahao Zhang](https://arxiv.org/search/?searchtype=author&query=Jiahao Zhang) 作者:曹阳、陈煜斌、宋钊、张家豪
Generative modelling has seen significant advances through simulation-free paradigms such as Flow Matching, and in particular, the MeanFlow framework, which replaces instantaneous velocity fields with average velocities to enable efficient single-step sampling. In this work, we introduce a theoretical study on Second-Order MeanFlow, a novel extension that incorporates average acceleration fields into the MeanFlow objective. We first establish the feasibility of our approach by proving that the average acceleration satisfies a generalized consistency condition analogous to first-order MeanFlow, thereby supporting stable, one-step sampling and tractable loss functions. We then characterize its expressivity via circuit complexity analysis, showing that under mild assumptions, the Second-Order MeanFlow sampling process can be implemented by uniform threshold circuits within the TC0 class. Finally, we derive provably efficient criteria for scalable implementation by leveraging fast approximate attention computations: we prove that attention operations within the Second-Order MeanFlow architecture can be approximated to within 1/poly(n) error in time n2+o(1). Together, these results lay the theoretical foundation for high-order flow matching models that combine rich dynamics with practical sampling efficiency. 生成建模通过无模拟范式取得了显著进展,例如流匹配(Flow Matching),尤其是 MeanFlow 框架,它以平均速度取代瞬时速度场,从而实现高效的单步采样。在本工作中,我们提出并理论研究了二阶 MeanFlow,这是一种新扩展,将平均加速度场纳入 MeanFlow 目标中。我们首先证明了该方法的可行性:平均加速度满足与一阶 MeanFlow 类似的广义一致性条件,从而支持稳定的一步采样和可处理的损失函数。随后,我们通过电路复杂度分析刻画了其表现力,表明在温和假设下,二阶 MeanFlow 的采样过程可以由属于 TC0 类的均匀阈值电路实现。最后,我们通过利用快速近似注意力计算,推导出可伸缩实现的有证效率条件:我们证明了在二阶 MeanFlow 架构中,注意力操作可以在时间 n2+o(1) 内被近似到 1/poly(n) 误差范围。 综上所述,这些结果为将丰富动力学与实用采样效率相结合的高阶流匹配模型奠定了理论基础。
Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition 学科:机器学习、人工智能、计算机视觉与模式识别
Publish: 2025-08-09 21:10:58 UTC 发布:2025-08-09 21:10:58 UTC
#219 Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning #219 少即是多:无训练稀疏注意力结合全局局部性以实现高效推理
Authors: [Lijie Yang](https://arxiv.org/search/?searchtype=author&query=Lijie Yang), [Zhihao Zhang](https://arxiv.org/search/?searchtype=author&query=Zhihao Zhang), [Arti Jain](https://arxiv.org/search/?searchtype=author&query=Arti Jain), [Shijie Cao](https://arxiv.org/search/?searchtype=author&query=Shijie Cao), [Baihong Yuan](https://arxiv.org/search/?searchtype=author&query=Baihong Yuan), [Yiwei Chen](https://arxiv.org/search/?searchtype=author&query=Yiwei Chen), [Zhihao Jia](https://arxiv.org/search/?searchtype=author&query=Zhihao Jia), [Ravi Netravali](https://arxiv.org/search/?searchtype=author&query=Ravi Netravali) 作者:杨立杰,张志豪,Arti Jain,曹世杰,袁柏宏,陈奕威,贾志豪,Ravi Netravali
Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves – and in some cases improves – accuracy while achieving a 1.1× average decoding speed-up compared to full attention. Moreover, LessIsMore attends to 2× fewer tokens without accuracy loss, achieving a 1.13× end-to-end speed-up compared to existing sparse attention methods. 大型推理模型通过在测试时扩展规模取得了强大的性能,但代价是显著的计算开销,尤其是在处理短输入提示时产生过多标记生成。尽管稀疏注意力机制可以降低延迟和内存使用,现有方法在长序列生成推理过程中由于误差累积而导致显著的精度下降。此类方法通常要么需要高标记保留率,要么需要昂贵的再训练。我们提出了 LessIsMore,一种用于推理任务的无训练稀疏注意力机制,它利用全局注意力模式,而不是依赖传统的按头局部优化。LessIsMore 将来自局部注意力头的标记选择与最近的上下文信息汇聚,能够为后续解码层实现跨头的统一标记排序。这种统一选择通过避免为每个注意力头维护独立的标记子集,从而提升了泛化能力和效率。 在多样的推理任务和基准评估中,LessIsMore 在保持——在某些情况下甚至提高——准确率的同时,与全注意力相比实现了平均 1.1× 的解码加速。此外,LessIsMore 在不损失准确率的情况下关注更少的 2× 个标记,与现有稀疏注意力方法相比实现了端到端 1.13× 的加速。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-09 21:10:33 UTC 发布:2025-08-09 21:10:33 协调世界时
#220 Hide or Highlight: Understanding the Impact of Factuality Expression on User Trust #220 隐藏或突出:理解事实性表达对用户信任的影响
Authors: [Hyo Jin Do](https://arxiv.org/search/?searchtype=author&query=Hyo Jin Do), [Werner Geyer](https://arxiv.org/search/?searchtype=author&query=Werner Geyer) 作者:Hyo Jin Do,Werner Geyer
Large language models are known to produce outputs that are plausible but factually incorrect. To prevent people from making erroneous decisions by blindly trusting AI, researchers have explored various ways of communicating factuality estimates in AI-generated outputs to end-users. However, little is known about whether revealing content estimated to be factually incorrect influences users’ trust when compared to hiding it altogether. We tested four different ways of disclosing an AI-generated output with factuality assessments: transparent (highlights less factual content), attention (highlights factual content), opaque (removes less factual content), ambiguity (makes less factual content vague), and compared them with a baseline response without factuality information. We conducted a human subjects research (N = 148) using the strategies in question-answering scenarios. We found that the opaque and ambiguity strategies led to higher trust while maintaining perceived answer quality, compared to the other strategies. We discuss the efficacy of hiding presumably less factual content to build end-user trust. 大型语言模型已知会生成看似合理但事实不准确的输出。为了防止人们盲目信任人工智能而做出错误决策,研究人员探讨了多种在 AI 生成输出中向最终用户传达事实性估计的方法。然而,目前尚不清楚与完全隐藏不准确内容相比,揭示被估计为事实性不高的内容是否会影响用户的信任。我们测试了四种不同的事实性评估披露方式:透明(突出显示事实性较低的内容)、关注(突出显示事实性较高的内容)、不透明(删除事实性较低的内容)、模糊(使事实性较低的内容变得模糊),并将它们与不含事实性信息的基线回应进行了比较。我们在问答场景中进行了以人为受试者的研究(N = 148),使用了上述策略。研究发现,与其他策略相比,不透明和模糊策略在保持答案感知质量的同时会带来更高的信任。我们讨论了隐藏推定为事实性较低的内容以建立最终用户信任的有效性。
Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能
Publish: 2025-08-09 20:45:21 UTC 发布:2025-08-09 20:45:21 UTC
#221 SQL-Exchange: Transforming SQL Queries Across Domains #221 SQL-Exchange:跨领域转换 SQL 查询
Authors: [Mohammadreza Daviran](https://arxiv.org/search/?searchtype=author&query=Mohammadreza Daviran), [Brian Lin](https://arxiv.org/search/?searchtype=author&query=Brian Lin), [Davood Rafiei](https://arxiv.org/search/?searchtype=author&query=Davood Rafiei) 作者:Mohammadreza Daviran、Brian Lin、Davood Rafiei
We introduce SQL-Exchange, a framework for mapping SQL queries across different database schemas by preserving the source query structure while adapting domain-specific elements to align with the target schema. We investigate the conditions under which such mappings are feasible and beneficial, and examine their impact on enhancing the in-context learning performance of text-to-SQL systems as a downstream task. Our comprehensive evaluation across multiple model families and benchmark datasets–assessing structural alignment with source queries, execution validity on target databases, and semantic correctness–demonstrates that SQL-Exchange is effective across a wide range of schemas and query types. Our results further show that using mapped queries as in-context examples consistently improves text-to-SQL performance over using queries from the source schema. 我们提出了 SQL-Exchange 这一框架,用于在不同数据库模式之间映射 SQL 查询:在保留源查询结构的同时,将特定领域的元素调整以与目标模式对齐。我们研究了此类映射在何种条件下是可行且有益的,并考察了它们作为下游任务对增强文本到 SQL 系统的上下文学习性能的影响。我们在多个模型家族和基准数据集上进行了全面评估——评估内容包括与源查询的结构对齐、在目标数据库上的执行有效性以及语义正确性——结果表明 SQL-Exchange 在广泛的模式和查询类型中均有效。我们的结果还显示,将映射后的查询作为上下文示例相比使用源模式的查询,能够持续提升文本到 SQL 的性能。
Subjects: Databases, Artificial Intelligence, Computation and Language 主题:数据库,人工智能,计算与语言
Publish: 2025-08-09 19:55:54 UTC 发布:2025-08-09 19:55:54 UTC
#222 An Evolutionary Game-Theoretic Merging Decision-Making Considering Social Acceptance for Autonomous Driving #222 一种考虑社会接受度的进化博弈论合并决策方法用于自动驾驶
Authors: [Haolin Liu](https://arxiv.org/search/?searchtype=author&query=Haolin Liu), [Zijun Guo](https://arxiv.org/search/?searchtype=author&query=Zijun Guo), [Yanbo Chen](https://arxiv.org/search/?searchtype=author&query=Yanbo Chen), [Jiaqi Chen](https://arxiv.org/search/?searchtype=author&query=Jiaqi Chen), [Huilong Yu](https://arxiv.org/search/?searchtype=author&query=Huilong Yu), [Junqiang Xi](https://arxiv.org/search/?searchtype=author&query=Junqiang Xi) 作者:刘浩林、郭子峻、陈彦博、陈佳琦、于会龙、奚君强
Highway on-ramp merging is of great challenge for autonomous vehicles (AVs), since they have to proactively interact with surrounding vehicles to enter the main road safely within limited time. However, existing decision-making algorithms fail to adequately address dynamic complexities and social acceptance of AVs, leading to suboptimal or unsafe merging decisions. To address this, we propose an evolutionary game-theoretic (EGT) merging decision-making framework, grounded in the bounded rationality of human drivers, which dynamically balances the benefits of both AVs and main-road vehicles (MVs). We formulate the cut-in decision-making process as an EGT problem with a multi-objective payoff function that reflects human-like driving preferences. By solving the replicator dynamic equation for the evolutionarily stable strategy (ESS), the optimal cut-in timing is derived, balancing efficiency, comfort, and safety for both AVs and MVs. A real-time driving style estimation algorithm is proposed to adjust the game payoff function online by observing the immediate reactions of MVs. Empirical results demonstrate that we improve the efficiency, comfort and safety of both AVs and MVs compared with existing game-theoretic and traditional planning approaches across multi-object metrics. 高速公路匝道并入对自动驾驶车辆(AVs)而言具有很大挑战性,因为它们必须在有限时间内主动与周围车辆互动,才能安全进入主路。然而,现有的决策算法未能充分应对动态复杂性和社会接受度问题,导致并入决策次优或不安全。为此,我们提出了一个基于进化博弈论(EGT)的并入决策框架,立足于人类驾驶员的有界理性,在动态上平衡自动驾驶车辆与主路车辆(MVs)双方的利益。我们将切入(cut-in)决策过程表述为一个具有多目标收益函数的 EGT 问题,该收益函数反映了类人驾驶偏好。通过求解复制者动力学方程以得到进化稳定策略(ESS),推导出最优的切入时机,在效率、舒适性与安全性之间为 AVs 与 MVs 实现平衡。我们还提出了一种实时驾驶风格估计算法,通过观察 MVs 的即时反应在线调整博弈收益函数。 实证结果表明,与现有的博弈论和传统规划方法相比,我们在多目标指标上提升了自动驾驶车辆(AVs)和有人驾驶车辆(MVs)的效率、舒适性和安全性。
Subjects: Robotics, Artificial Intelligence 主题:机器人学,人工智能
Publish: 2025-08-09 19:18:28 UTC 发布:2025-08-09 19:18:28 UTC
#223 Model Predictive Control for Crowd Navigation via Learning-Based Trajectory Prediction #223 通过基于学习的轨迹预测进行人群导航的模型预测控制
Authors: [Mohamed Parvez Aslam](https://arxiv.org/search/?searchtype=author&query=Mohamed Parvez Aslam), [Bojan Derajic](https://arxiv.org/search/?searchtype=author&query=Bojan Derajic), [Mohamed-Khalil Bouzidi](https://arxiv.org/search/?searchtype=author&query=Mohamed-Khalil Bouzidi), [Sebastian Bernhard](https://arxiv.org/search/?searchtype=author&query=Sebastian Bernhard), [Jan Oliver Ringert](https://arxiv.org/search/?searchtype=author&query=Jan Oliver Ringert) 作者:Mohamed Parvez Aslam、Bojan Derajic、Mohamed-Khalil Bouzidi、Sebastian Bernhard、Jan Oliver Ringert
Safe navigation in pedestrian-rich environments remains a key challenge for autonomous robots. This work evaluates the integration of a deep learning-based Social-Implicit (SI) pedestrian trajectory predictor within a Model Predictive Control (MPC) framework on the physical Continental Corriere robot. Tested across varied pedestrian densities, the SI-MPC system is compared to a traditional Constant Velocity (CV) model in both open-loop prediction and closed-loop navigation. Results show that SI improves trajectory prediction - reducing errors by up to 76% in low-density settings - and enhances safety and motion smoothness in crowded scenes. Moreover, real-world deployment reveals discrepancies between open-loop metrics and closed-loop performance, as the SI model yields broader, more cautious predictions. These findings emphasize the importance of system-level evaluation and highlight the SI-MPC framework’s promise for safer, more adaptive navigation in dynamic, human-populated environments. 在人群密集的环境中实现安全导航仍然是自动化机器人面临的关键挑战。本工作评估了将基于深度学习的 Social-Implicit(SI)行人轨迹预测器集成到模型预测控制(MPC)框架中,并在实体 Continental Corriere 机器人上进行了测试。在不同的行人密度下,SI-MPC 系统在开环预测和闭环导航中均与传统的匀速(CV)模型进行了比较。结果表明,SI 提高了轨迹预测精度——在低密度环境中误差最多减少了 76%——并在拥挤场景中提升了安全性和运动平顺性。此外,真实世界部署显示了开环指标与闭环性能之间的差异,因为 SI 模型产生了更宽、更谨慎的预测。这些发现强调了系统级评估的重要性,并凸显了 SI-MPC 框架在动态有人环境中实现更安全、更自适应导航的前景。
Subjects: Robotics, Artificial Intelligence, Systems and Control 学科:机器人学、人工智能、系统与控制
Publish: 2025-08-09 19:11:28 UTC 发布:2025-08-09 19:11:28 UTC
#224 Surgical Knowledge Rewrite in Compact LLMs: An 'Unlearn-then-Learn' Strategy with (IA3) for Localized Factual Modulation and Catastrophic Forgetting Mitigation #224 外科知识在紧凑型 LLMs 中的重写:一种带有 ( IA3 ) 的“先忘后学”策略,用于局部事实调节和灾难性遗忘缓解
Author: [Stanley Ngugi](https://arxiv.org/search/?searchtype=author&query=Stanley Ngugi) 作者:Stanley Ngugi
Large Language Models (LLMs) struggle with dynamic knowledge updates, especially when new information conflicts with deeply embedded facts. Such conflicting factual edits often lead to two critical issues: resistance to adopting the new fact and severe catastrophic forgetting of unrelated knowledge. This paper introduces and evaluates a novel “unlearn-then-learn” strategy for precise knowledge editing in LLMs, leveraging the parameter-efficient fine-tuning (PEFT) technique, Infused Adapter by Inhibiting and Amplifying Inner Activations (IA3). Crucially, this two-stage approach is powered by an initial circuit localization phase that identifies and targets the specific internal components responsible for encoding the conflicting fact. Through a rigorous experimental methodology on microsoft/Phi-3-mini-4k-instruct, we demonstrate that this mechanistically informed two-stage approach achieves near-perfect accuracy (98.50%) for the new, modulated fact while simultaneously effectively suppressing the original conflicting fact (96.00% forget rate). Critically, our strategy exhibits unprecedented localization (72.00% F_control accuracy), dramatically mitigating catastrophic forgetting observed in direct fine-tuning approaches (which showed as low as ~20% F_control accuracy), a direct benefit of our targeted interpretability-guided intervention. Furthermore, qualitative analysis reveals a nuanced mechanism of “soft forgetting,” where original knowledge is suppressed from default retrieval but remains latent and conditionally accessible, enhancing model safety and control. These findings represent a significant advancement towards precise, localized, and safe knowledge management in compact LLMs. 大型语言模型(LLMs)在动态知识更新方面存在困难,特别是当新信息与深度嵌入的事实相冲突时。此类冲突事实的编辑常常导致两个关键问题:对采纳新事实的抵抗,以及对无关知识的严重灾难性遗忘。本文提出并评估了一种新颖的“先忘后学”策略,用于在 LLMs 中进行精确的知识编辑,该策略利用参数高效微调(PEFT)技术——通过抑制和放大内部激活来注入适配器(Infused Adapter by Inhibiting and Amplifying Inner Activations, IA3 )。关键的是,这一两阶段方法以初始的电路定位阶段为驱动,该阶段识别并针对编码冲突事实的特定内部组件。通过在 microsoft/Phi-3-mini-4k-instruct 上的严格实验方法,我们证明了这一基于机制的两阶段方法在修正后事实上达到了近乎完美的准确率(98.50%),同时有效抑制了原先的冲突事实(96.00% 忘却率)。 关键是,我们的策略展现出前所未有的局部化效果(72.00% F_control 准确率),显著减轻了直接微调方法中观察到的灾难性遗忘(其 F_control 准确率低至约 20%),这是我们基于可解释性指导的有针对性干预的直接益处。此外,定性分析揭示了一种微妙的“软遗忘”机制,即原有知识从默认检索中被抑制但仍然处于潜伏且条件可访问的状态,从而增强了模型的安全性与可控性。这些发现代表了在紧凑型 LLMs 中实现精确、局部化和安全知识管理方面的重大进展。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-09 18:48:25 UTC 发布:2025-08-09 18:48:25 UTC
#225 SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages #225 SEADialogues:一个以文化为基础的东南亚语言多轮多语种对话数据集
Authors: [Muhammad Dehan Al Kautsar](https://arxiv.org/search/?searchtype=author&query=Muhammad Dehan Al Kautsar), [Aswin Candra](https://arxiv.org/search/?searchtype=author&query=Aswin Candra), [Muhammad Alif Al Hakim](https://arxiv.org/search/?searchtype=author&query=Muhammad Alif Al Hakim), [Maxalmina Satria Kahfi](https://arxiv.org/search/?searchtype=author&query=Maxalmina Satria Kahfi), [Fajri Koto](https://arxiv.org/search/?searchtype=author&query=Fajri Koto), [Alham Fikri Aji](https://arxiv.org/search/?searchtype=author&query=Alham Fikri Aji), [Peerat Limkonchotiwat](https://arxiv.org/search/?searchtype=author&query=Peerat Limkonchotiwat), [Ekapol Chuangsuwanich](https://arxiv.org/search/?searchtype=author&query=Ekapol Chuangsuwanich), [Genta Indra Winata](https://arxiv.org/search/?searchtype=author&query=Genta Indra Winata) 作者:Muhammad Dehan Al Kautsar、Aswin Candra、Muhammad Alif Al Hakim、Maxalmina Satria Kahfi、Fajri Koto、Alham Fikri Aji、Peerat Limkonchotiwat、Ekapol Chuangsuwanich、Genta Indra Winata
Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents. 尽管已经开发了许多用于支持对话系统的数据集,但大多数现有的闲聊数据集忽视了自然人类对话中固有的文化细微差别。为填补这一空白,我们推出了 SEADialogues,这是一个以东南亚为中心、具有文化根基的对话数据集。东南亚地区人口超过 7 亿,文化极为多样。我们的数据集包含来自东南亚六个国家的八种语言的对话,其中许多语言虽然拥有大量使用者却属于低资源语言。为了增强文化相关性和个性化,每段对话都包括人物属性和两个反映各自社区日常生活的文化主题。此外,我们还发布了一个多轮对话数据集,以推动对具有文化意识和以人为中心的大型语言模型(包括会话型对话代理)的研究。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-09 18:22:35 UTC 发布:2025-08-09 18:22:35 协调世界时
#226 Membership and Memorization in LLM Knowledge Distillation #226 会员资格与 LLM 知识蒸馏中的记忆化
Authors: [Ziqi Zhang](https://arxiv.org/search/?searchtype=author&query=Ziqi Zhang), [Ali Shahin Shamsabadi](https://arxiv.org/search/?searchtype=author&query=Ali Shahin Shamsabadi), [Hanxiao Lu](https://arxiv.org/search/?searchtype=author&query=Hanxiao Lu), [Yifeng Cai](https://arxiv.org/search/?searchtype=author&query=Yifeng Cai), [Hamed Haddadi](https://arxiv.org/search/?searchtype=author&query=Hamed Haddadi) 作者:张子奇、阿里·沙欣·沙姆萨巴迪、陆汉啸、蔡一峰、哈梅德·哈达迪
Recent advances in Knowledge Distillation (KD) aim to mitigate the high computational demands of Large Language Models (LLMs) by transferring knowledge from a large ‘’teacher’’ to a smaller ‘‘student’’ model. However, students may inherit the teacher’s privacy when the teacher is trained on private data. In this work, we systematically characterize and investigate membership and memorization privacy risks inherent in six LLM KD techniques. Using instruction-tuning settings that span seven NLP tasks, together with three teacher model families (GPT-2, LLAMA-2, and OPT), and various size student models, we demonstrate that all existing LLM KD approaches carry membership and memorization privacy risks from the teacher to its students. However, the extent of privacy risks varies across different KD techniques. We systematically analyse how key LLM KD components (KD objective functions, student training data and NLP tasks) impact such privacy risks. We also demonstrate a significant disagreement between memorization and membership privacy risks of LLM KD techniques. Finally, we characterize per-block privacy risk and demonstrate that the privacy risk varies across different blocks by a large margin. 近年来,知识蒸馏(KD)的进展旨在通过将知识从大型“教师”模型转移到较小的“学生”模型来缓解大型语言模型(LLMs)对计算资源的高需求。然而,当教师在私有数据上训练时,学生可能会继承教师的隐私。在本研究中,我们系统地表征并调查了六种 LLM KD 技术中固有的成员资格(membership)和记忆(memorization)隐私风险。我们在覆盖七种 NLP 任务的指令微调设置下,结合三类教师模型家族(GPT-2、LLAMA-2 和 OPT)以及不同规模的学生模型,证明了所有现有的 LLM KD 方法都会将来自教师的成员资格和记忆隐私风险传递给其学生。然而,不同 KD 技术的隐私风险程度存在差异。我们系统地分析了关键的 LLM KD 组成部分(KD 目标函数、学生训练数据和 NLP 任务)如何影响此类隐私风险。我们还展示了 LLM KD 技术在记忆隐私风险与成员资格隐私风险之间存在显著的不一致。 最后,我们对每个块的隐私风险进行了表征,并证明不同块之间的隐私风险存在很大差异。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-09 17:40:41 UTC 发布时间:2025-08-09 17:40:41 UTC
#227 ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability #227 ReasonRank:以强推理能力增强段落排序
Authors: [Wenhan Liu](https://arxiv.org/search/?searchtype=author&query=Wenhan Liu), [Xinyu Ma](https://arxiv.org/search/?searchtype=author&query=Xinyu Ma), [Weiwei Sun](https://arxiv.org/search/?searchtype=author&query=Weiwei Sun), [Yutao Zhu](https://arxiv.org/search/?searchtype=author&query=Yutao Zhu), [Yuchen Li](https://arxiv.org/search/?searchtype=author&query=Yuchen Li), [Dawei Yin](https://arxiv.org/search/?searchtype=author&query=Dawei Yin), [Zhicheng Dou](https://arxiv.org/search/?searchtype=author&query=Zhicheng Dou) 作者:刘文涵、马欣宇、孙伟伟、朱宇涛、李昱辰、尹大为、窦志成
Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank. 基于大型语言模型(LLM)的列表式排序在许多段落排序任务中表现出色。随着大型推理模型的发展,许多研究表明在测试时进行逐步推理有助于提高列表式排序性能。然而,由于用于推理密集型的训练数据稀缺,现有的重排序器在许多复杂排序场景中表现不佳,推理密集型重排序器的排序能力在很大程度上仍未充分发展。在本文中,我们首先提出了一个自动化的推理密集型训练数据合成框架,该框架从多领域获取训练查询和段落,并使用 DeepSeek-R1 生成高质量的训练标签。我们设计了一个自洽性数据过滤机制以确保数据质量。为使列表式重排序器具备强大的推理能力,我们进一步提出了一个两阶段的后训练方法,包括用于学习推理模式的冷启动监督微调(SFT)阶段和用于进一步增强排序能力的强化学习(RL)阶段。 在强化学习阶段,基于列表式排序的特性,我们设计了一种多视角排序奖励,这比基于排序指标的奖励更有效。大量实验表明,我们训练的以推理为主的重排序器 ReasonRank 显著优于现有基线,并且比逐点重排序器 Rank1 具有更低的延迟。通过进一步的实验,我们的 ReasonRank 在 BRIGHT 排行榜上取得了 40.6 的最新最优(SOTA)表现\footnote{https://brightbenchmark.github.io/}。我们的代码可在 https://github.com/8421BCD/ReasonRank 获取。
Subjects: Information Retrieval, Artificial Intelligence, Computation and Language, Machine Learning 主题:信息检索、人工智能、计算与语言、机器学习
Publish: 2025-08-09 17:26:18 UTC 发布:2025-08-09 17:26:18 协调世界时 (UTC)
#228 Whisfusion: Parallel ASR Decoding via a Diffusion Transformer #228 Whisfusion:通过扩散变换器实现并行 ASR 解码
Authors: [Taeyoun Kwon](https://arxiv.org/search/?searchtype=author&query=Taeyoun Kwon), [Junhyuk Ahn](https://arxiv.org/search/?searchtype=author&query=Junhyuk Ahn), [Taegeun Yun](https://arxiv.org/search/?searchtype=author&query=Taegeun Yun), [Heeju Jwa](https://arxiv.org/search/?searchtype=author&query=Heeju Jwa), [Yoonchae Choi](https://arxiv.org/search/?searchtype=author&query=Yoonchae Choi), [Siwon Park](https://arxiv.org/search/?searchtype=author&query=Siwon Park), [Nam-Joon Kim](https://arxiv.org/search/?searchtype=author&query=Nam-Joon Kim), [Jangchan Kim](https://arxiv.org/search/?searchtype=author&query=Jangchan Kim), [Hyun Gon Ryu](https://arxiv.org/search/?searchtype=author&query=Hyun Gon Ryu), [Hyuk-Jae Lee](https://arxiv.org/search/?searchtype=author&query=Hyuk-Jae Lee) 作者:权泰允、安俊赫、尹泰根、左希珠、崔允采、朴始元、南俊金、金章灿、柳贤坤、李赫宰
Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) methods. While modern ASR encoders can process up to 30 seconds of audio at once, AR decoders still generate tokens sequentially, creating a latency bottleneck. We propose Whisfusion, the first framework to fuse a pre-trained Whisper encoder with a text diffusion decoder. This NAR architecture resolves the AR latency bottleneck by processing the entire acoustic context in parallel at every decoding step. A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities. We also introduce a batch-parallel, multi-step decoding strategy that improves accuracy by increasing the number of candidates with minimal impact on speed. Fine-tuned solely on LibriSpeech (960h), Whisfusion achieves a lower WER than Whisper-tiny (8.3% vs. 9.7%), and offers comparable latency on short audio. For longer utterances (>20s), it is up to 2.6x faster than the AR baseline, establishing a new, efficient operating point for long-form ASR. The implementation and training scripts are available at https://github.com/taeyoun811/Whisfusion. 快速的自动语音识别(ASR)对于实时字幕和会议转录等对延迟敏感的应用至关重要。然而,由于自回归(AR)解码器的序列性以及非自回归(NAR)方法的上下文限制,真正并行的 ASR 解码仍然具有挑战性。尽管现代 ASR 编码器可以一次处理多达 30 秒的音频,AR 解码器仍然需要按顺序生成标记,造成延迟瓶颈。我们提出了 Whisfusion,这是首个将预训练的 Whisper 编码器与文本扩散解码器融合的框架。该 NAR 架构通过在每个解码步骤并行处理整个声学上下文来解决 AR 的延迟瓶颈。通过参数高效微调(PEFT)训练的轻量级跨注意力适配器在两种模态之间架起桥梁。我们还引入了一种批并行、多步解码策略,通过增加候选项数量在对速度影响极小的情况下提升准确性。仅在 LibriSpeech(960 小时)上微调后,Whisfusion 实现了比 Whisper-tiny 更低的词错误率(8.3% 对 9.7%),并在短音频上提供了相当的延迟表现。 对于较长的语句(>20 秒),其速度比自回归基线快多达 2.6 倍,为长篇自动语音识别确立了一个新的高效运行点。实现和训练脚本可在 https://github.com/taeyoun811/Whisfusion 获取。
Subjects: Sound, Artificial Intelligence, Machine Learning 主题:声音、人工智能、机器学习
Publish: 2025-08-09 17:20:54 UTC 发布时间:2025-08-09 17:20:54 世界标准时间(UTC)
#229 Balancing Privacy and Efficiency: Music Information Retrieval via Additive Homomorphic Encryption #229 在隐私与效率之间取得平衡:通过加法同态加密进行音乐信息检索
Authors: [William Zerong Wang](https://arxiv.org/search/?searchtype=author&query=William Zerong Wang), [Dongfang Zhao](https://arxiv.org/search/?searchtype=author&query=Dongfang Zhao) 作者:William Zerong Wang,Dongfang Zhao
In the era of generative AI, ensuring the privacy of music data presents unique challenges: unlike static artworks such as images, music data is inherently temporal and multimodal, and it is sampled, transformed, and remixed at an unprecedented scale. These characteristics make its core vector embeddings, i.e, the numerical representations of the music, highly susceptible to being learned, misused, or even stolen by models without accessing the original audio files. Traditional methods like copyright licensing and digital watermarking offer limited protection for these abstract mathematical representations, thus necessitating a stronger, e.g., cryptographic, approach to safeguarding the embeddings themselves. Standard encryption schemes, such as AES, render data unintelligible for computation, making such searches impossible. While Fully Homomorphic Encryption (FHE) provides a plausible solution by allowing arbitrary computations on ciphertexts, its substantial performance overhead remains impractical for large-scale vector similarity searches. Given this trade-off, we propose a more practical approach using Additive Homomorphic Encryption (AHE) for vector similarity search. The primary contributions of this paper are threefold: we analyze threat models unique to music information retrieval systems; we provide a theoretical analysis and propose an efficient AHE-based solution through inner products of music embeddings to deliver privacy-preserving similarity search; and finally, we demonstrate the efficiency and practicality of the proposed approach through empirical evaluation and comparison to FHE schemes on real-world MP3 files. 在生成式人工智能时代,确保音乐数据的隐私面临独特挑战:与图像等静态艺术品不同,音乐数据本质上具有时间性和多模态性,并且以空前的规模被采样、转换和混合重制。这些特性使其核心向量嵌入,即音乐的数值表示,极易被模型在无需访问原始音频文件的情况下学习、滥用甚至窃取。像版权许可和数字水印这样的传统方法对这些抽象的数学表示提供的保护有限,因此需要更强的、例如密码学的方式来保护嵌入本身。标准的加密方案(如 AES)会使数据在计算时变得不可理解,从而使此类检索变得不可能。虽然全同态加密(FHE)通过允许对密文进行任意计算提供了一个可行的解决方案,但其巨大的性能开销在大规模向量相似性检索中仍不切实际。鉴于这种权衡,我们提出了一种更实用的方法,使用加法同态加密(AHE)来进行向量相似性检索。 本文的主要贡献有三点:我们分析了音乐信息检索系统特有的威胁模型;我们提供了理论分析并提出了一种基于加密同态(AHE)的高效解决方案,通过音乐嵌入向量的内积实现隐私保护的相似性搜索;最后,我们通过在真实 MP3 文件上的实证评估并与全同态加密(FHE)方案进行比较,展示了所提出方法的高效性和实用性。
Subjects: Databases, Artificial Intelligence, Cryptography and Security 主题:数据库、人工智能、密码学与安全
Publish: 2025-08-09 17:00:34 UTC 发布:2025-08-09 17:00:34 UTC
#230 Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities #230 可信的医学影像与大型语言模型:跨模态幻觉研究
Authors: [Anindya Bijoy Das](https://arxiv.org/search/?searchtype=author&query=Anindya Bijoy Das), [Shahnewaz Karim Sakib](https://arxiv.org/search/?searchtype=author&query=Shahnewaz Karim Sakib), [Shibbir Ahmed](https://arxiv.org/search/?searchtype=author&query=Shibbir Ahmed) 作者:Anindya Bijoy Das,Shahnewaz Karim Sakib,Shibbir Ahmed
Large Language Models (LLMs) are increasingly applied to medical imaging tasks, including image interpretation and synthetic image generation. However, these models often produce hallucinations, which are confident but incorrect outputs that can mislead clinical decisions. This study examines hallucinations in two directions: image to text, where LLMs generate reports from X-ray, CT, or MRI scans, and text to image, where models create medical images from clinical prompts. We analyze errors such as factual inconsistencies and anatomical inaccuracies, evaluating outputs using expert informed criteria across imaging modalities. Our findings reveal common patterns of hallucination in both interpretive and generative tasks, with implications for clinical reliability. We also discuss factors contributing to these failures, including model architecture and training data. By systematically studying both image understanding and generation, this work provides insights into improving the safety and trustworthiness of LLM driven medical imaging systems. 大型语言模型 (LLMs) 正越来越多地应用于医学影像任务,包括图像解读和合成图像生成。然而,这些模型经常产生幻觉,即自信但不正确的输出,可能误导临床决策。本研究从两个方向考察幻觉现象:从图像到文本,即 LLMs 根据 X 光、CT 或 MRI 扫描生成报告;以及从文本到图像,即模型根据临床提示生成医学图像。我们分析了事实不一致和解剖学不准确等错误,并在不同影像模态上使用专家制定的标准评估输出。研究结果揭示了解读任务和生成任务中常见的幻觉模式,对临床可靠性具有重要影响。我们还讨论了导致这些失败的因素,包括模型架构和训练数据。通过系统地研究图像理解与生成,本研究为提高基于 LLM 的医学影像系统的安全性与可信性提供了见解。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 16:03:46 UTC 发布:2025-08-09 16:03:46 世界协调时间
#231 From Imitation to Optimization: A Comparative Study of Offline Learning for Autonomous Driving #231 从模仿到优化:自动驾驶离线学习的比较研究
Author: [Antonio Guillen-Perez](https://arxiv.org/search/?searchtype=author&query=Antonio Guillen-Perez) 作者:Antonio Guillen-Perez
Learning robust driving policies from large-scale, real-world datasets is a central challenge in autonomous driving, as online data collection is often unsafe and impractical. While Behavioral Cloning (BC) offers a straightforward approach to imitation learning, policies trained with BC are notoriously brittle and suffer from compounding errors in closed-loop execution. This work presents a comprehensive pipeline and a comparative study to address this limitation. We first develop a series of increasingly sophisticated BC baselines, culminating in a Transformer-based model that operates on a structured, entity-centric state representation. While this model achieves low imitation loss, we show that it still fails in long-horizon simulations. We then demonstrate that by applying a state-of-the-art Offline Reinforcement Learning algorithm, Conservative Q-Learning (CQL), to the same data and architecture, we can learn a significantly more robust policy. Using a carefully engineered reward function, the CQL agent learns a conservative value function that enables it to recover from minor errors and avoid out-of-distribution states. In a large-scale evaluation on 1,000 unseen scenarios from the Waymo Open Motion Dataset, our final CQL agent achieves a 3.2x higher success rate and a 7.4x lower collision rate than the strongest BC baseline, proving that an offline RL approach is critical for learning robust, long-horizon driving policies from static expert data. 从大规模真实世界数据中学习鲁棒的驾驶策略是自动驾驶领域的核心挑战,因为在线数据收集通常不安全且不切实际。虽然行为克隆(Behavioral Cloning,BC)为模仿学习提供了一种直接的方法,但用 BC 训练的策略以脆弱著称,在闭环执行中会遭受累积误差。本文提出了一个全面的流程并进行比较研究以应对这一限制。我们首先开发了一系列日益复杂的 BC 基线模型,最终提出了一个基于 Transformer 的模型,该模型在结构化、以实体为中心的状态表示上运行。尽管该模型取得了较低的模仿损失,但我们表明它在长时程仿真中仍会失败。随后我们展示了,通过将最先进的离线强化学习算法——保守 Q 学习(Conservative Q-Learning,CQL)——应用于相同的数据和架构,可以学得一个显著更为鲁棒的策略。借助精心设计的奖励函数,CQL 智能体学会了一个保守的价值函数,使其能够从小的错误中恢复并避免分布外状态。 在对 Waymo Open Motion Dataset 中 1,000 个未见场景的大规模评估中,我们最终的 CQL 代理比最强的 BC 基线取得了 3.2 倍更高的成功率和 7.4 倍更低的碰撞率,证明了离线强化学习方法对于从静态专家数据中学习稳健的长时程驾驶策略至关重要。
Subjects: Machine Learning, Artificial Intelligence, Robotics, Systems and Control 学科:机器学习、人工智能、机器人学、系统与控制
Publish: 2025-08-09 16:03:10 UTC 发布时间:2025-08-09 16:03:10 UTC
#232 Making Effective Decisions: Machine Learning and the Ecogame in 1970 #232 做出有效决策:1970 年的机器学习与生态博弈
Author: [Catherine Mason](https://arxiv.org/search/?searchtype=author&query=Catherine Mason) 作者:Catherine Mason
This paper considers Ecogame, an innovative art project of 1970, whose creators believed in a positive vision of a technological future; an understanding, posited on cybernetics, of a future that could be participatory via digital means, and therefore more democratised. Using simulation and early machine learning techniques over a live network, Ecogame combined the power of visual art with cybernetic concepts of adaptation, feedback, and control to propose that behaviour had implications for the total system. It provides an historical precedent for contemporary AI-driven art about using AI in a more human-centred way. 本文讨论了 Ecogame——一项革新的 1970 年艺术项目,其创作者信奉对技术未来的积极设想;这种设想以控制论为基础,认为未来可以通过数字手段实现参与性,从而更加民主化。Ecogame 在一个实时网络上使用模拟和早期机器学习技术,结合视觉艺术的力量与控制论中关于适应、反馈与控制的概念,提出行为会对整个系统产生影响的观点。它为当代以更以人为本方式使用人工智能的 AI 驱动艺术提供了历史先例。
Subjects: Computers and Society, Artificial Intelligence 主题:计算机与社会,人工智能
Publish: 2025-08-09 15:51:26 UTC 发布:2025-08-09 15:51:26 UTC
#233 TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree #233 TurboBias:由 GPU 加速短语提升树驱动的通用 ASR 上下文偏置
Authors: [Andrei Andrusenko](https://arxiv.org/search/?searchtype=author&query=Andrei Andrusenko), [Vladimir Bataev](https://arxiv.org/search/?searchtype=author&query=Vladimir Bataev), [Lilit Grigoryan](https://arxiv.org/search/?searchtype=author&query=Lilit Grigoryan), [Vitaly Lavrukhin](https://arxiv.org/search/?searchtype=author&query=Vitaly Lavrukhin), [Boris Ginsburg](https://arxiv.org/search/?searchtype=author&query=Boris Ginsburg) 作者:Andrei Andrusenko、Vladimir Bataev、Lilit Grigoryan、Vitaly Lavrukhin、Boris Ginsburg
Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit. 识别特定关键短语是具上下文的自动语音识别(ASR)的一项重要任务。然而,大多数现有的上下文偏置方法存在需要额外模型训练、显著降低解码速度或限制 ASR 系统类型选择的局限性。本文提出了一个通用的 ASR 上下文偏置框架,支持所有主要类型:CTC、Transducer 和注意力编码器-解码器模型。该框架基于 GPU 加速的词汇提升树,使其能够在浅融合模式下用于贪婪和束搜索解码,即使面对大量关键短语(最多达 2 万条目)也不会出现明显的速度下降。所得结果展示了该方法的高效性,在准确性和解码速度上均优于所考虑的开源上下文偏置方法。我们的上下文偏置框架已作为 NeMo 工具包的一部分开源。
Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language, Sound 主题:音频与语音处理、人工智能、计算与语言、声音
Publish: 2025-08-09 15:27:07 UTC 发布时间:2025-08-09 15:27:07 UTC
#234 Neural Channel Knowledge Map Assisted Scheduling Optimization of Active IRSs in Multi-User Systems #234 神经信道知识图辅助的多用户系统中主动 IRS 调度优化
Authors: [Xintong Chen](https://arxiv.org/search/?searchtype=author&query=Xintong Chen), [Zhenyu Jiang](https://arxiv.org/search/?searchtype=author&query=Zhenyu Jiang), [Jiangbin Lyu](https://arxiv.org/search/?searchtype=author&query=Jiangbin Lyu), [Liqun Fu](https://arxiv.org/search/?searchtype=author&query=Liqun Fu) 作者:陈新彤,姜振宇,吕江斌,付立群
Intelligent Reflecting Surfaces (IRSs) have potential for significant performance gains in next-generation wireless networks but face key challenges, notably severe double-pathloss and complex multi-user scheduling due to hardware constraints. Active IRSs partially address pathloss but still require efficient scheduling in cell-level multi-IRS multi-user systems, whereby the overhead/delay of channel state acquisition and the scheduling complexity both rise dramatically as the user density and channel dimensions increase. Motivated by these challenges, this paper proposes a novel scheduling framework based on neural Channel Knowledge Map (CKM), designing Transformer-based deep neural networks (DNNs) to predict ergodic spectral efficiency (SE) from historical channel/throughput measurements tagged with user positions. Specifically, two cascaded networks, LPS-Net and SE-Net, are designed to predict link power statistics (LPS) and ergodic SE accurately. We further propose a low-complexity Stable Matching-Iterative Balancing (SM-IB) scheduling algorithm. Numerical evaluations verify that the proposed neural CKM significantly enhances prediction accuracy and computational efficiency, while the SM-IB algorithm effectively achieves near-optimal max-min throughput with greatly reduced complexity. 智能反射表面(IRS)在下一代无线网络中有可能带来显著的性能提升,但面临关键挑战,尤其是严重的双重路径损耗和由于硬件约束引起的复杂多用户调度问题。主动 IRS 在一定程度上解决了路径损耗问题,但在小区级多 IRS 多用户系统中仍需高效的调度,因为随着用户密度和信道维度的增加,信道状态获取的开销/延迟和调度复杂度都会大幅上升。针对这些挑战,本文提出了一种基于神经信道知识图(CKM)的新型调度框架,设计了基于 Transformer 的深度神经网络(DNN)以从带有用户位置信息标注的历史信道/吞吐量测量中预测遍历谱效率(SE)。具体地,设计了两个级联网络 LPS-Net 和 SE-Net,分别用于准确预测链路功率统计(LPS)和遍历 SE。我们进一步提出了一种低复杂度的稳定匹配—迭代平衡(SM-IB)调度算法。 数值评估验证了所提出的神经 CKM 在显著提升预测精度和计算效率方面的效果,同时 SM-IB 算法在大幅降低复杂度的情况下有效地实现了接近最优的最大-最小吞吐量。
Subjects: Information Theory, Artificial Intelligence, Machine Learning 主题:信息论,人工智能,机器学习
Publish: 2025-08-09 15:14:03 UTC 发布时间:2025-08-09 15:14:03 UTC
#235 Consensus-based Decentralized Multi-agent Reinforcement Learning for Random Access Network Optimization #235 基于共识的去中心化多智能体强化学习用于随机接入网络优化 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Myeung Suk Oh](https://arxiv.org/search/?searchtype=author&query=Myeung Suk Oh), [Zhiyao Zhang](https://arxiv.org/search/?searchtype=author&query=Zhiyao Zhang), [FNU Hairi](https://arxiv.org/search/?searchtype=author&query=FNU Hairi), [Alvaro Velasquez](https://arxiv.org/search/?searchtype=author&query=Alvaro Velasquez), [Jia Liu](https://arxiv.org/search/?searchtype=author&query=Jia Liu) 作者:Myeung Suk Oh、Zhiyao Zhang、FNU Hairi、Alvaro Velasquez、Jia Liu
With wireless devices increasingly forming a unified smart network for seamless, user-friendly operations, random access (RA) medium access control (MAC) design is considered a key solution for handling unpredictable data traffic from multiple terminals. However, it remains challenging to design an effective RA-based MAC protocol to minimize collisions and ensure transmission fairness across the devices. While existing multi-agent reinforcement learning (MARL) approaches with centralized training and decentralized execution (CTDE) have been proposed to optimize RA performance, their reliance on centralized training and the significant overhead required for information collection can make real-world applications unrealistic. In this work, we adopt a fully decentralized MARL architecture, where policy learning does not rely on centralized tasks but leverages consensus-based information exchanges across devices. We design our MARL algorithm over an actor-critic (AC) network and propose exchanging only local rewards to minimize communication overhead. Furthermore, we provide a theoretical proof of global convergence for our approach. Numerical experiments show that our proposed MARL algorithm can significantly improve RA network performance compared to other baselines. 随着无线设备日益形成一个统一的智能网络以实现无缝且用户友好的操作,随机接入(RA)介质访问控制(MAC)设计被认为是处理多个终端不可预测数据流量的关键解决方案。然而,设计一个有效的基于 RA 的 MAC 协议以最小化冲突并确保设备间传输公平性仍然具有挑战性。尽管现有的多智能体强化学习(MARL)方法采用集中训练、分散执行(CTDE)来优化 RA 性能,但它们对集中训练的依赖以及为信息收集所需的大量开销可能使实际应用变得不切实际。在本工作中,我们采用了完全分散的 MARL 架构,策略学习不依赖于集中任务,而是利用设备间基于一致性的信信息交换。我们在演员-评论家(AC)网络上设计了我们的 MARL 算法,并提出仅交换局部回报以最小化通信开销。此外,我们为该方法提供了全局收敛性的理论证明。 数值实验表明,与其他基线方法相比,我们提出的多智能体强化学习(MARL)算法能够显著提升资源分配(RA)网络的性能。
Subjects: Networking and Internet Architecture, Artificial Intelligence, Machine Learning 学科:网络与互联网架构、人工智能、机器学习
Publish: 2025-08-09 14:39:27 UTC 发布日期:2025-08-09 14:39:27 UTC
#236 Conformal Set-based Human-AI Complementarity with Multiple Experts #236 基于保形集的人机互补(含多位专家)
Authors: [Helbert Paat](https://arxiv.org/search/?searchtype=author&query=Helbert Paat), [Guohao Shen](https://arxiv.org/search/?searchtype=author&query=Guohao Shen) 作者:Helbert Paat,Guohao Shen
Decision support systems are designed to assist human experts in classification tasks by providing conformal prediction sets derived from a pre-trained model. This human-AI collaboration has demonstrated enhanced classification performance compared to using either the model or the expert independently. In this study, we focus on the selection of instance-specific experts from a pool of multiple human experts, contrasting it with existing research that typically focuses on single-expert scenarios. We characterize the conditions under which multiple experts can benefit from the conformal sets. With the insight that only certain experts may be relevant for each instance, we explore the problem of subset selection and introduce a greedy algorithm that utilizes conformal sets to identify the subset of expert predictions that will be used in classifying an instance. This approach is shown to yield better performance compared to naive methods for human subset selection. Based on real expert predictions from the CIFAR-10H and ImageNet-16H datasets, our simulation study indicates that our proposed greedy algorithm achieves near-optimal subsets, resulting in improved classification performance among multiple experts. 决策支持系统旨在通过提供来自预训练模型的符合性预测集来辅助人类专家完成分类任务。这种人机协作相比单独使用模型或专家均表现出更好的分类性能。在本研究中,我们关注于从多位人类专家池中为每个实例选择特定专家,与现有通常关注单一专家场景的研究形成对比。我们刻画了多位专家在何种条件下能够从符合性预测集中受益。基于仅有部分专家可能与每个实例相关这一洞见,我们探讨了子集选择问题,并引入了一种利用符合性预测集来识别将用于对某一实例进行分类的专家预测子集的贪婪算法。结果表明,该方法相比于用于人类专家子集选择的简单方法能带来更好的性能。 基于来自 CIFAR-10H 和 ImageNet-16H 数据集的真实专家预测,我们的模拟研究表明,所提出的贪心算法能选出近似最优的子集,从而在多位专家之间提升分类性能。
Subjects: Machine Learning, Artificial Intelligence, Human-Computer Interaction, Multiagent Systems 主题:机器学习、人工智能、人机交互、多智能体系统
Publish: 2025-08-09 14:17:51 UTC 发布:2025-08-09 14:17:51 UTC
#237 WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering #237 WeatherDiffusion:用于正向和逆向渲染的天气引导扩散模型
Authors: [Yixin Zhu](https://arxiv.org/search/?searchtype=author&query=Yixin Zhu), [Zuoliang Zhu](https://arxiv.org/search/?searchtype=author&query=Zuoliang Zhu), [Miloš Hašan](https://arxiv.org/search/?searchtype=author&query=Miloš Hašan), [Jian Yang](https://arxiv.org/search/?searchtype=author&query=Jian Yang), [Jin Xie](https://arxiv.org/search/?searchtype=author&query=Jin Xie), [Beibei Wang](https://arxiv.org/search/?searchtype=author&query=Beibei Wang) 作者:Yixin Zhu、Zuoliang Zhu、Miloš Hašan、Jian Yang、Jin Xie、Beibei Wang
Forward and inverse rendering have emerged as key techniques for enabling understanding and reconstruction in the context of autonomous driving (AD). However, complex weather and illumination pose great challenges to this task. The emergence of large diffusion models has shown promise in achieving reasonable results through learning from 2D priors, but these models are difficult to control and lack robustness. In this paper, we introduce WeatherDiffusion, a diffusion-based framework for forward and inverse rendering on AD scenes with various weather and lighting conditions. Our method enables authentic estimation of material properties, scene geometry, and lighting, and further supports controllable weather and illumination editing through the use of predicted intrinsic maps guided by text descriptions. We observe that different intrinsic maps should correspond to different regions of the original image. Based on this observation, we propose Intrinsic map-aware attention (MAA) to enable high-quality inverse rendering. Additionally, we introduce a synthetic dataset (\ie WeatherSynthetic) and a real-world dataset (\ie WeatherReal) for forward and inverse rendering on AD scenes with diverse weather and lighting. Extensive experiments show that our WeatherDiffusion outperforms state-of-the-art methods on several benchmarks. Moreover, our method demonstrates significant value in downstream tasks for AD, enhancing the robustness of object detection and image segmentation in challenging weather scenarios. 前向渲染和逆向渲染已成为在自动驾驶(AD)背景下实现理解与重建的关键技术。然而,复杂的天气和光照对这项任务提出了巨大的挑战。大型扩散模型的出现通过从二维先验中学习在取得合理结果方面展现出希望,但这些模型难以控制且缺乏鲁棒性。在本文中,我们提出了 WeatherDiffusion,一种基于扩散的框架,用于在具有多种天气和光照条件的自动驾驶场景上进行前向和逆向渲染。我们的方法能够真实估计材质属性、场景几何和光照,并通过使用由文本描述引导的预测内在图来进一步支持可控的天气和光照编辑。我们观察到,不同的内在图应对应原始图像的不同区域。基于这一观察,我们提出了内在图感知注意力(MAA),以实现高质量的逆向渲染。 此外,我们构建了一个合成数据集(即 WeatherSynthetic)和一个真实世界数据集(即 WeatherReal),用于在具有多样天气和光照的自动驾驶场景上进行正向与逆向渲染。大量实验表明,我们的 WeatherDiffusion 在多个基准测试上优于最先进的方法。此外,我们的方法在自动驾驶的下游任务中展示了显著价值,提高了在恶劣天气场景下目标检测和图像分割的鲁棒性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 13:29:39 UTC 发布:2025-08-09 13:29:39 UTC
#238 Can Multitask Learning Enhance Model Explainability? #238 多任务学习能提升模型可解释性吗?
Authors: [Hiba Najjar](https://arxiv.org/search/?searchtype=author&query=Hiba Najjar), [Bushra Alshbib](https://arxiv.org/search/?searchtype=author&query=Bushra Alshbib), [Andreas Dengel](https://arxiv.org/search/?searchtype=author&query=Andreas Dengel) 作者:Hiba Najjar、Bushra Alshbib、Andreas Dengel
Remote sensing provides satellite data in diverse types and formats. The usage of multimodal learning networks exploits this diversity to improve model performance, except that the complexity of such networks comes at the expense of their interpretability. In this study, we explore how modalities can be leveraged through multitask learning to intrinsically explain model behavior. In particular, instead of additional inputs, we use certain modalities as additional targets to be predicted along with the main task. The success of this approach relies on the rich information content of satellite data, which remains as input modalities. We show how this modeling context provides numerous benefits: (1) in case of data scarcity, the additional modalities do not need to be collected for model inference at deployment, (2) the model performance remains comparable to the multimodal baseline performance, and in some cases achieves better scores, (3) prediction errors in the main task can be explained via the model behavior in the auxiliary task(s). We demonstrate the efficiency of our approach on three datasets, including segmentation, classification, and regression tasks. Code available at git.opendfki.de/hiba.najjar/mtl_explainability/. 遥感以多种类型和格式提供卫星数据。多模态学习网络利用这种多样性来提升模型性能,但这类网络的复杂性往往以可解释性为代价。在本研究中,我们探讨了如何通过多任务学习利用不同模态来内在地解释模型行为。具体而言,我们不是将某些模态作为额外输入,而是将它们作为辅助目标与主任务一起预测。这一方法的成功依赖于卫星数据作为输入模态所包含的丰富信息。我们展示了这一建模背景带来的诸多好处: (1) 在数据稀缺的情况下,额外的模态在部署时不需要被收集用于模型推断; (2) 模型性能与多模态基线性能保持可比,有时甚至取得更好成绩; (3) 主任务的预测错误可以通过模型在辅助任务中的行为得到解释。我们在三个数据集上展示了该方法的有效性,涵盖分割、分类和回归任务。 代码可在 git.opendfki.de/hiba.najjar/mtl_explainability/ 获取。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-09 12:24:48 UTC 发布时间:2025-08-09 12:24:48 协调世界时 (UTC)
#239 Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification #239 超越频率:通过空间分解视角观察细微线索以用于细粒度视觉分类
Authors: [Qin Xu](https://arxiv.org/search/?searchtype=author&query=Qin Xu), [Lili Zhu](https://arxiv.org/search/?searchtype=author&query=Lili Zhu), [Xiaoxia Cheng](https://arxiv.org/search/?searchtype=author&query=Xiaoxia Cheng), [Bo Jiang](https://arxiv.org/search/?searchtype=author&query=Bo Jiang) 作者:Qin Xu、Lili Zhu、Xiaoxia Cheng、Bo Jiang
The crux of resolving fine-grained visual classification (FGVC) lies in capturing discriminative and class-specific cues that correspond to subtle visual characteristics. Recently, frequency decomposition/transform based approaches have attracted considerable interests since its appearing discriminative cue mining ability. However, the frequency-domain methods are based on fixed basis functions, lacking adaptability to image content and unable to dynamically adjust feature extraction according to the discriminative requirements of different images. To address this, we propose a novel method for FGVC, named Subtle-Cue Oriented Perception Engine (SCOPE), which adaptively enhances the representational capability of low-level details and high-level semantics in the spatial domain, breaking through the limitations of fixed scales in the frequency domain and improving the flexibility of multi-scale fusion. The core of SCOPE lies in two modules: the Subtle Detail Extractor (SDE), which dynamically enhances subtle details such as edges and textures from shallow features, and the Salient Semantic Refiner (SSR), which learns semantically coherent and structure-aware refinement features from the high-level features guided by the enhanced shallow features. The SDE and SSR are cascaded stage-by-stage to progressively combine local details with global semantics. Extensive experiments demonstrate that our method achieves new state-of-the-art on four popular fine-grained image classification benchmarks. 解决细粒度视觉分类(FGVC)的关键在于捕捉与微妙视觉特征对应的判别性且类别特异的线索。近年来,基于频率分解/变换的方法因其显现出的判别性线索挖掘能力而受到广泛关注。然而,频域方法依赖固定的基函数,缺乏对图像内容的适应性,无法根据不同图像的判别需求动态调整特征提取。为了解决这一问题,我们提出了一种用于 FGVC 的新方法,称为面向微妙线索感知引擎(SCOPE),该方法在空间域中自适应地增强低级细节和高级语义的表征能力,突破了频域固定尺度的限制并提高了多尺度融合的灵活性。 SCOPE 的核心在于两个模块:微妙细节提取器(SDE),该模块从浅层特征动态增强诸如边缘和纹理等微妙细节;以及显著语义精炼器(SSR),该模块在增强的浅层特征的指导下,从高层特征中学习语义一致且结构感知的精炼特征。SDE 和 SSR 逐级级联,以逐步将局部细节与全局语义相结合。大量实验表明,我们的方法在四个流行的细粒度图像分类基准上达到了新的最先进水平。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 12:13:40 UTC 发布:2025-08-09 12:13:40 世界协调时
#240 Neural Beam Field for Spatial Beam RSRP Prediction #240 神经波束场用于空间波束 RSRP 预测
Authors: [Keqiang Guo](https://arxiv.org/search/?searchtype=author&query=Keqiang Guo), [Yuheng Zhong](https://arxiv.org/search/?searchtype=author&query=Yuheng Zhong), [Xin Tong](https://arxiv.org/search/?searchtype=author&query=Xin Tong), [Jiangbin Lyu](https://arxiv.org/search/?searchtype=author&query=Jiangbin Lyu), [Rui Zhang](https://arxiv.org/search/?searchtype=author&query=Rui Zhang) 作者:郭克强、钟宇恒、童鑫、吕江斌、张锐
Accurately predicting beam-level reference signal received power (RSRP) is essential for beam management in dense multi-user wireless networks, yet challenging due to high measurement overhead and fast channel variations. This paper proposes Neural Beam Field (NBF), a hybrid neural-physical framework for efficient and interpretable spatial beam RSRP prediction. Central to our approach is the introduction of the Multi-path Conditional Power Profile (MCPP), which bridges site-specific multipath propagation with antenna/beam configurations via closed-form analytical modeling. We adopt a decoupled ``blackbox-whitebox" design: a Transformer-based deep neural network (DNN) learns the MCPP from sparse user measurements and positions, while a physics-inspired module analytically infers beam RSRP statistics. To improve convergence and adaptivity, we further introduce a Pretrain-and-Calibrate (PaC) strategy that leverages ray-tracing priors and on-site calibration using RSRP data. Extensive simulations results demonstrate that NBF significantly outperforms conventional table-based channel knowledge maps (CKMs) and pure blackbox DNNs in prediction accuracy, training efficiency, and generalization, while maintaining a compact model size. The proposed framework offers a scalable and physically grounded solution for intelligent beam management in next-generation dense wireless networks. 准确预测波束级参考信号接收功率(RSRP)对于稠密多用户无线网络中的波束管理至关重要,但由于测量开销大且信道快速变化,这一任务具有挑战性。本文提出了神经波束场(NBF),一种用于高效且可解释的空间波束 RSRP 预测的混合神经—物理框架。我们方法的核心是引入多径条件功率剖面(MCPP),该剖面通过闭式解析建模将站点特定的多径传播与天线/波束配置联系起来。我们采用解耦的“黑盒—白盒”设计:基于 Transformer 的深度神经网络(DNN)从稀疏的用户测量和位置信息中学习 MCPP,而受物理启发的模块则解析性地推断波束 RSRP 统计特性。为改进收敛性和自适应性,我们进一步提出了预训练与校准(PaC)策略,该策略利用射线追踪先验并结合使用 RSRP 数据进行现场校准。 大量仿真结果表明,NBF 在预测精度、训练效率和泛化能力方面显著优于传统的基于表格的信道知识图(CKMs)和纯黑盒深度神经网络(DNNs),同时保持了紧凑的模型尺寸。所提出的框架为下一代密集无线网络中的智能波束管理提供了一个可扩展且具有物理依据的解决方案。
Subjects: Information Theory, Artificial Intelligence, Machine Learning 主题:信息论,人工智能,机器学习
Publish: 2025-08-09 12:05:51 UTC 发布:2025-08-09 12:05:51 UTC
#241 AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance #241 AMFT:通过元学习最佳模仿-探索平衡来对齐 LLM 推理器
Authors: [Lixuan He](https://arxiv.org/search/?searchtype=author&query=Lixuan He), [Jie Feng](https://arxiv.org/search/?searchtype=author&query=Jie Feng), [Yong Li](https://arxiv.org/search/?searchtype=author&query=Yong Li) 作者:何立轩,冯捷,李勇
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT’s implicit, path-level reward and RL’s explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT’s stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment.Our codes are open-sourced via https://github.com/hlxtsyj/AMFT. 大型语言模型 (LLMs) 通常通过两阶段流水线对推理任务进行微调:先进行有监督微调 (SFT),然后进行强化学习 (RL)。这一过程容易导致灾难性遗忘,并在模仿与探索之间产生不理想的权衡。近期的一些单阶段方法试图使用启发式手段将 SFT 与 RL 统一起来,但缺乏一种原则性机制来动态平衡这两种范式。在本文中,我们通过“隐式奖励”的理论视角重新构建这一挑战,将 SFT 和 RL 视为互补的奖励信号,而非截然不同的方法。我们提出了自适应元微调 (Adaptive Meta Fine-Tuning, AMFT),这是一种新的单阶段算法,能够学习在 SFT 的隐式、路径级奖励与 RL 的显式、基于结果的奖励之间达到最优平衡。AMFT 的核心是一个元梯度自适应权重控制器,它将 SFT-RL 的平衡视为一个可学习的参数,动态优化该参数以最大化长期任务表现。这种面向未来的方法通过策略熵进行正则化以提升稳定性,并能自主发现有效的训练课程。 我们在覆盖数学推理、抽象视觉推理(General Points)和视觉-语言导航(V-IRL)的挑战性基准上进行了全面评估。AMFT 持续建立了新的最先进水平,并在分布外(OOD)任务上展现出更强的泛化能力。消融研究和训练动态分析证实,元学习控制器对 AMFT 的稳定性、样本效率和性能至关重要,为 LLM 对齐提供了一种更有原则性和更有效的范式。我们的代码已开源于 https://github.com/hlxtsyj/AMFT。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition 主题:机器学习、人工智能、计算与语言、计算机视觉与模式识别
Publish: 2025-08-09 11:40:54 UTC 发布:2025-08-09 11:40:54 UTC
#242 Class Unbiasing for Generalization in Medical Diagnosis #242 类别去偏以提升医学诊断的泛化性
Authors: [Lishi Zuo](https://arxiv.org/search/?searchtype=author&query=Lishi Zuo), [Man-Wai Mak](https://arxiv.org/search/?searchtype=author&query=Man-Wai Mak), [Lu Yi](https://arxiv.org/search/?searchtype=author&query=Lu Yi), [Youzhi Tu](https://arxiv.org/search/?searchtype=author&query=Youzhi Tu)
Medical diagnosis might fail due to bias. In this work, we identified class-feature bias, which refers to models’ potential reliance on features that are strongly correlated with only a subset of classes, leading to biased performance and poor generalization on other classes. We aim to train a class-unbiased model (Cls-unbias) that mitigates both class imbalance and class-feature bias simultaneously. Specifically, we propose a class-wise inequality loss which promotes equal contributions of classification loss from positive-class and negative-class samples. We propose to optimize a class-wise group distributionally robust optimization objective-a class-weighted training objective that upweights underperforming classes-to enhance the effectiveness of the inequality loss under class imbalance. Through synthetic and real-world datasets, we empirically demonstrate that class-feature bias can negatively impact model performance. Our proposed method effectively mitigates both class-feature bias and class imbalance, thereby improving the model’s generalization ability. 医疗诊断可能因偏差而失败。在本研究中,我们识别出类特征偏差(class-feature bias),即模型可能依赖于仅与部分类别强相关的特征,导致对这些类别表现偏好,而对其他类别泛化能力差。我们的目标是训练一个消除类别偏见的模型(Cls-unbias),同时缓解类别不平衡和类特征偏差。具体而言,我们提出了一种按类别的不平等损失(class-wise inequality loss),该损失促进来自正类样本和负类样本的分类损失具有相等的贡献。我们提出优化一个按类别的群体分布鲁棒优化目标——一个对表现不佳类别加权的训练目标,以在类别不平衡情况下增强不平等损失的效果。通过合成和真实世界的数据集,我们在实证上证明了类特征偏差会对模型性能产生负面影响。我们提出的方法能有效缓解类特征偏差和类别不平衡,从而提高模型的泛化能力。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-09 11:37:44 UTC 发布时间:2025-08-09 11:37:44 协调世界时 (UTC)
#243 When Prompt Engineering Meets Software Engineering: CNL-P as Natural and Robust "APIs'' for Human-AI Interaction #243 当提示工程遇上软件工程:CNL-P 作为用于人机交互的自然且稳健的“API”
Authors: [Zhenchang Xing](https://arxiv.org/search/?searchtype=author&query=Zhenchang Xing), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu), [Zhuo Cheng](https://arxiv.org/search/?searchtype=author&query=Zhuo Cheng), [Qing Huang](https://arxiv.org/search/?searchtype=author&query=Qing Huang), [Dehai Zhao](https://arxiv.org/search/?searchtype=author&query=Dehai Zhao), [Daniel Sun](https://arxiv.org/search/?searchtype=author&query=Daniel Sun), [Chenhua Liu](https://arxiv.org/search/?searchtype=author&query=Chenhua Liu) 作者:邢振昌、刘洋、程卓、黄青、赵德海、Daniel Sun、刘晨华
With the growing capabilities of large language models (LLMs), they are increasingly applied in areas like intelligent customer service, code generation, and knowledge management. Natural language (NL) prompts act as the ``APIs’’ for human-LLM interaction. To improve prompt quality, best practices for prompt engineering (PE) have been developed, including writing guidelines and templates. Building on this, we propose Controlled NL for Prompt (CNL-P), which not only incorporates PE best practices but also draws on key principles from software engineering (SE). CNL-P introduces precise grammar structures and strict semantic norms, further eliminating NL’s ambiguity, allowing for a declarative but structured and accurate expression of user intent. This helps LLMs better interpret and execute the prompts, leading to more consistent and higher-quality outputs. We also introduce an NL2CNL-P conversion tool based on LLMs, enabling users to write prompts in NL, which are then transformed into CNL-P format, thus lowering the learning curve of CNL-P. In particular, we develop a linting tool that checks CNL-P prompts for syntactic and semantic accuracy, applying static analysis techniques to NL for the first time. Extensive experiments demonstrate that CNL-P enhances the quality of LLM responses through the novel and organic synergy of PE and SE. We believe that CNL-P can bridge the gap between emerging PE and traditional SE, laying the foundation for a new programming paradigm centered around NL. 随着大型语言模型(LLMs)能力的不断增强,它们越来越多地被应用于智能客户服务、代码生成和知识管理等领域。自然语言(NL)提示被视为人类与 LLMs 交互的“API”。为提高提示质量,已经制定了提示工程(PE)的最佳实践,包括写作指南和模板。在此基础上,我们提出了用于提示的受控自然语言(CNL-P),它不仅融入了 PE 的最佳实践,还借鉴了软件工程(SE)的关键原则。CNL-P 引入了精确的语法结构和严格的语义规范,进一步消除了自然语言的歧义,使用户意图能够以声明性但结构化且准确的方式表达。这有助于 LLMs 更好地解释和执行提示,从而生成更稳定且更高质量的输出。我们还基于 LLMs 引入了一个 NL 到 CNL-P 的转换工具,使用户可以用自然语言撰写提示,然后将其转换为 CNL-P 格式,从而降低了 CNL-P 的学习门槛。 具体来说,我们开发了一个 lint 工具,用于检查 CNL-P 提示词的语法和语义准确性,首次将静态分析技术应用于自然语言。大量实验证明,CNL-P 通过提示工程(PE)与软件工程(SE)之间新颖且有机的协同作用提升了 LLM 响应的质量。我们认为,CNL-P 能够弥合新兴提示工程与传统软件工程之间的差距,为以自然语言为中心的新编程范式奠定基础。
Subjects: Software Engineering, Artificial Intelligence 主题:软件工程,人工智能
Publish: 2025-08-09 11:32:33 UTC 发布时间:2025-08-09 11:32:33 UTC
#244 CLAP: Coreference-Linked Augmentation for Passage Retrieval #244 CLAP:用于段落检索的共指关联增强
Authors: [Huanwei Xu](https://arxiv.org/search/?searchtype=author&query=Huanwei Xu), [Lin Xu](https://arxiv.org/search/?searchtype=author&query=Lin Xu), [Liang Yuan](https://arxiv.org/search/?searchtype=author&query=Liang Yuan) 作者:许焕威,徐琳,袁亮
Large Language Model (LLM)-based passage expansion has shown promise for enhancing first-stage retrieval, but often underperforms with dense retrievers due to semantic drift and misalignment with their pretrained semantic space. Beyond this, only a portion of a passage is typically relevant to a query, while the rest introduces noise–an issue compounded by chunking techniques that break coreference continuity. We propose Coreference-Linked Augmentation for Passage Retrieval (CLAP), a lightweight LLM-based expansion framework that segments passages into coherent chunks, resolves coreference chains, and generates localized pseudo-queries aligned with dense retriever representations. A simple fusion of global topical signals and fine-grained subtopic signals achieves robust performance across domains. CLAP yields consistent gains even as retriever strength increases, enabling dense retrievers to match or surpass second-stage rankers such as BM25 + MonoT5-3B, with up to 20.68% absolute nDCG@10 improvement. These improvements are especially notable in out-of-domain settings, where conventional LLM-based expansion methods relying on domain knowledge often falter. CLAP instead adopts a logic-centric pipeline that enables robust, domain-agnostic generalization. 基于大型语言模型(LLM)的段落扩展在提升第一阶段检索方面展现出潜力,但由于语义漂移和与预训练语义空间的不对齐,往往在密集检索器上表现不佳。除此之外,通常只有段落的一部分与查询相关,其余部分会引入噪音——而将段落切分的技术则会破坏共指连续性,加剧此问题。我们提出了用于段落检索的共指关联增强(Coreference-Linked Augmentation for Passage Retrieval,CLAP),这是一种轻量级的基于 LLM 的扩展框架,能够将段落分割成连贯的块,解析共指链,并生成与密集检索器表示对齐的局部伪查询。将全局主题信号与细粒度子主题信号简单融合,在各领域均能取得稳健的性能。即便检索器能力增强,CLAP 仍能带来持续提升,使密集检索器能够匹配或超越诸如 BM25 + MonoT5-3B 之类的二阶段排序器,nDCG@10 最高绝对提升达 20.68%。这些改进在域外设置中尤为显著,而依赖领域知识的传统基于 LLM 的扩展方法在此类情形中常常失效。 CLAP 采用一种以逻辑为核心的流程,从而实现稳健的、与领域无关的泛化能力。
Subjects: Information Retrieval, Artificial Intelligence 主题:信息检索,人工智能
Publish: 2025-08-09 11:26:10 UTC 发布:2025-08-09 11:26:10 协调世界时 (UTC)
#245 CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing #245 CannyEdit:用于无训练图像编辑的选择性 Canny 控制与双提示引导
Authors: [Weiyan Xie](https://arxiv.org/search/?searchtype=author&query=Weiyan Xie), [Han Gao](https://arxiv.org/search/?searchtype=author&query=Han Gao), [Didan Deng](https://arxiv.org/search/?searchtype=author&query=Didan Deng), [Kaican Li](https://arxiv.org/search/?searchtype=author&query=Kaican Li), [April Hua Liu](https://arxiv.org/search/?searchtype=author&query=April Hua Liu), [Yongxiang Huang](https://arxiv.org/search/?searchtype=author&query=Yongxiang Huang), [Nevin L. Zhang](https://arxiv.org/search/?searchtype=author&query=Nevin L. Zhang) 作者:谢伟言,郜涵,邓迪丹,李开灿,刘华(April Hua Liu),黄永祥,张聂文
Recent advances in text-to-image (T2I) models have enabled training-free regional image editing by leveraging the generative priors of foundation models. However, existing methods struggle to balance text adherence in edited regions, context fidelity in unedited areas, and seamless integration of edits. We introduce CannyEdit, a novel training-free framework that addresses these challenges through two key innovations: (1) Selective Canny Control, which masks the structural guidance of Canny ControlNet in user-specified editable regions while strictly preserving details of the source images in unedited areas via inversion-phase ControlNet information retention. This enables precise, text-driven edits without compromising contextual integrity. (2) Dual-Prompt Guidance, which combines local prompts for object-specific edits with a global target prompt to maintain coherent scene interactions. On real-world image editing tasks (addition, replacement, removal), CannyEdit outperforms prior methods like KV-Edit, achieving a 2.93 to 10.49 percent improvement in the balance of text adherence and context fidelity. In terms of editing seamlessness, user studies reveal only 49.2 percent of general users and 42.0 percent of AIGC experts identified CannyEdit’s results as AI-edited when paired with real images without edits, versus 76.08 to 89.09 percent for competitor methods. 近年来文本到图像(T2I)模型的进展使得利用基础模型的生成先验实现无训练的局部图像编辑成为可能。然而,现有方法在被编辑区域的文本一致性、未编辑区域的上下文保真度以及编辑的无缝融合之间难以取得平衡。我们提出了 CannyEdit,一种新的无训练框架,通过两项关键创新来应对这些挑战: (1) 选择性 Canny 控制,针对用户指定的可编辑区域屏蔽 Canny ControlNet 的结构引导,同时在逆向阶段通过保留 ControlNet 信息严格保留源图像在未编辑区域的细节。这使得在不影响上下文完整性的情况下实现精确的文本驱动编辑成为可能。 (2) 双提示引导,将用于对象特定编辑的局部提示与用于保持连贯场景交互的全局目标提示相结合。在实际图像编辑任务(添加、替换、移除)中,CannyEdit 优于诸如 KV-Edit 等现有方法,在文本一致性与上下文保真度的平衡上提升了 2.93 到 10.49 百分点。 在编辑无缝性方面,用户研究显示,当将 CannyEdit 的结果与未编辑的真实图像配对时,只有 49.2% 的普通用户和 42.0% 的 AIGC 专家将其识别为 AI 编辑的,而竞争方法的识别率为 76.08% 到 89.09%。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 11:06:58 UTC 发布:2025-08-09 11:06:58 UTC
#246 CROP: Integrating Topological and Spatial Structures via Cross-View Prefixes for Molecular LLMs #246 CROP:通过跨视图前缀将拓扑和空间结构整合到分子 LLMs 中 [PDF 1 ] [Copy] [Kimi ] [REL]
Authors: [Jianting Tang](https://arxiv.org/search/?searchtype=author&query=Jianting Tang), [Yubo Wang](https://arxiv.org/search/?searchtype=author&query=Yubo Wang), [Haoyu Cao](https://arxiv.org/search/?searchtype=author&query=Haoyu Cao), [Linli Xu](https://arxiv.org/search/?searchtype=author&query=Linli Xu) 作者:汤建廷、王宇博、曹昊宇、徐林立
Recent advances in molecular science have been propelled significantly by large language models (LLMs). However, their effectiveness is limited when relying solely on molecular sequences, which fail to capture the complex structures of molecules. Beyond sequence representation, molecules exhibit two complementary structural views: the first focuses on the topological relationships between atoms, as exemplified by the graph view; and the second emphasizes the spatial configuration of molecules, as represented by the image view. The two types of views provide unique insights into molecular structures. To leverage these views collaboratively, we propose the CROss-view Prefixes (CROP) to enhance LLMs’ molecular understanding through efficient multi-view integration. CROP possesses two advantages: (i) efficiency: by jointly resampling multiple structural views into fixed-length prefixes, it avoids excessive consumption of the LLM’s limited context length and allows easy expansion to more views; (ii) effectiveness: by utilizing the LLM’s self-encoded molecular sequences to guide the resampling process, it boosts the quality of the generated prefixes. Specifically, our framework features a carefully designed SMILES Guided Resampler for view resampling, and a Structural Embedding Gate for converting the resulting embeddings into LLM’s prefixes. Extensive experiments demonstrate the superiority of CROP in tasks including molecule captioning, IUPAC name prediction and molecule property prediction. 近年来分子科学的重大进展在很大程度上得益于大型语言模型(LLMs)。然而,仅依赖分子序列的做法在有效性上存在局限,因为序列无法捕捉分子的复杂结构。除了序列表示外,分子还呈现两种互补的结构视角:第一种侧重于原子之间的拓扑关系,以图视图为例;第二种强调分子的空间构型,以图像视图为代表。这两种视角为分子结构提供了独特的见解。为协同利用这些视角,我们提出了跨视图前缀(CROss-view Prefixes,CROP),通过高效的多视图整合来增强 LLMs 对分子的理解。CROP 具有两方面优势:(i)高效性:通过将多种结构视图共同重采样为固定长度的前缀,避免了过度占用 LLM 有限的上下文长度,并便于扩展到更多视图;(ii)有效性:通过利用 LLM 自编码的分子序列来引导重采样过程,提高了生成前缀的质量。 具体而言,我们的框架采用精心设计的 SMILES 引导重采样器(SMILES Guided Resampler)用于视图重采样,并通过结构嵌入门(Structural Embedding Gate)将得到的嵌入转换为 LLM 的前缀。大量实验表明,CROP 在分子描述生成(molecule captioning)、IUPAC 命名预测和分子性质预测等任务上具有优越性。
Subjects: Quantitative Methods, Artificial Intelligence 主题:定量方法,人工智能
Publish: 2025-08-09 10:06:28 UTC 发布:2025-08-09 10:06:28 世界协调时间
#247 MMReID-Bench: Unleashing the Power of MLLMs for Effective and Versatile Person Re-identification #247 MMReID-Bench:释放多模态大模型在高效且多功能行人再识别中的力量
Authors: [Jinhao Li](https://arxiv.org/search/?searchtype=author&query=Jinhao Li), [Zijian Chen](https://arxiv.org/search/?searchtype=author&query=Zijian Chen), [Lirong Deng](https://arxiv.org/search/?searchtype=author&query=Lirong Deng), [Changbo Wang](https://arxiv.org/search/?searchtype=author&query=Changbo Wang), [Guangtao Zhai](https://arxiv.org/search/?searchtype=author&query=Guangtao Zhai) 作者:李金昊、陈子健、邓立荣、王长波、翟光涛
Person re-identification (ReID) aims to retrieve the images of an interested person in the gallery images, with wide applications in medical rehabilitation, abnormal behavior detection, and public security. However, traditional person ReID models suffer from uni-modal capability, leading to poor generalization ability in multi-modal data, such as RGB, thermal, infrared, sketch images, textual descriptions, etc. Recently, the emergence of multi-modal large language models (MLLMs) shows a promising avenue for addressing this problem. Despite this potential, existing methods merely regard MLLMs as feature extractors or caption generators, which do not fully unleash their reasoning, instruction-following, and cross-modal understanding capabilities. To bridge this gap, we introduce MMReID-Bench, the first multi-task multi-modal benchmark specifically designed for person ReID. The MMReID-Bench includes 20,710 multi-modal queries and gallery images covering 10 different person ReID tasks. Comprehensive experiments demonstrate the remarkable capabilities of MLLMs in delivering effective and versatile person ReID. Nevertheless, they also have limitations in handling a few modalities, particularly thermal and infrared data. We hope MMReID-Bench can facilitate the community to develop more robust and generalizable multimodal foundation models for person ReID. 行人重识别(ReID)旨在从画廊图像中检索感兴趣的人的图像,在医疗康复、异常行为检测和公共安全等方面有广泛应用。然而,传统的行人重识别模型受限于单模态能力,在多模态数据(如 RGB、热成像、红外、速写图像、文本描述等)上泛化能力较差。近来,多模态大型语言模型(MLLMs)的出现为解决此问题提供了有希望的途径。尽管具有这种潜力,现有方法仅将 MLLMs 视为特征提取器或描述生成器,未能充分发挥其推理、遵循指令和跨模态理解能力。为弥补这一差距,我们提出了 MMReID-Bench,这是第一个专为行人重识别设计的多任务多模态基准。MMReID-Bench 包含 20,710 个多模态查询和画廊图像,涵盖 10 种不同的行人重识别任务。全面实验表明,MLLMs 在提供高效且多才多艺的行人重识别方面表现出卓越能力。 然而,它们在处理某些模态时也存在局限,尤其是热成像和红外数据。我们希望 MMReID-Bench 能促进社区开发出更稳健、更具泛化性的用于行人重识别的多模态基础模型。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 09:42:09 UTC 发布时间:2025-08-09 09:42:09 UTC
#248 Advancements in Chinese font generation since deep learning era: A survey #248 自深度学习时代以来中文字体生成的进展:一项综述
Authors: [Weiran Chen](https://arxiv.org/search/?searchtype=author&query=Weiran Chen), [Guiqian Zhu](https://arxiv.org/search/?searchtype=author&query=Guiqian Zhu), [Ying Li](https://arxiv.org/search/?searchtype=author&query=Ying Li), [Yi Ji](https://arxiv.org/search/?searchtype=author&query=Yi Ji), [Chunping Liu](https://arxiv.org/search/?searchtype=author&query=Chunping Liu) 作者:陈伟然,朱贵谦,李英,纪怡,刘春平
Chinese font generation aims to create a new Chinese font library based on some reference samples. It is a topic of great concern to many font designers and typographers. Over the past years, with the rapid development of deep learning algorithms, various new techniques have achieved flourishing and thriving progress. Nevertheless, how to improve the overall quality of generated Chinese character images remains a tough issue. In this paper, we conduct a holistic survey of the recent Chinese font generation approaches based on deep learning. To be specific, we first illustrate the research background of the task. Then, we outline our literature selection and analysis methodology, and review a series of related fundamentals, including classical deep learning architectures, font representation formats, public datasets, and frequently-used evaluation metrics. After that, relying on the number of reference samples required to generate a new font, we categorize the existing methods into two major groups: many-shot font generation and few-shot font generation methods. Within each category, representative approaches are summarized, and their strengths and limitations are also discussed in detail. Finally, we conclude our paper with the challenges and future directions, with the expectation to provide some valuable illuminations for the researchers in this field. 中文字体生成旨在基于一些参考样本创建新的中文字体库。它是许多字体设计师和排版师非常关注的一个课题。过去几年中,随着深度学习算法的快速发展,各种新技术取得了蓬勃进展。然而,如何提高生成的中文字符图像的整体质量仍然是一个难题。在本文中,我们对基于深度学习的近期香港中文字体生成方法进行了全面综述。具体而言,我们首先阐述了该任务的研究背景。然后,我们概述了文献选择和分析方法,并回顾了一系列相关基础知识,包括经典深度学习架构、字体表示格式、公共数据集和常用评估指标。在此之后,基于生成新字体所需参考样本的数量,我们将现有方法分为两大类:多样本字体生成方法和少样本字体生成方法。 在每个类别中,我们总结了具有代表性的方法,并详细讨论了它们的优点和局限性。最后,我们以挑战和未来方向作为论文的结论,期望为该领域的研究人员提供一些有价值的启示。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 09:15:05 UTC 发布时间:2025-08-09 09:15:05 协调世界时 (UTC)
#249 BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models #249 基础:在多模态大语言模型中通过内在精炼嵌入提升视觉对齐
Authors: [Jianting Tang](https://arxiv.org/search/?searchtype=author&query=Jianting Tang), [Yubo Wang](https://arxiv.org/search/?searchtype=author&query=Yubo Wang), [Haoyu Cao](https://arxiv.org/search/?searchtype=author&query=Haoyu Cao), [Linli Xu](https://arxiv.org/search/?searchtype=author&query=Linli Xu) 作者:汤建廷、王宇博、曹昊宇、徐林立
Mainstream Multimodal Large Language Models (MLLMs) achieve visual understanding by using a vision projector to bridge well-pretrained vision encoders and large language models (LLMs). The inherent gap between visual and textual modalities makes the embeddings from the vision projector critical for visual comprehension. However, current alignment approaches treat visual embeddings as contextual cues and merely apply auto-regressive supervision to textual outputs, neglecting the necessity of introducing equivalent direct visual supervision, which hinders the potential finer alignment of visual embeddings. In this paper, based on our analysis of the refinement process of visual embeddings in the LLM’s shallow layers, we propose BASIC, a method that utilizes refined visual embeddings within the LLM as supervision to directly guide the projector in generating initial visual embeddings. Specifically, the guidance is conducted from two perspectives: (i) optimizing embedding directions by reducing angles between initial and supervisory embeddings in semantic space; (ii) improving semantic matching by minimizing disparities between the logit distributions of both visual embeddings. Without additional supervisory models or artificial annotations, BASIC significantly improves the performance of MLLMs across a wide range of benchmarks, demonstrating the effectiveness of our introduced direct visual supervision. 主流的多模态大语言模型(MLLMs)通过使用视觉投影器连接经过良好预训练的视觉编码器和大语言模型(LLMs)来实现视觉理解。视觉和文本模态之间的固有差距使得来自视觉投影器的嵌入在视觉理解中至关重要。然而,当前的对齐方法将视觉嵌入视为上下文提示,仅对文本输出应用自回归监督,忽视了引入等效直接视觉监督的必要性,这限制了视觉嵌入更精细对齐的潜力。在本文中,基于我们对视觉嵌入在 LLM 浅层中精炼过程的分析,我们提出了 BASIC,一种利用 LLM 中精炼后的视觉嵌入作为监督,直接指导投影器生成初始视觉嵌入的方法。具体而言,该指导从两个方面进行:(i)通过在语义空间中减少初始嵌入与监督嵌入之间的角度来优化嵌入方向;(ii)通过最小化两者视觉嵌入的 logit 分布差异来改善语义匹配。 在没有额外监督模型或人为标注的情况下,BASIC 在多种基准测试中显著提升了多模态大模型的性能,证明了我们所引入的直接视觉监督的有效性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 09:00:45 UTC 发布时间:2025-08-09 09:00:45 UTC
#250 Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody #250 Maestro-EVC:由参考与显式韵律引导的可控情感语音转换
Authors: [Jinsung Yoon](https://arxiv.org/search/?searchtype=author&query=Jinsung Yoon), [Wooyeol Jeong](https://arxiv.org/search/?searchtype=author&query=Wooyeol Jeong), [Jio Gim](https://arxiv.org/search/?searchtype=author&query=Jio Gim), [Young-Joo Suh](https://arxiv.org/search/?searchtype=author&query=Young-Joo Suh) 作者:Jinsung Yoon、Wooyeol Jeong、Jio Gim、Young-Joo Suh
Emotional voice conversion (EVC) aims to modify the emotional style of speech while preserving its linguistic content. In practical EVC, controllability, the ability to independently control speaker identity and emotional style using distinct references, is crucial. However, existing methods often struggle to fully disentangle these attributes and lack the ability to model fine-grained emotional expressions such as temporal dynamics. We propose Maestro-EVC, a controllable EVC framework that enables independent control of content, speaker identity, and emotion by effectively disentangling each attribute from separate references. We further introduce a temporal emotion representation and an explicit prosody modeling with prosody augmentation to robustly capture and transfer the temporal dynamics of the target emotion, even under prosody-mismatched conditions. Experimental results confirm that Maestro-EVC achieves high-quality, controllable, and emotionally expressive speech synthesis. 情感语音转换(EVC)旨在在保留语言内容的同时修改语音的情感风格。在实际的 EVC 中,可控性——使用不同参考独立控制说话者身份和情感风格的能力——至关重要。然而,现有方法常常难以完全解耦这些属性,并且缺乏对时间动态等细粒度情感表达的建模能力。我们提出了 Maestro-EVC,一种可控的 EVC 框架,通过从不同参考中有效解耦每个属性,实现对内容、说话者身份和情感的独立控制。我们进一步引入了时间情感表示以及带有韵律增强的显式韵律建模,以在韵律不匹配的情况下也能稳健地捕捉和传递目标情感的时间动态。实验结果表明,Maestro-EVC 实现了高质量、可控且具有情感表现力的语音合成。
Subjects: Sound, Artificial Intelligence, Computation and Language 主题:声音、人工智能、计算与语言
Publish: 2025-08-09 08:46:32 UTC 发布:2025-08-09 08:46:32 UTC
#251 NS-FPN: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective #251 NS-FPN:从抑制噪声的角度改进红外小目标检测与分割
Authors: [Maoxun Yuan](https://arxiv.org/search/?searchtype=author&query=Maoxun Yuan), [Duanni Meng](https://arxiv.org/search/?searchtype=author&query=Duanni Meng), [Ziteng Xi](https://arxiv.org/search/?searchtype=author&query=Ziteng Xi), [Tianyi Zhao](https://arxiv.org/search/?searchtype=author&query=Tianyi Zhao), [Shiji Zhao](https://arxiv.org/search/?searchtype=author&query=Shiji Zhao), [Yimian Dai](https://arxiv.org/search/?searchtype=author&query=Yimian Dai), [Xingxing Wei](https://arxiv.org/search/?searchtype=author&query=Xingxing Wei) 作者:袁茂勋、孟端倪、席子腾、赵天怡、赵世吉、戴宜勉、魏星星
Infrared small target detection and segmentation (IRSTDS) is a critical yet challenging task in defense and civilian applications, owing to the dim, shapeless appearance of targets and severe background clutter. Recent CNN-based methods have achieved promising target perception results, but they only focus on enhancing feature representation to offset the impact of noise, which results in the increased false alarms problem. In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. The LFP module suppresses the noise features by purifying high-frequency components to achieve feature enhancement devoid of noise interference, while the SFS module further adopts spiral sampling to fuse target-relevant features in feature fusion process. Our NS-FPN is designed to be lightweight yet effective and can be easily plugged into existing IRSTDS frameworks. Extensive experiments on the public IRSTDS datasets demonstrate that our method significantly reduces false alarms and achieves superior performance on IRSTDS tasks. 红外小目标检测与分割(IRSTDS)在国防和民用领域是一个关键但具有挑战性的任务,原因在于目标的微弱且无定形的外观以及严重的背景杂波。近年来基于卷积神经网络的方法在目标感知方面取得了可喜的成果,但它们仅关注增强特征表示以抵消噪声影响,这导致误报问题增加。在本文中,通过从频域分析该问题,我们率先从抑制噪声的角度改进性能,提出了一种新颖的噪声抑制特征金字塔网络(NS-FPN),该网络将低频引导特征净化(LFP)模块和螺旋感知特征采样(SFS)模块集成到原始 FPN 结构中。LFP 模块通过净化高频分量来抑制噪声特征,以实现不受噪声干扰的特征增强,而 SFS 模块在特征融合过程中进一步采用螺旋采样来融合与目标相关的特征。我们的 NS-FPN 设计为轻量且高效,可轻松插入现有的 IRSTDS 框架中。 在公开的 IRSTDS 数据集上进行的大量实验表明,我们的方法显著减少了误报,并在 IRSTDS 任务上取得了更优的性能。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 08:17:37 UTC 发布:2025-08-09 08:17:37 协调世界时
#252 ESNERA: Empirical and semantic named entity alignment for named entity dataset merging #252 ESNERA:用于命名实体数据集合并的经验与语义命名实体对齐 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Xiaobo Zhang](https://arxiv.org/search/?searchtype=author&query=Xiaobo Zhang), [Congqing He](https://arxiv.org/search/?searchtype=author&query=Congqing He), [Ying He](https://arxiv.org/search/?searchtype=author&query=Ying He), [Jian Peng](https://arxiv.org/search/?searchtype=author&query=Jian Peng), [Dajie Fu](https://arxiv.org/search/?searchtype=author&query=Dajie Fu), [Tien-Ping Tan](https://arxiv.org/search/?searchtype=author&query=Tien-Ping Tan) 作者:张晓波、何从庆、何颖、彭建、付大杰、陈田平
Named Entity Recognition (NER) is a fundamental task in natural language processing. It remains a research hotspot due to its wide applicability across domains. Although recent advances in deep learning have significantly improved NER performance, they rely heavily on large, high-quality annotated datasets. However, building these datasets is expensive and time-consuming, posing a major bottleneck for further research. Current dataset merging approaches mainly focus on strategies like manual label mapping or constructing label graphs, which lack interpretability and scalability. To address this, we propose an automatic label alignment method based on label similarity. The method combines empirical and semantic similarities, using a greedy pairwise merging strategy to unify label spaces across different datasets. Experiments are conducted in two stages: first, merging three existing NER datasets into a unified corpus with minimal impact on NER performance; second, integrating this corpus with a small-scale, self-built dataset in the financial domain. The results show that our method enables effective dataset merging and enhances NER performance in the low-resource financial domain. This study presents an efficient, interpretable, and scalable solution for integrating multi-source NER corpora. 命名实体识别(NER)是自然语言处理中的一项基础任务。由于其在各领域的广泛适用性,它仍然是研究的热点。尽管深度学习的最新进展显著提升了 NER 的性能,但这些进展高度依赖大规模、高质量的标注数据集。然而,构建这些数据集既昂贵又耗时,成为进一步研究的主要瓶颈。当前的数据集合并方法主要集中在人工标签映射或构建标签图等策略,这些方法缺乏可解释性和可扩展性。为了解决这一问题,我们提出了一种基于标签相似度的自动标签对齐方法。该方法结合了经验相似度和语义相似度,利用贪心的成对合并策略统一不同数据集的标签空间。实验分两个阶段进行:首先,将三个现有的 NER 数据集合并为一个统一语料库,对 NER 性能的影响最小;其次,将该语料库与一个小规模自建的金融领域数据集进行整合。结果表明,我们的方法能够实现有效的数据集合并,并在低资源的金融领域提升 NER 性能。 本研究提出了一种高效、可解释且可扩展的解决方案,用于整合多源命名实体识别语料库。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-09 08:15:26 UTC 发布:2025-08-09 08:15:26 UTC
#253 Sparsity-Driven Plasticity in Multi-Task Reinforcement Learning #253 稀疏性驱动的可塑性在多任务强化学习中的应用
Authors: [Aleksandar Todorov](https://arxiv.org/search/?searchtype=author&query=Aleksandar Todorov), [Juan Cardenas-Cartagena](https://arxiv.org/search/?searchtype=author&query=Juan Cardenas-Cartagena), [Rafael F. Cunha](https://arxiv.org/search/?searchtype=author&query=Rafael F. Cunha), [Marco Zullich](https://arxiv.org/search/?searchtype=author&query=Marco Zullich), [Matthia Sabatelli](https://arxiv.org/search/?searchtype=author&query=Matthia Sabatelli) 作者:Aleksandar Todorov、Juan Cardenas-Cartagena、Rafael F. Cunha、Marco Zullich、Matthia Sabatelli
Plasticity loss, a diminishing capacity to adapt as training progresses, is a critical challenge in deep reinforcement learning. We examine this issue in multi-task reinforcement learning (MTRL), where higher representational flexibility is crucial for managing diverse and potentially conflicting task demands. We systematically explore how sparsification methods, particularly Gradual Magnitude Pruning (GMP) and Sparse Evolutionary Training (SET), enhance plasticity and consequently improve performance in MTRL agents. We evaluate these approaches across distinct MTRL architectures (shared backbone, Mixture of Experts, Mixture of Orthogonal Experts) on standardized MTRL benchmarks, comparing against dense baselines, and a comprehensive range of alternative plasticity-inducing or regularization methods. Our results demonstrate that both GMP and SET effectively mitigate key indicators of plasticity degradation, such as neuron dormancy and representational collapse. These plasticity improvements often correlate with enhanced multi-task performance, with sparse agents frequently outperforming dense counterparts and achieving competitive results against explicit plasticity interventions. Our findings offer insights into the interplay between plasticity, network sparsity, and MTRL designs, highlighting dynamic sparsification as a robust but context-sensitive tool for developing more adaptable MTRL systems. 可塑性丧失,即随着训练进行而适应能力减弱,是深度强化学习中的一个关键挑战。我们在多任务强化学习(MTRL)中研究了这一问题,在该领域中,更高的表征灵活性对于处理多样且可能冲突的任务需求至关重要。我们系统地探讨了稀疏化方法,特别是渐进幅值剪枝(Gradual Magnitude Pruning,GMP)和稀疏进化训练(Sparse Evolutionary Training,SET),如何提升可塑性并因此改善 MTRL 智能体的表现。我们在标准化的 MTRL 基准上,针对不同的 MTRL 架构(共享骨干网络、专家混合架构、正交专家混合架构)评估了这些方法,并与密集基线以及一系列其他诱导可塑性或正则化的方法进行了比较。我们的结果表明,GMP 和 SET 均能有效缓解可塑性退化的主要指标,例如神经元沉睡和表征塌缩。这些可塑性方面的改善常常与多任务表现的提升相关联,稀疏化智能体经常优于密集对应模型,并在与显式可塑性干预的竞争中取得了有竞争力的结果。 我们的研究结果揭示了可塑性、网络稀疏性与多任务强化学习(MTRL)设计之间的相互作用,强调了动态稀疏化作为一种稳健但依赖情境的工具,对于开发更具适应性的多任务强化学习系统具有重要意义。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-09 07:44:44 UTC 发布:2025-08-09 07:44:44 协调世界时
#254 VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding #254 VSI:用于关键帧选择以增强长视频理解的视觉字幕集成
Authors: [Jianxiang He](https://arxiv.org/search/?searchtype=author&query=Jianxiang He), [Shaoguang Wang](https://arxiv.org/search/?searchtype=author&query=Shaoguang Wang), [Weiyu Guo](https://arxiv.org/search/?searchtype=author&query=Weiyu Guo), [Meisheng Hong](https://arxiv.org/search/?searchtype=author&query=Meisheng Hong), [Jungang Li](https://arxiv.org/search/?searchtype=author&query=Jungang Li), [Yijie Xu](https://arxiv.org/search/?searchtype=author&query=Yijie Xu), [Ziyang Chen](https://arxiv.org/search/?searchtype=author&query=Ziyang Chen), [Hui Xiong](https://arxiv.org/search/?searchtype=author&query=Hui Xiong) 作者:何建翔,王少光,郭卫宇,洪美生,李军刚,徐一杰,陈子阳,熊辉
Long video understanding presents a significant challenge to multimodal large language models (MLLMs) primarily due to the immense data scale. A critical and widely adopted strategy for making this task computationally tractable is keyframe retrieval, which seeks to identify a sparse set of video frames that are most salient to a given textual query. However, the efficacy of this approach is hindered by weak multimodal alignment between textual queries and visual content and fails to capture the complex temporal semantic information required for precise reasoning. To address this, we propose Visual-Subtitle Integeration(VSI), a multimodal keyframe search method that integrates subtitles, timestamps, and scene boundaries into a unified multimodal search process. The proposed method captures the visual information of video frames as well as the complementary textual information through a dual-stream search mechanism by Video Search Stream as well as Subtitle Match Stream, respectively, and improves the keyframe search accuracy through the interaction of the two search streams. Experimental results show that VSI achieve 40.00% key frame localization accuracy on the text-relevant subset of LongVideoBench and 68.48% accuracy on downstream long Video-QA tasks, surpassing competitive baselines by 20.35% and 15.79%, respectively. Furthermore, on the LongVideoBench, VSI achieved state-of-the-art(SOTA) in medium-to-long video-QA tasks, demonstrating the robustness and generalizability of the proposed multimodal search strategy. 长视频理解对多模态大语言模型(MLLMs)而言是一个重大挑战,主要原因在于数据规模庞大。一种关键且被广泛采用的使该任务在计算上可行的策略是关键帧检索,即在给定文本查询时识别与之最相关的稀疏视频帧集合。然而,该方法的有效性受限于文本查询与视觉内容之间的多模态对齐较弱,且未能捕捉到用于精确推理的复杂时间语义信息。为了解决这些问题,我们提出了视觉-字幕整合(VSI),这是一种将字幕、时间戳和场景边界整合到统一多模态检索过程中的多模态关键帧搜索方法。所提方法通过双流检索机制分别由视频检索流和字幕匹配流捕捉视频帧的视觉信息以及互补的文本信息,并通过两条检索流的交互来提高关键帧检索的准确性。 实验结果表明,VSI 在 LongVideoBench 的与文本相关子集上实现了 40.00% 的关键帧定位准确率,在下游长视频问答任务上实现了 68.48% 的准确率,分别比有竞争力的基线高出 20.35% 和 15.79%。此外,在 LongVideoBench 上,VSI 在中长视频问答任务中达到了最新水平(SOTA),展示了所提出的多模态检索策略的稳健性和泛化能力。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 07:38:48 UTC 发布:2025-08-09 07:38:48 UTC
#255 AGIC: Attention-Guided Image Captioning to Improve Caption Relevance #255 AGIC:基于注意力引导的图像描述以提高描述相关性
Authors: [L. D. M. S. Sai Teja](https://arxiv.org/search/?searchtype=author&query=L. D. M. S. Sai Teja), [Ashok Urlana](https://arxiv.org/search/?searchtype=author&query=Ashok Urlana), [Pruthwik Mishra](https://arxiv.org/search/?searchtype=author&query=Pruthwik Mishra) 作者:L. D. M. S. Sai Teja、Ashok Urlana、Pruthwik Mishra
Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k and Flickr30k datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning. 尽管图像字幕生成取得了显著进展,但生成准确且描述性的字幕仍然是一个长期存在的挑战。在本研究中,我们提出了注意力引导的图像字幕生成(AGIC),它在特征空间中直接放大显著的视觉区域以引导字幕生成。我们进一步引入了一种混合解码策略,结合确定性和概率采样以在流畅性和多样性之间取得平衡。为了评估 AGIC,我们在 Flickr8k 和 Flickr30k 数据集上进行了大量实验。结果表明,AGIC 与若干最先进模型相匹配或超越它们,同时实现了更快的推理速度。此外,AGIC 在多项评估指标上表现出色,提供了一种可扩展且具可解释性的图像字幕生成方案。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 06:42:25 UTC 发布:2025-08-09 06:42:25 UTC
#256 Towards Experience-Centered AI: A Framework for Integrating Lived Experience in Design and Development #256 面向体验中心的人工智能:在设计与开发中整合生活体验的框架
Authors: [Sanjana Gautam](https://arxiv.org/search/?searchtype=author&query=Sanjana Gautam), [Mohit Chandra](https://arxiv.org/search/?searchtype=author&query=Mohit Chandra), [Ankolika De](https://arxiv.org/search/?searchtype=author&query=Ankolika De), [Tatiana Chakravorti](https://arxiv.org/search/?searchtype=author&query=Tatiana Chakravorti), [Girik Malik](https://arxiv.org/search/?searchtype=author&query=Girik Malik), [Munmun De Choudhury](https://arxiv.org/search/?searchtype=author&query=Munmun De Choudhury) 作者:Sanjana Gautam、Mohit Chandra、Ankolika De、Tatiana Chakravorti、Girik Malik、Munmun De Choudhury
Lived experiences fundamentally shape how individuals interact with AI systems, influencing perceptions of safety, trust, and usability. While prior research has focused on developing techniques to emulate human preferences, and proposed taxonomies to categorize risks (such as psychological harms and algorithmic biases), these efforts have provided limited systematic understanding of lived human experiences or actionable strategies for embedding them meaningfully into the AI development lifecycle. This work proposes a framework for meaningfully integrating lived experience into the design and evaluation of AI systems. We synthesize interdisciplinary literature across lived experience philosophy, human-centered design, and human-AI interaction, arguing that centering lived experience can lead to models that more accurately reflect the retrospective, emotional, and contextual dimensions of human cognition. Drawing from a wide body of work across psychology, education, healthcare, and social policy, we present a targeted taxonomy of lived experiences with specific applicability to AI systems. To ground our framework, we examine three application domains (i) education, (ii) healthcare, and (iii) cultural alignment, illustrating how lived experience informs user goals, system expectations, and ethical considerations in each context. We further incorporate insights from AI system operators and human-AI partnerships to highlight challenges in responsibility allocation, mental model calibration, and long-term system adaptation. We conclude with actionable recommendations for developing experience-centered AI systems that are not only technically robust but also empathetic, context-aware, and aligned with human realities. This work offers a foundation for future research that bridges technical development with the lived experiences of those impacted by AI systems. 个体的亲身体验从根本上影响他们与人工智能系统的互动方式,进而影响对安全性、信任和可用性的认知。尽管以往研究侧重于开发模仿人类偏好的技术,并提出将风险(如心理伤害和算法偏见)分类的体系,但这些工作在系统性理解人们的亲身体验或为在人工智能开发生命周期中有意义地嵌入这些体验提供可操作策略方面仍然有限。本研究提出了一个框架,用以在人工智能系统的设计与评估中有意义地整合亲身体验。我们综合了来自亲身体验哲学、以人为中心的设计和人机/人机智能交互的跨学科文献,论证以亲身体验为中心能够促成更准确反映人类认知的回顾性、情感性和情境性维度的模型。借鉴心理学、教育、医疗和社会政策等领域的大量研究,我们提出了一个针对性强且可具体应用于人工智能系统的亲身体验分类法。 为了使我们的框架具有实证基础,我们考察了三个应用领域:(i)教育、(ii)医疗保健和(iii)文化对齐,说明了在每个情境中,个人经验如何影响用户目标、系统期望和伦理考量。我们还吸收了来自人工智能系统操作者和人机协作方面的见解,以突显在责任分配、心理模型校准和系统长期适应性方面的挑战。最后我们提出了可行的建议,旨在开发以经验为中心的人工智能系统,这些系统不仅在技术上稳健,而且富有同理心、具备情境感知并与人的现实相一致。本研究为未来将技术开发与受人工智能系统影响者的生活经验相衔接的研究奠定了基础。
Subjects: Computers and Society, Artificial Intelligence, Human-Computer Interaction 主题:计算机与社会、人工智能、人机交互
Publish: 2025-08-09 06:12:40 UTC 发布时间:2025-08-09 06:12:40 协调世界时(UTC)
#257 Highlight All the Phrases: Enhancing LLM Transparency through Visual Factuality Indicators #257 高亮所有短语:通过可视事实性指示器提升 LLM 的透明度
Authors: [Hyo Jin Do](https://arxiv.org/search/?searchtype=author&query=Hyo Jin Do), [Rachel Ostrand](https://arxiv.org/search/?searchtype=author&query=Rachel Ostrand), [Werner Geyer](https://arxiv.org/search/?searchtype=author&query=Werner Geyer), [Keerthiram Murugesan](https://arxiv.org/search/?searchtype=author&query=Keerthiram Murugesan), [Dennis Wei](https://arxiv.org/search/?searchtype=author&query=Dennis Wei), [Justin Weisz](https://arxiv.org/search/?searchtype=author&query=Justin Weisz) 作者:Hyo Jin Do、Rachel Ostrand、Werner Geyer、Keerthiram Murugesan、Dennis Wei、Justin Weisz
Large language models (LLMs) are susceptible to generating inaccurate or false information, often referred to as “hallucinations” or “confabulations.” While several technical advancements have been made to detect hallucinated content by assessing the factuality of the model’s responses, there is still limited research on how to effectively communicate this information to users. To address this gap, we conducted two scenario-based experiments with a total of 208 participants to systematically compare the effects of various design strategies for communicating factuality scores by assessing participants’ ratings of trust, ease in validating response accuracy, and preference. Our findings reveal that participants preferred and trusted a design in which all phrases within a response were color-coded based on factuality scores. Participants also found it easier to validate accuracy of the response in this style compared to a baseline with no style applied. Our study offers practical design guidelines for LLM application developers and designers, aimed at calibrating user trust, aligning with user preferences, and enhancing users’ ability to scrutinize LLM outputs. 大型语言模型 (LLMs) 易于生成不准确或虚假的信息,通常称为“幻觉”或“虚构”。尽管已经有若干技术进展用于通过评估模型回答的事实性来检测幻觉内容,但关于如何有效地将这些信息传达给用户的研究仍然有限。为填补这一空白,我们进行了两项基于情境的实验,共有 208 名参与者,系统地比较了用于传达事实性评分的各种设计策略的效果,评估参与者对信任度、验证回答准确性的难易程度以及偏好的评分。我们的研究发现,参与者更偏好并更信任一种将回答中所有短语按事实性评分进行颜色编码的设计。与未应用任何样式的基线相比,参与者也发现以这种样式更容易验证回答的准确性。我们的研究为 LLM 应用开发者和设计师提供了实用的设计指南,旨在校准用户信任、满足用户偏好并增强用户审查 LLM 输出的能力。
Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能
Publish: 2025-08-09 06:00:15 UTC 发布:2025-08-09 06:00:15 UTC
#258 Who's the Evil Twin? Differential Auditing for Undesired Behavior #258 谁是邪恶的孪生?针对不良行为的差异审计
Authors: [Ishwar Balappanawar](https://arxiv.org/search/?searchtype=author&query=Ishwar Balappanawar), [Venkata Hasith Vattikuti](https://arxiv.org/search/?searchtype=author&query=Venkata Hasith Vattikuti), [Greta Kintzley](https://arxiv.org/search/?searchtype=author&query=Greta Kintzley), [Ronan Azimi-Mancel](https://arxiv.org/search/?searchtype=author&query=Ronan Azimi-Mancel), [Satvik Golechha](https://arxiv.org/search/?searchtype=author&query=Satvik Golechha) 作者:Ishwar Balappanawar、Venkata Hasith Vattikuti、Greta Kintzley、Ronan Azimi-Mancel、Satvik Golechha
Detecting hidden behaviors in neural networks poses a significant challenge due to minimal prior knowledge and potential adversarial obfuscation. We explore this problem by framing detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior, with the performance of both being nearly indistinguishable on the benign dataset. The blue team, with limited to no information about the harmful behaviour, tries to identify the compromised model. We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks under different levels of hints provided by the red team. Results show high accuracy for adversarial-attack-based methods (100% correct prediction, using hints), which is very promising, whilst the other techniques yield more varied performance. During our LLM-focused rounds, we find that there are not many parallel methods that we could apply from our study with CNNs. Instead, we find that effective LLM auditing methods require some hints about the undesired distribution, which can then used in standard black-box and open-weight methods to probe the models further and reveal their misalignment. We open-source our auditing games (with the model and data) and hope that our findings contribute to designing better audits. 在神经网络中检测隐藏行为是一项重大挑战,原因在于先验知识极少且可能存在对抗性混淆。我们通过将检测问题表述为两队之间的对抗游戏来探讨这一问题:红队训练两个相似的模型,一个仅在良性数据上训练,另一个在包含隐藏有害行为的数据上训练,但两者在良性数据集上的表现几乎无法区分。蓝队对有害行为知之甚少甚至一无所知,试图识别被妥协的模型。我们使用卷积神经网络进行实验,并尝试了多种蓝队策略,包括高斯噪声分析、模型差异比较、积分梯度以及在红队提供不同程度提示下的对抗性攻击。结果表明基于对抗性攻击的方法表现出很高的准确率(在使用提示时达到 100%正确预测),这是非常有希望的,而其他技术的表现则更为多样。在我们面向 LLM 的轮次中,我们发现很少有可以直接从在 CNN 上研究的方法并行应用到 LLM 上的方法。 相反,我们发现有效的 LLM 审计方法需要关于不期望分布的一些提示,然后可以在标准的黑箱和开放权重方法中使用这些提示来进一步探查模型并揭示其不对齐之处。我们开源了我们的审计游戏(包括模型和数据),并希望我们的发现能有助于设计更好的审计。
Subjects: Machine Learning, Artificial Intelligence, Cryptography and Security 主题:机器学习、人工智能、密码学与安全
Publish: 2025-08-09 04:57:38 UTC 发布日期:2025-08-09 04:57:38 UTC
#259 Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face #259 机器学习生态系统的剖析:Hugging Face 上的 200 万个模型
Authors: [Benjamin Laufer](https://arxiv.org/search/?searchtype=author&query=Benjamin Laufer), [Hamidah Oderinwale](https://arxiv.org/search/?searchtype=author&query=Hamidah Oderinwale), [Jon Kleinberg](https://arxiv.org/search/?searchtype=author&query=Jon Kleinberg) 作者:Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg
Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees – networks that connect fine-tuned models to their base or parent – reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling’ models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license’s terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights. 许多人注意到,生成式机器学习(ML)和人工智能(AI)模型的开发与部署遵循一种独特模式:预训练模型被针对特定下游任务进行调整和微调。然而,关于这些相互作用结构的实证研究仍然有限。本文分析了 Hugging Face 上的 186 万个模型——该平台是模型开发的领先点对点生产平台。我们对模型家族树的研究(将微调模型与其基础或父模型相连接的网络)揭示了庞大的微调世系,其规模和结构差异很大。以进化生物学的视角研究 ML 模型,我们利用模型元数据和模型卡来衡量模型家族中的基因相似性和特征突变。我们发现,模型往往表现出家族相似性,这意味着当模型属于同一模型家族时,它们的基因标记和特征重叠更多。然而,这些相似性在某些方面偏离了无性繁殖的标准模型,因为突变发生得既快又有方向性,使得两个“兄弟”模型往往比父/子模型对更为相似。 对这些突变方向漂移的进一步分析揭示了关于开放机器学习生态系统的定性见解:许可出人意料地从限制性、商业许可转向宽松或强制开源(copyleft)许可,且常常违反上游许可的条款;模型从多语种兼容性演变为仅支持英语;模型卡则通过更频繁地采用模板和自动生成文本来缩短并标准化。总体而言,这项工作朝着基于实证的模型微调理解迈出了一步,并表明基于生态学的模型和方法可以带来新的科学洞见。
Subjects: Social and Information Networks, Artificial Intelligence, Computers and Society, Machine Learning 主题:社会与信息网络、人工智能、计算机与社会、机器学习
Publish: 2025-08-09 04:08:49 UTC 发布时间:2025-08-09 04:08:49 UTC
#260 Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation #260 离线到在线的强化学习与无分类器扩散生成
Authors: [Xiao Huang](https://arxiv.org/search/?searchtype=author&query=Xiao Huang), [Xu Liu](https://arxiv.org/search/?searchtype=author&query=Xu Liu), [Enze Zhang](https://arxiv.org/search/?searchtype=author&query=Enze Zhang), [Tong Yu](https://arxiv.org/search/?searchtype=author&query=Tong Yu), [Shuai Li](https://arxiv.org/search/?searchtype=author&query=Shuai Li) 作者:黄潇、刘旭、张恩泽、于彤、李帅
Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent’s stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze. 离线到在线强化学习(O2O RL)旨在对离线预训练策略进行在线微调,以将昂贵的在线交互降到最低。现有工作使用离线数据集生成符合在线数据分布的数据以进行数据增强。然而,生成的数据仍与在线数据存在差距,限制了整体性能。为了解决这一问题,我们提出了一种新的数据增强方法——无分类器扩散生成(Classifier-Free Diffusion Generation,CFDG)。CFDG 在不引入额外分类器训练开销的情况下,利用无分类器引导扩散显著提升了在不同分布的离线与在线数据上的生成质量。此外,它采用重加权方法使更多生成数据与在线数据对齐,在提升性能的同时保持了智能体的稳定性。实验结果表明,CFDG 优于简单重放两类数据或使用标准扩散模型生成新数据。我们的方法通用且可与现有离线到在线 RL 算法集成。 通过将 CFDG 应用于流行方法 IQL、PEX 和 APL,我们在 D4RL 基准(如 MuJoCo 和 AntMaze)上的实证性能平均取得了显著的 15% 提升。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-09 03:32:23 UTC 发布:2025-08-09 03:32:23 UTC
#261 Hardness-Aware Dynamic Curriculum Learning for Robust Multimodal Emotion Recognition with Missing Modalities #261 难度感知的动态课程学习用于在模态缺失情况下实现稳健的多模态情感识别
Authors: [Rui Liu](https://arxiv.org/search/?searchtype=author&query=Rui Liu), [Haolin Zuo](https://arxiv.org/search/?searchtype=author&query=Haolin Zuo), [Zheng Lian](https://arxiv.org/search/?searchtype=author&query=Zheng Lian), [Hongyu Yuan](https://arxiv.org/search/?searchtype=author&query=Hongyu Yuan), [Qi Fan](https://arxiv.org/search/?searchtype=author&query=Qi Fan) 作者:刘睿、左昊霖、连正、袁宏宇、范琦
Missing modalities have recently emerged as a critical research direction in multimodal emotion recognition (MER). Conventional approaches typically address this issue through missing modality reconstruction. However, these methods fail to account for variations in reconstruction difficulty across different samples, consequently limiting the model’s ability to handle hard samples effectively. To overcome this limitation, we propose a novel Hardness-Aware Dynamic Curriculum Learning framework, termed HARDY-MER. Our framework operates in two key stages: first, it estimates the hardness level of each sample, and second, it strategically emphasizes hard samples during training to enhance model performance on these challenging instances. Specifically, we first introduce a Multi-view Hardness Evaluation mechanism that quantifies reconstruction difficulty by considering both Direct Hardness (modality reconstruction errors) and Indirect Hardness (cross-modal mutual information). Meanwhile, we introduce a Retrieval-based Dynamic Curriculum Learning strategy that dynamically adjusts the training curriculum by retrieving samples with similar semantic information and balancing the learning focus between easy and hard instances. Extensive experiments on benchmark datasets demonstrate that HARDY-MER consistently outperforms existing methods in missing-modality scenarios. Our code will be made publicly available at https://github.com/HARDY-MER/HARDY-MER. 缺失模态最近成为多模态情感识别(MER)中的一个关键研究方向。传统方法通常通过缺失模态重建来解决该问题。然而,这些方法未能考虑不同样本间重建难度的差异,因此限制了模型有效处理困难样本的能力。为克服这一局限,我们提出了一种新颖的困难感知动态课程学习框架,称为 HARDY-MER。我们的框架由两个关键阶段组成:首先,它估计每个样本的困难等级;其次,在训练过程中有策略地强调困难样本,以提升模型在这些挑战性样本上的表现。具体而言,我们首先引入了一种多视角困难评估机制,通过同时考虑直接困难(模态重建误差)和间接困难(跨模态互信息)来量化重建难度。 同时,我们提出了一种基于检索的动态课程学习策略,通过检索具有相似语义信息的样本来动态调整训练课程,并在简单样本与困难样本之间平衡学习重点。在基准数据集上的大量实验表明,HARDY-MER 在缺失模态场景下始终优于现有方法。我们的代码将公开发布于 https://github.com/HARDY-MER/HARDY-MER。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-09 03:10:56 UTC 发布:2025-08-09 03:10:56 UTC
#262 LSDTs: LLM-Augmented Semantic Digital Twins for Adaptive Knowledge-Intensive Infrastructure Planning #262 LSDTs:用于自适应知识密集型基础设施规划的 LLM 增强语义数字孪生
Authors: [Naiyi Li](https://arxiv.org/search/?searchtype=author&query=Naiyi Li), [Zihui Ma](https://arxiv.org/search/?searchtype=author&query=Zihui Ma), [Runlong Yu](https://arxiv.org/search/?searchtype=author&query=Runlong Yu), [Lingyao Li](https://arxiv.org/search/?searchtype=author&query=Lingyao Li) 作者:李乃怡、马子辉、于润龙、李灵尧
Digital Twins (DTs) offer powerful tools for managing complex infrastructure systems, but their effectiveness is often limited by challenges in integrating unstructured knowledge. Recent advances in Large Language Models (LLMs) bring new potential to address this gap, with strong abilities in extracting and organizing diverse textual information. We therefore propose LSDTs (LLM-Augmented Semantic Digital Twins), a framework that helps LLMs extract planning knowledge from unstructured documents like environmental regulations and technical guidelines, and organize it into a formal ontology. This ontology forms a semantic layer that powers a digital twin-a virtual model of the physical system-allowing it to simulate realistic, regulation-aware planning scenarios. We evaluate LSDTs through a case study of offshore wind farm planning in Maryland, including its application during Hurricane Sandy. Results demonstrate that LSDTs support interpretable, regulation-aware layout optimization, enable high-fidelity simulation, and enhance adaptability in infrastructure planning. This work shows the potential of combining generative AI with digital twins to support complex, knowledge-driven planning tasks. 数字孪生(DTs)为管理复杂基础设施系统提供了强大的工具,但在整合非结构化知识方面常常受限。大型语言模型(LLMs)的最新进展为弥补这一空白带来了新可能,它们在提取和组织多样化文本信息方面具有很强的能力。因此,我们提出了 LSDTs(LLM 增强语义数字孪生)框架,帮助 LLMs 从环境法规和技术指南等非结构化文档中提取规划知识,并将其组织成形式化本体。该本体构成了一个语义层,为数字孪生——物理系统的虚拟模型——提供支持,使其能够模拟现实且符合法规的规划情景。我们通过对马里兰海上风电场规划的案例研究评估了 LSDTs,包括其在桑迪飓风期间的应用。结果表明,LSDTs 支持可解释的、符合法规的布局优化,实现高保真模拟,并增强了基础设施规划的适应性。 这项工作展示了将生成式人工智能与数字孪生相结合以支持复杂、以知识为驱动的规划任务的潜力。
Subjects: Emerging Technologies, Artificial Intelligence 学科:新兴技术,人工智能
Publish: 2025-08-09 03:06:40 UTC 发布:2025-08-09 03:06:40 UTC
#263 Geometry-Aware Spiking Graph Neural Network #263 几何感知脉冲图神经网络
Authors: [Bowen Zhang](https://arxiv.org/search/?searchtype=author&query=Bowen Zhang), [Genan Dai](https://arxiv.org/search/?searchtype=author&query=Genan Dai), [Hu Huang](https://arxiv.org/search/?searchtype=author&query=Hu Huang), [Long Lan](https://arxiv.org/search/?searchtype=author&query=Long Lan) 作者:Bowen Zhang、Genan Dai、Hu Huang、Long Lan
Graph Neural Networks (GNNs) have demonstrated impressive capabilities in modeling graph-structured data, while Spiking Neural Networks (SNNs) offer high energy efficiency through sparse, event-driven computation. However, existing spiking GNNs predominantly operate in Euclidean space and rely on fixed geometric assumptions, limiting their capacity to model complex graph structures such as hierarchies and cycles. To overcome these limitations, we propose \method{}, a novel Geometry-Aware Spiking Graph Neural Network that unifies spike-based neural dynamics with adaptive representation learning on Riemannian manifolds. \method{} features three key components: a Riemannian Embedding Layer that projects node features into a pool of constant-curvature manifolds, capturing non-Euclidean structures; a Manifold Spiking Layer that models membrane potential evolution and spiking behavior in curved spaces via geometry-consistent neighbor aggregation and curvature-based attention; and a Manifold Learning Objective that enables instance-wise geometry adaptation through jointly optimized classification and link prediction losses defined over geodesic distances. All modules are trained using Riemannian SGD, eliminating the need for backpropagation through time. Extensive experiments on multiple benchmarks show that GSG achieves superior accuracy, robustness, and energy efficiency compared to both Euclidean SNNs and manifold-based GNNs, establishing a new paradigm for curvature-aware, energy-efficient graph learning. 图神经网络(GNNs)在建模图结构数据方面表现出令人印象深刻的能力,而脉冲神经网络(SNNs)通过稀疏的事件驱动计算提供了高能效。然而,现有的脉冲 GNN 主要在欧几里得空间中运行并依赖固定的几何假设,限制了它们对层次结构和环等复杂图结构的建模能力。为克服这些限制,我们提出了\method{},一种新颖的几何感知脉冲图神经网络,将基于脉冲的神经动力学与黎曼流形上的自适应表示学习相结合。\method{}具有三个关键组件:一个黎曼嵌入层,将节点特征投影到一组常曲率流形中,以捕捉非欧几里得结构;一个流形脉冲层,通过与几何一致的邻居聚合和基于曲率的注意力,在曲面空间中建模膜电位演化和脉冲行为;以及一个流形学习目标,通过在测地距离上联合优化分类和链路预测损失,实现按实例的几何自适应。 所有模块均使用黎曼随机梯度下降(Riemannian SGD)进行训练,无需通过时间反向传播。对多个基准数据集的大量实验表明,与欧几里得脉冲神经网络(SNNs)和基于流形的图神经网络(GNNs)相比,GSG 在准确性、鲁棒性和能效方面均具有优势,确立了一种面向曲率感知且节能的图学习新范式。
Subjects: Neural and Evolutionary Computing, Artificial Intelligence, Machine Learning 主题:神经与进化计算、人工智能、机器学习
Publish: 2025-08-09 02:52:38 UTC 发布:2025-08-09 02:52:38 UTC
#264 Mode-Aware Non-Linear Tucker Autoencoder for Tensor-based Unsupervised Learning #264 模式感知非线性 Tucker 自编码器用于基于张量的无监督学习
Authors: [Junjing Zheng](https://arxiv.org/search/?searchtype=author&query=Junjing Zheng), [Chengliang Song](https://arxiv.org/search/?searchtype=author&query=Chengliang Song), [Weidong Jiang](https://arxiv.org/search/?searchtype=author&query=Weidong Jiang), [Xinyu Zhang](https://arxiv.org/search/?searchtype=author&query=Xinyu Zhang) 作者:郑君靖、宋成亮、姜卫东、张欣宇
High-dimensional data, particularly in the form of high-order tensors, presents a major challenge in self-supervised learning. While MLP-based autoencoders (AE) are commonly employed, their dependence on flattening operations exacerbates the curse of dimensionality, leading to excessively large model sizes, high computational overhead, and challenging optimization for deep structural feature capture. Although existing tensor networks alleviate computational burdens through tensor decomposition techniques, most exhibit limited capability in learning non-linear relationships. To overcome these limitations, we introduce the Mode-Aware Non-linear Tucker Autoencoder (MA-NTAE). MA-NTAE generalized classical Tucker decomposition to a non-linear framework and employs a Pick-and-Unfold strategy, facilitating flexible per-mode encoding of high-order tensors via recursive unfold-encode-fold operations, effectively integrating tensor structural priors. Notably, MA-NTAE exhibits linear growth in computational complexity with tensor order and proportional growth with mode dimensions. Extensive experiments demonstrate MA-NTAE’s performance advantages over standard AE and current tensor networks in compression and clustering tasks, which become increasingly pronounced for higher-order, higher-dimensional tensors. 高维数据,尤其是以高阶张量形式存在的数据,在自监督学习中构成了重大挑战。尽管常用基于多层感知机(MLP)的自编码器(AE),但它们对展平操作的依赖加剧了维度灾难,导致模型规模过大、计算开销高,并且在捕捉深层结构特征时优化困难。尽管现有的张量网络通过张量分解技术缓解了计算负担,但大多数在学习非线性关系方面能力有限。为克服这些限制,我们提出了模式感知非线性塔克自编码器(MA-NTAE)。MA-NTAE 将经典塔克分解推广到非线性框架,并采用“选取-展开”策略,通过递归的展开—编码—折叠操作实现对高阶张量的按模式灵活编码,有效地整合了张量结构先验。值得注意的是,MA-NTAE 的计算复杂度随张量阶数呈线性增长,并随模式维度成比例增长。 大量实验表明,MA-NTAE 在压缩和聚类任务中相比于标准自编码器和当前张量网络具有性能优势,且对于高阶、高维张量这种优势愈发显著。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-09 02:26:32 UTC 发布:2025-08-09 02:26:32 协调世界时 (UTC)
#265 PROPS: Progressively Private Self-alignment of Large Language Models #265 PROPS:大型语言模型的渐进式隐私自我对齐
Authors: [Noel Teku](https://arxiv.org/search/?searchtype=author&query=Noel Teku), [Fengwei Tian](https://arxiv.org/search/?searchtype=author&query=Fengwei Tian), [Payel Bhattacharjee](https://arxiv.org/search/?searchtype=author&query=Payel Bhattacharjee), [Souradip Chakraborty](https://arxiv.org/search/?searchtype=author&query=Souradip Chakraborty), [Amrit Singh Bedi](https://arxiv.org/search/?searchtype=author&query=Amrit Singh Bedi), [Ravi Tandon](https://arxiv.org/search/?searchtype=author&query=Ravi Tandon) 作者:Noel Teku、Fengwei Tian、Payel Bhattacharjee、Souradip Chakraborty、Amrit Singh Bedi、Ravi Tandon
Alignment is a key step in developing Large Language Models (LLMs) using human feedback to ensure adherence to human values and societal norms. Dependence on human feedback raises privacy concerns about how much a labeler’s preferences may reveal about their personal values, beliefs, and personality traits. Existing approaches, such as Differentially Private SGD (DP-SGD), provide rigorous privacy guarantees by privatizing gradients during fine-tuning and alignment but can provide more privacy than necessary as human preferences are tied only to labels of (prompt, response) pairs and can degrade model utility. This work focuses on LLM alignment with preference-level privacy, which preserves the privacy of preference labels provided by humans. We propose PROPS (PROgressively Private Self-alignment), a multi-stage privacy preserving alignment framework where privately aligned models in previous stages can serve as labelers for supplementing training data in the subsequent stages of alignment. We present theoretical guarantees for PROPS as well as comprehensive validation using multiple models (Pythia and GPT) and datasets (AlpacaEval, Anthropic HH-RLHF, truthy-dpo-v0.1) to demonstrate the utility of PROPS over existing methods while still providing high privacy. For the same privacy budget, alignment via PROPS can achieve up to 3x higher win-rates compared to DP-SGD, and 2.5x higher win-rates compared to Randomized Response (RR) based alignment. 对齐是使用人类反馈开发大型语言模型(LLMs)的关键步骤,旨在确保其遵循人类价值观和社会规范。对人类反馈的依赖引发了隐私担忧,即标注者的偏好可能会暴露他们的个人价值观、信仰和人格特质。现有方法,例如差分隐私随机梯度下降(DP-SGD),通过在微调和对齐过程中对梯度进行私有化来提供严格的隐私保证,但由于人类偏好仅与(提示,回应)对的标签相关,这类方法可能提供了超出必要的隐私保护,同时可能降低模型效用。本研究聚焦于具有偏好级别隐私的 LLM 对齐,即保护人类提供的偏好标签的隐私。我们提出了 PROPS(逐步私有自我对齐),这是一种多阶段的隐私保护对齐框架,其中前一阶段私有对齐的模型可以作为标注者,为后续对齐阶段补充训练数据。 我们为 PROPS 提供了理论保证,并通过多种模型(Pythia 和 GPT)和数据集(AlpacaEval、Anthropic HH-RLHF、truthy-dpo-v0.1)进行了全面验证,以证明在保持高隐私性的同时,PROPS 比现有方法更有用。在相同的隐私预算下,通过 PROPS 进行的对齐在胜率上相比 DP-SGD 可提高最多 3 倍,相比基于随机响应(RR)的对齐可提高 2.5 倍。
Subjects: Machine Learning, Artificial Intelligence, Cryptography and Security, Information Theory 主题:机器学习、人工智能、密码学与安全、信息论
Publish: 2025-08-09 02:17:47 UTC 发布:2025-08-09 02:17:47 UTC
#266 BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation #266 BiXSE:通过概率分级相关性蒸馏改进密集检索 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Christos Tsirigotis](https://arxiv.org/search/?searchtype=author&query=Christos Tsirigotis), [Vaibhav Adlakha](https://arxiv.org/search/?searchtype=author&query=Vaibhav Adlakha), [Joao Monteiro](https://arxiv.org/search/?searchtype=author&query=Joao Monteiro), [Aaron Courville](https://arxiv.org/search/?searchtype=author&query=Aaron Courville), [Perouz Taslakian](https://arxiv.org/search/?searchtype=author&query=Perouz Taslakian) 作者:Christos Tsirigotis、Vaibhav Adlakha、Joao Monteiro、Aaron Courville、Perouz Taslakian
Neural sentence embedding models for dense retrieval typically rely on binary relevance labels, treating query-document pairs as either relevant or irrelevant. However, real-world relevance often exists on a continuum, and recent advances in large language models (LLMs) have made it feasible to scale the generation of fine-grained graded relevance labels. In this work, we propose BiXSE, a simple and effective pointwise training method that optimizes binary cross-entropy (BCE) over LLM-generated graded relevance scores. BiXSE interprets these scores as probabilistic targets, enabling granular supervision from a single labeled query-document pair per query. Unlike pairwise or listwise losses that require multiple annotated comparisons per query, BiXSE achieves strong performance with reduced annotation and compute costs by leveraging in-batch negatives. Extensive experiments across sentence embedding (MMTEB) and retrieval benchmarks (BEIR, TREC-DL) show that BiXSE consistently outperforms softmax-based contrastive learning (InfoNCE), and matches or exceeds strong pairwise ranking baselines when trained on LLM-supervised data. BiXSE offers a robust, scalable alternative for training dense retrieval models as graded relevance supervision becomes increasingly accessible. 用于密集检索的神经句子嵌入模型通常依赖二元相关标签,将查询-文档对视为相关或不相关。然而,现实中的相关性常常是连续的,且大型语言模型(LLMs)的最新进展使得规模化生成细粒度分级相关性标签成为可能。在这项工作中,我们提出了 BiXSE,一种简单且有效的逐点训练方法,通过对 LLM 生成的分级相关性分数优化二元交叉熵(BCE)。BiXSE 将这些分数解释为概率目标,使得每个查询仅需一个带标签的查询-文档对即可获得细粒度监督。不同于需要每个查询多次注释比较的成对或列表式损失,BiXSE 通过利用批内负样本,在减少标注和计算成本的同时仍能取得强劲性能。跨句子嵌入(MMTEB)和检索基准(BEIR、TREC-DL)的广泛实验证明,BiXSE 在使用 LLM 监督数据训练时,持续优于基于 softmax 的对比学习(InfoNCE),并在性能上匹配或超过强大的成对排序基线。 BiXSE 提供了一种稳健且可扩展的替代方案,用于在分级相关性监督变得越来越可获得的情况下训练密集检索模型。
Subjects: Information Retrieval, Artificial Intelligence, Machine Learning 主题:信息检索、人工智能、机器学习
Publish: 2025-08-09 02:15:17 UTC 发布:2025-08-09 02:15:17 UTC
#267 Zero-Direction Probing: A Linear-Algebraic Framework for Deep Analysis of Large-Language-Model Drift #267 零方向探查:用于深度分析大型语言模型漂移的线性代数框架
Author: [Amit Pandey](https://arxiv.org/search/?searchtype=author&query=Amit Pandey) 作者:Amit Pandey
We present Zero-Direction Probing (ZDP), a theory-only framework for detecting model drift from null directions of transformer activations without task labels or output evaluations. Under assumptions A1–A6, we prove: (i) the Variance–Leak Theorem, (ii) Fisher Null-Conservation, (iii) a Rank–Leak bound for low-rank updates, and (iv) a logarithmic-regret guarantee for online null-space trackers. We derive a Spectral Null-Leakage (SNL) metric with non-asymptotic tail bounds and a concentration inequality, yielding a-priori thresholds for drift under a Gaussian null model. These results show that monitoring right/left null spaces of layer activations and their Fisher geometry provides concrete, testable guarantees on representational change. 我们提出了零方向探测(Zero-Direction Probing,ZDP),这是一个仅基于理论的框架,用于在没有任务标签或输出评估的情况下,从变换器激活的零方向检测模型漂移。在假设 A1–A6 下,我们证明了: (i) 方差—泄露定理,(ii) 费舍尔零保守性,(iii) 针对低秩更新的秩—泄露界,和 (iv) 在线零空间跟踪器的对数后悔保证。我们推导出具有非渐近尾界和浓缩不等式的谱零泄漏(Spectral Null-Leakage,SNL)度量,从而在高斯零模型下给出漂移的先验阈值。这些结果表明,监控层激活的右/左零空间及其费舍尔几何能够为表征变化提供具体的、可检验的保证。
Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题:机器学习,人工智能,机器学习
Publish: 2025-08-09 02:05:59 UTC 发布:2025-08-09 02:05:59 UTC
#268 PANAMA: A Network-Aware MARL Framework for Multi-Agent Path Finding in Digital Twin Ecosystems #268 PANAMA:面向数字孪生生态系统的网络感知多智能体路径规划 MARL 框架
Authors: [Arman Dogru](https://arxiv.org/search/?searchtype=author&query=Arman Dogru), [R. Irem Bor-Yaliniz](https://arxiv.org/search/?searchtype=author&query=R. Irem Bor-Yaliniz), [Nimal Gamini Senarath](https://arxiv.org/search/?searchtype=author&query=Nimal Gamini Senarath) 作者:Arman Dogru、R. Irem Bor-Yaliniz、Nimal Gamini Senarath
Digital Twins (DTs) are transforming industries through advanced data processing and analysis, positioning the world of DTs, Digital World, as a cornerstone of nextgeneration technologies including embodied AI. As robotics and automated systems scale, efficient data-sharing frameworks and robust algorithms become critical. We explore the pivotal role of data handling in next-gen networks, focusing on dynamics between application and network providers (AP/NP) in DT ecosystems. We introduce PANAMA, a novel algorithm with Priority Asymmetry for Network Aware Multi-agent Reinforcement Learning (MARL) based multi-agent path finding (MAPF). By adopting a Centralized Training with Decentralized Execution (CTDE) framework and asynchronous actor-learner architectures, PANAMA accelerates training while enabling autonomous task execution by embodied AI. Our approach demonstrates superior pathfinding performance in accuracy, speed, and scalability compared to existing benchmarks. Through simulations, we highlight optimized data-sharing strategies for scalable, automated systems, ensuring resilience in complex, real-world environments. PANAMA bridges the gap between network-aware decision-making and robust multi-agent coordination, advancing the synergy between DTs, wireless networks, and AI-driven automation. 数字孪生(DTs)通过先进的数据处理和分析正在改变各行各业,使数字孪生世界——数字世界,成为包括具身人工智能在内的新一代技术的基石。随着机器人和自动化系统的扩展, 高效的数据共享框架和强健的算法变得至关重要。我们探讨了数据处理在下一代网络中的关键作用,重点关注 DT 生态系统中应用提供者与网络提供者(AP/NP)之间的动态关系。我们提出了 PANAMA,一种具有优先级非对称性的网络感知多智能体强化学习(MARL)用于多智能体路径寻找(MAPF)的新算法。通过采用集中训练、分散执行(CTDE)框架和异步的 actor-learner 架构,PANAMA 在加速训练的同时使具身人工智能能够自主执行任务。与现有基准相比,我们的方法在路径寻找的准确性、速度和可扩展性上表现出更优的性能。通过仿真,我们突出展示了可扩展自动化系统的优化数据共享策略,确保在复杂的真实环境中的弹性。 PANAMA 弥合了面向网络的决策与稳健多智能体协同之间的鸿沟,推动了数字孪生、无线网络与人工智能驱动自动化之间的协同发展。
Subjects: Machine Learning, Artificial Intelligence, Distributed, Parallel, and Cluster Computing, Multiagent Systems, Robotics 主题:机器学习、人工智能、分布式、并行与集群计算、多智能体系统、机器人学
Publish: 2025-08-09 00:59:55 UTC 发表:2025-08-09 00:59:55 UTC
#269 SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding #269 SafePLUG:通过像素级洞察和时间定位赋能多模态 LLMs 以理解交通事故
Authors: [Zihao Sheng](https://arxiv.org/search/?searchtype=author&query=Zihao Sheng), [Zilin Huang](https://arxiv.org/search/?searchtype=author&query=Zilin Huang), [Yen-Jung Chen](https://arxiv.org/search/?searchtype=author&query=Yen-Jung Chen), [Yansong Qu](https://arxiv.org/search/?searchtype=author&query=Yansong Qu), [Yuhao Luo](https://arxiv.org/search/?searchtype=author&query=Yuhao Luo), [Yue Leng](https://arxiv.org/search/?searchtype=author&query=Yue Leng), [Sikai Chen](https://arxiv.org/search/?searchtype=author&query=Sikai Chen) 作者:盛子豪、黄子霖、陈彦榕、曲彦松、罗宇浩、冷悦、陈思凯
Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG 多模态大型语言模型(MLLMs)在一系列视觉-语言任务上取得了显著进展,并在交通事故理解方面展现出强大潜力。然而,该领域现有的 MLLMs 主要侧重于粗粒度的图像级或视频级理解,常常难以处理精细的视觉细节或局部场景组成部分,从而限制了其在复杂事故场景中的适用性。为了解决这些局限性,我们提出了 SafePLUG,这是一种新颖框架,使 MLLMs 具备像素级理解和时间锚定的能力,以便对交通事故进行全面分析。SafePLUG 支持任意形状的视觉提示,用于区域感知的问答和基于语言指令的像素级分割,同时还能识别交通事故场景中的时间锚定事件。为了推进 MLLMs 在交通事故理解方面的发展,我们整理了一个包含围绕多样事故场景的多模态问答对的新数据集,数据集中包含详细的像素级标注和时间事件边界。 实验结果表明,SafePLUG 在多项任务上表现出色,包括基于区域的问答、像素级分割、时间事件定位以及事故事件理解。这些能力为对复杂交通场景的细粒度理解奠定了基础,具有提升驾驶安全和增强智能交通系统态势感知的潜力。代码、数据集和模型检查点将公开发布于: https://zihaosheng.github.io/SafePLUG
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 00:25:24 UTC 发布:2025-08-09 00:25:24 UTC
#270 FoundBioNet: A Foundation-Based Model for IDH Genotyping of Glioma from Multi-Parametric MRI #270 FoundBioNet:一种基于基础模型的多参数 MRI 胶质瘤 IDH 基因分型模型 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Somayeh Farahani](https://arxiv.org/search/?searchtype=author&query=Somayeh Farahani), [Marjaneh Hejazi](https://arxiv.org/search/?searchtype=author&query=Marjaneh Hejazi), [Antonio Di Ieva](https://arxiv.org/search/?searchtype=author&query=Antonio Di Ieva), [Sidong Liu](https://arxiv.org/search/?searchtype=author&query=Sidong Liu) 作者:Somayeh Farahani, Marjaneh Hejazi, Antonio Di Ieva, Sidong Liu
Accurate, noninvasive detection of isocitrate dehydrogenase (IDH) mutation is essential for effective glioma management. Traditional methods rely on invasive tissue sampling, which may fail to capture a tumor’s spatial heterogeneity. While deep learning models have shown promise in molecular profiling, their performance is often limited by scarce annotated data. In contrast, foundation deep learning models offer a more generalizable approach for glioma imaging biomarkers. We propose a Foundation-based Biomarker Network (FoundBioNet) that utilizes a SWIN-UNETR-based architecture to noninvasively predict IDH mutation status from multi-parametric MRI. Two key modules are incorporated: Tumor-Aware Feature Encoding (TAFE) for extracting multi-scale, tumor-focused features, and Cross-Modality Differential (CMD) for highlighting subtle T2-FLAIR mismatch signals associated with IDH mutation. The model was trained and validated on a diverse, multi-center cohort of 1705 glioma patients from six public datasets. Our model achieved AUCs of 90.58%, 88.08%, 65.41%, and 80.31% on independent test sets from EGD, TCGA, Ivy GAP, RHUH, and UPenn, consistently outperforming baseline approaches (p <= 0.05). Ablation studies confirmed that both the TAFE and CMD modules are essential for improving predictive accuracy. By integrating large-scale pretraining and task-specific fine-tuning, FoundBioNet enables generalizable glioma characterization. This approach enhances diagnostic accuracy and interpretability, with the potential to enable more personalized patient care. 准确且无创地检测异柠檬酸脱氢酶(IDH)突变对于有效的胶质瘤管理至关重要。传统方法依赖侵入性组织取样,可能无法捕捉肿瘤的空间异质性。尽管深度学习模型在分子分型方面显示出潜力,但其性能常受限于标注数据稀缺。相比之下,基础深度学习模型为胶质瘤影像生物标志物提供了更具泛化性的途径。我们提出了一种基于基础模型的生物标志物网络(FoundBioNet),采用基于 SWIN-UNETR 的架构,从多参数 MRI 无创预测 IDH 突变状态。模型中整合了两个关键模块:肿瘤感知特征编码(TAFE),用于提取多尺度、聚焦肿瘤的特征;以及跨模态差异(CMD),用于突出与 IDH 突变相关的细微 T2-FLAIR 不匹配信号。该模型在来自六个公开数据集的 1705 例胶质瘤患者的多中心多样化队列上进行了训练和验证。 我们的模型在来自 EGD、TCGA、Ivy GAP、RHUH 和 UPenn 的独立测试集上取得了 90.58%、88.08%、65.41% 和 80.31% 的 AUC,稳定地优于基线方法(p <= 0.05)。消融研究证实 TAFE 和 CMD 模块对于提高预测精度均不可或缺。通过整合大规模预训练与任务特定的微调,FoundBioNet 实现了可泛化的胶质瘤表征。该方法提高了诊断的准确性和可解释性,有望促成更个性化的患者护理。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-09 00:08:10 UTC 发布:2025-08-09 00:08:10 UTC
#271 Many-Turn Jailbreaking #271 多轮越狱攻击
Authors: [Xianjun Yang](https://arxiv.org/search/?searchtype=author&query=Xianjun Yang), [Liqiang Xiao](https://arxiv.org/search/?searchtype=author&query=Liqiang Xiao), [Shiyang Li](https://arxiv.org/search/?searchtype=author&query=Shiyang Li), [Faisal Ladhak](https://arxiv.org/search/?searchtype=author&query=Faisal Ladhak), [Hyokun Yun](https://arxiv.org/search/?searchtype=author&query=Hyokun Yun), [Linda Ruth Petzold](https://arxiv.org/search/?searchtype=author&query=Linda Ruth Petzold), [Yi Xu](https://arxiv.org/search/?searchtype=author&query=Yi Xu), [William Yang Wang](https://arxiv.org/search/?searchtype=author&query=William Yang Wang) 作者:杨显军,肖立强,李世阳,Faisal Ladhak,Hyokun Yun,Linda Ruth Petzold,许毅,William Yang Wang
Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs. 当前针对大型语言模型(LLMs)的越狱研究旨在从给定提示中引出不安全的输出。然而,这些研究仅关注针对单一查询的单轮越狱。相反,先进的 LLMs 被设计用于处理极长的上下文,因此能够进行多轮对话。基于此,我们提出探索多轮越狱,其中被越狱的 LLMs 在超过首轮对话或单一目标查询的情况下持续受到测试。这是一种更严重的威胁,因为 1)用户常常会继续提出相关的后续问题以澄清某些越狱细节,且 2)首次越狱回合也可能导致 LLMs 持续对额外的无关问题做出响应。作为探索多轮越狱的第一步(初稿完成于 2024 年 6 月),我们构建了一个多轮越狱基准(MTJ-Bench),用于在一系列开源和闭源模型上对该设置进行基准测试,并为这一新的安全威胁提供了新见解。 通过揭示这一新型漏洞,我们旨在号召社区共同努力构建更安全的 LLMs,并为更深入理解对 LLMs 进行越狱奠定基础。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-09 00:02:39 UTC 发布:2025-08-09 00:02:39 UTC
#272 Analysis of Schedule-Free Nonconvex Optimization #272 无步长限制非凸优化分析
Author: [Connor Brown](https://arxiv.org/search/?searchtype=author&query=Connor Brown) 作者:Connor Brown
First-order methods underpin most large-scale learning algorithms, yet their classical convergence guarantees hinge on carefully scheduled step-sizes that depend on the total horizon T, which is rarely known in advance. The Schedule-Free (SF) method promises optimal performance with hyperparameters that are independent of T by interpolating between Polyak–Ruppert averaging and momentum, but nonconvex analysis of SF has been limited or reliant on strong global assumptions. We introduce a robust Lyapunov framework that, under only L-smoothness and lower-boundedness, reduces SF analysis to a single-step descent inequality. This yields horizon-agnostic bounds in the nonconvex setting: O(1/logT) for constant step + PR averaging, O(logT/T) for a linearly growing step-size, and a continuum of O(T−(1−α)) rates for polynomial averaging. We complement these proofs with Performance Estimation Problem (PEP) experiments that numerically validate our rates and suggest that our O(1/logT) bound on the original nonconvex SF algorithm may tighten to O(1/T). Our work extends SF’s horizon-free guarantees to smooth nonconvex optimization and charts future directions for optimal nonconvex rates. 一阶方法支撑着大多数大规模学习算法,然而它们的经典收敛保证依赖于需根据总时域 T 精心安排的步长,这在事先很少被知晓。Schedule-Free(SF)方法通过在 Polyak–Ruppert 平均和动量之间插值,承诺以与 T 无关的超参数实现最优性能,但对 SF 的非凸分析一直有限或依赖于强全局假设。我们引入了一个稳健的李雅普诺夫框架,仅在 L -光滑性和下界存在的条件下,就将 SF 的分析简化为一步下降不等式。这在非凸情形下带来了与时域无关的界:对常数步长 + PR 平均为 O(1/logT) ;对线性增长步长为 O(logT/T) ;对多项式平均则给出一系列 O(T−(1−α)) 速率。我们用性能估计问题(PEP)实验补充了这些证明,这些实验在数值上验证了我们的速率,并表明我们对原始非凸 SF 算法的 O(1/logT) 上界可能收紧为 O(1/T) 。 我们的工作将 SF 的无步长限制保证扩展到平滑非凸优化,并为非凸最优收敛速率的未来研究指明了方向。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-08 22:54:35 UTC 发表:2025-08-08 22:54:35 UTC
#273 Learning Causal Structure Distributions for Robust Planning #273 学习用于鲁棒规划的因果结构分布
Authors: [Alejandro Murillo-Gonzalez](https://arxiv.org/search/?searchtype=author&query=Alejandro Murillo-Gonzalez), [Junhong Xu](https://arxiv.org/search/?searchtype=author&query=Junhong Xu), [Lantao Liu](https://arxiv.org/search/?searchtype=author&query=Lantao Liu) 作者:Alejandro Murillo-Gonzalez, Junhong Xu, Lantao Liu
Structural causal models describe how the components of a robotic system interact. They provide both structural and functional information about the relationships that are present in the system. The structural information outlines the variables among which there is interaction. The functional information describes how such interactions work, via equations or learned models. In this paper we find that learning the functional relationships while accounting for the uncertainty about the structural information leads to more robust dynamics models which improves downstream planning, while using significantly lower computational resources. This in contrast with common model-learning methods that ignore the causal structure and fail to leverage the sparsity of interactions in robotic systems. We achieve this by estimating a causal structure distribution that is used to sample causal graphs that inform the latent-space representations in an encoder-multidecoder probabilistic model. We show that our model can be used to learn the dynamics of a robot, which together with a sampling-based planner can be used to perform new tasks in novel environments, provided an objective function for the new requirement is available. We validate our method using manipulators and mobile robots in both simulation and the real-world. Additionally, we validate the learned dynamics’ adaptability and increased robustness to corrupted inputs and changes in the environment, which is highly desirable in challenging real-world robotics scenarios. Video: https://youtu.be/X6k5t7OOnNc. 结构因果模型描述了机器人系统各组件之间如何相互作用。它们既提供系统中存在的关系的结构信息,也提供功能性信息。结构信息概述了相互作用所涉及的变量。功能信息通过方程或学习到的模型描述这些相互作用如何运作。在本文中,我们发现,在学习功能关系时考虑到关于结构信息的不确定性,会得到更稳健的动力学模型,从而改进后续的规划,同时显著降低计算资源的使用。这与常见的模型学习方法形成对比,后者忽视因果结构,未能利用机器人系统中相互作用的稀疏性。我们的做法是估计一个因果结构分布,用它来采样因果图,这些因果图为编码器-多解码器概率模型中的潜在空间表示提供信息。 我们展示了我们的模型可用于学习机器人的动力学,并且结合基于采样的规划器,在为新需求提供目标函数的前提下,可用于在新环境中执行新任务。我们在仿真和真实世界中使用机械臂和移动机器人验证了我们的方法。此外,我们验证了所学动力学在适应性方面的能力以及对损坏输入和环境变化的增强鲁棒性,这在具有挑战性的现实机器人场景中非常可取。视频:https://youtu.be/X6k5t7OOnNc。
Subjects: Robotics, Artificial Intelligence, Machine Learning, Systems and Control 主题:机器人学、人工智能、机器学习、系统与控制
Publish: 2025-08-08 22:43:17 UTC 发布:2025-08-08 22:43:17 UTC
#274 Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis #274 面向口述历史理解的用于文本分类与情感分析的大型语言模型
Authors: [Komala Subramanyam Cherukuri](https://arxiv.org/search/?searchtype=author&query=Komala Subramanyam Cherukuri), [Pranav Abishai Moses](https://arxiv.org/search/?searchtype=author&query=Pranav Abishai Moses), [Aisa Sakata](https://arxiv.org/search/?searchtype=author&query=Aisa Sakata), [Jiangping Chen](https://arxiv.org/search/?searchtype=author&query=Jiangping Chen), [Haihua Chen](https://arxiv.org/search/?searchtype=author&query=Haihua Chen) 作者:Komala Subramanyam Cherukuri、Pranav Abishai Moses、Aisa Sakata、Jiangping Chen、Haihua Chen
Oral histories are vital records of lived experience, particularly within communities affected by systemic injustice and historical erasure. Effective and efficient analysis of their oral history archives can promote access and understanding of the oral histories. However, Large-scale analysis of these archives remains limited due to their unstructured format, emotional complexity, and high annotation costs. This paper presents a scalable framework to automate semantic and sentiment annotation for Japanese American Incarceration Oral History. Using LLMs, we construct a high-quality dataset, evaluate multiple models, and test prompt engineering strategies in historically sensitive contexts. Our multiphase approach combines expert annotation, prompt design, and LLM evaluation with ChatGPT, Llama, and Qwen. We labeled 558 sentences from 15 narrators for sentiment and semantic classification, then evaluated zero-shot, few-shot, and RAG strategies. For semantic classification, ChatGPT achieved the highest F1 score (88.71%), followed by Llama (84.99%) and Qwen (83.72%). For sentiment analysis, Llama slightly outperformed Qwen (82.66%) and ChatGPT (82.29%), with all models showing comparable results. The best prompt configurations were used to annotate 92,191 sentences from 1,002 interviews in the JAIOH collection. Our findings show that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by well-designed prompts. This study provides a reusable annotation pipeline and practical guidance for applying LLMs in culturally sensitive archival analysis. By bridging archival ethics with scalable NLP techniques, this work lays the groundwork for responsible use of artificial intelligence in digital humanities and preservation of collective memory. GitHub: https://github.com/kc6699c/LLM4OralHistoryAnalysis. 口述历史是记录亲身经历的重要资料,尤其是在受到系统性不公和历史抹杀影响的社区中。对这些口述历史档案进行有效且高效的分析可以促进人们对口述历史的获取与理解。然而,由于档案的非结构化格式、情感复杂性以及高昂的标注成本,对这些档案的大规模分析仍然受限。本文提出了一个可扩展的框架,用于自动化注释日裔美國拘禁口述历史的语义与情感。我们使用 LLMs 构建了高质量数据集、评估了多种模型,并在具有历史敏感性的情境中测试了提示工程策略。我们的多阶段方法将专家标注、提示设计与使用 ChatGPT、Llama 和 Qwen 的 LLM 评估相结合。我们对来自 15 位叙述者的 558 句子进行了情感与语义分类标注,然后评估了零样本、少样本与 RAG 策略。 在语义分类方面,ChatGPT 获得了最高的 F1 分数(88.71%),其后是 Llama(84.99%)和 Qwen(83.72%)。在情感分析方面,Llama 略微优于 Qwen(82.66%)和 ChatGPT(82.29%),所有模型表现相近。使用最佳提示配置,我们对 JAIOH 集合中 1,002 次访谈的 92,191 句子进行了标注。我们的研究表明,在精心设计的提示引导下,LLMs 能够在大规模口述历史集合中有效执行语义和情感标注。本研究提供了一个可复用的标注流程以及将 LLMs 应用于具有文化敏感性的档案分析的实用指导。通过将档案伦理与可扩展的自然语言处理技术相结合,这项工作为在数字人文和集体记忆保存中负责任地使用人工智能奠定了基础。GitHub: https://github.com/kc6699c/LLM4OralHistoryAnalysis。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-08 22:06:23 UTC 发布:2025-08-08 22:06:23 UTC
#275 Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge #275 偏袒玩法:一种用于测量 LLM 作为裁判时自我偏见的统计方法
Authors: [Evangelia Spiliopoulou](https://arxiv.org/search/?searchtype=author&query=Evangelia Spiliopoulou), [Riccardo Fogliato](https://arxiv.org/search/?searchtype=author&query=Riccardo Fogliato), [Hanna Burnsky](https://arxiv.org/search/?searchtype=author&query=Hanna Burnsky), [Tamer Soliman](https://arxiv.org/search/?searchtype=author&query=Tamer Soliman), [Jie Ma](https://arxiv.org/search/?searchtype=author&query=Jie Ma), [Graham Horwood](https://arxiv.org/search/?searchtype=author&query=Graham Horwood), [Miguel Ballesteros](https://arxiv.org/search/?searchtype=author&query=Miguel Ballesteros) 作者:Evangelia Spiliopoulou、Riccardo Fogliato、Hanna Burnsky、Tamer Soliman、Jie Ma、Graham Horwood、Miguel Ballesteros
Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other LLM outputs. However, models may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias, which can distort evaluations of true model performance. Previous studies often conflate genuine differences in model quality with bias or incorrectly assume that evaluations from LLMs and humans follow the same rating distributions. In this work, we present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated. Our method models the difference in the scoring distribution that LLM-as-a-judge assigns to its own completions compared to other models, while accounting for the underlying quality of the completions provided by an independent, third-party judge (e.g., humans). Our method reliably isolates and quantifies self-bias, even when models vary in ability, ensuring that genuine performance differences are not mistaken for self-bias. We conduct an empirical analysis of self-bias on a large dataset (>5000 prompt-completion pairs) consisting of expert human annotations and judgments from nine different LLM judges. We find that some models, such as GPT-4o and Claude 3.5 Sonnet, systematically assign higher scores to their own outputs. These models also display family-bias; systematically assigning higher ratings to outputs produced by other models of the same family. Our findings highlight potential pitfalls of using LLM judges and offer practical guidance to mitigate biases when interpreting automated evaluations. 大型语言模型 (LLMs) 可以作为评审,为其他 LLM 的输出提供快速且可靠的评估。然而,模型可能会系统性地对自己的输出给出过于有利的评分,这一现象被称为自我偏差(self-bias),它可能扭曲对模型真实性能的评估。以往的研究常常将模型质量的真实差异与偏差混为一谈,或错误地假设 LLM 和人类的评估遵循相同的评分分布。在本研究中,我们提出了一个统计框架,明确形式化了在何种假设下可以识别和估计自我偏差。我们的方法对 LLM 作为评审时对自身生成与对其他模型生成的评分分布差异建模,同时考虑由独立第三方评审(例如人类)提供的生成质量的潜在差异。即便在模型能力各不相同的情况下,我们的方法也能可靠地隔离并量化自我偏差,确保不会将真实的性能差异误认为自我偏差。 我们对一个大规模数据集(>5000 条提示-完成对)进行了关于自我偏见的实证分析,该数据集由专家人工注释和来自九个不同 LLM 评审者的判断组成。我们发现一些模型(例如 GPT-4o 和 Claude 3.5 Sonnet)系统性地对其自身的输出给出更高的分数。这些模型还表现出家族偏见;系统性地对同一系列的其他模型产生的输出给出更高的评分。我们的研究结果凸显了使用 LLM 评审者时可能存在的陷阱,并为在解读自动化评估时减轻偏见提供了实用的指导。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-08 21:22:12 UTC 发布:2025-08-08 21:22:12 UTC
#276 MMFformer: Multimodal Fusion Transformer Network for Depression Detection #276 MMFformer:用于抑郁检测的多模态融合变换器网络 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Md Rezwanul Haque](https://arxiv.org/search/?searchtype=author&query=Md Rezwanul Haque), [Md. Milon Islam](https://arxiv.org/search/?searchtype=author&query=Md. Milon Islam), [S M Taslim Uddin Raju](https://arxiv.org/search/?searchtype=author&query=S M Taslim Uddin Raju), [Hamdi Altaheri](https://arxiv.org/search/?searchtype=author&query=Hamdi Altaheri), [Lobna Nassar](https://arxiv.org/search/?searchtype=author&query=Lobna Nassar), [Fakhri Karray](https://arxiv.org/search/?searchtype=author&query=Fakhri Karray) 作者:Md Rezwanul Haque、Md. Milon Islam、S M Taslim Uddin Raju、Hamdi Altaheri、Lobna Nassar、Fakhri Karray
Depression is a serious mental health illness that significantly affects an individual’s well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection. 抑郁症是一种严重的心理健康疾病,显著影响个体的福祉和生活质量,因此早期检测对获得适当的护理和治疗至关重要。抑郁症的检测通常很困难,因为主要依赖于临床访谈中的主观评估。因此,基于社交网络内容的抑郁症早期诊断已成为一个重要的研究领域。用户生成信息的广泛性和多样性带来了重大挑战,限制了相关时序信息的准确提取以及跨多模态数据的有效融合。本文提出了 MMFformer,一种多模态抑郁检测网络,旨在从多模态社交媒体信息中检索抑郁的时空高级模式。具有残差连接的 Transformer 网络从视频中捕捉空间特征,Transformer 编码器被用于设计音频中的重要时间动态。此外,融合架构通过后期融合和中期融合策略融合提取的特征,以找出它们之间最相关的模态间关联。 最后,所提出的网络在两个大规模抑郁检测数据集上进行了评估,结果清楚地显示其优于现有的最先进方法,在 D-Vlog 数据集上将 F1 分数提高了 13.92%,在 LMVD 数据集上提高了 7.74%。代码已公开发布于 https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection 。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning, Sound 学科:计算机视觉与模式识别、人工智能、机器学习、声音
Publish: 2025-08-08 21:03:29 UTC 发布:2025-08-08 21:03:29 UTC
#277 Do Biased Models Have Biased Thoughts? #277 有偏模型会有有偏的“思维”吗?
Authors: [Swati Rajwal](https://arxiv.org/search/?searchtype=author&query=Swati Rajwal), [Shivank Garg](https://arxiv.org/search/?searchtype=author&query=Shivank Garg), [Reem Abdel-Salam](https://arxiv.org/search/?searchtype=author&query=Reem Abdel-Salam), [Abdelrahman Zayed](https://arxiv.org/search/?searchtype=author&query=Abdelrahman Zayed) 作者:Swati Rajwal、Shivank Garg、Reem Abdel-Salam、Abdelrahman Zayed
The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: \textit{Do biased models have biased thoughts}? To answer our question, we conduct experiments on 5 popular large language models using fairness metrics to quantify 11 different biases in the model’s thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than 0.6 correlation with a p-value smaller than 0.001 in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts. 语言模型令人印象深刻的性能是毋庸置疑的。然而,基于性别、种族、社会经济地位、外貌和性取向的偏见使得语言模型的部署变得具有挑战性。本文研究了“思路链提示”(chain-of-thought prompting)这一近期方法对公平性的影响,该方法研究模型在回答前所遵循的步骤。更具体地,我们提出了这样一个问题:有偏见的模型是否会有偏见的“思路”?为回答这一问题,我们对 5 款流行的大型语言模型进行了实验,使用公平性度量来量化模型“思路”和输出中的 11 种不同偏见。我们的结果表明,思考步骤中的偏见与输出偏见并不高度相关(在大多数情况下相关性小于 0.6 ,且 p 的 p 值小于 0.001 )。换句话说,与人类不同,测试中那些做出有偏见决策的模型并不总是拥有有偏见的思路。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-08 19:41:20 UTC 发布:2025-08-08 19:41:20 协调世界时
#278 In-Context Reinforcement Learning via Communicative World Models #278 通过可交流的世界模型进行情境内强化学习
Authors: [Fernando Martinez-Lopez](https://arxiv.org/search/?searchtype=author&query=Fernando Martinez-Lopez), [Tao Li](https://arxiv.org/search/?searchtype=author&query=Tao Li), [Yingdong Lu](https://arxiv.org/search/?searchtype=author&query=Yingdong Lu), [Juntao Chen](https://arxiv.org/search/?searchtype=author&query=Juntao Chen) 作者:Fernando Martinez-Lopez、Tao Li、Yingdong Lu、Juntao Chen
Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because their learned representations and policies are overfit to the specifics of their training environments. To boost agents’ in-context RL (ICRL) ability, this work formulates ICRL as a two-agent emergent communication problem and introduces CORAL (Communicative Representation for Adaptive RL), a framework that learns a transferable communicative context by decoupling latent representation learning from control. In CORAL, an Information Agent (IA) is pre-trained as a world model on a diverse distribution of tasks. Its objective is not to maximize task reward, but to build a world model and distill its understanding into concise messages. The emergent communication protocol is shaped by a novel Causal Influence Loss, which measures the effect that the message has on the next action. During deployment, the previously trained IA serves as a fixed contextualizer for a new Control Agent (CA), which learns to solve tasks by interpreting the provided communicative context. Our experiments demonstrate that this approach enables the CA to achieve significant gains in sample efficiency and successfully perform zero-shot adaptation with the help of pre-trained IA in entirely unseen sparse-reward environments, validating the efficacy of learning a transferable communicative representation. 强化学习(RL)智能体在不更新其参数的情况下常常难以泛化到新的任务和情境,主要因为其学得的表征和策略过度拟合于训练环境的具体细节。为提升智能体的情境内强化学习(ICRL)能力,本工作将 ICRL 表述为一个双智能体的涌现通信问题,并提出了 CORAL(用于自适应强化学习的可通信表征)框架,通过将潜在表征学习与控制解耦,学习可迁移的可通信情境。在 CORAL 中,信息智能体(IA)作为世界模型在多样化任务分布上进行预训练。其目标不是最大化任务奖励,而是构建世界模型并将其理解蒸馏为简洁的信息。涌现的通信协议由一种新颖的因果影响损失塑造,该损失衡量信息对下一步动作的影响。在部署时,先前训练好的 IA 作为固定的情境器为新的控制智能体(CA)提供上下文,CA 通过解读所提供的可通信情境来学习解决任务。 我们的实验表明,这种方法使通信代理在样本效率上获得显著提升,并在预训练智能代理的帮助下,能够在完全未见过的稀疏奖励环境中成功执行零样本适应,验证了学习可迁移通信表征的有效性。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-08 19:23:23 UTC 发表:2025-08-08 19:23:23 协调世界时
#279 Fractal Language Modelling by Universal Sequence Maps (USM) #279 通过通用序列映射(USM)的分形语言建模
Authors: [Jonas S Almeida](https://arxiv.org/search/?searchtype=author&query=Jonas S Almeida), [Daniel E Russ](https://arxiv.org/search/?searchtype=author&query=Daniel E Russ), [Susana Vinga](https://arxiv.org/search/?searchtype=author&query=Susana Vinga), [Ines Duarte](https://arxiv.org/search/?searchtype=author&query=Ines Duarte), [Lee Mason](https://arxiv.org/search/?searchtype=author&query=Lee Mason), [Praphulla Bhawsar](https://arxiv.org/search/?searchtype=author&query=Praphulla Bhawsar), [Aaron Ge](https://arxiv.org/search/?searchtype=author&query=Aaron Ge), [Arlindo Oliveira](https://arxiv.org/search/?searchtype=author&query=Arlindo Oliveira), [Jeya Balaji Balasubramanian](https://arxiv.org/search/?searchtype=author&query=Jeya Balaji Balasubramanian) 作者:Jonas S Almeida、Daniel E Russ、Susana Vinga、Ines Duarte、Lee Mason、Praphulla Bhawsar、Aaron Ge、Arlindo Oliveira、Jeya Balaji Balasubramanian
Motivation: With the advent of Language Models using Transformers, popularized by ChatGPT, there is a renewed interest in exploring encoding procedures that numerically represent symbolic sequences at multiple scales and embedding dimensions. The challenge that encoding addresses is the need for mechanisms that uniquely retain contextual information about the succession of individual symbols, which can then be modeled by nonlinear formulations such as neural networks. Context: Universal Sequence Maps(USM) are iterated functions that bijectively encode symbolic sequences onto embedded numerical spaces. USM is composed of two Chaos Game Representations (CGR), iterated forwardly and backwardly, that can be projected into the frequency domain (FCGR). The corresponding USM coordinates can be used to compute a Chebyshev distance metric as well as k-mer frequencies, without having to recompute the embedded numeric coordinates, and, paradoxically, allowing for non-integers values of k. Results: This report advances the bijective fractal encoding by Universal Sequence Maps (USM) by resolving seeding biases affecting the iterated process. The resolution had two results, the first expected, the second an intriguing outcome: 1) full reconciliation of numeric positioning with sequence identity; and 2) uncovering the nature of USM as an efficient numeric process converging towards a steady state sequence embedding solution. We illustrate these results for genomic sequences because of the convenience of a planar representation defined by an alphabet with only 4 tokens (the 4 nucleotides). Nevertheless, the application to alphabet of arbitrary cardinality was found to be straightforward. 动机:随着以 Transformer 为基础的语言模型的出现,并由 ChatGPT 推广,人们重新关注探索以多重尺度和嵌入维度对符号序列进行数值表示的编码方法。编码所要解决的挑战在于,需要有能够唯一保留关于各个符号序列顺序的上下文信息的机制,这些信息随后可以用神经网络等非线性方法进行建模。背景:通用序列映射(Universal Sequence Maps, USM)是将符号序列双射地编码到嵌入数值空间的迭代函数。USM 由两个混沌游走表征(Chaos Game Representations, CGR)组成,分别向前和向后迭代,可以被投影到频域(FCGR)。相应的 USM 坐标可用于计算切比雪夫距离度量以及 k-mer 频率,而无需重新计算嵌入的数值坐标,并且具有悖论性地允许 k 为非整数值。结果:本报告通过解决影响迭代过程的初始种子偏差,推进了由通用序列映射(USM)实现的双射分形编码。 该结论产生了两个结果:第一个是预期之内,第二个则是一个有趣的发现:1)数值定位与序列相似性的完全调和;以及 2)揭示了 USM 的本质,即作为一种高效的数值过程,收敛到一个稳定状态的序列嵌入解。我们以基因组序列来说明这些结果,因为由仅含 4 个符号(4 种核苷酸)的字母表定义的平面表示十分方便。尽管如此,将该方法应用于任意基数的字母表也被发现是直接可行的。
Subjects: Machine Learning, Artificial Intelligence, Numerical Analysis, Quantitative Methods 主题:机器学习、人工智能、数值分析、定量方法
Publish: 2025-08-08 18:41:13 UTC 发表:2025-08-08 18:41:13 UTC
#280 Segmented Confidence Sequences and Multi-Scale Adaptive Confidence Segments for Anomaly Detection in Nonstationary Time Series #280 分段置信序列与用于非平稳时间序列异常检测的多尺度自适应置信区间
Authors: [Muyan Anna Li](https://arxiv.org/search/?searchtype=author&query=Muyan Anna Li), [Aditi Gautam](https://arxiv.org/search/?searchtype=author&query=Aditi Gautam) 作者:李沐妍(Muyan Anna Li)、阿迪蒂·高塔姆(Aditi Gautam)
As time series data become increasingly prevalent in domains such as manufacturing, IT, and infrastructure monitoring, anomaly detection must adapt to nonstationary environments where statistical properties shift over time. Traditional static thresholds are easily rendered obsolete by regime shifts, concept drift, or multi-scale changes. To address these challenges, we introduce and empirically evaluate two novel adaptive thresholding frameworks: Segmented Confidence Sequences (SCS) and Multi-Scale Adaptive Confidence Segments (MACS). Both leverage statistical online learning and segmentation principles for local, contextually sensitive adaptation, maintaining guarantees on false alarm rates even under evolving distributions. Our experiments across Wafer Manufacturing benchmark datasets show significant F1-score improvement compared to traditional percentile and rolling quantile approaches. This work demonstrates that robust, statistically principled adaptive thresholds enable reliable, interpretable, and timely detection of diverse real-world anomalies. 随着时间序列数据在制造业、信息技术和基础设施监控等领域变得日益普遍,异常检测必须适应统计特性随时间变化的非平稳环境。传统的静态阈值在遇到制度转变、概念漂移或多尺度变化时很容易失效。为应对这些挑战,我们提出并通过实证评估了两种新颖的自适应阈值框架:分段置信序列(Segmented Confidence Sequences,SCS)和多尺度自适应置信段(Multi-Scale Adaptive Confidence Segments,MACS)。两者均利用统计在线学习和分段原理进行局部、具上下文敏感性的自适应,同时在分布演变的情况下仍能维持对误报率的保证。我们在晶片制造(Wafer Manufacturing)基准数据集上的实验证明,与传统的百分位和滑动分位数方法相比,F1 得分显著提升。本研究表明,稳健且具有统计原则的自适应阈值能够实现对多样化真实世界异常的可靠、可解释且及时的检测。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-08 18:34:54 UTC 发布:2025-08-08 18:34:54 UTC
#281 Using Imperfect Synthetic Data in Downstream Inference Tasks #281 在下游推断任务中使用不完美的合成数据
Authors: [Yewon Byun](https://arxiv.org/search/?searchtype=author&query=Yewon Byun), [Shantanu Gupta](https://arxiv.org/search/?searchtype=author&query=Shantanu Gupta), [Zachary C. Lipton](https://arxiv.org/search/?searchtype=author&query=Zachary C. Lipton), [Rachel Leah Childers](https://arxiv.org/search/?searchtype=author&query=Rachel Leah Childers), [Bryan Wilder](https://arxiv.org/search/?searchtype=author&query=Bryan Wilder) 作者:Yewon Byun、Shantanu Gupta、Zachary C. Lipton、Rachel Leah Childers、Bryan Wilder
Predictions and generations from large language models are increasingly being explored as an aid to computational social science and human subject research in limited data regimes. While previous technical work has explored the potential to use model-predicted labels for unlabeled data in a principled manner, there is increasing interest in using large language models to generate entirely new synthetic samples (also termed as synthetic simulations), such as in responses to surveys. However, it is not immediately clear by what means practitioners can combine such data with real data and yet produce statistically valid conclusions upon them. In this work, we introduce a new estimator based on generalized method of moments, providing a hyperparameter-free solution with strong theoretical guarantees to address the challenge at hand. Surprisingly, we find that interactions between the moment residuals of synthetic data and those of real data can improve estimates of the target parameter. We empirically validate the finite-sample performance of our estimator across different regression tasks in computational social science applications, demonstrating large empirical gains. 在有限数据情形下,大型语言模型的预测和生成正越来越多地被用作计算社会科学和受试者研究的辅助工具。尽管先前的技术工作以原则性方式探讨了将模型预测标签用于未标记数据的潜力,但人们越来越关注使用大型语言模型生成全新的合成样本(亦称为合成模拟),例如用于对问卷的回答。然而,实践者通过何种方式将此类数据与真实数据结合并在此基础上得出统计上有效的结论,尚无明确答案。在本工作中,我们基于广义矩估计引入了一种新的估计量,提供了一个无超参数、具有强理论保证的解决方案来应对这一挑战。令人惊讶的是,我们发现合成数据的矩残差与真实数据的矩残差之间的相互作用可以改善目标参数的估计。我们在计算社会科学应用的不同回归任务上对该估计量的有限样本性能进行了实证验证,显示出显著的实证收益。
Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题:机器学习,人工智能,机器学习
Publish: 2025-08-08 18:32:52 UTC 发表时间:2025-08-08 18:32:52 UTC
#282 CoDe-NeRF: Neural Rendering via Dynamic Coefficient Decomposition #282 CoDe-NeRF:通过动态系数分解进行神经渲染
Authors: [Wenpeng Xing](https://arxiv.org/search/?searchtype=author&query=Wenpeng Xing), [Jie Chen](https://arxiv.org/search/?searchtype=author&query=Jie Chen), [Zaifeng Yang](https://arxiv.org/search/?searchtype=author&query=Zaifeng Yang), [Tiancheng Zhao](https://arxiv.org/search/?searchtype=author&query=Tiancheng Zhao), [Gaolei Li](https://arxiv.org/search/?searchtype=author&query=Gaolei Li), [Changting Lin](https://arxiv.org/search/?searchtype=author&query=Changting Lin), [Yike Guo](https://arxiv.org/search/?searchtype=author&query=Yike Guo), [Meng Han](https://arxiv.org/search/?searchtype=author&query=Meng Han) 作者:邢文鹏、陈洁、杨再峰、赵天成、李高雷、林常庭、顾亦可、韩萌
Neural Radiance Fields (NeRF) have shown impressive performance in novel view synthesis, but challenges remain in rendering scenes with complex specular reflections and highlights. Existing approaches may produce blurry reflections due to entanglement between lighting and material properties, or encounter optimization instability when relying on physically-based inverse rendering. In this work, we present a neural rendering framework based on dynamic coefficient decomposition, aiming to improve the modeling of view-dependent appearance. Our approach decomposes complex appearance into a shared, static neural basis that encodes intrinsic material properties, and a set of dynamic coefficients generated by a Coefficient Network conditioned on view and illumination. A Dynamic Radiance Integrator then combines these components to synthesize the final radiance. Experimental results on several challenging benchmarks suggest that our method can produce sharper and more realistic specular highlights compared to existing techniques. We hope that this decomposition paradigm can provide a flexible and effective direction for modeling complex appearance in neural scene representations. 神经辐射场(NeRF)在新视角合成方面表现出色,但在渲染具有复杂镜面反射和高光的场景时仍面临挑战。现有方法可能因光照与材质属性的纠缠而产生模糊的反射,或在依赖物理基础逆向渲染时遇到优化不稳定的问题。在本工作中,我们提出了一个基于动态系数分解的神经渲染框架,旨在改进视依赖外观的建模。我们的方法将复杂外观分解为一个共享的、静态的神经基底来编码内在材质属性,以及由一个基于视角和光照条件的系数网络生成的一组动态系数。随后,动态辐射积分器将这些组件组合以合成最终的辐射。多个具有挑战性的基准实验结果表明,与现有技术相比,我们的方法能够生成更清晰、更真实的镜面高光。我们希望这一分解范式能为在神经场景表示中建模复杂外观提供一种灵活且有效的方向。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-08 18:30:02 UTC 发布:2025-08-08 18:30:02 UTC
#283 Early Detection of Pancreatic Cancer Using Multimodal Learning on Electronic Health Record #283 使用电子健康记录的多模态学习进行胰腺癌早期检测
Authors: [Mosbah Aouad](https://arxiv.org/search/?searchtype=author&query=Mosbah Aouad), [Anirudh Choudhary](https://arxiv.org/search/?searchtype=author&query=Anirudh Choudhary), [Awais Farooq](https://arxiv.org/search/?searchtype=author&query=Awais Farooq), [Steven Nevers](https://arxiv.org/search/?searchtype=author&query=Steven Nevers), [Lusine Demirkhanyan](https://arxiv.org/search/?searchtype=author&query=Lusine Demirkhanyan), [Bhrandon Harris](https://arxiv.org/search/?searchtype=author&query=Bhrandon Harris), [Suguna Pappu](https://arxiv.org/search/?searchtype=author&query=Suguna Pappu), [Christopher Gondi](https://arxiv.org/search/?searchtype=author&query=Christopher Gondi), [Ravishankar Iyer](https://arxiv.org/search/?searchtype=author&query=Ravishankar Iyer) 作者:Mosbah Aouad、Anirudh Choudhary、Awais Farooq、Steven Nevers、Lusine Demirkhanyan、Bhrandon Harris、Suguna Pappu、Christopher Gondi、Ravishankar Iyer
Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and early detection remains a major clinical challenge due to the absence of specific symptoms and reliable biomarkers. In this work, we propose a new multimodal approach that integrates longitudinal diagnosis code histories and routinely collected laboratory measurements from electronic health records to detect PDAC up to one year prior to clinical diagnosis. Our method combines neural controlled differential equations to model irregular lab time series, pretrained language models and recurrent networks to learn diagnosis code trajectory representations, and cross-attention mechanisms to capture interactions between the two modalities. We develop and evaluate our approach on a real-world dataset of nearly 4,700 patients and achieve significant improvements in AUC ranging from 6.5% to 15.5% over state-of-the-art methods. Furthermore, our model identifies diagnosis codes and laboratory panels associated with elevated PDAC risk, including both established and new biomarkers. Our code is available at https://github.com/MosbahAouad/EarlyPDAC-MML. 胰腺导管腺癌(PDAC)是最致命的癌症之一,由于缺乏特异性症状和可靠的生物标志物,早期检测仍然是临床上的一大挑战。在这项工作中,我们提出了一种新的多模态方法,整合了电子病历中纵向的诊断编码历史和常规收集的实验室检测数据,以在临床诊断前最多一年检测出 PDAC。我们的方法结合了用于建模不规则化验时间序列的神经受控微分方程、用于学习诊断编码轨迹表示的预训练语言模型和循环网络,以及用于捕捉两种模态间交互的交叉注意力机制。我们在一项近 4,700 名患者的真实世界数据集上开发并评估了该方法,AUC 较最先进方法显著提升了 6.5%到 15.5%。此外,我们的模型识别出了与 PDAC 风险升高相关的诊断编码和化验组合,包括已知和新的生物标志物。我们的代码可在 https://github.com/MosbahAouad/EarlyPDAC-MML 获得。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-08 18:18:15 UTC 发布时间:2025-08-08 18:18:15 协调世界时 (UTC)
#284 Generalizing Scaling Laws for Dense and Sparse Large Language Models #284 将扩展规律推广到稠密与稀疏大型语言模型
Authors: [Md Arafat Hossain](https://arxiv.org/search/?searchtype=author&query=Md Arafat Hossain), [Xingfu Wu](https://arxiv.org/search/?searchtype=author&query=Xingfu Wu), [Valerie Taylor](https://arxiv.org/search/?searchtype=author&query=Valerie Taylor), [Ali Jannesari](https://arxiv.org/search/?searchtype=author&query=Ali Jannesari) 作者:Md Arafat Hossain、Xingfu Wu、Valerie Taylor、Ali Jannesari
Over the past few years, the size of language models has grown exponentially, as has the computational cost to train these large models. This rapid growth has motivated researchers to develop new techniques aimed at enhancing the efficiency of the training process. Despite these advancements, optimally predicting the model size or allocating optimal resources remains a challenge. Several efforts have addressed the challenge by proposing different scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws to demonstrate its effectiveness. 在过去几年里,语言模型的规模呈指数级增长,训练这些大型模型的计算成本也随之增加。这一快速增长促使研究人员开发新的技术以提升训练过程的效率。尽管取得了这些进展,最优地预测模型规模或分配最佳资源仍然是一项挑战。一些研究提出了不同的扩展规律来应对这一挑战,但几乎所有这些规律都是针对特定架构(稠密或稀疏)的。在本工作中,我们重新审视了现有的扩展规律,并提出了一种广义扩展规律,以提供一个适用于稠密和稀疏大型语言模型的统一框架。我们评估并将所提出的扩展规律与现有的扩展规律进行比较,以证明其有效性。
Subjects: Machine Learning, Artificial Intelligence, Performance 主题:机器学习、人工智能、性能
Publish: 2025-08-08 18:07:11 UTC 发布:2025-08-08 18:07:11 UTC
#285 Generative AI for Intent-Driven Network Management in 6G: A Case Study on Hierarchical Learning Approach #285 面向意图驱动的 6G 网络管理的生成式人工智能:基于分层学习方法的案例研究
Authors: [Md Arafat Habib](https://arxiv.org/search/?searchtype=author&query=Md Arafat Habib), [Medhat Elsayed](https://arxiv.org/search/?searchtype=author&query=Medhat Elsayed), [Yigit Ozcan](https://arxiv.org/search/?searchtype=author&query=Yigit Ozcan), [Pedro Enrique Iturria-Rivera](https://arxiv.org/search/?searchtype=author&query=Pedro Enrique Iturria-Rivera), [Majid Bavand](https://arxiv.org/search/?searchtype=author&query=Majid Bavand), [Melike Erol-Kantarci](https://arxiv.org/search/?searchtype=author&query=Melike Erol-Kantarci) 作者:Md Arafat Habib、Medhat Elsayed、Yigit Ozcan、Pedro Enrique Iturria-Rivera、Majid Bavand、Melike Erol-Kantarci
With the emergence of 6G, mobile networks are becoming increasingly heterogeneous and dynamic, necessitating advanced automation for efficient management. Intent-Driven Networks (IDNs) address this by translating high-level intents into optimization policies. Large Language Models (LLMs) can enhance this process by understanding complex human instructions to enable adaptive, intelligent automation. Given the rapid advancements in Generative AI (GenAI), a comprehensive survey of LLM-based IDN architectures in disaggregated Radio Access Network (RAN) environments is both timely and critical. This article provides such a survey, along with a case study on a hierarchical learning-enabled IDN architecture that integrates GenAI across three key stages: intent processing, intent validation, and intent execution. Unlike most existing approaches that apply GenAI in the form of LLMs for intent processing only, we propose a hierarchical framework that introduces GenAI across all three stages of IDN. To demonstrate the effectiveness of the proposed IDN management architecture, we present a case study based on the latest GenAI architecture named Mamba. The case study shows how the proposed GenAI-driven architecture enhances network performance through intelligent automation, surpassing the performance of the conventional IDN architectures. 随着 6G 的出现,移动网络变得愈发异构和动态化,因而需要先进的自动化手段以实现高效管理。意图驱动网络(IDN)通过将高层意图翻译为优化策略来应对这一挑战。大型语言模型(LLMs)可以增强这一过程,因其能够理解复杂的人类指令,从而实现自适应、智能的自动化。鉴于生成式人工智能(GenAI)的快速发展,针对在解构化无线接入网(RAN)环境中基于 LLM 的 IDN 架构进行全面综述既及时又至关重要。本文提供了这样的综述,并通过一个分层学习使能的 IDN 架构案例研究,介绍了在意图处理、意图验证和意图执行三个关键阶段集成 GenAI 的方法。不像大多数现有方法仅在意图处理阶段以 LLM 形式应用 GenAI,我们提出了一个在 IDN 所有三阶段引入 GenAI 的分层框架。为展示所提 IDN 管理架构的有效性,我们基于名为 Mamba 的最新 GenAI 架构进行了案例研究。 该案例研究展示了所提出的由生成式人工智能驱动的架构如何通过智能自动化提升网络性能,并超越传统 IDN 架构的性能。
Subjects: Networking and Internet Architecture, Artificial Intelligence 主题:网络与互联网体系结构,人工智能
Publish: 2025-08-08 18:06:52 UTC 发布:2025-08-08 18:06:52 UTC
#286 Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs #286 深度无知:过滤预训练数据为开权重 LLMs 构建防篡改保障
Authors: [Kyle O’Brien](https://arxiv.org/search/?searchtype=author&query=Kyle O’Brien), [Stephen Casper](https://arxiv.org/search/?searchtype=author&query=Stephen Casper), [Quentin Anthony](https://arxiv.org/search/?searchtype=author&query=Quentin Anthony), [Tomek Korbak](https://arxiv.org/search/?searchtype=author&query=Tomek Korbak), [Robert Kirk](https://arxiv.org/search/?searchtype=author&query=Robert Kirk), [Xander Davies](https://arxiv.org/search/?searchtype=author&query=Xander Davies), [Ishan Mishra](https://arxiv.org/search/?searchtype=author&query=Ishan Mishra), [Geoffrey Irving](https://arxiv.org/search/?searchtype=author&query=Geoffrey Irving), [Yarin Gal](https://arxiv.org/search/?searchtype=author&query=Yarin Gal), [Stella Biderman](https://arxiv.org/search/?searchtype=author&query=Stella Biderman) 作者:Kyle O’Brien、Stephen Casper、Quentin Anthony、Tomek Korbak、Robert Kirk、Xander Davies、Ishan Mishra、Geoffrey Irving、Yarin Gal、Stella Biderman
Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text – outperforming existing post-training baselines by over an order of magnitude – with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems. 开权重的人工智能系统提供了独特的优势,包括更高的透明度、开放研究和去中心化访问。然而,它们容易受到篡改攻击,这类攻击可以通过修改权重或激活来高效地引出有害行为。目前,针对开权重模型的风险管理尚未形成稳健的科学体系。现有的安全微调方法和其他训练后技术在使 LLMs 抵抗超过几十步对抗性微调方面一直困难重重。在本文中,我们研究了从训练数据中过滤有关双重用途主题的文本是否能防止不希望出现的能力并作为更抗篡改的防护手段。我们引入了一个可扩展的数据过滤多阶段流程,并表明它为最小化 LLMs 中的生物威胁代理知识提供了一种可行且有效的方法。我们从零开始预训练了多个 69 亿参数的模型,发现在对多达 10000 步和 3 亿标记的生物威胁相关文本进行对抗性微调攻击时,它们表现出显著的抗性——在没有观察到与无关能力退化的情况下,较现有的训练后基线方法提升了一个数量级以上。 然而,尽管经过筛选的模型内部没有内化危险知识,我们发现当这些信息以上下文形式提供时(例如,通过搜索工具增强),模型仍然可以利用这些信息,这表明需要采用纵深防御的方法。总体而言,这些发现有助于确立预训练数据策划作为面向开放权重 AI 系统的一种有前景的防御层。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-08 17:59:47 UTC 发布:2025-08-08 17:59:47 UTC
#287 LLM Unlearning Without an Expert Curated Dataset #287 LLM 无需专家策划数据集的遗忘方法
Authors: [Xiaoyuan Zhu](https://arxiv.org/search/?searchtype=author&query=Xiaoyuan Zhu), [Muru Zhang](https://arxiv.org/search/?searchtype=author&query=Muru Zhang), [Ollie Liu](https://arxiv.org/search/?searchtype=author&query=Ollie Liu), [Robin Jia](https://arxiv.org/search/?searchtype=author&query=Robin Jia), [Willie Neiswanger](https://arxiv.org/search/?searchtype=author&query=Willie Neiswanger) 作者:Xiaoyuan Zhu、Muru Zhang、Ollie Liu、Robin Jia、Willie Neiswanger
Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook. 现代大型语言模型常常编码敏感、有害或受版权保护的知识,因此需要事后“遗忘”能力——在不完全重训练模型的情况下移除特定领域知识。当前遗忘流程中的一大瓶颈是构建有效的遗忘集——即近似目标领域并引导模型遗忘该领域的数据集。在这项工作中,我们提出了一种可扩展的、自动化的方法,使用语言模型自身生成高质量的遗忘集。我们的方法通过结构化提示流水线合成教科书式的数据,输入仅需领域名称。通过对生物安全、网络安全和《哈利·波特》小说的遗忘实验,我们表明合成数据集在各项测试中始终优于基线合成替代品,并且与专家策划的数据集相当。此外,消融研究表明,多步生成流水线显著提升了数据多样性,从而提高了遗忘效用。 总体而言,我们的研究结果表明,合成数据集为在广泛新兴领域实现实用且可扩展的“消除学习”(unlearning)提供了一条有前景的途径,而无需人工干预。我们在 https://github.com/xyzhu123/Synthetic_Textbook 上发布了我们的代码和数据集。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习
Publish: 2025-08-08 14:30:08 UTC 发布时间:2025-08-08 14:30:08 协调世界时 (UTC)
#288 Towards Integrated Alignment #288 走向综合对齐
Authors: [Ben Y. Reis](https://arxiv.org/search/?searchtype=author&query=Ben Y. Reis), [William La Cava](https://arxiv.org/search/?searchtype=author&query=William La Cava) 作者:Ben Y. Reis,William La Cava
As AI adoption expands across human society, the problem of aligning AI models to match human preferences remains a grand challenge. Currently, the AI alignment field is deeply divided between behavioral and representational approaches, resulting in narrowly aligned models that are more vulnerable to increasingly deceptive misalignment threats. In the face of this fragmentation, we propose an integrated vision for the future of the field. Drawing on related lessons from immunology and cybersecurity, we lay out a set of design principles for the development of Integrated Alignment frameworks that combine the complementary strengths of diverse alignment approaches through deep integration and adaptive coevolution. We highlight the importance of strategic diversity - deploying orthogonal alignment and misalignment detection approaches to avoid homogeneous pipelines that may be “doomed to success”. We also recommend steps for greater unification of the AI alignment research field itself, through cross-collaboration, open model weights and shared community resources. 随着人工智能在整个人类社会中的普及,将 AI 模型与人类偏好对齐的问题仍然是一个重大挑战。目前,AI 对齐领域在行为导向方法与表征导向方法之间严重分裂,导致模型只在狭窄方面实现对齐,从而更容易受到日益狡猾的错位威胁的影响。面对这种分化,我们提出了对该领域未来的综合愿景。借鉴免疫学和网络安全的相关经验教训,我们提出了一组设计原则,用于开发“综合对齐”框架,通过深度整合与自适应共演将多样化对齐方法的互补优势结合起来。我们强调战略性多样性的重要性——部署正交的对齐与错位检测方法,以避免可能“注定成功”的同质化流程。我们还建议通过跨领域协作、开放模型权重与共享社区资源等步骤,促进 AI 对齐研究领域本身的更大统一。
Subjects: Computers and Society, Artificial Intelligence 主题:计算机与社会,人工智能
Publish: 2025-08-08 11:16:56 UTC 发布:2025-08-08 11:16:56 UTC
#289 Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials #289 生成式人工智能从植物中提取结构-功能关系以开发新材料
Authors: [Rachel K. Luu](https://arxiv.org/search/?searchtype=author&query=Rachel K. Luu), [Jingyu Deng](https://arxiv.org/search/?searchtype=author&query=Jingyu Deng), [Mohammed Shahrudin Ibrahim](https://arxiv.org/search/?searchtype=author&query=Mohammed Shahrudin Ibrahim), [Nam-Joon Cho](https://arxiv.org/search/?searchtype=author&query=Nam-Joon Cho), [Ming Dao](https://arxiv.org/search/?searchtype=author&query=Ming Dao), [Subra Suresh](https://arxiv.org/search/?searchtype=author&query=Subra Suresh), [Markus J. Buehler](https://arxiv.org/search/?searchtype=author&query=Markus J. Buehler) 作者:Rachel K. Luu、Jingyu Deng、Mohammed Shahrudin Ibrahim、Nam-Joon Cho、Ming Dao、Subra Suresh、Markus J. Buehler
Large language models (LLMs) have reshaped the research landscape by enabling new approaches to knowledge retrieval and creative ideation. Yet their application in discipline-specific experimental science, particularly in highly multi-disciplinary domains like materials science, remains limited. We present a first-of-its-kind framework that integrates generative AI with literature from hitherto-unconnected fields such as plant science, biomimetics, and materials engineering to extract insights and design experiments for materials. We focus on humidity-responsive systems such as pollen-based materials and Rhapis excelsa (broadleaf lady palm) leaves, which exhibit self-actuation and adaptive performance. Using a suite of AI tools, including a fine-tuned model (BioinspiredLLM), Retrieval-Augmented Generation (RAG), agentic systems, and a Hierarchical Sampling strategy, we extract structure-property relationships and translate them into new classes of bioinspired materials. Structured inference protocols generate and evaluate hundreds of hypotheses from a single query, surfacing novel and experimentally tractable ideas. We validate our approach through real-world implementation: LLM-generated procedures, materials designs, and mechanical predictions were tested in the laboratory, culminating in the fabrication of a novel pollen-based adhesive with tunable morphology and measured shear strength, establishing a foundation for future plant-derived adhesive design. This work demonstrates how AI-assisted ideation can drive real-world materials design and enable effective human-AI collaboration. 大型语言模型(LLMs)通过支持新的知识检索和创意构思方法,重塑了研究格局。然而,它们在学科特定的实验科学中的应用仍然有限,尤其是在像材料科学这样高度多学科的领域。我们提出了一个首创框架,将生成式人工智能与此前未曾关联的文献领域(如植物科学、仿生学和材料工程)整合,以提取见解并为材料设计实验。我们聚焦于湿度响应系统,例如基于花粉的材料和 Rhapis excelsa(宽叶棕竹)叶片,这些系统表现出自驱动和自适应性能。利用一套 AI 工具,包括一个微调模型(BioinspiredLLM)、检索增强生成(RAG)、代理系统和分层采样策略,我们提取结构—性能关系并将其转化为新类别的仿生材料。结构化推理协议从单一查询生成并评估数百个假设,挖掘出新颖且可实验验证的想法。 我们通过现实世界的实验验证了我们的方法:由 LLM 生成的流程、材料设计和力学预测在实验室中得到测试,最终制备出一种具有可调形态并测得剪切强度的新型花粉基粘合剂,为未来植物来源粘合剂的设计奠定了基础。这项工作展示了 AI 辅助的创意如何推动现实世界的材料设计并实现有效的人机协作。
Subjects: Machine Learning, Disordered Systems and Neural Networks, Materials Science, Other Condensed Matter, Artificial Intelligence, Computation and Language 主题:机器学习,紊乱系统与神经网络,材料科学,其他凝聚态物质,人工智能,计算与语言
Publish: 2025-08-08 10:41:03 UTC 发布时间:2025-08-08 10:41:03 UTC
#290 A Federated Learning Framework for Handling Subtype Confounding and Heterogeneity in Large-Scale Neuroimaging Diagnosis #290 一种用于在大规模神经影像诊断中处理亚型混杂和异质性的联邦学习框架
Authors: [Xinglin Zhao](https://arxiv.org/search/?searchtype=author&query=Xinglin Zhao), [Yanwen Wang](https://arxiv.org/search/?searchtype=author&query=Yanwen Wang), [Xiaobo Liu](https://arxiv.org/search/?searchtype=author&query=Xiaobo Liu), [Yanrong Hao](https://arxiv.org/search/?searchtype=author&query=Yanrong Hao), [Rui Cao](https://arxiv.org/search/?searchtype=author&query=Rui Cao), [Xin Wen](https://arxiv.org/search/?searchtype=author&query=Xin Wen) 作者:赵兴林,王燕文,刘晓博,郝延荣,曹睿,温鑫
Computer-aided diagnosis (CAD) systems play a crucial role in analyzing neuroimaging data for neurological and psychiatric disorders. However, small-sample studies suffer from low reproducibility, while large-scale datasets introduce confounding heterogeneity due to multiple disease subtypes being labeled under a single category. To address these challenges, we propose a novel federated learning framework tailored for neuroimaging CAD systems. Our approach includes a dynamic navigation module that routes samples to the most suitable local models based on latent subtype representations, and a meta-integration module that combines predictions from heterogeneous local models into a unified diagnostic output. We evaluated our framework using a comprehensive dataset comprising fMRI data from over 1300 MDD patients and 1100 healthy controls across multiple study cohorts. Experimental results demonstrate significant improvements in diagnostic accuracy and robustness compared to traditional methods. Specifically, our framework achieved an average accuracy of 74.06% across all tested sites, showcasing its effectiveness in handling subtype heterogeneity and enhancing model generalizability. Ablation studies further confirmed the importance of both the dynamic navigation and meta-integration modules in improving performance. By addressing data heterogeneity and subtype confounding, our framework advances reliable and reproducible neuroimaging CAD systems, offering significant potential for personalized medicine and clinical decision-making in neurology and psychiatry. 计算机辅助诊断(CAD)系统在分析神经影像数据以识别神经和精神疾病方面发挥着关键作用。然而,小样本研究存在可重复性低的问题,而大规模数据集由于将多种疾病亚型归为单一类别,会引入混杂的异质性。为了解决这些挑战,我们提出了一种为神经影像 CAD 系统量身定制的新型联邦学习框架。我们的方法包括一个动态导航模块,该模块基于潜在亚型表示将样本路由到最合适的本地模型,以及一个元整合模块,该模块将来自异质本地模型的预测组合成统一的诊断输出。我们使用包含来自多个研究队列的超过 1300 名抑郁症(MDD)患者和 1100 名健康对照的 fMRI 数据的综合数据集对我们的框架进行了评估。实验结果表明,与传统方法相比,该框架在诊断准确性和鲁棒性方面均有显著提升。具体而言,我们的框架在所有测试站点上的平均准确率达到 74.06%,展示了其在处理亚型异质性和增强模型泛化能力方面的有效性。 消融研究进一步证实了动态导航和元整合模块在提升性能方面的重要性。通过解决数据异质性和亚型混杂问题,我们的框架推动了可靠且可重复的神经影像计算机辅助诊断系统的发展,为神经学和精神病学中的个性化医疗及临床决策提供了重要潜力。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-08 07:19:49 UTC
#291 Graph is a Natural Regularization: Revisiting Vector Quantization for Graph Representation Learning
Authors: [Zian Zhai](https://arxiv.org/search/?searchtype=author&query=Zian Zhai), [Fan Li](https://arxiv.org/search/?searchtype=author&query=Fan Li), [Xingyu Tan](https://arxiv.org/search/?searchtype=author&query=Xingyu Tan), [Xiaoyang Wang](https://arxiv.org/search/?searchtype=author&query=Xiaoyang Wang), [Wenjie Zhang](https://arxiv.org/search/?searchtype=author&query=Wenjie Zhang)
Vector Quantization (VQ) has recently emerged as a promising approach for learning discrete representations of graph-structured data. However, a fundamental challenge, i.e., codebook collapse, remains underexplored in the graph domain, significantly limiting the expressiveness and generalization of graph tokens.In this paper, we present the first empirical study showing that codebook collapse consistently occurs when applying VQ to graph data, even with mitigation strategies proposed in vision or language domains. To understand why graph VQ is particularly vulnerable to collapse, we provide a theoretical analysis and identify two key factors: early assignment imbalances caused by redundancy in graph features and structural patterns, and self-reinforcing optimization loops in deterministic VQ. To address these issues, we propose RGVQ, a novel framework that integrates graph topology and feature similarity as explicit regularization signals to enhance codebook utilization and promote token diversity. RGVQ introduces soft assignments via Gumbel-Softmax reparameterization, ensuring that all codewords receive gradient updates. In addition, RGVQ incorporates a structure-aware contrastive regularization to penalize the token co-assignments among similar node pairs. Extensive experiments demonstrate that RGVQ substantially improves codebook utilization and consistently boosts the performance of state-of-the-art graph VQ backbones across multiple downstream tasks, enabling more expressive and transferable graph token representations. 向量量化(VQ)最近成为一种在图结构数据上学习离散表示的有前景的方法。然而,一个基本挑战——码本崩溃(codebook collapse)——在图领域仍未得到充分研究,这极大地限制了图令牌的表现力和泛化能力。本文首次通过实证研究展示了在将 VQ 应用于图数据时,即便采用了在视觉或语言领域提出的缓解策略,码本崩溃仍会持续发生。为了解为何图 VQ 特别容易发生崩溃,我们进行了理论分析并识别出两个关键因素:由图特征和结构模式的冗余引起的早期分配不平衡,以及确定性 VQ 中自我强化的优化循环。为了解决这些问题,我们提出了 RGVQ,一个新颖的框架,它将图的拓扑结构和特征相似性作为显式正则化信号,旨在增强码本的利用率并促进令牌多样性。RGVQ 通过 Gumbel-Softmax 重参数化引入了软分配,确保所有码字都能收到梯度更新。 此外,RGVQ 引入了一种结构感知对比正则化,用以对相似节点对之间的令牌共同分配进行惩罚。大量实验表明,RGVQ 显著提升了码本的利用率,并在多项下游任务中持续提升最先进图形 VQ 骨干网络的性能,使得图令牌表示更具表现力和可迁移性。
Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能
Publish: 2025-08-08 06:33:45 UTC 发布:2025-08-08 06:33:45 协调世界时 (UTC)
#292 Omni Geometry Representation Learning vs Large Language Models for Geospatial Entity Resolution #292 全向几何表示学习 与 大型语言模型 在地理空间实体解析上的比较
Authors: [Kalana Wijegunarathna](https://arxiv.org/search/?searchtype=author&query=Kalana Wijegunarathna), [Kristin Stock](https://arxiv.org/search/?searchtype=author&query=Kristin Stock), [Christopher B. Jones](https://arxiv.org/search/?searchtype=author&query=Christopher B. Jones) 作者:Kalana Wijegunarathna、Kristin Stock、Christopher B. Jones
The development, integration, and maintenance of geospatial databases rely heavily on efficient and accurate matching procedures of Geospatial Entity Resolution (ER). While resolution of points-of-interest (POIs) has been widely addressed, resolution of entities with diverse geometries has been largely overlooked. This is partly due to the lack of a uniform technique for embedding heterogeneous geometries seamlessly into a neural network framework. Existing neural approaches simplify complex geometries to a single point, resulting in significant loss of spatial information. To address this limitation, we propose Omni, a geospatial ER model featuring an omni-geometry encoder. This encoder is capable of embedding point, line, polyline, polygon, and multi-polygon geometries, enabling the model to capture the complex geospatial intricacies of the places being compared. Furthermore, Omni leverages transformer-based pre-trained language models over individual textual attributes of place records in an Attribute Affinity mechanism. The model is rigorously tested on existing point-only datasets and a new diverse-geometry geospatial ER dataset. Omni produces up to 12% (F1) improvement over existing methods. Furthermore, we test the potential of Large Language Models (LLMs) to conduct geospatial ER, experimenting with prompting strategies and learning scenarios, comparing the results of pre-trained language model-based methods with LLMs. Results indicate that LLMs show competitive results. 地理空间数据库的开发、集成与维护在很大程度上依赖于高效且准确的地理实体解析(Geospatial Entity Resolution,ER)匹配方法。尽管兴趣点(POI)的解析已被广泛研究,但对具有多样几何形状的实体的解析在很大程度上被忽视。这部分源于缺乏一种统一技术,能够将异构几何形状无缝嵌入神经网络框架。现有的神经方法将复杂几何简化为单一节点,导致大量空间信息丢失。为了解决这一局限,我们提出了 Omni,一种具有全景几何编码器的地理空间 ER 模型。该编码器能够嵌入点、线、折线、多边形和多重多边形几何,使模型能够捕捉被比较地点的复杂地理空间细节。此外,Omni 在属性相似度机制中利用基于 Transformer 的预训练语言模型对地点记录的各个文本属性进行建模。该模型在现有仅点数据集和一个新的多样几何地理 ER 数据集上进行了严格测试。 Omni 在现有方法上最多带来 12%(F1)的提升。此外,我们测试了大型语言模型(LLMs)在进行地理空间实体消歧(ER)方面的潜力,尝试了提示策略和学习场景,并将基于预训练语言模型的方法与 LLMs 的结果进行了比较。结果表明,LLMs 展现出具有竞争力的表现。
Subjects: Databases, Artificial Intelligence 主题:数据库,人工智能
Publish: 2025-08-08 03:37:11 UTC 发布:2025-08-08 03:37:11 UTC
#293 Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs #293 明辨的头脑还是通用导师?评估苏格拉底式 LLMs 的教学指导能力
Authors: [Ying Liu](https://arxiv.org/search/?searchtype=author&query=Ying Liu), [Can Li](https://arxiv.org/search/?searchtype=author&query=Can Li), [Ting Zhang](https://arxiv.org/search/?searchtype=author&query=Ting Zhang), [Mei Wang](https://arxiv.org/search/?searchtype=author&query=Mei Wang), [Qiannan Zhu](https://arxiv.org/search/?searchtype=author&query=Qiannan Zhu), [Jian Li](https://arxiv.org/search/?searchtype=author&query=Jian Li), [Hua Huang](https://arxiv.org/search/?searchtype=author&query=Hua Huang) 作者:刘英、李灿、张婷、王梅、朱芊楠、李健、黄华
The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their capacity for Socratic questioning, it often overlooks a critical dimension: adaptively guiding learners based on their cognitive states. This study shifts focus from mere question generation to the broader instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in response to learners’ understanding? To investigate this, we propose GuideEval, a benchmark grounded in authentic educational dialogues that evaluates pedagogical guidance through a three-phase behavioral framework: (1) Perception, inferring learner states; (2) Orchestration, adapting instructional strategies; and (3) Elicitation, stimulating proper reflections. Empirical findings reveal that existing LLMs frequently fail to provide effective adaptive scaffolding when learners exhibit confusion or require redirection. Furthermore, we introduce a behavior-guided finetuning strategy that leverages behavior-prompted instructional dialogues, significantly enhancing guidance performance. By shifting the focus from isolated content evaluation to learner-centered interaction, our work advocates a more dialogic paradigm for evaluating Socratic LLMs. 大型语言模型的对话能力在实现可扩展和互动式辅导方面具有重要潜力。尽管先前研究主要考察了它们进行苏格拉底式提问的能力,但常常忽视一个关键维度:基于学习者认知状态进行自适应引导。本研究将关注点从单纯的问题生成转向更广泛的教学引导能力。我们提出问题:LLMs 能否模仿专家导师,根据学习者的理解情况动态调整策略?为此,我们提出了 GuideEval,这是一个基于真实教育对话的基准,通过三阶段的行为框架评估教学引导能力:(1)感知,推断学习者状态;(2)编排,调整教学策略;(3)引导,激发适当的反思。实证结果表明,当学习者表现出困惑或需要重定向时,现有的 LLMs 经常无法提供有效的自适应支架。此外,我们引入了一种行为引导的微调策略,该策略利用行为提示的教学对话,显著提升了引导性能。 通过将关注点从孤立的内容评估转向以学习者为中心的互动,我们的工作倡导一种更具对话性的范式来评估 Socratic LLMs。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-08 01:02:44 UTC 发布日期:2025-08-08 01:02:44 UTC
#294 Leveraging LLMs for Privacy-Aware Predictions in Participatory Budgeting #294 在参与式预算中利用 LLMs 进行隐私感知预测
Authors: [Juan Zambrano](https://arxiv.org/search/?searchtype=author&query=Juan Zambrano), [Clément Contet](https://arxiv.org/search/?searchtype=author&query=Clément Contet), [Jairo Gudiño](https://arxiv.org/search/?searchtype=author&query=Jairo Gudiño), [Felipe Garrido-Lucero](https://arxiv.org/search/?searchtype=author&query=Felipe Garrido-Lucero), [Umberto Grandi](https://arxiv.org/search/?searchtype=author&query=Umberto Grandi), [Cesar A Hidalgo](https://arxiv.org/search/?searchtype=author&query=Cesar A Hidalgo) 作者:Juan Zambrano、Clément Contet、Jairo Gudiño、Felipe Garrido-Lucero、Umberto Grandi、Cesar A Hidalgo
Participatory Budgeting (PB) empowers citizens to propose and vote on public investment projects. Yet, despite its democratic potential, PB initiatives often suffer from low participation rates, limiting their visibility and perceived legitimacy. In this work, we aim to strengthen PB elections in two key ways: by supporting project proposers in crafting better proposals, and by helping PB organizers manage large volumes of submissions in a transparent manner. We propose a privacy-preserving approach to predict which PB proposals are likely to be funded, using only their textual descriptions and anonymous historical voting records – without relying on voter demographics or personally identifiable information. We evaluate the performance of GPT 4 Turbo in forecasting proposal outcomes across varying contextual scenarios, observing that the LLM’s prior knowledge needs to be complemented by past voting data to obtain predictions reflecting real-world PB voting behavior. Our findings highlight the potential of AI-driven tools to support PB processes by improving transparency, planning efficiency, and civic engagement. 参与式预算(PB)使公民能够提出并投票决定公共投资项目。然而,尽管其具备民主潜力,PB 项目往往参与率低,限制了其能见度和被感知的合法性。在本研究中,我们致力于通过两方面增强 PB 选举:支持项目提案者撰写更好的提案,以及帮助 PB 组织者以透明的方式管理大量提交。我们提出了一种隐私保护方法,用于预测哪些 PB 提案有可能获得资助,仅使用提案的文本描述和匿名的历史投票记录——不依赖选民人口统计信息或可识别个人身份的信息。我们评估了 GPT 4 Turbo 在不同情境下预测提案结果的表现,观察到 LLM 的先验知识需要与过去的投票数据相结合,才能得到反映真实世界 PB 投票行为的预测。我们的研究结果突显了以 AI 为驱动的工具在改善透明度、规划效率和公民参与方面支持 PB 流程的潜力。
Subjects: Computers and Society, Artificial Intelligence 主题:计算机与社会,人工智能
Publish: 2025-08-07 15:26:22 UTC 发布:2025-08-07 15:26:22 协调世界时 (UTC)
#295 Efficient Safety Testing of Autonomous Vehicles via Adaptive Search over Crash-Derived Scenarios #295 通过基于碰撞派生场景的自适应搜索实现自动驾驶车辆的高效安全测试
Author: [Rui Zhou](https://arxiv.org/search/?searchtype=author&query=Rui Zhou) 作者:周瑞
Ensuring the safety of autonomous vehicles (AVs) is paramount in their development and deployment. Safety-critical scenarios pose more severe challenges, necessitating efficient testing methods to validate AVs safety. This study focuses on designing an accelerated testing algorithm for AVs in safety-critical scenarios, enabling swift recognition of their driving capabilities. First, typical logical scenarios were extracted from real-world crashes in the China In-depth Mobility Safety Study-Traffic Accident (CIMSS-TA) database, obtaining pre-crash features through reconstruction. Second, Baidu Apollo, an advanced black-box automated driving system (ADS) is integrated to control the behavior of the ego vehicle. Third, we proposed an adaptive large-variable neighborhood-simulated annealing algorithm (ALVNS-SA) to expedite the testing process. Experimental results demonstrate a significant enhancement in testing efficiency when utilizing ALVNS-SA. It achieves an 84.00% coverage of safety-critical scenarios, with crash scenario coverage of 96.83% and near-crash scenario coverage of 92.07%. Compared to genetic algorithm (GA), adaptive large neighborhood-simulated annealing algorithm (ALNS-SA), and random testing, ALVNS-SA exhibits substantially higher coverage in safety-critical scenarios. 确保自动驾驶汽车(AV)安全性是在其开发和部署中至关重要的。安全关键场景带来更严峻的挑战,需要高效的测试方法来验证 AV 的安全性。本研究聚焦于为安全关键场景设计一种加速测试算法,使其能够迅速识别车辆的驾驶能力。首先,从中国深度出行安全研究—交通事故(CIMSS-TA)数据库中的真实撞车事件中提取典型的逻辑场景,并通过重构获得碰撞前特征。其次,集成了百度阿波罗(Baidu Apollo)这一先进的黑盒自动驾驶系统(ADS)来控制自车的行为。第三,我们提出了一种自适应大变量邻域模拟退火算法(ALVNS-SA)以加快测试过程。实验结果表明,使用 ALVNS-SA 时测试效率显著提高。 它在安全关键场景上实现了 84.00%的覆盖率,碰撞场景覆盖率为 96.83%,近碰撞场景覆盖率为 92.07%。与遗传算法(GA)、自适应大邻域-模拟退火算法(ALNS-SA)和随机测试相比,ALVNS-SA 在安全关键场景上的覆盖率显著更高。
Subjects: Robotics, Artificial Intelligence 主题:机器人学,人工智能
Publish: 2025-08-07 13:55:01 UTC 发布:2025-08-07 13:55:01 UTC
#296 Teaching Introduction to Programming in the times of AI: A case study of a course re-design #296 在人工智能时代教授编程入门课程:课程重设计的案例研究
Authors: [Nikolaos Avouris](https://arxiv.org/search/?searchtype=author&query=Nikolaos Avouris), [Kyriakos Sgarbas](https://arxiv.org/search/?searchtype=author&query=Kyriakos Sgarbas), [George Caridakis](https://arxiv.org/search/?searchtype=author&query=George Caridakis), [Christos Sintoris](https://arxiv.org/search/?searchtype=author&query=Christos Sintoris) 作者:Nikolaos Avouris、Kyriakos Sgarbas、George Caridakis、Christos Sintoris
The integration of AI tools into programming education has become increasingly prevalent in recent years, transforming the way programming is taught and learned. This paper provides a review of the state-of-the-art AI tools available for teaching and learning programming, particularly in the context of introductory courses. It highlights the challenges on course design, learning objectives, course delivery and formative and summative assessment, as well as the misuse of such tools by the students. We discuss ways of re-designing an existing course, re-shaping assignments and pedagogy to address the current AI technologies challenges. This example can serve as a guideline for policies for institutions and teachers involved in teaching programming, aiming to maximize the benefits of AI tools while addressing the associated challenges and concerns. 将人工智能工具融入编程教育在近年来变得越来越普遍,改变了编程的教学与学习方式。本文回顾了用于教授与学习编程的最先进人工智能工具,尤其是在入门课程背景下的应用情况。文章强调了课程设计、学习目标、课程交付以及形成性和总结性评估方面的挑战,以及学生对这些工具的滥用问题。我们讨论了重新设计现有课程、重塑作业和教学法以应对当前人工智能技术挑战的方法。该示例可作为高校与教师制定相关政策的指南,旨在在最大化人工智能工具收益的同时应对相关挑战与问题。
Subjects: Computers and Society, Artificial Intelligence 主题:计算机与社会,人工智能
Publish: 2025-08-07 08:56:19 UTC 发布时间:2025-08-07 08:56:19 协调世界时 (UTC)
#297 Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features #297 Surformer v1:基于 Transformer 的使用触觉和视觉特征的表面分类
Authors: [Manish Kansana](https://arxiv.org/search/?searchtype=author&query=Manish Kansana), [Elias Hossain](https://arxiv.org/search/?searchtype=author&query=Elias Hossain), [Shahram Rahimi](https://arxiv.org/search/?searchtype=author&query=Shahram Rahimi), [Noorbakhsh Amiri Golilarz](https://arxiv.org/search/?searchtype=author&query=Noorbakhsh Amiri Golilarz) 作者:Manish Kansana、Elias Hossain、Shahram Rahimi、Noorbakhsh Amiri Golilarz
Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and PCA-reduced visual embeddings extracted via ResNet-50. The model integrates modality-specific encoders with cross-modal attention layers, enabling rich interactions between vision and touch. Currently, state-of-the-art deep learning models for vision tasks have achieved remarkable performance. With this in mind, our first set of experiments focused exclusively on tactile-only surface classification. Using feature engineering, we trained and evaluated multiple machine learning models, assessing their accuracy and inference time. We then implemented an encoder-only Transformer model tailored for tactile features. This model not only achieved the highest accuracy but also demonstrated significantly faster inference time compared to other evaluated models, highlighting its potential for real-time applications. To extend this investigation, we introduced a multimodal fusion setup by combining vision and tactile inputs. We trained both Surformer v1 (using structured features) and Multimodal CNN (using raw images) to examine the impact of feature-based versus image-based multimodal learning on classification accuracy and computational efficiency. The results showed that Surformer v1 achieved 99.4% accuracy with an inference time of 0.77 ms, while the Multimodal CNN achieved slightly higher accuracy but required significantly more inference time. These findings suggest Surformer v1 offers a compelling balance between accuracy, efficiency, and computational cost for surface material recognition. 表面材料识别是机器人感知与物理交互中的关键组成部分,尤其是在同时利用触觉和视觉传感输入时。在本工作中,我们提出了 Surformer v1,一种基于 Transformer 的架构,旨在使用结构化触觉特征和通过 ResNet-50 提取并经 PCA 降维的视觉嵌入进行表面分类。该模型集成了针对各模态的编码器与跨模态注意力层,使视觉与触觉之间能够进行丰富的交互。目前,面向视觉任务的最先进深度学习模型已取得显著性能。基于此,我们的第一组实验专注于仅使用触觉的表面分类。通过特征工程,我们训练并评估了多种机器学习模型,考察了它们的准确率与推理时间。随后我们实现了一个仅含编码器的 Transformer 模型,专为触觉特征设计。该模型不仅达到了最高的准确率,还在推理时间上显著快于其他被评估的模型,突显了其在实时应用中的潜力。 为扩展此项研究,我们通过结合视觉和触觉输入引入了一个多模态融合方案。我们训练了使用结构化特征的 Surformer v1 和使用原始图像的多模态 CNN,以考察基于特征的多模态学习与基于图像的多模态学习在分类准确性和计算效率方面的影响。结果显示,Surformer v1 达到 99.4% 的准确率,推理时间为 0.77 毫秒,而多模态 CNN 虽然准确率略高,但所需的推理时间显著更长。这些发现表明,对于表面材料识别任务,Surformer v1 在准确性、效率和计算成本之间提供了有吸引力的平衡。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-07 00:59:33 UTC 发布日期:2025-08-07 00:59:33 协调世界时 (UTC)
#298 Factor Augmented Supervised Learning with Text Embeddings #298 因子增强的监督学习与文本嵌入
Authors: [Zhanye Luo](https://arxiv.org/search/?searchtype=author&query=Zhanye Luo), [Yuefeng Han](https://arxiv.org/search/?searchtype=author&query=Yuefeng Han), [Xiufan Yu](https://arxiv.org/search/?searchtype=author&query=Xiufan Yu) 作者:罗展烨,韩岳枫,于秀凡
Large language models (LLMs) generate text embeddings from text data, producing vector representations that capture the semantic meaning and contextual relationships of words. However, the high dimensionality of these embeddings often impedes efficiency and drives up computational cost in downstream tasks. To address this, we propose AutoEncoder-Augmented Learning with Text (AEALT), a supervised, factor-augmented framework that incorporates dimension reduction directly into pre-trained LLM workflows. First, we extract embeddings from text documents; next, we pass them through a supervised augmented autoencoder to learn low-dimensional, task-relevant latent factors. By modeling the nonlinear structure of complex embeddings, AEALT outperforms conventional deep-learning approaches that rely on raw embeddings. We validate its broad applicability with extensive experiments on classification, anomaly detection, and prediction tasks using multiple real-world public datasets. Numerical results demonstrate that AEALT yields substantial gains over both vanilla embeddings and several standard dimension reduction methods. 大型语言模型 (LLMs) 从文本数据中生成文本嵌入,产生捕捉词语语义含义和上下文关系的向量表示。然而,这些嵌入的高维性常常妨碍效率并推高下游任务的计算成本。为了解决这一问题,我们提出了带文本的自编码器增强学习(AutoEncoder-Augmented Learning with Text,AEALT),这是一种有监督的、因子增强的框架,将降维直接纳入预训练 LLM 工作流程。首先,我们从文本文档中提取嵌入;接着,将它们输入有监督增强自编码器,以学习低维、与任务相关的潜在因子。通过对复杂嵌入的非线性结构建模,AEALT 优于依赖原始嵌入的传统深度学习方法。我们通过在多个真实公开数据集上开展大量分类、异常检测和预测任务的实验证明了其广泛适用性。数值结果显示,AEALT 相较于原始嵌入和若干标准降维方法带来了显著提升。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning, Machine Learning 学科:计算与语言、人工智能、机器学习、机器学习
Publish: 2025-08-06 01:44:47 UTC 发布:2025-08-06 01:44:47 UTC
#299 Symbolic Learning of Interpretable Reduced-Order Models for Jumping Quadruped Robots #299 用于跳跃四足机器人可解释降阶模型的符号学习
Authors: [Gioele Buriani](https://arxiv.org/search/?searchtype=author&query=Gioele Buriani), [Jingyue Liu](https://arxiv.org/search/?searchtype=author&query=Jingyue Liu), [Maximilian Stölzle](https://arxiv.org/search/?searchtype=author&query=Maximilian Stölzle), [Cosimo Della Santina](https://arxiv.org/search/?searchtype=author&query=Cosimo Della Santina), [Jiatao Ding](https://arxiv.org/search/?searchtype=author&query=Jiatao Ding) 作者:Gioele Buriani,Jingyue Liu,Maximilian Stölzle,Cosimo Della Santina,Jiatao Ding
Reduced-order models are essential for motion planning and control of quadruped robots, as they simplify complex dynamics while preserving critical behaviors. This paper introduces a novel methodology for deriving such interpretable dynamic models, specifically for jumping. We capture the high-dimensional, nonlinear jumping dynamics in a low-dimensional latent space by proposing a learning architecture combining Sparse Identification of Nonlinear Dynamics (SINDy) with physical structural priors on the jump dynamics. Our approach demonstrates superior accuracy to the traditional actuated Spring-loaded Inverted Pendulum (aSLIP) model and is validated through simulation and hardware experiments across different jumping strategies. 降阶模型对于四足机器人运动规划与控制至关重要,因为它们在简化复杂动力学的同时保留了关键行为。本文提出了一种用于推导此类可解释动力学模型的新方法,专门针对跳跃。我们通过提出一种将稀疏非线性动力学识别(SINDy)与关于跳跃动力学的物理结构先验相结合的学习架构,在低维潜在空间中捕捉高维、非线性的跳跃动力学。我们的方法在精度上优于传统的有驱动弹簧加载倒立摆(aSLIP)模型,并通过仿真和硬件实验在不同跳跃策略下进行了验证。
Subjects: Robotics, Artificial Intelligence, Systems and Control 学科:机器人学、人工智能、系统与控制
Publish: 2025-08-04 12:33:51 UTC 发布:2025-08-04 12:33:51 UTC
#300 MetAdv: A Unified and Interactive Adversarial Testing Platform for Autonomous Driving #300 MetAdv:一个面向自动驾驶的统一且交互式对抗测试平台
Authors: [Aishan Liu](https://arxiv.org/search/?searchtype=author&query=Aishan Liu), [Jiakai Wang](https://arxiv.org/search/?searchtype=author&query=Jiakai Wang), [Tianyuan Zhang](https://arxiv.org/search/?searchtype=author&query=Tianyuan Zhang), [Hainan Li](https://arxiv.org/search/?searchtype=author&query=Hainan Li), [Jiangfan Liu](https://arxiv.org/search/?searchtype=author&query=Jiangfan Liu), [Siyuan Liang](https://arxiv.org/search/?searchtype=author&query=Siyuan Liang), [Yilong Ren](https://arxiv.org/search/?searchtype=author&query=Yilong Ren), [Xianglong Liu](https://arxiv.org/search/?searchtype=author&query=Xianglong Liu), [Dacheng Tao](https://arxiv.org/search/?searchtype=author&query=Dacheng Tao) 作者:Aishan Liu、Jiakai Wang、Tianyuan Zhang、Hainan Li、Jiangfan Liu、Siyuan Liang、Yilong Ren、Xianglong Liu、Dacheng Tao
Evaluating and ensuring the adversarial robustness of autonomous driving (AD) systems is a critical and unresolved challenge. This paper introduces MetAdv, a novel adversarial testing platform that enables realistic, dynamic, and interactive evaluation by tightly integrating virtual simulation with physical vehicle feedback. At its core, MetAdv establishes a hybrid virtual-physical sandbox, within which we design a three-layer closed-loop testing environment with dynamic adversarial test evolution. This architecture facilitates end-to-end adversarial evaluation, ranging from high-level unified adversarial generation, through mid-level simulation-based interaction, to low-level execution on physical vehicles. Additionally, MetAdv supports a broad spectrum of AD tasks, algorithmic paradigms (e.g., modular deep learning pipelines, end-to-end learning, vision-language models). It supports flexible 3D vehicle modeling and seamless transitions between simulated and physical environments, with built-in compatibility for commercial platforms such as Apollo and Tesla. A key feature of MetAdv is its human-in-the-loop capability: besides flexible environmental configuration for more customized evaluation, it enables real-time capture of physiological signals and behavioral feedback from drivers, offering new insights into human-machine trust under adversarial conditions. We believe MetAdv can offer a scalable and unified framework for adversarial assessment, paving the way for safer AD. 评估并确保自动驾驶(AD)系统的对抗鲁棒性是一个关键且尚未解决的挑战。本文提出了 MetAdv,一种新颖的对抗测试平台,通过将虚拟仿真与物理车辆反馈紧密结合,实现了逼真、动态和交互式的评估。MetAdv 的核心是建立了一个虚实混合沙盒,在其中我们设计了一个具有动态对抗测试演化的三层闭环测试环境。该架构促进了端到端的对抗评估,涵盖从高层统一的对抗生成、中层基于仿真的交互,到低层在物理车辆上的执行。此外,MetAdv 支持广泛的自动驾驶任务和算法范式(例如模块化深度学习管道、端到端学习、视觉-语言模型)。它支持灵活的 3D 车辆建模和仿真与物理环境之间的无缝切换,并内置兼容 Apollo 和 Tesla 等商用平台。 MetAdv 的一个关键特性是其“人机交互”能力:除了可用于更定制化评估的灵活环境配置外,它还能够实时捕捉驾驶员的生理信号和行为反馈,从而在对抗性条件下提供关于人机信任的新见解。我们认为 MetAdv 可为对抗性评估提供一个可扩展且统一的框架,为更安全的自动驾驶铺平道路。
Subjects: Robotics, Artificial Intelligence 主题:机器人学,人工智能
Publish: 2025-08-04 03:07:54 UTC 发布:2025-08-04 03:07:54 协调世界时
#301 The Art of Breaking Words: Rethinking Multilingual Tokenizer Design #301 断词的艺术:重新思考多语言分词器设计
Authors: [Aamod Thakur](https://arxiv.org/search/?searchtype=author&query=Aamod Thakur), [Ajay Nagpal](https://arxiv.org/search/?searchtype=author&query=Ajay Nagpal), [Atharva Savarkar](https://arxiv.org/search/?searchtype=author&query=Atharva Savarkar), [Kundeshwar Pundalik](https://arxiv.org/search/?searchtype=author&query=Kundeshwar Pundalik), [Siddhesh Dosi](https://arxiv.org/search/?searchtype=author&query=Siddhesh Dosi), [Piyush Sawarkar](https://arxiv.org/search/?searchtype=author&query=Piyush Sawarkar), [Viraj Thakur](https://arxiv.org/search/?searchtype=author&query=Viraj Thakur), [Rohit Saluja](https://arxiv.org/search/?searchtype=author&query=Rohit Saluja), [Maunendra Sankar Desarkar](https://arxiv.org/search/?searchtype=author&query=Maunendra Sankar Desarkar), [Ganesh Ramakrishnan](https://arxiv.org/search/?searchtype=author&query=Ganesh Ramakrishnan) 作者:Aamod Thakur、Ajay Nagpal、Atharva Savarkar、Kundeshwar Pundalik、Siddhesh Dosi、Piyush Sawarkar、Viraj Thakur、Rohit Saluja、Maunendra Sankar Desarkar、Ganesh Ramakrishnan
While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs 尽管模型架构和训练目标已被广泛研究,分词(尤其是在多语言环境下)仍然是大型语言模型(LLM)开发中相对被忽视的方面。现有的分词器常表现出较高的词元与单词比、上下文长度利用效率低以及推理速度较慢。我们提出了一项系统性研究,将词表大小、预分词规则和训练语料构成与词元—单词效率及模型质量联系起来。为了在语言学多样的背景下开展分析,我们在印地语系文字(Indic scripts)上进行了大量实验,这些文字由于其高度的文字体系多样性和正写法复杂性而带来独特挑战。基于这些分析所得的见解,我们提出了一种用于数据构成的新算法,用于在分词器训练中平衡多语言数据。我们在预分词策略上的观察显著提升了模型性能,并且我们的数据构成算法相比传统的数据随机化方法将平均词元—单词比约降低了 6%。 我们的分词器在平均“标记到单词”比率上比最先进的多语言印度语模型提高了超过 40%。这一改进在模型性能和推理速度上都带来了可观的收益。这突显了分词与架构和训练目标一道,作为构建高效、可扩展多语言 LLMs 的关键杠杆。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-08-03 15:31:10 UTC 发布:2025-08-03 15:31:10 UTC
#302 A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition #302 一个将 3D CNN 与 Transformer 结合用于基于视频的行为识别的框架
Authors: [Xiuliang Zhang](https://arxiv.org/search/?searchtype=author&query=Xiuliang Zhang), [Tadiwa Elisha Nyamasvisva](https://arxiv.org/search/?searchtype=author&query=Tadiwa Elisha Nyamasvisva), [Chuntao Liu](https://arxiv.org/search/?searchtype=author&query=Chuntao Liu)
Video-based behavior recognition is essential in fields such as public safety, intelligent surveillance, and human-computer interaction. Traditional 3D Convolutional Neural Network (3D CNN) effectively capture local spatiotemporal features but struggle with modeling long-range dependencies. Conversely, Transformers excel at learning global contextual information but face challenges with high computational costs. To address these limitations, we propose a hybrid framework combining 3D CNN and Transformer architectures. The 3D CNN module extracts low-level spatiotemporal features, while the Transformer module captures long-range temporal dependencies, with a fusion mechanism integrating both representations. Evaluated on benchmark datasets, the proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity. Ablation studies further validate the complementary strengths of the two modules. This hybrid framework offers an effective and scalable solution for video-based behavior recognition. 基于视频的行为识别在公共安全、智能监控和人机交互等领域至关重要。传统的三维卷积神经网络(3D CNN)能够有效捕捉局部时空特征,但在建模长程依赖关系方面存在困难。相反,Transformer 在学习全局上下文信息方面表现出色,但面临计算成本高的问题。为了解决这些局限性,我们提出了一种结合 3D CNN 和 Transformer 架构的混合框架。3D CNN 模块提取低层次的时空特征,而 Transformer 模块捕捉长程时间依赖,两者通过融合机制整合表示。在基准数据集上的评估表明,所提出的模型优于传统的 3D CNN 和单独的 Transformer,以可控的复杂度实现了更高的识别精度。消融研究进一步验证了两模块的互补优势。该混合框架为基于视频的行为识别提供了一个有效且可扩展的解决方案。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence
Publish: 2025-08-02 07:33:29 UTC 发布:2025-08-02 07:33:29 UTC
#303 PiKV: KV Cache Management System for Mixture of Experts #303 PiKV:用于专家混合模型的 KV 缓存管理系统
Authors: [Dong Liu](https://arxiv.org/search/?searchtype=author&query=Dong Liu), [Yanxuan Yu](https://arxiv.org/search/?searchtype=author&query=Yanxuan Yu), [Ben Lengerich](https://arxiv.org/search/?searchtype=author&query=Ben Lengerich), [Ying Nian Wu](https://arxiv.org/search/?searchtype=author&query=Ying Nian Wu), [Xuhong Wang](https://arxiv.org/search/?searchtype=author&query=Xuhong Wang)
As large language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce \textbf{PiKV}, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textit{expert-sharded KV storage} to partition caches across GPUs, \textit{PiKV routing} to reduce token-to-KV access, and a \textit{PiKV Scheduling} to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textit{PiKV Compression} modules the caching pipeline for acceleration. PiKV is recently publicly available as an open-source software library: \href{https://github.com/NoakLiu/PiKV}{https://github.com/NoakLiu/PiKV}. Experiments details is recorded at: \href{https://github.com/NoakLiu/PiKV/blob/main/downstream_tasks/README.md}{https://github.com/NoakLiu/PiKV/Experimental_Results}. We also have PiKV integrated with Nvidia kvpress for acceleration, details see \href{https://github.com/NoakLiu/PiKVpress}{https://github.com/NoakLiu/PiKVpress}. PiKV is still a living project, aiming to become a comprehesive KV Cache management system for MoE Architectures. 随着大型语言模型在参数规模和上下文长度上不断扩展,键值(KV)缓存存储的内存和通信开销已成为多 GPU 和多节点推理中的主要瓶颈。尽管基于 MoE 的架构在专家间对计算进行了稀疏化,但相应的 KV 缓存仍然是密集且全局同步的,导致显著的开销。我们提出了 PiKV,一种为 MoE 架构量身打造的并行分布式 KV 缓存服务框架。PiKV 利用专家分片的 KV 存储将缓存分布到各 GPU,PiKV 路由减少令牌到 KV 的访问,PiKV 调度自适应地保留与查询相关的条目。为进一步减少内存使用,PiKV 在缓存管道中集成了 PiKV 压缩模块以实现加速。 PiKV 最近作为开源软件库公开可用:\href{https://github.com/NoakLiu/PiKV}{https://github.com/NoakLiu/PiKV}。实验细节记录在:\href{https://github.com/NoakLiu/PiKV/blob/main/downstream_tasks/README.md}{https://github.com/NoakLiu/PiKV/Experimental_Results}。我们还将 PiKV 与 Nvidia kvpress 集成以加速,详情见 \href{https://github.com/NoakLiu/PiKVpress}{https://github.com/NoakLiu/PiKVpress}。PiKV 仍在持续发展中,目标是成为面向 MoE 架构的综合 KV Cache 管理系统。
Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence, Hardware Architecture 主题:分布式、并行与集群计算,人工智能,硬件架构
Publish: 2025-08-02 03:50:14 UTC 发布:2025-08-02 03:50:14 UTC
#304 CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models #304 CarbonScaling:在大型语言模型中将神经尺度定律扩展到碳足迹
Authors: [Lei Jiang](https://arxiv.org/search/?searchtype=author&query=Lei Jiang), [Fan Chen](https://arxiv.org/search/?searchtype=author&query=Fan Chen) 作者:姜雷、陈凡
Neural scaling laws have driven the development of increasingly large language models (LLMs) by linking accuracy improvements to growth in parameter count, dataset size, and compute. However, these laws overlook the carbon emissions that scale exponentially with LLM size. This paper presents \textit{CarbonScaling}, an analytical framework that extends neural scaling laws to incorporate both operational and embodied carbon in LLM training. By integrating models for neural scaling, GPU hardware evolution, parallelism optimization, and carbon estimation, \textit{CarbonScaling} quantitatively connects model accuracy to carbon footprint. Results show that while a power-law relationship between accuracy and carbon holds, real-world inefficiencies significantly increase the scaling factor. Hardware technology scaling reduces carbon emissions for small to mid-sized models, but offers diminishing returns for extremely large LLMs due to communication overhead and underutilized GPUs. Training optimizations-especially aggressive critical batch size scaling-help alleviate this inefficiency. \textit{CarbonScaling} offers key insights for training more sustainable and carbon-efficient LLMs. 神经尺度定律通过将准确性提升与参数数量、数据集规模和计算量的增长联系起来,推动了越来越大的语言模型(LLMs)的发展。然而,这些定律忽视了随 LLM 规模呈指数增长的碳排放。本文提出了 CarbonScaling,一个将神经尺度定律扩展为在 LLM 训练中同时纳入运行碳排放和隐含碳排放的分析框架。通过整合神经尺度模型、GPU 硬件演进、并行化优化和碳估算模型,CarbonScaling 定量地将模型准确性与碳足迹联系起来。结果表明,尽管准确性与碳之间存在幂律关系,但现实世界中的低效率显著增加了该尺度因子。硬件技术的提升在小到中等规模模型上能减少碳排放,但对于极大规模的 LLM,由于通信开销和 GPU 未充分利用,回报递减。训练优化——尤其是激进的临界批次大小扩展——有助于缓解这种低效率。CarbonScaling 为训练更可持续、更高碳效益的 LLM 提供了关键洞见。
Subjects: Computation and Language, Artificial Intelligence, Computers and Society, Distributed, Parallel, and Cluster Computing, Machine Learning 主题:计算与语言、人工智能、计算机与社会、分布式、并行与集群计算、机器学习
Publish: 2025-08-02 00:41:45 UTC 发布:2025-08-02 00:41:45 UTC
#305 Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models #305 基于检索增强生成的动态提示,用于利用大型语言模型进行少样本生物医学命名实体识别 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Yao Ge](https://arxiv.org/search/?searchtype=author&query=Yao Ge), [Sudeshna Das](https://arxiv.org/search/?searchtype=author&query=Sudeshna Das), [Yuting Guo](https://arxiv.org/search/?searchtype=author&query=Yuting Guo), [Abeed Sarker](https://arxiv.org/search/?searchtype=author&query=Abeed Sarker) 作者:Yao Ge、Sudeshna Das、Yuting Guo、Abeed Sarker
Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data). In this article, we address the performance challenges of LLMs for few-shot biomedical NER by investigating a dynamic prompting strategy involving retrieval-augmented generation (RAG). In our approach, the annotated in-context learning examples are selected based on their similarities with the input texts, and the prompt is dynamically updated for each instance during inference. We implemented and optimized static and dynamic prompt engineering techniques and evaluated them on five biomedical NER datasets. Static prompting with structured components increased average F1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further improved performance, with TF-IDF and SBERT retrieval methods yielding the best results, improving average F1-scores by 7.3% and 5.6% in 5-shot and 10-shot settings, respectively. These findings highlight the utility of contextually adaptive prompts via RAG for biomedical NER. 生物医学命名实体识别(NER)是一项高实用性的自然语言处理(NLP)任务,而大型语言模型(LLMs)在少样本设置(即训练数据有限)中表现出潜力。在本文中,我们通过研究涉及检索增强生成(RAG)的动态提示策略,解决了 LLMs 在少样本生物医学 NER 中的性能挑战。在我们的方法中,带注释的上下文学习示例根据与输入文本的相似性进行选择,并在推理过程中为每个实例动态更新提示。我们实现并优化了静态和动态提示工程技术,并在五个生物医学 NER 数据集上进行了评估。带有结构化组件的静态提示相较于基础静态提示,使 GPT-4 的平均 F1 分数提高了 12%,使 GPT-3.5 和 LLaMA 3-70B 的平均 F1 分数提高了 11%。动态提示进一步提升了性能,其中 TF-IDF 和 SBERT 检索方法取得了最佳结果,在 5-shot 和 10-shot 设置中分别将平均 F1 分数提升了 7.3%和 5.6%。这些发现凸显了通过 RAG 实现的上下文自适应提示在生物医学 NER 中的实用性。
Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能
Publish: 2025-07-25 20:57:16 UTC 发布:2025-07-25 20:57:16 UTC
#306 Understanding Human Limits in Pattern Recognition: A Computational Model of Sequential Reasoning in Rock, Paper, Scissors #306 理解人类在模式识别中的局限:剪刀石头布中序列推理的计算模型
Authors: [Logan Cross](https://arxiv.org/search/?searchtype=author&query=Logan Cross), [Erik Brockbank](https://arxiv.org/search/?searchtype=author&query=Erik Brockbank), [Tobias Gerstenberg](https://arxiv.org/search/?searchtype=author&query=Tobias Gerstenberg), [Judith E. Fan](https://arxiv.org/search/?searchtype=author&query=Judith E. Fan), [Daniel L. K. Yamins](https://arxiv.org/search/?searchtype=author&query=Daniel L. K. Yamins), [Nick Haber](https://arxiv.org/search/?searchtype=author&query=Nick Haber) 作者:Logan Cross、Erik Brockbank、Tobias Gerstenberg、Judith E. Fan、Daniel L. K. Yamins、Nick Haber
How do we predict others from patterns in their behavior and what are the computational constraints that limit this ability? We investigate these questions by modeling human behavior over repeated games of rock, paper, scissors from Brockbank & Vul (2024). Against algorithmic opponents that varied in strategic sophistication, people readily exploit simple transition patterns (e.g., consistently playing rock after paper) but struggle to detect more complex sequential dependencies. To understand the cognitive mechanisms underlying these abilities and their limitations, we deploy Hypothetical Minds (HM), a large language model-based agent that generates and tests hypotheses about opponent strategies, as a cognitive model of this behavior (Cross et al., 2024). We show that when applied to the same experimental conditions, HM closely mirrors human performance patterns, succeeding and failing in similar ways. To better understand the source of HM’s failures and whether people might face similar cognitive bottlenecks in this context, we performed a series of ablations and augmentations targeting different components of the system. When provided with natural language descriptions of the opponents’ strategies, HM successfully exploited 6/7 bot opponents with win rates >80% suggesting that accurate hypothesis generation is the primary cognitive bottleneck in this task. Further, by systematically manipulating the model’s hypotheses through pedagogically-inspired interventions, we find that the model substantially updates its causal understanding of opponent behavior, revealing how model-based analyses can produce testable hypotheses about human cognition. 我们如何从他人的行为模式中预测其举动?有哪些计算性制约限制了这种能力?我们通过对 Brockbank & Vul (2024) 中在一再重复的石头剪刀布博弈中人的行为进行建模来研究这些问题。面对在策略复杂性上有所差异的算法对手,人们很容易利用简单的转移模式(例如在出布后总是出石头),但在检测更复杂的序列依赖时却表现困难。为理解这些能力及其局限背后的认知机制,我们部署了假设性心智(Hypothetical Minds, HM)——一种基于大型语言模型的代理,它生成并检验关于对手策略的假设,作为此类行为的认知模型(Cross et al., 2024)。我们证明,当将 HM 应用于相同的实验条件时,其表现模式与人类高度相似,成功与失败的方式也相近。为更好地理解 HM 失败的原因以及人在此情境下是否可能面临类似的认知瓶颈,我们对系统的不同组件进行了系列消融与增强实验。 当给出对手策略的自然语言描述时,HM 成功以超过 80% 的胜率利用了 7 个机器人对手中的 6 个,这表明在该任务中,准确的假设生成是主要的认知瓶颈。此外,通过以教学学启发的干预系统性地操控模型的假设,我们发现模型显著更新了其对对手行为的因果理解,这揭示了基于模型的分析如何能够产生可检验的人类认知假设。
Subjects: Neurons and Cognition, Artificial Intelligence 主题:神经元与认知,人工智能
Publish: 2025-07-25 15:56:25 UTC 发布:2025-07-25 15:56:25 协调世界时
#307 Computing with Canonical Microcircuits #307 使用典型微电路进行计算
Author: [PK Douglas](https://arxiv.org/search/?searchtype=author&query=PK Douglas) 作者:PK Douglas
The human brain represents the only known example of general intelligence that naturally aligns with human values. On a mere 20-watt power budget, the brain achieves robust learning and adaptive decision-making in ways that continue to elude advanced AI systems. Inspired by the brain, we present a computational architecture based on canonical microcircuits (CMCs) - stereotyped patterns of neurons found ubiquitously throughout the cortex. We implement these circuits as neural ODEs comprising spiny stellate, inhibitory, and pyramidal neurons, forming an 8-dimensional dynamical system with biologically plausible recurrent connections. Our experiments show that even a single CMC node achieves 97.8 percent accuracy on MNIST, while hierarchical configurations - with learnable inter-regional connectivity and recurrent connections - yield improved performance on more complex image benchmarks. Notably, our approach achieves competitive results using substantially fewer parameters than conventional deep learning models. Phase space analysis revealed distinct dynamical trajectories for different input classes, highlighting interpretable, emergent behaviors observed in biological systems. These findings suggest that neuromorphic computing approaches can improve both efficiency and interpretability in artificial neural networks, offering new directions for parameter-efficient architectures grounded in the computational principles of the human brain. 人脑是已知的唯一与人类价值自然一致的通用智能实例。在仅 20 瓦的能耗下,大脑实现了强健的学习和自适应决策,这些能力至今仍难以被先进的人工智能系统复制。受大脑启发,我们提出了一种基于典型微电路(canonical microcircuits, CMCs)的计算架构——这是一种在整个皮层中普遍存在的刻板化神经元模式。我们将这些电路实现为由棘状星形细胞、抑制性细胞和锥体细胞组成的神经常微分方程(neural ODEs),构成一个具有生物学上合理的递归连接的 8 维动力系统。我们的实验表明,即使单个 CMC 节点在 MNIST 上也能达到 97.8%的准确率,而具有可学习的区际连接和递归连接的分层配置在更复杂的图像基准上取得了更好的性能。值得注意的是,与传统深度学习模型相比,我们的方法在参数数量大大减少的情况下仍取得了有竞争力的结果。相空间分析显示,不同输入类别对应不同的动力学轨迹,凸显了在生物系统中观察到的可解释的涌现行为。 这些发现表明,类脑计算方法可以提升人工神经网络的效率和可解释性,为基于人脑计算原理的参数高效架构提供了新的方向。
Subjects: Neurons and Cognition, Artificial Intelligence, Neural and Evolutionary Computing 学科:神经元与认知、人工智能、神经与进化计算
Publish: 2025-07-25 11:10:13 UTC 发布时间:2025-07-25 11:10:13 UTC
#308 Network-Specific Models for Multimodal Brain Response Prediction #308 面向网络的模型用于多模态脑响应预测
Authors: [Andrea Corsico](https://arxiv.org/search/?searchtype=author&query=Andrea Corsico), [Giorgia Rigamonti](https://arxiv.org/search/?searchtype=author&query=Giorgia Rigamonti), [Simone Zini](https://arxiv.org/search/?searchtype=author&query=Simone Zini), [Luigi Celona](https://arxiv.org/search/?searchtype=author&query=Luigi Celona), [Paolo Napoletano](https://arxiv.org/search/?searchtype=author&query=Paolo Napoletano) 作者:Andrea Corsico、Giorgia Rigamonti、Simone Zini、Luigi Celona、Paolo Napoletano
In this work, we present a network-specific approach for predicting brain responses to complex multimodal movies, leveraging the Yeo 7-network parcellation of the Schaefer atlas. Rather than treating the brain as a homogeneous system, we grouped the seven functional networks into four clusters and trained separate multi-subject, multi-layer perceptron (MLP) models for each. This architecture supports cluster-specific optimization and adaptive memory modeling, allowing each model to adjust temporal dynamics and modality weighting based on the functional role of its target network. Our results demonstrate that this clustered strategy significantly enhances prediction accuracy across the 1,000 cortical regions of the Schaefer atlas. The final model achieved an eighth-place ranking in the Algonauts Project 2025 Challenge, with out-of-distribution (OOD) correlation scores nearly double those of the baseline model used in the selection phase. Code is available at https://github.com/Corsi01/algo2025. 在这项工作中,我们提出了一种针对网络的特定方法,用于预测对复杂多模态电影的脑反应,利用了 Schaefer 图谱的 Yeo 7 网络分区。我们没有将大脑视为同质系统,而是将七个功能网络分为四个簇,并为每个簇训练了独立的多受试者、多层感知器(MLP)模型。该架构支持簇特定的优化和自适应记忆建模,允许每个模型根据其目标网络的功能角色调整时间动态和模态权重。我们的结果表明,这种簇化策略在 Schaefer 图谱的 1000 个皮质区域中显著提高了预测精度。最终模型在 Algonauts Project 2025 Challenge 中获得第八名,分布外(OOD)相关得分几乎是选拔阶段基线模型的两倍。代码可在 https://github.com/Corsi01/algo2025 获得。
Subjects: Neurons and Cognition, Artificial Intelligence 主题:神经元与认知,人工智能
Publish: 2025-07-25 10:21:06 UTC 发表:2025-07-25 10:21:06 UTC
#309 Forecasting Commodity Price Shocks Using Temporal and Semantic Fusion of Prices Signals and Agentic Generative AI Extracted Economic News #309 使用价格信号的时序与语义融合及具代理性的生成式 AI 提取的经济新闻预测商品价格冲击
Authors: [Mohammed-Khalil Ghali](https://arxiv.org/search/?searchtype=author&query=Mohammed-Khalil Ghali), [Cecil Pang](https://arxiv.org/search/?searchtype=author&query=Cecil Pang), [Oscar Molina](https://arxiv.org/search/?searchtype=author&query=Oscar Molina), [Carlos Gershenson-Garcia](https://arxiv.org/search/?searchtype=author&query=Carlos Gershenson-Garcia), [Daehan Won](https://arxiv.org/search/?searchtype=author&query=Daehan Won) 作者:Mohammed-Khalil Ghali, Cecil Pang, Oscar Molina, Carlos Gershenson-Garcia, Daehan Won
Accurate forecasting of commodity price spikes is vital for countries with limited economic buffers, where sudden increases can strain national budgets, disrupt import-reliant sectors, and undermine food and energy security. This paper introduces a hybrid forecasting framework that combines historical commodity price data with semantic signals derived from global economic news, using an agentic generative AI pipeline. The architecture integrates dual-stream Long Short-Term Memory (LSTM) networks with attention mechanisms to fuse structured time-series inputs with semantically embedded, fact-checked news summaries collected from 1960 to 2023. The model is evaluated on a 64-year dataset comprising normalized commodity price series and temporally aligned news embeddings. Results show that the proposed approach achieves a mean AUC of 0.94 and an overall accuracy of 0.91 substantially outperforming traditional baselines such as logistic regression (AUC = 0.34), random forest (AUC = 0.57), and support vector machines (AUC = 0.47). Additional ablation studies reveal that the removal of attention or dimensionality reduction leads to moderate declines in performance, while eliminating the news component causes a steep drop in AUC to 0.46, underscoring the critical value of incorporating real-world context through unstructured text. These findings demonstrate that integrating agentic generative AI with deep learning can meaningfully improve early detection of commodity price shocks, offering a practical tool for economic planning and risk mitigation in volatile market environments while saving the very high costs of operating a full generative AI agents pipeline. 准确预测商品价格暴涨对经济缓冲有限的国家至关重要,突发性上涨可能使国家预算承压、扰乱依赖进口的行业,并削弱粮食与能源安全。本文提出一种混合预测框架,将历史商品价格数据与来自全球经济新闻的语义信号相结合,采用具代理性的生成式人工智能管道。该架构整合了带注意力机制的双流长短期记忆(LSTM)网络,以融合结构化的时间序列输入与经过语义嵌入且已事实核查的新闻摘要,这些新闻覆盖 1960 年至 2023 年。该模型在一套 64 年数据集上进行了评估,数据集包括归一化的商品价格序列和时间对齐的新闻嵌入。 结果显示,所提出的方法实现了平均 AUC 为 0.94、总体准确率为 0.91,明显优于传统基线方法,如逻辑回归(AUC = 0.34)、随机森林(AUC = 0.57)和支持向量机(AUC = 0.47)。额外的消融研究表明,去除注意力机制或降维会导致性能适度下降,而去除新闻组件则使 AUC 急剧下降到 0.46,强调了通过非结构化文本引入现实世界语境的关键价值。这些发现表明,将具代理性的生成式人工智能与深度学习结合,能够显著改善商品价格冲击的早期检测,为动荡市场环境中的经济规划和风险缓解提供实用工具,同时节省运行完整生成式 AI 代理管道的极高成本。
Subjects: Computational Finance, Artificial Intelligence, Machine Learning 主题:计算金融、人工智能、机器学习
Publish: 2025-07-24 20:52:47 UTC 发布时间:2025-07-24 20:52:47 UTC
#310 Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction #310 半自动化葡萄牙语事实核查:使用检索与论断抽取进行语料库扩充
Authors: [Juliana Resplande Sant’anna Gomes](https://arxiv.org/search/?searchtype=author&query=Juliana Resplande Sant’anna Gomes), [Arlindo Rodrigues Galvão Filho](https://arxiv.org/search/?searchtype=author&query=Arlindo Rodrigues Galvão Filho) 作者:Juliana Resplande Sant’anna Gomes,Arlindo Rodrigues Galvão Filho
The accelerated dissemination of disinformation often outpaces the capacity for manual fact-checking, highlighting the urgent need for Semi-Automated Fact-Checking (SAFC) systems. Within the Portuguese language context, there is a noted scarcity of publicly available datasets that integrate external evidence, an essential component for developing robust AFC systems, as many existing resources focus solely on classification based on intrinsic text features. This dissertation addresses this gap by developing, applying, and analyzing a methodology to enrich Portuguese news corpora (Fake.Br, COVID19.BR, MuMiN-PT) with external evidence. The approach simulates a user’s verification process, employing Large Language Models (LLMs, specifically Gemini 1.5 Flash) to extract the main claim from texts and search engine APIs (Google Search API, Google FactCheck Claims Search API) to retrieve relevant external documents (evidence). Additionally, a data validation and preprocessing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora. 错误信息的快速传播往往超出人工事实核查的能力范围,凸显出半自动化事实核查(SAFC)系统的紧迫需求。在葡萄牙语环境中,公开可用的、整合外部证据的数据集明显匮乏,而外部证据是构建健壮的自动事实核查(AFC)系统的关键组成部分;现存的许多资源仅侧重于基于文本内在特征的分类。本文通过开发、应用和分析一种方法来填补这一空白,该方法为葡萄牙语新闻语料库(Fake.Br、COVID19.BR、MuMiN-PT)补充外部证据。该方法模拟用户的核查流程,利用大型语言模型(LLMs,具体为 Gemini 1.5 Flash)从文本中提取主要论断,并使用搜索引擎 API(Google Search API、Google FactCheck Claims Search API)检索相关的外部文档(证据)。此外,还引入了一个数据验证和预处理框架,其中包括近重复检测,以提高基础语料库的质量。
Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题:计算与语言、人工智能、信息检索
Publish: 2025-07-19 23:46:40 UTC 发布:2025-07-19 23:46:40 协调世界时
#311 AuthPrint: Fingerprinting Generative Models Against Malicious Model Providers #311 AuthPrint:针对恶意模型提供者的生成模型指纹识别 [PDF ] [Copy] [Kimi ] [REL]
Authors: [Kai Yao](https://arxiv.org/search/?searchtype=author&query=Kai Yao), [Marc Juarez](https://arxiv.org/search/?searchtype=author&query=Marc Juarez) 作者:Kai Yao, Marc Juarez
Generative models are increasingly adopted in high-stakes domains, yet current deployments offer no mechanisms to verify the origin of model outputs. We address this gap by extending model fingerprinting techniques beyond the traditional collaborative setting to one where the model provider may act adversarially. To our knowledge, this is the first work to evaluate fingerprinting for provenance attribution under such a threat model. The methods rely on a trusted verifier that extracts secret fingerprints from the model’s output space, unknown to the provider, and trains a model to predict and verify them. Our empirical evaluation shows that our methods achieve near-zero FPR@95%TPR for instances of GAN and diffusion models, even when tested on small modifications to the original architecture and training data. Moreover, the methods remain robust against adversarial attacks that actively modify the outputs to bypass detection. Source codes are available at https://github.com/PSMLab/authprint. 生成模型在高风险领域中的采用日益增加,但当前的部署并没有提供验证模型输出来源的机制。我们通过将模型指纹技术扩展到传统协作环境之外的场景来填补这一空白,该场景中模型提供者可能采取对抗行为。据我们所知,这是首个在此类威胁模型下评估用于来源归属的指纹技术的工作。该方法依赖一个可信验证者,从模型的输出空间中提取对提供者未知的秘密指纹,并训练一个模型来预测和验证这些指纹。我们的实证评估表明,该方法在 GAN 和扩散模型的实例上实现了在 95%召回率下近乎为零的假阳性率,即使在对原始架构和训练数据进行小幅修改时亦是如此。此外,该方法在面对主动修改输出以规避检测的对抗性攻击时依然保持稳健。源代码可在 https://github.com/PSMLab/authprint 获得。
Subject: Cryptography and Security 主题:密码学与安全
Publish: 2025-08-06 12:17:38 UTC 发布时间:2025-08-06 12:17:38 协调世界时
#312 UPP: Unified Path Planner with Adaptive Safety and Optimality #312 UPP:具有自适应安全性和最优性的统一路径规划器
Authors: [Jatin Kumar Arora](https://arxiv.org/search/?searchtype=author&query=Jatin Kumar Arora), [Shubhendu Bhasin](https://arxiv.org/search/?searchtype=author&query=Shubhendu Bhasin) 作者:Jatin Kumar Arora、Shubhendu Bhasin
We are surrounded by robots helping us perform complex tasks. Robots have a wide range of applications, from industrial automation to personalized assistance. However, with great technological innovation come significant challenges. One of the major challenges in robotics is path planning. Despite advancements such as graph search, sampling, and potential field methods, most path planning algorithms focus either on optimality or on safety. Very little research addresses both simultaneously. We propose a Unified Path Planner (UPP) that uses modified heuristics and a dynamic safety cost function to balance safety and optimality. The level of safety can be adjusted via tunable parameters, trading off against computational complexity. We demonstrate the planner’s performance in simulations, showing how parameter variation affects results. UPP is compared with various traditional and safe-optimal planning algorithms across different scenarios. We also validate it on a TurtleBot, where the robot successfully finds safe and sub-optimal paths. 我们周围有帮助我们完成复杂任务的机器人。机器人有广泛的应用,从工业自动化到个性化辅助。然而,随着重大技术创新而来的是重大的挑战。机器人学的主要挑战之一是路径规划。尽管图搜索、采样和势场方法等取得了进展,大多数路径规划算法要么侧重于最优性,要么侧重于安全性。很少有研究同时兼顾两者。我们提出了一种统一路径规划器(UPP),它使用改进的启发式方法和动态安全代价函数来平衡安全性与最优性。安全级别可以通过可调参数进行调整,以与计算复杂度进行权衡。我们在仿真中展示了该规划器的性能,说明参数变化如何影响结果。UPP 与各种传统及安全-最优规划算法在不同场景下进行了比较。我们还在 TurtleBot 上对其进行了验证,机器人成功找到了既安全又次优的路径。
Subject: Robotics 主题:机器人学
Publish: 2025-05-29 07:34:56 UTC 发布时间:2025-05-29 07:34:56 UTC
1.3 Huggingface
- GLM-4.5:代理、推理和编码(ARC)基础模型(74▲)
- Voost:一种用于双向虚拟开机和关机的统一可扩展扩散变压器(31▲)
- InfiGUI-G1:自适应探索策略优化推进GUI接地(17▲)
- Memp:探索代理程序记忆(16▲)
- 修剪不令人惊讶的:通过第一令牌惊讶进行有效的代码推理(10▲)
- GENIE:神经辐射场交互编辑的高斯编码(6▲)
- 适应无标签的视觉语言模型:综述(5▲)
- MELLA:低资源语言mlms的语言能力和文化基础的桥梁(4▲)
- 变压器训练中大规模激活的隐藏动力学(3▲)
- MeshLLM:授权大型语言模型逐步理解和生成3D网格(3▲)
- 操作系统代理:基于mlm的通用计算设备代理研究(2▲)
- VLM4D:面向视觉语言模型的时空感知(1▲)
- 基于奖励矫正的强化学习视角的SFT泛化研究(77▲)
- R-Zero:零数据自进化推理LLM(63▲)
- Genie Envisioner:统一的世界基金会机器人操作平台(54▲)
- DeepPHY:在物理推理上对代理vlm进行基准测试(51▲)
- Hi3DEval:基于层次有效性的三维生成评估(21▲)
- 我们是否走在评估文档检索增强生成的正确道路上? (▲)14日
- 今天的法学硕士准备好解释幸福的概念了吗? (8)▲
- 大型多模态模型能主动识别错误输入吗? 学生输入审查能力的系统评价框架(7▲)
- Marco-Voice技术报告(6▲)
- act -1:以编码为动作的计算机使用代理(5▲)
- 客户支持对话的评估、综合和增强(5▲)
- 不要想太多:高效r1型大型推理模型调查(5▲)
- 还有14篇论文