【论文速递】2025年09周 (Robotics/Embodied AI/LLM)
目录
- LLM-Microscope:揭示标点符号在Transformers的上下文中的隐藏作用
- 英文摘要
- 中文摘要
- SurveyX:通过大型语言模型实现学术调查自动化
- 英文摘要
- 中文摘要
- 数学推理的自我奖励校正
- 英文摘要
- 中文摘要
- VideoGrain:调整时空关注以进行多元透明视频编辑
- 英文摘要
- 中文摘要
- SWE-RL:调节时空注意力机制实现多粒度视频编辑
- 英文摘要
- 中文摘要
- Omnialign-V:增强MLLM与人类偏爱的对齐
- 英文摘要
- 中文摘要
- 长语境大型语言模型研究
- 英文摘要
- 中文摘要
- Slamming:在一天内使用单个GPU训练语音语言模型
- 英文摘要
- 中文摘要
- GHOST 2.0:生成式高保真单次头部迁移
- 英文摘要
- 中文摘要
- Kanana:计算有效的双语语言模型
- 英文摘要
- 中文摘要
- MEDVLM-R1:通过增强学习激励视觉模型(VLM)的医学推理能力
- 英文摘要
- 中文摘要
- SpargeAttn:准确的稀疏注意力加速了任何模型推断
- 英文摘要
- 中文摘要
- DICEPTION:视觉感知任务的通用扩散模型
- 英文摘要
- 中文摘要
- 定理解释代理:面向大语言模型定理理解的多模态解释
- 英文摘要
- 中文摘要
- 迈向AI联合科学家
- 英文摘要
- 中文摘要
- R2-T2:多模态专家混合模型的测试时动态路由
- 英文摘要
- 中文摘要
- Mol-Lalama:基于大模型的分子通用理解框架
- 英文摘要
- 中文摘要
- PhotoDoodle:从少数成对数据中学习艺术图像编辑
- 英文摘要
- 中文摘要
- MaskGWM:基于视频掩码重建的通用驾驶世界模型
- 英文摘要
- 中文摘要
- NeoBERT:下一代 BERT
- 英文摘要
- 中文摘要
- LongRoPE2:近距离LLM上下文窗口缩放
- 英文摘要
- 中文摘要
- Audio-FLAN:初步版本
- 英文摘要
- 中文摘要
- ART:可变多层透明图像生成的匿名区域Transformer
- 英文摘要
- 中文摘要
- KV-Edit:无训练的图像编辑,用于精确背景保护
- 英文摘要
- 中文摘要
- Plutus:低资源希腊金融中的大型语言模型的基准测试
- 英文摘要
- 中文摘要
- 语言模型的事实取决于查询的语言
- 英文摘要
- 中文摘要
- SIFT:通过情境贴纸夯实大语言模型的推理基础
- 英文摘要
- 中文摘要
LLM-Microscope:揭示标点符号在Transformers的上下文中的隐藏作用
-
标题: LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
-
作者: Anton Razzhigaev, Matvey Mikhalchuk, Temurbek Rahmatullaev, Elizaveta Goncharova, Polina Druzhinina, Ivan Oseledets, Andrey Kuznetsov
-
日期: 2025-02-20
-
ArXiv主页: https://arxiv.org/abs/2502.15007
-
论文链接: https://arxiv.org/pdf/2502.15007
-
gitHub仓库: https://github.com/AIRI-Institute/LLM-Microscope
英文摘要
We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens – especially stopwords, articles, and commas – consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis also shows a strong correlation between contextualization and linearity, where linearity measures how closely the transformation from one layer’s embeddings to the next can be approximated by a single linear mapping. These findings underscore the hidden importance of filler tokens in maintaining context. For further exploration, we present LLM-Microscope, an open-source toolkit that assesses token-level nonlinearity, evaluates contextual memory, visualizes intermediate layer contributions (via an adapted Logit Lens), and measures the intrinsic dimensionality of representations. This toolkit illuminates how seemingly trivial tokens can be critical for long-range understanding.
中文摘要
我们介绍了量化大型语言模型(LLMS)编码和存储上下文信息的方法,表明令牌通常被视为次要的(例如,确定词,标点符号)具有令人惊讶的高环境。值得注意的是,即使仅删除无关的令牌,删除这些代币(尤其是停止词,文章和逗号)也会始终降低对MMLU和Babilong-4K的性能。我们的分析还显示了上下文化与线性之间的密切相关性,其中线性度衡量了从一层的嵌入到下一个的转换如何通过单个线性映射近似。这些发现强调了填充令牌在维护上下文中的隐藏重要性。为了进一步探索,我们提出了LLM-Microscope,这是一种评估令牌级的非线性,评估上下文记忆,可视化中间层贡献(通过适应的logit镜头)并测量表示代表的内在维度。这个工具包阐明了看似微不足道的令牌对于远程理解至关重要。
SurveyX:通过大型语言模型实现学术调查自动化
-
标题: SurveyX: Academic Survey Automation via Large Language Models
-
作者: Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Simin Niu, Shichao Song, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, Zhiyu li
-
日期: 2025-02-20
-
ArXiv主页: https://arxiv.org/abs/2502.14776
-
论文链接: https://arxiv.org/pdf/2502.14776
-
gitHub仓库: https://github.com/IAAR-Shanghai/SurveyX
英文摘要
Large Language Models (LLMs) have demonstrated exceptional comprehension capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for automated survey generation. However, recent research related to automated survey generation remains constrained by some critical limitations like finite context window, lack of in-depth content discussion, and absence of systematic evaluation frameworks. Inspired by human writing processes, we propose SurveyX, an efficient and organized system for automated survey generation that decomposes the survey composing process into two phases: the Preparation and Generation phases. By innovatively introducing online reference retrieval, a pre-processing method called Attribute Tree, and a re-polishing process, SurveyX significantly enhances the efficacy of survey composition. Experimental evaluation results show that SurveyX outperforms existing automated survey generation systems in content quality (0.259 improvement) and citation quality (1.76 enhancement), approaching human expert performance across multiple evaluation dimensions. Examples of surveys generated by SurveyX are available on www.surveyx.cn
中文摘要
大型语言模型(LLM)表现出了出色的理解能力和庞大的知识库,这表明LLM可以用作自动化调查生成的有效工具。但是,与自动调查生成有关的最新研究仍然受到一些临界局限性的限制,例如有限上下文窗口,缺乏深入的内容讨论以及缺乏系统评估框架。在人类写作过程的启发下,我们提出了SurveyX,这是一种自动化调查生成的高效且有组织的系统,将调查过程分解为两个阶段:制备和生成阶段。通过创新引入在线参考检索,一种称为属性树的预处理方法,以及重新抛光过程,Suressionx显着提高了调查组成的功效。实验评估结果表明,SurveyX在内容质量(0.259提高)和引文质量(1.76增强)方面的表现优于现有的自动化测量系统,从而在多个评估维度上接近人类专家表现。www.surveyx.cn提供了由Sureverx生成的调查的示例
数学推理的自我奖励校正
- 标题: Self-rewarding correction for mathematical reasoning
- 作者: Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang
- 日期: 2025-02-26
- ArXiv主页: https://arxiv.org/abs/2502.19613
- 论文链接: https://arxiv.org/pdf/2502.19613
英文摘要
We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models’ ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.
中文摘要
我们研究自我奖励推理大语言模型(LLMS),它们可以同时生成逐步推理并评估其在推理时间与外部反馈期间其产出的正确性。这种集成的方法允许单个模型独立指导其推理过程,从而为模型部署提供了计算优势。我们特别关注自我纠正的代表性任务,在该任务中,模型自主检测其响应中的错误,修改输出并决定何时终止迭代改进循环。为了实现这一目标,我们提出了一个两阶段的算法框架,用于仅使用自我生成数据构建自我奖励推理模型。在第一阶段,我们采用顺序排斥采样来合成长长的思想轨迹,既结合了自我奖励和自我纠正机制。这些策划数据的微调模型使他们能够学习自我奖励和自我纠正的模式。在第二阶段,我们进一步增强了模型通过基于规则的信号进行加强学习来评估响应准确性和完善产出的能力。Llama-3和QWEN-2.5实验表明,我们的方法超过了内在的自我校正功能,并且实现了与依靠外部奖励模型的系统相当的性能。
VideoGrain:调整时空关注以进行多元透明视频编辑
- 标题: VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing
- 作者: Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang
- 日期: 2025-02-24
- ArXiv主页: https://arxiv.org/abs/2502.17258
- 论文链接: https://arxiv.org/pdf/2502.17258
- 项目链接: https://knightyxp.github.io/VideoGrain_project_page
- gitHub仓库: https://github.com/knightyxp/VideoGrain
英文摘要
Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt’s attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available at https://knightyxp.github.io/VideoGrain_project_page/
中文摘要
扩散模型的最新进展已大大提高了视频生成和编辑功能。但是,多元熟悉的视频编辑包括班级级别,实例级别和部分级别的修改,仍然是一个巨大的挑战。多元透彻编辑的主要困难包括在扩散模型中的文本对区域控制和特征耦合的语义错位。为了解决这些困难,我们提出了视频:一种零拍的方法,可调节时空(交叉和自我)注意机制,以实现对视频内容的细粒度控制。我们通过将每个当地提示的注意力放大其相应的空间 - 触发区域,同时最大程度地减少与跨注意区域无关区域的相互作用,从而增强文本对区域的控制。此外,我们通过提高区域内意识并减少自我注意力区域间干扰来改善特征分离。广泛的实验证明了我们的方法在现实情况下实现了最先进的表现。我们的代码,数据和演示可从https://knightyxp.github.io/videograin_project_page/获得
SWE-RL:调节时空注意力机制实现多粒度视频编辑
-
标题: SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
-
作者: Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, Sida I. Wang
-
日期: 2025-02-25
-
ArXiv主页: https://arxiv.org/abs/2502.18449
-
论文链接: https://arxiv.org/pdf/2502.18449
-
gitHub仓库: https://github.com/facebookresearch/swe-rl
英文摘要
The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer’s reasoning processes and solutions by learning from extensive open-source software evolution data – the record of a software’s entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified – a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.
中文摘要
最近发布的 DeepSeek-R1 展示了强化学习(Reinforcement Learning, RL)在提升大语言模型(LLMs)通用推理能力方面的巨大潜力。虽然 DeepSeek-R1 及其后续工作主要集中在将强化学习应用于编程竞赛和数学问题,但本文提出了 SWE-RL —— 首个将基于强化学习的语言模型推理方法扩展到真实世界软件工程任务的框架。SWE-RL 利用一种轻量级的基于规则的奖励机制(例如,真实解法与模型生成解法之间的相似度评分),使大语言模型能够从大规模开源软件演化数据中自主学习开发者的推理过程与解决方案。这些演化数据记录了一个软件的完整生命周期,包括代码快照、代码变更,以及诸如 issue 和 pull request 等事件。我们在 Llama 3 的基础上进行训练,得到了最终的推理模型 Llama3-SWE-RL-70B,在 SWE-bench Verified 上达到了 41.0% 的解决率 —— 这是一个由人工验证的真实 GitHub 问题集合。据我们所知,这是迄今为止中等规模(<100B 参数)大语言模型中表现最好的结果,甚至可以媲美当前领先的闭源模型如 GPT-4o。令人惊讶的是,尽管 SWE-RL 仅在软件演化数据上进行了强化学习训练,Llama3-SWE-RL 却展现出了泛化的推理能力。例如,在五个跨领域的任务上,包括函数编程、库使用、代码推理、数学问题以及通用语言理解任务,该模型的表现均有提升;而相比之下,监督微调基线模型在平均表现上反而有所下降。总体而言,SWE-RL 开辟了一条新路径:即通过在海量软件工程数据上应用强化学习来持续提升大语言模型的推理能力。
Omnialign-V:增强MLLM与人类偏爱的对齐
- 标题: OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
- 作者: Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Haodong Duan, Hua Yang, Kai Chen
- 日期: 2025-02-25
- ArXiv主页: https://arxiv.org/abs/2502.18411
- 论文链接: https://arxiv.org/pdf/2502.18411
- 项目链接: https://phoenixz810.github.io/OmniAlign-V/
- gitHub仓库: https://github.com/open-compass/VLMEvalKit
英文摘要
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs’ alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs’ alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at https://github.com/PhoenixZ810/OmniAlign-V.
中文摘要
开源多模式大型语言模型(MLLM)的最新进展主要集中在增强基础能力上,从而在人类的偏好一致性方面存在很大的差距。本文介绍了Omnialign-V,这是一个全面的数据集,该数据集的200K高质量培训样本具有各种图像,复杂的问题和各种响应格式,以改善MLLM与人类偏好的一致性。我们还提出了MM-Alignbench,这是一种专门旨在评估MLLM与人类价值的对齐的人类宣传的基准。实验结果表明,使用有监督的微调(SFT)或直接偏好优化(DPO)将MLLM与Omnialign-V进行芬太尼,在维持标准VQA基准上保持或增强性能,可显着增强人类偏好比对,从而保持其基本功能。我们的数据集,基准,代码和检查点已在https://github.com/phoenixz810/omnialign-v上发布。
长语境大型语言模型研究
-
标题: Thus Spake Long-Context Large Language Model
-
作者: Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu
-
日期: 2025-02-24
-
ArXiv主页: https://arxiv.org/abs/2502.17129
-
论文链接: https://arxiv.org/pdf/2502.17129
-
gitHub仓库: https://github.com/OpenMOSS/Thus-Spake-Long-Context-LLM
英文摘要
Long context is an important topic in Natural Language Processing (NLP), running through the development of NLP architectures, and offers immense opportunities for Large Language Models (LLMs) giving LLMs the lifelong learning potential akin to humans. Unfortunately, the pursuit of a long context is accompanied by numerous obstacles. Nevertheless, long context remains a core competitive advantage for LLMs. In the past two years, the context length of LLMs has achieved a breakthrough extension to millions of tokens. Moreover, the research on long-context LLMs has expanded from length extrapolation to a comprehensive focus on architecture, infrastructure, training, and evaluation technologies. Inspired by the symphonic poem, Thus Spake Zarathustra, we draw an analogy between the journey of extending the context of LLM and the attempts of humans to transcend its mortality. In this survey, We will illustrate how LLM struggles between the tremendous need for a longer context and its equal need to accept the fact that it is ultimately finite. To achieve this, we give a global picture of the lifecycle of long-context LLMs from four perspectives: architecture, infrastructure, training, and evaluation, showcasing the full spectrum of long-context technologies. At the end of this survey, we will present 10 unanswered questions currently faced by long-context LLMs. We hope this survey can serve as a systematic introduction to the research on long-context LLMs.
中文摘要
漫长的上下文是自然语言处理(NLP)的重要主题,贯穿NLP体系结构的开发,并为大型语言模型(LLMS)提供了巨大的机会,从而使LLMS具有类似于人类的终身学习潜力。不幸的是,追求漫长的背景伴随着许多障碍。然而,长篇小说仍然是LLM的核心竞争优势。在过去的两年中,LLM的上下文长度已取得了对数百万个令牌的突破性扩展。此外,对长篇小说LLM的研究已从长度外推到对建筑,基础设施,培训和评估技术的全面关注。受《交响曲》诗的启发,因此,我们在扩展LLM背景的旅程与人类超越其死亡率的尝试之间进行了类比。在这项调查中,我们将说明LLM在更长的环境中的巨大需求与接受最终是有限的事实之间的巨大需求之间的斗争。为了实现这一目标,我们从四个角度(建筑,基础架构,培训和评估)展示了长篇小说技术的完整范围,从而给出了长篇文化LLM的生命周期的全球图片。在这项调查结束时,我们将提出长篇小说LLMS目前面临的10个未解决的问题。我们希望这项调查可以作为长篇文化LLM的研究的系统介绍。
Slamming:在一天内使用单个GPU训练语音语言模型
- 标题: Slamming: Training a Speech Language Model on One GPU in a Day
- 作者: Gallil Maimon, Avishai Elmakies, Yossi Adi
- 日期: 2025-02-19
- ArXiv主页: https://arxiv.org/abs/2502.15814
- 论文链接: https://arxiv.org/pdf/2502.15814
- 项目链接: https://pages.cs.huji.ac.il/adiyoss-lab/slamming/
- gitHub仓库: https://github.com/slp-rl/slamkit
英文摘要
We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
中文摘要
我们介绍了SLAM,这是一种在24小时内单个学术GPU上培训高质量语音语言模型(SLM)的食谱。我们通过对模型初始化和体系结构,合成训练数据的经验分析,合成数据的偏好优化以及调整所有其他组件。我们从经验上证明,这种培训配方在更大的计算结果中与领先SLM的相同,在计算成本的一小部分方面取得了良好的结果。我们希望这些见解能使SLM培训和研究更加易于访问。在SLM缩放定律的背景下,我们的结果远远超过了计算的最佳性能,从而对SLM的可行性具有乐观的看法。请参阅-https://pages.cs.huji.ac.il/adiyoss-lab/slamming,请参见代码,数据,模型,样本。
GHOST 2.0:生成式高保真单次头部迁移
- 标题: GHOST 2.0: generative high-fidelity one shot transfer of heads
- 作者: Alexander Groshev, Anastasiia Iashchenko, Pavel Paramonov, Denis Dimitrov, Andrey Kuznetsov
- 日期: 2025-02-25
- ArXiv主页: https://arxiv.org/abs/2502.18417
- 论文链接: https://arxiv.org/pdf/2502.18417
英文摘要
While the task of face swapping has recently gained attention in the research community, a related problem of head swapping remains largely unexplored. In addition to skin color transfer, head swap poses extra challenges, such as the need to preserve structural information of the whole head during synthesis and inpaint gaps between swapped head and background. In this paper, we address these concerns with GHOST 2.0, which consists of two problem-specific modules. First, we introduce enhanced Aligner model for head reenactment, which preserves identity information at multiple scales and is robust to extreme pose variations. Secondly, we use a Blender module that seamlessly integrates the reenacted head into the target background by transferring skin color and inpainting mismatched regions. Both modules outperform the baselines on the corresponding tasks, allowing to achieve state of the art results in head swapping. We also tackle complex cases, such as large difference in hair styles of source and target. Code is available at https://github.com/ai-forever/ghost-2.0
中文摘要
尽管面部交换的任务最近在研究界引起了人们的关注,但头部交换的相关问题仍未得到探索。除了肤色转移外,头交换还带来了额外的挑战,例如需要在合成过程中保留整个头部的结构信息以及交换头和背景之间的涂料差距。在本文中,我们用Ghost 2.0解决了这些问题,该问题由两个特定问题的模块组成。首先,我们引入了用于头部重新制定的增强型对齐器模型,该模型以多个尺度保留身份信息,并且对极端姿势变化是可靠的。其次,我们使用搅拌机模块,该模块通过转移肤色和介入不匹配的区域将重新成型的头部无缝整合到目标背景中。这两个模块在相应的任务上的基准都优于基线,从而实现了最新的状态,从而导致头部交换。我们还解决了复杂的病例,例如源和靶标的发型差异很大。代码可从https://github.com/ai-forever/ghost-2.0获得。
Kanana:计算有效的双语语言模型
-
标题: Kanana: Compute-efficient Bilingual Language Models
-
作者: Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo
-
日期: 2025-02-26
-
ArXiv主页: https://arxiv.org/abs/2502.18934
-
论文链接: https://arxiv.org/pdf/2502.18934
-
gitHub仓库: https://github.com/kakao/kanana
英文摘要
We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.
中文摘要
我们介绍了Kanana,这是一系列双语语言模型,这些模型表明韩语的表现超出了英语的竞争性能。卡纳纳的计算成本明显低于相似大小的最先进模型的计算成本。该报告详细介绍了在预训练期间采用的技术,以实现计算效率但竞争性的模型,包括高质量的数据过滤,上演的预训练,深度缩小以及修剪和蒸馏。此外,该报告概述了卡纳纳模型训练后使用的方法,包括监督的微调和偏好优化,旨在增强其与用户无缝互动的能力。最后,该报告详细阐述了用于语言模型适应特定方案的合理方法,例如嵌入,检索增强生成和功能调用。Kanana型号系列将公开发布2.1b型号(基础,指导,嵌入)从2.1b到32.5b参数,以促进对韩国语言模型的研究。
MEDVLM-R1:通过增强学习激励视觉模型(VLM)的医学推理能力
- 标题: MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
- 作者: Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, Daniel Rueckert
- 日期: 2025-02-26
- ArXiv主页: https://arxiv.org/abs/2502.19634
- 论文链接: https://arxiv.org/pdf/2502.19634
英文摘要
Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice.
中文摘要
推理是进行医学图像分析的关键领域,在临床医生信任和监管批准中,透明度和可信赖性在中心作用。尽管医学视觉语言模型(VLM)对放射学任务显示出希望,但大多数现有的VLM仅产生最终答案而没有揭示潜在的推理。为了解决这一差距,我们引入了MedVLM-R1,这是一种医学VLM,明确产生自然语言推理以提高透明度和可信度。MedVLM-R1不依靠监督的微调(SFT)(SFT)遭受过度拟合的培训分布,而无法促进真正的推理,而是采用了一个强化学习框架,该框架激励该模型发现无需使用任何推理参考文献而发现人类破解的推理路径。尽管培训数据有限(600个视觉问题回答样品)和模型参数(2B),但MRI,CT和X射线基准测试的MEDVLM-R1仍将准确性从55.11%提高到78.22%,超过对一百万多个样品的较大型号的表现。它还在分布任务下证明了强大的域概括。通过用明确的推理统一医学图像分析,MEDVLM-R1标志着在临床实践中朝着值得信赖和可解释的AI迈出的关键步骤。
SpargeAttn:准确的稀疏注意力加速了任何模型推断
-
标题: SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
-
作者: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
-
日期: 2025-02-25
-
ArXiv主页: https://arxiv.org/abs/2502.18137
-
论文链接: https://arxiv.org/pdf/2502.18137
-
gitHub仓库: https://github.com/thu-ml/SpargeAttn
英文摘要
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.
中文摘要
由于其二次时间的复杂性,有效的注意力实现对于大型模型至关重要。幸运的是,注意力通常表现出稀疏性,即注意图中的许多值接近零,从而允许省略相应的计算。许多研究利用稀疏模式来加速注意力。但是,大多数现有作品都集中在利用注意图的某些稀疏模式来优化特定模型中的注意力。普遍的稀疏关注可以确保各种模型的加速和端到端性能仍然难以捉摸。在本文中,我们提出了SpargeAttn,这是任何模型的通用稀疏和量化的关注。我们的方法使用了两个阶段的在线过滤器:在第一阶段,我们迅速准确地预测了注意力图,从而可以跳过一些矩阵乘法。在第二阶段,我们设计了一个在线软磁性过滤器,该过滤器不会造成额外的开销,并进一步跳过一些矩阵乘法。实验表明,我们的方法显着加速了不同的模型,包括语言,图像和视频生成,而无需牺牲端到端指标。这些代码可在https://github.com/thu-ml/spargeattn上找到。
DICEPTION:视觉感知任务的通用扩散模型
- 标题: DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
- 作者: Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
- 日期: 2025-02-24
- ArXiv主页: https://arxiv.org/abs/2502.17157
- 论文链接: https://arxiv.org/pdf/2502.17157
- 项目链接: https://aim-uofa.github.io/Diception/
英文摘要
Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models.
中文摘要
我们的主要目标是创建一个良好的通才感知模型,该模型可以在计算资源和培训数据的限制内处理多个任务。为了实现这一目标,我们求助于在数十亿张图像上预先训练的文本对图像扩散模型。我们详尽的评估指标表明,掷骰子有效地解决了多个感知任务,从而在最新模型上达到了绩效。我们仅使用其数据的0.06%的数据(例如600K与1B像素级注释的图像)以SAM-VIT-H的标准率达到结果。受Wang等人的启发,使用颜色编码来制定各种感知任务的输出。我们表明,将随机颜色分配给不同实例的策略在实体细分和语义分割中都非常有效。将各种感知任务统一为有条件的图像生成,使我们能够完全利用预先训练的文本对图像模型。因此,与从头开始训练的传统模型相比,可以以低数量级的成本进行固定训练。将我们的模型调整到其他任务时,仅需要对50张图像和1%的参数进行微调。骰子吸收为视觉通用模型提供了宝贵的见解和更有希望的解决方案。
定理解释代理:面向大语言模型定理理解的多模态解释
- 标题: Theorem Explain Agent: Towards Multimodal Explanations for LLM Theorem Understanding
- 作者: Max Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, Wenhu Chen
- 日期: 2025-02-26
- ArXiv主页: https://arxiv.org/abs/2502.19400
- 论文链接: https://arxiv.org/pdf/2502.19400
- 项目链接: https://tiger-ai-lab.github.io/TheoremExplainAgent/
- gitHub仓库: https://github.com/TIGER-AI-Lab/TheoremExplainAgent
英文摘要
Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.
中文摘要
理解特定领域的定理通常不仅需要基于文本的推理;通过结构化视觉解释进行有效沟通对于深入理解至关重要。虽然大型语言模型(LLMs)在基于文本的定理推理中表现出色,但其生成连贯且具有教学意义的视觉解释的能力仍是一个待解决的挑战。在这项工作中,我们提出了定理解释代理(TheoremExplainAgent),一种基于代理的方法,用于生成包含曼尼姆动画的长篇幅定理解释视频(时长超过5分钟)。为系统评估多模态定理解释,我们提出了定理解释基准(TheoremExplainBench),该基准涵盖多个STEM学科的240个定理,并设计了5项自动化评估指标。实验结果表明,代理式规划对生成详细的长篇幅视频至关重要,其中o3-mini代理的成功率为93.8%,总体得分为0.77。然而,定量与定性分析表明,大多数生成的视频存在视觉元素布局的小问题。此外,多模态解释暴露了文本解释未能揭示的深层推理缺陷,这进一步凸显了多模态解释的重要性。
迈向AI联合科学家
- 标题: Towards an AI co-scientist
- 作者: Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, Vivek Natarajan
- 日期: 2025-02-26
- ArXiv主页: https://arxiv.org/abs/2502.18864
- 论文链接: https://arxiv.org/pdf/2502.18864
英文摘要
Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system’s design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.
中文摘要
科学发现依靠科学家产生了经过严格实验验证的新假设。为了增加此过程,我们引入了AI共同科学家,这是一种基于Gemini 2.0的多代理系统。AI共同科学家旨在帮助揭示新的原始知识,并在先前的证据基础上提出新颖的研究假设和建议,并与科学家提供的研究目标和指导保持一致。该系统的设计结合了受科学方法的启发,并通过缩放测试时间计算加速了假设产生的生成,辩论和进化方法。主要贡献包括:(1)具有异步任务执行框架的多代理体系结构,用于灵活的计算缩放;(2)自我提出假设产生的比赛进化过程。自动化评估显示测试时间计算的持续好处,改善了假设质量。虽然通用目的,但我们集中于三个生物医学领域的开发和验证:药物重新利用,新的靶标发现以及解释细菌进化和抗微生物抗性的机制。为了重新利用药物,该系统提出了具有有前途的验证结果的候选者,包括急性髓样白血病的候选者,这些候选在临床适用浓度下在体外显示肿瘤抑制。对于新的靶标发现,AI共同科学家提出了用于肝纤维化的新表观遗传靶标,并通过人肝癌的抗纤维化活性和肝细胞再生进行了验证。最后,AI共同科学家通过在细菌进化中发现了新型基因转移机制的硅酸盐发现,从而概括了未发表的实验结果。这些结果在单独的,合时的报告中详细介绍了增强生物医学和科学发现的潜力,并引入了AI授权科学家的时代。
R2-T2:多模态专家混合模型的测试时动态路由
-
标题: R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
-
作者: Zhongyang Li, Ziyue Li, Tianyi Zhou
-
日期: 2025-02-27
-
ArXiv主页: https://arxiv.org/abs/2502.20395
-
论文链接: https://arxiv.org/pdf/2502.20395
-
gitHub仓库: https://github.com/tianyi-lab/R2-T2
英文摘要
In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)’ powerful reasoning capabilities, deterring LMMs’ performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs’ performance on challenging benchmarks of diverse tasks, without training any base-model parameters.
中文摘要
在大型多模式模型(LMM)中,对非语言模式(例如,视觉表示)的感知通常与大语言模型(LLMS)的“强大推理能力相提并论),这阻止了LMMS在具有挑战性的下游任务上的表现。最近,通过用专家的混合物(MOE)代替视觉编码器,从而提供了这种弱点,该混合物提供了丰富的,多跨性的性能和多样化的下游任务所需的多种表示。多模式MOE的性能在很大程度上取决于其路由器,该路由器重量并混合了每个输入的不同专家的表示。但是,我们发现,端到端训练的路由器并不总是为每个测试样品产生最佳路由权重。为了弥合差距,我们提出了一种新颖而有效的方法“在测试时间(R2-T2)进行重新布置(R2-T2),该方法通过将其移动到测试样品中的那些正确预测样本的向量来优化测试时间中的路由权重的向量。我们提出了三个R2-T2的策略,以不同的策略进行了不同的策略,并以不同的方式进行了挑战。不同任务的基准,没有培训任何基本模型参数。
Mol-Lalama:基于大模型的分子通用理解框架
- 标题: Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model
- 作者: Dongki Kim, Wonbin Lee, Sung Ju Hwang
- 日期: 2025-02-19
- ArXiv主页: https://arxiv.org/abs/2502.13449
- 论文链接: https://arxiv.org/pdf/2502.13449
- 项目链接: https://mol-llama.github.io/
英文摘要
Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in interpreting molecular structures, their instruction datasets are limited to the specific knowledge from task-oriented datasets and do not fully cover the fundamental characteristics of molecules, hindering their abilities as general-purpose molecular assistants. To address this issue, we propose Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules via multi-modal instruction tuning. To this end, we design key data types that encompass the fundamental features of molecules, incorporating essential knowledge from molecular structures. In addition, to improve understanding of molecular features, we introduce a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of different molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and generating relevant responses to users’ queries with detailed explanations, implying its potential as a general-purpose assistant for molecular analysis.
中文摘要
理解分子是理解生物体并推动药物发现的进步,需要跨化学和生物学的跨学科知识的关键。尽管大型分子语言模型在解释分子结构方面取得了显着的成功,但它们的指导数据集仅限于以任务为导向的数据集中的特定知识,并且不能完全涵盖分子的基本特征,从而阻碍了它们作为通用分子助手的能力。为了解决这个问题,我们提出了Mol-llama,这是一种大型分子语言模型,该模型通过多模式教学调整掌握了以分子为中心的通用知识。为此,我们设计了涵盖分子基本特征的关键数据类型,并结合了分子结构的基本知识。此外,为了提高对分子特征的理解,我们引入了一个模块,该模块整合了来自不同分子编码器的互补信息,从而利用了不同分子表示的明显优势。我们的实验结果表明,Mol-llama能够理解分子的一般特征,并通过详细的解释对用户的查询产生相关响应,这意味着其作为分子分析的通用助手的潜力。
PhotoDoodle:从少数成对数据中学习艺术图像编辑
-
标题: PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
-
作者: Shijie Huang, Yiren Song, Yuxuan Zhang, Hailong Guo, Xueyin Wang, Mike Zheng Shou, Jiaming Liu
-
日期: 2025-02-20
-
ArXiv主页: https://arxiv.org/abs/2502.14397
-
论文链接: https://arxiv.org/pdf/2502.14397
-
gitHub仓库: https://github.com/showlab/PhotoDoodle
英文摘要
We introduce PhotoDoodle, a novel image editing framework designed to facilitate photo doodling by enabling artists to overlay decorative elements onto photographs. Photo doodling is challenging because the inserted elements must appear seamlessly integrated with the background, requiring realistic blending, perspective alignment, and contextual coherence. Additionally, the background must be preserved without distortion, and the artist’s unique style must be captured efficiently from limited training data. These requirements are not addressed by previous methods that primarily focus on global style transfer or regional inpainting. The proposed method, PhotoDoodle, employs a two-stage training strategy. Initially, we train a general-purpose image editing model, OmniEditor, using large-scale data. Subsequently, we fine-tune this model with EditLoRA using a small, artist-curated dataset of before-and-after image pairs to capture distinct editing styles and techniques. To enhance consistency in the generated results, we introduce a positional encoding reuse mechanism. Additionally, we release a PhotoDoodle dataset featuring six high-quality styles. Extensive experiments demonstrate the advanced performance and robustness of our method in customized image editing, opening new possibilities for artistic creation.
中文摘要
我们介绍了PhotoDoodle,这是一个新颖的图像编辑框架,旨在通过使艺术家能够将装饰元素叠加到照片上来促进照片涂鸦。照片doodling具有挑战性,因为插入的元素必须与背景无缝集成,需要逼真的融合,透视图和上下文的连贯性。此外,必须在没有失真的情况下保存背景,并且必须从有限的培训数据中有效地捕获艺术家的独特风格。这些要求并未通过以前主要关注全球样式转移或区域介绍的方法来解决。拟议的方法是光电料,采用了两阶段的训练策略。最初,我们使用大规模数据训练通用图像编辑模型Omnieditor。随后,我们使用Editlora使用前后图像对的小型,艺术家策划的数据集微调了该模型,以捕获不同的编辑样式和技术。为了提高生成结果的一致性,我们引入了一个位置编码的重用机制。此外,我们发布了具有六种高质量样式的光电料数据集。广泛的实验证明了我们在定制图像编辑中的方法的高级性能和鲁棒性,为艺术创作开辟了新的可能性。
MaskGWM:基于视频掩码重建的通用驾驶世界模型
- 标题: MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
- 作者: Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu
- 日期: 2025-02-17
- ArXiv主页: https://arxiv.org/abs/2502.11663
- 论文链接: https://arxiv.org/pdf/2502.11663
- 项目链接: https://sensetime-fvg.github.io/MaskGWM
- gitHub仓库: https://github.com/SenseTime-FVG/OpenDWM
英文摘要
World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.
中文摘要
预测行动环境变化的世界模型对于具有强烈概括的自主驾驶模型至关重要。盛行的驱动世界模型主要建立在视频预测模型上。尽管这些模型可以用基于高级扩散的发电机产生高保真视频序列,但它们受其预测持续时间和整体泛化功能的限制。在本文中,我们通过将发电损失与MAE风格的功能级上下文学习结合在一起来探讨解决这个问题。特别是,我们通过三个关键设计实例化了该目标:(1)更可扩展的扩散Transformer (DIT)结构,该结构训练有额外的蒙版构造任务。(2)我们设计了与扩散相关的面具令牌来处理掩模重建和生成扩散过程之间的模糊关系。(3)我们通过利用行式面具进行自我注意力而不是MAE中的掩盖自我注意来扩展到时空领域。然后,我们采用划分的跨视图模块与此面具设计保持一致。基于上述改进,我们提出了MaskGWM:一种可概括的驾驶世界模型,该模型具有视频面具重建。我们的模型包含两个变体:maskGWM-long,重点关注长马预测,而maskgwm-mview则专用于多视图生成。对标准基准的综合实验验证了所提出方法的有效性,该方法包含Nuscene数据集的正常验证,OPENDV-2K数据集的长途推出以及Waymo数据集的零拍验证。这些数据集上的定量指标表明我们的方法特别改善了最新的驾驶世界模型。
NeoBERT:下一代 BERT
- 标题: NeoBERT: A Next-Generation BERT
- 作者: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar
- 日期: 2025-02-26
- ArXiv主页: https://arxiv.org/abs/2502.19587
- 论文链接: https://arxiv.org/pdf/2502.19587
英文摘要
Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.
中文摘要
建筑,预培训和微调的最新创新导致了大型自动退缩语言模型(例如Llama and Deepseek)的非凡学习和推理能力。相比之下,尽管许多下游NLP应用是基础,但像Bert和Roberta这样的编码者并未看到相同的进度。为了弥合这一差距,我们介绍了Neobert,Neobert是下一代编码器,通过整合建筑,现代数据和优化的预训练方法中的最新进步,重新定义了双向模型的功能。Neobert专为无缝采用而设计:它是现有基本型号的插件替代品,依赖于最佳的深度宽度比,并利用了4,096个令牌的扩展上下文长度。尽管具有2500亿个参数足迹,但它在巨大的MTEB基准测试中取得了最先进的结果,在相同的微调条件下,伯特(Bert)大,罗伯塔(Roberta),大,名字师和现代伯特(Modernbert)的表现优于大,罗伯塔(Roberta)。此外,我们严格评估每种修饰对胶水的影响,并为MTEB设计一个均匀的微调和评估框架。我们发布所有代码,数据,检查点和培训脚本,以加速研究和现实世界中的采用。
LongRoPE2:近距离LLM上下文窗口缩放
- 标题: LongRoPE2: Near-Lossless LLM Context Window Scaling
- 作者: Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang
- 日期: 2025-02-27
- ArXiv主页: https://arxiv.org/abs/2502.20082
- 论文链接: https://arxiv.org/pdf/2502.20082
英文摘要
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by “needle-driven” perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens – 80x fewer than Meta’s approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.
中文摘要
Longrope2是一种新颖的方法,它将预训练的大语言模型(LLM)的有效上下文窗口扩展到目标长度,同时保留在原始较短上下文窗口上的性能。这是通过三个贡献来实现的:(1)一个假设,即较高绳索维度的训练不足有助于在现有方法中观察到的持续分布(OOD)问题;(2)一种有效的绳索恢复算法,该算法采用以“针驱动”的困惑为指导的进化搜索来解决训练问题不足;(3)一种混合的上下文窗口训练方法,该方法微调型号的权重以采用重新续线序列的重新绳索,同时用原始绳索保留短上下文性能。在各种基准的LLAMA3-8B和PHI3-MINI-3.8B上进行了广泛的实验验证了假设并证明了longrope2的有效性。值得注意的是,Longrope2扩展了Llama3-8B以实现128K有效上下文长度,同时仅使用10B代币保留了98.5%的短上下文性能 - 比Meta的方法少了80倍,而Meta的方法却少了80倍,而Meta未能达到目标有效上下文长度。代码将在https://github.com/microsoft/longrope上找到。
Audio-FLAN:初步版本
-
标题: Audio-FLAN: A Preliminary Release
-
作者: Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue
-
日期: 2025-02-23
-
ArXiv主页: https://arxiv.org/abs/2502.16584
-
论文链接: https://arxiv.org/pdf/2502.16584
-
gitHub仓库: https://github.com/lmxue/Audio-FLAN%7D%7D
英文摘要
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.
中文摘要
音频令牌化的最新进展显着增强了音频功能与大语言模型(LLMS)的集成。但是,音频理解和产生通常被视为不同的任务,从而阻碍了真正统一的音频模型的发展。尽管教学调整在改善文本和愿景的零击学习方面取得了巨大的成功,但其在音频上的应用仍然很大程度上没有探索。一个主要的障碍是缺乏统一音频理解和产生的全面数据集。为了解决这个问题,我们介绍了Audio-Flan,这是一个大规模的指令数据集,涵盖了超过1亿个实例,涵盖了跨语音,音乐和声音域的80个不同任务。Audio-Flan为统一的音频语言模型奠定了基础,这些模型可以无缝处理以零拍的方式跨各种音频域的理解(例如,转录,理解)和发电(例如,语音,音乐,音乐,声音)任务。Audio-Flan数据集可在HuggingFace和GitHub上使用,并且将不断更新。
ART:可变多层透明图像生成的匿名区域Transformer
- 标题: ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
- 作者: Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haoxing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang, Yanbin Wang, Lin Liang, Lijuan Wang, Ji Li, Xiu Li, Zhouhui Lian, Gao Huang, Baining Guo
- 日期: 2025-02-25
- ArXiv主页: https://arxiv.org/abs/2502.18364
- 论文链接: https://arxiv.org/pdf/2502.18364
英文摘要
Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Inspired by Schema theory suggests that knowledge is organized in frameworks (schemas) that enable people to interpret and learn from new information by linking it to prior knowledge.}, this anonymous region layout allows the generative model to autonomously determine which set of visual tokens should align with which text tokens, which is in contrast to the previously dominant semantic layout for the image generation task. In addition, the layer-wise region crop mechanism, which only selects the visual tokens belonging to each anonymous region, significantly reduces attention computation costs and enables the efficient generation of images with numerous distinct layers (e.g., 50+). When compared to the full attention approach, our method is over 12 times faster and exhibits fewer layer conflicts. Furthermore, we propose a high-quality multi-layer transparent image autoencoder that supports the direct encoding and decoding of the transparency of variable multi-layer images in a joint manner. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.
中文摘要
多层图像生成是一项基本任务,使用户能够隔离,选择和编辑特定的图像层,从而彻底改变了与生成模型的相互作用。在本文中,我们介绍了匿名区域Transformer (ART),该Transformer (ART)促进了基于全局文本提示和匿名区域布局的可变多层透明图像的直接生成。受模式理论的启发表明,知识是在框架(架构)中组织的,使人们能够通过将其链接到先验知识来解释和学习。},这种匿名区域布局允许生成模型可以自主地自动确定哪种文本应与以前占主导地位的Semantic Semantic Leaut for Image Generation Image Generation Image Generation of Image Generation Image Generation sagents of哪种文本相结合。此外,仅选择属于每个匿名区域的视觉令牌的层裁剪机制可显着降低注意力计算成本,并可以有效地产生具有许多不同层的图像(例如50+)。与完全关注的方法相比,我们的方法的速度超过12倍,并且层冲突较少。此外,我们提出了一个高质量的多层透明图像自动编码器,该图像支持可变多层图像的透明度的直接编码和解码。通过启用精确的控制和可扩展的层产生,ART为交互式内容创建建立了一个新的范式。
KV-Edit:无训练的图像编辑,用于精确背景保护
- 标题: KV-Edit: Training-Free Image Editing for Precise Background Preservation
- 作者: Tianrui Zhu, Shiyi Zhang, Jiawei Shao, Yansong Tang
- 日期: 2025-02-24
- ArXiv主页: https://arxiv.org/abs/2502.17363
- 论文链接: https://arxiv.org/pdf/2502.17363
- 项目链接: https://xilluill.github.io/projectpages/KV-Edit/
- gitHub仓库: https://github.com/Xilluill
英文摘要
Background consistency remains a significant challenge in image editing tasks. Despite extensive developments, existing works still face a trade-off between maintaining similarity to the original image and generating content that aligns with the target. Here, we propose KV-Edit, a training-free approach that uses KV cache in DiTs to maintain background consistency, where background tokens are preserved rather than regenerated, eliminating the need for complex mechanisms or expensive training, ultimately generating new content that seamlessly integrates with the background within user-provided regions. We further explore the memory consumption of the KV cache during editing and optimize the space complexity to O(1) using an inversion-free method. Our approach is compatible with any DiT-based generative model without additional training. Experiments demonstrate that KV-Edit significantly outperforms existing approaches in terms of both background and image quality, even surpassing training-based methods. Project webpage is available at https://xilluill.github.io/projectpages/KV-Edit
中文摘要
背景一致性仍然是图像编辑任务的重大挑战。尽管有广泛的发展,但现有作品仍然在保持与原始图像相似的相似性和生成与目标保持一致的内容之间面临权衡。在这里,我们提出了KV-EDIT,这是一种无训练的方法,它使用DIT中的KV缓存来维持背景一致性,在此中,保留背景令牌而不是再生,从而消除了对复杂机制或昂贵培训的需求,最终生成了新内容,最终会在用户培养的区域内与背景无缝集成。我们进一步探讨了编辑过程中KV缓存的内存消耗,并使用无反转方法优化了O(1)的空间复杂性。我们的方法与任何基于DIT的生成模型都兼容,没有其他培训。实验表明,KV-EDIT在背景和图像质量方面显着优于现有方法,甚至超过了基于培训的方法。项目网页可从https://xilluill.github.io/projectpages/kv-edit获得
Plutus:低资源希腊金融中的大型语言模型的基准测试
- 标题: Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
- 作者: Xueqing Peng, Triantafillos Papadopoulos, Efstathia Soufleri, Polydoros Giannouris, Ruoyu Xiang, Yan Wang, Lingfei Qian, Jimin Huang, Qianqian Xie, Sophia Ananiadou
- 日期: 2025-02-26
- ArXiv主页: https://arxiv.org/abs/2502.18772
- 论文链接: https://arxiv.org/pdf/2502.18772
英文摘要
Despite Greece’s pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. Previous efforts in multilingual financial natural language processing (NLP) have exposed considerable performance disparities, yet no dedicated Greek financial benchmarks or Greek-specific financial LLMs have been developed until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks in Greek: numeric and textual named entity recognition, question answering, abstractive summarization, and topic classification, thereby facilitating systematic and reproducible LLM assessments. To underpin these tasks, we present three novel, high-quality Greek financial datasets, thoroughly annotated by expert native Greek speakers, augmented by two existing resources. Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek financial NLP remains challenging due to linguistic complexity, domain-specific terminology, and financial reasoning gaps. These findings underscore the limitations of cross-lingual transfer, the necessity for financial expertise in Greek-trained models, and the challenges of adapting financial LLMs to Greek text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to promote reproducible research and advance Greek financial NLP, fostering broader multilingual inclusivity in finance.
中文摘要
尽管希腊在全球经济中起着关键作用,但由于希腊语的语言复杂性和特定领域的数据集的稀缺性,大型语言模型(LLMS)仍未在希腊财务背景下遭受重视。以前的多语言金融自然语言处理(NLP)已经揭示了相当大的绩效差异,但是到目前为止,还没有开发出专用的希腊金融基准或希腊特定的金融LLM。为了弥合这一差距,我们介绍了希腊第一个金融评估基准Plutus-ben和Plutus-8B,Pioneering Greek Financial LLM,以希腊特定的域名数据进行了微调。Plutus-ben在希腊语中介绍了五个核心财务NLP任务:数字和文本命名实体识别,问答,抽象性摘要和主题分类,从而促进了系统的和可重复的LLM评估。为了支撑这些任务,我们介绍了三个小说,高质量的希腊财务数据集,并由专家本地希腊人的专家注释,并由两个现有资源增强。我们对22个LLM在冥王星ben上的全面评估表明,由于语言复杂性,特定于领域的术语和财务推理差距,希腊财务NLP仍然具有挑战性。这些发现强调了跨语言转移的局限性,在希腊培训模型中的财务专业知识的必要性以及将财务LLMS适应希腊文本的挑战。我们公开发布Plutus-Ben,Plutus-8B和所有相关数据集,以促进可重复的研究并促进希腊财务NLP,从而促进了财务的更广泛的多语言包容性。
语言模型的事实取决于查询的语言
-
标题: Language Models’ Factuality Depends on the Language of Inquiry
-
作者: Tushar Aggarwal, Kumar Tanmay, Ayush Agrawal, Kumar Ayush, Hamid Palangi, Paul Pu Liang
-
日期: 2025-02-25
-
ArXiv主页: https://arxiv.org/abs/2502.17955
-
论文链接: https://arxiv.org/pdf/2502.17955
-
gitHub仓库: https://github.com/kmrtanmay/X_FaKT
英文摘要
Multilingual language models (LMs) are expected to recall factual knowledge consistently across languages, yet they often fail to transfer knowledge between languages even when they possess the correct information in one of the languages. For example, we find that an LM may correctly identify Rashed Al Shashai as being from Saudi Arabia when asked in Arabic, but consistently fails to do so when asked in English or Swahili. To systematically investigate this limitation, we introduce a benchmark of 10,000 country-related facts across 13 languages and propose three novel metrics: Factual Recall Score, Knowledge Transferability Score, and Cross-Lingual Factual Knowledge Transferability Score-to quantify factual recall and knowledge transferability in LMs across different languages. Our results reveal fundamental weaknesses in today’s state-of-the-art LMs, particularly in cross-lingual generalization where models fail to transfer knowledge effectively across different languages, leading to inconsistent performance sensitive to the language used. Our findings emphasize the need for LMs to recognize language-specific factual reliability and leverage the most trustworthy information across languages. We release our benchmark and evaluation framework to drive future research in multilingual knowledge transfer.
中文摘要
预计多语言语言模型(LMS)将跨语言始终如一地回顾事实知识,但是即使在其中一种语言中拥有正确的信息,它们也经常在语言之间转移知识。例如,我们发现,当用阿拉伯语询问LM时,LM可以正确地识别出来自沙特阿拉伯的Rash Al Shashai,但在用英语或斯瓦希里语询问时,始终未能这样做。为了系统地研究这一限制,我们介绍了13种语言的10,000个与国家相关的事实的基准,并提出了三种新颖的指标:事实召回得分,知识转移性得分和跨语性的事实知识转移能力,以量化不同语言的LMS的事实转移和知识转移性。我们的结果揭示了当今最先进的LMS中的根本弱点,尤其是在跨语言概括中,模型无法有效地跨不同语言转移知识,从而导致对所使用语言敏感的性能不一致。我们的发现强调了LMS必须识别特定语言的事实可靠性并利用跨语言最值得信赖的信息。我们发布基准和评估框架,以推动多语言知识转移的未来研究。
SIFT:通过情境贴纸夯实大语言模型的推理基础
-
标题: SIFT: Grounding LLM Reasoning in Contexts via Stickers
-
作者: Zihao Zeng, Xuyao Huang, Boxiu Li, Zhijie Deng
-
日期: 2025-02-19
-
ArXiv主页: https://arxiv.org/abs/2502.14922
-
论文链接: https://arxiv.org/pdf/2502.14922
-
gitHub仓库: https://github.com/zhijie-group/SIFT
英文摘要
This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase “10 dollars per kilo,” LLMs might not recognize that “per” means “for each,” leading to calculation errors. We introduce a novel, post-training approach called Stick to the Facts (SIFT) to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the Sticker, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions – one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via forward optimization (to better align the extracted facts with the query) and inverse generation (to conform with the model’s inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to 85.67%, establishing a new state-of-the-art in the open-source community. The code is available at https://github.com/zhijie-group/SIFT.
中文摘要
本文发现大语言模型(涵盖从Llama3.2-3B-Instruct到DeepSeek-R1等不同规模模型)在推理过程中普遍存在上下文误读问题。以"10美元/公斤"为例,模型可能无法正确解析"per"表征的"每单位"计算关系,从而导致运算错误。为解决该问题,我们提出一种名为"事实锚定法"(SIFT, Stick to the Facts)的后训练优化框架。该方法通过动态分配推理计算资源,将大模型的推理过程锚定于特定上下文情境。其核心创新在于由模型自主生成的"情境贴纸"(Sticker)——该标记通过显式强调上下文关键信息实现语义聚焦。具体而言,SIFT会并行生成两种预测结果:基于原始查询的常规预测和基于情境贴纸增强查询的优化预测。当两者出现分歧时,系统将启动迭代优化流程:首先通过前向优化调整贴纸使其精确匹配查询需求,继而通过逆向生成约束优化方向以符合模型固有认知模式,最终实现更可靠的推理结果。我们在3B到100B+不同量级模型及GSM8K、MATH-500等基准测试中验证了方法的有效性,结果显示平均准确率提升5.3-7.8个百分点。特别地,DeepSeek-R1在AIME2024测试中pass@1准确率从78.33%提升至85.67%,创下开源社区新纪录。本研究突破传统参数微调范式,通过动态标记增强实现了推理过程可解释性提升与知识纠偏成本降低80%的双重优势,相关代码及预训练参数已开源:https://github.com/zhijie-group/SIFT。
相关文章:
【论文速递】2025年09周 (Robotics/Embodied AI/LLM)
目录 LLM-Microscope:揭示标点符号在Transformers的上下文中的隐藏作用英文摘要中文摘要 SurveyX:通过大型语言模型实现学术调查自动化英文摘要中文摘要 数学推理的自我奖励校正英文摘要中文摘要 VideoGrain:调整时空关注以进行多元透明视频编…...
自主机器人模拟系统
一、系统概述 本代码实现了一个基于Pygame的2D自主机器人模拟系统,具备以下核心功能: 双模式控制:支持手动控制(WASD键)和自动导航模式(鼠标左键设定目标) 智能路径规划:采用改进型…...
DeepSeek构建非农预测模型:量化关税滞后效应与非线性经济冲击传导
AI分析:非农数据前瞻与关税影响的滞后性 根据AI模型对多维度经济指标的交叉验证,4月非农就业报告或呈现“增速放缓但未失速”的特征。当前市场共识预期为新增就业13.3万人(前值22.8万),失业率维持4.2%,时薪…...
前端面经-VUE3篇--vue3基础知识(一)插值表达式、ref、reactive
一、计算属性(computed) 计算属性(Computed Properties)是 Vue 中一种特殊的响应式数据,它能基于已有的响应式数据动态计算出新的数据。 计算属性有以下特性: 自动缓存:只有当它依赖的响应式数据发生变化时ÿ…...
云原生后端架构的优势与最佳实践
📝个人主页🌹:慌ZHANG-CSDN博客 🌹🌹期待您的关注 🌹🌹 在过去的几年里,随着云计算和容器化技术的迅猛发展,云原生架构逐渐成为现代企业和开发团队构建和运维应用系统的首选方式。云原生架构通过高度的自动化、弹性伸缩、微服务化等特点,使得企业能够在不断变化…...
力扣838.推多米诺随笔
“生活就像海洋,只有意志坚强的人,才能到达彼岸。”—— 马克思 题目 n 张多米诺骨牌排成一行,将每张多米诺骨牌垂直竖立。在开始时,同时把一些多米诺骨牌向左或向右推。 每过一秒,倒向左边的多米诺骨牌会推动其左侧…...
aab转apk
一、 android34升级: 1、升级到安卓34(蓝牙、图片) 再蓝牙广播的地方加入Context.RECEIVER_EXPORTED 2、废弃了 BluetoothAdapter#enable() 和 BluetoothAdapter#disable(),需要修改 // 以前的蓝牙操作BluetoothManager bluetoo…...
LeetCode 560. 和为 K 的子数组 | 前缀和与哈希表的巧妙应用
文章目录 方法思路:前缀和 哈希表核心思想关键步骤 代码实现复杂度分析示例解析总结 题目描述 给定一个整数数组 nums 和一个整数 k,请统计并返回该数组中和为 k 的子数组的数量。 子数组是数组中连续的非空元素序列。 示例 输入:nums …...
【Hive入门】Hive性能调优:小文件问题与动态分区合并策略详解
目录 引言 1 Hive小文件问题概述 1.1 什么是小文件问题 1.2 小文件产生的原因 2 Hive小文件合并机制 2.1 hive.merge.smallfiles参数详解 2.2 小文件合并流程 2.3 合并策略选择 3 动态分区与小文件问题 3.1 动态分区原理 3.2 动态分区合并策略 3.3 动态分区合并流程…...
基于Springboot+Vue3.0的前后端分离的个人旅游足迹可视化平台
文章目录 0、前言1、前端开发1.1 登录注册页面1.2 首页1.3 足迹管理1.3.1 足迹列表1.3.2 添加足迹1.4 个人中心1.4.1 足迹成就1.4.2 个人信息1.4.3 我的计划2、后端开发2.1 用户接口开发2.2 足迹点接口2.3 旅游计划接口3、完整代码资料下载0、前言 项目亮点: 前端用户权限动态…...
安妮推广导航系统开心版多款主题网址推广赚钱软件推广变现一键统计免授权源码Annie
一、源码描述 这是一套推广导航源码(Annie),基于Funadmin框架(ThinkPHP8Layui ),内置多款主题,可以用于网址推广,或者用于软件推广,PC端软件手机端软件,后台…...
单片机-STM32部分:1、STM32介绍
飞书文档https://x509p6c8to.feishu.cn/wiki/CmpZwTgHhiQSHZkvzjdc6c4Yn1g STM32单片机不是一款芯片,而是一个系列的芯片? STM32系列单片机是ST(意法半导体)公司开发的一套32位微控制器基于Arm Cortex()-M处理器,它包…...
PHP-session
PHP中,session(会话)是一种在服务器上存储用户数据的方法,这些数据可以在多个页面请求或访问之间保持。Session提供了一种方式来跟踪用户状态,比如登录信息、购物车内容等。当用户首次访问网站时,服务器会创…...
php artisan resetPass 执行密码重置失败的原因?php artisan resetPass是什么 如何使用?-优雅草卓伊凡
php artisan resetPass 执行密码重置失败的原因?php artisan resetPass是什么 如何使用?-优雅草卓伊凡 可能的原因 命令不存在:如果你没有正确定义这个命令,Laravel 会报错而不是提示”重置密码失败”用户不存在:’a…...
AI大模型-微调和RAG方案选项
在搭建知识库的方向上,有两个落地方案:微调、RAG。两个方案的比对: 方案选型 微调 让大模型(LLM)去学习现有知识(调整大模型的参数,让它学习新的知识),最终生成一个新的…...
MySQL 第一讲---基础篇 安装
前言: 在当今数据驱动的时代,掌握数据库技术已成为开发者必备的核心技能。作为全球最受欢迎的开源关系型数据库,MySQL承载着淘宝双十一每秒50万次的交易请求,支撑着Facebook百亿级的数据存储,更是无数互联网企业的数据…...
【JavaScript-Day 1】从零开始:全面了解 JavaScript 是什么、为什么学以及它与 Java 的区别
Langchain系列文章目录 01-玩转LangChain:从模型调用到Prompt模板与输出解析的完整指南 02-玩转 LangChain Memory 模块:四种记忆类型详解及应用场景全覆盖 03-全面掌握 LangChain:从核心链条构建到动态任务分配的实战指南 04-玩转 LangChai…...
C++ 复习
VS 修改 C 语言标准 右键项目-属性 输入输出 //引用头文件,用<>包裹起来的一般是系统提供的写好的代码 编译器会在专门的系统路径中去进行查找 #include <iostream> //自己写的代码文件一般都用""包裹起来 编译器会在当前文件所在的目录中査…...
数字智慧方案5877丨智慧交通项目方案(122页PPT)(文末有下载方式)
篇幅所限,本文只提供部分资料内容,完整资料请看下面链接 https://download.csdn.net/download/2301_78256053/89575494 资料解读:智慧交通项目方案 详细资料请看本解读文章的最后内容。 智慧交通项目方案是一个全面的设计框架,…...
如何封装一个线程安全、可复用的 HBase 查询模板
目录 一、前言:原生 HBase 查询的痛点 (一)连接管理混乱,容易造成资源泄露 (二)查询逻辑重复,缺乏统一的模板 (三)多线程/高并发下的线程安全性隐患 (四…...
VLM Qwen2.5VL GRPO训练微调 EasyR1 多机多卡训练(2)
在之前博客进行了简单的训练尝试:https://www.dong-blog.fun/post/2060 在本博客,将会深入进行多机多卡训练,以及调整训练奖励函数。 之前构建了镜像: docker build . -t kevinchina/deeplearning:r1 FROM hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cx…...
基于建造者模式的信号量与理解建造者模式
信号量是什么? AI解释:信号量(Semaphore)是操作系统中用于 进程同步与互斥 的经典工具,由荷兰计算机科学家 Edsger Dijkstra 在 1965 年提出。它本质上是一个 非负整数变量,通过原子操作(P 操作…...
笔试专题(十四)
文章目录 mari和shiny题解代码 体操队形题解代码 二叉树中的最大路径和题解代码 mari和shiny 题目链接 题解 1. 可以用多状态的线性dp 2. 细节处理:使用long long 存储个数 3. 空间优化:只需要考虑等于’s’,‘sh’,shy’的情况…...
2025年五一数学建模A题【支路车流量推测】原创论文讲解
大家好呀,从发布赛题一直到现在,总算完成了2025年五一数学建模A题【支路车流量推测】完整的成品论文。 给大家看一下目录吧: 摘 要: 一、问题重述 二.问题分析 2.1问题一 2.2问题二 2.3问题三 2.4问题四 2.5 …...
Linux系统:进程程序替换以及相关exec接口
本节重点 理解进程替换的相关概念与原理掌握相关程序替换接口程序替换与进程创建的区别程序替换的注意事项 一、概念与原理 进程程序替换是操作系统中实现多任务和资源复用的关键机制,允许进程在运行时动态加载并执行新程序。 1.1 定义 进程程序替换是指用新程…...
STM32复盘总结——芯片简介
1、stm32介绍 STM32是ST公司基于ARM Cortex-M内核开发的32位微控制器 STM32常应用在嵌入式领域,如智能车、无人机、机器人、无线通信、物联网、工业控制、娱乐电子产品等 STM32功能强大、性能优异、片上资源丰富、功耗低,是一款经典的嵌入式微控制器 目…...
安装深度环境anaconda+cuda+cudnn+pycharm+qt+MVS
下载anaconda,链接:link 默认电脑有显卡驱动,没有的话直接进NVIDIA官网:https://www.nvidia.cn/geforce/drivers/ 下载。 下载cuda 链接:https://developer.nvidia.com/cuda-toolkit-archive 下载cudnn安装包,链接:https://developer.nvidia.com/rdp/cudnn-archive 备注:…...
泰迪杯特等奖案例学习资料:基于多模态特征融合的图像文本检索系统设计
(第十二届泰迪杯数据挖掘挑战赛B题特等奖案例解析) 一、案例背景与核心挑战 1.1 应用场景与行业痛点 随着智能终端与社交媒体的普及,图像与文本数据呈现爆炸式增长,跨模态检索需求日益迫切。传统方法面临以下问题: 语义鸿沟:图像与文本的异构特征分布差异显著,导致跨模…...
进程与线程:05 内核级线程实现
内核级线程代码实现概述 这节课我们要讲内核级线程到底是怎么做出来的,实际上就是要深入探讨内核级线程的代码实现。 在前两节课中,我们学习了用户级线程和内核级线程是如何进行切换的,以及实现切换的核心要点。那两节课讲述的内容…...
Laravel 12 实现 API 登录令牌认证
Laravel 12 实现 API 登录令牌认证 在 Laravel 12 中实现基于令牌(Token)的 API 认证,可以使用 Laravel Sanctum 或 Laravel Passport。以下是两种方式的实现方法: 方法一:使用 Laravel Sanctum (轻量级 API 认证) 1. 安装 Sanctum compo…...
【Git】万字详解 Git 的原理与使用(上)
🥰🥰🥰来都来了,不妨点个关注叭! 👉博客主页:欢迎各位大佬!👈 文章目录 1. 初识 Git1.1 Git 是什么?1.2 为什么要有 Git 2. 安装 Git2.1 Linux-Ubuntu 安装 Git2.2 Windo…...
Python高级爬虫之JS逆向+安卓逆向1.7节: 面向对象
目录 引言: 1.7.1 先理解面向过程 1.7.2 再理解面向对象 1.7.3 面向对象的三大特征 1.7.4 类属性,类方法,静态方法 1.7.5 构造函数,对象属性,对象方法 1.7.6 爬虫接单实现了雪糕自由 引言: 大神薯条老师的高级爬虫+安卓逆向教程: 这套爬虫教程会系统讲解爬虫的初…...
SpringBoot基础(原理、项目搭建、yaml)
SpringBoot:javaweb的一个框架,基于Spring开发,SpringBoot本身并不提供Spring框架的核心特性以及扩展功能,只是用于快速、敏捷的开发新一代基于Spring框架的应用程序,它与Spring框架紧密结合用于提升Spring开发者体验的…...
MTV-SCA:基于多试向量的正弦余弦算法
3 正弦余弦算法 (SCA) 正弦余弦算法(SCA)是为全局优化而开发的,并受到两个函数,正弦和余弦的启发。与其他基于启发式种群的算法一样,SCA在问题的预设最小值和最大值边界内随机生成候选解。然后,通过应用方…...
STL之vector容器
vector的介绍 1.vector是可变大小数组的容器 2.像数组一样,采用连续的空间存储,也就意味着可以通过下标去访问,但它的大小可以动态改变 3.每次的插入都要开空间吗?开空间就要意味着先开临时空间,然后在拷贝旧的到新…...
Android学习总结之jetpack组件间的联系
在传统安卓开发中,UI 组件(Activity/Fragment)常面临三个核心问题: 生命周期混乱:手动管理 UI 与数据的绑定 / 解绑,易导致内存泄漏(如 Activity 销毁后回调仍在触发)。数据断层&am…...
linux的信号量初识
Linux下的信号量(Semaphore)深度解析 在多线程或多进程并发编程的领域中,确保对共享资源的安全访问和协调不同执行单元的同步至关重要。信号量(Semaphore)作为经典的同步原语之一,在 Linux 系统中扮演着核心角色。本文将深入探讨…...
【安装指南】Centos7 在 Docker 上安装 RabbitMQ4.0.x
目录 前置知识:RabbitMQ 的介绍 一、单机安装 RabbitMQ 4.0.7版本 1.1 在线拉取镜像 二、延迟插件的安装 2.1 安装延迟插件 步骤一:下载延迟插件 步骤二:将延迟插件放到插件目录 步骤三:启动延迟插件 步骤四:重启 RabbitMQ 服务 步骤五:验收成果 步骤六:手动…...
Android和iOS测试的区别有哪些?
作为移动端测试工程师,Android 和 iOS 的测试差异直接影响测试策略设计。本文从测试环境、工具链、兼容性、发布流程等维度全面解析,并附实战建议。 1. 测试环境差异 维度AndroidiOS设备碎片化高(厂商/分辨率/系统版本多样)低(仅苹果设备,版本集中)系统开放性开放(可Ro…...
spring中的@PostConstruct注解详解
基本概念 PostConstruct 是 Java EE 规范的一部分,后来也被纳入到 Spring 框架中。它是一个标记注解,用于指示一个方法应该在依赖注入完成后被自动调用。 主要特点 生命周期回调:PostConstruct 标记的方法会在对象初始化完成、依赖注入完成…...
大模型开发学习笔记
文章目录 大模型基础大模型的使用大模型训练的阶段大模型的特点及分类大模型的工作流程分词化(tokenization)与词表映射 大模型的应用 进阶agent的组成和概念planning规划子任务分解ReAct框架 memory记忆Tools工具\工具集的使用langchain认知框架ReAct框架plan-and-Execute计划…...
【android Framework 探究】pixel 5 内核编译
相关文章: 【android Framework 探究】android 13 aosp编译全记录 【android Framework 探究】android 13 aosp 全记录 - 烧录 一,环境 主机 -> Ubuntu 18.04.6 LTS 内存 -> 16GB 手机 -> pixel 5 代号redfin。kernel代号redbull 二…...
PowerBI实现点击空白处隐藏弹窗(详细教程)
PowerBI点击空白处隐藏弹窗 第五届PowerBI可视化大赛中亚军作品:金融企业智慧经营分析看板 有个功能挺好玩的:点击空白处隐藏弹窗,gif动图如下: 我们以一个案例分享下实现步骤: 第一步, 先添加一个显示按钮ÿ…...
【git】获取特定分支和所有分支
1 特定分支 1.1 克隆指定分支(默认只下载该分支) git clone -b <分支名> --single-branch <仓库URL> 示例(克隆 某一个 分支): git clone -b xxxxxx --single-branch xxxxxxx -b :指定分支…...
Windows配置grpc
Windows配置grpc 方法一1. 使用git下载grph下载速度慢可以使用国内镜像1.1 更新子模块 2. 使用Cmake进行编译2.1 GUI编译2.2 命令行直接编译 3. 使用Visual Studio 生成解决方法 方法二1. 安装 vcpkg3.配置vckg的环境变量2. 使用 vcpkg 安装 gRPC3. 安装 Protobuf4. 配置 CMake…...
【学习笔记】深入理解Java虚拟机学习笔记——第2章 Java内存区域与内存溢出异常
第2章 Java内存区域与内存溢出异常 2.1 概述 略 2.2 运行时数据区域 2.2.1 程序计数器 线程私有,记录执行的字节码位置 2.2.2 Java 虚拟机栈 线程私有,存储一个一个的栈帧,通过栈帧的出入栈来控制方法执行。 -栈帧:对应一个…...
数字智慧方案6189丨智慧应急综合解决方案(46页PPT)(文末有下载方式)
资料解读:智慧应急综合解决方案 详细资料请看本解读文章的最后内容。 在当前社会环境下,应急管理的重要性愈发凸显。国务院发布的《“十四五” 国家应急体系规划》以及 “十四五” 智慧应急专项规划,明确了应急管理体系建设的方向和重点&…...
解决 3D Gaussian Splatting 中 SIBR 可视化组件报错 uv_mesh.vert 缺失问题【2025最新版!】
一、📌 引言 在使用 3D Gaussian Splatting(3DGS)进行三维重建和可视化的过程,SIBR_gaussianViewer_app 是一款官方推荐的本地可视化工具,允许我们在 GPU 上实时浏览重建结果。然而,许多用户在启动该工具时…...
见多识广4:Buffer与Cache,神经网络加速器的Buffer
目录 前言传统意义上的Buffer与Cache一言以蔽之定义与主要功能BufferCache 数据存储策略二者对比 神经网络加速器的bufferInput BufferWeight BufferOutput Buffer与传统buffer的核心区别总结 前言 知识主要由Qwen和Kimi提供,我主要做笔记。 参考文献: …...
微服务中组件扫描(ComponentScan)的工作原理
微服务中组件扫描(ComponentScan)的工作原理 你的问题涉及到Spring框架中ComponentScan的工作原理以及Maven依赖管理的影响。我来解释为什么能够扫描到common模块的bean而扫描不到其他模块的bean。 根本原因 关键在于**类路径(Classpath)**的包含情况: Maven依赖…...