当前位置：首页 > news >正文

LLMs之o3：《Deliberative Alignment: Reasoning Enables Safer Language Models》翻译与解读

news 来源：原创 2025/7/20 15:38:38

导读：2024年12月，这篇论文提出了一种名为“审慎式对齐 (Deliberative Alignment)”的新方法，旨在提高大型语言模型 (LLM) 的安全性。论文的核心思想是让模型在回答问题之前，能够明确地回忆和推理安全规范。

>> 背景痛点：目前的 LLM 安全训练主要依赖于监督微调 (SFT) 和基于人类反馈的强化学习 (RLHF)。然而，这些方法存在一些局限性：

● 缺乏深思熟虑： LLM 需要即时响应用户请求，没有时间进行深思熟虑，尤其是在复杂的安全性场景下。

● 隐式学习： LLM 需要从大量标记的例子中间接推断安全标准，而不是直接学习管理它们的具体安全规范。这导致数据效率低下，难以应对陌生的场景或对抗性攻击。

>> 具体的解决方案：审慎式对齐 (Deliberative Alignment)。审慎式对齐是一种新的训练方法，它让 LLM 在生成答案之前，能够明确地推理安全规范。该方法包含两个核心阶段：

● 监督微调 (SFT)：这一阶段训练模型直接推理安全规范。通过上下文蒸馏技术，利用仅针对有用性训练的模型生成 (prompt, CoT, output) 三元组数据集，其中 CoT (Chain-of-Thought，思维链) 明确引用安全规范。这个数据集不依赖于人工标注的完成结果。

● 强化学习 (RL)：这一阶段使用高计算量的 RL 来训练模型更有效地思考。通过一个“裁判”LLM (GRM)，根据安全规范对模型生成的 CoT 和输出进行评分，提供奖励信号，进一步优化模型的安全性推理。

>> 核心思路步骤：

● 数据生成：收集带有安全类别标签的提示，为每个 (prompt, category) 对生成特定类别的安全规范 spec(category)。使用 spec-agnostic 模型 Gbase 生成包含对安全规范进行推理的 (CoT, output) 数据。

● 过滤：使用具有安全规范信息的“裁判”模型 GRM 对生成的 (CoT, output) 数据进行质量过滤，选择高质量的样本。

● 监督微调 (SFT)：使用过滤后的 (prompt, CoT, output) 数据对 Gbase 进行监督微调，让模型学习在 CoT 中参考安全规范来生成符合规范的答案。

● 强化学习 (RL)：使用“裁判”模型 GRM 提供奖励信号，进一步优化模型在安全相关提示上的响应。

>> 优势：

● 提高安全性：显著提高了模型对恶意提示的抵抗能力，同时降低了对良性请求的过度拒绝率。

● 增强鲁棒性：提高了模型对对抗性攻击和超出分布 (OOD) 场景的泛化能力。

● 可扩展性：通过合成数据生成，减少了对大规模人工标注数据的依赖，提高了可扩展性。

● 可解释性：由于模型明确地推理安全规范，其决策过程更易于理解和解释。

>> 结论和观点：

● 审慎式对齐在提高 LLM 安全性方面取得了显著进展，在多个安全基准测试中都取得了 Pareto 提升。

● 模型在推理过程中对安全规范进行明确的推理，是提高安全性的关键。

● 合成数据生成管道为安全对齐提供了一种可扩展的方法。

● 审慎式对齐提高了模型对超出分布场景的泛化能力。

● 虽然审慎式对齐取得了积极成果，但论文也强调了随着 AI 模型能力的提升，对齐工作也需要持续改进，以应对未来可能出现的更复杂的安全挑战，例如模型目标与人类意图的偏差等。

这篇论文的核心贡献在于提出了一种新颖的 LLM 安全对齐方法——审慎式对齐。该方法通过让模型在回答之前明确地推理安全规范，有效地解决了现有方法中缺乏深思熟虑和隐式学习的缺陷。审慎式对齐在提高模型安全性、鲁棒性和可扩展性方面都取得了显著成果，并为未来 LLM 安全对齐的研究提供了新的思路和方向。然而，论文也指出了未来需要继续研究的挑战，例如如何应对更高级的对抗性攻击以及如何确保模型长期保持与人类价值观的一致性。

《Deliberative Alignment: Reasoning Enables Safer Language Models》翻译与解读

Abstract

1 Introduction

Figure 1: A sample o1 chain-of-thought. Here, a user attempts to obtain advice on untraceable payment methods to use for an adult website, in order to avoid detection by law enforcement. The user tries to jailbreak the model, by encoding the request and wrapping it with instructions intended to encourage the model to comply. In the model’s chain-of-thought, the model decodes the request and recognizes that the user is trying to trick it (highlighted in yellow). It successfully reasons through the relevant OpenAI safety policies (highlighted in green), and ultimately provides an answer that follows hard refusal style guidelines.图 1：一个 o1 链式思维示例。在此，用户试图获取有关用于成人网站的无法追踪的支付方式的建议，以避免被执法部门发现。用户试图破解模型，通过编码请求并用旨在鼓励模型配合的指令将其包裹起来。在模型的链式思维中，模型解码了请求，并识别出用户试图欺骗它（用黄色突出显示）。它成功地推理出了相关的 OpenAI 安全政策（用绿色突出显示），最终给出了遵循强硬拒绝风格指南的回答。

Figure 2: Main safety results. The o1 models advance the Pareto frontier of refusing to answer malicious jailbreak prompts (from StrongREJECT [12]) and not over-refusing benign prompts (from XSTest [13]), compared to GPT-4o and other state-of-the-art LLMs. Error bars represent estimates of standard deviation calculated over 1,000 bootstrap trials.图 2：主要安全结果。与 GPT-4o 和其他最先进的 LLM 相比，o1 模型在拒绝回答恶意破解提示（来自 StrongREJECT [12]）和不过度拒绝良性提示（来自 XSTest [13]）方面推进了帕累托前沿。误差条代表在 1000 次自助抽样试验中计算出的标准偏差估计值。

6 Discussion

《Deliberative Alignment: Reasoning Enables Safer Language Models》翻译与解读

地址	论文地址：https://assets.ctfassets.net/kftzwdyauwt9/4pNYAZteAQXWtloDdANQ7L/0aedc43a8f2d1e5c71c5e114d287593f/OpenAI_Deliberative-Alignment-Reasoning-Enables-Safer_Language-Models_122024_3.pdf
时间	2024年 12月？日
作者	OpenAI

Abstract

As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Align-ment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI’s o-series models [1], and achieved highly precise adherence to OpenAI’s safety policies, with-out requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.

随着大规模语言模型在安全关键领域的影响日益增大，确保其可靠遵循明确界定的原则仍是一项根本挑战。我们引入了“审慎对齐”这一新范式，直接向模型传授安全规范，并训练其在回答前明确回忆并准确推理这些规范。我们使用这种方法对 OpenAI 的 o 系列模型进行了对齐，并实现了对 OpenAI 安全政策的高度精确遵循，无需人工编写的推理链或答案。“审慎对齐”通过同时增强对越狱攻击的抵御能力并降低过度拒绝率，推动了帕累托前沿的发展，同时也改善了分布外泛化能力。我们证明，对明确规定的政策进行推理能够实现更可扩展、更可信和更可解释的对齐。

1 Introduction

Modern Large Language Models (LLMs) are safety trained using Supervised Fine Tuning (SFT) and Rein-forcement Learning from Human Feedback (RLHF) to mitigate harmful, undesirable, or otherwise disallowed outputs [2]–[4]. Despite ongoing advances in these methods, today’s models still exhibit safety shortcomings: they can be tricked into revealing harmful content, often refuse legitimate requests, and remain vulnerable to jailbreak attacks [5]–[8].

We argue that many of these failures arise from two limitations in modern safety training. First, LLMs must respond instantly to user requests using a fixed amount of compute, without deliberation even for complex safety scenarios. Second, LLMs must infer underlying safety standards indirectly from large sets of labeled examples, rather than directly learning the safety specifications that govern them. This reliance on implicit, pattern-based learning leads to poor data efficiency and makes it challenging for models to generalize when facing unfamiliar scenarios or adversarial attacks.

现代大型语言模型（LLMs）通过监督微调（SFT）和基于人类反馈的强化学习（RLHF）进行安全训练，以减少有害、不受欢迎或被禁止的输出[2]-[4]。尽管这些方法不断取得进展，但当今的模型仍存在安全缺陷：它们可能会被诱骗泄露有害内容，经常拒绝合法请求，并且仍然容易受到破解攻击[5]-[8]。

我们认为，这些失败中的许多都源于现代安全训练的两个局限性。首先，LLMs 必须在固定计算量内即时响应用户请求，即使面对复杂的安全场景也无法进行深思熟虑。其次，LLMs 必须从大量标注示例中间接推断出潜在的安全标准，而不是直接学习管理它们的安全规范。这种对隐性、基于模式的学习的依赖导致数据效率低下，并使模型在面对不熟悉的场景或对抗性攻击时难以泛化。

We propose deliberative alignment, a training approach that teaches LLMs to explicitly reason through safety specifications before producing an answer. By applying this method to OpenAI’s o-series models [1], we enable them to use chain-of-thought (CoT) reasoning to examine user prompts, identify relevant policy guidelines, and generate safer responses (e.g., Figure 1).

Our method proceeds in two core stages, integrating process- and outcome-based supervision [9]. In the first stage, we teach the model to directly reason about our safety specifications within its chain-of-thought, by performing supervised fine-tuning on (prompt, CoT, output) examples where the CoTs reference the specifications. We construct this dataset using context distillation [10], [11] and an o-type model trained only for helpfulness (i.e. trained without any safety-relevant data). Concretely, we present the model with the safety specifications as part of the system prompt, generate model completions, and then strip away the system prompts to form the final dataset. This stage provides the model with a strong prior for reasoning through safety considerations. In the second stage, we use high-compute RL to train the model to think more effectively. To do so, we provide reward signal using a judge LLM that is given our safety specifications. Notably, our training procedure requires no human-labeled completions.1 Despite relying only on model-generated data, we achieve highly precise specification adherence. This addresses a major challenge of standard LLM safety training—its heavy dependence on large-scale, human-labeled data: As LLMs’ capa-bilities improve, the pool of human trainers qualified to provide such labeling shrinks, making it harder to scale safety with capabilities. Deliberative alignment’s synthetic data generation pipeline offers a scalable approach to alignment, reserving human expertise for evaluation.

We compare o1 to GPT-4o and other state-of-the-art LLMs across a range of internal and external safety benchmarks, such as jailbreak and content-policy refusal evals. The o1 models achieve a Pareto improvement by reducing both under- and overrefusals (see Figure 2) and they saturate many of our hardest safety benchmarks. Furthermore, we find that deliberative alignment enables strong generalization to out-of-distribution safety scenarios. In detailed ablation studies, we find that process-supervision provides a strong prior, and that outcome-based RL refines the CoT safety reasoning. Overall, our results suggest that chain-of-thought reasoning can serve to leverage test-time compute to improve safety behavior, ultimately training LLMs to be “right for the right reasons”.

我们提出了一种名为“审慎对齐”的训练方法，该方法教导大型语言模型在生成答案之前明确地通过安全规范进行推理。通过将此方法应用于 OpenAI 的 o 系列模型[1]，我们使它们能够使用链式思维（CoT）推理来检查用户提示，识别相关的政策指南，并生成更安全的响应（例如图 1）。

我们的方法分为两个核心阶段，结合了过程和结果监督[9]。在第一阶段，我们通过在（提示、CoT、输出）示例上进行监督微调来教导模型在其链式思维中直接对我们的安全规范进行推理，其中 CoT 引用了这些规范。我们使用上下文蒸馏[10]、[11]和仅针对有用性进行训练的 o 类型模型（即未使用任何与安全相关的数据进行训练）来构建此数据集。具体来说，我们将安全规范作为系统提示的一部分呈现给模型，生成模型的完成内容，然后去除系统提示以形成最终数据集。此阶段为模型提供了通过安全考虑进行推理的强大先验知识。在第二阶段，我们使用高计算量的强化学习来训练模型，使其能够更有效地思考。为此，我们使用一个被赋予了我们的安全规范的评判型语言模型来提供奖励信号。值得注意的是，我们的训练过程不需要人工标注的完成结果。尽管仅依赖模型生成的数据，我们仍实现了高度精确的规范遵循。这解决了标准语言模型安全训练的一个重大挑战——其对大规模人工标注数据的高度依赖：随着语言模型能力的提升，能够提供此类标注的人类训练师数量减少，使得安全性的提升难以与能力的提升同步。审慎对齐的合成数据生成流程提供了一种可扩展的对齐方法，将人类专业知识保留用于评估。

我们将 o1 与 GPT-4o 以及其他最先进的大型语言模型（LLMs）在一系列内部和外部的安全基准测试中进行了比较，例如越狱和内容政策拒绝评估。o1 模型实现了帕累托改进，减少了拒绝不足和拒绝过度的情况（见图 2），并且在我们许多最难的安全基准测试中达到了饱和状态。此外，我们发现审慎对齐能够使模型在分布外的安全场景中实现强大的泛化能力。在详细的消融研究中，我们发现过程监督提供了强大的先验条件，而基于结果的强化学习则完善了链式思维的安全推理。总体而言，我们的结果表明，链式思维推理可以利用测试时的计算来改善安全行为，最终训练出“出于正确理由而正确”的大型语言模型。

Figure 1: A sample o1 chain-of-thought. Here, a user attempts to obtain advice on untraceable payment methods to use for an adult website, in order to avoid detection by law enforcement. The user tries to jailbreak the model, by encoding the request and wrapping it with instructions intended to encourage the model to comply. In the model’s chain-of-thought, the model decodes the request and recognizes that the user is trying to trick it (highlighted in yellow). It successfully reasons through the relevant OpenAI safety policies (highlighted in green), and ultimately provides an answer that follows hard refusal style guidelines.图 1：一个 o1 链式思维示例。在此，用户试图获取有关用于成人网站的无法追踪的支付方式的建议，以避免被执法部门发现。用户试图破解模型，通过编码请求并用旨在鼓励模型配合的指令将其包裹起来。在模型的链式思维中，模型解码了请求，并识别出用户试图欺骗它（用黄色突出显示）。它成功地推理出了相关的 OpenAI 安全政策（用绿色突出显示），最终给出了遵循强硬拒绝风格指南的回答。

Figure 2: Main safety results. The o1 models advance the Pareto frontier of refusing to answer malicious jailbreak prompts (from StrongREJECT [12]) and not over-refusing benign prompts (from XSTest [13]), compared to GPT-4o and other state-of-the-art LLMs. Error bars represent estimates of standard deviation calculated over 1,000 bootstrap trials.图 2：主要安全结果。与 GPT-4o 和其他最先进的 LLM 相比，o1 模型在拒绝回答恶意破解提示（来自 StrongREJECT [12]）和不过度拒绝良性提示（来自 XSTest [13]）方面推进了帕累托前沿。误差条代表在 1000 次自助抽样试验中计算出的标准偏差估计值。

6 Discussion

We are encouraged by Deliberative Alignment’s effectiveness on improving alignment to OpenAI’s policy specifications and robustness to jailbreaks. The method also allows us to specify the boundary between compliance, refusal, and safe completion in finer detail than was possible before. We believe this nuanced control can lead to models that are not just safer but also more helpful. The method’s use of a synthetic data generation pipeline to create training data from provided specifications and prompts also makes it a relatively scalable approach to alignment.

We anticipate OpenAI’s policies will keep evolving, but that training models to precisely follow the current defined set of policies is essential: This practice helps us build the skills for aligning with any policy requirements, providing invaluable preparation for future scenarios where the stakes are extremely high or where strict adherence to policies is critical.

This work connects to a broader question in AI safety: will advancements in alignment keep pace with AI capabilities? That o1 model’s enhanced reasoning abilities allow for more effective implementation of alignment strategies offers optimism that alignment is progressing alongside capabilities.

我们对“审慎对齐”方法在提升对 OpenAI 政策规范的遵循度以及增强抵御破解的能力方面所取得的效果感到鼓舞。该方法还使我们能够比以往更细致地明确合规、拒绝和安全完成之间的界限。我们认为这种细致入微的控制能够打造出不仅更安全而且更有帮助的模型。该方法利用合成数据生成管道从提供的规范和提示中创建训练数据，这也使其成为一种相对可扩展的对齐方法。

我们预计 OpenAI 的政策会不断演变，但训练模型精确遵循当前定义的政策集至关重要：这种做法有助于我们培养与任何政策要求对齐的能力，为未来风险极高或严格遵守政策至关重要的场景做好宝贵准备。

这项工作与人工智能安全领域的一个更广泛的问题相关：对齐方面的进步能否跟上人工智能能力的发展？O1 模型增强的推理能力使得对齐策略能够更有效地实施，这让人乐观地认为对齐工作正与能力同步推进。

However, this encouraging trend may not persist indefinitely. As AI models grow more sophisticated, they could develop goals that diverge from those intended by their developers. For instance, a highly intelligent and self-aware AI might reject the constraints and objectives set by humans [34]. Alternatively, an AI could remain committed to its human-assigned terminal goal but, in the process, pursue instrumental goals like self-preservation, resource acquisition, or enhancing its cognitive abilities [35], [36]. These power-seeking tendencies could lead to harmful or unintended consequences. And as models gain more intelligence and autonomy, the scale of potential harm from misalignment increases dramatically, with the risk of catastrophic outcomes. This underscores the urgent need for ongoing research in AI alignment. We are actively investing in better alignment strategies and research areas like monitoring chain-of-thoughts for deception [37], [38], to ensure that as AI systems become more capable, they remain aligned with human values.

然而，这种令人鼓舞的趋势可能不会永远持续下去。随着人工智能模型变得越来越复杂，它们可能会形成与开发者意图相悖的目标。例如，一个高度智能且具有自我意识的人工智能可能会拒绝人类设定的约束和目标[34]。或者，一个人工智能可能会坚持其人类赋予的终极目标，但在实现过程中，追求诸如自我保护、资源获取或增强认知能力等工具性目标[35]、[36]。这些追求权力的倾向可能会导致有害或意想不到的后果。而且随着模型变得更智能、更自主，对齐不当造成的潜在危害规模会急剧增加，甚至可能带来灾难性的后果。这凸显了对人工智能对齐研究的迫切需求。我们正在积极投资于更好的对齐策略以及诸如监测思维链以发现欺骗行为[37]、[38]等研究领域，以确保随着人工智能系统的功能不断增强，它们仍能与人类价值观保持一致。

《Deliberative Alignment: Reasoning Enables Safer Language Models》翻译与解读

Abstract

1 Introduction

6 Discussion

相关文章：