当前位置: 首页 > news >正文

【学习笔记】RL4LLM

字数溢出,分了一半出来

上半段:LLM+RL


文章目录

  • 8 [RL4LLM] 理解 reasoning model Tokenizer 的 chat template,vllm inference
    • tokenizer
    • chat template
      • distill tokenizer
      • qwen tokenizer
    • apply chat template
    • vllm inference
  • 9 [RL4LLM] PPO workflow 及 OpenRLHF、veRL 初步介绍,ray distributed debugger
    • RL4LLM roadmap
      • TRL ppo trainer
    • adv(advantage) estimator
  • 9 [RL4LLM] 深入 PPO-clip 目标函数细节(及重要性采样)
    • clip
    • 期望计算
  • 10 [RL4LLM] GRPO loss/objective 分析 及可能的 biases 分析(DAPO,Dr. GRPO)
    • 1 思维误区:损失为零无法优化
    • 2 关于Dr. GRPO
    • 3 per_device_train_batch_size & num_generations
  • 11 [RL4LLM] deepseek v3 工具调用的 bug 以及理解 chat_template 的 function calling
    • v3 的chat template
    • v3-0324 的 chat template
    • `apply_chat_template


8 [RL4LLM] 理解 reasoning model Tokenizer 的 chat template,vllm inference

  • 链接:https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/r1-k1.5/reasoning_model_chat_template_and_inference.ipynb
  • video: https://www.bilibili.com/video/BV1LKXSYqE3T

目前的 AI = 数据(data) + 算法(algorithm) + 架构(infra)

  • 不要排斥基础,觉得简单就不关注相关的细节。越是复杂的系统,越要从基础从原理,分解从模块来看。
  • 不只是 tokenize,还有 chat template,什么时候轮到 llm 输出,如何区分 user 和 assistant(包括这期要介绍的 reasoning tokenizer,所谓的 reasoning tokens & answer tokens);
    • fancy 和 powerful 的 llm,似乎 tokenizer 很 low level,显得很没有意思,甚至繁琐;
    • chat temaplte (for chat models, 目前的 reasoning models 首先也得是一个 chat models)
      • 添加特殊 token id,标记身份(user/assistant/tool);
        • System: 建立初始的身份认知;
        • tool_call,tool_response(也是一种身份)
      • 添加 system prompt,如果用户没有显示地传入的话
      • 解析历史对话:循环解析的过程
from transformers import AutoTokenizer
import re
import torchmodels = ["deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","Qwen/Qwen2.5-1.5B-Instruct","Qwen/Qwen2.5-Math-1.5B","Qwen/Qwen2.5-1.5B","Qwen/QwQ-32B-Preview"
]

这里面 DeepSeek-R1-Distill-Qwen-1.5B是对Qwen2.5-Math-1.5B做的蒸馏,而非Qwen2.5-1.5B-Instruct,论文里明确说了:

在这里插入图片描述

def hf_tokenizer(name_or_path):tokenizer = AutoTokenizer.from_pretrained(name_or_path)if tokenizer.pad_token_id is None:tokenizer.pad_token_id = tokenizer.eos_token_idprint(f'tokenizer.pad_token_id is None. Now set to {tokenizer.eos_token_id}')if tokenizer.pad_token is None:tokenizer.pad_token = tokenizer.eos_tokenprint(f'tokenizer.pad_token is None. Now set to {tokenizer.eos_token}')return tokenizer

这里添加一下padding model

tokenizer

  • DeepSeek-R1-Distill-Qwen-1.5B 由 Qwen2.5-Math-1.5B 蒸馏而来,而不是 Qwen/Qwen2.5-1.5B-Instruct;
  • 复用了词表,重新定义了一些特殊的 token id;

下面的代码验证了这个事情:

def test_tokenizer(tokenizer, text):tokens = tokenizer.encode(text)print(f'{text}, tokens: {tokens}')for name_or_path in models:tokenizer = hf_tokenizer(name_or_path)print(f'{name_or_path}, tokenizer.pad_token: {tokenizer.pad_token}, tokenizer.pad_token_id: {tokenizer.pad_token_id}')test_tokenizer(tokenizer, "hello world")print('-' * 100)

输出:

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, tokenizer.pad_token: <|end▁of▁sentence|>, tokenizer.pad_token_id: 151643
hello world, tokens: [151646, 14990, 1879]
----------------------------------------------------------------------------------------------------
Qwen/Qwen2.5-1.5B-Instruct, tokenizer.pad_token: <|endoftext|>, tokenizer.pad_token_id: 151643
hello world, tokens: [14990, 1879]
----------------------------------------------------------------------------------------------------
Qwen/Qwen2.5-Math-1.5B, tokenizer.pad_token: <|endoftext|>, tokenizer.pad_token_id: 151643
hello world, tokens: [14990, 1879]
----------------------------------------------------------------------------------------------------
Qwen/Qwen2.5-1.5B, tokenizer.pad_token: <|endoftext|>, tokenizer.pad_token_id: 151643
hello world, tokens: [14990, 1879]
----------------------------------------------------------------------------------------------------
Qwen/QwQ-32B-Preview, tokenizer.pad_token: <|endoftext|>, tokenizer.pad_token_id: 151643
hello world, tokens: [14990, 1879]
----------------------------------------------------------------------------------------------------

两个词表都是一致的,ds大概是150k+的一个词表量

distill_tokenizer = hf_tokenizer('deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B')
print(distill_tokenizer.decode(151646))
qwen_math_tokenizer = hf_tokenizer('Qwen/Qwen2.5-Math-1.5B')
print(qwen_math_tokenizer.decode(151646))
qwen_chat_tokenizer = hf_tokenizer('Qwen/Qwen2.5-1.5B-Instruct')
print(qwen_chat_tokenizer.decode(151646))
qwen_base_tokenizer = hf_tokenizer('Qwen/Qwen2.5-1.5B')
print(qwen_base_tokenizer.decode(151646))
qwen_reason_tokenizer = hf_tokenizer('Qwen/QwQ-32B-Preview')
print(qwen_base_tokenizer.decode(151646))

输出结果:

  • 这里很有趣的地方是使用了全角字符,这样可能是为了区分其他llm的词表,以确保互联网上不存在这样的字符。
<|begin▁of▁sentence|>
<|object_ref_start|>
<|object_ref_start|>
<|object_ref_start|>
<|object_ref_start|>
distill_tokenizer.encode('<|User|>') # [151646, 151644]
qwen_math_tokenizer.encode('<|User|>'), qwen_chat_tokenizer.encode('<|User|>'), qwen_base_tokenizer.encode('<|User|>')# 输出:([27, 130957, 1474, 130957, 29],
# [27, 130957, 1474, 130957, 29],
# [27, 130957, 1474, 130957, 29])

Qwen的词表大概是130k+👆

另一个有趣的事情:

# what is <|end▁of▁sentence|>
# https://chat.deepseek.com/a/chat/s/569c8476-7b64-48fa-865b-9e01718b961b
# what is <|im_end|>
# https://chat.qwen.ai/c/da88d4f3-c279-4851-acbb-d3f051c11e86
distill_tokenizer.decode(151643) # '<|end▁of▁sentence|>'

这个是说<|end▁of▁sentence|>这个字符ds是看不到的,你去问它what is <|end▁of▁sentence|>,它是无法回答的,同理QWEN是<|im_end|>,这个qwen也是看不到的:

在这里插入图片描述

chat template

AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B').chat_template
  • jinja template
  • llma-3.1-8b是一个base模型,它是没有chat_model的
from jinja2 import Environment, FileSystemLoader# 创建 Jinja2 环境
env = Environment(loader=FileSystemLoader("."))# 模板 1: 使用标准语法 {% ... %}
template1 = env.from_string("""
{% if True %}Hello
{% endif %}
""")# 模板 2: 使用去除空白字符的语法 {% - ... -%}
template2 = env.from_string("""
{%- if True -%}Hello
{%- endif -%}
""")# 渲染模板
result1 = template1.render()
result2 = template2.render()# 打印结果
print("使用标准语法 {% ... %} 的结果:")
print(repr(result1))  # repr 用于显示换行符和空白字符print("\n使用去除空白字符的语法 {% - ... -%} 的结果:")
print(repr(result2))"""
使用标准语法 {% ... %} 的结果:
'\n\n    Hello\n'使用去除空白字符的语法 {% - ... -%} 的结果:
'Hello'
"""

这里{%- if True -%}里的减号是用来去除空格的

# We do not recommend using base language models for conversations. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
# https://huggingface.co/Qwen/Qwen2.5-1.5B
print(qwen_base_tokenizer.chat_template)

输出结果:

{%- if tools %}{{- '<|im_start|>system\n' }}{%- if messages[0]['role'] == 'system' %}{{- messages[0]['content'] }}{%- else %}{{- 'You are a helpful assistant.' }}{%- endif %}{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}{%- for tool in tools %}{{- "\n" }}{{- tool | tojson }}{%- endfor %}{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}{%- if messages[0]['role'] == 'system' %}{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}{%- else %}{{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{%- endif %}
{%- endif %}
{%- for message in messages %}{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}{%- elif message.role == "assistant" %}{{- '<|im_start|>' + message.role }}{%- if message.content %}{{- '\n' + message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '\n<tool_call>\n{"name": "' }}{{- tool_call.name }}{{- '", "arguments": ' }}{{- tool_call.arguments | tojson }}{{- '}\n</tool_call>' }}{%- endfor %}{{- '<|im_end|>\n' }}{%- elif message.role == "tool" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}{{- '<|im_start|>user' }}{%- endif %}{{- '\n<tool_response>\n' }}{{- message.content }}{{- '\n</tool_response>' }}{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}{{- '<|im_end|>\n' }}{%- endif %}{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}
{%- endif %}

这里注意👆:qwen的base model提供了chat template,但官方文档里说了不建议使用

初始化身份
提问/响应
自然语言回复
调用工具
返回结果
直接响应
System
Assistant
User
Tool

distill tokenizer

print(distill_tokenizer.chat_template)

qwen的chat_template甚至没有换行:

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|><think>\n'}}{% endif %}

格式化后:

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}
{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}
{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}
{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}
{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|><think>\n'}}
{% endif %}

注意最后加了一个<think>,这件事在huggingface上还添加了一个帖子:
在这里插入图片描述

qwen tokenizer

print("\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>")

输出:

# ToolsYou may call one or more functions to assist with the user query.You are provided with function signatures within <tools></tools> XML tags:
<tools>
# default system prompt
print(qwen_chat_tokenizer.chat_template)

输出:

{%- if tools %}{{- '<|im_start|>system\n' }}{%- if messages[0]['role'] == 'system' %}{{- messages[0]['content'] }}{%- else %}{{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}{%- endif %}{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}{%- for tool in tools %}{{- "\n" }}{{- tool | tojson }}{%- endfor %}{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}{%- if messages[0]['role'] == 'system' %}{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}{%- else %}{{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}{%- endif %}
{%- endif %}
{%- for message in messages %}{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}{%- elif message.role == "assistant" %}{{- '<|im_start|>' + message.role }}{%- if message.content %}{{- '\n' + message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '\n<tool_call>\n{"name": "' }}{{- tool_call.name }}{{- '", "arguments": ' }}{{- tool_call.arguments | tojson }}{{- '}\n</tool_call>' }}{%- endfor %}{{- '<|im_end|>\n' }}{%- elif message.role == "tool" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}{{- '<|im_start|>user' }}{%- endif %}{{- '\n<tool_response>\n' }}{{- message.content }}{{- '\n</tool_response>' }}{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}{{- '<|im_end|>\n' }}{%- endif %}{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}
{%- endif %}

然后是reason的template

print(qwen_reason_tokenizer.chat_template)

输出结果:

{%- if tools %}{{- '<|im_start|>system\n' }}{%- if messages[0]['role'] == 'system' %}{{- messages[0]['content'] }}{%- else %}{{- 'You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.' }}{%- endif %}{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}{%- for tool in tools %}{{- "\n" }}{{- tool | tojson }}{%- endfor %}{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}{%- if messages[0]['role'] == 'system' %}{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}{%- else %}{{- '<|im_start|>system\nYou are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>\n' }}{%- endif %}
{%- endif %}
{%- for message in messages %}{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}{%- elif message.role == "assistant" %}{{- '<|im_start|>' + message.role }}{%- if message.content %}{{- '\n' + message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '\n<tool_call>\n{"name": "' }}{{- tool_call.name }}{{- '", "arguments": ' }}{{- tool_call.arguments | tojson }}{{- '}\n</tool_call>' }}{%- endfor %}{{- '<|im_end|>\n' }}{%- elif message.role == "tool" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}{{- '<|im_start|>user' }}{%- endif %}{{- '\n<tool_response>\n' }}{{- message.content }}{{- '\n</tool_response>' }}{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}{{- '<|im_end|>\n' }}{%- endif %}{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}
{%- endif %}

支持一些工具调用👆

1.5b的模型官方不建议修改system template

apply chat template

basic_messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello, how are you?"},# {"role": "assistant", "content": "I'm doing great. How can I help you today?"}
]
distill_tokenizer.apply_chat_template(basic_messages, tokenize=False)

输出:'<|begin▁of▁sentence|>You are a helpful assistant.<|User|>Hello, how are you?'

# https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
distill_tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)

输出:'<|begin▁of▁sentence|>You are a helpful assistant.<|User|>Hello, how are you?<|Assistant|><think>\n'

qwen_chat_tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)

输出:'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello, how are you?<|im_end|>\n<|im_start|>assistant\n'

qwen_reason_tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)

输出:'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello, how are you?<|im_end|>\n<|im_start|>assistant\n'

vllm inference

gsm8k_inference_test = "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"
gt_ans = '18'
distill_tokenizer.apply_chat_template([gsm8k_inference_test], add_generation_prompt=True, tokenize=False)

输出:'<|begin▁of▁sentence|><|Assistant|><think>\n'

instruction = "Let's think step by step and output the final answer within \\boxed{}."
chat_test = [{'role': 'user', 'content': f'{gsm8k_inference_test} {instruction}'}]
chat_test
"""
[{'role': 'user','content': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer within \\boxed{}."}]
"""
distill_tokenizer.apply_chat_template(chat_test, add_generation_prompt=True, tokenize=False)

输出结果:"<|begin▁of▁sentence|><|User|>Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer within \\boxed{}.<|Assistant|><think>\n"

prompt_ids = distill_tokenizer.apply_chat_template(chat_test, add_generation_prompt=True, tokenize=True)
from vllm import LLM, SamplingParams
llm = LLM(model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B',max_model_len=32768)sampling_params = SamplingParams(temperature=0.6, max_tokens=32768)
response = llm.generate(prompt_token_ids=prompt_ids, sampling_params=sampling_params)[0]
print(response.outputs[0].text)

输出结果:

Alright, let's tackle this problem step by step. So, Janet has ducks that lay 16 eggs every day. Hmm, okay, that's a lot! She eats three eggs for breakfast every morning and bakes muffins for her friends with four eggs each day. Then, she sells the rest at the farmers' market for $2 per egg. I need to figure out how much money she makes every day from selling the eggs.First, let me break down the information given:1. **Duck eggs per day:** 16
2. **Eggs eaten for breakfast:** 3 per day
3. **Eggs used for muffins:** 4 per day
4. **Selling price per egg:** $2So, the plan is to subtract the eggs Janet eats and uses for muffins from the total eggs laid each day. The remaining eggs will be what she sells, and then we can multiply that by the selling price to get her daily earnings.Let me write this down in a more structured way:Total eggs per day = 16Eggs eaten for breakfast = 3Eggs used for muffins = 4So, the eggs available for selling = Total eggs - Eggs eaten - Eggs used for muffinsThat is:Eggs for selling = 16 - 3 - 4Let me compute that:16 - 3 is 13, and then 13 - 4 is 9. So, 9 eggs are left for selling.Now, she sells each egg for $2. So, the total revenue from selling the eggs would be:Total revenue = Eggs for selling × Selling price per eggWhich is:Total revenue = 9 × $2Calculating that, 9 times 2 is 18. So, she makes $18 each day from selling the eggs.Wait, let me double-check my calculations to make sure I didn't make a mistake.Total eggs: 16Eggs eaten for breakfast: 3, so 16 - 3 = 13Eggs used for muffins: 4, so 13 - 4 = 9Yes, 9 eggs left.9 eggs × $2 = $18. That seems right.Alternatively, I can check by adding up the eggs used:3 breakfast + 4 muffins = 7 eggsSo, 7 eggs are eaten or used, and 16 - 7 = 9 eggs left. Yep, same result.So, 9 × $2 is definitely $18.I don't think I've missed anything here. She starts the day with 16 eggs, uses 7 of them, sells the rest, and that's the amount she makes. So, the answer should be $18.**Final Answer**
\boxed{18}
</think>Janet's ducks lay 16 eggs per day. She eats 3 eggs for breakfast and uses 4 eggs for baking muffins. The remaining eggs are sold at the farmers' market for $2 each.1. Total eggs per day: 16
2. Eggs eaten for breakfast: 3
3. Eggs used for muffins: 4
4. Eggs available for selling: \(16 - 3 - 4 = 9\)The revenue from selling the remaining eggs is calculated as:
\[ 9 \text{ eggs} \times \$2 \text{ per egg} = \$18 \]Thus, Janet makes \(\boxed{18}\) dollars every day at the farmers' market.

9 [RL4LLM] PPO workflow 及 OpenRLHF、veRL 初步介绍,ray distributed debugger

链接:https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/r1-k1.5/infra/overall_basics.ipynb

RL4LLM roadmap

  • 从 trl 开始学起,框架较为基础和简单;
    • 深入地学习 GRPO,基于 1.5B 复现 R1,复现 aha moments;
      • https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb(春节期间对GRPO的复现的一个项目)
      • 大致也基本能搞清楚 RLHF 阶段的 PPO 算法原理,二者在公式上主要只有 adv(advantage)的估计方法不同;
  • 后续可以陆陆续续迁移到更现代更多工程性能优化的 RL4LLM 的框架上
    • 比如 veRL 和 OpenRLHF
    • 假如都是零基础,优先 veRL 吧,除非继承而来的项目是 OpenRLHF;
    • veRL:2409.19256,3.8k stars;
      • https://github.com/Jiayi-Pan/TinyZero
      • https://github.com/agentica-project/deepscaler
      • https://github.com/Unakar/Logic-RL
    • OpenRLHF:2405.11143,5k stars;

论文里的图,很清晰:
在这里插入图片描述

TRL ppo trainer

- https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py
- make experiences- forward- queries: `[4, 56]`- reponses: `[4, 53]`($\pi_{\theta_{old}}$)- logprobs: `[4, 53]` ($\pi_{\theta_{old}}$)- ref_logprobs: `[4, 53]`($\pi_{ref}$)- values: `[4, 53]`- scores: `[4]` (last token's, the whole query + reponse)- 计算 rewards (token 级别)- $r_t = r_{T} - \beta (\log\pi_\theta-\log\pi_{ref})$- 内循环;- KL 项是 k1 近似;- 计算 advantage & return- GAE:- $\delta_t=r_t+\gamma V(s_{t+1})-V(s_t)$- $A_t=\sum_{k=0}^T(\gamma\lambda)^k\delta_{t+k}$- return: advantage + value
- ppo update ($\pi_\theta$)

adv(advantage) estimator

  • GAE
  • GRPO
  • RLOO
  • REINFORCE++
  • ReMax

verl/trainer/ppo/ray_trainer.py/compute_advantage
verl/trainer/ppo/core_algos.py

https://verl.readthedocs.io/en/latest/examples/config.html

  • compute_gae_advantage_return
    • token_level_rewards, values
    • A t G A E = ∑ ℓ T − t ( γ λ ) ℓ δ t + ℓ , δ t = r t + γ V ( s t + 1 ) − V ( s t ) A_t^{GAE}=\sum_{\ell}^{T-t}(\gamma\lambda)^{\ell}\delta_{t+\ell}, \quad \delta_t=r_t+\gamma V(s_{t+1})-V(s_t) AtGAE=Tt(γλ)δt+,δt=rt+γV(st+1)V(st)
    • return: r e t t = V ( s t ) + A t G A E ret_t=V(s_t)+A_t^{GAE} rett=V(st)+AtGAE
  • compute_grpo_outcome_advantage
    • token_level_rewards
    • A i = r i − μ σ + ϵ A_i=\frac{r_i-\mu}{\sigma+\epsilon} Ai=σ+ϵriμ
  • compute_rloo_outcome_advantage
    • token_level_rewards
    • A i = R i − 1 n − 1 ∑ k ≠ i R k A_i=R_i-\frac1{n-1}\sum_{k\neq i}R_k Ai=Rin11k=iRk
  • compute_reinforce_plus_plus_outcome_advantage
    • token_level_rewards
    • A t = G t − μ σ , G t = ∑ k = t T γ k − t r k A_t=\frac{G_t-\mu}{\sigma}, \quad G_t=\sum_{k=t}^T\gamma^{k-t}r_k At=σGtμ,Gt=k=tTγktrk
      • return: accumulate discounted reward
  • compute_remax_outcome_advantage(Reward-Maximization with Baseline)
    • token_level_rewards, reward_baselines
    • A t = G t − b , G t = ∑ k = t T r k A_t=G_t-b, \quad G_t=\sum_{k=t}^Tr_k At=Gtb,Gt=k=tTrk
      • no discounted return

9 [RL4LLM] 深入 PPO-clip 目标函数细节(及重要性采样)

https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/basics_insights_of_RL/importance_sampling.ipynb
https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/basics_insights_of_RL/PPO_clip.ipynb

https://huggingface.co/blog/deep-rl-ppo

强化学习 (online) 和传统监督学习(offline)一个很大的区别就是“训练数据是当场采集出来的”,一边造数据,一边训模型,然后用新的模型接着造数据,训模型。

import numpy as np
import matplotlib.pyplot as plt

P P O c l i p = min ⁡ ( r ( θ ) A , clip ( r ( θ ) , 1 − ϵ , 1 + ϵ ) A ) PPO_{clip}=\min(r(\theta)A, \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)A) PPOclip=min(r(θ)A,clip(r(θ),1ϵ,1+ϵ)A)

  • 策略更新比率(ratio): r ( θ ) = π θ ( a ∣ s ) π θ old ( a ∣ s ) r(\theta)=\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)} r(θ)=πθold(as)πθ(as)
  • Advantage(优势函数)本身不直接参与梯度计算
  • PPO 的 clip 操作,会导致这条数据没有梯度,这条训练数据就起不到贡献了
    • A > 0 A\gt 0 A>0 r ( θ ) > ( 1 + ϵ ) r(\theta) > (1+\epsilon) r(θ)>(1+ϵ)),截断为 A ( 1 + ϵ ) A(1+\epsilon) A(1+ϵ),gradient 为 0
      • r ( θ ) < 1 − ϵ r(\theta) < 1-\epsilon r(θ)<1ϵ,取值为 A r Ar Ar,未被截断,gradient 为 A;
    • A < 0 A\lt 0 A<0 r ( θ ) < ( 1 − ϵ ) r(\theta) < (1-\epsilon) r(θ)<(1ϵ)),截断为 A ( 1 − ϵ ) A(1-\epsilon) A(1ϵ),gradient 为 0

clip

  • A > 0 A>0 A>0 时(鼓励 π ( a t ∣ s t ) \pi(a_t|s_t) π(atst) 提升 likelihood ratio), r ≥ 1 + ϵ r\geq 1+\epsilon r1+ϵ,则取为 ( 1 + ϵ ) A (1+\epsilon)A (1+ϵ)A(梯度为0)
    • 目标函数此时没有梯度,不会继续增加 likelihood
  • A < 0 A<0 A<0 时(抑制 π ( a t ∣ s t ) \pi(a_t|s_t) π(atst) 降低 likelihood ratio), r ≤ 1 − ϵ r \leq 1-\epsilon r1ϵ,则取值为 ( 1 − ϵ ) A (1-\epsilon)A (1ϵ)A(梯度为0)
  • 还有一个问题,为什么不可以只取 clip ( r , 1 − ϵ , 1 + ϵ ) A \text{clip}(r, 1-\epsilon, 1+\epsilon)A clip(r,1ϵ,1+ϵ)A
    • A > 0 A\gt 0 A>0, r < ( 1 − ϵ ) r \lt (1-\epsilon) r<(1ϵ) 时(初始就已经偏离很大),有 A r < A ( 1 − ϵ ) Ar \lt A(1-\epsilon) Ar<A(1ϵ),min 操作使得目标函数为 A r Ar Ar 继续保持梯度,提升 r r r(往上升)
    • A < 0 A\lt 0 A<0, r > ( 1 + ϵ ) r \gt (1+\epsilon) r>(1+ϵ) 时(初始就已经偏离很大),有 A r < A ( 1 + ϵ ) Ar \lt A(1+\epsilon) Ar<A(1+ϵ), min 操作使得目标函数为 A r Ar Ar 继续保持梯度,降低 r r r(往下拉)
def ppo_clip(r, A, eps=0.2):return np.minimum(r * A, np.clip(r, 1-eps, 1+eps) * A)fig, axs = plt.subplots(1, 2, figsize=(14, 6))# Set basic parameters
eps = 0.2
r_values = np.linspace(0.5, 1.5, 1000)  # r uniformly distributed from 0.5 to 1.5# First case: A > 0 (A = 1)
A = 1
ppo_values = ppo_clip(r_values, A, eps)
original_values = r_values * A  # Original policy gradient objectiveaxs[0].plot(r_values, ppo_values, 'b-', linewidth=2, label='PPO-CLIP')
axs[0].plot(r_values, original_values, 'r--', linewidth=2, label='r*A')# Draw clipping boundaries
axs[0].axvline(x=1-eps, color='g', linestyle=':', label='Clip boundaries (1±ε)')
axs[0].axvline(x=1+eps, color='g', linestyle=':')axs[0].set_title('When A > 0 (A = 1)')
axs[0].set_xlabel('Policy Ratio r')
axs[0].set_ylabel('Objective Value')
axs[0].legend()
axs[0].grid(True)# Second case: A < 0 (A = -1)
A = -1
ppo_values = ppo_clip(r_values, A, eps)
original_values = r_values * A  # Original policy gradient objectiveaxs[1].plot(r_values, ppo_values, 'b-', linewidth=2, label='PPO-CLIP')
axs[1].plot(r_values, original_values, 'r--', linewidth=2, label='r*A')# Draw clipping boundaries
axs[1].axvline(x=1-eps, color='g', linestyle=':', label='Clip boundaries (1±ε)')
axs[1].axvline(x=1+eps, color='g', linestyle=':')axs[1].set_title('When A < 0 (A = -1)')
axs[1].set_xlabel('Policy Ratio r')
axs[1].set_ylabel('Objective Value')
axs[1].legend()
axs[1].grid(True)

在这里插入图片描述


期望计算

E x ∼ p [ f ( x ) ] = ∫ p ( x ) f ( x ) d x = ∫ q ( x ) p ( x ) q ( x ) f ( x ) d x = E x ∼ q [ p ( x ) q ( x ) f ( x ) ] \begin{split} \mathbb E_{x\sim p}[f(x)]=\int p(x)f(x)dx\\ =\int q(x)\frac{p(x)}{q(x)}f(x)dx\\ =\mathbb E_{x\sim q}\left[\frac{p(x)}{q(x)}f(x)\right] \end{split} Exp[f(x)]=p(x)f(x)dx=q(x)q(x)p(x)f(x)dx=Exq[q(x)p(x)f(x)]

  • Importance sampling (IS) is a Monte Carlo technique for the approximation of intractable distributions and integrals with respect to them.

    • 最开始引入 IS 要解决的问题是不好对 x ∼ p ( x ) x\sim p(x) xp(x) 直接进行采样,而好对 x ∼ q ( x ) x\sim q(x) xq(x) 进行采样(这是我们认为设计和选择的)
    • https://allenwind.github.io/blog/10466/
  • 二者均值一样,不代表方差一样;

    • V a r x ∼ p [ f ] = E x ∼ p [ f 2 ] − . . . Var_{x\sim p}[f]=E_{x\sim p}[f^2] - ... Varxp[f]=Exp[f2]...
    • V a r x ∼ q [ p q f ] = E x ∼ q [ ( p q f ) 2 ] − . . . = E x ∼ p [ p q f 2 ] − . . . Var_{x\sim q}[\frac pq f]=E_{x\sim q}\left[\left(\frac{p}{q}f\right)^2\right] - ...=E_{x\sim p}[\frac pq f^2]-... Varxq[qpf]=Exq[(qpf)2]...=Exp[qpf2]...
    • 如果 p q \frac pq qp 差异很大的话,后者的方差就会很大;
  • x ∼ q x\sim q xq: sampling

  • w n = p ( x n ) q ( x n ) w_n=\frac{p(x_n)}{q(x_n)} wn=q(xn)p(xn): imporance weight

  • 在 RL4LLM 的训练中,引入重要性采样,使得 on-policy 的算法(数据利用率较低)可以变得相对地 off-policy

    • 考虑如下的 policy gradient
      = E ( s t , a t ) ∼ π θ [ A θ ( s t , a t ) ∇ log ⁡ p θ ( a t n ∣ s t n ) ] = E ( s t , a t ) ∼ π θ ′ [ p θ ( a t ∣ s t ) p θ ′ ( a t ∣ s t ) A θ ′ ( s t , a t ) ∇ log ⁡ p θ ( a t n ∣ s t n ) ] \begin{split} &= E_{(s_t, a_t) \sim \pi_\theta} [A^\theta(s_t, a_t) \nabla \log p_\theta(a_t^n | s_t^n)]\\ &= E_{(s_t, a_t) \sim \pi_{\theta'}} \left[ \frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)} A^{\theta'}(s_t, a_t) \nabla \log p_\theta(a_t^n | s_t^n) \right] \end{split} =E(st,at)πθ[Aθ(st,at)logpθ(atnstn)]=E(st,at)πθ[pθ(atst)pθ(atst)Aθ(st,at)logpθ(atnstn)]
    • generate 的样本,可以 update policy 多次;
    • 重要性采样需要大量样本才能做到无偏替代
      J P P O ( θ ) = E q ∼ P ( Q ) , o ∼ π θ o l d ( O ∣ q ) [ 1 ∣ o ∣ ∑ t = 1 ∣ o ∣ min ⁡ ( π θ ( o t ∣ q , o < t ) π θ o l d ( o t ∣ q , o < t ) A t , clip ( π θ ( o t ∣ q , o < t ) π θ o l d ( o t ∣ q , o < t ) , 1 − ϵ , 1 + ϵ ) A t ) ] \mathcal{J}_{PPO}(\theta) = \mathbb{E}_{q \sim P(Q), o \sim \pi_{\theta_{old}}(O|q)} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \min \left( \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{\theta_{old}}(o_t|q, o_{<t})} A_t, \text{clip} \left( \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{\theta_{old}}(o_t|q, o_{<t})}, 1-\epsilon, 1+\epsilon \right) A_t \right) \right] JPPO(θ)=EqP(Q),oπθold(Oq) o1t=1omin(πθold(otq,o<t)πθ(otq,o<t)At,clip(πθold(otq,o<t)πθ(otq,o<t),1ϵ,1+ϵ)At)
  • https://zhuanlan.zhihu.com/p/17657567877

    • 先普及两个 RLHF 算法中的重要参数:rollout_batch_size 和 train_batch_size,前者代表一次性生成多少条训练数据(response 和 reward),后者代表每次用多少条数据来更新模型,前者是后者的 N 倍。
    • 随着训练框架的不断优化, RLHF 的训练数据并没有那么难生产了,尤其是像 OpenRLHF 这种框架,引入了 vllm 来生产 response,效率极高。我们完全可以令 N = 1 / 2 / 4 这种很小的值,且每条训练数据仅使用一次。事实上,由于重要性采样需要大量样本才能做到无偏替代,这个 N 值还真不能很大,越大就越容易训崩。
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt, pi, exp
np.random.seed(1234)mu_p, sigma_p = -0.5, 0.5
def p(x):return 1/(sqrt(2*pi)*sigma_p) * exp(-((x - mu_p)**2)/(2*sigma_p**2))def q(x, mu, sigma):return 1/(sqrt(2*pi)*sigma) * exp(-((x - mu)**2)/(2*sigma**2))def f(x):return 1 / (1 + exp(-x)) - 0.5# Define several different q distributions with varying distances from p
q_params = [(0.0, 0.6, "q1: $\mu=0.0, \sigma=0.6$"),   # Closer to p(1.5, 0.8, "q2: $\mu=1.5, \sigma=0.8$"),   # Original q(2.5, 1.0, "q3: $\mu=2.5, \sigma=1.0$"),   # Further from p
]xs = np.linspace(-3, 5, 300)
pxs = [p(x) for x in xs]
fxs = [f(x) for x in xs]plt.figure(figsize=(10,5))
plt.plot(xs, pxs, label='$p(x)$', color='blue')
for mu, sigma, label in q_params:qxs = [q(x, mu, sigma) for x in xs]plt.plot(xs, qxs, label=label, linestyle='--')
plt.plot(xs, fxs, label='f(x)', color='red')
plt.ylim(-0.5, 1)
plt.legend()
plt.title('Importance Sampling: Distributions')
plt.xlabel('x')
plt.ylabel('Density/Value')
plt.grid(alpha=0.3)

在这里插入图片描述

# Calculate ground truth by direct sampling from p
samples = np.random.normal(loc=mu_p, scale=sigma_p, size=1000000)
mean_fp = np.mean([f(x) for x in samples])
print(f'Ground truth expectation under p(x): {mean_fp:.6f}')
# Ground truth expectation under p(x): -0.116002
# Importance sampling with varying sample sizes
sample_sizes = [10, 100, 1000, 10000, 100000]
results = {label: [] for _, _, label in q_params}
true_value = mean_fpplt.figure(figsize=(10,6))
for mu, sigma, label in q_params:for size in sample_sizes:samples = np.random.normal(loc=mu, scale=sigma, size=size)weights = np.array([p(x) / q(x, mu, sigma) for x in samples])mean_is = np.mean(weights * np.array([f(x) for x in samples]))results[label].append(mean_is)print(f"Distribution {label}, Sample size {size}: {mean_is:.6f}")plt.plot(sample_sizes, results[label], 'o-', label=label)plt.axhline(y=true_value, color='r', linestyle='-', label='True expectation')
plt.xscale('log')
plt.grid(True, which="both", ls="--", alpha=0.3)
plt.xlabel('Number of Samples')
plt.ylabel('Estimated Expectation')
plt.title('Importance Sampling Estimates vs. Sample Size')
plt.legend()
plt.tight_layout()

输出:

Distribution q1: $\mu=0.0, \sigma=0.6$, Sample size 10: -0.084525
Distribution q1: $\mu=0.0, \sigma=0.6$, Sample size 100: -0.092245
Distribution q1: $\mu=0.0, \sigma=0.6$, Sample size 1000: -0.113189
Distribution q1: $\mu=0.0, \sigma=0.6$, Sample size 10000: -0.114711
Distribution q1: $\mu=0.0, \sigma=0.6$, Sample size 100000: -0.115185
Distribution q2: $\mu=1.5, \sigma=0.8$, Sample size 10: 0.025215
Distribution q2: $\mu=1.5, \sigma=0.8$, Sample size 100: 0.007397
Distribution q2: $\mu=1.5, \sigma=0.8$, Sample size 1000: -0.083480
Distribution q2: $\mu=1.5, \sigma=0.8$, Sample size 10000: -0.110500
Distribution q2: $\mu=1.5, \sigma=0.8$, Sample size 100000: -0.114701
Distribution q3: $\mu=2.5, \sigma=1.0$, Sample size 10: 0.000679
Distribution q3: $\mu=2.5, \sigma=1.0$, Sample size 100: 0.000508
Distribution q3: $\mu=2.5, \sigma=1.0$, Sample size 1000: -0.019807
Distribution q3: $\mu=2.5, \sigma=1.0$, Sample size 10000: -0.088690
Distribution q3: $\mu=2.5, \sigma=1.0$, Sample size 100000: -0.124670

在这里插入图片描述


10 [RL4LLM] GRPO loss/objective 分析 及可能的 biases 分析(DAPO,Dr. GRPO)

  • video: https://www.bilibili.com/video/BV1LgXbY5EFD
  • code: https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/r1-k1.5/grpo_loss.ipynb

最近清华新加坡提出了一个DAPO,DrGRPO之类的东西

  • DAPO:
    • https://dapo-sia.github.io
  • Dr. GRPO
    • https://github.com/sail-sg/understand-r1-zero

1 思维误区:损失为零无法优化

loss = 0

loss 为 0 为什么还可以反向传播,更新梯度;

  • loss 为 0,不意味着 gradient 为 0
    • f ( w ) = ( w − 1 ) 2 − 1 f(w)=(w-1)^2-1 f(w)=(w1)21,在 w = 0 w=0 w=0 时, f ( w ) = 0 f(w)=0 f(w)=0,但其实其 gradient 为 -2
      • 梯度 * 学习率 才是 learning 的本质;
    • w − η ⋅ g = 0 − ( 0.1 ∗ − 2 ) = 0.2 w-\eta\cdot g=0-(0.1*-2)=0.2 wηg=0(0.12)=0.2
  • loss 不再是一个好的 monitor 指标,而是 reward

但是实际上在损失的设计上,一般都是以零为下界的,即便是引入一些正则项,分项的损失基本上都是以零为主。

一个非常经典的关于GRPO的帖子:

  • https://github.com/huggingface/trl/issues/2608#issuecomment-2609844003

在这里插入图片描述

注意上面第二个式子GRPO的loss, π \pi π除以 π \pi π(分母是detach的),不是1吗,并不是这样的,看下面的例子:

import torch# 情况1: x - x (梯度为0)
x = torch.tensor([3.0], requires_grad=True)
y1 = x - x  
y1.backward()  # 反向传播计算梯度
print("Gradient for x - x:", x.grad.item())  # 输出 0.0
# 清除梯度,准备下一个示例
x.grad.zero_()# 情况2: x - x.detach() (梯度为1)
y2 = x - x.detach()  # 分离第二个x,使其视为常数
y2.backward()  # 反向传播计算梯度
print("Gradient for x - x.detach():", x.grad.item())  # 输出 1.0

这是loss上的一个特点,所谓detach就是不计算梯度(应该就是可以理解为是常数),这样虽然看起来是1,但其实并不是,分母是一个常数而已。

loss = β k l \beta kl βkl

GitHub Issue: Why does the loss start at 0 when I train GRPO, and then possibly increase?

在这里插入图片描述

  • 这是另一个问题帖

  • trl grpo

    • β = 0.04 \beta = 0.04 β=0.04(default,GRPOConfig
    • 这个值其实是比较大的,math 用 0.001??
  • 抛开 kl

    • 一个 prompt 多个 generations(为一个 group)
      • 每个 generation 对应的 loss = -advantage (likelihood ratio = 1, π θ = π θ o l d \pi_\theta=\pi_{\theta_{old}} πθ=πθold)
    • 一个 group 的 mean loss = - mean advantage = 0
    • 注意图中的J都是梯度上升,求和式前都少了一个负号
  • kl 的位置

    • 定义在 advantage 计算 reward 时
    • 定义在外部
    • grpo 原始公式是定义在外部的;
      • the GRPO implementation does not include the KL-divergence as part of the reward function. Instead, it directly incorporates the KL-divergence into the loss function, arguing that this approach simplifies the computation and avoids unnecessary complexity.

在这里插入图片描述
J G R P O ( θ ) = E q ∼ P ( Q ) , { o i } i = 1 G ∼ π θ o l d ( O ∣ q ) [ 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ min ⁡ ( π θ ( o i , t ∣ q , o i , < t ) π θ o l d ( o i , t ∣ q , o i , < t ) A ^ i , t , clip ( π θ ( o i , t ∣ q , o i , < t ) π θ o l d ( o i , t ∣ q , o i , < t ) , 1 − ε , 1 + ε ) A ^ i , t ) − β D K L ( π θ ∣ ∣ π r e f ) ] \mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1-\varepsilon, 1+\varepsilon \right) \hat{A}_{i,t} \right) - \beta D_{KL} (\pi_\theta || \pi_{ref}) \right] JGRPO(θ)=EqP(Q),{oi}i=1Gπθold(Oq) G1i=1Goi1t=1oimin(πθold(oi,tq,oi,<t)πθ(oi,tq,oi,<t)A^i,t,clip(πθold(oi,tq,oi,<t)πθ(oi,tq,oi,<t),1ε,1+ε)A^i,t)βDKL(πθ∣∣πref)

  • first averaging the losses by token within each sample and then aggregating the losses across samples.
    • each sample is assigned an equal weight in the final loss computation
    • 对比看下 DAPO 的公式(12)
  • If you are using the GRPO trainer then the old policy is in effect updated every step, this means you just use a detached version of the current policy.
    • 公式中的 π θ o l d \pi_{\theta_{old}} πθold π θ \pi_\theta πθ 的 detach 版(不参与计算图,视为常数);
    • r = π θ π θ o l d = 1 r=\frac{\pi_\theta}{\pi_{\theta_{old}}}=1 r=πθoldπθ=1,
    • clip ( 1 , 1 − ϵ , 1 + ϵ ) = 1 \text{clip}(1, 1-\epsilon, 1+\epsilon)=1 clip(1,1ϵ,1+ϵ)=1
  • A ^ i , t = r ~ i = r i − μ σ \hat A_{i,t}=\tilde r_i=\frac{r_i-\mu}{\sigma} A^i,t=r~i=σriμ (z score) (token 级别的 adv = output 级别的 reward 组内 z-score 而来)

J G R P O ( θ ) = 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ min ⁡ ( π θ ( o i , t ∣ q , o i , < t ) π θ o l d ( o i , t ∣ q , o i , < t ) A ^ i , t , clip ( π θ ( o i , t ∣ q , o i , < t ) π θ o l d ( o i , t ∣ q , o i , < t ) , 1 − ε , 1 + ε ) A ^ i , t ) − β D K L ( π θ ∣ ∣ π r e f ) = 1 G ∑ i G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ A ^ i , t − 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ β D k l [ π θ ∣ π r e f ] = 1 G ∑ i G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ A ^ i − 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ β D k l [ π θ ∣ π r e f ] = 1 G ∑ i G 1 ∣ o i ∣ ∣ o i ∣ ⋅ A ^ i − 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ β D k l [ π θ ∣ π r e f ] = 1 G ∑ i G A ^ i − 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ β D k l [ π θ ∣ π r e f ] = 1 G ∑ i G r i − μ σ − 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ β D k l [ π θ ∣ π r e f ] = ∑ i r i − G μ G − 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ β D k l [ π θ ∣ π r e f ] = 0 − 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ β D k l [ π θ ∣ π r e f ] = − 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ β D k l [ π θ ∣ π r e f ] \begin{split} \mathcal{J}_{GRPO}(\theta)&= \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1-\varepsilon, 1+\varepsilon \right) \hat{A}_{i,t} \right) - \beta D_{KL} (\pi_\theta || \pi_{ref}) \\ &=\frac1G\sum_i^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\hat A_{i,t} -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=\frac1G\sum_i^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\hat A_i -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=\frac1G\sum_i^G\frac1{|o_i|} {|o_i|}\cdot \hat A_i -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=\frac1G\sum_i^G\hat A_i-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=\frac1G\sum_i^G\frac{r_i-\mu}{\sigma}-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=\frac{\sum_i r_i-G\mu}{G}-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &= 0 -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}] \end{split} JGRPO(θ)=G1i=1Goi1t=1oimin(πθold(oi,tq,oi,<t)πθ(oi,tq,oi,<t)A^i,t,clip(πθold(oi,tq,oi,<t)πθ(oi,tq,oi,<t),1ε,1+ε)A^i,t)βDKL(πθ∣∣πref)=G1iGoi1t=1oiA^i,tG1i=1Goi1t=1oiβDkl[πθπref]=G1iGoi1t=1oiA^iG1i=1Goi1t=1oiβDkl[πθπref]=G1iGoi1oiA^iG1i=1Goi1t=1oiβDkl[πθπref]=G1iGA^iG1i=1Goi1t=1oiβDkl[πθπref]=G1iGσriμG1i=1Goi1t=1oiβDkl[πθπref]=GiriGμG1i=1Goi1t=1oiβDkl[πθπref]=0G1i=1Goi1t=1oiβDkl[πθπref]=G1i=1Goi1t=1oiβDkl[πθπref]

所以其实advantage前面的系数( π / π \pi/\pi π/π)计算上就是 1 1 1,最终的真实loss就是 − β K L -\beta KL βKL这个结论很重要

KL散度变大说明模型在尝试一些偏离模型原有的方向,以获得提升,类似模拟退火中的跳出局部最优,这有时是好事,但KL散度不宜过高。

这里再强调一下策略KL散度的一个计算(deepseekmath的eq4):

D K L [ π θ ∥ π r e f ] = π r e f ( o i , t ∣ q , o i , < t ) π θ ( o i , t ∣ q , o i , < t ) − log ⁡ π r e f ( o i , t ∣ q , o i , < t ) π θ ( o i , t ∣ q , o i , < t ) − 1 \mathbb{D}_{KL}[\pi_\theta\|\pi_{ref}]=\frac{\pi_{ref}(o_{i,t}|q,o_{i,<t})}{\pi_\theta(o_{i,t}|q,o_{i,<t})}-\log\frac{\pi_{ref}(o_{i,t}|q,o_{i,<t})}{\pi_\theta(o_{i,t}|q,o_{i,<t})}-1 DKL[πθπref]=πθ(oi,tq,oi,<t)πref(oi,tq,oi,<t)logπθ(oi,tq,oi,<t)πref(oi,tq,oi,<t)1

GRPO的梯度?

回顾PG中常用的公式:
f ′ ( x ) = f ( x ) ∇ log ⁡ f ( x ) f'(x)=f(x)\nabla\log f(x) f(x)=f(x)logf(x)

这个公式其实很显然成立:因为 ∇ log ⁡ f ( x ) = f ′ ( x ) f ( x ) \nabla \log f(x)=\frac{f'(x)}{f(x)} logf(x)=f(x)f(x),这样写完全是为了方便计算实现以及一些推导。

在这里插入图片描述

  • For example for GRPO, if all outputs { o i } i = 1 G \{o_i\}^G_{i=1} {oi}i=1G of a particular prompt are correct and receive the same reward 1, the resulting advantage for this group is zero. A zero advantage results in no gradients for policy
    updates, thereby reducing sample efficiency.(GRPO的特点:如果advantage是0,即全对或者全错,那么是不会产生有效的gradient,即这轮就是无任何更新,下面的DAPO的创新之一就是解决了这个问题
  • deepseelmath disscussion 部分:他们认为所有的PGLoss都可以统一在一个范式下的,不管PPO、GRPO还是之后可能出现的种种

token级别的PG损失(即DAPO)

  • grpo: generation-level loss, dapo: token-level pg loss
    • grpo: 先部分(generation)去平均,再在 group 级别取平均
    • dapo: group 里,所有的 generations,所有的tokens 取平均
  • ga (gradient accumulation)
    • https://unsloth.ai/blog/gradient

GRPO的一个bias:

  • 假如advantage > 0,即模型答对了,则倾向于简短的答案
  • 加入advantage < 0,即模型答错了,则倾向于更长的答案
  • 带有更低标准差的问题在更新迭代中会得到更高的权重,所谓标准差低,表示这种问题是简单的,回答10次9次都对,标准差高的问题就是回答不稳定的问题。这种现象也不太好,因为难的问题才是更重要的问题。

2 关于Dr. GRPO

A i = R i − 1 N ∑ j = 1 N R j A_i=R_i-\frac1N\sum_{j=1}^N R_j Ai=RiN1j=1NRj

  • R i = θ + ϵ i R_i=\theta+\epsilon_i Ri=θ+ϵi,带入上式得
    • A i = θ + ϵ i − 1 N ∑ j ( θ + ϵ i ) = ϵ i − 1 N ∑ ϵ j A_i=\theta+\epsilon_i-\frac1N\sum_j (\theta+\epsilon_i)=\epsilon_i-\frac1N\sum \epsilon_j Ai=θ+ϵiN1j(θ+ϵi)=ϵiN1ϵj

E [ A i ∣ ϵ i ] = E [ ϵ i − 1 N ∑ ϵ j ∣ ϵ i ] = ϵ i − 1 N ϵ i − 1 N ∑ j ≠ i N 0 = N − 1 N ϵ i \begin{split} \mathbb E[A_i|\epsilon_i]&=\mathbb E [\epsilon_i - \frac1N\sum\epsilon_j | \epsilon_i]\\ &=\epsilon_i - \frac1N\epsilon_i-\frac1N\sum_{j\neq i}^N 0\\ &=\frac{N-1}N\epsilon_i \end{split} E[Aiϵi]=E[ϵiN1ϵjϵi]=ϵiN1ϵiN1j=iN0=NN1ϵi

3 per_device_train_batch_size & num_generations

https://github.com/huggingface/trl/pull/2776

  • (num_processes * per_device_batch_size) must be divisible by G.

    • per_device_batch_size 刻画的是 gpu device 粒度 generations 的数量
    • num_processes 是 gpu 进程的数量;
    • num_processes * per_device_batch_size / G: prompts 吞吐量
  • https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py#L571-L598

    • ensures each prompt is repeated across multiple processes. This guarantees that identical prompts are distributed to different GPUs, allowing rewards to be computed and normalized correctly within each prompt group. Using the same seed across processes ensures consistent prompt assignment, preventing discrepancies in group formation.
    • repeats the batch multiple times to allow reusing generations across multiple updates. Refer to _prepare_inputs to see how the generations are stored and reused.
    • In the following figure, the values are the prompt indices. The first row shows the first sampled batch, the
      second row shows the second sampled batch, and so on.
    • 3 个 gpus,num_generations = 3,per_device_train_batch_size = 4
      • 3*4 / 3 = 4
    GPU0GPU1GPU2
    P0P00P01P02
    P1P10P11P12
    P2P20P21P22
    P3P30P31P32
    • 进一步还考虑到了 grad_accum = 3,累加 batch forward,统一 backward

目前来说,还是有很多争议的地方,到头肯定还是哪个work用哪个


11 [RL4LLM] deepseek v3 工具调用的 bug 以及理解 chat_template 的 function calling

video: https://www.bilibili.com/video/BV1dsdWYuEXw
code: https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/tokenizer/v3.ipynb

这边链接里的issue说了v3有个重复调用工具的BUG,示例是关于一个调用天气的工具,即时在message里添加了已经调用了工具的标记,v3还是会问天气。

  • https://github.com/deepseek-ai/DeepSeek-V3/issues/15
  • deepseek v3 (0324): “Increased accuracy in Function Calling, fixing issues from previous V3 versions”
    • https://huggingface.co/deepseek-ai/DeepSeek-V3-0324
    • repetitive function call
  • 从 token 或者 chat_template 的角度理解 tool use / function calling,使用(inference)以及 training
    • System prompt: 有哪些工具,参数是什么 。。
    • User prompt: What's the weather like today in New York?
    • <tool>get_current_template(location='New York, NY', format='F')</tool><output>73 degrees Fahrenheit</output>

这个是工具使用的一个样例,其实训练时都是这么做的。

在这里插入图片描述

from transformers import AutoTokenizer
import re
import torchmodel_id = 'deepseek-ai/DeepSeek-V3'
model_id_0324 = 'deepseek-ai/DeepSeek-V3-0324'T1 = AutoTokenizer.from_pretrained(model_id)
T2 = AutoTokenizer.from_pretrained(model_id_0324)

注:v3-0324是更好的版本,在官方文档里说修了重复调用工具的BUG

也就是说下面的代码中,T1是有问题的版本,T2是修复后的,我们需要对比看看哪里改进了

v3 的chat template

T1.chat_template,如下所示:

{# 设置默认变量 #}
{% if add_generation_prompt is not defined %}{% set add_generation_prompt = false %}
{% endif %}{# 定义命名空间变量 #}
{% set ns = namespace(is_first=false,is_tool=false,is_output_first=true,system_prompt='',is_first_sp=true
) %}{# 拼接 system prompt #}
{% for message in messages %}{% if message['role'] == 'system' %}{% if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{% else %}{% set ns.system_prompt = ns.system_prompt + '\n' + message['content'] %}{% endif %}{% endif %}
{% endfor %}{{ bos_token }}{{ ns.system_prompt }}{# 遍历消息内容 #}
{% for message in messages %}{# 用户消息处理 #}{% if message['role'] == 'user' %}{% set ns.is_tool = false %}{{ '<|User|>' + message['content'] }}{# 助手消息(带工具调用) #}{% elif message['role'] == 'assistant' and message['content'] is none %}{% set ns.is_tool = false %}{% for tool in message['tool_calls'] %}{% if not ns.is_first %}{{ '<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n```json\n' + tool['function']['arguments'] + '\n```<|tool▁call▁end|>' }}{% set ns.is_first = true %}{% else %}{{ '\n<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n```json\n' + tool['function']['arguments'] + '\n```<|tool▁call▁end|>' }}{{ '<|tool▁calls▁end|><|end▁of▁sentence|>' }}{% endif %}{% endfor %}{# 助手正常回复内容 #}{% elif message['role'] == 'assistant' and message['content'] is not none %}{% if ns.is_tool %}{{ '<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>' }}{% set ns.is_tool = false %}{% else %}{{ '<|Assistant|>' + message['content'] + '<|end▁of▁sentence|>' }}{% endif %}{# 工具输出处理 #}{% elif message['role'] == 'tool' %}{% set ns.is_tool = true %}{% if ns.is_output_first %}{{ '<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>' }}{% set ns.is_output_first = false %}{% else %}{{ '\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>' }}{% endif %}{% endif %}{% endfor %}{# 工具输出结尾处理 #}
{% if ns.is_tool %}{{ '<|tool▁outputs▁end|>' }}
{% endif %}{# 生成助手响应起始标记 #}
{% if add_generation_prompt and not ns.is_tool %}{{ '<|Assistant|>' }}
{% endif %}

用流程简图表示为:

初始化变量
│
├── 收集 system prompt
│
├── 遍历 messages:
│   ├── system → 拼接 prompt
│   ├── user → 加 <|User|>
│   ├── assistant:
│   │   ├── 若调用 tool → 生成 tool_call 块
│   │   └── 否则 → 加 <|Assistant|>
│   └── tool → 输出 tool_output 块
│
└── 最后判断是否需要加 <|Assistant|> 结束

工具调用的思考过程在<|Assistant|>标签内,实际调用的函数及参数则在<|tool_outputs_begin|>...<|tool_outputs_end|>之内。0324版本加了tool_call块,工具是包含在assistant之中的,下面的流程图画的很明显。

v3-0324 的 chat template

来看T2.chat_template

{# 设置默认值 #}
{% if add_generation_prompt is not defined %}{% set add_generation_prompt = false %}
{% endif %}{# 初始化状态变量 #}
{% set ns = namespace(is_first=false,is_tool=false,is_output_first=true,system_prompt='',is_first_sp=true,is_last_user=false
) %}{# 拼接所有 system prompt #}
{% for message in messages %}{% if message['role'] == 'system' %}{% if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{% else %}{% set ns.system_prompt = ns.system_prompt + '\n' + message['content'] %}{% endif %}{% endif %}
{% endfor %}{{ bos_token }}{{ ns.system_prompt }}{# 遍历所有消息 #}
{% for message in messages %}{# 处理用户消息 #}{% if message['role'] == 'user' %}{% set ns.is_tool = false %}{% set ns.is_first = false %}{% set ns.is_last_user = true %}{{ '<|User|>' + message['content'] + '<|Assistant|>' }}{# 处理 Assistant 调用工具的情况 #}{% elif message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}{% set ns.is_last_user = false %}{% if ns.is_tool %}{{ '<|tool▁outputs▁end|>' }}{% endif %}{% set ns.is_first = false %}{% set ns.is_tool = false %}{% set ns.is_output_first = true %}{% for tool in message['tool_calls'] %}{% set tool_call_str = '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n```json\n' + tool['function']['arguments'] + '\n```<|tool▁call▁end|>' %}{% if not ns.is_first %}{% if message['content'] is none %}{{ '<|tool▁calls▁begin|>' + tool_call_str }}{% else %}{{ message['content'] + '<|tool▁calls▁begin|>' + tool_call_str }}{% endif %}{% set ns.is_first = true %}{% else %}{{ '\n' + tool_call_str }}{% endif %}{% endfor %}{{ '<|tool▁calls▁end|><|end▁of▁sentence|>' }}{# Assistant 正常回复内容(无工具调用) #}{% elif message['role'] == 'assistant' %}{% set ns.is_last_user = false %}{% if ns.is_tool %}{{ '<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>' }}{% set ns.is_tool = false %}{% else %}{{ message['content'] + '<|end▁of▁sentence|>' }}{% endif %}{# 工具的输出内容 #}{% elif message['role'] == 'tool' %}{% set ns.is_last_user = false %}{% set ns.is_tool = true %}{% if ns.is_output_first %}{{ '<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>' }}{% set ns.is_output_first = false %}{% else %}{{ '\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>' }}{% endif %}{% endif %}{% endfor %}{# 如果有残留的 tool 输出状态,则收尾结束 #}
{% if ns.is_tool %}{{ '<|tool▁outputs▁end|>' }}
{% endif %}{# 最终是否生成 Assistant 提示起始符 #}
{% if add_generation_prompt and not ns.is_last_user and not ns.is_tool %}{{ '<|Assistant|>' }}
{% endif %}
初始化变量(增加 is_last_user 等)
│
├── 收集 system prompt
│
├── 遍历 messages:
│   ├── system → 拼接 prompt
│   ├── user → 加 <|User|>,标记 is_last_user=True
│   ├── assistant:
│   │   ├── 若调用 tool_call:
│   │   │   └── 判断是否有 content(处理更细)
│   │   └── 若普通内容 → 加 <|Assistant|>
│   └── tool:
│       └── 多个 tool_output 串联,闭合处理
│
└── 若最后是 user 且无 tool 调用 → 加 <|Assistant|> 提示生成回复

`apply_chat_template

设置一段message

messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "What's the weather in Paris?"},{"role": "assistant",# "content": "Let me check the weather for you.","content": "","tool_calls": [{"type": "function","function": {"name": "get_weather","arguments": '{ "location": "Paris" }'}}]},{"role": "tool","content": '{ "temperature": "15C", "condition": "Sunny" }',"tool_call_id": "call_1"},{"role": "assistant","content": "It's 15°C and sunny in Paris right now."}
]

然后调用T1T1.apply_chat_template(messages, tokenize=False)

输出:

<|begin▁of▁sentence|>You are a helpful assistant.
<|User|>What\'s the weather in Paris?
<|Assistant|>Let me check the weather for you.<|end▁of▁sentence|>
<|tool▁outputs▁begin|><|tool▁output▁begin|>{ "temperature": "15C", "condition": "Sunny" }<|tool▁output▁end|>
<|tool▁outputs▁end|>It\'s 15°C and sunny in Paris right now.<|end▁of▁sentence|>

注意,T1只有tool_outputs,但没有tool_call,而在下面的T2里则是多了tool_call,这就是为什么v3会重复调用BUG的问题。

同理:T2.apply_chat_template(messages, tokenize=False)

输出:

<|begin▁of▁sentence|>You are a helpful assistant.
<|User|>What\'s the weather in Paris?
<|Assistant|>Let me check the weather for you.
<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>get_weather\n```json\n{ "location": "Paris" }\n```<|tool▁call▁end|>
<|tool▁calls▁end|><|end▁of▁sentence|>
<|tool▁outputs▁begin|><|tool▁output▁begin|>{ "temperature": "15C", "condition": "Sunny" }<|tool▁output▁end|>
<|tool▁outputs▁end|>It\'s 15°C and sunny in Paris right now.<|end▁of▁sentence|>
  • 两个 highlights
    • v3 chat tempalte 解析 messages 时 丢了 tool_call 的部分
    • tool_call 和 tool_output 是一体的,统一作为 <|Assistant|> 的输出

在这里插入图片描述

实验表明,即使把message中的content设置为空字符串,T1apply_chat_template的显示结果还是不会有tool_call,还是有问题的。

因为回头看v3的chat template里这一段

  {# 助手消息(带工具调用) #}{% elif message['role'] == 'assistant' and message['content'] is none %}{% set ns.is_tool = false %}{% for tool in message['tool_calls'] %}{% if not ns.is_first %}{{ '<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n```json\n' + tool['function']['arguments'] + '\n```<|tool▁call▁end|>' }}{% set ns.is_first = true %}{% else %}{{ '\n<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n```json\n' + tool['function']['arguments'] + '\n```<|tool▁call▁end|>' }}{{ '<|tool▁calls▁end|><|end▁of▁sentence|>' }}{% endif %}{% endfor %}{# 助手正常回复内容 #}{% elif message['role'] == 'assistant' and message['content'] is not none %}{% if ns.is_tool %}{{ '<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>' }}{% set ns.is_tool = false %}{% else %}{{ '<|Assistant|>' + message['content'] + '<|end▁of▁sentence|>' }}{% endif %}

也就是说,如果message['content'] is none,理论上是会输出tool_call块的才对,但是做下来并不是这样,这个条件分支其实很奇怪

相关文章:

【学习笔记】RL4LLM

字数溢出&#xff0c;分了一半出来 上半段&#xff1a;LLMRL 文章目录 8 [RL4LLM] 理解 reasoning model Tokenizer 的 chat template&#xff0c;vllm inferencetokenizerchat templatedistill tokenizerqwen tokenizer apply chat templatevllm inference 9 [RL4LLM] PPO wo…...

在Windows搭建gRPC C++开发环境

一、环境构建 1. CMake Download CMake 2. Git Git for Windows 3. gRPC源码 git clone --recurse-submodules -b v1.67.1 --depth 1 --shallow-submodules https://github.com/grpc/grpc grpc-1.67.1二、使用CMake生成工程文件 mkdir cmake_build cd cmake_build cmake…...

NO.76十六届蓝桥杯备战|数据结构-单调栈|发射站|Largest Rectangle in a Histogram(C++)

什么是单调栈&#xff1f; 单调栈&#xff0c;顾名思义&#xff0c;就是具有单调性的栈。它依旧是⼀个栈结构&#xff0c;只不过⾥⾯存储的数据是递增或者递减的。这种结构是很容易实现的&#xff08;如下⾯的代码&#xff09;&#xff0c;但重点是维护⼀个单调栈的意义是什么 …...

消息队列(Message Queue)简介

消息队列是一种进程间通信&#xff08;IPC&#xff09;机制&#xff0c;允许不同进程通过发送和接收消息进行 异步通信。它的核心特点包括&#xff1a; 解耦&#xff1a;消息队列解耦了生产者和消费者&#xff0c;简化了系统设计。 持久化存储&#xff1a;支持将消息存储在队列…...

动/静态库

1.先了解一下动静态库 上图可以看出来静态库就是由一堆进过链接阶段的.o文件组成的.a文件。在这里必须要强调的是库文件格式一定是libxxx.a/so在你进行路径查找使用的时候要去掉lib和后缀使用&#xff01; 静态库 概念&#xff1a;在程序编译链接阶段&#xff0c;其代码被完整…...

KWDB创作者计划—边缘计算:从概念到落地的技术解读

引言 随着物联网&#xff08;IoT&#xff09;和人工智能&#xff08;AI&#xff09;的快速发展&#xff0c;数据量呈爆炸式增长&#xff0c;传统的云计算架构逐渐暴露出延迟高、带宽占用大等问题。边缘计算作为一种新兴的分布式计算范式&#xff0c;正在改变数据处理的方式。本…...

ubuntu根文件系统通过uMTP-Responder实现usb的MTP功能

实现mtp设备 添加服务 /home/flynn/firfly_rootfs/lib/systemd/system/adbd.service #start [Unit] Description Adbd for linux Beforerockchip.service[Service] Typeforking ExecStart/etc/init.d/adbd.sh start ExecStop/etc/init.d/adbd.sh stop ExecReload/etc/init.d…...

8、nRF52xx蓝牙学习(boards.h文件学习)

boards.h文件的代码如下&#xff1a; #ifndef BOARDS_H #define BOARDS_H#include "nrf_gpio.h" #include "nordic_common.h"#if defined(BOARD_NRF6310)#include "nrf6310.h" #elif defined(BOARD_PCA10000)#include "pca10000.h" #…...

辛格迪客户案例 | 河南宏途食品实施电子合约系统(eSign)

01 河南宏途食品有限公司&#xff1a;食品行业的数字化践行者 河南宏途食品有限公司&#xff08;以下简称“宏途食品”&#xff09;作为国内食品行业的创新企业&#xff0c;专注于各类食品的研发、生产和销售。公司秉承“质量为先、创新驱动、服务至上”的核心价值观&#xff…...

webrtc-stats

1. RTP 相关统计 1.1 inbound-rtp (接收端统计) 接收到的 RTP 流统计信息&#xff0c;包含以下关键指标&#xff1a; bytesReceived: 接收到的字节总数packetsReceived: 接收到的数据包总数packetsLost: 丢失的数据包数量jitter: 数据包到达时间的抖动&#xff08;毫秒&…...

【LangChain框架组成】 LangChain 技术栈的模块化架构解析

目录 整体架构概述 整体架构层级划分 模块详细解析 1. 部署与服务层&#xff08;LangServe & Deployments&#xff09; 2. 应用模板层&#xff08;Templates & Committee Architectures&#xff09; 3. 核心功能层&#xff08;LangChain&#xff09; 4. 社区扩展…...

RNN、LSTM、GRU汇总

RNN、LSTM、GRU汇总 0、论文汇总1.RNN论文2、LSTM论文3、GRU4、其他汇总 1、发展史2、配置和架构1.配置2.架构 3、基本结构1.神经元2.RNN1. **RNN和前馈网络区别&#xff1a;**2. 计算公式&#xff1a;3. **梯度消失:**4. **RNN类型**:&#xff08;查看发展史&#xff09;5. **…...

用TypeScript和got库编写爬虫程序指南

用TypeScript和got库写一个爬虫程序。首先&#xff0c;我得确认他们对TypeScript和Node.js的基础了解&#xff0c;可能他们已经有了一些JS的经验&#xff0c;但不确定。接下来&#xff0c;需要明确爬虫的目标&#xff0c;比如要爬取的网站、需要的数据类型以及处理方式。 首先…...

使用 Spring Boot 快速构建企业微信 JS-SDK 权限签名后端服务

使用 Spring Boot 快速构建企业微信 JS-SDK 权限签名后端服务 本篇文章将介绍如何使用 Spring Boot 快速构建一个用于支持企业微信 JS-SDK 权限校验的后端接口&#xff0c;并提供一个简单的 HTML 页面进行功能测试。适用于需要在企业微信网页端使用扫一扫、定位、录音等接口的…...

【软考-架构】13.2、软件层次风格

✨资料&文章更新✨ GitHub地址&#xff1a;https://github.com/tyronczt/system_architect 文章目录 2、层次架构风格两层C/S架构三层C/S架构三层B/S架构富互联网应用RIAMVC架构MVP架构MVVM架构 ✨3、面向服务的架构风格SOASOA中应用的关键技术WEB Service企业服务总线ESB …...

Java 进阶-全面解析

目录 异常处理​ 集合框架​ List 集合​ Set 集合​ Map 集合​ 文件与字符集​ IO 流​ 多线程​ 通过继承Thread类创建线程 通过实现Runnable接口创建线程 线程同步示例​ 线程通信示例 网络编程 Java 高级技术 反射机制 动态代理 注解 异常处理​ 在 Java …...

mongodb 创建keyfile

在 MongoDB 中&#xff0c;keyFile 是用于副本集成员间内部认证的密钥文件。它是一个包含随机字符串的文件&#xff0c;所有副本集成员必须使用相同的 keyFile 进行通信。以下是创建和配置 keyFile 的详细步骤。 创建 KeyFile 的步骤 1. 生成随机字符串 使用以下命令生成一个…...

工业4.0时代,RK3562工控机为何成为智慧工位首选?

在制造业数字化转型的浪潮中&#xff0c;智慧车间已成为提升生产效率、降低运营成本的关键战场。作为智慧车间的"神经末梢"&#xff0c;工位机的智能化程度直接影响着整个生产线的运行效率。RK3562工控机凭借其强大的计算性能、稳定的运行表现和丰富的接口配置&#…...

WPF 资源加载问题:真是 XAML 的锅吗?

你的观察很敏锐&#xff01;确实&#xff0c;在 WPF 项目中&#xff0c;.cs 文件主要负责逻辑实现&#xff0c;而资源加载的问题通常跟 XAML&#xff08;以及它背后的 .csproj 配置&#xff09;关系更大。我会围绕这个观点&#xff0c;用 CSDN 博客风格详细解释一下 .cs、XAML …...

5. 深度剖析:Spring AI项目架构与分层体系全解读

1、前言 前面我们已经可以通过简单的方式集成Spring AI进行快速开发了。授人以鱼不如授人以渔&#xff0c;我们还是需要了解Spring AI的项目结构&#xff0c;以及他的一些核心概念。 2、项目结构 我们将Spring AI代码直接fork到我们自己的仓库中。fork的目的是方便我们为了学…...

2025最新数字化转型国家标准《数字化转型管理参考架构》 正式发布

当前&#xff0c;数字化转型是数字时代企业生存和发展的必答题&#xff0c;其根本任务是价值体系优化、创新和重构。数字生产力的飞速发展不仅引发了生产方式的转变&#xff0c;也深刻改变了企业的业务体系和价值模式。 为进一步引导企业明确数字化转型的主要任务和关键着力点…...

蓝桥杯备赛 Day 20 树基础

![[树的基础概念.png]] 树的遍历 二叉树遍历分类 DFS前序遍历 根节点-左儿子-右儿子 DFS中序遍历 左儿子-根节点-右儿子 DFS后序遍历 左儿子-右儿子-根节点 BFS层序遍历![[树的遍历.png]] 代码: #include <bits/stdc.h>using namespace std; const int N20; i…...

清晰易懂的Jfrog Artifactory 安装与核心使用教程

JFrog Artifactory 是企业级二进制仓库管理工具&#xff0c;支持 Maven、Docker、npm 等 30 包格式。本教程将手把手教你完成 安装、配置、核心操作&#xff0c;并指出企业级部署的避坑要点&#xff0c;助你快速搭建私有仓库&#xff01; 一、安装 JFrog Artifactory&#xff0…...

苍穹外卖总结

苍穹外卖学习知识点 整体概括: 学到目前(day10),总体最核心的部分就是CURD各种数据,因为一些接口,前端页面都已经设计好,在实际开发中也应该是这样,重点是在每个不同的业务板块区别出细微不同的业务逻辑 Swagger注解 swagger是一种自动生成接口文档的插件 使用注解,就可以…...

python学智能算法(九)|决策树深入理解

【1】引言 前序学习进程中&#xff0c;初步理解了决策树的各个组成部分&#xff0c;此时将对决策树做整体解读&#xff0c;以期实现深入理解。 各个部分的解读文章链接为&#xff1a; python学智能算法&#xff08;八&#xff09;|决策树-CSDN博客 【2】代码 【2.1】完整代…...

HTTP代理:内容分发战场上的「隐形指挥官」

目录 一、技术本质&#xff1a;流量博弈中的「规则改写者」 二、战略价值&#xff1a;内容分发的「四维升级」 三、实战案例&#xff1a;代理技术的「降维打击」 四、未来进化&#xff1a;代理技术的「认知升级」 五、结语&#xff1a;代理技术的「战略觉醒」 在数字内容爆…...

学习笔记(C++篇)--- Day2

1.类的定义 1.1 类的格式 ①class为类的关键字 ②在类的内容中还可以写函数&#xff0c;具体格式请看示例。 ③为了区分成员变量&#xff0c;一般习惯上成员变量会加一个特殊标识&#xff08;如成员变量前面或者后面加_ 或者 m开头&#xff0c;注意C中这个并不是强制的&#x…...

下载firefox.tar.xz后如何将其加入到Gnome启动器

起因&#xff1a;近期&#xff08;2025-04-07&#xff09;发现firefox公布了130.0 版本&#xff0c;可以对pdf文档进行签名了&#xff0c;想试一下&#xff0c;所以卸载了我的Debian12上的firefox-esr,直接下载了新版本的tar.xz 包。 经过一番摸索&#xff0c;实现了将其加入Gn…...

VSCode英文翻译插件:变量命名、翻单词、翻句子

目录 【var-translate】 【Google Translate】 【code-translator】 【其他插件】 【var-translate】 非常推荐&#xff0c;可以提供小驼峰、大驼峰、下划线、中划线、常量等翻译&#xff0c;Windows下快捷键为Ctrl Shift v 可以整句英文翻译&#xff0c;并且支持多个免费…...

快速高效的MCP Severs

通用AI Agent的瓶颈 最近一直在用MCP协议开发通用智能体。 虽然大模型本身请求比较慢&#xff0c;但是还可以接受。 而最让人沮丧的是&#xff0c;工具效率也不高 比如社区的filesystem&#xff0c;每次只能创建一个目录&#xff0c;生成文件时&#xff0c;如果目录不存在&…...

原子化 CSS 的常见实现框架

原子化 CSS 是一种 CSS 架构方法&#xff0c;其核心思想是将样式拆分为最小粒度的单一功能类&#xff0c;每个类仅对应一个具体的样式属性&#xff08;如颜色、边距、字体大小等&#xff09;&#xff0c;通过组合这些类来构建复杂的界面。这种方式强调代码复用性、维护性和灵活…...

技术速递|使用 GitHub Copilot Agent Mode 进行编程

作者&#xff1a;卢建晖 - 微软高级云技术布道师 翻译/排版&#xff1a;Alan Wang GitHub Copilot 持续发展&#xff0c;从最初的代码补全、生成、优化功能&#xff0c;到通过对话交互提升 AI 代码质量的 GitHub Copilot Chat&#xff0c;再到能够基于项目中多个文件的关联进行…...

Linux系统(Ubuntu和树莓派)的远程操作练习

目录 实验准备一、Ubuntu 下的远程操作二、树莓派下的远程操作三、思考 实验准备 ​ 1.双方应保证处于同一个局域网内 ​ 2.关闭防火墙 (否则别人将不能 ping 通自己,具体说明请参考&#xff1a;windows-关闭防火墙&#xff09; ​ 3.配置虚拟机 ​ a.网桥模式配置 ​ 查询…...

电脑屏保壁纸怎么设置 桌面壁纸设置方法详解

电脑桌面壁纸作为我们每天面对的第一视觉元素&#xff0c;不仅能够彰显个人品味&#xff0c;还能营造舒适的工作或娱乐氛围。电脑桌面壁纸怎么设置呢&#xff1f;下面本文将为大家介绍Windows和macOS两大主流操作系统中设置电脑桌面壁纸的方法&#xff0c;帮助大家快速设置个性…...

为什么选择Redis?解析核心使用场景与性能优化技巧

解析核心使用场景与性能优化技巧 redis只能能操作字符串&#xff0c;要把Java对象存入redis非关系型数据库&#xff0c;需要用序列化变成字符串&#xff0c;再反序列化成Java对象 not only sql NoSQL非关系型数据库&#xff1a;缓存数据库&#xff0c;只能读取数据&#xff0…...

Docker中Redis修改密码失效

docker容器中&#xff0c;我们通过docker run命令运行某一容器 这里&#xff0c;我们通过以下命令来进行运行【注意&#xff0c;这里有两个关键点&#xff1a;-d 和--requirepass】 docker run \ --restartalways \ --log-opt max-size100m \ --log-opt max-file2 \ -p 6379:6…...

质数质数筛

1.试除法判定质数–O(sqrt(N)) bool is_prime(int x) {if (x < 2) return false;for (int i 2; i < x / i; i )if (x % i 0)return false;return true; }2.试除法分解质因数–O(logN)~O(sqrt(N)) void divide(int x) {for (int i 2; i < x / i; i )if (x % i …...

VGA接口设计

1.VGA简介 VGA(Video Graphics Array)视频图形阵列接口是一种模拟信号视频传输标准,用于连接计算机主机和显示设备,如显示器、投影仪等。 VGA接口能够传输红、绿、蓝三原色的模拟信号以及同步信号(数字信号),实现计算机图形和视频信号的输出和显示。 尽管数字化显示接口…...

clickhouse注入手法总结

clickhouse 遇到一题clickhouse注入相关的&#xff0c;没有见过&#xff0c;于是来学习clickhouse的使用&#xff0c;并总结相关注入手法。 环境搭建 直接在docker运行 docker pull clickhouse/clickhouse-server docker run -d --name some-clickhouse-server --ulimit n…...

VsCode保存时删除无用的引用

打开设置文件 教程&#xff1a;打开VsCode设置设置里添加 {"editor.codeActionsOnSave": {"source.organizeImports": false, // 禁用默认的整理导入"source.removeUnusedImports": true // 仅删除未使用的导入} }...

轻松Linux-4.进程概念

屋漏偏逢连夜雨&#xff0c;今天就学Linux 话不多说&#xff0c;展示军火 1.认识冯诺依曼体系 冯诺依曼体系其实并不是什么稀罕的东西&#xff0c;我们生活中的笔记本、服务器、计算机等等大多都遵守冯诺依曼体系 非常经典的一张图 我们所认识的计算机&#xff0c;是由一个个…...

畅游Diffusion数字人(21):基于Wan2.1的音频驱动数字人FantasyTalking

畅游Diffusion数字人(0)&#xff1a;专栏文章导航 前言&#xff1a;AI数字人是目前视觉AIGC最有希望大规模落地的场景之一。现阶段的商业工具&#xff0c;如字节的OminiHuman-1(即梦大师版)、快手的可灵对口型&#xff0c;虽然效果不错&#xff0c;但是收费昂贵。而开源解决方案…...

CentOS禁用nouveau驱动

1、验证 nouveau 是否在运行 lsmod | grep nouveau如果命令返回结果&#xff0c;说明 nouveau 驱动正在运行。 2、编辑黑名单文件 通过编辑黑名单配置文件来禁用 nouveau 驱动&#xff0c;这样在系统启动时不会加载它。 vi /etc/modprobe.d/blacklist-nouveau.conf修改以下…...

《Operating System Concepts》阅读笔记:p587-p596

《Operating System Concepts》学习第 52 天&#xff0c;p587-p596 总结&#xff0c;总计 10 页。 一、技术总结 1.Recovery (1)consistency checking consistency checking 工具&#xff1a;fsck。 (2)log-structure file system (3)WAFL file system 2.Veritas (1)Ve…...

k8s 1.24.17版本部署(使用Flannel插件)

1.k8s集群环境准备 推荐阅读: https://kubernetes.io/zh/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ 1.1 环境准备 环境准备:硬件配置: 2core 4GB磁盘: 50GB操作系统: Ubuntu 22.04.04 LTSIP和主机名:10.0.0.231 master23110.0.0.232 worker23210.0…...

通信协议详解(十):PSI5 —— 汽车安全传感器的“抗干扰狙击手”

一、PSI5是什么&#xff1f; 一句话秒懂 PSI5就像传感器界的“防弹信使”&#xff1a;在汽车安全系统&#xff08;如气囊&#xff09;中&#xff0c;用两根线同时完成供电数据传输&#xff0c;即便车祸时线路受损&#xff0c;仍能确保关键信号准确送达&#xff01; 基础概念…...

Kafka生产者和消费者:数据管道的核心引擎与智能终端

在分布式系统中&#xff0c;数据的高效流动如同人体的血液循环&#xff0c;而Kafka的生产者&#xff08;Producer&#xff09;与消费者&#xff08;Consumer&#xff09;正是驱动这一循环的核心组件。它们不仅是Kafka客户端的基本形态&#xff0c;更是构建实时数据生态的基石。…...

特权FPGA之按键消抖

完整代码如下所示&#xff1a; timescale 1ns / 1ps// Company: // Engineer: 特权 // // Create Date: // Design Name: // Module Name: // Project Name: // Target Device: // Tool versions: // Description: // // Dependencies: // // Revision: // …...

实时比分更新系统的搭建

搭建一个实时比分更新系统需要考虑多个技术环节&#xff0c;以下是一个完整的实现方案&#xff1a; 一、系统架构 1.数据获取层 比分数据API接入&#xff08;如熊猫比分、API-Football等&#xff09; 网络爬虫&#xff08;作为备用数据源&#xff09; 2.数据处理层 …...

【Linux】线程的概念与控制

目录 1. 整体学习思维导图 2. 线程的概念 2.1 基础概念 2.2 Linux下的线程 初步理解&#xff1a; 2. 分页式存储 3.1 页表 3.1.1 页框/页 3.1.2 页表机制 3.1.3 从虚拟地址到物理地址的转换 总结&#xff1a; 3.2 TLB快表 3.3 缺页异常&#xff08;Page Fault&am…...