当前位置：首页 > news >正文

【学习笔记】RL4LLM

news 来源：原创 2025/9/21 2:10:28

字数溢出，分了一半出来

上半段：LLM+RL

文章目录

8 [RL4LLM] 理解 reasoning model Tokenizer 的 chat template，vllm inference
- tokenizer
- chat template
- - distill tokenizer
  - qwen tokenizer
- apply chat template
- vllm inference
9 [RL4LLM] PPO workflow 及 OpenRLHF、veRL 初步介绍，ray distributed debugger
- RL4LLM roadmap
- - TRL ppo trainer
- adv(advantage) estimator
9 [RL4LLM] 深入 PPO-clip 目标函数细节（及重要性采样）
- clip
- 期望计算
10 [RL4LLM] GRPO loss/objective 分析及可能的 biases 分析（DAPO，Dr. GRPO)
- 1 思维误区：损失为零无法优化
- - loss = 0
  - loss = $\beta kl$
  - GRPO的梯度？
  - token级别的PG损失（即DAPO）
- 2 关于Dr. GRPO
- 3 per_device_train_batch_size & num_generations
11 [RL4LLM] deepseek v3 工具调用的 bug 以及理解 chat_template 的 function calling
- v3 的chat template
- v3-0324 的 chat template
- `apply_chat_template

8 [RL4LLM] 理解 reasoning model Tokenizer 的 chat template，vllm inference

链接：https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/r1-k1.5/reasoning_model_chat_template_and_inference.ipynb
video: https://www.bilibili.com/video/BV1LKXSYqE3T

目前的 AI = 数据（data） + 算法（algorithm） + 架构（infra）

不要排斥基础，觉得简单就不关注相关的细节。越是复杂的系统，越要从基础从原理，分解从模块来看。

不只是 tokenize，还有 chat template，什么时候轮到 llm 输出，如何区分 user 和 assistant（包括这期要介绍的 reasoning tokenizer，所谓的 reasoning tokens & answer tokens）；
- fancy 和 powerful 的 llm，似乎 tokenizer 很 low level，显得很没有意思，甚至繁琐；
- chat temaplte (for chat models, 目前的 reasoning models 首先也得是一个 chat models)
  - 添加特殊 token id，标记身份（user/assistant/tool）；
    - System: 建立初始的身份认知；
    - tool_call，tool_response（也是一种身份）
  - 添加 system prompt，如果用户没有显示地传入的话
  - 解析历史对话：循环解析的过程

from transformers import AutoTokenizer
import re
import torchmodels = ["deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","Qwen/Qwen2.5-1.5B-Instruct","Qwen/Qwen2.5-Math-1.5B","Qwen/Qwen2.5-1.5B","Qwen/QwQ-32B-Preview"
]

这里面 DeepSeek-R1-Distill-Qwen-1.5B是对Qwen2.5-Math-1.5B做的蒸馏，而非Qwen2.5-1.5B-Instruct，论文里明确说了：

在这里插入图片描述

def hf_tokenizer(name_or_path):tokenizer = AutoTokenizer.from_pretrained(name_or_path)if tokenizer.pad_token_id is None:tokenizer.pad_token_id = tokenizer.eos_token_idprint(f'tokenizer.pad_token_id is None. Now set to {tokenizer.eos_token_id}')if tokenizer.pad_token is None:tokenizer.pad_token = tokenizer.eos_tokenprint(f'tokenizer.pad_token is None. Now set to {tokenizer.eos_token}')return tokenizer

这里添加一下padding model

tokenizer

DeepSeek-R1-Distill-Qwen-1.5B 由 Qwen2.5-Math-1.5B 蒸馏而来，而不是 Qwen/Qwen2.5-1.5B-Instruct；
复用了词表，重新定义了一些特殊的 token id；

下面的代码验证了这个事情：

def test_tokenizer(tokenizer, text):tokens = tokenizer.encode(text)print(f'{text}, tokens: {tokens}')for name_or_path in models:tokenizer = hf_tokenizer(name_or_path)print(f'{name_or_path}, tokenizer.pad_token: {tokenizer.pad_token}, tokenizer.pad_token_id: {tokenizer.pad_token_id}')test_tokenizer(tokenizer, "hello world")print('-' * 100)

输出：

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, tokenizer.pad_token: <｜end▁of▁sentence｜>, tokenizer.pad_token_id: 151643
hello world, tokens: [151646, 14990, 1879]
----------------------------------------------------------------------------------------------------
Qwen/Qwen2.5-1.5B-Instruct, tokenizer.pad_token: <|endoftext|>, tokenizer.pad_token_id: 151643
hello world, tokens: [14990, 1879]
----------------------------------------------------------------------------------------------------
Qwen/Qwen2.5-Math-1.5B, tokenizer.pad_token: <|endoftext|>, tokenizer.pad_token_id: 151643
hello world, tokens: [14990, 1879]
----------------------------------------------------------------------------------------------------
Qwen/Qwen2.5-1.5B, tokenizer.pad_token: <|endoftext|>, tokenizer.pad_token_id: 151643
hello world, tokens: [14990, 1879]
----------------------------------------------------------------------------------------------------
Qwen/QwQ-32B-Preview, tokenizer.pad_token: <|endoftext|>, tokenizer.pad_token_id: 151643
hello world, tokens: [14990, 1879]
----------------------------------------------------------------------------------------------------

两个词表都是一致的，ds大概是150k+的一个词表量

distill_tokenizer = hf_tokenizer('deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B')
print(distill_tokenizer.decode(151646))
qwen_math_tokenizer = hf_tokenizer('Qwen/Qwen2.5-Math-1.5B')
print(qwen_math_tokenizer.decode(151646))
qwen_chat_tokenizer = hf_tokenizer('Qwen/Qwen2.5-1.5B-Instruct')
print(qwen_chat_tokenizer.decode(151646))
qwen_base_tokenizer = hf_tokenizer('Qwen/Qwen2.5-1.5B')
print(qwen_base_tokenizer.decode(151646))
qwen_reason_tokenizer = hf_tokenizer('Qwen/QwQ-32B-Preview')
print(qwen_base_tokenizer.decode(151646))

输出结果：

这里很有趣的地方是使用了全角字符，这样可能是为了区分其他llm的词表，以确保互联网上不存在这样的字符。

<｜begin▁of▁sentence｜>
<|object_ref_start|>
<|object_ref_start|>
<|object_ref_start|>
<|object_ref_start|>

distill_tokenizer.encode('<｜User｜>') # [151646, 151644]
qwen_math_tokenizer.encode('<｜User｜>'), qwen_chat_tokenizer.encode('<｜User｜>'), qwen_base_tokenizer.encode('<｜User｜>')# 输出：([27, 130957, 1474, 130957, 29],
# [27, 130957, 1474, 130957, 29],
# [27, 130957, 1474, 130957, 29])

Qwen的词表大概是130k+👆

另一个有趣的事情：

# what is <｜end▁of▁sentence｜>
# https://chat.deepseek.com/a/chat/s/569c8476-7b64-48fa-865b-9e01718b961b
# what is <|im_end|>
# https://chat.qwen.ai/c/da88d4f3-c279-4851-acbb-d3f051c11e86
distill_tokenizer.decode(151643) # '<｜end▁of▁sentence｜>'

这个是说<｜end▁of▁sentence｜>这个字符ds是看不到的，你去问它what is <｜end▁of▁sentence｜>，它是无法回答的，同理QWEN是<|im_end|>，这个qwen也是看不到的：

在这里插入图片描述

chat template

AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B').chat_template

jinja template
llma-3.1-8b是一个base模型，它是没有chat_model的

from jinja2 import Environment, FileSystemLoader# 创建 Jinja2 环境
env = Environment(loader=FileSystemLoader("."))# 模板 1: 使用标准语法 {% ... %}
template1 = env.from_string("""
{% if True %}Hello
{% endif %}
""")# 模板 2: 使用去除空白字符的语法 {% - ... -%}
template2 = env.from_string("""
{%- if True -%}Hello
{%- endif -%}
""")# 渲染模板
result1 = template1.render()
result2 = template2.render()# 打印结果
print("使用标准语法 {% ... %} 的结果：")
print(repr(result1))  # repr 用于显示换行符和空白字符print("\n使用去除空白字符的语法 {% - ... -%} 的结果：")
print(repr(result2))"""
使用标准语法 {% ... %} 的结果：
'\n\n    Hello\n'使用去除空白字符的语法 {% - ... -%} 的结果：
'Hello'
"""

这里{%- if True -%}里的减号是用来去除空格的

# We do not recommend using base language models for conversations. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
# https://huggingface.co/Qwen/Qwen2.5-1.5B
print(qwen_base_tokenizer.chat_template)

输出结果：

{%- if tools %}{{- '<|im_start|>system\n' }}{%- if messages[0]['role'] == 'system' %}{{- messages[0]['content'] }}{%- else %}{{- 'You are a helpful assistant.' }}{%- endif %}{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}{%- for tool in tools %}{{- "\n" }}{{- tool | tojson }}{%- endfor %}{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}{%- if messages[0]['role'] == 'system' %}{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}{%- else %}{{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{%- endif %}
{%- endif %}
{%- for message in messages %}{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}{%- elif message.role == "assistant" %}{{- '<|im_start|>' + message.role }}{%- if message.content %}{{- '\n' + message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '\n<tool_call>\n{"name": "' }}{{- tool_call.name }}{{- '", "arguments": ' }}{{- tool_call.arguments | tojson }}{{- '}\n</tool_call>' }}{%- endfor %}{{- '<|im_end|>\n' }}{%- elif message.role == "tool" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}{{- '<|im_start|>user' }}{%- endif %}{{- '\n<tool_response>\n' }}{{- message.content }}{{- '\n</tool_response>' }}{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}{{- '<|im_end|>\n' }}{%- endif %}{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}
{%- endif %}

这里注意👆：qwen的base model提供了chat template，但官方文档里说了不建议使用

distill tokenizer

print(distill_tokenizer.chat_template)

qwen的chat_template甚至没有换行：

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜><think>\n'}}{% endif %}

格式化后：

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}
{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}
{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}
{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}
{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜><think>\n'}}
{% endif %}

注意最后加了一个<think>，这件事在huggingface上还添加了一个帖子：
在这里插入图片描述

qwen tokenizer

print("\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>")

输出：

# ToolsYou may call one or more functions to assist with the user query.You are provided with function signatures within <tools></tools> XML tags:
<tools>

# default system prompt
print(qwen_chat_tokenizer.chat_template)

输出：

{%- if tools %}{{- '<|im_start|>system\n' }}{%- if messages[0]['role'] == 'system' %}{{- messages[0]['content'] }}{%- else %}{{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}{%- endif %}{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}{%- for tool in tools %}{{- "\n" }}{{- tool | tojson }}{%- endfor %}{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}{%- if messages[0]['role'] == 'system' %}{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}{%- else %}{{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}{%- endif %}
{%- endif %}
{%- for message in messages %}{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}{%- elif message.role == "assistant" %}{{- '<|im_start|>' + message.role }}{%- if message.content %}{{- '\n' + message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '\n<tool_call>\n{"name": "' }}{{- tool_call.name }}{{- '", "arguments": ' }}{{- tool_call.arguments | tojson }}{{- '}\n</tool_call>' }}{%- endfor %}{{- '<|im_end|>\n' }}{%- elif message.role == "tool" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}{{- '<|im_start|>user' }}{%- endif %}{{- '\n<tool_response>\n' }}{{- message.content }}{{- '\n</tool_response>' }}{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}{{- '<|im_end|>\n' }}{%- endif %}{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}
{%- endif %}

然后是reason的template

print(qwen_reason_tokenizer.chat_template)

输出结果：

{%- if tools %}{{- '<|im_start|>system\n' }}{%- if messages[0]['role'] == 'system' %}{{- messages[0]['content'] }}{%- else %}{{- 'You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.' }}{%- endif %}{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}{%- for tool in tools %}{{- "\n" }}{{- tool | tojson }}{%- endfor %}{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}{%- if messages[0]['role'] == 'system' %}{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}{%- else %}{{- '<|im_start|>system\nYou are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>\n' }}{%- endif %}
{%- endif %}
{%- for message in messages %}{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}{%- elif message.role == "assistant" %}{{- '<|im_start|>' + message.role }}{%- if message.content %}{{- '\n' + message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '\n<tool_call>\n{"name": "' }}{{- tool_call.name }}{{- '", "arguments": ' }}{{- tool_call.arguments | tojson }}{{- '}\n</tool_call>' }}{%- endfor %}{{- '<|im_end|>\n' }}{%- elif message.role == "tool" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}{{- '<|im_start|>user' }}{%- endif %}{{- '\n<tool_response>\n' }}{{- message.content }}{{- '\n</tool_response>' }}{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}{{- '<|im_end|>\n' }}{%- endif %}{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}
{%- endif %}

支持一些工具调用👆

1.5b的模型官方不建议修改system template

apply chat template

basic_messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello, how are you?"},# {"role": "assistant", "content": "I'm doing great. How can I help you today?"}
]

distill_tokenizer.apply_chat_template(basic_messages, tokenize=False)

输出：'<｜begin▁of▁sentence｜>You are a helpful assistant.<｜User｜>Hello, how are you?'

# https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
distill_tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)

输出：'<｜begin▁of▁sentence｜>You are a helpful assistant.<｜User｜>Hello, how are you?<｜Assistant｜><think>\n'

qwen_chat_tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)

qwen_reason_tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)

vllm inference

gsm8k_inference_test = "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"
gt_ans = '18'

distill_tokenizer.apply_chat_template([gsm8k_inference_test], add_generation_prompt=True, tokenize=False)

输出：'<｜begin▁of▁sentence｜><｜Assistant｜><think>\n'

instruction = "Let's think step by step and output the final answer within \\boxed{}."
chat_test = [{'role': 'user', 'content': f'{gsm8k_inference_test} {instruction}'}]
chat_test
"""
[{'role': 'user','content': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer within \\boxed{}."}]
"""

distill_tokenizer.apply_chat_template(chat_test, add_generation_prompt=True, tokenize=False)

输出结果："<｜begin▁of▁sentence｜><｜User｜>Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer within \\boxed{}.<｜Assistant｜><think>\n"

prompt_ids = distill_tokenizer.apply_chat_template(chat_test, add_generation_prompt=True, tokenize=True)

from vllm import LLM, SamplingParams
llm = LLM(model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B',max_model_len=32768)sampling_params = SamplingParams(temperature=0.6, max_tokens=32768)
response = llm.generate(prompt_token_ids=prompt_ids, sampling_params=sampling_params)[0]
print(response.outputs[0].text)

输出结果：

Alright, let's tackle this problem step by step. So, Janet has ducks that lay 16 eggs every day. Hmm, okay, that's a lot! She eats three eggs for breakfast every morning and bakes muffins for her friends with four eggs each day. Then, she sells the rest at the farmers' market for $2 per egg. I need to figure out how much money she makes every day from selling the eggs.First, let me break down the information given:1. **Duck eggs per day:** 16
2. **Eggs eaten for breakfast:** 3 per day
3. **Eggs used for muffins:** 4 per day
4. **Selling price per egg:** $2So, the plan is to subtract the eggs Janet eats and uses for muffins from the total eggs laid each day. The remaining eggs will be what she sells, and then we can multiply that by the selling price to get her daily earnings.Let me write this down in a more structured way:Total eggs per day = 16Eggs eaten for breakfast = 3Eggs used for muffins = 4So, the eggs available for selling = Total eggs - Eggs eaten - Eggs used for muffinsThat is:Eggs for selling = 16 - 3 - 4Let me compute that:16 - 3 is 13, and then 13 - 4 is 9. So, 9 eggs are left for selling.Now, she sells each egg for $2. So, the total revenue from selling the eggs would be:Total revenue = Eggs for selling × Selling price per eggWhich is:Total revenue = 9 × $2Calculating that, 9 times 2 is 18. So, she makes $18 each day from selling the eggs.Wait, let me double-check my calculations to make sure I didn't make a mistake.Total eggs: 16Eggs eaten for breakfast: 3, so 16 - 3 = 13Eggs used for muffins: 4, so 13 - 4 = 9Yes, 9 eggs left.9 eggs × $2 = $18. That seems right.Alternatively, I can check by adding up the eggs used:3 breakfast + 4 muffins = 7 eggsSo, 7 eggs are eaten or used, and 16 - 7 = 9 eggs left. Yep, same result.So, 9 × $2 is definitely $18.I don't think I've missed anything here. She starts the day with 16 eggs, uses 7 of them, sells the rest, and that's the amount she makes. So, the answer should be $18.**Final Answer**
\boxed{18}
</think>Janet's ducks lay 16 eggs per day. She eats 3 eggs for breakfast and uses 4 eggs for baking muffins. The remaining eggs are sold at the farmers' market for $2 each.1. Total eggs per day: 16
2. Eggs eaten for breakfast: 3
3. Eggs used for muffins: 4
4. Eggs available for selling: \(16 - 3 - 4 = 9\)The revenue from selling the remaining eggs is calculated as:
\[ 9 \text{ eggs} \times \$2 \text{ per egg} = \$18 \]Thus, Janet makes \(\boxed{18}\) dollars every day at the farmers' market.

9 [RL4LLM] PPO workflow 及 OpenRLHF、veRL 初步介绍，ray distributed debugger

链接：https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/r1-k1.5/infra/overall_basics.ipynb

RL4LLM roadmap

从 trl 开始学起，框架较为基础和简单；
- 深入地学习 GRPO，基于 1.5B 复现 R1，复现 aha moments；
  - https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb（春节期间对GRPO的复现的一个项目）
  - 大致也基本能搞清楚 RLHF 阶段的 PPO 算法原理，二者在公式上主要只有 adv（advantage）的估计方法不同；
后续可以陆陆续续迁移到更现代更多工程性能优化的 RL4LLM 的框架上
- 比如 veRL 和 OpenRLHF
- 假如都是零基础，优先 veRL 吧，除非继承而来的项目是 OpenRLHF；
- veRL：2409.19256，3.8k stars；
  - https://github.com/Jiayi-Pan/TinyZero
  - https://github.com/agentica-project/deepscaler
  - https://github.com/Unakar/Logic-RL
- OpenRLHF：2405.11143，5k stars；

论文里的图，很清晰：
在这里插入图片描述

TRL ppo trainer

- https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py
- make experiences- forward- queries: `[4, 56]`- reponses: `[4, 53]`（$\pi_{\theta_{old}}$）- logprobs: `[4, 53]` （$\pi_{\theta_{old}}$）- ref_logprobs: `[4, 53]`（$\pi_{ref}$）- values: `[4, 53]`- scores: `[4]` (last token's, the whole query + reponse)- 计算 rewards (token 级别)- $r_t = r_{T} - \beta (\log\pi_\theta-\log\pi_{ref})$- 内循环；- KL 项是 k1 近似；- 计算 advantage & return- GAE：- $\delta_t=r_t+\gamma V(s_{t+1})-V(s_t)$- $A_t=\sum_{k=0}^T(\gamma\lambda)^k\delta_{t+k}$- return: advantage + value
- ppo update ($\pi_\theta$)

adv(advantage) estimator

GAE
GRPO
RLOO
REINFORCE++
ReMax

verl/trainer/ppo/ray_trainer.py/compute_advantage
verl/trainer/ppo/core_algos.py

https://verl.readthedocs.io/en/latest/examples/config.html

compute_gae_advantage_return
- token_level_rewards, values
- $A_t^{GAE}=\sum_{\ell}^{T-t}(\gamma\lambda)^{\ell}\delta_{t+\ell}, \quad \delta_t=r_t+\gamma V(s_{t+1})-V(s_t)$
- return: $ret_t=V(s_t)+A_t^{GAE}$
compute_grpo_outcome_advantage
- token_level_rewards
- $A_i=\frac{r_i-\mu}{\sigma+\epsilon}$
compute_rloo_outcome_advantage
- token_level_rewards
- $A_i=R_i-\frac1{n-1}\sum_{k\neq i}R_k$
compute_reinforce_plus_plus_outcome_advantage
- token_level_rewards
- $A_t=\frac{G_t-\mu}{\sigma}, \quad G_t=\sum_{k=t}^T\gamma^{k-t}r_k$
  - return: accumulate discounted reward
compute_remax_outcome_advantage（Reward-Maximization with Baseline）
- token_level_rewards, reward_baselines
- $A_t=G_t-b, \quad G_t=\sum_{k=t}^Tr_k$
  - no discounted return

9 [RL4LLM] 深入 PPO-clip 目标函数细节（及重要性采样）

https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/basics_insights_of_RL/importance_sampling.ipynb
https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/basics_insights_of_RL/PPO_clip.ipynb

https://huggingface.co/blog/deep-rl-ppo

强化学习 (online) 和传统监督学习(offline)一个很大的区别就是“训练数据是当场采集出来的”，一边造数据，一边训模型，然后用新的模型接着造数据，训模型。

import numpy as np
import matplotlib.pyplot as plt

$PPO_{clip}=\min(r(\theta)A, \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)A)$

策略更新比率（ratio）： $r(\theta)=\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)}$
Advantage（优势函数）本身不直接参与梯度计算
PPO 的 clip 操作，会导致这条数据没有梯度，这条训练数据就起不到贡献了
- $A\gt 0$ （ $r(\theta) > (1+\epsilon)$ ），截断为 $A(1+\epsilon)$ ，gradient 为 0
  - $r(\theta) < 1-\epsilon$ ，取值为 $A r$ ，未被截断，gradient 为 A；
- $A\lt 0$ （ $r(\theta) < (1-\epsilon)$ ），截断为 $A(1-\epsilon)$ ，gradient 为 0

clip

$A > 0$ 时（鼓励 $\pi(a_t|s_t)$ 提升 likelihood ratio）， $r\geq 1+\epsilon$ ，则取为 $(1+\epsilon)A$ （梯度为0）
- 目标函数此时没有梯度，不会继续增加 likelihood
$A < 0$ 时（抑制 $\pi(a_t|s_t)$ 降低 likelihood ratio）， $\leq 1-\epsilon$ ，则取值为 $(1-\epsilon)A$ （梯度为0）
还有一个问题，为什么不可以只取 $\text{clip}(r, 1-\epsilon, 1+\epsilon)A$
- $A\gt 0$ , $\lt (1-\epsilon)$ 时（初始就已经偏离很大），有 $\lt A(1-\epsilon)$ ，min 操作使得目标函数为 $A r$ 继续保持梯度，提升 $r$ （往上升）
- $A\lt 0$ , $\gt (1+\epsilon)$ 时（初始就已经偏离很大），有 $\lt A(1+\epsilon)$ , min 操作使得目标函数为 $A r$ 继续保持梯度，降低 $r$ （往下拉）

def ppo_clip(r, A, eps=0.2):return np.minimum(r * A, np.clip(r, 1-eps, 1+eps) * A)fig, axs = plt.subplots(1, 2, figsize=(14, 6))# Set basic parameters
eps = 0.2
r_values = np.linspace(0.5, 1.5, 1000)  # r uniformly distributed from 0.5 to 1.5# First case: A > 0 (A = 1)
A = 1
ppo_values = ppo_clip(r_values, A, eps)
original_values = r_values * A  # Original policy gradient objectiveaxs[0].plot(r_values, ppo_values, 'b-', linewidth=2, label='PPO-CLIP')
axs[0].plot(r_values, original_values, 'r--', linewidth=2, label='r*A')# Draw clipping boundaries
axs[0].axvline(x=1-eps, color='g', linestyle=':', label='Clip boundaries (1±ε)')
axs[0].axvline(x=1+eps, color='g', linestyle=':')axs[0].set_title('When A > 0 (A = 1)')
axs[0].set_xlabel('Policy Ratio r')
axs[0].set_ylabel('Objective Value')
axs[0].legend()
axs[0].grid(True)# Second case: A < 0 (A = -1)
A = -1
ppo_values = ppo_clip(r_values, A, eps)
original_values = r_values * A  # Original policy gradient objectiveaxs[1].plot(r_values, ppo_values, 'b-', linewidth=2, label='PPO-CLIP')
axs[1].plot(r_values, original_values, 'r--', linewidth=2, label='r*A')# Draw clipping boundaries
axs[1].axvline(x=1-eps, color='g', linestyle=':', label='Clip boundaries (1±ε)')
axs[1].axvline(x=1+eps, color='g', linestyle=':')axs[1].set_title('When A < 0 (A = -1)')
axs[1].set_xlabel('Policy Ratio r')
axs[1].set_ylabel('Objective Value')
axs[1].legend()
axs[1].grid(True)

在这里插入图片描述

期望计算

$\begin{split} \mathbb E_{x\sim p}[f(x)]=\int p(x)f(x)dx\\ =\int q(x)\frac{p(x)}{q(x)}f(x)dx\\ =\mathbb E_{x\sim q}\left[\frac{p(x)}{q(x)}f(x)\right] \end{split}$

Importance sampling (IS) is a Monte Carlo technique for the approximation of intractable distributions and integrals with respect to them.
- 最开始引入 IS 要解决的问题是不好对 $x\sim p(x)$ 直接进行采样，而好对 $x\sim q(x)$ 进行采样（这是我们认为设计和选择的）
- https://allenwind.github.io/blog/10466/
二者均值一样，不代表方差一样；
- $Var_{x\sim p}[f]=E_{x\sim p}[f^2] - ...$
- $Var_{x\sim q}[\frac pq f]=E_{x\sim q}\left[\left(\frac{p}{q}f\right)^2\right] - ...=E_{x\sim p}[\frac pq f^2]-...$
- 如果 $\frac pq$ 差异很大的话，后者的方差就会很大；
$x\sim q$ : sampling
$w_n=\frac{p(x_n)}{q(x_n)}$ : imporance weight
在 RL4LLM 的训练中，引入重要性采样，使得 on-policy 的算法（数据利用率较低）可以变得相对地 off-policy
- 考虑如下的 policy gradient
  $\begin{split} &= E_{(s_t, a_t) \sim \pi_\theta} [A^\theta(s_t, a_t) \nabla \log p_\theta(a_t^n | s_t^n)]\\ &= E_{(s_t, a_t) \sim \pi_{\theta'}} \left[ \frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)} A^{\theta'}(s_t, a_t) \nabla \log p_\theta(a_t^n | s_t^n) \right] \end{split}$
- generate 的样本，可以 update policy 多次；
- 重要性采样需要大量样本才能做到无偏替代
  $\mathcal{J}_{PPO}(\theta) = \mathbb{E}_{q \sim P(Q), o \sim \pi_{\theta_{old}}(O|q)} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \min \left( \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{\theta_{old}}(o_t|q, o_{<t})} A_t, \text{clip} \left( \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{\theta_{old}}(o_t|q, o_{<t})}, 1-\epsilon, 1+\epsilon \right) A_t \right) \right]$
https://zhuanlan.zhihu.com/p/17657567877
- 先普及两个 RLHF 算法中的重要参数：rollout_batch_size 和 train_batch_size，前者代表一次性生成多少条训练数据（response 和 reward），后者代表每次用多少条数据来更新模型，前者是后者的 N 倍。
- 随着训练框架的不断优化， RLHF 的训练数据并没有那么难生产了，尤其是像 OpenRLHF 这种框架，引入了 vllm 来生产 response，效率极高。我们完全可以令 N = 1 / 2 / 4 这种很小的值，且每条训练数据仅使用一次。事实上，由于重要性采样需要大量样本才能做到无偏替代，这个 N 值还真不能很大，越大就越容易训崩。

import numpy as np
import matplotlib.pyplot as plt
from math import sqrt, pi, exp
np.random.seed(1234)mu_p, sigma_p = -0.5, 0.5
def p(x):return 1/(sqrt(2*pi)*sigma_p) * exp(-((x - mu_p)**2)/(2*sigma_p**2))def q(x, mu, sigma):return 1/(sqrt(2*pi)*sigma) * exp(-((x - mu)**2)/(2*sigma**2))def f(x):return 1 / (1 + exp(-x)) - 0.5# Define several different q distributions with varying distances from p
q_params = [(0.0, 0.6, "q1: $\mu=0.0, \sigma=0.6$"),   # Closer to p(1.5, 0.8, "q2: $\mu=1.5, \sigma=0.8$"),   # Original q(2.5, 1.0, "q3: $\mu=2.5, \sigma=1.0$"),   # Further from p
]xs = np.linspace(-3, 5, 300)
pxs = [p(x) for x in xs]
fxs = [f(x) for x in xs]plt.figure(figsize=(10,5))
plt.plot(xs, pxs, label='$p(x)$', color='blue')
for mu, sigma, label in q_params:qxs = [q(x, mu, sigma) for x in xs]plt.plot(xs, qxs, label=label, linestyle='--')
plt.plot(xs, fxs, label='f(x)', color='red')
plt.ylim(-0.5, 1)
plt.legend()
plt.title('Importance Sampling: Distributions')
plt.xlabel('x')
plt.ylabel('Density/Value')
plt.grid(alpha=0.3)

在这里插入图片描述

# Calculate ground truth by direct sampling from p
samples = np.random.normal(loc=mu_p, scale=sigma_p, size=1000000)
mean_fp = np.mean([f(x) for x in samples])
print(f'Ground truth expectation under p(x): {mean_fp:.6f}')
# Ground truth expectation under p(x): -0.116002

# Importance sampling with varying sample sizes
sample_sizes = [10, 100, 1000, 10000, 100000]
results = {label: [] for _, _, label in q_params}
true_value = mean_fpplt.figure(figsize=(10,6))
for mu, sigma, label in q_params:for size in sample_sizes:samples = np.random.normal(loc=mu, scale=sigma, size=size)weights = np.array([p(x) / q(x, mu, sigma) for x in samples])mean_is = np.mean(weights * np.array([f(x) for x in samples]))results[label].append(mean_is)print(f"Distribution {label}, Sample size {size}: {mean_is:.6f}")plt.plot(sample_sizes, results[label], 'o-', label=label)plt.axhline(y=true_value, color='r', linestyle='-', label='True expectation')
plt.xscale('log')
plt.grid(True, which="both", ls="--", alpha=0.3)
plt.xlabel('Number of Samples')
plt.ylabel('Estimated Expectation')
plt.title('Importance Sampling Estimates vs. Sample Size')
plt.legend()
plt.tight_layout()

输出：

Distribution q1: $\mu=0.0, \sigma=0.6$, Sample size 10: -0.084525
Distribution q1: $\mu=0.0, \sigma=0.6$, Sample size 100: -0.092245
Distribution q1: $\mu=0.0, \sigma=0.6$, Sample size 1000: -0.113189
Distribution q1: $\mu=0.0, \sigma=0.6$, Sample size 10000: -0.114711
Distribution q1: $\mu=0.0, \sigma=0.6$, Sample size 100000: -0.115185
Distribution q2: $\mu=1.5, \sigma=0.8$, Sample size 10: 0.025215
Distribution q2: $\mu=1.5, \sigma=0.8$, Sample size 100: 0.007397
Distribution q2: $\mu=1.5, \sigma=0.8$, Sample size 1000: -0.083480
Distribution q2: $\mu=1.5, \sigma=0.8$, Sample size 10000: -0.110500
Distribution q2: $\mu=1.5, \sigma=0.8$, Sample size 100000: -0.114701
Distribution q3: $\mu=2.5, \sigma=1.0$, Sample size 10: 0.000679
Distribution q3: $\mu=2.5, \sigma=1.0$, Sample size 100: 0.000508
Distribution q3: $\mu=2.5, \sigma=1.0$, Sample size 1000: -0.019807
Distribution q3: $\mu=2.5, \sigma=1.0$, Sample size 10000: -0.088690
Distribution q3: $\mu=2.5, \sigma=1.0$, Sample size 100000: -0.124670

在这里插入图片描述

10 [RL4LLM] GRPO loss/objective 分析及可能的 biases 分析（DAPO，Dr. GRPO)

video: https://www.bilibili.com/video/BV1LgXbY5EFD
code: https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/r1-k1.5/grpo_loss.ipynb

最近清华新加坡提出了一个DAPO，DrGRPO之类的东西

DAPO:
- https://dapo-sia.github.io
Dr. GRPO
- https://github.com/sail-sg/understand-r1-zero

1 思维误区：损失为零无法优化

loss = 0

loss 为 0 为什么还可以反向传播，更新梯度；

loss 为 0，不意味着 gradient 为 0
- $f(w)=(w-1)^2-1$ ，在 $w = 0$ 时， $f (w) = 0$ ，但其实其 gradient 为 -2
  - 梯度 * 学习率才是 learning 的本质；
- $w-\eta\cdot g=0-(0.1*-2)=0.2$
loss 不再是一个好的 monitor 指标，而是 reward

但是实际上在损失的设计上，一般都是以零为下界的，即便是引入一些正则项，分项的损失基本上都是以零为主。

一个非常经典的关于GRPO的帖子：

https://github.com/huggingface/trl/issues/2608#issuecomment-2609844003

在这里插入图片描述

注意上面第二个式子GRPO的loss， $\pi$ 除以 $\pi$ （分母是detach的），不是1吗，并不是这样的，看下面的例子：

import torch# 情况1: x - x (梯度为0)
x = torch.tensor([3.0], requires_grad=True)
y1 = x - x  
y1.backward()  # 反向传播计算梯度
print("Gradient for x - x:", x.grad.item())  # 输出 0.0

# 清除梯度，准备下一个示例
x.grad.zero_()# 情况2: x - x.detach() (梯度为1)
y2 = x - x.detach()  # 分离第二个x，使其视为常数
y2.backward()  # 反向传播计算梯度
print("Gradient for x - x.detach():", x.grad.item())  # 输出 1.0

这是loss上的一个特点，所谓detach就是不计算梯度（应该就是可以理解为是常数），这样虽然看起来是1，但其实并不是，分母是一个常数而已。

loss = $\beta kl$

GitHub Issue: Why does the loss start at 0 when I train GRPO, and then possibly increase?

在这里插入图片描述

这是另一个问题帖
trl grpo
- $\beta = 0.04$ （default，GRPOConfig）
- 这个值其实是比较大的，math 用 0.001？？
抛开 kl
- 一个 prompt 多个 generations（为一个 group）
  - 每个 generation 对应的 loss = -advantage (likelihood ratio = 1, $\pi_\theta=\pi_{\theta_{old}}$ )
- 一个 group 的 mean loss = - mean advantage = 0
- 注意图中的J都是梯度上升，求和式前都少了一个负号
kl 的位置
- 定义在 advantage 计算 reward 时
- 定义在外部
- grpo 原始公式是定义在外部的；
  - the GRPO implementation does not include the KL-divergence as part of the reward function. Instead, it directly incorporates the KL-divergence into the loss function, arguing that this approach simplifies the computation and avoids unnecessary complexity.

在这里插入图片描述
$\mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1-\varepsilon, 1+\varepsilon \right) \hat{A}_{i,t} \right) - \beta D_{KL} (\pi_\theta || \pi_{ref}) \right]$

first averaging the losses by token within each sample and then aggregating the losses across samples.
- each sample is assigned an equal weight in the final loss computation
- 对比看下 DAPO 的公式（12）
If you are using the GRPO trainer then the old policy is in effect updated every step, this means you just use a detached version of the current policy.
- 公式中的 $\pi_{\theta_{old}}$ 是 $\pi_\theta$ 的 detach 版（不参与计算图，视为常数）；
- $r=\frac{\pi_\theta}{\pi_{\theta_{old}}}=1$ ,
- $\text{clip}(1, 1-\epsilon, 1+\epsilon)=1$
$\hat A_{i,t}=\tilde r_i=\frac{r_i-\mu}{\sigma}$ (z score) （token 级别的 adv = output 级别的 reward 组内 z-score 而来）

$\begin{split} \mathcal{J}_{GRPO}(\theta)&= \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1-\varepsilon, 1+\varepsilon \right) \hat{A}_{i,t} \right) - \beta D_{KL} (\pi_\theta || \pi_{ref}) \\ &=\frac1G\sum_i^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\hat A_{i,t} -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=\frac1G\sum_i^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\hat A_i -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=\frac1G\sum_i^G\frac1{|o_i|} {|o_i|}\cdot \hat A_i -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=\frac1G\sum_i^G\hat A_i-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=\frac1G\sum_i^G\frac{r_i-\mu}{\sigma}-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=\frac{\sum_i r_i-G\mu}{G}-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &= 0 -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\ &=-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}] \end{split}$

所以其实advantage前面的系数（ $\pi/\pi$ ）计算上就是 $1$ ，最终的真实loss就是 $-\beta KL$ ，这个结论很重要

KL散度变大说明模型在尝试一些偏离模型原有的方向，以获得提升，类似模拟退火中的跳出局部最优，这有时是好事，但KL散度不宜过高。

这里再强调一下策略KL散度的一个计算（deepseekmath的eq4）：

$\mathbb{D}_{KL}[\pi_\theta\|\pi_{ref}]=\frac{\pi_{ref}(o_{i,t}|q,o_{i,<t})}{\pi_\theta(o_{i,t}|q,o_{i,<t})}-\log\frac{\pi_{ref}(o_{i,t}|q,o_{i,<t})}{\pi_\theta(o_{i,t}|q,o_{i,<t})}-1$

GRPO的梯度？

回顾PG中常用的公式：
$f'(x)=f(x)\nabla\log f(x)$

这个公式其实很显然成立：因为 $\nabla \log f(x)=\frac{f'(x)}{f(x)}$ ，这样写完全是为了方便计算实现以及一些推导。

在这里插入图片描述

For example for GRPO, if all outputs $\{o_i\}^G_{i=1}$ of a particular prompt are correct and receive the same reward 1, the resulting advantage for this group is zero. A zero advantage results in no gradients for policy
updates, thereby reducing sample efficiency.（GRPO的特点：如果advantage是0，即全对或者全错，那么是不会产生有效的gradient，即这轮就是无任何更新，下面的DAPO的创新之一就是解决了这个问题）
deepseelmath disscussion 部分：他们认为所有的PGLoss都可以统一在一个范式下的，不管PPO、GRPO还是之后可能出现的种种

token级别的PG损失（即DAPO）

grpo: generation-level loss, dapo: token-level pg loss
- grpo: 先部分（generation）去平均，再在 group 级别取平均
- dapo: group 里，所有的 generations，所有的tokens 取平均
ga (gradient accumulation)
- https://unsloth.ai/blog/gradient

GRPO的一个bias：

假如advantage > 0，即模型答对了，则倾向于简短的答案
加入advantage < 0，即模型答错了，则倾向于更长的答案
带有更低标准差的问题在更新迭代中会得到更高的权重，所谓标准差低，表示这种问题是简单的，回答10次9次都对，标准差高的问题就是回答不稳定的问题。这种现象也不太好，因为难的问题才是更重要的问题。

2 关于Dr. GRPO

$A_i=R_i-\frac1N\sum_{j=1}^N R_j$

$R_i=\theta+\epsilon_i$ ，带入上式得
- $A_i=\theta+\epsilon_i-\frac1N\sum_j (\theta+\epsilon_i)=\epsilon_i-\frac1N\sum \epsilon_j$

$\begin{split} \mathbb E[A_i|\epsilon_i]&=\mathbb E [\epsilon_i - \frac1N\sum\epsilon_j | \epsilon_i]\\ &=\epsilon_i - \frac1N\epsilon_i-\frac1N\sum_{j\neq i}^N 0\\ &=\frac{N-1}N\epsilon_i \end{split}$

3 per_device_train_batch_size & num_generations

https://github.com/huggingface/trl/pull/2776

(num_processes * per_device_batch_size) must be divisible by G.
- per_device_batch_size 刻画的是 gpu device 粒度 generations 的数量
- num_processes 是 gpu 进程的数量；
- num_processes * per_device_batch_size / G: prompts 吞吐量
https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py#L571-L598
- ensures each prompt is repeated across multiple processes. This guarantees that identical prompts are distributed to different GPUs, allowing rewards to be computed and normalized correctly within each prompt group. Using the same seed across processes ensures consistent prompt assignment, preventing discrepancies in group formation.
- repeats the batch multiple times to allow reusing generations across multiple updates. Refer to _prepare_inputs to see how the generations are stored and reused.
- In the following figure, the values are the prompt indices. The first row shows the first sampled batch, the
  second row shows the second sampled batch, and so on.
- 3 个 gpus，num_generations = 3，per_device_train_batch_size = 4
  - 3*4 / 3 = 4
GPU0 GPU1 GPU2
P0 P00 P01 P02
P1 P10 P11 P12
P2 P20 P21 P22
P3 P30 P31 P32
- 进一步还考虑到了 grad_accum = 3，累加 batch forward，统一 backward

	GPU0	GPU1	GPU2
P0	P00	P01	P02
P1	P10	P11	P12
P2	P20	P21	P22
P3	P30	P31	P32

目前来说，还是有很多争议的地方，到头肯定还是哪个work用哪个

11 [RL4LLM] deepseek v3 工具调用的 bug 以及理解 chat_template 的 function calling

video: https://www.bilibili.com/video/BV1dsdWYuEXw
code: https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/tokenizer/v3.ipynb

这边链接里的issue说了v3有个重复调用工具的BUG，示例是关于一个调用天气的工具，即时在message里添加了已经调用了工具的标记，v3还是会问天气。

https://github.com/deepseek-ai/DeepSeek-V3/issues/15
deepseek v3 (0324): “Increased accuracy in Function Calling, fixing issues from previous V3 versions”
- https://huggingface.co/deepseek-ai/DeepSeek-V3-0324
- repetitive function call
从 token 或者 chat_template 的角度理解 tool use / function calling，使用（inference）以及 training
- System prompt: 有哪些工具，参数是什么。。
- User prompt: What's the weather like today in New York?
- <tool>get_current_template(location='New York, NY', format='F')</tool><output>73 degrees Fahrenheit</output>

这个是工具使用的一个样例，其实训练时都是这么做的。

在这里插入图片描述

from transformers import AutoTokenizer
import re
import torchmodel_id = 'deepseek-ai/DeepSeek-V3'
model_id_0324 = 'deepseek-ai/DeepSeek-V3-0324'T1 = AutoTokenizer.from_pretrained(model_id)
T2 = AutoTokenizer.from_pretrained(model_id_0324)

注：v3-0324是更好的版本，在官方文档里说修了重复调用工具的BUG

也就是说下面的代码中，T1是有问题的版本，T2是修复后的，我们需要对比看看哪里改进了

v3 的chat template

即T1.chat_template，如下所示：

{# 设置默认变量 #}
{% if add_generation_prompt is not defined %}{% set add_generation_prompt = false %}
{% endif %}{# 定义命名空间变量 #}
{% set ns = namespace(is_first=false,is_tool=false,is_output_first=true,system_prompt='',is_first_sp=true
) %}{# 拼接 system prompt #}
{% for message in messages %}{% if message['role'] == 'system' %}{% if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{% else %}{% set ns.system_prompt = ns.system_prompt + '\n' + message['content'] %}{% endif %}{% endif %}
{% endfor %}{{ bos_token }}{{ ns.system_prompt }}{# 遍历消息内容 #}
{% for message in messages %}{# 用户消息处理 #}{% if message['role'] == 'user' %}{% set ns.is_tool = false %}{{ '<｜User｜>' + message['content'] }}{# 助手消息（带工具调用） #}{% elif message['role'] == 'assistant' and message['content'] is none %}{% set ns.is_tool = false %}{% for tool in message['tool_calls'] %}{% if not ns.is_first %}{{ '<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n```json\n' + tool['function']['arguments'] + '\n```<｜tool▁call▁end｜>' }}{% set ns.is_first = true %}{% else %}{{ '\n<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n```json\n' + tool['function']['arguments'] + '\n```<｜tool▁call▁end｜>' }}{{ '<｜tool▁calls▁end｜><｜end▁of▁sentence｜>' }}{% endif %}{% endfor %}{# 助手正常回复内容 #}{% elif message['role'] == 'assistant' and message['content'] is not none %}{% if ns.is_tool %}{{ '<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>' }}{% set ns.is_tool = false %}{% else %}{{ '<｜Assistant｜>' + message['content'] + '<｜end▁of▁sentence｜>' }}{% endif %}{# 工具输出处理 #}{% elif message['role'] == 'tool' %}{% set ns.is_tool = true %}{% if ns.is_output_first %}{{ '<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>' }}{% set ns.is_output_first = false %}{% else %}{{ '\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>' }}{% endif %}{% endif %}{% endfor %}{# 工具输出结尾处理 #}
{% if ns.is_tool %}{{ '<｜tool▁outputs▁end｜>' }}
{% endif %}{# 生成助手响应起始标记 #}
{% if add_generation_prompt and not ns.is_tool %}{{ '<｜Assistant｜>' }}
{% endif %}

用流程简图表示为：

初始化变量
│
├── 收集 system prompt
│
├── 遍历 messages:
│   ├── system → 拼接 prompt
│   ├── user → 加 <|User|>
│   ├── assistant:
│   │   ├── 若调用 tool → 生成 tool_call 块
│   │   └── 否则 → 加 <|Assistant|>
│   └── tool → 输出 tool_output 块
│
└── 最后判断是否需要加 <|Assistant|> 结束

v3-0324 的 chat template

来看T2.chat_template

{# 设置默认值 #}
{% if add_generation_prompt is not defined %}{% set add_generation_prompt = false %}
{% endif %}{# 初始化状态变量 #}
{% set ns = namespace(is_first=false,is_tool=false,is_output_first=true,system_prompt='',is_first_sp=true,is_last_user=false
) %}{# 拼接所有 system prompt #}
{% for message in messages %}{% if message['role'] == 'system' %}{% if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{% else %}{% set ns.system_prompt = ns.system_prompt + '\n' + message['content'] %}{% endif %}{% endif %}
{% endfor %}{{ bos_token }}{{ ns.system_prompt }}{# 遍历所有消息 #}
{% for message in messages %}{# 处理用户消息 #}{% if message['role'] == 'user' %}{% set ns.is_tool = false %}{% set ns.is_first = false %}{% set ns.is_last_user = true %}{{ '<｜User｜>' + message['content'] + '<｜Assistant｜>' }}{# 处理 Assistant 调用工具的情况 #}{% elif message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}{% set ns.is_last_user = false %}{% if ns.is_tool %}{{ '<｜tool▁outputs▁end｜>' }}{% endif %}{% set ns.is_first = false %}{% set ns.is_tool = false %}{% set ns.is_output_first = true %}{% for tool in message['tool_calls'] %}{% set tool_call_str = '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n```json\n' + tool['function']['arguments'] + '\n```<｜tool▁call▁end｜>' %}{% if not ns.is_first %}{% if message['content'] is none %}{{ '<｜tool▁calls▁begin｜>' + tool_call_str }}{% else %}{{ message['content'] + '<｜tool▁calls▁begin｜>' + tool_call_str }}{% endif %}{% set ns.is_first = true %}{% else %}{{ '\n' + tool_call_str }}{% endif %}{% endfor %}{{ '<｜tool▁calls▁end｜><｜end▁of▁sentence｜>' }}{# Assistant 正常回复内容（无工具调用） #}{% elif message['role'] == 'assistant' %}{% set ns.is_last_user = false %}{% if ns.is_tool %}{{ '<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>' }}{% set ns.is_tool = false %}{% else %}{{ message['content'] + '<｜end▁of▁sentence｜>' }}{% endif %}{# 工具的输出内容 #}{% elif message['role'] == 'tool' %}{% set ns.is_last_user = false %}{% set ns.is_tool = true %}{% if ns.is_output_first %}{{ '<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>' }}{% set ns.is_output_first = false %}{% else %}{{ '\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>' }}{% endif %}{% endif %}{% endfor %}{# 如果有残留的 tool 输出状态，则收尾结束 #}
{% if ns.is_tool %}{{ '<｜tool▁outputs▁end｜>' }}
{% endif %}{# 最终是否生成 Assistant 提示起始符 #}
{% if add_generation_prompt and not ns.is_last_user and not ns.is_tool %}{{ '<｜Assistant｜>' }}
{% endif %}

初始化变量（增加 is_last_user 等）
│
├── 收集 system prompt
│
├── 遍历 messages:
│   ├── system → 拼接 prompt
│   ├── user → 加 <|User|>，标记 is_last_user=True
│   ├── assistant:
│   │   ├── 若调用 tool_call：
│   │   │   └── 判断是否有 content（处理更细）
│   │   └── 若普通内容 → 加 <|Assistant|>
│   └── tool:
│       └── 多个 tool_output 串联，闭合处理
│
└── 若最后是 user 且无 tool 调用 → 加 <|Assistant|> 提示生成回复

`apply_chat_template

设置一段message

messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "What's the weather in Paris?"},{"role": "assistant",# "content": "Let me check the weather for you.","content": "","tool_calls": [{"type": "function","function": {"name": "get_weather","arguments": '{ "location": "Paris" }'}}]},{"role": "tool","content": '{ "temperature": "15C", "condition": "Sunny" }',"tool_call_id": "call_1"},{"role": "assistant","content": "It's 15°C and sunny in Paris right now."}
]

然后调用T1：T1.apply_chat_template(messages, tokenize=False)

输出：

<｜begin▁of▁sentence｜>You are a helpful assistant.
<｜User｜>What\'s the weather in Paris?
<｜Assistant｜>Let me check the weather for you.<｜end▁of▁sentence｜>
<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>{ "temperature": "15C", "condition": "Sunny" }<｜tool▁output▁end｜>
<｜tool▁outputs▁end｜>It\'s 15°C and sunny in Paris right now.<｜end▁of▁sentence｜>

注意，T1只有tool_outputs，但没有tool_call，而在下面的T2里则是多了tool_call，这就是为什么v3会重复调用BUG的问题。

同理：T2.apply_chat_template(messages, tokenize=False)

输出：

<｜begin▁of▁sentence｜>You are a helpful assistant.
<｜User｜>What\'s the weather in Paris?
<｜Assistant｜>Let me check the weather for you.
<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>function<｜tool▁sep｜>get_weather\n```json\n{ "location": "Paris" }\n```<｜tool▁call▁end｜>
<｜tool▁calls▁end｜><｜end▁of▁sentence｜>
<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>{ "temperature": "15C", "condition": "Sunny" }<｜tool▁output▁end｜>
<｜tool▁outputs▁end｜>It\'s 15°C and sunny in Paris right now.<｜end▁of▁sentence｜>

两个 highlights
- v3 chat tempalte 解析 messages 时丢了 tool_call 的部分
- tool_call 和 tool_output 是一体的，统一作为 <｜Assistant｜> 的输出

在这里插入图片描述

实验表明，即使把message中的content设置为空字符串，T1的apply_chat_template的显示结果还是不会有tool_call，还是有问题的。

因为回头看v3的chat template里这一段

  {# 助手消息（带工具调用） #}{% elif message['role'] == 'assistant' and message['content'] is none %}{% set ns.is_tool = false %}{% for tool in message['tool_calls'] %}{% if not ns.is_first %}{{ '<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n```json\n' + tool['function']['arguments'] + '\n```<｜tool▁call▁end｜>' }}{% set ns.is_first = true %}{% else %}{{ '\n<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n```json\n' + tool['function']['arguments'] + '\n```<｜tool▁call▁end｜>' }}{{ '<｜tool▁calls▁end｜><｜end▁of▁sentence｜>' }}{% endif %}{% endfor %}{# 助手正常回复内容 #}{% elif message['role'] == 'assistant' and message['content'] is not none %}{% if ns.is_tool %}{{ '<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>' }}{% set ns.is_tool = false %}{% else %}{{ '<｜Assistant｜>' + message['content'] + '<｜end▁of▁sentence｜>' }}{% endif %}

也就是说，如果message['content'] is none，理论上是会输出tool_call块的才对，但是做下来并不是这样，这个条件分支其实很奇怪

文章目录

8 [RL4LLM] 理解 reasoning model Tokenizer 的 chat template，vllm inference

tokenizer

chat template

distill tokenizer

qwen tokenizer

apply chat template

vllm inference

9 [RL4LLM] PPO workflow 及 OpenRLHF、veRL 初步介绍，ray distributed debugger

RL4LLM roadmap

TRL ppo trainer

adv(advantage) estimator

9 [RL4LLM] 深入 PPO-clip 目标函数细节（及重要性采样）

clip

期望计算

10 [RL4LLM] GRPO loss/objective 分析 及可能的 biases 分析（DAPO，Dr. GRPO)

1 思维误区：损失为零无法优化

loss = 0

loss = β k l \beta kl βkl

GRPO的梯度？

token级别的PG损失（即DAPO）

2 关于Dr. GRPO

3 per_device_train_batch_size & num_generations

11 [RL4LLM] deepseek v3 工具调用的 bug 以及理解 chat_template 的 function calling

v3 的chat template

v3-0324 的 chat template

`apply_chat_template

相关文章：

10 [RL4LLM] GRPO loss/objective 分析及可能的 biases 分析（DAPO，Dr. GRPO)

loss = $\beta kl$