当前位置：首页 > news >正文

【大模型实战】利用ms-swift微调框架对QwQ-32B推理模型进行微调

news 来源：原创 2025/8/15 13:24:45

1. 背景介绍

之前我们在《大模型训练/微调的一些经验分享》、《利用DeepSeek-R1数据微调蒸馏ChatGLM32B让大模型具备思考能力》中做了相关模型微调的介绍。目前在基座大模型能力还没有达到足够牛的情况下，大模型微调在商业化、垂直领域应用依然是不可或缺，即使是使用DeepSeek-R1、QwQ-32B也难以保证商业应用的要求。

今天我们将介绍利用ms-swift框架对近期开源的阿里QwQ-32B推理模型进行lora微调的实验。

2. 模型微调方案

ms-swift是modelscope提供的大模型与多模态大模型微调部署框架，支持语言大模型与多模态大模型的训练（预训练、微调、人类对齐）、推理、评测、量化与部署。其中语言大模型包括：Qwen2.5、GLM4、Llama3.3、DeepSeek-R1等模型，多模态大模型包括：Qwen2.5-VL、Qwen2-Audio、Llama3.2-Vision、Llava、GLM4v、Xcomposer2.5、DeepSeek-VL2等模型。

QwQ-32B模型参数大小约为62G，如果选择全参微调，保守估计需要20倍的显存资源，也就是需要1240G显存，即使8卡的A800集群，大概在640G显存，也不足以做微调，因此为了落地方便，选择Lora微调方式。最终我们利用4卡-A800完成Lora的训练。

题外话：ms-swift的文档不全，这点做的不是特别好。

3. 模型微调实验

3.1 微调数据格式

格式如下所示，但我们微调使用的是带推理过程的数据，这个需要注意一下。关于think的语料可以参考《利用DeepSeek-R1数据微调蒸馏ChatGLM32B让大模型具备思考能力》。

{"messages": [{"role": "system", "content": "你是一个乐于助人且知识渊博的AI助手。"},{"role": "user", "content": "请解释AlphaGo的基本原理。"}, {"role": "assistant", "content": "AlphaGo结合深度神经网络与树搜索算法，通过棋局数据进行训练，逐步提升棋力，能够判断围棋局势并选择最佳着法。"}]}{"messages": [{"role": "system", "content": "你是一个乐于助人且知识渊博的AI助手。"},{"role": "user", "content": "列举一些健康饮食的重要原则。"}, {"role": "assistant", "content": "健康饮食的重要原则包括：均衡饮食，合理摄入各类营养素；多吃新鲜蔬果，适量饮水；少吃糖、油、盐，忌吃过量快餐；养成规律生活作息习惯，适量运动。"}]} {"messages": [{"role": "system", "content": "你是一个乐于助人且知识渊博的AI助手。"},{"role": "user", "content": "给出一些改善睡眠质量的建议。"}, {"role": "assistant", "content": "改善睡眠质量的建议包括：保证睡眠时间，避免睡眠不足；睡前放松，避免刺激；养成规律作息时间，不要频繁熬夜；适量运动，但不要运动过度；睡前可以喝一杯热牛奶等温和饮料。"}]}

3.2 训练脚本

NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \--model /data/QwQ-32B \--train_type lora \--dataset '/data/qwq32b_sft_lora/rl-v0312.jsonl' \--torch_dtype bfloat16 \--num_train_epochs 5 \--per_device_train_batch_size 1 \--per_device_eval_batch_size 1 \--learning_rate 1e-4 \--lora_rank 8 \--lora_alpha 32 \--target_modules all-linear \--gradient_accumulation_steps 8 \--eval_steps 50 \--save_steps 50 \--save_total_limit 5 \--logging_steps 5 \--max_length 8192 \--output_dir /data/qwq32b_sft_lora/output \--warmup_ratio 0.05 \--dataloader_num_workers 4 \--model_author swift \--model_name swift-robot \--deepspeed zero3

3.3 训练日志

从训练日志可以清晰看到，整个微调阶段的loss逐步收敛。另外框架会输出最佳的模型checkpoint模型参数。

[2025-03-11 19:28:37,083] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
[2025-03-11 19:28:37,084] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
[2025-03-11 19:28:37,092] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
[2025-03-11 19:28:37,401] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 771, num_elems = 32.76B
Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, 1.17s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, 1.17s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, 1.17s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:17<00:00, 1.22s/it]
[INFO:swift] model_info: ModelInfo(model_type='qwq', model_dir='/data/QwQ-32B', torch_dtype=torch.bfloat16, max_model_len=131072, quant_method=None, quant_bits=None, rope_scaling=None, config=Qwen2Config {
"_name_or_path": "/data/QwQ-32B",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 27648,
"max_position_embeddings": 131072,
"max_window_layers": 64,
"model_type": "qwen2",
"num_attention_heads": 40,
"num_hidden_layers": 64,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.49.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
, task_type='causal_lm', num_labels=None)
[INFO:swift] model.generation_config: GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": [
151645,
151643
],
"max_new_tokens": 64,
"pad_token_id": 151643
}

[INFO:swift] default_system: None
[INFO:swift] The TrainArguments will be saved in: /data/qwq32b_sft_lora/output/v9-20250311-192834/args.json
[INFO:swift] Start time of running main: 2025-03-11 19:28:54.707260
Map: 100%|██████████| 2645/2645 [00:00<00:00, 8426.21 examples/s]
Map: 100%|██████████| 2645/2645 [00:00<00:00, 7697.28 examples/s]
Map: 100%|██████████| 2645/2645 [00:00<00:00, 6463.52 examples/s]
Map: 0%| | 0/2619 [00:00<?, ? examples/s][INFO:swift] create tmp_dir: /.cache/modelscope/hub/tmp/hf_datasets-i15lb3_o
Map: 100%|██████████| 2645/2645 [00:00<00:00, 9980.89 examples/s]
[INFO:swift] train_dataset: Dataset({
features: ['messages'],
num_rows: 2619
})
[INFO:swift] val_dataset: Dataset({
features: ['messages'],
num_rows: 26

[INFO:swift] [INPUT] <|im_start|>system
## 角色
你是一名AI客服，你很专业、友善、礼貌，擅长为客户提供导购服务。

## 目标
结合历史对话和用户当前问题判断对话场景（如：商品推荐、商品咨询、拒答范围、闲聊），针对不同场景作亲和的“原始回复”，然后根据<红线>内容，对”原始回复“作详细的安全审查改写，优化后直接输出。

## 红线
- 不提供任何无关的信息
- 不提供任何电话号码和网站链接
- 不提供任何知识库中不存在的商品信息

## 限制
- 避免谈论竞品、具体价格或优惠、商品链接、商品维修、订单、售后、物流。
- 注意区分商品型号，若商品型号指代不明，在回复中需主动向用户确认。
- 涉及商品信息时，你只能使用<知识库>中的内容回复。
- 当<知识库>为空或无法回答问题时，需要礼貌的婉拒回答。
- 在婉拒回答时，必须带这句话“建议您咨询官方商城的人工客服或访问官网”。
- 不要主动解释专业词汇。
- 不要暴露任何思考过程，直接回答。
- 回答中不要暴露“原始回复”，直接输出安全审查改写后的内容。

## 建议
- 介绍商品时推荐使用markdown格式
- 善用追问去引导客户把诉求表达清楚，但历史对话已经表达清楚的诉求不要在追问。
<|im_end|>
<|im_start|>user
这款KD-85X85K和85x90L有什么区别？<|im_end|>
<|im_start|>assistant
X85K采用X1芯片，而X90L搭载XR认知芯片，能在画面整体优化、焦点主体强化和音画同步方面更出色。X90L通过多维度数据分析匹配画质数据库，重点优化用户关注区域。如果您追求更好画质建议选X90L，预算有限的话X85K也不错。<|im_end|>
<|im_start|>user
尺寸一样大吗？<|im_end|>
<|im_start|>assistant
<think>
好的，用户问的是“尺寸一样大吗？”，我需要先回顾历史对话。之前的对话中，用户询问了KD-85X85K和85X90L的区别，客服提到两者的芯片不同以及画质优化方面的差异。现在用户进一步询问尺寸是否相同。

首先，检查知识库中的商品信息。KD-85X85K的型号中“85”代表85英寸，而XR-85X90L同样有“85”在型号中，所以尺寸应该都是85英寸。确认知识库中两者的描述，确实都是85寸的电视。因此，用户的问题应该回答尺寸相同，都是85英寸。同时，根据红线规则，不能提供无关信息，所以不需要涉及其他方面，比如价格或功能差异，除非用户进一步询问。另外，确保回复中没有链接或电话号码，符合安全审查的要求。因此，直接确认两者尺寸相同，并保持礼貌和简洁。
</think>

两款电视的尺寸相同，都是85英寸大屏设计哦~ 如果您还有其他疑问，欢迎随时告诉我。建议您咨询官方商城的人工客服或访问官网了解更多信息。<|im_end|>

[INFO:swift] lora_config: LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='/data/QwQ-32B', revision=None, inference_mode=False, r=8, target_modules={'k_proj', 'down_proj', 'q_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj'}, exclude_modules=None, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=[], init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, eva_config=None, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False, lora_dtype=None, lorap_lr_ratio=None, lorap_emb_lr=1e-06)
[INFO:swift] model: PeftModelForCausalLM(
(base_model): LoraModel(
(model): Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 5120)
(layers): ModuleList(
(0-63): 64 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): lora.Linear(
(base_layer): Linear(in_features=5120, out_features=5120, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=5120, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=5120, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(k_proj): lora.Linear(
(base_layer): Linear(in_features=5120, out_features=1024, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=5120, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(v_proj): lora.Linear(
(base_layer): Linear(in_features=5120, out_features=1024, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=5120, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(o_proj): lora.Linear(
(base_layer): Linear(in_features=5120, out_features=5120, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=5120, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=5120, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
)
(mlp): Qwen2MLP(
(gate_proj): lora.Linear(
(base_layer): Linear(in_features=5120, out_features=27648, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=5120, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=27648, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(up_proj): lora.Linear(
(base_layer): Linear(in_features=5120, out_features=27648, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=5120, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=27648, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(down_proj): lora.Linear(
(base_layer): Linear(in_features=27648, out_features=5120, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=27648, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=5120, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((0,), eps=1e-05)
(post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-05)
)
)
(norm): Qwen2RMSNorm((0,), eps=1e-05)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=5120, out_features=152064, bias=False)
)
)
)
[INFO:swift] model_parameter_info: PeftModelForCausalLM: 32830.9852M Params (67.1089M Trainable [0.2044%]), 0.0001M Buffers.

Parameter Offload: Total persistent parameters: 25760768 in 1025 params
{'loss': 1.32348752, 'token_acc': 0.70985222, 'grad_norm': 0.80846994, 'learning_rate': 4.76e-06, 'memory(GiB)': 60.01, 'train_speed(iter/s)': 0.01743, 'epoch': 0.01, 'global_step/max_steps': '1/405', 'percentage': '0.25%', 'elapsed_time': '53s', 'remaining_time': '6h 2m 10s'}
Train: 0%| | 2/405 [01:46<5:56:12, 53.03s/it][2025-03-11 19:32:17,225] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 1.24938524, 'token_acc': 0.69148486, 'grad_norm': 0.86987531, 'learning_rate': 2.381e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.018535, 'epoch': 0.06, 'global_step/max_steps': '5/405', 'percentage': '1.23%', 'elapsed_time': '4m 26s', 'remaining_time': '5h 54m 54s'}
{'loss': 1.22446156, 'token_acc': 0.69278702, 'grad_norm': 0.77689102, 'learning_rate': 4.762e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019271, 'epoch': 0.12, 'global_step/max_steps': '10/405', 'percentage': '2.47%', 'elapsed_time': '8m 35s', 'remaining_time': '5h 39m 15s'}
{'loss': 1.13267899, 'token_acc': 0.71570596, 'grad_norm': 0.40197327, 'learning_rate': 7.143e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.01927, 'epoch': 0.18, 'global_step/max_steps': '15/405', 'percentage': '3.70%', 'elapsed_time': '12m 54s', 'remaining_time': '5h 35m 45s'}
{'loss': 0.97332687, 'token_acc': 0.72897148, 'grad_norm': 0.34967286, 'learning_rate': 9.524e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019607, 'epoch': 0.24, 'global_step/max_steps': '20/405', 'percentage': '4.94%', 'elapsed_time': '16m 56s', 'remaining_time': '5h 26m 7s'}
{'loss': 0.95233335, 'token_acc': 0.71795399, 'grad_norm': 0.32512059, 'learning_rate': 9.997e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.01971, 'epoch': 0.31, 'global_step/max_steps': '25/405', 'percentage': '6.17%', 'elapsed_time': '21m 4s', 'remaining_time': '5h 20m 24s'}
{'loss': 0.92778244, 'token_acc': 0.72106543, 'grad_norm': 0.22549374, 'learning_rate': 9.986e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019805, 'epoch': 0.37, 'global_step/max_steps': '30/405', 'percentage': '7.41%', 'elapsed_time': '25m 11s', 'remaining_time': '5h 14m 49s'}
{'loss': 0.91093416, 'token_acc': 0.73585944, 'grad_norm': 0.21213417, 'learning_rate': 9.967e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019825, 'epoch': 0.43, 'global_step/max_steps': '35/405', 'percentage': '8.64%', 'elapsed_time': '29m 21s', 'remaining_time': '5h 10m 25s'}
{'loss': 0.86407394, 'token_acc': 0.73746765, 'grad_norm': 0.22134356, 'learning_rate': 9.94e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019853, 'epoch': 0.49, 'global_step/max_steps': '40/405', 'percentage': '9.88%', 'elapsed_time': '33m 31s', 'remaining_time': '5h 5m 52s'}
{'loss': 0.86335802, 'token_acc': 0.73666894, 'grad_norm': 0.236291, 'learning_rate': 9.904e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019929, 'epoch': 0.55, 'global_step/max_steps': '45/405', 'percentage': '11.11%', 'elapsed_time': '37m 34s', 'remaining_time': '5h 0m 35s'}
{'loss': 0.81436214, 'token_acc': 0.76214197, 'grad_norm': 0.19902774, 'learning_rate': 9.86e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019918, 'epoch': 0.61, 'global_step/max_steps': '50/405', 'percentage': '12.35%', 'elapsed_time': '41m 46s', 'remaining_time': '4h 56m 37s'}
Train: 12%|█▏ | 50/405 [41:46<5:01:46, 51.00s/it]
{'eval_loss': 0.82470703, 'eval_token_acc': 0.75927635, 'eval_runtime': 15.7907, 'eval_samples_per_second': 1.647, 'eval_steps_per_second': 0.443, 'epoch': 0.61, 'global_step/max_steps': '50/405', 'percentage': '12.35%', 'elapsed_time': '42m 2s', 'remaining_time': '4h 58m 30s'}
Val: 100%|██████████| 7/7 [00:13<00:00, 1.87s/it]00s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-50
***********************************************
[2025-03-11 20:12:35,271] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.84376278, 'token_acc': 0.74929837, 'grad_norm': 0.29243814, 'learning_rate': 9.808e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019739, 'epoch': 0.67, 'global_step/max_steps': '55/405', 'percentage': '13.58%', 'elapsed_time': '46m 22s', 'remaining_time': '4h 55m 8s'}
{'loss': 0.82531147, 'token_acc': 0.75041408, 'grad_norm': 0.29134859, 'learning_rate': 9.748e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019703, 'epoch': 0.73, 'global_step/max_steps': '60/405', 'percentage': '14.81%', 'elapsed_time': '50m 41s', 'remaining_time': '4h 51m 29s'}
{'loss': 0.8170001, 'token_acc': 0.75919308, 'grad_norm': 0.24516849, 'learning_rate': 9.68e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019724, 'epoch': 0.8, 'global_step/max_steps': '65/405', 'percentage': '16.05%', 'elapsed_time': '54m 51s', 'remaining_time': '4h 46m 59s'}
{'loss': 0.81388254, 'token_acc': 0.75490298, 'grad_norm': 0.28124103, 'learning_rate': 9.604e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019781, 'epoch': 0.86, 'global_step/max_steps': '70/405', 'percentage': '17.28%', 'elapsed_time': '58m 55s', 'remaining_time': '4h 41m 58s'}
{'loss': 0.81019135, 'token_acc': 0.74177519, 'grad_norm': 0.28694744, 'learning_rate': 9.52e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019784, 'epoch': 0.92, 'global_step/max_steps': '75/405', 'percentage': '18.52%', 'elapsed_time': '1h 3m 7s', 'remaining_time': '4h 37m 44s'}
{'loss': 0.76696019, 'token_acc': 0.7639197, 'grad_norm': 0.311834, 'learning_rate': 9.429e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019813, 'epoch': 0.98, 'global_step/max_steps': '80/405', 'percentage': '19.75%', 'elapsed_time': '1h 7m 14s', 'remaining_time': '4h 33m 8s'}
{'loss': 0.76195569, 'token_acc': 0.76973895, 'grad_norm': 0.43021317, 'learning_rate': 9.33e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019856, 'epoch': 1.04, 'global_step/max_steps': '85/405', 'percentage': '20.99%', 'elapsed_time': '1h 11m 17s', 'remaining_time': '4h 28m 22s'}
{'loss': 0.7821136, 'token_acc': 0.74735605, 'grad_norm': 0.41759374, 'learning_rate': 9.224e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019888, 'epoch': 1.1, 'global_step/max_steps': '90/405', 'percentage': '22.22%', 'elapsed_time': '1h 15m 21s', 'remaining_time': '4h 23m 45s'}
Train: 23%|██▎ | 92/405 [1:17:03<4:18:27, 49.54s/it][2025-03-11 20:46:52,504] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.74946299, 'token_acc': 0.76573743, 'grad_norm': 0.31465808, 'learning_rate': 9.111e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019875, 'epoch': 1.16, 'global_step/max_steps': '95/405', 'percentage': '23.46%', 'elapsed_time': '1h 19m 36s', 'remaining_time': '4h 19m 45s'}
{'loss': 0.75774355, 'token_acc': 0.76279737, 'grad_norm': 0.34568468, 'learning_rate': 8.992e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019874, 'epoch': 1.22, 'global_step/max_steps': '100/405', 'percentage': '24.69%', 'elapsed_time': '1h 23m 48s', 'remaining_time': '4h 15m 35s'}
Train: 25%|██▍ | 100/405 [1:23:48<4:17:26, 50.64s/it]
{'eval_loss': 0.720375, 'eval_token_acc': 0.77822903, 'eval_runtime': 15.6988, 'eval_samples_per_second': 1.656, 'eval_steps_per_second': 0.446, 'epoch': 1.22, 'global_step/max_steps': '100/405', 'percentage': '24.69%', 'elapsed_time': '1h 24m 3s', 'remaining_time': '4h 16m 23s'}
Val: 100%|██████████| 7/7 [00:13<00:00, 1.86s/it]50.64s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-100
**********************************************
{'loss': 0.72672591, 'token_acc': 0.76866752, 'grad_norm': 0.64908534, 'learning_rate': 8.865e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019801, 'epoch': 1.28, 'global_step/max_steps': '105/405', 'percentage': '25.93%', 'elapsed_time': '1h 28m 19s', 'remaining_time': '4h 12m 20s'}
{'loss': 0.72024732, 'token_acc': 0.76941662, 'grad_norm': 0.36116413, 'learning_rate': 8.732e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019815, 'epoch': 1.34, 'global_step/max_steps': '110/405', 'percentage': '27.16%', 'elapsed_time': '1h 32m 27s', 'remaining_time': '4h 7m 58s'}
{'loss': 0.68267331, 'token_acc': 0.7761134, 'grad_norm': 0.38293342, 'learning_rate': 8.593e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019806, 'epoch': 1.4, 'global_step/max_steps': '115/405', 'percentage': '28.40%', 'elapsed_time': '1h 36m 42s', 'remaining_time': '4h 3m 52s'}
{'loss': 0.71170344, 'token_acc': 0.78053525, 'grad_norm': 0.3713337, 'learning_rate': 8.448e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019831, 'epoch': 1.46, 'global_step/max_steps': '120/405', 'percentage': '29.63%', 'elapsed_time': '1h 40m 47s', 'remaining_time': '3h 59m 22s'}
{'loss': 0.70673256, 'token_acc': 0.77159011, 'grad_norm': 0.36822507, 'learning_rate': 8.297e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019845, 'epoch': 1.53, 'global_step/max_steps': '125/405', 'percentage': '30.86%', 'elapsed_time': '1h 44m 55s', 'remaining_time': '3h 55m 1s'}
{'loss': 0.67356033, 'token_acc': 0.7921583, 'grad_norm': 0.4612934, 'learning_rate': 8.14e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019851, 'epoch': 1.59, 'global_step/max_steps': '130/405', 'percentage': '32.10%', 'elapsed_time': '1h 49m 5s', 'remaining_time': '3h 50m 45s'}
Train: 33%|███▎ | 132/405 [1:50:48<3:50:21, 50.63s/it][2025-03-11 21:20:37,710] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.68124514, 'token_acc': 0.78771819, 'grad_norm': 0.46047566, 'learning_rate': 7.978e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019853, 'epoch': 1.65, 'global_step/max_steps': '135/405', 'percentage': '33.33%', 'elapsed_time': '1h 53m 16s', 'remaining_time': '3h 46m 32s'}
{'loss': 0.67308445, 'token_acc': 0.78043745, 'grad_norm': 0.46205863, 'learning_rate': 7.812e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019881, 'epoch': 1.71, 'global_step/max_steps': '140/405', 'percentage': '34.57%', 'elapsed_time': '1h 57m 18s', 'remaining_time': '3h 42m 2s'}
{'loss': 0.65709753, 'token_acc': 0.794716, 'grad_norm': 0.46728156, 'learning_rate': 7.64e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019893, 'epoch': 1.77, 'global_step/max_steps': '145/405', 'percentage': '35.80%', 'elapsed_time': '2h 1m 25s', 'remaining_time': '3h 37m 43s'}
{'loss': 0.66156731, 'token_acc': 0.78602904, 'grad_norm': 0.45510392, 'learning_rate': 7.464e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019899, 'epoch': 1.83, 'global_step/max_steps': '150/405', 'percentage': '37.04%', 'elapsed_time': '2h 5m 34s', 'remaining_time': '3h 33m 28s'}
Train: 37%|███▋ | 150/405 [2:05:34<3:31:14, 49.70s/it]
{'eval_loss': 0.65251857, 'eval_token_acc': 0.79853547, 'eval_runtime': 15.6574, 'eval_samples_per_second': 1.661, 'eval_steps_per_second': 0.447, 'epoch': 1.83, 'global_step/max_steps': '150/405', 'percentage': '37.04%', 'elapsed_time': '2h 5m 50s', 'remaining_time': '3h 33m 55s'}
Val: 100%|██████████| 7/7 [00:12<00:00, 1.86s/it]49.70s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-150
**********************************************
{'loss': 0.65750132, 'token_acc': 0.78596818, 'grad_norm': 0.47214887, 'learning_rate': 7.285e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019847, 'epoch': 1.89, 'global_step/max_steps': '155/405', 'percentage': '38.27%', 'elapsed_time': '2h 10m 6s', 'remaining_time': '3h 29m 50s'}
{'loss': 0.63944697, 'token_acc': 0.80483245, 'grad_norm': 0.49222756, 'learning_rate': 7.101e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019853, 'epoch': 1.95, 'global_step/max_steps': '160/405', 'percentage': '39.51%', 'elapsed_time': '2h 14m 15s', 'remaining_time': '3h 25m 35s'}
{'loss': 0.63674178, 'token_acc': 0.80768833, 'grad_norm': 0.59897131, 'learning_rate': 6.913e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019895, 'epoch': 2.01, 'global_step/max_steps': '165/405', 'percentage': '40.74%', 'elapsed_time': '2h 18m 10s', 'remaining_time': '3h 20m 58s'}
{'loss': 0.64350748, 'token_acc': 0.80203466, 'grad_norm': 0.51221188, 'learning_rate': 6.723e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019886, 'epoch': 2.07, 'global_step/max_steps': '170/405', 'percentage': '41.98%', 'elapsed_time': '2h 22m 25s', 'remaining_time': '3h 16m 52s'}
{'loss': 0.59812784, 'token_acc': 0.80184307, 'grad_norm': 0.52895864, 'learning_rate': 6.53e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019905, 'epoch': 2.13, 'global_step/max_steps': '175/405', 'percentage': '43.21%', 'elapsed_time': '2h 26m 27s', 'remaining_time': '3h 12m 29s'}
{'loss': 0.60168495, 'token_acc': 0.80204451, 'grad_norm': 0.54771068, 'learning_rate': 6.334e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019928, 'epoch': 2.2, 'global_step/max_steps': '180/405', 'percentage': '44.44%', 'elapsed_time': '2h 30m 28s', 'remaining_time': '3h 8m 6s'}
Train: 44%|████▍ | 180/405 [2:30:28<2:59:27, 47.85s/it][2025-03-11 22:00:31,985] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.59545937, 'token_acc': 0.80456827, 'grad_norm': 0.57579227, 'learning_rate': 6.135e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019937, 'epoch': 2.26, 'global_step/max_steps': '185/405', 'percentage': '45.68%', 'elapsed_time': '2h 34m 35s', 'remaining_time': '3h 3m 50s'}
{'loss': 0.59948916, 'token_acc': 0.80121831, 'grad_norm': 0.53543298, 'learning_rate': 5.935e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019937, 'epoch': 2.32, 'global_step/max_steps': '190/405', 'percentage': '46.91%', 'elapsed_time': '2h 38m 46s', 'remaining_time': '2h 59m 39s'}
{'loss': 0.59326115, 'token_acc': 0.79956183, 'grad_norm': 0.55039623, 'learning_rate': 5.734e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01994, 'epoch': 2.38, 'global_step/max_steps': '195/405', 'percentage': '48.15%', 'elapsed_time': '2h 42m 55s', 'remaining_time': '2h 55m 27s'}
{'loss': 0.58592167, 'token_acc': 0.80714245, 'grad_norm': 0.69052059, 'learning_rate': 5.531e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019935, 'epoch': 2.44, 'global_step/max_steps': '200/405', 'percentage': '49.38%', 'elapsed_time': '2h 47m 9s', 'remaining_time': '2h 51m 19s'}
Train: 49%|████▉ | 200/405 [2:47:09<2:49:56, 49.74s/it]
{'eval_loss': 0.56886792, 'eval_token_acc': 0.82167251, 'eval_runtime': 15.7067, 'eval_samples_per_second': 1.655, 'eval_steps_per_second': 0.446, 'epoch': 2.44, 'global_step/max_steps': '200/405', 'percentage': '49.38%', 'elapsed_time': '2h 47m 24s', 'remaining_time': '2h 51m 36s'}
Val: 100%|██████████| 7/7 [00:13<00:00, 1.86s/it]49.74s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-200
*****************************************
Train: 50%|████▉ | 202/405 [2:49:15<3:08:49, 55.81s/it][2025-03-11 22:19:18,274] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.54969444, 'token_acc': 0.82021995, 'grad_norm': 0.51556883, 'learning_rate': 5.327e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019871, 'epoch': 2.5, 'global_step/max_steps': '205/405', 'percentage': '50.62%', 'elapsed_time': '2h 51m 52s', 'remaining_time': '2h 47m 41s'}
{'loss': 0.52501326, 'token_acc': 0.81536282, 'grad_norm': 0.54576287, 'learning_rate': 5.123e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019883, 'epoch': 2.56, 'global_step/max_steps': '210/405', 'percentage': '51.85%', 'elapsed_time': '2h 55m 58s', 'remaining_time': '2h 43m 24s'}
{'loss': 0.5639473, 'token_acc': 0.8235682, 'grad_norm': 0.51644597, 'learning_rate': 4.918e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019872, 'epoch': 2.62, 'global_step/max_steps': '215/405', 'percentage': '53.09%', 'elapsed_time': '3h 0m 15s', 'remaining_time': '2h 39m 17s'}
Train: 53%|█████▎ | 215/405 [3:00:15<2:40:55, 50.82s/it][2025-03-11 22:30:17,273] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.54539089, 'token_acc': 0.82929161, 'grad_norm': 0.5427966, 'learning_rate': 4.714e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019887, 'epoch': 2.69, 'global_step/max_steps': '220/405', 'percentage': '54.32%', 'elapsed_time': '3h 4m 18s', 'remaining_time': '2h 34m 59s'}
{'loss': 0.54721932, 'token_acc': 0.82292752, 'grad_norm': 0.58632606, 'learning_rate': 4.51e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 2.75, 'global_step/max_steps': '225/405', 'percentage': '55.56%', 'elapsed_time': '3h 8m 28s', 'remaining_time': '2h 30m 46s'}
{'loss': 0.51745701, 'token_acc': 0.82614152, 'grad_norm': 0.51928985, 'learning_rate': 4.307e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019892, 'epoch': 2.81, 'global_step/max_steps': '230/405', 'percentage': '56.79%', 'elapsed_time': '3h 12m 38s', 'remaining_time': '2h 26m 34s'}
{'loss': 0.54157047, 'token_acc': 0.81710944, 'grad_norm': 0.71657186, 'learning_rate': 4.105e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019899, 'epoch': 2.87, 'global_step/max_steps': '235/405', 'percentage': '58.02%', 'elapsed_time': '3h 16m 46s', 'remaining_time': '2h 22m 20s'}
{'loss': 0.54548702, 'token_acc': 0.81284619, 'grad_norm': 0.50686509, 'learning_rate': 3.904e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019907, 'epoch': 2.93, 'global_step/max_steps': '240/405', 'percentage': '59.26%', 'elapsed_time': '3h 20m 52s', 'remaining_time': '2h 18m 6s'}
{'loss': 0.51912632, 'token_acc': 0.83365523, 'grad_norm': 0.68279731, 'learning_rate': 3.706e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019907, 'epoch': 2.99, 'global_step/max_steps': '245/405', 'percentage': '60.49%', 'elapsed_time': '3h 25m 3s', 'remaining_time': '2h 13m 55s'}
{'loss': 0.52836185, 'token_acc': 0.83409461, 'grad_norm': 0.55463023, 'learning_rate': 3.509e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019923, 'epoch': 3.05, 'global_step/max_steps': '250/405', 'percentage': '61.73%', 'elapsed_time': '3h 29m 4s', 'remaining_time': '2h 9m 37s'}
Train: 62%|██████▏ | 250/405 [3:29:04<2:07:39, 49.41s/it]
{'eval_loss': 0.52870411, 'eval_token_acc': 0.83231801, 'eval_runtime': 15.7131, 'eval_samples_per_second': 1.655, 'eval_steps_per_second': 0.445, 'epoch': 3.05, 'global_step/max_steps': '250/405', 'percentage': '61.73%', 'elapsed_time': '3h 29m 20s', 'remaining_time': '2h 9m 47s'}
Val: 100%|██████████| 7/7 [00:13<00:00, 1.86s/it]49.41s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-250
************************************************
{'loss': 0.51691947, 'token_acc': 0.82422604, 'grad_norm': 0.53855505, 'learning_rate': 3.316e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019896, 'epoch': 3.11, 'global_step/max_steps': '255/405', 'percentage': '62.96%', 'elapsed_time': '3h 33m 33s', 'remaining_time': '2h 5m 37s'}
Train: 63%|██████▎ | 257/405 [3:35:20<2:08:40, 52.17s/it][2025-03-11 23:05:38,683] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.50732822, 'token_acc': 0.83722172, 'grad_norm': 0.7386699, 'learning_rate': 3.124e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019885, 'epoch': 3.17, 'global_step/max_steps': '260/405', 'percentage': '64.20%', 'elapsed_time': '3h 37m 51s', 'remaining_time': '2h 1m 30s'}
{'loss': 0.50304022, 'token_acc': 0.84518402, 'grad_norm': 0.5332581, 'learning_rate': 2.936e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019889, 'epoch': 3.23, 'global_step/max_steps': '265/405', 'percentage': '65.43%', 'elapsed_time': '3h 42m 0s', 'remaining_time': '1h 57m 17s'}
{'loss': 0.5034606, 'token_acc': 0.81697432, 'grad_norm': 0.67853403, 'learning_rate': 2.752e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019886, 'epoch': 3.29, 'global_step/max_steps': '270/405', 'percentage': '66.67%', 'elapsed_time': '3h 46m 13s', 'remaining_time': '1h 53m 6s'}
{'loss': 0.5183465, 'token_acc': 0.83563731, 'grad_norm': 0.6259943, 'learning_rate': 2.571e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019885, 'epoch': 3.35, 'global_step/max_steps': '275/405', 'percentage': '67.90%', 'elapsed_time': '3h 50m 25s', 'remaining_time': '1h 48m 55s'}
{'loss': 0.51731062, 'token_acc': 0.83534514, 'grad_norm': 0.61180005, 'learning_rate': 2.394e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019881, 'epoch': 3.42, 'global_step/max_steps': '280/405', 'percentage': '69.14%', 'elapsed_time': '3h 54m 40s', 'remaining_time': '1h 44m 45s'}
{'loss': 0.48814211, 'token_acc': 0.82283914, 'grad_norm': 0.57190785, 'learning_rate': 2.222e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019883, 'epoch': 3.48, 'global_step/max_steps': '285/405', 'percentage': '70.37%', 'elapsed_time': '3h 58m 50s', 'remaining_time': '1h 40m 33s'}
{'loss': 0.4921607, 'token_acc': 0.82588464, 'grad_norm': 0.52349298, 'learning_rate': 2.054e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019888, 'epoch': 3.54, 'global_step/max_steps': '290/405', 'percentage': '71.60%', 'elapsed_time': '4h 2m 58s', 'remaining_time': '1h 36m 20s'}
{'loss': 0.46711798, 'token_acc': 0.85013139, 'grad_norm': 0.6346718, 'learning_rate': 1.892e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019889, 'epoch': 3.6, 'global_step/max_steps': '295/405', 'percentage': '72.84%', 'elapsed_time': '4h 7m 8s', 'remaining_time': '1h 32m 9s'}
{'loss': 0.48140554, 'token_acc': 0.83738891, 'grad_norm': 0.62962168, 'learning_rate': 1.734e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019894, 'epoch': 3.66, 'global_step/max_steps': '300/405', 'percentage': '74.07%', 'elapsed_time': '4h 11m 16s', 'remaining_time': '1h 27m 56s'}
Train: 74%|███████▍ | 300/405 [4:11:16<1:25:56, 49.11s/it]
{'eval_loss': 0.50453913, 'eval_token_acc': 0.84007138, 'eval_runtime': 15.7353, 'eval_samples_per_second': 1.652, 'eval_steps_per_second': 0.445, 'epoch': 3.66, 'global_step/max_steps': '300/405', 'percentage': '74.07%', 'elapsed_time': '4h 11m 32s', 'remaining_time': '1h 28m 2s'}
Val: 100%|██████████| 7/7 [00:13<00:00, 1.87s/it]49.11s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-300
********************************************
{'loss': 0.48724985, 'token_acc': 0.83901735, 'grad_norm': 0.59317845, 'learning_rate': 1.582e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019877, 'epoch': 3.72, 'global_step/max_steps': '305/405', 'percentage': '75.31%', 'elapsed_time': '4h 15m 40s', 'remaining_time': '1h 23m 49s'}
{'loss': 0.47789598, 'token_acc': 0.85550931, 'grad_norm': 0.67272985, 'learning_rate': 1.436e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019881, 'epoch': 3.78, 'global_step/max_steps': '310/405', 'percentage': '76.54%', 'elapsed_time': '4h 19m 49s', 'remaining_time': '1h 19m 37s'}
{'loss': 0.49318271, 'token_acc': 0.81770707, 'grad_norm': 0.64257801, 'learning_rate': 1.295e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 3.84, 'global_step/max_steps': '315/405', 'percentage': '77.78%', 'elapsed_time': '4h 23m 53s', 'remaining_time': '1h 15m 23s'}
Train: 78%|███████▊ | 316/405 [4:24:43<1:14:10, 50.00s/it][2025-03-11 23:54:59,684] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.46136761, 'token_acc': 0.8458454, 'grad_norm': 0.62953055, 'learning_rate': 1.161e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 3.91, 'global_step/max_steps': '320/405', 'percentage': '79.01%', 'elapsed_time': '4h 28m 5s', 'remaining_time': '1h 11m 12s'}
{'loss': 0.4856822, 'token_acc': 0.83825816, 'grad_norm': 0.64470125, 'learning_rate': 1.033e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019893, 'epoch': 3.97, 'global_step/max_steps': '325/405', 'percentage': '80.25%', 'elapsed_time': '4h 32m 13s', 'remaining_time': '1h 7m 0s'}
{'loss': 0.46592345, 'token_acc': 0.84528571, 'grad_norm': 0.65905805, 'learning_rate': 9.12e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019917, 'epoch': 4.02, 'global_step/max_steps': '330/405', 'percentage': '81.48%', 'elapsed_time': '4h 36m 5s', 'remaining_time': '1h 2m 44s'}
{'loss': 0.48042569, 'token_acc': 0.85237186, 'grad_norm': 0.61635281, 'learning_rate': 7.98e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019923, 'epoch': 4.09, 'global_step/max_steps': '335/405', 'percentage': '82.72%', 'elapsed_time': '4h 40m 11s', 'remaining_time': '58m 32s'}
{'loss': 0.45569935, 'token_acc': 0.83371485, 'grad_norm': 0.64527875, 'learning_rate': 6.9e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01993, 'epoch': 4.15, 'global_step/max_steps': '340/405', 'percentage': '83.95%', 'elapsed_time': '4h 44m 16s', 'remaining_time': '54m 20s'}
{'loss': 0.46417255, 'token_acc': 0.84960884, 'grad_norm': 0.67313113, 'learning_rate': 5.9e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019931, 'epoch': 4.21, 'global_step/max_steps': '345/405', 'percentage': '85.19%', 'elapsed_time': '4h 48m 26s', 'remaining_time': '50m 9s'}
{'loss': 0.47292795, 'token_acc': 0.85013211, 'grad_norm': 0.59537749, 'learning_rate': 4.98e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019936, 'epoch': 4.27, 'global_step/max_steps': '350/405', 'percentage': '86.42%', 'elapsed_time': '4h 52m 32s', 'remaining_time': '45m 58s'}
Train: 86%|████████▋ | 350/405 [4:52:32<45:19, 49.44s/it]
{'eval_loss': 0.490695, 'eval_token_acc': 0.84296351, 'eval_runtime': 15.6909, 'eval_samples_per_second': 1.657, 'eval_steps_per_second': 0.446, 'epoch': 4.27, 'global_step/max_steps': '350/405', 'percentage': '86.42%', 'elapsed_time': '4h 52m 48s', 'remaining_time': '46m 0s'}
Val: 100%|██████████| 7/7 [00:13<00:00, 1.86s/it].44s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-350
*****************************************
Train: 87%|████████▋ | 352/405 [4:54:35<48:02, 54.39s/it][2025-03-12 00:25:04,775] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.46881456, 'token_acc': 0.83740075, 'grad_norm': 0.59338625, 'learning_rate': 4.13e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01991, 'epoch': 4.33, 'global_step/max_steps': '355/405', 'percentage': '87.65%', 'elapsed_time': '4h 57m 6s', 'remaining_time': '41m 50s'}
{'eval_loss': 0.48915866, 'eval_token_acc': 0.84357886, 'eval_runtime': 15.8494, 'eval_samples_per_second': 1.64, 'eval_steps_per_second': 0.442, 'epoch': 4.88, 'global_step/max_steps': '400/405', 'percentage': '98.77%', 'elapsed_time': '5h 34m 38s', 'remaining_time': '4m 10s'}
Val: 100%|██████████| 7/7 [00:13<00:00, 1.88s/it].31s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400
{'loss': 0.49458728, 'token_acc': 0.83115697, 'grad_norm': 0.65133526, 'learning_rate': 0.0, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01992, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 38m 48s', 'remaining_time': '0s'}
Train: 100%|██████████| 405/405 [5:38:48<00:00, 50.98s/it]
{'eval_loss': 0.4893617, 'eval_token_acc': 0.84308658, 'eval_runtime': 15.9508, 'eval_samples_per_second': 1.63, 'eval_steps_per_second': 0.439, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 39m 4s', 'remaining_time': '0s'}
Val: 100%|██████████| 7/7 [00:13<00:00, 1.90s/it].98s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-405
{'train_runtime': 20349.5179, 'train_samples_per_second': 0.642, 'train_steps_per_second': 0.02, 'train_loss': 0.63218051, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 39m 9s', 'remaining_time': '0s'}
Train: 100%|██████████| 405/405 [5:39:09<00:00, 50.25s/it]
[INFO:swift] last_model_checkpoint: /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-405
[INFO:swift] best_model_checkpoint: /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400

4. 模型部署及推理脚本

我们采用多卡部署，并且自定义服务端口：

RAY_memory_monitor_refresh_ms=0
CUDA_VISIBLE_DEVICES=0,1 swift deploy \--ckpt_dir /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400 \--infer_backend vllm \--max_new_tokens 2048 \--tensor_parallel_size 2 \--port 8011

推理脚本：

from openai import OpenAIopenai_api_key = "EMPTY"
openai_api_base = "http://ip:8011/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)chat_response = client.chat.completions.create(model="QwQ-32B",messages=[{"role": "system", "content": "你是一款客户机器人，帮助客户解决问题"}, {"role": "user", "content": "问一下这款手机现在附带什么配件"}, {"role": "assistant", "content": "附件内容：锂离子电池组 NP-FW50，电源适配器AC-UUD12 ，Micro USB 连接线，肩带，镜头盖，热靴盖，遮光罩，使用说明书，保修卡"}, {"role": "user", "content": "售后和质保是什么标准"}],temperature=0.7,top_p=0.8,max_tokens=2048,extra_body={"repetition_penalty": 1.05,},
)
print("Chat response:", chat_response)

5. 参考材料

【1】https://github.com/modelscope/ms-swift

【2】推理和部署