DoReMi 数据混合器

1679 字约 6 分钟

2025-11-27

DoReMi (Domain Reweighting with Minimax Optimization) 是一种用于优化多领域数据混合比例的算法。通过在小型代理模型上进行领域权重优化，可以找到适用于大规模模型训练的最优数据混合策略。

算法概述

DoReMi 算法分为三个步骤：

Step 1: 使用参考权重训练参考模型（Reference Model）
Step 2: 在小型代理模型上动态优化领域权重（Proxy Model）
Step 3: 使用优化后的权重训练大规模目标模型

三步训练流程

Step 1: 参考模型训练

使用初始的领域权重（通常是均匀分布或经验权重）训练一个参考模型。这个模型将作为后续权重优化的基准。

配置文件: doremi_step1_static_qwen_pt_full.yaml

### dynamic_train - DoReMi Step 1: Reference Model Training
train_type: dynamic_mix
components_cfg_file: src/dataflex/configs/components.yaml
component_name: static  # 使用静态混合器
mixture_sample_rule: mixture
init_mixture_proportions: [0.5, 0.5]  # 初始权重，这里使用均匀分布
static_mix: true

关键参数说明:

component_name: static: 使用静态混合器，在整个训练过程中保持权重不变
init_mixture_proportions: 初始领域权重，需与数据集数量对应
static_mix: true: 启用静态混合模式

在 components.yaml 中的配置:

mixers:
  static:
    name: static
    params:
      proportions: [0.5, 0.5]  # 对应各个域的比例
      # proportions: null  # 使用均匀分布

Step 2: 代理模型权重优化

使用 DoReMi 算法在小型代理模型上动态优化领域权重。算法会通过计算各领域的过剩损失（excess loss）来调整权重。训练过程中，算法使用均匀采样进行数据选择，但优化后的领域权重会被记录并用于训练步骤中的损失加权。

配置文件: doremi_step2_dynamic_qwen_pt_full.yaml

### dynamic_train - DoReMi Step 2: Proxy Model Training
train_type: dynamic_mix
components_cfg_file: src/dataflex/configs/components.yaml
component_name: doremi  # 使用 DoReMi 混合器
mixture_sample_rule: mixture
init_mixture_proportions: [0.5, 0.5]  # 初始权重
warmup_step: 100  # 预热步数
update_step: 200  # 权重更新间隔
update_times: 3   # 权重更新次数

在 components.yaml 中的配置:

mixers:
  doremi:
    name: doremi
    params:
      # Step 1 训练得到的参考模型路径
      reference_model_path: /path/to/doremi_step1_result/checkpoint-xxx
      # 权重更新学习率 (DoReMi 论文中的 eta)
      reweight_eta: 0.1
      # 权重平滑参数 (DoReMi 论文中的 epsilon)
      reweight_eps: 0.01

关键参数说明:

reference_model_path: Step 1 训练得到的参考模型检查点路径
reweight_eta: 权重更新的学习率，控制权重调整幅度
reweight_eps: 平滑参数，防止某些领域权重过小
warmup_step: 在开始权重优化前的预热训练步数
update_step: 每隔多少步更新一次领域权重

算法行为:

算法使用均匀采样进行数据选择（每个领域具有相等的采样概率）
优化后的 domain_weights 会被计算并用于训练过程中的损失加权
这种方法确保了公平采样，同时允许损失函数关注更困难的领域

权重日志:

训练过程中会自动生成 doremi_weights.jsonl 文件，记录每次权重更新的详细信息：

{"step": 100, "timestamp": "2025-11-27 10:00:00", "domain_names": ["wiki", "c4"], "domain_weights": [0.3, 0.7], "perdomain_scores": [2.5, 3.2]}
{"step": 300, "timestamp": "2025-11-27 10:10:00", "domain_names": ["wiki", "c4"], "domain_weights": [0.25, 0.75], "perdomain_scores": [2.3, 3.5]}

Step 3: 目标模型训练

使用 Step 2 优化得到的最终权重，训练大规模目标模型。

配置文件: doremi_step3_static_qwen_pt_full.yaml

### dynamic_train - DoReMi Step 3: Large Model Training with Optimized Weights
train_type: dynamic_mix
components_cfg_file: src/dataflex/configs/components.yaml
component_name: static  # 使用静态混合器
mixture_sample_rule: mixture
init_mixture_proportions: [0.3, 0.7]  # 使用 Step 2 优化得到的最终权重
static_mix: true

关键步骤:

从 Step 2 的 doremi_weights.jsonl 文件中提取最终的优化权重
将权重填入 init_mixture_proportions 配置项
使用静态混合器进行训练

完整训练示例

# Step 1: 训练参考模型
llamafactory-cli train examples/train_full/mixers/doremi_step1_static_qwen_pt_full.yaml

# Step 2: 优化领域权重（在小型代理模型上）
# 注意：需要先修改 components.yaml 中的 reference_model_path
llamafactory-cli train examples/train_full/mixers/doremi_step2_dynamic_qwen_pt_full.yaml

# Step 3: 使用优化权重训练目标模型
# 注意：需要将 Step 2 的最终权重填入配置文件
llamafactory-cli train examples/train_full/mixers/doremi_step3_static_qwen_pt_full.yaml

权重提取和分析

从 Step 2 的输出目录中读取优化后的权重：

import json

# 读取权重日志
weights_history = []
with open('doremi_step2_result/doremi_weights.jsonl', 'r') as f:
    for line in f:
        weights_history.append(json.loads(line))

# 获取最终权重
final_weights = weights_history[-1]['domain_weights']
domain_names = weights_history[-1]['domain_names']

print("优化后的领域权重:")
for name, weight in zip(domain_names, final_weights):
    print(f"  {name}: {weight:.4f}")

# 可视化权重变化趋势
import matplotlib.pyplot as plt
import numpy as np

steps = [entry['step'] for entry in weights_history]
weights_matrix = np.array([entry['domain_weights'] for entry in weights_history])

plt.figure(figsize=(10, 6))
for i, name in enumerate(domain_names):
    plt.plot(steps, weights_matrix[:, i], label=name, marker='o')
plt.xlabel('Training Step')
plt.ylabel('Domain Weight')
plt.title('DoReMi Domain Weight Evolution')
plt.legend()
plt.grid(True)
plt.savefig('doremi_weights_evolution.png')
plt.show()