Offline-Near数据选择器

888 字约 3 分钟

2025-11-26

本文档介绍如何在 DataFlex 框架中使用 Offline NEAR Selector 实现训练数据的动态选择，以在监督微调（SFT）中聚焦于与目标集的相似度，进行邻近选择。

1. 方法概述

NEAR 的核心思想是：

先将已分词(tokenized)的样本进一步编码为句向量（例如 512 维）。
在嵌入空间中进行**近邻搜索 **，得到每个样本与目标集的“样本相似度”。

直观理解：选择与目标集最接近的训练数据，以最优化训练目标。

2. 环境与依赖

# DataFlex（建议源码安装）
git clone https://github.com/OpenDCAI/DataFlex.git
cd DataFlex
pip install -e .

# 训练与推理的常用依赖
pip install llamafactory==0.9.3

# NEAR 额外依赖（向量检索与进度条等）
pip install faiss-cpu vllm sentence-transformer

3. offline 数据选择

在DataFlex\src\dataflex\offline_selector\offline_near_selector.py文件中修改训练集、编码模型和参数

if __name__ == "__main__":
    near = offline_near_Selector(
        candidate_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = train
        query_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = vaildation
        # It automatically try vllm first, then sentence-transformers
        embed_model="Qwen/Qwen3-Embedding-0.6B",
        # support method:
        #auto(It automatically try vllm first, then sentence-transformers),
        #vllm,
        #sentence-transformer
        embed_method= "auto",
        batch_size=32,
        save_indices_path="top_indices.npy",
        max_K=1000,
        
    )
    near.selector()

注意：此处的 model_name 用于将tokenized后的文本进一步编码为句向量（例如 1024 维），支持vllm和sentence-transformer 推理。

最终保存为每个query的max_K个最邻近训练数据的索引矩阵（ N ,max_K ）

4. 关键超参数与建议

参数	典型范围	含义与建议
`max_K`	64–2000	近邻检索数量上限，越大越稳但开销更高；建议与数据规模/显存权衡
`model_name`	—	句向量编码模型路径或名称（如本地embeddingm模型）
`cache_dir`	—	中间结果缓存路径，便于断点续跑

5. 组件配置（components.yaml）

路径： DataFlex/src/dataflex/configs/components.yaml

预设参数

near:
    name: near
    params:
      indices_path: ./src/dataflex/offline_selector/top_indices.npy
      cache_dir: ../dataflex_saves/near_output

6. 动态训练配置（LoRA + NEAR）

示例文件： DataFlex/examples/train_lora/selectors/near.yaml

### model
model_name_or_path: #模型地址
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 16
lora_alpha: 8
# deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: #训练集
template: qwen （训练模型类型：qwen、llama...）
cutoff_len: 4096
# max_samples: 100000000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 0
# disable_shuffling: true
seed: 42

### output
output_dir: ../dataflex_saves
logging_steps: 10
save_steps: 100
plot_loss: true
save_only_model: false
overwrite_output_dir: true

### swanlab
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
# use_swanlab: true
# swanlab_project: medical_dynamic_sft
# swanlab_run_name: qwen2_5_3b_lora_medical_50k_baseline
# swanlab_workspace: word2li
# swanlab_api_key: AnLWTMijcbd4cyEfundi3
# swanlab_lark_webhook_url: https://open.feishu.cn/open-apis/bot/v2/hook/ff10a391-4e51-4481-97ff-965760cae2a1
# swanlab_lark_secret: cySzwTbCJh08349FGAhBSf

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 16
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: false

### Dataflex args
train_type: dynamic_select   # 选择训练器类型。可选值包括：
                          # "dynamic_select" - 动态选择训练器
                          # "dynamic_mix" - 动态混合训练器
                          # "dynamic_weight" - 动态加权训练器
                          # "static" - 默认静态训练器
components_cfg_file: src/dataflex/configs/components.yaml
component_name: near  # 选择组件名称，对应 components_cfg_file 中定义的组件
warmup_step: 400
update_step: 500
update_times: 2
# eval_dataset: alpaca_zh_demo

参数说明：

component_name: near：启用 NEAR 组件。
warmup_step / update_step / update_times：决定何时与多久进行一次动态选择；总步数 ≈ warmup_step + update_step × update_times。
总batch_size=device_number x per_device_train_batch_size x gradient_accumulation_steps

7. 运行训练

FORCE_TORCHRUN=1 DISABLE_VERSION_CHECK=1 dataflex-cli train examples/train_lora/selectors/near.yaml

采用分布式

训练过程中会在设定的步数触发 NEAR 动态选择：根据离线选择的样本索引，选出下一阶段训练子集。

8. 模型合并与导出

与 Less Selector 流程一致：

配置文件： DataFlex/examples/merge_lora/llama3_lora_sft.yaml

model_name_or_path: 原模型地址
adapter_name_or_path: 微调后adpter地址
template: qwen
trust_remote_code: true

export_dir: ../dataflex_saves
export_size: 5
export_device: cpu
export_legacy_format: false

导出命令：在llamafactory文件夹中运行

llamafactory-cli export llama3_lora_sft.yaml

9. 评估与对比

建议使用 DataFlow 的模型 QA 评估流水线，对 NEAR 与 Less、随机采样 等策略进行并列评测