Offline-Near-Selector
About 553 wordsAbout 2 min
2025-11-27
This document introduces how to use the Offline NEAR Selector for dynamic data selection during supervised fine-tuning (SFT) within the DataFlex framework, finding the most close data to the target dataset to improve generalization performance.
1. Method Overview
The core idea of NEAR is:
- Further encode already tokenized samples into sentence embeddings (e.g., 512‑dim).
- Perform **nearest‑neighbor search ** in the embedding space to obtain each sample’s representativeness score.
Intuition: Closest data for the target dataset
Scoring Formulation
Let the sentence embedding of a sample be ei, and let its maxK nearest neighbors be NK(i).
2. Environment & Dependencies
# DataFlex (recommended: editable install)
git clone https://github.com/OpenDCAI/DataFlex.git
cd DataFlex
pip install -e .
# Common training/inference dependencies (as needed)
pip install llamafactory
# NEAR extras (vector search & progress bars)
pip install faiss-cpu vllm sentence-transformer3. Offline Selection
Modify training set, embedding model, and parameters inside DataFlex/src/dataflex/offline_selector/offline_near_selector.py:
if __name__ == "__main__":
near = offline_near_Selector(
candidate_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = train
query_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = vaildation
# It automatically try vllm first, then sentence-transformers
embed_model="Qwen/Qwen3-Embedding-0.6B",
# support method:
#auto(It automatically try vllm first, then sentence-transformers),
#vllm,
#sentence-transformer
embed_method= "auto",
batch_size=32,
save_indices_path="top_indices.npy",
max_K=1000,
)
near.selector()Note: model_name is used to encode the already-tokenized text into sentence embeddings (e.g., 1024-dim), supporting both vLLM and sentence-transformer inference.
Output: save as the indices matrix that contain the max_K close data for each query
4. Key Hyperparameters & Tips
| Parameter | Typical Range | Meaning & Tips |
|---|---|---|
max_K | 64–10000 | Upper bound of NN retrieval. Larger = stabler but more costly; balance with data size & VRAM. |
model_name | — | Path/name of the sentence encoder (local BERT/USE/SimCSE, etc.). |
cache_dir | — | Cache directory for intermediate artifacts and resume‑from‑cache. |
5. Component Config (components.yaml)
Path: DataFlex/src/dataflex/configs/components.yaml
Preset example
near:
name: near
params:
indices_path: ./src/dataflex/offline_selector/top_indices.npy
cache_dir: ../dataflex_saves/near_output6. Dynamic Training Config (LoRA + NEAR)
Example file: DataFlex/examples/train_lora/selectors/near.yaml
### model
model_name_or_path:
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 16
lora_alpha: 8
### dataset
dataset: # training dataset
template: qwen
cutoff_len: 4096
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: ../dataflex_saves
logging_steps: 10
save_steps: 100
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 16
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
### Dataflex args
train_type: dynamic_select
components_cfg_file: src/dataflex/configs/components.yaml
component_name: near
warmup_step: 400
update_step: 500
update_times: 2Notes:
component_name: nearenables the NEAR component.warmup_step / update_step / update_timesdecide when and how often to re‑select the training subset; total steps ≈warmup_step + update_step × update_times.- total batch_size=device_number x per_device_train_batch_size x gradient_accumulation_steps
7. Run Training
FORCE_TORCHRUN=1 DISABLE_VERSION_CHECK=1 dataflex-cli train examples/train_lora/selectors/near.yamlNote: the above example runs with distributed launch.
During training, NEAR is triggered at scheduled steps: base the sample indice → select the next training subset.
8. Merge & Export the Model
Same as the Less Selector pipeline.
Config file: DataFlex/examples/merge_lora/llama3_lora_sft.yaml
model_name_or_path: base model path
adapter_name_or_path: finetuned adapter path
template: qwen
trust_remote_code: true
export_dir: ../dataflex_saves
export_size: 5
export_device: cpu
export_legacy_format: falseRun the export command (inside the LLaMA‑Factory directory):
llamafactory-cli export llama3_lora_sft.yaml9. Evaluation & Comparison
We recommend using the DataFlow QA evaluation pipeline to compare NEAR against Less and random sampling.

