General Generate Operators
About 305 wordsAbout 1 min
2025-06-24
Currently, Dataflow integrates five text data generators, covering various formats such as pretraining document data, SFT-format data, and multi-turn dialogues.
Name | Applicable Type | Description | Repository or Paper |
---|---|---|---|
PretrainGenerator | Pretrain | Synthesize phi-4 question and answer data pairs using pre trained document data, and retell the document in QA format | Paper |
SFTGeneratorSeed | SFT | Synthesize SFT format QA data pairs based on seed documents and return original information | - |
CondorGenerator | SFT | Two-stage synthesis of SFT-format data from scratch based on preset knowledge tree labels (recommend increasing label variety if generating more than 5000 samples) | paper |
PromptedGenerator | - | Generate data based on user-defined prompts | - |
ConsistentChatGenerator | Multi-turn Dialogue | Two-stage synthesis of multi-turn dialogue data from scratch based on preset topics and human intents (recommend increasing label variety if generating more than 9000 samples) | paper |