General Generate Operators

About 1508 wordsAbout 5 min

2025-06-24

Currently, Dataflow integrates five text data generators, covering various formats such as pretraining document data, SFT-format data, and multi-turn dialogues.

Name	Applicable Type	Description	Repository or Paper
PretrainGenerator	Pretrain	Synthesize phi-4 question and answer data pairs using pre trained document data, and retell the document in QA format	Paper
SFTGeneratorSeed	SFT	Synthesize SFT format QA data pairs based on seed documents and return original information	-
CondorGenerator	SFT	Two-stage synthesis of SFT-format data from scratch based on preset knowledge tree labels (recommend increasing label variety if generating more than 5000 samples)	paper
PromptedGenerator	-	Generate data based on user-defined prompts	-
ConsistentChatGenerator	Multi-turn Dialogue	Two-stage synthesis of multi-turn dialogue data from scratch based on preset topics and human intents (recommend increasing label variety if generating more than 9000 samples)	paper

Operator Interface Usage Instructions

Specifically, for operators that specify storage paths or call models, we provide encapsulated model interfaces and storage object interfaces. You can predefine model API parameters for operators in the following way:

from dataflow.llmserving import APILLMServing_request

api_llm_serving = APILLMServing_request(
                api_url="your_api_url",
                model_name="model_name",
                max_workers=5
        )

You can predefine storage parameters for operators in the following way:

from dataflow.utils.storage import FileStorage

 self.storage = FileStorage(
            first_entry_file_name="your_file_path",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl", # jsonl, json, ...
        )

The api_llm_serving and self.storage used in the following text are the interface objects defined here. Complete usage examples can be found in test/test_general_text.py.

For parameter passing, the constructor of operator objects mainly passes information related to operator configuration, which can be configured once and called multiple times; while the X.run() function passes key information related to IO. Details can be seen in the operator description examples below.

Detailed Operator Descriptions

1. PretrainGenerator✨

Function Description: This operator is specifically designed to generate pretraining format multi-turn dialogue Q&A data based on given document content. It converts raw document content into dialogue format data suitable for language model pretraining by calling large language models to reorganize and express document content.

Input Parameters:

__init__()
- llm_serving: Large language model interface object to use (required, must implement LLMServingABC interface)
run()
- storage: Storage interface object (default: predefined value above)
- input_key: Input document content field name (default: "raw_content")
- output_key: Output generated content field name (default: "generated_content")

Key Features:

Supports content conversion for multiple document formats
Automatically generates dialogue format data suitable for pretraining
Maintains integrity of core information from original documents
Supports batch processing of large-scale document data

Usage Example:

from dataflow.prompts.general_text import PretrainGeneratorPrompt

pretrain_gen = PretrainGenerator(
          llm_serving=api_llm_serving
          )
result = pretrain_gen.run(
          storage=self.storage.step(),
          input_key="raw_content",
          output_key="generated_content"
          )

2. SFTGeneratorSeed✨

Function Description: This operator generates supervised fine-tuning format Q&A data based on given document content and supports user-defined content generation requirements. It extracts information from raw documents to generate instruction-response pairs in SFT format, particularly suitable for building high-quality supervised fine-tuning datasets.

Input Parameters:

__init__()
- llm_serving: Large language model interface object to use (required, must implement LLMServingABC interface)
- custom_prompt: User-defined custom prompt (required, defines specific requirements for generated content)
run()
- storage: Storage interface object (default: predefined value above)
- input_key: Input document content field name (default: "raw_content")

Key Features:

Supports user-defined content generation requirements
Automatically extracts and parses JSON format instruction-response pairs
Preserves original document content for traceability
Intelligently filters invalid generation results
Supports long text generation up to 4096 tokens

Output Format:

DataFrame containing 'instruction', 'output', and 'raw_content' fields
Returns list containing 'instruction' and 'output' field names

Usage Example:

from dataflow.prompts.general_text import SFTGeneratorSeedPrompt

sft_gen = SFTGeneratorSeed(
          llm_serving=api_llm_serving,
          custom_prompt="Please generate educational Q&A pairs based on document content"
          )
result_keys = sft_gen.run(
          storage=self.storage.step(),
          input_key="raw_content"
          )

3. CondorGenerator✨🚀

Function Description: This operator generates SFT format data from scratch through a two-stage process based on predefined knowledge tree tags. The first stage generates questions of varying difficulty levels (Easy, Medium, Hard) based on randomly selected topics, domains, and theme tags, while the second stage generates corresponding detailed answers for each question.

Input Parameters:

__init__()
- llm_serving: Large language model interface object to use (required, must implement LLMServingABC interface)
- num_samples: Total number of samples to generate (default: 15, recommended to be less than 5000 to ensure data quality)
run()
- storage: Storage interface object (default: predefined value above)

Key Features:

Two-stage generation process ensures question-answer quality
Supports three difficulty levels of question generation
Ensures content diversity based on predefined knowledge tree tags
Automatically parses and formats generation results
Intelligent error handling and logging

Generation Process:

Question Generation Stage: Generates three difficulty levels of questions based on randomly selected topic, domain, and theme
Answer Generation Stage: Generates corresponding detailed answers for each valid question
Data Organization Stage: Organizes questions and answers into standard SFT format

Output Format:

DataFrame containing 'difficulty', 'instruction', and 'output' fields
difficulty field identifies question difficulty level (Easy/Medium/Hard)

Usage Example:

from dataflow.prompts.general_text import CondorPrompt

condor_gen = CondorGenerator(
          llm_serving=api_llm_serving,
          num_samples=150  # Will generate approximately 150 Q&A pairs
          )
result_df = condor_gen.run(
          storage=self.storage.step()
          )

Important Notes:

When generating more than 5000 samples, it is recommended to increase the number of tags in dataflow.prompts.general_text.CondorPrompt to improve data richness
The operator automatically handles failed parsing responses to ensure output data validity

4. PromptedGenerator✨

Function Description: This operator generates data based on user-provided prompts, combining system prompts and input content to generate desired output text. It provides maximum flexibility, allowing users to fully customize generation logic and output formats.

Input Parameters:

__init__()
- llm_serving: Large language model interface object to use (required, must implement LLMServingABC interface)
- system_prompt: System prompt defining model behavior (default: "You are a helpful agent.")
run()
- storage: Storage interface object (default: predefined value above)
- input_key: Input content field name (default: "raw_content")
- output_key: Output generated content field name (default: "generated_content")

Key Features:

Fully customizable prompt control
Flexible input-output field configuration
Supports arbitrary format text generation tasks
Simple and direct combination of system prompt and input content
Batch processing capability

Working Principle:

Directly concatenates system prompt with input content
Calls LLM to generate corresponding output content
Adds generation results to specified output field

Usage Example:

prompted_gen = PromptedGenerator(
          llm_serving=api_llm_serving,
          system_prompt="You are a professional document summarizer. Please generate a concise summary for the following content:"
          )
result_key = prompted_gen.run(
          storage=self.storage.step(),
          input_key="raw_content",
          output_key="summary"
          )

5. ConsistentChatGenerator ✨

Description:
This operator synthesizes multi-turn dialogue data from scratch using a two-stage process based on predefined topics and user intents. In the first stage, it generates user queries under a specific topic and intent; in the second stage, it produces assistant replies for each query. It is ideal for constructing large-scale dialogue datasets with strong consistency and clearly defined categories.

Input Parameters:

__init__()
- llm_serving: An instance of an LLM interface implementing the LLMServingABC protocol (required)
- num_dialogs_per_intent: Number of dialogues to generate per intent (default: 20, recommended ≤ 1000)
- num_turns_per_dialog: Number of turns per dialogue (default: 6)
- temperature: Sampling temperature controlling generation randomness (default: 0.9)
run()
- storage: The storage interface object (default: uses predefined context)

Key Features:

Predefined combinations of topics and intents, covering multiple domains
Two-stage generation: user queries first, assistant responses second
Auto-cleaning of malformed or invalid generations
Supports large-scale synthesis (recommended < 9000 dialogues; extend topic tags for more)
Generates standardized multi-turn dialogue format compatible with SFT training

Output Format:

A DataFrame with category and conversation fields

The conversation field is a list of multi-turn Q&A items. Each turn follows the structure:

[
  {"role": "user", "value": "question"},
  {"role": "assistant", "value": "answer"},
  ...
]

Usage Example:

from dataflow.operators.general_text import ConsistentChatGenerator

consistent_gen = ConsistentChatGenerator(
    llm_serving=api_llm_serving,
    num_dialogs_per_intent=30,
    num_turns_per_dialog=4,
    temperature=0.85
)

result_df = consistent_gen.run(
    storage=self.storage.step()
)

Notes:

When generating more than 9000 dialogues, it is recommended to expand the topic_dict in ConsistentChatPrompt to improve the diversity and coverage of the generated conversations. To ensure high-quality output, the operator automatically skips any malformed or unparseable generations, maintaining a consistent and reliable dialogue structure. During multi-turn conversation generation, the operator invokes the LLM API twice for each dialogue (once for user questions and once for assistant responses), so a stable and responsive LLM service is essential.