Quick Start – dataflow init
About 433 wordsAbout 1 min
2025-06-30
Code-Generation-Based Workflow
DataFlow follows a code generation + customization + script execution workflow, similar to create-react-app / vue-cli.
By invoking a CLI command, DataFlow automatically generates default runtime scripts and entry Python files. After user customization (e.g., changing datasets, switching LLM APIs, or reordering operators), you simply run the Python script to execute the desired functionality.
You only need three steps to run the SoTA Pipelines we provide.
1. Initialize a Project
In an empty directory, run:
dataflow initThis will generate the following directories in your working path:
$ tree -L 1
.
|-- api_pipelines
|-- core_text
|-- cpu_pipelines
|-- example_data
|-- gpu_pipelines
|-- playground
`-- simple_text_pipelinesDirectory overview:
cpu_pipelines: Pipelines that run using CPU onlycore_text: Examples of the most fundamental DataFlow operatorsapi_pipelines: Pipelines that use online LLM APIs (recommended for beginners)gpu_pipelines: Pipelines that use locally deployed GPU modelsexample_data: Default input datasets for all example pipelinesplayground: Lightweight examples that do not form full pipelinessimple_text_pipelines: Simple text-processing pipeline examples
2. Pipeline Categories (Choose One)
Pipelines with the same name across different directories follow an inclusive relationship:
| Directory | Required Resources |
|---|---|
cpu_pipelines | CPU only |
api_pipelines | CPU + LLM API |
gpu_pipelines | CPU + API + Local GPU |
Recommendation for beginners: start directly with
api_pipelines.
If you later have access to a GPU, you can simply replace the LLMServing with a local model.
3. Run Your First Prebuilt Pipeline
Enter any pipeline directory, for example:
cd api_pipelinesOpen one of the Python files. In most cases, you only need to care about two configurations:
(1) Input Dataset Path
self.storage = FileStorage(
first_entry_file_name="<path_to_dataset>"
)By default, this points to the example dataset we provide and can be run directly. You may change it to your own dataset path to process your own data.
(2) LLM Serving
If you are using an API-based LLM, you need to set an environment variable first:
Linux / macOS
export DF_API_KEY=sk-xxxxxWindows CMD
set DF_API_KEY=sk-xxxxxPowerShell
$env:DF_API_KEY="sk-xxxxx"Then simply run the script:
python xxx_pipeline.py4. Multiple API Servings (Optional)
If you need to use multiple LLM APIs at the same time, you can assign a different environment variable name to each serving instance:
llm_serving_openai = APILLMServing_request(
api_url="https://api.openai.com/v1/chat/completions",
key_name_of_api_key="OPENAI_API_KEY",
model_name="gpt-4o"
)
llm_serving_deepseek = APILLMServing_request(
api_url="https://api.deepseek.com/v1/chat/completions",
key_name_of_api_key="DEEPSEEK_API_KEY",
model_name="deepseek-chat"
)Then define the corresponding environment variables (e.g., OPENAI_API_KEY=sk-xxxxx) to enable multiple API servings to coexist.

