Quick Start
About 534 wordsAbout 2 min
2025-06-30
DataFlow employs a "code generation" paradigm similar to that of create-react-app
or vue-cli
. This means that, through command-line invocation, it automatically generates the necessary scripts and entry Python files. After customizing these files (for example, by changing the dataset, using different large model APIs, or re-tuning operators), you can run the Python file to execute the corresponding functions.
Specifically, after successfully installing DataFlow as described in the previous section, please find an empty working directory to get started with DataFlow. Navigate to this directory and execute:
dataflow init
This command will generate three folders named cpu
, api
, and gpu
, as well as an example_data
folder for storing default sample data, in your current working directory.
Each of our pre-configured Pipelines is provided in three modes, placed in these three folders respectively. They are categorized based on the resource types required by the operators in the Pipeline, as shown in the table below:
User Category | Operators that only require CPU | Operators that require a large model API | Operators that require a locally deployed GPU |
---|---|---|---|
cpu | √ | ||
api | √ | √ | |
gpu | √ | √ | √ |
The same-named Pipelines in different folders have an inclusive relationship. Specifically, the Pipeline in the gpu
folder is the most comprehensive, containing all the functions. Removing the operators that require a locally deployed GPU model results in the Pipeline in the api
folder. Further removing the operators that require a large model backend results in the Pipeline in the cpu
folder.
Notably, the api
Pipeline can be modified to use a locally deployed GPU model (such as Qwen-3, llama, etc.) by changing the LLMServing
within it. Compared to the gpu
Pipeline, the operators removed in the api
Pipeline are mainly those that call unconventional LLM models which cannot be deployed using the vllm
backend.
Subsequently, by navigating to the corresponding path, you can access the Python files corresponding to our pre-configured Pipelines.
For these files, the default input dataset is stored in the json
file within the example_data
folder. You can change the first_entry_file_name
field in the storage
class to point it to your raw dataset.
self.storage = FileStorage(
first_entry_file_name="../example_data/AgenticRAGPipeline/pipeline_small_chunk.json",
cache_path="./cache_local", # Cache path
file_name_prefix="dataflow_cache_step", # Prefix for cache file names
cache_type="json", # File type for intermediate cache files
)
Additionally, you may need to modify the LLMServing
class according to your device or the api_url
you possess, in order to use a locally downloaded model or an online large model API.
If you are using the API method, you need to export the api_key
field to the environment variable. On Linux, this can be done using:
export api_key=sh-xxxxx
On Windows, you can set the environment variable using the following command:
set api_key=sh-xxxxx
Or in PowerShell:
$env:api_key = "sh-xxxxx"
After setting this, the program can read the API key from the environment for invocation. Be sure not to expose the key in public code.
Once you have modified the Python script, you can run it to experience DataFlow's comfortable data governance capabilities:
python reasoning_pipeline.py