PDF-to-Model Model Simulation Pipeline
About 423 wordsAbout 1 min
2025-08-30
Quick Start
conda create -n dataflow python=3.10
conda activate dataflow
git clone https://github.com/OpenDCAI/DataFlow.git
cd DataFlow
#prepare environment
pip install -e .[llamafactory]
# Supports mineru2.5. If you only want to run the pipeline backend, you can skip downloading the whl file and proceed directly to model preparation.
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
#prepare models
mineru-models-download
cd ..
mkdir run_dataflow
cd run_dataflow
# Initialize
dataflow pdf2model init
# Train
dataflow pdf2model train
# Chat with the trained model, or chat with locally trained models in workspace directory
dataflow chatStep 1: Install DataFlow Environment
conda create -n dataflow python=3.10
conda activate dataflow
cd DataFlow
pip install -e .[llamafactory]
# Supports mineru2.5. If you only want to run the pipeline backend, you can skip downloading the whl file and proceed directly to model preparation
# Download flash-attn whl file. You need to download the corresponding whl based on your environment
# For example, if your environment is python3.10 torch2.4 cuda12.1 https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
# Version selection URL: https://github.com/Dao-AILab/flash-attention/releases
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whlStep 2: Create New DataFlow Working Directory
# Exit project root directory
cd ..
mkdir run_dataflow
cd run_dataflowStep 3: Setup Dataset
Place appropriately sized datasets (data files in PDF format) into the working directory.
Step 4: Initialize dataflow-pdf2model
# Initialize
# --cache can specify the location of .cache directory (optional)
# Default value is current folder directory
dataflow pdf2model initAfter initialization is complete, the project directory becomes:
Project Root/
├── pdf_to_qa_pipeline.py # pipeline execution file
└── .cache/ # cache directory
└── train_config.yaml # default config file for llamafactory trainingStep 5: One-Click Fine-tuning
# --lf_yaml can specify the path to the llamafactory yaml parameter file for training (optional)
# Default value is .cache/train_config.yaml
dataflow pdf2model trainAfter fine-tuning is complete, the project directory becomes:
Project Root/
├── pdf_to_qa_pipeline.py # pipeline execution file
└── .cache/ # cache directory
├── train_config.yaml # default config file for llamafactory training
├── data/
│ ├── dataset_info.json
│ └── qa.json
├── gpu/
│ ├── batch_cleaning_step_step1.json
│ ├── batch_cleaning_step_step2.json
│ ├── batch_cleaning_step_step3.json
│ ├── batch_cleaning_step_step4.json
│ └── pdf_list.jsonl
├── mineru/
│ └── sample/auto/
└── saves/
└── pdf2model_cache_{timestamp}/Step 6: Chat with Fine-tuned Model
# Method 1: Specify model path with --model flag (optional)
# Default path: .cache/saves/pdf2model_cache_{timestamp}
dataflow chat --model ./custom_model_path
# Method 2: Navigate to workspace directory and run dataflow chat
dataflow chat
