PDF-to-Model Model Simulation Pipeline
About 1498 wordsAbout 5 min
2025-08-30
This pipeline is designed to help beginners quickly get started with fine-tuning models using PDF documents.
1.Overview
The Pdf-to-Model Fine-tuning Pipeline is an end-to-end large language model training solution designed to provide fully automated services from raw documents to deployable domain-specific models. The pipeline transforms heterogeneous-format, high-noise PDF documents into high-quality training data and performs parameter-efficient fine-tuning of large models based on this data, enabling models to achieve precise question-answering capabilities for specific domain knowledge.
The pipeline integrates advanced document processing technologies (MinerU, trafilatura), intelligent knowledge cleaning methods, and efficient fine-tuning strategies. It significantly enhances model performance in vertical domains while maintaining the general capabilities of base models. According to MIRIAD experimental validation, models trained with Multi-Hop QA format demonstrate excellent performance in complex question-answering scenarios requiring multi-step reasoning.
Document Parsing Engine: MinerU1 and partial functionality of MinerU 2.5. This pipeline is now fully compatible with Flash-MinerU. Compared to the native MinerU engine, Flash-MinerU offers significant advantages in parsing speed and high-concurrency processing.
Automated Format Conversion and Fine-tuning: The pipeline supports two data extraction and preparation paradigms: KBC (Text-based Knowledge Base) and VQA (Multimodal Visual Question Answering).
- KBC Mode: Focused on plain-text data cleaning and QA synthesis. It utilizes the Alpaca format for model fine-tuning, making it ideal for text-centric knowledge base scenarios.
- VQA Mode: Focused on multimodal data cleaning and QA synthesis. It utilizes the ShareGPT format for model fine-tuning, specifically optimized for textbook-style PDFs (e.g., Mathematics, Physics, and Chemistry textbooks or exam papers).
Supported Input Formats: PDF, Markdown, HTML, URL webpages.
Output Model: Adapter (compatible with any Qwen/Llama series base model).
2.Quick Start
conda create -n dataflow python=3.10
conda activate dataflow
git clone https://github.com/OpenDCAI/DataFlow.git
cd DataFlow
# prepare environment
pip install -e .[llamafactory]
#prepare models
mineru-models-download
cd ..
mkdir run_dataflow
cd run_dataflow
# Initialize
# KBC Mode: Initialize KBC knowledge cleaning pipeline (Default)
dataflow pdf2model init --qa="kbc"
# VQA Mode: Initialize VQA multimodal extraction pipeline (Optimized for textbooks/exams)
dataflow pdf2model init --qa="vqa"
# Train
dataflow pdf2model train
# Chat with the trained model, or chat with locally trained models in workspace directory
dataflow chat3. Pipeline Design
Main Pipeline Workflow
The Pdf-to-Model pipeline consists of two phases: initialization and execution, with the execution phase comprising 5 core steps:
Initialization Phase (dataflow pdf2model init)
Automatically generates a training configuration file (train_config.yaml) and customizable data processing scripts (pdf_to_model_pipeline.py), configuring default LoRA fine-tuning parameters, dataset paths, and model output directories.
--qa="kbc"(Default): Generates a pipeline focused on knowledge cleaning. This workflow emphasizes long-range logical text cleaning, intelligent chunking, and the production of Alpaca format data.--qa="vqa":Generates a pipeline focused on Visual Question Answering. This workflow leverages multimodal capabilities to parse charts, diagrams, and formulas within PDFs, producing ShareGPT format data.
Execution Phase (dataflow pdf2model train)
- Document Discovery: Automatically scans specified directories to identify all PDF files and generate an index list.
- Knowledge Extraction and Cleaning: Extracts textual information from PDF/Markdown/HTML/URL using tools like MinerU and trafilatura.
- QA Data Generation and Cleaning:
- KBC Mode: Performs refined cleaning of raw text (removing redundant tags, fixing formatting errors, and protecting sensitive information). It then utilizes a three-sentence sliding window to transform knowledge into Multi-Hop QA pairs requiring multi-step reasoning.
- VQA Mode: Transforms complex raw data into LLM-understandable inputs. For high-value content like textbooks and exam papers, it uses multi-threaded API calls to extract high-quality QA pairs from multimodal page blocks.
- Data Format Conversion: Converts the extracted QA data into Llama-Factory standard training formats (Alpaca or ShareGPT).
- Fine-tuning: Based on the generated QA data, uses LoRA (Low-Rank Adaptation) method to perform parameter-efficient fine-tuning of the base model, training model parameters and outputting a domain-specific model adapter ready for deployment.
Testing Phase (dataflow chat)
Model Dialogue Testing: Automatically loads the latest trained adapter and corresponding base model, launches an interactive dialogue interface, and supports real-time testing of model performance on domain-specific knowledge Q&A. Users can also specify a particular model path for testing using the --model parameter.
Step 1: Install DataFlow Environment
conda create -n dataflow python=3.10
conda activate dataflow
cd DataFlow
pip install -e .[pdf2model]Step 2: Create New DataFlow Working Directory
# Exit project root directory
cd ..
mkdir run_dataflow
cd run_dataflowStep 3: Setup Dataset
Place appropriately sized datasets (data files in PDF format) into the working directory.
Step 4: Initialize dataflow-pdf2model
# Initialize
# --cache can specify the location of .cache directory (optional)
# Default value is current folder directory
# Initialize with KBC mode (Default)
dataflow pdf2model init --qa="kbc"
# Initialize with VQA mode (For textbooks/exams)
dataflow pdf2model init --qa="vqa"💡After initialization is complete, the project directory becomes:
Project Root/
├── pdf_to_qa_pipeline.py # pipeline execution file
└── .cache/ # cache directory
└── train_config.yaml # default config file for llamafactory trainingStep 5: Set Parameters
🌟 Display common and important parameters:
Mode A: KBC Knowledge Cleaning (Default Mode)
self.storage = FileStorage(
first_entry_file_name="./.cache/pdf_list.jsonl", # Set the path for the default generated pdf_list.json
cache_path=str(cache_path / ".cache" / "gpu"),
file_name_prefix="batch_cleaning_step", # Prefix for created files
cache_type="jsonl", # Format of created files
)
# Flash-MinerU Backend (Recommended)
self.mineru_executor = FileOrURLToMarkdownConverterFlash(
intermediate_dir="../example_data/PDF2VQAPipeline/flash/",
mineru_model_path="<your Model Path>/MinerU2.5-2509-1.2B", # !!! place your local model path here !!!
# https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B.
batch_size=4, # batchsize per vllm worker
replicas=1, # num of vllm workers
num_gpus_per_replica=0.5, # for ray to schedule vllm workers to GPU, can be float, e.g. 0.5 means each worker uses half GPU, 1 means each worker uses whole GPU
engine_gpu_util_rate_to_ray_cap=0.9 # actuall GPU utilization for each worker; acturall memory per worker= num_gpus_per_replica * engine_gpu_util_rate_to_ray_cap; this is to avoid OOM, you can set it to 0.9 or 0.8 to leave some buffer for other processes on
)
self.knowledge_cleaning_step2 = KBCChunkGeneratorBatch(
split_method="token", # Specify the splitting method
chunk_size=512, # Specify the chunk size
tokenizer_name="./Qwen2.5-7B-Instruct", # Path to the tokenizer model
)
self.knowledge_cleaning_step3 = KBCTextCleaner(
llm_serving=self.llm_serving,
lang="en"
)
self.knowledge_cleaning_step4 = Text2MultiHopQAGenerator(
llm_serving=self.llm_serving,
lang="en",
num_q = 5
)
self.extract_format_qa_step5 = QAExtractor(
qa_key="qa_pairs",
output_json_file="./.cache/data/qa.json", # Path for the output dataset
)For a more comprehensive guide on parameter settings, please refer to the descriptions in Case 8. Converting Massive PDFs to QAs.
Mode B: VQA Data Extraction
self.storage = FileStorage(
first_entry_file_name="./.cache/pdf_list.jsonl", # Set the path for the default generated pdf_list.json
cache_path="./cache",
file_name_prefix="vqa", # Prefix for created files
cache_type="jsonl", # Format of created files
)
self.llm_serving = APILLMServing_request(
api_url="http://<YOUR_SERVER_IP>:3000/v1/chat/completions", # API endpoint path
key_name_of_api_key="DF_API_KEY",
model_name="gemini-2.5-pro", # Ensure the API node supports this model
max_workers=100,
)
self.vqa_extract_prompt = QAExtractPrompt()
self.pdf_merger = PDF_Merger(output_dir="./cache")
# Flash-MinerU Backend (Recommended)
self.mineru_executor = FileOrURLToMarkdownConverterFlash(
intermediate_dir="../example_data/PDF2VQAPipeline/flash/",
mineru_model_path="<your Model Path>/MinerU2.5-2509-1.2B", # !!! place your local model path here !!!
# https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B.
batch_size=4, # batchsize per vllm worker
replicas=1, # num of vllm workers
num_gpus_per_replica=0.5, # for ray to schedule vllm workers to GPU, can be float, e.g. 0.5 means each worker uses half GPU, 1 means each worker uses whole GPU
engine_gpu_util_rate_to_ray_cap=0.9 # actuall GPU utilization for each worker; acturall memory per worker= num_gpus_per_replica * engine_gpu_util_rate_to_ray_cap; this is to avoid OOM, you can set it to 0.9 or 0.8 to leave some buffer for other processes on
)
self.input_formatter = MinerU2LLMInputOperator()
self.vqa_extractor = ChunkedPromptedGenerator(
llm_serving=self.llm_serving,
system_prompt = self.vqa_extract_prompt.build_prompt(),
max_chunk_len=128000,
)
self.llm_output_parser = LLMOutputParser(
output_dir="./cache", intermediate_dir="intermediate"
)
self.qa_merger = QA_Merger(
output_dir="./cache", strict_title_match=False
)
self.vqa_format_converter = VQAFormatter(
output_json_file="./.cache/data/qa.json", # Path for the output dataset
)For a more comprehensive guide on parameter settings, please refer to the descriptions in Case 7. PDF VQA Extraction Pipeline.
Note:You must configure API credentials to invoke the LLM API. These credentials can be obtained from your LLM provider (e.g., OpenAI, Google Gemini, etc.). Set them as environment variables:
export DF_API_KEY="sk-xxxxx"$env:DF_API_KEY = "sk-xxxxx"Step 6: One-Click Fine-tuning
# One-Click Fine-tuning: Directly launch the cleaning + fine-tuning functionality
dataflow pdf2model train💡After fine-tuning is complete, the project directory will reflect a structure similar to the following (based on the --qa="kbc" configuration):
Project Root/
├── pdf_to_qa_pipeline.py # pipeline execution file
└── .cache/ # cache directory
├── train_config.yaml # default config file for llamafactory training
├── data/
│ ├── dataset_info.json
│ └── qa.json
├── gpu/
│ ├── batch_cleaning_step_step1.json
│ ├── batch_cleaning_step_step2.json
│ ├── batch_cleaning_step_step3.json
│ ├── batch_cleaning_step_step4.json
│ ├── batch_cleaning_step_step5.json
│ └── pdf_list.jsonl
├── mineru/
│ └── sample/auto/
└── saves/
└── pdf2model_cache_{timestamp}/Step 7: Chat with Fine-tuned Model
# Method 1: Specify model path with --model flag (optional)
# Default path: .cache/saves/pdf2model_cache_{timestamp}
dataflow chat --model ./custom_model_path
# Method 2: Navigate to workspace directory and run dataflow chat
dataflow chat
