Case 7. PDF VQA Extraction Pipeline
About 1045 wordsAbout 3 min
2025-11-17
1. Overview
The PDF VQA Extraction Pipeline automatically extracts high-quality Q&A pairs from textbook-style PDFs. It supports both separated question/answer PDFs and interleaved PDFs, and chains together layout parsing (MinerU), subject-aware LLM extraction, and structured post-processing. Typical use cases:
- Building math/physics/chemistry QA corpora from scanned books
- Creating QA pairs' markdown/JSONL exports that preserve figure references
Major stages:
- Document layout extraction: call MinerU to dump structured JSON + rendered page images.
- LLM-based QA extraction: prompt the
VQAExtractoroperator with subject-specific rules. - Merging & filtering: consolidate question/answer streams, filter invalid entries, emit JSONL/Markdown plus copied images.
2. Quick Start
Step 1: Install Dataflow
Install Dataflow:
pip install "open-dataflow[pdf2vqa]"Or install Dataflow from source:
git clone https://github.com/OpenDCAI/DataFlow.git
cd Dataflow
pip install -e ".[pdf2vqa]"Step 2: Create a workspace
cd /your/working/directory
mkdir run_dataflow
cd run_dataflowStep 3: Initialize Dataflow
dataflow initYou can then add your pipeline script under pipelines/ or any custom path.
Step 4: Configure API credentials
DF_API_KEY is for calling LLM API, and MINERU_API_KEY is for calling MinerU for layout analysis. MINERU_API_KEY can be obtained from https://mineru.net/apiManage/token, and DF_API_KEY can be obtained from your LLM provider (e.g., OpenAI, Google Gemini, etc.). Set them as environment variables:
Linux / macOS:
export DF_API_KEY="sk-xxxxx"
export MINERU_API_KEY="sk2-xxxxx"Windows PowerShell:
$env:DF_API_KEY = "sk-xxxxx"
$env:MINERU_API_KEY = "sk2-xxxxx"In the pipeline script, set your API endpoint:
self.llm_serving = APILLMServing_request(
api_url="https://generativelanguage.googleapis.com/v1beta/openai/chat/completions",
key_name_of_api_key="DF_API_KEY",
model_name="gemini-2.5-pro",
max_workers=100,
)and set LLM max token length (recommended not to exceed 128000 to avoid LLM forgetting details).
self.vqa_extractor = ChunkedPromptedGenerator(
llm_serving=self.llm_serving,
system_prompt = self.vqa_extract_prompt.build_prompt(),
max_chunk_len=128000,
)Step 5: One-click run
python api_pipelines/pdf_vqa_extract_pipeline.pyYou can also import the operators into other workflows; the remainder of this doc explains the data flow in detail.
3. Data Flow and Pipeline Logic
1. Input data
Each job is defined by a JSONL row. input_pdf_paths can be a single PDF or a list of PDFs (questions appear before answers). name is an identifier for the job. Questions and answers can be interleaved or separated; they can come from the same PDF or different PDFs.
{"input_pdf_paths": "./example_data/PDF2VQAPipeline/questionextract_test.pdf", "name": "math1"}
{"input_pdf_paths": ["./example_data/PDF2VQAPipeline/math_question.pdf", "./example_data/PDF2VQAPipeline/math_answer.pdf"], "name": "math2"}FileStorage handles batching/cache management:
self.storage = FileStorage(
first_entry_file_name="../example_data/PDF2VQAPipeline/vqa_extract_test.jsonl",
cache_path="./cache",
file_name_prefix="vqa",
cache_type="jsonl",
)2. Document layout extraction (MinerU)
For each PDF (question, answer, or mixed), the pipeline calls _parse_file_with_mineru inside FileOrURLToMarkdownConverterAPI. MinerU outputs:
*_content_list.json: structured layout tokens (texts, figures, tables, IDs)images/: cropped page images
Note: If you want to use a locally deployed MinerU model, you can replace the operator with FileOrURLToMarkdownConverterLocal (original version from opendatalab) or FileOrURLToMarkdownConverterFlash (our accelerated version), and provide the corresponding model path and deployment parameters.
For example:
self.mineru_executor = FileOrURLToMarkdownConverterAPI(intermediate_dir = "intermediate")can be replaced with
self.mineru_executor = FileOrURLToMarkdownConverterLocal(
intermediate_dir = "intermediate",
mineru_model_path = "path/to/mineru/model",
)or
self.mineru_executor = FileOrURLToMarkdownConverterFlash(
intermediate_dir = "intermediate",
mineru_model_path = "path/to/mineru/model",
batch_size = 4,
replicas = 1,
num_gpus_per_replica = 1,
engine_gpu_util_rate_to_ray_cap = 0.9
)You can refer to https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/mineru_operators.py for specific parameters and usage.
Afterwards, the MinerU2LLMInputOperator flattens list items and re-indexes them to create LLM-friendly input.
3. QA extraction (VQAExtractor)
ChunkedPromptedGenerator chunks the layout JSON to respect token limits, builds prompts (QAExtractPrompt), and batches LLM calls via APILLMServing_request. Key behaviors:
- Grouping and pairing Q&A based, and inserting images to proper positions.
- Supports QA separated or interleaved PDFs.
- Copies rendered images into
cache_path/name/vqa_images. - Parses
<qa_pair>,<question>,<answer>,<solution>,<chapter>,<label>tags from the LLM response.
4. Post-processing and outputs
The QA_Merger operator is called for question-answer pair matching. Its behavior is as follows:
For mixed layouts (where questions and answers are already interleaved): It writes the complete QA pairs as they are, effectively performing no additional processing.
For separated layouts (where questions and answers are in different sections): It performs heuristic matching based on chapter titles and question sequence numbers.
This operator includes a strict_title_match parameter:
True: The operator performs an exact string match on chapter titles.
False: The operator attempts to extract Chinese or English sequence numbers from the titles for matching.
For each output_dir (under cache_path/name/), the pipeline writes:
extracted_vqa.jsonl(extracted questions and answers, could be separate or interleaved depending on input)merged_qa_pairs.jsonl(fully merged question-answer pairs)merged_qa_pairs.md(markdown version of the merged QA pairs)vqa_images/(containing all images extracted for the QA pairs)
Furthermore, the final step of the cache main file will contain all extracted qa pairs, making it easier to connect subsequent operators for downstream post-processing.
Each qa_item includes:
question: question text and imagesanswer: answer text and imagessolution: optional worked solution (if present)label: original qa numberingchapter_title: chapter/section header detected on the same page
Example:
{
"question": "Solve for x in x^2 - 1 = 0.",
"answer": "x = 1 or x = -1",
"solution": "Factor as (x-1)(x+1)=0.",
"label": 1,
"chapter_title": "Chapter 1 Quadratic Equations"
}5. Pipeline Example
from dataflow.operators.knowledge_cleaning import FileOrURLToMarkdownConverterAPI
from dataflow.serving import APILLMServing_request
from dataflow.utils.storage import FileStorage
from dataflow.operators.pdf2vqa import MinerU2LLMInputOperator, LLMOutputParser, QA_Merger, PDF_Merger
from dataflow.operators.core_text import ChunkedPromptedGenerator
from dataflow.pipeline import PipelineABC
from dataflow.prompts.pdf2vqa import QAExtractPrompt
from pypdf import PdfWriter
class PDF_VQA_extract_optimized_pipeline(PipelineABC):
def __init__(self):
super().__init__()
self.storage = FileStorage(
first_entry_file_name="./example_data/PDF2VQAPipeline/vqa_extract_test.jsonl",
cache_path="./cache",
file_name_prefix="vqa",
cache_type="jsonl",
)
self.llm_serving = APILLMServing_request(
api_url="http://123.129.219.111:3000/v1/chat/completions",
key_name_of_api_key="DF_API_KEY",
model_name="gemini-2.5-pro",
max_workers=100,
)
self.vqa_extract_prompt = QAExtractPrompt()
self.pdf_merger = PDF_Merger(output_dir="./cache")
self.mineru_executor = FileOrURLToMarkdownConverterAPI(intermediate_dir = "intermediate")
self.input_formatter = MinerU2LLMInputOperator()
self.vqa_extractor = ChunkedPromptedGenerator(
llm_serving=self.llm_serving,
system_prompt = self.vqa_extract_prompt.build_prompt(),
max_chunk_len=128000,
)
self.llm_output_parser = LLMOutputParser(output_dir="./cache", intermediate_dir="intermediate")
self.qa_merger = QA_Merger(output_dir="./cache", strict_title_match=False)
def forward(self):
self.pdf_merger.run(
storage=self.storage.step(),
input_pdf_list_key="input_pdf_paths",
input_name_key="name",
output_pdf_path_key="merged_pdf_path",
)
self.mineru_executor.run(
storage=self.storage.step(),
input_key="merged_pdf_path",
output_key="vqa_markdown_path",
)
self.input_formatter.run(
storage=self.storage.step(),
input_markdown_path_key="vqa_markdown_path",
output_converted_layout_key="converted_vqa_layout_path",
)
self.vqa_extractor.run(
storage=self.storage.step(),
input_path_key="converted_vqa_layout_path",
output_path_key="extracted_llm_vqa_path",
)
self.llm_output_parser.run(
storage=self.storage.step(),
input_response_path_key="extracted_llm_vqa_path",
input_converted_layout_path_key="converted_vqa_layout_path",
input_name_key="name",
output_qalist_path_key="extracted_vqa_path",
)
self.qa_merger.run(
storage=self.storage.step(),
input_qalist_path_key="extracted_vqa_path",
input_name_key="name",
output_merged_qalist_path_key="output_merged_vqalist_path",
output_merged_md_path_key="output_merged_md_path",
output_qa_item_key="vqa_pair",
)
if __name__ == "__main__":
# Each line in the jsonl contains input_pdf_paths, name (math1, math2, physics1, chemistry1, ...)
pipeline = PDF_VQA_extract_optimized_pipeline()
pipeline.compile()
pipeline.forward()Pipeline source: DataFlow/dataflow/statics/pipelines/api_pipelines/pdf_vqa_extract_pipeline.py
Use this pipeline whenever you need structured QA data distilled directly from PDF textbooks with figure references intact.

