Case 6. Math Problem Extraction

About 861 wordsAbout 3 min

2025-07-16

This example demonstrates how to use the MathBookQuestionExtract operator in Dataflow to automatically extract math problems from a textbook PDF and generate output in JSON/Markdown format.

1 Environment and Dependencies

Install Dataflow and MinerU dependencies

pip install "open-dataflow[mineru]"

Or install from source:

pip install -e ".[mineru]"

Download MinerU model weights
```
mineru-models-download
```

This operator uses MinerU for PDF content segmentation and image extraction; please ensure that the installation and model weight download have succeeded.

2 Configure LLM Serving

This operator currently only supports API-based VLM Serving. Please configure the API URL and key before running.

Linux / macOS:
```
export DF_API_KEY="sk-xxxxx"
```
Windows PowerShell:
```
$env:DF_API_KEY = "sk-xxxxx"
```

The API key will be read from the environment variable in the code, so there is no need to hard-code it in the script.

3 Prepare the Test PDF

The example repository includes a test PDF:

./dataflow/example/KBCleaningPipeline/questionextract_test.pdf

You can also replace it with any math textbook or exercise collection PDF.

4 Write the Execution Script

In the project’s root directory, create generate_question_extract_api.py with the following content as an example:

from dataflow.operators.generate import MathBookQuestionExtract
from dataflow.serving.APIVLMServing_openai import APIVLMServing_openai

class QuestionExtractPipeline:
    def __init__(self, llm_serving: APIVLMServing_openai):
        self.extractor = MathBookQuestionExtract(llm_serving)
        self.test_pdf = "./dataflow/example/KBCleaningPipeline/questionextract_test.pdf"

    def forward(
        self,
        pdf_path: str,
        output_name: str,
        output_dir: str,
        api_url: str = "https://api.openai.com/v1/chat/completions",
        key_name_of_api_key: str = "DF_API_KEY",
        model_name: str = "o4-mini",
        max_workers: int = 20
    ):
        self.extractor.run(
            pdf_file_path=pdf_path,
            output_file_name=output_name,
            output_folder=output_dir,
            api_url=api_url,
            key_name_of_api_key=key_name_of_api_key,
            model_name=model_name,
            max_workers=max_workers
        )

if __name__ == "__main__":
    # 1. Initialize LLM Serving
    llm_serving = APIVLMServing_openai(
        api_url="https://api.openai.com/v1/chat/completions",
        model_name="o4-mini",      # Strong reasoning model recommended
        max_workers=20             # Number of concurrent requests
    )

    # 2. Build and run the pipeline
    pipeline = QuestionExtractPipeline(llm_serving)
    pipeline.forward(
        pdf_path=pipeline.test_pdf,
        output_name="test_question_extract",
        output_dir="./output"
    )

Key Parameter Explanation

api_url: OpenAI VLM endpoint URL
key_name_of_api_key: Name of the environment variable
model_name: Model name (e.g., o4-mini; strong reasoning models are recommended)
max_workers: Number of concurrent requests

The complete implementation of the operator is located at
dataflow/operators/generate/KnowledgeCleaning/mathbook_question_extract.py
Below, starting from the overall flow, we provide concise yet detailed explanations of each key stage to facilitate use and secondary development:

PDF file splitting
- Use pymupdf (fitz) to open the target PDF, rendering each page into a high-quality JPEG image at the specified DPI.
- Save the images, named by page number, to the specified output directory, and log the conversion progress of each page to ensure traceability.
Invoke MinerU for content recognition and image extraction
- Dynamically import the mineru module; if it is not installed, throw a friendly prompt guiding the user to run pip install mineru[pipeline] and download the models.
- Specify loading models from the local source via the environment variable MINERU_MODEL_SOURCE=local, supporting backend options "vlm-sglang-engine" or "pipeline".
- Execute the command-line tool:

    mineru -p <pdf_file> -o <output_folder> -b <backend> --source local

After execution, the tool will generate *_content_list.json (a structured content inventory) and a folder of the original split images in the intermediate directory.

Organize and rename image resources
- Read the content_list.json produced by MinerU, filtering out all items where type=='image'.
- Copy the corresponding images from MinerU’s temporary directory to the final result folder, renaming them sequentially as 0.jpg, 1.jpg....
- Also generate a new JSON inventory, recording each image’s page number in the source PDF and its new file path.
Organize model invocation commands
- Retrieve the predefined text prompt (mathbook_question_extract_prompt) from dataflow.prompts.kbcleaning.KnowledgeCleanerPrompt, specifying the task requirements and format conventions.
- Package the rendered commands together with multiple input images (page snapshots, illustrations) to prepare for subsequent concurrent LLM service calls.
Concurrently obtain model responses
- Use APIVLMServing_openai (or another LLMServingABC implementation) combined with ThreadPoolExecutor to concurrently submit the packaged list of images and labels to the model.
- Allow customization of the model name, API endpoint, concurrency level, and timeout to flexibly meet different performance and cost requirements.
Parse and save the final output
- In the analyze_and_save method, use regular expressions to precisely capture the <image>index.jpg</image> tags in the model’s returned text.
- Copy the corresponding images referenced in the tags to the images/ subfolder in the results directory.
- Output the results in two formats:
  a. JSON file: sequentially store each question’s plain text (with tags removed) and the corresponding list of image paths
  b. Markdown file: embed images in the original text using the ![](images/xx.jpg) format for easy visualization
- All output files are saved in the user-specified result folder, facilitating subsequent verification and secondary use.

5 Run the Script

python generate_question_extract_api.py

After it finishes, the ./output directory will contain:

test_question_extract.json
Each record includes:
- text: Extracted problem text
- pics: List of image paths involved in the problem
test_question_extract.md
Displays the problems and their images in Markdown format

6 Optional Extensions

Custom prompts: To adjust the system prompt, replace it inside the operator:

from dataflow.prompts.kbcleaning import KnowledgeCleanerPrompt
system_prompt = KnowledgeCleanerPrompt().mathbook_question_extract_prompt()

Parameter customization: Supports switching the MinerU backend (pipeline | vlm-sglang-engine), adjusting DPI, concurrency, etc. See the run method signature in the operator.