LLMOutputParser

About 475 wordsAbout 2 min

2026-01-20

📘 Overview

LLMOutputParser is a structured data parsing operator designed specifically to parse response text generated by Large Language Models (LLMs) that contain specific XML tags.

The core functionalities of this operator include:

Tag Parsing: Identifying and extracting content within tags such as <chapter>, <qa_pair>, <question>, <answer>, <solution>, and <label>.
ID Restoration: Mapping numerical IDs returned by the LLM back to original text content or image tags (based on the converted layout files generated by MinerU2LLMInputOperator).
Resource Synchronization: Automatically copying associated images from the intermediate directory to the final output directory and correcting the image reference paths.

`init` Function

def __init__(self,  
             output_dir: str, 
             intermediate_dir: str = "intermediate"
             )

Initialization Parameters

Parameter	Type	Default	Description
output_dir	str	Required	The final root directory for structured data and images.
intermediate_dir	str	"intermediate"	The intermediate directory where original image resources processed by MinerU are located.

XML Tag Protocol

The operator expects the LLM to return data according to the following structure:

<chapter>: A chapter block containing a title and multiple QA pairs.
<title>: The ID corresponding to the chapter title.
<qa_pair>: A block representing a single question-answer pair.
<question> / <solution>: A list of IDs (e.g., 1, 2, 5) corresponding to the source content.
<answer>: The answer extracted from the solution. This is actual text content, not an ID.
<label>: Question type or label information. This is a real sequence number/label, not an ID.

`run` Function

def run(self, 
        storage: DataFlowStorage, 
        input_response_path_key: str, 
        input_converted_layout_path_key: str, 
        input_name_key: str, 
        output_qalist_path_key: str
        )

Executes the parsing logic: Reads the LLM response, restores content using the layout JSON file, saves the result in JSONL format.

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance.
input_response_path_key	str	Required	Column name for the path to the original LLM response file.
input_converted_layout_path_key	str	Required	Column name for the path to the converted layout file (`_converted.json`).
input_name_key	str	Required	Column for the task name, which determines the naming of the output folder.
output_qalist_path_key	str	Required	Column name to store the path of the generated JSONL file.

🧠 Example Logic

1. ID Restoration Process

Suppose the LLM returns: <question>1, 3</question> The operator looks up entries with id 1 and 3 in the layout JSON:

If id: 1 is the text "What is AI?" and id: 3 is the image path/to/img.png.
The restored content will be: What is AI?\n![image](vqa_images/img.png).

2. Output File Structure

After execution, the directory structure under output_dir (referenced as cache_path in some contexts) will be as follows:

output_dir/
└── {name}/
    ├── extracted_questions.jsonl  # Structured data
    └── vqa_images/           # Automatically synchronized images
        ├── img1.png
        └── ...

3. JSONL Output Example

{
  "question": "Please analyze the image below:\n![image](vqa_images/img1.png)",
  "answer": "This is the parsed answer text.",
  "solution": "Detailed step-by-step solution...",
  "label": "1",
  "chapter_title": "Chapter 1: Fundamentals"
}

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

refine

generate

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

generate

filter

eval

filter

generate

refine

LLMOutputParser