LLMOutputParser
About 475 wordsAbout 2 min
2026-01-20
📘 Overview
LLMOutputParser is a structured data parsing operator designed specifically to parse response text generated by Large Language Models (LLMs) that contain specific XML tags.
The core functionalities of this operator include:
- Tag Parsing: Identifying and extracting content within tags such as
<chapter>,<qa_pair>,<question>,<answer>,<solution>, and<label>. - ID Restoration: Mapping numerical IDs returned by the LLM back to original text content or image tags (based on the converted layout files generated by
MinerU2LLMInputOperator). - Resource Synchronization: Automatically copying associated images from the intermediate directory to the final output directory and correcting the image reference paths.
__init__ Function
def __init__(self,
output_dir: str,
intermediate_dir: str = "intermediate"
)Initialization Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| output_dir | str | Required | The final root directory for structured data and images. |
| intermediate_dir | str | "intermediate" | The intermediate directory where original image resources processed by MinerU are located. |
XML Tag Protocol
The operator expects the LLM to return data according to the following structure:
<chapter>: A chapter block containing a title and multiple QA pairs.<title>: The ID corresponding to the chapter title.<qa_pair>: A block representing a single question-answer pair.<question>/<solution>: A list of IDs (e.g.,1, 2, 5) corresponding to the source content.<answer>: The answer extracted from the solution. This is actual text content, not an ID.<label>: Question type or label information. This is a real sequence number/label, not an ID.
run Function
def run(self,
storage: DataFlowStorage,
input_response_path_key: str,
input_converted_layout_path_key: str,
input_name_key: str,
output_qalist_path_key: str
)Executes the parsing logic: Reads the LLM response, restores content using the layout JSON file, saves the result in JSONL format.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
| storage | DataFlowStorage | Required | DataFlow storage instance. |
| input_response_path_key | str | Required | Column name for the path to the original LLM response file. |
| input_converted_layout_path_key | str | Required | Column name for the path to the converted layout file (_converted.json). |
| input_name_key | str | Required | Column for the task name, which determines the naming of the output folder. |
| output_qalist_path_key | str | Required | Column name to store the path of the generated JSONL file. |
🧠 Example Logic
1. ID Restoration Process
Suppose the LLM returns: <question>1, 3</question> The operator looks up entries with id 1 and 3 in the layout JSON:
- If
id: 1is the text "What is AI?" andid: 3is the imagepath/to/img.png. - The restored content will be:
What is AI?\n.
2. Output File Structure
After execution, the directory structure under output_dir (referenced as cache_path in some contexts) will be as follows:
output_dir/
└── {name}/
├── extracted_questions.jsonl # Structured data
└── vqa_images/ # Automatically synchronized images
├── img1.png
└── ...3. JSONL Output Example
{
"question": "Please analyze the image below:\n",
"answer": "This is the parsed answer text.",
"solution": "Detailed step-by-step solution...",
"label": "1",
"chapter_title": "Chapter 1: Fundamentals"
}
