MinerU2LLMInputOperator
About 356 wordsAbout 1 min
2026-01-20
📘 Overview
MinerU2LLMInputOperator is a format conversion operator specifically designed for processing MinerU parsing results. It transforms the underlying _content_list.json files generated by MinerU into a flattened format that is more suitable for Large Language Model (LLM) understanding and processing.
Key Features:
- List Flattening: Breaks down complex
listtype items into individualtextentries. - Data Cleaning: Removes metadata that is typically unnecessary for LLMs, such as
bbox(bounding box coordinates) andpage_idx(page numbers). - Re-indexing: Generates continuous and unique
idvalues for all converted content items.
__init__ Function
def __init__(self)This operator does not require any additional parameters during initialization.
run Function
def run(self, storage: DataFlowStorage, input_markdown_path_key: str, output_converted_layout_key: str)Executes the conversion logic: Locates the corresponding MinerU JSON file based on the Markdown file path, processes it, saves it as a new file, and records the new path.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
| storage | DataFlowStorage | Required | DataFlow storage instance. |
| input_markdown_path_key | str | Required | Input column name containing the paths to MinerU .md files. The operator automatically searches for _content_list.json in the same directory. |
| output_converted_layout_key | str | Required | Output column name to store the path of the processed _converted.json file. |
🧠 Conversion Logic Details
- Path Matching: The operator retrieves the file path from
input_markdown_path_keyand replaces the.mdextension with_content_list.jsonto read the original layout data. - Content Processing:
- If an entry type is
listand the sub-type istext, the operator iterates throughlist_itemsand promotes each sub-item to an independenttextentry. - Entries that are already
textor other types are preserved.
- Format Simplification: The
bboxandpage_idxfields are removed from all entries to reduce token interference and noise. - File Output: The resulting file is saved with a
_converted.jsonsuffix in the same directory as the original file.
🧠 Example Usage
🧾 Format Conversion Comparison
Input (Original MinerU _content_list.json):
[
{
"type": "list",
"sub_type": "text",
"list_items": ["Item One Content", "Item Two Content"],
"bbox": [10, 20, 100, 200],
"page_idx": 0
}
]Output (Processed _converted.json):
[
{
"type": "text",
"text": "Item One Content",
"id": 0
},
{
"type": "text",
"text": "Item Two Content",
"id": 1
}
]
