KBCTextCleanerBatch

About 462 wordsAbout 2 min

2025-10-09

📘 Overview

KBCTextCleanerBatch is a batch knowledge-cleaning operator designed to standardize raw knowledge content by removing HTML tags, normalizing special characters, processing hyperlinks, and optimizing structure — all to enhance the quality of RAG (Retrieval-Augmented Generation) knowledge bases.

init Function

def __init__(self, llm_serving: LLMServingABC, lang="en", prompt_template=None)

Initialization Parameters

Parameter	Type	Default	Description
llm_serving	LLMServingABC	Required	The large language model service instance used for inference and text generation.
lang	str	"en"	Specifies the language of the prompt. Supports `"zh"` (Chinese) and `"en"` (English).
prompt_template	PromptABC	None	The prompt template object for constructing cleaning instructions. If not specified, the built-in `KnowledgeCleanerPrompt` is used.

Prompt Template Description

Prompt Template	Purpose	Application Scenario	Key Features
KnowledgeCleanerPrompt	Multi-dimensional text cleaning	Private knowledge base cleaning	Removes sensitive information and noise; performs normalization.

run Function

def run(storage, input_key="chunk_path", output_key="cleaned_chunk_path")

Executes the main logic of the operator — it reads the input DataFrame from the storage, cleans the text files located at the given paths, and writes the cleaned results back to new files.

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	The data flow storage instance responsible for reading and writing data.
input_key	str	"chunk_path"	The input column name that contains the file paths of the knowledge chunks to be cleaned.
output_key	str	"cleaned_chunk_path"	The output column name that stores the file paths of the cleaned knowledge chunks.

🧠 Example Usage

self.knowledge_cleaning_step3 = KBCTextCleanerBatch(
    llm_serving=self.llm_serving,
    lang="en"
)
self.knowledge_cleaning_step3.run(
    storage=self.storage.step(),
)

🧾 Default Output Format

Field	Type	Description
chunk_path	str	Path to the original raw knowledge text.
cleaned_chunk_path	str	Path to the cleaned knowledge text generated by the model.

Example Input (File pointed by `chunk_path`)

{
"raw_chunk":"<div class=\"container\">\n  <h1>标题文本</h1>\n  <p>正文段落，包括特殊符号，例如“弯引号”、–破折号等</p>\n  <img src=\"example.jpg\" alt=\"示意图\">\n  <a href=\"...\">链接文本</a>\n  <pre><code>代码片段</code></pre>\n</div>"
}

Example Output (File pointed by `cleaned_chunk_path`)

{
"raw_chunk":"<div class=\"container\">\n  <h1>标题文本</h1>\n  <p>正文段落，包括特殊符号，例如“弯引号”、–破折号等</p>\n  <img src=\"example.jpg\" alt=\"示意图\">\n  <a href=\"...\">链接文本</a>\n  <pre><code>代码片段</code></pre>\n</div>",
"cleaned_chunk":"标题文本\n\n正文段落，包括特殊符号，例如\"直引号\"、-破折号等\n\n[Image: 示意图 example.jpg]\n\n链接文本\n\n<code>代码片段</code>"
}

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

refine

generate

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

generate

filter

eval

filter

generate

refine

KBCTextCleanerBatch