KBCTextCleaner

About 434 wordsAbout 1 min

2025-10-09

📘 Overview

KBCTextCleaner is a knowledge-cleaning operator designed to standardize raw knowledge content by removing HTML tags, normalizing special characters, handling hyperlinks, and optimizing text structure. Its goal is to improve the quality and reliability of RAG (Retrieval-Augmented Generation) knowledge bases.

init Function

def __init__(self, llm_serving: LLMServingABC, lang="en", prompt_template = KnowledgeCleanerPrompt)

Initialization Parameters

Parameter	Type	Default	Description
llm_serving	LLMServingABC	Required	The LLM service instance used for inference and text generation.
lang	str	"en"	Language setting for selecting the prompt template. Supports `'zh'` and `'en'`.
prompt_template	PromptABC	`KnowledgeCleanerPrompt()`	The prompt template object. If not provided, the default `KnowledgeCleanerPrompt` will be used.

Prompt Template Description

Prompt Template	Purpose	Application Scenario	Key Features
KnowledgeCleanerPrompt	Multi-dimensional text cleaning	Private knowledge base cleaning	Removes sensitive information and noise, performs normalization

run Function

def run(self, storage: DataFlowStorage, input_key: str = "raw_chunk", output_key: str = "cleaned_chunk")

Executes the main logic of the operator — it reads the input DataFrame from the storage, generates cleaned text, and writes the result back to storage.

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance responsible for reading and writing data.
input_key	str	"raw_chunk"	Input column name corresponding to the raw knowledge chunk field.
output_key	str	"cleaned_chunk"	Output column name corresponding to the cleaned knowledge chunk field.

🧠 Example Usage

self.knowledge_cleaning_step3 = KBCTextCleaner(
    llm_serving=self.llm_serving,
    lang="en"
)
self.knowledge_cleaning_step3.run(
    storage=self.storage.step(),
    # input_key=,
    # output_key=,
)

🧾 Default Output Format

Field	Type	Description
raw_chunk	str	The input raw knowledge text.
cleaned_chunk	str	The cleaned and standardized text generated by the model.

Example Input

{
"raw_chunk": "<div class=\"container\">\n  <h1>标题文本</h1>\n  <p>正文段落，包括特殊符号，例如“弯引号”、–破折号等</p>\n  <img src=\"example.jpg\" alt=\"示意图\">\n  <a href=\"...\">链接文本</a>\n  <pre><code>代码片段</code></pre>\n</div>"
}

Example Output

{
"raw_chunk": "<div class=\"container\">\n  <h1>标题文本</h1>\n  <p>正文段落，包括特殊符号，例如“弯引号”、–破折号等</p>\n  <img src=\"example.jpg\" alt=\"示意图\">\n  <a href=\"...\">链接文本</a>\n  <pre><code>代码片段</code></pre>\n</div>",
"cleaned_chunk": "标题文本\n\n正文段落，包括特殊符号，例如\"直引号\"、-破折号等\n\n[Image: 示意图 example.jpg]\n\n链接文本\n\n<code>代码片段</code>"
}

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

refine

generate

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

generate

filter

eval

filter

generate

refine

KBCTextCleaner