KBCTextCleaner
About 434 wordsAbout 1 min
2025-10-09
📘 Overview
KBCTextCleaner is a knowledge-cleaning operator designed to standardize raw knowledge content by removing HTML tags, normalizing special characters, handling hyperlinks, and optimizing text structure. Its goal is to improve the quality and reliability of RAG (Retrieval-Augmented Generation) knowledge bases.
init Function
def __init__(self, llm_serving: LLMServingABC, lang="en", prompt_template = KnowledgeCleanerPrompt)Initialization Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| llm_serving | LLMServingABC | Required | The LLM service instance used for inference and text generation. |
| lang | str | "en" | Language setting for selecting the prompt template. Supports 'zh' and 'en'. |
| prompt_template | PromptABC | KnowledgeCleanerPrompt() | The prompt template object. If not provided, the default KnowledgeCleanerPrompt will be used. |
Prompt Template Description
| Prompt Template | Purpose | Application Scenario | Key Features |
|---|---|---|---|
| KnowledgeCleanerPrompt | Multi-dimensional text cleaning | Private knowledge base cleaning | Removes sensitive information and noise, performs normalization |
run Function
def run(self, storage: DataFlowStorage, input_key: str = "raw_chunk", output_key: str = "cleaned_chunk")Executes the main logic of the operator — it reads the input DataFrame from the storage, generates cleaned text, and writes the result back to storage.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
| storage | DataFlowStorage | Required | DataFlow storage instance responsible for reading and writing data. |
| input_key | str | "raw_chunk" | Input column name corresponding to the raw knowledge chunk field. |
| output_key | str | "cleaned_chunk" | Output column name corresponding to the cleaned knowledge chunk field. |
🧠 Example Usage
self.knowledge_cleaning_step3 = KBCTextCleaner(
llm_serving=self.llm_serving,
lang="en"
)
self.knowledge_cleaning_step3.run(
storage=self.storage.step(),
# input_key=,
# output_key=,
)🧾 Default Output Format
| Field | Type | Description |
|---|---|---|
| raw_chunk | str | The input raw knowledge text. |
| cleaned_chunk | str | The cleaned and standardized text generated by the model. |
Example Input
{
"raw_chunk": "<div class=\"container\">\n <h1>标题文本</h1>\n <p>正文段落,包括特殊符号,例如“弯引号”、–破折号等</p>\n <img src=\"example.jpg\" alt=\"示意图\">\n <a href=\"...\">链接文本</a>\n <pre><code>代码片段</code></pre>\n</div>"
}Example Output
{
"raw_chunk": "<div class=\"container\">\n <h1>标题文本</h1>\n <p>正文段落,包括特殊符号,例如“弯引号”、–破折号等</p>\n <img src=\"example.jpg\" alt=\"示意图\">\n <a href=\"...\">链接文本</a>\n <pre><code>代码片段</code></pre>\n</div>",
"cleaned_chunk": "标题文本\n\n正文段落,包括特殊符号,例如\"直引号\"、-破折号等\n\n[Image: 示意图 example.jpg]\n\n链接文本\n\n<code>代码片段</code>"
}
