KBCMultiHopQAGeneratorBatch

About 466 wordsAbout 2 min

2025-10-09

📘 Overview

KBCMultiHopQAGeneratorBatch is a batch-based multi-hop question-answer pair generation operator designed to automatically generate questions and answers that require multi-step reasoning from given textual data. By invoking a Large Language Model (LLM), this operator transforms raw text into structured QA data, suitable for constructing complex QA datasets or enhancing knowledge bases.

init Function

def __init__(self,
  llm_serving: LLMServingABC,
  seed: int = 0,
  lang="en",
  prompt_template = None
):

Initialization Parameters

Parameter	Type	Default	Description
llm_serving	LLMServingABC	Required	The LLM service instance used for inference and generation.
seed	int	0	Random seed to ensure reproducibility of the generation process.
lang	str	"en"	Language setting that specifies the output language for QA pairs (e.g., "en" or "zh").
prompt_template	PromptABC	Text2MultiHopQAGeneratorPrompt	Prompt template object used to construct the input for multi-hop QA generation.

Prompt Template Description

Template Name	Purpose	Applicable Scenario	Key Features
Text2MultiHopQAGeneratorPrompt	Generate multi-hop QA pairs from text	Scenarios that require constructing complex reasoning questions from long passages	Built-in template that guides the model to generate the question, reasoning steps, final answer, and supporting facts, ensuring structured and logical output.

run Function

def run(
    self,
    input_key: str = 'chunk_path',
    output_key: str = 'enhanced_chunk_path',
    storage: DataFlowStorage = None,
):

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance responsible for reading and writing data.
input_key	str	"chunk_path"	The input column name that contains the path to JSON or JSONL files with text chunks to process.
output_key	str	"enhanced_chunk_path"	The output column name that will store the path to the enhanced files containing generated QA pairs.

🧠 Example Usage

self.knowledge_cleaning_step4 = KBCMultiHopQAGeneratorBatch(
    llm_serving=self.llm_serving,
    lang="en"
)
self.knowledge_cleaning_step4.run(
    storage=self.storage.step(),
)

🧾 Default Output Format

The operator reads the text files specified by input_key, generates QA pairs, and writes the enriched content back to the same file.

Field	Type	Description
text	str	The original context text.
qa_pairs	list	A list of generated multi-hop QA pairs, each containing the question, answer, reasoning steps, and supporting facts.

Example Input (`chunk_path` file content)

{
  "cleaned_chunk": "The Eiffel Tower is located in Paris, the capital of France. The Louvre Museum, also in Paris, is the world's largest art museum."
}

Example Output (after operator execution)

{
  "cleaned_chunk": "The Eiffel Tower is located in Paris, the capital of France. The Louvre Museum, also in Paris, is the world's largest art museum.",
  "qa_pairs": [
    {
      "question": "In which country is the world's largest art museum located?",
      "reasoning_steps": [
        {"step": "The text states the Louvre Museum is the world's largest art museum."},
        {"step": "The text also states the Louvre Museum is in Paris."},
        {"step": "Paris is identified as the capital of France."}
      ],
      "answer": "France",
      "supporting_facts": [
        "The Louvre Museum, also in Paris, is the world's largest art museum.",
        "Paris, the capital of France."
      ],
      "type": "Geography"
    }
  ]
}

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

refine

generate

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

generate

filter

eval

filter

generate

refine

KBCMultiHopQAGeneratorBatch