AgenticRAGAtomicTaskGenerator

About 649 wordsAbout 2 min

2025-10-09

📘 Overview: AgenticRAGAtomicTaskGenerator

The AgenticRAGAtomicTaskGenerator is an operator designed to generate high-quality questions and verifiable answers from provided text content. It follows a multi-step process involving identifying key concepts, generating conclusions, creating questions, and refining answers to produce structured question-answer pairs suitable for RAG systems.

`init` function

__init__(self,
         llm_serving: LLMServingABC = None,
         data_num : int = 100,
         max_per_task: int = 10,
         max_question: int = 10,
         )

Parameter	Type	Default	Description
llm_serving	LLMServingABC	None	An instance of a large language model serving class, used for executing inference and generation.
data_num	int	100	The number of data samples to process.
max_per_task	int	10	The maximum number of candidate tasks to generate per input document.
max_question	int	10	The maximum number of questions to generate for each document.

Prompt Template Descriptions

This operator does not use prompt templates as it generates and processes question-answer pairs directly based on input content, without requiring intermediate prompt templates.

`run`

run(
    self,
    storage: DataFlowStorage,
    input_key: str = "prompts",
    output_question_key: str = "question",
    output_answer_key:str = "answer",
    output_refined_answer_key:str = "refined_answer",
    output_optional_answer_key: str = "optional_answer",
    output_llm_answer_key: str = "llm_answer",
    output_golden_doc_answer_key: str = "golden_doc_answer",
)

Parameter	Type	Default	Description
storage	DataFlowStorage	Required	The DataFlow storage instance responsible for reading and writing data.
input_key	str	"prompts"	The column name for the input text content.
output_question_key	str	"question"	The column name for the generated questions.
output_answer_key	str	"answer"	The column name for the initial generated answers.
output_refined_answer_key	str	"refined_answer"	The column name for the refined answers after cleaning.
output_optional_answer_key	str	"optional_answer"	The column name for alternative refined answers.
output_llm_answer_key	str	"llm_answer"	The column name for answers generated by the LLM for verification.
output_golden_doc_answer_key	str	"golden_doc_answer"	The column name for answers generated based on the golden source document.

🧠 Example Usage

from dataflow.operators.agentic_rag.generate.agenticrag_atomic_task_generator import AgenticRAGAtomicTaskGenerator
from dataflow.utils.storage import DataFlowStorage

# Initialize the operator
generator = AgenticRAGAtomicTaskGenerator(
    llm_serving=your_llm_serving_instance,
    data_num=50,
    max_per_task=5,
    max_question=5
)

# Run the operator
storage = DataFlowStorage()
generator.run(
    storage=storage,
    input_key="prompts",
    output_question_key="question",
    output_answer_key="answer",
    output_refined_answer_key="refined_answer",
    output_optional_answer_key="optional_answer",
    output_llm_answer_key="llm_answer",
    output_golden_doc_answer_key="golden_doc_answer"
)

🧾 Output Format

The operator modifies the input DataFrame by adding several new columns.

Field	Type	Description
question	str	The generated question based on the input text.
answer	str	The initial answer extracted from the reasoning process.
refined_answer	str	The cleaned and improved version of the initial answer.
optional_answer	str	A list of acceptable alternative answers.
llm_answer	str	The answer generated by the LLM for verification purposes.
golden_doc_answer	str	The answer generated directly from the source document for verification.
identifier	str	Content identifier extracted from the input text.
candidate_tasks_str	str	JSON string containing candidate tasks and conclusions.
llm_score	int	Quality score for LLM-generated answers.
golden_doc_score	int	Quality score for golden document answers.

Example Input:

{
  "prompts": "Explain the core concepts of quantum mechanics."
}

Example Output:

{
  "prompts": "Explain the core concepts of quantum mechanics.",
  "question": "What is the uncertainty principle in quantum mechanics?",
  "answer": "The uncertainty principle states that certain pairs of physical properties, like position and momentum, cannot both be known to arbitrary precision at the same time.",
  "refined_answer": "The Heisenberg uncertainty principle in quantum mechanics asserts that it is fundamentally impossible to simultaneously determine with arbitrary precision both the position and the momentum of a particle.",
  "optional_answer": [
    "The more precisely the position is known, the less precisely the momentum is known and vice versa.",
    "A principle in quantum mechanics expressing the limits of measurement precision for certain properties."
  ],
  "llm_answer": "The uncertainty principle means the more accurately we know a particle's position, the less accurately we can know its momentum.",
  "golden_doc_answer": "The uncertainty principle, formulated by Heisenberg, is a fundamental theory in quantum mechanics describing the limitations in measuring certain pairs of variables.",
  "identifier": "quantum mechanics core concepts",
  "candidate_tasks_str": "[{\"question\": \"What is the uncertainty principle in quantum mechanics?\", \"conclusion\": \"Measurement limits of conjugate variables.\"}, {\"question\": \"What is quantum superposition?\", \"conclusion\": \"A particle exists in multiple states at once until measured.\"}]",
  "llm_score": 5,
  "golden_doc_score": 5
}

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

refine

generate

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

generate

filter

eval

filter

generate

refine

AgenticRAGAtomicTaskGenerator

📘 Overview: AgenticRAGAtomicTaskGenerator

__init__ function

Prompt Template Descriptions

run

🧠 Example Usage

🧾 Output Format

`init` function

`run`