AgenticRAGAtomicTaskGenerator
About 649 wordsAbout 2 min
2025-10-09
📘 Overview: AgenticRAGAtomicTaskGenerator
The AgenticRAGAtomicTaskGenerator is an operator designed to generate high-quality questions and verifiable answers from provided text content. It follows a multi-step process involving identifying key concepts, generating conclusions, creating questions, and refining answers to produce structured question-answer pairs suitable for RAG systems.
__init__ function
__init__(self,
llm_serving: LLMServingABC = None,
data_num : int = 100,
max_per_task: int = 10,
max_question: int = 10,
)| Parameter | Type | Default | Description |
|---|---|---|---|
| llm_serving | LLMServingABC | None | An instance of a large language model serving class, used for executing inference and generation. |
| data_num | int | 100 | The number of data samples to process. |
| max_per_task | int | 10 | The maximum number of candidate tasks to generate per input document. |
| max_question | int | 10 | The maximum number of questions to generate for each document. |
Prompt Template Descriptions
This operator does not use prompt templates as it generates and processes question-answer pairs directly based on input content, without requiring intermediate prompt templates.
run
run(
self,
storage: DataFlowStorage,
input_key: str = "prompts",
output_question_key: str = "question",
output_answer_key:str = "answer",
output_refined_answer_key:str = "refined_answer",
output_optional_answer_key: str = "optional_answer",
output_llm_answer_key: str = "llm_answer",
output_golden_doc_answer_key: str = "golden_doc_answer",
)| Parameter | Type | Default | Description |
|---|---|---|---|
| storage | DataFlowStorage | Required | The DataFlow storage instance responsible for reading and writing data. |
| input_key | str | "prompts" | The column name for the input text content. |
| output_question_key | str | "question" | The column name for the generated questions. |
| output_answer_key | str | "answer" | The column name for the initial generated answers. |
| output_refined_answer_key | str | "refined_answer" | The column name for the refined answers after cleaning. |
| output_optional_answer_key | str | "optional_answer" | The column name for alternative refined answers. |
| output_llm_answer_key | str | "llm_answer" | The column name for answers generated by the LLM for verification. |
| output_golden_doc_answer_key | str | "golden_doc_answer" | The column name for answers generated based on the golden source document. |
🧠 Example Usage
from dataflow.operators.agentic_rag.generate.agenticrag_atomic_task_generator import AgenticRAGAtomicTaskGenerator
from dataflow.utils.storage import DataFlowStorage
# Initialize the operator
generator = AgenticRAGAtomicTaskGenerator(
llm_serving=your_llm_serving_instance,
data_num=50,
max_per_task=5,
max_question=5
)
# Run the operator
storage = DataFlowStorage()
generator.run(
storage=storage,
input_key="prompts",
output_question_key="question",
output_answer_key="answer",
output_refined_answer_key="refined_answer",
output_optional_answer_key="optional_answer",
output_llm_answer_key="llm_answer",
output_golden_doc_answer_key="golden_doc_answer"
)🧾 Output Format
The operator modifies the input DataFrame by adding several new columns.
| Field | Type | Description |
|---|---|---|
| question | str | The generated question based on the input text. |
| answer | str | The initial answer extracted from the reasoning process. |
| refined_answer | str | The cleaned and improved version of the initial answer. |
| optional_answer | str | A list of acceptable alternative answers. |
| llm_answer | str | The answer generated by the LLM for verification purposes. |
| golden_doc_answer | str | The answer generated directly from the source document for verification. |
| identifier | str | Content identifier extracted from the input text. |
| candidate_tasks_str | str | JSON string containing candidate tasks and conclusions. |
| llm_score | int | Quality score for LLM-generated answers. |
| golden_doc_score | int | Quality score for golden document answers. |
Example Input:
{
"prompts": "Explain the core concepts of quantum mechanics."
}Example Output:
{
"prompts": "Explain the core concepts of quantum mechanics.",
"question": "What is the uncertainty principle in quantum mechanics?",
"answer": "The uncertainty principle states that certain pairs of physical properties, like position and momentum, cannot both be known to arbitrary precision at the same time.",
"refined_answer": "The Heisenberg uncertainty principle in quantum mechanics asserts that it is fundamentally impossible to simultaneously determine with arbitrary precision both the position and the momentum of a particle.",
"optional_answer": [
"The more precisely the position is known, the less precisely the momentum is known and vice versa.",
"A principle in quantum mechanics expressing the limits of measurement precision for certain properties."
],
"llm_answer": "The uncertainty principle means the more accurately we know a particle's position, the less accurately we can know its momentum.",
"golden_doc_answer": "The uncertainty principle, formulated by Heisenberg, is a fundamental theory in quantum mechanics describing the limitations in measuring certain pairs of variables.",
"identifier": "quantum mechanics core concepts",
"candidate_tasks_str": "[{\"question\": \"What is the uncertainty principle in quantum mechanics?\", \"conclusion\": \"Measurement limits of conjugate variables.\"}, {\"question\": \"What is quantum superposition?\", \"conclusion\": \"A particle exists in multiple states at once until measured.\"}]",
"llm_score": 5,
"golden_doc_score": 5
}
