RARE Data Synthesis Pipeline

About 793 wordsAbout 3 min

2025-07-04

1. Overview

The RARE (Retrieval-Augmented Reasoning Modeling) Data Synthesis Pipeline is an end-to-end framework designed to enhance the domain-specific intelligence of Large Language Models (LLMs) by decoupling knowledge storage and reasoning optimization. The core ideas of the RARE method are:

Knowledge Externalization: Storing domain knowledge in a retrievable external source.
Reasoning Internalization: During the training process, the model focuses on learning and internalizing domain-specific reasoning patterns.

This pipeline can generate high-quality, knowledge- and reasoning-intensive training data from a given set of documents, enabling even lightweight models to achieve top-tier performance, potentially surpassing large models like GPT-4 and DeepSeek-R1.

Dependency Installation

The BM25HardNeg operator in RAREPipeline depends on pyserini, gensim, and JDK. The configuration method for Linux is as follows:

sudo apt install openjdk-21-jdk
pip install pyserini gensim

2. Dataflow and Pipeline Logic

1. Input Data

The process begins with input data containing just one core field:

text: The plain text content of a document from any domain.

This data is managed by a FileStorage object, allowing you to easily configure input file paths, cache paths, and file formats.

self.storage = FileStorage(
    first_entry_file_name="../example_data/AgenticRAGPipeline/pipeline_small_chunk.json",
    cache_path="./cache_local",
    file_name_prefix="dataflow_cache_step",
    cache_type="json",
)

2. Generate Knowledge and Reasoning-Intensive Questions (Doc2Query)

The first step in the pipeline is the Doc2Query operator. It uses an LLM to generate questions and scenarios based on the input documents that require complex reasoning to answer. These questions are designed to be independent of the original document, but the reasoning process required to answer them relies on the knowledge contained within the document.

Functionality:

For each document, it generates a self-contained question and scenario that demand deep reasoning.
The questions are designed to test higher-order thinking skills such as analysis, evaluation, and synthesis.
Answering the questions requires leveraging knowledge from the source document.

Input: The original text content. Output: Adds new question and scenario fields.

# Call within RAREPipeline
self.doc2query_step1.run(
    storage=self.storage.step(),
    input_key="text",
)

3. Mine Hard Negative Samples (BM25HardNeg)

The second step uses the BM25HardNeg operator. After generating the questions, this step utilizes the BM25 algorithm to retrieve and filter "hard negative samples" for each question from the entire dataset. These negative samples are textually similar to the "correct" document (the positive sample) but cannot be logically used to answer the question, thus increasing the challenge for the model in the subsequent reasoning step.

Functionality:

Uses the generated questions to retrieve relevant documents as distractors with the BM25 algorithm.
Filters for hard negative samples that are similar to the positive sample but irrelevant, enhancing the model's discriminative ability.

Input: question (the query) and text (the positive sample). Output: Adds a hard_negatives field, which contains a set of hard negative samples.

# Call within RAREPipeline
self.bm25hardneg_step2.run(
    storage=self.storage.step(),
    input_question_key="question",
    input_text_key="text",
    output_negatives_key="hard_negatives",
)

4. Distill the Reasoning Process (ReasonDistill)

The final step is the ReasonDistill operator. It combines the question, scenario, one positive sample, and multiple hard negative samples to construct a complex prompt. It then leverages a powerful "teacher" LLM (like GPT-4o) to generate a detailed, step-by-step reasoning process (Chain-of-Thought) that demonstrates how to use the provided (mixed true and false) information to arrive at the final answer.

Functionality:

Randomly shuffles the positive and hard negative samples to simulate the noisy environment of real-world information retrieval.
Prompts a powerful "teacher" model to generate a detailed reasoning chain that a "student" model can learn from.

Input: text (the positive sample), question, scenario, and hard_negatives.

Output: Adds a reasoning field containing the detailed reasoning process generated by the teacher model.

# Call within RAREPipeline
self.reasondistill_step3.run(
    storage=self.storage.step(),
    input_text_key="text",
    input_question_key="question",
    input_scenario_key="scenario",
    input_hardneg_key="hard_negatives",
    output_key="reasoning",
)

3. Running the Complete Pipeline

Below is the sample code for running the complete RAREPipeline. It executes the three steps described above in sequence, progressively transforming the original documents into high-quality training data that includes a question, a scenario, hard negative samples, and a detailed reasoning process.

from dataflow.operators.generate.RARE import (
    Doc2Query,
    BM25HardNeg,
    ReasonDistill,
)
from dataflow.utils.storage import FileStorage
from dataflow.llmserving import APILLMServing_request, LocalModelLLMServing

class RAREPipeline():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="../example_data/AgenticRAGPipeline/pipeline_small_chunk.json",
            cache_path="./cache_local",
            file_name_prefix="dataflow_cache_step",
            cache_type="json",
        )

        # Use an API server as the LLM service
        llm_serving = APILLMServing_request(
                api_url="https://api.openai.com/v1/chat/completions",
                model_name="gpt-4o",
                max_workers=1
        )

        self.doc2query_step1 = Doc2Query(llm_serving)
        self.bm25hardneg_step2 = BM25HardNeg()
        self.reasondistill_step3 = ReasonDistill(llm_serving)
        
    def forward(self):
        self.doc2query_step1.run(
            storage=self.storage.step(),
            input_key="text",
        )

        self.bm25hardneg_step2.run(
            storage=self.storage.step(),
            input_question_key="question",
            input_text_key="text",
            output_negatives_key="hard_negatives",
        )

        self.reasondistill_step3.run(
            storage=self.storage.step(),
            input_text_key="text",
            input_question_key="question",
            input_scenario_key="scenario",
            input_hardneg_key="hard_negatives",
            output_key="reasoning",
        )
        
if __name__ == "__main__":
    model = RAREPipeline()
    model.forward()