CiderSampleEvaluator

About 653 wordsAbout 2 min

2025-10-09

📘 Overview

CiderSampleEvaluator is an evaluation operator that uses the CIDEr (Consensus-based Image Description Evaluation) metric to measure similarity between generated text and reference text. This operator primarily calculates scores based on TF-IDF weighted n-gram overlap, commonly used for evaluating tasks such as image caption generation.

init

def __init__(self, n=4, sigma=6.0, df_mode="coco-val-df", idf_path="./dataflow/operators/general_text/eval/cider/coco-val-df.p")

init Parameters

Parameter	Type	Default	Description
n	int	4	Maximum n-gram length for overlap calculation.
sigma	float	6.0	Sigma parameter for Gaussian penalty calculation, adjusting the magnitude of difference penalty.
df_mode	str	"coco-val-df"	Document frequency (DF) calculation mode, options: "coco-val-df" or "corpus".
idf_path	str	"./dataflow/operators/general_text/eval/cider/coco-val-df.p"	Pre-calculated IDF (Inverse Document Frequency) file path. Required when df_mode is "coco-val-df".

IDF File Download Instructions

When using df_mode="coco-val-df" (default mode), you need to download the pre-trained IDF file:

Download URL: https://github.com/ramavedantam/coco-caption/raw/master/data/coco-val-df.p

File Location: Place the downloaded file at ./dataflow/operators/general_text/eval/cider/coco-val-df.p

Alternatively, you can set df_mode="corpus" to dynamically calculate IDF from the current corpus without downloading the file.

run

def run(self, storage: DataFlowStorage, input_key: str, reference_key: str, output_key: str='CiderScore')

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name corresponding to the generated text field to be evaluated.
reference_key	str	Required	Input column name corresponding to the reference text field for comparison.
output_key	str	"CiderScore"	Output column name corresponding to the generated CIDEr score field.

🧠 Example Usage

from dataflow.operators.general_text import CiderSampleEvaluator
from dataflow.utils.storage import FileStorage

class CiderSampleEvaluatorTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/gen_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.evaluator = CiderSampleEvaluator(
            n=4,
            sigma=6.0,
            df_mode="coco-val-df",
            idf_path="./dataflow/operators/general_text/eval/cider/coco-val-df.p"
        )
        
    def forward(self):
        self.evaluator.run(
            storage=self.storage.step(),
            input_key='input_key',
            input_reference_key='reference_key',
            output_key='CiderScore'
        )

if __name__ == "__main__":
    test = CiderSampleEvaluatorTest()
    test.forward()

🧾 Default Output Format

Field	Type	Description
input_key	str	The original generated text
reference_key	str	The original reference text
CiderScore	float	CIDEr score (typically 0-10, higher indicates better similarity)

📋 Example Input

{"input_key": "The quick brown fox jumps over the lazy dog.", "reference_key": "A fast brown fox leaps over a lazy dog."}
{"input_key": "She sells seashells by the seashore.", "reference_key": "She is selling shells by the beach."}
{"input_key": "To be or not to be, that is the question.", "reference_key": "The question is whether to be or not."}
{"input_key": "All that glitters is not gold.", "reference_key": "Not everything that shines is gold."}
{"input_key": "A picture is worth a thousand words.", "reference_key": "A single image can convey so much meaning."}

📤 Example Output

{"input_key": "The quick brown fox jumps over the lazy dog.", "reference_key": "A fast brown fox leaps over a lazy dog.", "CiderScore": 2.1455700748}
{"input_key": "She sells seashells by the seashore.", "reference_key": "She is selling shells by the beach.", "CiderScore": 0.8880643249}
{"input_key": "To be or not to be, that is the question.", "reference_key": "The question is whether to be or not.", "CiderScore": 0.5822556198}
{"input_key": "All that glitters is not gold.", "reference_key": "Not everything that shines is gold.", "CiderScore": 0.8292019111}
{"input_key": "A picture is worth a thousand words.", "reference_key": "A single image can convey so much meaning.", "CiderScore": 0.6053086919}

📊 Results Analysis

Sample 1:

CiderScore: 2.146 (high)
Highest TF-IDF weighted n-gram overlap with good keyword matching

Sample 2:

CiderScore: 0.888 (medium)
Partial keyword overlap, moderate score after TF-IDF weight adjustment

Sample 3:

CiderScore: 0.582 (medium-low)
Significant sentence restructuring affects consecutive n-gram matching

Sample 4:

CiderScore: 0.829 (medium)
Core concept words present but with noticeable expression differences

Sample 5:

CiderScore: 0.605 (medium-low)
Large vocabulary differences, but TF-IDF weighting still captures some relevance

Application Scenarios:

Image caption generation evaluation
Video subtitle quality assessment
Text generation tasks requiring consideration of word importance
Evaluation scenarios balancing diversity and consistency

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

refine

generate

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

generate

filter

eval

filter

generate

refine

CiderSampleEvaluator