SimHashDeduplicateFilter

About 508 wordsAbout 2 min

2025-10-09

📘 Overview

SimHashDeduplicateFilter is an approximate text deduplication operator based on the SimHash algorithm. It efficiently identifies similar content by converting text into fixed-length "fingerprints" and calculating the Hamming distance between fingerprints. This operator is faster than semantic deduplication, making it especially suitable for rapid preprocessing deduplication of character-level similar text when processing large-scale datasets.

init Function

def __init__(self, fingerprint_size: int = 64, bound: float = 0.1)

Parameter	Type	Default	Description
fingerprint_size	int	64	Length (in bits) of the SimHash fingerprint.
bound	float	0.1	Similarity distance threshold. When the ratio of Hamming distance between two text fingerprints to fingerprint length is less than this threshold, they are considered duplicates. For example, the default value of 0.1 means text with similarity higher than 90% will be considered duplicates.

run Function

def run(self, storage: DataFlowStorage, input_keys: list = None, input_key: str = None, output_key: str = 'minhash_deduplicated_label')

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_keys	list	None	List of multiple input column names containing text for deduplication. Choose one of `input_key` or `input_keys`.
input_key	str	None	Single input column name containing text for deduplication. Choose one of `input_key` or `input_keys`.
output_key	str	'minhash_deduplicated_label'	Column name for output result labels, marking whether samples are duplicates.

🧠 Example Usage

from dataflow.operators.general_text import SimHashDeduplicateFilter
from dataflow.utils.storage import FileStorage

class SimHashDeduplicateFilterTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/simhash_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.filter = SimHashDeduplicateFilter(
            fingerprint_size=64,
            bound=0.1
        )
        
    def forward(self):
        self.filter.run(
            storage=self.storage.step(),
            input_key='text',
            output_key='minhash_deduplicated_label'
        )

if __name__ == "__main__":
    test = SimHashDeduplicateFilterTest()
    test.forward()

🧾 Default Output Format

The operator adds a new label column (specified by output_key parameter) to the DataFrame and filters out duplicate rows.

Field	Type	Description
minhash_deduplicated_label	int	Deduplication label. Value of 1 indicates the sample is unique (retained), value of 0 indicates duplicate (removed in final output DataFrame).

📋 Example Input

{"text": "Hello world, this is a test message."}
{"text": "Hello world, this is a test message."}
{"text": "Completely different text goes here."}

📤 Example Output

{"text": "Hello world, this is a test message.", "minhash_deduplicated_label": 1}
{"text": "Completely different text goes here.", "minhash_deduplicated_label": 1}

📊 Result Analysis

Sample 1 (First message):

Generate 64-bit SimHash fingerprint
First occurrence, serves as baseline
Retained (unique sample)

Sample 2 (Duplicate message):

Generates identical SimHash fingerprint
Hamming distance = 0, similarity = 1.0
Similarity 1.0 > (1 - 0.1) = 0.9
Filtered out (duplicate)

Sample 3 (Different text):

Generates different SimHash fingerprint
Large Hamming distance, similarity < 0.9
Retained (unique sample)

How It Works:

Generate fixed-length SimHash fingerprint for each text
Calculate Hamming distance between fingerprints
Hamming distance / fingerprint length = difference degree
Similarity = 1 - difference degree
Similarity ≥ (1 - bound) considered duplicate

Use Cases:

Rapid deduplication for large-scale text
Approximate duplicate detection
Web content deduplication
Document similarity detection

Notes:

Larger fingerprint_size means higher precision but slower computation
bound=0.1 means similarity > 90% considered duplicate
Faster than MinHash but slightly lower precision
Suitable for character-level similarity detection
Sensitive to text order

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

SimHashDeduplicateFilter