General Data Processing Operators
About 1411 wordsAbout 5 min
2025-06-09
Overview
DataFlow currently supports text data processing at the data point level, categorized into three types: refiners, deduplicators and filters.
Type | Count | Description |
---|---|---|
Refiners | 16 | Improves the content of data points through processing and augmentation without altering the total count. |
Deduplicators | 6 | Removes duplicate data points using methods such as hashing. |
Filters | 42 | Filters data points based on thresholds and other criteria. |
Refiners
Name | Applicable Type | Description | Repository or Paper |
---|---|---|---|
CondorRefiner | SFT | Generate evaluations and rewrites of SFT responses using LLM APIs to improve QA quality | paper |
LowercaseRefiner | NLP | Converts text fields to lowercase. | - |
PIIAnonymizeRefiner | Pre-training | Anonymizes Personally Identifiable Information (PII), such as names and locations, to protect privacy. | Code |
RemovePunctuationRefiner | NLP | Removes punctuation from text. | - |
RemoveNumberRefiner | NLP | Removes numeric characters from text. | - |
RemoveExtraSpacesRefiner | NLP, Pre-training | Replaces multiple consecutive spaces with a single space and trims leading/trailing spaces. | - |
RemoveRepetitionsPunctuationRefiner | NLP | Removes repeated punctuation, e.g., "!!!" becomes "!". | - |
RemoveEmojiRefiner | Pre-training | Removes emojis from text, e.g., "😀". | Code |
RemoveEmoticonsRefiner | Pre-training | Removes emoticons such as ":-)", using a predefined list. | Code |
RemoveContractionsRefiner | NLP | Expands contractions in text, e.g., "can't" becomes "cannot". | Code |
HtmlUrlRemoverRefiner | Pre-training | Removes URLs and HTML tags from text. | - |
TextNormalizationRefiner | NLP | Normalizes formats for dates, currencies, etc., in text. | - |
NERRefiner | NLP | Uses Named Entity Recognition (NER) to identify and mask specific entities in text. | Code |
StemmingLemmatizationRefiner | NLP | Performs stemming or lemmatization on text. | Code |
SpellingCorrectionRefiner | NLP, Pre-training | Corrects spelling errors in text using SymSpell. | Code |
RemoveStopwordsRefiner | NLP | Removes stopwords (e.g., "the", "is") from text. | Code |
Deduplicators
Name | Type | Description | Repository or Paper |
---|---|---|---|
HashDeduplicator | Exact Deduplication | Uses various hash functions (e.g., MD5, SHA256, XXH3_128) to remove duplicate data based on exact hash value comparison. Suitable for small-scale simple deduplication. | - |
CCNetDeduplicator | Exact Deduplication | Compares the first 64 bits of the SHA-1 hash to identify duplicate text, balancing security and computational efficiency. | - |
NgramHashDeduplicator | Near Deduplication | Combines n-gram techniques with hashing to detect duplicates based on multiple hash comparisons of n-gram segments. Useful for identifying near-duplicates. | Paper |
SemDeduplicator | Near Deduplication | Uses semantic similarity based on BERT embeddings and cosine similarity to detect duplicates. Ideal for detecting semantically similar but differently phrased text. | Paper Code |
SimHashDeduplicator | Near Deduplication | Uses the SimHash algorithm to detect similar text based on Hamming distance of fingerprints. Efficient for large-scale data deduplication. | Paper |
MinHashDeduplicator | Near Deduplication | Combines MinHash and LSH to compare sets with minimal memory usage and computation cost, detecting similarity between sets. | Paper |
Filters
Name | Applicable Type | Description | Repository or Paper |
---|---|---|---|
LanguageFilter | Pre-training, SFT | Filters specific languages using the fasttext language identification model. | Huggingface |
BlocklistFilter | Pre-training, SFT | Filters data points using a blocklist (e.g., List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words). | Code |
Additionally, Open-DataFlow-Eval supports filtering data points based on scores from single data point scorers, with 18 supported scorers.
DeitaQualityFilter:
min_score: 1
max_score: 5
scorer_args:
device: 'cuda:0'
model_name: 'hkust-nlp/deita-quality-scorer'
max_length: 512
You can set min/max scores and scorer parameters in scorer_args
for filtering. For more information on supported scorers, refer to the evaluation algorithm documentation (excluding the Diversity part).
In addition, heuristic rule filtering plays a significant role in the screening of pre-training data. In this regard, the Dingo Data Quality Evaluation Tool has greatly inspired our development. We have integrated some of the rule filtering algorithms used in Dingo, a total of 22 types, into dataflow/process/text/filters/heuristics.py
. For details, please refer to the Rules Documentation. The names of the filters can be found in the dataflow/process/text/filters/heuristics.py
file.
All 42 data filters mentioned above share the same yaml
invocation method.