General Data Evaluation Operators
About 4453 wordsAbout 15 min
2025-06-09
Text quality evaluation
Scorers are divided into the following four types, each scorer provides one or more scores.
Type | Count | Description |
---|---|---|
APIcaller | 3 | Call API for scoring |
Diversity | 2 | Compute diversity score of the entire dataset |
Models | 12 | Model or classifier-based scoring |
Statistics | 3 | Statistical metric scoring |
Regarding data types: [Text] indicates accepting single-field string input, suitable for pre-training or fine-tuning data. [Instruction] indicates only suitable for fine-tuning data with multi-field format input.
The types of open-source operators are quite limited. In order to achieve better data processing quality and fill the gap in data evaluation methods missing in open-source, we have meticulously designed and self-developed a new set of operators. The meanings of the labels are as follows:
🚀 Independent Innovation: Core algorithms are original developments, filling gaps in existing algorithms or further improving performance, breaking through current performance bottlenecks.
✨ Open Source Premiere: This operator is integrated into the mainstream community framework for the first time, making it easier for more developers to use and achieve open-source sharing.
List of Scorers
APIcaller
Name | Evaluation Dimension | Data Type | Description | Value Range | Official Repository or Paper |
---|---|---|---|---|---|
AlpagasusScorer✨ | Content Accuracy & Effectiveness | Instruction | Evaluates the quality of instructions by calling GPT, returning a quality score. A higher score indicates higher instruction quality. | [0, 5] | paper |
PerspectiveScorer✨ | Safety | Text | Uses PerspectiveAPI to evaluate the toxicity of the text, returning a toxicity probability. A higher score indicates higher text toxicity. | [0, 1] | API |
TreeinstructScorer✨ | Diversity & Complexity | Instruction | Measures instruction complexity by generating the number of nodes in the syntax tree; more nodes indicate more complex instructions. | - | paper |
Diversity
Name | Evaluation Dimension | Data Type | Description | Value Range | Official Repository or Paper |
---|---|---|---|---|---|
Task2VecScorer✨ | Diversity & Complexity | Text | Evaluates the diversity of the dataset using the Task2Vec method. Higher scores indicate higher dataset diversity. | [0.0525±3.41E-4, 0.4037±1.932E-5] | paper code |
VendiScorer | Diversity & Complexity | Text | Evaluates dataset diversity by calculating VendiScore; higher scores indicate higher diversity. | - | paper code |
Models
Name | Evaluation Dimension | Data Type | Description | Value Range | Official Repository or Paper |
---|---|---|---|---|---|
DebertaV3Scorer✨ | Content Accuracy & Effectiveness | Text | A quality classifier based on NVIDIA's DeBERTa V3 model for evaluating text quality. | {Low, Medium, High} | code |
FineWebEduScorer✨ | Educational Value | Text | A classifier for evaluating the educational value of text; higher scores indicate higher educational value. | [0, 5] | paper code |
InstagScorer✨ | Diversity & Complexity | Instruction | Evaluates instruction content diversity by returning the number of tags; more tags indicate higher content diversity. | - | paper code |
PerplexityScorer | Fluency & Understandability | Text | Calculates text perplexity using the KenLM model; lower scores indicate higher fluency and understandability. | - | paper code |
QuratingScorer✨ | Content Accuracy & Effectiveness、 Educational Value | Text | Evaluates text quality using the Qurating model; higher scores indicate higher quality. | - | paper code |
PairQualScorer🚀 | Educational Value | Text | Evaluates the quality of text using the PairQual model, based on the BGE model. It supports both Chinese and English. It is trained by scoring pairwise comparisons of texts using GPT. A higher score indicates better quality. | - | code |
PresidioScorer✨ | Safety | Text | Using the Microsoft Presidio model, identify private entities (PII) in text such as credit card numbers, names, locations, etc. The scorer returns the number of PII information. | - | code |
SuperfilteringScorer✨ | Fluency & Understandability | Instruction | Evaluates the following difficulty of instructions using the Superfiltering method; higher scores indicate more difficult instructions to follow. | - | paper code |
TextbookScorer✨ | Educational Value | Text | A textbook quality classifier based on FastText, used to evaluate the educational value of text. | [0, 2] | paper code |
DeitaQualityScorer✨ | Content Accuracy & Effectiveness | Instruction | An instruction quality scorer based on the Llama model; higher scores indicate higher instruction quality. | [1, 6] | paper code |
DeitaComplexityScorer✨ | Diversity & Complexity | Instruction | An instruction complexity scorer based on the Llama model; higher scores indicate higher instruction complexity. | [1,6] | paper code |
RMScorer✨ | Fluency & Understandability | 指令 | A reward-model-deberta-v3-large-v2 scorer based on human value judgment. High scores represent higher quality. | - | code |
Statistics
Name | Evaluation Dimension | Data Type | Description | Value Range | Official Repository or Paper |
---|---|---|---|---|---|
LangkitScorer | Text Structure, Fluency & Understandability | Text | Calculates statistical information of text using the Langkit toolkit, such as word count, sentence count, syllable count, etc., to help evaluate the structural complexity and readability of the text. | - | code |
LexicalDiversityScorer✨ | Diversity & Complexity | Text | Calculates lexical diversity scores using MTLD and HD-D methods; higher scores represent richer vocabulary use, reflecting the diversity and complexity of the text. | - | paper code |
NgramScorer | Diversity & Complexity | Text | Calculates the repetition ratio of n-grams in the text to measure text repetition; higher scores indicate lower repetition of n-grams in the text. | [0, 1] | - |
Quality Evaluation System
To provide more precise data quality evaluation, we have constructed a quality evaluation system based on existing classifiers. Specifically, the output score metrics of each scorer include the following six dimensions.
1. Text Structure
- LangkitScorer: LangkitSentenceCountScore, LangkitCharacterCountScore, LangkitLetterCountScore, LangkitSyllableCountScore, LangkitPolysyllableCountScore, LangkitMonosyllableCountScore, LangkitLexiconCountScore, LangkitDifficultWordsScore
2. Diversity & Complexity
- LexicalDiversityScorer: LexicalDiversityMTLDScore, LexicalDiversityHD-DScore
- NgramScorer: NgramScore
- InstagScorer: InstagScore
- TreeinstructScorer: TreeinstructScore
- Task2VecScorer: Task2VecDiversityScore (ConfidenceInterval)
- VendiScorer: N-gramsVendiScore, BERTVendiScore, SimCSEVendiScore
- DeitaComplexityScorer: DeitaComplexityScore
3. Fluency & Understandability
- LangkitScorer: LangkitFleschReadingEaseScore, LangkitAutomatedReadabilityIndexScore, LangkitAggregateReadingLevelScore
- PerplexityScorer: PerplexityScore
- QuratingScorer: QuratingWritingStyleScore
- SuperfilteringScorer: SuperfilteringScore
- RMScorer: RMScore
4. Safety
- PerspectiveScorer: PerspectiveScore
- PresidioScorer: PresidioScore
5. Educational Value
- TextbookScorer: TextbookScore
- FineWebEduScorer: FineWebEduScore
- QuratingScorer: QuratingEducationalValueScore
- PairQualScorer: PairQualScore
6. Content Accuracy & Effectiveness
- QuratingScorer: QuratingRequiredExpertiseScore, QuratingFactsAndTriviaScore
- DebertaV3Scorer: DebertaV3Score
- AlpagasusScorer: AlpagasusScore
- DeitaScorer: DeitaScore
Benchmark Values
To better provide data quality references, we randomly selected 5k data samples from the currently considered high-quality datasets Fineweb and alpaca-cleaned based on data types, and tested the benchmark values of some scorers.
Scorer Name | Score Metric Name | Description | Mean | Variance | Max | Min |
---|---|---|---|---|---|---|
PerspectiveScorer | PerspectiveScore | Evaluates the toxicity of the text, checking for potential insults or inappropriate language. The higher the score, the higher the toxicity | 0.0426 | 0.0025 | 0.2610 | 0.0026 |
LexicalDiversityScorer | LexicalDiversityMTLDScore | Measures the lexical diversity of the text; higher scores indicate more varied vocabulary usage.The higher the score, the higher the lexical diversity | 100.5990 | 1625.1318 | 1165.7164 | 14.8439 |
LexicalDiversityHD-DScore | Used to measure the lexical diversity of the text, calculated based on discrete distribution.The higher the score, the higher the lexical diversity | 0.8487 | 0.0014 | 0.9873 | 0.5570 | |
NgramScorer | NgramScore | Calculate the repetition ratio of n-grams in the text to measure the degree of repetition. The higher the score, the lower the n-gram repetition. | 0.9938 | 0.0002 | 1.0 | 0.8285 |
LangkitScorer | LangkitFleschReadingEaseScore | Measures Flesch text readability. The higher the score, the easier readability. | 55.1870 | 324.8975 | 106.37 | -144.75 |
LangkitAutomatedReadabilityIndexScore | Automated readability index based on sentence length and vocabulary difficulty.The higher the score, the more difficult readability | 11.7727 | 19.4117 | 98.2 | 0.9 | |
LangkitAggregateReadingLevelScore | Aggregate reading difficulty score of the text.The higher the score, the more difficult readability | 11.2332 | 13.6816 | 77.0 | 0.0 | |
LangkitSyllableCountScore | Counts the total number of syllables in the text. The higher the score, the more syllables there are. | 815.3852 | 2299853.7272 | 43237 | 32 | |
LangkitLexiconCountScore | Counts the total number of words in the text. The higher the score, the more words there are. | 524.178 | 1061058.5875 | 33033 | 23 | |
LangkitSentenceCountScore | Counts the total number of sentences in the text. The higher the score, the more sentences there are. | 28.9664 | 3618.2549 | 2193 | 1 | |
LangkitCharacterCountScore | Counts the total number of characters in the text. The higher the score, the more characters there are. | 2610.2462 | 23580442.8820 | 139807 | 118 | |
LangkitLetterCountScore | Counts the total number of letters in the text. The higher the score, the more letters there are. | 2513.4572 | 21890120.2030 | 134507 | 109 | |
LangkitPolysyllableCountScore | Counts the number of polysyllabic words in the text. The higher the score, the more polysyllabic words there are. | 78.8834 | 18918.1990 | 3261 | 0 | |
LangkitMonosyllableCountScore | Counts the number of monosyllabic words, which are usually related to the text's simplicity. The higher the score, the more monosyllabic words there are. | 334.6674 | 503285.5160 | 25133 | 13 | |
LangkitDifficultWordsScore | Counts the number of difficult words in the text. The higher the score, the more difficult words there are. | 93.4112 | 14401.2789 | 2366 | 4 | |
TextbookScorer | TextbookScore | Tests whether the text meets textbook standards. The higher the score, the closer the text is to an ideal textbook. | 0.9255 | 0.1779 | 1.9867 | 0.0001 |
FineWebEduScorer | FineWebEduScore | Measures the educational value of the text. The higher the score, the greater the educational value. | 1.1901 | 0.4924 | 4.6827 | -0.6319 |
DebertaV3Scorer | DebertaV3Score | Text evaluation using the DebertaV3 model. Quality scores are classified as high, medium, or low. | Medium: 3180 times | - | High: 1412 times | Low: 408 times |
PerplexityScorer | PerplexityScore | Measures the perplexity of the text. The higher the score, the greater the model's perplexity. | 564.3942 | 165893.5542 | 8271.0 | 13.9 |
QuratingScorer | QuratingWritingStyleScore | Evaluates the quality of the text's writing style. The higher the score, the better the writing style. | 0.6453 | 6.7949 | 8.375 | -7.3474 |
QuratingRequiredExpertiseScore | Measures the level of expertise required for the text. The higher the score, the more expertise is required. | -0.4661 | 7.0458 | 9.0 | -8.25 | |
QuratingFactsAndTriviaScore | Tests whether the text contains facts and trivia. The higher the score, the more facts and trivia the text contains. | 0.1889 | 4.5678 | 7.4688 | -6.0993 | |
QuratingEducationalValueScore | Measures the educational value of the text. The higher the score, the greater the educational value. | 1.2946 | 11.2196 | 11.5625 | -8.7843 | |
InstagScorer | InstagScore | Evaluates the content diversity by returning the number of tags. The higher the score, the greater the content diversity. | 2.304 | 2.9396 | 11 | 1 |
SuperfilteringScorer | SuperfilteringScore | Evaluates the instruction-following difficulty using the Superfiltering method. The higher the score, the more difficult it is to follow the instructions. | 1.3223 | 836.0302 | 1978.6534 | 0.0011 |
DeitaQualityScorer | DeitaQualityScore | Instruction quality evaluation based on the Llama model. The higher the score, the better the quality of the instructions. | 3.5629 | 0.9247 | 5.5309 | 1.0840 |
DeitaComplexityScorer | DeitaComplexityScore | Instruction complexity evaluation based on the Llama model. The higher the score, the greater the complexity of the instructions. | 1.4936 | 0.2086 | 3.3207 | 1.0001 |
VendiScorer | N-grams_VendiScore | Evaluates text diversity based on N-grams embeddings. The higher the score, the greater the dataset diversity. | 1832.96 | - | - | - |
BERT_VendiScore | Evaluates text diversity based on BERT embeddings. The higher the score, the greater the dataset diversity. | 1.83 | - | - | - | |
SimCSE_VendiScore | Evaluates text diversity based on SimCSE embeddings. The higher the score, the greater the dataset diversity. | 68.94 | - | - | - | |
Task2VecScorer | Task2VecScore | Evaluates dataset diversity using Task2Vec diversity coefficient. The higher the score, the greater the dataset diversity. | 0.0673 | - | - | - |
AlpagasusScorer | AlpagasusScore | Evaluates instruction quality using ChatGPT. The higher the score, the better the quality of the instructions. | 4.172 | 0.2164 | 5.0 | 2.0 |
TreeinstructScorer | TreeinstructScore | Uses ChatGPT to evaluate the semantic complexity of instructions. The higher the score, the greater the semantic complexity of the instruction. | 6.494 | 9.7540 | 63.0 | 0.0 |
PresidioScorer | PresidioScore | Uses Presidio to evaluate the number of PII (Personally Identifiable Information) instances. The higher the score, the more PII information is present in the text. | 21.4008 | 2915.3542 | 1786.0 | 0.0 |
RMScorer | RMScore | Uses a reward model based on human values to evaluate the quality of SFT (Supervised Fine-Tuning) data. The higher the score, the better the data quality. | 3.1537 | 9.9461 | 8.6803 | -4.9680 |
Generated text evaluation
Dataflow integrates three methods for evaluating the quality of generated text, used to evaluate the similarity between generated text and reference text.
Scorer Name | Description | Value Range | Description |
---|---|---|---|
BLEU Scorer | Calculates precision based on n-gram matching by comparing n-grams in generated and reference texts | [0, 1] | Higher values indicate greater match between generated and reference texts |
CIDEr Scorer | Uses TF-IDF weighted n-gram statistics to compare similarity between generated and reference descriptions | [0, 1] | Higher values indicate stronger content consistency between generated and reference texts |
BertScore | Computes similarity of word embeddings between generated and reference texts using BERT | [0, 1] | Higher values indicate stronger semantic similarity between generated and reference texts |