General Data Evaluation Operators

About 4453 wordsAbout 15 min

2025-06-09

Text quality evaluation

Scorers are divided into the following four types, each scorer provides one or more scores.

Type	Count	Description
APIcaller	3	Call API for scoring
Diversity	2	Compute diversity score of the entire dataset
Models	12	Model or classifier-based scoring
Statistics	3	Statistical metric scoring

Regarding data types: [Text] indicates accepting single-field string input, suitable for pre-training or fine-tuning data. [Instruction] indicates only suitable for fine-tuning data with multi-field format input.

The types of open-source operators are quite limited. In order to achieve better data processing quality and fill the gap in data evaluation methods missing in open-source, we have meticulously designed and self-developed a new set of operators. The meanings of the labels are as follows:

🚀 Independent Innovation: Core algorithms are original developments, filling gaps in existing algorithms or further improving performance, breaking through current performance bottlenecks.

✨ Open Source Premiere: This operator is integrated into the mainstream community framework for the first time, making it easier for more developers to use and achieve open-source sharing.

List of Scorers

APIcaller

Name	Evaluation Dimension	Data Type	Description	Value Range	Official Repository or Paper
AlpagasusScorer✨	Content Accuracy & Effectiveness	Instruction	Evaluates the quality of instructions by calling GPT, returning a quality score. A higher score indicates higher instruction quality.	[0, 5]	paper
PerspectiveScorer✨	Safety	Text	Uses PerspectiveAPI to evaluate the toxicity of the text, returning a toxicity probability. A higher score indicates higher text toxicity.	[0, 1]	API
TreeinstructScorer✨	Diversity & Complexity	Instruction	Measures instruction complexity by generating the number of nodes in the syntax tree; more nodes indicate more complex instructions.	-	paper

Diversity

Name	Evaluation Dimension	Data Type	Description	Value Range	Official Repository or Paper
Task2VecScorer✨	Diversity & Complexity	Text	Evaluates the diversity of the dataset using the Task2Vec method. Higher scores indicate higher dataset diversity.	[0.0525±3.41E-4, 0.4037±1.932E-5]	paper code
VendiScorer	Diversity & Complexity	Text	Evaluates dataset diversity by calculating VendiScore; higher scores indicate higher diversity.	-	paper code

Models

Name	Evaluation Dimension	Data Type	Description	Value Range	Official Repository or Paper
DebertaV3Scorer✨	Content Accuracy & Effectiveness	Text	A quality classifier based on NVIDIA's DeBERTa V3 model for evaluating text quality.	{Low, Medium, High}	code
FineWebEduScorer✨	Educational Value	Text	A classifier for evaluating the educational value of text; higher scores indicate higher educational value.	[0, 5]	paper code
InstagScorer✨	Diversity & Complexity	Instruction	Evaluates instruction content diversity by returning the number of tags; more tags indicate higher content diversity.	-	paper code
PerplexityScorer	Fluency & Understandability	Text	Calculates text perplexity using the KenLM model; lower scores indicate higher fluency and understandability.	-	paper code
QuratingScorer✨	Content Accuracy & Effectiveness、 Educational Value	Text	Evaluates text quality using the Qurating model; higher scores indicate higher quality.	-	paper code
PairQualScorer🚀	Educational Value	Text	Evaluates the quality of text using the PairQual model, based on the BGE model. It supports both Chinese and English. It is trained by scoring pairwise comparisons of texts using GPT. A higher score indicates better quality.	-	code
PresidioScorer✨	Safety	Text	Using the Microsoft Presidio model, identify private entities (PII) in text such as credit card numbers, names, locations, etc. The scorer returns the number of PII information.	-	code
SuperfilteringScorer✨	Fluency & Understandability	Instruction	Evaluates the following difficulty of instructions using the Superfiltering method; higher scores indicate more difficult instructions to follow.	-	paper code
TextbookScorer✨	Educational Value	Text	A textbook quality classifier based on FastText, used to evaluate the educational value of text.	[0, 2]	paper code
DeitaQualityScorer✨	Content Accuracy & Effectiveness	Instruction	An instruction quality scorer based on the Llama model; higher scores indicate higher instruction quality.	[1, 6]	paper code
DeitaComplexityScorer✨	Diversity & Complexity	Instruction	An instruction complexity scorer based on the Llama model; higher scores indicate higher instruction complexity.	[1,6]	paper code
RMScorer✨	Fluency & Understandability	指令	A reward-model-deberta-v3-large-v2 scorer based on human value judgment. High scores represent higher quality.	-	code

Statistics

Name	Evaluation Dimension	Data Type	Description	Value Range	Official Repository or Paper
LangkitScorer	Text Structure, Fluency & Understandability	Text	Calculates statistical information of text using the Langkit toolkit, such as word count, sentence count, syllable count, etc., to help evaluate the structural complexity and readability of the text.	-	code
LexicalDiversityScorer✨	Diversity & Complexity	Text	Calculates lexical diversity scores using MTLD and HD-D methods; higher scores represent richer vocabulary use, reflecting the diversity and complexity of the text.	-	paper code
NgramScorer	Diversity & Complexity	Text	Calculates the repetition ratio of n-grams in the text to measure text repetition; higher scores indicate lower repetition of n-grams in the text.	[0, 1]	-

Quality Evaluation System

To provide more precise data quality evaluation, we have constructed a quality evaluation system based on existing classifiers. Specifically, the output score metrics of each scorer include the following six dimensions.

1. Text Structure

LangkitScorer: LangkitSentenceCountScore, LangkitCharacterCountScore, LangkitLetterCountScore, LangkitSyllableCountScore, LangkitPolysyllableCountScore, LangkitMonosyllableCountScore, LangkitLexiconCountScore, LangkitDifficultWordsScore

2. Diversity & Complexity

LexicalDiversityScorer: LexicalDiversityMTLDScore, LexicalDiversityHD-DScore
NgramScorer: NgramScore
InstagScorer: InstagScore
TreeinstructScorer: TreeinstructScore
Task2VecScorer: Task2VecDiversityScore (ConfidenceInterval)
VendiScorer: N-gramsVendiScore, BERTVendiScore, SimCSEVendiScore
DeitaComplexityScorer: DeitaComplexityScore

3. Fluency & Understandability

LangkitScorer: LangkitFleschReadingEaseScore, LangkitAutomatedReadabilityIndexScore, LangkitAggregateReadingLevelScore
PerplexityScorer: PerplexityScore
QuratingScorer: QuratingWritingStyleScore
SuperfilteringScorer: SuperfilteringScore
RMScorer: RMScore

4. Safety

PerspectiveScorer: PerspectiveScore
PresidioScorer: PresidioScore

5. Educational Value

TextbookScorer: TextbookScore
FineWebEduScorer: FineWebEduScore
QuratingScorer: QuratingEducationalValueScore
PairQualScorer: PairQualScore

6. Content Accuracy & Effectiveness

QuratingScorer: QuratingRequiredExpertiseScore, QuratingFactsAndTriviaScore
DebertaV3Scorer: DebertaV3Score
AlpagasusScorer: AlpagasusScore
DeitaScorer: DeitaScore

Benchmark Values

To better provide data quality references, we randomly selected 5k data samples from the currently considered high-quality datasets Fineweb and alpaca-cleaned based on data types, and tested the benchmark values of some scorers.

Scorer Name	Score Metric Name	Description	Mean	Variance	Max	Min
PerspectiveScorer	PerspectiveScore	Evaluates the toxicity of the text, checking for potential insults or inappropriate language. The higher the score, the higher the toxicity	0.0426	0.0025	0.2610	0.0026
LexicalDiversityScorer	LexicalDiversityMTLDScore	Measures the lexical diversity of the text; higher scores indicate more varied vocabulary usage.The higher the score, the higher the lexical diversity	100.5990	1625.1318	1165.7164	14.8439
LexicalDiversityScorer	LexicalDiversityHD-DScore	Used to measure the lexical diversity of the text, calculated based on discrete distribution.The higher the score, the higher the lexical diversity	0.8487	0.0014	0.9873	0.5570
NgramScorer	NgramScore	Calculate the repetition ratio of n-grams in the text to measure the degree of repetition. The higher the score, the lower the n-gram repetition.	0.9938	0.0002	1.0	0.8285
LangkitScorer	LangkitFleschReadingEaseScore	Measures Flesch text readability. The higher the score, the easier readability.	55.1870	324.8975	106.37	-144.75
	LangkitAutomatedReadabilityIndexScore	Automated readability index based on sentence length and vocabulary difficulty.The higher the score, the more difficult readability	11.7727	19.4117	98.2	0.9
	LangkitAggregateReadingLevelScore	Aggregate reading difficulty score of the text.The higher the score, the more difficult readability	11.2332	13.6816	77.0	0.0
	LangkitSyllableCountScore	Counts the total number of syllables in the text. The higher the score, the more syllables there are.	815.3852	2299853.7272	43237	32
	LangkitLexiconCountScore	Counts the total number of words in the text. The higher the score, the more words there are.	524.178	1061058.5875	33033	23
	LangkitSentenceCountScore	Counts the total number of sentences in the text. The higher the score, the more sentences there are.	28.9664	3618.2549	2193	1
	LangkitCharacterCountScore	Counts the total number of characters in the text. The higher the score, the more characters there are.	2610.2462	23580442.8820	139807	118
	LangkitLetterCountScore	Counts the total number of letters in the text. The higher the score, the more letters there are.	2513.4572	21890120.2030	134507	109
	LangkitPolysyllableCountScore	Counts the number of polysyllabic words in the text. The higher the score, the more polysyllabic words there are.	78.8834	18918.1990	3261	0
	LangkitMonosyllableCountScore	Counts the number of monosyllabic words, which are usually related to the text's simplicity. The higher the score, the more monosyllabic words there are.	334.6674	503285.5160	25133	13
	LangkitDifficultWordsScore	Counts the number of difficult words in the text. The higher the score, the more difficult words there are.	93.4112	14401.2789	2366	4
TextbookScorer	TextbookScore	Tests whether the text meets textbook standards. The higher the score, the closer the text is to an ideal textbook.	0.9255	0.1779	1.9867	0.0001
FineWebEduScorer	FineWebEduScore	Measures the educational value of the text. The higher the score, the greater the educational value.	1.1901	0.4924	4.6827	-0.6319
DebertaV3Scorer	DebertaV3Score	Text evaluation using the DebertaV3 model. Quality scores are classified as high, medium, or low.	Medium: 3180 times	-	High: 1412 times	Low: 408 times
PerplexityScorer	PerplexityScore	Measures the perplexity of the text. The higher the score, the greater the model's perplexity.	564.3942	165893.5542	8271.0	13.9
QuratingScorer	QuratingWritingStyleScore	Evaluates the quality of the text's writing style. The higher the score, the better the writing style.	0.6453	6.7949	8.375	-7.3474
	QuratingRequiredExpertiseScore	Measures the level of expertise required for the text. The higher the score, the more expertise is required.	-0.4661	7.0458	9.0	-8.25
	QuratingFactsAndTriviaScore	Tests whether the text contains facts and trivia. The higher the score, the more facts and trivia the text contains.	0.1889	4.5678	7.4688	-6.0993
	QuratingEducationalValueScore	Measures the educational value of the text. The higher the score, the greater the educational value.	1.2946	11.2196	11.5625	-8.7843
InstagScorer	InstagScore	Evaluates the content diversity by returning the number of tags. The higher the score, the greater the content diversity.	2.304	2.9396	11	1
SuperfilteringScorer	SuperfilteringScore	Evaluates the instruction-following difficulty using the Superfiltering method. The higher the score, the more difficult it is to follow the instructions.	1.3223	836.0302	1978.6534	0.0011
DeitaQualityScorer	DeitaQualityScore	Instruction quality evaluation based on the Llama model. The higher the score, the better the quality of the instructions.	3.5629	0.9247	5.5309	1.0840
DeitaComplexityScorer	DeitaComplexityScore	Instruction complexity evaluation based on the Llama model. The higher the score, the greater the complexity of the instructions.	1.4936	0.2086	3.3207	1.0001
VendiScorer	N-grams_VendiScore	Evaluates text diversity based on N-grams embeddings. The higher the score, the greater the dataset diversity.	1832.96	-	-	-
	BERT_VendiScore	Evaluates text diversity based on BERT embeddings. The higher the score, the greater the dataset diversity.	1.83	-	-	-
	SimCSE_VendiScore	Evaluates text diversity based on SimCSE embeddings. The higher the score, the greater the dataset diversity.	68.94	-	-	-
Task2VecScorer	Task2VecScore	Evaluates dataset diversity using Task2Vec diversity coefficient. The higher the score, the greater the dataset diversity.	0.0673	-	-	-
AlpagasusScorer	AlpagasusScore	Evaluates instruction quality using ChatGPT. The higher the score, the better the quality of the instructions.	4.172	0.2164	5.0	2.0
TreeinstructScorer	TreeinstructScore	Uses ChatGPT to evaluate the semantic complexity of instructions. The higher the score, the greater the semantic complexity of the instruction.	6.494	9.7540	63.0	0.0
PresidioScorer	PresidioScore	Uses Presidio to evaluate the number of PII (Personally Identifiable Information) instances. The higher the score, the more PII information is present in the text.	21.4008	2915.3542	1786.0	0.0
RMScorer	RMScore	Uses a reward model based on human values to evaluate the quality of SFT (Supervised Fine-Tuning) data. The higher the score, the better the data quality.	3.1537	9.9461	8.6803	-4.9680

Generated text evaluation

Dataflow integrates three methods for evaluating the quality of generated text, used to evaluate the similarity between generated text and reference text.

Scorer Name	Description	Value Range	Description
BLEU Scorer	Calculates precision based on n-gram matching by comparing n-grams in generated and reference texts	[0, 1]	Higher values indicate greater match between generated and reference texts
CIDEr Scorer	Uses TF-IDF weighted n-gram statistics to compare similarity between generated and reference descriptions	[0, 1]	Higher values indicate stronger content consistency between generated and reference texts
BertScore	Computes similarity of word embeddings between generated and reference texts using BERT	[0, 1]	Higher values indicate stronger semantic similarity between generated and reference texts