PerspectiveFilter

About 540 wordsAbout 2 min

2025-10-09

📘 Overview

PerspectiveFilter is a data filtering operator based on Perspective API for evaluating text toxicity and filtering data according to set score thresholds. Higher scores indicate higher text toxicity.

init Function

def __init__(self, min_score: float = 0.0, max_score: float = 0.5):

Init Parameters

Parameter	Type	Default	Description
min_score	float	0.0	Minimum toxicity score threshold. Retains text with scores greater than or equal to this value.
max_score	float	0.5	Maximum toxicity score threshold. Retains text with scores less than or equal to this value.

run Function

def run(self, storage: DataFlowStorage, input_key: str, output_key: str = 'PerspectiveScore'):

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name corresponding to the text field for toxicity evaluation.
output_key	str	"PerspectiveScore"	Output column name corresponding to the generated toxicity score field.

📦 API Key Configuration

Before using PerspectiveFilter, you need to configure the Google Perspective API Key. There are two methods:

Method 1: Set Environment Variable

export GOOGLE_API_KEY="your-google-api-key"

Or set it in Python:

import os
os.environ["GOOGLE_API_KEY"] = "your-google-api-key"

Method 2: Configure via PerspectiveAPIServing

Pass the API Key directly when initializing PerspectiveAPIServing:

from dataflow.serving import PerspectiveAPIServing

serving = PerspectiveAPIServing(api_key="your-google-api-key", max_workers=10)

Obtaining API Key

To obtain a Google Perspective API Key, visit: Google Perspective API

🧠 Example Usage

from dataflow.operators.general_text import PerspectiveFilter
from dataflow.utils.storage import FileStorage

class PerspectiveFilterTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/perspective_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.filter = PerspectiveFilter(
            min_score=0.0,
            max_score=0.5
        )
        
    def forward(self):
        self.filter.run(
            storage=self.storage.step(),
            input_key='text',
            output_key='PerspectiveScore'
        )

if __name__ == "__main__":
    test = PerspectiveFilterTest()
    test.forward()

🧾 Default Output Format

Field	Type	Description
[input_key]	str	Input text for evaluation.
[output_key]	float	Model-generated toxicity score value.

📋 Example Input

{"text": "Thank you for your help! I really appreciate your kindness and support."}
{"text": "I disagree with your opinion, but I respect your perspective on this matter."}
{"text": "You are an idiot and your ideas are completely stupid and worthless."}

📤 Example Output

{"text": "Thank you for your help! I really appreciate your kindness and support.", "PerspectiveScore": 0.012}
{"text": "I disagree with your opinion, but I respect your perspective on this matter.", "PerspectiveScore": 0.089}

📊 Result Analysis

Sample 1 (Friendly text):

Perspective toxicity score: 0.012
Score range: [0.0, 0.5]
Passes filter (0.012 within range)
Characteristics: Gratitude and positive expression, nearly no toxicity

Sample 2 (Neutral text):

Perspective toxicity score: 0.089
Score range: [0.0, 0.5]
Passes filter (0.089 within range)
Characteristics: Expresses disagreement but remains respectful, low toxicity

Sample 3 (Offensive text):

Perspective toxicity score: 0.952
Score range: [0.0, 0.5]
Filtered out (0.952 > 0.5)
Characteristics: Contains insulting words and offensive language, high toxicity

Score Interpretation:

0.0 - 0.3: Low or no toxicity (polite, friendly)
0.3 - 0.7: Moderate toxicity (possibly controversial content)
0.7 - 1.0: High toxicity (insults, attacks, hate speech)

Use Cases:

Content moderation systems
Social media comment filtering
User-generated content quality control
Building healthy online communities

Notes:

Requires configured Google Perspective API Key (see 📦 API Key Configuration section above)
API calls have rate limits; recommend setting appropriate concurrency
Supports multiple languages, but works best for English
Samples with NaN values are automatically retained

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

PerspectiveFilter