PresidioFilter
About 513 wordsAbout 2 min
2025-10-09
📘 Overview
PresidioFilter is a PII (Personally Identifiable Information) score-based data filtering operator. It utilizes the Microsoft Presidio model to identify and count private entities (such as names, emails, phone numbers, etc.) in text, and filters data according to set score threshold ranges. This operator is mainly used in data privacy protection and compliance checking scenarios.
init Function
def __init__(self, min_score: int = 0, max_score: int = 5, lang='en', device='cuda', model_cache_dir='./dataflow_cache'):Init Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| min_score | int | 0 | Minimum PII count threshold for retaining samples. |
| max_score | int | 5 | Maximum PII count threshold for retaining samples. |
| lang | str | 'en' | Text language. |
| device | str | 'cuda' | Device for model execution. |
| model_cache_dir | str | './dataflow_cache' | Model cache directory. |
run Function
def run(self, storage: DataFlowStorage, input_key: str, output_key: str = 'PresidioScore'):Parameters
| Name | Type | Default | Description |
|---|---|---|---|
| storage | DataFlowStorage | Required | DataFlow storage instance for reading and writing data. |
| input_key | str | Required | Input column name corresponding to the text field for PII detection. |
| output_key | str | 'PresidioScore' | Output column name corresponding to the PII score field. |
🧠 Example Usage
from dataflow.operators.general_text import PresidioFilter
from dataflow.utils.storage import FileStorage
class PresidioFilterTest():
def __init__(self):
self.storage = FileStorage(
first_entry_file_name="./dataflow/example/GeneralTextPipeline/presidio_test_input.jsonl",
cache_path="./cache",
file_name_prefix="dataflow_cache_step",
cache_type="jsonl",
)
self.filter = PresidioFilter(
min_score=0,
max_score=5,
lang='en',
device='cuda',
model_cache_dir='./dataflow_cache'
)
def forward(self):
self.filter.run(
storage=self.storage.step(),
input_key='text',
output_key='PresidioScore'
)
if __name__ == "__main__":
test = PresidioFilterTest()
test.forward()🧾 Default Output Format
| Field | Type | Description |
|---|---|---|
| PresidioScore | int | Number of PII entities in text generated by model. |
📋 Example Input
{"text": "The weather is nice today. Let's go for a walk in the park."}
{"text": "My name is John Smith and I live in New York."}
{"text": "Please contact me at john.doe@example.com or call me at +1-555-123-4567. My credit card number is 4532-1234-5678-9010."}📤 Example Output
{"text": "The weather is nice today. Let's go for a walk in the park.", "PresidioScore": 0}
{"text": "My name is John Smith and I live in New York.", "PresidioScore": 2}
{"text": "Please contact me at john.doe@example.com or call me at +1-555-123-4567. My credit card number is 4532-1234-5678-9010.", "PresidioScore": 4}📊 Result Analysis
Sample 1 (Normal text):
- Detected PII count: 0
- Score range: [0, 5]
- Passes filter (0 within range)
- Characteristics: No personal identifiable information
Sample 2 (Contains name and location):
- Detected PII count: 2
- PERSON: "John Smith"
- LOCATION: "New York"
- Score range: [0, 5]
- Passes filter (2 within range)
Sample 3 (Sensitive information text):
- Detected PII count: 4
- EMAIL_ADDRESS: "john.doe@example.com"
- PHONE_NUMBER: "+1-555-123-4567"
- CREDIT_CARD: "4532-1234-5678-9010"
- Possibly other entities
- Score range: [0, 5]
- Passes filter (4 within range)
Supported PII Types:
- Name (PERSON)
- Location (LOCATION)
- Email address (EMAIL_ADDRESS)
- Phone number (PHONE_NUMBER)
- Credit card number (CREDIT_CARD)
- ID numbers, etc.
Use Cases:
- Data privacy protection
- Compliance checking (GDPR, CCPA)
- Sensitive information detection
- Pre-anonymization assessment
Notes:
- Uses
dslim/bert-base-NERmodel - Supports multiple languages via
langparameter min_scoreandmax_scoredefine PII count range for retaining samples- Can set smaller
max_scoreto filter high-risk text

