Your LLM application is probably leaking personal data to your AI provider right now. Not because of a security breach. Because of normal use — users typing their name, their email, their employer, their medical condition, their financial situation into your product.
That text goes to OpenAI, Anthropic, or Mistral in the request body. Under GDPR, that is a data transfer to a third-party processor. Under the EU AI Act, if your system is high-risk, it must be logged — but not with raw personal data.
PII scrubbing is the technical solution: detecting and replacing personal identifiers before they leave your infrastructure, before they reach your LLM provider, and before they are written to any log. This guide covers every technique available, their trade-offs, and complete code implementations in Python and TypeScript.
What counts as PII in an LLM context
GDPR defines personal data broadly: any information that relates to an identified or identifiable natural person. In LLM prompts and responses, this includes far more than most developers expect.
Direct identifiers — always PII
- —Full names
- —Email addresses
- —Phone numbers (all international formats)
- —Physical addresses
- —National identification numbers (passport, tax ID, national insurance)
- —IBAN and bank account numbers
- —Credit and debit card numbers
- —IP addresses (personal data under GDPR)
- —Device identifiers linked to individuals
Special category data (Article 9 — highest risk)
- —Health information and diagnoses
- —Racial or ethnic origin
- —Political opinions
- —Religious or philosophical beliefs
- —Trade union membership
- —Sexual orientation or sex life
- —Criminal convictions and offences
The context problem: A combination of seemingly non-personal attributes can uniquely identify an individual. No scrubber catches everything. Layer your defences — do not rely on scrubbing alone.
The three layers of PII detection
Effective scrubbing requires three complementary approaches used together. Each catches what the others miss.
Fast, deterministic, catches well-structured PII reliably. Sub-millisecond latency, zero dependencies, 100% recall for the patterns it covers. Cannot catch unstructured PII like names.
Statistical or neural models that identify named entities — people, organisations, locations — in free text. Catches names that regex cannot. 5–50ms per call. Accuracy varies by language.
Uses a small, fast LLM to detect PII that requires contextual understanding. Highest accuracy for complex cases. 50–200ms per call. Only for high-risk systems where false negatives are costly.
For most production pipelines, Layer 1 + Layer 2 is the right starting point. Add Layer 3 for high-risk systems where false negatives are costly.
Layer 1: Regex implementation
import re
from dataclasses import dataclass, field
from typing import List, Tuple, Dict
@dataclass
class PIIMatch:
category: str
placeholder: str
start: int
end: int
class RegexScrubber:
# Ordered by priority — more specific patterns first
PATTERNS: Dict[str, Tuple[str, str]] = {
'credit_card': (
r'(?:4[0-9]{12}(?:[0-9]{3})?|' # Visa
r'5[1-5][0-9]{14}|' # Mastercard
r'3[47][0-9]{13}|' # Amex
r'(?:d{4}[s-]?){3}d{4})', # Generic 16-digit
'[CARD_REDACTED]'
),
'iban': (
r'[A-Z]{2}d{2}[A-Z0-9]{4}d{7}'
r'(?:[A-Z0-9]{0,16})',
'[IBAN_REDACTED]'
),
'email': (
r'[a-zA-Z0-9._%+-]+@'
r'[a-zA-Z0-9.-]+.[a-zA-Z]{2,}',
'[EMAIL_REDACTED]'
),
'ip_v4': (
r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}'
r'(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)',
'[IP_REDACTED]'
),
'phone_eu': (
r'(?:+d{1,3}[s.-]?)?'
r'(?:(?d{1,4})?[s.-]?)?'
r'd{3,4}[s.-]?d{3,4}'
r'(?:[s.-]?d{1,4})?',
'[PHONE_REDACTED]'
),
'nino_uk': (
r'[A-Z]{2}s?d{2}s?d{2}s?d{2}s?[A-D]',
'[NINO_REDACTED]'
),
'date_of_birth': (
r'(?:0?[1-9]|[12][0-9]|3[01])'
r'[s/-.](0?[1-9]|1[0-2])'
r'[s/-.](19|20)d{2}',
'[DOB_REDACTED]'
),
}
def scrub(self, text: str) -> Tuple[str, List[PIIMatch]]:
matches: List[PIIMatch] = []
scrubbed = text
for category, (pattern, placeholder) in self.PATTERNS.items():
found = list(re.finditer(pattern, scrubbed,
re.IGNORECASE | re.MULTILINE))
for match in reversed(found): # reverse to preserve positions
matches.append(PIIMatch(
category=category,
placeholder=placeholder,
start=match.start(),
end=match.end()
))
scrubbed = (
scrubbed[:match.start()] +
placeholder +
scrubbed[match.end():]
)
return scrubbed, matchesLayer 2: NER with Microsoft Presidio
Presidio is an open-source PII detection library from Microsoft. It combines regex, NER, and rule-based approaches and supports multiple languages. Better for production than raw spaCy.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
# pip install presidio-analyzer presidio-anonymizer
# python -m spacy download en_core_web_lg
class PresidioScrubber:
EU_ENTITIES = [
"PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
"IBAN_CODE", "CREDIT_CARD", "IP_ADDRESS",
"DATE_TIME", "LOCATION", "NRP",
"DE_TAX_ID", "DE_PASSPORT", # German
"ES_NIF", "ES_NIE", # Spanish
"IT_FISCAL_CODE", # Italian
"PL_PESEL", # Polish
]
def __init__(self, language: str = "en"):
self.analyzer = AnalyzerEngine()
self.anonymizer = AnonymizerEngine()
self.language = language
def scrub(self, text: str) -> Tuple[str, List[dict]]:
results = self.analyzer.analyze(
text=text,
language=self.language,
entities=self.EU_ENTITIES,
score_threshold=0.6
)
if not results:
return text, []
anonymized = self.anonymizer.anonymize(
text=text,
analyzer_results=results,
operators={
"DEFAULT": OperatorConfig("replace", {"new_value": "<REDACTED>"}),
"PERSON": OperatorConfig("replace", {"new_value": "[NAME_REDACTED]"}),
"EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[EMAIL_REDACTED]"}),
"PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[PHONE_REDACTED]"}),
"IBAN_CODE": OperatorConfig("replace", {"new_value": "[IBAN_REDACTED]"}),
}
)
detections = [
{"category": r.entity_type.lower(), "confidence": r.score}
for r in results
]
return anonymized.text, detectionsLayer 3: OpenAI Privacy Filter (contextual)
OpenAI released a lightweight privacy filter library under Apache 2.0. It uses constrained Viterbi decoding over BIOES tags to produce contextually accurate PII spans — catching cases that regex and standard NER miss.
# pip install openai-privacy-filter
from openai_privacy_filter import PrivacyFilter
class ContextualScrubber:
"""
LLM-based contextual PII detection.
Use for high-risk systems where false negatives are costly.
Adds 50-150ms per call.
"""
def __init__(self):
self.filter = PrivacyFilter()
def scrub(self, text: str) -> Tuple[str, List[dict]]:
result = self.filter.scrub(text)
return result.scrubbed_text, result.detectionsThe complete combined pipeline
import time
class GDPRScrubbingPipeline:
"""
Full three-layer PII scrubbing pipeline.
Runs locally — no data leaves your infrastructure.
"""
def __init__(
self,
use_ner: bool = True,
use_contextual: bool = False, # only for high-risk systems
language: str = "en"
):
self.regex = RegexScrubber()
self.ner = PresidioScrubber(language=language) if use_ner else None
self.contextual = ContextualScrubber() if use_contextual else None
def scrub(self, text: str) -> Tuple[str, List[dict]]:
if not text or not text.strip():
return text, []
all_detections = []
# Layer 1: Regex (always runs first, fastest)
text, regex_detections = self.regex.scrub(text)
all_detections.extend([
{"category": d.category, "layer": "regex"}
for d in regex_detections
])
# Layer 2: NER (catches names, locations, orgs)
if self.ner:
text, ner_detections = self.ner.scrub(text)
all_detections.extend([
{"category": d["category"],
"confidence": d["confidence"],
"layer": "ner"}
for d in ner_detections
])
# Layer 3: Contextual (high-risk systems only)
if self.contextual:
text, ctx_detections = self.contextual.scrub(text)
all_detections.extend([
{"category": d.get("category", "unknown"),
"layer": "contextual"}
for d in ctx_detections
])
return text, all_detections
def scrub_messages(self, messages: list) -> Tuple[list, List[dict]]:
"""Scrub a list of OpenAI-format message objects."""
scrubbed_messages = []
all_detections = []
for message in messages:
if message.get("content") and isinstance(
message["content"], str
):
scrubbed_content, detections = self.scrub(message["content"])
scrubbed_messages.append({**message, "content": scrubbed_content})
all_detections.extend(detections)
else:
scrubbed_messages.append(message)
return scrubbed_messages, all_detectionsConsistent pseudonymisation — preserving context
Replacing every instance of a name with [NAME_REDACTED] loses important context. A medical summary referring to “Dr Smith” and “patient Chen” becomes confusing when both become the same placeholder. Consistent pseudonymisation assigns stable pseudonyms within a session — the same real name always gets the same placeholder.
import hashlib
from typing import Dict
class ConsistentPseudonymiser:
"""
Replaces PII with consistent session-scoped pseudonyms.
Same input always gets the same pseudonym within a session.
Pseudonyms are not reversible without the session salt.
"""
def __init__(self, session_salt: str):
self.salt = session_salt
self.cache: Dict[str, str] = {}
def pseudonymise(self, value: str, category: str) -> str:
cache_key = f"{category}:{value}"
if cache_key not in self.cache:
h = hashlib.sha256(
f"{self.salt}:{value}".encode()
).hexdigest()[:8]
prefixes = {
"PERSON": "Person",
"EMAIL_ADDRESS": "email",
"ORG": "Organisation",
"LOCATION": "Location",
}
prefix = prefixes.get(category, "Entity")
self.cache[cache_key] = f"[{prefix}_{h}]"
return self.cache[cache_key][NAME_REDACTED] reviewed the patient file. Patient [NAME_REDACTED] had previously seen [NAME_REDACTED] in March.
[Person_a3f9c284] reviewed the patient file. Patient [Person_a3f9c284] had previously seen [Person_a3f9c284] in March.
The identity is protected. The narrative is preserved.
TypeScript implementation
const REGEX_PATTERNS: Record<string, [RegExp, string]> = {
email: [
/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}/gi,
'[EMAIL_REDACTED]'
],
iban: [
/[A-Z]{2}d{2}[A-Z0-9]{4}d{7}(?:[A-Z0-9]{0,16})/g,
'[IBAN_REDACTED]'
],
ipv4: [
/(?:(?:25[0-5]|2[0-4]d|[01]?dd?).){3}(?:25[0-5]|2[0-4]d|[01]?dd?)/g,
'[IP_REDACTED]'
],
creditCard: [
/(?:d{4}[s-]?){3}d{4}/g,
'[CARD_REDACTED]'
],
phone: [
/(?:+d{1,3}[s.-]?)?(?:(?d{1,4})?[s.-]?)?d{3,4}[s.-]?d{3,4}(?:[s.-]?d{1,4})?/g,
'[PHONE_REDACTED]'
],
}
interface ScrubResult {
text: string
detections: Array<{ category: string; layer: 'regex' | 'ner' }>
}
function regexScrub(text: string): ScrubResult {
const detections: ScrubResult['detections'] = []
let scrubbed = text
for (const [category, [pattern, placeholder]] of Object.entries(REGEX_PATTERNS)) {
const matches = [...scrubbed.matchAll(pattern)]
if (matches.length > 0) {
detections.push(
...matches.map(() => ({ category, layer: 'regex' as const }))
)
scrubbed = scrubbed.replace(pattern, placeholder)
}
}
return { text: scrubbed, detections }
}
// For NER in Node.js, run Presidio as a sidecar REST service
async function nerScrub(text: string): Promise<ScrubResult> {
const response = await fetch('http://localhost:5002/analyze', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text, language: 'en' }),
})
const results = await response.json()
return { text, detections: results }
}
async function scrubMessages(
messages: Array<{ role: string; content: string }>
): Promise<{ messages: typeof messages; detections: ScrubResult['detections'] }> {
const allDetections: ScrubResult['detections'] = []
const scrubbedMessages = []
for (const message of messages) {
const { text: regexScrubbed, detections: rd } = regexScrub(message.content)
const { text: final, detections: nd } = await nerScrub(regexScrubbed)
scrubbedMessages.push({ ...message, content: final })
allDetections.push(...rd, ...nd)
}
return { messages: scrubbedMessages, detections: allDetections }
}Handling multilingual text
EU companies serve users across 24 official languages. Your scrubber must handle all of them.
from langdetect import detect
def get_scrubber_for_text(text: str) -> PresidioScrubber:
lang_map = {
'en': 'en', 'de': 'de', 'fr': 'fr',
'nl': 'nl', 'es': 'es', 'it': 'it',
'pl': 'pl', 'pt': 'pt'
}
try:
detected = detect(text)
lang = lang_map.get(detected, 'en')
except Exception:
lang = 'en'
return PresidioScrubber(language=lang)Presidio also supports running multiple language analysers simultaneously — pass them all and take the union of detections for maximum recall on mixed-language input.
Testing your scrubber
A scrubber you have not tested is a scrubber you cannot trust. Test with real-world examples from each domain your application serves.
import pytest
class TestGDPRScrubber:
def setup_method(self):
self.pipeline = GDPRScrubbingPipeline(use_ner=True, language="en")
def test_email_detection(self):
text = "Contact me at sarah.chen@acmecorp.com tomorrow."
scrubbed, detections = self.pipeline.scrub(text)
assert "sarah.chen@acmecorp.com" not in scrubbed
assert "[EMAIL_REDACTED]" in scrubbed
assert any(d["category"] == "email" for d in detections)
def test_iban_detection(self):
text = "Please transfer to DE89 3704 0044 0532 0130 00."
scrubbed, _ = self.pipeline.scrub(text)
assert "DE89" not in scrubbed
assert "[IBAN_REDACTED]" in scrubbed
def test_name_detection(self):
text = "The patient John Smith presented with symptoms."
scrubbed, _ = self.pipeline.scrub(text)
assert "John Smith" not in scrubbed
def test_false_positive_rate(self):
text = "The EU AI Act was passed in 2024. Article 12 requires logging."
scrubbed, _ = self.pipeline.scrub(text)
assert "EU AI Act" in scrubbed
assert "Article 12" in scrubbed
def test_no_original_pii_in_detections(self):
text = "Email: test@example.com, Name: Jane Doe"
_, detections = self.pipeline.scrub(text)
for d in detections:
assert "test@example.com" not in str(d)
assert "Jane Doe" not in str(d)
def test_empty_text(self):
scrubbed, detections = self.pipeline.scrub("")
assert scrubbed == ""
assert detections == []
def test_multilingual(self):
text = "Kontaktieren Sie mich unter max.mueller@beispiel.de"
scrubbed, _ = self.pipeline.scrub(text)
assert "max.mueller@beispiel.de" not in scrubbedPerformance at scale
At high volume, scrubbing latency accumulates. Profile your pipeline before deploying to production.
| Pipeline | Latency per call | Use case |
|---|---|---|
| Regex only | 0.1–0.5ms | High-throughput, structured data |
| Regex + Presidio NER | 5–20ms | Most production systems |
| Regex + NER + contextual LLM | 50–200ms | High-risk AI systems |
Run scrubbing asynchronously and cache results for identical inputs. For high-throughput systems, Presidio can be deployed as a sidecar service and called over a local socket to avoid per-process model loading.
The limits of scrubbing
Scrubbing is not anonymisation
Pseudonymisation — replacing identifiers with consistent aliases — is still personal data under GDPR. It remains possible, with the right key, to re-identify individuals. GDPR defines it as a risk-reduction measure, not a route out of compliance.
Context-dependent PII slips through
A reference to “the CEO of [ORG_REDACTED] who had the skiing accident in January” may still identify a real person even after org scrubbing. No scrubber catches every case. Layer your defences beyond scrubbing alone.
Scrubbed training data may not be fully safe
Research has demonstrated that LLM-based reconstruction attacks can recover PII from scrubbed training data with meaningful accuracy. This is a research-stage finding, not a current operational concern for most deployments — but relevant for long-term data strategy.
Document your scrubbing approach in your DPIA and ROPA. Scrubbing is a mitigation measure — document it as such, alongside its known limitations.
Summary
GDPR-compliant LLM applications require PII scrubbing before prompts reach your LLM provider and before any data is written to compliance logs.
Scrubbing is necessary but not sufficient. Combine it with a GDPR-compliant logging pipeline, EU-hosted infrastructure, defined retention periods, and Article 12 compliant tamper-evident logs for high-risk AI systems.
This guide is maintained by Sovergate. We build Article 12 EU AI Act compliance logging infrastructure — with PII scrubbing built in, running locally in your infrastructure before any data reaches our servers in Germany. This guide is for informational purposes only and does not constitute legal advice.
Last updated June 2026.
Want scrubbing built in, out of the box?
Two lines of code. PII scrubbed locally inside your infrastructure before anything leaves. Data stored in Germany. Monthly Article 12 compliance reports ready for your legal team.