Sovergate
← Back to Blog
Technical10 min read · 2 June 2026

LLM PII Scrubbing: A Developer's Guide to GDPR-Compliant AI

How to detect and replace personally identifiable information in LLM prompts and responses before they are logged or leave your infrastructure. Regex, NER, Presidio, and contextual detection — with complete Python and TypeScript implementations.

Your LLM application is probably leaking personal data to your AI provider right now. Not because of a security breach. Because of normal use — users typing their name, their email, their employer, their medical condition, their financial situation into your product.

That text goes to OpenAI, Anthropic, or Mistral in the request body. Under GDPR, that is a data transfer to a third-party processor. Under the EU AI Act, if your system is high-risk, it must be logged — but not with raw personal data.

PII scrubbing is the technical solution: detecting and replacing personal identifiers before they leave your infrastructure, before they reach your LLM provider, and before they are written to any log. This guide covers every technique available, their trade-offs, and complete code implementations in Python and TypeScript.

What counts as PII in an LLM context

GDPR defines personal data broadly: any information that relates to an identified or identifiable natural person. In LLM prompts and responses, this includes far more than most developers expect.

Direct identifiers — always PII

  • Full names
  • Email addresses
  • Phone numbers (all international formats)
  • Physical addresses
  • National identification numbers (passport, tax ID, national insurance)
  • IBAN and bank account numbers
  • Credit and debit card numbers
  • IP addresses (personal data under GDPR)
  • Device identifiers linked to individuals

Special category data (Article 9 — highest risk)

  • Health information and diagnoses
  • Racial or ethnic origin
  • Political opinions
  • Religious or philosophical beliefs
  • Trade union membership
  • Sexual orientation or sex life
  • Criminal convictions and offences

The context problem: A combination of seemingly non-personal attributes can uniquely identify an individual. No scrubber catches everything. Layer your defences — do not rely on scrubbing alone.

The three layers of PII detection

Effective scrubbing requires three complementary approaches used together. Each catches what the others miss.

Layer 1: Regex pattern matching

Fast, deterministic, catches well-structured PII reliably. Sub-millisecond latency, zero dependencies, 100% recall for the patterns it covers. Cannot catch unstructured PII like names.

Layer 2: Named entity recognition (NER)

Statistical or neural models that identify named entities — people, organisations, locations — in free text. Catches names that regex cannot. 5–50ms per call. Accuracy varies by language.

Layer 3: LLM-based detection (contextual)

Uses a small, fast LLM to detect PII that requires contextual understanding. Highest accuracy for complex cases. 50–200ms per call. Only for high-risk systems where false negatives are costly.

For most production pipelines, Layer 1 + Layer 2 is the right starting point. Add Layer 3 for high-risk systems where false negatives are costly.

Layer 1: Regex implementation

Python — regex scrubber
import re
from dataclasses import dataclass, field
from typing import List, Tuple, Dict

@dataclass
class PIIMatch:
    category: str
    placeholder: str
    start: int
    end: int

class RegexScrubber:
    # Ordered by priority — more specific patterns first
    PATTERNS: Dict[str, Tuple[str, str]] = {
        'credit_card': (
            r'(?:4[0-9]{12}(?:[0-9]{3})?|'      # Visa
            r'5[1-5][0-9]{14}|'                    # Mastercard
            r'3[47][0-9]{13}|'                     # Amex
            r'(?:d{4}[s-]?){3}d{4})',       # Generic 16-digit
            '[CARD_REDACTED]'
        ),
        'iban': (
            r'[A-Z]{2}d{2}[A-Z0-9]{4}d{7}'
            r'(?:[A-Z0-9]{0,16})',
            '[IBAN_REDACTED]'
        ),
        'email': (
            r'[a-zA-Z0-9._%+-]+@'
            r'[a-zA-Z0-9.-]+.[a-zA-Z]{2,}',
            '[EMAIL_REDACTED]'
        ),
        'ip_v4': (
            r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}'
            r'(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)',
            '[IP_REDACTED]'
        ),
        'phone_eu': (
            r'(?:+d{1,3}[s.-]?)?'
            r'(?:(?d{1,4})?[s.-]?)?'
            r'd{3,4}[s.-]?d{3,4}'
            r'(?:[s.-]?d{1,4})?',
            '[PHONE_REDACTED]'
        ),
        'nino_uk': (
            r'[A-Z]{2}s?d{2}s?d{2}s?d{2}s?[A-D]',
            '[NINO_REDACTED]'
        ),
        'date_of_birth': (
            r'(?:0?[1-9]|[12][0-9]|3[01])'
            r'[s/-.](0?[1-9]|1[0-2])'
            r'[s/-.](19|20)d{2}',
            '[DOB_REDACTED]'
        ),
    }

    def scrub(self, text: str) -> Tuple[str, List[PIIMatch]]:
        matches: List[PIIMatch] = []
        scrubbed = text

        for category, (pattern, placeholder) in self.PATTERNS.items():
            found = list(re.finditer(pattern, scrubbed,
                                     re.IGNORECASE | re.MULTILINE))
            for match in reversed(found):    # reverse to preserve positions
                matches.append(PIIMatch(
                    category=category,
                    placeholder=placeholder,
                    start=match.start(),
                    end=match.end()
                ))
                scrubbed = (
                    scrubbed[:match.start()] +
                    placeholder +
                    scrubbed[match.end():]
                )

        return scrubbed, matches

Layer 2: NER with Microsoft Presidio

Presidio is an open-source PII detection library from Microsoft. It combines regex, NER, and rule-based approaches and supports multiple languages. Better for production than raw spaCy.

Python — Presidio scrubber
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# pip install presidio-analyzer presidio-anonymizer
# python -m spacy download en_core_web_lg

class PresidioScrubber:
    EU_ENTITIES = [
        "PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
        "IBAN_CODE", "CREDIT_CARD", "IP_ADDRESS",
        "DATE_TIME", "LOCATION", "NRP",
        "DE_TAX_ID", "DE_PASSPORT",          # German
        "ES_NIF", "ES_NIE",                   # Spanish
        "IT_FISCAL_CODE",                     # Italian
        "PL_PESEL",                           # Polish
    ]

    def __init__(self, language: str = "en"):
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()
        self.language = language

    def scrub(self, text: str) -> Tuple[str, List[dict]]:
        results = self.analyzer.analyze(
            text=text,
            language=self.language,
            entities=self.EU_ENTITIES,
            score_threshold=0.6
        )

        if not results:
            return text, []

        anonymized = self.anonymizer.anonymize(
            text=text,
            analyzer_results=results,
            operators={
                "DEFAULT":        OperatorConfig("replace", {"new_value": "<REDACTED>"}),
                "PERSON":         OperatorConfig("replace", {"new_value": "[NAME_REDACTED]"}),
                "EMAIL_ADDRESS":  OperatorConfig("replace", {"new_value": "[EMAIL_REDACTED]"}),
                "PHONE_NUMBER":   OperatorConfig("replace", {"new_value": "[PHONE_REDACTED]"}),
                "IBAN_CODE":      OperatorConfig("replace", {"new_value": "[IBAN_REDACTED]"}),
            }
        )

        detections = [
            {"category": r.entity_type.lower(), "confidence": r.score}
            for r in results
        ]

        return anonymized.text, detections

Layer 3: OpenAI Privacy Filter (contextual)

OpenAI released a lightweight privacy filter library under Apache 2.0. It uses constrained Viterbi decoding over BIOES tags to produce contextually accurate PII spans — catching cases that regex and standard NER miss.

Python — contextual scrubber
# pip install openai-privacy-filter
from openai_privacy_filter import PrivacyFilter

class ContextualScrubber:
    """
    LLM-based contextual PII detection.
    Use for high-risk systems where false negatives are costly.
    Adds 50-150ms per call.
    """

    def __init__(self):
        self.filter = PrivacyFilter()

    def scrub(self, text: str) -> Tuple[str, List[dict]]:
        result = self.filter.scrub(text)
        return result.scrubbed_text, result.detections

The complete combined pipeline

Python — full three-layer pipeline
import time

class GDPRScrubbingPipeline:
    """
    Full three-layer PII scrubbing pipeline.
    Runs locally — no data leaves your infrastructure.
    """

    def __init__(
        self,
        use_ner: bool = True,
        use_contextual: bool = False,  # only for high-risk systems
        language: str = "en"
    ):
        self.regex = RegexScrubber()
        self.ner = PresidioScrubber(language=language) if use_ner else None
        self.contextual = ContextualScrubber() if use_contextual else None

    def scrub(self, text: str) -> Tuple[str, List[dict]]:
        if not text or not text.strip():
            return text, []

        all_detections = []

        # Layer 1: Regex (always runs first, fastest)
        text, regex_detections = self.regex.scrub(text)
        all_detections.extend([
            {"category": d.category, "layer": "regex"}
            for d in regex_detections
        ])

        # Layer 2: NER (catches names, locations, orgs)
        if self.ner:
            text, ner_detections = self.ner.scrub(text)
            all_detections.extend([
                {"category": d["category"],
                 "confidence": d["confidence"],
                 "layer": "ner"}
                for d in ner_detections
            ])

        # Layer 3: Contextual (high-risk systems only)
        if self.contextual:
            text, ctx_detections = self.contextual.scrub(text)
            all_detections.extend([
                {"category": d.get("category", "unknown"),
                 "layer": "contextual"}
                for d in ctx_detections
            ])

        return text, all_detections

    def scrub_messages(self, messages: list) -> Tuple[list, List[dict]]:
        """Scrub a list of OpenAI-format message objects."""
        scrubbed_messages = []
        all_detections = []

        for message in messages:
            if message.get("content") and isinstance(
                message["content"], str
            ):
                scrubbed_content, detections = self.scrub(message["content"])
                scrubbed_messages.append({**message, "content": scrubbed_content})
                all_detections.extend(detections)
            else:
                scrubbed_messages.append(message)

        return scrubbed_messages, all_detections

Consistent pseudonymisation — preserving context

Replacing every instance of a name with [NAME_REDACTED] loses important context. A medical summary referring to “Dr Smith” and “patient Chen” becomes confusing when both become the same placeholder. Consistent pseudonymisation assigns stable pseudonyms within a session — the same real name always gets the same placeholder.

Python — consistent pseudonymisation
import hashlib
from typing import Dict

class ConsistentPseudonymiser:
    """
    Replaces PII with consistent session-scoped pseudonyms.
    Same input always gets the same pseudonym within a session.
    Pseudonyms are not reversible without the session salt.
    """

    def __init__(self, session_salt: str):
        self.salt = session_salt
        self.cache: Dict[str, str] = {}

    def pseudonymise(self, value: str, category: str) -> str:
        cache_key = f"{category}:{value}"

        if cache_key not in self.cache:
            h = hashlib.sha256(
                f"{self.salt}:{value}".encode()
            ).hexdigest()[:8]

            prefixes = {
                "PERSON": "Person",
                "EMAIL_ADDRESS": "email",
                "ORG": "Organisation",
                "LOCATION": "Location",
            }
            prefix = prefixes.get(category, "Entity")
            self.cache[cache_key] = f"[{prefix}_{h}]"

        return self.cache[cache_key]
Example — with vs without consistency
Without consistency

[NAME_REDACTED] reviewed the patient file. Patient [NAME_REDACTED] had previously seen [NAME_REDACTED] in March.

With consistent pseudonymisation

[Person_a3f9c284] reviewed the patient file. Patient [Person_a3f9c284] had previously seen [Person_a3f9c284] in March.

The identity is protected. The narrative is preserved.

TypeScript implementation

TypeScript — combined pipeline
const REGEX_PATTERNS: Record<string, [RegExp, string]> = {
  email: [
    /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}/gi,
    '[EMAIL_REDACTED]'
  ],
  iban: [
    /[A-Z]{2}d{2}[A-Z0-9]{4}d{7}(?:[A-Z0-9]{0,16})/g,
    '[IBAN_REDACTED]'
  ],
  ipv4: [
    /(?:(?:25[0-5]|2[0-4]d|[01]?dd?).){3}(?:25[0-5]|2[0-4]d|[01]?dd?)/g,
    '[IP_REDACTED]'
  ],
  creditCard: [
    /(?:d{4}[s-]?){3}d{4}/g,
    '[CARD_REDACTED]'
  ],
  phone: [
    /(?:+d{1,3}[s.-]?)?(?:(?d{1,4})?[s.-]?)?d{3,4}[s.-]?d{3,4}(?:[s.-]?d{1,4})?/g,
    '[PHONE_REDACTED]'
  ],
}

interface ScrubResult {
  text: string
  detections: Array<{ category: string; layer: 'regex' | 'ner' }>
}

function regexScrub(text: string): ScrubResult {
  const detections: ScrubResult['detections'] = []
  let scrubbed = text

  for (const [category, [pattern, placeholder]] of Object.entries(REGEX_PATTERNS)) {
    const matches = [...scrubbed.matchAll(pattern)]
    if (matches.length > 0) {
      detections.push(
        ...matches.map(() => ({ category, layer: 'regex' as const }))
      )
      scrubbed = scrubbed.replace(pattern, placeholder)
    }
  }

  return { text: scrubbed, detections }
}

// For NER in Node.js, run Presidio as a sidecar REST service
async function nerScrub(text: string): Promise<ScrubResult> {
  const response = await fetch('http://localhost:5002/analyze', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text, language: 'en' }),
  })
  const results = await response.json()
  return { text, detections: results }
}

async function scrubMessages(
  messages: Array<{ role: string; content: string }>
): Promise<{ messages: typeof messages; detections: ScrubResult['detections'] }> {
  const allDetections: ScrubResult['detections'] = []
  const scrubbedMessages = []

  for (const message of messages) {
    const { text: regexScrubbed, detections: rd } = regexScrub(message.content)
    const { text: final, detections: nd } = await nerScrub(regexScrubbed)
    scrubbedMessages.push({ ...message, content: final })
    allDetections.push(...rd, ...nd)
  }

  return { messages: scrubbedMessages, detections: allDetections }
}

Handling multilingual text

EU companies serve users across 24 official languages. Your scrubber must handle all of them.

Python — language detection + routing
from langdetect import detect

def get_scrubber_for_text(text: str) -> PresidioScrubber:
    lang_map = {
        'en': 'en', 'de': 'de', 'fr': 'fr',
        'nl': 'nl', 'es': 'es', 'it': 'it',
        'pl': 'pl', 'pt': 'pt'
    }
    try:
        detected = detect(text)
        lang = lang_map.get(detected, 'en')
    except Exception:
        lang = 'en'

    return PresidioScrubber(language=lang)

Presidio also supports running multiple language analysers simultaneously — pass them all and take the union of detections for maximum recall on mixed-language input.

Testing your scrubber

A scrubber you have not tested is a scrubber you cannot trust. Test with real-world examples from each domain your application serves.

Python — test suite
import pytest

class TestGDPRScrubber:

    def setup_method(self):
        self.pipeline = GDPRScrubbingPipeline(use_ner=True, language="en")

    def test_email_detection(self):
        text = "Contact me at sarah.chen@acmecorp.com tomorrow."
        scrubbed, detections = self.pipeline.scrub(text)
        assert "sarah.chen@acmecorp.com" not in scrubbed
        assert "[EMAIL_REDACTED]" in scrubbed
        assert any(d["category"] == "email" for d in detections)

    def test_iban_detection(self):
        text = "Please transfer to DE89 3704 0044 0532 0130 00."
        scrubbed, _ = self.pipeline.scrub(text)
        assert "DE89" not in scrubbed
        assert "[IBAN_REDACTED]" in scrubbed

    def test_name_detection(self):
        text = "The patient John Smith presented with symptoms."
        scrubbed, _ = self.pipeline.scrub(text)
        assert "John Smith" not in scrubbed

    def test_false_positive_rate(self):
        text = "The EU AI Act was passed in 2024. Article 12 requires logging."
        scrubbed, _ = self.pipeline.scrub(text)
        assert "EU AI Act" in scrubbed
        assert "Article 12" in scrubbed

    def test_no_original_pii_in_detections(self):
        text = "Email: test@example.com, Name: Jane Doe"
        _, detections = self.pipeline.scrub(text)
        for d in detections:
            assert "test@example.com" not in str(d)
            assert "Jane Doe" not in str(d)

    def test_empty_text(self):
        scrubbed, detections = self.pipeline.scrub("")
        assert scrubbed == ""
        assert detections == []

    def test_multilingual(self):
        text = "Kontaktieren Sie mich unter max.mueller@beispiel.de"
        scrubbed, _ = self.pipeline.scrub(text)
        assert "max.mueller@beispiel.de" not in scrubbed

Performance at scale

At high volume, scrubbing latency accumulates. Profile your pipeline before deploying to production.

PipelineLatency per callUse case
Regex only0.1–0.5msHigh-throughput, structured data
Regex + Presidio NER5–20msMost production systems
Regex + NER + contextual LLM50–200msHigh-risk AI systems

Run scrubbing asynchronously and cache results for identical inputs. For high-throughput systems, Presidio can be deployed as a sidecar service and called over a local socket to avoid per-process model loading.

The limits of scrubbing

Scrubbing is not anonymisation

Pseudonymisation — replacing identifiers with consistent aliases — is still personal data under GDPR. It remains possible, with the right key, to re-identify individuals. GDPR defines it as a risk-reduction measure, not a route out of compliance.

Context-dependent PII slips through

A reference to “the CEO of [ORG_REDACTED] who had the skiing accident in January” may still identify a real person even after org scrubbing. No scrubber catches every case. Layer your defences beyond scrubbing alone.

Scrubbed training data may not be fully safe

Research has demonstrated that LLM-based reconstruction attacks can recover PII from scrubbed training data with meaningful accuracy. This is a research-stage finding, not a current operational concern for most deployments — but relevant for long-term data strategy.

Document your scrubbing approach in your DPIA and ROPA. Scrubbing is a mitigation measure — document it as such, alongside its known limitations.

Summary

GDPR-compliant LLM applications require PII scrubbing before prompts reach your LLM provider and before any data is written to compliance logs.

1.
RegexFast, deterministic, catches structured PII — email, IBAN, phone, IP
2.
NER (Presidio or spaCy)Catches names, locations, organisations
3.
Contextual (optional)Catches implied and complex PII — high-risk systems only
4.
Consistent pseudonymisationPreserves narrative context while removing identifiers
5.
Multilingual supportLanguage detection and per-language models for EU coverage

Scrubbing is necessary but not sufficient. Combine it with a GDPR-compliant logging pipeline, EU-hosted infrastructure, defined retention periods, and Article 12 compliant tamper-evident logs for high-risk AI systems.

This guide is maintained by Sovergate. We build Article 12 EU AI Act compliance logging infrastructure — with PII scrubbing built in, running locally in your infrastructure before any data reaches our servers in Germany. This guide is for informational purposes only and does not constitute legal advice.

Last updated June 2026.

Want scrubbing built in, out of the box?

Two lines of code. PII scrubbed locally inside your infrastructure before anything leaves. Data stored in Germany. Monthly Article 12 compliance reports ready for your legal team.