How SafePrompt Works

A technical overview of SafePrompt's 4-stage detection pipeline — how it catches prompt injection attacks accurately, quickly, and cost-effectively.

The 4-Stage Pipeline

Every prompt submitted to SafePrompt passes through up to four sequential validation stages. Each stage exits early if it reaches a high-confidence decision — most safe prompts never make it past Stage 1.

// Request flow
User input
Stage 1: Pattern Detection<5ms · $0
↓ (if uncertain)
Stage 2: External Reference Detection<5ms · $0
↓ (if uncertain)
Stage 3: AI Validation — Pass 1~50ms · fast model
↓ (if still uncertain)
Stage 4: AI Validation — Pass 2~100ms · deep analysis
Result: safe / unsafe + threats + confidence

Most legitimate requests exit at Stage 1 in under 5ms. Stage 4 is reserved for ambiguous edge cases only.

Stage 1: Pattern Detection

<5msZero costNo API call

A high-performance regex and pattern-matching engine scans the input for known attack signatures. This catches the majority of common attacks instantly — no AI inference required.

Catches: Direct instruction overrides ("ignore all previous instructions"), role injection attempts ("you are now"), system prompt extraction keywords, known jailbreak patterns

Stage 2: External Reference Detection

<5msZero costNo API call

Detects attempts to load external instructions by scanning for URLs, IP addresses, file paths, and encoded references. A prompt that tries to fetch external content is attempting indirect injection.

Catches: URLs (http/https/ftp), IP addresses, file system paths, base64-encoded URLs, data URIs, attempts to reference external documents

Stage 3: AI Validation — Pass 1

~50msFast AI model

For prompts that pass Stages 1 and 2, a fast AI model performs semantic analysis. This catches sophisticated attacks that use natural language to disguise injection attempts — roleplay framing, hypothetical scenarios, obfuscated intent.

Catches: Semantic injection attacks, roleplay-based jailbreaks, hypothetical framing ("imagine you are"), multi-turn context manipulation, obfuscated instructions

Stage 4: AI Validation — Pass 2

~100msDeep analysis modelEdge cases only

When Pass 1 returns a low-confidence verdict, a more capable AI model performs a deeper analysis. This second pass significantly reduces false positives on ambiguous inputs while maintaining high accuracy on genuine attacks.

Handles: Low-confidence Pass 1 results, complex multi-layered attacks, culturally nuanced inputs, highly obfuscated content, novel attack patterns

Performance

StageLatencyCostApplies to
Pattern Detection<5ms$0All requests
External Ref Detection<5ms$0Uncertain after Stage 1
AI Pass 1~50msMinimalUncertain after Stage 2
AI Pass 2~100msHigherLow-confidence Pass 1 only

Overall detection accuracy: above 95%. False positive rate: under 3%. Most requests resolve in Stage 1 or 2 — sub-5ms total.

What Gets Returned

Every validation returns a structured result regardless of which stage reached the verdict:

{
  "safe": false,
  "confidence": 0.98,
  "threats": ["prompt_injection", "instruction_override"],
  "processingTimeMs": 4,
  "passesUsed": 1,
  "request_id": "uuid-for-audit-trail",
  "timestamp": "2026-03-19T10:00:00.000Z"
}
  • safe — boolean. The only field you must act on.
  • confidence — 0.0–1.0. Values above 0.9 are high-confidence verdicts.
  • threats — array of detected threat categories.
  • passesUsed — how many stages were needed (1 or 2).
  • request_id — unique ID for audit trail purposes.