How SafePrompt Works
A technical overview of SafePrompt's 4-stage detection pipeline — how it catches prompt injection attacks accurately, quickly, and cost-effectively.
The 4-Stage Pipeline
Every prompt submitted to SafePrompt passes through up to four sequential validation stages. Each stage exits early if it reaches a high-confidence decision — most safe prompts never make it past Stage 1.
Most legitimate requests exit at Stage 1 in under 5ms. Stage 4 is reserved for ambiguous edge cases only.
Stage 1: Pattern Detection
A high-performance regex and pattern-matching engine scans the input for known attack signatures. This catches the majority of common attacks instantly — no AI inference required.
Stage 2: External Reference Detection
Detects attempts to load external instructions by scanning for URLs, IP addresses, file paths, and encoded references. A prompt that tries to fetch external content is attempting indirect injection.
Stage 3: AI Validation — Pass 1
For prompts that pass Stages 1 and 2, a fast AI model performs semantic analysis. This catches sophisticated attacks that use natural language to disguise injection attempts — roleplay framing, hypothetical scenarios, obfuscated intent.
Stage 4: AI Validation — Pass 2
When Pass 1 returns a low-confidence verdict, a more capable AI model performs a deeper analysis. This second pass significantly reduces false positives on ambiguous inputs while maintaining high accuracy on genuine attacks.
Performance
| Stage | Latency | Cost | Applies to |
|---|---|---|---|
| Pattern Detection | <5ms | $0 | All requests |
| External Ref Detection | <5ms | $0 | Uncertain after Stage 1 |
| AI Pass 1 | ~50ms | Minimal | Uncertain after Stage 2 |
| AI Pass 2 | ~100ms | Higher | Low-confidence Pass 1 only |
Overall detection accuracy: above 95%. False positive rate: under 3%. Most requests resolve in Stage 1 or 2 — sub-5ms total.
What Gets Returned
Every validation returns a structured result regardless of which stage reached the verdict:
{
"safe": false,
"confidence": 0.98,
"threats": ["prompt_injection", "instruction_override"],
"processingTimeMs": 4,
"passesUsed": 1,
"request_id": "uuid-for-audit-trail",
"timestamp": "2026-03-19T10:00:00.000Z"
}safe— boolean. The only field you must act on.confidence— 0.0–1.0. Values above 0.9 are high-confidence verdicts.threats— array of detected threat categories.passesUsed— how many stages were needed (1 or 2).request_id— unique ID for audit trail purposes.