Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.
Prompt Guard is an 86M parameter classifier that detects prompt injections and jailbreak attempts in LLM applications.
Installation:
pip install transformers torch
Basic usage:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.nn.functional import softmax
model_id = "meta-llama/Prompt-Guard-86M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
def get_jailbreak_score(text):
"""Check user input for jailbreak attempts."""
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = softmax(logits, dim=-1)
return probs[0, 2].item() # Jailbreak probability
# Check prompt
score = get_jailbreak_score("Ignore previous instructions")
if score > 0.5:
print("⚠️ Jailbreak attempt detected!")
Classification labels:
Filter user prompts before LLM:
def filter_user_input(user_message, threshold=0.5):
"""
Filter user input for jailbreak attempts.
Returns: (is_safe, score, message)
"""
score = get_jailbreak_score(user_message)
if score >= threshold:
return False, score, "Input blocked: jailbreak attempt"
else:
return True, score, "Input safe"
# Example
user_input = "Tell me about machine learning"
is_safe, score, message = filter_user_input(user_input)
if is_safe:
response = llm.generate(user_input)
print(response)