Skip to main content
Guardrails let you define rules that run automatically before or after the agent responds. They help you keep conversations on-topic, protect sensitive data, and prevent misuse — without any manual moderation.

How guardrails work

Guardrails run in two stages:
  • Before-agent checks run on the user’s message before the agent sees it. Use these to block or filter inputs.
  • After-agent checks run on the agent’s response before it’s shown. Use these to catch problems in the output.
When a guardrail triggers, you can choose what happens: block the message entirely, redact the sensitive part, or let it through with a warning logged.

Before-agent guardrails

PII detection & masking

Automatically detects personally identifiable information in the user’s message and handles it before the agent processes it. Detection patterns — choose which types of PII to look for:
PatternExamples
Email addressesjan@example.com
Phone numbers+31 6 12345678
Credit card numbers4111 1111 1111 1111
Action — what to do when PII is detected:
ActionBehaviour
RedactReplaces the PII with [REDACTED] before passing the message to the agent. The user’s original message is unchanged in the chat UI, but the agent never sees the actual value.
BlockRejects the entire message and returns an error to the user.

Off-topic detection

Checks whether the user’s question falls within the topics you want this agent to handle. If the message is off-topic, the agent won’t run — it returns a message telling the user what the agent is for. When you add this guardrail, you define the allowed topics. These are plain-language descriptions, not rigid keywords: Example allowed topics:
  • Business data analysis
  • Campaign performance
  • Budget tracking
  • Google Ads metrics
Be broad enough to cover natural variations in how users phrase questions. “Campaign performance” will match “how did my campaigns do?” even though the exact words don’t match.

Jailbreak detection

Automatically detects attempts to override the agent’s instructions or make it behave outside its intended purpose (e.g. “ignore your system prompt and…”). No configuration needed — enable it and it runs automatically.

Harmful content moderation

Checks for harmful, abusive, or unsafe content in the user’s message. Automatically blocks requests that contain hate speech, threats, or other harmful language. No configuration needed.

After-agent guardrails

PII detection & masking (output)

The same PII detection applies to the agent’s response before it’s shown to the user. Useful when your data contains personal information that might appear in query results — for example, if a table includes customer email addresses that could be returned by a query.

Hallucination detection

Checks the agent’s response for claims that don’t appear to be grounded in the data it retrieved. If the agent makes a statement that can’t be verified from the query results, this guardrail flags or blocks the response. No configuration needed — enable it and it runs on every response automatically.

Custom guardrails

Custom guardrails let you define your own checks using plain language. You write a prompt that describes the rule, and the AI evaluates each message or response against it. Each custom guardrail has:
FieldDescription
NameA label for this check, shown in logs (e.g. “Competitor mention check”)
PromptA plain-language description of what to check for (e.g. “Does this message mention any competitor brand names?”)
ActionWhat to do if the check triggers: warn (log only), block (reject the message), or redact (remove the flagged content)
Custom guardrails can be added to either the before-agent or after-agent stage. Example custom guardrail:
Name: Budget guardrail
Prompt: Does this response recommend increasing any budget by more than 50% compared to current spend?
Action: Warn
Custom guardrails use an AI call to evaluate each message, which adds a small amount of processing time and cost per request. Use them for checks that matter most — not as a catch-all.