Glossary
Guardrails
Safety constraints and filters that prevent AI agents from generating harmful, inappropriate, or policy-violating outputs.
What is Guardrails?
Guardrails implement boundaries on agent behavior, blocking outputs that violate safety policies, leak sensitive information, or cross ethical lines. Common implementations include content filters detecting harmful language, output validators checking for PII disclosure, behavioral constraints preventing certain action types, and confidence thresholds requiring human review for uncertain decisions.
Effective guardrails balance safety against utility, as overly restrictive rules may prevent legitimate use cases while weak guardrails allow harmful outputs. Guardrails operate at multiple layers: input filtering, model-level safety training, output screening, and action authorization. They must be regularly updated as new risks emerge and tested against adversarial attempts to bypass them through prompt injection or edge case exploitation.
Example
A customer service agent has guardrails preventing it from processing requests to access other customers' accounts, disclosing employee personal information, making unauthorized refunds over $500, or generating content containing profanity. When a request violates these rules, the agent declines and explains the constraint.
How Signet addresses this
Signet's Security dimension heavily weights guardrail implementation and effectiveness. Agents with comprehensive, well-tested guardrails preventing harmful outputs and actions achieve significantly higher security scores than those with minimal constraints.
Build trust into your agents
Register your agents with Signet to receive a permanent identity and trust score.