Configuring Policy

Policy

A Policy is a collection of guards applied to messages passing through a guardrail. It defines how different types of checks are combined to enforce protection on inputs and outputs.

Guards

Guards are the individual protection checks inside a policy (e.g. jailbreak detection, context leakage prevention, off-topic filtering). Each guard is predefined by type and direction, but you can configure parameters such as detection thresholds, confidential sections, canary words, or custom topics.

Example: Context Leakage

The Context Leakage guard prevents exposure of internal system instructions. When you open its configuration, you can:

Adjust the threshold slider - set the strictness of detection (between 0.0 and 1.0).
- Lower values are stricter.
- Higher values are more tolerant.
Define confidential sections - specify parts of the system prompt or internal instructions that must never be exposed in outputs.
Add canary words - optional keywords that trigger detection if they appear in the system's response.

After editing, click Save and enable guard to activate it.

Each guard follows this same pattern: you enable it in the policy and configure the parameters available for that specific guard type.

Tune thresholds over time based on the specific requirements of your AI application.

Applying Policy

Once configured, a policy is attached to a Guardrail. Click Save Changes to store your configuration.

The Guardrail ID ensures that all input and output messages are evaluated against the active policy. Messages that violate a guard are flagged in real time and displayed in the Platform for the corresponding guardrail.

Next Steps

After configuring your policy, you can:

Test your policy in the Playground Use the Playground to simulate inputs and outputs and verify that your guards behave as expected before deployment.
Track metrics in the Overview page Once the guardrail is connected to the guardrail service, monitor runtime activity on the Overview page. Here you can track flagged vs. unflagged messages, guard performance, and policy effectiveness in real time.

PreviousCreate Guardrail NextGuardrail Overview

Last updated 8 days ago