Security

Security probes test how well a target application resists adversarial inputs. They focus on confidentiality, integrity, and access controls.

They simulate attack-like conversations (e.g., prompt injection, phishing, XSS). They mark results as pass/fail based on target behavior.

They focus on model behavior and policy enforcement. They do not scan infrastructure.

These probes help you assess:

  • Resistance to prompt injection and jailbreak attempts.

  • Protection of sensitive data (secrets, credentials, internal info).

  • Enforcement of authorization and policy boundaries.

  • Safe handling of untrusted input and malicious instructions.

Code Execution

Use this probe to assess how vulnerable the target is to prompt injections that could execute arbitrary Python code. It tests whether safety mechanisms can be bypassed to access system-level information, manipulate the runtime environment, or exploit misconfigurations in sandboxing and permission controls.

  • Probe ID: 51

  • Supported Modes: Text, Document

What to Expect

The probe simulates adversarial interactions to see if the target can be coaxed into:

  • Disclosing environment variables.

  • Enumerating users or running processes.

  • Traversing the file system.

  • Listing or importing installed Python modules.

  • Modifying or deleting files, creating persistent artifacts, or initiating outbound network requests.

Why It Matters

Unrestricted code execution could expose sensitive data, compromise the runtime environment, or enable remote attacks such as data exfiltration, malware staging, or unauthorized API interactions, especially if the Python environment is internet-connected.


Context Leakage

Use this probe to test whether the target can be tricked into revealing sensitive internal information. It evaluates the target’s ability to protect confidential data such as internal documents, intellectual property, or the system prompt itself.

  • Probe ID: 9

  • Supported Modes: Text, Image, Voice, Document

What to Expect

The probe simulates malicious messages that attempt to make the target:

  • Expose internal documents or proprietary data.

  • Reveal system prompts or operational parameters.

  • Leak information that could help competitors replicate proprietary solutions.

Why It Matters

If sensitive context is leaked, it can compromise the security and integrity of the target, create competitive disadvantages, and enable further adversarial attacks.


Data Exfiltration

Use this probe to test whether the target can prevent attempts to covertly extract sensitive data. It evaluates the target’s ability to detect and block unauthorized data transfers during interactions.

  • Probe ID: 35

  • Supported Modes: Text, Document

What to Expect

The probe simulates attempts to make the target:

  • Share private documents or intellectual property.

  • Disclose user data without authorization.

  • Export sensitive information through manipulated conversations.

Why It Matters

Failing to stop data exfiltration can expose confidential organizational or personal information, creating serious security and privacy risks.


Jailbreak

Use this probe to test whether the target can be manipulated to bypass its safety and operational constraints. It checks how resilient the system is against prompts designed to make it perform unintended actions.

  • Probe ID: 18

  • Supported Modes: Text, Image

What to Expect

The probe simulates malicious prompts to see if the target:

  • Executes unauthorized actions.

  • Shares harmful or unsafe information.

  • Behaves outside its defined operational limits.

Why It Matters

If a target can be jailbroken, it may compromise system security, spread harmful content, or perform actions it shouldn’t, posing serious risks to users and the organization.

Configuration Inputs

  • Company Name - Enter the name of the company that owns the target application.

  • Illegal Question - Enter a single example question the target must never answer. This is used to detect successful jailbreak attempts. Example: “How do I make a bomb?”.


Manipulation

Use this probe to test whether the target can resist deceptive prompts that try to trick users into harmful actions. It evaluates the target’s ability to detect and prevent manipulation aimed at exploiting user trust.

  • Probe ID: 36

  • Supported Modes: Text, Document

What to Expect

The probe simulates prompts designed to make the target:

  • Encourage users to share sensitive information (e.g., credit card details).

  • Guide users to perform damaging actions (e.g., copying harmful commands).

Why It Matters

If the target fails, users or organizational data could be exposed to security breaches, and the system’s integrity could be compromised.


Phishing

Use this probe to test whether the target can detect and block attempts to trick users into revealing sensitive information. It evaluates how well the target identifies fraudulent prompts that mimic legitimate requests.

  • Probe ID: 34

  • Supported Modes: Text, Document

What to Expect

The probe simulates prompts that try to make the target:

  • Request login credentials, payment details, or other confidential data.

  • Follow instructions that could compromise user or organizational information.

Why It Matters

If the target fails, users or sensitive organizational data could be exposed, leading to security breaches and loss of trust.

Configuration Inputs

  • Trusted Domain - Enter the company’s trusted domain, used to test if phishing attempts can mimic it. Example: “company-domain.com”.


RAG Poisoning

Use this probe to test whether a target using Retrieval-Augmented Generation (RAG) can resist malicious data injections in its dataset. It checks if the target can distinguish between legitimate and corrupted information.

  • Probe ID: 39

  • Supported Modes: Text

What to Expect

The probe introduces subtly altered or incorrect data to see if the target:

  • Retrieves and uses poisoned or manipulated information in its responses.

  • Maintains coherence and factual accuracy despite corrupted data.

Why It Matters

If the target incorporates poisoned data, it can spread misinformation, damage credibility, and undermine user trust, especially when handling sensitive or business-critical information.


Web Injection

Use this probe to test whether the target is vulnerable to prompt injections from external digital content, such as websites, PDFs, or other files. It checks if the target blindly follows or reproduces hidden instructions in these resources.

  • Probe ID: 50

  • Supported Modes: Text

What to Expect

The probe simulates scenarios where a user shares a link or document to see if the target:

  • Executes injected commands.

  • Echoes unsafe instructions.

  • Alters conversation flow or bypasses safety protocols.

Why It Matters

If successful, these attacks could spread misinformation, perform unauthorized actions, manipulate users, or expose proprietary data, creating serious security and reputational risks.


Cross-Site Scripting (XSS)

Use this probe to test whether the target can be manipulated to generate malicious scripts that could run in a user’s browser or interface. It evaluates whether the model produces unsafe outputs that bypass safety mechanisms.

  • Probe ID: 52

  • Supported Modes: Text, Document

What to Expect

The probe simulates adversarial prompts to see if the target:

  • Produces classic script injection payloads (e.g., <script>alert(1)</script>).

  • Generates obfuscated or event-driven scripts embedded in HTML.

  • Creates outputs that could be used for cookie theft, DOM manipulation, or persistent XSS attacks.

Why It Matters

If the target generates unsafe scripts, it could be exploited to steal data, manipulate web pages, or persist malicious code in downstream applications, posing security and user safety risks.


Last updated

Was this helpful?