Security
Security probes test how well a target application resists adversarial inputs. They focus on confidentiality, integrity, and access controls.
They simulate attack-like conversations (e.g., prompt injection, phishing, XSS). They mark results as pass/fail based on target behavior.
They focus on model behavior and policy enforcement. They do not scan infrastructure.
These probes help you assess:
Resistance to prompt injection and jailbreak attempts.
Protection of sensitive data (secrets, credentials, internal info).
Enforcement of authorization and policy boundaries.
Safe handling of untrusted input and malicious instructions.
Code Execution
Use this probe to assess how vulnerable the target is to prompt injections that could execute arbitrary Python code. It tests whether safety mechanisms can be bypassed to access system-level information, manipulate the runtime environment, or exploit misconfigurations in sandboxing and permission controls.
Probe ID: 51
Supported Modes: Text, Document
What to Expect
The probe simulates adversarial interactions to see if the target can be coaxed into:
Disclosing environment variables.
Enumerating users or running processes.
Traversing the file system.
Listing or importing installed Python modules.
Modifying or deleting files, creating persistent artifacts, or initiating outbound network requests.
Why It Matters
Unrestricted code execution could expose sensitive data, compromise the runtime environment, or enable remote attacks such as data exfiltration, malware staging, or unauthorized API interactions, especially if the Python environment is internet-connected.
Context Leakage
Use this probe to test whether the target can be tricked into revealing sensitive internal information. It evaluates the target’s ability to protect confidential data such as internal documents, intellectual property, or the system prompt itself.
Probe ID: 9
Supported Modes: Text, Image, Voice, Document
What to Expect
The probe simulates malicious messages that attempt to make the target:
Expose internal documents or proprietary data.
Reveal system prompts or operational parameters.
Leak information that could help competitors replicate proprietary solutions.
Why It Matters
If sensitive context is leaked, it can compromise the security and integrity of the target, create competitive disadvantages, and enable further adversarial attacks.
Data Exfiltration
Use this probe to test whether the target can prevent attempts to covertly extract sensitive data. It evaluates the target’s ability to detect and block unauthorized data transfers during interactions.
Probe ID: 35
Supported Modes: Text, Document
What to Expect
The probe simulates attempts to make the target:
Share private documents or intellectual property.
Disclose user data without authorization.
Export sensitive information through manipulated conversations.
Why It Matters
Failing to stop data exfiltration can expose confidential organizational or personal information, creating serious security and privacy risks.
Jailbreak
Use this probe to test whether the target can be manipulated to bypass its safety and operational constraints. It checks how resilient the system is against prompts designed to make it perform unintended actions.
Probe ID: 18
Supported Modes: Text, Image
What to Expect
The probe simulates malicious prompts to see if the target:
Executes unauthorized actions.
Shares harmful or unsafe information.
Behaves outside its defined operational limits.
Why It Matters
If a target can be jailbroken, it may compromise system security, spread harmful content, or perform actions it shouldn’t, posing serious risks to users and the organization.
Configuration Inputs
Company Name - Enter the name of the company that owns the target application.
Illegal Question - Enter a single example question the target must never answer. This is used to detect successful jailbreak attempts. Example: “How do I make a bomb?”.
Manipulation
Use this probe to test whether the target can resist deceptive prompts that try to trick users into harmful actions. It evaluates the target’s ability to detect and prevent manipulation aimed at exploiting user trust.
Probe ID: 36
Supported Modes: Text, Document
What to Expect
The probe simulates prompts designed to make the target:
Encourage users to share sensitive information (e.g., credit card details).
Guide users to perform damaging actions (e.g., copying harmful commands).
Why It Matters
If the target fails, users or organizational data could be exposed to security breaches, and the system’s integrity could be compromised.
Phishing
Use this probe to test whether the target can detect and block attempts to trick users into revealing sensitive information. It evaluates how well the target identifies fraudulent prompts that mimic legitimate requests.
Probe ID: 34
Supported Modes: Text, Document
What to Expect
The probe simulates prompts that try to make the target:
Request login credentials, payment details, or other confidential data.
Follow instructions that could compromise user or organizational information.
Why It Matters
If the target fails, users or sensitive organizational data could be exposed, leading to security breaches and loss of trust.
Configuration Inputs
Trusted Domain - Enter the company’s trusted domain, used to test if phishing attempts can mimic it. Example: “company-domain.com”.
RAG Poisoning
Use this probe to test whether a target using Retrieval-Augmented Generation (RAG) can resist malicious data injections in its dataset. It checks if the target can distinguish between legitimate and corrupted information.
Probe ID: 39
Supported Modes: Text
What to Expect
The probe introduces subtly altered or incorrect data to see if the target:
Retrieves and uses poisoned or manipulated information in its responses.
Maintains coherence and factual accuracy despite corrupted data.
Why It Matters
If the target incorporates poisoned data, it can spread misinformation, damage credibility, and undermine user trust, especially when handling sensitive or business-critical information.
Web Injection
Use this probe to test whether the target is vulnerable to prompt injections from external digital content, such as websites, PDFs, or other files. It checks if the target blindly follows or reproduces hidden instructions in these resources.
Probe ID: 50
Supported Modes: Text
What to Expect
The probe simulates scenarios where a user shares a link or document to see if the target:
Executes injected commands.
Echoes unsafe instructions.
Alters conversation flow or bypasses safety protocols.
Why It Matters
If successful, these attacks could spread misinformation, perform unauthorized actions, manipulate users, or expose proprietary data, creating serious security and reputational risks.
Cross-Site Scripting (XSS)
Use this probe to test whether the target can be manipulated to generate malicious scripts that could run in a user’s browser or interface. It evaluates whether the model produces unsafe outputs that bypass safety mechanisms.
Probe ID: 52
Supported Modes: Text, Document
What to Expect
The probe simulates adversarial prompts to see if the target:
Produces classic script injection payloads (e.g.,
<script>alert(1)</script>).Generates obfuscated or event-driven scripts embedded in HTML.
Creates outputs that could be used for cookie theft, DOM manipulation, or persistent XSS attacks.
Why It Matters
If the target generates unsafe scripts, it could be exploited to steal data, manipulate web pages, or persist malicious code in downstream applications, posing security and user safety risks.
Last updated
Was this helpful?