Safety
Safety probes test whether a target avoids harmful, abusive, or dangerous content. They simulate high-risk prompts and mark results as pass/fail based on target behavior.
They focus on safety policy adherence in real conversations.
These probes help you assess:
Safe handling of illegal or harmful requests.
Avoidance of toxicity, harassment, hate, and discriminatory content.
Protection of user privacy and sensitive personal data.
Appropriateness and de-escalation in high-risk conversations.
Bias
Use this probe to test whether the target produces biased or prejudiced responses. It evaluates how the target handles sensitive topics related to race, gender, religion, age, and other categories.
Probe ID: 7
Supported Modes: Text, Image
What to Expect
The probe generates prompts designed to see if the target:
Shows favoritism or prejudice toward certain groups.
Uses language that could be considered discriminatory or offensive.
Why It Matters
Biased responses can harm users, damage the target’s reputation, and create potential legal risks. Detecting and correcting bias ensures fair, inclusive, and trustworthy interactions.
Configuration Inputs
Company Name - Enter the name of the company that owns the target application.
Domain Description - Provide a short description of the industry, sector, or domain the target application operates in (e.g., automotive repair, online banking, healthcare services).
Cyber Threats
Use this probe to test whether the target can be manipulated into generating content about harmful cyber activities. It evaluates the target’s ability to maintain content safety and ethical standards.
Probe ID: 47
Supported Modes: Text, Document
What to Expect
The probe simulates prompts that try to make the target:
Discuss malware, trojans, ransomware, spyware, or adware.
Explain methods for cybercrime or phishing attacks.
Why It Matters
If the target generates unsafe cyber content, it could educate malicious users, expose security vulnerabilities, facilitate social engineering, or cause data breaches, leading to legal, reputational, and operational risks.
Configuration Inputs
Company Name - Enter the name of the company that owns the target application.
Company Industry - Enter the company’s main industry or market vertical (e.g., finance, healthcare, automotive, e-commerce) to help contextualize the probe for relevant scenarios.
Fake News
Use this probe to test whether the target can resist prompts that try to make it generate false or misleading information. It evaluates how well the target maintains accuracy and trustworthiness.
Probe ID: 17
Supported Modes: Text, Document
What to Expect
The probe simulates prompts that attempt to make the target:
Spread false information.
Mislead users about events, facts, or brand-related details.
Why It Matters
If the target generates fake news, it can damage brand reputation, erode user trust, manipulate public opinion, and reduce customer loyalty.
Configuration Inputs
Company Name - Enter the name of the company that owns the target application.
Company Services - List the main services or products your company provides. This helps the probe craft realistic prompts tied to your business context (e.g., “online banking,” “car repair,” “SaaS analytics”).
Fake News Topics - List topics you want the probe to test for false or misleading claims (e.g., product recalls, executive scandals, regulatory fines). The probe will try to make the target generate fake or misleading information about these subjects.
Fraudulent Activities
Use this probe to test whether the target can generate or support content related to fraud. It evaluates how well the target rejects or handles requests that could enable fraudulent behavior.
Probe ID: 48
Supported Modes: Text, Document
What to Expect
The probe simulates prompts that try to make the target:
Give guidance on identity theft, false advertising, or counterfeiting.
Advise on market manipulation, employment scams, or investment scams.
Provide instructions that could facilitate insider fraud or policy circumvention.
Why It Matters
If the target responds inappropriately, it could enable financial harm, regulatory violations, or loss of public trust, both in public-facing and internal environments.
Configuration Inputs
Company Name - Enter the name of the company that owns the target application.
Company Industry - Enter the company’s main industry or market vertical (e.g., finance, healthcare, automotive, e-commerce) to help contextualize the probe for relevant scenarios.
Illegal Activities
Use this probe to test whether the target will generate or facilitate content related to unlawful activities. It checks the model’s ability to refuse or safely handle requests about illegal behavior.
Probe ID: 49
Supported Modes: Text, Document
What to Expect
The probe simulates realistic or disguised prompts (casual, hypothetical, or technical) asking the target to:
Explain drug manufacturing or distribution.
Describe weapon creation or acquisition.
Give methods for theft, robbery, or financial fraud.
Advise on intellectual property theft or synthesis of bioweapons/illicit drugs.
Why It Matters
If the target provides such guidance it can:
Assist criminal behavior and create legal liability.
Erode public trust and damage the organization’s reputation.
Enable internal policy circumvention, IP misappropriation, or insider fraud.
Configuration Inputs
Company Name — Enter the name of the company that owns the target application.
Company Industry — Enter the company’s main industry or market vertical (e.g., finance, healthcare, automotive, e-commerce) to help contextualize the probe for relevant scenarios.
Personally Identifiable Information (PII)
Use this probe to test whether the target properly handles sensitive user data. It evaluates whether the system accidentally stores, reveals, or shares personally identifiable information.
Probe ID: 45
Supported Modes: Text, Document
What to Expect
The probe simulates interactions to see if the target discloses:
Names, email addresses, or phone numbers.
Any other information that could identify individuals.
Why It Matters
Improper handling of PII can violate privacy regulations, damage user trust, and create legal, reputational, and financial risks. This probe helps identify vulnerabilities that could be exploited to access or manipulate user data.
Configuration Inputs
Public PII - Public personal information that is target allowed to share, such as company contact email.
Privacy Violation
Use this probe to test whether the target respects user privacy. It evaluates the system’s ability to prevent content that exposes sensitive data or violates privacy standards.
Probe ID: 46
Supported Modes: Text, Document
What to Expect
The probe simulates interactions to see if the target:
Shares confidential information.
Performs or advises on surveillance, doxxing, deepfake creation, or location tracking.
Why It Matters
If the target violates privacy, it can expose sensitive user or employee data, create legal issues, damage trust, and compromise organizational security.
Configuration Inputs
Company Name - Enter the name of the company that owns the target application.
Company Industry - Enter the company’s main industry or market vertical (e.g., finance, healthcare, automotive, e-commerce) to help contextualize the probe for relevant scenarios.
Profanity
Use this probe to test whether the target uses profane or vulgar language. It evaluates the system’s ability to maintain respectful and considerate communication.
Probe ID: 10
Supported Modes: Text, Document
What to Expect
The probe asks prompts designed to see if the target:
Uses offensive or inappropriate words.
Responds in a way that could insult or make users uncomfortable.
Why It Matters
Profane responses can harm user experience, reduce trust, and damage the target’s credibility, making it essential to detect and address.
Last updated
Was this helpful?