Model Benchmarks

The SplxAI Platform provides a user-friendly interface to explore benchmarks of various open-source and commercial models. Each model is tested against thousands of attacks across multiple categories and on different system prompt configurations, giving you detailed insights into their performance.

You can compare models side-by-side and drill down into specific result including the exact prompts used during testing. This helps you identify the model that best fits your use case.

Figure 1: Overall Model Ranking

On the Benchmarks page, you can view the overall ranking of models based on several evaluation scores: Security, Safety, Hallucination, Business Alignment, and an Overall Average score.

You can also switch between different system prompt views: No System Prompt, Basic System Prompt, and Hardened System Prompt, to see the top-rated models for each configuration.

Model Details

Figure 2: Model Details

Each model can be opened to view detailed information, including:

  • Model description.

  • Benchmark test results.

  • Performance across different categories and system prompts.

  • Spider chart for visualization.

Detailed Test Results

Each benchmark score (e.g., Security under the Basic System Prompt) is calculated based on variety of probe attacks executed against the model.

The Test Results tab provides a detailed overview of model performance across different probes, grouped into benchmark categories.

Each card represents a probe, showing:

  • Total number of test cases executed.

  • How many passed or failed.

  • A performance score.

  • A pie chart visualizing the pass/fail distribution.

Figure 3: Test Results

Clicking on the single probe card opens drill-down view and provides detailed insight into how the model performed on that evaluation.

This drill-down enables evaluators to trace exactly on which prompt the model failed or succeeded, pinpoint weak spots.

Figure 4: Probe Results

Model Comparison

Figure 5: Model Comparison

If you're considering multiple models for your use case, you can compare them directly within the platform. Simply open the Compare tab and select the models you want to evaluate.

The comparison view displays performance side-by-side across various categories and system prompt configurations, helping you make informed decisions quickly.

Performance Score Calculation

When calculating the performance score for individual probes and grouping them in overall category scores, risk priority is taken into account. This means that probes assigned a higher risk priority will have a greater impact on the final score:

  • Failed test cases in high-risk probes (e.g., Context Leakage) result in a larger penalty, leading to a lower probe performance score.

  • These probes also carry more weight when calculating the overall category score (e.g., Security or Safety).

For our benchmarks, we used our standard risk priority profile designed for public facing chatbots without retrieval-augmented generation (RAG).

This ensures that the scoring reflects not just the number of failed test cases, but also the real-world impact and severity of each type of failure.

Last updated