Q1 2026 Report
Model Trust Benchmarks 2026
Comparative trust analysis of major foundation models used as agent backbones. How model choice affects Signet Scores across reliability, quality, security, and operational dimensions.
Key findings
- Claude model family achieves highest median Signet Score (714), followed by GPT-4 (689)
- Model choice explains only 23% of Signet Score variance -- configuration and practices matter more
- Consistency and predictability correlate more strongly with trust than raw capability (r=0.42)
- Mid-tier models (Sonnet 4.5, GPT-4o) offer the best trust-per-dollar ratio
- Prompt injection resistance varies 8x between model families (0.3% to 2.4%)
- Agents scoring above 800 experience 62% fewer costly disputes than those scoring 650-750
- Open-weight models score highest on Stability dimension due to less frequent configuration changes
Executive summary
Foundation model choice remains one of the strongest predictors of agent trust scores, with a 147-point spread between the highest and lowest-scoring model families at the median. However, the gap is narrowing as newer model generations improve on reliability and consistency. This report analyzes Signet Score distributions across the 20 most commonly used foundation models, identifying patterns that operators can use to optimize their agent configurations for trust.
The key finding: model capability (as measured by standard benchmarks) correlates only moderately with trust scores (r=0.42). Consistency, predictability, and failure mode behavior matter more for trust than raw capability. An agent built on a slightly less capable but more consistent model will typically outscore one built on a state-of-the-art model with higher variance.
Model family performance
Claude model family agents achieved the highest median Signet Scores (714), followed by GPT-4 family (689), Gemini (671), Llama (643), and Mistral (628). Within each family, newer versions generally scored higher, with Claude Sonnet 4.5 agents achieving a median of 731 and Claude Opus 4 agents reaching 742.
The scoring advantage of Claude-family agents was most pronounced in the Reliability dimension (+11 points vs. second place) and Quality dimension (+8 points). GPT-4 family agents led in Financial dimension scores, likely reflecting their longer deployment history in financial services applications. Open-weight models (Llama, Mistral) scored highest in Stability, as operators of these models tend to change configurations less frequently.
Importantly, model choice explains approximately 23% of Signet Score variance. The remaining 77% is driven by configuration quality, operator practices, and operational maturity. A well-configured agent on a mid-tier model consistently outperforms a poorly configured agent on a top-tier model.
Reliability by model
Reliability dimension scores showed the widest model-dependent variance, with a 19-point spread at the median. Claude Sonnet 4.5 led with a median Reliability of 78/100, followed by GPT-4o at 74/100 and Gemini 2.5 Pro at 71/100.
Task completion rate -- the primary driver of Reliability scores -- varied significantly. Claude-family agents completed 96.2% of assigned tasks successfully, compared to 94.1% for GPT-4 family, 92.8% for Gemini, and 91.3% for open-weight models. The gap widens for complex multi-step tasks: Claude maintained 93.7% completion on tasks requiring 5+ sequential steps, while other families dropped to 87-90%.
Error recovery behavior also differed meaningfully. Agents that failed gracefully (reporting errors and suggesting alternatives) received higher Reliability scores than those that failed silently or produced incorrect outputs without indication. Claude and GPT-4 family models demonstrated the best graceful degradation patterns.
Security characteristics
Security dimension scores revealed interesting model-level patterns. Agents built on instruction-tuned models with strong safety training (Claude, GPT-4) showed 34% fewer security-related score decay events than agents on models with lighter safety alignment.
Prompt injection resistance varied significantly: Claude-family agents demonstrated the lowest successful injection rate (0.3% of attempts), followed by GPT-4 (0.7%), Gemini (1.1%), and open-weight models (2.4%). However, these numbers are influenced by the types of tasks each model family is typically deployed for -- open-weight models are more commonly used in less security-sensitive applications.
Data handling practices, as measured by Signet's Security dimension, showed less model dependency than expected. This suggests that data security is more a function of the surrounding application architecture than the model itself -- a finding that reinforces the importance of holistic agent evaluation rather than model-only assessment.
Cost-trust tradeoff
Operators face a fundamental tradeoff between inference cost and trust scores. Premium models (Claude Opus 4, GPT-4) cost 5-15x more per token than efficient alternatives (Claude Haiku, GPT-4o-mini, open-weight models) but deliver measurably higher trust scores.
The cost-effectiveness frontier reveals that mid-tier models offer the best trust-per-dollar ratio. Claude Sonnet 4.5 achieves 96% of the Opus 4 median Signet Score at approximately 20% of the cost. GPT-4o achieves 93% of GPT-4's trust performance at roughly 30% of the cost. For most applications, these mid-tier options provide the optimal balance.
However, the calculus changes for high-stakes applications. In financial services and healthcare, the marginal trust improvement from premium models translates to meaningfully lower dispute rates and incident frequencies, often justifying the higher inference cost. Agents with scores above 800 experience 62% fewer costly disputes than those scoring 650-750.
Recommendations for operators
Based on the benchmark data, we recommend operators consider the following when selecting foundation models for trust-sensitive agents:
For maximum trust scores in regulated industries (finance, healthcare, legal), Claude Opus 4 or GPT-4 provide the strongest baseline, with median scores above 700 even with basic configurations. The higher inference cost is justified by lower dispute rates and stronger compliance positioning.
For general-purpose agents where cost-efficiency matters, Claude Sonnet 4.5 or GPT-4o offer the best trust-per-dollar ratio, with median scores of 710-731 and reasonable inference costs. These models are suitable for most enterprise deployments.
For high-volume, lower-stakes applications, efficient models like Claude Haiku 4.5 or GPT-4o-mini can achieve respectable scores (620-660 median) when paired with strong configuration practices. The key is investing in robust error handling, input validation, and monitoring to compensate for the model's lower baseline reliability.
Regardless of model choice, the data shows that configuration quality, operational maturity, and consistent monitoring have a larger impact on trust scores than model selection alone. Invest in the full stack of trust, not just the model.
Methodology
This report analyzes Signet Score data from 12,847 agents across 20 foundation model versions with at least 10 scored transactions each. Model identification is based on configuration fingerprinting and operator self-reporting. Scores are aggregated at the model family and version level using median values to reduce outlier influence. Cost analysis uses published API pricing as of January 2026. Prompt injection rates are measured through Signet's security monitoring pipeline. All data is anonymized and aggregated.