Search-Based Testing (SBT) and the Curse of Dimensionality

📚 Series Context: In Part 1, we discussed why traditional mileage-based validation breaks at Level 3/4—requiring billions of equivalent kilometers to even approximate statistical confidence. This article introduces the first practical building block of the solution: Search-Based Testing (SBT). Where mileage-based validation tries to catch failures by accumulating exposure, SBT tries to hunt them by intelligently navigating huge logical scenario spaces.

speed TL;DR: Why This Matters

The Problem: A simple 4-car intersection has 10¹¹ scenarios = 31 years to test exhaustively. The Solution: Search-Based Testing finds critical failures using intelligent sampling.

Business Value:

Compresses simulation cost by at least 10×–1000×, with gains compounding in higher-dimensional spaces
Finds critical edge cases faster, accelerating validation cycles
Generates traceable safety evidence aligned with ISO 21448 / UL 4600
Reduces cloud infrastructure costs when scaling large simulation clusters
Supports continuous V&V loops instead of batch milestone validation

For organizations trying to ship Level 3/4 products under quarterly release pressure, SBT isn't an optional optimization—it's a lever that reduces both compute and calendar time.

Scenario Hierarchy and ODD Context

Within an Operational Design Domain (ODD), scenarios exist at different abstraction levels—from human-readable descriptions (abstract) to parameter ranges (logical) to executable simulations (concrete). The intersection example in this article is one logical scenario within its ODD.

The Curse of Dimensionality: 4-Car Intersection Example

Let’s look at a standard unsignalized intersection with four vehicles (ego + three traffic participants). We fix the ego vehicle's acceleration (allowing the planner to control it) but vary its initial state. This gives us:

Ego Vehicle: Initial velocity (v₀) and Initial position (p₀) [2 parameters]
3 Traffic Participants: Initial velocity (v₀), Initial position (p₀), and Acceleration (a₀) each [3 × 3 = 9 parameters]

That gives us an 11-dimensional space to cover. With a coarse discretization of 10 steps per parameter, brute-force simulation would require:

10¹¹ = 100,000,000,000 concrete scenarios

And this toy setup is aggressively simplified. We have not modeled:

steering

driver models

sensor perception

priority rules

traffic lights

occlusions

friction / weather

vehicle dynamics

and many more...

turn maneuvers

pedestrians

Yet even this minimal scenario already produces a continuous parameter space large enough that naive brute sweeps are completely impossible to compute.

Assuming an optimistic simulation engine running 100 simulations per second, the runtime would be:

100,000,000,000 scenarios / 100 sims/sec = 1,000,000,000 sec ≈ 31.7 years!

"Most scenario space is boring."

The interesting failures live in tiny subregions. This makes brute-force both expensive and ineffective.

The Stakes: Why This Decision Matters

block Without SBT

• 31-year validation cycles for single scenarios
• Unpredictable cloud costs scaling exponentially
• Weak regulatory arguments based on mileage alone
• Missed critical failures in unsampled regions
• Batch milestone testing delaying release cycles

check_circle With SBT

• At least 10×–1000× cost compression in high dimensions
• Predictable compute budgets with measurable ROI
• ISO 21448 / UL 4600-ready evidence with traceability
• Targeted critical scenario discovery by design
• Continuous V&V integration in quarterly sprints

What SBT Does Differently

This is where most teams get stuck: they understand the problem but don't know how to escape brute-force thinking.

Running every single variation is a massive waste of resources on uninteresting scenarios where the AV performs well. The critical events—collisions, near-misses, and edge cases—occupy only small subregions of the parameter space. Search-Based Testing reframes scenario evaluation as an optimization problem:

Instead of evaluating everything, evaluate only what is likely to matter.

To make that work, SBT needs two ingredients:

A KPI (what "interesting" means)
A Search Strategy (how we navigate the space)

Example KPI for our intersection: minimum bounding-box distance between vehicles during the crossing. The optimization then tries to minimize that distance—surfacing near-misses and collisions.

Business context: Simulation is one of the most expensive components of AV validation infrastructure—both in compute and wall-clock time. Running 1,000+ cloud cores isn't cheap. Every 10× improvement in sampling efficiency translates directly to fewer GPU/CPU hours, reduced spot-instance burn, lower scheduling latency, and faster validation cycles. For teams measured in quarters, not decades, these differences are existential.

How Search-Based Testing Works

SBT uses genetic algorithms, Bayesian optimization, or surrogate models to intelligently explore the scenario space. Here's the typical workflow:

Iterative Process of SBT:

Coarse Sampling: Sample initial points across the logical scenario space.
KPI Evaluation: Run simulations and compute the KPI for each scenario.
Surrogate Model Training: Build a fast approximation (e.g., Gaussian Process, neural network) of the KPI function based on evaluated samples. This model predicts KPI values without running expensive simulations.
Region Refinement: Use the surrogate model to identify promising regions and focus subsequent samples where the KPI indicates potential critical events.
Continuous Improvement: As new samples are evaluated, the surrogate model is continuously retrained and refined, improving prediction accuracy in critical regions while maintaining computational efficiency.
Repeat: Iterate between surrogate updates and targeted sampling until convergence or computational budget is exhausted.

Surrogate models are the secret weapon of production SBT: instead of running expensive simulations for every candidate scenario, the search algorithm queries a fast approximation to eliminate obviously uninteresting regions. Only the most promising candidates get full simulation treatment, reducing evaluations potentially by multiple orders of magnitude.

That said, building a good surrogate is easier said than done—getting the training set right is half the battle.

lightbulb Key Insight: Surrogate models are the efficiency multiplier—they eliminate obviously uninteresting regions without running expensive simulations. The better your surrogate, the fewer evaluations you need.

Visual Comparison: SBT vs Full Grid Sampling:

Try It Yourself: Interactive Demonstration

In the following simplified intersection with two vehicles, you can experiment with initial velocities using grid resolutions from 20–80 steps. The simulation uses realistic bounding box collision detection. Compare two strategies side-by-side:

Full grid sweep (exhaustive brute-force sampling)
SBT refinement (adaptive sampling guided by the KPI)

science Interactive Lab: Brute Force vs SBT

Grid Resolution (K)

Sim Speed

Vel Range (m/s)

Accel (m/s²)

Simulation Replay

Simulations: 0

Critical Scenarios: 0

Ratio: 0.00

Safe

Near Miss

Crash

Hover grid to inspect

Implementation Note: This demo uses adaptive binary refinement in a 2D space. Production implementations leverage Bayesian optimization, surrogate models, and genetic algorithms to achieve orders of magnitude better performance in high-dimensional spaces. The key insight: efficiency gains compound exponentially with dimensionality.

The KPI: The Compass That Guides the Search

SBT is only as good as the KPI driving it. Bad KPIs lead the search to optimize the wrong thing.

Common Failure Modes

Binary crash flag (discontinuous → no gradient to follow)
Final separation distance (ignores temporal dynamics)
TTC-only metrics (can be gamed by avoidance trajectories)

Better KPIs blend multiple dimensions and remain continuous near safety boundaries:

Min bounding-box distance + Time-to-Collision (TTC)
Integrated risk over trajectory
RSS-based gap compliance for merges
Stopping margin for pedestrians

Multi-objective KPIs are common in production:

K = w₁ × safety + w₂ × comfort + w₃ × regulation

This ensures the search doesn't produce "critical but illegal or absurd" trajectories.

KPI Design Principles: What Works and What Fails

The most common KPI failure isn't mathematical—it's choosing metrics that optimize for something you don't actually care about.

Monotonicity and Gradients:

❌ Poor Choice: Binary crash/no-crash flag (discontinuous). The search algorithm has no gradient to follow near the safety boundary—it's blindly guessing.
✅ Better Choice: Minimum distance during trajectory (continuous). Provides smooth gradient that guides the search toward critical boundaries, even from safe initial conditions.

lightbulb Key Insight: Bad KPIs lead to missed failures. Your KPI must be continuous (gradients to follow), resistant to exploitation (no gaming), and capture temporal dynamics (not just snapshots).

Other Critical Considerations:

Effective KPIs must be resistant to exploitation—a TTC-only metric can be gamed by scenarios that avoid the intersection entirely. Teams often don't discover this until weeks into a test campaign, when they realize their "critical scenarios" are just clever ways to avoid the intersection entirely. KPIs should capture temporal dynamics over trajectories, not static snapshots. Leverage formal frameworks like RSS for regulation-aligned metrics.

Scenario Type	Weak KPI	Strong KPI
Intersection	Binary crash flag	Min bounding-box distance + TTC
Lane Change	Lateral offset only	Time-integrated TTC + jerk comfort
Pedestrian Crossing	Final separation distance	Max deceleration + stopping margin
Highway Merge	Closest approach	RSS gap compliance + merge completion

Limitations and Trade-offs

But SBT alone doesn't solve validation—it's a tool with real limitations that teams must understand before betting their safety case on it.

SBT is not magic. It introduces trade-offs:

May converge to local optima → multiple runs needed
Search coverage is KPI-dependent
Surrogate quality limits sensitivity
False negatives are possible
No universal stopping criterion

The hardest part? Knowing when you've sampled enough.

"It's part science, part judgment, part organizational risk tolerance."

This is why SBT must sit inside a larger validation loop rather than acting as a standalone technique.

Regulatory note: Standards like ISO 21448 (SOTIF) and UL 4600 require demonstrating that relevant scenario space has been explored and that evidence is traceable. SBT provides auditable sampling logic, reproducible scenario selection, coverage arguments, KPI rationale, and failure mode documentation. This enables stronger safety cases than "we ran X million kilometers."

Implementation note: SBT isn't a replacement for existing simulation platforms—it's an orchestration layer that sits on top of your existing validation infrastructure, intelligently selecting which scenarios to test.

Position in the Validation Pipeline

SBT solves only one piece of the puzzle:

Efficient sampling within a single logical scenario

It does not solve:

Scenario generation
Scenario prioritization across the ODD
ODD coverage reasoning
Real-drive data integration
Regulatory safety argumentation

Those are the topics of the next articles in this series.

summarize Executive Summary

The Problem:

A simple 4-car intersection produces 10¹¹ scenarios (11D parameter space). Brute-force simulation at 100 sims/sec would take 31 years. Most scenario space is "boring" — failures occupy tiny subregions.

The Solution — Search-Based Testing (SBT):

Reframes validation as an optimization problem: focus on critical scenarios using KPIs (e.g., min distance, TTC) and surrogate models to predict outcomes without full simulation. Achieves at least 10×–1000× cost compression depending on dimensionality.

Business Value:

Fewer compute hours → reduced cloud costs
Faster iteration → quarterly validation sprints
ISO 21448 / UL 4600-ready safety evidence
Continuous integration in V&V pipelines

Critical Success Factors:

KPI design: continuous, resistant to exploitation, captures temporal dynamics
Surrogate model quality: drives sensitivity in critical regions
Multi-objective balance: safety + comfort + regulation
Judgment: knowing when sampling is sufficient

Limitations:

May converge to local optima, KPI-dependent, false negatives possible. SBT solves efficient sampling within a single logical scenario, not scenario generation, ODD coverage, or real-drive integration.

"If mileage-based validation was about exposure, scenario-based validation is about intelligent coverage — and SBT is how we get there."

References

ISO 21448:2022 — Road vehicles — Safety of the intended functionality (SOTIF).
Operational Design Domain (Wikipedia). Link
Scenario abstraction levels (functional → logical → concrete). Link
PEGASUS project: Scenario-based testing methodology. Link

Kaveh Rahnema

V&V Expert for ADAS & Autonomous Driving with 7+ years at Robert Bosch GmbH.

Connect on LinkedIn