8 min read ADAS Validation

Search-Based Testing (SBT) and the Curse of Dimensionality

Validating autonomous vehicles is no longer a mileage game. The relevant question isn't "How many kilometers did we drive?" but "How systematically did we explore the scenarios where the AV is most likely to fail?"

📚 Series Context: In Part 1, we discussed why traditional mileage-based validation breaks at Level 3/4—requiring billions of equivalent kilometers to even approximate statistical confidence. This article introduces the first practical building block of the solution: Search-Based Testing (SBT). Where mileage-based validation tries to catch failures by accumulating exposure, SBT tries to hunt them by intelligently navigating huge logical scenario spaces.

speed TL;DR: Why This Matters

The Problem: A simple 4-car intersection has 1011 scenarios = 31 years to test exhaustively. The Solution: Search-Based Testing finds critical failures using intelligent sampling.

Business Value:

  • Compresses simulation cost by at least 10×–1000×, with gains compounding in higher-dimensional spaces
  • Finds critical edge cases faster, accelerating validation cycles
  • Generates traceable safety evidence aligned with ISO 21448 / UL 4600
  • Reduces cloud infrastructure costs when scaling large simulation clusters
  • Supports continuous V&V loops instead of batch milestone validation

For organizations trying to ship Level 3/4 products under quarterly release pressure, SBT isn't an optional optimization—it's a lever that reduces both compute and calendar time.

Scenario Hierarchy and ODD Context

Within an Operational Design Domain (ODD), scenarios exist at different abstraction levels—from human-readable descriptions (abstract) to parameter ranges (logical) to executable simulations (concrete). The intersection example in this article is one logical scenario within its ODD.

The Curse of Dimensionality: 4-Car Intersection Example

Let’s look at a standard unsignalized intersection with four vehicles (ego + three traffic participants). We fix the ego vehicle's acceleration (allowing the planner to control it) but vary its initial state. This gives us:

  • Ego Vehicle: Initial velocity (v0) and Initial position (p0) [2 parameters]
  • 3 Traffic Participants: Initial velocity (v0), Initial position (p0), and Acceleration (a0) each [3 × 3 = 9 parameters]

That gives us an 11-dimensional space to cover. With a coarse discretization of 10 steps per parameter, brute-force simulation would require:

1011 = 100,000,000,000 concrete scenarios

And this toy setup is aggressively simplified. We have not modeled:

steering
driver models
sensor perception
priority rules
traffic lights
occlusions
friction / weather
vehicle dynamics
and many more...
turn maneuvers
pedestrians

Yet even this minimal scenario already produces a continuous parameter space large enough that naive brute sweeps are completely impossible to compute.

Assuming an optimistic simulation engine running 100 simulations per second, the runtime would be:

100,000,000,000 scenarios / 100 sims/sec = 1,000,000,000 sec ≈ 31.7 years!

"Most scenario space is boring."

The interesting failures live in tiny subregions. This makes brute-force both expensive and ineffective.

The Stakes: Why This Decision Matters

block Without SBT

  • 31-year validation cycles for single scenarios
  • Unpredictable cloud costs scaling exponentially
  • Weak regulatory arguments based on mileage alone
  • Missed critical failures in unsampled regions
  • Batch milestone testing delaying release cycles

check_circle With SBT

  • At least 10×–1000× cost compression in high dimensions
  • Predictable compute budgets with measurable ROI
  • ISO 21448 / UL 4600-ready evidence with traceability
  • Targeted critical scenario discovery by design
  • Continuous V&V integration in quarterly sprints

What SBT Does Differently

This is where most teams get stuck: they understand the problem but don't know how to escape brute-force thinking.

Running every single variation is a massive waste of resources on uninteresting scenarios where the AV performs well. The critical events—collisions, near-misses, and edge cases—occupy only small subregions of the parameter space. Search-Based Testing reframes scenario evaluation as an optimization problem:

Instead of evaluating everything, evaluate only what is likely to matter.

To make that work, SBT needs two ingredients:

  1. A KPI (what "interesting" means)
  2. A Search Strategy (how we navigate the space)

Example KPI for our intersection: minimum bounding-box distance between vehicles during the crossing. The optimization then tries to minimize that distance—surfacing near-misses and collisions.

Business context: Simulation is one of the most expensive components of AV validation infrastructure—both in compute and wall-clock time. Running 1,000+ cloud cores isn't cheap. Every 10× improvement in sampling efficiency translates directly to fewer GPU/CPU hours, reduced spot-instance burn, lower scheduling latency, and faster validation cycles. For teams measured in quarters, not decades, these differences are existential.

How Search-Based Testing Works

SBT uses genetic algorithms, Bayesian optimization, or surrogate models to intelligently explore the scenario space. Here's the typical workflow:

Iterative Process of SBT:

  1. Coarse Sampling: Sample initial points across the logical scenario space.
  2. KPI Evaluation: Run simulations and compute the KPI for each scenario.
  3. Surrogate Model Training: Build a fast approximation (e.g., Gaussian Process, neural network) of the KPI function based on evaluated samples. This model predicts KPI values without running expensive simulations.
  4. Region Refinement: Use the surrogate model to identify promising regions and focus subsequent samples where the KPI indicates potential critical events.
  5. Continuous Improvement: As new samples are evaluated, the surrogate model is continuously retrained and refined, improving prediction accuracy in critical regions while maintaining computational efficiency.
  6. Repeat: Iterate between surrogate updates and targeted sampling until convergence or computational budget is exhausted.

Surrogate models are the secret weapon of production SBT: instead of running expensive simulations for every candidate scenario, the search algorithm queries a fast approximation to eliminate obviously uninteresting regions. Only the most promising candidates get full simulation treatment, reducing evaluations potentially by multiple orders of magnitude.

That said, building a good surrogate is easier said than done—getting the training set right is half the battle.

lightbulb Key Insight: Surrogate models are the efficiency multiplier—they eliminate obviously uninteresting regions without running expensive simulations. The better your surrogate, the fewer evaluations you need.

Visual Comparison: SBT vs Full Grid Sampling:

Brute Force: 240 Evaluations Critical Event Region SBT: ~15 Targeted Evaluations

Try It Yourself: Interactive Demonstration

In the following simplified intersection with two vehicles, you can experiment with initial velocities using grid resolutions from 20–80 steps. The simulation uses realistic bounding box collision detection. Compare two strategies side-by-side:

  • Full grid sweep (exhaustive brute-force sampling)
  • SBT refinement (adaptive sampling guided by the KPI)

science Interactive Lab: Brute Force vs SBT

-
Simulation Replay
Simulations: 0
Critical Scenarios: 0
Ratio: 0.00
Safe
Near Miss
Crash
Hover grid to inspect

Implementation Note: This demo uses adaptive binary refinement in a 2D space. Production implementations leverage Bayesian optimization, surrogate models, and genetic algorithms to achieve orders of magnitude better performance in high-dimensional spaces. The key insight: efficiency gains compound exponentially with dimensionality.

The KPI: The Compass That Guides the Search

SBT is only as good as the KPI driving it. Bad KPIs lead the search to optimize the wrong thing.

Common Failure Modes

  • Binary crash flag (discontinuous → no gradient to follow)
  • Final separation distance (ignores temporal dynamics)
  • TTC-only metrics (can be gamed by avoidance trajectories)

Better KPIs blend multiple dimensions and remain continuous near safety boundaries:

  • Min bounding-box distance + Time-to-Collision (TTC)
  • Integrated risk over trajectory
  • RSS-based gap compliance for merges
  • Stopping margin for pedestrians

Multi-objective KPIs are common in production:

K = w1 × safety + w2 × comfort + w3 × regulation

This ensures the search doesn't produce "critical but illegal or absurd" trajectories.

KPI Design Principles: What Works and What Fails

The most common KPI failure isn't mathematical—it's choosing metrics that optimize for something you don't actually care about.

Monotonicity and Gradients:

  • Poor Choice: Binary crash/no-crash flag (discontinuous). The search algorithm has no gradient to follow near the safety boundary—it's blindly guessing.
  • Better Choice: Minimum distance during trajectory (continuous). Provides smooth gradient that guides the search toward critical boundaries, even from safe initial conditions.

lightbulb Key Insight: Bad KPIs lead to missed failures. Your KPI must be continuous (gradients to follow), resistant to exploitation (no gaming), and capture temporal dynamics (not just snapshots).

Other Critical Considerations:

Effective KPIs must be resistant to exploitation—a TTC-only metric can be gamed by scenarios that avoid the intersection entirely. Teams often don't discover this until weeks into a test campaign, when they realize their "critical scenarios" are just clever ways to avoid the intersection entirely. KPIs should capture temporal dynamics over trajectories, not static snapshots. Leverage formal frameworks like RSS for regulation-aligned metrics.

Scenario Type Weak KPI Strong KPI
Intersection Binary crash flag Min bounding-box distance + TTC
Lane Change Lateral offset only Time-integrated TTC + jerk comfort
Pedestrian Crossing Final separation distance Max deceleration + stopping margin
Highway Merge Closest approach RSS gap compliance + merge completion

Limitations and Trade-offs

But SBT alone doesn't solve validation—it's a tool with real limitations that teams must understand before betting their safety case on it.

SBT is not magic. It introduces trade-offs:

  • May converge to local optima → multiple runs needed
  • Search coverage is KPI-dependent
  • Surrogate quality limits sensitivity
  • False negatives are possible
  • No universal stopping criterion

The hardest part? Knowing when you've sampled enough.

"It's part science, part judgment, part organizational risk tolerance."

This is why SBT must sit inside a larger validation loop rather than acting as a standalone technique.

Regulatory note: Standards like ISO 21448 (SOTIF) and UL 4600 require demonstrating that relevant scenario space has been explored and that evidence is traceable. SBT provides auditable sampling logic, reproducible scenario selection, coverage arguments, KPI rationale, and failure mode documentation. This enables stronger safety cases than "we ran X million kilometers."

Implementation note: SBT isn't a replacement for existing simulation platforms—it's an orchestration layer that sits on top of your existing validation infrastructure, intelligently selecting which scenarios to test.

Position in the Validation Pipeline

SBT solves only one piece of the puzzle:

Efficient sampling within a single logical scenario

It does not solve:

  • Scenario generation
  • Scenario prioritization across the ODD
  • ODD coverage reasoning
  • Real-drive data integration
  • Regulatory safety argumentation

Those are the topics of the next articles in this series.

summarize Executive Summary

The Problem:

A simple 4-car intersection produces 1011 scenarios (11D parameter space). Brute-force simulation at 100 sims/sec would take 31 years. Most scenario space is "boring" — failures occupy tiny subregions.

The Solution — Search-Based Testing (SBT):

Reframes validation as an optimization problem: focus on critical scenarios using KPIs (e.g., min distance, TTC) and surrogate models to predict outcomes without full simulation. Achieves at least 10×–1000× cost compression depending on dimensionality.

Business Value:

  • Fewer compute hours → reduced cloud costs
  • Faster iteration → quarterly validation sprints
  • ISO 21448 / UL 4600-ready safety evidence
  • Continuous integration in V&V pipelines

Critical Success Factors:

  • KPI design: continuous, resistant to exploitation, captures temporal dynamics
  • Surrogate model quality: drives sensitivity in critical regions
  • Multi-objective balance: safety + comfort + regulation
  • Judgment: knowing when sampling is sufficient

Limitations:

May converge to local optima, KPI-dependent, false negatives possible. SBT solves efficient sampling within a single logical scenario, not scenario generation, ODD coverage, or real-drive integration.

"If mileage-based validation was about exposure, scenario-based validation is about intelligent coverage — and SBT is how we get there."

References

  1. ISO 21448:2022 — Road vehicles — Safety of the intended functionality (SOTIF).
  2. Operational Design Domain (Wikipedia). Link
  3. Scenario abstraction levels (functional → logical → concrete). Link
  4. PEGASUS project: Scenario-based testing methodology. Link
Kaveh Rahnema

Kaveh Rahnema

V&V Expert for ADAS & Autonomous Driving with 7+ years at Robert Bosch GmbH.

Connect on LinkedIn