Search-Based Testing (SBT) and the Curse of Dimensionality
Validating autonomous vehicles is no longer a mileage game. The relevant question isn't "How many kilometers did we drive?" but "How systematically did we explore the scenarios where the AV is most likely to fail?"
📚 Series Context: In Part 1, we discussed why traditional mileage-based validation breaks at Level 3/4—requiring billions of equivalent kilometers to even approximate statistical confidence. This article introduces the first practical building block of the solution: Search-Based Testing (SBT). Where mileage-based validation tries to catch failures by accumulating exposure, SBT tries to hunt them by intelligently navigating huge logical scenario spaces.
speed TL;DR: Why This Matters
The Problem: A simple 4-car intersection has 1011 scenarios = 31 years to test exhaustively. The Solution: Search-Based Testing finds critical failures using intelligent sampling.
Business Value:
- Compresses simulation cost by at least 10×–1000×, with gains compounding in higher-dimensional spaces
- Finds critical edge cases faster, accelerating validation cycles
- Generates traceable safety evidence aligned with ISO 21448 / UL 4600
- Reduces cloud infrastructure costs when scaling large simulation clusters
- Supports continuous V&V loops instead of batch milestone validation
For organizations trying to ship Level 3/4 products under quarterly release pressure, SBT isn't an optional optimization—it's a lever that reduces both compute and calendar time.
In This Article:
Scenario Hierarchy and ODD Context
Within an Operational Design Domain (ODD), scenarios exist at different abstraction levels—from human-readable descriptions (abstract) to parameter ranges (logical) to executable simulations (concrete). The intersection example in this article is one logical scenario within its ODD.
The Curse of Dimensionality: 4-Car Intersection Example
Let’s look at a standard unsignalized intersection with four vehicles (ego + three traffic participants). We fix the ego vehicle's acceleration (allowing the planner to control it) but vary its initial state. This gives us:
- Ego Vehicle: Initial velocity (v0) and Initial position (p0) [2 parameters]
- 3 Traffic Participants: Initial velocity (v0), Initial position (p0), and Acceleration (a0) each [3 × 3 = 9 parameters]
That gives us an 11-dimensional space to cover. With a coarse discretization of 10 steps per parameter, brute-force simulation would require:
1011 = 100,000,000,000 concrete scenarios
And this toy setup is aggressively simplified. We have not modeled:
Yet even this minimal scenario already produces a continuous parameter space large enough that naive brute sweeps are completely impossible to compute.
Assuming an optimistic simulation engine running 100 simulations per second, the runtime would be:
100,000,000,000 scenarios / 100 sims/sec = 1,000,000,000 sec ≈ 31.7 years!
"Most scenario space is boring."
The interesting failures live in tiny subregions. This makes brute-force both expensive and ineffective.
The Stakes: Why This Decision Matters
block Without SBT
- • 31-year validation cycles for single scenarios
- • Unpredictable cloud costs scaling exponentially
- • Weak regulatory arguments based on mileage alone
- • Missed critical failures in unsampled regions
- • Batch milestone testing delaying release cycles
check_circle With SBT
- • At least 10×–1000× cost compression in high dimensions
- • Predictable compute budgets with measurable ROI
- • ISO 21448 / UL 4600-ready evidence with traceability
- • Targeted critical scenario discovery by design
- • Continuous V&V integration in quarterly sprints
What SBT Does Differently
This is where most teams get stuck: they understand the problem but don't know how to escape brute-force thinking.
Running every single variation is a massive waste of resources on uninteresting scenarios where the AV performs well. The critical events—collisions, near-misses, and edge cases—occupy only small subregions of the parameter space. Search-Based Testing reframes scenario evaluation as an optimization problem:
Instead of evaluating everything, evaluate only what is likely to matter.
To make that work, SBT needs two ingredients:
- A KPI (what "interesting" means)
- A Search Strategy (how we navigate the space)
Example KPI for our intersection: minimum bounding-box distance between vehicles during the crossing. The optimization then tries to minimize that distance—surfacing near-misses and collisions.
Business context: Simulation is one of the most expensive components of AV validation infrastructure—both in compute and wall-clock time. Running 1,000+ cloud cores isn't cheap. Every 10× improvement in sampling efficiency translates directly to fewer GPU/CPU hours, reduced spot-instance burn, lower scheduling latency, and faster validation cycles. For teams measured in quarters, not decades, these differences are existential.
How Search-Based Testing Works
SBT uses genetic algorithms, Bayesian optimization, or surrogate models to intelligently explore the scenario space. Here's the typical workflow:
Iterative Process of SBT:
- Coarse Sampling: Sample initial points across the logical scenario space.
- KPI Evaluation: Run simulations and compute the KPI for each scenario.
- Surrogate Model Training: Build a fast approximation (e.g., Gaussian Process, neural network) of the KPI function based on evaluated samples. This model predicts KPI values without running expensive simulations.
- Region Refinement: Use the surrogate model to identify promising regions and focus subsequent samples where the KPI indicates potential critical events.
- Continuous Improvement: As new samples are evaluated, the surrogate model is continuously retrained and refined, improving prediction accuracy in critical regions while maintaining computational efficiency.
- Repeat: Iterate between surrogate updates and targeted sampling until convergence or computational budget is exhausted.
Surrogate models are the secret weapon of production SBT: instead of running expensive simulations for every candidate scenario, the search algorithm queries a fast approximation to eliminate obviously uninteresting regions. Only the most promising candidates get full simulation treatment, reducing evaluations potentially by multiple orders of magnitude.
That said, building a good surrogate is easier said than done—getting the training set right is half the battle.
lightbulb Key Insight: Surrogate models are the efficiency multiplier—they eliminate obviously uninteresting regions without running expensive simulations. The better your surrogate, the fewer evaluations you need.
Visual Comparison: SBT vs Full Grid Sampling:
Try It Yourself: Interactive Demonstration
In the following simplified intersection with two vehicles, you can experiment with initial velocities using grid resolutions from 20–80 steps. The simulation uses realistic bounding box collision detection. Compare two strategies side-by-side:
- Full grid sweep (exhaustive brute-force sampling)
- SBT refinement (adaptive sampling guided by the KPI)
science Interactive Lab: Brute Force vs SBT
Implementation Note: This demo uses adaptive binary refinement in a 2D space. Production implementations leverage Bayesian optimization, surrogate models, and genetic algorithms to achieve orders of magnitude better performance in high-dimensional spaces. The key insight: efficiency gains compound exponentially with dimensionality.
The KPI: The Compass That Guides the Search
SBT is only as good as the KPI driving it. Bad KPIs lead the search to optimize the wrong thing.
Common Failure Modes
- Binary crash flag (discontinuous → no gradient to follow)
- Final separation distance (ignores temporal dynamics)
- TTC-only metrics (can be gamed by avoidance trajectories)
Better KPIs blend multiple dimensions and remain continuous near safety boundaries:
- Min bounding-box distance + Time-to-Collision (TTC)
- Integrated risk over trajectory
- RSS-based gap compliance for merges
- Stopping margin for pedestrians
Multi-objective KPIs are common in production:
K = w1 × safety + w2 × comfort + w3 × regulation
This ensures the search doesn't produce "critical but illegal or absurd" trajectories.
KPI Design Principles: What Works and What Fails
The most common KPI failure isn't mathematical—it's choosing metrics that optimize for something you don't actually care about.
Monotonicity and Gradients:
- ❌ Poor Choice: Binary crash/no-crash flag (discontinuous). The search algorithm has no gradient to follow near the safety boundary—it's blindly guessing.
- ✅ Better Choice: Minimum distance during trajectory (continuous). Provides smooth gradient that guides the search toward critical boundaries, even from safe initial conditions.
lightbulb Key Insight: Bad KPIs lead to missed failures. Your KPI must be continuous (gradients to follow), resistant to exploitation (no gaming), and capture temporal dynamics (not just snapshots).
Other Critical Considerations:
Effective KPIs must be resistant to exploitation—a TTC-only metric can be gamed by scenarios that avoid the intersection entirely. Teams often don't discover this until weeks into a test campaign, when they realize their "critical scenarios" are just clever ways to avoid the intersection entirely. KPIs should capture temporal dynamics over trajectories, not static snapshots. Leverage formal frameworks like RSS for regulation-aligned metrics.
| Scenario Type | Weak KPI | Strong KPI |
|---|---|---|
| Intersection | Binary crash flag | Min bounding-box distance + TTC |
| Lane Change | Lateral offset only | Time-integrated TTC + jerk comfort |
| Pedestrian Crossing | Final separation distance | Max deceleration + stopping margin |
| Highway Merge | Closest approach | RSS gap compliance + merge completion |
Limitations and Trade-offs
But SBT alone doesn't solve validation—it's a tool with real limitations that teams must understand before betting their safety case on it.
SBT is not magic. It introduces trade-offs:
- May converge to local optima → multiple runs needed
- Search coverage is KPI-dependent
- Surrogate quality limits sensitivity
- False negatives are possible
- No universal stopping criterion
The hardest part? Knowing when you've sampled enough.
"It's part science, part judgment, part organizational risk tolerance."
This is why SBT must sit inside a larger validation loop rather than acting as a standalone technique.
Regulatory note: Standards like ISO 21448 (SOTIF) and UL 4600 require demonstrating that relevant scenario space has been explored and that evidence is traceable. SBT provides auditable sampling logic, reproducible scenario selection, coverage arguments, KPI rationale, and failure mode documentation. This enables stronger safety cases than "we ran X million kilometers."
Implementation note: SBT isn't a replacement for existing simulation platforms—it's an orchestration layer that sits on top of your existing validation infrastructure, intelligently selecting which scenarios to test.
Position in the Validation Pipeline
SBT solves only one piece of the puzzle:
Efficient sampling within a single logical scenario
It does not solve:
- Scenario generation
- Scenario prioritization across the ODD
- ODD coverage reasoning
- Real-drive data integration
- Regulatory safety argumentation
Those are the topics of the next articles in this series.
summarize Executive Summary
The Problem:
A simple 4-car intersection produces 1011 scenarios (11D parameter space). Brute-force simulation at 100 sims/sec would take 31 years. Most scenario space is "boring" — failures occupy tiny subregions.
The Solution — Search-Based Testing (SBT):
Reframes validation as an optimization problem: focus on critical scenarios using KPIs (e.g., min distance, TTC) and surrogate models to predict outcomes without full simulation. Achieves at least 10×–1000× cost compression depending on dimensionality.
Business Value:
- Fewer compute hours → reduced cloud costs
- Faster iteration → quarterly validation sprints
- ISO 21448 / UL 4600-ready safety evidence
- Continuous integration in V&V pipelines
Critical Success Factors:
- KPI design: continuous, resistant to exploitation, captures temporal dynamics
- Surrogate model quality: drives sensitivity in critical regions
- Multi-objective balance: safety + comfort + regulation
- Judgment: knowing when sampling is sufficient
Limitations:
May converge to local optima, KPI-dependent, false negatives possible. SBT solves efficient sampling within a single logical scenario, not scenario generation, ODD coverage, or real-drive integration.
"If mileage-based validation was about exposure, scenario-based validation is about intelligent coverage — and SBT is how we get there."
References
Kaveh Rahnema
V&V Expert for ADAS & Autonomous Driving with 7+ years at Robert Bosch GmbH.
Connect on LinkedIn