We benchmarked 6 search APIs on 170 AI-agent queries
AI agents fan out 80–240 web searches per task, and a lot of setups default everything to one premium engine. We benchmarked six — Serper, Brave, Exa, Tavily, Perplexity, Firecrawl — on 170 class-balanced agent queries. The robust finding is cost: a naive all-Exa default runs ~$7/1k while routing each query class to the cheapest engine that clears its quality bar blends to ~$0.85/1k — up to 8.24× difference. Quality differences were small and we treat them as indicative (judged on a 62-query gold subset, small per-class n). Here's everything, caveats included.
Methodology — honest, up front
We lead with caveats because this is the kind of post that gets scrutinised on HN. Methodology-forward is the credibility play.
- Queries
- 170 synthetic class-discriminating probes across 5 classes (Academic 35, News 33, Page lookup 33, Direct answer 34, Web 35). Treat rankings as directional — not real partner traffic.
- Cost axis
- Measured on all 170 queries (the robust axis). List prices as of 2026.
- Quality axis
- Judged on a 62-query gold subset by an LLM judge validated against human gold at quadratic-weighted κ=0.78 (target 0.6, passes). Per-class judged-n is 10–17 → per-class quality is indicative. We don't rank on quality.
- Cache economics
- Exact-only simulation. Baseline blended hit-rate 46%. Cache hit-rate band 9-46% across traffic-shape scenarios; baseline ~46%. Caching compounds the routing savings but does not carry margin alone.
- Reproducible
- spikes/rankings/build_rankings.py
Price caveat: List prices as of 2026; actual cost varies by plan tier and volume. Notably: Serper drops to ~$0.30/1k at scale; Firecrawl 'Enhanced Mode' is ~5x for bot-protected sites. Comparisons use published list prices for a like-for-like baseline.
Finding 1 — Cost is where the money is (the robust result)
Cost and latency are measured on all 170queries — this is the finding we're confident in. All-Exa ($7/1k) vs route-to-cheapest-that-clears-the-bar (~$0.85/1k) = 8.24×. Honest framing: this is the ceiling a naive default leaves on the table — not a claim Exa is overpriced. Exa stays the right call where it wins; the savings come from not defaulting everything to a premium engine.
Finding 2 — Quality is tightly clustered (and we won't overclaim it)
Overall quality scores sat in a narrow band (judge scale 0–3; metric = mean fraction of results judged relevant). Margins are within roughly one judge increment, so we don't crown a definitive quality winner. Perplexity is “not measured,” not zero — its key was non-functional that run.
Finding 3 — No single engine wins every query class
Per-class quality leaders (indicative, small n): academic → Firecrawl · news → Firecrawl · page → Brave · answer → Brave · web → Brave. Notably, the two priciest engines (Exa $7, Tavily $8) didn't lead any class outright in this set. The takeaway isn't “engine X is best” — it's that the best engine changes by query type, so routing per class beats a single default.
Finding 4 — Caching compounds the savings
On an exact-only cache simulation, repeated agent queries hit ~46% baseline (band 9–46% across traffic shapes; go-line is 33%, kill-floor 20%). Caching compoundsthe routing savings but doesn't carry margin alone.
- Method
- exact-only cache simulation (cache_sensitivity_report.json)
- Baseline hit-rate
- 46% blended across traffic shapes
- Go / kill lines
- Go-line 33% · kill-floor 20%
What this means — and where GroundRoute fits
Route each query class to the cheapest engine that clears its quality bar, cache the repeats, and keep a premium engine for the queries that need it. That's what GroundRoute does behind one API.
On pricing: you keep ~half the cache savings we generate, we keep the other half — so you're never worse off than going direct. BYOK supported. You can try the playground on your own queries, or read the per-engine breakdowns to see where each engine wins.
Caveats (kept prominent)
- Perplexity quality is shown as 'not measured' (null), NOT 0: its API key was non-functional during this run so it returned no parseable results. We publish its cost/latency but make NO quality claim — a 0 would be a data artifact, not a verdict. Re-run with a working key to measure it.
- Quality is judged on the 62-query gold subset; per-class judged-n is small (10-17), so per-class quality is INDICATIVE — the headline rests on COST (covered on all 170 queries), not on a quality ranking.
- Quality scores are tightly clustered (judge granularity ~0.083 = one result of three). Within-class quality_leader margins are small; best_value uses a 0.05 tolerance band around the leader to define 'clears the bar'.
- Queries are synthetic class-discriminating probes (see bench_queries_v2.README.md), not real partner traffic; treat rankings as directional.
- Queries are synthetic class-discriminating probes, not real partner traffic — directional. A real partner-traffic re-run is the planned v2.
Per-engine & head-to-head breakdowns
Full pricing, limits, and benchmark details for each engine:
Compare pairs directly:
Stop defaulting everything to one engine.
GroundRoute routes each query to the cheapest engine that clears your bar — with caching on top. You pay 50% of what the cache saves you, never more than going direct.
© 2026 GroundRoute, Inc. · Benchmark generated 2026-06-14T07:40:03Z · N=170 queries · numbers traceable to the published dataset (bench_v2_raw.jsonl, bench_queries_v2.jsonl).
FAQ
- Which search API is cheapest for AI agents?
- Firecrawl ($0.85/1k) and Serper ($1/1k, ~$0.30 at scale) are the cheapest. The optimal choice depends on your query class — see the per-class breakdown below.
- Is Exa the best search API for AI agents?
- Exa is strong for academic and semantic retrieval, but it's the second-priciest engine in the benchmark ($7/1k). On web/news/page queries, cheaper engines matched quality. Routing per class — not defaulting everything to Exa — is where the savings come from.
- How was quality measured?
- Quality was judged on a 62-query gold subset by an LLM judge validated against human labels (κ=0.78). Per-class n is 10–17, so per-class quality is indicative. The headline finding rests on cost, which was measured on all 170 queries.
- What about Perplexity?
- Perplexity's API key was non-functional during the benchmark run. We publish its cost ($5/1k) and latency (3,538ms) but make no quality claim — a zero would be an artifact, not a verdict.
- What is GroundRoute?
- GroundRoute is a search control plane for AI agents. You point your agent at one API; GroundRoute routes each query to the cheapest engine that clears your quality bar and caches repeats. You keep ~half the cache savings, never pay more than going direct.