#Introduction
Search agents have demonstrated dramatic improvements in performance when trained with reinforcement learning (RL). However, existing work often uses Google as the backend. An understudied question is how training on different web search tools impacts RL outcomes.
We compare Exa with a Google baselineaaWe use SERP as a proxy for Google results, as it is widely used for post-training language models today.1,2,3 and find that agents trained with Exa as the search backend during RL reach higher performance with less training compute.
Prior research in training search agents with reinforcement learning has often kept the retrieval method fixed.
- Early research taught models to retrieve over offline corpora4,5,6, learning to reason over multi-hop questions.
- Later work added live web search to the action space7,8, but still fixed the retriever, usually to a SERP-based backend.
Issuing queries across the live web during training can be prohibitively expensive and unpredictable9, so ablations of the retrieval method for RL usually remain limited to offline indexes10,11. Because Exa has its own search engine and web index, we can study this interaction at scale.
#Experimental Design
#Model and Action Space
To isolate the impact of live search during RL, we train two models, one with Exa and one with SERP, and hold all other variables constant.
We train Qwen3-4B-Instruct-250712 with LoRA adapters using Tinker. Prior work has found that LoRA can match full fine-tuning performance, even with small ranks13.
Both agents use the same system prompt, adapted from Search-R14,14, to avoid prompt-induced behavioral differences. Each search tool call returns 5 live web results, each truncated to a snippet of maximum 2,000 characters. The agent is not aware of the search variant it has access to. This tool configuration was chosen with model context limits in mind, as we initially observed long multi-hop trajectories quickly saturating the context window. Though other harness optimizations are possible, here we focus primarily on the comparison of search providers.
The crossing of the Rhine by a mixed group of barbarians (including Vandals, Alans, and Suebi) is traditionally considered to have occurred on 31 December 406, initiating a wave of destruction in northern Gaul.
The final answer to the question is 378, as this is when the major battle and defeat occurred, marking the end of the initial phase of the invasion.
#Training and Reward Design
Beyond the agent and its action space, each rollout is shaped by the training data, optimizer, and reward signal applied at the end of the trajectory.
We train on two multi-hop QA datasets, MuSiQue and HotpotQA15,16, using Dr. GRPO17.
The reward is a single binary signal at the end of each trajectory. We initially graded with exact substring matching, but found the agent receiving large gains in reward from formatting its responses differently, rather than improving on accuracybbFor example, the agent learned to output increasingly verbose responses that could match the correct answer by chance, or intentionally list alternative answers. See the Appendix for an example.. We switched to the LLM grader from SimpleQA18, and apply a -0.25 penalty when a trajectory exceeds context limits.
After training the models, we evaluate using pass@k19,ccResearch has shown RL can concentrate probability mass on correct trajectories, reducing diversity and degrading pass@k at higher k. Evaluating across multiple values of k therefore helps distinguish genuine improvements in solution-finding from probability-mass concentration on a narrower set of trajectories.20,21,22 on MuSiQue and HotpotQA test splits and on OOD benchmarks 2WikiMultihopQA, FRAMES, BrowseComp, and SimpleQA15,16,18,23,24,25. As a reference point, we compare against Qwen3-235B-A22B-Instruct-250726 with the same toolset. Every variable besides the search backend is fixed.
#Results
#Impact of Retrieval Backend
Across evaluations, Exa-trained agents outperform SERP-trained agents. Three patterns stand out:
(i) Post-RL Exa models outperform post-RL SERP models on pass@k across all values of k.
(ii) Training with Exa is more compute efficient than SERP: taking fewer tokens, turns, and search calls to reach the same performance. This also results in lower-token cost inference.
(iii) Exa-trained 4B agents outperform SERP-trained agents on every benchmark and often exceed the larger untrained 235B model.
| Exa | SERP | |||||
|---|---|---|---|---|---|---|
| 4B Base | 4B+RL | 235B | 4B Base | 4B+RL | 235B | |
| SimpleQA | 0.651 | 0.767 | 0.730 | 0.579 | 0.692 | 0.689 |
| 2WikiMultihopQA | 0.500 | 0.839 | 0.774 | 0.461 | 0.798 | 0.747 |
| HotpotQA | 0.521 | 0.694 | 0.632 | 0.491 | 0.684 | 0.620 |
| FRAMES | 0.316 | 0.566 | 0.604 | 0.272 | 0.521 | 0.578 |
| MuSiQue | 0.151 | 0.311 | 0.273 | 0.146 | 0.307 | 0.259 |
| BrowseComp | 0.020 | 0.043 | 0.039 | 0.008 | 0.039 | 0.055 |
#Token efficiency
We find that training with Exa requires fewer tokens for the same number of steps. This gap is largely due to the difference in turn count per rollout: during training, the SERP agent makes more search calls per rollout than the Exa agent. Because both agents start from the same base policy, they initially use similar numbers of turns and search calls, but this gap grows over training; in our training setup:
- It took 20% more total tokens to train with SERP to step 100 (1.58B vs. 1.89B).
- The Exa agent took 69% fewer tokens to achieve equal performance with the SERP-trained agent (prefill and decode) to match performance (0.58B vs. 1.89B).
- The Exa agent took 62% fewer search calls and 58% fewer turns to achieve equal performance with the SERP-trained agent (93k vs. 248k, 146k vs. 350k).
#Search engine at training vs. inference
To separate training-time and inference-time effects, we evaluate each trained agent with both search backends.
- After using Exa for RL, agents perform better than having used SERP for training, regardless of the final engine used.
- Using Exa makes search agents perform better at inference time, regardless of how they were trained.
We hypothesize that this difference could be attributable to learning with Exa being more sample efficient: Exa retrieves results containing information to arrive at the correct answer in fewer actions than SERP.
In multi-turn agentic RL, higher per-step success makes rewards less sparse and helps the model learn from fewer rollouts. This may be due to the semantic nature of Exa search, which is designed specifically to handle non-keyword natural-language queries.
To measure per-action sample efficiency, we run the same base policy on the same prompts for roughly 240,000 rollouts per engine (480k total) between six benchmarks.
We see that for a search call from the same policy, Exa returns sites that contain the answer +10.7% more often than SERPdd36.1±2.4% compared to 32.6±2.3%; Δ+3.50±1.55pp, and for the first search call in the rollout, +11.5% more oftenee34.0±2.5% compared to 30.5±2.5%; Δ+3.51±1.68pp. We believe this indicates that Exa provides a denser signal per action, as the agent is able to see the correct answer more often in a rollout. This may give the agent more opportunities to see the correct answer and reason over it, which may also lead to a lower overall search count. Retrieval quality appears to matter at training time as much as at inference time: a denser per-step signal compounds over many rollouts.
#Conclusion
In this work, we studied how the search backend shapes RL for large language models.
Holding all else fixed, we compared one agent trained with Exa against one agent trained with a SERP backend. Across our benchmarks, Exa-trained agents achieve higher pass@k at lower training and inference cost. These gains persist when the engine is swapped at inference time, showing that the model may be learning transferable skills for search when training with Exa. Exa also improves inference performance even for agents not trained on Exa, though the strongest results come from using Exa during both training and inference.
#Future Work
Here is a list of ongoing questions we hope to address in upcoming research.
- Scaling up training with bigger models, longer runs, and larger batch sizes.
- Interpreting what learned characteristics result in improved search performance, particularly after an initial learning phase.
- Train with a more advanced harness including elements like content fetching, and context pruning.
Our results suggest that the retrieval engine is a crucial component for agents searching over the web, especially when paired with the same tool at inference time. A better search backend can both improve the final performance of a search agent and improve the learning signal during RL.