A big step toward perfect search
Today we’re excited to announce that our new search product, “Websets”, retrieves over 20x more correct search results than Google on our first benchmark of complex queries.
Turns out Google is great for simple searches, but is quite bad at complex ones. For example “find me software engineers in the Bay Area who have a blog” or “startups in NYC with more than 10 employees working on futuristic hardware”
We built Websets to enable these extremely valuable queries. And we do remarkably well in our early testing.
Basically we use an army of agents for each search to get you a precise and nearly comprehensive list of people, companies, news articles, or really any entity on the web.
Because Websets is designed specifically to search for entities with multiple criteria – and there does not exist such a benchmark online – we needed to create the benchmark.
Our goal was to create a minimally-biased benchmark of queries. Using humans to create this would definitely have introduced bias. So we decided that one-shot sampling from o1 would work the best.
We came up with a single prompt and asked o1 for 200 queries. We were careful to write this prompt once and run it once, to avoid any potential bias from prompt tuning.
The queries look like this:
We then fed these queries to both Exa Websets and a Google Search API (serper.dev).
We asked gpt4o to grade both sets of results. For each query, gpt4o created the set of criteria required to pass. gpt4o then evaluated whether each result passed all the criteria. For example, when the query was "nutritional advice that experts have reversed their stance on in the last 5 years," gpt4o created these criteria and evaluated each result based on them:
This graph below shows the percentage of matching results (i.e. results that met all criteria above) for Google vs Websets (low compute) vs Websets (high compute).
Google APIs like serper.dev can only return max 100 results. But Websets can return as many results as you want, depending on how much compute you’re willing to spend.
In the low compute version, we requested 100 results per query to match Google’s API. In the high compute version, we requested 1000 results. The y-axis shows the average number of results across all the queries that met the query criteria, according to gpt4o.
Google obtained only 16 correct results on average – that’s because Google is doing a keyword matching algorithm that is pretty bad for complex queries. In contrast, Websets (low compute) gets 66 correct results on average, because we leverage a more powerful algorithm. And Websets (high compute) gets 320 correct results per query on average, because we leverage way more compute.
Of course, we could run Websets with 10x even more compute, or 100x more, but we think 20 times better than Google is good enough for this eval :)
SEE MORE
Before exploring other worlds, we should fully understand our own
Will Bryk
January 7, 2025
It uses clustering, matryoshka embeddings, binary quantization, and SIMD operations. Written in rust of course 🦀
The Exa Team
December 17, 2024
Exa is pitching a new spin on generative search. It uses the tech behind large language models to return lists of results that it claims are more on point than those from its rivals, including Google and OpenAI.
The Exa Team
December 3, 2024