Curated datasets for your model training

Specify any type of web data you want and Exa will curate a custom dataset for your needs

@ Talk to us
Hero image depicting multiple interconnected applications

Get any slice of the web

Up toBillions

of urls

Up toTrillions

of tokens

Example datasets

Research papers about genetics

Technical blog posts

2023 news

Financial tweets

Curate your own

Why use Exa

High Quality data

Instead of relying on low quality datadumps like commoncrawl, Exa's crawling infrastructure is optimized to find data from high quality sources.

Powerful Filters

Exa can filter the web based on natural language. You can filter by topic, idea, entity type, length -- pretty much anything you want.

Comprehensiveness

We can customize our crawling to fit any size that fits your needs: from a few million tokens to trillions.

Find financial articles that mention S&P500 companies in thelast year

Exa filtering
Exa filtering
Use case

Training a large context window model

Challenge

A leading AI company is trying to fine-tune its own LLM to handle long context windows for technical topics like programming, but cannot find high quality, long-form, technical content.

Solution

Using a few natural language queries, Exa is able to find high quality, long-form, technical content in math, science and programming that exactly match the fine-tuning goal of the company.

Impact

>10Btokens delivered

The Exa dataset was delivered quickly, saving the company from dedicating a team of engineers for months to build out a crawling system and filtering algorithm from scratch.

Model training illustration

Exa gives real web data and control to your dataset

Features

Common Crawl logo

Types of web data

High-quality sources

Random webpages

Ability to filter data

Customized crawling

Update interval

Every day

Every month

Customer support

Trusted by thousands of developers and companies

Open AI logoDatabricks logoImbue logoLangChain logoLlamaIndex logo

“Exa feeds our deep research AI, which helps sales people research their prospects. Without Exa's speed and quality over the web, this would be hard to pull off!”

Person 1
Rabi Gupta CEO, EvaBot

“Models are only as good as the data they're trained on, and Exa's search allowed us to get high quality data we couldn't find any other way”

Person 2
Jonathan FrankleChief scientist, Databricks

“Exa is good, really good. We went from multiple API calls and scraping into a single <1s fast call. The results are way different than traditional search, and way better. Our users love it!”

Person 3
Alex JohnsonFounder, JotBot