Skip to main content

Overview

Endpoint: POST https://api.exa.ai/contents Auth: Pass your API key via the x-api-key header. Get one at https://dashboard.exa.ai/api-keys The Contents API extracts clean, LLM-ready content from any URL. It handles JavaScript-rendered pages, PDFs, and complex layouts. Returns full text, highlights, summaries, or any combination.

Installation

pip install exa-py    # Python
npm install exa-js    # JavaScript

Minimal Working Example

curl -X POST "https://api.exa.ai/contents" \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{"urls": ["https://example.com"], "text": true}'
from exa_py import Exa
exa = Exa(api_key="YOUR_API_KEY")
result = exa.get_contents(["https://example.com"], text=True)
import Exa from "exa-js";
const exa = new Exa("YOUR_API_KEY");
const result = await exa.getContents(["https://example.com"], { text: true });

Request Parameters

ParameterTypeDefaultDescription
urlsstring[](required)Array of URLs to extract content from. Also accepts ids (document IDs from search results).
textboolean or objectReturn full page text as markdown. Object form: {maxCharacters, includeHtmlTags, verbosity, includeSections, excludeSections}.
highlightsboolean or objectReturn key excerpts relevant to a query. Object form: {maxCharacters, query}.
summaryboolean or objectReturn LLM-generated summary. Object form: {query, schema}.
maxAgeHoursintegerMax age of cached content in hours. 0 = always livecrawl. -1 = never livecrawl. Omit for default (livecrawl as fallback).
livecrawlTimeoutinteger10000Timeout for livecrawling in milliseconds. Recommended: 10000-15000.
subpagesinteger0Number of subpages to crawl from each URL.
subpageTargetstring or string[]Keywords to prioritize when selecting subpages.
extras.linksinteger0Number of URLs to extract from each page.
extras.imageLinksinteger0Number of image URLs to extract from each page.

Text Object Options

ParameterTypeDefaultDescription
maxCharactersintegerCharacter limit for returned text.
includeHtmlTagsbooleanfalsePreserve HTML tags in output.
verbositystring"compact"compact, standard, or full. Should use maxAgeHours: 0 for fresh content.
includeSectionsstring[]Only include these page sections: header, navigation, banner, body, sidebar, footer, metadata. Should use maxAgeHours: 0 for fresh content.
excludeSectionsstring[]Exclude these page sections. Same options as above. Should use maxAgeHours: 0 for fresh content.

Highlights Object Options

ParameterTypeDefaultDescription
maxCharactersintegerMaximum characters for all highlights combined per URL.
querystringCustom query to direct the LLM’s selection of relevant excerpts.

Summary Object Options

ParameterTypeDefaultDescription
querystringCustom query for the summary.
schemaobjectJSON Schema (Draft 7) for structured summary output.

Content Modes

Text — Full page content as clean markdown. Best for deep analysis.
{"urls": ["https://example.com"], "text": {"maxCharacters": 8000}}
Highlights — Extractive key excerpts from the page. Best for agent workflows (10x fewer tokens). These are pulled directly from the source, not generated.
{"urls": ["https://example.com"], "highlights": {"query": "key findings", "maxCharacters": 2000}}
Summary — LLM-generated abstract. Supports JSON schema for structured extraction.
{
  "urls": ["https://example.com"],
  "summary": {
    "query": "Extract company information",
    "schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "industry": {"type": "string"}
      },
      "required": ["name", "industry"]
    }
  }
}
You can combine all three in a single request.

Content Freshness

maxAgeHours valueBehavior
Omit (default)Livecrawl only when no cached content exists. Recommended.
Positive (e.g. 24)Use cache if less than N hours old, otherwise livecrawl.
0Always livecrawl, never use cache. Increases latency.
-1Never livecrawl, cache only. Maximum speed.
When using maxAgeHours, pair with livecrawlTimeout (10000-15000ms recommended).

Subpage Crawling

Automatically discover and extract content from linked pages within a site.
{
  "urls": ["https://docs.example.com"],
  "subpages": 10,
  "subpageTarget": ["api", "reference", "guide"],
  "text": {"maxCharacters": 5000}
}
  • subpages: Max subpages to crawl per URL.
  • subpageTarget: Keywords to prioritize when selecting which subpages to crawl.
  • Start small (5-10) and increase if needed.

Response Schema

{
  "requestId": "e492118ccdedcba5088bfc4357a8a125",
  "results": [
    {
      "title": "Page Title",
      "url": "https://example.com/page",
      "id": "https://example.com/page",
      "publishedDate": "2024-01-15T00:00:00.000Z",
      "author": "Author Name",
      "image": "https://example.com/image.png",
      "favicon": "https://example.com/favicon.ico",
      "text": "Full page content as markdown...",
      "highlights": ["Key excerpt from the page..."],
      "highlightScores": [0.46],
      "summary": "LLM-generated summary...",
      "subpages": [],
      "extras": {
        "links": ["https://example.com/related"]
      }
    }
  ],
  "statuses": [
    {
      "id": "https://example.com/page",
      "status": "success"
    }
  ],
  "costDollars": {
    "total": 0.003
  }
}

Response Fields

FieldTypeDescription
requestIdstringUnique request identifier.
resultsarrayList of result objects with extracted content.
results[].titlestringPage title.
results[].urlstringPage URL.
results[].idstringDocument ID (same as URL).
results[].publishedDatestring or nullEstimated publication date.
results[].authorstring or nullAuthor if available.
results[].textstringFull page text (if text requested).
results[].highlightsstring[]Key excerpts (if highlights requested).
results[].highlightScoresfloat[]Cosine similarity scores for each highlight.
results[].summarystringLLM summary (if summary requested).
results[].subpagesarrayNested results from subpage crawling. Same shape as results.
results[].extras.linksstring[]Extracted links from the page.
statusesarrayPer-URL status information. Always check this for errors.
statuses[].idstringThe URL that was requested.
statuses[].statusstring"success" or "error".
statuses[].error.tagstringError type (see Error Handling).
statuses[].error.httpStatusCodeinteger or nullCorresponding HTTP status code.
costDollars.totalfloatTotal dollar cost for the request.

Error Handling

The endpoint returns HTTP 200 even when individual URLs fail. Per-URL errors appear in the statuses array.

Per-URL Error Tags

TagHTTP CodeMeaning
CRAWL_NOT_FOUND404Content not found.
CRAWL_TIMEOUT504Crawl timed out fetching content.
CRAWL_LIVECRAWL_TIMEOUT504Livecrawl exceeded livecrawlTimeout.
SOURCE_NOT_AVAILABLE403Access forbidden.
UNSUPPORTED_URLURL type not supported.
CRAWL_UNKNOWN_ERROR500+Other errors.

Request-Level Errors

HTTP StatusMeaning
400Bad request — invalid parameters.
401Invalid or missing API key.
422Validation error.
429Rate limit exceeded.
Always check statuses to handle per-URL failures:
result = exa.get_contents(["https://example.com", "https://example.com/maybe-broken"])
for status in result.statuses:
    if status.status == "error":
        print(f"Failed: {status.id} - {status.error.tag}")

Common Mistakes

LLMs frequently generate these incorrect parameters. Do NOT use any of the following:
WrongCorrect
useAutoprompt: trueRemove it. useAutoprompt does not exist on the /contents endpoint.
numSentencesRemove it. This highlights parameter is deprecated. Use maxCharacters instead.
highlightsPerUrlRemove it. This highlights parameter is deprecated. Use maxCharacters instead.
livecrawl: "always"Use maxAgeHours: 0 instead. The livecrawl parameter is deprecated.
tokensNumRemove it. This parameter does not exist. Use text.maxCharacters to limit text length.
stream: trueRemove it. The /contents endpoint does not support streaming.
contents: { text: ... }On /contents, text, highlights, and summary are top-level — do NOT wrap them in a contents object. This is different from /search.
Remember: On the /contents endpoint, text, highlights, and summary are top-level parameters. Do NOT nest them inside a contents object (that nesting is only for the /search endpoint).

Patterns and Gotchas

  • Always check statuses. The endpoint returns 200 even when individual URLs fail. Unchecked, you’ll silently miss failed URLs.
  • Use highlights over text for agent workflows. Highlights are 10x more token-efficient and return the most relevant excerpts.
  • Set livecrawlTimeout when using maxAgeHours. Default is 10000ms. For slow sites, use 12000-15000ms.
  • subpageTarget focuses crawling. Without it, subpage selection is best-effort. Use specific terms like ["api", "docs"].
  • Python SDK uses snake_case. maxCharactersmax_characters, subpageTargetsubpage_target, maxAgeHoursmax_age_hours.
  • urls and ids are interchangeable. Both accept URL strings. ids exists for backward compatibility with document IDs from search results.
  • Combine modes freely. Request text, highlights, and summary in the same call for different views of the same content.

Complete Examples

Basic text extraction

{
  "urls": ["https://arxiv.org/abs/2301.07041"],
  "text": true
}

Highlights with custom query

{
  "urls": ["https://example.com/research-paper"],
  "highlights": {
    "query": "methodology and results",
    "maxCharacters": 2000
  }
}

Documentation crawling

{
  "urls": ["https://platform.openai.com/docs"],
  "subpages": 15,
  "subpageTarget": ["api", "models", "embeddings"],
  "maxAgeHours": 24,
  "livecrawlTimeout": 15000,
  "text": {"maxCharacters": 5000}
}

Structured company extraction

{
  "urls": ["https://stripe.com"],
  "subpages": 8,
  "subpageTarget": ["about", "careers", "press", "blog"],
  "summary": {
    "query": "Company overview, culture, and recent news",
    "schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "industry": {"type": "string"},
        "employee_count": {"type": "string"},
        "recent_news": {"type": "array", "items": {"type": "string"}}
      },
      "required": ["name", "industry"]
    }
  }
}