Key Benefits
- Clean markdown extraction: Automatically filters out navigation, ads, and boilerplate to return only the main content, formatted as clean markdown.
- Flexible content modes: Choose between full text, query-relevant highlights, or LLM-generated summaries—or combine them in one request.
- Subpage crawling: Automatically discover and extract content from linked pages within a site, with targeted filtering to focus on specific sections.
Request Fields
Theids parameter (list of URLs) is required. All other fields are optional. See the API Reference for complete parameter specifications.
| Field | Type | Notes | Example |
|---|---|---|---|
| ids | string[] | List of URLs to extract content from. | [“https://example.com/article”] |
| text | bool/obj | Return full page text as markdown. Can specify maxCharacters and includeHtmlTags. | true or {"maxCharacters": 5000} |
| highlights | bool/obj | Return key excerpts most relevant to a query. Can specify numSentences, highlightsPerUrl, and custom query. | {"query": "main findings", "numSentences": 3} |
| maxAgeHours | int | Maximum age of indexed content in hours. If older, fetches with livecrawl. 0 = always livecrawl, -1 = never livecrawl (cache only). | 24 |
| livecrawlTimeout | int | Timeout in milliseconds for live crawling. Recommended: 10000-15000. | 12000 |
| subpages | int | Maximum number of subpages to crawl from each URL. | 5 |
| subpageTarget | string[] | Keywords to prioritize when selecting subpages. | [“docs”, “about”, “pricing”] |
| summary | bool/obj | Return LLM-generated summary. Can specify custom query and JSON schema for structured extraction. | {"query": "Key takeaways"} |
| context | bool/obj | Return all results combined into a single string for RAG. Can specify maxCharacters. | true or {"maxCharacters": 10000} |
Content Extraction Options
Text
Returns the full page content as clean markdown.Highlights
Returns key excerpts from the page that are most relevant to your query. These are extractive (pulled directly from the source), not generated.Summary
Returns an LLM-generated abstract tailored to your specific query. Supports JSON schema for structured extraction.Token Efficiency
Choosing the right content mode can significantly reduce token usage while maintaining answer quality.| Mode | Best For |
|---|---|
| text | Deep analysis, when you need full context, comprehensive research |
| highlights | Factual questions, specific lookups, multi-step agent workflows |
| summary | Quick overviews, structured extraction, when you control the output size |
numSentences and highlightsPerUrl to control output size.
maxCharacters to cap token usage.
Content Freshness
Control whether to return cached content (faster) or fetch fresh content from the source usingmaxAgeHours.
| Value | Behavior | Best For |
|---|---|---|
24 | Use cache if less than 24 hours old, otherwise livecrawl | Daily-fresh content |
1 | Use cache if less than 1 hour old, otherwise livecrawl | Near real-time data |
0 | Always livecrawl (ignore cache entirely) | Real-time data where cached content is unusable |
-1 | Never livecrawl (cache only) | Maximum speed, historical/static content |
| (omit) | Default behavior (livecrawl as fallback if no cache exists) | Recommended — balanced speed and freshness |
maxAgeHours). Only set it when you have specific freshness requirements. If you do, pair with an explicit livecrawlTimeout (10000-15000ms).
Subpage Crawling
Automatically discover and extract content from linked pages within a website.subpages: Maximum number of subpages to crawl per URLsubpageTarget: Keywords to prioritize when selecting which subpages to crawl
- Start with a smaller
subpagesvalue (5-10) and increase if needed - Use specific
subpageTargetterms to focus on relevant sections - Combine with
maxAgeHoursfor fresh results
Example: Documentation Crawling
Example: Company Research
Error Handling
The Contents API returns detailed status information for each URL in thestatuses field. The endpoint only returns an error for internal issues—individual URL failures are reported per-URL.
CRAWL_NOT_FOUND: Content not found (404)CRAWL_TIMEOUT: Target page timed out (408)CRAWL_LIVECRAWL_TIMEOUT:livecrawlTimeoutlimit reachedSOURCE_NOT_AVAILABLE: Access forbidden (403)CRAWL_UNKNOWN_ERROR: Other errors (500+)
statuses array to handle failures gracefully:

