Exa recently purchased a 5 million dollar GPU cluster to train retrieval models over the web.
This cluster is a beast:
It arrived at our datacenter in crates of bare-metal GPUs. And now it's running nearly 24/7, powering all concurrent workflows of our ML team.
As I write this, we're simultaneously embedding billions of pages, training a reranker, uploading checkpoints to a cache with S3 backups, all using a batch job framework that ensures reliability and reproducibility.
In this post, we'll explain in detail how we turned 18 hunks of bare metal into the streamlined infrastructure that powers our ML research.
Since the beginning of Exa in 2021, we've believed that neural approaches to web retrieval will win over traditional approaches (this is now almost certainly correct). That's why we spent half our 2021 seed money on our first GPU cluster — 10 nodes with 8xA100 GPUs each.
By late 2024, we had 100x larger inference requirements and 5x more ML engineers running lots of training experiments in parallel. We needed more compute.
So we stood up a second, larger 18-node fleet — the Exacluster 😉
Old Cluster (A100) | Exa Cluster (H200) | Combined | |
---|---|---|---|
Nodes | 10 | 18 | 28 |
GPUs | 80 | 144 | 224 |
Total GPU RAM | 6.4 TB | 20 TB | 26.4 TB |
CPU Cores | 960 | 3,456 | 4,416 |
System RAM | 10 TB | 36 TB | 46 TB |
Local NVMe | 80 TB | 270 TB | 350 TB |
FP16 PFLOPs* | ~25 | ~143 | ~168 |
*ball-park, aggregated theoretical flops without sparsity
What this means is that a run that used to take a week to complete can now finish in ~1 day. No longer compute-bound! (as much)
Here's the five-layer stack that turns these bare servers into a compute beast we can control.
Managing 28 bare-metal nodes, high-speed networks, and core services requires a repeatable process—not ad-hoc scripts. Pulumi provides that consistency.
mypy
, flags incorrect parameters before any provisioning begins.Pulumi's plan → apply workflow produces a clear diff of intended actions for every change, letting us review and confirm before it reaches production. If an update causes issues, the environment can be rolled back to its previous state with the same command.
Using Pulumi keeps the system consistent, auditable, and straightforward to evolve as hardware and software requirements grow.
Pulumi orchestrates the full setup including a set of Ansible playbooks—most of which come directly from Kubespray, a maintained collection of Ansible roles for Kubernetes clusters. This turns each powered‑on server into a ready Kubernetes node.
Running Ansible/Kubespray under Pulumi gives us a single, repeatable path from bare metal to a consistent Kubernetes fleet, with all changes tracked and reproducible.
NVIDIA's Kubernetes operators give us a clean, uniform way to manage every accelerator and NIC in the fleet—no manual SSH sessions, no node-specific scripts.
GPU Operator
nvidia.com/gpu: 1.
Network Operator
"rdma/rdma_network": "1000"
Because everything is containerized, each node runs exactly the same stack and survives reboots or upgrades without drift. A new CUDA release becomes a version bump: nodes are cordoned and drained one at a time, upgraded and tested, then put back into service before the rollout advances.
This seamless hardware-to-software integration is a key reason NVIDIA dominates the datacenter AI market. Their operators close the gap between "I have GPUs" and "my workloads are running."
Each new node ships with high-end Gen 4 NVMe; across 28 servers that's ~350 TB of low-latency storage connected by 400 Gb/s ethernet links. We use Alluxio to fuse those drives into one distributed cache while keeping S3 as the authoritative store.
Write path
Read path
Transparent interface
Cost control
The result is a unified, high-bandwidth data layer that lets GPUs stream training data at aggregate local‑disk speeds while preserving the durability and simplicity of S3.
When we listed what the scheduling layer had to deliver, the list was long:
We evaluated several frameworks—Kubeflow, Airflow, Prefect, Dagster, etc—but Flyte was the only one that covered the full list without extensive custom work.
Requirement | How Flyte addresses it |
---|---|
Code-first | A Python decorator turns a function into a task; DAGs compose naturally in code, no special DSL needed. |
Versioned, reproducible runs | Each registration hashes the Docker image, requirements, and source code; every run is traceable and re-runnable. |
Task caching & checkpointing | Built-in task cache and partial-run resume; failed jobs can restart from the last saved state. |
Distributed training | Native plugins for PyTorchJob and MPIJob schedule multi-node GPU pods with topology-aware placement. |
Large Batch Jobs | Using the Ray plugin + KubeRay we scale data jobs to hundreds of nodes, then shut down cleanly. |
Cloud burst | A single project can target on-prem or cloud execution queues; scaling to thousands of spot GPUs is a configuration flag, not a rewrite. |
Dataset lineage | Automatic tracking links inputs, outputs, and the exact code version that produced them. |
Priority queues & quotas | Easily configurable Kubernetes priority classes and resource quotas; urgent inference fine-tunes preempt batch preprocessing. |
Developer workflows | Flyte tasks are ordinary Python functions—run them locally, set breakpoints, and write unit tests exactly as you would for any other Python code. |
Flyte orchestrates node affinity, GPU topology, log routing to Loki, metrics to Prometheus, and artifact storage on S3 (with Alluxio caching) while honoring queue priorities. Engineers focus on writing code; the platform handles placement, scaling, and bookkeeping.
Pulumi provisions everything → Ansible/Kubespray lays kube-roots → NVIDIA operators light up the accelerators → Alluxio feeds them at warp speed → Flyte keeps the assembly line humming.
The payoff?
That's why we named this infrastructure Hephaestus, after the Greek god of blacksmithing. Hephaestus is the roaring forge where we shape our ML models.
If you're an engineer/researcher burning with ideas for semantic search models that need serious compute, Hephaestus is waiting. https://exa.ai/careers
SEE MORE
How we cut our BM25 index footprint in half at billions-document scale without sacrificing performance.
Tom An
May 5, 2025
Before exploring other worlds, we should fully understand our own
Will Bryk
March 11, 2025
Before exploring other worlds, we should fully understand our own
Will Bryk
January 7, 2025