Scaling PostgreSQL to Millions of Queries Per Second: Lessons from OpenAI

When OpenAI's ChatGPT launched, traffic exploded at an unprecedented rate. Their PostgreSQL database faced a challenge that would make most database engineers break into a cold sweat: scaling a single-primary PostgreSQL setup to handle 800 million users and millions of queries per second.

The conventional wisdom says you can't scale PostgreSQL this way. You need sharding, you need distributed systems, you need to abandon the single-primary architecture. But OpenAI proved that wrong.

They scaled PostgreSQL to millions of QPS with a single primary and nearly 50 read replicas, maintaining low double-digit millisecond p99 latency and five-nines availability. Over the past year, their PostgreSQL load grew by more than 10x, and they've had only one SEV-0 incident.

This isn't just impressive—it's a blueprint for what's possible when you optimize relentlessly and understand your database at a deep level.

The Reality Check: PostgreSQL at Scale

PostgreSQL has been powering OpenAI's core products like ChatGPT and their API platform for years. But as user growth accelerated, the database infrastructure had to evolve rapidly.

The initial architecture was straightforward: a primary-replica setup with one writer and multiple readers. This worked well initially, but as traffic grew, cracks began to show.

The most critical insight? The bottleneck isn't read scalability—it's write requests. PostgreSQL's multiversion concurrency control (MVCC) implementation, while excellent for consistency, creates significant overhead for write-heavy workloads.

Scaling load diagram showing the vicious cycle under load

When an upstream issue causes a sudden spike in database load—whether from cache misses, expensive multi-way joins, or write storms—query latency rises. Requests begin to time out. Retries amplify the load, creating a vicious cycle that can degrade entire services.

MVCC's write amplification problem is real. When you update a tuple or even a single field, PostgreSQL copies the entire row to create a new version. Under heavy write loads, this results in significant write amplification and increased read amplification as queries must scan through multiple tuple versions (dead tuples) to retrieve the latest one.

The Strategy: Minimize Writes, Maximize Reads

OpenAI's approach was methodical: reduce write pressure on the primary at all costs, and optimize read distribution across replicas.

Offloading Write-Heavy Workloads

The first major decision: migrate shardable, write-heavy workloads to sharded systems like Azure Cosmos DB. This wasn't just about capacity—it was about protecting the primary from write storms.

New workloads default to sharded systems. They no longer allow adding new tables to the current PostgreSQL deployment. The key insight: Not all workloads need to be in PostgreSQL. Identify what can be sharded, migrate it, and keep only what truly benefits from PostgreSQL's ACID guarantees.

Reducing Primary Load

With only one writer, every write matters. OpenAI minimized load on the primary by:

Offloading all possible reads to replicas—only keep reads on the primary when they're part of write transactions
Fixing application bugs that caused redundant writes
Introducing lazy writes to smooth out traffic spikes
Enforcing strict rate limits when backfilling table fields

The goal: Ensure the primary has sufficient capacity to handle write spikes without degrading service.

Query Optimization: The Hidden Killer

One expensive query can bring down an entire service. OpenAI identified several problematic patterns:

Complex multi-table joins are particularly dangerous. They once found a query joining 12 tables that was responsible for high-severity SEVs during traffic spikes. Break down complex joins and move join logic to the application layer when possible.

ORM-generated queries are another common culprit. Review the SQL that ORMs produce and ensure it behaves as expected. Many problematic queries come from ORMs generating inefficient joins or missing indexes.

Long-running idle queries can block autovacuum. Configuring timeouts like idle_in_transaction_session_timeout is essential.

High Availability and Workload Isolation

A single-primary architecture means a single point of failure. OpenAI mitigated this by offloading critical reads to replicas (so reads continue even if the primary fails) and running the primary in High-Availability (HA) mode with a hot standby for quick failover.

Workload isolation prevents noisy neighbors. OpenAI splits requests into low-priority and high-priority tiers, routing them to separate instances. This ensures resource-intensive workloads don't degrade performance for critical requests.

Connection Pooling: The 5,000 Connection Limit

Each PostgreSQL instance has a maximum connection limit (5,000 in Azure PostgreSQL). It's easy to run out of connections or accumulate too many idle ones. OpenAI has had incidents caused by connection storms that exhausted all available connections.

The solution: PgBouncer as a proxy layer to pool database connections. Running it in statement or transaction pooling mode allows efficient connection reuse, greatly reducing the number of active client connections.

PostgreSQL proxy diagram showing PgBouncer architecture

The performance impact: Average connection time dropped from 50 milliseconds to 5 ms. Regional co-location is critical—co-locate the proxy, clients, and replicas in the same region to minimize network overhead.

Caching: Preventing Cache-Miss Storms

A sudden spike in cache misses can trigger a surge of reads on the PostgreSQL database, saturating CPU and slowing user requests.

The solution: Cache locking (and leasing) mechanism. When multiple requests miss on the same cache key, only one request acquires the lock and proceeds to retrieve the data and repopulate the cache. All other requests wait for the cache to be updated rather than all hitting PostgreSQL at once.

This significantly reduces redundant database reads and protects the system from cascading load spikes.

Scaling Read Replicas: The WAL Challenge

OpenAI operates nearly 50 read replicas across multiple geographic regions to minimize latency. But here's the challenge: the primary must stream Write Ahead Log (WAL) data to every replica.

As the number of replicas increases, the primary must ship WAL to more instances, increasing pressure on both network bandwidth and CPU. This causes higher and more unstable replica lag.

The solution: Cascading replication, where intermediate replicas relay WAL to downstream replicas. This approach allows scaling to potentially over a hundred replicas without overwhelming the primary.

PostgreSQL cascading replication diagram

However, cascading replication introduces operational complexity around failover management. OpenAI is working with Azure to ensure it's robust before rolling it out to production.

Rate Limiting and Schema Management

A sudden traffic spike, surge of expensive queries, or retry storm can quickly exhaust critical resources. The solution: Rate-limiting across multiple layers—application, connection pooler, proxy, and query. Avoid overly short retry intervals that can trigger retry storms.

Schema changes require extreme caution. Even a small schema change, such as altering a column type, can trigger a full table rewrite. OpenAI only permits lightweight schema changes that don't trigger table rewrites, enforces a strict 5-second timeout on schema changes, and applies strict rate limits when backfilling table fields.

The Results: What's Actually Possible

This effort demonstrates that with the right design and optimizations, Azure PostgreSQL can be scaled to handle the largest production workloads.

PostgreSQL handles millions of QPS for read-heavy workloads, powering OpenAI's most critical products. They added nearly 50 read replicas while keeping replication lag near zero, maintained low-latency reads across geo-distributed regions, and built sufficient capacity headroom to support future growth.

The performance metrics: Low double-digit millisecond p99 client-side latency, five-nines availability, and only one SEV-0 incident in the past 12 months (during ChatGPT ImageGen's viral launch when write traffic surged by more than 10x).

Key Takeaways for Your Infrastructure

1. Single-primary PostgreSQL can scale much further than conventional wisdom suggests—if you optimize relentlessly and understand your workload patterns.

2. Write load is the real bottleneck, not read scalability. Minimize writes on the primary through workload migration, application optimization, and lazy writes.

3. Query optimization is non-negotiable. One expensive query can bring down your service. Review ORM-generated SQL, break down complex joins, and configure proper timeouts.

4. Connection pooling is essential. PgBouncer isn't optional at scale—it's a requirement. Co-locate it with your application to minimize latency.

5. Cache-miss storms are real. Implement cache locking/leasing to prevent multiple requests from hitting the database for the same missing key.

6. Workload isolation prevents noisy neighbors. Separate high-priority and low-priority workloads onto different instances.

7. Rate limiting at multiple layers prevents cascading failures from traffic spikes or retry storms.

8. Schema changes require extreme caution. Only lightweight operations, strict timeouts, and rate-limited backfills.

9. Cascading replication is the path to scaling beyond 50+ replicas, but requires careful operational planning.

10. Partnership with your database provider matters for achieving reliable failover and exploring new features.

What This Means for You

If you're running PostgreSQL at scale, OpenAI's experience proves that you don't necessarily need to abandon your single-primary architecture—at least not immediately. With careful optimization, you can scale much further than you might think.

Start with the low-hanging fruit: connection pooling, query optimization, and workload isolation. Then tackle write load reduction, caching strategies, and replica scaling.

The most important takeaway: Understand your workload—what's read-heavy vs. write-heavy, which queries are expensive, where your bottlenecks actually are. Then optimize systematically, measure everything, and iterate.

OpenAI didn't achieve this overnight. It took rigorous optimization and a deep understanding of PostgreSQL's internals. But the results prove it's possible—and the lessons are applicable to any team scaling PostgreSQL in production.

Have you faced similar scaling challenges with PostgreSQL? What strategies have worked (or failed) in your infrastructure? The database engineering community benefits when we share these hard-won lessons.

Scaling PostgreSQL to Millions of Queries Per Second: Lessons from OpenAI

Rajkumar

Scaling PostgreSQL to Millions of Queries Per Second: Lessons from OpenAI

The Reality Check: PostgreSQL at Scale

The Strategy: Minimize Writes, Maximize Reads

Offloading Write-Heavy Workloads

Reducing Primary Load

Query Optimization: The Hidden Killer

High Availability and Workload Isolation

Connection Pooling: The 5,000 Connection Limit

Caching: Preventing Cache-Miss Storms

Scaling Read Replicas: The WAL Challenge

Rate Limiting and Schema Management

The Results: What's Actually Possible

Key Takeaways for Your Infrastructure

What This Means for You

Subscribe to the newsletter