•May 29, 2026

Upgrading Postgres Clusters With Minimal Downtime

Learn how we upgraded Amazon Aurora PostgreSQL from 14 to 17 with zero downtime using a custom Blue/Green deployment, logical replication, and fully automated cutover.

Tony Li / Software Engineer

Contents

Explore With AI

Topics

Engineering

Upgrading a database will never be exciting. But with the right design and automation, it doesn't have to be scary either.

We needed to upgrade our Aurora Postgres clusters from version 14 to 17 across multiple production environments powering real-time payment systems. Aurora had just added support for version 17 with features we wanted to adopt. But the constraints were strict: no downtime visible to the users, no data loss, and no manual, one-off processes. The entire upgrade had to be automated and repeatable, so future upgrades could follow the same process rather than being treated as one-off efforts.

The Constraints

A cell in our system is an independent, self-contained production environment that serves a subset of traffic and can be operated or upgraded in isolation. Each of our production cells runs its own Aurora Postgres cluster. Applications connect through multiple paths:

Most workloads through PgBouncer in transaction-pooling mode
Some containers directly to the writer endpoint (these workloads require features that PGBouncer in transaction-pooling mode cannot provide, such as WITH HOLD CURSOR)

ParadeDB is a separate Postgres-based system used for search, which continuously replicates data from our primary database via logical replication.

ParadeDB was a key constraint because it’s an external Postgres system continuously replicating from our Aurora primary via replication slots. It connects through an NLB whose target is kept in sync with the current primary by a Lambda that resolves the writer endpoint every 2 minutes. This NLB’s target group is pointed to the primary IP by a lambda that runs every 2 minutes, resolving the DNS of the writer endpoint of the Aurora cluster.

The presence of active replication slots for ParadeDB meant we couldn't use Aurora's native Blue/Green upgrade feature (which creates a fully synchronized copy of the database and allows traffic to be switched over). AWS doesn't allow creating a green deployment when replication slots exist. Deleting them would cause ParadeDB to fall behind and require manual backfills—not an acceptable tradeoff.

All this meant we needed our own Blue/Green upgrade—one that could maintain replication, preserve users and roles, and switch traffic seamlessly.At a high level, our approach still follows the familiar Blue/Green pattern—replicate, then cut over. However, because Aurora’s native implementation doesn’t support active logical replication slots, we had to build a custom workflow to preserve replication and orchestrate the transition safely.

Design Constraints

We built the solution around a few non-negotiables:

No data loss: All writes replicated and confirmed before switching.
Zero downtime: No lost connections or paused systems visible to users.
Full automation: Every step repeatable through infrastructure-as-code and AWS Step Functions.

The high-level approach: stand up a new Aurora cluster (the "green" database) running Postgres 17, replicate data from the existing Postgres 14 cluster (the "blue" database) using logical replication, and switch connections once both are in sync.

Testing in the demo

Before rolling this out to production, we ran the full upgrade process end-to-end in a demo cell. This let us validate the replication setup, switchover sequence, and automation under realistic conditions.

Testing in isolation helped catch issues early—such as configuration assumptions and credential handling—and gave us confidence that the process would behave as expected in production.

The phases

The upgrade process unfolds in two main phases: setup and switchover.

Setup Phase

Provision the new cluster: Terraform brings up the green database alongside the existing blue one. The green instance is monitored by its own Datadog agent so replication lag can be tracked independently.

Establish logical replication: Publications and subscriptions mirror every table from blue to green. In larger clusters, multiple publications and subscriptions are created, with high throughput tables (tables with sustained high write rates, where a single replication slot cannot keep up) getting their own dedicated publications and subscriptions.

Monitor replication lag: Custom observability platform queries continuously report lag. Once all tables are copied over and lag is near-zero, the system is ready to begin switchover.

Switchover Phase

During the switchover window, we saw a brief latency increase as expected due to connection draining and traffic redirection, but it remained within our SLOs.

We still notified customers ahead of time to set expectations, but the transition completed without any user-visible disruption. The upgrade completed without any issues.The switchover sequence is carefully ordered to ensure no stray writes and no missed replication events.

Scale down queues that directly connect to Aurora. This gives job queues 30s to gracefully finish their in-flight jobs.
Pause PgBouncer for writes, allowing in-flight writes to complete while blocking new ones. Revoke write privileges on the blue database and kill active connections, ensuring no further writes.
Delete ParadeDB subscriptions, so ParadeDB stops reading from the old cluster
Rename clusters — the green database takes the old identifier, replacing blue. This operation waits for replication lag to reach 0 (as writes have been paused by this point). If lag doesn’t reach 0 within 2 minutes, the renaming is aborted and the rest of the pipeline continues as is. This results in a DNS switchover, where the new cluster effectively takes over the old cluster’s endpoints.
Refresh ParadeDB's primary endpoint through its Lambda, updating the NLB target group to the new cluster
Recreate ParadeDB subscriptions with copy_data=false, allowing it to resume instantly. This is critical to do before resuming writes to ensure no missed data in ParadeDB.
Resume PgBouncer
Scale up previously scaled down Sidekiq queues. At this point, every application is reading and writing against the Postgres 17 database, with ParadeDB fully resynchronized—all without downtime.

Automating the Orchestration

Every step runs as an independent operation orchestrated through a state machine. This design gives us visibility, retries, and structured failure handling. Each Lambda performs a single operation—pausing PgBouncer, checking replication lag, or renaming databases—and passes outputs to the next state.

Choosing Step Functions over manual workflows or ad hoc task runners made the process auditable and repeatable. The state machine can be triggered per cell, and the same workflow runs start to finish with minimal supervision.

Users, Roles, and SSM Integration

User management was another subtle challenge. Each cell stores database credentials in SSM parameters—writers, read-only users, and PgBouncer users all have separate paths. To maintain zero downtime, we recreated users in the green database with the same passwords pulled from SSM. That allowed PgBouncer and application connections to continue without reconfiguration.

We automated the process of reading SSM parameters and applying corresponding SQL grants, preserving roles, privileges, and memberships. During testing, we discovered SSM rate-limit issues—a reminder to batch operations and increase API quotas before scaling to all cells.

Results and lessons

Reliable systems come down to eliminating uncertainty before making irreversible decisions.

The cutover itself isn’t the hard part — it’s everything leading up to it. You want to get the system into a state where the final step is boring and predictable. In our case, that meant treating replication lag as a hard gate, not just something to watch. We don’t move forward until it’s zero, and we’re confident there are no remaining writers, and everything has caught up.

Another lesson is that your system isn’t just your primary database. Everything around it — connection pools, background workers, direct writer connections, downstream consumers — all contribute to whether the system is actually “safe.” If even one of those is still writing or reading from the wrong place, the whole assumption breaks.

It also helps to design workflows as a sequence of small, explicit steps. Each step should do one thing and be easy to verify. That’s what makes it possible to reason about the system, retry safely, and avoid getting stuck in partial states.

At a high level, the pattern is straightforward: run the new system in parallel, keep it in sync, stop all writes in a controlled way, verify you’re in a truly safe state, and only then switch over. Being strict about those boundaries is what turns a risky migration into something routine.

Looking Ahead

This work has become our blueprint for future Aurora major version upgrades. The entire sequence—provisioning, replication, cutover, and validation—is now codified and can be triggered through infrastructure automation.

Our next step is to extend the same framework to other shared databases.

Subscribe to our newsletter

Get the latest articles, guides, and insights delivered to your inbox.

Authors

Tony LiSoftware Engineer

Tony Li is a software engineer at Modern Treasury, focused on infrastructure and database systems. He designs and operates distributed systems built to stay reliable at scale.