How we replaced shared staging with ephemeral deployment preview

At Plane, one slash command provisions a Kubernetes namespace, a Postgres database, Cloudflare DNS, and a signed TLS certificate per pull request, torn down automatically after merge.

Goutham P. and Devanshu Arora

●

19 Feb, 2026

blog-how-we-replaced-shared-staging-with-ephemeral-deployment-preview.webp

Before we had this system, validating a pull request at Plane meant one of two things: merge it to staging and hope nothing collides, or spin up a test environment manually and spend 30 minutes on infra before you could even look at the change.

An ephemeral deployment preview is a short-lived, fully isolated deployment that mirrors production, created automatically for a pull request and torn down when the PR closes. Unlike a shared staging environment, it exists only for the duration of that PR's review cycle, and each one is self-contained, with its own database, services, and ingress configuration.

We built one for every pull request at Plane. Each PR now gets its own deployment triggered by a single slash command. This is how it works.

The problem we were actually solving

Staging environments break down in a specific way as systems grow. It starts with small things accumulating. One team merges a migration that blocks another team's feature. Two PRs touch the same service and produce an unpredictable combined state. A flaky test passes on staging because the environment is stale. By the time you realize staging has drifted from production, you've already shipped the problem.

At Plane's scale of complexity, with seven services, two Helm chart variants (plane-enterprise and plane-cloud), Postgres, Redis, and separate ingress configurations, shared staging became a liability. We needed environments that were isolated per PR so changes in one couldn't affect another, production-like so what you validate is what you ship, disposable by design so they don't accumulate into infrastructure debt, and zero manual work because manual infra steps are steps that get skipped. The last constraint mattered most. A system engineer doesn't use a system that didn't ship.

The architecture

The pipeline looks like this:

blog-infographic-how-we-replaced-shared-staging-2 (2).webp

Every stage is automated. The engineer's only job is to type /deploy in a PR comment.

The GitHub Actions workflow picks up the slash command via an issue_comment event trigger, parses the options, and kicks off the pipeline. Build steps run in parallel where possible. A full cycle covering seven Docker image builds, database provisioning, Helm release deployment, and DNS and TLS configuration completes in 10-15 minutes.

Building from the merge ref

The first design decision was what to build.

We do not build from the PR branch. We build from:

plaintext

refs/pull/<PR_NUMBER>/merge

This is GitHub's synthetic merge ref, the result of merging the PR branch into the base branch at the time of the workflow run. Building from here means the environment reflects what the code will actually look like post-merge, not what it looks like in isolation on the feature branch. Conflicts surface as build failures, which is the right time to surface them.

Images are tagged with two identifiers:

plaintext

ee-<PR_NUMBER>
<commit-short-hash>

The PR number tag lets the deploy workflow find and upgrade an existing release when the same PR re-deploys. The commit hash provides traceability, so you can always trace a running image back to the exact commit that produced it.

Namespace isolation

Each deployment lives in its own Kubernetes namespace:

plaintext

ee-<PR>-cloud    # Cloud variant
ee-<PR>-ent        # Enterprise variant

Namespace isolation is doing meaningful work here. Services, secrets, ingress rules, TLS certificates, and configuration are all scoped to the namespace. Nothing leaks between PR environments. A bad migration in PR 5701 cannot affect PR 5657's database. A secret rotation in one environment doesn't touch another.

This is the part that seems obvious in retrospect but matters enormously in practice. Without hard namespace boundaries, "isolated environments" is more aspiration than reality.

Database provisioning via Postgres operator

Stateful dependencies are where ephemeral environment systems usually break down. You can containerize services easily. Databases are harder because you have to provision them, seed them, and most critically, clean them up reliably.

We use the MoveToKube External PostgreSQL Server Operator, which manages PostgreSQL resources declaratively via Kubernetes CRDs.

Each PR environment gets two CRs applied: a Postgres resource that creates the database, and a PostgresUser resource that creates the user and stores credentials in a Kubernetes Secret.

yaml

apiVersion: db.movetokube.com/v1alpha1
kind: Postgres
metadata:
  name: ee5657cloud
  namespace: ee-5657-cloud
spec:
  database: ee5657cloud
  dropOnDelete: true

yaml

apiVersion: db.movetokube.com/v1alpha1
kind: PostgresUser
metadata:
  name: ee5657cloud-user
  namespace: ee-5657-cloud
spec:
  database: ee5657cloud
  role: ee5657cloud
  secretName: ee5657cloud-credentials
  privileges: OWNER

The critical line is dropOnDelete: true. When the CR is deleted as part of cleanup, the operator automatically drops the database, removes the user, and cleans up the credentials secret without any manual intervention. Each PR gets a predictably named database (ee<PR_NUMBER>cloud or ee<PR_NUMBER>ent), making the naming convention itself a source of traceability.

TLS and DNS automation

Engineers stop using preview environments that require manual DNS entries or self-signed certificates. We automated the full chain.

Each PR environment gets a public URL with this pattern:

plaintext

https://feat-<PR>-ee.feat.plane.town       # Enterprise
https://feat-<PR>-cloud.feat.plane.town  # Cloud

So PR 5657 gets https://feat-5657-ee.feat.plane.town automatically. The workflow configures Cloudflare DNS, creates a cert-manager TLS issuer, and generates a signed certificate. By the time the workflow posts the URL to the PR comment, the URL is live and HTTPS is working.

The SSL cert generation adds a few minutes to the total pipeline time, but it means the preview URL is always shareable with a PM, a designer, or a customer, with no certificate warnings, no VPN requirements, and no "just ignore the browser error."

Security model

Preview infrastructure follows the same security practices as production.

We use AWS OIDC role assumption rather than storing long-lived AWS credentials in GitHub. The workflow assumes a short-lived role at runtime. Separate IAM roles handle secrets access and EKS access, following least-privilege. Secrets are pulled from AWS Secrets Manager and masked immediately in workflow logs.

yaml

- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v2
  with:
    role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
    aws-region: us-east-1

No static credentials in GitHub Actions secrets. No credentials visible in logs. The preview infrastructure is disposable, but it handles real data patterns, so it needs a real security posture.

The slash command interface

The developer interface is a slash command in any PR comment:

plaintext

/deploy type=enterprise
/deploy type=cloud
/deploy

The type parameter selects which Helm chart to use, either plane-enterprise or plane-cloud. If omitted, it defaults to cloud. Additional options:

Option	Values	What it does
`type`	`enterprise`, `cloud`	Selects the Helm chart variant
`airgap`	`true`, `false`	Enables air-gapped deployment mode
`aio`	`true`, `false`	Builds and deploys the All-in-One image
`helm_repo`	URL	Overrides the default Helm repo

The airgap option came directly from customer need. The same customers who run air-gapped self-hosted Plane instances needed us to validate air-gapped configurations in preview environments, not just in production. Re-running /deploy with the same type on the same PR upgrades the existing Helm release rather than creating a new one, so there's one environment per PR per type at any given time.

Cleanup: the part that makes it sustainable

An ephemeral environment system without reliable cleanup is just a slower way to accumulate infrastructure debt.

A scheduled GitHub Actions workflow runs daily at midnight:

yaml

on:
  schedule:
    - cron: "0 0 * * *"

It enumerates all Helm releases matching the naming pattern ee-<PR>-ent or ee-<PR>-cloud, checks each PR's state against the GitHub API, and for any PR that is MERGED or CLOSED, runs the full teardown sequence:

plaintext

helm uninstall ee-${PR}-ent --namespace ee-${PR}-ent
kubectl delete namespace ee-${PR}-ent
kubectl delete pv -l release=ee-${PR}-ent
kubectl delete postgres ee${PR}ent -n ee-${PR}-ent
kubectl delete postgresuser ee${PR}ent-user -n ee-${PR}-ent

Because dropOnDelete: true is set on the Postgres CRs, deleting the CR triggers the operator to drop the database and user automatically. The teardown sequence handles the Kubernetes resources; the operator handles the database-layer cleanup.

Without this, every merged PR leaves behind a namespace, a database, persistent volumes, TLS certificates, and Cloudflare DNS records. That accumulates fast on an active team. With the daily cleanup job, the infrastructure stays clean without anyone thinking about it.

What we learned

A few things that weren't obvious until we ran this in practice.

Merge conflicts fail loudly, which is the right behavior. Building from refs/pull/<PR>/merge means a PR with merge conflicts fails at the build step, not silently at runtime. Engineers learned quickly to resolve conflicts before deploying. This turned out to be a better forcing function than we expected.

New environment variables are a coordination point. If a PR introduces a new required env var, the preview Helm chart needs to be updated to include it before the deployment will succeed. This sounds like friction, but it's actually correct behavior. It forces the team to think about configuration as a first-class concern before merge, not after.

Parallel builds matter more than you'd expect. Seven services built sequentially adds up. Running image builds in parallel cut the average pipeline time from 25+ minutes to 10-15 minutes. That's the difference between "I'll check back after lunch" and "I'll check back after this meeting."

Where it's headed

The current system handles the core loop well. We're working on a few extensions.

Automatic deployment on PR open, without requiring the /deploy command for certain PR labels or paths. For changes that always need preview validation, removing the manual trigger removes a step where engineers forget.

Preview environment diffing, surfacing what changed between two deployments of the same PR as the branch evolves, not just whether the current deployment works.

Extending the airgap option to support full network policy enforcement in the preview namespace, not just configuration flags, so air-gapped behavior is validated at the network layer, not just the application layer.

Custom domain support for PR environments, so teams can share previews under a different domain rather than the auto-generated URL. This matters most for customer-facing validation, where a branded URL removes friction from getting external feedback early

The distance that matters

The metric we think about isn't deployment time. It's the distance between opening a PR and validating it in real infrastructure.

Before this system, that distance was measured in coordination overhead, staging queue time, and manual setup steps. It was long enough that engineers regularly skipped preview validation entirely and relied on code review alone.

Now it's 15 minutes and one slash command. That's a different kind of confidence in what ships.

If you're running a multi-service platform and your staging story is "one shared environment that everyone writes to," this architecture is worth building. The Postgres operator pattern for database lifecycle management is the part most teams skip, and it's the part that makes cleanup actually work.