backendreliabilitypatterns

Graceful Shutdown and Restart Patterns in TypeScript Services

UUnknown

2026-02-27

10 min read

Practical TypeScript patterns—signal handling, transactional shutdowns, health checks—to keep Node.js services resilient against random process terminations.

Survive the random kills: why graceful shutdowns matter for TypeScript backends in 2026

If you've lost work, leaked connections, or seen half-committed transactions after a deployment or a flaky node restart, you're not alone. Production environments in 2026 are more ephemeral than ever: containers, edge functions, and host autoscalers will kill and restart processes at unpredictable times. The result? Partial requests, stuck jobs, and corrupt state. This article gives you practical, TypeScript-first patterns—signal handling, transactional shutdowns, health checks, and restart strategies—to make your services resilient when processes die at random.

Quick summary (inverted pyramid): what to do first

Catch signals (SIGTERM, SIGINT, SIGQUIT) and route them to a shutdown manager.
Flip readiness immediately so load balancers stop sending new traffic.
Stop accepting work and drain in-flight requests and sockets gracefully.
Finish or rollback transactions and finish background jobs deterministically.
Enforce timeouts and exit cleanly, letting the orchestrator restart you.

Context: what's changed in 2024–2026 (and why that matters)

By 2026 we see three platform trends that change how graceful shutdowns are implemented:

Node.js and the ecosystem have broadly adopted AbortController-first APIs (core fs, timers, fetch) and many libraries accept an AbortSignal for cancellation—use it to cancel pending work.
Containers, Kubernetes, and service meshes are default infrastructure; readiness and liveness probes, preStop hooks, and terminationGracePeriodSeconds are table stakes for shutdown logic.
Chaos engineering is mainstream—teams regularly run automated killers (Chaos Mesh, Pumba) and expect apps to survive unexpected SIGKILLs as much as planned redeploys.

Pattern 1 — Centralized graceful shutdown manager (TypeScript)

Instead of scattering signal handlers across files, create a single orchestrator that coordinates shutdown steps and exposes an AbortSignal for the rest of the app to listen to.

Why a manager?

Clear lifecycle: readiness flip, stepwise disposal, timeout enforcement.
Easy testing: replace timers and probes in unit tests.
Consistent logging and observability of shutdown duration and reasons.

Example: graceful-shutdown.ts

/* TypeScript: graceful-shutdown.ts */
import { EventEmitter } from 'events';

export type CleanupFn = (signal?: AbortSignal) => Promise | void;

export class GracefulShutdown extends EventEmitter {
  private cleaners: CleanupFn[] = [];
  private shuttingDown = false;
  private controller = new AbortController();

  constructor(private readonly timeoutMs = 30_000) {
    super();
    this.installSignalHandlers();
  }

  public get signal() { return this.controller.signal; }

  public register(fn: CleanupFn) {
    this.cleaners.push(fn);
  }

  private installSignalHandlers() {
    const handler = (sig: NodeJS.Signals) => {
      // prevent reentrant shutdown
      if (this.shuttingDown) return;
      this.shuttingDown = true;
      this.emit('shutdown', sig);
      void this.shutdown(sig).catch((err) => {
        console.error('shutdown error', err);
        process.exit(1);
      });
    };

    process.on('SIGTERM', () => handler('SIGTERM'));
    process.on('SIGINT', () => handler('SIGINT'));
    process.on('SIGQUIT', () => handler('SIGQUIT'));
  }

  private async shutdown(reason: string | NodeJS.Signals) {
    console.log('Graceful shutdown started:', reason);
    // stop new work by aborting shared signal
    this.controller.abort();

    const timeout = new Promise((_, reject) => {
      setTimeout(() => reject(new Error('shutdown timeout')), this.timeoutMs);
    });

    const runAll = (async () => {
      for (const fn of this.cleaners) {
        try {
          await fn(this.signal);
        } catch (err) {
          console.error('cleanup failed', err);
        }
      }
    })();

    await Promise.race([runAll, timeout]).catch((err) => {
      console.error('Graceful shutdown timed out or failed:', err);
    });

    console.log('Graceful shutdown complete. Exiting.');
    process.exit(0);
  }
}

How to use it

/* app.ts */
import http from 'http';
import express from 'express';
import { GracefulShutdown } from './graceful-shutdown';

const shutdown = new GracefulShutdown(20_000);
const app = express();

let ready = true;
app.get('/health/liveness', (_req, res) => res.sendStatus(200));
app.get('/health/readiness', (_req, res) => res.send(ready ? 200 : 503));

const server = http.createServer(app);
server.listen(3000, () => console.log('listening'));

// register a cleanup that flips readiness and closes server
shutdown.register(async (signal) => {
  ready = false; // immediate readiness flip
  console.log('stopping accepting new connections');

  await new Promise((resolve) => {
    server.close(() => resolve());

    // in case the process is stuck, abort after the shared signal is aborted
    signal?.addEventListener('abort', () => { /* already handled by timeout */ });
  });
});

Pattern 2 — Drain sockets and in-flight HTTP requests

server.close() stops accepting new connections but doesn't forcibly close existing keep-alive sockets. Track sockets and destroy idle ones once the drain window expires.

/* socket-drain.ts (continued) */
const sockets = new Set();

server.on('connection', (socket) => {
  sockets.add(socket);
  socket.on('close', () => sockets.delete(socket));
});

shutdown.register(async () => {
  // after server.close we have only active sockets
  const drainTimeout = 10_000;
  const killAfter = Date.now() + drainTimeout;

  for (const s of sockets) {
    // set a short timeout so truly idle keep-alive sockets get killed
    s.setTimeout(5_000);
  }

  while (sockets.size > 0 && Date.now() < killAfter) {
    await new Promise((r) => setTimeout(r, 100));
  }

  for (const s of sockets) s.destroy();
});

Pattern 3 — Transactional shutdowns: commit or rollback deterministically

Long-running transactions are fragile during shutdown. The pattern is:

Stop accepting new requests that start transactions (readiness flip).
Wait for active request handlers to finish or signal cancellation via an AbortSignal.
Commit or rollback in a finally block; prefer idempotent commits or compensating transactions for safety.

Example with a generic DB client that supports transactions:

/* tx-worker.ts */
async function handleRequest(req: RequestContext, signal?: AbortSignal) {
  const client = await db.connect();
  try {
    await client.beginTransaction();

    // pass the signal down to cancel long DB queries if controller aborts
    const result = await client.query('UPDATE ...', { signal });

    await client.commit();
    return result;
  } catch (err) {
    await client.rollback();
    throw err;
  } finally {
    client.release();
  }
}

// During shutdown register a handler that waits until in-flight handlers finish
shutdown.register(async (signal) => {
  // wait for all active handlers (tracked via a semaphore) to finish
  await activeRequests.waitForZero({ signal });
});

Tips for safer transactions

Keep transactions short. In 2026, prefer single-statement transactions when possible and use compensating transactions for complex workflows.
Use optimistic concurrency and idempotency keys so retries during restarts are safe.
If using ORMs (Prisma, TypeORM), ensure the client supports AbortSignal for cancelable queries.

Pattern 4 — Background workers and job queues

For queue processors (BullMQ, RabbitMQ consumers), handle a two-phase shutdown:

Stop fetching new jobs.
Allow current job(s) to finish or abort them safely.
Persist state so a new worker can resume if needed.

/* worker.ts */
let pulling = true;

async function pullLoop(signal?: AbortSignal) {
  while (pulling && !signal?.aborted) {
    const job = await queue.getNext();
    if (!job) { await sleep(100); continue; }
    try {
      await processJob(job, signal);
      await job.ack();
    } catch (err) {
      await job.nack();
    }
  }
}

shutdown.register(async (signal) => {
  pulling = false; // stop getting new jobs
  // wait for currently processing jobs to finish or be canceled by signal
  await activeJobs.waitForZero({ signal });
});

Pattern 5 — Readiness & liveness: flip early, verify often

Health endpoints are your control plane during shutdown. The canonical approach:

/health/liveness returns 200 if the process is alive (used by kubelet to restart crashed pods).
/health/readiness returns 200 only if the instance can accept new work (used by load balancers and service meshes).

On receiving a signal, immediately make /health/readiness return 503. This prevents new traffic while you drain.

Flipping readiness first makes shutdowns predictable to orchestrators—without it, load balancers may keep sending new requests into a dying process.

Pattern 6 — Restart strategies (zero downtime and rolling restarts)

There are multiple ways to restart a TypeScript service without dropping traffic:

Orchestrator-managed rolling updates (Kubernetes Deployments, ECS): prefer rolling deployments with proper readiness probes and terminationGracePeriodSeconds tuned to your drain + commit time.
Signal-based graceful restart (process manager): PM2 or systemd can send signals for reload (SIGUSR2, SIGTERM) and manage rolling restarts across instances.
Cluster manager: use a small master that spawns workers and performs rolling restarts by spawning a new worker and killing the old one after it signals ready.

Example snippet for a simple rolling restart using Node's cluster:

/* master.ts */
import cluster from 'cluster';
import os from 'os';

if (cluster.isMaster) {
  const workers: cluster.Worker[] = [];
  for (let i = 0; i < os.cpus().length; i++) cluster.fork();

  // rolling restart helper
  async function rollingRestart() {
    for (const id in cluster.workers) {
      const w = cluster.workers[id]!;
      w.send({ cmd: 'shutdown' });
      await new Promise((res) => w.on('exit', res));
      cluster.fork();
    }
  }
} else {
  // worker starts HTTP server and listens for shutdown message
  process.on('message', (m) => {
    if (m?.cmd === 'shutdown') process.kill(process.pid, 'SIGTERM');
  });
}

Tooling, build configs & editor integrations (practical tips)

Make sure your development and CI tooling don't get in the way of debugging shutdowns:

tsconfig: target the Node version you run in prod. For ESM apps set "module": "NodeNext" and enable "sourceMap": true to keep stack traces readable in TypeScript.
Bundlers: using esbuild or swc reduces start-up time (less time spent in transient states during restarts). Keep your shutdown logic robust to faster restarts.
VS Code: configure a launch task that sends SIGTERM to the debuggee to reproduce real shutdown behavior; use the Node debugger's 'Restart' to see how your app behaves under immediate restarts.
Testing: run chaos tests in pre-prod—simulate SIGTERM, SIGKILL, and container stop timeouts. Use ephemeral environments (kind, k3d) to validate Kubernetes probes and lifecycle hooks.

Observability and SLOs for shutdowns

Track metrics that indicate shutdown health:

Shutdown duration histogram (time from signal to process exit).
Number of requests dropped during shutdown.
Transaction abort/rollback counts.

Capture shutdown reasons in logs (SIGTERM vs. SIGINT vs. healthcheck failure) and export them to your APM. Use these signals to set SLOs: e.g., 99% of graceful shutdowns complete within the configured termination window.

Common pitfalls and how to avoid them

Ignoring keep-alive sockets: causes requests to hang past terminationGracePeriodSeconds. Track and destroy sockets after a short drain window.
Not flipping readiness: LB continues to send requests to a process that's about to die.
Long transactions: avoid multi-second transactions that force long termination windows.
Mixing signals: ensure your orchestration (systemd, Docker, Kubernetes) sends the expected signals—Docker sends SIGTERM then SIGKILL after the stop timeout, Kubernetes respects terminationGracePeriodSeconds.

Test plan: how to validate your shutdown strategy

Unit test: call your GracefulShutdown.shutdown manually and assert registered cleaners run and the AbortSignal is triggered.
Integration: run the container locally and send SIGTERM; watch readiness flip, server.close, and socket cleanup logs.
Chaos test: use a chaos tool (Chaos Mesh, Litmus, Pumba) in a staging cluster to randomly kill pods and verify no user-visible errors occur beyond expected retries.
Performance test: measure how often shutdowns exceed timeout and adjust terminationGracePeriodSeconds and timeouts accordingly.

2026 recommendations & future-proofing

Prefer AbortController-driven cancellation for all async primitives; by 2026 most major libs support it and it simplifies cancellation propagation.
Design systems for idempotency and short-lived transactions so restarts are cheap and safe.
Automate chaos tests in CI for every major release—treat random kills as a first-class test case.
Make health endpoints and metrics part of your deployment checklist; don't rely solely on logs to diagnose shutdown issues.

Actionable checklist (apply this in 30 minutes)

Add a single GracefulShutdown manager to your app and replace ad-hoc signal handlers.
Expose /health/readiness and /health/liveness and flip readiness on signal.
Track sockets and ensure idle keep-alives are destroyed during drain.
Stop pulling new jobs on shutdown and wait for active jobs to finish.
Run a SIGTERM test in a staging container with Kubernetes/ECS to validate terminationGracePeriodSeconds.

Closing thoughts

Random process terminations are inevitable. In 2026, with faster infra and ubiquitous chaos testing, applications that treat graceful shutdown as a feature—not an afterthought—will be more reliable and easier to operate. The patterns above are practical, TypeScript-first, and compatible with modern Node and container ecosystems.

Try this now: add the GracefulShutdown manager to a small service, simulate SIGTERM locally, and verify readiness flips and sockets drain. If you have CI/CD, add a chaos step in staging that sends random SIGKILLs—your users will thank you.

Call to action

Want a ready-to-run template? Clone our minimal TypeScript graceful-shutdown starter repo (includes tsconfig, Dockerfile, and Kubernetes manifest) and run chaos tests in 10 minutes. Share feedback or your own patterns in the comments—let's make TypeScript services that actually survive real-world chaos.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.