Chaos-Testing Node Apps: Simulating 'Process Roulette' with TypeScript
chaostestingnodejs

Chaos-Testing Node Apps: Simulating 'Process Roulette' with TypeScript

UUnknown
2026-02-26
11 min read
Advertisement

Build a controlled TypeScript tool to randomly kill Node processes safely and harden resilience. Practical code, safety rules, and 2026 best practices.

Hook: If you maintain Node.js services, you know the pain of intermittent failures that only appear under unexpected process crashes. Building resilience means testing the worst: random process terminations. This article walks you through designing and implementing a controlled chaos-testing tool in TypeScript — "process-roulette" — that deliberately kills processes in safe environments to harden Node services and prove their fault tolerance.

Why process-roulette matters in 2026

Chaos engineering has matured from niche practice into mainstream reliability work. In late 2025 and early 2026 we saw continued adoption of "chaos as code", broader integration with GitOps flows, and stronger observability tooling driven by OpenTelemetry becoming a de facto standard. That means teams are expected to validate how services behave when unexpected processes die, not only when pods are evicted or network partitions occur.

Process-level failures are especially important for Node.js apps because many deployments still run single-process servers, sidecars, or local daemons which, when killed, silently degrade availability. A focused tool that simulates random process termination helps uncover issues like:

  • Improper signal handling and shutdown hooks
  • Uncaught exceptions that crash the process without graceful draining
  • State loss in in-memory caches or local queues
  • Bad assumptions around restart behavior from process managers (pm2, systemd, Kubernetes)

Important safety rules before you run anything

Simulating process death is powerful and dangerous. Follow these strict rules:

  • Always run in non-production or in a production-like staging cluster with explicit approval.
  • Use dry-run first to verify which processes would be targeted.
  • Restrict scope to namespaces, container IDs, PID ranges, user-owned processes, or processes matching exact names.
  • Use rate limits and cooldown so the tool cannot spin up destructive loops.
  • Enable observability and tracing (OpenTelemetry, metrics, logs) during the experiment and tag events for correlation.
  • Have a rollback and recovery plan, including a kill-switch or a controller that can immediately stop the experiment.
Run process-roulette only when you can recover services quickly and when stakeholders are aware. Misuse can cause real outages and data loss.

Design goals for a controlled TypeScript tool

Before writing code, set clear goals and constraints. A good process-roulette tool should:

  • Be explicit about environment: container, host, or pod level.
  • Support multiple targeting strategies: by PID, process name, user, or container ID.
  • Allow probabilistic killing (for realistic random failures).
  • Support signal choice and grace periods (SIGTERM then SIGKILL if needed).
  • Provide dry-run and audit logging to track actions.
  • Integrate with observability for correlation and metrics export.

Architecture overview

We implement a small CLI written in TypeScript that can run on Linux/macOS hosts or inside containers. High-level components:

  • Discovery — list candidate processes on the node using ps listing or container runtime APIs.
  • Selector — filter candidates based on rules: allowlist, denylist, name patterns, owner.
  • Chooser — pick a target randomly according to probability and weights.
  • Execution — send signals with configurable grace period and fallback to SIGKILL.
  • Observability — emit events to stdout, metrics endpoint, or OpenTelemetry exporter.
  • Safety — dry-run, rate limiting, and emergency stop.

Implementation: a minimal process-roulette CLI in TypeScript

The following example implements a pragmatic, extensible starting point. It targets Unix-like systems and assumes you run it with appropriate permissions (root or same user as target processes). Keep evolving it to meet your environment and safety policies.

Step 1: package and config

Create a project and install dependencies. The example uses ps-list for process discovery and yargs for CLI parsing. Use a recent TypeScript 5.x compiler.

npm init -y
npm install ps-list yargs p-retry
npm install -D typescript ts-node @types/node @types/ps-list @types/yargs
npx tsc --init

Step 2: core TypeScript source

Save this as src/index.ts. It is intentionally compact while showing key behaviors: discovery, selection, kill with grace period, and dry-run.

import psList from 'ps-list'
import yargs from 'yargs'
import { hideBin } from 'yargs/helpers'

type Options = {
  namePattern?: string
  probability: number
  intervalMs: number
  dryRun: boolean
  graceMs: number
  whitelist?: string[]
  blacklist?: string[]
  maxKillsPerRun: number
}

const argv = yargs(hideBin(process.argv))
  .option('name', { type: 'string', alias: 'n' })
  .option('probability', { type: 'number', default: 0.1 })
  .option('interval', { type: 'number', default: 60000 })
  .option('dry-run', { type: 'boolean', default: true })
  .option('grace', { type: 'number', default: 5000 })
  .option('whitelist', { type: 'array' })
  .option('blacklist', { type: 'array' })
  .option('max-kills', { type: 'number', default: 1 })
  .parseSync() as Options

function matchesFilters(proc: psList.ProcessDescriptor, opts: Options) {
  if (opts.whitelist && opts.whitelist.length && !opts.whitelist.includes(proc.name)) return false
  if (opts.blacklist && opts.blacklist.length && opts.blacklist.includes(proc.name)) return false
  if (opts.namePattern) {
    try {
      const re = new RegExp(opts.namePattern)
      if (!re.test(proc.name)) return false
    } catch (e) {
      // invalid regex -- ignore
    }
  }
  return true
}

async function runOnce(opts: Options) {
  const procs = await psList()
  const candidates = procs.filter(p => matchesFilters(p, opts))
  if (!candidates.length) {
    console.log('No matching processes found')
    return
  }

  // Decide how many to kill this iteration
  const toKill: psList.ProcessDescriptor[] = []
  for (const p of candidates) {
    if (Math.random() < opts.probability) toKill.push(p)
    if (toKill.length >= opts.maxKillsPerRun) break
  }

  if (!toKill.length) {
    console.log('No processes chosen this run')
    return
  }

  for (const p of toKill) {
    console.log(opts.dryRun ? '[dry-run] would kill' : 'killing', p.pid, p.name)
    if (opts.dryRun) continue

    try {
      process.kill(p.pid, 'SIGTERM')
    } catch (e) {
      console.warn('Failed to send SIGTERM', e)
      continue
    }

    // wait grace period
    await new Promise(r => setTimeout(r, opts.graceMs))

    // check if still alive
    try {
      process.kill(p.pid, 0)
      // still alive, escalate
      console.log('Escalating to SIGKILL for', p.pid)
      process.kill(p.pid, 'SIGKILL')
    } catch (e) {
      // process already gone
      console.log('Process', p.pid, 'terminated gracefully')
    }
  }
}

async function main() {
  const opts: Options = {
    namePattern: (argv as any).name,
    probability: argv.probability,
    intervalMs: argv.interval,
    dryRun: argv['dry-run'],
    graceMs: argv.grace,
    whitelist: (argv.whitelist as any) || [],
    blacklist: (argv.blacklist as any) || [],
    maxKillsPerRun: argv['max-kills']
  }

  console.log('process-roulette starting with opts', { ...opts, dryRun: opts.dryRun })

  // basic loop, could be enhanced with signal handling
  while (true) {
    try {
      await runOnce(opts)
    } catch (e) {
      console.error('Error during run', e)
    }
    await new Promise(r => setTimeout(r, opts.intervalMs))
  }
}

main().catch(err => {
  console.error(err)
  process.exit(1)
})

Notes on the example

  • It performs discovery via ps-list which works cross-platform for Unix-like systems but you should add Windows support (tasklist/taskkill) if needed.
  • It uses a probability model; tweak probability and interval to shape failure injection intensity.
  • dry-run is default to encourage safety. Turn it off only after verification.
  • It performs graceful termination first, then escalates to SIGKILL if the process does not exit.

Advanced extensions for real-world usage

The minimal tool is useful for local experiments. For production-quality chaos testing in 2026, extend the tool with these capabilities:

1. Container-aware targeting

When running in Kubernetes, you usually want to kill processes inside specific pods or containers, or you may prefer to delete pods instead. You can:

  • Run process-roulette as a privileged DaemonSet sidecar that targets only the main container PID namespace.
  • Use the Kubernetes API to cordon or delete pods to simulate controller-level failures instead of killing PIDs.
  • Integrate with container runtimes (crictl/docker) to identify processes by container ID.

2. Chaos policies and experiments as code

Adopt experiment definitions stored in Git and triggered by GitOps. Each experiment defines scope, risk level, blast radius, and rollbacks. This fits trends in 2025/2026 where teams treat chaos experiments like feature releases.

3. Observability and experiment correlation

Emit structured events to OpenTelemetry or a metrics endpoint. Add experiment IDs and timestamps so SREs can filter logs and traces for the window when kills occurred. Example event fields:

  • experiment_id
  • target_pid
  • signal_sent
  • result (terminated | escalated | error)

4. Safety enforcement and approvals

Require preflight approvals: a webhook that checks if the cluster is in an allowed time window and no high-severity incidents are open. Implement rate limiting and a global kill-switch endpoint.

5. Integrate with chaos platforms

Popular chaos frameworks in recent years include Gremlin, LitmusChaos, and Chaos Mesh. Use process-roulette as a custom experiment in these platforms and leverage their scheduling, RBAC, and safety guardrails.

Testing and validating resilience

Chaos experiments are only useful if you measure outcomes. Key validation steps:

  • Define success criteria: zero SLO breaches, acceptable error rate increase, bounded request retries, no data loss.
  • Run experiments under load to capture real behavior: use load generators to simulate traffic during kills.
  • Track automation safety: ensure CI pipelines only run chaos on ephemeral test environments.
  • Use readiness and liveness probe behaviors to verify orchestrator restarts correctly.

Common pitfalls and how to avoid them

Here are practical warnings from real-world practice:

  • Messy assumptions about restarts — some teams assume the process manager will preserve in-memory state; test for it and store critical state externally.
  • Missing signal handlers — Node apps should listen for SIGTERM and perform graceful shutdown: stop accepting new connections, flush queues, and close DB connections.
  • Blindly killing databases or sidecars — use whitelists to prevent hitting critical processes.
  • Lack of observability — if you cannot see traces or metrics during the experiment, you cannot debug the root cause.

Signal handling best practices for Node.js services

Make your Node.js apps resilient to process termination by implementing robust shutdown paths. Minimal pattern:

process.on('SIGTERM', async () => {
  console.log('SIGTERM received: closing server')
  server.close(async () => {
    await drainWorkloads()
    await closeDbConnections()
    process.exit(0)
  })

  // Force exit if not closed in time
  setTimeout(() => {
    console.warn('Force exit after grace period')
    process.exit(1)
  }, 10000)
})

Use a process manager (systemd, pm2, or Kubernetes) configured with appropriate restart policies and probe intervals so restarts are reliable but not abusive.

Ethics, compliance, and governance

Simulating failures can impact data and compliance controls. In 2026, organizations increasingly require documented experiments, audits, and approvals. Follow these guidelines:

  • Keep an experiment registry with dates, scope, and results.
  • Log experiment actions for compliance audits.
  • Ensure experiments do not violate customer SLAs or data residency rules.

Case study: Hardening a Node API after process-roulette

At a mid-size fintech in 2025, SREs ran a process-roulette campaign on staging. Findings and outcomes:

  • Problem: A background in-memory job processor lost state when the process was killed during batch window. Retry logic was insufficient.
  • Fixes: Jobs persisted state to Redis; shutdown handlers drained in-flight jobs; circuit breakers were added for external calls.
  • Result: When the same experiment was replayed, zero jobs were lost and error rates remained within SLOs.

Lessons: process-level failures uncover both code and operational assumptions. The fix was code plus configuration changes to restart behavior.

Where process-roulette fits in your reliability program in 2026

Process-roulette is one tool in your chaos toolbox. Use it alongside:

  • Network chaos (packet loss, latency)
  • Resource pressure (CPU, memory stress)
  • API rate limiting and dependency failure simulations
  • Orchestrator-level actions (pod eviction, node termination)
These experiments, combined with GitOps and observability, give teams confidence that when real failures occur they will be manageable and recoverable.

Next steps and practical checklist

  1. Fork the example code and run in dry-run mode against a staging host.
  2. Implement signal handlers in your Node services and add unit tests for shutdown logic.
  3. Integrate process-roulette events with your tracing and metrics.
  4. Define an experiment policy and schedule the first test during a low-risk window with stakeholders informed.
  5. Iterate: analyze results, fix issues, and rerun until acceptance criteria are met.

Chaos engineering continues to evolve into structured programs integrated with CI/CD and observability platforms. Expect to see:

  • Greater automation of safety checks and approvals via policy-as-code
  • Tighter integration between chaos platforms and GitOps pipelines
  • More default support for experimental auditing in cloud providers and orchestrators

Process-roulette is a focused technique that surfaces classically hard-to-hit issues at the process level. When used responsibly, it accelerates hardening and gives teams measurable confidence in their Node.js services.

Call to action

Ready to try this at your org? Start by cloning the example, enabling dry-run, and running a scoped experiment in a staging environment. Share your findings with your SRE and platform teams, and consider contributing improvements back to the example so others can benefit. If you want a checklist or a template GitOps experiment CRD for Kubernetes, request it and I will provide a ready-to-use repo.

Advertisement

Related Topics

#chaos#testing#nodejs
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-26T02:58:41.265Z