Chaos-Testing Node Apps: Simulating 'Process Roulette' with TypeScript
Build a controlled TypeScript tool to randomly kill Node processes safely and harden resilience. Practical code, safety rules, and 2026 best practices.
Hook: If you maintain Node.js services, you know the pain of intermittent failures that only appear under unexpected process crashes. Building resilience means testing the worst: random process terminations. This article walks you through designing and implementing a controlled chaos-testing tool in TypeScript — "process-roulette" — that deliberately kills processes in safe environments to harden Node services and prove their fault tolerance.
Why process-roulette matters in 2026
Chaos engineering has matured from niche practice into mainstream reliability work. In late 2025 and early 2026 we saw continued adoption of "chaos as code", broader integration with GitOps flows, and stronger observability tooling driven by OpenTelemetry becoming a de facto standard. That means teams are expected to validate how services behave when unexpected processes die, not only when pods are evicted or network partitions occur.
Process-level failures are especially important for Node.js apps because many deployments still run single-process servers, sidecars, or local daemons which, when killed, silently degrade availability. A focused tool that simulates random process termination helps uncover issues like:
- Improper signal handling and shutdown hooks
- Uncaught exceptions that crash the process without graceful draining
- State loss in in-memory caches or local queues
- Bad assumptions around restart behavior from process managers (pm2, systemd, Kubernetes)
Important safety rules before you run anything
Simulating process death is powerful and dangerous. Follow these strict rules:
- Always run in non-production or in a production-like staging cluster with explicit approval.
- Use dry-run first to verify which processes would be targeted.
- Restrict scope to namespaces, container IDs, PID ranges, user-owned processes, or processes matching exact names.
- Use rate limits and cooldown so the tool cannot spin up destructive loops.
- Enable observability and tracing (OpenTelemetry, metrics, logs) during the experiment and tag events for correlation.
- Have a rollback and recovery plan, including a kill-switch or a controller that can immediately stop the experiment.
Run process-roulette only when you can recover services quickly and when stakeholders are aware. Misuse can cause real outages and data loss.
Design goals for a controlled TypeScript tool
Before writing code, set clear goals and constraints. A good process-roulette tool should:
- Be explicit about environment: container, host, or pod level.
- Support multiple targeting strategies: by PID, process name, user, or container ID.
- Allow probabilistic killing (for realistic random failures).
- Support signal choice and grace periods (SIGTERM then SIGKILL if needed).
- Provide dry-run and audit logging to track actions.
- Integrate with observability for correlation and metrics export.
Architecture overview
We implement a small CLI written in TypeScript that can run on Linux/macOS hosts or inside containers. High-level components:
- Discovery — list candidate processes on the node using ps listing or container runtime APIs.
- Selector — filter candidates based on rules: allowlist, denylist, name patterns, owner.
- Chooser — pick a target randomly according to probability and weights.
- Execution — send signals with configurable grace period and fallback to SIGKILL.
- Observability — emit events to stdout, metrics endpoint, or OpenTelemetry exporter.
- Safety — dry-run, rate limiting, and emergency stop.
Implementation: a minimal process-roulette CLI in TypeScript
The following example implements a pragmatic, extensible starting point. It targets Unix-like systems and assumes you run it with appropriate permissions (root or same user as target processes). Keep evolving it to meet your environment and safety policies.
Step 1: package and config
Create a project and install dependencies. The example uses ps-list for process discovery and yargs for CLI parsing. Use a recent TypeScript 5.x compiler.
npm init -y
npm install ps-list yargs p-retry
npm install -D typescript ts-node @types/node @types/ps-list @types/yargs
npx tsc --init
Step 2: core TypeScript source
Save this as src/index.ts. It is intentionally compact while showing key behaviors: discovery, selection, kill with grace period, and dry-run.
import psList from 'ps-list'
import yargs from 'yargs'
import { hideBin } from 'yargs/helpers'
type Options = {
namePattern?: string
probability: number
intervalMs: number
dryRun: boolean
graceMs: number
whitelist?: string[]
blacklist?: string[]
maxKillsPerRun: number
}
const argv = yargs(hideBin(process.argv))
.option('name', { type: 'string', alias: 'n' })
.option('probability', { type: 'number', default: 0.1 })
.option('interval', { type: 'number', default: 60000 })
.option('dry-run', { type: 'boolean', default: true })
.option('grace', { type: 'number', default: 5000 })
.option('whitelist', { type: 'array' })
.option('blacklist', { type: 'array' })
.option('max-kills', { type: 'number', default: 1 })
.parseSync() as Options
function matchesFilters(proc: psList.ProcessDescriptor, opts: Options) {
if (opts.whitelist && opts.whitelist.length && !opts.whitelist.includes(proc.name)) return false
if (opts.blacklist && opts.blacklist.length && opts.blacklist.includes(proc.name)) return false
if (opts.namePattern) {
try {
const re = new RegExp(opts.namePattern)
if (!re.test(proc.name)) return false
} catch (e) {
// invalid regex -- ignore
}
}
return true
}
async function runOnce(opts: Options) {
const procs = await psList()
const candidates = procs.filter(p => matchesFilters(p, opts))
if (!candidates.length) {
console.log('No matching processes found')
return
}
// Decide how many to kill this iteration
const toKill: psList.ProcessDescriptor[] = []
for (const p of candidates) {
if (Math.random() < opts.probability) toKill.push(p)
if (toKill.length >= opts.maxKillsPerRun) break
}
if (!toKill.length) {
console.log('No processes chosen this run')
return
}
for (const p of toKill) {
console.log(opts.dryRun ? '[dry-run] would kill' : 'killing', p.pid, p.name)
if (opts.dryRun) continue
try {
process.kill(p.pid, 'SIGTERM')
} catch (e) {
console.warn('Failed to send SIGTERM', e)
continue
}
// wait grace period
await new Promise(r => setTimeout(r, opts.graceMs))
// check if still alive
try {
process.kill(p.pid, 0)
// still alive, escalate
console.log('Escalating to SIGKILL for', p.pid)
process.kill(p.pid, 'SIGKILL')
} catch (e) {
// process already gone
console.log('Process', p.pid, 'terminated gracefully')
}
}
}
async function main() {
const opts: Options = {
namePattern: (argv as any).name,
probability: argv.probability,
intervalMs: argv.interval,
dryRun: argv['dry-run'],
graceMs: argv.grace,
whitelist: (argv.whitelist as any) || [],
blacklist: (argv.blacklist as any) || [],
maxKillsPerRun: argv['max-kills']
}
console.log('process-roulette starting with opts', { ...opts, dryRun: opts.dryRun })
// basic loop, could be enhanced with signal handling
while (true) {
try {
await runOnce(opts)
} catch (e) {
console.error('Error during run', e)
}
await new Promise(r => setTimeout(r, opts.intervalMs))
}
}
main().catch(err => {
console.error(err)
process.exit(1)
})
Notes on the example
- It performs discovery via ps-list which works cross-platform for Unix-like systems but you should add Windows support (tasklist/taskkill) if needed.
- It uses a probability model; tweak probability and interval to shape failure injection intensity.
- dry-run is default to encourage safety. Turn it off only after verification.
- It performs graceful termination first, then escalates to SIGKILL if the process does not exit.
Advanced extensions for real-world usage
The minimal tool is useful for local experiments. For production-quality chaos testing in 2026, extend the tool with these capabilities:
1. Container-aware targeting
When running in Kubernetes, you usually want to kill processes inside specific pods or containers, or you may prefer to delete pods instead. You can:
- Run process-roulette as a privileged DaemonSet sidecar that targets only the main container PID namespace.
- Use the Kubernetes API to cordon or delete pods to simulate controller-level failures instead of killing PIDs.
- Integrate with container runtimes (crictl/docker) to identify processes by container ID.
2. Chaos policies and experiments as code
Adopt experiment definitions stored in Git and triggered by GitOps. Each experiment defines scope, risk level, blast radius, and rollbacks. This fits trends in 2025/2026 where teams treat chaos experiments like feature releases.
3. Observability and experiment correlation
Emit structured events to OpenTelemetry or a metrics endpoint. Add experiment IDs and timestamps so SREs can filter logs and traces for the window when kills occurred. Example event fields:
- experiment_id
- target_pid
- signal_sent
- result (terminated | escalated | error)
4. Safety enforcement and approvals
Require preflight approvals: a webhook that checks if the cluster is in an allowed time window and no high-severity incidents are open. Implement rate limiting and a global kill-switch endpoint.
5. Integrate with chaos platforms
Popular chaos frameworks in recent years include Gremlin, LitmusChaos, and Chaos Mesh. Use process-roulette as a custom experiment in these platforms and leverage their scheduling, RBAC, and safety guardrails.
Testing and validating resilience
Chaos experiments are only useful if you measure outcomes. Key validation steps:
- Define success criteria: zero SLO breaches, acceptable error rate increase, bounded request retries, no data loss.
- Run experiments under load to capture real behavior: use load generators to simulate traffic during kills.
- Track automation safety: ensure CI pipelines only run chaos on ephemeral test environments.
- Use readiness and liveness probe behaviors to verify orchestrator restarts correctly.
Common pitfalls and how to avoid them
Here are practical warnings from real-world practice:
- Messy assumptions about restarts — some teams assume the process manager will preserve in-memory state; test for it and store critical state externally.
- Missing signal handlers — Node apps should listen for SIGTERM and perform graceful shutdown: stop accepting new connections, flush queues, and close DB connections.
- Blindly killing databases or sidecars — use whitelists to prevent hitting critical processes.
- Lack of observability — if you cannot see traces or metrics during the experiment, you cannot debug the root cause.
Signal handling best practices for Node.js services
Make your Node.js apps resilient to process termination by implementing robust shutdown paths. Minimal pattern:
process.on('SIGTERM', async () => {
console.log('SIGTERM received: closing server')
server.close(async () => {
await drainWorkloads()
await closeDbConnections()
process.exit(0)
})
// Force exit if not closed in time
setTimeout(() => {
console.warn('Force exit after grace period')
process.exit(1)
}, 10000)
})
Use a process manager (systemd, pm2, or Kubernetes) configured with appropriate restart policies and probe intervals so restarts are reliable but not abusive.
Ethics, compliance, and governance
Simulating failures can impact data and compliance controls. In 2026, organizations increasingly require documented experiments, audits, and approvals. Follow these guidelines:
- Keep an experiment registry with dates, scope, and results.
- Log experiment actions for compliance audits.
- Ensure experiments do not violate customer SLAs or data residency rules.
Case study: Hardening a Node API after process-roulette
At a mid-size fintech in 2025, SREs ran a process-roulette campaign on staging. Findings and outcomes:
- Problem: A background in-memory job processor lost state when the process was killed during batch window. Retry logic was insufficient.
- Fixes: Jobs persisted state to Redis; shutdown handlers drained in-flight jobs; circuit breakers were added for external calls.
- Result: When the same experiment was replayed, zero jobs were lost and error rates remained within SLOs.
Lessons: process-level failures uncover both code and operational assumptions. The fix was code plus configuration changes to restart behavior.
Where process-roulette fits in your reliability program in 2026
Process-roulette is one tool in your chaos toolbox. Use it alongside:
- Network chaos (packet loss, latency)
- Resource pressure (CPU, memory stress)
- API rate limiting and dependency failure simulations
- Orchestrator-level actions (pod eviction, node termination)
Next steps and practical checklist
- Fork the example code and run in dry-run mode against a staging host.
- Implement signal handlers in your Node services and add unit tests for shutdown logic.
- Integrate process-roulette events with your tracing and metrics.
- Define an experiment policy and schedule the first test during a low-risk window with stakeholders informed.
- Iterate: analyze results, fix issues, and rerun until acceptance criteria are met.
Final thoughts and 2026 trends to watch
Chaos engineering continues to evolve into structured programs integrated with CI/CD and observability platforms. Expect to see:
- Greater automation of safety checks and approvals via policy-as-code
- Tighter integration between chaos platforms and GitOps pipelines
- More default support for experimental auditing in cloud providers and orchestrators
Process-roulette is a focused technique that surfaces classically hard-to-hit issues at the process level. When used responsibly, it accelerates hardening and gives teams measurable confidence in their Node.js services.
Call to action
Ready to try this at your org? Start by cloning the example, enabling dry-run, and running a scoped experiment in a staging environment. Share your findings with your SRE and platform teams, and consider contributing improvements back to the example so others can benefit. If you want a checklist or a template GitOps experiment CRD for Kubernetes, request it and I will provide a ready-to-use repo.
Related Reading
- How Holywater Scaled Vertical Video with AI: A Guide for Student Creators
- Host Playbook: Combining Digital Tools With Hands-On Control to Improve Guest Stays
- Best Practices for KYC and Payouts When Offering Physical Prize Promotions (e.g., Booster Boxes, Consoles, LEGO Sets)
- Remote-Work Home Hunt: Finding Dog-Friendly Properties with a Home Office
- Build a ‘micro’ NFT app in a weekend: from idea to minting UI
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Chrome Extension to Local AI Extension: A Migration Playbook in TypeScript
PWA + Local AI: Shipping an Offline Assistant for Android and iOS with TypeScript
Client-Side NLP with TypeScript and WASM: Practical Patterns
Build a Local LLM-Powered Browser Feature with TypeScript (no server required)
A TypeScript dev’s guide to building low-footprint apps for older Android devices
From Our Network
Trending stories across our publication group