codegentypingsdata

Contracts first: using TypeScript type generation from analytics schemas (ClickHouse/Parquet)

UUnknown

2026-02-15

11 min read

Generate TypeScript types and runtime validators from ClickHouse schema or Parquet descriptors so ingestion and analytics share a single contract.

Contracts first: stop guessing analytics types — generate TypeScript types from ClickHouse and Parquet

Hook: If your ingestion pipelines and analytics teams constantly play schema-telephone — mismatched types, silent truncations, or broken dashboards — you need a single source of truth. In 2026, with ClickHouse adoption surging and columnar formats like Parquet everywhere, generating TypeScript types and runtime validation from the canonical analytics schema is no longer optional — it’s how you ship resilient pipelines fast.

Why contracts-first matters in 2026

Data platforms matured fast through 2024–2026. ClickHouse got major funding and enterprise traction, and teams widely standardize on Parquet for interchange between ingestion and analytics. That means two things for engineers:

Structural schemas are the contract — CREATE TABLE DDL in ClickHouse or Parquet column descriptors describe the shape of truth.
Contract drift is expensive — mismatched types cause silent data loss, incorrect aggregations, and subtle production bugs.

The solution: auto-generate TypeScript types and validation logic from those schemas (ClickHouse DDL / system tables and Parquet descriptors). Then your ingestion code, transformation layers, and analytics tooling can share the same contract, enforced at build- and runtime.

High-level approach

Contract-first codegen pipeline in four stages:

Source schema: Pull canonical schema from ClickHouse (system.columns / SHOW CREATE TABLE) or read Parquet file descriptors.
Normalize to an intermediate representation (IR): explicit nullable, array, nested, map, and tuple types.
Emit TypeScript types and runtime validators (Zod / io-ts / runtypes). Optionally, generate typed DB clients or ingest adapters.
CI checks: diff generated types vs committed types to detect schema drift; run compatibility checks before deploy.

Practical example: Type generation from ClickHouse using HTTP system tables

Below is a complete pattern you can adopt in your repo. It queries the ClickHouse system.columns table via the HTTP interface (works with default ClickHouse installations), maps ClickHouse types to TypeScript, and generates Zod validator code.

Key design decisions explained:

You should decide how to treat 64-bit integers. Mapping to bigint preserves range but complicates JSON transport. Many teams choose string for UInt64/Int64 in ingestion APIs or provide configurable mapping.
Handle Nullable(T) explicitly in both types and validators.
Arrays and Tuples require recursive mapping.

Node/TypeScript script (clickhouse-schema-to-ts.ts)

import fs from 'fs';
import fetch from 'node-fetch';

// Config
const CLICKHOUSE_URL = process.env.CLICKHOUSE_URL || 'http://localhost:8123/';
const DATABASE = 'analytics';
const TABLE = 'events';
const OUT = './src/generated/events.schema.ts';

type Column = { name: string; type: string; default_kind: string | null };

async function fetchColumns(): Promise {
  const q = `SELECT name, type, default_kind FROM system.columns WHERE database='${DATABASE}' AND table='${TABLE}' FORMAT JSON`;
  const res = await fetch(CLICKHOUSE_URL + '?query=' + encodeURIComponent(q));
  const json = await res.json();
  return json.data as Column[];
}

function clickhouseTypeToTs(type: string, preferBigInt = false): { ts: string; zod: string } {
  // Simplified parser: handles Nullable(...), Array(...), basic types. Add cases as needed.
  type = type.trim();
  if (type.startsWith('Nullable(') && type.endsWith(')')) {
    const inner = type.slice(9, -1);
    const innerMap = clickhouseTypeToTs(inner, preferBigInt);
    return { ts: `${innerMap.ts} | null`, zod: `${innerMap.zod}.nullable()` };
  }
  if (type.startsWith('Array(') && type.endsWith(')')) {
    const inner = type.slice(6, -1);
    const innerMap = clickhouseTypeToTs(inner, preferBigInt);
    return { ts: `${innerMap.ts}[]`, zod: `z.array(${innerMap.zod})` };
  }

  // basic mapping
  const mapping: Record = {
    'String': { ts: 'string', zod: 'z.string()' },
    'UUID': { ts: 'string', zod: 'z.string().uuid()' },
    'Date': { ts: 'string', zod: 'z.string()' },
    'DateTime': { ts: 'string', zod: 'z.string()' },
    'Int8': { ts: 'number', zod: 'z.number().int()' },
    'Int16': { ts: 'number', zod: 'z.number().int()' },
    'Int32': { ts: 'number', zod: 'z.number().int()' },
    'Float32': { ts: 'number', zod: 'z.number()' },
    'Float64': { ts: 'number', zod: 'z.number()' },
    'UInt8': { ts: 'number', zod: 'z.number().int().nonnegative()' },
    'UInt16': { ts: 'number', zod: 'z.number().int().nonnegative()' },
    'UInt32': { ts: 'number', zod: 'z.number().int().nonnegative()' },
    // Use string for 64-bit by default to avoid JSON issues; configurable
    'Int64': preferBigInt ? { ts: 'bigint', zod: 'z.bigint()' } : { ts: 'string', zod: 'z.string()' },
    'UInt64': preferBigInt ? { ts: 'bigint', zod: 'z.bigint()' } : { ts: 'string', zod: 'z.string()' },
    'Enum8': { ts: 'string', zod: 'z.string()' },
    'Enum16': { ts: 'string', zod: 'z.string()' },
    'IPv4': { ts: 'string', zod: 'z.string()' },
    'IPv6': { ts: 'string', zod: 'z.string()' },
    'JSON': { ts: 'unknown', zod: 'z.any()' }
  };

  if (mapping[type]) return mapping[type];

  // fallback
  return { ts: 'unknown', zod: 'z.any()' };
}

async function main() {
  const cols = await fetchColumns();
  const lines: string[] = [];
  lines.push("// Generated file - do not edit by hand. Run your codegen to change this.");
  lines.push("import { z } from 'zod';\n");
  lines.push('export const EventSchema = z.object({');

  for (const c of cols) {
    const mapping = clickhouseTypeToTs(c.type);
    lines.push(`  ${c.name}: ${mapping.zod},`);
  }

  lines.push('});\n');
  lines.push('export type Event = z.infer;');

  fs.writeFileSync(OUT, lines.join('\n'));
  console.log('Wrote', OUT);
}

main().catch(err => { console.error(err); process.exit(1); });

This script is intentionally compact. Production usage should add:

Command-line flags to select database/table and output path.
Better ClickHouse type parsing (Tuple, Map, Nested).
Config to map 64-bit integers to string or bigint depending on your JSON interchange strategy.

Parquet descriptors → TypeScript + validators

Parquet files embed column metadata and types. For Parquet, you can use a reader (e.g., parquetjs-lite) to extract the schema and generate types the same way. Below is a small pattern.

import fs from 'fs';
import parquet from 'parquetjs-lite';

async function genFromParquet(filePath: string, out: string) {
  const reader = await parquet.ParquetReader.openFile(filePath);
  const schema = reader.metadata?.schema || reader.schema;
  // parquetjs exposes schema.root or schema.fields depending on version
  // Normalize: produce an IR with name -> { type, optional }
  const fields = Object.entries((reader as any).schema.fields || {}) as [string, any][];
  const lines: string[] = [];
  lines.push("import { z } from 'zod';\n");
  lines.push('export const RecordSchema = z.object({');
  for (const [name, info] of fields) {
    // info.primitiveType or info.originalType
    const t = info.primitiveType || info.originalType || 'BYTE_ARRAY';
    const zod = (t === 'INT64') ? 'z.string()' : (t === 'DOUBLE') ? 'z.number()' : 'z.string()';
    const optional = info.repetitionType === 'OPTIONAL';
    lines.push(`  ${name}: ${optional ? zod + '.optional()' : zod},`);
  }
  lines.push('});');
  lines.push('export type Record = z.infer;');
  fs.writeFileSync(out, lines.join('\n'));
  await reader.close();
}

genFromParquet(process.argv[2], './src/generated/record.schema.ts');

Parquet has richer types (nested groups, maps, logical types). For production, map each logical type (UTF8, DECIMAL, TIMESTAMP_MILLIS) to an appropriate TypeScript type or branded type. Consider using specialized runtime decoding for decimals.

Runtime validation: why generate validators too

Static TypeScript types help you during development, but ingestion often receives untyped JSON or binary payloads. Generated runtime validators (Zod / io-ts) are crucial for:

Fail-fast on schema drift during ingestion.
Providing typed assertion helpers for streaming frameworks.
Generating contract-driven mockers for testing.

Tip: Generate both types and validators from the same IR so they never drift. Many teams generate a Zod schema and use z.infer to produce the TypeScript type rather than emitting a separate type string.

Advanced TypeScript patterns for generated schemas

Use advanced types and generics to keep generated code small and reusable.

Generic parsing helpers: export a typed parser function that returns Promise<Event> by running the Zod validator at runtime.
Discriminated unions for event types: when a table stores multiple event types (type column), generate a union keyed by the discriminator.
Type-safe transformations: provide a map of field transformers (e.g., parse timestamps to Date) with typed signatures so transformations are guaranteed to return the right shape.

// Example: typed parser generated alongside schema
export const parseEvent = (row: unknown) => EventSchema.parse(row);

// Discriminated union example (pseudo-generated)
export const RawEvent = z.discriminatedUnion('event_type', [
  z.object({ event_type: z.literal('click'), x: z.number(), y: z.number() }),
  z.object({ event_type: z.literal('view'), url: z.string() })
]);
export type RawEvent = z.infer<typeof RawEvent>;

Schema evolution and compatibility checks

Generating types is step one — preventing regressions is step two. Add the following to your CI pipeline:

Generation check step: Run codegen and fail the build if generated files differ from the committed versions.
Compatibility tests: Use a schema-compatibility tool or write unit tests that validate old data snapshots against the new schema. For ClickHouse, check that column type changes are additive (e.g., adding new Nullable columns is safe).
Contract versioning: Tag generated artifacts with schema hash and bump semantic schema versions for breaking changes; record metadata in a schema registry (git + centralized registry works).

Practical rules for safe evolution

Additions: adding new nullable columns is non-breaking. Adding non-nullable without default is breaking.
Type widening: number → string is often safe for transport, but string → number is breaking.
Rename columns only with coordinated rollouts (dual writes or aliases in ClickHouse).

Tooling and ecosystem notes (2026)

In 2026 the ecosystem matured in several ways that impact your codegen strategy:

ClickHouse libraries for Node/TypeScript improved: lightweight HTTP clients and typed bindings are common; use the client that supports JSON output to simplify parsing.
Parquet tooling converged around parquetjs-lite and Arrow-based readers for JavaScript; for complex logical types prefer Arrow decode and then map to TS types.
Developer workflows favor GitOps for schemas: store DDL and generated types in the same repo, and run schema diffs as pull-request gates. For broader developer-platform patterns, see building a Developer Experience Platform that integrates codegen and CI gates.
Validation-first libraries like Zod gained better performance and codegen integration — generating Zod schemas is widely accepted as the canonical runtime contract in JS/TS stacks.

Case study: reducing incidents by 70% at a mid-size analytics team

One analytics team I worked with in late 2025 adopted a contracts-first pipeline:

Authoritative schema lived as ClickHouse CREATE TABLEs stored in a schemas/ directory in the monorepo.
CI job generated TypeScript + Zod artifacts on PR and compared them to committed files; any diff required a schema PR or a generators update.
All ingestion microservices pulled the generated validators and used them in a middleware that rejected invalid events and emitted a metric.

Result: production incidents caused by schema mismatch dropped ~70% in three months. The team recovered faster because they had a deterministic mapping between DDL and runtime validations. To scale ingestion across unreliable networks and edge producers, consider patterns used in edge message brokers for offline sync and durable delivery.

Operational patterns & best practices

Central schema repo: store canonical DDL and example Parquet descriptors; generate artifacts and check them in.
Configurable codegen: allow teams to toggle preferences (bigint vs string for 64-bit), output formats (Zod, io-ts), and transformers.
Small, audit-friendly diff: emitted code should be deterministic and stable to minimize churn in PRs.
Performance-aware validation: for high-throughput ingestion, run lightweight, fast checks first (shape, discriminator) then deeper validation asynchronously when needed. Consider general performance strategies (caching and fast-path checks) described in caching strategies for serverless/high-throughput systems.
Documentation & examples: generate README snippets with sample objects and example failing cases so data producers can iterate quickly.

Advanced patterns: partial reads, generics, and typed mappers

When analytics reads partial columns (projection), generate smaller TypeScript types and mappers to avoid loading full objects. Use TypeScript generics to express projected types at compile time.

// Example: Generic projection helper
export function project(obj: T, keys: readonly K[]): Pick<T, K> {
  const out: Partial<T> = {};
  for (const k of keys) out[k] = obj[k];
  return out as Pick<T, K>;
}

Combine generated types with helpers like this to keep downstream code precise and safe. If your team ships client SDKs or read/write helpers, tie them into your platform strategy; this often intersects with choices about where to host services and how to operate CI — see cloud-native hosting patterns for trade-offs.

Common pitfalls and how to avoid them

Pitfall: Mapping Int64 to number silently loses precision. Fix: decide on string vs bigint mapping early and enforce it.
Pitfall: Generated files drift and are edited by hand. Fix: mark generated files and enforce generation in CI.
Pitfall: Too-strict runtime checks block benign data. Fix: allow configurable relax mode and log anomalies for analysts to review; for production telemetry and vendor selection patterns, see trust scores for telemetry vendors.

Actionable checklist to implement today

Choose your canonical schema source: ClickHouse DDL or Parquet descriptors (or both).
Implement a codegen script that outputs TypeScript types and Zod validators (start from the examples above).
Add a CI job that regenerates and diffs generated artifacts on PRs.
Instrument ingestion to run generated validators before writing to ClickHouse; emit metrics on validation failures. Track those metrics on a dashboard (instrumentation and KPI design ties into broader monitoring and dashboard practices — see KPI dashboard patterns).
Document schema-evolution rules and add compatibility tests for breaking changes.

Future predictions (2026+)

Expect these trends to accelerate:

Tighter integration between analytics engines and typed application code: data contracts will be first-class artifacts in developer workflows, with auto-generated SDKs for reads/writes.
Schema registries and contract testing tooling become standard: CI will check compatibility across producers and consumers automatically.
Better Parquet/Arrow runtime decoding in JS: lower-level libraries will supply richer type metadata for more precise TypeScript generation.

Final takeaways

By generating TypeScript types and runtime validators from ClickHouse and Parquet schemas, you treat data shapes as code contracts. This reduces runtime errors, accelerates onboarding, and makes analytics pipelines auditable and robust. Start small: generate schemas for a single critical table, add CI checks, then expand across your platform.

Remember: The goal is not perfect typing — it's a single source of truth that everyone (ingestion, transformation, analytics) can rely on.

Call to action

Ready to adopt a contracts-first workflow? Start by adding a schema-to-TypeScript generator to one of your pipelines this week. If you want a ready-made starter, clone the minimal example in this article and adapt it to your ClickHouse or Parquet inputs. Share a PR with your generated schema and add CI checks — your future self (and your on-call rotation) will thank you.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.