cliseoautomation

Build an Automated TypeScript SEO Auditor CLI

UUnknown

2026-03-01

10 min read

Build a TypeScript CLI that audits crawling, meta tags, structured data, and Lighthouse metrics—automate prioritized SEO fixes for engineering teams.

Stop guessing—automate your technical SEO audits with a TypeScript CLI

If you maintain a medium-to-large web property, you know the pain: broken meta tags, flaky structured data, pages that render perfectly in dev but vanish for crawlers. Manual audits are slow, inconsistent, and hard to scale. In 2026 the stakes are higher—search engines rely more on structured entity signals and automated quality assessments. This guide shows how to design and build a TypeScript CLI that programmatically audits crawling health, meta tags, and structured data, runs Lighthouse checks, and outputs prioritized fixes developers can act on.

What this CLI will deliver (executive summary)

Fast, configurable crawler that respects robots.txt and sitemaps
Page-level audits for meta tags, canonicalization, and hreflang
Structured data extraction and lightweight validation (JSON‑LD + microdata)
Programmatic Lighthouse runs for Core Web Vitals and SEO metrics
Prioritization engine that scores issues by impact, effort, and confidence
Outputs: JSON report, human-friendly HTML, CSV, and optional GitHub issues or Slack notifications

Why build this yourself in 2026?

Recent trends (late 2025 → early 2026) pushed search engines to weigh structured, entity-based signals and automated quality metrics more heavily. Off-the-shelf SaaS tools are excellent, but they can be expensive and opaque. A TypeScript CLI gives engineering teams:

Full control over crawl scope and heuristics
Integration into CI and monorepos (fast feedback loops)
Custom prioritization modeled on business KPIs

Architecture overview

Keep it modular. At a high level the CLI implements components:

Input: seed URLs, sitemap(s), or a list of routes
Crawler: respects robots.txt, fetches HTML/JS-rendered pages
Auditors: Meta tags, structured data, Lighthouse runner
Scoring: compute priority for each issue
Reporter: JSON/HTML/CSV/GitHub

Tech stack & rationale

TypeScript (Node 20+)—safety, DX, and easy build output
Playwright—for reliable JS rendering and fast page fetches
Lighthouse—programmatic audits for CWV, SEO, and best practices
Cheerio—for quick server-side DOM parsing of static HTML
robots-txt-parser & sitemap-parser—respect crawl rules and extract seeds
Commander or oclif—for CLI ergonomics
p-queue—for concurrency control and politeness

Step 1 — Bootstrap the CLI

Start with a minimal package.json and tsconfig set for Node ESM. Use esbuild or TypeScript tsc to produce a single JS file. Example CLI skeleton with Commander:

import { Command } from 'commander'

const program = new Command()
program
  .name('seo-audit')
  .description('Automated SEO auditor CLI')
  .version('0.1.0')
  .option('-s, --seed ', 'seed URL or file with URLs')
  .option('-c, --concurrency ', 'concurrency', '4')
  .option('-o, --output ', 'output file', 'audit-report.json')

program.parse(process.argv)

const opts = program.opts()
console.log('Running with', opts)

TypeScript config tips

Use strict mode and enable esModuleInterop
Emit to a dist folder and keep types for CI checks
Consider esbuild for fast bundling into a single executable

Step 2 — Implement a respectful crawler

Key rules: honor robots.txt, parse sitemaps, limit concurrency, and avoid hammering servers. Use a queue for URLs and store visited URLs in a Set or persistent DB for large sites.

Robots and sitemaps

Fetch robots.txt and use a parser to decide if a path is allowed. If sitemaps are referenced, expand the seed set from them.

import fetch from 'node-fetch'
import RobotsParser from 'robots-parser'

async function isAllowed(baseUrl: string, path: string) {
  const robotsUrl = new URL('/robots.txt', baseUrl).toString()
  const res = await fetch(robotsUrl)
  if (!res.ok) return true // be permissive by default
  const txt = await res.text()
  const parser = RobotsParser(robotsUrl, txt)
  return parser.isAllowed(path, 'seo-audit-bot')
}

Polite crawling with p-queue

import PQueue from 'p-queue'
const queue = new PQueue({ concurrency: 4 })

async function crawl(urls: string[]) {
  const visited = new Set()
  await Promise.all(urls.map(u => queue.add(() => visit(u))))
}

async function visit(url: string) {
  if (visited.has(url)) return
  visited.add(url)
  // decide whether to use Playwright or simple fetch
}

Step 3 — Fetching and rendering strategy

Not every URL needs a full headless browser. Use a tiered approach:

Quick fetch: fetch raw HTML for static pages using node-fetch
Render fetch: use Playwright for client-rendered pages or when checking Lighthouse

import { chromium } from 'playwright'

async function renderPage(url: string) {
  const browser = await chromium.launch({ args: ['--no-sandbox'] })
  const page = await browser.newPage()
  await page.goto(url, { waitUntil: 'networkidle' })
  const html = await page.content()
  await browser.close()
  return html
}

Step 4 — Auditors (meta tags, canonical, hreflang)

Write small, composable auditors that accept HTML and return typed findings. This keeps the code testable and friendly to CI.

import cheerio from 'cheerio'

type AuditFinding = {
  url: string
  type: 'meta' | 'canonical' | 'hreflang' | 'structured-data' | 'lighthouse'
  message: string
  severity: 'info' | 'warning' | 'critical'
}

function auditMeta(url: string, html: string): AuditFinding[] {
  const $ = cheerio.load(html)
  const findings: AuditFinding[] = []
  const title = $('head > title').text().trim()
  if (!title || title.length <= 10) findings.push({ url, type: 'meta', message: 'Missing or short title', severity: 'warning' })
  const desc = $('meta[name="description"]').attr('content')
  if (!desc) findings.push({ url, type: 'meta', message: 'Missing meta description', severity: 'warning' })
  const canonical = $('link[rel="canonical"]').attr('href')
  if (!canonical) findings.push({ url, type: 'canonical', message: 'Missing canonical tag', severity: 'info' })
  return findings
}

Step 5 — Structured data extraction and validation

Structured data in 2026 remains critical for entity signals and rich results. Extract JSON‑LD and basic validate the presence of @context and @type. For deep validation you can call a remote API (be mindful of rate limits) or implement SHACL/JSON‑Schema validation.

function extractJsonLd(html: string) {
  const matches = [...html.matchAll(/([\s\S]*?)<\/script>/gi)]
  return matches.map(m => {
    try { return JSON.parse(m[1]) }
    catch { return null }
  }).filter(Boolean)
}

function auditStructuredData(url: string, html: string): AuditFinding[] {
  const jsonlds = extractJsonLd(html)
  const findings: AuditFinding[] = []
  if (jsonlds.length === 0) findings.push({ url, type: 'structured-data', message: 'No JSON-LD found', severity: 'info' })
  for (const j of jsonlds) {
    if (!j['@type']) findings.push({ url, type: 'structured-data', message: 'JSON-LD missing @type', severity: 'warning' })
    if (!j['@context']) findings.push({ url, type: 'structured-data', message: 'JSON-LD missing @context', severity: 'warning' })
  }
  return findings
}

Step 6 — Programmatic Lighthouse audits

Use Lighthouse to capture Core Web Vitals and SEO audits. In 2026 Lighthouse remains the standard for automated site quality checks; it's also resource-intensive. Run it only for a prioritized subset (high-traffic pages, templates, or sampled pages).

import lighthouse from 'lighthouse'
import { chromium } from 'playwright'

async function runLighthouse(url: string) {
  const browser = await chromium.launch({ headless: true })
  const wsEndpoint = browser.wsEndpoint()
  const result = await lighthouse(url, { port: new URL(wsEndpoint).port }, null)
  await browser.close()
  return result.lhr // Lighthouse result
}

Step 7 — Prioritization engine (impact × effort × confidence)

Raw findings are noise. A small scoring model separates blockers from low-value suggestions. Use three axes:

Impact (1–10): potential traffic or ranking uplift
Effort (1–10): estimated dev effort
Confidence (0–1): how sure the auditor is

Compute priority = impact × confidence / effort. Optionally weight Lighthouse metrics more for Core Web Vitals failures.

type PrioritizedFinding = AuditFinding & { impact: number; effort: number; confidence: number; priority: number }

function scoreFinding(f: AuditFinding): PrioritizedFinding {
  const impact = f.severity === 'critical' ? 9 : f.severity === 'warning' ? 6 : 2
  const effort = f.type === 'lighthouse' ? 5 : f.type === 'structured-data' ? 3 : 2
  const confidence = f.type === 'meta' ? 0.9 : 0.7
  const priority = (impact * confidence) / effort
  return { ...f, impact, effort, confidence, priority }
}

Step 8 — Reports and developer workflows

Offer multiple reporter formats so different teams can adopt the tool:

JSON for ingestion into analytics and dashboards
HTML for stakeholders (ranked list of fixes, charts for CWV)
CSV for Excel or backlog imports
GitHub Issues or GitLab tickets for automatic triage

import fs from 'fs'

function writeJsonReport(path: string, data: any) {
  fs.writeFileSync(path, JSON.stringify(data, null, 2))
}

Auto-create GitHub issues (example pattern)

For high-priority, high-confidence issues, create issues with a standard template including location, failure details, and a suggested fix. Use repository tokens and respect rate limits.

Step 9 — CI, baselines, and regression monitoring

Integrate the CLI into nightly CI runs or scheduled GitHub Actions. Store baseline runs and only surface regressions to reduce noise. Example CI flow:

Run audit on deployment branch
Compare results to baseline stored in artifacts or a DB
Fail build when regressions cross a threshold (e.g., CWV dropped by >10%)

Advanced ideas and 2026 trends

Think beyond simple checks. Recent industry shifts mean teams can gain an edge by:

Entity-first auditing: correlate structured data across pages to detect inconsistent entity representations
LLM-assisted fix suggestions: use small-ranked models to generate code snippets (JSON-LD templates, meta tag suggestions). Always human-reviewed.
Template-aware crawling: map pages to templates and triage template-level issues automatically
Incremental audits: only audit changed routes in PRs

Performance and cost control

Lighthouse runs and Playwright browsers are expensive. Strategies to reduce costs:

Run Lighthouse only for a small set of representative pages
Use cached HTML snapshots for unchanged pages
Batch Playwright instances and reuse contexts
Throttle runs during peak business hours

Testing, types, and developer ergonomics

Keep auditors small and typed. Unit test extractors (JSON-LD parser, meta audits) with fixtures. Use interfaces to define findings and reporters so new outputs are easy to add.

export interface AuditFinding {
  url: string
  type: string
  message: string
  severity: 'info' | 'warning' | 'critical'
}

// sample test using Jest
it('detects missing title', () => {
  const html = '<html><head></head><body></body></html>'
  const findings = auditMeta('https://example.com', html)
  expect(findings.some(f => f.message.includes('Missing'))).toBe(true)
})

Security, ethics, and scraping rules

Always respect robots.txt, rate limits, and site terms. Avoid sensitive endpoints. Provide an opt-out for site owners. Log actions for auditability in enterprise environments.

Starter project layout

src/
  cli.ts           # Commander entry
  crawl/
    crawler.ts
    robots.ts
    sitemap.ts
  audits/
    meta.ts
    structuredData.ts
    lighthouse.ts
  scoring/
    prioritize.ts
  reporters/
    json.ts
    html.ts
  bin/
    seo-audit (built binary)

Real-world example — prioritizing a structured-data fix

Scenario: product pages missing price in JSON‑LD. The CLI detects 2,000 pages with missing price and outputs a prioritized fix:

Impact: 9 (affects SERP rich result eligibility)
Effort: 3 (template takes 2–4 hours to update)
Confidence: 0.95 (JSON-LD scanning is precise)

Priority score = (9 × 0.95) / 3 ≈ 2.85 (high). The reporter groups by template and auto-creates a GitHub issue with a suggested JSON‑LD snippet and a code location hint—actionable and ready for an engineer to fix.

Actionable takeaways — immediate next steps

Bootstrap a TypeScript CLI with Commander and a strict tsconfig
Implement a polite crawler that honors robots.txt and sitemaps
Write small auditors for meta tags and structured data first—these are cheap wins
Run Lighthouse selectively—save cost by sampling representative pages
Implement a simple priority formula and integrate with your issue tracker

Closing thoughts & 2026 perspective

Automated SEO auditing is no longer only for SEO specialists—engineering teams must own the quality signals that affect discoverability. In 2026, entity-based SEO and automated scoring are standard practice. A TypeScript CLI gives your team control, transparency, and the ability to tie technical fixes to business outcomes. Build it modularly, start small, and iterate toward richer signals and smarter prioritization.

Pro tip: Start by detecting the top 20 pages by traffic, run a focused Lighthouse and structured data audit on them, then expand incrementally. Quick wins build trust.

Try the starter repo

I’ve outlined the core components and patterns you need. If you want a ready-made starter repo, clone a lightweight template that wires Commander, Playwright, Lighthouse, and basic reporters—run it in CI, adapt the scoring to your product, and gradually add LLM-assisted fix suggestions where it reduces manual work (with human review).

Call to action

Ready to build a tailored SEO auditor? Clone the starter, run it against your staging site, and open a PR with the first batch of prioritized fixes. If you want, share the repo and I’ll review the prioritization rules and Lighthouse sampling strategy for your site—let’s make technical SEO part of the engineering lifecycle.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.