Build an Automated TypeScript SEO Auditor CLI
Build a TypeScript CLI that audits crawling, meta tags, structured data, and Lighthouse metrics—automate prioritized SEO fixes for engineering teams.
Stop guessing—automate your technical SEO audits with a TypeScript CLI
If you maintain a medium-to-large web property, you know the pain: broken meta tags, flaky structured data, pages that render perfectly in dev but vanish for crawlers. Manual audits are slow, inconsistent, and hard to scale. In 2026 the stakes are higher—search engines rely more on structured entity signals and automated quality assessments. This guide shows how to design and build a TypeScript CLI that programmatically audits crawling health, meta tags, and structured data, runs Lighthouse checks, and outputs prioritized fixes developers can act on.
What this CLI will deliver (executive summary)
- Fast, configurable crawler that respects robots.txt and sitemaps
- Page-level audits for meta tags, canonicalization, and hreflang
- Structured data extraction and lightweight validation (JSON‑LD + microdata)
- Programmatic Lighthouse runs for Core Web Vitals and SEO metrics
- Prioritization engine that scores issues by impact, effort, and confidence
- Outputs: JSON report, human-friendly HTML, CSV, and optional GitHub issues or Slack notifications
Why build this yourself in 2026?
Recent trends (late 2025 → early 2026) pushed search engines to weigh structured, entity-based signals and automated quality metrics more heavily. Off-the-shelf SaaS tools are excellent, but they can be expensive and opaque. A TypeScript CLI gives engineering teams:
- Full control over crawl scope and heuristics
- Integration into CI and monorepos (fast feedback loops)
- Custom prioritization modeled on business KPIs
Architecture overview
Keep it modular. At a high level the CLI implements components:
- Input: seed URLs, sitemap(s), or a list of routes
- Crawler: respects robots.txt, fetches HTML/JS-rendered pages
- Auditors: Meta tags, structured data, Lighthouse runner
- Scoring: compute priority for each issue
- Reporter: JSON/HTML/CSV/GitHub
Tech stack & rationale
- TypeScript (Node 20+)—safety, DX, and easy build output
- Playwright—for reliable JS rendering and fast page fetches
- Lighthouse—programmatic audits for CWV, SEO, and best practices
- Cheerio—for quick server-side DOM parsing of static HTML
- robots-txt-parser & sitemap-parser—respect crawl rules and extract seeds
- Commander or oclif—for CLI ergonomics
- p-queue—for concurrency control and politeness
Step 1 — Bootstrap the CLI
Start with a minimal package.json and tsconfig set for Node ESM. Use esbuild or TypeScript tsc to produce a single JS file. Example CLI skeleton with Commander:
import { Command } from 'commander'
const program = new Command()
program
.name('seo-audit')
.description('Automated SEO auditor CLI')
.version('0.1.0')
.option('-s, --seed ', 'seed URL or file with URLs')
.option('-c, --concurrency ', 'concurrency', '4')
.option('-o, --output ', 'output file', 'audit-report.json')
program.parse(process.argv)
const opts = program.opts()
console.log('Running with', opts)
TypeScript config tips
- Use strict mode and enable esModuleInterop
- Emit to a dist folder and keep types for CI checks
- Consider esbuild for fast bundling into a single executable
Step 2 — Implement a respectful crawler
Key rules: honor robots.txt, parse sitemaps, limit concurrency, and avoid hammering servers. Use a queue for URLs and store visited URLs in a Set or persistent DB for large sites.
Robots and sitemaps
Fetch robots.txt and use a parser to decide if a path is allowed. If sitemaps are referenced, expand the seed set from them.
import fetch from 'node-fetch'
import RobotsParser from 'robots-parser'
async function isAllowed(baseUrl: string, path: string) {
const robotsUrl = new URL('/robots.txt', baseUrl).toString()
const res = await fetch(robotsUrl)
if (!res.ok) return true // be permissive by default
const txt = await res.text()
const parser = RobotsParser(robotsUrl, txt)
return parser.isAllowed(path, 'seo-audit-bot')
}
Polite crawling with p-queue
import PQueue from 'p-queue'
const queue = new PQueue({ concurrency: 4 })
async function crawl(urls: string[]) {
const visited = new Set()
await Promise.all(urls.map(u => queue.add(() => visit(u))))
}
async function visit(url: string) {
if (visited.has(url)) return
visited.add(url)
// decide whether to use Playwright or simple fetch
}
Step 3 — Fetching and rendering strategy
Not every URL needs a full headless browser. Use a tiered approach:
- Quick fetch: fetch raw HTML for static pages using node-fetch
- Render fetch: use Playwright for client-rendered pages or when checking Lighthouse
import { chromium } from 'playwright'
async function renderPage(url: string) {
const browser = await chromium.launch({ args: ['--no-sandbox'] })
const page = await browser.newPage()
await page.goto(url, { waitUntil: 'networkidle' })
const html = await page.content()
await browser.close()
return html
}
Step 4 — Auditors (meta tags, canonical, hreflang)
Write small, composable auditors that accept HTML and return typed findings. This keeps the code testable and friendly to CI.
import cheerio from 'cheerio'
type AuditFinding = {
url: string
type: 'meta' | 'canonical' | 'hreflang' | 'structured-data' | 'lighthouse'
message: string
severity: 'info' | 'warning' | 'critical'
}
function auditMeta(url: string, html: string): AuditFinding[] {
const $ = cheerio.load(html)
const findings: AuditFinding[] = []
const title = $('head > title').text().trim()
if (!title || title.length <= 10) findings.push({ url, type: 'meta', message: 'Missing or short title', severity: 'warning' })
const desc = $('meta[name="description"]').attr('content')
if (!desc) findings.push({ url, type: 'meta', message: 'Missing meta description', severity: 'warning' })
const canonical = $('link[rel="canonical"]').attr('href')
if (!canonical) findings.push({ url, type: 'canonical', message: 'Missing canonical tag', severity: 'info' })
return findings
}
Step 5 — Structured data extraction and validation
Structured data in 2026 remains critical for entity signals and rich results. Extract JSON‑LD and basic validate the presence of @context and @type. For deep validation you can call a remote API (be mindful of rate limits) or implement SHACL/JSON‑Schema validation.
function extractJsonLd(html: string) {
const matches = [...html.matchAll(/