aiagentself-hosted

Build a model-agnostic TypeScript code-review agent inspired by Kodus

JJordan Ellis

2026-05-08

26 min read

1) What a model-agnostic code review agent actually does

It reads the diff, not the whole universe

A practical review agent should focus on the changed lines, relevant surrounding context, and a compact set of repository signals. That keeps token usage reasonable and makes the system faster in local and CI workflows. The agent should accept a pull request diff, optional file context, and project-specific rules, then return actionable review comments with severity, rationale, and suggested fixes. This is very different from sending an entire codebase to an LLM and hoping for the best.

In a well-designed system, the agent is a pipeline, not a monolith. One stage gathers the diff, another enriches it with metadata, a third builds the prompt, and the final stage sends the request through a provider adapter. That separation makes testing much easier, because you can verify prompt generation without hitting a model and verify model output parsing without a Git provider. It also makes it easier to extend the tool later with additional checks like security heuristics or style rules.

Model-agnostic means provider-swappable by design

Model-agnostic does not mean model-indifferent. It means your app speaks to a small internal interface while providers remain interchangeable behind the scenes. Today that interface may point to OpenAI, Anthropic, or a self-hosted model gateway; tomorrow it may point to a cheaper or more capable endpoint without forcing a rewrite. This is one of the main lessons from modern agent systems and from the architecture patterns described in orchestrating specialized AI agents.

That flexibility matters because code review workloads vary. A startup may prefer a low-cost model for routine checks, while a security-sensitive team may route suspicious diffs to a stronger reasoning model. A model-agnostic design lets you choose by policy, not by refactoring pressure. It also prevents a single vendor’s API changes from becoming a product outage.

Self-hosted review bots are attractive for cost and privacy

Self-hosted systems appeal to teams that want more control over where code is sent, how long prompts are retained, and what gets logged. This is especially important for regulated industries or organizations with strict source-control policies. There is also a financial angle: when you pay provider pricing directly and avoid markup, the economics of reviewing every PR become much more predictable. For broader context on deployment discipline, our trust-first deployment checklist is a useful companion piece.

In practice, self-hosting does not need to mean heavy infrastructure. A lightweight Node service plus a CLI wrapper can support local checks, Git hooks, and CI jobs. You only need to persist the minimal state required for retries, configuration, and audit logs. That is the sweet spot for teams that want practical automation without building a full internal platform.

2) A reference architecture in TypeScript

Keep the core small and the edges swappable

The cleanest architecture is a core review engine wrapped by thin adapters. Your core knows how to parse diffs, assemble prompts, normalize outputs, and rank findings. Your adapters know how to talk to specific providers such as OpenAI or Anthropic. This aligns with a classic boundary-driven design: your business logic should not depend on network-specific SDK shapes, which makes the code easier to test and future-proof.

A simple project layout might look like this:

src/
  core/
    reviewEngine.ts
    promptBuilder.ts
    types.ts
  adapters/
    openaiAdapter.ts
    anthropicAdapter.ts
    openAiCompatibleAdapter.ts
  cli/
    precommit.ts
  config/
    rules.ts
    secrets.ts
  utils/
    diff.ts
    redaction.ts

That structure keeps your model-specific code isolated. It also makes it straightforward to add another adapter later, such as a local inference server or a different hosted provider. If your team already works with modular architectures, the approach will feel familiar, similar to how modern systems separate orchestration from execution in patterns discussed in SLO-aware automation.

Use strong types for requests and findings

TypeScript gives you an immediate advantage: you can define strict types for review requests, provider responses, and normalized findings. This reduces runtime ambiguity and makes your CLI output consistent. A reviewer should not have to guess whether a finding is a warning or an error, or whether it includes a file path and line number. Well-chosen types are especially useful when multiple adapters produce slightly different response formats.

For example, a normalized finding might include severity, title, message, file, lineStart, lineEnd, and confidence. Once you normalize output into this shape, every downstream consumer becomes simpler: CLI rendering, GitHub comments, JSON export, and even dashboards. This is one of those design choices that feels tedious at first but pays off every time you add a new provider.

Support async streaming only where it helps

Streaming output can improve the perceived speed of a review, but it is not mandatory for a local pre-commit workflow. In CI, a completed response is often easier to aggregate and log. In interactive tools, streaming can be useful to show partial reasoning or progressive summaries, but be careful not to leak chain-of-thought-style content into logs. The review agent should emit concise, user-facing progress messages and keep internal reasoning private.

As a rule, optimize for developer trust rather than novelty. You want the bot to feel fast, predictable, and helpful, not clever for its own sake. The best agents are boring in the right ways: same inputs, same structure, same output schema. That reliability is what earns adoption.

3) Implementing the adapter pattern for OpenAI, Anthropic, and beyond

Define a provider-agnostic interface first

The interface is the heart of model agnosticism. Start with a contract like review(prompt: ReviewPrompt): Promise<ReviewResult>. Then create a provider adapter for each vendor. Each adapter should translate your internal request shape into the vendor’s SDK format and translate the returned text or JSON back into your normalized result. When the interface is stable, the provider becomes an implementation detail instead of a core dependency.

Here is a simplified example:

export interface LlmAdapter {
  review(input: {
    systemPrompt: string;
    userPrompt: string;
    temperature?: number;
    maxTokens?: number;
  }): Promise<string>;
}

Once you have this, your review engine can do provider selection by config, environment variable, or repository policy. That keeps your application logic free from provider branching, which is crucial when you later add fallback routing or cost-aware model selection.

OpenAI and Anthropic adapters should only differ at the edge

In the OpenAI adapter, you map your internal request to the OpenAI SDK and normalize the text response. In the Anthropic adapter, you do the same with Claude’s message format. The important part is that the core engine never knows which SDK is being used. You should not have provider-specific conditionals scattered through your business logic, because that is how codebases become impossible to maintain.

If you support an OpenAI-compatible endpoint, the adapter gets even more valuable. Many hosted gateways and self-hosted inference servers expose an OpenAI-style API, so a single adapter can support a wide variety of models. That means your self-hosted review bot can grow from a single-provider demo into a portable internal tool without changing its interface, a practical advantage echoed in the discussion of AI supply-chain risk and vendor dependency management.

Build adapter tests around contract behavior

Test the adapter interface like a contract. For each provider, verify that a sample prompt yields a valid normalized response, that retry logic behaves correctly, and that malformed outputs are rejected. These tests should not overfit to a specific model’s wording, because providers vary in phrasing. Instead, assert on structure: does the output contain valid JSON, sensible severity values, and line references that match the diff?

This approach makes switching providers less risky. You can run the same test suite against a mock adapter, a staging API key, and a live provider account. When you treat provider selection as a replaceable concern, you get the operational freedom that many AI products promise but few actually deliver. That freedom is the practical meaning of model agnosticism.

4) Natural-language rule definitions that developers will actually use

Prefer human-readable policies over brittle config files

Review rules are only useful if the team can maintain them. A common mistake is burying behavior in JSON or YAML knobs that no one wants to edit. Instead, allow natural-language rules like: “Flag any async function that swallows errors without logging” or “Warn when public functions return mutable collections.” These rules can be stored in markdown, repository comments, or a rules.md file that the prompt builder reads at runtime.

Natural-language rules work well because they map cleanly to how senior engineers think during review. They are also easier for non-specialists to edit, which means product teams or platform engineers can adapt the bot without touching TypeScript. Just make sure the rules are specific, testable, and bounded. Vague instructions like “write better code” produce vague output.

Use a prompt template with explicit priorities

Your prompt should clearly separate universal standards from repo-specific rules. For example, you might tell the model to prioritize correctness, security, and maintainability in that order, and then apply local conventions below that. This helps the model resolve conflicts when a style preference clashes with a bug risk. Strong prompt structure is especially important when you want your code review agent to behave consistently across different providers.

A good template also tells the model what not to do. Ask it to avoid repeating the diff, avoid broad architectural advice unless the evidence supports it, and return only findings that are grounded in the changed lines. This keeps output useful and concise. It also reduces the risk of hallucinated problems that waste developer time.

Keep a rule library for team patterns

Over time, you can build a small rule library from actual team pain points: unsafe null handling, missing tests, accidental mutation, insecure string interpolation, and confusing abstractions. This is where the bot becomes more than a generic AI assistant. It starts to reflect your team’s engineering culture, much like how specialized tooling adapts to domain workflows in guides such as building a secure AI incident-triage assistant.

One useful pattern is to group rules by severity and scope. For example, critical rules may cover security and data loss, warning rules may cover maintainability, and info rules may cover style or micro-optimizations. That hierarchy makes the review bot’s behavior predictable and easier to tune as adoption grows.

5) Secure key handling and self-hosted safety

Never hard-code provider credentials

API keys should come from environment variables, secret managers, or injected CI credentials, never from source code. Your CLI can read process.env, but your deployment should prefer a managed secret store wherever possible. Even for local development, use a .env file that is ignored by Git and validated on startup. If the key is missing, fail early with a clear message rather than making a confusing API call later.

It is also important to minimize how long secrets stay in memory and never print them in error messages. Redact headers, request bodies, and debug logs before they reach your terminal or log sink. This is standard hygiene, but it becomes even more important when your tool handles proprietary source code and external model requests. For adjacent thinking on secure workflows, see secure AI incident triage and trust-first deployments.

Design for least privilege and provider isolation

Use separate API keys for development, staging, and production. If a provider supports project-level keys, do not reuse a global root credential. This reduces blast radius and makes usage attribution easier. It also allows you to shut off a single environment without affecting every workflow across the company.

For self-hosted deployments, isolate the review service from the rest of your infrastructure with narrow network access. The bot should only reach the model endpoint, Git provider APIs, and whatever storage you explicitly need. That constraint lowers risk and simplifies auditability, which matters when the system can see sensitive diffs before they merge.

Redact sensitive content before sending prompts

A good code-review agent should avoid sending secrets, tokens, and obvious credentials to the model. Build a redaction step that scans for key-like strings, private URLs, and common secret patterns. If a diff contains sensitive material, either mask it or skip that section entirely. This matters because even if a provider does not train on your data, you still should not send secrets unless there is a compelling, approved reason.

Consider a preflight policy that blocks reviews on files known to contain secrets, such as .env or credential manifests, and routes them to a separate security check. This separation of duties helps you keep the review agent focused on code quality while dedicated tools handle secret scanning. It is a practical way to combine AI assistance with deterministic safeguards.

6) The local CLI pre-commit workflow

Make the bot available before the push

Developer adoption improves dramatically when the review agent runs locally as a CLI. A pre-commit check can catch obvious issues before code ever reaches CI, saving time and reducing reviewer noise. The CLI should accept file paths, staged diff output, or a target branch, then print findings in a readable format. This is the fastest way to turn AI review from a novelty into a habit.

For example, a command like npx review-bot precommit can inspect staged changes, send them through the engine, and fail only on high-confidence critical issues. Warnings can be shown without blocking the commit. That balance preserves developer flow while still providing value. It is also a good place to surface configuration mistakes, such as missing keys or invalid rules.

Support JSON output for automation

CLI tools should be pleasant for humans and machines. Human-readable output helps during interactive use, but JSON is essential for CI, dashboards, and GitHub checks. Provide a --json flag that emits the same normalized findings your core engine uses internally. This allows other tools to consume the output without parsing terminal decorations.

When you have consistent JSON, you can build more around it: trend analysis, flaky-rule detection, or summary stats about how often the agent finds issues. That data helps justify the tool’s value to the team. It also lets you refine prompts and rules based on what the bot actually catches in the wild.

Use Git hooks carefully

Git hooks are powerful, but they should not create a fragile developer experience. Keep the hook lightweight, make it opt-in, and document how to bypass it for emergencies. The hook should invoke the same CLI path that CI uses, so the behavior remains consistent across environments. That consistency is what makes pre-commit checks trustworthy.

One useful pattern is to let the hook run fast local heuristics first, then call the LLM only if the diff crosses a threshold of complexity or risk. That way simple formatting changes do not pay the full model cost. Teams that care about efficiency can use this pattern to keep the tool responsive while still benefiting from richer review on important changes.

7) Prompting, context selection, and output normalization

Send just enough context to be useful

Prompt quality often depends more on context selection than on model choice. Include the diff, nearby lines, file path, repository rules, and a brief project summary if available. Avoid dumping unrelated files into the prompt, because excess context increases cost and can confuse the model. Smart context selection is one of the easiest ways to improve output quality without changing providers.

You can build a simple context strategy: include exact changed lines, plus a few surrounding lines on both sides, plus function-level context when needed. For TypeScript projects, include type definitions when a diff changes interfaces or generics. That helps the model reason about compile-time implications instead of only syntax-level changes. In more advanced setups, you could enrich the prompt with repository metadata or dependency relationships, similar to how data pipelines transform raw inputs into actionable insight.

Ask for structured output, not prose essays

Your review agent should return machine-readable findings, even if the user-facing presentation is nice. Ask the model for JSON with fields like title, severity, explanation, evidence, and suggestedFix. This is crucial for determinism, deduplication, and later rendering. If the model cannot provide valid JSON, your adapter should retry with a stricter instruction or fall back gracefully.

Structured output also makes it easier to aggregate results from multiple providers. For example, you could route a diff through a fast model first, then send only higher-risk findings to a more capable model for verification. That layered approach keeps costs manageable while preserving confidence. It is a practical compromise between speed and quality.

Validate and score findings before surfacing them

Do not trust the model blindly. Validate that file paths exist, line numbers fall within the diff, and severity values are permitted. Then score findings by confidence and relevance so you do not overwhelm the developer with low-value noise. The best review agents act like filter systems: they amplify the important things and suppress the rest.

This is where the core engine earns its keep. If the agent merely forwards raw model output, your users will eventually lose trust. If it normalizes, validates, and prioritizes findings well, it becomes something engineers rely on daily. That reliability is more important than flashy prompt engineering tricks.

8) Cost, performance, and operational tradeoffs

Measure token usage per review

Cost control starts with measurement. Track prompt tokens, completion tokens, retry counts, and average response time per repository or branch type. This lets you estimate monthly spend and identify expensive patterns. In many teams, a small number of large diffs account for a disproportionate share of LLM usage.

You can reduce cost by trimming context, caching repository summaries, and only analyzing files that matter. A review agent should not be a black box of unpredictable spend. It should produce observable, explainable costs the same way any other infrastructure service does. That discipline is part of the broader trend toward transparent AI operations discussed in AI supply-chain risk management.

Choose the right model for the task

Not every diff needs the most powerful model. Routine style checks or obvious anti-pattern detection can run on cheaper models, while deeper logic reviews can use a stronger one. A model-agnostic design makes this selection policy-based rather than code-based. You can route by file type, diff size, team preference, or risk level.

For example, a docs-only change might use a compact model, while a payment or auth module change could use a higher-capability model. That kind of policy keeps your review agent economically viable at scale. It also makes the system feel smarter, because expensive reasoning is reserved for the cases that deserve it.

Cache what does not need to be recomputed

Repository summaries, rule embeddings, and branch metadata can often be cached between runs. If a reviewer is re-evaluating the same pull request after a minor fix, there is no reason to regenerate all the static context. Caching lowers latency and cuts costs, especially in busy repos with frequent rebases.

Just be careful with cache invalidation. Cache only non-sensitive, versioned inputs and key them by commit hash or diff signature. This is one of those operational details that turns a good prototype into a reliable developer tool. It is the same kind of practical robustness teams need in any automation stack, whether they are dealing with orchestration, infrastructure, or AI assistants.

9) Example implementation blueprint in TypeScript

Core types and engine flow

Below is a compact sketch of the core flow. The engine receives a diff and rules, picks an adapter, builds the prompt, executes the model call, and normalizes the output. This separation keeps the review path testable and easy to extend. It also creates clear seams for adding metrics and logging.

type Severity = 'critical' | 'warning' | 'info';

type Finding = {
  title: string;
  severity: Severity;
  message: string;
  file?: string;
  lineStart?: number;
  lineEnd?: number;
  confidence: number;
};

type ReviewInput = {
  diff: string;
  rules: string[];
  repoName: string;
};

class ReviewEngine {
  constructor(private adapter: LlmAdapter) {}

  async review(input: ReviewInput): Promise<Finding[]> {
    const prompt = buildPrompt(input);
    const raw = await this.adapter.review(prompt);
    return normalizeFindings(raw);
  }
}

The design looks simple because it should be simple. Most of the complexity belongs in the prompt builder, redaction pipeline, and parser, not in the engine signature. That discipline keeps the code review agent understandable to future maintainers.

CLI entry point with staged diff support

Your CLI can read the staged Git diff, pass it into the engine, and render a concise summary. On high-severity findings, return a non-zero exit code. On informational findings, print a report but allow the commit to proceed. This mirrors how classic linters work while adding semantic review value.

Because the CLI is a first-class interface, it should be documented as carefully as the API. Include examples for staged files, a branch comparison, and JSON output. The simpler you make the local workflow, the more likely engineers are to use it before the CI gate. That is where the biggest win often lives.

Repository configuration and onboarding

Good onboarding matters. A repository-level config file should define provider preference, rule packs, severity thresholds, and redaction settings. New teams should be able to add the bot in minutes, not days. The faster the setup, the easier it is to prove value.

In larger organizations, you can offer sane defaults and allow overrides per repo. That creates a balance between standardization and local autonomy. It also makes the tool more likely to spread organically across teams, which is often how successful internal developer tools scale.

10) A comparison table for architecture choices

Choosing a review bot architecture is really a series of tradeoffs. The table below compares the most common options so you can decide what fits your team’s needs. Notice how model agnosticism, self-hosting, and local CLI support usually improve control, while fully managed systems can reduce setup effort. The right choice depends on whether your priority is speed of adoption, cost control, or security.

Approach	Model Agnostic	Self-Hosted	Local CLI	Best For
Vendor-specific SaaS reviewer	No	No	Sometimes	Fast setup, minimal ops
OpenAI-only internal bot	No	Yes	Yes	Teams standardized on one provider
Anthropic-only internal bot	No	Yes	Yes	Teams optimizing for Claude workflows
OpenAI-compatible adapter layer	Yes	Yes	Yes	Provider swapping, cost control
Full self-hosted review service with CLI	Yes	Yes	Yes	Security-sensitive, scalable teams

The table makes one thing obvious: the adapter layer is the deciding factor. Once you add it, you gain portability across providers and deployment styles. That is why the adapter pattern should be treated as a foundation, not an optimization.

Pro tip: Start with one provider and one review mode, but design the interface as if you will support three providers and two execution modes later. That mindset prevents painful rewrites once usage grows.

11) Implementation checklist and rollout strategy

Start with a narrow pilot repo

Choose a repository with active development, moderate diff sizes, and a team that is willing to give feedback. Avoid starting in the most mission-critical codebase, because early noise can undermine confidence. Your first goal is not perfect coverage; it is proving that the bot saves time without creating friction. That may mean reviewing only high-risk files or only certain types of changes at first.

Collect examples of true positives, false positives, and missed issues. Those examples are gold for prompt tuning and rule refinement. They also help you explain the system to stakeholders in concrete terms rather than abstract promises. A small pilot with honest metrics will always outperform a vague “AI initiative.”

Instrument the review loop

Capture response times, average findings per review, acceptance rates, and developer overrides. If your bot finds many issues that engineers ignore, the prompt or rule set needs work. If it is too quiet, it may be missing important classes of defects. Metrics turn subjective opinions into actionable product feedback.

You should also log provider selection and token cost by repository. That data helps you decide when a cheaper model is enough and when a stronger one is worth the spend. Without instrumentation, model-agnostic architecture becomes an abstract virtue instead of an operational advantage.

Iterate on policy, not just prompt wording

Prompt tuning matters, but policy matters more. Decide which findings should block commits, which should just warn, and which should be suppressed by default. Establish a process for updating rule packs and documenting changes. The most successful review agents feel opinionated but not arbitrary.

Once the workflow is stable, you can expand to PR comments, CI annotations, and repository-wide quality reports. You might also add branch-level policy routing, so critical branches get stronger analysis. That progression turns your code review agent into a durable internal platform rather than a one-off bot.

12) When to build versus buy

Build when control and portability matter

If you need provider flexibility, strong data boundaries, or the ability to self-host, building makes a lot of sense. You get to decide how prompts are structured, where secrets live, and how the bot behaves in your workflow. For many engineering teams, those details are not optional. They are the product.

Building also pays off when your review needs are specialized. A generic review SaaS may not understand your domain rules, monorepo conventions, or release process. A custom TypeScript implementation can reflect those realities directly. That is especially valuable when your team already has internal platform expertise and wants a tool that evolves with it.

Buy when you need instant coverage and minimal maintenance

Managed tools still have a place, especially for small teams that want value immediately. If you do not have time to operate a self-hosted service or maintain adapters, buying can be the pragmatic choice. Just be clear about the tradeoffs: cost, privacy, and lock-in. Those factors become more visible as usage scales.

A good compromise is to start with a managed proof of value, then migrate to a model-agnostic internal bot once you understand your needs. That approach reduces risk while preserving an exit path. It is the same philosophy behind many resilient platform decisions: try, measure, then own the critical pieces.

Use the build path as a strategic capability

For teams that want long-term leverage, a model-agnostic code review agent is a strategic capability. It improves code quality, creates a reusable automation pattern, and gives you negotiating power with vendors. It also helps your engineers learn how to operationalize LLMs responsibly, which is becoming an important skill across developer tooling. In that sense, the project is both practical and career-building.

If you want to go deeper on agent design patterns, the next natural step is studying specialized orchestration systems like specialized AI agents, and if you are hardening the rollout, pair this guide with a trust-first deployment checklist. Together, they provide the architectural and operational guardrails needed to deploy confidently.

FAQ

1) What makes a code review agent model-agnostic?

A code review agent is model-agnostic when its core logic does not depend on one vendor’s SDK or response shape. It talks to a small adapter interface, and each provider-specific adapter translates requests and responses at the edge. That lets you swap OpenAI, Anthropic, or an OpenAI-compatible endpoint without rewriting the review engine.

2) Why use TypeScript instead of plain Node JavaScript?

TypeScript is especially useful here because you can define strong contracts for diffs, findings, adapters, and config. Those types reduce bugs in parsing and normalization, which is important when multiple providers return different formats. In a tool that coordinates prompts, secrets, and CLI output, type safety pays off quickly.

3) How do I keep API keys safe in a self-hosted review bot?

Keep keys in environment variables or a secret manager, never in source code. Redact logs, avoid printing request payloads, and use separate credentials for development and production. Also limit network access so the review service can only reach the model endpoint and approved APIs.

4) Should the bot block commits in a pre-commit hook?

Only for high-confidence, high-severity findings. Most teams get better adoption when warnings are informative but non-blocking, while critical issues like security mistakes or obvious data-loss risks can fail the hook. A balanced policy keeps the tool helpful without making local development miserable.

5) How do I prevent the model from producing noisy or hallucinated feedback?

Use tight context, explicit rules, and structured output requirements. Ask the model to ground findings in the diff, validate returned file and line references, and normalize results before surfacing them. Also measure false positives and tune the prompt or rules based on real developer feedback.

6) What is the best first milestone for this project?

The best first milestone is a CLI that reviews staged TypeScript diffs using one provider and returns JSON findings. Once that works, add the adapter layer, natural-language rules, and secret redaction. That sequence gives you a useful tool early while preserving the path to full model agnosticism.

How to Build a Secure AI Incident-Triage Assistant for IT and Security Teams - A strong companion guide for secret handling, validation, and safe automation.
Orchestrating Specialized AI Agents: A Developer's Guide to Super Agents - Useful for understanding multi-agent boundaries and orchestration patterns.
Navigating the AI Supply Chain Risks in 2026 - A broader look at dependency, vendor, and operational risk in AI systems.
Closing the Kubernetes Automation Trust Gap: SLO-Aware Right-Sizing That Teams Will Delegate - A practical lens on trust, reliability, and automation adoption.
Trust-First Deployment Checklist for Regulated Industries - A deployment-focused guide for secure, auditable rollout practices.

IN BETWEEN SECTIONS

Jordan Ellis

Senior TypeScript Content Strategist

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.