|
Knowledgebase
EnterprisePlatform
Platform
APIReact (MCP)CLIIntegrationsReact (Lingo Compiler)
Alpha
GuidesChangelog

Getting Started

  • Introduction
  • Localization MCP
  • Connect Your Engine

Localization Engine

  • Overview
  • Brand Voices
  • Instructions
  • Glossaries
  • LLM Models

Quality

  • Reports
  • AI Reviewers
  • Playground

Admin

  • API Keys
  • Team

AI Reviewers

Max PrilutskiyMax Prilutskiy·Updated 15 days ago·7 min read

AI reviews are automated quality checks that evaluate translations produced by your localization engine. After each translation request, Lingo.dev runs independent LLM evaluations to verify the output - checking glossary compliance, instruction adherence, and any custom criteria you define. Reviews run asynchronously and never block the translation response.

How it works#

When the localization engine completes a translation request, it queues the applicable reviews for asynchronous evaluation. Each review runs an independent LLM that receives the source text, translated output, context, and evaluation criteria. It returns a structured result - pass/fail or a percentage score - with reasoning for imperfect results.

The engine's Reviews tab controls which reviews run for that engine. There are three categories:

CategoryWhat it checksResult typeConfiguration
Glossary items AI reviewWhether translations follow the engine's glossary rulesPass / FailBuilt-in toggle per engine
Instructions AI reviewWhether translations follow each of the engine's instructionsPass / Fail per instructionBuilt-in toggle per engine
Custom AI reviewersYour own evaluation criteria, defined at the organization levelPass / Fail or 0–100%Select per engine from org-level reviewers

Built-in AI reviews#

Every localization engine includes two built-in review types that verify translations against the engine's own configuration. Enable or disable them in the engine's Reviews tab.

Glossary items AI review#

Checks whether the translation adhered to all applicable glossary rules. If the engine has custom translations (e.g., "Deploy" → "Bereitstellen") or non-translatable terms (e.g., "OAuth"), the review verifies the translation respected them.

The review accounts for grammatical variations - a glossary rule for a term in one grammatical case applies to all forms of that term. If conflicting glossary rules exist, the translation is considered compliant as long as one of them was followed.

The result is a single pass/fail verdict for the entire translation request, with reasoning when the result is a fail.

Instructions AI review#

Evaluates each instruction independently. If the engine has three instructions, the review produces three separate pass/fail verdicts - each with its own reasoning when the result is a fail.

An instruction can return N/A when its criteria don't apply to the content being translated. For example, an instruction about formal address returns N/A when the translation contains only a product name or a technical term where formality is irrelevant. N/A results are excluded from aggregate scores.

Both built-in reviews only trigger when the engine has relevant configuration - if no glossary items match the locale pair, no glossary items AI review runs.

Configuring reviews per engine#

Open the engine's Reviews tab to control which reviews run for that engine. The tab has two sections:

Built-in toggles at the top control the glossary items AI review and instructions AI review. These are independent - you can enable one without the other, depending on what the engine has configured.

Custom AI reviewers below the toggles list all AI reviewers defined at the organization level. Toggle each one on or off for that specific engine. This lets you maintain a shared library of quality checks and apply them selectively.

A single engine can have both built-in reviews and multiple custom AI reviewers running simultaneously. All reviews run asynchronously after each translation request, and results appear in the translation log and in Reports.

AI reviewer types#

Boolean AI reviewers#

Return a binary verdict: pass or fail. Use these for rules that are either met or not.

Examples:

  • "Does the translation preserve all HTML tags and attributes?"
  • "Are pluralization rules applied correctly for the target language?"
  • "Does the translation use formal address (Sie) in German?"

Results are aggregated as pass rates - 75% means 3 out of 4 evaluated translations passed.

Percentage AI reviewers#

Return a score from 0 to 100. Use these for quality dimensions that exist on a spectrum.

Examples:

  • "Rate the naturalness of the translation for a native speaker (0–100)"
  • "Score how well the translation preserves the original tone and intent (0–100)"
  • "Evaluate grammatical correctness on a scale of 0–100"

Results are aggregated as averages across the evaluation period.

AI reviewer configuration#

FieldDescription
NameA label identifying the AI reviewer (e.g., "Pluralization check")
InstructionThe evaluation criteria, written in natural language
Typeboolean (pass/fail) or percentage (0–100)
Source localeThe source locale to match, or * for any
Target localeThe target locale to match, or * for any
Provider / ModelThe LLM used for evaluation (independent of the translation model)
SamplingPercentage of requests to evaluate (0–100%)
Allow N/AWhether the AI reviewer can return "not applicable" for irrelevant pairs
EnabledToggle review on or off without deleting the configuration

Writing AI reviewer instructions#

The instruction field is the core of an AI reviewer. It tells the evaluation LLM exactly what to check. Write it as a specific, testable criterion.

Good instructions#

Boolean:

text
Check whether all HTML tags in the source text are preserved
exactly in the translation. Tags must not be added, removed,
modified, or reordered. Pass if all tags are preserved, fail
if any tag is missing or altered.

Percentage:

text
Rate the fluency of the translation on a scale of 0-100.
100 means a native speaker would find it completely natural.
0 means it reads like machine output. Deduct points for
awkward phrasing, unnatural word order, or overly literal
constructions.

What makes a good instruction#

  • Specific criteria - define exactly what pass/fail means, or what 0 and 100 represent
  • Observable outcomes - the LLM should be able to evaluate by reading the text, not guessing intent
  • One concern per AI reviewer - split multi-dimensional quality checks into separate AI reviewers

Locale matching#

AI reviewers match translation requests by source and target locale. Wildcard * matches any locale.

Source localeTarget localeMatches
endeOnly English → German translations
en*Any translation from English
*jaAny translation into Japanese
**All translations

A single translation request can trigger multiple AI reviewers if several match its locale pair.

Sampling#

Not every translation needs to be reviewed. The sampling rate controls what percentage of matching requests get evaluated.

SamplingBehavior
100%Every matching request is reviewed (thorough but higher cost)
50%Roughly half of matching requests are reviewed
10%One in ten - useful for high-volume engines where trends matter more than individual scores
0%AI reviewer is effectively paused without disabling it

Sampling is applied at request time using a random check. Over a sufficient volume of requests, the actual evaluation rate converges to the configured percentage.

N/A support#

When allowsNA is enabled, the review LLM can return "not applicable" instead of a score. This is useful for AI reviewers whose criteria don't apply to every locale pair.

Example: An AI reviewer checking formal address conventions returns N/A for English → English translations (English has no formal/informal distinction), but returns a score for English → German.

N/A results are excluded from averages and pass rates in reporting - they don't pull scores down or inflate them.

Reasoning#

AI reviewers provide reasoning for imperfect results to help you understand what went wrong:

  • Perfect score (pass or 100%) - reasoning is null (nothing to explain)
  • N/A - reasoning is null
  • Imperfect score - a brief one-sentence explanation

This keeps the review results actionable: when a translation fails a check, the reasoning tells you why without manual investigation.

Review model#

Each AI reviewer has its own LLM provider and model configuration, independent of the translation model. This separation is intentional - the model that produces the translation should not be the same model that evaluates it.

Model independence

Using a different model for review than for translation provides an independent assessment. If GPT-4o produces the translation, evaluating with Claude Sonnet gives you a second opinion rather than self-assessment.

AI reviewer reports#

Review results are visualized in the dashboard under the AI reviewer reports section, showing:

  • Pass rates over time - for boolean AI reviewers, plotted as daily percentages
  • Average scores over time - for percentage AI reviewers, plotted as daily averages
  • Per-locale-pair breakdown - see how each source → target pair performs independently
  • Aggregate view - combine all locale pairs into a single trend line

AI reviewer reports complement the volume-focused Reports - together they give you a complete picture of both throughput and quality.

Managing AI reviewers via MCP#

If you use the Lingo.dev MCP server, your AI coding assistant can create and configure AI reviewers directly:

text
"Create a boolean AI reviewer for all locale pairs that checks
whether HTML tags are preserved in translations."
text
"Add a percentage AI reviewer for English to German that rates
translation fluency on a 0-100 scale, sampling 50% of requests."

Next Steps#

Reports
Monitor translation volume, token usage, and locale coverage
LLM Models
Configure the translation models that AI reviewers evaluate
Glossaries
Set up terms that glossary compliance AI reviewers can check against
API Reference
Integrate the localization API into your workflow

Was this page helpful?