BLEU Score
The BLEU (Bilingual Evaluation Understudy) score is a metric developed by IBM in 2002 to evaluate the quality of machine-translated text by comparing it against one or more human-produced reference translations. It measures how closely the machine output matches the reference translation at the word and phrase level, producing a score between 0 and 1 (often expressed as a percentage).
BLEU works by analysing n-gram overlap — counting how many sequences of words (unigrams, bigrams, trigrams, and four-grams) in the machine translation match sequences in the reference translation. A higher BLEU score indicates closer alignment with the human reference. Scores above 0.5 are generally considered good, though the threshold varies by language pair and content type.
While widely used in machine translation research and development, BLEU has significant limitations as a quality measure. It focuses on surface-level word matching and cannot assess meaning, fluency, cultural appropriateness, or whether a translation actually communicates the intended message. Two translations can have similar BLEU scores while differing dramatically in quality as perceived by a human reader.
Other automated metrics such as METEOR, TER (Translation Edit Rate), and more recent neural evaluation metrics like COMET and BERTScore have been developed to address some of BLEU's limitations, but no automated metric fully replaces human quality assessment.
LEXIGO uses automated metrics as one input alongside human review in our quality assurance process, recognising that numbers alone cannot capture the nuance that professional translation demands.
Understanding BLEU scores helps clients evaluate machine translation vendors and make informed decisions about when machine translation is fit for purpose versus when human translation is required. A high BLEU score does not necessarily mean a translation is ready for publication — it simply means the output statistically resembles a reference translation.
For content where accuracy, brand voice, and cultural sensitivity matter, human evaluation remains essential regardless of automated scores. BLEU is most useful as a development and benchmarking tool rather than a definitive quality measure.