A Gentle Introduction to Statistical Machine Translation

AbstractThis is some abstract text.

Word-Based Models

It is very intuitive to think of translation as a word-to-word process. Think of a Chinese person who does not know any English traveling in New York (or you can think about Tom Hanks in the movie The Terminal). When he looks at the sentence: .

The idea behind Word-Based Models lies in the common situation described above. However, the reality is always a lot crueler. If you take this verbatim approach every time you come across a foreign sentence, you'll easily end up with translations that are completely not understandable. There are several reasons for this:

  1. Word Order Issues: The word order in one language may be different from that in another. For example, in Japanese the predicate is always at the end, while English has predicates in the middle of sentences between the subject and the object. If translates Japanese word by word with a Japanese-English dictionary, you'll probably hear sentences like these: "I Today apple eat.", "I the beach to want to go.", etc.

  2. Multiple Translation Possibilities: It is common for a foreign word to have many possible translations. The Chinese word "智" could both mean intelligence or the country Chile. Choosing the correct translation that fits the context and reflects the meaning of the foreign sentence is no easy task for Tom Hanks in that movie.

  3. Missing Words: This is commonly seen when you translate from English to Chinese. One kind of the determiners, definite articles (the word the) is usually ignored, because there is no equivalence to such words in Chinese. Examples are ubiquitous.

A Generative Story

IBM Models tell generative stories

The probability of translating one sentence to another sentence is:

\[ p({\bf{t}}|{\bf{s}}) \]

where \( {\bf{t}} = ({t_1},{t_2},...,{t_{{l_t}}}) \) is the word sequence (represented by a vector of words) of the target language, and \( {\bf{s}} = ({s_1},{s_2},...,{s_{{l_s}}}) \) is the word sequence of the source language. In a Chinese to English translation task, the source language is Chinese, and the target language is English.

Now that we have this model, we want to estimate all these probabilities, which indicate how probable a sentence in the target language is seen given the sentence in source language. On a parallel corpus where correspondent sentences from source and target languages are paired together, we can perform a standard Maximum Likelihood Estimation and get some estimates of the probabilities. You might already sensed that I'm merely joking here. This model will be incredibly sparse, because chances are that the sentence in your test set is never seen in the training set. Probabilities like

\[p({\rm{write code in lab}}|asdf)\]