Franz Josef Och, Google's translation uber-scientist, talks about Google Translate
This week we wrote about Google's Translate application and how it could eventually change the way people communicate, overcoming the language barriers that have long separated human populations. Franz Josef Och leads the machine translation (MT) team at Google, and has been the driving force behind much of the company's progress on the technology. The following is an edited transcript of a recent interview with Och.
How often do you add new languages to Google Translate?
The last language we added was Haitian Creole. I myself am quite surprised that we can build MT technologies for very small languages. If you'd asked me three years ago, when would you have Haitian Creole, or Yiddish, or Icelandic, I would've said that with statistical machine translation (SMT), the challenge is how much data you have, so probably quite some time -- if it ever works. But now, thanks to the Internet and the availability of data there -- along with the improvement in algorithms -- we can build MT systems for those small languages and make them work reasonably well.
How is it possible to make the system work for a language like Yiddish, where there's not much text out there to train the machine with?
What made it possible is that Yiddish is very similar to German, and has a lot of similarities to loan-words from Hebrew and Polish. For those languages, we have large amounts of training data. So what we do is learn a lot of stuff from those other languages and then apply that to Yiddish.
How did Google figure out so early that it was going to be important to be able to translate the Web?
The language barrier is really a very big problem for communication. That's especially true for someone who speaks a language where just a small percentage of the information out there is available in that language. A language like Arabic -- where 1% of the information on the Web is in Arabic -- those people would have very limited access to information out there. The idea is, can we with the help of technology and machine translation -- can we break down the language barrier? So that anyone can access any information -- any text out there -- independent of the language.
When I joined Google, I actually talked to Larry [Page] about that on the phone, because I was concerned about why Google would do MT -- it's a search engine company. He emphasized that it's really core to the mission of Google, and not just a side thing where if times get hard, then MT will [fall by the wayside]. But people are very serious at Google about the mission and trying to achieve it.
It's now important in areas like search, where we now have the idea of cross-lingual translated search. If you have a question about something, you should be able to type a query in, and if the answer is in a Web page in a completely different language, you should be able to find that and understand the information there.
How close are you to making that a reality?
It's a hard question. In some sense, I believe we've made progress, and this is an exciting time for MT in the research community at large, but also here at Google. MT gets a lot more traction, more people are using it and it gets integrated into many different products. But on the other hand, there's obviously still a lot of work ahead of us. What we're doing is working on the core quality of machine translation.
So I feel my job is relatively safe. For quite a few years, there will be things still to be improved. Now, while the MT is pretty good for some of the big languages, like Portuguese and Spanish, for the small languages there's still a lot to be done so we can get similar translation quality. It will be a never-ending kind of improvement.
When you train the translator, you've got to get so-called parallel data sets, where every document occurs in at least two languages. Where do you get all of those translations from?
When we started, there were standard test sets provided by the Linguistic Data Consortium, which provides data for research and academic institutes. Then there are places like the United Nations, which have all their documents translated into the six official languages of the United Nations. And there's a vast pool of documents available there in the database, which has been very useful because the translation quality has been very good.
But then otherwise, it's kind of 'the Web.' Where all the documents that are on the Web that are translated contribute to learning translation for our algorithms. On the Web, the quality of the translation might not always be so good, so it's a very interesting and challenging research problem in itself to find all the translations and learn from the potentially noisy translations out there.
Our algorithms basically mine everything that's out there.
So it's sort of analogous to the way Google's Web crawler spiders Web pages?
It's similar. While the Web crawler is mining the whole Web and indexing it, then for the translation crawler is the subset of documents that include translations. The challenge is to find which texts are translated into another language -- and where to find the corresponding translation.
Do you use the data from Google Books as a source of translated data?
That's obviously a very interesting data source because a lot of books have been translated into many different languages. And especially for small languages where there's not that much Web content out there, there are actually books. But that area has its own interesting challenges, with OCR quality being an issue -- especially if you want small unusual languages. But we've started adding books too into the mix of data.
The Android version of Google Translate allows the user to speak to the application, and have his or her words translated. How does that part work?
The way we are doing speech recognition and MT are conceptually rather similar. Both of them learn from large amounts of data. For MT, we need to mine those translations, but for speech recognition, what you need is a speech signal that you tape somehow, and then the transcription. The more of the transcribed speech you have, the better the speech recognition quality.
You have similar learning algorithms. In translation we learn the correlation in how words relate from source to target language. In speech recognition, they would learn how certain phonemes would get pronounced.
Is it just short step from here to real time, speech-to-speech translation, a la "Star Trek's" universal translator?
To really do the integrated speech-to-speech translation, where you can have a phone call with someone and it would interpreted live? I believe that based on the technology that we have, and the improvement rate we have in the core quality of MT and speech recognition, that it should be possible to do that in the not-too-distant future.
Here's a short demonstration of the Google Translate app for Android:
-- David Sarno