Sunday, April 06, 2008

Language services on the web

Applications like Microsoft Word have embedded spelling and grammar checks for years. So Google's recent release of a web-based API for language translation made me think - just how far could these automated services go?

There are huge benefits to hosting language services on the web, rather than installing them locally on each PC. The clearest is the availability of enormous data sets. For example, it turns out that Google's spell check service is totally automated - there is no manually maintained database of words, it simply searches the web for common character sequences. The top 10,000 sequences must surely be correctly spelt words!

The same brute force data attacks could surely also provide a grammar check service. For automatic translation, you just need to analyse enough Rosetta Stones, where the same text is written in multiple languages. And Google has been operating a free telephone 411 service in the US, supposedly so that it can gather enough data (through recording people's voices) to eventually deliver good speech recognition.

It's also important whether a service is descriptive (merely the result of viewing how language is used) or prescriptive (defining rules for people to follow). The writers of the Oxford English Dictionary claim their work reflects the usage patterns of different words; entries in the dictionary are not meant as prescriptive rules, though it clearly helps if you want to be understood! This is important because a descriptive service could in theory be automated simply by analysing literary data, whereas descriptive services can't.

Finally, services that require a semantic understanding of language are clearly some way off.

ServiceAuthorityEnough DataSemantics
SpellingDescriptiveYesNo
GrammarDescriptiveYesNo
Speech recognitionDescriptiveNot yetNo
ThesaurusDescriptiveNot yetNo
TranslationDescriptiveNot yetMaybe
DictionaryDescriptiveYesYes
EncyclopediaPrescriptiveNot yetYes

This table is saying that pretty much every service could be generated automatically simply by analysing huge amounts of data, without the need for understanding. The only exceptions are translations, dictionaries, and encyclopaedias - and for translations, as Google has proved, you can still get a useful part of the way there.

The main takeaway is that there's one massively important side benefit of search engines that has yet to be fully appreciated; they revolutionise linguistics. In fact, they turn it from a mainly qualitative area into a quantitative science.

We now have the tools to analyse language variations as they spread through time and geography, or to discover the common elements in every language, or to watch how language style depends on context, using as a data set the entire internet!

If there's one thing that makes us human, it's language. Computers will help us to understand ourselves!

No comments: