Ladies May Have a Fit Upstairs

By Charlene Crabb|Wednesday, December 01, 1993
RELATED TAGS: COMPUTERS
That sign in a Hong Kong tailor shop was not translated by CITAC, a computer that converts written Chinese into usable English.

Chinese is one of the most beautiful of written languages, and also one of the most complex. Its intricate characters, developed centuries before the first alphabets were invented, bear no resemblance to Greek or Roman letters. And unlike the picture symbols of ancient hieroglyphics, they do not usually resemble objects either. Instead, each boxy character represents a word or part of one. That means there have to be a lot of characters--around 50,000 of them, of which a Chinese person must recognize at least 2,000 just to read the morning paper. To complicate things even more, the characters are strung together without spaces to signal where one word ends and another begins. The strings usually run across the page from left to right, but sometimes, especially in older texts, they run from top to bottom or right to left.

Thus the existence of human beings who can effortlessly read Chinese and even translate it into English is evidence of the tremendous sophistication of the human brain. It is not the sort of task one might think of entrusting to a computer. And yet Julius Tou, a native of Shanghai and, until his recent retirement, an electrical engineer at the University of Florida in Gainesville, says he has done it: after decades of work, he has developed a computer system that translates written Chinese into grammatically correct and idiomatic English. Tou says his system, called CITAC (for Chinese Translation Computer), is faster than, albeit not quite as accurate as, an expert human translator. The system, which Tou is now selling for $15,000, consists of a personal computer equipped with specialized hardware and software.

The roots of the 68-year-old Tou’s recent success go back 40 years. In the 1950s computer scientists and the military alike were hot on developing programs that could translate major languages into English. But the complexities soon proved insuperable. A Chinese-to-English program developed at that time, for example, translated individual characters. It was almost useless. A ten-character sentence that should have read He drives recklessly was translated by the computer as He open car not manage three seven two ten one. By the early 1970s the military had stopped funding the research, and almost everyone, including Tou, had given up on computer translation.

What was needed, as it turns out, was the sort of cheap computer power that is now taken for granted. What was also needed were ways of teaching computers how to look at a language as a human does, and to recognize verbs, adjectives, nouns, and so on. Beginning in the 1980s, these prerequisites began to fall into place. Researchers started building machines that could translate many of the major languages, including Spanish, German, Hebrew, and Greek, into English.

About ten years ago Tou decided to have a go at Chinese again-- and this time he has succeeded. One of the most difficult parts was teaching the computer how to recognize words from a Chinese character stream, he says. The solution was a clever search strategy.

Since very few Chinese words are more than six characters long, CITAC begins with the first six characters in a sentence. It uses the first three as a key to scan its built-in dictionary of 40,000 words and idioms. If those three characters are a word unto themselves, it puts that word into a word buffer. If they could be the beginning of a longer word or phrase, it adds the next three characters one by one to see if the longer string matches something in the dictionary. If CITAC can’t find a match, it checks the first two characters to see if they are a word. If they aren’t, it assumes the first character is a word unto itself. It then begins the process anew using characters 2, 3, and 4 as the search trio.

When CITAC encounters an ambiguity--a set of characters that could be either a whole word or the beginning of a word or phrase--it puts the longer string into a temporary word buffer. Then, after it has finished dividing up the entire sentence, it goes back and determines which of the two possibilities makes more sense in context. This helps it avoid overly literal translations like He open car not manage three seven two ten one. In that ten-character sentence, each character corresponds to an English word; but the last seven taken together (not manage three seven two ten one) mean recklessly. (It is not that they are a strange Chinese idiom; the sum of the characters is a single word that has nothing to do with the individual parts.)

Once the computer has worked its way through the sentence and spewed the words and phrases it found into the buffer, it defines them and then replaces them with English equivalents. Then, using the relative position of the characters in the sentence, it identifies the words that correspond to subject, verb, object, and so on. Next it shuffles them around, transforming the Chinese word order to English word order. The sentence He Japanese speak very fluent becomes He speak Japanese very fluent.

Finally CITAC polishes the prose. It adds prefixes, suffixes, articles, verb endings, and plural endings (all of which Chinese generally dispenses with). He speak Japanese very fluent becomes He speaks Japanese very fluently. CITAC even knows to insert verbs when there are none. She beautiful is perfectly acceptable Chinese, but CITAC adds is to make the sentence acceptable English.

How acceptable CITAC English making? It’s still not possible for CITAC to do good translations of Mao Ze-dong’s essays, admits Tou. But it can, he says, do a creditable job on a newspaper. And Tou thinks it will be even more effective in business correspondence because the prose could be composed with CITAC in mind. A Chinese person writing to an American associate could remove land mines--such as unusual proper names that CITAC doesn’t recognize--that would otherwise turn a sentence to gibberish. He could also add verbs, modifiers, and pronouns that are only implied in Chinese, to make sure the sentences have all the elements of their English counterparts. He could even customize the basic dictionary by adding characters, names, words, and phrases frequently used in his business.

Tou is now at work on a companion to CITAC that will translate English into Chinese, thereby allowing the American to answer the letter from China. He also hopes to tackle the more difficult problem of translating spoken Chinese. I feel I’ve made a contribution to the world already, he says. But I want people to be able to talk to each other in their own native language by the year 2000.
Comment on this article
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

ADVERTISEMENT
ADVERTISEMENT
Collapse bottom bar
DSC-JanFeb15
+

Log in to your account

X
Email address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it emailed to you.

Not registered yet?

Register now for FREE. It takes only a few seconds to complete. Register now »