Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Microsoft turns spoken English into spoken Mandarin – in the same voice (thenextweb.com)
583 points by evo_9 on Nov 8, 2012 | hide | past | favorite | 121 comments


To someone who spent years learning Chinese as a second language, and then made my living for years as a Chinese-English interpreter, that was pretty impressive.

The economics of the issue is that a machine interpreter just has to be as good as a human interpreter at the same cost. That's a reachable target with today's computer technology. EVERY time I've heard someone else interpreting English or Chinese into the other language, I have heard mistakes, and I am chagrined to remember mistakes that I made over the years. We can't count on error-free machine interpretation between any pair of languages (human language is too ambiguous in many daily life cases for that), but if companies develop tested, validated software solutions for consecutive interpreting (what I usually did, and what is shown in the video) or simultaneous interpreting (the harder kind of interpreting in demand at the United Nations, where even in the best case it is not always done well), then those companies will be able to displace a lot of human professionals who rely on their language ability to make a living.

Right now a lot of interpreters in the United States make a lot of part-time income from gigs that involve suddenly getting telephone calls and joining in to interpret a telephone conversation in two languages. This is often necessary, for example, for physician interviews of patients in emergency rooms or pharmacist consultations with patients buying prescribed drugs (where I last saw a posted notice on how to access such an interpretation service). The IBM Watson project is already targeted at becoming an expert system for medical diagnosis, and patient care markets will surely provide a lot of income for further development of software interpretation between human languages.

It's still good for human beings to spend the time and effort to learn another human language (as so many HN participants have by learning English as a second language). That's a broadening experience and an intellectual delight. But just as riding horses is more a form of recreation these days than a basis for being employed, so too speaking another language will be a declining factor in seeking employment in the next decade.


I don't think there will be much of an impact on the interpreter industry until the machine translations are significantly better than human translations.

Human translators are so expensive today that they are only used in situations where the translation has to be correct -- diplomacy, courtrooms, books, etc. Until a machine is much better than a human, these use cases won't switch to machine translation (similarly, self-driving cars won't be allowed until they are proven to be much safer than human drivers).

On the other hand, there's a large casual market for machine translations today for situations like reading foreign Web sites, chatting with people in different countries, reading Tweets in a different language, etc.


Translators and interpreters are still used in a fairly wide range of situations. I've worked both as a document translator and as a spoken interpreter in a number of manufacturing plants.

After watching this video, I'm fairly confident that a large part of the interpreting that I did could already be handled by this technology.


Luis von Ahn, one of the creators of reCAPTCHA, has a fascinating project going:

http://duolingo.com/

The idea is to teach people language at the same time as providing a real time translation service. Apparently if you multi-plex novices (and not at a bad rate) you get expert translation at a similar accuracy. The translators benefit by learning language, and the service is self supporting by proving translation.

He did an excellent TED talk on the subject:

http://www.ted.com/talks/luis_von_ahn_massive_scale_online_c...


Sounds like taking on interns or co-ops. Pretty nice idea.


While machine translation is not ready yet, there are some incremental innovations in the interpreter industry.

A local entrepreneur is having some success with a remote system for interpreters that tries to replace the interpreter console and related expensive interpretation equipment:

http://www.glotweb.com/

Not sure how his has handled the strict no-lag requirements though, but they were doing trial runs in Washington and San Diego.


>Human translators are so expensive today that they are only used in situations where the translation has to be correct -- diplomacy, courtrooms, books, etc.

Professional translators and trained interpreters yes.

But thousands of people work solely (or assume the role periodically) as interpreters and translators for many more situations, mainly revolving around business.

Now, signing a business deal will still involve translator and a trained lawyer, but those other everyday cases, including showing a western partner around the Chinese offices, could switch to machine translation.


Or you know...many of the middle class Chinese that work in the offices that already speak English. Honestly, when I've walked into a Chinese office I've never had the "noone speaks English" problem, while the big boss who doesn't speak English would prefer to have one of the younger guy/gals around anyways as a sign of status.


> That's a broadening experience and an intellectual delight.

I disagree. It's only broadening in the additional people it allows you to commune with. Other than that, it's a waste of time.

Having to convert between languages (I'm a native speaker of English who lives in Germany) all the time is huge overhead, sort of like if every country had its own system of measurement, except that the overhead is incurred much more often, not just for measuring things.


It's broadening in many, many ways. nathannecro in his comment below highlights an intellectual aspect. There are others. Knowing -- I chose that word specifically, as opposed to "speaking" -- a language means knowing a culture. This broadens one's mind, and that in turn leads to acceptance. Acceptance of religions, of gays, and of the Japanese seemingly chasing Americans around meeting rooms. Amongst others. A notable other is that you realize just how insanely silly, myopic and stupid nationalism is.

There are subtler benefits - I speak three languages natively, and the part I love most about that is also the part that frustrates me most. Afrikaans is a modern language and has a relatively small vocabulary. Insulting someone therefore consists of creatively stringing together colourful combinations of everyday words, mixing in the odd English, Malaysian or Zulu word, and then spoken with a religious fervour that cracks you up. It's side-splittingly funny. What frustrates me is that no other language can do that, so I can't share that with my girlfriend, who is French.

German has similar mannerisms about it. It's naturally dry in a way that English, for all it's adjectives, can never hope to be.

To give you another perspective, I've always had a few deal-breakers I look for when meeting a girl I might otherwise be interested in. First, she needs to speak more than one language. Seconds, she needs to have spent a significant amount of time outside her own country. She must either be able to ski or snowboard. Last, she needs to be able to play a musical instrument. Beyond that I couldn't care less about race, job, creed, colour or any other bigoted perspective.

To know another language is to know another world.


Ever read 1984? A lot of linguists make arguements that our language shape the way we think and interact with the world, since at a young age we stop thinking visually and start thinking in spoken words.

In 1984, newspeak was all about removing ways to describe things the Party didn't want the public thinking about, so that in several generations of the language nobody would be able to comprehend dissent. And it actually has some real life foundations.

Just think of all the times a foreign language speaker says "I can't accurately represent this in whatever other tongue".

Note, I only know English, and after 6 years of Spanish throughout high school and college I have forgotten every wink of it. You really need to immerse yourself in a language to really dedicate it to memory, and constantly use it. Which is really a ton of overhead, that isn't really necessary, I agree.


I am aware of the Sapir-Whorf hypothesis.

German, for example, has no word for "silly", while English has no word for "gemütlichkeit".

Eliminating the overhead is worth any of these sacrifices, though.


Your example misses the mark as "silly" can be explained in German and Gemütlichkeit can be explained in English.

Language structures our thoughts. It is claimed that Chinese are better at math because of their language structure. How you express things in a different language will help you more clearly see the concept in your own language. I learned more about English by learning German than I did in all my schooling. It improved my spelling as well because it made me understand the words I was using better.


It's not about explanatory power, it's about patterns of thinking. Americans call each other "silly" all the time, but I'm guessing German's don't. Even though it's possible, it takes more effort, and doesn't come to mind as readily.


They could call each other "lustig" or "lecherlich", which often covers what is meant by silly.


If you're interested in how language can shape the way you think, I highly recommend watching Lera Boroditsky's talk here:

http://fora.tv/2010/10/26/Lera_Boroditsky_How_Language_Shape...

It's VERY good. But skip the first 3 minutes of boring intro.


That it can be cumbersome for an expat, doesn't mean it's not a broadening experience and an intellectual delight.

For broadening experience, see: http://en.wikipedia.org/wiki/Linguistic_relativity http://www.wired.com/wiredscience/2012/04/language-and-bias/

And for intellectual delight, see: http://www.amazon.com/Le-Ton-Beau-De-Marot/dp/0465086454 http://books.google.gr/books/about/Experiences_in_Translatio...


This is the second time Deep Neural Network research from the University of Toronto has made the front page, the first being when it won first-place in a Kaggle competition http://news.ycombinator.com/item?id=4733335


Geoffrey Hinton (who leads this research at the University of Toronto) is teaching a coursera course right now about machine learning with deep neural networks:

https://class.coursera.org/neuralnets-2012-001

He talks about a surprising amount of cutting edge achievements being made by deep neural networks just over the last few months.


There was also a video presentation by Peter Norvig posted a few days ago explaining research at Google done in collaboration with Geoffrey Hinton from the University of Toronto on deep learning at Google: http://news.ycombinator.com/item?id=4733387


Here is a GREAT talk by Geoffrey Hinton (the Prof running said lab) http://www.youtube.com/watch?v=DleXA5ADG78&hd=1 where he explains the method.

Unfortunately, even though it was posted three times to HN http://www.hnsearch.com/search#request/all&q=sex+machine... it never made the fron page.

Here is the my summary and comment: " Great talk. I don't know much about artificial neural networks (ANN) and even less about natural ones, but I have the feeling that I learnt a lot from this video.

If I understand correct, Hinton uses so many artificial neurons compared to the amount of learning data, that you would usually see an overfitting effect. However, his ANN's randomly shut of a substantial part (~50%) of the neurons during each learning iteration. He calls this "dropout". Therefore, a single ANN represents many different models. Most models never get trained, but they exist in the ANN, because they share their weights with the trained models. This learning method avoids over specializing and therefore improves robustness with respect to new data but it also allows for arbitrary combination of different models which tremendously enlarges the pool of testable models.

When using or testing these ANNs you also "dropout" neurons during every prediction. Practically, every rerun predicts a different result by using a different model. Afterwards, these results are averaged. The more results, the higher the chance, that the classification is correct.

Hinton argues, that our brains work in a similar way. This explains among other things a) Why are neurons firing in a random manner? It's an equivalent implementation to his "dropout" where only a part of the neurons is used at any given time. b) Why does spending more time on a decision improve the likely hood of success? Even though there might be more at work, his theory alone is able to explain the effect. The longer you think, the more models you test, simply by rerunning the prediction. The more such predictions the higher the chance, that the average prediction is correct.

To me, the latter also explains in an intuitive way, why the "wisdom of the crowds" works well when predicting events that many people have an, halfway sophisticated, understanding of. Examples are betting on sport events or movies box office success. As far as I know, no single expert beats the "wisdom of the crowd" in such cases.

What I would like to know is, how many, random model based predictions do you need until the improvement rate becomes insignificant? In other words, would humans act much smarter if they could afford more time to think about decisions? Put another way, does the "wisdom of the crowd" effect stem from the larger amount of combined neurons and the diversity of the available models that follows, or from the larger amount of predictions that are used to compute the average? How much less effective would the crowd be, if less people make more ("e.g. top 5") predictions or if the crowd was made up of few individuals which are cloned?

If the limiting factor for humans is the time to predict based on many different models and not the amount of neurons we have, this would have interesting implications. Once, a single computer would have sufficient complexity to compete with the human brain, you could merely build more of these computers and average there opinions to arrive at better conclusions that any human could [1]. Computers wouldn't be just faster than humans, they would be much smarter, too.

[1] I'm talking about brain like ANN implementations here. Obviously, we already use specialized software to predict complex events like weather, better than any single human could. But these are not general purpose machines. "


We're too busy debating the merits of the new iDevice, or bashing people who debate the merits of the new iDevice, or wasting our upvotes on personality cults, upvoting every little thing said personality has written about.

Did I about cover all the problems with groupthink and hivemind?

It's a shame though, but I hang out in new a lot, and I urge other HN'ers to hang out there too.


Where might this new place you hang out be?


http://news.ycombinator.com/newest

Upvote only interesting things




My current client is specialising in speech recognition, speech synthesis and automatic translation. They have something similar, focused on enterprise customers. I find this subject very interesting.

I am a Ruby guy and I only marginally get in contact with their C++ code, but from what I learned so far this stuff is extremely memory and CPU hungry. It also depends on having been fed the right amounts of input. That's why Google Translate is so good. They have tons and tons of data from all the websites they parse, and in many cases the content can be obtained in different languages. Corporate pages are often translated paragraph by paragraph by humans which results in perfect raw data to train these algorithms. Also for example all documents that the European Parliament produces are translated into the languages of all member states.

Everything that has to do with translation has to do with context. I think the software right now is as smart as a six year old kid, except that it has a much bigger vocabulary. But if you say "The process has stalled. Let's kill it." it probably only makes sense if you know you are talking about computers.

It's hard to imagine that computers one day might really understand everything we say. But just by using Google Translate I think they really might. Это является удивительным. (I don't speak Russian. I hope I didn't insult anyone now. ;))


> Corporate pages are often translated paragraph by paragraph by humans which results in perfect raw data to train these algorithms.

Actually this may be one of the reasons why Google's Japanese translations are so terrible. The why isn't really relevant here[0] (perhaps you already know anyway) but it there are times when the raw data becomes the most misleading.

[0] Obviously I still mean those actually translating by hand, not the companies which just throw all of their material into Google Translate and consider it a finely proofed document. There are plenty of the latter which makes for an amusing loop in the system.


Google Translate's English to Chinese is plain awful as well. At my last job, many of the employees only spoke Mandarin. If I had to communicate with them, I would use Google Translate and copy/paste the Chinese and the English for CYA. There was 2 people who spoke good English and Mandarin, so they were able to correctly translate the meaning. From what they told me, the translations were about 60 to 70% accurate and often comical. I guess the translations never asked my coworkers on a date or cursed them out, but ultimately, outside of the fact that my correspondence never happened unless it was an important issue and thus alerted the recipients that there is a pressing issue, the translations didn't serve much purpose. It was better than zero communication. There just has to be a lot of work done in the field.

I also disagree that having someone doing the translations on the webpage is a good source. One only needs to find Asian companies with direct translations to see how bad the human translations are. Just as those companies generally don't have good English writers, I think it is likely that American companies probably don't have outstanding Japanese / Chinese / et.al. writers.


I speak bad Japanese after having worked there for a while. But Japanese is extremely context dependent. They leave out a lot of words and you just deduce the meaning. (Don't hit me if that's wrong, but that is what I remember thinking back when I was learning it.)


The way you phrase that is a bit misleading. It's not that Japanese speech has any less info than most languages, it's just that you can set topics (add state to a stack, to put it in geek terms) that carry over into subsequent phrases. The correct translation of an individual sentence may thus depend on that previous context.

A simple but famous case is "Watashi wa hamburger desu". The sentence has no subject, so with no context that would get translated as "I am a hamburger", but if you fill in a previously defined subject, it could be "I [order] a hamburger", "My [favorite food] is a hamburger", etc.


To clarify further for those unfamiliar with the language,a super literal translation of "watashi wa hanbaga desu" is something like:

    (concerning/as for) myself, (it's) hamburger.
On it's own if Bob says this, basically it comes out as "I (am) (a) hamburger".

However, if Sally has just said something like "I'll have a salad...what about you Bob?" then it makes sense as Bob's order is the implied subject and it becomes "My order is hamburger." or "I'll have hamburger."

I know very little about linguistics but I think there are a bunch of other things that make Japanese-English difficult to translate via software as well.

There is the whole aspect of culture embedded in it. あなた could mean "you" or something like "dear/sweetie" depending on the context. There also the question of how to translate "you" (etc) in English text to Japanese as you have to consider politeness etc. If you are just translating a business web page it's probably safe to stick with polite forms, but if you are translating say the dialogue in a TV show you want to preserve the tone of the characters.

In terms of voice recognition, Japanese seems to have a lot of homophones to me when compared to English. It may just be my imagination, but here are some I ran into recently:

舶,錘, 頭, 摘む, 積む, 詰む, and 紡錘 are all pronounced つむ and mean completely different things. Or 六, 碌, and 録 are pronounced ろく. 上, 神, 紙, 髪, and 加味 are all pronounced かみ.

I seem to run into things like that regularly; when just hearing it spoken you need the context to figure out what they mean.


> 上, 神, 紙, 髪, and 加味 are all pronounced かみ.

This can be mostly solved by context. There are very few situations in normal speech where you'd hear "kami" and not know if they're talking about 神 (god) or 髪 (hair). Also, it's not particularly hard to code that knowledge. E.g. try かみにいのる (pray to god) and かみのけをきった (cut hair) on Google Translate. It will suggest the correct kanji in both cases.

Anyway, I'm not a native Japanese speaker, but I find the whole homophone thing a bit overrated. As far I as can recall the only pair of homophones that cause trouble in normal speech are 科学/化学 (both pronounced kagaku, meaning science/chemistry) and 私立/市立 (shiritsu, private/municipal).


Thanks for the reply.

> This can be mostly solved by context. T

Right, as I said. It's not too bad, but it's easier when you can just translate word for word.

かみにいのる gives me "pray to bite" on Google translate; as you say, it suggests the right kanji...but that's precisely my point. It needs you to disambiguate for it to be sure.

I'm not saying this is an insurmountable problem, I'm contrasting the difficulty.

> There are very few situations in normal speech where you'd hear "kami" and not know if they're talking about 神 (god) or 髪 (hair).

I ran into it recently in music. Babymetal has a song that starts:

伝説の黒髪を華麗に乱し

When you listen to the song, it'd be easy to momentarily think she might be saying "black god" or "black paper" since while the pronunciation wouldn't be identical, it's pretty close. Since I'm human, I figured out pretty quickly what she is saying...but in the equivalent English phrase there's no issue there...it's "black hair" or "black paper".

This is admittedly not "normal speech", but I could see it popping up there too.

I've seen confusion over 神/髪 in other situations too, though those were deliberately puns so probably don't count, but demonstrate it's possible to have situations where it's at least somewhat ambiguous.

> I find the whole homophone thing a bit overrated

I'm sure it's exaggerated to me because my Japanese is pretty atrocious, but I think my point is valid: any time you have homophones in a language it makes things more difficult to set up a system that listens to speech and translates. Japanese seems to have more homophones than English, and if that's true it is proportionally more difficult to translate in that regard.


Also from what I understand, certain homophones are differentiated in practice by differing accenting (raising/lowering) in speech. This is however region specific.



> I'm not saying this is an insurmountable problem, I'm contrasting the difficulty.

Fair enough. I'm not claiming there's no homophone ambiguity either, just that it's a relatively easier problem compared to, say, the stuff Microsoft is doing.

Yeah, when I say "normal speech" I don't include pop music lyrics.


As someone already pointed it out, you should have written the translation as "I - hamburger", not "I hamburger." This implies a pause in speech. I am native Russian speaker and Russian also has this case where the meaning of a sentence has to be deduced from the context of previous sentence. But in written Russian you could look at that sentence only and understand that someone is likely replying to something. I don't practice my Russian daily as my day-to-day communication is only in English. I tend to forget words when i communicate with my Russian friends via email, so i use Google Translator a lot to translate from English to Russian. I actually find that Google is pretty good at translating formal sentence structure you'd see in literature and absolutely abysmal at everything else.


Japanese has noticeably fewer phonemes than most languages (IIRC something like 21 compared to a "normal" 24-28) so it makes sense that there are more homophones. One interesting effect is that puns and innuendo are easier in Japanese. Of course it's easy enough to disambiguate in normal conversation.


This is not a unique feature of Japanese. You can do exactly the same in Russian using a dash (which translates to a short pause in speech): "я — гамбургер" (I — hamburger). In general "X — Y" means "X is Y" or another relationship between X and Y as indicated by preceding context.


Yep I have tried the Google English-Japanese translation before and it was just plain terrible.


That's why Google Translate is so good. They have tons and tons of data from all the websites they parse, and in many cases the content can be obtained in different languages. … Also for example all documents that the European Parliament produces are translated into the languages of all member states.

This can backfire. I remember hearing that sometimes/back in the day "Baile Átha Cliath" (the Irish for "Dublin", the capital city of Ireland) would sometimes get translated as "London" the capital of the UK. This is due to Google Translate trying to match up Laws in Ireland (in the Irish language) with UK laws (which would be very similar or potentially based on the same original law). However in the Irish law "Baile Átha Cliath" would be replaced with "London".

Here's an example of it: http://translate.google.com/#ga/en/L%C3%A1%20alainn%20inniu%...


You did not insult me, but you sound like a robot because of preserved English sentence structure.


funnily, translated to Polish it has perfectly natural structure. Это удивительным would be correct for Russian, yes?


Это удивительно.


oh, right, forgot to fix the suffix.


> and in many cases the content can be obtained in different languages

I wonder if there's some feedback loop caused by websites that used google translate itself to offer the alternative versions :)


If I recall correctly, that is one of the reasons that Google made the translation API a paid service. Regardless, in those cases, I suspect Google would be able flag their own translations and avoid using them, It would be much more interesting if their were a couple of companies dueing translations on many websites without coordination, or if one of them tries to sabotoge the others with subtly bad translations.


Speech Recognition would probably be best fed in the same way - find a neutral-sounding speeches for which transcripts exist.

Best would be parliamentary speeches with transcipts, and the closed captioning for national news programs. The main constraint is storage space/computational power.


Translation is as much of an art as it is a science, so I wonder where this project is headed. Le Ton beau de Marot is a great book for illustrating this point.

In college I had studied Japanese and a friend introduced me to the anime cartoon Initial D. His copy had the original Japanese with English subtitles, and so I could assess the translation to some degree -- it was very good. On Netflix you can watch Initial D, but after 2 minutes I had to turn it off because the English dubbing really failed to capture the characters.

As someone noted in this thread, the presenter's synthesized voice in the linked video doesn't seem to reflect his own. If he could have said something like "Wo hui shua putonghua" and had the machine output say the same, it might have been more convincing.


I was just pondering today why PCs have adopted spell checking as a standard feature but don't appear to use context techniques for word checking or grammar checking yet. Perhaps I'm just using the wrong apps?

The speaker says "to take in much more data" but it gets parsed by the speech-to-text as "to take it much more data" which is such an unlikely phrase I can't really work out why it's not auto-corrected.

The phrase provided doesn't appear to be in either Google's nor Bing's web indexes. Typing "to take i" in to either Google or Bing's search box produces a hit for "to take in" as the most likely match; and within milliseconds.

Similarly (and ironically) with "about one error out of" being parsed as "about one air out of".

That he goes on to say that they use statistical techniques and phrase analysis for the translation makes this sort of error all the more intriguing, why isn't that same statistical approach weeding out these sorts of errors.

Nonetheless an impressive demonstration.


Green squiggly excluded, I can think of two reasons off the top of my head why even the most advanced general purpose grammar checker would be a bit of a controversial feature:

- Because grammar is typically more expressive, and dependent upon a concept that otherwise may not exist in words. Thus statistical grammar models and context checkers would be much more volatile to generating nonsense from user input (along the lines of the Sokal hoax) or restricting output to a range of acceptable models (giving the machine its own voice in a sense). That leads to the second thing...

- It kills freedom and creativity (or at least, how we receive it). Imagine comedy routines in stoic deadpan. Perfunctory exchanges in formal constructions (and vice versa). Obviously you can avoid all of these situations if you wanted to, but in that case it should probably be saved for those special occasions. It could probably help a lot of businessmen wanting to write their statements and messages in shorthand without spewing boilerplate text. But it's potentially damaging to every child or student who is still finding out how they want to express themselves in the given context.

Note: I think it is fair to assume that grammar checking would include the ability to reformulate or generate text that obeys the relevant models. Spell checkers suggest spellings, grammar checkers have to suggest fixes and changes as well, and if we want to get any further than Win98 era Word it will probably have to have a plain old fix-it generator as well.


FWIW: There's a Chilean comedian whose weapon of choice is to deliver mostly black humor in stoic deadpan. The result is hilarious. Probably because you know he's a comedian.


Fun thing to do: you can turn on transcribe audio on Youtube and directly compare how Google's speech recognition tech stacks up against Microsoft's.


I bet the audio directly from his mic would be enormously better quality than whatever YouTube has recorded. Plus Google can hardly afford to dedicate gigantic amounts of CPU to the transcription - they'll be going for a crude but useful job where for this demo he probably has a whole lot of CPU grunt just dedicated to it.


I really don't think youtube transcribes audio for every single user. You can see it's not available in many videos. I'd guess they run some test on the audio to see if it's worth transcribing, and only then run a background task to do the job.. doesn't really matter how fast.

You are right about the source quality though.


see below for some translation links, but google translate is pretty bad compared to bing translator in Chinese.


The implications of this kind of technology reaching consumers in the next decade or so are really interesting.

If we can get to the point of having handheld devices that can accomplish live translation of spoken word, what exactly is the point of different languages anymore?


I don't follow your logic... Without those different languages, you wouldn't have anything to translate in the first place. If anything, this type of technology promotes independent and different languages, as it makes it so much easier to communicate with others regardless of your native tongue.

Also, bravo to Microsoft; I'll remove my jaw from the floor after I watch your video a second time.


different languages emerged because of physical separation. As barriers have been reduced because of technology (both physical and communication barriers) there really aren't "borders" anymore. I can call someone in China right now if I wanted to, something impossible to do even just 100 years ago.

So if we can all talk to each other across the world in real time, and we can all understand each other because of this technology, what exactly is the point of different languages anymore?


As a polyglot (English, several dialects of Chinese, Spanish, German, Russian), languages are still one of the most important tenets of culture and individualism.

Trying to understand or think like an American is a completely different experience from trying to think like a Chinese which, respectively, is completely different from thinking like a German. These modes of thinking not only make each culture/country/peoples unique, it actually facilitates various strengths.

Primarily, the Chinese language actually allows humans to remember and store more information in short term memory when compared to more Western/romantic languages. Most humans can store 7 (+- 2) bits of information (bits being defined as one contextual idea: 25 + 23 are 3 bits, Picasso's Mona Lisa is 1 bit).

The Chinese language actually facilitates math/short-term memory because many things are spoken/read/written as one contextual idea. When remembering a large number (602-112-5097 for example), English speakers tend to remember this number as: Area code = 1 contextual bits Each individual number = 7 contextual bits This happens because the English language separates non-related digits into individual ideas.

In Chinese however, masses of digits are written and spoken as one long contextual idea. Similarly, in memory, these long numbers are more easily stored as one contextual bit and take up "less" space.

Conversely, English (and to some degree romantic languages) happen to have lots of descriptors (what we call adjectives and adverbs). This, respectively, is one of the reasons why English speakers tend to be more creative with how they express themselves.

Does this mean that we'll eventually tend towards an universal language that implements all the good points of current languages? Perhaps.

Personally, I enjoy the uniqueness of each language by itself.



> "So if we can all talk to each other across the world in real time, and we can all understand each other because of this technology, what exactly is the point of different languages anymore?"

A bunch of things. First and foremost: cultural representation. There are many things that are simple words in Chinese that have no English equivalent. The values and traditions of a culture are subtly communicated via its language, and even if we can instantaneously translate the literal meaning of what is being spoken, subtext will be lost.

This is why at a high level (beyond "where is the restroom") translation is a highly involved field.

Secondly is the usability of this system. Even if it is 100% accurate and immediate you'll just end up with the UN problem: communication becomes asynchronous because you need to wait for the translator. You'd say one thing, the other person would listen to your translator. He'll say something back, and you get to listen to his translator. It's a hell of a lot better than nothing, but there is still a tremendous advantage to being able to converse in real-time.


> what exactly is the point of different languages anymore?

True. This is why I will be teaching my children Basque and Basque only.

New languages might have originally emerged due to separation, but that doesn't mean lack of separation will cause already existing languages to go away.


Not Basque and Spanish? Or Basque an English? Or Basque and some Chinese language?


Nope! If we don't need different languages anymore, Basque only.


Following a little of what nathannecro said, I think that the real consequence of this technology is that it makes language more useful, more important as a conceptual tool than ever before. Studying language in within an environmental model is an amazing tool for understanding how languages were formed, but if the previous barriers for communication are shattered then it seems to me language itself won't be at such a loss, rather the linguists of the future will have to expand their models to see what happened once everyone got sms/twitter/skype with PerfectTranslator or whatever it becomes. Language allows you to do more than just communicate with others, it allows you to conceptualize. Cue obvious comparisons to programming, where the product is often bound to the process of its creation. We identify the paradigms and then work them to our advantage. If we continue to promote language learning and develop newer, better systems for facilitating that task I think it would be greatly beneficial, basically as long as we humans still exist and aren't delegating everything beyond our beating hearts to a digital circuit.



I have no doubt a few more long years will have to pass until those solutions reach the mass market, but this is extraordinary especially for someone like me, passionate about travelling. Our generation witnessed the shift towards cheaper flights, easier accommodation booking and web/mobile tools growing year by year becoming more helpful in organizing our visits and seeking information about places and cultures we don't know. We, or the future generation, will probably witness the fall of the language barrier, it's truly amazing and one of the most important shifts in our global experiences.


Skip to 8 minutes to hear the actual translation.

I'd love some comparison - that doesn't sound like the same voice to me (awfully close to the 'standard' computer voice, IMO), but some of it is crummy recording quality, and showing the flexibility would go a long way toward convincing me.


I agree that it doesn't really sound like him, but the voice is far better than most Chinese computer voices that I've heard and is totally understandable.

Seems like my years of learning Chinese and living in China are about to become useless...


That was my opinion as well.

Actually I stopped learning Chinese and living in China because I discovered the following. They were learning English faster then I could learn Chinese, and I only needed to know enough to let them know I wasn't culturally insensitive.

Not sure where the quote was but it went along the lines of "Don't try to talk in their language, because you will make a hash of it and they will have the advantage."


Thanks! The quality of the speaking is definitely not something I can judge, not knowing Chinese.

Are Chinese computer voices just flat-out worse than English ones, or is there something specific?


Now imagine this on Skype as a premium feature.


Near real time speech to speech translation is awesome[1], but the voice sounded more like how i would picture ASIMO speaking (ie a 1980s speech synthesis) than 'his' voice.

1. is anyone here fluent in mandarin to assess the quality of the output?


Mandarin speaker here. I'm more impressed that it was able to reconstruct the sentences properly (where traditional translation tools typically fail). The output was fine. Obviously doesn't sound like a human speaker, but the tones are correct.


Yea, it sounds a bit mechanical. With that said, based on the priming he's given with the talk on waveform, I'm guess they are simply breaking the English speech into the waveform with corresponding frequency and simply mapping that over to the Chinese counterpart.

It would be even cooler if they created a distribution of possible sound frequency for each syllables in both English and Chinese, and determined where in the distribution his speech pattern lies, and transfer the "ranking" in the distribution. Hence you get a subjective transformation instead of a objective one. :)


Mark my words; some good changes are happening at MSFT. There are some indicators that suggest this may be a come back. Surface seems to gain momentum, while future of OS Windows will be freemium plus ads (as you can imagine hundreds of millions of "screens" plugged in).



It's pretty incredible how far language-to-language technologies have come and how far they still have to go.

Very cool stuff.


Putting on my tinfoil hat here, if all it takes to build a speech model to impersonate someone's voice is an hour's worth of them talking... what happens when the wrong person gets that? For example, the government or a corporation (a internet phone service, maybe) uses to fabricate evidence of conversations that never really happened; could also be used to aid in identity theft


To indulge you just a little bit, I think it most likely result in an a rapid expansion of forensic industries. While I have no legitimate experience in signal processing, I imagine there would be ways to deduce whether or not such impersonations were credible to some degree. Whether or not that would stop tech-savvy marketers and con artists out of scamming grandma, I don't know. We'll have to wait and see what 21st century holds for future firewalls. Of course, if someone with any knowledge on the subject would like to step in and point out how stupid my response sounds to them, I'd be glad to become more informed!


Did you watch the video where the computer was talking?

There is no possible way to confuse that robot voice with the speaker. Technology like you suggest is a long way away from the consumer market.



My Android phone already does voice-to-text better than the system demoed in that video. Looks like Microsoft's research needs a bit more tuning before it can be declared 'amazing'


Damnit. This is pretty much the very idea I had in college around 12 years ago. At the time, there was nowhere near the required technology to pull this off. Over the last few months, I'd begun rethinking through the idea again, feeling the time was right to pull this off as a killer idea. Even began trying to investigate how to pitch this to create a startup focused solely on this problem.

Now it seems the time may be too late. Rats.



Naive in what regard exactly?

EDIT: Oh, I see. You think I think I had this idea first and out of nowhere. Wrong.


In the sense that this is an age-old dream of mankind (see Tower of Babel), not an idea someone grabbed away from you before you could create a startup.


You erroneously conclude I think I am the first person to think of this. I do not. I know my sci-fi and biblical and human history (20 years of studying intellectual and cultural history, thank you). I've read about the tower of babel and grew up on wanting universal translators.

My comment was a bit of bittersweet admiration at a very specific success in implementation. Can we just assume that the average person on HN knows wtf a universal translator is, and not every time someone says, "I had this very idea" that they are making accusations that company/person X is stealing the idea away? Christ. Not a single word was spoken that this idea was "stolen" or "grabbed away". Nor were words employed signifying that "I thought of this first and completely independently".

I was saying it looks like it's too late to be the person who gets to be first to demonstrate it. And by "very idea" I meant, quite specifically, how cool it would be to be able to translate my voice into another language in near-real-time. And by "Rats", I meant "Oh snap, looks like others have been working on doing that same thing and impressively pulled the damn thing off."

If you spent a moment looking at my comment history, you'd see I am usually quite clear in stating what I mean directly. Had I meant to imply an idea was "grabbed away from [me] before I could create a startup", I'd have used those exact words. Instead, I said that I'd had this very idea years ago and only recently thought about how the tech might be there today to make it happen, and maybe make for an interesting startup. I didn't even try to imply it would be my startup.

Nonetheless, point taken. I'll ensure I more specifically couch any future statements about having ideas with the qualifier that I am, in fact, not implying that I had it first or independently in the whole of human history.


Sorry to offend, I was just clarifying why the original guy said you sounded naive - just clarifying what specific implementation you had in mind would have helped in your original comment. I can see how it sounded as if you meant the idea "universal translator", can't you?


Honestly, it'd never even cross my mind that anyone on HN would possibly comment that s/he had come up with the idea of a universal translator 12 years ago.

Anyway, my apologies either way. Your comment read a tad snarky with the "grabbed away before you could create a startup" bit. I was left thinking, "Oh great. Relegated to the company of some kid who thinks his startup idea was stolen." My reply could have been tempered a bit more.


Whoah, this discussion turned hostile pretty quickly.


This is really impressive, especially the speech recognition part. I can't really judge anything else, since I don't speak a word of mandarin. The speech recognition though is easily the best I've ever seen. This is almost the kind of recognition rate needed for voice controlled interfaces to finally work. Exciting stuff.


It looks like a translation we can hear occurs around 8:10. Is anyone able to verify the correctness of the speech? (Also, remembering it's a demo and it has probably been tested multiple times for that phrase).


It's amazing that they can do this, yet there is still no high-quality solution for changing a male voice into a believable female voice. (For example, making a 40 year old man sound like a 16 year old girl)


Pretty cool stuff! Jump here to listen to the demo: http://www.youtube.com/watch?v=Nu-nlQqFCKg&t=7m55s


I'd like to see them demo this using a few different voices. The voice still sounded very computerized, but maybe I'm just not used to hearing this speakers voice.


I think the Linux hacker, MetalX1000 on youtube did this a while back using various [ speech to text -> google translation api -> text to speech ] tools.


I seem to remember on Scientific Frontiers with Alan Alda, back in the mid-late 1990s, him demonstrating something similar but not as performant. Neat.


As good as this is, it doesn't seem too much of a quantam leap from where Google Translate is, with conversation mode, installable in every Android phone.

I didn't hear the inflections in his voice superimposed on the Chinese voice, so it is just modeling his voice characteristics and reflecting that into output voice. From what I understand the voice modeling happened off line, not in real time, so it is not nearly as sexy as if it was happening as he spoke. In other words, a nice touch, but I don't see it as revolutionary (unless I'm missing something).


I would gladly have an hour long conversation with my computer once to be able to speak in another language.


Very impressed! The translated Chinese is far better than those translated by any other translation tools I've seen on earth in my entire life!


Every day, in every way, we get better and better.


you can jump to 6 minutes and read the text to save time. chinese voice starts at about 7:30.

it is pretty neat.


Couldn't this be easily hacked together using google translate and a text to voice program?


I had the same idea, but it looks like Google translate needs a TON of work. For example:

English: "I hope you enjoy the rest of the presentations today" http://translate.google.com/#en/zh-TW/I%20hope%20you%20enjoy... Translated into Chinese: "I hope you enjoy rest (as in take a break)/resting introduction/presentation"

Bing translator is much better using the same English input: http://www.bing.com/translator, the Chinese that comes out is grammatically correct in Chinese. The reverse translation comes out correct as well.

But yes, seems that if the translator is good enough, it would be simple to reproduce the process from the video.


Google has already done something similar. It's called Google Translate Conversation Mode.

http://www.youtube.com/watch?v=oyRQnflIv6Y


Which makes me wonder: is there a text to voice application based on convolution?


Of course, they cannot do Mandarin to English yet - that's 2.0 :-)


It's really unfortunate though because these languages are some of the hardest for someone like myself to navigate through. For instance I can look at "poulet et riz" and figure out its roughing something chicken. But "雞肉和米飯" means nothing to me.


"poulet et riz" meant nothing to me until I read your post in chinese... [touche! =P]

雞肉和米飯 is a really odd way of saying chicken and rice... if you google the phrase, the search results don't come up as good as the way native speakers would expect:

google 雞肉和米飯: https://www.google.com/search?q=%E9%9B%9E%E8%82%89%E5%92%8C%...

google 雞肉飯 (the more normal way in chinese "chicken meat, rice"): https://www.google.com/search?q=%E9%9B%9E%E8%82%89%E9%A3%AF&...

google "poulet et riz" https://www.google.com/search?hl=en&tbo=d&rlz=1C1CHF...

Hmm... they are all chicken, but still pretty different from each other, especially if talking to a foodie!

Maybe give M$ another 15 years, but for now, the best way to learn a language is to do it through the social method.


But does "kurczak z kaszą" mean anything?


"kaszą" looks like "kasha", but in the context of a whole arbitrary sentence (rather than in the context of talking about guessing at "chicken and rice" in other languages) I'm not sure I would have thought of that.


Triggers a hunger response somewhere in my brain. "雞肉和米飯" definitely does not.


Looking forward to the Android app. Five years maybe?


The big question is when do we see this in a real product. Many interesting innovations have come out of Microsoft research. It will be nice if they reach end users as well


Got a Kinect for your XBox yet?


True, but the Kinect is poorly implemented. I find the Kinect incredibly frustrating and usually have it unplugged. When you're watching a show that has a lot of dialogue it will interpret the talking from the show as commands. These commands include things like "Fast forward 40x" or "Rewind 20x".


Interesting -- I've never had that happen. You may want to recalibrate the microphones on your Kinect; it does some crazy analysis to filter audio from the speakers out from the input stream.


I can see this going into Windows 8 phones... I'd buy one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: