A Blog by Jonathan Low

 

May 6, 2013

Speech Recognition Advances Help Google Best Apple's Siri In Mobile Search

We tend to take technological advances for granted.

We appreciate what they can do for us. We are sometimes awed by the virtuosity displayed in helping us realize yearnings we were not quite capable of articulating. And we do love the coolness factor.

What we often do not appreciate whatsoever is the science behind many of these marvels. Voice-activated mobile search is a typical example.

Layers of research based on science sometimes only remotely connected in the public's mind with the practical applications is often necessary to deliver the product that consumers use with limited appreciation. But as the following article explains, applying those disparate collections of intelligence can provide crucial advantages that add up to mega-success...or irrelevance. The ability to identify, coordinate, manage and then deliver the results of that process is a skill that is too often unappreciated and undervalued by investors or corporate executives who mostly want it done fast and cheap. JL

Robert Hof reports in Forbes:

The success of Google’s mobile search stems at least as much from a big improvement over the past year in Google’s speech recognition efforts. That’s the result of research by legendary Google Fellow Jeff Dean and others in applying a fast-emerging branch of artificial intelligence called deep learning to recognizing speech in all its ambiguity and in noisy environments.
For all the attention lavished on Siri, the often-clever voice-driven virtual assistant on Apple’s iPhone, Google’s mobile search app lately has impressed a lot more people. That’s partly thanks to Google Now, its own virtual assistant that’s part of that app, which some observers think is more useful than Siri.
Replacing part of Google’s speech recognition system last July with one based on deep learning cut error rates by 25% in one fell swoop.As I wrote in a recent article on deep learning neural networks, the technology tries to emulate the way layers of neurons in the human neocortex recognize patterns and ultimately engage in what we call thinking. Improvements in mathematical formulas coupled with the rise of powerful networks of computers are helping machines get noticeably closer to humans in their ability to recognize speech and images.
Making the most of Google’s vast network of computers has been Dean’s specialty since he joined Google an almost inconceivable 14 years ago, when the company employed only 20 people. He helped create a programming tool called MapReduce that allowed software developers to process massive amounts of data across many computers, as well as BigTable, a distributed storage system that can handle millions of gigabytes of data (known in technical terms as “bazillions.”) Although conceptual breakthroughs in neural networks have a huge role in deep learning’s success, sheer computer power is what has made deep learning practical in a Big Data world.
Dean’s extreme geekitude showed in a recent interview, when he gamely tried to help me understand how deep learning works, in much more detail than most of you will ever want to know. Nonetheless, I’ll warn you that some of this edited interview still gets pretty deep, as it were. Even more than the work of Ray Kurzweil, who joined Google recently to improve the ability of computers to understand natural language, Dean’s work is focused on more basic advances in how to use smart computer and network design to make AI more effective, not on the application to advertising.
Still, Google voice search seems certain to change the way most people find things, including products. So it won’t hurt for marketers and users alike to understand a bit more about how this technology will transform marketing, which after all boils down to how to connect people with products and services they’re looking for. Here’s a deeply edited version of our conversation:
Q: What is deep learning?
A: It’s a form of machine learning where the system automatically learns which features are important in deciding, for example, if an image contains a cat or not. Before, a human might sit down and write code to generate a feature that they think is important, like if I’m trying to detect cats in images, I might write a whisker detector. So I write some code that I think will characterize whether there are whiskers in this picture. Then I might write an ear detector or things like that. These things take a lot of time to develop.
Q: What’s “deep” about deep learning?
A: “Deep” typically refers to the fact that you have many layers of neurons in neural networks. It’s been very hard to train networks with many layers. In the last five years, people have come up with techniques that allow training of networks with more layers than, say, three. So in a sense it’s trying to model how human neurons respond to stimuli.
We’re trying to model not at the detailed molecular level, but abstractly we understand there are these lower-level neurons that construct very primitive features, and as you go higher up in the network, it’s learning more and more complicated features.
Q: What has happened in the last five years to make deep learning a more widely used technique?
A: In the last few years, people have figured out how to do layer-by-layer pre-training [of the neural network]. So you can train much deeper networks than was possible before. The second thing is the use of unsupervised training, so you can actually feed it any image you have, even if you don’t know what’s in it. That really expands the set of data you can consider because now, it’s any image you get your hands on, not just one where you have a true label of what that image is [such as an image you know is a cheetah].
The third thing is just more computational power. These techniques work best when you can give them lots of data and you can train a big enough model where it has enough representational capacity to learn a really broad set of features from all the data you’re giving it. If you pass a lot of data through a teeny network, like 20 neurons, it’ll do what it can, but it’s not going to be very good. You really tens or hundreds of thousands of neurons in your model for a general image classification system before it can capture all the subtleties in all the different kinds of images.
Q: How does this approach fit into the pantheon of artificial intelligence techniques? Is it an alternative to other methods or an addition?
A: The key thing that it does that a lot of other methods don’t is that it automatically builds higher-level features from very raw inputs. You don’t necessarily have to figure out as a human what features are going to be most important. It’s hard as a human, for example, to figure out what components would make a “psss” sound. And it’s very hard to do that in all kinds of different noise conditions. Like if I’m in a subway stop and I go “psss,” there’s all kinds of background noise going on if am I holding the phone here or here.
With enough training data, you get enough people making that noise in a wide variety of conditions that the system will learn good detectors for all those different kinds of conditions it’s been exposed to. If you have someone with a Southern accent or a non-English speaker, and if you have enough information in your training data, the system can learn to pick up on that without any programmers having to be involved in writing an accent changer or writing code to deal with that.
Q: How much of an improvement can deep learning provide ultimately? For instance, your deep learning network doubled the rate of images that could be identified in Google’s  experiment last year in which the system could recognize cat images, but overall the identification rate was pretty low. What’s the significance of that?
A: One thing is that that’s a very large number of categories. If you have a much smaller number of categories, the accuracy goes up a lot. So if you go to a thousand categories, you get to above 50% accuracy. I can give you a sense of how difficult this is. This [image database that Google used in its experiment] has 21,000 categories in it, and, for example, you have to distinguish a thorny skate [fish] from a little skate from a gray skate from a devil ray. That’s a pretty hard task. So the accuracy is low, but I don’t think human accuracy on that test would be anywhere near 100%.
Q: What advances has Google in particular made in this area?
A: We’ve really pushed on being able to scale these systems to train much larger models–more neurons, more connections–and also train on much larger data sets. The combination of those two things is important. If you train on much more data but you have a small model, that extra data is not all that useful because the model can’t take advantage of it because it doesn’t have enough extra neurons in it to deal with it.
We are willing to deal with models where different pieces of the model reside on different machines. We essentially can partition these models so this might be on one machine, this might be on another machine.
The second thing that we do is in addition to having a single copy of the model spread out over a lot of machines, we’ll actually stamp out multiple copies of that, and they all sort of collaborate and process different input data and exchange their parameters every so often. That scale is pretty important because the more data you can expose your model to and the more neurons  you have in it, the better off you’ll be.
Q: How do you get these advances out into products?
A: Sometimes we build general pieces of functionality that teams can then apply to their own problems. So for example, for this image processing that we’ve been doing for general images, we have a group doing optical character recognition, which is a more specialized problem, but they were able to take some of this stuff and apply it to their particular problem. They had one or two people build a prototype that seemed to work.
Often a few people on the product team see the stuff we’ve done and they said we should try it on our problem, and they collaborate with us. All the speech work we’ve been doing is a collaboration between our group, which is working on general training infrastructure for these kinds of models, and the speech group, applying these kinds of models.
Q: How did the speech group apply your work to their products?
A: They replaced the acoustic model of speech, going from the raw audio waveform to the sequence of phonemes. They essentially divide speech into two parts, that part and the part that goes from a sequence of phonemes into what are actually the likely words that were uttered. So the part that got replaced was the acoustic model, the “Gaussian Mixture Model.” That got replaced with a deep learning model.
Q: So that was a big change in how speech recognition is done?
A: GMMs had been what state-of-the-art speech recognition systems used for years and years. And it’s only in the last three or four years that this neural net-based acoustic model has been able to get enough computational power and large data sets to train it where it’s now significantly better than the Gaussian Mixture Model.
Q: How long had your team in research been working on what got into the new speech recognition?
A: We started working on it about nine months before it got released.
Q: When might the image research work its way into actual products?
A: I don’t think we have anything concrete to announce at this point, but we’re definitely looking at a bunch of different places within different products that we can use it. My sense is that it will eventually be deployed in lots of different places. It’s not like it will be a new product, but integrated into a lot of other image-related products.
Q: Beyond specific products, why is all this important?
A: Traditionally computers have not been that good at interacting with people in ways that people feel natural interacting with. For example, speech recognition has only recently gotten significantly better. Doing general large-vocabulary speech recognition without these controlled scenarios–say “one” to do this, etc.–it really opens up possibilities for new kinds of user interfaces. The speech recognition is now good enough that I dictate emails on my phone rather than type them in. It’s not perfect, but it’s good enough that it changes how I interact with my phone. Doing queries by voice is very natural now because it’s not frustrating. It usually does it right.
The same is true of image recognition. Computers don’t usually have a sense of if you have a picture of something what is in that image. And if we can do a good job of understanding what is in an image, that can bring along a lot of new things you can do in applications. If you’re able to give computer models so they can do a good job of perception like speech processing and image processing, that will be a pretty dramatic shift in a lot of areas.
Q: Any other applications?
A: Translation is another example where these kinds of models can help. It’s not hard to envision a mode where you look at something in an image and it shows you the English language equivalent as you’re looking at it. Maybe it shows you the French equivalent too.

0 comments:

Post a Comment