Where’s speech recognition?

Almost immediately following the market success of handheld computers, customers began to yearn for voice recognition in little devices. “If only I could enter an appointment by talking, or hear an address from my contact list read out loud,” they cried. And there is no doubt that such a capability, married to those handy gadgets, would increase their value exponentially.

Unfortunately, an entire decade has passed, including one of the most technologically innovative periods in computing history, and we still have no voice recognition. Even as voice and data are merged, and handheld devices morph into powerful cellular telephones, we still lack such an obvious input capacity.

Why the Stutter?

With many expected but undelivered innovations, the hurdle is market development or self-serving corporate strategy. In this case, no one is to blame. The delay is caused by the continuing limitations of technology.

First, the hardware within a handheld device is not up the task of voice recognition. This feature requires hoards of memory to store words in both computer and spoken languages. It also requires enormous processing power to identify the spoken word and know what to do with it.

A small handheld device cannot contain large chunks of memory without running into space constraints. And it cannot incorporate a zippy processor without running into power problems, which can only be alleviated with either a shorter battery life or a larger battery. If you cherish the tiny size of your handheld device, then you must forgo these “speeds and feeds.”

The second issue is demand. speech recognition software has existed for PCs for years, and yet few people use it. Initially, the quality of the recognition was under-whelming; with all of the error corrections necessary to create a document using speech software, typing was simply a faster input method. More recently, quality has been much improved, and yet still few people have purchased speech software.

For example, Dragon Systems, a pioneer in speech recognition and market leader in the consumer software space, was gobbled up by L&H in 2000. A year and a half later, L&H declared bankruptcy and was purchased by ScanSoft. Clearly, this market is difficult.

One of the only places where demand remains even lukewarm is in IVR. Nuance Communications, while suffering recently like all other technology companies, has remained afloat due to its focus on this market segment.

Say It Ain’t So!

With technology improvements (and miniaturization) on the horizon, we have some promising intermediate steps in the foreground. Already, many handheld devices incorporate speech, either to record it or, in the case of a telephone, to transmit it. While not able to understand or act upon these spoken words, at least the devices contain the hardware to record, to store, and to playback sounds. This is clearly a necessary first step.

A few software companies have accepted the limitations of handheld hardware and reduced their goals accordingly. The result is “Command-and-Control.” The user trains the device to recognize certain words, which are saved as a .wav file on the device. The user then selects an action to accompany that sound. The handheld device cannot understand the words “Say Time” per se, but it can cross-reference these sounds with its pre-recorded library, and then perform the prescribed action.

Voice commands are slowly invading cell phones. However, in most cases, the phone requires that the entire name be recorded; t cannot parse different words within a recording to initiate different actions; nor can it currently perform any action other than dialing the pre-typed number.

Nonetheless, his is a significant step. And a handheld computer with more features and functions could perform even more actions using command-and-control. While driving on a busy highway, the user could say “Open Address Book. Find L-A-D-D. Read home number.” The computer would understand that open means launch. It would map the spoken “L” with the typed “L”, and find the number. It could then say the number out loud.

Input can also be stream-lined with this simplified approach, such as “Open Date Book; make appointment for 3 pm on Tuesday.” After performing these simple steps, it could then save the details of the meeting to recorded .wav file for subsequent interpretation, if necessary. At least, in this case, the time is booked on the calendar.

Command-and-control requires that the user pre-record all of the words that the computer should recognize, like “Open” and “Address Book”, single numbers and letters. This might take as long as an hour, but it is (in my opinion) an acceptable price to pay to this wonderful functionality. Moreover, these commands can be saved from computer to computer, eliminating the dreaded re-training now necessary with voice command in most cell phones.

The other intermediate solution to full voice recognition that has recently entered the market uses the massive computing power of a server. As we mentioned, true recognition requires fast processors and large amounts of memory, both of which can be found (with much to spare) in a computing center full of large, expensive servers. But how do we get the user’s command from the handheld device (while driving in a moving car, no less!) to the server “farm” for recognition? And then how do push the resulting text or action back to the handheld?

The answer can be found in the newest generation of combo devices, which incorporate a handheld computer and cell phone all in one package. With a PocketPC-based Seimens SX56 device, one would dial the server farm over a voice connection, speak the commands or dictate a letter, and hang up. The server would perform the heavy lifting and send the response back to the smart phone either via wireless email, streaming packet data, or the next synchronization for incorporation into the handheld computer’s files.

This method would allow for full dictation with a high degree of accuracy. It can even perform time-insensitive tasks, like “Make appointment for 3 pm tomorrow with T-E-D-space-L-A-D-D”. However, this approach is not conducive to instantaneous navigation or text-to-speech playback.

More Lethal Than Killer

Whenever the market for handheld devices from Palm or Microsoft stumbles, smug industry pundits affirm that data-centric devices will never amount to much since “voice is the killer app.” While I will not air my reservations for this particular attribution of causation, I will agree that voice is, indeed, the most intuitive and effective input and output method. Thus has it been for hundreds of thousands of years of animal and more recently human evolution.

The combination of the simplicity of the spoken word with the power of the written word is an even further advancement, which begins to match the capabilities of the human mind. Until that glorious day arrives in a small affordable package, we have these two alternatives to get us most of the way there.

