ITscout Blog: Limits of Speech Recognition

Long, long ago, in another era of my life, back when I was still just a doctoral candidate at Indiana University majoring in cognitive psychology with a minor in artificial intelligence, that's when I first met Professor Ben Shneiderman. That was long before he had published Software Psychology: Human Factors in Computer and Information Systems or Designing the User Interface: Strategies for Effective Human-Computer Interaction.

Ben Shneiderman

Nowadays, Ben is a professor in the Department of Computer Science at the University of Maryland in College Park, and also founding director of the school's Human-Computer Interaction Laboratory.

Awhile back, in September 2000, Ben wrote an intriguing article published in the highly-respected journal of the computer industry's most highly-regarded professional society, ACM (Communications of the ACM). That article, entitled The Limits of Speech Recognition, brought me back to those bygone years when I was studying to become an expert in speech perception.

Ben makes a number of observations and shares some insights I'd like to pass along. Edited excerpts appear below:

There are key differences between human-human interaction and human-computer interaction. Spoken language is effective for human-human interaction but often has severe limitations when applied to human-computer interaction.
Speech:
is slow for presenting information

is transient and therefore difficult to review or edit

interferes significantly with other cognitive tasks
Cognitive processes surrounding “acoustic memory” interfere with "short-term memory" required to effectively problem-solve. The part of the human brain that transiently holds chunks of information and solves problems also supports speaking and listening. That's why it's best, when working on tough problems, to be in a quiet environment void of speaking or listening. Humans speak and walk easily but find it more difficult to speak and think at the same time. Since speaking consumes precious cognitive resources, it is difficult to solve problems at the same time.

More cognitive resources are available for problem solving and recall when hand-eye coordination is used for pointing and clicking. Because physical activity is handled by another part of the brain, problem solving is compatible with routine physical activities (like walking and driving). It's because hand-eye coordination is accomplished in different brain structures, that typing or mouse movement can be performed in parallel with problem solving.
Cognitive resources for problem-solving and recall expand when hand-eye cooridination is used for pointing and clicking.

Cognitive resources for problem-solving and recall are limited when speech shares short-term memory.

is filled with rich emotional content
Conveyed by pacing, intonation, and amplitude in spoken language (such as rising tones at end of phrases to denote questions), the emotive aspects of what's called prosody may be disruptive for human-computer interaction.

The human voice has evolved remarkably well to support human-human interaction. We admire and are inspired by passionate speeches. We are moved by grief-choked eulogies and touched by a child’s calls as we leave for work. A military commander may bark commands at troops, but there is as much motivational force in the tone as there is information in the words. Loudly barking commands at a computer is not likely to force it to shorten its response time or retract a dialogue box.

Accurate simulation or recognition of emotional states is usually impractical since human emotional expression is so:

varied
-- across individuals

nuanced
-- subtly combining anger, frustration, impatience, and more, and

situated
-- contextually influenced in uncountable ways

Speech recognition and generation is helpful for environments that are:
hands-busy

eyes-busy

mobility-required

hostile

Speech recognition also shows promise for telephone-based services.

Physical problems associated with speech include:
fatigue from speaking continuously

the disruption in an office filled with people speaking

After 30 years of ambitious attempts to provide military pilots with speech recognition in cockpits, aircraft designers persist in using hand-input devices and visual displays. Complex functionality is built into the pilot’s joystick, which has up to 17 functions, including pitch-roll-yaw controls, plus a rich set of buttons and triggers. Similarly automobile controls may have turn signals, wiper settings, and washer buttons all built onto a single stick, and typical video camera controls may have dozens of settings that are adjustable through knobs and switches. Rich designs for hand input can inform users and free their minds for status monitoring and problem solving.

I've long worried that a natural language interface to your computer was a lot like a natural language interface to your car:

Speed up. Slow down. Turn left.

Sounds like a backseat driver. Who needs that? I much prefer interface controls provided by a steering wheel, gear shift, and foot pedals controlling gas, brakes, and clutch.

The idea that human-computer interaction ought to based on natural language speech may well prove to be as misguided as some old predictions about a paperless office -- which created about as much value as a paperless bathroom.

ITscout Blog

Sunday, June 05, 2005

Limits of Speech Recognition

Speech:

varied

nuanced

situated

0 Comments:

Previous Posts

Links

About Me