As computer use becomes ubiquitous, it is increasingly desirable to communicate with them in the same way that we communicate with one another: using human speech. Voice or Speech Recognition technology aims to do just this. Personally, I fell in love with the concept of voice recognition ever since I first saw “Star Trek, The New Generation” series. Unfortunately, my first attempt at making a productive use of speech recognition in Microsoft Windows 3.1 was rather disappointing.
Today our ability to use voice recognition is limited to issuing system commands to speed up familiar functions. So what prevents us from talking to our personal computers and phone systems (those are quickly converging into one) ? What you may not realize is that speech recognition is a rather complicated and resource intensive task.
Humans easily and efficiently relay information via speech despite many complications, including background noise, slips related to spontaneous speech (stammers, filled pauses, false starts, etc.) and the inherent variability of human speech.
Challenge 1: Interaction between Humans and Computer Systems requires perceptual intelligence:
- Identify audio environment
- Identify each voice element comprising the audio environment
- Identify voice to be followed
- Determine what that voice says
- Recognize this voice’s intonation
- Decide and issue what would be considered an appropriate response
- Summarized, index and store information for future retrieval?
Doing voice search on a desktop computer over speakerphone is more difficult than doing it on mobile device. Mobile phones and headsets are designed for voice input of one individual. Desktop microphones, speakerphones and conference phones are designed with the exact opposite purpose in mind: to be very sensitive so that they could pick up conversations as far away as possible. Unless you are wearing a headset, you are likely to pick up background noise and conversations of other people in the office. If you’re talking to a microphone in your softphone or desk video phone, you’re sitting far away instead of talking right next to it as you would with a mobile phone. This creates a big difference in noise level and ambient sound. Now you are expecting your speech recognition software to interpret multiple voices, identify and authenticate primary voice issuing a command and execute this command.
Not only does this create a significant isolation challenge, but it also presents a number of security risks. The effectiveness of language as a means of communicating information is dependent upon the robust nature of human speech perception. Computers obviously lack any such auditory capability, but the field of automatic speech recognition seeks to overcome this deficiency. The primary challenge in doing so is the overwhelming variability of human speech: no two people speak exactly alike; in fact, every utterance is unique. This irregularity and its implications for automatic speech recognition are examined from the perspective of three major areas of linguistic study: phonetics, phonology and prosody.
Challenge 2: Human speech perception is bimodal.
We subconsciously read lips in complex audio environments to improve intelligibility. We integrate audio and visual stimuli. Audio-visual automatic speech recognition (often referred to by its acronym AV-ASR) relies on audio and visual signal inputs from the video of a speaker’s face to transcribe spoken utterances. AV-ASR system could outperform traditional audio-only ASR.
Challenge 3: For voice recognition users: endurance of voice becomes important.
Improper use of voice and unhealthy voice behaviors may result in vocal problems for end users. Other concerns impacting voice behavior could be emotional anxiety and natural limitations of voice constitution.
As you can see, Voice Recognition technology deals with significant challenges and its uses in today’s business phone systems interactive applications is somewhat limited. Because of this it is also named a primary source of frustration by many clients. Nevertheless, in its current form, it can still be useful in a number of Interactive Voice Response uses where security and speakerphone detection issues can be mitigated.