Voice user interfaces seem to have been almost tailor-made for the car, as they allow the driver to focus on traffic while interacting with applications to write messages or play music. Despite this, voice control is by no means common in cars today, and even major companies with years of experience in developing such interfaces have struggled to create an acceptable product for motorists.
The two most prominent building blocks for recognizing the semantics of a user utterance from a speech signal are Speech Recognition and Natural Language Understanding. The current design is also a challenge.
Converting an acoustic signal to a sequence of words
The journey of a speech signal begins with its recording by microphone. Speech recognition translates spoken utterances from a user into written text (thus it is sometimes also called “speech to text”). Several web services are available for this task, such as Bing Speech, Google Speech API and Amazon Lex. All major mobile platforms provide an API to one of these services. While cloud services are easy to integrate into an application, they have two major drawbacks when used in a car:
While driving, internet connection can be weak or lost completely. Even when connected these services often fail to understand names of contacts, artists, or places if these are not common or have a foreign language background.
Speech recognition on the device is always available and allows the definition of user-specific vocabulary. However, language models for such recognizers must be carefully designed so that recognition is fast and accurate enough given the limited computational resources of a mobile device.
Advantages of the hybrid approach
Findings from linguistic research come in handy in restricting the various ways a user could respond to a system question to a much smaller set. The best known finding is that the vocabulary as well as the syntax of a dialog contribution is primed by context: a response to a question is likely to use a similar syntactical pattern and wording as the question itself. Moreover, if the system question leaves only a few options for the user (e.g. “Did you mean Patrick Weissert or Holger G. Weiss?”), the language model can be defined to understand these better than other names.
A hybrid approach takes advantage of either approach’s benefits in the best possible way. For example, free text input such as a dictated message could be recognized by using the cloud recognizer, while specific commands like “send a message” can be captured on the device, leaving the possibility for the application to tell the user that dictation is not possible due to poor network coverage if this is the case.
Understanding the user’s goals
Natural Language Understanding (NLU) maps all possible ways a user could speak a command to a unique “intent” representing that command. In addition, it extracts information that can be used to define the command further (often called “entities” in popular NLU frameworks). It receives the recognized text from the speech recognizer and forwards the intent and entities to the application.
Popular bot frameworks like Microsoft LUIS or API.ai (now: Dialogflow) offer simple interfaces to define possible intents and entities for an application and store a set of user commands that are annotated with intent and entity information. As they are run on a server, they can use powerful and data intensive NLP methods. For example, they can make use of word vectors, which encode the meaning of a word as a point in semantic space, such that the similarity of two words can be computed based on the angle between their word vectors in the semantic space. As with cloud-based speech recognition, the main disadvantage of cloud-based NLU is its unavailability in areas with poor network coverage.
General purpose vs. custom model
Current AI frameworks define general algorithms to recognize arbitrary sets of substrings as entities in a user command. They are therefore well suited for simple models and are easy to set up. Some entities which have a large set of possible values, like names or message texts, are hard to recognize accurately with these general purpose algorithms. For such entities, a custom model fitted to that particular entity should be implemented.
An NLU system running on a mobile device needs to be compatible with much smaller models and less powerful, but inexpensive to execute, machine learning algorithms. As with speech recognition, NLU on the phone can make use of the personal data of the user to improve recognition of user-specific entities. For example, contact names in the contact book application are useful to identify tokens that correspond to the recipient name in a message sending command.
While popular and easy to integrate into mobile apps, cloud-based speech recognition and natural language understanding lack accessibility while on the road. On-device speech processing is harder to implement, but has further benefits in addition to its constant availability. The main benefit is that user data can be easily integrated into the model without compromising privacy.