Why (most) bots and voice assistants are dumb…

September 18, 2017
Nadav Gur
Voice assistant

...and how to build a smart one

This article has first been published on medium.com by Nadav Gur, our mastermind of AI.  

Dumb messenger chatbots are fuelling the question if conversational AI is more hype than substance. Dialog Management is the real brains behind conversational AI – and if your Dialog Management isn’t smart, it’s like having a user interface with no intelligence behind it.

AI is going through the quintessential new technology debate, “Is it the next big thing? Or is it just a hype wave?” Conversational AI is part of that story  –  “are bots really smart? Is voice really the next interface?” A lot of the early evidence seems to point more to hype than substance. The word “dumb” is often used in the context of Facebook Messenger chatbots (e.g. “Chatbots are dumb, but wait until they learn how to negotiate for you”), and pursuit of “general intelligence” does not seem to improve the state of consumer-facing products (Tougher Turing Test Exposes Chatbots’ Stupidity).

Previously I explained the technology stack behind personal assistant AI  –  voice or messaging bots. It talked about the AI pipeline behind these –  Speech Recognition (ASR)  –  Natural Language Understanding (NLU) – Dialog Management (DM) – Text To Speech (TTS).

The speech interaction pipeline

The speech interaction pipeline

But once you look at the product's developers are using to build bots and voice skills these days, you quickly realize that while ASR, NLU and TTS are implemented using off-the-shelf products, dialog management is mostly still being hard-coded, and “hard” in this context is literal.

Unfortunately, Dialog Management is the Brains of the Bot. So if your Artificial Intelligence is not particularly intelligent … 

Dialog Management is the real brains behind conversational AI.

Why intelligent Dialog Management makes the difference

For the sake of the discussion, let’s compare a voice bot to a mobile app. When you need to solicit user input in an app, you present a screen with a UI form  –  input fields, menus, buttons etc. The user is expected to tap, type, swipe  – and provide a response. The voice equivalent is simply asking a question using speech (TTS), listening to a voice response which is converted to text (ASR), and then parsed from natural language to an intent (NLU). The end result is structured data that represents the user’s response. 

So, while there have been great strides forward in speech recognition (e.g. this latest achievement by Microsoft), and near-commoditization of NLU and TTS, all that gets us to is the equivalent of a working form-based user interface for an app. But we all know that the UI is just the UI. The “brains” of the app is the code that does whatever the app is supposed to do. In other words, if your Dialog Management isn’t smart, it’s like having a user interface with no intelligence behind it.

If your Dialog Management isn’t smart – your bot is stupid.

What makes a Dialog Manager smart?

A Dialog Manager basically answers one question  –  “based on what just happened + what happened before –  what to do next?”. For the simplest of services, this may not require a lot of intelligence. But in many cases, it gets complicated really quickly for several reasons:

  • Context  –  situations where what you do next, or simply how you understand the input, depends on something that happened before.
  • Pivoting  –  where the user’s response surprises you despite being within the capabilities of your bot.
  • Multi-modality  –  where more than just voice/natural language are used for input/output.

Here are some examples:


Let’s look at a product like Chris, a driver’s assistant that helps you control your phone while driving. You may say “Chris, call Luke”. Assuming ASR and NLU do a perfect job, Chris knows you want to place a phone call to someone named Luke. What if you have more than one Luke in the address book? This is called disambiguation.

Simple dialog management will ask “which Luke did you mean  –  Luke Cage, Luke Bizzy or Luke Warm?”, and you will have to respond with the full name. But consider these cases:

  • What if you have a missed call from Luke Cage from 20 minutes ago? Chances are, that’s exactly who you want to call. Smart DM will realize ‘Luke Cage’ is in the short-term context and respond with “Calling Luke Cage” (and allow you to cancel in the unlikely event that’s not what you meant).
  • What if you call Luke Cage every day, and the other guys no more than once a year? Again  –  Luke Cage is our man, and DM should be able to figure that out  – be cause Luke Cage is stored in the long-term context.
  • What if you’re actually presented with these 3 options, and respond “Luke Bizzy”? Speech recognition is quite likely to understand this as “look busy” which will throw things off  –  unless the speech recognition model was cued ahead of time that one of these names is probably the next thing you’re going to say. In technical terms  –  smart DM provides context for ASR and/or NLU through mechanisms like dynamic grammars, slot-filling, ontology management etc.


Take for example a voice assistant for online banking like Kasisto. As a customer, you may need to transfer some funds into your checking account as there are some outstanding payments. Maybe you say “MegaBank, transfer $2,500 from my savings account to checking”. NLU does its job and realizes your intent is to TransferFunds from: ‘Savings Account’ to ‘Checking Account’ allowing DM to respond with: “Please confirm transfer of $2,500 from savings account 123 to checking account 789, after the transfer you will have $x in this account and $y in that account”. 

At this point in time, you realize there may be a better option. You’ve been itching to sell that Netflix stock which hit an all-time-high you don’t think will hold. Or maybe you remember you have some credit card payment coming up and you actually need more than $2,500 transferred. Anyway, you say “What’s the current quote on my Netflix stock?”.

Now  –  your DM was waiting for a confirm/reject on that transfer. Instead, it’s hit by something totally unexpected. A good DM won’t get hung up on “please confirm or cancel” but rather pivot to answering that query  –  without the developer having to directly code/design the bot that way. A great DM will also allow you to get back from this “side-track” to your original request  –  and change to a transfer from your securities account, if needed.

Multi Modality

Let’s go back to a device like Chris, that has speech interaction, but also a screen and a gesture sensor. These three UI elements need to be orchestrated. For instance when Chris asks “Which Luke did you mean” the image of each one of them can be shown on the screen. Then you can pick the right one with a “swipe” gesture. In this case, a gesture is equivalent to an utterance like “next” or “back”. So you want DM to support gesture input, and voice output.


Great Dialog Management is Rare

Many of the the issues people have with current voice assistants are focused here. The reason is simply that once speech recognition and natural language understanding are good enough  –  the shortcomings of the next chain in the pipeline become apparent. Some examples:

  • Alexa and particular song names  –  play a song by name (“Alexa play Plants vs. Zombies a-capela”). It’s an example for speech recognition hit-or-miss (cause the name is so unusual). If Alexa doesn’t understand the name the first time — it’s forgiven. When it does understand the name, plays the song, and then your kids try to play it again immediately afterwards — and it fails — that’s when “Alexa is stupid!” is the prevailing opinion. Short-term memory is something we take for granted.
  • Try “what’s the closest gas station” with Siri. Then try “show me the next one”. Now try this while driving on the highway with an empty gas tank.

Great dialog management is not a problem that’s easily solved by the state-machine or flowchart models offered by off-the-shelf bot platforms, because the pivoting issue makes a state machine look like a bowl of spaghetti. It’s also not simply tractable through “deeper neural network + more compute power” (the recipe à-la-mode for tough AI challenges) because there’s not a lot of sample data you can train on. But it’s still a generic problem  –  that’s solvable with the right platform.

Shameless plug

Much of the work we’ve done at Servo Labs over the last year was focused on applying a better AI model to solve dialog management holistically and elegantly. If you need to build a truly smart user-facing voice or bot experience, drop me a line.

More featured articles