Confidence Score in Conversational AI
January 21, 2021 • 5 minute read

The Role of a Confidence Score in Conversational AI

This blog is written courtesy of Interactions R&D team.

When should AI say: “I don’t know”?

No one is perfect! Humans make mistakes, and so do machines. An admirable quality in humans is to accept one’s mistakes. It turns out that it is also a valuable quality for AI. In recent years, the role of AI has become prominent in every aspect of our lives. We rely on AI to tell us the weather, find the nearest restaurants, give us directions, and play our favorite song among many other things. We trust AI responses to make our decisions, but do we really know how reliable AI decisions are? 

What is “confidence”?

AI relies on data to train models that automatically make decisions. However, since we are not living in a perfect world, errors are inevitable due to multiple reasons such as noisy or inadequate data, or mismatch between train and test data. Different applications have different thresholds for error. Most of us have experienced problems with speech recognition on our phones, which could be annoying. For other applications such as medical or security, however, the high accuracy for automatic models is even more crucial. Since AI models cannot be accurate all the time, the best alternative is for such models to know when they could be wrong. That’s why it is important to quantify the confidence of automatic predictions in different fields of AI such as speech, text, image, and video. A confidence score is a scalar quantity that measures the reliability of an automatic system. 

Don’t be overconfident! 

It’s good to be confident, but no one likes an overconfident person. Well, some people do, but for the sake of argument let’s assume that’s not a good quality. Which one is worse: a student who is very confident that they will score 100 in an exam, but they actually score 45, or a student who is not so confident about getting a high score, yet they score 85? While both these situations are undesirable and show that these students do not have a reasonable evaluation of themselves, the first scenario is worse. If someone is giving you directions and they’re not sure, you would prefer to know that, and it is your decision to take their advice or not. The same goes for AI models. Ideally, we would want an AI system to associate a number with each automatic prediction that shows the certainty of that prediction. If this number is too low, the AI system is basically saying:  “I’m not sure!”. 

The role of confidence in Conversational AI 

Conversational AI makes it possible for humans to converse with machines using text or speech, and is used in a variety of applications including chatbots and voice-based intelligent virtual assistants (IVA). The majority of voice-based IVA systems use automatic models to convert speech to text and text to meaning. A confidence score could be associated with the output of each model and the lowest confidence decides the overall confidence of the system. If you are speaking in a train station with a lot of noise, the speech to text confidence is likely to be low, while if you speak vaguely, the text to meaning confidence is low. Either scenario impacts the final output of the system and therefore the conversational AI response. If the system is not confident that it really understands what you said, an alternative approach such as human-in-the-loop could be applied, so you don’t have to repeat yourself. Combining AI with human intelligence allows the Conversational AI system to use human understanding for utterances that are difficult for machines to understand. Confidence scores play an important role in this combination. If the meaning of utterances with low confidence scores are annotated by humans, the system could continue a flawless conversation with you instead of telling you: “I don’t know!”.

How to measure confidence?

Most AI applications use statistical models to make predictions. A typical approach to determine the confidence of a statistical model is to interpret the posterior probability of the model’s prediction as a measure of certainty. Higher probability for a prediction shows more confidence. While this approach requires no additional data to determine the confidence score, it is dependent entirely on the complexity of the model, the input features, and the training data. In some applications, it is beneficial to build a separate confidence model to estimate the certainty of predictions, and this model is trained based on comparing automatic predictions to human-verified predictions. Another approach to measuring confidence, is to train multiple models for the same prediction task and compare their results. The quality of a confidence measure could be evaluated by calculating the number of accurate predictions when the confidence score is higher than a threshold. Ideally, a calibrated confidence score of 80% shows that the system’s prediction is accurate 80% of the time.