This part will give an overview of the built-in NLP (Natural Language Processing) features and define the guidelines on creating as good as possible data set (intent training phrases).
Data Set Size
Taking into consideration that we offer Answers, a bot building platform, to our clients, who will be the ones responsible for defining their own set of intents and relevant samples for the intents, we expect that data set to be limited.
Because of that, we use algorithms and NLP tricks to maximize the information gain specifically tailored for the small data sets.
With respect to data size, the following principles should be observed:
- make sure that the training dataset for important intents is larger than that of less important intents. Example: For a bank, the Mortgage intent is more important than the Welcome and Goodbye intents.
- if two (or more) intents are similar, you can make one more probable by providing more samples for that intent, or combine them into one and then branch them through the use of attributes (for example, activation or deactivation of a subscription)
- for maximum benefit, intents should not be similar
there should be at least 10 complete sentences which define an intent, for example:
how much cash i spent today; DAILY_TRAFFIC
show recent transactions; DAILY_TRAFFIC
Infobip recommends that you add at least 100 training phrases to an intent. For the best performance, add 400 training phrases. The better the bot is trained with the data, the better it will perform in resolving intents.
The words which are very frequent in an intent are more important than the rarely mentioned words. For example, word money might appear in 7 out of 10 messages for ACCOUNT which will make it very important for that intent and the word current might appear only once, so it will be less important.
IF a given word is expected to be commonly used in end user input for the intent, make sure that there are plenty samples with that word:
how much money is in my account; ACCOUNT
show me the money; ACCOUNT
how much money i have; ACCOUNT
For the model in use, the word importance weighting schema does not compare importance of each word across all intents, the importance is assigned with respect to the intent where it appears.
Or very important words.
When defining samples for an intent, there is also the notion of the keyword. These important words are words which are integral to the intent, and as such, it is expected that there are plenty of samples containing keywords (as discussed in the previous paragraph).
These are not the same keywords as they appear in the Answers platform as a separate configuration tab. Think of these keywords firstly as intent key words, the most meaningful words in intents.
The super important words for an intent, which could be used to uniquely define an intent among all other intents, should be defined on their own without other words and in training phrases as well. These words are keywords in the sense that if the end user enters only a single word, it is still recognized by the bot's underlying AI.
For important keywords, there should be plenty of samples and there should also be a keyword only sample. We should be able to use keywords to uniquely distinguish between intents. Make sure that there are also samples with complete sentences for every keyword.
Do not use the same keywords or their exact synonyms in two different intents.
Single Word Prefix Matching
Keywords have another neat property which is that they are all prefix indexed with specialized data structure and this lets the platform resolve user inputs which are not complete words.
If the end user types a single word input such as bal, or ag. Answers try to auto complete that input to the closest matching keyword. The algorithm will match the user input to the keyword with the minimal length difference to the user query.
Answers platform is able to do spelling correction of the end user input; however, this feature should not be overused. The chatbot vocabulary consists of unique words in machine learning samples, so whenever the end user enters a word which is not present in that vocabulary, it is considered as the candidate for spelling correction.
There is no need to define multiple samples of each important keyword in plural and singular form, such as offer and offers, since Answers will auto correct many of them into one or the other.
It is recommended you stick with either singular form or plural form, such as offers or offer, but not both.
The end user might enter a word that we have not seen in the training phrases, yet that word is a synonym to one of the words the platform knows about. For example, we train the chatbot with the word baggage but the end user enters luggage.
Here is an example for the following training data:
I lost baggage; LOST_BAGGAGE
Where is my baggage; LOST_BAGGAGE
If the end user enters "lost luggage" Answers will internally treat the input as "lost baggage" and correctly classify the intent. There is an exhaustive set of synonym sets that the Answers support, although some that are expected might be missing.