BERT for Joint Intent Classification and Slot Filling
Table of Contents
- Business Use case
- Business Application
- Business Constraint
- Mapping to ML/DL Problem
- Existing Approaches and Limitations
- Dataset
- Exploratory Data Analysis
- Modelling
- Future work
- Github repository
- References
1. Business Use Case
Given a user query, need to identify what the Intent and slots/entities are. Here comes the question, What are Intents? What are Slots/Entities?
Intent refers to the intention behind user query or user input.
Slots/Entities refers to key elements in user query or user Input.
Intent Detection and Slot Filling is the task of interpreting user commands/queries by extracting the intent and the relevant slots.
2. Business Application
Intent identification and slot filling find their major application in Spoken Language Understanding (SLU), Spoken Language System (SLS). Wherever there is a human intervention to understand user query, we can adopt this technique to know user intentions and entities, ex: chatbots (google assistance, Siri)
3. Business Constraint
Low Latency requirement — Given a user query, quickly we need to find out intents and entities.
4. Mapping to ML/DL Problem
Intent detection/ classification can be formulated as a classification problem. Popular classifiers like Support Vector Classifier (SVC), Linear Regression (LR), Naive Bayes., etc can be applied.
Slot filling can be formulated as a sequence labelling task. Popular approaches to solve sequence labelling problems include Maximum Entropy Markov Models (MEMMs), Conditional Random Fields (CRFs), and Recurrent Neural Networks (RNNs).
The title of our blog says “BERT for Joint Intent Classification and Slot Filling”, so we need to come up with a model which does both intent classification as well as slot filling tasks. This joint model simplifies the SLU/NLP system, as only one model needs to be trained and fine-tuned for the two tasks. For the same purpose, the language representation model, BERT (Bidirectional Encoder Representations from Transformers) has been used.
5. Existing Approaches and limitations
Identifying Intent and Entities finds major application in chatbots. There is already an existing framework to build chatbots like:
- DialogFlow from Google
- Azure Bot Service from Microsoft
- IBM Watson from IBM
Pros:
1. All the above are cloud-based services, and hence all advantages of a typical cloud-based service are applicable.
2. Easy to build, deploy and maintain. Model training is automatic.
3. Multiple language options built-in
Cons:
1. Although for low traffic volumes, the service is free, as the volumes increase, users might need to move to a paid plan.
2. Fine control over dialogue processing will not be available to the programmer.
Chatbot using RSA:
RSA is an open-source library to build a chatbot in python language.
Pros:
1. On-premise solution
2. Highly customisable — various pipelines can be employed to process user dialogues. spaCy + sklearn is the default backend.
3. The rasa framework can be run as a simple HTTP server or can be used from Python, using APIs.
Cons:
1. Server requirements — Although spaCy is a very fast NLP platform, it seems to be very memory hungry.
2. Learning curve — Installation, configuration and training phases require machine learning expertise (at least basic level).
Custom Chatbots:
Wherein, With a custom chatbot, virtually there is no end to the level of functionality and creativity you can produce.
6. Dataset
For the current project, data is obtained from here
Here we have data from two domains, atis (Airline travel Information System) and snips, both are benchmark datasets for NLU/NLP tasks. Each Folder further divides into train, valid, and test. And further drills down as label, seq.in and seq.out.
Where label has all intent details, seq.in has user query, seq.out has tags to a corresponding user query in seq.in.
(Further analysis and modelling has been done on snips dataset)
7. Exploratory Data Analysis
Here we will be performing analysis on our dataset, which helps in further pre-processing, featurization and modelling.
Let’s ask the following question for a better EDA.
Q1. How many intents do we have?
Train:
We have 7 unique intents. Hence it will fall under multi-class classification problems. PlayMusic is the majority class with the 1914 occurrence. As we can see, train data is not highly imbalanced, other classes have almost equal data points per class
Validation:
All 7 intent classes are present in the validation set. Data points in the validation set are equally distributed with 100 data points in each class.
Q2. What is the maximum length of user input? [Analysis on User input]
Train
Below is an output of the above code snippet.
Here, we can see, mean, median is almost equal to 9. Mode = 8, and 1816 data points have length = 8. Since mean, median and mode are almost equal, we can say, train sequence length is Normal/gaussian distributed. Few data points have a length less than 5 and greater than 25. These points look like an outlier.
Validation
Below is an output of the above code snippet
In the validation set, mean and median are almost equal. And mode = 8 and 110 data points have a length equal to 8. Since mean, median and mode are almost the same, validation sequence length follows Normal/gaussian distribution. Here also we can see a few points with length less than 2 and greater than 20.
Investigating outlier points:
Input with a length greater than or equal to 20 and input with a length less than 2 looks like an outlier. When we actually investigate these points, they look like actual points rather than random/outlier. Hence considering the maximum sequence length as 35.
Q3. Frequency of words in input sequence?
Train
Validation
We can see that in both train and validation sets, stop words have occurred more frequently.
Q4. How are the tags distributed?
Train
Validation
“O” is the tag that has occurred more frequently, compared to other tags.
8. Modelling
Initially, this problem is approached with traditional models as baseline models to identify intents and slots/entities. Later in this section, a joint model has been developed using BERT. The joint model simplifies the SLU system, as only one model needs to be trained and fine-tuned for the two tasks.
Pre-processing:
- Encoding the labels using LabelEncoder() from sklearn module. We have 7 unique classes, hence our encoded labels will be in the range 0–6
- Vectorizing user input using TfidfVectorizer() from sklearn module.
1. Intent Classification
The following model has been trained for intent classification.
1.1. Multinomial Naive Bayes
Naive Bayes is applied along with grid search to select the best alpha for the dataset.
It is found that alpha = 0.1 is giving 97%accuracy.
1.2. SGD Classifier
We have experimented with SGD classifier with logistics and SVC by setting loss = “log” and loss = “hinge” respectively. Ans with alpha = 0.0001, loss = “hinge” and penalty = “l2” is giving best score of 98%accuracy
1.3. Decision Tree
With criterion = “entropy”, max_depth = 30 and min_samples_split = 100, the decision tree has given the best score of 93% accuracy.
The below table is the result of each model, where each model is fitted with their respective best parameters obtained from the above grid search.
And we can see SGD with hinge loss is the winner, as its train accuracy and dev accuracy/validation accuracy is higher than other models.
2. Slot Filling or Entity Recognition
Here we have tried Conditional Random Field and a simple deep learning model to identify slots/entities. Since some of the “O” tagged words fall under stop words, we have not removed stop words while doing pre-processing.
2.1. Conditional Random Field
Here each input sequence is vectorized by applying the following condition to each of the words.
- Length of the word
- isupper
- islower
- istitle
- isdigit
- length of the word
- weather word is stop word or not
(for the detailed featurization please refer to GitHub repo)
classification report for CRF.
2.2. Deep Learning model
Here we have used Bidirectional LSTM to process the sequence information. Pre-processing of sequences and tags are done using Tokenizer from the TensorFlow package.
Here we have trained a model for 8 epochs, with “adam” as an optimizer, “categorical_crossentropy” as a loss function, and “accuracy” as a metric. At the 8th epoch model has been achieved.
loss: 0.0643 — accuracy: 0.9855 — val_loss: 0.1055 — val_accuracy: 0.9727
If we compare the F1 score between CRF and DL model, CRF is the winner as it has a higher F1 score of 0.88, 0.82, 0.88 for micro average, macro average and weighted average respectively.
3. Joint Model using BERT
BERT is observed to solve various NLU tasks as seen on the superGLUE benchmark leaderboard. BERT takes the entire input into account, enabling it to understand the queries better. The joint model simplifies the SLU/NLP system, as only one model needs to be trained and fine-tuned for the two tasks.
Pre-Processing:
Labels are encoded using LabelEncoder from sklearn package.
Input sequences are tokenized using BERTTokenizer.
tokenizer = BertTokenizer.from_pretrained(BERT_MODEL)
Output sequences are tokenized by the below code snippet.
Model:
JointIntentAndSlotFillingModel takes, total_intent_no, total_slot_no, pre-trained bert model name, and dropout_prob as input.
TFBertModel gives 2 outputs,
- Sequence output of the shape (batch_size, sequence_length, 768)
- The pooled output of the shape (batch_size, 768)
sequence output is a representation of each word, where pooled output encoded representation of the entire sentence, hence slot filling is done using sequence output, whereas intent classification is done using pooled output.
Model is trained by using “Adam” as an optimizer. And the model is trying to minimize SparseCategoricalCrossentropy for both intent and slot filling tasks. SParseCategoricalAccuracy is used as a metric to measure model performance.
Model is trained for 9 epoch, (Early stopping call back is added to monitor the progress) and at the end, the model has achieved,
loss: 0.0226 — output_1_loss: 0.0171 — output_2_loss: 0.0055 — output_1_accuracy: 0.9954 — output_2_accuracy: 0.9984 — val_loss: 0.0855 — val_output_1_loss: 0.0248 — val_output_2_loss: 0.0606 — val_output_1_accuracy: 0.9939 — val_output_2_accuracy: 0.9886
Below are training plots for the Joint BERT model:
Classification report:
The joint model has performed better than the baseline model, with
0.96 micro average
0.86 macro average
0.96 weighted average
98% accuracy on intent classification
Reason for good performance:
- BERT is a language model, which is trained on huge amounts of text data from Wikipedia, Twitter data etc.
- BERT gives good representation for input.
- It takes the entire sentence into account, hence it gives good context information.
The downside of using BERT:
- To train a BERT or to fine-tune a BERT it required GPU
9. Future Work
- Training the same model for the different dataset.
- Finetune BERT layers.
10. Github Repository
Code is available in my GitHub Repository. Kindly have a glance.
11. References
Contact info: