Grammatical Error Correction using Deep Learning
Table of contents
- Introduction
- Why Grammatical Error Correction (GEC)
- Formulating as Business Problem
- Formulating as ML/DL Problem
- Literature Survey
- Dataset Overview
- Data Analysis/ Exploratory Data Analysis
- Data Preparation
- Baseline models.
- Attention Mechanism
- Model
- Training statistics
- Model output
- Results
- Inference
- Future work
- Github repository
1. Introduction
Grammatical error correction (GEC), as the name suggests is the process of detecting and correcting different kinds of an error in a sentence such as spelling, punctuation, grammatical, and word choice errors.
The problem seems to be easy, but it's tough, because of the diverse vocabulary and set of rules in a language.
As we all know Natural Language Processing (NLP) refers to processing/preprocessing human languages using computational algorithms, hence GEC has gained popularity in the area of NLP.
2. Why Grammatical Error Correction (GEC)
What do you think if you read this sentence, “i started to study spanish recently .”
Well if you know English you will say, the sentence is grammatically wrong. And the corrected sentence is “I recently started studying Spanish .”
It’s effortless for a human with considerable knowledge of the English language to correct an incorrect sentence, but can we make computers replicate the same behavior?
Now you might be wondering, why do we need to do that? Well, here is an answer, we will be using computers in our day-to-day life to draft emails for our clients, post messages on social media, write a cover letter, or resume for our dream jobs. If there are any grammatical errors or mistakes, our proposal becomes very unprofessional. We want to avoid this and get a good impression.
Now the question is how to do this?
In NLP, there is a wide range of techniques, from classical rule-based to SOTA deep learning techniques.
As part of the literature survey, classical rule base, syntax-based methods will be explained. And the major focus of this blog will be on building an end-to-end Grammatical Error correction system using SOTA deep learning techniques.
3. Formulating as Business Problem
3.1. Problem Definition:
Grammatical error correction is a process of detecting and correcting erroneous words in a given sentence.
GEC system takes erroneous sentences as input and is expected to transform them into grammatically and syntactically correct sentences.
Example 1:
Input: “i still pain my calf muscles .”
Output: “I still feel pain in my calf muscles”
Example 2:
Input: “finally, i had done my exam”
Output: “finally, I finished my exam”
3.2.Usecase:
- GEC system can be used as a post editor for many applications like machine translation, word processor, outlook, Gmail, etc.
- Native students, as well as second language learners, can use a GEC system as a writing tool to help them improve their writing.
3.3.Business Constraints:
- Strict latency constraint: Since this is an interactive application, correction to the incorrect sentences should be fast.
- Interpretability: The user is not going to ask questions about model prediction, but it is good to have an interpretable model, as it helps to understand model behavior.
4. Formulating as ML/DL Problem:
Grammatical Error Correction falls under Natural Language Processing task, as we deal with human languages.
Grammatical Error Correction can be framed as Sequence to the Sequence translation problem, where it takes erroneous sentence (which has incorrect/miss-spelled word, syntactically incorrect words) as an input, and provides corrected sentence as an output (which is grammatically correct and syntactically correct sequences)
Here we are trying to minimize Sparse Categorically Cross-Entropy loss.
And model performance is measured by using GLEU score (Generalized Language Understanding Evaluation)
GLEU score is a string matching algorithm, which is a basic metric for machine translation. To compute the GLEU score following are required:
- one or more human reference translations, which is not been used in building translation systems. And prediction from the model.
- Since it is a string matching algorithm, it’s recommended to have more samples (< 1000) to get good results.
GLEU value lies between 0 –1. Closer to 1 means, more overlap between human reference and predicted output.
5. Literature Survey
There are multiple approaches in NLP for GEC,
Rule-Based Approach: It has a predefined set of rules, i.e., error patterns to match against the text. Text is considered to be erroneous if it is matched against rules.
- Advantage:
1. Simple to implement
2. Very precise and requires minimal data.
3. Essay to understand and extendable
4. Rules can be added incrementally
5. It’s best only for simple languages with minimal grammar rules.
- Disadvantage:
1. Requires extensive labor work and expertise.
2. As grammar becomes complex, rules become complex.
3. All or many errors cannot be handled by rules.
4. As rules increase, it becomes difficult to maintain and interpret.
Syntax-Based Approach: Here morphology and syntax of the text is fully analyzed. It requires a lexical database, a morphological analyzer, and a parser. Depending on a language’s grammar, the parser gives a syntactic tree structure to each sentence. If complete parsing were not successful, it means the text is erroneous.
- Disadvantages:
- It will not be able to specify to the user what the exact error is. To overcome this problem, extra rules are needed.
Machine Learning Approach:
There are two types of machine learning approaches for GEC, they are:
- Statistical based models 2. Classification models
Statistical Models: These models require a very large amount of data to learn error patterns. They utilize part-of-speech (POS) annotated corpus, tokens, and lemma to build patterns. Common and more probable sequences that occur often in the corpus can be considered as correct, whereas uncommon sequences could contain errors. ( they work like, what is the probability of getting Noun when the previous word is Determinant → P(Noun/Determinant))
- Advantage:
1. Grammar rules can be built automatically
2. Deep knowledge of grammar is not needed
3. Can be used to develop language-independent systems
- Disadvantage:
1. Requires a very large amount of data.
2. It fails to address/correct unseen error patterns.
Classification Model: Here a classifier is trained on well-formed text to learn usage for specific word classes to handle specific error types. Along with actual text, they require features for better performance, some of the features are, POS, n-grams, grammatical relationships. These features are embedded into vectors and the classifier is trained using various classifiers such as maximum entropy, Naive Bayes, and support vector machines.
- Disadvantage:
1. Classifiers cannot handle dependent errors.
2. Requires handcrafting features, which is time-consuming and needs expert
knowledge.
Deep Learning Approach:
Recurrent Neural Network: RNN’s fall under deep neural networks, unlike classical machine learning models, deep learning models do not require feature engineering as neural networks can learn them automatically. This helps to overcome the disadvantages found in the above-mentioned methods. Feature engineering requires domain knowledge as well as it’s time-consuming. But these models require a large amount of data. Grammatical Error detection and Correction can be considered as a sequence to sequence translation problem, where an erroneous sentence is translated into the correct version of the sentence in the same language.
Approach: This approach uses Encoder-decoder architecture with an attention mechanism. Encoder maps the input into high-level representation using RNN’s. A decoder which is also an RNN uses a content-based attention mechanism and generates the output sentence by generating one word at a time.
This blog mainly focuses on solving Grammatical Error Correction using a Deep Neural Network.
6. Dataset Overview
Here Lang-8 dataset has been used (source). And it is collected from the following source.
Data is present in .m2 file format. In the .m2 file, the line followed by S is an original/incorrect sentence, and the sentence followed by A is a sentence annotated by an annotator(0,1,2,3,4). Annotated sentences are nothing but corrected sentences. There is a total of 1037562 sentences.
7. Data Analysis/ Exploratory Data Analysis
7.1. Extract Data into a pandas data frame
The above code helps to convert the .m2 file into a pandas data frame, for the sake of simplicity (by considering computational resources) we have considered only annotations from annotator 0. And code is referred from here.
7.2. Data Preprocessing
Since we are dealing with text data, basic preprocessing like removing special characters, digits, stripping, lowering, decontraction is carried out. And any duplicates, null values, and sentences with a single letter (since they do not add much value) have been removed.
7.3. Univariate Analysis
7.3.1. Analysis of sentence length
- Incorrect Sentences
Incorrect sentence length is not falling under the Gaussian distribution. Most data points have lengths between 0 –15. The majority of the data points have their length less than or equal to 50. In CDF we can see 99% of data points have their length less than or equal to 50.
- Correct Sentences
Similar to incorrect sentence length, correct sentence length also not falling under the Gaussian distribution, the majority of data points have a length between 0–15, and 99% of data points have a length less than or equal to 50.
7.3.2. Basic Statistics
- Incorrect Sentence
- Correct Sentence
Output from both categories (incorrect and correct) are self-explanatory. In both categories, the majority of the sentences have a length equal to 8.
7.3.3. Percentile values
- Incorrect Sentences
We can see, minimum length of data points is 1, and the maximum is 310. Only 0.1% of data points have a length greater than or equal to 310. 50% of data points have a length less than or equal to 10.
- Correct Sentences
We can see, minimum length of data points is 1, and the maximum is 487. Only 0.1% of data points have a length greater than or equal to 487. 50% of data points have a length less than or equal to 11.
In both categories, after investigating those points whose length is greater than 99%, it’s been found that they are genuine English sentences, not anomalies or outliers.
7.3.4. Analysis on total edits total edits
“total_edits” represents the number of edits made to transform incorrect sentences into correct sentences.
From the above graphs, we can conclude the majority of the data points have total_edits equal to 0. So most of the data points under incorrect sections are already correct. And entire dataset has total edits less than or equal to 15.
7.4. Bivariate Analysis
7.4.1. How are correct and incorrect sentences(word) lengths distributed?
Here we can see PDF are overlapping, hence we can conclude incorrect_word_len and correct_word_len follows the same distribution.
If we check in the right figure, they are almost linearly correlated. As the incorrect sentence length increases, correct sentence length is also increasing.
7.4.2. incorrect sentence length v/s total edits
Incorrect sent length and total_edits are not perfectly linearly correlated. We can see when the sent length is between 0–50, total_edits and sentence length are slightly linearly related. when the length is between 250–310, total_edits is linearly related.
8. Data Preparation
In the above analysis we saw, incorrect sentence lengths are ranging from 1–310, by considering computational power we are limiting the maximum sequence length to 10.
Below is the “total edit” analysis of the filtered dataset. (filtered based on maximum sequence length)
The majority of sentences (i.e., half of the data points in our dataset) have total edits equal to 0, that is incorrect sentences are the same as correct sentences. This will make the model bias, making the model return the same output as input. Hence we need to decrease this count.
Here we are dropping 98% of data points whose edits are equal to 0. Now 22.37% of data points has total_edits = 1, 12.43% of data points has total_edits = 2, 1.13% of data points has total_edits = 0. We can also observe only very few data points have edits greater than 6.
As the next step, data has been divided into train validation and test.
For our current sequence to sequence transformation problem, encoder-decoder architecture seems to be a suitable one. Hence data is prepared as follows:
- Each of the correct sentences is preceded with the “<start>” token, which is considered as an input to the decoder while training. (Teacher Forcing Technique)
- Each of the correct sentences is succeded with the “<end>” token, which is considered as ground truth.
Example: correct sentence: “how are you”
decoder input: “<start> how are you?”
ground truth: “how are you <end>”
Now we need to tokenize the dataset, this is done by using Tokenizer API in the Tensorflow library. Here every unique word will get a unique integer representation.
Here we are using two tokenizers, one for the encoder and another for the decoder. Encoder tokenizer takes incorrect sentences as an input, which has words in a different form, miss-spelled words, erroneous words, etc. Decoder tokenizer takes correct sentences as an input. Hence encoder vocabulary will be greater than decoder vocabulary.
Word Embeddings: FastText is used to get word embeddings. FastText gives subword level tokenizations, hence it helps to get representation even for out of vocabulary words, unseen words, erroneous words, incorrect words.
9. Baseline models
An encoder-decoder model consists of either LSTM or GRU’s.
The encoder takes input sequence as an input and gives its high level
representation or summarization of input in the form of hidden and
cell state, which is then used to initialize decoder state, and decoder will take the “start of the sentence” token as an initial input, and start to predict the next word in the sequence, using prior predictions.
For training a model, the Teacher forcing technique is used. Teacher forcing is a strategy for training recurrent neural networks that use ground truth as input, instead of model prediction or output from a prior time step as an input.
The teacher forcing technique is a fast and efficient way to train RNN.
Here I have experimented with the character level Encoder-Decoder model and Vanilla Encoder-Decoder model.
Character level Encoder-Decoder: Here input and output tokens is a character. That is, the encoder will take one character at a time as an input, and the decoder will predict one character at a time.
Vanilla Encoder-Decoder model: Here input and output tokens are words. The encoder will take a sequence of words as input and, the decoder will predict a sequence of words (one word at a time).
The codebase for the above variants can be checked in my GitHub profile.
10. Attention Mechanism
Here raises a question, what is an attention mechanism, and why do we need an attention mechanism?
There are few flaws of the vanilla encoder-decoder model, which an attention model overcomes, and there are a few drawbacks of an attention model, which the transformer model overcomes. Here we are limiting ourselves to an Attention-based model.
A word-to-word translation doesn’t work in most cases as most languages don’t share a common grammatical structure. Hence NLP models like RNN’s, usually capture all the information of the input sentence (details of objects, how objects are related to each other) in an intermediate state. And this intermediate information is used by the decoder for translation.
In the case of very long sentences as input, the intermediate state fails because it is not sufficient to capture the entire information.
Even for a human translator,
India, officially the Republic of India is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar and Indonesia.
translating the above paragraph at a single glance is very hard.
A human translator can go through the whole paragraph and get the gist out of it. And then the translator will divide a paragraph into chunks (based on full stop, comma, etc in any convenient way) and start to translate.
The neural network is a simulation of human behavior or the human brain, so we want to replicate the same human translator behavior in our neural model while translating the sentence, which can be achieved using attention-based neural models.
Now let's dive deep into attention-based models. Here the encoder behaves as same as in a vanilla encoder-decoder model and outputs the last hidden state and cell state of the input sequence. And this is used to initialize the initial states of the decoder. In the decoder, for every time step context vector is computed, which holds relevant information (on which word more focus/attention needs to be paid) from the encoder to predict the current word.
Now let’s see how the context vector is computed.
- At any time step, we will be having hidden states from the encoder(ht) and previous decoder hidden state (hs). For the decoder's initial time step, the encoder's last hidden state is considered as the decoder's previous hidden state.
- We compute the similarity between encoder hidden states(ht) and decoder previous hidden state(hs). That is by taking a dot product between ht and hs.
3. General scoring function and Concat scoring function are equivalent to Luong’s attention mechanism and Bahdanau’s attention mechanism respectively. In my case study, I have used Bahdanau’s attention mechanism from the Tensorflow library.
4. Output from the scoring function is passed to the softmax layer to get a good probability score, which is much interpretable. These are considered Alpha values.
5. Above alpha values are used to compute the weighted sum of encoder hidden state to derive context vector at a particular time step.
The above steps are iterated till <EOS> token is predicted or till max sentence length is reached.
Types of Attention:
- Global Attention: Which pays attention or focus on all source words or on all hidden states of an encoder, while decoding or translation
- Local Attention: Which pays attention or focus on few words in the source sentence or on few hidden states of an encoder, while decoding or translating.
11. Model
Here we have used TensorFlow libraries to implement the attention-based encoder-decoder model.
12. Training Statistics
encoder_vocab_size = 37462
decoder_vocab_size = 29022
embedding_dim = 300 (from FastText)
input_length = 10
lstm_size=64
batch_size=512
optimizer = Adam (from Tensorflow)
loss function = Sparse Categorical Crossentropy (from Tensorflow)
Here we have experimented with an embedding layer (whose weights are derived from the FastText library), by making its weights trainable and non-trainable.
Trained for 80 epochs with EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, Tensorboard as callbacks from Tensorflow.
13. Model Output
The left figure is from model1, where the embedding layer is trainable, and the right figure is from model2, where the embedding layer is non-trainable.
14. Result
We can see GLEU score of both models is comparable.
15. Inference
We need to make certain changes in the model architecture to predict output sequence during inference time. This is because we have used the “Teacher forcing” technique while training a model.
Basically, there are two ways to indulge the changes,
- Greedy Search
- Beam Search
Greedy Search: Here for every timestep, we consider only one token with maximum probability and ignore the remaining tokens, even though they have comparable probability scores after the softmax layer.
This is faster and less time-consuming algorithm, but the output might be not very accurate or optimal always, because at any time step if the prediction goes wrong entire output will be affected.
Beam Search: Here for every timestep, we consider multiple most probable tokens. This choice is called Beamwidth (k). So at every time step k most probable tokens are considered after the softmax layer.
During inference time, k copies of the model will be created. If k=1 then it is equivalent to greedy search.
This is a time and space-consuming algorithm. But the results are optimal compared to greedy search.
In my case study, I have used Beam search with beam width (k) = 3.
16. Future work
- The computational resource is the main problem that I faced while doing this case study. So in the future, I would like to experiment with the same in high-end systems with a large amount of data.
- The model can be trained for more epochs by hyperparameter tuning by increasing the number of lstm units, embedding dim, batch size, lounge attention mechanism, etc.
- Can also experiment with GRU’s as it has lesser gates and less trainable parameters compared to LSTM’s.
- Can experiment with the glove, word2vec embeddings.
- Here I have used simple lstm units instead, I can use bidirectional lstms.
- Train with more quality data
- More complex models like Transformers-based models can be used, which have already proven to be state-of-the-art in other sequences to sequence learning tasks like machine translations.
17. Github Repository
I have tried to brief out my work through this blog. For the full code details please check out my GitHub repository.
If you have reached so for, please provide me your feedback on this, I will definitely consider and learn from that. Thank you :)
I would like to thank the AppliedAI team for mentoring me during this case study.
Feel free to connect on Linkedin