Test German BERT Named Entity Recognition with Matlab (Updated)
Table of Contents
1 Overview 2 Using a Pretrained German BERT Model Load German BERT NER Model Load Data Set Parameters Get all Classes Setup Datastore BERT Predict Quality of BERT NER-Model 3 New Sentences BERT Compare to Matlab „addEntityDetail“ 4 Backtest Matlab GermEval 2014 Transform Matlab Labels Quality Matlab 5 Testing Quality of the New hmmEntityModel Function Train hmmEntity Model Test Quality of the hmmEntitiy Model —————– Supporting Functions ———— Mini-batch Preprocessing Function. Model Function Model Gradients Function Model Predictions Function
This script illustrates how to use a German BERT model for named entity recognition and compares the predictive performance of such a transformer model to the performance of the addEntityDetails function of Mathwork’s „Text Analytics“ toolbox. As an update to the previous version of this blog, I also conduct a quality test of the new hmmEntityModel function, which was added to the Text Analytics toolbox with Matlab 2023a.
Quite generally, as transformer models have revolutionized NLP over the last years, it is no surprise that the BERT-NER model will be superior to the Matlab functions, as the addEntityDetails function is not based on a neural network and the hmmEntityModel function is based on a purely statistical framwork instead of semantics. This script serves only to illustrate some NER uses of Matlab, as well as the power of transformer models for NLP, which can now be used in Matlab with ease.
Mathwork provides a BERT-base implementation which we rely upon. However, as Mathworks currently does not provide a means to import pre-trained models from Python into Matlab, we rely on the toolbox „exportBERTtoMatlab“ designed by Moritz Scherrmann (from the Institute of Finance & Banking, LMU Munich) for exactly this purpose. As this toolbox basically allows to import all TensorFlow and PyTorch BERT-base models that rely upon the word-piece tokenizer, the expansion by new and more powerful NLP methods directly available in Matlab is in my view just fantastic – I have wished for this for a long time.
2 Using a Pretrained German BERT Model
We use the GermanBERT model trained by deepset.ai and train it ourselve with the GermEval 2014 data. Alternatively, one could have used an already pre-trained model based on these data, but I’d liked to see how the training works in Matlab (though this will not be discussed in this article).
Load German BERT NER Model
Load a pretrained BERT model. The model consists of a tokenizer that encodes text as sequences of integers, and a structure of parameters.
Load the example data. The file „dataTokenized.mat“ contains test, train and dev splits of the GermEval2014 competition, the most popular annotated NER data for the German language.
Get all Classes
Show the NER labels, following the IOB coding for categories „person“, „location“, „organization“, and „other“ („O“ is no entity)
Create an array datastore containing the encoded tokens.
Make predictions using the modelPredictions function, listed at the end of the example. The data comprise 5,100 test sentences.
Quality of BERT NER-Model
To test model quality a simple average F1 score across all classes (taking each token separatly into account) is calculated (i.e. Macro-F1).
No suprise, the transformer model achieves Macro-F1 of 82.5% – a decent result.
3 New Sentences
Iillustration how to test unrelated test senteces using BERT and the Matlab built-in function.
Create a mini-batch queue for the new data. Preprocess the mini-batches using the preprocessPredictors function, listed at the end of the example.
Make predictions using the modelPredictions function, listed at the end of the example.
Compare to Matlab „addEntityDetail“
4 Backtest Matlab GermEval 2014
For a more systematic analysis of the improvement due to the transformer model, let’s analyze the performance of the built-in Matlab function on the test data from GermEval 2014. This requires to translate Matlab labels (like „person“ instead of „B-PER“ or „I-PER“) into the IOB coding. We first translate Matlab labels into the „B-“ categories, and determine the true labels.
Transform Matlab Labels
Here, we add „I-“ lables to the Matlab results in a logically consistent sequence. For example, a sequence error like [„O“,“I-PER“] will be changed to [„O“,“B-PER“], or a sequence [„B-OTH“,“B-OTH“, „B-OTH“, „B-ORG“] will be changed to [„B-OTH“,“I-OTH“, „I-OTH“, „B-ORG“].
Note that this a general correction which can be applied to any NER model. Zöllner et al. (2021) show that even for transformer models this can lead to an improvement of about 4%-points.
Macro-F1 measures the quality of the Text Analytics toolbox function „addEntityDetails“. Macro-F1 for the GermEval 2014 test data is 54.26%. Using the BERT model thus leads to an improvement by about 54% or 28%-points!
5 Testing Quality of the New hmmEntityModel Function
The new hmmEntityModel function was added in Matlab 2023a. HMM stands for Hidden Markov Model, where the model assigns joint probabilities to paired observation and label sequences. Then the model’s parameters are trained to maximize the joint likelihood. The model is thus purely statistical. This has the benefit that it is language and contrext free, i.e. you can use it for all languages supported by the Text Analytics toolbox tokenizer. Since Matlab also provides the training function trainHMMEntityModel, the new NER model also introduces the possibility to train your own model on arbitrary entities, which is not possible with the addEntityDetails function.
Train hmmEntity Model
Test Quality of the hmmEntitiy Model
We now test the quality of the trained German hmm model, using the same test set from GermEval2014 as in the previous analyses.
One tricky challenge is that the hmmEntity function retokenizes input text automatically, without the user being able to deactivate this functionality. As the retokenized tokens will be different from the test data, the ground truth labels cannot be matched to the models predictions anymore. For example, for the GermEval test set, this leads to about 96,200 predicted labels instead of 95,542 labels for the test data.
We thus have to align the predicted tokens to the test data tokens and keep the predicted labels. This is a somewhat painful task but it can be achieved by noting that most (but not all) tokenization differences arise from non-alphabetic, numeric, or underscore characters, which can be identified by a regular expression using the „\W“ identifier. Having identified token differences per sentence, one can simply delete as many \W characters as there are length differences between the test and the prediction sentence (\W tokens almost always have an „O“ entity). This requires a sentence-by-sentence cleaning procedure, however (not documented here).
As can be seen from the performace measures, the predictive quality of hmmEntity is actually worse than the addEntityDetail result with Macro-F1 of 39.2%. In return, however, one receives the flexibility to train the model on arbitrary entities.
Overall, using pre-trained BERT models is far more powerful for all NLP applications, and we have seen in this blog that the performance difference for NER is quite significant.
—————– Supporting Functions ————
Mini-batch Preprocessing Function.
The preprocessMiniBatch function preprocess the predictors using the preprocessPredictors function and then encodes the labels as encoded vectors. Use this preprocessing function to preprocess both predictors and labels.
The function model performs a forward pass of the classification model.
Model Gradients Function
The modelGradients function performs a forward pass of the classification model and returns the model loss and gradients of the loss with respect to the learnable parameters.
Model Predictions Function
The modelPredictions function makes predictions by iterating over mini-batches of data.