Laiba Mehnaz

Logo


View My GitHub Profile

SMM4H at COLING 2020 Shared Task: Automatic classification of tweets that mention a medication


[Paper]

In this blog post, I write about my thought process for working on SMM4H 2020 Shared Task. Before getting starting, let me define the task and that datasets in depth.
The Shared Task in Social Media Mining for Health Applications 2020 is defined as: This binary classification task involves distinguishing tweets that mention a medication or dietary supplement (annotated as “1”) from those that do not (annotated as “0”). The dataset given to the participants represents the natural, highly imbalanced distribution of the two classes among tweets posted by 112 women during pregnancy2, with only approximately 0.2% of the tweets mentioning a medication.

The dataset distribution is as follows:
Training data: 69,272 (181 “positive” tweets; 69,091 “negative” tweets)
Test data: 29,687 tweets

I chose to go ahead with this task by analysing the perfomance of the several pretrained encoders that we have in NLP as of now, with their extremely cute names, such as BERT, BioBERT, ClinicalBERT, SciBERT, RoBERTa, Biomed-RoBERTa, ELECTRA and ERNIE 2.0. They all differed in so many aspects that I decided it would be fun to work with them for a particular task and understand myself. There are still lots of questions that I do have, and also this is not a comprehensive list in any manner. :)

First let’s see the final performance of the models on our shared task dataset:


Models F1 score
BERT base 0.78
BioBERT base 0.80
Clinical BioBERT base 0.81
SciBERT base 0.83
RoBERTa base 0.81
BioMed-RoBERTa base 0.85
ELECTRA base 0.79
ERNIE 2.0 0.83

As we can see, the best performing model is Biomed-RoBERTa. Now let’s look into how and why these models differ, and if the differences have translated to their performance.

1. In terms of pretraining objective:


Models Pretraining objective: Token prediction Pretraining objective: Nontoken prediction
BERT base Yes (MLM) Yes (NSP)
BioBERT base Yes (MLM) Yes (NSP)
Clinical BioBERT base Yes (MLM) Yes (NSP)
SciBERT base Yes (MLM) Yes (NSP)
RoBERTa base Yes (MLM) No
BioMed-RoBERTa base Yes (MLM) No
ELECTRA base Yes (replaced token detection) No
ERNIE 2.0 Yes (Knowledge masking) Yes (Word-aware, Structure-aware and Semantic-aware Tasks)

All the BERT variants except Roberta and ELECTRA have both Token prediction and Nontoken prediction tasks, i.e. Masked language modeling and Next sentence prediction. RoBERTa and Biomed-RoBERTa have only MLM as their pretraining objective. ELECTRA does too only rely on replaced token detection which is also at token level. ERNIE 2.0 has only one token prediction task i.e. Knowledge masking. The rest of the tasks are all Nontoken prediction tasks such as Token-Document Relation, Capital Prediction, Sentences Reordering, Sentences Distance, Discourse Relation, and IR Relevance.

2. In terms of pretraining data size and data domain:


Models Pretraining data size Pretraining data domain
BERT base 13GB/3.3B words BooksCorpus + Wikipedia
BioBERT base 21.3B words BooksCorpus + Wikipedia + PubMed + PMC
Clinical BioBERT base 21.3B + 2 million notes of MIMIC III db BooksCorpus + Wikipedia + PubMed + PMC + 2 million notes of MIMIC III db
SciBERT base 3.17B 1.14M papers from Semantic Scholar
RoBERTa base 160GB BooksCorpus + Wikipedia + CC-News + OpenWebText + Stories
BioMed-RoBERTa base 207GB 2.68 million scientific papers from the Semantic Scholar(7.55B) + RoBERTa’s pretraining data
ELECTRA base 13GB/3.3B words BooksCorpus + Wikipedia
ERNIE 2.0 7.9B BooksCorpus + Wiki + Reddit + Discovery Data

Their relative order is: BioMed-RoBERTa base > RoBERTa base > ClinicalBERT > BioBERT > ERNIE 2.0 > BERT = ELECTRA > SciBERT
As we can see, Biomed-RoBERTa uses the maximum amount of pretraining data and is also the best performing model in our list. But what is interesting is that the second best performing models SciBERT and ERNIE 2.0 use much lesser data than the rest of the models without compromising the performance.
On the one hand, SciBERT uses even lesser data than BERT and gets an F1 score of 0.83, owing completely to its domain data pretraining. Whereas, on the other hand, ERNIE 2.0 uses only twice as much data as BERT, but absolutely no domain pretraining data, and performs equally good with an F1 score of 0.83. We definitely need a lot more experiments to conclude anything, but it looks like ERNIE 2.0’s well crafted pretraining tasks aiming to learn lexical, semantic, and syntactic features, learn the language better or equal to a model learing just language modeling on a huge domain-specific dataset.

Apart from this, ClinicalBERT, BioBERT and Biomed-RoBERTa, all three continue pretraining on more and more domain specific data, but Biomed-RoBERTa leads the other two by a good margin with respect to the F1 score. Which can be more or less based on the huge amounts of data it has been trained on, both general and domain-specific.
This is the last comparison in terms of size of the models(i.e., number of parameters):

3. In terms of model size:

Models Number of parameters
BERT base 110M
BioBERT base 110M
Clinical BioBERT base 110M
SciBERT base 110M
RoBERTa base 125M
BioMed-RoBERTa base 125M
ELECTRA base 110M
ERNIE 2.0 -

With respect to the parameter size as well, Biomed-RoBERTa is the biggest model.

Conclusion

Overall, Biomed-RoBERTa is the best performing model in the list, as well as the most computationally expensive model, in terms of both the size and the amount of pretraining data used. What’s also interesting is what we have observed about SciBERT and ERNIE 2.0, i.e., given their relatively small model sizes and training data used while pretraining, do we make our models highly domain-specific when working with different domains, or is there a silver lining in terms of universal pretrained models like ERNIE 2.0, that can perform extremely well given a totally unseen domain.

Future work

I would like to extend this work to more models including multilingual models and others that I have missed in this list. I am also interested in looking into the examples where each model fails, what patterns emerge if so, and further analysing them. :)