In this blog post, I write about my thought process for working on SMM4H 2020 Shared Task. Before getting starting, let me define the task and that datasets in depth.
The Shared Task in Social Media Mining for Health Applications 2020 is defined as: This binary classification task involves distinguishing tweets that mention a medication or dietary supplement (annotated as “1”) from those that do not (annotated as “0”).
The dataset given to the participants represents the natural, highly imbalanced distribution of the two classes among tweets posted by 112 women during pregnancy2, with only approximately 0.2% of the tweets mentioning a medication.
The dataset distribution is as follows:
Training data: 69,272 (181 “positive” tweets; 69,091 “negative” tweets)
Test data: 29,687 tweets
I chose to go ahead with this task by analysing the perfomance of the several pretrained encoders that we have in NLP as of now, with their extremely cute names, such as BERT, BioBERT, ClinicalBERT, SciBERT, RoBERTa, Biomed-RoBERTa, ELECTRA and ERNIE 2.0. They all differed in so many aspects that I decided it would be fun to work with them for a particular task and understand myself. There are still lots of questions that I do have, and also this is not a comprehensive list in any manner. :)
Models | F1 score |
---|---|
BERT base | 0.78 |
BioBERT base | 0.80 |
Clinical BioBERT base | 0.81 |
SciBERT base | 0.83 |
RoBERTa base | 0.81 |
BioMed-RoBERTa base | 0.85 |
ELECTRA base | 0.79 |
ERNIE 2.0 | 0.83 |
As we can see, the best performing model is Biomed-RoBERTa.
Now let’s look into how and why these models differ, and if the differences have translated to their performance.
Models | Pretraining objective: Token prediction | Pretraining objective: Nontoken prediction |
---|---|---|
BERT base | Yes (MLM) | Yes (NSP) |
BioBERT base | Yes (MLM) | Yes (NSP) |
Clinical BioBERT base | Yes (MLM) | Yes (NSP) |
SciBERT base | Yes (MLM) | Yes (NSP) |
RoBERTa base | Yes (MLM) | No |
BioMed-RoBERTa base | Yes (MLM) | No |
ELECTRA base | Yes (replaced token detection) | No |
ERNIE 2.0 | Yes (Knowledge masking) | Yes (Word-aware, Structure-aware and Semantic-aware Tasks) |
All the BERT variants except Roberta and ELECTRA have both Token prediction and Nontoken prediction tasks, i.e. Masked language modeling and Next sentence prediction. RoBERTa and Biomed-RoBERTa have only MLM as their pretraining objective. ELECTRA does too only rely on replaced token detection which is also at token level. ERNIE 2.0 has only one token prediction task i.e. Knowledge masking. The rest of the tasks are all Nontoken prediction tasks such as Token-Document Relation, Capital Prediction, Sentences Reordering, Sentences Distance, Discourse Relation, and IR Relevance.
Models | Pretraining data size | Pretraining data domain |
---|---|---|
BERT base | 13GB/3.3B words | BooksCorpus + Wikipedia |
BioBERT base | 21.3B words | BooksCorpus + Wikipedia + PubMed + PMC |
Clinical BioBERT base | 21.3B + 2 million notes of MIMIC III db | BooksCorpus + Wikipedia + PubMed + PMC + 2 million notes of MIMIC III db |
SciBERT base | 3.17B | 1.14M papers from Semantic Scholar |
RoBERTa base | 160GB | BooksCorpus + Wikipedia + CC-News + OpenWebText + Stories |
BioMed-RoBERTa base | 207GB | 2.68 million scientific papers from the Semantic Scholar(7.55B) + RoBERTa’s pretraining data |
ELECTRA base | 13GB/3.3B words | BooksCorpus + Wikipedia |
ERNIE 2.0 | 7.9B | BooksCorpus + Wiki + Reddit + Discovery Data |
Their relative order is: BioMed-RoBERTa base > RoBERTa base > ClinicalBERT > BioBERT > ERNIE 2.0 > BERT = ELECTRA > SciBERT
As we can see, Biomed-RoBERTa uses the maximum amount of pretraining data and is also the best performing model in our list. But what is interesting is that the second best performing models SciBERT and ERNIE 2.0 use much lesser data than the rest of the models without compromising the performance.
On the one hand, SciBERT uses even lesser data than BERT and gets an F1 score of 0.83, owing completely to its domain data pretraining. Whereas, on the other hand, ERNIE 2.0 uses only twice as much data as BERT, but absolutely no domain pretraining data, and performs equally good with an F1 score of 0.83. We definitely need a lot more experiments to conclude anything, but it looks like ERNIE 2.0’s well crafted pretraining tasks aiming to learn lexical, semantic, and syntactic features, learn the language better or equal to a model learing just language modeling on a huge domain-specific dataset.
Apart from this, ClinicalBERT, BioBERT and Biomed-RoBERTa, all three continue pretraining on more and more domain specific data, but Biomed-RoBERTa leads the other two by a good margin with respect to the F1 score. Which can be more or less based on the huge amounts of data it has been trained on, both general and domain-specific.
This is the last comparison in terms of size of the models(i.e., number of parameters):
Models | Number of parameters |
---|---|
BERT base | 110M |
BioBERT base | 110M |
Clinical BioBERT base | 110M |
SciBERT base | 110M |
RoBERTa base | 125M |
BioMed-RoBERTa base | 125M |
ELECTRA base | 110M |
ERNIE 2.0 | - |
With respect to the parameter size as well, Biomed-RoBERTa is the biggest model.
Overall, Biomed-RoBERTa is the best performing model in the list, as well as the most computationally expensive model, in terms of both the size and the amount of pretraining data used. What’s also interesting is what we have observed about SciBERT and ERNIE 2.0, i.e., given their relatively small model sizes and training data used while pretraining, do we make our models highly domain-specific when working with different domains, or is there a silver lining in terms of universal pretrained models like ERNIE 2.0, that can perform extremely well given a totally unseen domain.
I would like to extend this work to more models including multilingual models and others that I have missed in this list. I am also interested in looking into the examples where each model fails, what patterns emerge if so, and further analysing them. :)