Experiments¶
We have performed distillation experiments on several typical English and Chinese NLP datasets. The setups and configurations are listed below.
Models¶
For English tasks, the teacher model is BERT-base-cased.
For Chinese tasks, the teacher models are RoBERTa-wwm-ext and Electra-base released by the Joint Laboratory of HIT and iFLYTEK Research.
We have tested different student models. To compare with public results, the student models are built with standard transformer blocks except for BiGRU which is a single-layer bidirectional GRU. The architectures are listed below. Note that the number of parameters includes the embedding layer but does not include the output layer of each specific task.
English models¶
Model | #Layers | Hidden size | Feed-forward size | #Params | Relative size |
---|---|---|---|---|---|
BERT-base-cased (teacher) | 12 | 768 | 3072 | 108M | 100% |
T6 (student) | 6 | 768 | 3072 | 65M | 60% |
T3 (student) | 3 | 768 | 3072 | 44M | 41% |
T3-small (student) | 3 | 384 | 1536 | 17M | 16% |
T4-Tiny (student) | 4 | 312 | 1200 | 14M | 13% |
T12-nano (student) | 12 | 256 | 1024 | 17M | 16% |
BiGRU (student) | - | 768 | - | 31M | 29% |
Chinese models¶
Model | #Layers | Hidden size | Feed-forward size | #Params | Relative size |
---|---|---|---|---|---|
RoBERTa-wwm-ext (teacher) | 12 | 768 | 3072 | 102M | 100% |
Electra-base (teacher) | 12 | 768 | 3072 | 102M | 100% |
T3 (student) | 3 | 768 | 3072 | 38M | 37% |
T3-small (student) | 3 | 384 | 1536 | 14M | 14% |
T4-Tiny (student) | 4 | 312 | 1200 | 11M | 11% |
Electra-small (student) | 12 | 256 | 1024 | 12M | 12% |
T6 archtecture is the same as DistilBERT[1], BERT6-PKD[2], and BERT-of-Theseus[3].
T4-tiny archtecture is the same as TinyBERT[4].
T3 architecure is the same as BERT3-PKD[2].
Configurations¶
Distillation Configurations¶
distill_config = DistillationConfig(temperature = 8, intermediate_matches = matches)
# Others arguments take the default values
matches
are differnt for different models:
Model | matches |
---|---|
BiGRU | None |
T6 | L6_hidden_mse + L6_hidden_smmd |
T3 | L3_hidden_mse + L3_hidden_smmd |
T3-small | L3n_hidden_mse + L3_hidden_smmd |
T4-Tiny | L4t_hidden_mse + L4_hidden_smmd |
T12-nano | small_hidden_mse + small_hidden_smmd |
Electra-small | small_hidden_mse + small_hidden_smmd |
The definitions of matches
are at exmaple/matches/matches.py.
We use GeneralDistiller
in all the distillation experiments.
Training Configurations¶
Learning rate is 1e-4 (unless otherwise specified).
We train all the models for 30~60 epochs.
Results on English Datasets¶
We experiment on the following typical Enlgish datasets:
Dataset | Task type | Metrics | #Train | #Dev | Note |
---|---|---|---|---|---|
MNLI | text classification | m/mm Acc | 393K | 20K | sentence-pair 3-class classification |
SQuAD 1.1 | reading comprehension | EM/F1 | 88K | 11K | span-extraction machine reading comprehension |
CoNLL-2003 | sequence labeling | F1 | 23K | 6K | named entity recognition |
We list the public results from DistilBERT, BERT-PKD, BERT-of-Theseus, TinyBERT and our results below for comparison.
Public results:
Model (public) | MNLI | SQuAD | CoNLL-2003 |
---|---|---|---|
DistilBERT (T6) | 81.6 / 81.1 | 78.1 / 86.2 | - |
BERT6-PKD (T6) | 81.5 / 81.0 | 77.1 / 85.3 | - |
BERT-of-Theseus (T6) | 82.4/ 82.1 | - | - |
BERT3-PKD (T3) | 76.7 / 76.3 | - | - |
TinyBERT (T4-tiny) | 82.8 / 82.9 | 72.7 / 82.1 | - |
Our results (see Experimental Results for details):
Model (ours) | MNLI | SQuAD | CoNLL-2003 |
---|---|---|---|
BERT-base-cased (teacher) | 83.7 / 84.0 | 81.5 / 88.6 | 91.1 |
BiGRU | - | - | 85.3 |
T6 | 83.5 / 84.0 | 80.8 / 88.1 | 90.7 |
T3 | 81.8 / 82.7 | 76.4 / 84.9 | 87.5 |
T3-small | 81.3 / 81.7 | 72.3 / 81.4 | 78.6 |
T4-tiny | 82.0 / 82.6 | 75.2 / 84.0 | 89.1 |
T12-nano | 83.2 / 83.9 | 79.0 / 86.6 | 89.6 |
Note:
The equivalent model structures of public models are shown in the brackets after their names.
When distilling to T4-tiny, NewsQA is used for data augmentation on SQuAD and HotpotQA is used for data augmentation on CoNLL-2003.
When distilling to T12-nano, HotpotQA is used for data augmentation on CoNLL-2003.
Results on Chinese Datasets¶
We experiment on the following typical Chinese datasets:
Dataset | Task type | Metrics | #Train | #Dev | Note |
---|---|---|---|---|---|
XNLI | text classification | Acc | 393K | 2.5K | Chinese translation version of MNLI |
LCQMC | text classification | Acc | 239K | 8.8K | sentence-pair matching, binary classification |
CMRC 2018 | reading comprehension | EM/F1 | 10K | 3.4K | span-extraction machine reading comprehension |
DRCD | reading comprehension | EM/F1 | 27K | 3.5K | span-extraction machine reading comprehension (Traditional Chinese) |
MSRA NER | sequence labeling | F1 | 45K | 3.4K (test) | Chinese named entity recognition |
The results are listed below (see Experimental Results for details).
Model | XNLI | LCQMC | CMRC 2018 | DRCD |
---|---|---|---|---|
RoBERTa-wwm-ext (teacher) | 79.9 | 89.4 | 68.8 / 86.4 | 86.5 / 92.5 |
T3 | 78.4 | 89.0 | 66.4 / 84.2 | 78.2 / 86.4 |
T3-small | 76.0 | 88.1 | 58.0 / 79.3 | 75.8 / 84.8 |
T4-tiny | 76.2 | 88.4 | 61.8 / 81.8 | 77.3 / 86.1 |
Model | XNLI | LCQMC | CMRC 2018 | DRCD | MSRA NER |
---|---|---|---|---|---|
Electra-base (teacher) | 77.8 | 89.8 | 65.6 / 84.7 | 86.9 / 92.3 | 95.14 |
Electra-small | 77.7 | 89.3 | 66.5 / 84.9 | 85.5 / 91.3 | 93.48 |
Note:
Learning rate decay is not used in distillation on CMRC 2018 and DRCD.
CMRC 2018 and DRCD take each other as the augmentation dataset in the distillation.
The settings of training Electra-base teacher model can be found at Chinese-ELECTRA.
Electra-small student model is intialized with the pretrained weights.