Skip to main content

Table 1 State of the art summary for de-identification studies in non-English languages

From: De-identifying Spanish medical texts - named entity recognition applied to radiology reports

Study Methodology Recall F1-score Corpus size Identifying tokens
Dalianis et al. [5] CRF 0.715 0.810 100 clinical records, train set 6170
     4-fold cross-validation  
Menger et al. [12] Regular expression rules 0.916 0.862 2000 medical texts, development 542, test set
  and tree-based hashing    400 medical texts, test set  
Jian et al. [13] Rule-based and CRF 0.851 0.848 201 sentences, train set 1259, train set
     1000 clinical records, test set  
Lange et al. [28] BiLSTM with CRF 0.974 0.974 500 clinical records, train set 11333, train set
     250 clinical records, development 5801, development
     250 clinical records, test set 5661, test set
Jiang et al. [29] BERT and flair system 0.968 0.962 500 clinical records, train set 11333, train set
     250 clinical records, development 5801, development
     250 clinical records, test set 5661, test set
Pérez et al. [30] spaCy 0.953 0.960 500 clinical records, train set 11333, train set
     250 clinical records, development 5801, development
     250 clinical records, test set 5661, test set
  1. The table describes the methodology used by the authors, the performance of the approach and the corpus size in number of documents and number of identifying tokens. From MEDDOCAN, the top 3 best-performing models were included