Skip to main content

Table 1 State of the art summary for de-identification studies in non-English languages

From: De-identifying Spanish medical texts - named entity recognition applied to radiology reports

Study

Methodology

Recall

F1-score

Corpus size

Identifying tokens

Dalianis et al. [5]

CRF

0.715

0.810

100 clinical records, train set

6170

    

4-fold cross-validation

 

Menger et al. [12]

Regular expression rules

0.916

0.862

2000 medical texts, development

542, test set

 

and tree-based hashing

  

400 medical texts, test set

 

Jian et al. [13]

Rule-based and CRF

0.851

0.848

201 sentences, train set

1259, train set

    

1000 clinical records, test set

 

Lange et al. [28]

BiLSTM with CRF

0.974

0.974

500 clinical records, train set

11333, train set

    

250 clinical records, development

5801, development

    

250 clinical records, test set

5661, test set

Jiang et al. [29]

BERT and flair system

0.968

0.962

500 clinical records, train set

11333, train set

    

250 clinical records, development

5801, development

    

250 clinical records, test set

5661, test set

Pérez et al. [30]

spaCy

0.953

0.960

500 clinical records, train set

11333, train set

    

250 clinical records, development

5801, development

    

250 clinical records, test set

5661, test set

  1. The table describes the methodology used by the authors, the performance of the approach and the corpus size in number of documents and number of identifying tokens. From MEDDOCAN, the top 3 best-performing models were included