Skip to main content
Fig. 2 | Journal of Biomedical Semantics

Fig. 2

From: De-identifying Spanish medical texts - named entity recognition applied to radiology reports

Fig. 2

Data curation process and corpus preparation workflow. a 7848 radiology reports in total were retrieved from BIMCV database. b We used a custom Python script to automatically annotate the names, surnames and hospital names from radiology reports. c A subset of records was made meeting the condition that more than one ‘name’ tag was present, remaining 2214 reports. d Another subsetting was performed to randomly select one-third of reports to be manually annotated and corrected by three annotators. After the manual revision, 692 reports remain. e Ground Truth dataset was divided into 3 subsets: the training set included 447 reports, validation 213, and test 32 reports from healthcare department number 7

Back to article page