Text mining-based measurement of precision of polysomnographic reports as basis for intervention

Background Text mining can be applied to automate knowledge extraction from unstructured data included in medical reports and generate quality indicators applicable for medical documentation. The primary objective of this study was to apply text mining methodology for the analysis of polysomnographic medical reports in order to quantify sources of variation – here the diagnostic precision vs. the inter-rater variability – in the work-up of sleep-disordered breathing. The secondary objective was to assess the impact of a text block standardization on the diagnostic precision of polysomnography reports in an independent test set. Results Polysomnography reports of 243 laboratory-based overnight sleep investigations scored by 9 trained sleep specialists of the Sleep Center St. Gallen were analyzed using a text-mining methodology. Patterns in the usage of discriminating terms allowed for the characterization of type and severity of disease and inter-rater homogeneity. The variation introduced by the inter-rater (technician/physician) heterogeneity was found to be twice as high compared to the variation introduced by effective diagnostic information. A simple text block standardization could significantly reduce the inter-rater variability by 44%, enhance the predictive value and ultimately improve the diagnostic accuracy of polysomnography reports. Conclusions Text mining was successfully used to assess and optimize the quality, as well as the precision and homogeneity of medical reporting of diagnostic procedures – here exemplified with sleep studies. Text mining methodology could lay the ground for objective and systematic qualitative assessment of medical reports. Supplementary Information The online version contains supplementary material available at (10.1186/s13326-022-00259-3).

of indexing journal articles in life science. International Classification of Disease (ICD) is another classification system of diseases. Methodological approaches have been described in the literature aiming to facilitate the exploration of narrative texts included in electronic health records (EHR) (see e.g. [5]). These works typically stress the difficulty to extract insightful information from EHR due to the complexity of the information (codified text, use of jargon jerky terminology, etc.).
Text mining (TM) refers to the process of deriving meaningful insights from textual sources. This process encompasses several analytical challenges including retrieving, annotating, exploring and interpreting valuable information from text corpora. TM can be applied to automate knowledge extraction from unstructured data included in medical reports and generate quality indicators applicable for medical documentation [6][7][8][9]. Free text description of complex diseases reported in health records can be subject to various sources of variation. It is of interest to keep the text as accurate and standardized as possible in order to minimize errors, miscoding and loss of information susceptible to have a negative impact on patient management.
Sleep apnea (SA) is a prevalent sleep disorder characterized by a reduction or cessation of airflow to the lungs caused by obstructive or central events. SA is diagnosed by polysomnography (PSG) based on the number of apnea-hypopnea events per hour of sleep. PSG is technically complex. This procedure generates elaborated reports whose interpretation requires the expertise of sleep technicians under the supervision of trained physicians.
Applications of TM in the field of sleep disorders exist but are scarce. For example, TM methodology was applied for the determination of trendy sleep disorder terminologies in recent sleep-related journal articles [10]. Moreover, sleep domain ontology proposed on the NCBO BioPortal provides a set of controlled vocabulary (English language) with specific application on sleep medicine [11].
The aim of the current study was to apply TM to PSG medical reports for quality purposes. More specifically, the aim was to assess the inter-rater variability in the diagnostic evaluation of sleep-disordered breathing by quantifying the part of variation associated with objective patient's diagnosis (type of disease, disease severity) and comparing it with the part of variation explained by the subjective rater's interpretation. In a second step, we sought to reduce the inter-rater variability in an independent test set by text standardization.

Text mining of pSG reports
Overall, 695 unique terms were extracted from the corpus of PSG medical reports among which 52 keywords were retained based on their usage frequency (all terms whose sparsity was greater than 90% were removed). The list of discriminating terms is provided in the Additional file 1 (Additional Table 1). A term-document matrix (243 documents × 52 terms) was created and analyzed using CA (data and source codes are provided in the Additional files 2 and 3). Figure 1a displays the term usage ordinated by CA. The first 2 CA axes summarized 11% and 8% of the overall variation, respectively. The percentage of variance explained by the disease characteristics (diagnosis and severity) was 6% and 7%, respectively. On the other hand, the percentage of variance explained by the raters (technicians and physicians) was 18% and 7%, respectively (Fig. 1b). Noteworthily, clustering among technicians (1, 3, 6 and 7) and among physicians (1, 6, 7 and 8) could be observed, showing some similarities in the semantic of polysomnographic reports among technicians/physicians.

Effect of text block standardization
After text block standardization, the total variance measured by the total inertia of the correspondence analysis, decreased from 2.73 down to 1.13. The percentage of variance explained by the raters (technicians / physicians) dropped from 25% to 15%, whereas the percentage of variance explained by the disease characteristics (type of apnea / disease severity) increased from 13% to 17% (Fig. 2a). The fractions of variation between the explanatory variables are shown in Fig. 2b. Before standardization, the combined percentage of the explanatory variables associated with the objective patient's diagnosis (type of apnea / severity) represented 8% of the overall variation whereas 14% of the total variation was associated with subjective interpretation of sleep technicians and physicians. After standardization, the percentage of explained variance associated with the disease increased to 11%, whereas the percentage of explained variance associated with rater decreased down to 4%. The ratio of disease to rater explained variance favorably increased from 0.5 to 2.75.
The predictive accuracy of the final SA diagnosis was assessed using a linear support vector machine classifier with a repeated 10-fold cross-validation. Patients were classified in the following 6 diagnostic categories: obstructive sleep apnea (OSAS) light (n = 13), OSAS mild (n = 18), OSAS severe (n = 45), central SA (n = 4), mixed SA (n = 4) and undetected SA (n = 16). Overall, an accuracy of 88% (95% CI: 83 to 91) was obtained when using the standardized text compared with 86% (95% CI: 83 to 88) without standardization. The confusion matrix of the cross-validation procedure is provided in Table 1. The prediction accuracy was particularly high with regard to the three subclasses of obstructive sleep apnea (light, mild and severe) and for the prediction of cases without detected apnea events. On the other hand, SA patients

Discussion
Electronic health reports contain information about patient's condition, which can be retrieved in an automatic manner [12]. However, unstructured text included in medical reports is often hampered by a series of pitfalls related among others to the raters' narrative style [13], the ambiguity or the redundancy of the reported information [14], the customization of the texts and the clinical experience of the rater. This inter-rater language heterogeneity is a potential source of confusion when extracting objective medical information from a health report. It is in the interest of quality assurance to maximize the diagnostic precision, i.e. the proportion of objective (disease / severity) over subjective (rater) information content included in health reports. TM can lay the ground for the evaluation of measures to efficiently standardize the information present in medical reports (e.g. using text blocks combined with the unified medical language system [15]), and minimize the risk of imprecision.
With TM methodology it is possible to quantify the importance of several sources of variation present in medical reports. In the current study, the variation introduced by inter-rater (technician/physician) heterogeneity was found to be twice higher compared to the variation introduced by effective diagnostic information. In order to improve the consistency of the PSG medical reports, we found that further standardization of the reporting in the form of a semi-structured documentation could improve the homogeneity and objectivity of generated reports, with a high predictive value, while maintaining the possibility of adding free text comments when needed.
There are several limitations to the current study. Discriminating terms were extracted from the corpus of documents based on automated procedures and did not include further meticulous manual inspections. Although this basic methodological approach was deemed sufficient within the scope of the current study, future developments could include more advanced data curation such as stemming and other refined text transformations. Future works on structured medical reports could also benefit from the use of controlled medical vocabulary.

Conclusion
The analysis of electronic health reports with text mining techniques combined with correspondence analysis and variance partitioning provides a unique and powerful way to assess and optimize the quality of medical reporting. To the best of our knowledge, this is the first time that such an approach has been applied in the field of sleep medicine. Generalization of strategies of text analytics in healthcare should be encouraged as they trigger quality improvements in most health systems with a direct benefit for clinicians and patients.

Polysomnography reports
In a retrospective quality survey, 243 PSG medical reports were retrieved from the Sleep Center of the Cantonal Hospital St. Gallen. These reports were taken from consecutive patients with suspicion of SA referred for a whole-night PSG. All patients were included in a prior study investigating the clinical validity of a novel wearable electrocardiogram (ECG) device [16][17][18]. The study was performed in accordance with the Declaration of Helsinki, following the principles of Good Clinical Practice. The study was approved by the local institutional review board (EKSG 15/140) and patients gave written informed consent to participate. Patients data were analyzed in a fully anonymized manner.
Altogether, the PSG medical reports were assessed by 7 sleep technicians and validated by 9 sleep physicians. Diagnoses included obstructive, central and mixed sleep apnea with various levels of severity. Data from PSG records are evaluated by sleep technicians based on information presented in the form of tables and graphics. Technicians typically provide a provisional interpretation of the sleep record, highlighting the main features and characteristics. This initial interpretation is thereafter validated by a pulmonologist who adapts and corrects the report if necessary. A snapshot of an example of PSG report is provided in the Additional file 4 (Snapshot of a PSG medical report). The narrative interpretation is highlighted in the bottom inset.

Text block standardization
A standardization of the PSG reports was implemented using predefined blocks of text sequentially assessing sleep features in a systematic manner. The resulting standardized approach -thereafter called text block standardization -increases the uniformity of the diagnostic information contained in these reports. This standardization automates the generation of PSG reports with a systematic sequential description of the following items: sleep latency (normal, shortened, lengthened), sleep efficiency (normal, reduced), sleep architecture (fragmented, shortened, with lack of rapid eye movement [REM] phase), sleep stages and position in which the patient slept (lateral position, on the back, on the abdomen). Thereafter, it is described whether the patient had an obstructive, mixed or central sleep apnea, together with indications on the sleep apnea severity (mild, moderate, severe) and whether sleep apnea was associated with the patient's position and/or REM phase. Furthermore, the following items are highlighted: oxygen saturation, hypoxemia and hypercapnia, presence of snoring, arousal index and presence of periodic movements of the lower limbs. The specialized pulmonologist finally checks (and possibly adapt/correct) the automatically generated report. For the purpose of the current analysis, one hundred consecutive reports from independent patients were extracted.

Statistical approaches Text mining approach
The narrative section of PSG electronic reports was extracted and analyzed using TM. TM summarizes the usage of key terms throughout a corpus of textual documents by generating a term-document matrix. More specifically, TM requires several pre-processing steps of data cleansing [19]. The TM procedure used in the current study follows the guidelines provided in the vignette of the R package tm [20]. The procedure includes the elimination of extra white spaces, stop words, common words in the German language, punctuation, numbers, sparse terms and transformation to lower case terms. The filtered terms were cross-tabulated in a term-document matrix. The term-document matrix tend to be very large and, as suggested in the introductory guidelines of the Rpackage tm, a step consisting in removing sparse terms occurring only in few documents can be employed to reduce the matrix without losing significant relations inherent to the matrix.

(Constrained-)correspondence analysis and variation partitioning
The term-document matrix was analyzed using correspondence analysis (CA), a multivariate dimension reduction method appropriate for the analysis of contingency tables. Theoretical aspects underlying CA can be summarized by defining the following: The contingency table was partitioned with respect to explanatory variables using variation partitioning techniques [21]. The following four explanatory variables were considered: type of apnea, apnea severity, physician, technician. The partitioning was based on constrained correspondence analysis (CCA), a supervised counterpart of CA (e.g., [22]). In CCA, linear constraints are applied observation-wise. Each categorical explanatory variable is used to define row blocks. If we define M the n × g matrix of dummy variables defining g blocks among observations, the observation-wise constraint is given by the projection operator: The projection on O r computes the means per block of observations for each variable. CCA consists in performing the following singular value decomposition: with * the k * × k * (k * = rank(Z * )) diagonal matrix of singular values associated with Z * with λ * 1 ≥ · · · ≥ λ * k > 0, U * the n × k * matrix of left singular vectors and V the m × k * matrix of right singular vectors.
The percentage of explained variance associated with a specific explanatory variable is given by the ratio of the total inertia of constrained over unconstrained CA. In a first step, the total inertia of CA was partitioned according to each explanatory variable using univariate analyses and the reported percentage of explained variance corresponded to the unadjusted R-squared, i.e. the fraction of variance explained by each individual explanatory variable independently of the other variables. In a second step, adjusted R-squared were calculated where the joint effect among variables was taken into account. For each explanatory variable, the percentage of explained variance and its significance was assessed using permutation tests. The inter-rater variability was defined by the percentage of explained variance associated with both physicians and technicians.

Predictive accuracy of the final diagnosis
The predictive value of the text standardization was assessed using a linear support vector machine (SVM) classifier and the prediction accuracy of the classifier was estimated using repeated 10-fold cross-validation. In 10-fold cross-validation, the original sample is randomly partitioned into 10 equal size subsamples. Of the 10 subsamples, 1 single subsample is retained as test data and the remaining 9 subsamples are used as training data. The process is repeated 10 times, each subsample being used exactly once as validation test data. All observations are used both for training and validation. Furthermore, the cross-validation procedure was repeated 3 times. The SVM-classifier and its cross-validation was implemented using the function train of the R package caret using the following control parameters: resampling method was set to "repeatedcv", number of folds was set to 10 and number of repetitions of k-fold was set to 3. The following diagnostic classes were considered: OSAS severe, OSAS mild, OSAS light, central SA, mixed SA, undetected SA. The class distribution and detailed class-wise performance was provided.

Statistical software implementations
Source codes can be provided upon request to the corresponding authors. All analyses were done using the R statistical software (v. 4.0.3) including the following extension packages: tm [23], ade4 [24], vegan [25] and caret [26]. CA was performed using the function dudi.coa implemented in ade4, and CCA using the function cca implemented in vegan. Variation partitioning was performed using the function varipart implemented in ade4. Source codes can be provided upon request to the corresponding authors.