The languages of health in general practice electronic patient records: a Zipf’s law analysis
© Kalankesh et al.; licensee BioMed Central Ltd. 2014
Received: 3 September 2012
Accepted: 26 November 2013
Published: 10 January 2014
Natural human languages show a power law behaviour in which word frequency (in any large enough corpus) is inversely proportional to word rank - Zipf’s law. We have therefore asked whether similar power law behaviours could be seen in data from electronic patient records.
In order to examine this question, anonymised data were obtained from all general practices in Salford covering a seven year period and captured in the form of Read codes. It was found that data for patient diagnoses and procedures followed Zipf’s law. However, the medication data behaved very differently, looking much more like a referential index. We also observed differences in the statistical behaviour of the language used to describe patient diagnosis as a function of an anonymised GP practice identifier.
This works demonstrate that data from electronic patient records does follow Zipf’s law. We also found significant differences in Zipf’s law behaviour in data from different GP practices. This suggests that computational linguistic techniques could become a useful additional tool to help understand and monitor the data quality of health records.
A recent survey has shown that 90% of patient contact with the National Health Service (NHS) in the UK is through General Practices and General Practitioners (GPs) . Over 98% of the UK population is registered with a general practitioner and almost all GPs use computerised patient record systems, providing a unique and valuable resource of data . About 259 million GP consultations are undertaken every year in the UK. However, capturing structured clinical data is not straightforward . Clinical terminologies are required by electronic patient record systems to capture, process, use, transfer and share data in a standard form  by providing a mechanism to encode patient data in a structured and common language . This standard language helps improve sharing and communication of information throughout the health system and beyond [6, 7]. Codes assigned to patient encounters with the health system can be used for many purposes such as automated medical decision support, disease surveillance, payment and reimbursement of services rendered to the patients . In this work we are focusing our attention specifically on the coding system used predominantly by UK GPs, the Read codes.
Read codes provide a comprehensive controlled vocabulary that has been structured hierarchically to provide a mechanism for recording data in computerised patient records for UK GPs . They combine the characteristics of both classification and coding systems . Most data required for an effective electronic patient record (demographic data, lifestyle, symptoms, history, symptoms, signs, process of care, diagnostic procedures, administrative procedures, therapeutic procedures, diagnosis data, and medication prescribed for patient) can be coded in terms of Read codes . Each Read Code is represented as 5-digit alphanumeric characters and each character represents one level in hierarchical structure of Read codes’ tree . These codes are organised into chapters and sections. For example Read codes beginning with 0–9 are processes of care, those beginning with A – Z (uppercase) are diagnosis, and those beginning a-z (lowercase) represent drugs (described further in the Methods section). Of some concern, however, is the quality of the data captured in this way.
At its heart, medical coding is a process of communication, with clinical terminologies bridging the gap between language, medicine and software . Read codes can be thought of as a vocabulary for primary care medicine, providing words (terms) used to describe encounters between GPs and patients. The GPs (annotators) are attempting to encode information regarding the consultation; information that the wider community then needs to decode. The bag of codes associated with a consultation can therefore be thought of a sentence made up of words from Read, a sentence written by a GP to convey information to a range of different listeners.
One of the best known and universal statistical behaviours of language is Zipf’s law. This law states that for any sufficiently large corpus, word frequency is approximately inversely proportional to word rank. In fact, Zipf’s law is considered as a universal characteristic of human language  and as a wider property of many different complex systems  as well as human languages . Zipf suggested that this universal regularity in languages emerges as a consequence of the competing requirements of the person or system coding the information (speaker) compared with the person or system trying to decode the information (listener). From the perspective of the speaker, it would be most straightforward for them to code the signal using high level, non-specific terms as these are easy to retrieve. It is more difficult to code the signal using very specific terms as this requires hunting through long lists and navigating deep into the terminology. The problem is very different for the listener. For them the problem is one of resolving ambiguity. If the data is coded using very specific terms then ambiguity is minimal and interpreting the message is straightforward. If only high level general terms are used, then it is much harder to discern the meaning of the message. In any communication system there is therefore a tension between the work being done by the speaker and the listener. Indeed, some controversial recent papers have attempted to show that Zipf’s law emerges automatically in systems that simultaneously attempt to minimise the combined cost of coding and decoding information [16–18].
Similar issues clearly arise in medical coding in which there needs to be a balance between the efforts required from the coder with those of the person interpreting and using the data. Reaching a proper balance between comprehensiveness and usability of clinical vocabularies is regarded as one of the challenges in the medical informatics domain .
The hypothesis we are therefore exploring in this paper is whether a Zipfian analysis of medical coding data can provide useful insights into the nature and quality of data. For example, we can ask where this balance lies across different aspects of the data medically-coded captured in GP records, information about diagnosis, information about the medical procedures applied and medication prescribed, and whether this balance is different across different general practices. We have therefore performed a computational linguistics analysis of a large corpus of anonymised Read code data from GPs in Salford to see whether such analyses might have value in understanding and characterising coding behaviour and data quality in electronic patient records. Salford is a city in the North West of England with an estimated population of 221,300. The health of people in Salford is generally worse than the English average, including the estimated percentage of binge drinking adults, the rate of hospital stays for alcohol-related harm, and the rate of people claiming incapacity benefit for mental illness. However, the percentage of physically active adults is similar to the English average and the rate of road injuries and deaths is lower.
The data set
An example of the 5-byte Read code that shows how the specificity of a term increases as a function of depth
Circulatory system diseases
Ischaemic heart disease
Acute myocardial infarction
Other specified anterior myocardial infarction
Acute anteroseptal infarction
Zipf’s law analysis
Pareto plots and parameter estimations were calculated using the Matlab packages plfit, plplot and, plpva developed by Clauset and Shalizi . These packages attempt to fit a power law model to the empirical data and then determine the extent to which the data really can be effectively modeled using a power law. These tools provide two statistics describing the data. The first is a p-value that is used to determine the extent to which the power law model is appropriate. If the p-value is greater than 0.1 we can regard the power law to be a plausible model of our data. The second statistic produced is β, the exponent of power law.
A number of Zipfian analyses were then performed on different subsets of the Read code data within the Salford corpus. In particular we looked at the subsets of Read codes for codes to do with diagnosis, procedure and medication separately (Read codes used for diagnosis start with an upper case character (A-Z), Read codes for procedures begin with a number (0–9), and those medication with a lower case character (a-z) ). We were able to further subdivide the data into chapters based on the first letter of the Read code for more detailed analysis.
We also performed a number of other simple analyses to characterise the Salford corpus. We first measured the type-token ratio (TTR). The TTR is calculated by dividing the types (the total number of different Read codes) by tokens (total number of Read codes used), expressed as a percentage. In essence, this measure is equal to the number of distinct terms (Types) in the corpus divided by the total number of terms (Tokens) used . A low TTR is a signal that there is a lot of repetition in the terms used, a high TTR ratio is a signal that the “vocabulary” (distinct terms) used is rich. A second analysis examined the typical depth of the terms used from the Read codes in each of the subsets of data. In a final analysis we characterised the Read code terminology itself, to how many terms at each level there were available to GPs in each chapter. We then repeated this analysis in the Salford data looking at the set of codes that were actually used from this full set. From this we were able to determine the extent to which GPs did, or did not, take advantage of the structure inherent in the terminology.
Discussion and conclusions
Within the Salford corpus, the usage of Read codes for diagnosis and process show a power law behaviour with exponents typical of those seen in natural languages. This supports the hypothesis being made in this paper that there are overlaps between the processes involved in describing medical data (terms chosen from a thesaurus to describe an encounter between a patient and a GP) and human communication (words chosen to describe a concept to a listener). This was not only true of the complete data sets; it was also seen to be true of the data from the specific chapters.
However, the story is not completely straightforward. There was one section of data captured by Read codes that showed a very different behaviour, namely the medication data. These data showed no evidence of Zipf’s law behaviour and it would appear that the principle of reaching a balance between the encoding and decoding costs has broken down. The pattern of code use from the hierarchy of Read codes is very different for the medication data compared with process or diagnosis code. All Read codes used by GPs for encoding the drug information is from the highest level provided by the hierarchy of Read Code System. This would suggest that, in the case of medication information, doctors attribute very high value to creating minimal ambiguity in the message to the maximum extent the coding system allows them. This is perhaps unsurprising as the prescription data are an input for another health care professional in the continuum of care (pharmacist) and any ambiguity in the case of this sensitive data could be harmful or fatal to a patient. The exact match between expression and meaning by someone other than encoder is critical. From this perspective, medication data seem to behave as an indexical reference in which an indexical expression “e” refers to an object “o” only if “e” can be understood as referring to “o” by someone other than the speaker as a result of the communicative act.
It is also the case that not all GPs use language in the same way. It is known that capture of diagnosis information is very variable between different GP practices . At this stage, it is difficult to provide detailed explanation reasons for this. It could be that this reflects a difference in the populations being served by each GP; however we do not have the information available to us in this study to allow us to address this. However, it is suggestive that this form of computational linguistic analysis could provide useful information on the quality of data being captured from different GP surgeries. There is a significant body of work in language processing looking at power law exponents and how they change with different qualities of language, an analysis that could well have useful analogies for these data. At this stage we do not have the information to determine the extent to which the signal mirrors the quality of the data capture by the GPs, but this is clearly something that would warrant further study.
Therefore, there are aspects of GP records that behave very like a language and for which it would be appropriate to apply the methodologies of computational linguistics. Our hope is that the development of such methods could provide important new tools to help assess and improve the quality of data in the health service.
British National Corpus
Cumulative Distribution Function
National Health Service
This article has been published as part of thematic series “Semantic Mining of Languages in Biology and Medicine” of Journal of Biomedical Semantics. An early version of this paper was presented at the Fourth International Symposium on Languages in Biology and Medicine (LBM 2011), held in Singapore in 2011.
- Mant D: R & D in primary care: an NHS Priority. Br J Gen Pract. 1998, 48: 871-Google Scholar
- Agarwal G, Grooks V: The nature of informational continuity of care in general practice. Br J Gen Pract. 2008, 58: e17-e24. 10.3399/bjgp08X342624.View ArticleGoogle Scholar
- Park H, Hardiker N: Clinical terminologies: a solution for semantic interoperability. J Korean Soc Med Inform. 2009, 15: 1-11. 10.4258/jksmi.2009.15.1.1.View ArticleGoogle Scholar
- Qamar R: Semantic mapping of clinical model data to biomedical terminologies to facilitate interoperability. PhD thesis. 2008, University of ManchesterGoogle Scholar
- Cimino J: Review paper: coding systems in health care. Methods Inf Med. 1996, 35: 273-284.Google Scholar
- Thiru K, de Lusignan S, Sullivan F, Brew S, Cooper A: Three steps to data quality. Inform Prim Care. 2003, 11: 95-102.Google Scholar
- Lewis A: Health Informatics: information and communication. Adv Psychiatr Treat. 2002, 8: 165-171. 10.1192/apt.8.3.165.View ArticleGoogle Scholar
- Yan Y, Fung G, Dy J: Medical coding classification by leveraging inter-code relationships. 2010, Washington DC, USA: International Conference on Knowledge Discovery and Data Mining, 193-202.Google Scholar
- Robinson D, Comp D, Schulz E, Brown P: Updating the Read codes : user-interactive maintenance of a dynamic clinical vocabulary. J Am Med Inform Assoc. 1997, 4: 465-472. 10.1136/jamia.1997.0040465.View ArticleGoogle Scholar
- Benson T: Clinical terminology. Principles of Health Interoperability HL7 and SNOMED. 2010, London: Springer-VerlagView ArticleGoogle Scholar
- Booth N: What are Read codes?. Health Libr Rev. 1994, 11: 177-182. 10.1046/j.1365-2532.1994.1130177.x.MathSciNetView ArticleGoogle Scholar
- Zheng H, Wang H, Black N: Data structures, coding and classification. Technol Health Care. 2010, 18: 71-87.Google Scholar
- Rector A: Thesauri and formal classification: terminologies for people and machines. Methods Inf Med. 1998, 37: 501-50914.Google Scholar
- Grzybek P, Kohler R: Exact methods in the study of language and text. 2007, Berlin: Walter de Gruyter Gmb & CoView ArticleGoogle Scholar
- Manin D: Zipf’s law and avoidance of excessive synonymy. Cogn Sci. 2008, 32: 1075-1098. 10.1080/03640210802020003.View ArticleGoogle Scholar
- Zipf G: Human behaviour and the principle of least effort. 1949, Massachusetts: Addison-WesleyGoogle Scholar
- Ferrer-i-Cancho R: Decoding least effort and scaling in signal frequency distributions. Physica A: Stat Mech Appl. 2005, 345: 275-284. 10.1016/j.physa.2004.06.158.View ArticleGoogle Scholar
- Ferrer-i-Cancho R, Sole R: Least effort and the origins of scaling in human language. Proc Natl Acad Sci U S A. 2003, 100: 788-791. 10.1073/pnas.0335980100.MATHMathSciNetView ArticleGoogle Scholar
- Botsis T, Bassoe C, Hartvigsen G: Sixteen years of ICPC use in Norwegian primary care. BMC Med Inform Decis Mak. 2010, 10: 11-10.1186/1472-6947-10-11.View ArticleGoogle Scholar
- Newman M: Power laws, Pareto distribution and Zipf’s law. Contemp Phys. 2005, 46: 323-351. 10.1080/00107510500052444.View ArticleGoogle Scholar
- Clauset A, Shalizi C, Newman M: Power law distribution in empirical data. SIAM Rev. 2009, 51: 661-703. 10.1137/070710111.MATHMathSciNetView ArticleGoogle Scholar
- Bentley T, Price C, Brown J: Structural and lexical features of successive versions of the Read Codes. 1996, Cambridge: UK: The Proceeding of the 1996 Annual Conference of the Primary Health Care Specialist Group of the British Computer SocietyGoogle Scholar
- Baker P: Using corpora in discourse analysis. 2006, Continuum International Publishing GroupGoogle Scholar
- Akerman J: Communication and indexical reference. Philos Stud. 2010, 149: 355-366. 10.1007/s11098-009-9347-0.View ArticleGoogle Scholar
- Ferre-i-Cancho R, Sole R: Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited. J Quant Linguist. 2001, 8: 165-173. 10.1076/jqul.126.96.36.19901.View ArticleGoogle Scholar
- Tai TW, Anandarajah S, Dhoul N, de Lusignan S: Variation in clinical coding lists in UK general practice: a barrier to consistent data entry?. Inform Prim Care. 2007, 15: 143-150.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.