RESEARCH – October 2014

Corpus-based analysis of abbreviations and abbreviating in Estonian radiology reports

Authors: Eola Valdre, Peeter Ross, Katrin Tsepelina, Kaarel Veskis, Tarmo Vaino, Heiki-Jaan Kaalep

Articles PDF

Abstract

Background and aim. Electronic patient records set new requirements to medical language. The Estonian professional medical language is relatively young: it emerged in the 19th century and is still inf luenced by its predecessors and contemporaries Latin, German and Russian, as well as by the modern lingua franca of science, i.e. English. Medical texts in Estonian are written in either of the two sublanguages: academic language or health data recording language. The latter reflects everyday work and is affected by fatigue, stress and tight schedules. It does not fully conform to any particular standard and is neither revised nor edited. Data standardisation is an absolute prerequisite for any E-health application meant to enable data mining, analysis and/or customisation. However, sustaining adequate data quality poses a real challenge, because it is impaired by inconsistency and ambiguity of free text. Hence, it is important to improve the language of free text to assure correctness and consistency of the content, and to reduce ambiguity and vagueness. In Estonia, the professional medical language has not been systematically studied, partly because there exist no suitable language resources. The aim of this study was to compile a relevant text corpus and to assess the overall suitability of the linguistic approach to study the language of radiology reports. We confined the study to analysis of abbreviations.

Methods. We compiled a corpus of depersonalised radiology reports. The corpus was converted to XML, annotated and validated against the TEI P5 encoding scheme. We established a specific set of rules and, by using UNIX commands based scripts, applied them to retrieve abbreviations from the corpus. Because of inherent ambiguity, all one-letter abbreviations were analysed as trigrams consisting of the abbreviation and the neighbouring tokens. The frequencies and meaning of the abbreviations were

reviewed separately by two doctors.

Results. We compiled a corpus consisting of 207,534 depersonalised radiology reports with more than 11.8 million tokens. We retrieved 10,606 abbreviations (446,158 tokens, 3.8% of the corpus). Abbreviating appeared to be rather arbitrary and inconsistent. Mistyping was not an issue compared to ambiguity and/or inconsistent use of punctuation, space, and numbering; unconventional merging or breaking sentence structures and word boundaries, in particular, when adding case endings. The use of abbreviations of Estonian, Latin and English origin was often overlapping. This synonymy revealed an emerging shift from Latin- to English-based abbreviations.

Conclusions. The study of abbreviations in Estonian radiology reports showed an urgent need for standardisation of the medical language and confirmed an enormous potential of the linguistic approach for analysing the free text of health data. It is a feasible and resource-effective tool for analysing huge data sets and provides much needed insight into the existing problems of the medical language. Medical linguistics is an interdisciplinary field, meaning that inputs by linguists and medical professionals are equally important.