Using Systematized Nomenclature of Medicine clinical term codes to assign histological findings for prostate biopsies in the Gauteng province, South Africa: Lessons learnt

Background Prostate cancer (PCa) is a leading male neoplasm in South Africa. Objective The aim of our study was to describe PCa using Systemized Nomenclature of Medicine (SNOMED) clinical terms codes, which have the potential to generate more timely data. Methods The retrospective study design was used to analyse prostate biopsy data from our laboratories using SNOMED morphology (M) and topography (T) codes where the term ’prostate’ was captured in the narrative report. Using M code descriptions, the diagnosis, sub-diagnosis, sub-result and International Classification of Diseases for Oncology (ICD-O-3) codes were assigned using a lookup table. Topography code descriptions identified biopsies of prostatic origin. Lookup tables were prepared using Microsoft Excel and combined with the data extracts using Access. Contingency tables reported M and T codes, diagnosis and sub-diagnosis frequencies. Results An M and T code was reported for 88% (n = 22 009) of biopsies. Of these, 20 551 (93.37%) were of prostatic origin. A benign diagnosis (ICD-O-3:8000/0) was reported for 10 441 biopsies (50.81%) and 45.26% had a malignant diagnosis (n = 9302). An adenocarcinoma (8140/3) sub-diagnosis was reported for 88.16% of malignant biopsies (n = 8201). An atypia diagnosis was reported for 760 biopsies (3.7%). Inflammation (39.03%) and hyperplasia (20.82%) were the predominant benign sub-diagnoses. Conclusion Our study demonstrated the feasibility of generating PCa data using SNOMED codes from national laboratory data. This highlights the need for extending the results of our study to a national level to deliver timeous monitoring of PCa trends.


Introduction
Prostate cancer (PCa) was a leading male cancer in South Africa in 2012, while globally it is the second most frequently diagnosed neoplasm. 1 A 2012 global cancer study reported an estimated age-specific incidence rate of 67.9 per 100 000 for South Africa, with an associated mortality rate of 26.4 per 100 000. 1,2 Presentation with highly aggressive PCa in African men in South Africa has been described. 3,4 http://www.ajlmonline.org Open Access importance of both surveillance and disease registries that should be integrated into existing health information systems to improve the availability of high-quality data. 7 Current cancer registry reporting is not integrated into any hospital information system. The purpose of a cancer registry is to establish and maintain a cancer incidence reporting system that informs planning of cancer control programs. 8 Cancer registries should typically publish annual data within 28 months after the close of the year in which the incident case was diagnosed. 9 A delay in cancer registry reporting is a major limitation for understanding PCa trends. 9 The National Cancer Registry of South Africa is a passively reported registry that used the International Classification of Diseases for Oncology (ICD-O-3) -recommended international methodology -to manually code pathology reports. 10,11,12 The ICD-O-3 describes both the anatomical site and cell type and behaviour (malignant or benign biopsy). 13 The National Cancer Registry (NCR) had only reported data for 2014 in 2018. 14 Cases are manually coded, and this results in a reporting delay. Based on the surveillance, Epidemiology, and End Results programme standard, the 2015 report should have been published by 2018. 9 The programme itself, on which the South African reporting standard is based, is comprised of 11 registries in five states and six metropolitan areas in the United States which generate annual cancer data approximately 28 months after diagnosis. 8 Both internationally and nationally, cancer surveillance is defined as an ongoing, timely, and systematic collection and analysis of cancer data to assess risk factors, screening, diagnosis, and cancer incidences and deaths. 9 The aim of surveillance is to analyse and disseminate cancer data to identify challenges and opportunities in the delivery of timeous cancer control programmes. 9 The cancer registry reporting is a time-intensive process requiring trained coders to review each narrative biopsy report individually and to manually add the applicable ICD-O-3 codes to the reporting system. 15 This can take hundreds of hours of manual database building to report PCa data. The coders would have to add both topography and morphology ICD-O-3 codes for each narrative biopsy report reviewed. 15 This manually intensive process prolongs time between diagnosis and reporting of identified PCa cases for surveillance. PCa reporting is required at least 28 months after diagnosis (~2 years) to understand changes in incidence. 8 Without timely data this would not be possible. Therefore, new approaches are required to reduce or automate the coding process to provide more timely cancer data.
Some studies have used approaches such as natural language processing and text mining. 16,17,18 For our study, we decided to use the Systematized Nomenclature of Medicine (SNOMED) clinical terms. 19 The Systematized Nomenclature of Medicine is a comprehensive and precise health terminology used globally that incorporates a structured list of health terms or concepts. 19,20 One of the benefits of using SNOMED is that codes can be mapped to other coding systems to facilitate interoperability. 19 All references to SNOMED relate to SNOMED CT. 21 In an Anatomical Pathology setting, SNOMED CT is used to capture the histological finding in the form of morphology (M) and topography (T) codes. The M and T codes are captured directly in the laboratory information system (LIS) by the pathologist after examining prostate cores. The same histological findings are also reported as a narrative pathology report. 19 The M and T code(s) are captured separately in defined test items in the LIS, which has a dictionary of all the SNOMED codes that may be reported and the anatomical pathologist selects the appropriate codes to add based on the histological findings. For each biopsy, more than one SNOMED M/T code may be captured. Studies outside South Africa have used SNOMED codes to transform laboratory and other reported data for cancer registry reporting. 22 A good example is the Danish PCa registry that analysed SNOMED data for biopsies with PCa, that is histologically verified. 22 This study confirmed that the SNOMED codes generated clinically useful data. 22 Our study described here is the first attempt in South Africa to investigate reporting PCa using SNOMED codes by collating this information contained in the national laboratory data repository. It is anticipated that this could, in future strengthen surveillance activities. Additionally, using the SNOMED codes to report on patients without PCa could provide important presentation information that is currently poorly understood.
The majority of local studies have manually coded biopsy reports to extract PCa information. The development of SNOMED CT lookup tables have the potential to automate this process and improve timely PCa reporting. The objective of this study was to describe the methodology used to report PCa data using SNOMED CT lookup tables.

Ethical considerations
Ethics clearance for this study was obtained from the University of the Witwatersrand (M170419). This study used national laboratory data that does not contain any patient identifiers. No patient recruitment was required.

Study design
This was a retrospective descriptive study that analysed prostate biopsy data between 2006 and 2016 for men ≥ 30 years in the Gauteng province.

Data extraction and preparation
Retrospective prostate biopsy data for the period 2006-2016 were extracted from the national health laboratory repository of patient-related data where the term 'prostate' was captured in the narrative pathology report. Simple text mining approaches were used in the Netezza Aginity (Marlborough, Massachusetts, United States) query tool which employed pattern matching by fuzzy string search function. 23 Two data extracts were received: (1) prostate biopsy, and (2) chained M and T code(s) captured for each biopsy.
The prostate biopsy extract included the following variables: (1) episode number, (2) unique patient identifier (generated using a probabilistic matching algorithm 24 ), From the prostate biopsy data extract, the unique SNOMED CT code combinations were extracted, and a lookup table was developed ( Figure 1). The lookup tables were developed in a two-step process: (1) code manipulation to combine descriptions, and (2) coding the lookup tables. The blue circles in the figure indicate which figures provide additional details for each step, that is: data manipulation ( Figure 2) and query to combine data extracts and lookup tables in a single database query ( Figure 3).

Combining Systemized Nomenclature of Medicine descriptions for the lookup table
The chained SNOMED codes were provided in a format that could not readily be analysed and had to be separated into individual columns with the applicable descriptions added from the LIS code table. Data were prepared using Microsoft Excel (Microsoft Corporation, Redmond, Washington, United States) 25 ( Figure 2). The unique code combinations and descriptions were extracted for the development of the lookup table.

Coding the Systemized Nomenclature of Medicine M and T lookup tables
We used the prepared M and T code descriptions to start populating the lookup tables. For the M lookup table, we used each unique code description combination to populate the following new variables: (5) sub-result with guidance from an anatomical pathologist and a urologist. The team reviewed each code combination and assigned values to be captured in the lookup table. An example of the coding is provided for four biopsies in Table 1. We captured the matching ICD-O-3 codes for predominantly malignant findings. The diagnosis reports the overall biopsy finding, whereas the sub-diagnosis was used to differentiate the diagnosis, e.g. Benign, negative for malignancy (ICD-O-3: 8000/0) and 'Hyperplasia', respectively (Episode A). Assigning the malignant code descriptions was fairly easy, but we struggled with benign findings given the number of findings reported and the order thereof. To clarify coding, we defined the reporting order to assign a benign sub-diagnosis as follows: (1) inflammation, (2) hypertrophy, (3) hyperplasia, (4) edema,    Separate the chain of T code combinaƟons (comma separated) and add the laboratory informaƟon system code only the matching values from the other tables reported using referential integrity (Table 1). 26,27 This query contained all the variables for the data analysis.

Systemized Nomenclature of Medicine M code descriptive analysis
The number of prostate biopsies with an M and T code populated was assessed as a contingency

Descriptive analysis of diagnosis and sub-diagnosis
The diagnosis and sub-diagnosis volumes were reported for biopsies with M code populated of prostatic origin. For each diagnosis, the sub-diagnoses were then reported. Where more than 10 sub-diagnoses were reported, the first 10 were reported and the remaining grouped as 'Other'.
Inflammation was reported as a sub-diagnosis for 4075 benign biopsies (39.03%). This was followed by no pathologic diagnosis and hyperplasia at 26.90% (n = 2809) and 20.82% (n = 2174) respectively. Inflammation and hyperplasia were

Discussion
We showed that it was possible to automate prostate biopsy reporting using a commonly available relational database (Microsoft Access) and SNOMED lookup tables in the Gauteng province. The use of ICD-O-3 codes for malignant findings facilitate PCa reporting similar to cancer registries. 13 To routinely automate the registration and surveillance of PCa in South Africa, the lookup tables developed for this study would need to be introduced to the corporate data warehouse. Lookup tables are routinely used by the Corporate Data Warehouse (CDW) to transform laboratory data for reporting, for example the HIV serology results reported as 'NEG', 'N' or 'NEGATIVE' are transformed to a single value ('NEG') for uniform reporting. 28 The benefit of this mechanism for PCa reporting is that as biopsies are reported in the LIS, the data replicated to the CDW will be conformed to report the biopsy diagnosis and sub-diagnosis within three months of diagnosis.
Lookup tables would facilitate a constant feed of analysed PCa data to the South African NCR to ensure timeous reporting. Over time, any new SNOMED code combinations identified would have to be added to the lookup table. By providing this data at shorter intervals, it would be possible to triangulate against other local data sources (NCR and other). Triangulation is the process used in public health to review and interpret data from multiple sources that answer the same question for decision making. 30 Unpublished data from this study revealed that between 2012 and 2016, PCa incidence has increased from 44.92 to 57.31 per 100 000 compared to 46.53 reported by the NCR in 2012. 31 Another advantage of this approach is that PCa data from other African countries using a LIS could also be analysed using the developed lookup tables to dramatically improve PCa reporting across Africa. Antoni et al. assessed the methods used for reporting the 2018 Global Cancer Statistics estimates. 32 For 14/51 African countries, PCa incidence estimates were based on simple average rates from neighbouring countries (27%). 32 For South Africa, projections of national incidence were sourced from the NCR. 32 The SNOMED lookup tables have the potential to improve both national and regional PCa incidence reporting across the African continent providing more accurate data. With better data, cancer control initiatives could be better mobilised.  The principles applied in our study could also be implemented for other cancers. The SNOMED codes are captured for all cancers of public health importance routinely. Similar lookup tables could be developed to report on lung, breast and cervical cancers with incidence rates of 17.3, 49.0 and 13.5 per 100, 1000 respectively in 2018 in South Africa. 33 The approach described in our study is not unique. Similar approaches have been employed in the Danish cancer registry, where data reported for 161 525 biopsies for the period 1995-2011 were undertaken using SNOMED codes. 22 The Danish cancer registry predominantly reported data for PCa, whereas our study reported data for negative biopsies as well. The combination of the lookup tables reported in our study with text mining to extract the Gleason score reported in an unpublished study could be used to report data similar to the Danish cancer registry, for example diagnosis of 'Neoplasm, malignant', sub-diagnosis of 'Adenocarcinoma' and a Gleason score of 3 + 3 = 6 would be coded as 'bGS3+3'. 22 It is important to provide up to date PCa data at both the national, provincial, district and health facilities levels to identify hotspots where programmatic interventions are required. SNOMED lookup tables could be used to focus programmatic interventions for geographic areas with a higher PCa burden. Similar initiatives using laboratory data have been undertaken locally for HIV and tuberculosis services. 24,34,35 An example is the World Bank report that described spatial clustering analysis of HIV viral load suppression at the national, provincial, district, sub-district and health facility levels. 24 This study indicated that national laboratory data stored in the CDW has the potential to provide important strategic information on the quality and reach of the antiretroviral therapy programme by highlighting the geographical variation in the proportion of patients virally suppressed. 24 This information can be accessed by healthcare workers using the epidemiological dashboard developed to identify health facilities that are performing poorly. 24,34 Similar work has also been published using cluster of differentiation 4 data to highlight areas where HIV-positive patients presenting for care have a higher burden of advanced disease. 34 The assimilation of health data into workable and user-friendly dashboards has had a big impact on how health data is used locally. 36,37,38 In the medium term, it would be possible to develop a PCa epidemiological dashboard similar to the HIV example mentioned to report PCa trends by age, race group and geographical boundaries routinely. The PCa dashboard could facilitate the reporting of the number of incident cases. The dashboard could also provide insights into how and where health care services are being accessed to inform both guideline changes and programmatic improvements to facilitate equitable access to care across the country. It could also provide loss to follow up and waiting time data for patients who had presented with an elevated prostate specific antigen and who were confirmed with PCa histologically using the CDW probabilistic matching algorithm. 29 Data reported in our study additionally demonstrated functionality by providing data for benign histological findings, potentially useful to identify trends for patients diagnosed with chronic inflammation who eventually progress to PCa. While the ICD-O-3 codes are particularly important for PCa cases, the importance of additional benign findings is especially important for inflammation and hypertrophy that have been shown to be linked with many other cancers. 39,40 Several studies have indicated that chronic inflammation has a potential role in prostatic carcinogenesis and tumour progression. 39,41,42 Nelson et al. reported that chronic or recurrent inflammation may play a role in the development of PCa. 43 Using the data generated using the lookup tables, patients with chronic inflammation could be followed up to identify whether they progress to PCa in an African setting.
Finally, to improve SNOMED reporting, it is recommended that the M and T codes be defined as mandatory fields. This will ensure that 100% of biopsies include these codes. This will address the 12% of biopsies without this information. Future research includes extending the lessons learnt with PCa in the use of lookup tables to other common cancers, as patient-level data is already available in the CDW and could be easily unlocked for national cancer reporting.

Limitations
One of the limitations is that the data for our study was limited to biopsies sent to the NHLS and did not include data from private sector laboratories and thus limits the generalisability of our study. As private sector laboratories also use the SNOMED code, discussions will be initiated with them to share PCa data for national reporting.
An additional limitation of our data is the 10% of biopsies excluded a SNOMED M or T code. Unfortunately, these fields are not mandatory and can be uncaptured. By amending rules and making these fields mandatory, all biopsies would prospectively be reviewed with at least one T and M code captured. The excluded data would also affect the generalisability of our study. We are not able to determine the findings for these biopsies. To address this gap, text mining and machine learning approaches are being investigated. A sample of the biopsy data will be used to train the machine learning models i.e. malignancy (1) and benign (0). The big data tools will be validated against well populated SNOMED data entailing grid search, k-fold cross validation, precision, recall and F-score. The combination of SNOMED, text mining and machine learning will hopefully address the missing data. With minor changes to the laboratory information system, this has the potential to report PCa histological findings for all biopsies.

Key messages
Existing national laboratory data has been used for the first time in South Africa to report PCa diagnosis across a province using SNOMED lookup tables. This could be implemented across South Africa to provide timely PCa trends.

Conclusion
Our study has demonstrated that it is possible to automate PCa reporting using SNOMED codes for 88% of biopsies. The value of national laboratory data as shown in our study can easily be extended to deliver the timely monitoring of PCa trends across South Africa and other African countries. This could also be applied to report data for other cancers of public health interest.