Towards de-identification of general practitioners‘ electronic medical records for secondary research

Johannes Hauswaldt, Roland Groh, Alireza Zarei, Knut Kaulke, Falk Schlegelmilch

Keywords: electronic medical record; secondary research; privacy protection; de-identification; health services research

Secondary use of GPs‘ routine data in a legal way is technically and organisationally feasible. Potentially identifying field content (PIF), especially free text entries, obstruct ‘factual anonymisation’ of a secondary data set (SDS) for scientific use.

Research questions:
Stepwise and systematic recognition of PIF in an exemplary SDS from structured routine data, using a mandatory software interface in a general practice management system. Data protection impact assessment (Art. 35 GDPR) for evaluation.

Studies were performed at four levels, (a) single field identifiers (variables, attributes), (b) their combination, (c) their field content (expressions, values), and (d) the dataset as a whole.
Instruments for (a) and (b) were field type, relative frequencies, categories, GP’s expertise, for (c) TextCrawler, (d) ARX. Results were evaluated as coin-cidence of a possible harm‘s severity with its probability of occurrence.

A SDS from one german general practice, 1993 until 2017, covering 14,2885 patients, was studied as csv-datafile with 5,918,321 data lines and three variables (order, field identifier, field content).
PIF were discovered predominantly in ‘permanent remarks’ (‘doctor’s notes’) and ‘findings’, and were categorised as ‘names’, ‘toponymata’, ‘phone numbers’, ‘functional descriptors’ and ‘professions’, but semantic text qualifiers were not implicit. ‘Date of death’ was considered a harm of high impact to privacy protection with moderate occurence probability ‒ remedial was replacement by ‘Year of death’. The combination of temporal order, patient pseudonym and certain field contents increased the risk of re-identification within this SDS as a whole.

Studies for PIF have to be done on a defined and completed SDS. They require professional and appropriate expertise concerning data generation and framing background in general practice as well as metainformation about the primary data set.
With reasonable effort, PIFs can be identified only to a certain imperfect extent. Recognizing and assessing PIFs is a requirement prior to any de-identfying intervention.

Points for discussion:
GP's EMR for secondary research

Potentially identifying content of GP's EMR and breach of privacy

Appropriate de-identification of GP's routine data and privacy protection