Provenance for data-driven healthcare

It is often said that we are amidst a ‘data revolution’. With more data being generated than ever before, there is great potential for data analytics to transform clinical care, health research, and to enable next-generation health services including tele- and precision medicine.

Opportunities to transform healthcare come from access to data. Data analysts often seek to combine data from a number of sources to create a richer, more holistic base in which to derive insights. In practice, this requires data to be shared (or at least made accessible to analytical or computational processes), often across administrative boundaries. For instance, data from a general practitioner may be combined with hospital-managed electronic health records, augmented by data feeds from a patient’s wearable technologies and sensors in their home. As such, data is becoming increasingly federated, providing opportunities to use interconnected but de-centralised data sources and stores, to answer research questions. However, managing data in such an environment raises a number of challenges, particularly where collaboration is required. Here we briefly consider two aspects regarding transparency.

Data security

In the healthcare context, there is (rightly) a considerable focus on data confidentiality. At a technical level, this is typically realised through access controls that define “who may access such information”. For instance, there may be a rule that a doctor can only access a patient’s record if there is a treating relationship. Such controls are crucial to any data governance model, and work well where a common regime can apply, e.g. in a single organisation, such as a hospital, or platform.

However, more controls are required where data is federated and shared across systems. Consider a hospital that provides data to a research organisation. The hospital has control over the data released to the researchers, but after the transfer, the hospital effectively loses visibility and thus control over (the researchers’ copy of) that data. In practical terms, this means that sharing and collaboration agreements in such environments are largely based on trust.

Given the sensitivity of health data, and the overarching responsibilities and obligations of those who deal in personal information, this lack of transparency de-incentivises sharing and collaboration. This runs directly against the vision of the ‘data revolution’, where data can be used and re-used, when and where appropriate, to bring value and innovation to health services.

Data quality

Another consideration is data quality, which directly impacts the value of any analysis. However, assessing data quality in a federated environment described can be difficult. Data will originate from a variety of sources, leading to variance even within the same dataset. Such concerns can be mitigated in controlled environments where equipment and procedures can be standardised, such as in a clinical unit or research project. However, where data is collected in a more ad hoc manner and comes from a range of sources including consumer devices, such as a patient’s wearable technology, contextual information surrounding the data becomes increasingly important: how the data was generated, processed and transformed, etc.

Provenance: Tracking data flow

Data provenance is an emerging area of research that aims to address these challenges. Provenance can be described as ‘data about data’, providing details of the data life cycle: where/when and by what/whom was the data produced, where was it transferred, and how was it processed.

Provenance techniques can complement general access control regimes by improving levels of transparency. By making visible the flow of information, it is possible to track what is happening to data, even after it moves “out of one’s hands”. In line with the previous example, a strong provenance infrastructure could allow the hospital to ‘see’ that the research organisation is using and handling the data appropriately. Raising levels of transparency facilitates accountability, and therefore works to encourage data sharing and collaboration.

Provenance information also assists data quality. Recording where and how data is created, and the processes to which data was subject (e.g. did it pass through a sanitisation, validation or anonymisation routine) can influence its interpretation and handling. For instance, provenance details might highlight that a particular reading came from a faulty or inaccurate sensor, or an untrusted source, in which case those readings might be ignored or transformed before use. Such information also helps in identifying procedural issues, for instance, where a member of staff consistently generates data that does not accord with that of other staff – which may indicate a training issue.

Information management is key to realising the potential of data-driven healthcare. Though work in provenance is at its early stages, the techniques described show great promise for improving transparency in federated data environments, which will assist in mitigating certain governance risks and incentivise data sharing and collaboration.

Global Health, Epidemiology and Genomics

Provenance for data-driven healthcare