What is Synthetic data?

Synthetic data consists of computer-generated observations that mimic real data and is generated through a computational model. The data links to real data only through its statistical properties. Synthetic data aims to be an easy and secure drop-in replacement for real observations with similar details, distribution, and even occasional errors. Most notably, the computational model is designed not to reveal any sensitive details of the original data it simulates.

Synthetic data may be one way the healthcare sector can fulfill the great potential of data sharing.

The other Nordic countries have some of the best and most complete health data in the world. These data have considerable potential to enable the healthcare sector to detect diseases early, improve diagnosis, and create individually tailored treatments. However, this potential cannot be realized easily because of the strict protection of privacy (for good reasons of course). This poses great difficulty in sharing health data, even in anonymized form, and thus using it for research. Further, the inability to share data pose a problem for the healthcare sector in finding new treatment options by analyzing the large quantities of health data, for example, in collaboration with the other Nordic countries.

We will explore a possible solution to this problem by developing and refining a method that can use original data to generate synthetic data sets. The Novo Nordisk Foundation is supporting the project with a grant of DKK 7.5 million.

Why Synthetic Data?

Especially in the health sector, real data can be unavailable or hard to utilize due to legal restrictions. These include limitations of the intended use, denial of merging the data with other data sets, or disclosing any portions of it for software testing. Even if real data is available, the security of personal information is a concern: In academic research, confidential data can be protected to some degree with pseudonymization or anonymization, but the EU general data protection regulation (GDPR), makes anonymizing multidimensional health data very challenging. The advent of privacy regulation has made it both unavoidable and generally understood that confidential data needs to be protected. For these reasons, generating high-quality synthetic data can boost innovation in the health sector and, at the same time, guarantee that public opinion remains favorable for the responsible use of national health registers.

Open-source access will ensure quality

Our project, SHARED, is a research project for the real-world use of synthetic data in healthcare. Synthetic data are created by running an original data set through a mathematical program that generates a new dataset with similar statistical properties while ensuring that the synthetic data cannot be attributed to specific individuals. This enables data to be shared – without compromising data security.

“An elaborate and secure model capable of generating synthetic data can help to harness the great potential inherent in deriving new contexts from our common health data in a safe and secure way. The results of the project can influence both disease prevention and treatment, not only in Denmark’s healthcare sector but throughout the Nordic countries,” Niels-Henrik von Holstein-Rathlou, Head of Biomed, Novo Nordisk Foundation.

Together with partners in Finland, Henning Langberg will work on methodological developments, exploring real-world applications, and raising awareness of the use of synthetic data in healthcare.

“Our major challenge is to include as many parameters as possible in the synthetic dataset without losing the context between data points. In addition, it is important for us to have an open-source approach to developing methods so that the academic community can ask relevant questions about the method during the project. This is essential when working in such a sensitive and regulated area as health”

Henning Langberg - Chief Innovation Officer, Righospitalet, DK