Utilising large language models to generate synthetic training data for machine learning applications in healthcare

Important note

The healthcare data and medical transcripts used in this project come from mtsamples.com. They are not linked in any way to real patient data from my personal or professional environment.

This project and this master’s thesis also have no relationship to my employer. The context was communicated transparently, but my employer was neither involved in the work nor connected to it in any institutional or organisational way.

Abstract

Utilising large language models to generate synthetic training data for machine learning applications in healthcare

In the light of the current overload of the healthcare sector, the search for solutions to efficiently ease the burden on healthcare systems is becoming increasingly urgent. The integration of machine learning, a key component of artificial intelligence, offers a promising approach with the automation of routine tasks, specifically in the processing of unstructured medical data. Given the critical importance of high-quality data to effectively train machine learning models and stringent data protection regulations in the medical field, this thesis investigates the possibility of using large language models, in particular GPT-4, to generate synthetic training data. The aim is to improve the performance and effectiveness of machine learning models for machine learning applications in healthcare by utilising large language models to generate synthetic data, specifically in the text classification of medical transcripts. The experimental approach of this thesis takes into account the real-world healthcare conditions by addressing the challenges of sub-optimal data locations and performs a realistic comparison between models trained with real versus synthetic data. The results show that a combination of real and synthetic data, specifically in domains with small datasets, improves text classification significantly, with an F1 score of over 80% that was achieved in experiments with synthetically augmented training data. This illustrates the potential of synthetic data to expand and improve the basis of the data for machine learning applications in the healthcare sector. Whilst the analysis suggests minor differences in the data, no noticeable bias appears to arise. However, it also becomes clear that a purely synthetic dataset is not ideal, as it may not fully contain all the nuances of real-world data. The research highlights the importance of striking a balance between real and synthetic data in order to leverage the advantages of both approaches whilst minimising the limitations. At the same time, the boundaries and challenges are also highlighted, such as applying the results to larger data sets and the issues surrounding data privacy. The reliance on real-world data to generate synthetic datasets emphasises the need for innovative approaches to ensure data privacy compliance. In summary, the work reinforces the suitability of synthetic data to improve the performance of machine learning models in healthcare, but highlights the importance of further research to overcome the identified challenges.

Thanks

My thanks go to my first reviewer and supervisor, Prof. Dr. Bernd Ulmann.

My thanks also go to Tushaar Bhatt for the collegial and professional exchange.

And to my father for proofreading. Today AI catches a lot of that. Back then it was not that strong yet, and spelling has never been my strongest side. That is exactly where having a former German teacher in the family helps.