Applying informal health reports and search queries for public health monitoring: An evaluation of characteristics, potentials, and requirements of online self-reporting, discussions, and search behaviour

Marques Barros, Joana Carina
The introduction of digital data sources has positively impacted public health surveillance and has paved the way for novel approaches. Internet-based sources provide large volumes of data which can be analysed in near real-time and directly address limitations of traditional sources for disease surveillance. For example, the timeliness of these sources can be applied for infectious disease outbreak detection while the descriptive content provides informal health reports useful to monitor noncommunicable illnesses such as diabetes. In this context, this thesis aims to provide a deeper understanding of how Internet-based sources are, and can be used for public health monitoring. Hence, it encompasses the discipline of health informatics and applies infodemiology science and digital health with the potential applications in infoveillance. Our focus is on three sources of informal health reports: microblogs (Twitter), discussion forums (Reddit), and search queries (Google Trends). The reasoning behind this choice is threefold: 1) Twitter and Reddit are a type of social media; hence they directly capture the users’ input; 2) the nature of these social media sources is complementary; while Twitter is utilised to share spontaneous thoughts, reports from Reddit tend to be lengthy and more contextualised; 3) Google search queries offer a channel to study the search behaviour of potential patients, worldwide. With Twitter, we targeted its potential use for global disease monitoring of infectious and noncommunicable illnesses while also exploring the effect of disease transmission hot-spots. Our findings showed that Twitter is not suitable for global disease monitoring, suffering from low recall when applying standard terminological resources and that transmission hot-spots do not present an increased mention of diseases. Reddit forum posts encourage discussion and facilitate the exchange of information; hence, the contextual richness of these sources is superior to that of short and mostly isolated messages on Twitter. Given this, the research focused on the discovery of the capabilities for the classification of disease mentions. Using contextualised representation models and an hierarchical neural text classification architecture, we achieved F1-scores of 0.992 and 0.674 for the classification of 6 infectious and 17 non-infectious diseases. For Google Trends, we developed a suicide occurrences forecasting system for the Republic of Ireland, where search volumes are utilised in parallel with official suicide statistics. Besides, we further explored the ability to generalise relevant search queries for suicide occurrence prediction in a distinct country with a shared language. Utilising a neural autoregression model, we achieved a mean absolute error of 4.14 for Ireland when utilising the search query feeling down, and 6.09 for the United Kingdom when using 34 search queries and unemployment data. The contributions of this thesis are four-fold: first, a comprehensive systematic literature review with a focus on internet-based sources and their limitations, diseases targeted, and standard methods for disease surveillance; second, Twitter’s capacity in providing health-related reports from disease transmission hot-spots for infectious and noncommunicable illnesses; third, the potential of Reddit for informal health report classification; fourth, improvement of the prediction of suicide occurrences in Ireland, and that search queries selected from Ireland also positively contribute to the modelling of suicide occurrences in the United Kingdom.
NUI Galway
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland