Embedded Reddit - An investigation into improved information extraction from Reddit through word embeddings
Bradshaw, Stephen
Bradshaw, Stephen
Loading...
Publication Date
2020-12-31
Type
Thesis
Downloads
Citation
Abstract
Social media has increasingly become a de facto source of information for people seeking to inform themselves on current events. Traditionally there has always been a centralised source such as news-papers or news broadcasts which are curated by editors to ensure information integrity, social media sourced information may not have been created with the same investigative rigour that one would find from an official source. Natural Language Processing (NLP) is used for automated extraction of information from text. However, it has been developed on standardised text and so does not perform as well in noisy environments, as one would find in social media sourced text. This thesis aims to address this by exploring pre-processing approaches to improve the representation of the data.In addition to being a noisy source of information, social media sourced information contains many comments that agree, disagree and some that would not be applicable to a given information need.The first experiment aims to address this by linking comments with information found from an official source. A comparison is made between Reddit and Twitter as potential sources of auxiliary information. In this thesis Reddit is used as the social media source of information. This thesis documents how one can connect comments made on-line with official sources. Issues that arise from this are related to intended meaning; how does one know what context a term is used in when comments have been extracted from a thread of similar comments? Through use of graphs it is shown how term embedding can be created from social media content to extrapolate meaning in comments. Utilising principles of distributional semantics one can increase the cohesion of a number of related comments by defining their intended use. This is commonly referred to as Word Sense Disambiguation (WSD).There are a number of issues that one must address when relying on social media data, such as an abundance of diverse opinion that is expressed in an ad hoc manner. It is challenging to determine which comments contain pertinent information for a user. Additionally, one is faced with other NLP issues such as: sense disambiguation, accurate topic extraction and text processing in addition to social media related issues such as poor grammar, misspelling and use of colloquial language. The second experiment addresses this issue by creating more cohesion in clusters of comments. This can facilitate knowledge extraction by identifying highly related comments which can be aggregated with the aim of identifying some consensus.The third experiment looks to build on the second by doing a deeper analysis on the created clusters. Comment length is considered and an introduction of more in-depth analysis on the results of the proposed pre-processing approach is made. In creating greater cluster cohesion a graph-based approach for representing data is employed. This differs from many alternative approaches that look to create word embeddings to capture context, which would ordinarily be done through a vector representation of information. Through use of a graph representation, it is possible to retain a lot of the information of the original text that can later be queried for more clarification. In the final experiment I look at employing a graph-based method, analysing the impact of altering various parameters such as stopword removal, reference expansion, acronym expansion, stemming and use of sentence deliminators to inform graph construction. Taken together this thesis looks to model information found on social media sources for extracting pertinent information. The application of such an approach could be used to rapidly inform oneself on an event in real time
Funder
Publisher
NUI Galway
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivs 3.0 Ireland