Utilizing passage-based evidences in information retrieval tasks
Sarwar, Ghulam
Sarwar, Ghulam
Loading...
Identifiers
http://hdl.handle.net/10379/17683
https://doi.org/10.13025/16803
https://doi.org/10.13025/16803
Repository DOI
Publication Date
2023-03-09
Type
Thesis
Downloads
Citation
Abstract
In the age of Information overload accessing information is a few clicks away from us. This wealth of information at hand can certainly have its advantages. However, for a search engine, identifying the relevant information from that Big Data is still a challenging task. Particularly, finding pertinent information from lengthy documents is tricky due to the natural language nature of searched queries and the topical diversity of the documents. Rather than considering the document as a whole, one viable method is to measure the pertinence of a query to the concise units (passages) in the given document and utilize that measuring process for evaluating the query-document relevance. This thesis aims to utilize these smaller units of a document known as passages in different information retrieval tasks. A passage is defined as a sequence of sentences or words that start and end at any place within a given document. Passage retrieval deals with identifying and retrieving small but explanatory portions of a document that answers a user’s query. In this thesis, we first present a novel approach to improving the document ranking by using different passage-based evidence. We evaluated our approach with the existing passage retrieval methods and more in-depth analysis was undertaken into the effect of varying specific. We have also explored the notion of query difficulty to understand whether the best performing passage-based approach helps to improve, or not, the performance of certain queries. Secondly, we presented a novel graph approach that utilizes the similarity of passages within their parent document to form a cohesion structure. We discussed that the relevant documents tend to be more cohesive than the non-relevant documents. Furthermore, we also re-ranked the documents by applying the cohesion score with a document similarity score to inspect its impact on the system’s performance. Moreover, we carried out experiments by using different sliding windows around words in each passage to determine the context and semantic relatedness. We then compared the state of the art pseudo relevance feedback (PRF) technique with our proposed passage-based sliding window approach for query expansion. The usage of top-ranked passages for query expansion was motivated due to the reason that relevant passages for query expansion would remove elements of noise found in a text document that contains a number of topics. We extended our approach by including a popular word embedding (WE) approach i.e the word2vec and have demonstrated that the passage-based PRF and WE approach outperforms their document-based equivalent. Lastly, we utilize the passage answer-set for each query as a graph and applied different graph-based measures to identify a correlation between the relevance of a document and those calculated graph measures. Our approach was inspired by the cluster hypothesis which states that similar entities are more likely to be closer to each other. We also discussed an application of our answer-set graph approach for the Query Performance Prediction tasks and a future avenue to apply it for the topic visualization. We have shown that our passage-based graph features outperforms the existing state of the art QPP approaches and generate a positive correlation in determining the easy and the hard queries.
Funder
Publisher
NUI Galway