Visual Exploration of Text Collections

Thai, VinhTuan
Thumbnail Image
Repository DOI
Publication Date
Despite many technological advances, the information overload problem still prevails in many application areas. It is challenging for users who are inundated with data to explore different facets of a complex information space to extract and put several pieces of facts together into a big picture that allows them to see various aspects of the data. Nevertheless, the availability of data should be embraced, not considered a threat for individuals and businesses alike. As a substantial amount of invaluable information to be explored resides within unstructured text data, there is a need to support users in visual exploration of text collections to obtain useful understandings that can be turned into worthwhile results. In this dissertation, we present our contributions in this area. We propose an approach to support users in exploring collections of text documents based on their interests and knowledge, which are represented by entities within an ontology. This ontology is used to drive the exploration and can be enriched with newly discovered entities matching users' interests in the process. Coordinated multiple views are used to visualize various aspects of text collections in relation to the set of entities of interest to users. To support faceted filtering of a large number of documents, we show how a multi-dimensional visualization can be employed as an alternative to the traditional linear listing of focus items. In this visualization, visual abstraction based on a combination of a conceptual structure and the structural equivalence of documents can be simultaneously used to deal with a large number of items. Furthermore, the approach also enables visual ordering based on the importance of facet values to support prioritized, cross-facet comparisons of focus items. We also report on an approach to support users' comprehension of the distribution of entities within a document based on the classic TileBars paradigm. Our approach employs a simplified version of a matrix reordering technique, which is based on the barycenter heuristic for bigraph edge crossing minimization, to reorder elements of TileBars-based Entities Distribution Views to tackle the visual complexity problem. The resulting reordered views enable users to quickly and easily identify which entities appear in the beginning, the end, or throughout a document. Lastly, our work is also concerned with visual concordance analysis, which supports users in understanding how terms are used within a document by investigating their usage contexts. To abstract away the textual details and yet retain the core facets of a term's contexts for visualization, we employ a statistical topic modeling method to group together words that are thematically related. These groups are used to visualize the gist of a term's usage contexts in a visualization called Context Stamp.
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland