Investigating the use of distributional semantic models for co-hyponym identification in special corpora

QasemiZadeh, Behrang
Knowledge is assumed by cognitive science to consist of concepts that are organised and maintained by complex processes taking place in human minds. These processes are not yet accessible directly. Language is still the primary medium for communicating knowledge and presumably linguistic objects and structures are expressions of knowledge and its organisation in mind. Collecting terms (i.e., creating a specialised vocabulary) and capturing their relationships are thus important mechanisms for distilling knowledge from specialised texts and for formalising it for machines. The approach taken in this thesis is to analyse the co-hyponymy relationships between terms as an organisational mechanism. Co-hyponyms are sets of lexical units sharing a common hypernym; bank and building society, for example, are co-hyponyms of the hypernym financial organisation. Analysing the co-hyponymy relationships between terms is important because it bridges the semantic gap between a) specialised lexical knowledge, b) the quantitative interpretation of meanings in specialised discourse, and c) machine-accessible conceptualisation of knowledge. This thesis proposes the use of a vector-based distributional representation of terms in order to construct a quantitative conceptual model of kinds-sorts in a given field of knowledge. Among empirical methods for analysing linguistic structures, distributional approaches to semantics encode language data to models that should correspond to the meanings of linguistic entities. The meaning of an entity, such as a word or a phrase, is assumed to be a function of its statistical distribution in contexts. In order to use these methods we thus need to define (a) the contexts, that is, which statistical information must be collected; and (b) the functions, that is, how this information must be used to correlate with a meaning. This thesis is a study of corpus-based distributional methods for characterising co-hyponymy between terms. Terms are represented as vectors to form a so-called term-space model. To obviate the curse of dimensionality and to facilitate the construction of models, novel methods employing sparse random projections are proposed. Random Manhattan indexing is used to construct L1-normed spaces and random indexing for L2-normed spaces. Following these steps a memory-based classifier exploits the distance between vectors to identify the presence of targeted co-hyponymy relationships. An evaluation is also performed to assess any reciprocal influences of the method's parameters on its performance. Userfriendliness, flexibility in updating and maintenance, and an innate capacity to resemble conceptual structures in a domain knowledge are the advantages of this method.
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland