Loading...
Deep learning for automatic term extraction
Banerjee, Shubhanker
Banerjee, Shubhanker
Files
Citations
Altmetric:
Publication Date
2026-01-05
Type
doctoral thesis
Downloads
Citation
Abstract
Automatic term extraction faces a fundamental challenge in specialized domains and low-resource languages: the scarcity of annotated training data needed to develop effective extraction systems. This thesis addresses this challenge through three complementary research directions that investigate how different methodological approaches can be strategically employed to maximize extraction performance across varying resource availability contexts.
This research investigates three core questions: (1) whether framing term extraction as a generation task using large language models with in-context learning improves performance in few-shot scenarios, (2) whether data augmentation through LLM-generated synthetic examples enhances domain-specific term extraction in few-shot settings, and (3) how annotator agreement and variability affect the quality of fine-grained semantic annotations in low-resource language datasets, with strategies for improving annotation consistency.
The first investigation demonstrates that large language models significantly outperform traditional baselines for extremely small datasets (fewer than 30 examples) across diverse domains, including Heart Failure, Equitation, Wind Energy, and Corruption from the ACTER corpus, while revealing a critical transition point where fine-tuning smaller models becomes more effective as more labeled data becomes available. Building upon these insights, the second study introduces three novel data augmentation strategies—TermDA, ContextDA, and CombinedDA—that bridge the performance gap between extreme few-shot and higher-resource scenarios through synthetic data generation using both LLM-based approaches and Wikipedia-derived methods. The third investigation addresses the foundational challenge of creating high-quality annotated datasets for low-resource languages through the development of HTEC 2.0, a systematically annotated Hindi educational terminology corpus that demonstrates how iterative annotation refinement can improve inter-annotator agreement from 25.5\% to 66.2\% while implementing fine-grained semantic classification across seven distinct categories.
The convergence of these findings reveals a comprehensive framework for automatic term extraction that matches optimal strategies to specific data availability contexts: in-context learning for extreme few-shot scenarios, data augmentation for intermediate resource contexts, and systematic dataset creation for sustainable long-term progress. This framework provides practical guidelines for researchers and practitioners working across diverse domains and languages, while contributing methodological insights that extend beyond term extraction to the broader challenges of few-shot learning in natural language processing. The research demonstrates how a strategic combination of these approaches can create more robust and adaptable extraction systems, ultimately reducing barriers to developing terminological resources in specialized domains and low-resource languages.
Funder
Publisher
University of Galway
Publisher DOI
Rights
CC BY-NC-ND