GlossLM: Multilingual Pretraining for Low-Resource Interlinear Glossing

Authors: Michael Ginn (University of Colorado), Lindia Tjuatja (Carnegie Mellon University), Taiqi He (Carnegie Mellon University), Enora Rice (University of Colorado), Graham Neubig (Carnegie Mellon University), Alexis Palmer (University of Colorado), Lori Levin (Carnegie Mellon University)

arXiv: 2403.06399v1 - DOI (cs.CL)

18 pages, 3 figures Submitted to ACL ARR Feb 2024 First two authors are equal contribution

License: CC BY 4.0

Abstract: A key aspect of language documentation is the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. Prior work has explored methods to automatically generate IGT in order to reduce the time cost of language analysis. However, many languages (particularly those requiring preservation) lack sufficient IGT data to train effective models, and crosslingual transfer has been proposed as a method to overcome this limitation. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. Then, we pretrain a large multilingual model on a portion of this corpus, and further finetune it to specific languages. Our model is competitive with state-of-the-art methods for segmented data and large monolingual datasets. Meanwhile, our model outperforms SOTA models on unsegmented text and small corpora by up to 6.6% morpheme accuracy, demonstrating the effectiveness of crosslingual transfer for low-resource languages.

Submitted to arXiv on 11 Mar. 2024

Explore the paper tree

Click on the tree nodes to be redirected to a given paper and access their summaries and virtual assistant

Also access our AI generated Summaries, or ask questions about this paper to our AI assistant.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.