Preserving Endangered Languages with Morphological Analysis
Sydney DeFilippo
This project is focused on integrating a set of morphological rules into an automatic speech recognition and AI language model system to efficiently transcribe speech to Interlinear Glossed Text (IGT). IGT is a format of linguistic annotation that segments the words in a language into their morphological units. This allows linguists to document languages without translating them into their own language, so the syntax and unique semantic meaning of every component are preserved. My work is focused on the Zongozotla dialect of Totonac, a language indigenous to the Sierra Norte de Puebla region of Mexico, with only 5000 speakers. I am in the process of working with a native speak to develop a full documentation of the vocabulary and rules of the language. I am constructing a program which will parse any given text and segment it along its morpheme boundaries. My experiment is comparing this method of segmentation to traditional algorithms such as BPE and Morfessor, and seeing if segmenting along explicit morphoological rules improves the results of the Automatic Speech Recognition model. Generating gloss directly from speech, rather than transcribing and glossing by hand, can significantly reduce the time and resources necessary for documenting these under-resourced
languages. The main goal of my project is to improve current computational tools that can be used to preserve not only Totonac, but any under-resourced language.
Lori Levin
Enter the password to open this PDF file.
-
-
-
-
-
-
-
-
-
-
-
-
-
-