Contract Awarded to Develop Accurate, Cost-Effective Translation System for Low Re- source Languages (ACE-Translation)

The team will develop methods to automate collection of training data, explore optimal structures for training datasets given constrained resources, implement machine translation systems for representative languages, develop novel quality assurance methods for the translation systems, and quantify performance of the systems using standard benchmarks.  The proposed approach builds on proven technologies for translation of low-resource languages previously developed by team members, as well as experience with automated data collection and assurance of critical software systems.   The team will build on open-sourced pre-trained models like mBART, DeltaLM, or NLLB, adapting these pre-trained models to new languages.  The team will employ hierarchical adapter units, which are injected between the layers of the pre-trained model and can be trained on a new language, domain, or task. The proposed work will improve existing processes for using these adapter units by incorporating additional phylogenetic linguistic information in their organization, leveraging results of prior linguistic studies that provided the needed information in the form of phylogeny trees.  The team will develop innovative automated data collection methods to enable low-cost collection of data to support implementation of translation systems for nearly any low-resource language.  The proposed work will also include development of strategies for optimally selecting input data for system training. The system will leverage the team’s extensive experience with run-time assurance (RTA) to achieve high reliability.