– a source language A and a target language B (goal: translate from A to B).
– an authentic bilingual parallel dataset (ie: pairs of matching sentences in both languages)
– some additional monolingual data in the target language B
We can use the bilingual dataset to train a system to “back-translate” from the target B to the source A. Using this trained system on the monolingual dataset in language B will produce synthetic text in the source language A. This creates an artificial bilingual dataset which can be added to the authentic one to train a system to translate from A to B. It has been shown that this added synthetic data can lead to performance improvements.
Pre-trained Language Models
To achieve better performance in NLP tasks very often we can take a model that has been trained more generally to “understand” the language and its patterns (the “pre-trained model”), and fine-tune it for our specific task (eg: answer questions given a text). This often leads to better performance compared to training a model from scratch directly on the target task (particularly as often there is limited data available to train on the target task). Essentially, we’re allowing the fine-tuning process to build on the knowledge already acquired by the pre-trained model.
Usually, pre-trained models are trained via unsupervised learning on large amounts of unlabeled text. For example, the popular model BERT is pre-trained on a large corpus of text to perform two tasks:
– Masked Language Modelling (MLM): some words in the text are masked (hidden) and the model needs to predict what those words are based on the rest of the sentence.
– Next Sentence Prediction (NSP): given two sentences, the model needs to predict if there is a sequential connection between the first and the second one, or if the connection is random