This problem has raised interest in the field of parallel data filtering to identify and correct the most problematic issues for NMT, e.g., segments where source and target are the same, and misaligned sentences. This presentation by eBay provides an overview of the importance of parallel data filtering and its best practices. It adds to the useful points made by Doctor-sahib in this post. Data cleaning and preparation have always been necessary for developing superior MT engines, and most of us agree that it is even more critical now with neural network-based models.
This guest post is by Raymond Doctor, who is an old and wise acquaintance of mine who has spent over a decade at the Centre for Development of Advanced Computing
(C-DAC) in Pune, India. He is a pioneer in digital Indic language work
and was involved in several Indic language NLP based initiatives conducting research on Indic language Parsers, Segmentation,
Tokenization, Stemming, Lemmatization, NER, Chunking, Machine
Translation, and Opinion Mining.
The success of these MT experiments is yet more proof that the best MT systems come from those who have a deep understanding of both the underlying linguistics, as well as the MT system development methodology.
He and I also share two Indian languages in common (Hindi and Gujarati). Over the years, he has shown me many examples of output from MT systems he has developed in his research that were the best I had seen for these two languages going into and out of English. The success of his MT experiments is yet more proof that the best MT systems come from those who have a deep understanding of both the underlying linguistics, as well as the MT system development methodology.
Overview of the SMT data alignment processes |
"True inaccuracy and errors in
data are at least relatively straightforward to address, because they are
generally all logical in nature. Bias, on the other hand, involves changing how
humans look at data, and we all know how hard it is to change human
behavior."
Some other wisdom about data from Michiko:
Truth #1: Data are stupid and lazy.
Data are not intelligent. Even artificial intelligence must be taught before it learns to learn on its own (even that is debatable). Data have no ability on their own. It is often said that insights must be teased out of data.
Truth #2: Data are rarely an objective representation of reality (on their own).
I want to clarify this statement: it does not say that data is rarely accurate or error-free. Accuracy and correctness are dimensions of quality of what is in the data themselves.
The text below is written by the guest author.
**************
Over the years, I have been studying the various recommendations given to prepare training data before submitting it to an NMT learning engine. I feel these recommended practices mainly emerged as best practices at the time of SMT, and have been carried over to NMT with less beneficial results.
I have identified six major pitfalls that data analysts make when preparing training data for NMT models. These data cleaning and preparation practices originated as best practices with SMT, where they were of benefit. Many data practices that made sense with SMT are still being followed today, and it is my opinion that these should be avoided and are likely to result in better outcomes.
While I have listed a few practices that I feel should be avoided, many other SMT-based data prepping practices ensure that the training data is likely to produce a sub-optimal NMT system. But the factors I have listed below are the most common practices which have resulted in lower output quality than would be possible by ignoring these practices. I disregarded the advice given regarding punctuations, deduping, removing truncations, MWEs, and found the quality of NMT output considerably improves in my research with Indic language MT systems.
As far as possible, examples have been provided from a Gujarati <> English NMT system I have developed. But the same can apply to any other parallel corpus.
1. PUNCTUATION
Quite a few sites tell you to remove punctuations before submitting the data for learning. It is my observation that this is not optimal practice.
Punctuations are markers that allow for understanding the meaning. In a majority of languages word order does not necessarily show interrogation
Tu viens? =You are coming?
Removing the interrogation marker creates confusion and dupes [see my remark below]
See what happens when a comma is removed:
Anne Marie, va manger mon enfant=Anne Marie. Come have your lunch
Anne Marie va manger mon enfant=Anne Marie is going to eat my child
The mayor says, the commissioner is a fool.
The mayor, says the commissioner is a fool.
I feel that in preparing a corpus the punctuation markers should be retained.
2. TRUNCATIONS AND SHORT SENTENCES
Quite a few sites advise you to remove short sentences. Doing this, in my opinion, is a serious error. Short sentences are crucial for translating headlines, one of the stumbling blocks of NMT. Some have no verbs and are pure nominal structures.
Curfew declared: Noun + Verb
Sweep of Covid19 over the continent: Nominal Phrase
Google does not handle nominal structures well, and here is an example:
Sweep of Covid over India= ભારત ઉપર કોવિડનો સ્વીપ
I have found that retaining such structures strengthens and improves the quality of NMT output.
3. MULTIWORD EXPRESSIONS
Multiword expressions (MWEs) are expressions that are made up of at least two words, and which can be syntactically and/or semantically idiosyncratic in nature. Moreover, they act as a single unit at some level of linguistic analysis.
Like short sentences, MWEs are often ignored and removed from the training corpus. These MWEs are very often fixed patterns found in a given language. These can be short expressions, titles, or phrasal constructs, just to name a few of the possibilities. MWEs cannot be literally translated and need to be glossed accurately. My experience has been that the higher the volume of MWEs provided, the better the quality of learning. A few MWEs in Gujarati are provided below:
agreement in absence =અભાવાન્વય
agreement in presence =ભવાન્વય
agriculture parity =કૃષિમૂલ્ય સમાનતા
aid and advice =સહાય અને સલાહ
aider and abettor =સહાયક અને મદદગાર
aim fire =નિશાન લગાવી ગોળી ચલાવવી
4. DUPLICATES
A large number of sites providing recommendations on NMT training data preparation tell you to remove duplicates, both in the Source and Target texts. This action in popular parlance is termed as deduping. The argument being that deduping the corpus makes for greater accuracy. However, it is common to have an English sentence that can map to two or more strings in the target language. This variation can be because of synonyms used in the target languages, or because of a flexible word order that is especially common in Indic languages. De-duping the data results in weakening the quality of MT output. The only case where deduping needs to be done is when we have two identical strings, both in the Source and Target language. Higher quality NMT engines incorporate these slight variations on a single segment to enable the MT engines to produce multiple variants.
Change of verbal expression and word order:
How are the trade talks between China and the US moving forward now. =ચીન તથા અમેરિકા વચ્ચે વેપાર વ્યવહાર વિષયક વાતચીત હવે કેવી આગળ વધે છે.
How are the trade talks between China and the US moving forward now. =ચીન તથા અમેરિકા વચ્ચે હવે વેપાર વિષયક વાતચીત કેવી આગળ વધે છે.
Synonyms:
Experts believe. =એક્સપર્ટ્સ માને છે.
Experts believe. =જાણકારોનું માનવું છે.
Experts believe. =નિષ્ણાતોનું માનવું છે.
Deduping the data in such cases results in reducing the quality of output.
The only case where deduping needs to be done is where we have two identical strings, both in the Source and Target language. In other words, an exact duplicate. High-end NMT engines do not practice deduping since this deprives the MT system from being able to provide variants, which can be seen by clicking on full or part of the gloss.
5. VERBAL PATTERNS
The inability to handle these are the Achilles heel of a majority of NMT engines, including Google, insofar as English to Indic languages are concerned. Attention to this area is ignored because it is felt that the corpus will handle all verbal patterns in both the source and target language. Even the best of corpora does not handle this.
Providing a set of sentences with the Verbal Pattern of both the source and target languages goes a long way.
Gujarati admits around 40+ verbal patterns and NMT fails on quite a few:
They ought to have been listening to the PM's speech =તેઓએ વડા પ્રધાનનું ભાષણ સાંભળ્યું હોવું જોઈએ
Shown below is a sample of Gujarati verbal patterns with “to eat “ as a paradigm
You are eating =તમે ખાઓ છોYou are not eating =તમે ખાતા નથી
You ate =તમે ખાધું
You can eat =તમે ખાઈ શકો છો
You cannot eat =તમે નહીં ખાઈ શકો
You could not eat =તમે ખાઈ શક્યા નહીં
You did not eat =તમે ખાધું નહીં
You do not eat =તમે ખાતા નથી
You eat =તમે ખાધું
You had been eating =તમે ખાતા હતા
You had eaten =તમે ખાધું હતું
You have eaten =તમે ખાધું છે
You may be eating =તમે ખાતા હોઈ શકો છો
You may eat =તમે ખાઈ શકો છો
You might eat =તમે કદાચ ખાશો
You might not eat =તમે કદાચ ખાશો નહીં
You must eat =તમારે ખાવું જ જોઇએ
You must not eat =તમારે ખાવું ન જોઈએ
You ought not to eat =તમારે ખાવું ન જોઈએ
You ought to eat =તમારે ખાવું જોઈએ
You shall eat =તમે ખાશો
Similarly, the use of a habitual marker used when glossed into French by a high-quality NMT system
6. POLE AND VECTOR VERBS
This construct is very common in Indic languages and often leads to mistranslation.
Thus, Gujarati uses જવું કરવું as an adjunct to the main verb. The combination of the pole and the vector verb such as જવું creates a new meaning.
મરી જવું is not translated as die go, but is simply die
Gujarati admits around 15-20 such verbs, as do Hindi and other Indic languages, and once again, a corpus needs to be fed this type of data in the shape of sentences to produce better output.
In the case of English it is the prepositional phrases that often create issues:
Pick up, pick someone up, pick up the tab
Conclusion
We noticed that when training data that ignores some of the frequent data preparation recommendations are sent in for training, the quality of MT output markedly improves. However, there is a caveat. If the threshold of the training data is lower than 100,000 segments, following or not following the above recommendations make little or no difference. Superior NMT systems require a sizeable corpus, and generally, we see that at least a million+ segments are needed.
Here is a small set of sentences from various domains is provided below as proof of the quality of output using these techniques
Now sieve this mixture.=હવે આ મિશ્રણને ગરણીથી ગાળી લો.
It is violence and violence is sin.=હિંસા કહેવાય અને હિંસા પાપ છે.
The youth were frustrated and angry.=યુવાનો નિરાશ અને ક્રોધિત હતા.
Give a double advantage.=ચાંલ્લો કરીને ખીર પણ ખવડાવી.
The similarity between Modi and Mamata=મોદી અને મમતા વચ્ચેનું સામ્ય
I'm a big fan of Bumrah.=હું બુમરાહનો મોટો પ્રશંસક છું.
38 people were killed.=તેમાં 38 લોકોના મોત થયા હતા.
The stranger came and asked.=અજાણ્યા યુવકે આવીને પૂછ્યું.
Jet now has 1,300 pilots.=હવે જેટની પાસે 1,300 પાયલટ છે.
0 Comment