Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog
Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work.
Stay updated with cutting-edge advancements, practical applications, and real-world use.
How to Prepare a French-to-English Dataset for Machine Translation
Machine translation is the challenging task of converting text from a source language into coherent and matching text in a target language.
Neural machine translation systems such as encoder-decoder recurrent neural networks are achieving state-of-the-art results for machine translation with a single end-to-end system trained directly on source and target language.
Standard datasets are required to develop, explore, and familiarize yourself with how to develop neural machine translation systems.
In this tutorial, you will discover the Europarl standard machine translation dataset and how to prepare the data for modeling.
After completing this tutorial, you will know:
The Europarl dataset comprised of the proceedings from the European Parliament in a host of 11 languages.
How to load and clean the parallel French and English transcripts ready for modeling in a neural machine translation system.
How to reduce the vocabulary size of both French and English data in order to reduce the complexity of the translation task.
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
Europarl Machine Translation Dataset
The Europarl is a standard dataset used for statistical machine translation, and more recently, neural machine translation.
It is comprised of the proceedings of the European Parliament, hence the name of the dataset as the contraction Europarl.
The proceedings are the transcriptions of speakers at the European Parliament, which are translated into 11 different languages.
It is a collection of the proceedings of the European Parliament, dating back to 1996. Altogether, the corpus comprises of about 30 million words for each of the 11 official languages of the European Union
Once downloaded, you should have the file “fr-en.tgz” in your current working directory.
You can unzip this archive file using the tar command, as follows:
1
tar zxvf fr-en.tgz
You will now have two files, as follows:
English: europarl-v7.fr-en.en (288M)
French: europarl-v7.fr-en.fr (331M)
Below is a sample of the English file.
1
2
3
4
5
Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
You have requested a debate on this subject in the course of the next few days, during this part-session.
In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.
Below is a sample of the French file.
1
2
3
4
5
Reprise de la session
Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.
Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles.
Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.
En attendant, je souhaiterais, comme un certain nombre de collègues me l'ont demandé, que nous observions une minute de silence pour toutes les victimes, des tempêtes notamment, dans les différents pays de l'Union européenne qui ont été touchés.
Load Dataset
Let’s start off by loading the data files.
We can load each file as a string. Because the files contain unicode characters, we must specify an encoding when loading the files as text. In this case, we will use UTF-8 that will easily handle the unicode characters in both files.
The function below, named load_doc(), will load a given file and return it as a blob of text.
1
2
3
4
5
6
7
8
9
# load doc into memory
def load_doc(filename):
# open the file as read only
file=open(filename,mode='rt',encoding='utf-8')
# read all text
text=file.read()
# close the file
file.close()
returntext
Next, we can split the file into sentences.
Generally, one utterance is stored on each line. We can treat these as sentences and split the file by new line characters. The function to_sentences() below will split a loaded document.
1
2
3
# split a loaded document into sentences
def to_sentences(doc):
returndoc.strip().split('\n')
When preparing our model later, we will need to know the length of sentences in the dataset. We can write a short function to calculate the shortest and longest sentences.
1
2
3
4
# shortest and longest sentence lengths
def sentence_lengths(sentences):
lengths=[len(s.split())forsinsentences]
returnmin(lengths),max(lengths)
We can tie all of this together to load and summarize the English and French data files. The complete example is listed below.
Once normalized, we save the lists of clean lines directly in binary format using the pickle API. This will speed up loading for further operations later and in the future.
Reusing the loading and splitting functions developed in the previous sections, the complete example is listed below.
After running, the clean sentences are saved in english.pkl and french.pkl files respectively.
As part of the run, we also print the first few lines of each list of clean sentences, reproduced below.
English:
1
2
3
4
5
6
7
8
9
10
resumption of the session
i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period
although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful
you have requested a debate on this subject in the course of the next few days during this partsession
in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union
please rise then for this minute s silence
the house rose and observed a minute s silence
madam president on a point of order
you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka
one of the people assassinated very recently in sri lanka was mr kumar ponnambalam who had visited the european parliament just a few months ago
French:
1
2
3
4
5
6
7
8
9
10
reprise de la session
je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances
comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles
vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session
en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches
je vous invite a vous lever pour cette minute de silence
le parlement debout observe une minute de silence
madame la presidente cest une motion de procedure
vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka
lune des personnes qui vient detre assassinee au sri lanka est m kumar ponnambalam qui avait rendu visite au parlement europeen il y a quelques mois a peine
My reading of French is very limited, but at least as the English is concerned, further improves could be made, such as dropping or concatenating hanging ‘s‘ characters for plurals.
Reduce Vocabulary
As part of the data cleaning, it is important to constrain the vocabulary of both the source and target languages.
The difficulty of the translation task is proportional to the size of the vocabularies, which in turn impacts model training time and the size of a dataset required to make the model viable.
In this section, we will reduce the vocabulary of both the English and French text and mark all out of vocabulary (OOV) words with a special token.
We can start by loading the pickled clean lines saved from the previous section. The load_clean_sentences() function below will load and return a list for a given filename.
1
2
3
# load a clean dataset
def load_clean_sentences(filename):
returnload(open(filename,'rb'))
Next, we can count the occurrence of each word in the dataset. For this we can use a Counter object, which is a Python dictionary keyed on words and updates a count each time a new occurrence of each word is added.
The to_vocab() function below creates a vocabulary for a given list of sentences.
1
2
3
4
5
6
7
# create a frequency table for all words
def to_vocab(lines):
vocab=Counter()
forline inlines:
tokens=line.split()
vocab.update(tokens)
returnvocab
We can then process the created vocabulary and remove all words from the Counter that have an occurrence below a specific threshold.
The trim_vocab() function below does this and accepts a minimum occurrence count as a parameter and returns an updated vocabulary.
1
2
3
4
# remove all words with a frequency below a threshold
def trim_vocab(vocab,min_occurance):
tokens=[kfork,cinvocab.items()ifc>=min_occurance]
returnset(tokens)
Finally, we can update the sentences, remove all words not in the trimmed vocabulary and mark their removal with a special token, in this case, the string “unk“.
The update_dataset() function below performs this operation and returns a list of updated lines that can then be saved to a new file.
1
2
3
4
5
6
7
8
9
10
11
12
13
# mark all OOV with "unk" for all lines
def update_dataset(lines,vocab):
new_lines=list()
forline inlines:
new_tokens=list()
fortoken inline.split():
iftoken invocab:
new_tokens.append(token)
else:
new_tokens.append('unk')
new_line=' '.join(new_tokens)
new_lines.append(new_line)
returnnew_lines
We can tie all of this together and reduce the vocabulary for both the English and French dataset and save the results to new data files.
We will use a min occurrence of 5, but you are free to explore other min occurrence counts suitable for your application.
The complete code example is listed below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
from pickle import load
from pickle import dump
from collections import Counter
# load a clean dataset
def load_clean_sentences(filename):
returnload(open(filename,'rb'))
# save a list of clean sentences to file
def save_clean_sentences(sentences,filename):
dump(sentences,open(filename,'wb'))
print('Saved: %s'%filename)
# create a frequency table for all words
def to_vocab(lines):
vocab=Counter()
forline inlines:
tokens=line.split()
vocab.update(tokens)
returnvocab
# remove all words with a frequency below a threshold
def trim_vocab(vocab,min_occurance):
tokens=[kfork,cinvocab.items()ifc>=min_occurance]
returnset(tokens)
# mark all OOV with "unk" for all lines
def update_dataset(lines,vocab):
new_lines=list()
forline inlines:
new_tokens=list()
fortoken inline.split():
iftoken invocab:
new_tokens.append(token)
else:
new_tokens.append('unk')
new_line=' '.join(new_tokens)
new_lines.append(new_line)
returnnew_lines
# load English dataset
filename='english.pkl'
lines=load_clean_sentences(filename)
# calculate vocabulary
vocab=to_vocab(lines)
print('English Vocabulary: %d'%len(vocab))
# reduce vocabulary
vocab=trim_vocab(vocab,5)
print('New English Vocabulary: %d'%len(vocab))
# mark out of vocabulary words
lines=update_dataset(lines,vocab)
# save updated dataset
filename='english_vocab.pkl'
save_clean_sentences(lines,filename)
# spot check
foriinrange(10):
print(lines[i])
# load French dataset
filename='french.pkl'
lines=load_clean_sentences(filename)
# calculate vocabulary
vocab=to_vocab(lines)
print('French Vocabulary: %d'%len(vocab))
# reduce vocabulary
vocab=trim_vocab(vocab,5)
print('New French Vocabulary: %d'%len(vocab))
# mark out of vocabulary words
lines=update_dataset(lines,vocab)
# save updated dataset
filename='french_vocab.pkl'
save_clean_sentences(lines,filename)
# spot check
foriinrange(10):
print(lines[i])
First, the size of the English vocabulary is reported followed by the updated size. The updated dataset is saved to the file ‘english_vocab.pkl‘ and a spot check of some updated examples with out of vocabulary words replace with “unk” are printed.
1
2
3
English Vocabulary: 105357
New English Vocabulary: 41746
Saved: english_vocab.pkl
We can see that the size of the vocabulary was shrunk by about half to a little over 40,000 words.
1
2
3
4
5
6
7
8
9
10
resumption of the session
i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period
although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful
you have requested a debate on this subject in the course of the next few days during this partsession
in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union
please rise then for this minute s silence
the house rose and observed a minute s silence
madam president on a point of order
you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka
one of the people assassinated very recently in sri lanka was mr unk unk who had visited the european parliament just a few months ago
The same procedure is then performed on the French dataset, saving the result to the file ‘french_vocab.pkl‘.
1
2
3
French Vocabulary: 141642
New French Vocabulary: 58800
Saved: french_vocab.pkl
We see a similar shrinking of the size of the French vocabulary.
1
2
3
4
5
6
7
8
9
10
reprise de la session
je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances
comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles
vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session
en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches
je vous invite a vous lever pour cette minute de silence
le parlement debout observe une minute de silence
madame la presidente cest une motion de procedure
vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka
lune des personnes qui vient detre assassinee au sri lanka est m unk unk qui avait rendu visite au parlement europeen il y a quelques mois a peine
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
No comments:
Post a Comment