エンジニアコラム

広い技術領域をカバーする当社の機械学習エンジニアが、
アカデミア発のAI＆機械学習技術を
紹介＆解説いたします。

A New Japanese-English Parallel Corpus

− 新日英対訳コーパス −

2021.11.9
Laboro.AI Inc.　Machine Learning Engineer　Zhao Xinyi

（※このコラムでは、当社が開発した機械翻訳モデルによる日本語訳を各セクションに掲載しています。翻訳文は、その性能を実感いただくことを目的に、いくつかの用語を置き換える以外は人手による修正は行なっておりません。そのため、一部文章に不自然な箇所も含みますことをご了承ください。）

INTRODUCTION

Parallel corpus is essential to Natural Language Processing (NLP) research, especially when it comes to translation. However, such research sometimes suffers from the lack of high-quality corpus. Hoping to make NLP researchers’ life easier, we are here to share a Japanese-English parallel corpus. To assess the quality of our corpus, NMT models were trained with the corpus and then evaluated on several datasets. The models reach quite good BLEU scores and are able to give decent translation involving text from various fields and sources. Now we’re making our corpus public to everyone.

Besides that, by writing this article, we also want to share the methodology for building a parallel corpus efficiently and financially friendly. Traditionally, collecting a parallel corpus needs considerable amount of linguistic resources (corpora, dictionaries, etc.), which is difficult for those with limited budgets. We managed to find some workarounds and make the whole process more practical. In this article, we will first explain how our corpus is collected, including where the data are from and how parallel sentence pairs are mined. Then we will introduce the NMT models that were trained, followed by the evaluation results and conclusions.

対訳コーパスは、特に翻訳に関しては、自然言語処理(NLP)の研究に不可欠です。しかし、そのような研究は、質の高いコーパスの欠如に苦しむことがあります。NLP研究者の生活をより楽にするために、日英対訳コーパスを共有します。コーパスの品質を評価するため、NMTモデルをコーパスで学習し、いくつかのデータセットで評価しました。モデルは、非常に優れたBLEUスコアに達し、さまざまな分野や情報源からのテキストを含む適切な翻訳を提供できます。

また、本稿を執筆することで、効率的かつ経済的な対訳コーパス構築の方法論を共有したい。従来、対訳コーパスの収集には相当な言語資源(コーパス、辞書など)が必要であり、予算が限られている人には難しい。いくつかの回避策を見つけて、プロセス全体をより実用的なものにすることができました。本記事では、まず、コーパスの収集方法や、データの出所、並列文章ペアの採掘方法などについて解説します。次に、訓練されたNMTモデルを紹介し、評価結果と結論を述べる。

CONTENTS

・Corpus Collecting
　・Related works
　・Crawling & Preprocessing
　・Alignment
　・Filtering
・Training NMT
・Evaluation
・Download & Source Code
・Acknowledgements

Corpus Collecting

Related works

One of the most well-known projects for building a parallel corpus is ParaCrawl, aiming to mine sentence pairs from the web for European languages. ParaCrawl has been proven to be one of the most high-quality large parallel corpora. In order to achieve that, ParaCrawl team developed a set of open-source tools including Extractor for processing the data in Common Crawl, Bitextor for crawling and aligning data, and Bicleaner for filtering bilingual text pairs. Using the tools developed by ParaCrawl, JParaCrawl corpus was created by NTT as the Japanese version of ParaCrawl. Our project was inspired by the two projects mentioned above.

コーパス集め

関連作品

対訳コーパス構築で最もよく知られているプロジェクトのひとつが、ヨーロッパの言語向けにウェブから文章ペアを掘り起こすParaCrawlです。ParaCrawlは証明され、最も高品質の大型対訳コーパスの一つです。これを実現するために、ParaCrawlチームはコモンクロールでデータを処理するためのExtractor、クロールと整列用のBitextor、バイリンガルテキストペアをフィルタリングするためのBicleanerなどのオープンソースツールのセットを開発しました。JParaCrawlは、ParaCrawlが開発したツールを用いて、NTTが日本版のParaCrawlとして作成したコーパスです。この2つのプロジェクトから着想を得ました。

Crawling & Preprocessing

Our parallel corpus is built based on web-crawled data. To begin the crawling, we have to first decide the candidate domains. Common Crawl as a large web archive database is a good place to start for selecting the candidate domains. An ideal source domain for our purpose should include parallel webpages having the same contents in two languages. To simplify the selection of candidate domains, however, we only request a desirable language ratio between Japanese and English at this step. With the help of the Extractor, we were able to extract Japanese and English text from Common Crawl database, and then calculate the language bytes ratio for each domain. Top 50,000 domains were finally selected with the closest Japanese to English bytes ratio to 1.22.

The main limitation of using Common Crawl as the source data, however, is that it usually doesn’t have a complete copy of a website. This leads to two problems. One is that the language statistics we collected in the previous step might have bias from the actual situation. The other problem is that from a potentially ideal domain, we want to obtain as much data as possible, in other words, the entire website. This means instead of using the imcomplete copy in Common Crawl, it’s better for us to crawl the websites again for more data.

From this step, we started using another tool called Bitextor. It integrates together the functions of crawling, alignment, filtering, etc, and the tools for them. By modifying the configuration files, we are able to control the pipeline and select the tools. Detailed instruction can be found on its GitHub homepage.

As for the crawling, we chose Creepy among several crawlers supported by Bitextor. Creepy is very straightforward to use, and Bitextor has some Creepy-specific variables to help us control the crawling process. Specifically, we set the crawlTimeLimit as 24 hours, crawlSizeLimit as 1GB and crawlTLD as False. By doing this, we restrained the crawling from taking up too much resources, and obtained around 1TB data with gzip compression.

The crawled data have to be preprocessed for further use. This includes extracting plain text, splitting sentences and tokenization. To better extract text and suit Japanese characters and punctuations, we modified and replaced source code bitextor-warc2preprocess.py and split-sentences.perl. For tokenization, we used the original source code tokenizer.perl for English, and used MeCab tokenizer with NEologd dictionary for Japanese. After preprocessing, we got about 29GB of English plain text and 20GB of Japanese plain text.

クローリングと前処理

弊社の対訳コーパスは、ウェブクロールデータに基づいて構築されています。クロールを始めるには、まずは候補ドメインを決定する必要があります。大規模なWebアーカイブデータベースとしてのCommon Crawlという大規模なWebアーカイブデータベースは、候補ドメインを選択するのに最適な場所です。理想のソースドメインは、2つの言語で同じ内容のパラレルウェブページを含むべきです。ただし、候補ドメインの選択を簡略化するため、この段階では、日本語と英語の望ましい言語比率のみをリクエストします。Extractorの助けを借りて、Common Crawlデータベースから日本語と英語のテキストを抽出し、各ドメインの言語バイト数比を計算することができました。最終的に、日本語と英語のバイト数の割合が1.22と最も近い上位50,000ドメインが選ばれました。

しかし、Common Crawlをソースデータとして使用する主な制限は、通常、ウェブサイトの完全なコピーを持っていないことです。これは2つの問題につながります。1つは、前のステップで収集した言語統計には、実際の状況からバイアスが生じる可能性があるということです。もう一つの問題は、潜在的に理想的なドメインから、可能な限り多くのデータ、すなわちウェブサイト全体を取得したいということです。つまり、Common Crawlで不完全なコピーを使用する代わりに、より多くのデータのためにウェブサイトを再度クロールすることをお勧めします。

このステップから、Bitextorという別のツールを使い始めました。クロール、アライメント、フィルタリングなどの機能と、それらのツールを統合します。設定ファイルを変更することで、パイプラインを制御し、ツールを選択することができます。詳細な手順はGitHubホームページにあります。

クローリングに関しては、Bitextorがサポートするクローラの中からCreepyを選びました。Creepyは非常に使いやすく、Bitextor にはCreepy特有の変数があり、クローリングプロセスを制御できます。具体的には、「crawlTimeLimit」を24時間、「crawlSizeLimit」を1GB、「crawlTLD」をFalseに設定します。これにより、クロールによる資源の取り込みを抑制し、gzip圧縮で1TB前後のデータを取得しました。

クロールされたデータは、後で使用するために前処理する必要があります。これには、プレーンテキストの抽出、文章の分割、トークナイゼーションが含まれます。テキストを抽出して日本語の文字や句読点に合うように、ソースコード「bitextor-warc2preprocess.py」と「split-sentences.perl」を変更して置き換えました。トークナイズには、オリジナルソースコードの「tokenizer.perl」を英語に使い、NEOlogd辞書付きのMeCabトークナイザーを日本語に使いました。前処理後、約29GBの英語プレーンテキストと20GBの日本語プレーンテキストを手に入れました。

Alignment

Once the crawled data is ready, we can start document and segment (sentence) alignment in order to extract parallel sentence pairs. Bitextor supports two methods to do both alignments, using a dictionary or introducing an external machine translation (MT) to the system. The big obstacle we were facing was that we didn’t have any dictionary or MT available. Collecting a dictionary is apparently the more practical option here, because after all training a reliable MT is our ultimate purpose. We set off to crawl an English-Japanese dictionary from several dictionary websites, and ended up collecting 82,711 entries. It is important to select multiple sources to crawl the dictionary in order to balance the language style, because we want our final corpus to contain a little bit of everything, both academic and casual text.

Fortunately, the detour ends here. With the collected dictionary, alignment can be easily done following the instruction for Bitextor. We collected about 329 million sentence pairs when sentence alignment was finished, but many of those are not correct pairs and need to be filtered out.

アライメント

クロールされたデータの準備ができたら、ドキュメントとセグメント(センテンス)のアライメントを開始して、平行な文のペアを抽出します。Bitextorは、辞書を使用するか、外部機械翻訳(MT)をシステムに導入するかの2つの方法をサポートしています。私たちが直面していた大きな課題は、辞書もMTもないということでした。辞書の収集は、明らかにここではより実用的なオプションです,なぜなら、信頼できるMTをトレーニングすることが究極の目的だからです。日英辞書をいくつかの辞書サイトからクロールし、最終的に82,711のエントリを集めることにしました。言語スタイルのバランスを取るために、複数のソースを選択して辞書をクロールすることが重要です。

幸いなことに、迂回はここで終わります。収集された辞書を使用すると、Bitextorの指示に従って簡単に位置合わせを行うことができます。アライメントが完了すると約3億2900万個の文章ペアを収集しましたが、それらの多くは正しいペアではなく、除外する必要があります。

Filtering

On the Bitextor pipeline, Bicleaner is used to filter the sentence pairs and output the final parallel corpus. What it does is to score each pair and eliminate those whose scores are lower than a threshold we set. However, it takes some time and efforts to train a Bicleaner model. The very detailed explanation for training a Bicleaner can be found here, according to which, we still need some extra data for the training. In general, two parallel corpora are needed, a big corpus to extract probabilistic dictionary and word frequency information, and a small but high-quality corpus as the training corpus. Note that the dictionary used in the previous alignment step cannot be used here, because it doesn’t contain the probability and word frequency information required in the training process.

We crawled the big corpus from a bunch of dictionary websites with bilingual example sentences. It contains more than 1 million sentence pairs as suggested in the training instruction. As for the small but clean training corpus, we selected about 600K sentence pairs from Reijiro corpus. We also tested several types of classifier used in Bicleaner, and finally decided to adopt “random forest” classifier and 0.5 as the threshold in order to suit our needs the best. By using Bicleaner, our corpus reduced from 329 million sentence pairs to 23 million pairs.

By browsing the corpus we got, we found there’s still possibility to further clean the corpus. Some of the wrong sentence pairs are easy to spot because the source URL pairs obviously mismatched, and this is caused by the mistakes in document alignment. Using a dictionary instead of an external MT for document alignment compulsorily aligns one URL with the other most possible URL, even if they contain totally different contents. To deal with this problem, we appended a strict rule-based filter at the end of the pipeline to identify correct URL pairs. The rules include

1. the URL pairs must contain at least one language identifier including “ja”, “en”, “=j”, etc;
2. the numbers in the URLs, if exist, are usually the date or post ID, and are asked to be identical in a URL pair.

The size of our final parallel corpus reduced to 14 million sentence pairs after cleaning. Giving it a second thought, if we do the alignment again in the future, using this rule-based filter right after document alignment can save us some time in the later steps.

フィルタリング

Bitextorパイプラインでは、Bicleanerを使用して文章ペアをフィルタリングし、最終的な対訳コーパスを出力します。何をするかは、各ペアをスコア化し、設定したしきい値よりも低いスコアを排除することです。しかし、Bicleanerモデルのトレーニングには時間と労力がかかります。Bicleanerの訓練の非常に詳細な説明は、ここにある、これに従って、トレーニングのためにはさらにいくつかのデータが必要です。一般的に、2つの対訳コーパス、確率的辞書と単語頻度情報を抽出する大きなコーパス、および訓練コーパスとして小さくて高品質なコーパスが必要です。前のアライメントステップで使用する辞書は学習過程に必要な確率や単語頻度情報が含まれていないためここでは使えません。

大量の辞書ウェブサイトから、バイリンガルの例文をクロールしてみました。訓練の指示に示されているように、100万以上の文のペアが含まれています。また、小型でクリーンな学習コーパスは、Reijiroコーパスから約600Kの文章を選びました。また、Bicleanerで使用されている分類器をいくつか試し、最終的にはニーズに合わせて「random forest」分類器と0.5を閾値として採用することを決めました。Bicleanerを用いることで、コーパスは3億2,900万文対から2,300万対へと減少しました。

私達が得たコーパスを閲覧することによって、私達は、コーパスをさらにきれいにする可能性があることに気がつきました。ソースURLのペアが明らかに間違っているため、間違った文のペアのいくつかは見分けがつきやすいです。これは、ドキュメントのアラインメントの間違いが原因です。ドキュメントのアラインメントに外部MTではなく辞書を使用すると、たとえ完全に異なる内容が含まれていても、あるURLを他の最も可能なURLに強制的に整列させることができます。この問題に対処するために、パイプラインの最後に厳密なルールベースのフィルターを追加し、正しいURLペアを特定しました。

1. URLペアには、「ja」、「en」、「=j」など、少なくとも1つの言語識別子を含める必要があります。
2. URL内の番号は、通常、日付または投稿IDであり、URLのペアで同じであるように求められます。

最終的な対訳コーパスのサイズは、クリーニング後に1400万文ペアに縮小しました。将来、アラインメントを再び行う場合、ドキュメントのアラインメント直後にこのルールベースのフィルタを使用すると、後のステップで時間を節約できます。

Training NMT

To evaluate and compare the quality of the parallel corpora, we trained several sets of NMT models. The first set of models is trained with Laboro-ParaCorpus. To explore how much the performance is influenced by add an extra corpus, especially when adding a small corpus, we tested a Laboro-ParaCorpus+ which is a combination of Laboro-ParaCorpus and an HNK daily conversation corpus. The NHK corpus is also crawled from online resources and contains only around 60K sentence pairs. In addition to that, the third set is trained with the combination of Laboro-ParaCorpus and NTT-JParaCrawl corpus. Each set includes 4 models,

1. base model, from English to Japanese
2. base model, from Japanese to English
3. big model, from English to Japanese
4. big model, from Japanese to English

All the pre-trained models are later evaluated on 7 datasets. 4 of those, namely ASPEC, JESC, KFTT, IWSLT, are furthermore used for fine-tuning each model for an extra 2000 steps. We list and briefly introduce them as followings.

・ASPEC, Asian Scientific Paper Excerpt Corpus
・JESC, Japanese-English Subtitle Corpus containing casual language, colloquialisms, expository writing, and narrative discourse
・KFTT, Kyoto Free Translation Task that focuses on Wikipedia articles related to Kyoto
・IWSLT 2017 TED.tst2015 used in IWSLT 2017 Evaluation Campaign, including TED talks scripts in both languages
・Duolinguo STAPLE for the 2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education
・Tatoeba corpus, a large collection of multilingual sentences and translations that keeps being updated by voluntary contributors; release v20190709 is used in our experiment
・BSD, Business Scene Dialogue corpus containing Japanese-English business conversations

トレーニングNMT

対訳コーパスの品質を評価し比較するために、いくつかのNMTモデルのセットを訓練しました。モデルの最初のセットはLaboro-ParaCorpusで訓練されます。追加コーパスを追加することによってパフォーマンスがどの程度影響されるかを調べるには、特に、小さなコーパスを追加する場合、私たちは、Laboro-ParaCorpusとHNK日常会話コーパスの組み合わせであるLabolo-ParaCorpus+をテストしました。NHKコーパスもオンラインリソースからクロールされ、約60Kの文章ペアしか収録されていません。さらに、第3セットは、Laboro-ParaCorpusとNTT-JParaCrawlコーパスの組み合わせで訓練されます。各セットに4モデル、

1. ベースモデル(英語から日本語)
2. ベースモデル(日本語→英語)
3. 大きなモデル(英語から日本語)
4. 大きなモデル(日本語から英語)

事前学習済みのすべてのモデルは、その後7つのデータセットで評価されます。ASPEC、JESC、KFTT、IWSLTの4つは、さらに2000ステップ追加で各モデルのファインチューニングに使用されます。以下にリストアップして簡単に紹介します。

・ASPEC、アジア学術論文抜粋
・JESC、日本語・英語字幕コーパス、口語・解説、ナラティブ・ディスコース
・KFTT、京都に関するWikipedia記事を中心にした京都フリー翻訳タスク
・IWSLT 2017 TED.tst2015、両言語のTEDトークスクリプトを含むIWSLT 2017評価キャンペーンで使用
・Duolinguo STAPLE、2020 Duolingo Shared Task on Simultaneous Translation and Paraphrase for Language Education
・Tatoebaコーパス、自発的な貢献者によって更新され続ける多言語文章と翻訳の膨大なコレクション; release v20190709が我々の実験で使用されています
・BSD、日英ビジネス会話を含むビジネスシーン対話コーパス

Evaluation

NMT models are evaluated by BLEU scores on the test datasets. Except the three sets of models mentioned above, we used models trained on NTT’s JParaCrawl as the baseline, and the results from Google Cloud Translate as a reference.

In the table below, we show a simplified version of comparison based only on BLEU scores on average, so that it is easy to understand. The pre-trained (PT) models are represented by the average scores on 7 datasets, while the fine-tuned (FT) models on 4 datasets. For the detailed results for each model on each dataset, please refer to our Github document.

It is obvious to see that big models always get higher scores than base models. Except that Google Cloud Translate exceeds other pre-trained models, the models trained with the combination of 2 corpora give the best performance. As for models trained on one single corpus, the models trained on NTT-JParaCrawl perform slightly better than those trained on Laboro-ParaCorpus overall, but by adding a small and high-quality corpus to our corpus, it raised the performance to the same level as NTT’s models. Specifically which model is better really depends on the type and content of the evaluation dataset. The results show that the Laboro-ParaCorpus+ is of comparable quality with NTT-JParaCrawl corpus. With limited resources in hand, it’s still possible to create a decent parallel corpus of your own.

In this article, we set out to share a self-build parallel corpus and the NMT models pre-trained with the corpus. We also dicussed the methodology for creating a parallel corpus when we only have access to limited linguistic resources. This experience is valuable for our future work, and hope what we shared is helpful for you too.

評価

NMTモデルは、テストデータセット上のBLEUスコアによって評価されます。上記の3つのモデル以外は、NTTのJParaCrawlで学習したモデルをベースラインとして使用し、Google Cloud Translateの結果を参考として使用しました。

下表では、平均してBLEUスコアのみを基準に簡略化した比較を示しており、わかりやすいようにしています。事前学習(PT)モデルは7つのデータセットの平均スコアで、4つのデータセットで微調整(FT)モデルで表されます。各データセットにおける各モデルの詳細結果については、Githubドキュメントを参照してください。

大きなモデルがベースモデルよりも高いスコアを常に得ることは明らかです。Google Cloud Translateが他の事前訓練済みのモデルを超える場合を除き、2コーパスの組み合わせで訓練されたモデルは最高のパフォーマンスを発揮します。1つのコーパスで学習したモデルは、総じてLaboro-ParaCorpusで学習したモデルよりもNTT-JParaCrawlで学習したモデルの方が性能はやや向上しますが、コーパスに小型で高品質なコーパスを追加することにより、NTTのモデルと同等の性能が向上しました。具体的には、どのモデルがより優れているかは、評価データセットの種類と内容によって異なります。その結果、Laboro-ParaCorpus+はNTT-JParaCrawlコーパスと同等の品質であることが分かりました。限られたリソースで、独自の適切な対訳コーパスを作成することは可能です。

本記事では、自前の対訳コーパスと事前学習済みのNMTモデルを共有することにした。また、限られた言語資源しか得られない対訳コーパスの作成方法についても議論しました。この経験は将来の仕事に役立ち、私たちが共有したものがあなたにとっても役立つことを願っています。

Download & Source Code

Please refer to our GitHub Homepage.

ダウンロード&ソースコード

詳しくはGitHubのホームページをご参照ください。

Acknowledgements

We sincerely appreciate the ParaCrawl project for developing a great set of softwares. This project would not have been possible without Extractor, Bitextor, and Bicleaner. Special thanks to the NTT JParaCrawl project for the methodology for adapting ParaCrawl for Japanese.

謝辞

たくさんのソフトウェアを開発してくださったParaCrawlプロジェクトに心より感謝申し上げます。このプロジェクトは、Extractor、Bitextor、Bicleanerがなければ実現できなかった。NTT JParaCrawlプロジェクトによるParaCrawlの日本語適応方法に特に感謝します。