Europarl parallel corpus

Publication

AuthorsKoehn, P

Abstract

This paper describes the acquisition, preparation, and properties of a corpus extracted from the Proceedings of the European Parliament. This corpus is available in 11 languages, consists of over 200 million words per language, and is preprocessed for use in statistical machine translation. We describe the methods we used for crawling, document alignment, and sentence alignment. We also present a common test set for machine translation and report the results of a number of basic statistical machine translation experiments.

Extracted information

·Extracted by skipped:no_textMay 12, 2026