350,000 aligned Welsh and English sentences
from the Proceedings of the Third National Assembly for Wales (2007-2011)
Page language:

Kynulliad3 is a substantial corpus of nearly 360,000 aligned sentences in Welsh and English (around 8.8m words in each language) drawn from the Record of Proceedings of the Third Assembly (2007-2011) of the National Assembly for Wales.

The corpus is intended for use in natural language processing research - since it contains only the longer sentences from the plenary sessions of the Proceedings, without attribution or an indication of the context, date, or which language was used by the speaker, it is not a record of the activity of the Assembly.

The Record of Proceedings of the National Assembly for Wales is Crown copyright. Material from the Record is reproduced under the terms of Crown copyright policy guidance issued by HMSO and the National Assembly for Wales.

When you enter a word in the search box, 20 sentences in the corpus containing that word will be shown. Each time you press the Search button, a different set of 20 sentences will be shown.

The data is drawn from the HTML versions of the Proceedings. Markup was discarded to leave blocks of aligned text. These blocks were cleaned to remove low-value items like "I move that". Alberto Simões' wonderful Lingua::Identify was used to swap around the blocks where necessary so that Welsh was always first (it got the language wrong for fewer than 0.03% of the blocks!). The blocks were then split into individual sentences, which were also cleaned to remove duplicates and sentences of less than 20 characters.

The database table contains the following fields: id: unique identifier for the sentence; source: year, month and document number of the Proceedings document; welsh: the sentence in Welsh; english: the equivalent sentence in English; source_id: the block of the source the sentence occurred in; word_w: the number of words in the Welsh sentence; word_e: the number of words in the English sentence.

Some work remains to be done. In 1.3% of cases (4,715), two sentences have not been split if the first sentence contained sentence-final capitals or numbers, so that two sentences appear in the same record. In 0.8% of cases (2,838), one sentence in one language has been translated by two sentences in the other, leaving a "hole" in the alignment. I will try to fix these if I get time, using Thomas Kellerer's excellent SQLWorkbench. This is the only SQL GUI I have seen that allows you to edit the results of a SELECT query. In the latest version you can word-wrap text fields, making it easy to correct the above issues. If you'd like to help with this, contact me on my main page.

I may at some point use the Autoglosser to tag each sentence, and add that to the corpus. I need to add the next batch of words to Eurfa before that can be done, though.

If using Kynulliad3 in research, the following citation can be used:

Kevin Donnelly (2013). "Kynulliad3: a corpus of 350,000 aligned Welsh and English sentences from the Third Assembly (2007-2011) of the National Assembly for Wales." http://cymraeg.org.uk/kynulliad3. (BibTeX)

Kynulliad3, including a copy of the Crown copyright policy guidance, is available for download - just click the button below. Because of the size of the corpus, the 40Mb download, which will decompress to 126Mb, is in the form of a PostgreSQL dump.


You can also download a frequency list of the almost 48,400 words in Kynulliad3.