A 4m-word Welsh Wikipedia corpus
Page language:


Kwici is a 4m-word corpus drawn from the Welsh Wikipedia as it was on 30 December 2013. When you enter a word in the search box above, 20 sentences in the corpus containing that word will be shown.

If using Kwici in research, the following citation can be used:

Kevin Donnelly (2014). "Kwici: a 4m-word corpus drawn from the Welsh Wikipedia." http://cymraeg.org.uk/kwici. (BibTeX)

The final pages and articles dump for 2013 was downloaded from the Wikimedia dump page. The excellent WikiExtractor tool written by Giuseppe Attardi and Antonio Fuschetto was then used to extract plain text (discarding markup etc) from the 165Mb dump, resulting in a 33Mb output file. This was tidied by removing remaining XML, blank lines, and blocks of English text.

The text was then split to give into a total of 360,477 sentences, and these were imported into a PostgreSQL database table. The sentences were pruned by removing all items less than 50 characters long, all items containing numbers only (eg timelines), and all duplicates, to give a final total of 204,789 sentences in the corpus.

The Kwici corpus, which is licensed under the CC-BY-SA, can be downloaded below in csv format.

Download Kwici