-->

Kig

A web interface to the 300k-word CIG1 and 570k-word CIG2 corpora
Page language:
Non-child:


Child:



CIG1 + CIG2

The CIG1 and CIG2 corpora focus on child language acquisition in Welsh. They were assembled by Bob Morris Jones and colleagues from the University of Wales Aberystwyth and the University of Wales Bangor. The Economic and Social Research Council funded the creation of the corpora.

The search boxes above allow you to search for a word across all files in the CIG1 and CIG2 corpora - when you enter a word, 20 utterances in the corpus containing that word will be shown. For readability, most of the transcription marking is removed, but it can be restored by unticking "Do not show full marking".

You can search for words used by a child, or for words used by an adult, "non-child" being defined here as any speaker who is not identified as a child, target child, or playmate.

CIG1, created in 1996, consists of 84 hours of transcribed recordings from children aged 18-30 months, 4 from North Wales (Alaw, Dewi, Elin and Rhys) and 3 from Mid Wales (Bethan, Melisa and Rhian).

CIG2 consists of 120 hours of transcribed recordings from 469 children from across Wales aged 3-7. The recordings were collected in 1974-7, and transcribed in 1999-2000.

Other key parameters of the corpora are set out in the following table. As is to be expected, the amount and range of non-child material is much larger in CIG1 than in CIG2.

Files Total
utterances
Total
tokens
Total
types
Non-child
utterances
Non-child
utterances %
Non-child
tokens
Non-child
tokens %
Non-child
types
Non-child
types %
CIG1 168 78766 304846 5498 25286 32% 222390 73% 4869 89%
CIG2 239 151422 566140 12206 40237 27% 103755 18% 4043 33%

Detailed information about CIG1 and CIG2 are available at the Child Language Databases website, and the transcriptions are available from the CHILDES website. For ease of access, however, I have taken the liberty of replicating everything here - some of the links on the CLD website (eg those to the lexicon files) are already dead, and it would be a pity if the information about the corpora got lost. The following links provide information on: documentation for CIG1, documentation for CIG2, the transcription conventions used, and the structure of the lexicon files.

Each of the corpora files can be examined in detail (eg transcription, comments, etc) from the CIG1 listing or the CIG2 listing pages.

The files on which the search is based can be downloaded below.

Download CIG1 Download CIG2