-->
Non-child: | ||
Child: |
The CIG1 and CIG2 corpora focus on child language acquisition in Welsh. They were assembled by Bob Morris Jones and colleagues from the University of Wales Aberystwyth and the University of Wales Bangor. The Economic and Social Research Council funded the creation of the corpora.
The search boxes above allow you to search for a word across all files in the CIG1 and CIG2 corpora - when you enter a word, 20 utterances in the corpus containing that word will be shown. For readability, most of the transcription marking is removed, but it can be restored by unticking "Do not show full marking".
You can search for words used by a child, or for words used by an adult, "non-child" being defined here as any speaker who is not identified as a child, target child, or playmate.
CIG1, created in 1996, consists of 84 hours of transcribed recordings from children aged 18-30 months, 4 from North Wales (Alaw, Dewi, Elin and Rhys) and 3 from Mid Wales (Bethan, Melisa and Rhian).
CIG2 consists of 120 hours of transcribed recordings from 469 children from across Wales aged 3-7. The recordings were collected in 1974-7, and transcribed in 1999-2000.
Other key parameters of the corpora are set out in the following table. As is to be expected, the amount and range of non-child material is much larger in CIG1 than in CIG2.
Files | Total utterances |
Total tokens |
Total types |
Non-child utterances |
Non-child utterances % |
Non-child tokens |
Non-child tokens % |
Non-child types |
Non-child types % |
|
---|---|---|---|---|---|---|---|---|---|---|
CIG1 | 168 | 78766 | 304846 | 5498 | 25286 | 32% | 222390 | 73% | 4869 | 89% |
CIG2 | 239 | 151422 | 566140 | 12206 | 40237 | 27% | 103755 | 18% | 4043 | 33% |
Detailed information about CIG1 and CIG2 are available at the Child Language Databases website, and the transcriptions are available from the CHILDES website. For ease of access, however, I have taken the liberty of replicating everything here - some of the links on the CLD website (eg those to the lexicon files) are already dead, and it would be a pity if the information about the corpora got lost. The following links provide information on: documentation for CIG1, documentation for CIG2, the transcription conventions used, and the structure of the lexicon files.
Each of the corpora files can be examined in detail (eg transcription, comments, etc) from the CIG1 listing or the CIG2 listing pages.
The files on which the search is based can be downloaded below.