Kig: CIG1 and CIG2 child language acquisition corpora

CIG1 + CIG2

The CIG1 and CIG2 corpora focus on child language acquisition in Welsh. They were assembled by Bob Morris Jones and colleagues from the University of Wales Aberystwyth and the University of Wales Bangor. The Economic and Social Research Council funded the creation of the corpora.

The search boxes above allow you to search for a word across all files in the CIG1 and CIG2 corpora - when you enter a word, 20 utterances in the corpus containing that word will be shown. For readability, most of the transcription marking is removed, but it can be restored by unticking "Do not show full marking".

You can search for words used by a child, or for words used by an adult, "non-child" being defined here as any speaker who is not identified as a child, target child, or playmate.

CIG1, created in 1996, consists of 84 hours of transcribed recordings from children aged 18-30 months, 4 from North Wales (Alaw, Dewi, Elin and Rhys) and 3 from Mid Wales (Bethan, Melisa and Rhian).

CIG2 consists of 120 hours of transcribed recordings from 469 children from across Wales aged 3-7. The recordings were collected in 1974-7, and transcribed in 1999-2000.

Other key parameters of the corpora are set out in the following table. As is to be expected, the amount and range of non-child material is much larger in CIG1 than in CIG2.

	Files	Total utterances	Total tokens	Total types	Non-child utterances	Non-child utterances %	Non-child tokens	Non-child tokens %	Non-child types	Non-child types %
CIG1	168	78766	304846	5498	25286	32%	222390	73%	4869	89%
CIG2	239	151422	566140	12206	40237	27%	103755	18%	4043	33%

Detailed information about CIG1 and CIG2 are available at the Child Language Databases website, and the transcriptions are available from the CHILDES website. For ease of access, however, I have taken the liberty of replicating everything here - some of the links on the CLD website (eg those to the lexicon files) are already dead, and it would be a pity if the information about the corpora got lost. The following links provide information on: documentation for CIG1, documentation for CIG2, the transcription conventions used, and the structure of the lexicon files.

Each of the corpora files can be examined in detail (eg transcription, comments, etc) from the CIG1 listing or the CIG2 listing pages.

The files on which the search is based can be downloaded below.

Download CIG1 Download CIG2

Page language:
Non-child:	Do not show full marking	Do not show full marking
Child:	Do not show full marking	Do not show full marking