A web interface to the 300k-word CIG1 and 570k-word CIG2 corpora
Transcription conventions

(Replicated from the Child Language Databases website.)


CHILDES (Child Language Data Exchange System) offers various resources which aid and support studying language acquisition, language development, and language learning:

The database uses the CHILDES transcriptional system namely CHAT (Codes for the Human Analysis of Transcripts), in order to achieve an internationally-recognized written version of the sound recordings of the spontaneous spoken data of young children. A manual is available at the CHILDES Web site, in .pdf format. A summary of the conventions which are used in the database is given below.

Intonational features are not coded except for exceptional stress on individual words. The traditional sentence types of declarative, interrogative and exclamative are indicated by the standard orthographic conventions of using a full stop (period), question mark, or exclamation mark.

The format of a data file

Opening headers convey details about the speakers and the recordings:

@Participants: TRY [speaker 1's reference] Trystan [speaker 1's name] Target_child [speaker 1's role], HEL [speaker 2's reference] Heledd [speaker 2's name] Target_child [speaker 2's role], BMJ [speaker 3's reference] Bob+morris+jones [speaker 3's name] Investigator [speaker 3's role]
@Filename: c3004.cha [the name of the electronic file with CHILDES extension]

@End is placed at the end of each file.

Between the beginning and end headers is the transcript:

Conventions used

The following summarises the transcriptional conventions in the files. More general observations follow the summary.

[...]Contain codes or comments on immediately preceding data.tractor [?]
< ... >Enclose the words that comments/codes refer to.<un cloc> [?]
Without angled brackets the comments/codes refer to one word.un cloc [?]
.?!Indicate the end of a line of data: declarative, interrogative, exclamation.
+...Unfinished declarative.
+..?Unfinished interrogative.
,,,Left-peripheral material i.e. on left periphery of core syntax.ie,,, heddiw.
,,Right-peripheral material i.e. on right periphery of core syntax.dim heddiw,, na.
Initial capital letterPersonal names, place names, brand names.Steve-austin
The names of the children and adults (except investigators), place names,
works names, have been made anonymous by using nonsense alphabetic strings.
A final '0' on the anonymous versions indicate the names of places and works.Lmno0
[!!]Contrastive word stress.
[!]Strong word stress.na [!]
["]Quoting another speaker's words.
[% Saesneg]An English phrase or sentence.welish i <big christmas tree> [% Saesneg]
[% ca:n]Words from a song or nursery rhyme.<dau gi bach yn mynd i 'r coed> [% ca:n]
[/]Repetition.fi [/] fi sy 'n mynd
[//]Repetition with change.fi [//] ti sy 'n mynd
[>] ac [<]Overlapping speech. Numbers can indicate successive pairings.
[= explanation]Indicates an explanation about the immediately preceding data.tlacdol [= tractor]
[=? explanation]Indicates a tentative explanation on immediately preceding data.[=? 'di marw]
[=! description]Indicates how utterances are delivered.[=! prolonged 'r']
[?]'Best guess' transcription.arian [?]
xxxIndecipherable data.
The number of syllable beats are indicated thus: [% 2 sill].xxx [% 2 sill]
&Unfinished word (not shortening).&bre
:The colon symbol : is placed after a vowel instead of the circumflex diacritic ^.ta:n (in place of tân)
,,Precedes question tag.yn fanna mae 'o,, ynde?
#Pause in mid-utterance.rho hwnna # yn1 fanna
@sn suffixNoises and onomatopoeic forms.br+rr@sn
@gl suffixNonsense words.nwci+nwcs@gl
@l suffixLetter from the alphabet.s@l

Personal names, local place-names, and local places-of-work have anonymised by using random nonsense-strings of letters: all begin with an initial capital, and the place names have a final 0. The names of public figures, fictional characters, and more distant places have been retained. Making names anonymous loses some information about word-forms, especially about mutations - where they occur - and word-play.

The children produced many noises while playing, and some attempt has been made to transcribe these, although they are not intended to capture the phonetic details. They have the suffix @sn. Nonsense forms, in word-play for instance, have the suffix @gl. Both are declared in the 00depadd.cut file.

English is also spoken by various children to different degrees in the database. Single English words - either by themselves or within a Welsh utterance - are not marked. But phrases or sentences of English words are enclosed in scope symbols < ... >, and are followed by the comment [% Saesneg] - 'Saesneg' being the Welsh word for 'English'.

Similarly, phrases and sentences which are from songs, nursery rhymes, and similar material are enclosed within < ... > and are followed by the comment [% ca:n] - 'ca:n' (or 'cân' to use the circumflex - see below) is the Welsh for 'song'.

Unfinished words (that is, fragments and not shortened words) are indicated by an initial &.

There are many homonyms, many of which come about through phonological processes of elision and assimilation in spontaneous speech. Digits and the apostrophe are used to distinguish different word-forms which otherwise have the same spelling. The lexicon gives the lexeme to which they belong. The apostrophe is declared in the 00depadd.cut file to cater for word-initial occurrences.

In spontaneous speech, patterns of a Welsh copula followed by a personal subject pronoun occur as a pronoun only. Such pronouns are indicated by a final apostrophe. There are instances, mainly of directive-like utterances within the context of a game, were it is not entirely clear what the pattern is. But these instances have likewise been give a final apostrophe.

Welsh orthography contains circumflexed letters: â ê î ô and also ŵ and ŷ, for which there is no ASCII provision. Circumflexed letters are not stable over different applications, as is well-known. Consequently, they are represented as a: e: i: o:, which convention can then be conveniently extended to w and y. This convention is mainly used where ambiguity would otherwise occur. Welsh also makes limited use of the diaeresis and the acute diacritics, but it has not been necessary to cater for these separately.

The data files contain utterances by children and adults. The former are identified as Target_child or Child on the @Participant header line in the data files; the latter are identified as Investigators and Teachers. The utterances of the adults have been transcribed in full, but not as painstakingly as those of the children; in particular, homonyms have not all been disambiguated through transcription.

Example of a transcription

*HEL: mwy.
*HEL: mwy.
*HEL: 'na2 ni!
*HEL: 'ei [= chwerthin].
*HEL: 'anna.
*TRY: nagi.
*HEL: na.
*TRY: Heledd, na' i gal yr un melyn, 'de.
*TRY: gei di gal yr un glas.
@Comment: sw:n chwarae.
*TRY: gei di 'm+ond rhyi [: rhoi] dwy [?].
*TRY: ymm, nei di ryid, ymm +...
*HEL: heina?
*HEL: hwn.
*TRY: ia.
*TRY: &n [/] na.
*TRY: na, ryid tywod i+mewn # efo fi.
*HEL: naf.
*TRY: xxx [% 2 sill].
*TRY: <un cloc> [?]
*TRY: <xxx [% 3 sill]> [>].
*HEL: <dw i> [<] +...