Some Information about the Data in the Current DEMO Query Package
1. Components of the Corpus
Here in this demo query package, we have selected a list of transcripts from different subject, level, and stream. The breakdown of the transcripts of 31 English-mediated lessons for the demo is as follows:
Table 1. English-mediated Lessons in the Demo Query Package
Subject |
Level |
Stream |
Units |
Lessons |
Schools |
Duration(hour) |
Tokens |
English |
Pri 5 |
EM1 |
0 |
0 |
0 |
0 |
0 |
|
|
EM2 |
1 |
5 |
1 |
3.9 |
18308 |
|
|
EM3 |
0 |
0 |
0 |
0 |
0 |
|
Sec 3 |
EXP |
0 |
0 |
0 |
0 |
0 |
|
|
NA |
1 |
3 |
1 |
3.4 |
19567 |
|
|
NT |
0 |
0 |
0 |
0 |
0 |
|
|
SPE |
0 |
0 |
0 |
0 |
0 |
Maths |
Pri 5 |
EM1 |
1 |
2 |
1 |
2.1 |
10868 |
|
|
EM2 |
1 |
1 |
1 |
1.0 |
5457 |
|
|
EM3 |
0 |
0 |
0 |
0 |
0 |
|
Sec 3 |
EXP |
0 |
0 |
0 |
0 |
0 |
|
|
NA |
1 |
2 |
1 |
1.2 |
12087 |
|
|
NT |
1 |
1 |
1 |
1.0 |
7779 |
|
|
SPE |
0 |
0 |
0 |
0 |
0 |
Science |
Pri 5 |
EM1 |
1 |
4 |
1 |
3.2 |
16541 |
|
|
EM2 |
0 |
0 |
0 |
0 |
0 |
|
|
EM3 |
0 |
0 |
0 |
0 |
0 |
|
Sec 3 |
EXP |
0 |
0 |
0 |
0 |
0 |
|
|
NA |
1 |
6 |
1 |
2.3 |
21850 |
|
|
NT |
0 |
0 |
0 |
0 |
0 |
|
|
SPE |
0 |
0 |
0 |
0 |
0 |
Social |
Pri 5 |
EM1 |
1 |
3 |
1 |
1.4 |
10633 |
|
|
EM2 |
0 |
0 |
0 |
0 |
0 |
|
|
EM3 |
0 |
0 |
0 |
0 |
0 |
|
Sec 3 |
EXP |
1 |
4 |
1 |
1.8 |
11731 |
|
|
NA |
0 |
0 |
0 |
0 |
0 |
|
|
NT |
0 |
0 |
0 |
0 |
0 |
|
|
SPE |
0 |
0 |
0 |
0 |
0 |
Total |
|
|
10 |
31 |
10 |
21.3 |
134821
|
As showed in the table above, the corpus size of the demo package here is about 21 hours recording and 134821 tokens (words) in total. On average, the recording duration of each of the 31 transcripts/lessons is about 42 minutes; and the word count of each of the 31 transcripts/lessons is about 4400 tokens (words).
If you want to play the audio clips and video clips of your search results, you must check the option "Show results in turns" in the lower left corner of the query window when submitting your queries. Please be reminded that there are a few misalignments of audio clips due to the resampling of audio files. We are currently fixing this problem. For testing purpose, you can only play some of the audio/video clips of the search results in the result list. The audio icon and video icon displayed in your query result indicate the availability of the media.
For a better performance with audio and video played in your browser, we recommend the latest version of Internet Explorer. If you do want to use Mozilla Firefox, some media files may not be played automatically, as Firefox has limited capacity to play Windows Media files, and you may require changes to your system configuration. We recommend your IT administrator assist in correcting the problem (the proper Windows Media and Active X plugins need to be installed). Go to: http://forums.mozillazine.org/viewtopic.php?t=206213 and follow the steps exactly. After doing so, you should be able to view the Windows Media files via Firefox Browser.
2. Event List Indexed in the Transcripts
When you browse the search result from the query, you may find that we have indexed all the events in the classrooms:
Table 2. Event Index in the Transcripts
Event
|
Symbol
|
Explanation
|
ENT001
|
%%
|
Background conversation that is inaudible
|
ENT002
|
##
|
Background noise
|
ENT003
|
*CHORUS*
|
Choral voices
|
ENT004
|
\$
|
Laughter
|
ENT005
|
\$\$
|
Extended Laughter
|
ENT006
|
[$]
|
Laughter Quality
|
ENT007
|
[V]
|
Verbatim Reading
|
ENT008
|
(O)
|
May or may not be talk
|
ENT009
|
( )
|
Ungotten talk
|
ENT010
|
( )
|
Extended ungotten talk
|
ENT010
|
( )
|
Extended ungotten talk
|
ENT010
|
(
)
|
Extended ungotten talk ??
|
ENT011
|
<MS:
|
Generic Male Student
Background voice initiated
|
ENT011
|
<MS:
|
Generic Male Student
Background voice initiated
|
ENT012
|
<FS:
|
Generic Female
Student Background voice initiated
|
ENT012
|
<FS:
|
Generic Female
Student Background voice initiated
|
ENT013
|
<US:
|
Generic Unknown
Student Background voice initiated
|
ENT013
|
<US:
|
Generic Male Student
Background voice initiated
|
ENT014
|
<TV:
|
Generic Teacher
Background voice initiated
|
ENT014
|
<TV:
|
Generic Teacher
Background voice initiated
|
ENT015
|
>
|
End of background voice
|
ENT016
|
[
|
Overlap Onset
|
ENT017
|
]
|
Overlap Termination
|
ENT018
|
==
|
Latching - indicate end of
word and start of new word
|
ENT019
|
=
|
No break in utterance
|
ENT020
|
(.)
|
Gap between utterances
|
ENT021
|
(..)
|
Extended gap between
utterances
|
ENT022
|
:
|
Prolongation of immediate
prior sound
|
ENT023
|
::
|
Longer prolongation of
immediate prior sound
|
ENT023
|
:::
|
Longer prolongation of
immediate prior sound ??
|
ENT024
|
-
|
Cut off
|
ENT025
|
?
|
Rising intonation
|
ENT026
|
.
|
Falling intonation
|
ENT027
|
\\
|
Falling intonation contour
|
ENT028
|
/
|
Rising intonation contour
|
ENT029
|
|
Continuing intonation
|
ENT030
|
^
|
Stress
|
ENT030
|
^^
|
Stress
|
ENT031
|
[E]
|
Code-Switching - English
spoken
|
ENT032
|
[C]
|
Code-Switching - Chinese
spoken
|
ENT033
|
[M]
|
Code-Switching - Malay
spoken
|
ENT034
|
[H]
|
Code-Switching - Hindi
spoken
|
ENT035
|
[T]
|
Code-Switching - Tamil
spoken
|
ENT036
|
[U]
|
Code-Switching - Unknown
language spoken
|
ENT999
|
|
All others
|
3. Speaker Index in the Transcripts
We have created a scheme to index the speakers in the transcripts. In this way, the corpus process and database query can be easier, and it can also help us to generate the speaker's profile across levels, streams, subjects, etc. in corpus query. In addition, this index can also enhance the anonymity of speakers when the corpus data is released to the public. Here is the list of speakers you can find in the transcripts:
Table 3. List of Index of Speakers in the Transcripts
Speaker Type
|
Speakers
|
Index
|
Remarks
|
Researcher
|
Researcher
|
RA
|
Each transcript has one researcher, and some may have more than one.
|
RA, Coder, Interviewer...
|
Teacher
|
Teacher
|
TR
|
Each transcript has one teacher, and some may have more than one teacher.
|
Teacher 1, 2...
|
Student
|
Student
|
ST
|
Each transcript has more than one student participated.
|
Student 1, 2...
|
Class
|
Class
|
CS
|
|
Group, Some students...
|
Audio System
|
VCD, DVD, CD, MP3 player...
|
AO
|
Audio played in the class.
|
Video System
|
VCD/DVD player, TV...
|
VO
|
Video played in the class.
|
PA System
|
Public announcements
|
PA
|
Public announcements while class in progress.
|
Others
|
All the others
|
OT
|
All the others in the class.
|
4. Linguistic Features Annotated
The linguistic features that you can find in the demo data are listed in the table below.
Table 4. Linguistic Features Annotated
LEVELS
|
TAGGING
|
TOOLS & METHODS
|
SAMPLES
|
Token/Word/
Phrase
|
POS
Tagging
|
Wmatrix (English)
Autotags (Chinese)
|
A sample output from Wmatrix
|
Semantic
|
Clause/
Sentence
|
Theme, Mood & Process
|
SFG & Dialogue act
|
|
Sentence Types, Nominalizations
|
SFG and Speech-Act Annotation
|
|
Speech Acts
|
|
Discourse
|
Interclausal Relations
|
SFG
|
|
Turn-taking, IRFs
|
Dialogue analysis
|
|
Phase/episode TRS
|
Dialogue analysis
|
|
Others
|
Localized language, teaching/learning strategies,
code-mixing/switching, etc
|
|
|
|