Some Information about the Data in the Current DEMO Query Package

1. Components of the Corpus

Here in this demo query package, we have selected a list of transcripts from different subject, level, and stream. The breakdown of the transcripts of 31 English-mediated lessons for the demo is as follows:

Table 1. English-mediated Lessons in the Demo Query Package
Subject

Level Stream Units Lessons Schools Duration(hour) Tokens

English

Pri 5

EM1

0

0

0

0

0

 

 

EM2

1

5

1

3.9

18308

 

 

EM3

0

0

0

0

0

 

Sec 3

EXP

0

0

0

0

0

 

 

NA

1

3

1

3.4

19567

 

 

NT

0

0

0

0

0

 

 

SPE

0

0

0

0

0

Maths

Pri 5

EM1

1

2

1

2.1

10868

 

 

EM2

1

1

1

1.0

5457

 

 

EM3

0

0

0

0

0

 

Sec 3

EXP

0

0

0

0

0

 

 

NA

1

2

1

1.2

12087

 

 

NT

1

1

1

1.0

7779

 

 

SPE

0

0

0

0

0

Science

Pri 5

EM1

1

4

1

3.2

16541

 

 

EM2

0

0

0

0

0

 

 

EM3

0

0

0

0

0

 

Sec 3

EXP

0

0

0

0

0

 

 

NA

1

6

1

2.3

21850

 

 

NT

0

0

0

0

0

 

 

SPE

0

0

0

0

0

Social

Pri 5

EM1

1

3

1

1.4

10633

 

 

EM2

0

0

0

0

0

 

 

EM3

0

0

0

0

0

 

Sec 3

EXP

1

4

1

1.8

11731

 

 

NA

0

0

0

0

0

 

 

NT

0

0

0

0

0

 

 

SPE

0

0

0

0

0

Total

 

 

10

31

10

21.3

134821

As showed in the table above, the corpus size of the demo package here is about 21 hours recording and 134821 tokens (words) in total. On average, the recording duration of each of the 31 transcripts/lessons is about 42 minutes; and the word count of each of the 31 transcripts/lessons is about 4400 tokens (words).

If you want to play the audio clips and video clips of your search results, you must check the option "Show results in turns" in the lower left corner of the query window when submitting your queries. Please be reminded that there are a few misalignments of audio clips due to the resampling of audio files. We are currently fixing this problem. For testing purpose, you can only play some of the audio/video clips of the search results in the result list. The audio icon and video icon displayed in your query result indicate the availability of the media.

For a better performance with audio and video played in your browser, we recommend the latest version of Internet Explorer. If you do want to use Mozilla Firefox, some media files may not be played automatically, as Firefox has limited capacity to play Windows Media files, and you may require changes to your system configuration. We recommend your IT administrator assist in correcting the problem (the proper Windows Media and Active X plugins need to be installed). Go to: http://forums.mozillazine.org/viewtopic.php?t=206213 and follow the steps exactly. After doing so, you should be able to view the Windows Media files via Firefox Browser.

2. Event List Indexed in the Transcripts

When you browse the search result from the query, you may find that we have indexed all the events in the classrooms:

Table 2. Event Index in the Transcripts

Event

Symbol

Explanation

ENT001

 %%

Background conversation that is inaudible

ENT002

 ##

Background noise

ENT003

 *CHORUS*

Choral voices

ENT004

 \$

Laughter

ENT005

 \$\$

Extended Laughter

ENT006

 [$]

Laughter Quality

ENT007

 [V]

Verbatim Reading

ENT008

 (O)

May or may not be talk

ENT009

 ( )

Ungotten talk

ENT010

 (  )

Extended ungotten talk

ENT010

 (   )

Extended ungotten talk

ENT010

 (    )

Extended ungotten talk ??

ENT011

 <MS:

Generic Male Student Background voice initiated

ENT011

 &lt;MS:

Generic Male Student Background voice initiated

ENT012

 <FS:

Generic Female Student Background voice initiated

ENT012

 &lt;FS:

Generic Female Student Background voice initiated

ENT013

 <US:

Generic Unknown Student Background voice initiated

ENT013

 &lt;US:

Generic Male Student Background voice initiated

ENT014

 <TV:

Generic Teacher Background voice initiated

ENT014

 &lt;TV:

Generic Teacher Background voice initiated

ENT015

 >

End of background voice

ENT016

 [

Overlap Onset

ENT017

 ]

Overlap Termination

ENT018

 ==

Latching - indicate end of word and start of new word

ENT019

 =

No break in utterance

ENT020

 (.)

Gap between utterances

ENT021

 (..)

Extended gap between utterances

ENT022

 :

Prolongation of immediate prior sound

ENT023

 ::

Longer prolongation of immediate prior sound

ENT023

 :::

Longer prolongation of immediate prior sound ??

ENT024

 -

Cut off

ENT025

 ?

Rising intonation

ENT026

 .

Falling intonation

ENT027

 \\

Falling intonation contour

ENT028

 /

Rising intonation contour

ENT029

 

Continuing intonation

ENT030

 ^

Stress

ENT030

 ^^

Stress

ENT031

 [E]

Code-Switching - English spoken

ENT032

 [C]

Code-Switching - Chinese spoken

ENT033

 [M]

Code-Switching - Malay spoken

ENT034

 [H]

Code-Switching - Hindi spoken

ENT035

 [T]

Code-Switching - Tamil spoken

ENT036

 [U]

Code-Switching - Unknown language spoken

ENT999

 

All others


3. Speaker Index in the Transcripts

We have created a scheme to index the speakers in the transcripts. In this way, the corpus process and database query can be easier, and it can also help us to generate the speaker's profile across levels, streams, subjects, etc. in corpus query. In addition, this index can also enhance the anonymity of speakers when the corpus data is released to the public. Here is the list of speakers you can find in the transcripts:

Table 3. List of Index of Speakers in the Transcripts

Speaker Type

Speakers

Index

Remarks

Researcher

Researcher

RA

Each transcript has one researcher, and some may have more than one.

RA, Coder, Interviewer...

Teacher

Teacher

TR

Each transcript has one teacher, and some may have more than one teacher.

Teacher 1, 2...

Student

Student

ST

Each transcript has more than one student participated.

Student 1, 2...

Class

Class

CS

 

Group, Some students...

Audio System

VCD, DVD, CD, MP3 player...

AO

Audio played in the class.

Video System

VCD/DVD player, TV...

VO

Video played in the class.

PA System

Public announcements

PA

Public announcements while class in progress.

Others

All the others

OT

All the others in the class.


4. Linguistic Features Annotated

The linguistic features that you can find in the demo data are listed in the table below.

Table 4. Linguistic Features Annotated

LEVELS

TAGGING

TOOLS & METHODS

SAMPLES

Token/Word/
Phrase

POS
Tagging

Wmatrix (English)
Autotags (Chinese)

A sample output from Wmatrix

Semantic

Clause/
Sentence

Theme, Mood & Process

SFG & Dialogue act

 

Sentence Types, Nominalizations

SFG and Speech-Act Annotation

 

Speech Acts

 

Discourse

Interclausal Relations

SFG

 

Turn-taking, IRFs

Dialogue analysis

 

Phase/episode TRS

Dialogue analysis

 

Others

Localized language, teaching/learning strategies, code-mixing/switching, etc

 

 

 

 

Copyright@ 2006 SCoRE. All rights reserved.
audio (9) background (12) class (7) clips (5) code-switching (7) corpus (6) demo (7) em (13) english (4) ent (46) event (5) exp (5) extended (6) files (5) generic (10) index (9) initiated (9) intonation (6) laughter (4) lessons (6) level (5) linguistic (4) list (7) lt (5) media (6) na (5) nt (5) others (6) package (5) played (8) pri (5) prolongation (4) query (10) result (7) search (4) sec (5) spe (5) speakers (9) spoken (7) student (12) system (5) table (7) talk (6) teacher (8) tokens (5) transcripts (14) ungotten (5) video (7) voice (11) windows (5)