Pro(so)praat - A prosodic annotation project
The need for a form of orthographic transcription of spoken texts is considered as a preliminary and essential operation for constitution, usability and reusability of a corpus (Gibbon et al., 1997: 79): it is the tool which endows the corpus with an organized structure, it comes from "a set of conventions to represent different kinds of information which are present in the spoken form but cannot be conveyed by means of the normal spelling convention" (Llisterri, 1997: 1).
The transcription makes the oral text permanent, manageable and analysable for multiple purposes.
To the operation of representation follows the interpretation of the text, which consists in the addition of information to the text itself, in different degrees according to its different aims.
The entire coding operation of a corpus consists in making explicit the various types of interpretation of the text.
For the reasons outlined above, the conventional orthographic transcription of the produced texts is common to all spoken corpora, whatever the objective they propose, the scope of the application they are in or the public to whom they speak to.
This preliminary operation should be seen as a form of simple transcript of the recording in the case of spontaneous or semi-spontaneous texts, while in case of texts which are read as a simple correspondence to the written text handed in to the speaker.
The codification normally includes an annotation of the text itself which enriches the transcription with a number of descriptive and interpretative details.
For both operations, in the projects CLIPS (Corpora and Lexicon of Spoken and Written Italian) and C-ORAL-ROM, specific norms and standardized protocols have been proposed and defined, in order to facilitate the use and interchangeability between the numerous corpora collected and developed around the world.
The norms proposed and adopted in the various projects are based on widespread and common general principles but, at the same time, they differ in response to specific criteria which vary according to the aims and final goals for which the corpus is collected and codified.
The main goal of the annotation is to have a written text which can be used 'autonomously' from the vocal production, so that it is possible to extract linguistic (related to the different levels of analysis), paralinguistic and also extra-linguistic (for example, situational) information.
The annotation of the text has different degrees of complexity in relation, as we said, to the type of corpus and its aims. Theoretically speaking, a corpus could be annotated at all possible levels, however, corpora are much more frequently annotated only in some respects.
Annotation = enrichment of the transcript through details related to the production, symbols referring to phonic phenomena and extratextual comments referring to linguistic units.
Mark-up = operations aimed at defining, identifying and classifying the constitutive linguistic units of a text at various levels (phonetic, phonological, prosodic, lexical-morphological, morpho-syntactic, coreferential, of unity of speech, of communicative functions, etc.).
In this subdivision, the annotation is the stage which shows the lowest degree of complexity. The two operations can coexist within the same text, but usually respond to different principles and set different aims.
Minimum requirements for an annotated transcript:
We will take care of all these minimum requirements in the first Pro(so)Praat tutorial using, precisely, the PRAAT program.
PRAAT is a powerful tool for analysing, synthesizing, visualizing and manipulating the phonic signal. It is a free software created by the Dutch linguists Paul Boersma and David Weenink and can be downloaded from this page.
It is great for segmentation annotation and mark-up as it allows you to do it on multiple levels; moreover, it allows the creation of scripts to automate the procedures of analysis, to extract statistics, graphs and much more.
We will use this software to develop firstly the orthographic annotation and then the prosodic one.