The construction of standard speech database is animportant prerequisite for contemporary research activities in speechrecognition and understanding. It is seen in institutes over the world acontinuous growth of interest among the researchers in corpus-based speech andnatural language processing techniques. Since corpus-based methods are found asvery efficient in most language and speech systems, the reliability of usingthese methods are also increasing with progressive development of languagetechnology. In recent decades systems development using language technologyhave uplifted by various laboratories, industries and also by governments inalmost all influential languages. But the history of corpus generation andcorpus based Bangla speech recognition are not so far and limited within fewyears. However Bangla is one of the influential languages spoken by about 260million people around the world and is 8th most popular language.
Till dateamong the few creations of Bangla speech corpora, probably the first step wastaken by the Center for Development of Advanced Computing (CDAC) of India bycreating Bangla Katha Bhandar released in 2005 1. It was a collection ofAnnotated Speech Corpus for Bangla. Another step of similar work was done bythe Center for Research on Bangla Language Processing in BRAC University ofBangladesh in 2010 2. In between these two, a research project financed bythe MOSICT of Bangladesh was completed in June, 2008. Under this project alarge scale speech corpora were recorded in SIPL of Islamic University 3. Thedistinction of The SIPL speech corpora from other two is that it was designedespecially for Bangla speech recognition. As the continuation of the projectresults organizing, labeling and similar other processing is still ongoing.
Inthis paper describes the design and development processes of connected wordspeech corpus. After the basics of speech corpora, a brief description ofBdNC01 text corpus has been discussed to understand the selection of words forspeech database design. In the next subsections, speech recording, editingprocesses and final outcome are discussed. The paper concludes with theusability of the corpus. 2.
SPEECH CORPUS FUNDAMENTALSCorpus is a collection of written text or recorded speechof a language to discover the units and relation among the units of thelanguage. Modern corpora are collected and stored in electronic form forefficient statistical analysis using software tools. Corpora are collectedaccording to some external criteria to represent a language or language varietyso that it can be used as a source of data for linguistic research 4. Speechcorpus is created by audio files and text transcriptions in a structureddatabase that can be used to train automated system which can then be used as apart of speech recognition engine 5. Speech Corpora may be classified in twotypes as below:1. Read Speech – This includes part of Books, Newspapercontents, Broadcast news, Lists of words and numbers etc.
2. Spontaneous Speech – This includes naturally occurreddialogs between two or more people, Narratives such as a person telling astory, Class lectures and discussions such as two people try to find a commonmeeting time based on individual schedules.There are also some special kinds of speech corpora such asnon-native speech databases that contain speech with foreign accent or dialectdatabase.
Speech corpus is frequently used as the basis for analyzingthe characteristics of speech signal and the result of analysis then becomeuseful for developing speech generation and recognition systems. The speechcorpora are growing more complicated and larger in size day by day. This isbecause the computation power is increasing and various robust methods aredeveloping in speech technology.
One of the selection methods of speech contentof a corpus is to use the analytical result from a text corpus. For example, aspeech corpus of British English WSJCAM0 has been recorded at Cambridge Universityfrom the Wall Street Journal text corpus 6. An important step before recording aspeech corpus is to select popular words such that it becomes a representativevocabulary of the language in consideration. Since each unknown word causes anaverage recognition error usually between 1.5 and 2 7. Therefore therecognizer vocabulary is usually designed with the goal of maximizing lexicalcoverage for the expected input. A most popular approach is to choose the mostfrequent words from a text corpus which means that the reliable vocabulary ishighly dependent upon the representativeness of the training data 8.The Influential parameters to categorize a speechrecognition system are speech types, speaker dependency, vocabulary size, etc.
The importance of these parameters is context sensitive. It depends on thedesign considerations of a recognition system to be used for a specificapplication or task 9. There are three types of speech usually feed to thespeech recognition systems. These are isolated, connected, or continuousspeech. Isolated speech requires a significant pause between words, may be 250milliseconds.
In isolated speech system, one speech file may contain anutterance of a single word or a short string of several isolated words. Incontinuous speech recognition systems, continuous speech flows with a rhythm andthe words are overlapped each other thus making recognition harder. In betweenthese two, connected speech recognizers do not require the intermediate pausebetween inputs, but are able to detect word boundaries within a string ofconnected speech. However it requires careful utterance of each word like adictation. Though many relevant literatures describe connected words andcontinuous words as alternative terms, but because of vast diversity ofapplication it is required to define connected words separately. In fact theway to classify “connected words” and “continuous speech” issomewhat technical.
A connected word recognizer uses words asrecognition units, which can be trained in an isolated word mode. Specificand efficient applications of connected word recognizers are found in dictationand voice command recognition. Speech recognition systems can be classifiedfurther as either speaker-dependent or speaker-independent systems. Inspeaker-dependent systems, each speaker enters several samples of each word ofexpected vocabulary to form the reference templates 10. Other importantparameter to design a speech corpus is its vocabulary size.
The words small,medium and large are usually applicable to vocabulary sizes of the orderof 100, 1000 and (over) 5000 words, respectively. But a typical smallvocabulary recognizer can recognize only ten digits and a typical largevocabulary recognition system can recognize 20000 words 9. Gould, Conti, andHovanyecz 10 were proposed a limited capability automatic dictation machinein 1983named listening typewriter. The machine was simulated by letterwriting task with isolated and connected speech databases using variousvocabulary sizes. In their experimentthe performance of the voice recognizer were estimated for a 1000 wordvocabulary and various unlimited vocabulary. The 1000 word vocabulary wascomposed of the 1000 high frequent English words.
The conclusion of the workindicated that roughly 75% of the words used in the letter writing task wereavailable in the 1000 word vocabulary. Therefore in dictation and voice commandrecognition medium size vocabulary may be estimated enough for satisfactoryperformance.