COSC6328 Course Project 2 (W08)

 

Due: 12:00 noon, April 3rd9th, 2006

 

In this project, you can choose to do either option A or option B. But I would like to have roughly half of class to do each option. If too many people have chosen to do one option, then the rest of people may be forced to do the other. If you have decided to do either, please let me know as soon as possible. Note this is an individual work. Do not exchange or discuss any details of your experiments with your classmates. At the end, everybody will have to prepare a short (10-15 minutes) talk to present your major results and findings from this project.

 

 

Option A: Building a large-vocabulary speech recognizer with HTK

 

In this option, you are asked to use the HMM Toolkit, called HTK, to build a large vocabulary speech recognition system on a speech database, called SPINE, in a Linux machine.  A detailed user manual on HTK is available in the course Web site (see reading list [F1]; Do not print the entire document which is almost 300 pages). All binary executable HTK tools (for Linux) can be found in the CS department network directory /cs/course/6328/Project2A/HTK34.  Some other complementary materials, such as tutorial examples and some script tools, can be found in /cs/course/6328/Project2A/HTK-samples.

 

The SPINE database is collected from pairs of speakers who are engaged in a collaborative war-game that requires players to locate and destroy targets on a pre-defined grid. Each speaker sits in a sound booth in which a background noise environment is reproduced. The original speech data is sampled at 16 kHz with a resolution of 16 bits. The objective of collecting this database is to build up a speech recognition system which can be used by military personnel in a real battle field. A small set of SPINE speech data is copied in the directory /cs/course/6328/Project2A/SPINE/speech for your reference. You can view these waveform files (*.sph) with an HTK tool called HSLab in Linux by the command:

 

> ???/HSLab –C  /cs/course/6328/Project2A/config_view   ???.sph

 

If audio device is installed in the Linux system, you will be able to play these recorded voice files as well. Since all SPINE data was collected in a noisy environment and most speakers spoke in a pretty causal way, it is normally a huge challenge to recognize this kind of spontaneous and noisy speech data.

 

The speech material in spine1_train is used for training and that of spine1_eval is used for test, leading to a total of 11974 utterances available for training and 12079 utterances for test. For your convenience, I have completed feature extraction (converting speech waveform into 13-D MFCC feature vectors).  All MFCC feature files (*.mfc) for training are stored in directory /cs/course/6328/Project2A/SPINE/spine1_train and all MFCC feature files for test are stored in directory /cs/course/6328/Project2A/SPINE/spine1_eval. All of these MFCC feature files follow the standard HTK format (see section 5.7.1 in pages 69-70 of the HTK manual). The feature files can be listed by using the HTK tool HList:

 

>   ??/HList   ???.mfc

 

The feature extraction was conducted by the HTK tool HCopy using the configuration file /cs/course/6328/Project2A/config_fe.  IMPORTANT: when you use all these mfcc feature files for training and test, you need to use configure file /cs/course/6328/Project2A/config to convert it into 39-D feature. Just use option ‘-C’ in most HTK tools.

 

In this project, you are required to build a speech recognition system to recognize the SPINE test data by using the HTK tools.  You are allowed to directly use all MFCC feature files in the above directories.  For your convenience, I have compiled a lexicon, located at /cs/course/6328/Project2A/ lexicon_train.dic, which includes all words in training set along with their pronunciations for you to train your system. In addition, I prepare a tri-gram language model for SPINE /cs/course/6328/ Project2A/lm.arpa and its corresponding lexicon /cs/course/6328/Project2A/lexicon_test.dic, which you should use for test.

 

You must STRICTLY follow the project requirements:

 

·        Use Gaussian mixture continuous density HMM (CDHMM) as acoustic model.

 

·        Use only the files in the directory spine1_train as your total training data. Don't use any other data in any training stage. A full list of all training feature files is provided to you at /cs/course/6328/Project2A/spine1_train.scp. The transcriptions of all these training data is given at /cs/course/6328/Project2A/spine1_train.mlf  in the HTK MLF format. (see section 6.3 on pages 89-93 of the HTK manual for the MLF format)

 

·        Always evaluate your system or models by testing on all data in the directory spine1_eval. A full list of all test feature files is provided at /cs/course/6328/Project2A/spine1_eval.scp. Since there are over 12000 utterances in spine1_eval, it may be too time-consuming to test the entire set during every step of your development process. Thus, I prepare a subset of test data which is included in /cs/course/6328/Project2A/spine1_dev.scp (totally 2010 utterances) for you to quickly test your models during your development phase. But you should report the performance of your best model on the entire test set spine1_eval.scp. The transcriptions of all test data are given at /cs/course/6328/Project2A/spine1_eval.mlf in the HTK MLF format. You can only use these test data to evaluate your models and systems, NOT for model training. Access to any information of the test data in any training stage will be considered as cheating.

 

Besides these specifications, you are totally flexible to do whatever you want in all other aspects. Your goal is to achieve as good recognition accuracy as possible in the test set spine1_dev.scp. You are free to explore any ideas you have or you learned in class to improve the recognition performance. However, you need to show a systematic strategy to initialize/build your system and improve its performance gradually, not just some random experiments like shooting in dark.

 

I have a few suggestions regarding what you should focus on in this project:

 

·        Build a good monophone system first.

 

·        How to extend to a state-tied triphone system based on the phonetic decision tree method by using the HTK tool HHEd. A sample edit script is given for your convenience at /cs/course/6328/Project2A/tree.hed. This file is provided here only as a template. You may need to adjust many parameters there to achieve the best performance.

 

·        Refine the state-tied triphone models for the best performance.

 

·        Always evaluate your systems in terms of word accuracy (in %).

 

What to submit?

 

You need to submit (email me at hj@cse.yorku.ca)  the following files before the deadline:

 

1.      report_X.pdf: a project report (maximum 8 pages); to summarize all experiments you have done and describe the whole experimental course from scratch to your best system; report the settings in your best system and the best performance in the entire test set.

 

2.      train_X.script: a training script to automatically build (from scratch) the best system as mentioned in your report. Please use absolute paths in your script so that I can run it anywhere.

 

3.      test_X.script: a test script to test your best system (built from train.script) based on the entire test set spine1_eval.scp.

 

4.      readme_X.txt: how to run train_X.script and test_X.script.

 

where X stands for your user name @cse.yorku.ca. Besides, you must prepare a 15-minute presentation to introduce what you have done in the project and what your findings and conclusions may be.

 

Further Reading for option A:

[F1] Steve Young, et. al. , “The HTK Book (for HTK version 3.4)”, Microsoft Corporation. HMM Toolkit (HTK) Book.  (Particularly chapters 1, 2, 3).

 

Option B: Optimizing large-scale language model with WFST

 

As we know, n-gram language models play an important role in many applications, such as speech recognition, machine translation, data mining, information retrieval and so on. In most practical applications, we need to handle very large scale n-gram language models, which may contain up to several hundred millions of n-grams along with their conditional probability. Therefore, it is very important to represent a large n-gram language model in a compact way to save computing resources to handle it. In many cases, an n-gram language model is given in a readable ARPA format. In this project, we are asked to write a program to convert an ARPA back-off n-gram language model into an equivalent weighted finite state transducer (WFST) in the FSM format.  Then, you use the provided lexicon to convert it into a tri-phone search network for speech recognition. Next, you expand it into a big HMM network using the provided lists of physical HMMs and logical HMMs and optimize it to achieve the most compact representation in number of states and transitions of WFST. During your optimization process, you may need to use various FSM tools, such as sum, product, composition, determinization, minimization, etc. In this project, your goal is to create the smallest WFST network which is equivalent to the original language model and lexicon. You are allowed to explore any kinds of WSFT operations. You may want to refer to various ways in [F2] to optimize your WSFT.

 

In this project, you use a 5000-word 3-gram ARPA language model and a 5000-word lexicon which are stored as follows:

 

/cs/course/6328/Project2B/lm_3gram.arpa

/cs/course/6328/Project2B/lexicon_5k

/cs/course/6328/Project2B/physical_hmms

/cs/course/6328/Project2B/logical_hmms

 

Each line of /cs/course/6328/Project2B/physical_hmms indicates a physical tri-phone HMM, denoted as ‘a-b+c’. It consists of three distinct HMM states, as shown in the same line. Each line /cs/course/6328/Project2B/logical_hmms represents a logical tri-phone HMM. It is shown to be equivalent to one physical HMM shown in the same line. You can use these two HMM lists to create the required HMM transducer.

 

And all FSM tools are provided in the following directory:

 

/cs/course/6328/Project2B/FSM4.0

 

What to submit?

 

You need to submit (email me at hj@cse.yorku.ca) the following files before the deadline:

 

1. report_X.pdf: a project report (maximum 8 pages); to summarize all experiments you have done and describe the whole experimental course to convert the original LM to your best WFST; report all of your settings and size of the best WFST in number of states and transitions.

 

2. lm2wfst.[c|java]: a program which converts the original LM into an equivalent WFST in FSM format.

 

3. experiment.script: a running script which uses your lm2wfst tool and other FSM tools to automatically generate your best WFST as mentioned in your report. Please use absolute paths in your script so that I can run it anywhere.

 

4. readme_X.txt: how to run experiment.script to repeat your experiments.

 

 

Further Reading for option B:

 

[F2] M. Mohri, F. Pereira and M. Riley, “Weighted Finite-State Transducers in Speech Recognition”,

Download here.

[F3]  M. Mohri, F. Pereira and M. Riley, tutorial slides on Finite-State-Machine (FSM) Toolkit.