Feature-Based Models of Spoken Pronunciations
Spoken language, especially conversational speech, is characterized by a great deal of variability in pronunciation, including many variants that differ grossly from dictionary prototypes. This is an interesting phenomenon in its own right, and in particular, it has been cited as a factor in the poor performance of automatic speech recognizers on conversational speech. One approach to representing this variation consists of expanding the recognition dictionary, in which words are represented as strings of phones, or basic speech sounds. This expansion is often done by applying phonetic substitution, insertion, and deletion rules to an existing dictionary. This is a powerful method but has some drawbacks: Many pronunciation variants typically remain unaccounted for; word confusability is increased; and some common speech phenomena are awkward to represent.
In this talk, I will present an alternative approach, in which speech is represented in terms of multiple streams of linguistic features rather than a single stream of phones. In this work, features are defined as corresponding loosely to the states of the speech articulators, such as the lips and tongue. By modeling the (1) asynchrony between articulators and (2) possibility of articulators missing their target positions, many pronunciation changes that are difficult to account for with phone-based models become quite natural. Although it is well-known that many phenomena can be attributed to the ``semi-independent trajectories'' of articulators, previous models of pronunciation variation have typically not taken advantage of this. Another contribution of this work is a probabilistic framework that is expressive, learnable from data, and fits seamlessly into a statistical speech recognition approach.
In particular, I will present a class of feature-based pronunciation models implemented using dynamic Bayesian networks (DBNs). DBNs are a probablistic modeling technique that allows us to naturally represent the factorization of the large state space of feature combinations into feature-specific factors, as well as providing standard algorithms for inference and learning from data. I will describe a set of experiments testing these models "in isolation", using manually transcribed speech data, as well as on a realistic visual speech recognition task.