Main Article Content
Abstract
This paper describes a system to classify Indonesian speech into voiced-unvoiced-silence (VUS). In this
system, a speech of 16 KHz is segmented into frames of 10 milliseconds with overlap of 20%. Next, each frame is
characterized using 3 features in time domain: frame energy (E), level crossing rate (LCR) and differential level
crossing rate (DLCR). Furthermore, each frame is classified using an Evolving Feedforward Neural Network
(EFNNs), which is Feedforward Neural Network (FNNs) that be trained using evolutionary algorithms (EAs).
Finally, the classified frames are concatenated to get a right VUS classification. The training data is
combination of 18 consonants and 7 vowels from a single speaker. Whereas validation set and testing data is
developed from 25 word speeches represent all the combination of consonants and vowels. Computer simulation
shows that the best FNNs architecture is 3-10-3 (3 inputs, 10 hidden unit, and 3 output units) and the
appropriate number of training data is 150. It gives a total accuracy of 0.7366, where the accuracies for voiced,
unvoiced, and silence respectively are 0.6206, 0.6428, and 0.9626. Since the accuracies for voiced and unvoiced
are very low, then the whole VUS system is poor, even a filtering procedure has been applied.
Keywords: indonesian speech, voiced-unvoiced-silence classification, evolving feedforward neural network
system, a speech of 16 KHz is segmented into frames of 10 milliseconds with overlap of 20%. Next, each frame is
characterized using 3 features in time domain: frame energy (E), level crossing rate (LCR) and differential level
crossing rate (DLCR). Furthermore, each frame is classified using an Evolving Feedforward Neural Network
(EFNNs), which is Feedforward Neural Network (FNNs) that be trained using evolutionary algorithms (EAs).
Finally, the classified frames are concatenated to get a right VUS classification. The training data is
combination of 18 consonants and 7 vowels from a single speaker. Whereas validation set and testing data is
developed from 25 word speeches represent all the combination of consonants and vowels. Computer simulation
shows that the best FNNs architecture is 3-10-3 (3 inputs, 10 hidden unit, and 3 output units) and the
appropriate number of training data is 150. It gives a total accuracy of 0.7366, where the accuracies for voiced,
unvoiced, and silence respectively are 0.6206, 0.6428, and 0.9626. Since the accuracies for voiced and unvoiced
are very low, then the whole VUS system is poor, even a filtering procedure has been applied.
Keywords: indonesian speech, voiced-unvoiced-silence classification, evolving feedforward neural network