Chapter 9: Speech recognition

Speech recognition and synthesis are probably the most iconic aspects of speech research. While speech compression (e.g. in mobile phones) is by far the most important speech techology in terms of daily use, it is automatic speech recognition (ASR) and synthesis or text-to-speech (TTS) that capture the imagination of the public.

When we are experimenting with speech recognition the diagram below, from the book, explains the typical process that we use;

9.2 Voice activity detection

A voice activity detector (VAD) is one of those backgroud pieces of technology that are essential to the operation of real-life systems but yet which are less commonly considered. It is the VAD that tells your mobile phone when you are talking (and when to consume precious battery power and mobile bandwidth to encode/transmit your speech). In quiet locations and with a strong voice, VAD is very easy, but becomes much more difficult in noise - especially multi-speaker babble - and wich quite speech like whispers.

The example VAD code given in the text is;

%the noisy speech is in array nspeech
%fs is the sample rate
L=length(nspeech);
frame=0.1; %frame size in seconds
Ws=floor(fs*frame); %length
Nf=floor(L/Ws); %no. of frames
energy=[];
%plot the noisy speech waveform
subplot(2,1,1)
plot([0:L-1]/fs,nspeech);axis tight
xlabel('Time,s'); ylabel('Amplitude');
%divide into frames, get energy
for n=1:Nf
  seg=nspeech(1+(n-1)*Ws:n*Ws);
  energy(n)=sum(seg.^2);
end
%plot the energy
subplot(2,1,2)
bar([1:Nf]*frame,energy,'y');
A=axis; A(2)=(Nf-1)*frame; axis(A)
xlabel('Time,s'); ylabel('Energy');
%find the maximum energy, and threshold
emax=max(energy);
emin=min(energy);
e10=emin+0.1*(emax-emin);
%draw the threshold on the graph
line([0 Nf-1]*frame,[e10 e10])
%plot the decision (frames > 10%)
hold on
plot([1:Nf]*frame,(energy>e10)*(emax),'ro')
hold off

The result of this code will be something like the plot below (in this case the speech is Winston Churchill - as explained in the book - and the noise is something a lot more modern);

9.4 Hidden Markov Models

We will follow the examples given in the book from page 313 onwards.

First the setup phase.

Pi=[0.7, 0.1, 0.2];
B=[0.1, 0.02, 0.6];
A=[0.5 0.2 0.3
  0.15 0.6 0.25
  0.1 0.4 0.5];
N=length(Pi);

X=[0 0 0 0 1 0 0];
T=length(X);

%
%
alpha=zeros(T,N);
%initial state
alpha(1,1:N)=B(:).*Pi(:);
  
for t=1:T-1
 for Pi=1:3  
   alpha(t+1,Pi)=B(Pi)*sum(A(Pi,:)*alpha(t,:)');
 end
end

Iterate through the observations;

for t=1:T-1
 for Pi=1:3  
   alpha(t+1,Pi)=B(Pi)*sum(A(Pi,:)*alpha(t,:)');
 end
end

This gives an alpha matrix as follows;

>> alpha

alpha =

    0.0700    0.0020    0.1200
    0.0071    0.0008    0.0407
    0.0016    0.0002    0.0128
    0.0005    0.0001    0.0040
    0.0001    0.0000    0.0012
    0.0000    0.0000    0.0004
    0.0000    0.0000    0.0001

Useful links:

Hidden Markov Model Toolkit (HTK)
Available from http://htk.eng.cam.ac.uk
The excellent HTK book can also be obtained from the same location as the software.
Kaldi
Available from http://kaldi.sourceforge.net
Sphinx and Pocketsphinx
Available from http://cmusphinx.sourceforge.net