Most speech -- at least for the near future -- is generated by the human voice production system, comprising lungs, diaphragm, glottis, vocal tract, tongue, teeth, lips, nasal cavity and so on. Plus the brain. The brain is important at the motor level in providing the fine and fast control of the vocal actuators to form the phonemes of speech. At a higher level, the brain encodes meaning into those strings of phonemes.
Similarly, most speech -- at least for the near future -- is produced for the understanding of human ears, and the comprehension of the human brain, and thus we will look much closer at the human hearing system in Chapter 4.
If readers wonder why I used the phrase "at least for the near future" twice in the paragraphs above, this is because of the ongoing growth in text-to-speech systems, where computers generate speech for humans to hear, and in ASR (automatic speech recognition) systems by which computers listen to our speech. With the aid of these two rapdidly improving technologies, we can envision a future where most speech communications is computer-human rather than human-human.
As described in the book, if we are to understand the human voice, the best place to start is by understanding how it is produced. The figure below, reproduced from the book, shows the main articulators located inside a cut-away diagram of the head
These are the three main types of everyday speech. Note: turn down the volume before you play the sample below!
Let's take things further by doing a bit of simple plotting and visualisation of this recording. Assume you have downloaded the sample to your MATLAB directory (Ctrl-click or Alt-click the sample player to download it).
[s,fs]=audioread('shout_speak_whisper.wav'); plot([1:length(s)]/fs,s) axis tight % to make it fill the plot window xlabel('Time, s') ylabel('Amplitude')
Here is the plotted output;
Let's try a spectrogram instead;
spectrogram(s,512,374,512,fs,'yaxis') colormap jet colorbar off
Remember you need to use the US spelling of the word colour throughout the MATLAB command set - it does not take long to get used to this but can be confusing for new users. The command "colorbar off" turns off the default stripe showing how the colourscheme of the spectrogram maps to energy or power. In this case becuase the recording amplitudes are relative (and possiby not even linear) it is meaningless and hence better off being removed. There are many types of colour map possible in MATLAB and you can even define your own, but jet is one of the common choices for spectrograms, and 'gray' (again note the US spelling) is the one to use for greyscale plots that will be reproduced in black and white.
It should be clear from both the waveform plot and the spectrogram which parts of the recording correspond to shouting (more power), speaking (medium power) and whispering (lower power, but missing fundamental pitch). The fundemental pitch source, when present, is always visible as a high energy region along the bottom of the plot. In the whisper part of the recording the formants are all perfectly visible, but the pitch is clearly lacking - and this is very characteristic of whispers
Let's combine some of these with a pitch analysis. In this case it's not possible to visualise this over such a long recording, so we need to work with just part of it, so we need to begin by selecting just the "shout" part of the recording. Next we slice the recoding into a sequence of overlapping segments and analyse these individually;
s=s(5000:100000); %cut down the array w=2048; %window size d=512; %overlap nf=floor((length(s)-w)/d); %no. of frames pf=; pa=; for l=1:nf lhs=max(1,1+(((l-1)*d)-w/2)); rhs=min(length(s),(((l-1)*d)+w/2)); seg=s(lhs:rhs); [B,M]=ltp(seg); %ltp() can be found in the "listings" section of the website pf=[pf;M]; pa=[pa;B]; end subplot(2,1,1) spectrogram(s,w,w-d,w,fs,'yaxis') colorbar off subplot(2,1,2) x=[1:nf]*d/fs; area(x,abs(pa)*100/max(pa));axis tight hold on plot(x,pf,'r-+') hold off legend('pitch multiplier','pitch frequency')
This is what we end up with:
Chapter 3 also discusses various objective measures of speech quality and intelligibility. Short listings are given for many of those, but some are best provided as part of the excellent COLEA package by the late Philip Loizou. Although this package is old (and unmaintained), it is still a valuable and useful resource. You don't need to download everything - in particular we don't make use of the visualisation aspects, just the distance and quality aspects - but it is all available from MATLAB Central:http://uk.mathworks.com/matlabcentral/fileexchange/108-colea