Sinewave speech

Introduction

Sinewave speech (SWS) is one or several sinewaves that are modulated in frequency (and possibly in amplitude too) to appear speech-like. A simple example utilising three separate sinwaves will help to illustrate what this sounds like in practice:

Listen to this a few times. Can you understand what is being said in the sentence?
If not, try listening to the original speech that it was generated from;

Now listen to the sinwave sample again. Can you understand it? The text is from TIMIT, 'she had your dark suit in greasy washwater all year'.

Most people can't understand the sinewave speech (SWS) the first time they hear it, but after listening to the corresponding speech and going back to listen again, they can usually understand the SWS version almost perfectly. The interpretation is that the brain did not recognise the SWS recording as being speech, and it was thus 'sent' to a non-speech part of the brain for processing. By listening to corresponding speech, you are training the hearing part of your brain to recognise this kind of recording as a type of speech, and to send it to the speech processing centres of the brain. If you do this enough, eventually your brain will always interpret this as speech-like and you could actually become very competant at understanding sinewave speech.

The interpretation is that the sinewaves we are listening to are actually the direct formant frequencies - which is one reason we know that formants are very important to speech intelligibility! To form sinewave speech we thus just need to extract formant tracks and then synthesise sinewaves at those frequencies.

Interestingly, readers from much of the Western world and Commonwealth may have grown up being exposed to a kind of non-technical sinewave speech (from swanee whistles) if they watched the BBC Clangers as a child (you can view the original first episode from November 1969 here).

There is quite a lot of material on the internet relating to sinewave speech. Much of the material presented here has been gathered from others. Most notably the following;
  1. Sussex University SWS demonstrations, generated using the audio toolbox called Praat.
  2. In my opinion, by far the best practical resources available for SWS experimentation are from Dan Ellis of Columbia University. His MATLAB code for SWS creation and analysis is excellent, and his explanations can't be bettered.
  3. A set of different SWS demos from Tim Prebble, NZ.
  4. Further explanation from Al Bregman, McGill University (actually an excellent website all round for various auditory phenomena).
  5. Another good resource is from Matt Davis of the University of Cambridge Cognition and Brain Sciences Unit.

Acknowledgement

The code presented below combines parts from many sources, but most notably from the MATLAB code produced by Professor Dan Ellis (see link 2 above). I therefore do not make any claim to originality, and strongly urge readers to ensure that they cite Professor Ellis of Columbia rather than me if you find this useful:

@misc{Ellis2003_sws,
   Author = {Daniel P. W. Ellis},
   Year = {2003},
   Title = {Sinewave and Sinusoid+Noise Analysis/Synthesis in {M}atlab},
   Url = {http://www.ee.columbia.edu/~dpwe/resources/matlab/sinemodel/},
   Note = {online web resource}}
   
D. P. W. Ellis (2003) "Sinewave and Sinusoid+Noise Analysis/Synthesis in Matlab",
web resource, available: http://www.ee.columbia.edu/~dpwe/resources/matlab/sinemodel	

Making it work in MATLAB

First we will load some speech from a file, and resample to 8kHz sample rate:

[s,ofs] = audioread('some_speech.wav');
fs=8000;
speech=resample(s,fs,ofs);

Check that you can hear the speech loud and clear:

soundsc(speech,fs);

Now we will make direct use of some of Dan Ellis's code to extract formant tracks, and later we will synthesise sinewaves at those frequencies. First of all let's include some helper functions that Dan has written (these are unchanged from the code downloadable from his webpage).

The fist function extracts LPC representations of a speech recording, from which we can extract LPC peaks:

function [a,g,e] = lpcfit(x,p,h,w,ov)
% [a,g,e] = lpcfit(x,p,w,h,ov)  Fit LPC to short-time segments
%    x is a stretch of signal.  Using w point (256) windows every 
%    h points (w/2), fit order p LPC models.  Return the successive 
%    all-pole coefficients as rows of a, the per-frame gains in g 
%    and the residual excitation in e.
%    ov nonzero selects overlap-add of window-length
%    residuals, otherwise successive hop-sized residuals are concatenated
%    for independent near-perfect reconstruction with lpcsynth.
%    (default is 1)
% 2001-02-25 dpwe@ee.columbia.edu $Header: $

if nargin < 2
  p = 12;
end
if nargin < 3
  w = 256;
end
if nargin < 4
  h = w/2;
end
if nargin < 5
  ov = 1; 
end

if (size(x,2) == 1)
  x = x';  % Convert X from column to row
end

npts = length(x);

nhops = floor(npts/h);

% Pad x with zeros so that we can extract complete w-length windows
% from it
%x = [zeros(1,(w-h)/2),x,zeros(1,(w-h/2))]; %Jingjie comment

a = zeros(nhops, p+1);
g = zeros(nhops, 1);
if ov == 1 %jingjie changed
  e = zeros(1, npts);
else
  e = zeros(1, (nhops-1)*h+w);
end

% Pre-emphasis
pre = [1 -0.9];
x = filter(pre,1,x);

for hop = 1:nhops-1

  % Extract segment of signal
  xx = x((hop - 1)*h + [1:w]);
  % Apply hanning window
  wxx = xx .* hanning(w)';
  % Form autocorrelation (calculates *way* too many points)
  rxx = xcorr(wxx);
  % extract just the points we need (middle p+1 points)
  rxx = rxx(w+[0:p]);
  % Setup the normal equations
  R = toeplitz(rxx(1:p));
  % Solve for a (horribly inefficient to use full inv())
  an = inv(R)*rxx(2:(p+1))';
  % Calculate residual by filtering windowed xx
  aa = [1 -an'];
  if ov == 0
    rs = filter(aa, 1, xx((w-h)/2 + [1:h]));
  else
    rs = filter(aa,1,wxx);
  end
  G = sqrt(mean(rs.^2));
  % Save filter, gain and residual
  a(hop,:) = aa;
  g(hop) = G;
  if ov == 0
    e((hop - 1)*h + [1:h]) = rs'/G;
  else
    e((hop - 1)*h + [1:w]) =  e((hop - 1)*h + [1:w]) + rs/G;
  end
end

% Throw away first (win-hop)/2 pts if in overlap mode
% for proper synchronization of resynth
if ov ~= 1 %jingjie changed
  e = e((1+((w-h)/2)):end);
end

The second function takes the stack of LPCs and extracts the resonances from this, i.e. the formant frequencies. It outputs a string formants:

function [f,m] = lpca2frq(a,g)
% [f,m] = lpca2frq(a,g)  Convert LPC analysis frames into resonant frequencies
%    Each row of a defines an LPC analysis e.g. from lpcfit.  Convert
%    this into poles, and return the frequencies in rows of f.
% 2001-02-25 dpwe@ee.columbia.edu

[nhops,p] = size(a);

f = zeros(nhops, floor((p - 1)/2));
m = f;

for hop = 1:nhops
  aa = a(hop,:);
  G = g(hop);
  an = aa(2:p);
  rts = roots([1 an]);
  frqs = angle(rts');
  mags = G./(1 - abs(rts'));
  [dummy, ix] = sort(frqs);
  keep = frqs(ix) > 0;
  ix = ix(keep);
  f(hop,1:length(ix)) = frqs(ix);
  m(hop,1:length(ix)) = mags(ix);
end	

We now make use of these two functions in swsmodel.m that together models a given recording of speech into a formant array:

function [F,M] = swsmodel(D,R,H)
% [F,M] = swsmodel(D,R,H)  Sine wave speech analysis
%       D is a speech example sampled at R samples per second.
%       Return a sinusoid model of up to 4 components with each sinusoid
%       defined by a row of F (frequencies in Hz) and M (linear magnitude). 
%       Each column of F and M corresponds to H samples at R.
%       Rows of F are sorted with lowest frequency first; 
%       sinusoids cannot cross.
%
%       Relies on lpcfit.m and lpca2frq.m to form LPC model and convert it 
%       into frequencies.
% 2001-03-12 dpwe@ee.columbia.edu $Header: $

if nargin < 2
  R = 8000;
end
if nargin < 3
  H = 128;
end
if nargin < 4
  P = 3;
end

% Target sampling rate
MyR = 8000;

% Resample to 8 kHz, so LPC only picks main formants
if R ~= MyR
  D = resample(D, round(MyR/1000) ,round(R/1000));
end

% Step size in units of my sampling rate (force to be even integer)
HH = 2*round(H/R * MyR / 2);

% Form 8th-order LPC model (3 or 4 pole pairs)
[lpca,g] = lpcfit(D,8,HH,2*HH);

% Convert poles to sorted freqs and magnitudes
% If only 3 nonzero freqs are found, 4th row will have mag/frq zero
[fa, ma] = lpca2frq(lpca,g);

% Convert frqs into Hz
F = fa'*MyR/(2*pi);

M = ma';	

So now let's make use of this with the speech recording we loaded earlier, and start to examine what the code has done:

[F, M] = swsmodel(speech, fs);	

%Lets look at the formant tracks:
plot(F')

%If that is a bit confusing, we can look at the first 3 formants only:
plot(F(1:3,:)')

At this point we could look at all the signals by plotting everything together;

subplot(4,1,1)
plot(speech); axis tight

subplot(4,1,2)
spectrogram(speech, 256, 128, 256, fs, 'yaxis')

subplot(4,1,3)
plot(F(1:3,:)')

subplot(4,1,4)
plot(M(1:3,:)')

It should result in a plot something like this

In general there should be some correspondance in time between the features visible in the speech waveform and spectrogram, and in the formant frequency (F) and magnitude (M) arrays.

Now we are going to reconstruct sound from those arrays. The most efficient way to accomplish this is to use the synthtrax.m code on Dan Ellis's website. However for the sake of keeping things simple and very clear, we will make use of freqgen.m to construct sinewaves that vary in frequency to match the formant patterns.
Please refer to the freqgen() code in the book (and the explanation of how it works). The freqgen.m function can be downloaded from here.

SF=round(length(speech)/length(F));

%Construct sample-by-sample arrays that vary in frequency and magnitude to match the first three formants
%Warning - this code is VERY slow:
for n=1:length(speech)
 idx=min(length(F),1+floor(n/SF));
 formmod(n,1)=F(1, idx);
 formmod(n,2)=F(2, idx);
 formmod(n,3)=F(3, idx);
 formstrength(n,1)=M(1, idx);
 formstrength(n,2)=M(2, idx);
 formstrength(n,3)=M(3, idx);
end

%Now use these to constuct audio arrays for each formant
%Warning - this is also VERY slow:
form1=freqgen(formmod(:,1),fs);
form2=freqgen(formmod(:,2),fs);
form3=freqgen(formmod(:,3),fs);

Now that we have three formant frequency sounds, we can listen to these separately to reacquaint ourselves with the Clangers speech, before reconstructing a final output.

soundsc(form1,fs) %formant 1
soundsc(form1.*formstrength(:,1)',fs) %formant 1 with magnitude variation

%Combined output
rspeech=form1.*formstrength(:,1)'+form2.*formstrength(:,2)'+form3.*formstrength(:,3)';
soundsc(rspeech,fs)

Given the fact that the reconstruction processing is very slow, the reader is encouraged to try this code with some short samples of a few seconds length only. Try plotting the reconstructed spectrograms and comparing them to the original speech spectrograms (and waveforms).

As an aside, the reconstruction using Dan Ellis's method is performed with the synthtrax.m function, from his website as follows

rspeech2=synthtrax(F, M, fs, 256, 128);
soundsc(rspeech2, fs);

We will leave this topic with one final example.
This one is a very famous quote from world history. Can you recognise what is being said?

Can't understand it? Try again... (and try the example above again). If you really can't follow this, try listening to the original recording below, then retry the SWS example.