Machine hearing

Introduction

Machine hearing is the name given to the ability of computers to detect and recognise sounds. This was made popular through an interesting magazine article written in 2010 by Richard F. Lyon of Google Research Labs entitled Machine hearing: an emerging field, which outlined the state of research currently, and related this relatively new field to the more established research field of machine vision.

Since then many researchers have been working in the machine hearing field, leading to a number of alternative approaches. Prominent research includes

  1. ways of computationally modelling the human hearing process, giving rise to the stabilised auditory image (SAI) two dimensional representation of sound in a brain-like way,
  2. use of very large scale big data for sound recognition which made use of the SAI allied to an image classifier and
  3. application of machine learning to sound classification in noise.

The approach presented in this section is inspired by the work of Richard Lyon discussed above, and makes use of the evaluation methods and similar image representation as that of Jonathan Dennis, but the actual Matlab code included here was written by a research student in the National Engineering Laboratory of Speech and Language Information Processing at the University of Science and Technology of China named Zhang Haomin.

Specifically, sounds are converted into high resolution spectrograms (with a frequency resolution of 2048 and a time overlap of 16 samples at a sample rate of 16kHz). They are then down sampled into a sequence of dimension 24x30 smaller windows, which are then de-noised before being stacked to form a feature vector ready for classification as described here and in this paper.

Setup - obtain the required data and software

The sounds and noises used for training and testing are chosen to exactly replicate the experimental conditions of Dennis (by using 'standard' training and test conditions, it is easy to compare how good our system is with other peoples'). The database of sounds to recognise is from the Real Word Computing Partnership (RWCP) Sound Scene Database (SSD) in Real Acoustical Environments (RWCP-SSD can be obtained for free by non-commercial academic researchers by mail order from here). The database contains many short sound recordings, one sound per audio file, arranged by class in subdirectories. Following the methodology of Dennis, we need to select 50 classes of sounds from RWCP, taking 80 files from each class: 50 for testing and the remaining 30 for training.

We place these files into the clean sound database, with directory name data_wav. This directory contains 50 subdirectories (one for each class, using the class name as label, i.e. 'ring'), and each subdirectory contains 80 audio files (which have numerical names).

From this we will now create noisy sound databases, where each sound has noise added at a predetermined signal-to-noise ratio (SNR). We could use additive white Gaussian noise (AWGN), but Dennis made use of a noise database called NOISEX-92 (a database of various standard environmental noise recordings which can be obtained online - but this could be replaced by other background noise recordings if NOISEX-92 is unavailable).
In particular, we make use of four noise files for evaluation: "Speech Babble", "Destroyer Control Room", "Factory Floor 1" and "Jet Cockpit 1".

We then create noise-corrupted versions, further subdirectories (identified by SNR level). To corrupt the sounds, for each sound file, a random choice is made of the type of corrupting noise (from the 4 choices of noise), and a random noise starting point is selected (i.e. so the mix does not always start from the beginning of the noise file). In practice, noise is added at 0, 10 and 20db SNR.

Thus, for example, directory data_wav_mix0 contains 50 class subdirectories of 80 sound files each of which have random NOISEX-92 noise added at 0dB SNR.

Each of these clean and noise-corrupted sound directories are to be used for subsequent training and testing.
While it is common to have separate directories for testing and training material, we will simply use every 4rd to 8th file for training and every 1st to 3rd file for testing (e.g. when counting from zero, 3...7, 11...15 for training, 0...2 and 8...10 for testing).

As mentioned, while there are a number of ready-made deep neural network software packages available, we will make use of the DeepLearnToolbox. Download the software (using the link given here), unpack the toolbox and then add this to your Matlab path as shown in the DeepLearnToolbox documentation (don't worry, this is easy to do).

Note: the actual code will require a reasonably fast computer, at least 4GB of memory and at least 2GB of hard disc space. Training and testing for one condition (i.e. one level of noise) may require more than 1 hour of processing time.

The instructions that follow are mainly copied from here.

PASS 1: training a DNN

First we set up the variables used in the system:
%-----MATLAB code------
% Set up initial variables
clear all; 
data_dir='data_wav/'; %where clean sounds are stored
directory=dir(data_dir); 
nclass=50;  %no. of classes
nfile=80;   %no. of files per class
ny=24;      %frequency resolution of feature vector
nx=30;      %time resolution of feature vector 
winlen=2048;      %spectrogram window 
overlap=2048-16;  %spectrogram overlap
%-----------

Next step will be to run through all sounds in the database, form feature vectors, and load these into memory:

%-----MATLAB code------
for class=1:nclass 
 sub_d=dir([data_dir,directory(class+2).name]); 
 for file=1:nfile 
   if mod(file-1,8)>2 %select files for training
   [wave,fs]=audioread([data_dir,directory(class+2).name,'/',sub_d(file+2).name]); 
   data0=abs(spectrogram(wave,winlen,overlap,winlen,16000)); 
   clear data; 
   nchannel=size(data0,1); 
   for y=1:ny 
      data(y,:)=mean(data0(ceil(nchannel/ny*(y-1))+1:ceil(nchannel/ny*y),:)); 
   end
   for y=1:ny 
      data(y,:)=data(y,:)-min(data(y,:)); 
   end 

   nFrames(80*(class-1)+file)=floor(size(data,2)/nx*2)-1; 
   for frame=1:nFrames(80*(class-1)+file) 
      ntrain=ntrain+1; 
      train_data(ntrain,1:nx*ny)=reshape(data(:,(frame-1)*nx/2+1:(frame+1)*nx/2),1,nx*ny); 
      energy=sum(train_data(ntrain,1:nx*ny)); 
      if energy~=0 
         train_data(ntrain,1:nx*ny)=train_data(ntrain,1:nx*ny)/energy; 
      end; 
      train_data(ntrain,nx*ny+1)=energy; 
      train_label(ntrain,:)=class; 
   end

   train_data(ntrain-nFrames(80*(class-1)+file)+1:ntrain,end)=train_data(ntrain-nFrames(80*(class-1)+file)+1:ntrain,end)/sum(train_data(ntrain-nFrames(80*(class-1)+file)+1:ntrain,end)); 
   end
 end
 fprintf('Finished %d class training files\n',class); 
end
%-----------

The array train_data now contains all of the training data feature vectors, created as per the spectrogram image feature (SIF) method described above.
Before continuing further, we must condition the data to ensure it is scaled appropriately, and free up some memory by clearing unwanted arrays;

%-----MATLAB code------
train_data(:,end)=train_data(:,end)/max(train_data(:,end))*max(max(train_data(:,1:end-1)));
mi=min(min(train_data)); 
train_x=train_data-mi; 

ma=max(max(train_x)); 
train_x=train_x/ma; 

clear train_data; 

train_y=zeros(length(train_label),50); 
for i=1:length(train_label) 
   train_y(i,train_label(i))=1; 
end
clear train_label; 
%-----------

Next step will be to set up the neural network parameters, using the recommended settings by for the DeepLearnToolbox, and with 210 internal layers;

%-----MATLAB code------
nnsize=210; 
dropout=0.10; 
nt=size(train_x,1); 
rand('state',0) %seed random number generator
for i=1:(ceil(nt/100)*100-nt) 
   np=ceil(rand()*nt); 
   train_x=[train_x;train_x(np,:)]; 
   train_y=[train_y;train_y(np,:)]; 
end
%-----------

Now we start to create and stack RBM layers - as many as are required for the current task, to create a deep structure (this example is not particularly deep);

%-----MATLAB code------
% train a 100 hidden unit RBM
rand('state',0) 
% train RBM
dbn.sizes = [nnsize]; 
opts.numepochs = 1; 
opts.batchsize = 100; 
opts.momentum = 0; 
opts.alpha = 1; 
dbn = dbnsetup(dbn, train_x, opts); 
dbn = dbntrain(dbn, train_x, opts); 

%train DBN
dbn.sizes = [nnsize nnsize]; 
opts.numepochs = 1; 
opts.batchsize = b; 
opts.momentum = 0; 
opts.alpha = 1; 
dbn = dbnsetup(dbn, train_x, opts); 
dbn = dbntrain(dbn, train_x, opts); 

%unfold DBN to neural network
nn = dbnunfoldtonn(dbn, 50); 
%use sigmoid activation function
nn.activation_function = 'sigm'; 
%-----------

Now, with this newly-initialised structure, we can treat the network as a NN. It is ready for fine-tuning using back-propagation;

%-----MATLAB code------
%train neural network 
opts.numepochs = 1; 
opts.batchsize = b; 
nn.dropoutFraction = dropout; 
nn.learningRate = 10; 
for i=1:1000 
   fprintf('Epoch=%d\n',i); 
   nn = nntrain(nn, train_x, train_y, opts); 
   %reduce the learning rate as training continues
   if i==100 
      nn.learningRate = 5; 
   end; 
   if i==400 
      nn.learningRate = 2; 
   end; 
   if i==800 
      nn.learningRate = 1; 
   end
end
%-----------

The outcome of this process is a fairly large array in Matlab's memory called nn. This defines the DNN structure, as well as containing all weights and connection definitions.

Use the DNN for testing

The newly learned DNN (called nn) is now to be used for classification. Again, we first set up the system similarly to the way we did it before.

%-----MATLAB code------
clear test_x; 
clear test_y; 
data_dir='data_wav/'; 
noise_dir='data_wav_mix0/'; %the 0dB SNR noise mixture
%change to other directories to test different SNRs
directory=dir(data_dir); 
nclass=50; 
nfile=80; 
ny=24; 
nx=30; 
winlen=2048; 
overlap=2048-16;
%-----------

And again, read in the data for testing in the same way we did for training previously:

%-----MATLAB code------
ntest=0; 
rand('state',0) 
for class=1:nclass 
   sub_d=dir([data_dir,directory(class+2).name]); 
   for file=1:nfile 
      if mod(file-1,8)<3 %select the specific files used for testing
         [wave,fs]=audioread([noise_dir,directory(class+2).name,'/',sub_d(file+2).name]); 
         data0=abs(spectrogram(wave,winlen,overlap,winlen,16000)); 
         clear data; 
         nchannel=size(data0,1); 
         for y=1:ny 
            data(y,:)=mean(data0(ceil(nchannel/ny*(y-1))+1:ceil(nchannel/ny*y),:)); 
         end
         for y=1:ny 
            data(y,:)=data(y,:)-min(data(y,:)); 
         end

         nFrames(80*(class-1)+file)=floor(size(data,2)/nx*2)-1; 
         for frame=1:nFrames(80*(class-1)+file) 
            ntest=ntest+1; 
            test_data(ntest,1:nx*ny)=reshape(data(:,(frame-1)*nx/2+1:(frame+1)*nx/2),1,nx*ny); 
            energy=sum(test_data(ntest,1:nx*ny)); 
            if energy~=0 
               test_data(ntest,1:nx*ny)=test_data(ntest,1:nx*ny)/energy; 
            end
            test_data(ntest,nx*ny+1)=energy; 
            test_label(ntest,:)=class; 
         end
         test_data(ntest-nFrames(80*(class-1)+file)+1:ntest,end)=test_data(ntest-nFrames(80*(class-1)+file)+1:ntest,end)/sum(test_data(ntest-nFrames(80*(class-1)+file)+1:ntest,end)); 
      end
   end
   fprintf('Finished %d class test_files\n',class); 
end
%-----------

Next - as before - we also condition the files and ensure they are normalised and scaled appropriately:

%-----MATLAB code------
test_data(:,end)=test_data(:,end)/max(test_data(:,end))*max(max(test_data(:,1:end-1)));
%normalise the data
mi=min(min(test_data)); 
test_x=test_data-mi; 
ma=max(max(test_x)); 
test_x=test_x/ma; 
clear test_data; 
%create class output vector from test data files
test_y=zeros(length(test_label),50); 
for i=1:length(test_label) 
   test_y(i,test_label(i))=1; 
end
clear test_label;
%-----------

Finally, it is possible to execute the actual test;

%-----MATLAB code------
correct=0; 
test_now=0; 
nfile=4000; 
for file=1:nfile 
   if mod(file-1,8)<3 %the testing files
      [label, prob] = nnpredict_p(nn,test_x(test_now+1:test_now+nFrames(file),:)); 
      if label==ceil(file/80) 
         printf('correct\n');
      else
         printf('NOT correct\n');
      end; 
      test_now=test_now+nFrames(file); 
   end
end
%-----------

This makes use of a function called nnpredict_p to probability and energy scaling, which is shown below;

%-----MATLAB code------
function [label, maxp] = nnpredict_p(nn, x) 
nn.testing = 1; 
nn = nnff(nn, x, zeros(size(x,1), nn.size(end))); 
nn.testing = 0; 
prob = sum(nn.a{end}); 
label = find(prob==max(prob)); 
maxp=max(prob); 
end
%-----------

The above code will output either a "correct" or "NOT correct" output for each of the 30x50=1500 files in the training set so the final performance score for that particular test condition is simply the proportion 'number of correct'/1500.