Machine hearing is the name given to the ability of computers to detect and recognise sounds. This was made popular through an interesting magazine article written in 2010 by Richard F. Lyon of Google Research Labs entitled Machine hearing: an emerging field, which outlined the state of research currently, and related this relatively new field to the more established research field of machine vision.
Since then many researchers have been working in the machine hearing field, leading to a number of alternative approaches. Prominent research includes
The approach presented in this section is inspired by the work of Richard Lyon discussed above, and makes use of the evaluation methods and similar image representation as that of Jonathan Dennis, but the actual Matlab code included here was written by a research student in the National Engineering Laboratory of Speech and Language Information Processing at the University of Science and Technology of China named Zhang Haomin.
Specifically, sounds are converted into high resolution spectrograms (with a frequency resolution of 2048 and a time overlap of 16 samples at a sample rate of 16kHz). They are then down sampled into a sequence of dimension 24x30 smaller windows, which are then de-noised before being stacked to form a feature vector ready for classification as described here and in this paper.
The sounds and noises used for training and testing are chosen to exactly replicate the experimental conditions of Dennis (by using 'standard' training and test conditions, it is easy to compare how good our system is with other peoples'). The database of sounds to recognise is from the Real Word Computing Partnership (RWCP) Sound Scene Database (SSD) in Real Acoustical Environments (RWCP-SSD can be obtained for free by non-commercial academic researchers by mail order from here). The database contains many short sound recordings, one sound per audio file, arranged by class in subdirectories. Following the methodology of Dennis, we need to select 50 classes of sounds from RWCP, taking 80 files from each class: 50 for testing and the remaining 30 for training.
We place these files into the clean sound database, with directory name
From this we will now create noisy sound databases, where each sound has noise added at a predetermined signal-to-noise ratio (SNR).
We could use additive white Gaussian noise (AWGN), but Dennis made use of a noise database called NOISEX-92 (a database of various standard environmental noise recordings which can be obtained online - but this could be replaced by other background noise recordings if NOISEX-92 is unavailable).
In particular, we make use of four noise files for evaluation: "Speech Babble", "Destroyer Control Room", "Factory Floor 1" and "Jet Cockpit 1".
We then create noise-corrupted versions, further subdirectories (identified by SNR level). To corrupt the sounds, for each sound file, a random choice is made of the type of corrupting noise (from the 4 choices of noise), and a random noise starting point is selected (i.e. so the mix does not always start from the beginning of the noise file). In practice, noise is added at 0, 10 and 20db SNR.
Thus, for example, directory
Each of these clean and noise-corrupted sound directories are to be used for subsequent training and testing.
While it is common to have separate directories for testing and training material, we will simply use every 4rd to 8th file for training and every 1st to 3rd file for testing (e.g. when counting from zero, 3...7, 11...15 for training, 0...2 and 8...10 for testing).
As mentioned, while there are a number of ready-made deep neural network software packages available, we will make use of the DeepLearnToolbox. Download the software (using the link given here), unpack the toolbox and then add this to your Matlab path as shown in the DeepLearnToolbox documentation (don't worry, this is easy to do).
Note: the actual code will require a reasonably fast computer, at least 4GB of memory and at least 2GB of hard disc space. Training and testing for one condition (i.e. one level of noise) may require more than 1 hour of processing time.
The instructions that follow are mainly copied from here.
%-----MATLAB code------ % Set up initial variables clear all; data_dir='data_wav/'; %where clean sounds are stored directory=dir(data_dir); nclass=50; %no. of classes nfile=80; %no. of files per class ny=24; %frequency resolution of feature vector nx=30; %time resolution of feature vector winlen=2048; %spectrogram window overlap=2048-16; %spectrogram overlap %-----------
Next step will be to run through all sounds in the database, form feature vectors, and load these into memory:
%-----MATLAB code------
for class=1:nclass
sub_d=dir([data_dir,directory(class+2).name]);
for file=1:nfile
if mod(file-1,8)>2 %select files for training
[wave,fs]=audioread([data_dir,directory(class+2).name,'/',sub_d(file+2).name]);
data0=abs(spectrogram(wave,winlen,overlap,winlen,16000));
clear data;
nchannel=size(data0,1);
for y=1:ny
data(y,:)=mean(data0(ceil(nchannel/ny*(y-1))+1:ceil(nchannel/ny*y),:));
end
for y=1:ny
data(y,:)=data(y,:)-min(data(y,:));
end
nFrames(80*(class-1)+file)=floor(size(data,2)/nx*2)-1;
for frame=1:nFrames(80*(class-1)+file)
ntrain=ntrain+1;
train_data(ntrain,1:nx*ny)=reshape(data(:,(frame-1)*nx/2+1:(frame+1)*nx/2),1,nx*ny);
energy=sum(train_data(ntrain,1:nx*ny));
if energy~=0
train_data(ntrain,1:nx*ny)=train_data(ntrain,1:nx*ny)/energy;
end;
train_data(ntrain,nx*ny+1)=energy;
train_label(ntrain,:)=class;
end
train_data(ntrain-nFrames(80*(class-1)+file)+1:ntrain,end)=train_data(ntrain-nFrames(80*(class-1)+file)+1:ntrain,end)/sum(train_data(ntrain-nFrames(80*(class-1)+file)+1:ntrain,end));
end
end
fprintf('Finished %d class training files\n',class);
end
%-----------
The array
Before continuing further, we must condition the data to ensure it is scaled appropriately, and free up some memory by clearing unwanted arrays;
%-----MATLAB code------ train_data(:,end)=train_data(:,end)/max(train_data(:,end))*max(max(train_data(:,1:end-1))); mi=min(min(train_data)); train_x=train_data-mi; ma=max(max(train_x)); train_x=train_x/ma; clear train_data; train_y=zeros(length(train_label),50); for i=1:length(train_label) train_y(i,train_label(i))=1; end clear train_label; %-----------
Next step will be to set up the neural network parameters, using the recommended settings by for the DeepLearnToolbox, and with 210 internal layers;
%-----MATLAB code------
nnsize=210;
dropout=0.10;
nt=size(train_x,1);
rand('state',0) %seed random number generator
for i=1:(ceil(nt/100)*100-nt)
np=ceil(rand()*nt);
train_x=[train_x;train_x(np,:)];
train_y=[train_y;train_y(np,:)];
end
%-----------
Now we start to create and stack RBM layers - as many as are required for the current task, to create a deep structure (this example is not particularly deep);
%-----MATLAB code------
% train a 100 hidden unit RBM
rand('state',0)
% train RBM
dbn.sizes = [nnsize];
opts.numepochs = 1;
opts.batchsize = 100;
opts.momentum = 0;
opts.alpha = 1;
dbn = dbnsetup(dbn, train_x, opts);
dbn = dbntrain(dbn, train_x, opts);
%train DBN
dbn.sizes = [nnsize nnsize];
opts.numepochs = 1;
opts.batchsize = b;
opts.momentum = 0;
opts.alpha = 1;
dbn = dbnsetup(dbn, train_x, opts);
dbn = dbntrain(dbn, train_x, opts);
%unfold DBN to neural network
nn = dbnunfoldtonn(dbn, 50);
%use sigmoid activation function
nn.activation_function = 'sigm';
%-----------
Now, with this newly-initialised structure, we can treat the network as a NN. It is ready for fine-tuning using back-propagation;
%-----MATLAB code------
%train neural network
opts.numepochs = 1;
opts.batchsize = b;
nn.dropoutFraction = dropout;
nn.learningRate = 10;
for i=1:1000
fprintf('Epoch=%d\n',i);
nn = nntrain(nn, train_x, train_y, opts);
%reduce the learning rate as training continues
if i==100
nn.learningRate = 5;
end;
if i==400
nn.learningRate = 2;
end;
if i==800
nn.learningRate = 1;
end
end
%-----------
The outcome of this process is a fairly large array in Matlab's memory called
The newly learned DNN (called
%-----MATLAB code------ clear test_x; clear test_y; data_dir='data_wav/'; noise_dir='data_wav_mix0/'; %the 0dB SNR noise mixture %change to other directories to test different SNRs directory=dir(data_dir); nclass=50; nfile=80; ny=24; nx=30; winlen=2048; overlap=2048-16; %-----------
And again, read in the data for testing in the same way we did for training previously:
%-----MATLAB code------
ntest=0;
rand('state',0)
for class=1:nclass
sub_d=dir([data_dir,directory(class+2).name]);
for file=1:nfile
if mod(file-1,8)<3 %select the specific files used for testing
[wave,fs]=audioread([noise_dir,directory(class+2).name,'/',sub_d(file+2).name]);
data0=abs(spectrogram(wave,winlen,overlap,winlen,16000));
clear data;
nchannel=size(data0,1);
for y=1:ny
data(y,:)=mean(data0(ceil(nchannel/ny*(y-1))+1:ceil(nchannel/ny*y),:));
end
for y=1:ny
data(y,:)=data(y,:)-min(data(y,:));
end
nFrames(80*(class-1)+file)=floor(size(data,2)/nx*2)-1;
for frame=1:nFrames(80*(class-1)+file)
ntest=ntest+1;
test_data(ntest,1:nx*ny)=reshape(data(:,(frame-1)*nx/2+1:(frame+1)*nx/2),1,nx*ny);
energy=sum(test_data(ntest,1:nx*ny));
if energy~=0
test_data(ntest,1:nx*ny)=test_data(ntest,1:nx*ny)/energy;
end
test_data(ntest,nx*ny+1)=energy;
test_label(ntest,:)=class;
end
test_data(ntest-nFrames(80*(class-1)+file)+1:ntest,end)=test_data(ntest-nFrames(80*(class-1)+file)+1:ntest,end)/sum(test_data(ntest-nFrames(80*(class-1)+file)+1:ntest,end));
end
end
fprintf('Finished %d class test_files\n',class);
end
%-----------
Next - as before - we also condition the files and ensure they are normalised and scaled appropriately:
%-----MATLAB code------ test_data(:,end)=test_data(:,end)/max(test_data(:,end))*max(max(test_data(:,1:end-1))); %normalise the data mi=min(min(test_data)); test_x=test_data-mi; ma=max(max(test_x)); test_x=test_x/ma; clear test_data; %create class output vector from test data files test_y=zeros(length(test_label),50); for i=1:length(test_label) test_y(i,test_label(i))=1; end clear test_label; %-----------
Finally, it is possible to execute the actual test;
%-----MATLAB code------
correct=0;
test_now=0;
nfile=4000;
for file=1:nfile
if mod(file-1,8)<3 %the testing files
[label, prob] = nnpredict_p(nn,test_x(test_now+1:test_now+nFrames(file),:));
if label==ceil(file/80)
printf('correct\n');
else
printf('NOT correct\n');
end;
test_now=test_now+nFrames(file);
end
end
%-----------
This makes use of a function called
%-----MATLAB code------
function [label, maxp] = nnpredict_p(nn, x)
nn.testing = 1;
nn = nnff(nn, x, zeros(size(x,1), nn.size(end)));
nn.testing = 0;
prob = sum(nn.a{end});
label = find(prob==max(prob));
maxp=max(prob);
end
%-----------
The above code will output either a "correct" or "NOT correct" output for each of the 30x50=1500 files in the training set so the final performance score for that particular test condition is simply the proportion 'number of correct'/1500.