import numpy as np
import pandas as pd
import librosa
import librosa.display
import IPython.display as ipd
from matplotlib import pyplot as plt
from tqdm.auto import tqdm
%matplotlib inline
Ch. 11 – Genre classification
Figure 11.2. Illustration of genre classification with acoustic descriptors. Panel A and B: Distribution of RMSE and Spectral centroid values across 10 genres in all excerpts. Panel C: Confusion matrix of the random forest classification model showing the proportion of cases predicted in the actual and predicted genres.
Summary
This notebook will look at a classic genre categorization study and dataset by Tzanetakis & Cook (2002) and will conduct a simple classification of genre based on acoustic features extracted from the audio files. The full data contains 100 audio excerpts from 10 different genres (1000 clips in total), but we are going to start with a smaller set to keep this light to run. It should be noted that the selection of the excerpts for this dataset were not particularly rigorous and represented the collection of music that George Tzanetakis had at his disposal at the time. And this dataset has some quirks and imperfections, but I think it is still a fun, classic and illustrative example to explore.
Setup
Load libraries
Load dataset
We first install mirdata
library to the computer.
import sys
!{sys.executable} -m pip install mirdata
Then we initialise the library and download the audio excerpts needed. I only take 100 excerpts here but you can take all 1000 excerpts by altering the script below.
import mirdata
#print(mirdata.list_datasets())
= mirdata.initialize('gtzan_genre')
gtzan_genre = mirdata.initialize('gtzan_genre', version='mini') # This is 100 excerpts
gtzan #gtzan = mirdata.initialize('gtzan_genre') # This is 1000 excerpts (uncomment if you want to analyse the full data)
gtzan.download()print('Downloaded',len(gtzan.track_ids),'tracks')
INFO: Downloading ['mini', 'tempo_beat_annotations'] to /Users/tuomaseerola/mir_datasets/gtzan_genre
INFO: [mini] downloading main.zip
INFO: /Users/tuomaseerola/mir_datasets/gtzan_genre/main.zip already exists and will not be downloaded. Rerun with force_overwrite=True to delete this file and force the download.
INFO: [tempo_beat_annotations] downloading annot.zip
INFO: /Users/tuomaseerola/mir_datasets/gtzan_genre/annot.zip already exists and will not be downloaded. Rerun with force_overwrite=True to delete this file and force the download.
Downloaded 100 tracks
Let’s look at an example (track ID 88
).
= 88
ID = gtzan.load_tracks()
tracks = tracks[gtzan.track_ids[ID]]
ex print(["Genre:", ex.genre, "Name:", ex.track_id, "Tempo:",ex.tempo,])
print(ex.audio[1])
=(8, 2))
plt.figure(figsize= ex.audio[0], sr = ex.audio[1])
librosa.display.waveshow(y = ex.audio[0], rate = ex.audio[1])) ipd.display(ipd.Audio(data
['Genre:', 'pop', 'Name:', 'pop.00008', 'Tempo:', 84.1]
22050
Extract features
Let’s extract some features and use them to predict genres. We take some rhythm features, some timbral features, _MFCC_s (19 in total), and chromas (12 in total). In the feature extraction, we calculate the features across the 30-second excerpt and then just take the mean to represent the feature.
The summary of the track could be more sophisticated (one could take the median, and the standard deviation, for instance, or to avoid extreme values, or to take multiple measures from within each example, and using bagging or voting where separate clips are used in assessing the most likely genre. This latter goes close to ensemble machine-learning techniques).
= pd.DataFrame(columns = ['genre','bpm','rmse', 'spec_cent','spec_bw','rolloff','zcr','spec_ctr','mfcc1','mfcc2','mfcc3','mfcc4','mfcc5','mfcc6','mfcc7','mfcc8','mfcc9','mfcc10','mfcc11','mfcc12','mfcc13','mfcc14','mfcc15','mfcc16','mfcc17','mfcc18','mfcc19','chroma1','chroma2','chroma3','chroma4','chroma5','chroma6','chroma7','chroma8','chroma9','chroma10','chroma11','chroma12'])
df for track in tqdm(tracks):
= tracks[track]
ex = librosa.load(ex.audio_path)
y, sr = librosa.feature.chroma_stft(y=y, sr=sr)
chroma_stft = librosa.feature.rms(y=y)
rmse = librosa.feature.spectral_centroid(y=y, sr=sr)
spec_cent = librosa.feature.spectral_bandwidth(y=y, sr=sr)
spec_bw = librosa.feature.spectral_rolloff(y=y, sr=sr)
rolloff = librosa.feature.zero_crossing_rate(y)
zcr = librosa.feature.spectral_contrast(y=y, sr=sr, hop_length=512)
spec_ctr = librosa.feature.chroma_stft(y=y, sr=sr, hop_length=512)
chroma = librosa.feature.mfcc(y=y, sr=sr)
mfcc len(df)] = [ex.genre,ex.tempo,np.mean(rmse),np.mean(spec_cent),np.mean(spec_bw),np.mean(rolloff),np.mean(zcr),np.mean(spec_ctr),np.mean(mfcc[1]),np.mean(mfcc[2]),np.mean(mfcc[3]),np.mean(mfcc[4]),np.mean(mfcc[5]),np.mean(mfcc[6]),np.mean(mfcc[7]),np.mean(mfcc[8]),np.mean(mfcc[9]),np.mean(mfcc[10]),np.mean(mfcc[11]),np.mean(mfcc[12]),np.mean(mfcc[13]),np.mean(mfcc[14]),np.mean(mfcc[15]),np.mean(mfcc[16]),np.mean(mfcc[17]),np.mean(mfcc[18]),np.mean(mfcc[19]),np.mean(chroma[0]),np.mean(chroma[1]),np.mean(chroma[2]),np.mean(chroma[3]),np.mean(chroma[4]),np.mean(chroma[5]),np.mean(chroma[6]),np.mean(chroma[7]),np.mean(chroma[8]),np.mean(chroma[9]),np.mean(chroma[10]),np.mean(chroma[11])] df.loc[
Explore features
Let’s look at some of features across genres. Are there differences in dynamics or brightness?
rmse spec_cent spec_bw rolloff zcr spec_ctr
0 0.036233 1505.357461 1559.228895 2717.238764 0.098223 23.372866
1 0.030610 1361.006486 1441.739951 2389.011463 0.087766 25.186866
2 0.043828 1490.274810 1600.005082 2785.418914 0.090046 22.894315
3 0.029426 1526.628932 1499.462050 2916.150271 0.108946 25.663545
Feature space
How large is our feature space and do have features that are redundant, that is highly similar to each other? This can be easily explored by visualising the correlations between all features.
= df.iloc[0:99,1:27].corr() # Compute the correlation matrix
corr = np.triu(np.ones_like(corr, dtype=bool)) # Generate a mask for the upper triangle
mask = plt.subplots(figsize=(8, 8)) # Define matplotlib figure
f, ax = sns.diverging_palette(230, 20, as_cmap=True) # Custom colormap
cmap =mask, cmap=cmap, vmax=1.00, center=0,
sns.heatmap(corr, mask=True, linewidths=.5, cbar_kws={"shrink": .5}) # Draw the heatmap with the mask and correct aspect ratio
square plt.show()
Classification and machine-learning algorithms typically deal well with numerous features, but here we have only 100 observations and 39 variables, which is not a healthy proportion (too many variables compared to observations). Usually it is a good idea to have 10:1 or 15:1 or even 20:1 of observations to predictors. Based on the correlation matrix, what would you eliminate?
For instance, all chroma features have high positive correlations and some of the timbral features seem to be related. Let’s trim the selection as we have quite a little data when using the mini dataset.
= df.iloc[:,0:22] df_trimmed
Classify with the features
We use all the features and relatively simple hierarchical classification tree model called random forest. It creates a bunch of decision nodes based on the data by using a subset of the features and bootstrapping this process many times over. It is a robust technique and does not really whether the distribution are normal or not.
Before running the model, there are three operations to introduce that are part of the good practice for model construction.
Cross-validation of the model
We cross-validate the model, which means that we split the data into training and testing sets. We first train the model on the training set, which here is a randomly select 70% of the data. Once we have trained the model, we test it against the unseen data (test set, 30% of the data) to assess how the model performs. This could be done by alterning the selection of the training and testing set and we could do this 10 times and average the results (this is called k-fold cross-validation).
Stratifying the sample
When we randomly split the data into training and testing sets, we might want to stratify the data according to genre, which makes sure that we have similar proportion of examples from each genre at both sets.
Normalize variables
We also want to normalize the variables. This is not so crucial for the random forest model that we are going to use, but usually it is good idea to eliminate the differences the feature ranges have to the model. To normalize the variables, we turn them into z-scores, where the mean is 0 and standard deviation is 1.
import pandas as pd
import sklearn as sk
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
= df_trimmed.drop('genre', axis = 1)
X = preprocessing.normalize(X)
Xn = df_trimmed['genre']
y
= 0.30 # taking 70:30 training and test set
test_size = 9 # Random numbmer seeding for reapeatability of the code
seed = train_test_split(Xn, y, test_size=test_size, random_state=seed,stratify=y)
X_train, X_test, y_train, y_test
= RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=0).fit(X_train, y_train)
RF #RF.predict(X_test)
#print(round(RF.score(X_test, y_test), 4))
= RF.predict(X_test) y_pred_test
And we have the results:
print('Correct classification rate:',round(RF.score(X_test, y_test), 4))
Correct classification rate: 0.7
In order to answer this question, you should think what a model that predicts nonsense would achieve by chance? You could also check how this model compares to the work published by Tzanetakis. Finally, might want to consider what is level of accuracy expected from listeners and there might be even studies about this to give you a benchmark.
Analyse the model
Let’s look at this model in more detail and try to see which features are doing the most heavy lifting here and could we simplify the model and what kind of mistakes does the classification model make.
Visualise confusion matrix
Let’s explore what kind of mistakes the model makes. Confusion matrix is a useful way to visualise this.
import seaborn as sns
# Reshape
= confusion_matrix(y_test, y_pred_test)
matrix = matrix.astype('float') / matrix.sum(axis=1)[:, np.newaxis]
matrix
# Build the plot
=(8,5))
plt.figure(figsizeset(font_scale=1.4)
sns.=True, annot_kws={'size':10},
sns.heatmap(matrix, annot=plt.cm.Blues, linewidths=0.2)
cmap
# Add labels to the plot
= RF.classes_ #np.unique(y_test)
class_names = np.arange(len(class_names))
tick_marks = tick_marks + 0.5
tick_marks2 +0.5, class_names, rotation=90)
plt.xticks(tick_marks=0)
plt.yticks(tick_marks2, class_names, rotation'Predicted label')
plt.xlabel('True label')
plt.ylabel('Confusion Matrix for Random Forest Model')
plt.title( plt.show()
Feature importance
Let’s plot the feature importance from random forest classification.
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
= RF.feature_importances_
importance = df_trimmed.columns[1:len(df.columns)]
n = pd.DataFrame({'data': importance,'names': n})
im = im.sort_values(by='data',ascending=False)
im2 # plot feature importance
= plt.subplots(figsize=(8, 5))
fig, ax #figure(figsize=(10, 5))
0:9],im2.data[0:9],color='red')
plt.scatter(im2.names[0:9],im2.data[0:9])
plt.plot(im2.names['10 strongest features')
ax.set_title( plt.show()
The plot show the best 10 features and the first four seem to bring greater benefit to the model.
Simplify model
What happens if we take the four best features and try building a simpler model with these features?
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
= df_trimmed.filter(['mfcc8','rmse', 'spec_ctr', 'mfcc7'])
X2
= 0.30 # taking 70:30 training and test set
test_size = 2022 # Random numbmer seeding for reapeatability of the code
seed = train_test_split(X2, y, test_size=test_size, random_state=seed,stratify=y)
X_train, X_test, y_train, y_test
= RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=seed).fit(X_train, y_train)
RF
RF.predict(X_test)# Make predictions for the test set
= RF.predict(X_test)
y_pred_test print(round(RF.score(X_test, y_test), 4))
0.6333
What do you think about the simplified model with 5 features? Is the model still good? You could look at the confusion to see what kind of mistakes the slimmer model starts to make.
There is concept call principle of parsimony or the idea behind that simpler models are more parsimonius than complex models, which stems from Occam’s razor. There are several statistical measures that assess the model fit and parsimoniousness (Akaike Information Criterion etc.). We are not entering into those calculations here but usually it is better to have a simple model and compromise the model accuracy a little bit than to gain few points in accuracy but having a complex model.
Summary
There numerous other algorithms to classify the materials, SVMs (Support Vector Machines), K-nearest neighbour models (KNNs), Neural networks, and many others.
We could have focussed more on features, their calculation, the summary measures, and subsets, but overall we achieved a good success with a small set of features. We have to remember that this is a mini-version of the original dataset. You are welcome to try how the full dataset would improve the results.
This process is pretty generic for all kinds of classification tasks, so the same procedure could be applied to prediction emotion categories, meter, instrumentation and other properties of music.
References
- Tzanetakis, G. & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293-302 doi:10.1109/TSA.2002.800560.