Objectives

Build an algorithm which can be used to automatically label a song’s genre based on its properties and artist information.

Part 1: Develop an algorithm that Spotify can use to determine the genre of a song.

spotify <- read.csv("https://www.macalester.edu/~ajohns24/data/music_1_exam.csv")

library(ggplot2)
library(gridExtra)
library(dplyr)
library(caret)
library(rpart)        # for building trees
library(rpart.plot)   # for plotting trees
library(class)        # for the instructor's plotting functions
library(randomForest) # for bagging & forests
library(infer)        # for resampling

The first thing I did before running an alogrithm was to clean the data set. I converted the song name from being a column to being a rowname and also deleted the column of artist names. Since each artist appeared only once in the data set, there was no point in leaving it in as apart of the dataset the alogrithm will use.

Few sample rows of cleaned data. We see that the variables capture characteristics such as tempo and danceability of each song.

library(tibble)
spotify_clean <- spotify %>% 
  column_to_rownames("song")
spotify_clean <- spotify_clean[-1]
head(spotify_clean)
##                  old young solo key   energy liveness   tempo speechiness
## Burn            1986  1986    Y  10 0.744093 0.101070 116.158    0.056171
## Harlem Shake    1989  1989    Y   9 0.750070 0.402280 137.371    0.048298
## Demons          1984  1987    N   3 0.508804 0.950387  90.050    0.050202
## Say Something   1980  1985    N   2 0.146009 0.082441 109.509    0.037671
## Survival        1972  1972    Y   7 0.898239 0.082388 176.113    0.147401
## Beauty & A Beat 1982  1994    N   0 0.772988 0.083122 128.384    0.083079
##                 acousticness instrumentalness. mode time_signature
## Burn                0.310308          0.000000    0              5
## Harlem Shake        0.011560          0.005793    1              4
## Demons              0.289520          0.001030    1              4
## Say Something       0.852270          0.000004    1              4
## Survival            0.003843          0.000000    1              4
## Beauty & A Beat     0.126042          0.000158    1              4
##                 duration loudness  valence danceability    genre
## Burn            233.3462   -5.904 0.337765     0.395337      Pop
## Harlem Shake    127.9729   -7.105 0.347030     0.431644 RBHipHop
## Demons          236.2560  -14.380 0.225547     0.442005      Pop
## Say Something   231.0129   -8.770 0.135200     0.442882      Pop
## Survival        272.2795   -3.004 0.474467     0.455424 RBHipHop
## Beauty & A Beat 207.4795   -6.099 0.273818     0.457297      Pop

Classification Approach

In choosing this algorithm, I prioritized accuracy and stability over simplicity. Thus I used an algorithm which is known to enjoy low variability (in the bias-variance tradeoff).

I used a random forest approach for this supervised classification problem. Since there are more than two categories that we are trying to predict, we cannot use logistic regression techniques and need an unparametric classiciation tool.

The pros of using random forest approach is that relative to trees, bagging and forests usually reduce variance and have better classification since it uses multiple trees. Forests are also more computationally efficient.

set.seed(253)

forest_model <- train(
  genre ~ .,
  data = spotify_clean,
  method = "rf",
  tuneGrid = data.frame(mtry = c(1,2,6, 8, 10,14,16,17,18,19)),
  trControl = trainControl(method = "oob"),
  metric = "Accuracy",
  na.action = na.omit
)
plot(forest_model)

forest_model$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 54.72%
## Confusion matrix:
##             ElectroClub Pop RBHipHop class.error
## ElectroClub           0   6        6   1.0000000
## Pop                   3  20        3   0.2307692
## RBHipHop              2   9        4   0.7333333

Studying the confusion matrix, we see that the OOB error rate is pretty high at 54%. Our model classfied all of the ElectroClub songs incorrectly and held a class error of 1 in classfying ElectroClub songs. The model did relatively well in classifying pop music with a class error of 0.23. RBHipHop was still relatively poor with a class error of .73.

2: Determine which predictor is the most useful.

variable_importance <- data.frame(randomForest::importance(forest_model$finalModel)) %>% 
  mutate(predictor = rownames(.))

# Arrange predictors by importance (most to least)
variable_importance %>% 
  arrange(desc(MeanDecreaseGini)) %>% 
  head()
##   MeanDecreaseGini    predictor
## 1         2.328484      valence
## 2         2.292037 danceability
## 3         2.015615        tempo
## 4         2.007310          key
## 5         1.958765          old
## 6         1.958282     duration
# Arrange predictors by importance (least to most)
variable_importance %>% 
  arrange(MeanDecreaseGini) %>% 
  head()
##   MeanDecreaseGini         predictor
## 1        0.1060712    time_signature
## 2        0.5058777              mode
## 3        0.9184808             soloY
## 4        1.5382888 instrumentalness.
## 5        1.6413484      acousticness
## 6        1.6589468             young
ggplot(spotify_clean, aes(x = valence, fill = genre)) + 
  geom_density(alpha = 0.5)
## Warning: Removed 1 rows containing non-finite values (stat_density).

The most important predictor seems to be valence. From our varaible importance analysis, we see that valence has the highest mean decrease gini of 3.258. This may be becuase there seems to be a distinct split in the valence between the genres. Looking at the density plots we see that pop seems to have the lowest valence, RBHipHop has middle valence, and ElectroClub has high valence.

3: Limitations of the algorithm.

Overall, the random forest does an ok job of distinguishing between songs of different genres. Looking at the confusion matrix we see that the OOB estimate of error rate is 54% which is not very good. The algorithm did the best job of classifying songs in the Pop genre, but did a bad job of classifying ElectroClub and RBHipHop. The algorithm only classified ElectroClub correctly 1 time out of 12 and only classified RBHipHop correctly 4 times out of 15. Looking at the plot of the forest model, we see that the OOB accuracy rate actually drops significantly from around 46% after we use more than 2 predictors and then fluctuates a little after. Some limitations of the random forest technique.

Part 2

4: Classifying genre of songs using data that missing information about genre.

This data is comprised of songs from a morning playlist. It has all of the same variables as the previous dataset, except the genre variable is missing. We are going to build a model to predict the genre of the songs.

spotify_new <- read.csv("https://www.macalester.edu/~ajohns24/data/music_2_exam.csv")
library(tibble)
spotify_newcluster <- spotify_new %>% 
  column_to_rownames("track_name")
spotify_newcluster <- spotify_newcluster[-1]


hier_model <- hclust(dist(scale(spotify_newcluster)), method = "complete")
library(tree)
spotify_cluster <- hclust(dist(spotify_new), method = "complete")
plot(spotify_cluster)

# Visualization: heatmaps (w/ and w/out dendrogram)
heatmap(data.matrix(scale(spotify_newcluster)), Colv = NA)

heatmap(data.matrix(scale(spotify_newcluster)), Colv = NA, Rowv = NA)

Studying the cluster dendrogram, we are able to see which songs are closely related to each other. We see that classical music like “Piano Sonata”, “Concerto in D Minor”, and “Symphony No.2 in C Minor” are all clustered in the same Dendogram.

plot(hier_model, cex = .3)

Here, we assign each sample case to a cluster.

# Assign each sample case to a cluster (you can add to dataset using mutate())
# You specify the number of clusters, k
as.factor(cutree(hier_model, k = 4))
## Concerto grosso in D Minor, Op. 3 No. 11, RV 565: I. Allegro - Adagio e spiccato - Allegro (Live) 
##                                                                                                 1 
##                                                                           Delicate - Instrumental 
##                                                                                                 2 
##                                                                              Ghosts On The Stereo 
##                                                                                                 3 
##                                                 Don't Sleep (feat. French Montana & Stefflon Don) 
##                                                                                                 2 
##                                                       Fêtes Galantes I, CD 86: III. Clair de lune 
##                                                                                                 1 
##                                                                                      Worth Living 
##                                                                                                 4 
##                                                                               The Long Way Around 
##                                                                                                 3 
##                                                                                             Beast 
##                                                                                                 4 
##                                                             Pièces froides: II. Danses de travers 
##                                                                                                 1 
##                One More Saturday Night - Live at Portland Memorial Coliseum, Portland, OR 5/19/74 
##                                                                                                 3 
##                                                                                      Mexican Home 
##                                                                                                 3 
##                            Piano Sonata No. 1 in F Minor, Op. 2, No. 1: III. Menuetto: Allegretto 
##                                                                                                 1 
##                                                                                             Picky 
##                                                                                                 4 
##                                                                                           Oh Fuck 
##                                                                                                 4 
##        Symphony No. 2 in C Minor, Op. 17 "Little Russian": II. Andantino marziale, quasi moderato 
##                                                                                                 1 
##                                                              Change the World (feat. John P. Kee) 
##                                                                                                 2 
## Levels: 1 2 3 4

Once again, the first thing we do is clean the data set. I converted the track name column into the rowname and also deleted the artist column from the dataset.

Looking at the dendogram we built, we see that there seem to be 4 clusters that capture the genre of the songs. The first cluster we can categorize as being classical music. The second is more religious or instrumental. The third could be pop and the fourth is RBHipHop.

The morning playlist we analysed can be described by 4 main genres: classical, religious/instrumental, pop, and RBHipHop.