Build an algorithm which can be used to automatically label a song’s genre based on its properties and artist information.
spotify <- read.csv("https://www.macalester.edu/~ajohns24/data/music_1_exam.csv")
library(ggplot2)
library(gridExtra)
library(dplyr)
library(caret)
library(rpart) # for building trees
library(rpart.plot) # for plotting trees
library(class) # for the instructor's plotting functions
library(randomForest) # for bagging & forests
library(infer) # for resampling
The first thing I did before running an alogrithm was to clean the data set. I converted the song name from being a column to being a rowname and also deleted the column of artist names. Since each artist appeared only once in the data set, there was no point in leaving it in as apart of the dataset the alogrithm will use.
Few sample rows of cleaned data. We see that the variables capture characteristics such as tempo and danceability of each song.
library(tibble)
spotify_clean <- spotify %>%
column_to_rownames("song")
spotify_clean <- spotify_clean[-1]
head(spotify_clean)
## old young solo key energy liveness tempo speechiness
## Burn 1986 1986 Y 10 0.744093 0.101070 116.158 0.056171
## Harlem Shake 1989 1989 Y 9 0.750070 0.402280 137.371 0.048298
## Demons 1984 1987 N 3 0.508804 0.950387 90.050 0.050202
## Say Something 1980 1985 N 2 0.146009 0.082441 109.509 0.037671
## Survival 1972 1972 Y 7 0.898239 0.082388 176.113 0.147401
## Beauty & A Beat 1982 1994 N 0 0.772988 0.083122 128.384 0.083079
## acousticness instrumentalness. mode time_signature
## Burn 0.310308 0.000000 0 5
## Harlem Shake 0.011560 0.005793 1 4
## Demons 0.289520 0.001030 1 4
## Say Something 0.852270 0.000004 1 4
## Survival 0.003843 0.000000 1 4
## Beauty & A Beat 0.126042 0.000158 1 4
## duration loudness valence danceability genre
## Burn 233.3462 -5.904 0.337765 0.395337 Pop
## Harlem Shake 127.9729 -7.105 0.347030 0.431644 RBHipHop
## Demons 236.2560 -14.380 0.225547 0.442005 Pop
## Say Something 231.0129 -8.770 0.135200 0.442882 Pop
## Survival 272.2795 -3.004 0.474467 0.455424 RBHipHop
## Beauty & A Beat 207.4795 -6.099 0.273818 0.457297 Pop
In choosing this algorithm, I prioritized accuracy and stability over simplicity. Thus I used an algorithm which is known to enjoy low variability (in the bias-variance tradeoff).
I used a random forest approach for this supervised classification problem. Since there are more than two categories that we are trying to predict, we cannot use logistic regression techniques and need an unparametric classiciation tool.
The pros of using random forest approach is that relative to trees, bagging and forests usually reduce variance and have better classification since it uses multiple trees. Forests are also more computationally efficient.
set.seed(253)
forest_model <- train(
genre ~ .,
data = spotify_clean,
method = "rf",
tuneGrid = data.frame(mtry = c(1,2,6, 8, 10,14,16,17,18,19)),
trControl = trainControl(method = "oob"),
metric = "Accuracy",
na.action = na.omit
)
plot(forest_model)
forest_model$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 54.72%
## Confusion matrix:
## ElectroClub Pop RBHipHop class.error
## ElectroClub 0 6 6 1.0000000
## Pop 3 20 3 0.2307692
## RBHipHop 2 9 4 0.7333333
Studying the confusion matrix, we see that the OOB error rate is pretty high at 54%. Our model classfied all of the ElectroClub songs incorrectly and held a class error of 1 in classfying ElectroClub songs. The model did relatively well in classifying pop music with a class error of 0.23. RBHipHop was still relatively poor with a class error of .73.
variable_importance <- data.frame(randomForest::importance(forest_model$finalModel)) %>%
mutate(predictor = rownames(.))
# Arrange predictors by importance (most to least)
variable_importance %>%
arrange(desc(MeanDecreaseGini)) %>%
head()
## MeanDecreaseGini predictor
## 1 2.328484 valence
## 2 2.292037 danceability
## 3 2.015615 tempo
## 4 2.007310 key
## 5 1.958765 old
## 6 1.958282 duration
# Arrange predictors by importance (least to most)
variable_importance %>%
arrange(MeanDecreaseGini) %>%
head()
## MeanDecreaseGini predictor
## 1 0.1060712 time_signature
## 2 0.5058777 mode
## 3 0.9184808 soloY
## 4 1.5382888 instrumentalness.
## 5 1.6413484 acousticness
## 6 1.6589468 young
ggplot(spotify_clean, aes(x = valence, fill = genre)) +
geom_density(alpha = 0.5)
## Warning: Removed 1 rows containing non-finite values (stat_density).
The most important predictor seems to be valence. From our varaible importance analysis, we see that valence has the highest mean decrease gini of 3.258. This may be becuase there seems to be a distinct split in the valence between the genres. Looking at the density plots we see that pop seems to have the lowest valence, RBHipHop has middle valence, and ElectroClub has high valence.
Overall, the random forest does an ok job of distinguishing between songs of different genres. Looking at the confusion matrix we see that the OOB estimate of error rate is 54% which is not very good. The algorithm did the best job of classifying songs in the Pop genre, but did a bad job of classifying ElectroClub and RBHipHop. The algorithm only classified ElectroClub correctly 1 time out of 12 and only classified RBHipHop correctly 4 times out of 15. Looking at the plot of the forest model, we see that the OOB accuracy rate actually drops significantly from around 46% after we use more than 2 predictors and then fluctuates a little after. Some limitations of the random forest technique.
This data is comprised of songs from a morning playlist. It has all of the same variables as the previous dataset, except the genre variable is missing. We are going to build a model to predict the genre of the songs.
spotify_new <- read.csv("https://www.macalester.edu/~ajohns24/data/music_2_exam.csv")
library(tibble)
spotify_newcluster <- spotify_new %>%
column_to_rownames("track_name")
spotify_newcluster <- spotify_newcluster[-1]
hier_model <- hclust(dist(scale(spotify_newcluster)), method = "complete")
library(tree)
spotify_cluster <- hclust(dist(spotify_new), method = "complete")
plot(spotify_cluster)
# Visualization: heatmaps (w/ and w/out dendrogram)
heatmap(data.matrix(scale(spotify_newcluster)), Colv = NA)
heatmap(data.matrix(scale(spotify_newcluster)), Colv = NA, Rowv = NA)
Studying the cluster dendrogram, we are able to see which songs are closely related to each other. We see that classical music like “Piano Sonata”, “Concerto in D Minor”, and “Symphony No.2 in C Minor” are all clustered in the same Dendogram.
plot(hier_model, cex = .3)
Here, we assign each sample case to a cluster.
# Assign each sample case to a cluster (you can add to dataset using mutate())
# You specify the number of clusters, k
as.factor(cutree(hier_model, k = 4))
## Concerto grosso in D Minor, Op. 3 No. 11, RV 565: I. Allegro - Adagio e spiccato - Allegro (Live)
## 1
## Delicate - Instrumental
## 2
## Ghosts On The Stereo
## 3
## Don't Sleep (feat. French Montana & Stefflon Don)
## 2
## Fêtes Galantes I, CD 86: III. Clair de lune
## 1
## Worth Living
## 4
## The Long Way Around
## 3
## Beast
## 4
## Pièces froides: II. Danses de travers
## 1
## One More Saturday Night - Live at Portland Memorial Coliseum, Portland, OR 5/19/74
## 3
## Mexican Home
## 3
## Piano Sonata No. 1 in F Minor, Op. 2, No. 1: III. Menuetto: Allegretto
## 1
## Picky
## 4
## Oh Fuck
## 4
## Symphony No. 2 in C Minor, Op. 17 "Little Russian": II. Andantino marziale, quasi moderato
## 1
## Change the World (feat. John P. Kee)
## 2
## Levels: 1 2 3 4
Once again, the first thing we do is clean the data set. I converted the track name column into the rowname and also deleted the artist column from the dataset.
Looking at the dendogram we built, we see that there seem to be 4 clusters that capture the genre of the songs. The first cluster we can categorize as being classical music. The second is more religious or instrumental. The third could be pop and the fourth is RBHipHop.
The morning playlist we analysed can be described by 4 main genres: classical, religious/instrumental, pop, and RBHipHop.