What makes a song popular? Can we predict a song’s popularity base only on its acoustic properties (eg: how fast it is, how loud it is, etc)? To answer these questions, we’ll analyze a sample of 10,000 songs that play on Spotify:
These data are a mere sample of the more than 100,000 Spotify songs posted on Kaggle.
Along with the popularity of each song, the music data set contains lots of acoustic variables. To learn more about the acoustic variables, check out the Spotify API page.
library(ggplot2) # for plots
library(GGally) # for pairs plots
library(ggridges) # for joy plots
library(dplyr) # for wrangling
library(caret) # for machine learning algorithms
music <- read.csv("https://www.macalester.edu/~ajohns24/data/spotify_18.csv")
Seeing the 6 most and the 6 least popular songs in the dataset
music %>%
arrange(desc(popularity)) %>%
head(6)
## artist_name track_id
## 1 Ariana Grande 2rPE9A1vEgShuZxxzR2tZH
## 2 Halsey 5p7ujcrUXASCNwRaWNHR1C
## 3 XXXTENTACION 3ee8Jmje8o58CHK66QrVC2
## 4 6ix9ine 2E124GmJRnBJuXbTb4cPUB
## 5 Little Mix 6rrTr2HEAzlpC4KWZxF3S1
## 6 Rita Ora 6xtcFXSo8H9BZN637BMVKS
## track_name acousticness danceability
## 1 thank u, next 0.280 0.724
## 2 Without Me 0.297 0.752
## 3 SAD! 0.258 0.740
## 4 FEFE (feat. Nicki Minaj & Murda Beatz) 0.088 0.931
## 5 Woman Like Me (feat. Nicki Minaj) 0.173 0.757
## 6 Let You Love Me 0.288 0.531
## duration_ms energy instrumentalness liveness loudness mode speechiness
## 1 207333 0.647 0.00e+00 0.1020 -5.642 1 0.0658
## 2 201661 0.488 9.11e-06 0.0936 -7.050 1 0.0705
## 3 166606 0.613 3.72e-03 0.1230 -4.880 1 0.1450
## 4 179405 0.387 0.00e+00 0.1360 -9.127 1 0.4120
## 5 228207 0.849 0.00e+00 0.0878 -3.424 0 0.0536
## 6 190000 0.854 0.00e+00 0.0773 -3.149 1 0.1980
## tempo valence popularity time_signature_4
## 1 106.960 0.435 100 TRUE
## 2 136.041 0.533 95 TRUE
## 3 75.023 0.473 92 TRUE
## 4 125.978 0.376 90 TRUE
## 5 150.036 0.826 90 TRUE
## 6 94.419 0.456 90 TRUE
music %>%
arrange(popularity) %>%
head(6)
## artist_name track_id track_name acousticness
## 1 7j2music 4JZkJVBZH2PE0typUqzN9F Welcome to Loud City 0.26200
## 2 7j2music 59JeX6N0kfnzKi016LC9LW So Easy 0.11600
## 3 7liwa 40kCuwvYw5SCX7njHnKfhe Nari 0.30600
## 4 Ace Amin 71k1FxiK1PmzCsRd8bV7Nk Punctuation 0.02950
## 5 Adam Jones 4HfJTAR2lu5k15W6DOfe7t Forever 0.00137
## 6 Alisa Turner 5j6CvdHg6ddueHPYqQwMEa Only My Jesus 0.49900
## danceability duration_ms energy instrumentalness liveness loudness mode
## 1 0.825 230191 0.510 0.000651 0.0575 -5.838 1
## 2 0.843 191060 0.217 0.010900 0.0854 -10.062 0
## 3 0.798 202710 0.565 0.000000 0.0641 -9.814 0
## 4 0.708 311458 0.678 0.000000 0.0974 -7.459 0
## 5 0.726 183432 0.307 0.476000 0.1190 -12.301 1
## 6 0.450 241347 0.465 0.000000 0.0788 -9.419 1
## speechiness tempo valence popularity time_signature_4
## 1 0.0557 117.074 0.3180 1 TRUE
## 2 0.2000 93.031 0.0396 1 TRUE
## 3 0.3590 148.953 0.6120 1 TRUE
## 4 0.2110 119.835 0.1860 1 TRUE
## 5 0.1620 130.964 0.3230 1 TRUE
## 6 0.0359 152.059 0.3020 1 FALSE
Joy Division is one of the only, if not the only, band to have a data visualization tool named after them. The “Joy plot” technique is inspired by the band’s album cover.
In honor of their album cover, I construct a joy plot for the popularity of a handful of artists who are among those with the most songs in the data set.
# Filter out the artists of interest
classical <- music %>%
filter(artist_name %in% c("Johann Sebastian Bach", "Wolfgang Amadeus Mozart", "Waka Flocka Flame", "Eagles"))
ggplot(classical, aes(x = popularity, y = artist_name)) +
geom_density_ridges() +
theme_ridges()
## Picking joint bandwidth of 2.63
Here we create a new data set music_sub with the following features:
It re-defines time_signature_4 and mode as factor variables. It removes 3 variables which should not be used in our predictive model of popularity.
music$time_signature_4 = as.factor(music$time_signature_4)
music$mode = as.factor(music$mode)
music_sub <-music %>% select(acousticness, danceability, duration_ms, energy, instrumentalness, liveness, loudness, mode, speechiness, tempo, valence, popularity, time_signature_4) %>%
droplevels()
Our goal in this section is to build a predictive model of song popularity using the available set of predictors in music_sub. To this end, we’ll perform least squares, backward stepwise selection, and LASSO.
Here we construct a least squares regression model of popularity by all predictors in music_sub.
# Set the seed
set.seed(253)
# Run the least squares regression model
ls_model <- train(
popularity ~ .,
data = music_sub,
method = "lm",
trControl = trainControl(method = "cv", number = 10),
na.action = na.omit
)
# Summarize the model
summary(ls_model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.711 -11.872 -1.863 9.923 67.537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.286e+01 1.615e+00 20.343 < 2e-16 ***
## acousticness 1.269e+00 7.036e-01 1.804 0.071312 .
## danceability 1.092e+01 1.133e+00 9.637 < 2e-16 ***
## duration_ms -3.649e-06 1.299e-06 -2.809 0.004984 **
## energy -4.214e+00 1.219e+00 -3.458 0.000547 ***
## instrumentalness -3.532e+00 5.491e-01 -6.433 1.31e-10 ***
## liveness -2.406e+00 9.951e-01 -2.418 0.015621 *
## loudness 6.017e-01 4.592e-02 13.105 < 2e-16 ***
## mode1 -5.080e-01 3.274e-01 -1.551 0.120816
## speechiness -5.813e+00 1.453e+00 -4.002 6.32e-05 ***
## tempo 5.604e-03 5.497e-03 1.019 0.308042
## valence -5.054e+00 7.394e-01 -6.835 8.69e-12 ***
## time_signature_4TRUE 9.606e-01 5.091e-01 1.887 0.059199 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.87 on 9987 degrees of freedom
## Multiple R-squared: 0.0797, Adjusted R-squared: 0.07859
## F-statistic: 72.07 on 12 and 9987 DF, p-value: < 2.2e-16
# Calculate the CV MAE (2 approaches)
ls_model$resample %>%
summarize(mean(MAE))
## mean(MAE)
## 1 12.7675
The average MAE is 12.76 which means our model’s predictions are off by an average of 12.76 in predicting popularity.
set.seed(253)
backstep_model <- train(
popularity ~ .,
data = music_sub,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:13),
trControl = trainControl(method = "cv", number = 10, selectionFunction = "best"),
metric = "MAE",
na.action = na.omit
)
summary(backstep_model)
## Subset selection object
## 12 Variables (and intercept)
## Forced in Forced out
## acousticness FALSE FALSE
## danceability FALSE FALSE
## duration_ms FALSE FALSE
## energy FALSE FALSE
## instrumentalness FALSE FALSE
## liveness FALSE FALSE
## loudness FALSE FALSE
## mode1 FALSE FALSE
## speechiness FALSE FALSE
## tempo FALSE FALSE
## valence FALSE FALSE
## time_signature_4TRUE FALSE FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: backward
## acousticness danceability duration_ms energy instrumentalness
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " "
## 3 ( 1 ) " " "*" " " "*" " "
## 4 ( 1 ) " " "*" " " "*" " "
## 5 ( 1 ) " " "*" " " "*" "*"
## 6 ( 1 ) " " "*" " " "*" "*"
## 7 ( 1 ) " " "*" "*" "*" "*"
## 8 ( 1 ) " " "*" "*" "*" "*"
## 9 ( 1 ) " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*"
## liveness loudness mode1 speechiness tempo valence
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " "*"
## 5 ( 1 ) " " "*" " " " " " " "*"
## 6 ( 1 ) " " "*" " " "*" " " "*"
## 7 ( 1 ) " " "*" " " "*" " " "*"
## 8 ( 1 ) "*" "*" " " "*" " " "*"
## 9 ( 1 ) "*" "*" " " "*" " " "*"
## 10 ( 1 ) "*" "*" " " "*" " " "*"
## 11 ( 1 ) "*" "*" "*" "*" " " "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*"
## time_signature_4TRUE
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) " "
## 5 ( 1 ) " "
## 6 ( 1 ) " "
## 7 ( 1 ) " "
## 8 ( 1 ) " "
## 9 ( 1 ) "*"
## 10 ( 1 ) "*"
## 11 ( 1 ) "*"
## 12 ( 1 ) "*"
The predictor in the 1 predictor model is loudness. Looking at the summary of our model, we see that loudness is included in every model indicating that it is the best predictor of popularity.
plot(backstep_model)
Looking at the plot, I see that the MAE decreases significantly with an additional predictor added to model up until the model has 5 predictors. After that, the MAE is relatively the same.
backstep_model$bestTune$nvmax
## [1] 12
coef(backstep_model$finalModel, id = backstep_model$bestTune$nvmax)
## (Intercept) acousticness danceability
## 3.285533e+01 1.269143e+00 1.091794e+01
## duration_ms energy instrumentalness
## -3.649070e-06 -4.214388e+00 -3.532106e+00
## liveness loudness mode1
## -2.406279e+00 6.017404e-01 -5.079746e-01
## speechiness tempo valence
## -5.813490e+00 5.603849e-03 -5.053561e+00
## time_signature_4TRUE
## 9.606084e-01
backstep_model$results
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 16.06074 0.05606267 12.91315 0.3204538 0.01010604 0.2098705
## 2 2 16.00292 0.06313439 12.87448 0.3432967 0.01242319 0.2341260
## 3 3 15.95057 0.06914477 12.82616 0.3542820 0.01289069 0.2469815
## 4 4 15.93250 0.07153704 12.81453 0.3642571 0.01542394 0.2575261
## 5 5 15.89390 0.07599532 12.77347 0.3620911 0.01606566 0.2490064
## 6 6 15.88231 0.07734374 12.76765 0.3705848 0.01668448 0.2598045
## 7 7 15.88142 0.07740947 12.77148 0.3724037 0.01690682 0.2653889
## 8 8 15.88000 0.07749397 12.77376 0.3710002 0.01595113 0.2639313
## 9 9 15.88028 0.07749334 12.77298 0.3699003 0.01638701 0.2633252
## 10 10 15.87793 0.07776160 12.77353 0.3713698 0.01644213 0.2685246
## 11 11 15.87263 0.07838279 12.76940 0.3716256 0.01668481 0.2693489
## 12 12 15.87059 0.07861446 12.76750 0.3740550 0.01669409 0.2705485
## 13 13 15.87059 0.07861446 12.76750 0.3740550 0.01669409 0.2705485
The 10 fold CV MAE for the “best” model is 12.75957.
lambda_grid <- 10^seq(-3, 0.5, length = 100)
lasso_model <- train(
popularity ~ .,
data = music_sub,
method = "glmnet",
tuneGrid = data.frame(alpha = 1, lambda = lambda_grid),
trControl = trainControl(method = "cv", number = 10, selectionFunction = "oneSE"),
metric = "MAE",
na.action = na.omit
)
plot(lasso_model$finalModel, xvar = "lambda", label = TRUE, col = rainbow(15))
coef(lasso_model$finalModel, 1)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 28.6239105
## acousticness .
## danceability 4.8684752
## duration_ms .
## energy .
## instrumentalness -1.4110102
## liveness .
## loudness 0.3544421
## mode1 .
## speechiness .
## tempo .
## valence .
## time_signature_4TRUE .
coef(lasso_model$finalModel, 0.2)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 3.124019e+01
## acousticness 2.087322e-01
## danceability 9.455856e+00
## duration_ms -1.805861e-06
## energy -2.175641e+00
## instrumentalness -3.091832e+00
## liveness -2.060220e+00
## loudness 4.965490e-01
## mode1 -4.025813e-02
## speechiness -3.695737e+00
## tempo .
## valence -3.560391e+00
## time_signature_4TRUE 4.624382e-01
If Lamba equals 1, then there are 3 variables in the model: danceability, instrumentalness, and loudness.
One of the most persistent variables is loudness.
One of the least persistent variables is tempo.
plot(lasso_model)
plot(lasso_model, xlim = c(0,0.6), ylim = c(12.75,12.8))
lasso_model$bestTune
## alpha lambda
## 78 1 0.5274997
coef(lasso_model$finalModel, 0.5094)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 28.6745152
## acousticness .
## danceability 7.2176889
## duration_ms .
## energy .
## instrumentalness -2.3463701
## liveness -1.0493408
## loudness 0.3956689
## mode1 .
## speechiness -0.3998064
## tempo .
## valence -1.3125838
## time_signature_4TRUE .
The remaining variables and coefficients are:
danceability: 7.2176 intrumentalness: -2.34637 liveness: -1.04934 loudness: 0.39566 speechiness: -0.39980 valence: -1.31258
lasso_model$resample %>%
summarize(mean(MAE))
## mean(MAE)
## 1 12.82685
Lasso Model MAE: 12.82 Backstep Model MAE: 12.75 Least Squares Regression Model: 12.76
Basedon the average MAE’s, I’d say that the backstep model is the best since it has the lowest MAE and is relatively simple at 5 variables.
lasso_model$results %>%
filter(lambda == lasso_model$bestTune$lambda)
## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD
## 1 1 0.5274997 15.95739 0.07070021 12.82685 0.3003048 0.01467597
## MAESD
## 1 0.1904266
result_df <- data.frame(resid = resid(lasso_model), fitted = fitted(lasso_model))
ggplot(result_df, aes(x = fitted, y = resid)) +
geom_point() +
geom_hline(yintercept = 0)
Is it right? Based on the residual plot, the LASSO model doesn’t seem to be right. The residuals aren’t random and have clear patterns.
Is the model strong? The R squared is weak at 0.07 so I would say the model is weak.
Does the model produce accurate predictions? The MAE is 12.85 which is relatively large compared to the popularity levels we are predicting so the model does not produce accurate predictions.
Although our model isn’t great, I would recommend music execs to increase danceability as that is relatively significant and has a high coefficient suggesting it has a sizeable impact on popularity.