The ultimate goal of our analysis will be to build a predictive model of price and to understand the listing features with which this might be associated.
Specifically:
How accurately can we predict the price of an Airbnb listing by its features? Not only might this information help Airbnb, it might help new listers with guidance on how much they might charge.
What makes some listings more expensive than others?!
Airbnb has publicly available data on its listings. We’ll focus on Airbnb listings in New York City. To this end, you can find data on nearly 45,000 listings.
The data above indicate the specific neighborhood in which each listing resides. There are 217 such neighborhoods. We’ll simplify our analysis by lumping these 217 neighborhoods into the corresponding 5 boroughs (eg: Brooklyn, Queens). The following data set summarizes the correspondence between neighborhood and borough.
library(ggplot2)
library(dplyr)
library(caret)
library(ggridges)
airbnb <- read.csv("NYC_airbnb_kaggle_copy.csv")
NYC_nbd <- read.csv("NYC_nbhd_kaggle_copy.csv")
#head(airbnb)
#head(NYC_nbd)
dim(airbnb)
## [1] 44317 31
dim(NYC_nbd)
## [1] 230 2
airbnb_new <- left_join(airbnb, NYC_nbd, by = c("neighbourhood_cleansed"="neighbourhood"))
impute_info <- airbnb_new %>%
preProcess(method = "knnImpute")
airbnb_new <- predict(impute_info, newdata = airbnb_new)
sum(complete.cases(airbnb_new))
## [1] 44317
#head(airbnb_new, 2)
airbnb_slim <- airbnb_new %>%
select(-c(id, latitude, longitude, is_location_exact, neighbourhood_cleansed, amenities)) %>%
filter(price < 1000) %>%
sample_n(5000)
dim(airbnb_slim)
## [1] 5000 26
head(airbnb_slim, 2)
## host_response_time host_response_rate host_is_superhost
## 1 N/A N/A f
## 2 within an hour 90% f
## host_has_profile_pic property_type room_type accommodates
## 1 t Apartment Private room -0.4388584
## 2 t Apartment Entire home/apt -0.4388584
## bathrooms bedrooms beds bed_type square_feet price
## 1 -0.3220221 -0.22034 -0.5186317 Real Bed -0.8807408 -0.4192236
## 2 -0.3220221 -0.22034 -0.5186317 Real Bed -0.5333286 0.4844821
## guests_included minimum_nights maximum_nights calendar_updated
## 1 -0.4423159 0.07883499 -0.006119401 14 months ago
## 2 -0.4423159 0.47645527 -0.006117400 2 weeks ago
## availability_30 number_of_reviews review_scores_rating instant_bookable
## 1 -0.6489067 -0.4277980 -0.05938067 f
## 2 -0.6489067 -0.5185239 0.79327893 f
## is_business_travel_ready cancellation_policy
## 1 f moderate
## 2 f flexible
## require_guest_profile_picture reviews_per_month neighbourhood_group
## 1 f -0.7239659 Manhattan
## 2 f -0.2723903 Manhattan
We took out variables that were not relevant to predict the price. We did not think that ID, latitude, longitude, the exact location categoriacal, or amenities were important in determing price. We believe that taking these variables out could have also potentionally saved the model form homogeneity. We believe that taking these variables out will allow us to have a clearer data set.
We used KNN-model and lasso-model to predict the price from 25 predictors in the dataset.
We dealt with the missing data by imputing the original data. This may lead to an overly optimistic modeling. Some missing data are missing as N/A, which may be identified as a separate variable during the modeling process.
Due to the nature of the data, a non parametric model like GAM would not be suitable. We found that our best models were KNN and lasso based. The models both have decent residual plots, and similar `\(R^2\) and MAE values. KNN has a slightly lower MAE, and lasso has a slightly higher \(R^2\). Whichever model is “best” would depend on which you value more in your models. #### Lasso
lambda_grid <- 10^seq(-3, 1, length = 100)
set.seed(253)
lassoairbnb <- train(
price ~ .,
data = airbnb_slim,
method = "glmnet",
trControl = trainControl(method = "cv", number = 10, selectionFunction = "oneSE"),
tuneGrid = data.frame(alpha = 1, lambda = lambda_grid),
metric = "MAE",
na.action = na.omit
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
plot(lassoairbnb)
plot(lassoairbnb$finalModel, xvar = "lambda", label = TRUE, col = rainbow(20))
lassoairbnb$results %>% filter (lambda == lassoairbnb$bestTune$lambda )
## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD
## 1 1 0.05462277 0.631103 0.3809836 0.2762339 0.2075848 0.1010985
## MAESD
## 1 0.01384297
model_coef <- coef(lassoairbnb$finalModel, lassoairbnb$bestTune$lambda)
predictors <- model_coef@Dimnames[[1]][model_coef@i + 1][-1]
result_df <- data.frame(resid = resid(lassoairbnb), fitted = fitted(lassoairbnb))
# Residual plot
ggplot(result_df, aes(x = fitted, y = resid)) +
geom_point() +
geom_hline(yintercept = 0)
coef(lassoairbnb$finalModel, lassoairbnb$bestTune$lambda)
## 217 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.103161388
## host_response_timea few days or more .
## host_response_timeN/A .
## host_response_timewithin a day .
## host_response_timewithin a few hours .
## host_response_timewithin an hour .
## host_response_rate0% .
## host_response_rate10% .
## host_response_rate100% .
## host_response_rate14% .
## host_response_rate15% .
## host_response_rate17% .
## host_response_rate20% .
## host_response_rate21% .
## host_response_rate22% .
## host_response_rate25% .
## host_response_rate26% .
## host_response_rate27% .
## host_response_rate29% .
## host_response_rate30% .
## host_response_rate33% .
## host_response_rate35% .
## host_response_rate36% .
## host_response_rate38% .
## host_response_rate40% .
## host_response_rate41% .
## host_response_rate43% .
## host_response_rate44% .
## host_response_rate46% .
## host_response_rate47% .
## host_response_rate50% .
## host_response_rate52% .
## host_response_rate53% .
## host_response_rate54% .
## host_response_rate55% .
## host_response_rate56% .
## host_response_rate57% .
## host_response_rate58% .
## host_response_rate59% .
## host_response_rate6% .
## host_response_rate60% .
## host_response_rate61% .
## host_response_rate62% .
## host_response_rate63% .
## host_response_rate64% .
## host_response_rate65% .
## host_response_rate66% .
## host_response_rate67% .
## host_response_rate68% .
## host_response_rate69% .
## host_response_rate70% .
## host_response_rate71% .
## host_response_rate72% .
## host_response_rate73% .
## host_response_rate74% .
## host_response_rate75% .
## host_response_rate76% .
## host_response_rate77% .
## host_response_rate78% .
## host_response_rate79% .
## host_response_rate80% .
## host_response_rate81% .
## host_response_rate82% .
## host_response_rate83% .
## host_response_rate84% .
## host_response_rate85% .
## host_response_rate86% .
## host_response_rate87% .
## host_response_rate88% .
## host_response_rate89% .
## host_response_rate90% .
## host_response_rate91% .
## host_response_rate92% .
## host_response_rate93% .
## host_response_rate94% .
## host_response_rate95% .
## host_response_rate96% .
## host_response_rate97% .
## host_response_rate98% .
## host_response_rate99% .
## host_response_rateN/A .
## host_is_superhostf .
## host_is_superhostt .
## host_has_profile_picf .
## host_has_profile_pict .
## property_typeBed & Breakfast .
## property_typeBoat .
## property_typeBoutique hotel .
## property_typeBungalow .
## property_typeCabin .
## property_typeCastle .
## property_typeCave .
## property_typeChalet .
## property_typeCondominium .
## property_typeDorm .
## property_typeEarth House .
## property_typeGuest suite .
## property_typeGuesthouse .
## property_typeHostel .
## property_typeHouse .
## property_typeIn-law .
## property_typeLoft .
## property_typeOther .
## property_typeServiced apartment .
## property_typeTent .
## property_typeTimeshare .
## property_typeTownhouse .
## property_typeTrain .
## property_typeTreehouse .
## property_typeVacation home .
## property_typeVilla .
## property_typeYurt .
## room_typePrivate room -0.174665293
## room_typeShared room -0.039501781
## accommodates 0.099920331
## bathrooms 0.096589508
## bedrooms .
## beds .
## bed_typeCouch .
## bed_typeFuton .
## bed_typePull-out Sofa .
## bed_typeReal Bed .
## square_feet 0.296050118
## guests_included .
## minimum_nights .
## maximum_nights .
## calendar_updated10 months ago .
## calendar_updated11 months ago .
## calendar_updated12 months ago .
## calendar_updated13 months ago .
## calendar_updated14 months ago .
## calendar_updated15 months ago .
## calendar_updated16 months ago .
## calendar_updated17 months ago .
## calendar_updated18 months ago .
## calendar_updated19 months ago .
## calendar_updated2 days ago .
## calendar_updated2 months ago .
## calendar_updated2 weeks ago .
## calendar_updated20 months ago .
## calendar_updated21 months ago .
## calendar_updated22 months ago .
## calendar_updated23 months ago .
## calendar_updated24 months ago .
## calendar_updated25 months ago .
## calendar_updated26 months ago .
## calendar_updated27 months ago .
## calendar_updated28 months ago .
## calendar_updated29 months ago .
## calendar_updated3 days ago .
## calendar_updated3 months ago .
## calendar_updated3 weeks ago .
## calendar_updated30 months ago .
## calendar_updated31 months ago .
## calendar_updated32 months ago .
## calendar_updated33 months ago .
## calendar_updated34 months ago .
## calendar_updated35 months ago .
## calendar_updated36 months ago .
## calendar_updated37 months ago .
## calendar_updated38 months ago .
## calendar_updated39 months ago .
## calendar_updated4 days ago .
## calendar_updated4 months ago .
## calendar_updated4 weeks ago .
## calendar_updated40 months ago .
## calendar_updated41 months ago .
## calendar_updated42 months ago .
## calendar_updated43 months ago .
## calendar_updated44 months ago .
## calendar_updated45 months ago 5.237992738
## calendar_updated46 months ago .
## calendar_updated5 days ago .
## calendar_updated5 months ago .
## calendar_updated5 weeks ago .
## calendar_updated50 months ago .
## calendar_updated51 months ago .
## calendar_updated52 months ago .
## calendar_updated53 months ago .
## calendar_updated54 months ago .
## calendar_updated55 months ago .
## calendar_updated56 months ago .
## calendar_updated58 months ago .
## calendar_updated59 months ago .
## calendar_updated6 days ago .
## calendar_updated6 months ago .
## calendar_updated6 weeks ago .
## calendar_updated60 months ago .
## calendar_updated62 months ago .
## calendar_updated63 months ago .
## calendar_updated66 months ago .
## calendar_updated67 months ago .
## calendar_updated69 months ago .
## calendar_updated7 months ago .
## calendar_updated7 weeks ago .
## calendar_updated8 months ago .
## calendar_updated9 months ago .
## calendar_updateda week ago .
## calendar_updatednever .
## calendar_updatedtoday .
## calendar_updatedyesterday .
## availability_30 0.001839595
## number_of_reviews .
## review_scores_rating .
## instant_bookablet .
## is_business_travel_readyt .
## cancellation_policylong_term .
## cancellation_policymoderate .
## cancellation_policystrict .
## cancellation_policysuper_strict_30 .
## cancellation_policysuper_strict_60 .
## require_guest_profile_picturet .
## reviews_per_month .
## neighbourhood_groupBrooklyn .
## neighbourhood_groupManhattan 0.206651351
## neighbourhood_groupQueens .
## neighbourhood_groupStaten Island .
We took out some categorical variables with many categories because our KNN model was crushing.
airbnb_slim <- airbnb_slim %>%
select(-c(host_response_time, host_response_rate, calendar_updated))
knn_model_3 <- train(
price ~ .,
data = airbnb_slim,
preProcess = c("center","scale"),
method = "knn",
#tuneGrid = data.frame(k = c(1, 5, 10, 11:39, seq(40, 200, by = 20))),
tuneGrid = data.frame(k = c(1, 5)),
trControl = trainControl(method = "cv", number = 10, selectionFunction = "best"),
metric = "MAE",
na.action = na.omit
)
# Examine results
plot(knn_model_3)
knn_model_3$bestTune
## k
## 2 5
knn_model_3$results %>%
filter(k == knn_model_3$bestTune$k)
## k RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 5 0.6387762 0.3526001 0.2614734 0.2064368 0.1320186 0.02394247
knn_model_3
## k-Nearest Neighbors
##
## 5000 samples
## 22 predictor
##
## Pre-processing: centered (61), scaled (61)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4500, 4500, 4500, 4501, 4498, 4501, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.7416528 0.2535878 0.3193247
## 5 0.6387762 0.3526001 0.2614734
##
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
result_knn <- data.frame(resid = resid(knn_model_3), fitted = fitted(knn_model_3))
# Residual plot
ggplot(result_knn, aes(x = fitted, y = resid)) +
geom_point() +
geom_hline(yintercept = 0)
Studying the relationship between price and square feet:
Looking at our the predictor coeficients in our model, we noticed that square feet is the main driver/indicator of price. The coeficient for square footage is 0.33841943 which means that price goes up $0.338 for every 1 increase in square feet. Relative to the other predictors, it seems that square feet drives the price of an airbnb listing.
Many of the predictors left in our model like “neighbourhood_groupManhattan” are cateogrical variables and have smaller coefficients than square feet which means their effect is dwarfed compared to square feet.
It makes sense that square feet and price have a strong and positive correlation with each other. Especially in New York City, the main factor that determines the value of a property is the amount of space. Since we are only comparing listings that are all within New York, location doesn’t have as large of an effect as square feet which is why the only location oriented variable not ommited in our model is “neighbourhood_groupManhattan”. If we had listings from both rural Minnesota and downtown New York, then we can imagine location being a much stronger predictor of price.
We can focus on two listings in the data set (id # 738588 & 259946) to see how square feet affects prices. The first has a price of $625 with a square footage of 3700 and the second listing is listed at 125 with a sqaure footage of 700. From comparing these two, we see that price and square footage are strongly and positively correlated with each other.