Statistical Machine learning Final Regression Project

1. The URL to the dataset I’m using can be found at http://insideairbnb.com/get-the-data.html and searching for “Hawaii” then clicking on the listings.csv dataset
2. This dataset is measuring the Air BnB prices in different neighborhoods in Hawaii along with host and neighborhood information, reviews and room type.
3. The data was compiled on March 6th, 2019 and does not specifically mention when the data was collected.

Code for each method

The Quantitative Variables I will be using are price, number of reviews and availability out of 365 days.

price – this is measuring the price of the Air Bnb in Hawaii

number of reviews – this is measuring the number of reviews since the Air Bnb was listed

availability out of 365 days – the number of days that the Air BnB is available

The Qualitative Variables I will be using are room type and neighborhood.

neighborhood- this variable gives you the name of the neighborhood where the Air BnB is located

room type- This variable gives you the type of room for the Air Bnb which is Entire Home/Apt or private room.

Code for each method:

importing the Hawaii Air BnB listings dataset

)

View(Hawaii_data)

library(rsample)
library(caret)
library(dplyr)
library(randomForest)

using select() to select all variables

Hawaii_data2 <- select(Hawaii_data, price, neighbourhood, room_type,
number_of_reviews, availability_365)

View(Hawaii_data2)

Hawaii_noNA <- na.omit(Hawaii_data2)

is.factor(Hawaii_noNA\$neighbourhood)
is.factor(Hawaii_noNA\$room_type)

Splitting data into 75/25 training test split

set.seed(1)

indices <- sample(nrow(Hawaii_noNA), 1/4*nrow(Hawaii_noNA), replace = FALSE)

indices

Hawaii_test <- Hawaii_noNA[indices, ]

Hawaii_training <- Hawaii_noNA[-indices, ]

Splitting up training and test sets into matrices

all_y <- Hawaii_noNA\$price
all_x <- model.matrix(price ~ ., data = Hawaii_noNA)[, -1]

test_x <- all_x[indices, ]
test_y <- all_y[indices]

training_x <- all_x[-indices, ]
training_y <- all_y[-indices]

Random Forests method to predict price of Air BnB

set.seed(1)

Hawaii_fit_rf <- train(price ~ ., data = Hawaii_training,
method = “rf”,
trControl = trainControl(method = “none”),
importance = TRUE,
ntree = 1000
)

Hawaii_fit_rf\$finalModel

Making prediction for total production

rf_prediction <- predict(object = Hawaii_fit_rf, newdata = Hawaii_test)

calculate RMSE

sqrt(mean((Hawaii_test\$price – rf_prediction)^2))

mean(Hawaii_noNA\$price)
median(Hawaii_noNA\$price)

Bagging Method to predict price of Air BNB in Hawaii

set.seed(1)

Hawaii_bag_fit <- train(price ~ ., data = Hawaii_training,
method = “rf”,
trControl = trainControl(method = “none”),
importance = TRUE,
ntree = 1000,
tuneGrid = data.frame(mtry = ncol(Hawaii_training)-1)
)

Hawaii_bag_fit\$finalModel

Making prediction for the price of the Air BnB

Bagging_prediction <- predict(object = Hawaii_bag_fit, newdata = Hawaii_test)

calculating RMSE

sqrt(mean((Hawaii_test\$price – Bagging_prediction)^2))

Least Squares Regression Method

library(glmnet)

set.seed(1)

Hawaii_OLS <- cv.glmnet(x = training_x, y =training_y, alpha = 0)

Making prediction for the RMSE

prediction_OLS <- predict(object = Hawaii_OLS, s = 0, newx = test_x)

rmse_OLS <- sqrt(mean((prediction_OLS – test_y)^2))
rmse_OLS

set.seed(1)

cv.glmnet() w/ alpha = 0 –> ridge regression

Hawaii_ridge <- cv.glmnet(x = training_x, y = training_y, alpha = 0)

subset the optimal lambda

lambda_ridge <- Hawaii_ridge\$lambda.min

Making Prediction for rmse

prediction_ridge <- predict(object = Hawaii_ridge, s = lambda_ridge, newx = test_x)

Calculating RMSE

rmse_ridge <- sqrt(mean((prediction_ridge – test_y)^2))
rmse_ridge

LASSO method

set.seed(1)

Hawaii_lasso <- cv.glmnet(x = training_x, y = training_y, alpha = 1)

Hawaii_lambda_lasso <- Hawaii_lasso\$lambda.min

prediction_lasso <- predict(object = Hawaii_lasso, s = Hawaii_lambda_lasso,
newx = test_x)

rmse_lasso <- sqrt(mean((prediction_lasso – test_y)^2))
rmse_lasso

library(rpart.plot)

set.seed(1)

Hawaii_tree_fit <- train(price ~ ., data = Hawaii_training,
method = “rpart”,
trControl = trainControl(method = “cv”, number = 5),
parms = list(split = “information”),
tuneLength = 10
)

Hawaii_tree_fit

storing optimal alpha

optimal_alpha2 <- Hawaii_tree_fit\$bestTune

prp(Hawaii_tree_fit\$finalModel, box.palette = “Blues”)

tree_prediction <- predict(object = Hawaii_tree_fit, newdata = Hawaii_test)

sqrt(mean((Hawaii_test\$price – tree_prediction)^2))

Results:

Random Forests: The RMSE is 579.297 dollars when using the Random forests method for predicting the price of an Air BnB in Hawaii.

Bagging: The RMSE is 580.5365 dollars when using the Bagging method for predicting the price of an Air BnB in Hawaii.

Least Squares Regression: The RMSE is 592.8501 dollars when using the Least Squares regression model for predicting the price of an Air BnB in Hawaii.

Ridge Regression: The RMSE is 592.8483 dollars when using the Ridge Regression model for the predicting the price of an Air BnB in Hawaii.

LASSO: The RMSE is 592.8449 dollars when using the LASSO method for predicting the price of an Air BnB in Hawaii.

Single Regression Tree: The RMSE is 595.4788 dollars for predicting the price of an Air BnB in Hawaii.

None of the methods substantially outperformed the others because all of the RMSE values were between 570 and 600 dollars for the price of an Air BnB in Hawaii. Also, none of the methods accurately predict the price of an Air BnB in Hawaii because paying right around \$600 to stay at an Air BnB in Hawaii seems very expensive when the mean price for the dataset was 299.9061 dollars.

The results from running these methods show that we didn’t have the right predictor variables for using these methods to predict the price of an Air BnB in Hawaii. There might be other variables not included in the dataset that would of helped us in making more realistic predictions for the price of an Air BnB in Hawaii.