Airbnb Dataset NAS (Model Optimisation)¶
Objective¶
This notebook starts with a business question, chooses machine learning models to answer the question, prepares the cleaned dataset for those machine learning models, then compares their effectiveness.
Business Question¶
Is a proposed price for a given room type in a given neighbourhood underpriced, in price range, or overpriced?
Machine Learning Models¶
We are working with labeled data and a predefined output (from our business question), so we will use supervised learning. Our output is a classification ("underpriced", "in price range", "overpriced") so we'll use a classification model.
We checked the use cases of many kinds of classification models and chose not to use these models.
- Naive Bayes (best for text)
- Logistic Regression (best for binary)
- Support Vector Machines (best for more complex data)
- Random Forests (best for large datasets with complex patterns)
- Linear/Quadratic Discriminant Analysis (requires well-separated classes)
- Gradient Boosting (best for very complex data)
We chose these two models to answer our business question.
- K-Nearest Neighbors
- Decision Tree
1. Import Libraries and Data¶
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
df_import = pd.read_csv("airbnb_cleaned.csv")
2. Select Columns for Models¶
df = df_import[["neighbourhood_group", "room_type", "number_of_reviews", "availability_365", "price"]]
df.head()
| neighbourhood_group | room_type | number_of_reviews | availability_365 | price | |
|---|---|---|---|---|---|
| 0 | Brooklyn | Private room | 9 | 365 | 149 |
| 1 | Manhattan | Entire home/apt | 45 | 355 | 225 |
| 2 | Manhattan | Private room | 0 | 365 | 150 |
| 3 | Brooklyn | Entire home/apt | 270 | 194 | 89 |
| 4 | Manhattan | Entire home/apt | 9 | 0 | 80 |
3. Create Price Category Column¶
df.insert(5, "price_category", pd.NA)
df.head()
| neighbourhood_group | room_type | number_of_reviews | availability_365 | price | price_category | |
|---|---|---|---|---|---|---|
| 0 | Brooklyn | Private room | 9 | 365 | 149 | <NA> |
| 1 | Manhattan | Entire home/apt | 45 | 355 | 225 | <NA> |
| 2 | Manhattan | Private room | 0 | 365 | 150 | <NA> |
| 3 | Brooklyn | Entire home/apt | 270 | 194 | 89 | <NA> |
| 4 | Manhattan | Entire home/apt | 9 | 0 | 80 | <NA> |
4. Import Quartile Data¶
quartiles = pd.read_csv("5_quartiles.csv")
quartiles
| neighbourhood_group | room_type | Q1 | Q3 | |
|---|---|---|---|---|
| 0 | Bronx | Entire home/apt | 80.0 | 140.0 |
| 1 | Bronx | Private room | 40.0 | 70.0 |
| 2 | Bronx | Shared room | 28.0 | 55.5 |
| 3 | Brooklyn | Entire home/apt | 104.0 | 198.2 |
| 4 | Brooklyn | Private room | 50.0 | 80.0 |
| 5 | Brooklyn | Shared room | 30.0 | 50.0 |
| 6 | Manhattan | Entire home/apt | 140.0 | 250.0 |
| 7 | Manhattan | Private room | 67.0 | 120.0 |
| 8 | Manhattan | Shared room | 49.0 | 88.0 |
| 9 | Queens | Entire home/apt | 90.0 | 165.0 |
| 10 | Queens | Private room | 47.0 | 75.0 |
| 11 | Queens | Shared room | 30.0 | 50.5 |
| 12 | Staten Island | Entire home/apt | 75.0 | 150.0 |
| 13 | Staten Island | Private room | 40.0 | 75.0 |
| 14 | Staten Island | Shared room | 29.0 | 75.0 |
5. Put Values in Price Category Column¶
0: underpriced
1: in price range
2: overpriced
for index, row in quartiles.iterrows():
df.loc[(df["neighbourhood_group"] == row["neighbourhood_group"]) & (df["room_type"] == row["room_type"]) & (df["price"] < row["Q1"]), ["price_category"]] = 0
df.loc[(df["neighbourhood_group"] == row["neighbourhood_group"]) & (df["room_type"] == row["room_type"]) & (df["price"] >= row["Q1"]) & (df["price"] <= row["Q3"]), ["price_category"]] = 1
df.loc[(df["neighbourhood_group"] == row["neighbourhood_group"]) & (df["room_type"] == row["room_type"]) & (df["price"] > row["Q3"]), ["price_category"]] = 2
6. Check for NA Values¶
pd.isna(df["price_category"]).sum()
np.int64(0)
df.head()
| neighbourhood_group | room_type | number_of_reviews | availability_365 | price | price_category | |
|---|---|---|---|---|---|---|
| 0 | Brooklyn | Private room | 9 | 365 | 149 | 2 |
| 1 | Manhattan | Entire home/apt | 45 | 355 | 225 | 1 |
| 2 | Manhattan | Private room | 0 | 365 | 150 | 2 |
| 3 | Brooklyn | Entire home/apt | 270 | 194 | 89 | 0 |
| 4 | Manhattan | Entire home/apt | 9 | 0 | 80 | 0 |
7. Convert Categories to Integers¶
neighbourhoods
1: Bronx
2: Brooklyn
3: Manhattan
4: Queens
5: Staten Island
room types
0: Entire home/apt
1: Private room
2: Shared room
neighbourhoods = ["", "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"]
for index, n in enumerate(neighbourhoods):
df.loc[df["neighbourhood_group"] == n, ["neighbourhood_group"]] = index
room_types = ["Entire home/apt", "Private room", "Shared room"]
for index, r in enumerate(room_types):
df.loc[df["room_type"] == r, ["room_type"]] = index
df
| neighbourhood_group | room_type | number_of_reviews | availability_365 | price | price_category | |
|---|---|---|---|---|---|---|
| 0 | 2 | 1 | 9 | 365 | 149 | 2 |
| 1 | 3 | 0 | 45 | 355 | 225 | 1 |
| 2 | 3 | 1 | 0 | 365 | 150 | 2 |
| 3 | 2 | 0 | 270 | 194 | 89 | 0 |
| 4 | 3 | 0 | 9 | 0 | 80 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 48567 | 2 | 1 | 0 | 9 | 70 | 1 |
| 48568 | 2 | 1 | 0 | 36 | 40 | 0 |
| 48569 | 3 | 0 | 0 | 27 | 115 | 0 |
| 48570 | 3 | 2 | 0 | 2 | 55 | 1 |
| 48571 | 3 | 1 | 0 | 23 | 90 | 1 |
48572 rows × 6 columns
8. Change Data Types to Integer¶
df = df.astype("int64")
df.dtypes
neighbourhood_group int64 room_type int64 number_of_reviews int64 availability_365 int64 price int64 price_category int64 dtype: object
9. Split Data into Training and Testing¶
x_train, x_test, y_train, y_test = train_test_split(df[["neighbourhood_group", "room_type", "price"]], df.price_category, train_size = 0.8)
10. Test K-Nearest Neighbors Models¶
# K = 3
knn_3 = KNeighborsClassifier(n_neighbors = 3)
knn_3.fit(x_train, y_train)
knn_3.score(x_test, y_test)
0.9947503860010294
# K = 5
knn_5 = KNeighborsClassifier(n_neighbors = 5)
knn_5.fit(x_train, y_train)
knn_5.score(x_test, y_test)
0.9904271744724653
# K = 10
knn_10 = KNeighborsClassifier(n_neighbors = 10)
knn_10.fit(x_train, y_train)
knn_10.score(x_test, y_test)
0.9834276891405044
11. Test Decision Tree Model¶
dtree = tree.DecisionTreeClassifier()
dtree.fit(x_train, y_train)
dtree.score(x_test, y_test)
1.0
12. Test Decision Tree Model with More Columns¶
x_train, x_test, y_train, y_test = train_test_split(df[["neighbourhood_group", "room_type", "number_of_reviews", "availability_365", "price"]], df.price_category, train_size = 0.8)
dtree.fit(x_train, y_train)
dtree.score(x_test, y_test)
0.9994853319608852
Conclusion¶
We used several machine learning classification models on our cleaned dataset to answer our business question.
Is a proposed price for a given room type in a given neighbourhood underpriced, in price range, or overpriced?
We tested several K-Nearest Neighbors models with different values for K - the accuracy was very high.
- K = 3 - accuracy 99.5%
- K = 5 - accuracy 99.0%
- K = 10 - accuracy 98.3%
We tested a Decision Tree model - the accuracy was immediately 100%. (A different train/test split might produce an accuracy slightly less than 100%.)
We tested what would happen if we added seemingly less relevant columns to the model - "number of reviews" and "availability 365" - the accuracy went below 100%.
The Decision Tree model using "neighbourhood group", "room type" and "price" was the most accurate model we found to answer our business question.