Header image

Introduction

Problem Statement

Zillow is a real estate company that has a niche in purchasing properties to rent out on a short-term basis. They want to target New York City to make investments in. In particular, the company wants to focus on 2 bedroom apartments as they have concluded that renting out 2 bedroom properties are the most profitable. We aim to build a data product to provide them with insights on which zip codes are the most profitable to invest in.

Data

The data we are going to using to perform the analysis comes from 2 different sources.

  1. Cost data - This dataset from Zillow, contains information about the median prices of apartments across America. For our particular analysis, we are focusing on the properties with 2 bedrooms and median prices up to June 2017.

  2. Revenue data - To decide which zip codes are profitable to invest in, we are looking at listings in Airbnb, a platform to rent our apartments/rooms from. This is publicly available data.

Assumptions

The following assumptions are made:

  1. The occupancy rate for any listing on Airbnb is 75%.

  2. The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).

  3. The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).

  4. All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)

  5. The availability for 30, 60, 90 and 365 days are always updated. If a property is not available, we assumed it is prebooked for that period.

  6. One in every 5 bookings has an additional guest, and each booking lasts an average of 3 nights.

  7. We are assuming 0 value for tax, interest and maintenance cost for the properties.

  8. The properties are not being considered to be sold. So investment based on resale value is not considered.

  9. The median price is available for atleast 10 year period.

Proposed outcome

We would investigate the return on investment for properties in New York City, particularly for 2 bedroom apartments. The analysis would provide the projected property cost, yearly return and break even period by zip codes. We would also conduct exploratory analysis to see how the prices of properties for different zip codes have changed over the years, how the prices are affected by the neighbourhood and what zipcodes actually achieve our desired break even time. The results of our analysis could be used as a guide to making decisions on which areas to invest in.

Technical Set-up

Prerequisites:

  1. The config file - Configuration.R, Data Dictionary file - DataDictionary.csv must be placed in the current working directory. The current working directory can be accessed by executing getwd() or set to the desired location by setwd(“desired-file-path”)

  2. The output file zipcode_analysis.csv would be written to the current working directory as well.

  3. All configurable variables such as file names; state or city in focus; how many years in the future do we want to make projections for are present in the Configuration.R file

Global Variable declaration

 state_in_focus : NY 
 city_in_focus : New York 
 threshold : 85 
 no_years_projection : 3 
 occupancy_rate : 0.75 
 break_even_period : 20 
 freq_extra_guest : 5 
 avg_stay_duration : 3 
 no_of_rooms : 2 
 airbnb_file_name: listings.csv 
 zillow_file_name : Zip_Zhvi_2bedroom.csv 

Packages:

library(tidyverse) #used for data manipulation 
library(dplyr) # used for data manipulation
library(DT) #used for DataTable
library(zoo) #used for YearMon
library(ggplot2) #used for plotting graphs
library(reshape2) #used for reshaping dataframes
library(plotly) #used for plotting graphs
library(ggcorrplot) #used for plotting correlation matrix
library(readxl) #used for reading excel
library(knitr) #used to display an aligned table on the screen
library(kableExtra)#used to build with straightforward formatting options
library(scales) #used for formatting plot axis

Data Preparation

Data Source and Import

For our analysis, we are going to looking at data from 2 different sources, Airbnb and Zillow The dataset is spread across 2 files listings.csv, Zip_Zhvi_2bedroom.csv. The two datasets are first imported into data frames named property_estimate and airbnb_listing.

Dataset 1 - Zillow cost data

For the cost data, we are going to be looking at data available from Zillow. This dataset contains the estimate of the prices of properties across various zip codes(also called Region) with details about which county, city, and state the property belongs to, along with the median price per month for properties in that region through the years starting from April 1996 to June 2017.

Dimensions of the dataset

No of variables : 262

Number of observations : 8946

Apart from the property median price (in dollars) through the years, the dataset has the following variables:

## [1] "RegionID"   "RegionName" "City"       "State"      "Metro"     
## [6] "CountyName" "SizeRank"

Dataset 2 - Airbnb revenue data

For the revenue data we are using Airbnb data. This dataset contains the listing of properties rented out using the airbnb platform. The data is for listings in all of America, which geographical details such as zipcode, latitude, longitude, neighbourhood, city, state, etc. Details about the property such as description, number of bedrooms, number of bathrooms are also available in this dataset. Apart from these details, host details such as host id, host’s location, host’s response time along with various information about the host is also present in the dataset. The review details, availability of the property and the rental price of the properties are present. A detailed description of the variables is present in the metadata details file.

Dimensions of the dataset

No of variables : 106

Number of observations : 48895

Data Cleaning

As we saw from the data import part, we have a large dataset with a lot of variables. Before we start analyzing what factors affect if a property is a good investment or not, we need to ensure that we have good clean data to work with.

There are a few characteristics we are going to focus on while cleaning the data.

  1. Completeness - As is the nature of real-world data, we could have incomplete data. Through data cleaning, we aim to impute the null values or remove them as required.

  2. Validity - Through validity checks, we aim to ensure that each variable is of the datatype that is appropriate for the value it holds; ensure data is in the range that is meaningful and enforce uniqueness when appropriate.

  3. Consistency and Uniformity - Since we are going to be combining data from 2 different data sources, we need to ensure that the units are uniform across both the sources.

Property_estimate

The dataset contains information for all cities across the US. Before we filter for observations just in New York City, we do a few data quality checks.

  1. Null values in the non-price variables.
  2. If the datatypes are all that we expect them to be
  3. All the zip codes are unique
  4. All prices values are non-negative
  5. Column names are appropriate

We observe that the above checks reveal that the values are as we expect them to be. However, there are null values in the median price value columns. We want to see what percentage of values at each time period are null values. we do this comparison for all values vs just New York City values.

We observe from the graph that for earlier time periods, there are a lot of missing values, especially for New York City. Given how large the percentage of missing values are, it wouldn’t be prudent to impute them from the available values, so we move to discard those values and consider values for the last 10 years from around 2007 to 2017.

For our analysis, we need a single value of the median cost at the current time. We look at the pattern of the median prices through the last 10 year time period to make a call on how to calculate the median value at the current time period. We shift our focus to only New York properties now. After the filtering, the dataset now has 25 zipcodes’ from New York City’s observations.

Visualizing trend of median price over the years per zipcode

From the graph we observe that the prices have gone up through the years. Since one of our assumptions is that the time value difference of money is zero, this change must be caused by other factors. The change in rate is not uniform, so we need to consider this while predicting the median cost at the time of investment.

We finally create a dataset from the property_estimate data frame that has a subset of its’ columns. We’d be using this dataset, called cost_data to combine with the cost to calculate which investment would yield the best profit. The following columns will be pruned from the final dataset.

  1. RegionId - An internal Zillow assigned number. This has no impact on the price or cost
  2. 1996-4 to 2017-06 - The median price through the years. Since we have already analyzed these columns and derived a price column, these columns would not be in the final dataset.

Calculating median price at current time

To calculate the median cost at the current time, we calculate the delta between the price during 2017 and price during 2007, to see what pattern the median price has been following and calculate the median price at current time based on the delta.

price_delta= (2017 median price - 2007 median price)/10
median_projection= 2017 median price + (price_delta * no_years_projection)

Airbnb_listing

The airbnb_listing dataset has information regarding all the listings of properties listed in the Airbnb platform. The dataset has listing details of 48895 properties spread across 106 attributes.

We want to focus on attributes and observations that would benefit our analysis. So we first perform a sanity check, similar to that of the property_estimate dataset.

  1. There are few columns with a large number of null values. We remove columns that have over 85% of missing values.
  2. The only column with negative values is the longitudes. But this is acceptable since it is within the permissible range for longitudes.
  3. All the datatypes of the columns are appropriate for the data they hold.
  4. There are null values in cleaning fees and security deposit. Hosts have an option to either have these payments or not. A null value in these columns implies that the host isn’t charging the guests. Therefore these values can be imputed to zero.

Missing zipcode values

One important observation from the description table is that there are null values present in the zipcode value. Although it is a negligible percent 1.06% there still is a risk of missing integral data. To evaluate how much data we’d lose, we visualize the number of missing values in zip code per state.

We see that out of 517 zipcodes, most of the missing values are from New york city. For now, we remove these values, because imputing might lead to erroneous data.

Another interesting observation from the pie chart is that we see there are multiple entries for the same city. This indicates that the data is not standardized.

Attribute reduction

We now shift our focus on reducing the number of attributes. We consider the following aspects:

  1. Variance - If all the values in a column are the same or completely different (excluding textual columns like amenities, description), they contribute very little to our analysis.

  2. Internal purpose variables - Some of the variables from the dataset are meant just for internal housekeeping purposes. These variables would not have an effect on the analysis, hence these values are removed as well.

  3. Geographical data - There are over 13 columns that are descriptive of the geographical location of the property. Some of these columns have messy data - multiple representations of the same value. Considering that this dataset would be combined with the Zillow’s dataset where the entries are some standardized, removing some of these columns would help the data be less messy while still retaining the information they provide.

Atttribute correlation

Various factors could affect the rent of a property, other than just the location. Before we filter our revenue data to 2 bedroom properties in New York, we want to analyze the entire dataset to see what factors could affect the price. We also want to see if there are any other factors that have an effect on each other.

We see from the graph that the rental price of the apartment has little or no correlation with the number of reviews it garners. However it does have a positive correlation with the cleaning fees, the number of bedrooms and how people can be accommodated in the house.

Also there seems to be a correlation between the number of reviews/review ratings and the availability of an apartment for 30 or 365 days. An inference could be that higher the number of reviews a property has, the higher is the chance of it being pre-booked.

Filtering for New York City 2 bedroom properties

We now want to narrow our revenue data only for New York City. We use the zipcodes from cost data to do this. It has also been established that 2 bedroom apartments are the most profitable for investment. After filtering we see that there are 48378 2 bedroom properties listed in New York City.

Data Merge

We now have clean revenue and cost data. In order to analyze which zipcodes to invest in, we need to merge these 2 datasets together. We use the zipcode column in both datasets, RegionName in cost_data and zipcodes in revenue data. The merged dataset is called cost_revenue_analysis and has attributes 65 for observations : 1565 properties.

In order to analyze which zip codes are the best, we first look at the income from the properties. There are 3 variables that affect revenue.

  1. The price per night
  2. Cleaning fees per stay
  3. Cost of additional guests

But these 3 values are in different units. So we first standardize them to yearly incomes.

Price per night

We have assumed 75% occupancy rate. So the income per year would be

price_yearly= price per night * occupancy rate * 365

Cleaning fees

Cleaning fee is the fee a host can opt to charge a guest to clean up the house after each visit. Some guests opt to include separate cleaning fee while some have it included in their price per night. Assuming that each stay in a property lasts for 3 night, the cleaning fees per year for a property is given by

cleaning_fees_yearly= ((occupancy rate * 365)/average stay duration)*cleaning fees

Cost of additional guests

The price per night stated by each host is for a number of guests that the host specifies. But if there were more guests renting the property, hots can specify the cost per additional guest. We have assumed that on average, one in every 5 times a property is rented, there would be one additional guest.

additional_guest_cost_yearly= ((occupancy rate * 365)/frequency of additional guests * cost for additional guest per night

Finally we create a column for the yearly revenue that is the sum of the 3 sources of revenue

Break Even Period:

We want to calculate how long it would take to start making a profit on the investment made on a property, ie the break-even period. To calculate this we take a ratio of the median price of a property with the yearly revenue from a property. We round this up to the nearest integer to get the break-even period in years.

break_even_years= ceiling(median property cost/ yearly revenue)

Data Dictionary

Variable Description
zipcode Zip code where the property is located.
name Header created by host to entice potential occupants to stay at their property
summary Description provided by the host of their property
space Description provided by host describing the space (bedrooms, bathrooms, sqft, etc) of their property
description Detailed information about the property, neighborhood, amenities, location, and anything else the host feels would be beneficial to share
neighborhood_overview Additional detail about the neighborhood including sites to see, restaurants, noise, landmarks around, etc.
notes Any other additional notes the host would like to disclose, which was not provided in the other description information.
transit Host can provide the kind of transportation available nearby including bus, trains, airport, etc..
access Description on how to enter the home, what area of the home is available for use, and what you are able items you are able to use during your stay.
interaction The availability of the host; could be in person, phone, email, text, etc. The host provides guidelines of what the occupant can expect for communication and if they expect to meet.
house_rules The host provides rules of the home that they expect the occupant to adhere to. Examples include clean up, laundry, dishes, etc.
street Street address where the property is located
neighbourhood_group_cleansed Name of the area where the property is located.
latitude The angular distance of a place north or south of the earth’s equator, expressed in degrees and minutes.
longitude The angular distance of a place east or west of the meridian at Greenwich, England, expressed in degrees and minutes.
is_location_exact Indicator to validate if the latitude and longitude is exact or not. t=yes f = no
property_type Indicates the type of house is this is considered. Examples are house, apartment, townhomes, etc…
room_type Indicates specific rooms available for rent and/or the entire home.
accommodates Indicates the amount of people the host is allowing to occupy the property.
bathrooms Indicates the number of bathrooms within the property.
bedrooms Indicates the number of bedrooms within the property.
beds Indicates the number of beds within the property for use.
bed_type Indicates the type of bed the property has. ie. futon, real bed, airbed, etc.
amenities Provides the potential occupant the type of convenience items included as part of rent. Items include cable television, wifi, toiletries, food, etc..
price Price the host is charging to stay per night.
security_deposit Amount the host would like upfront for any incidentals (should you damage anything during your stay).
cleaning_fee Price the host is charging to clean up after your stay.
guests_included Indicates the number of guests the host is allowing you to bring along.
extra_people Additional charge per additional guests you bring.
minimum_nights Minimum amount of nights the host is willing to rent out the property.
maximum_nights Maximum amount of nights the host is willing to rent out the property.
minimum_minimum_nights Minimum amount of nights the host is willing to rent out the property.
maximum_minimum_nights Maximum amount of nights the host is willing to rent out the property.
minimum_maximum_nights Minimum amount of nights the host is willing to rent out the property.
maximum_maximum_nights Maximum amount of nights the host is willing to rent out the property.
minimum_nights_avg_ntm Minimum nights in average for the last 12 months
maximum_nights_avg_ntm Minimum nights in average for the last 12 months
calendar_updated Indicates when the host has updated the availability of when the property is available to rent.
availability_30 Indicates the number of days the property is available for rent within 30 days.
availability_60 Indicates the number of days the property is available for rent within 60 days.
availability_90 Indicates the number of days the property is available for rent within 90 days.
availability_365 Indicates the number of days the property is available for rent within 365 days.
number_of_reviews Number of reviews received for the property for its entire existence within AirBnB.
number_of_reviews_ltm Number of reviews received for the property in the last 12 months
first_review Date of when the first review was provided.
last_review Date of when the last review was provided.
review_scores_rating Overall score given based on accuracy, cleanliness, check-in, communication, location, and value.
review_scores_accuracy Score based on the accuracy of the description and pictures provided in the listing of the host.
review_scores_cleanliness Score based on the cleanliness of the property when occupant arrived property.
review_scores_checkin Score based on the ease of checkin when the occupant arrived to the property.
review_scores_communication Score based on the communication between host and occupant before and during their stay.
review_scores_location Score based on the location of the property.
review_scores_value Score based on the price charged for the property based on everything that is offered.
instant_bookable Indicates if the property can be booked immediately or is currently unavailable to book.
cancellation_policy Indicates how flexible the host is for cancelling the reservation. Values range from flexible to superstrict.
require_guest_profile_picture Requires the occupant to provide a picture in order to reserve the property.
require_guest_phone_verification Requires the occupant to provide a valid phone number in order to reserve the property.
reviews_per_month Number of reviews the host has received on average per month.
City City of where the property is located.
State State of where the property is located.
Metro General name of the surrounding area where the property is located.
CountyName Political and administrative division of a state, referred to as a particular part of the state.
SizeRank Population of the area; the lower the number the greater the population.
price_delta Change in median price over the last 10 years
median_projection The projected median price for the desired year
price_yearly The nightly price for a property calculated for a year
cleaning_fees_yearly The average cleaning_fee for a property calculated for a year
extra_guest_yearly The average extra guest charge for a property calculated for a year
revenue_yearly The average revenue from a property calculated for a year
break_even_years The number of years by when the investment is obtained back

Data Analysis & Insights

The cost_revenue_analysis now holds clean and standardized data. This dataset now contains properties details spread across various zip codes and neighbourhoods, with the revenue data, along with the median price values as well. To highlight what zipcodes would make for good investments we focus our analysis on 3 aspects

  1. Analysis based on neighbourhood
  2. Analysis of Investment vs Return
  3. Analysis based on Break-even period

Analysis by neighbourhood

Before we dive into zip code analysis, a rudimentary analysis of how the properties are faring based on the neighbourhood is done. We want to check how many properties are there in each neighbourhood, how are they generally priced, are the reviews affected by the neighbourhood and the demand for renting.

Observations :

  1. We see that most of the properties listed are in either Manhattan or Brooklyn.

  2. Surprisingly all the neighbourhoods on an average have high ratings. This could either be biased data or people are really enjoying their stay in New York City rentals.

  3. The average price is relatively higher in Manhattan and Brooklyn, but this surprisingly hasn’t affected their popularity. Brooklyn and Manhattan properties on average are being booked much in advance.

From an initial analysis, it seems like Brooklyn and Manhattan are good neighbourhoods to have an Airbnb property in.

Analysis of investment vs return

Just because a property yields high revenue per year doesn’t make it a good investment. Similarly, a property that is low cost but yields very low revenue is a poor investment as well. So we now focus our analysis on visualizing the return of investment vs the cost of properties. The zipcodes are split into 4 quadrants according to whether the revenue and cost fall above or below the mean value of all zip codes.

Analysis of Break-even point

Break-even analysis is essential for any good business plan since it helps businesses plan their income and expenditures. Properties will extremely high break even point even if the revenue it generates is quite high might not be a good investment. So we rank the zipcodes based on their break-even point to assess what zip codes are good choices for investing in. Here we are assuming 20 years is a good break-even period measurement.

Observations :

The graph lists the zip codes in increasing order of its break-even year. The zipcodes towards the top are considered good investments, whereas the properties towards the bottom might do poorly.

Future actions

To predict with even higher accuracy which zip codes will be good investments, we recommend the following future actions, to add on to our data product:

  1. Reverse engineer the missing zip code values from available data or from an external reliable source:

    We are given the latitude and longitude values for the missing values of the zip codes. We could reverse engineer the zipcode values. Or if we need 10% accuracy for the zip codes, we could use external APIs available or external sources to populate the missing zip codes

  2. Analyze what factors actually affect the occupancy rate:

    We have assumed the occupancy rate here to be 75% but in reality, the occupancy rate is affected by a lot of factors such as amenities provided, distance to public transport, view from the room, etc. From the airbnb_listing data we have, we could do a market survey and predict occupancy rate

  3. Natural Language Processing

    There are a lot of columns with textual content. Mining these fields could provide meaningful insights. Performing natural language processing on fields like description, or amenities provided, for example, could help identify what features attract renters.

  4. Monetary assumptions

    We have made assumptions such as zero time value of money, no taxes, no maintenance cost. But in reality, these factors would affect the revenue generated. Diving deeper into these values could help our prediction be more accurate.

  5. Resale value

    Most real estate companies aim to sell the properties they hold to reinvest in other properties. As an addendum to our product, we could consider the resale value of properties while ranking the zip codes. Zip codes, where median prices are increasing might make for good investments where zip codes, where the property values have been falling might yield poor returns.