geektimes

Using Data Science for house hunting in Montreal

  • понедельник, 2 марта 2020 г. в 00:12:28
https://habr.com/en/post/490546/
  • Data Mining
  • R
  • DIY


Introduction


I happen to live in Montreal, in my condo on the edge of McGill Ghetto. Close to Saint Laurent Boulevard or the Maine as locals call it, with all it's attractions — bars, restaurants, night clubs, drunken students. And once upon a time, on a particular lively night, listening to the sounds of McGill frosh students drunkenly heading home after hard night of studying. I thought, that it might be a good idea to move into my own house, a little bit further away from the action.


Image


It was not my first rodeo, buying a real estate in Montreal, but first time buying a house. So, I decided to do a little bit of research, before trusting my money to a real estate agent. I quickly realized that I can't afford a house anywhere close to the subway station on the Island, but I could possible afford a duplex or a triplex, where tenants would be covering part of my mortgage.
The solution to this problem depends not only on the price of the house, but also on the rent or potential rent that the tenants could be paying.


So, being a visual person with background in research, I wanted to see a visual map of how much things cost around the island, and how much revenue I could get. In the States, and even in Ontario there are services like Zillow that can show some of the information, but for Montreal I couldn't find anything, apart from the realtor association APCIQ. Maybe my preference of using English language is to blame.


So, after a few weeks of studying realtor.ca and kijiji, I wrote a python script to scrape information from them, using some resources I found on github: https://github.com/Froren/realtorca. Also, city of Montreal have an open data web site, that helps to fill-out some blanks.


After the data is collected by webscrappers it is processed in R, using tidy-verse, Simple Features for R. I found excellent resources on how to process geospatial information in R: Geocomputation with R, I used ggplot2 to make graphs and thematic maps for map making.


Now I have more then a year worth of data to study.


Data pre-processing


I preprocess the data by converting it into simple-features format first, and then changing the geographic coordinate reference system (longitude and latitude) to North American projection for Quebec and Ontario


library(tidyverse)
library(sf)

property<-read_csv("....") %>% 
 st_as_sf(coords=c("lng","lat"), crs=4326) %>% 
 st_transform(crs=32188)

Condo price


First I wanted to evaluate how much I could get for my condo. I need to define my neighborhood and find all the condos for sale around me.


Neighborhood map


neighbourhood<-geojson_sf("quartierreferencehabitation.geojson") %>%
 st_transform(32188) %>% 
 filter(nom_qr %in% c("Saint-Louis", "Milton-Parc")) %>% 
 summarize() %>% 
 st_buffer(dist=0)

Selecting condos for sale.


neighbors <- st_join(property, neighbourhood, left=F)

Using a basemap from openstreetmap.


osm_neighbourhood<-read_osm(st_bbox(neighbourhood%>%st_transform(4326)), ext=1.5, type="esri")

Drawing results using tmap package.


library(tmap)
library(tmaptools)

tm_shape(osm_neighbourhood) + tm_rgb(alpha=0.7)+
  tm_shape(neighbourhood) + tm_borders(col='red',alpha=0.8)  + 
  tm_shape(neighbors) + tm_symbols(shape=3,size=0.2,alpha=0.8) +
  tm_shape(ref_home) + tm_symbols(col='red',shape=4,size=0.5,alpha=0.8)+
  tm_compass(position=c("right", "bottom"))+
  tm_scale_bar(position=c("right", "bottom"))

image


Neighbourhood condo prices


Now I can show the prices, and see how the depend on condo surface area and if there is a parking lot. And If i use a simple linear regression I can get the first approximation of what my condo might be worth.


image


Linear model


More formally I can use linear model to predict price and confidence intervals


model_price_lm <- lm(mprice ~ parking:area_interior , data=neighbors_)

## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                41861.30   22421.28   1.867   0.0628 .  
## parkingFALSE:area_interior   436.65      23.56  18.530   <2e-16 ***
## parkingTRUE:area_interior    511.95      19.40  26.393   <2e-16 ***

So, in my neighborhood every square foot in a condo without parking adds 437$ to the base price of 42k$, and with parking it is 512$ per square foot. And now I can make a prediction of the price: 443k$ with confidence interval [422k$, 465k$]


However, if I look at the difference between what my model predicts for all the condos in the neighborhood and the prices, I can see that error depends on the predicted value:


image


Therefore violating one of the conditions where simple linear regression can be used. This kind of behaviour is called overdispersion, and there are several ways of dealing with it. In particular, I found in the literature that I should be using a generalized linear model with inverse Gaussian distribution for errors and logarithmic link function.


Generalized linear model


image


The estimate using generalized linear model is following:


model_price_glm <- glm(mprice ~ parking:area_interior , data=neighbors_, 
                       family=inverse.gaussian(link="log"))

Which gives prediction 436k$ [422k$, 452k$]


Note that I am ignoring number of rooms, floor of the building and the location of the condo for simplicity. It is possible to plug them all in into the regression, but it will increase number of parameters and make modelling results more difficult to interpret. Also, many parameters are correlated, for example bigger apartments tend to have more rooms and there a more of them with parking.


Now, for the sake of simplicity of comparing different properties, I could estimate price per square foot, and how it is affected by different factors.


Again, using generalized linear model with inverse Gaussian distribution and log link:


price per square foot


image


It's easy to make sense of the regression results:


print(exp(model_psqft$coeff))

## (Intercept) parkingTRUE   bedrooms2   bedrooms3   bedrooms4 
## 501.7826165   1.1215192   0.9769839   0.9818974   0.8349424

So, the square foot is worth 501$, parking adds 12%, two bedrooms reduce price by 2.4%, three bedrooms by 1.2%, four bedrooms 17% (given the same total price).


The predicted price of my condo is: 431k$ [414k$, 449k$]


Longitudinal condo price model


All my previous models are showing results based on the condos on the market during the last year, without
trying to account for the price change. It would have been interesting, how the price change with time.
I have no idea how prices should behave, there is no reason to think that there is a steady linear trend, considering
seasonal rise and fall in prices, so first, I could just smooth the data using loess function.


Loess smoothing


If I pile all the data together:
image


But if I try to separate by number of bedrooms, the results are kind of random, since the data
might be too sparse.
image


So, it seems that I would rather want to have an overall smooth variation in price, while taking into account some features of the condos: i.e there is actually no reason to think that two bedroom condos are gaining in value slower then three bedroom ones. But there is variation of the proportion of different appartments with time, which would bias the results.


So, I am going to use generalized additive models where I can model overall change of price using a smooth function, while taking into account difference between different kinds of condos.


Longitudinal condo price model:GAM model


# price model with time
model_psqft_t <- gam(price_sqft ~ bedrooms + parking + s(start_date, k=24) ,
          data=neighbors_, bs="cr",method='REML',
          family=inverse.gaussian(link="log"))

It still looks like the prices are going up.
image


Using this model, the prediction of the price is 468k$ [435k$, 503k$]


How long would it take to sell


Another important question — how long would it take to sell? For this one can use
survival analysis
Technically, it looks like some types of condos sell faster then others, but the difference is not big.
It looks like half of the condos disappear from the market within 60 days :


image


Plex price estimate


Similarly, when I am looking at the potential plex I would like to know how much houses cost in the neighborhood. Let's say within 2km radius of the plex I was interested at some point:


image


The price distribution is


image


Here i can see that the seller is asking slightly more then what is the average for neighborhood, but
at the same time the variability is quite high. For plexes many more parameters are important then
for condos, like the size of the backyard, which year the building was built and how much
existing tennants are paying.


Using similar GLM model as for condos, the estimate for the price is the following: 567к$ [522k$, 616k$]


To estimate the rentals prices in the neighborhood I can find all the appartments listed on Kijiji during last year close by.
image


The price distribution gives me idea how much I could be potentially getting from the tennants. Of course there might be existing tenants already, so it would show me if what they are paying is close to what's currently on the market.
image


Spatial prices


Average over neighborhood


Remember, my original question was to see the map of the prices in Montreal. The simplest would be
to calculate median rental prices per neighborhood and show it on the map, like following:


rent_by_quartier<-aggregate( kijiji_geo_p%>%filter(bedrooms==2) %>% 
 dplyr::select(price), mtl_p,median, join = st_contains) 

image


Since I am not actually looking everywhere on the island, here is the central part. Blue cross is where I go for work.


image


This map looks interesting, but it seem unrealistic to ussume that there are going to be sharp borders on the edges of neighborhoods. So, I would prefer to use a method that allows for smooth spatial change in prices. I can actually again use generalized additive models, as for the time course estimate, but with spatial coordinates.


Rental prices spatial gam model


model_rent_geo_whole<-gam(price~bedrooms+s(x,y,k=100),
        data=rent,bs="cr",method='REML',
        family=inverse.gaussian(link="log"))

image


Rental prices in the central area, which is more interesting for me.


image


Plexes price spatial model


In a same fashion, I can model distribution of the prices per square foot for triplexes with 3br main apartment and parking.
image


Surface area for a triplex with 3br and parking


Now that I have spatial price distribution, I can also model surface area distribution. This,
technically can be done using data from the city website.
But for this example I am using only property that was on the market


image


Triplex Profitability (rent per year/triplex total price)


This way I can roughly estimate profitability of triplexes in different parts of town.
By calculating a total price and dividing by the potential income of two two-bedroom
apartments rented for the year. Of course this is very rough estimate, since
I am assuming that all triplexes will have two 4 1/2 apartments for rent.


image


Plex Longitudinal price model: Plateau, Ahuntsic, Rosemont, Villeray


Finally, using the same idea that was used for tracking condo price during the year,
I can track plexes prices in the boroughs that were interesting for me


image


Conclusions


I did this research to study the distribution of prices in Montreal and to familiarize myself with geospatial modelling in R. I didn't have access to the actual sale prices, so the results should be taken with a grain of salt.


Source code and data


The complete source of scripts used for this publication is publicly available on github: (https://github.com/vfonov/re_mtl), version of this article rendered using rmarkdown is available at http://www.ilmarin.info/re_mtl/stats_eng.html


Interactive map of prices distribution


Results are also shown in an interactive dashboard on (http://www.ilmarin.info/re_mtl/)