Predicting House Prices in Lagos State with Python

Predicting House Prices in Lagos State with Python

The prices of houses in Lagos state are a major problem, especially when you don't know what to expect. I for one will like to know what houses cost in specific areas, so I can decide if it's worth my time to look at them, given my budget.

This article analyzes the cost of rent in specific areas of Lagos State and also presents a machine learning model that can be used to predict rent prices in Lagos State based on specific variables.

Getting the Data

For this analysis, the data was scrapped from a real estate company in Nigeria, project.pro.ng.. You can download the data here.

To avoid making this article unnecessarily long, we will not explain what each column represents. However, you can look up the data on Kaggle for more details. Here is what our data look like:

# Reading the data file
df = pd.read_csv('lagos-rent_renewed.csv')
df.head(5)

Cleaning the Data:

Certain issues need to be handled to ensure the data is clean. Here are some of the irregularities we will be handling:

Dropping Non-Residential locations

The data included non-residential locations such as schools, churches, warehouses, office apartments, etc. Our first step in data cleaning was to filter out these locations.

# Locating non-residential buildings in the dataset
commercial_locations= df[df['More
Info'].str.contains('COMMERCIAL|WAREHOUSE|CONFERENCE|LAND|CHURCH|EVENT CENTRE|WORKING SPACE|SCHOOL|OFFICE', case=False)]
df = df.drop(commercial_locations.index, axis=0)
df.head()

Dropping Unwanted Columns

We do not need the ‘Title’ and ‘More Info’ columns, they contain similar details in the City and Neighbourhood columns.

# Droping the Title and More Info columns as they are not required for this analysis(a repeat)
df = df.drop(['Title', 'More Info'], axis=1)
df.head(5)

Cleaning Specific Columns

The Price, Bedroom, Bathroom and Toilet columns all have string variables and are saved as object variable types. To make these columns compatible with our machine-learning model, we extracted just the numeric values from each column, removed the comma separators in the Price column, dropped all rows with null values, and converted the data types to integers.

# Extracting the numeric values in the columns
df['Bedrooms']= df['Bedrooms'].str.extract(r'(\d+) beds')
df['Toilets']= df['Toilets'].str.extract(r'(\d+) Toilets')
df['Bathrooms']= df['Bathrooms'].str.extract(r'(\d+) baths')
df['Price']= df['Price'].str.extract(r'([\d,]+)')

# Removing the comma separators from the price column
df['Price']= df['Price'].str.replace(',', '')

#dropping null values in the columns
df = df.dropna(subset=['Bedrooms', 'Toilets', 'Bathrooms' ], inplace=True)

# Changing the datatypes for the columns
data_types = {
'Price': int,
'Bedrooms': int,
'Bathrooms': int,
'Toilets': int
 }

df.head()

Handling outliers

It is generally recommended that exploratory data analysis (EDA) should be done before removing outliers so that you can get a complete picture of the data before modeling. However, our data has some extreme cases, with the cheapest houses being 0 naira and the most expensive houses being 7 billion naira. These prices are unrealistic for rent in Lagos, so we removed the outliers using the interquartile range (IQR) formula to get a better grasp of the data.

#Defining a function to take out outliers
def remove_outlier(data):
    '''
    This function returns the total outliers in a dataset column, and replaces the upper and lower outliers
    using the Interquartile Range (IQR) Method.
    data = df['columns']
    '''
    #calculationg the third quartile of the column
    Q3 = data.quantile(0.75)

    #Calculating the First quartile of the column
    Q1 = data.quantile(0.25)

    #Calculating the Interquartile range
    IQR = Q3 - Q1

    # Identifying the lower outliers
    lower_limit = round(Q1 - 1.5 * IQR)

    #Identifying the upper outliers
    upper_limit = round(Q3 + 1.5 * IQR)

    #Replacing the lower outliers with the lower IQR value
    data.loc[data> upper_limit] = upper_limit

    #Replacing the upper outliers with the lower IQR value
    data.loc[data<lower_limit] = lower_limit

    return data
#Removing the outliers in each column. 
df['Price'] = remove_outlier(df['Price'])
df['Bedrooms'] = remove_outlier(df['Bedrooms'])
df['Bathrooms'] = remove_outlier(df['Bathrooms'])
df['Toilets'] = remove_outlier(df['Toilets'])

Handling the Price Column

The price column had a minimum value of N1, which is unrealistic given Lagos context. To handle this, we took out every row with a price below N100,000. This gives us a better representation of Lagos state reality.

#Taking out rows with unrealistic prices
Too_cheap_price = df[(df['Price'] <100000)].index
df = df.drop(Too_cheap_price, axis=0)
df.head()

EDA (Analysing Rents in Lagos)

To better understand the data, we split the rents into two groups; Cheap locations and Expensive locations. To achieve this, we got the average cost of rent in cities in Lagos. Here is what the visualization looks like:

# Grouping the cities by the average prices
cities = df.groupby('City')['Price'].mean().round(2).sort_values(ascending= False).reset_index()

#Visualizing the cities
plt.figure(figsize=(10, 6))
sns.barplot(x='City', y='Price', data= cities, color='Orange')

# Customize the plot
plt.xlabel('Cities')
plt.ylabel('Prices')
plt.title('Average cost of rent in various cities in lagos')

# Show the plot
plt.show()

We can see that the data itself is somewhat split in two. Thus, our Cheap locations are Ojodu, Gbagada, Ajah, Surulere, and Yaba, and our Expensive locations are Ikoyi, Island, Lekki, and Ikeja.

The cheapest locations in Lagos:

Here, we identify the cheap location in the dataset, sort the locations by their prices, and create a chart to show the bottom 10 locations. Here is what the code looks like.

# Isolating the cheap cities in the dataset
cheap_houses = df[df['City'].str.contains('Ojodu|Gbagada|Ajah|Surulere|Yaba', case=False)]

#Ranking the locations by rent prices
cheap_locations = cheap_houses.sort_values(by='Price', ascending=False)[['Neighborhood', 'Price']]


# Getting the locations with the cheapest rent prices
top_10_cheapest = cheap_locations.tail(10)

# Ploting a barchart to vizualize the data
plt.figure(figsize=(10, 6))
sns.barplot(x='Price', y='Neighborhood', data=top_10_cheapest, color='lightblue')

# Customize the plot
plt.xlabel('Prices')
plt.ylabel('Locations')
plt.title('Cheapest Locations in Lagos')

# Show the plot
plt.show()

Most expensive locations in Lagos:

Similar to what we did for the cheap location, we identify and sort expensive locations by their prices, then create a chart to show the top expensive locations.

# Isolating expensive locations in the dataset
Expensive_houses = df[df['City'].str.contains('Ikoyi|Island|Lekki|Ikeja', case=False)]

# Ranking the locations by the average price
Expensive_locations = Expensive_houses.sort_values(by='Price', ascending=False)[['Neighborhood', 'Price']]

# Getting the locations with the most expensive rent prices
exp_10_cheapest = Expensive_locations.head(10)

# Plotting a bar chart to visualise the data 
plt.figure(figsize=(10, 6))
sns.barplot(x='Price', y='Neighborhood', data=exp_10_cheapest, color='lightblue')

# Customize the plot
plt.xlabel('Prices')
plt.ylabel('Locations')
plt.title('Cheapest Houses by Location')

# Show the plot
plt.show()

Other findings

  • The average cost of rent in the cheap locations in Lagos state is about 1.3 million naira, and expensive locations cost about 4.6 million naira.
  • You can get one-bedroom apartments at an average cost of about 500,000 naira, and two bedrooms for about 1.1 million naira in cheap locations. In expensive locations, the average rent for a one-bedroom apartment cost about 1.3 million naira on average, and two-bedroom flats cost about 3.3 million naira.
  • Furnished apartments in cheap locations cost about 1.3 million naira on average, for expensive locations, this figure is 4.8 million naira.
  • The average rent cost for serviced apartments is about 1.5 million naira in cheap locations and 5.3 million naira in expensive locations. Nonserviced houses cost 1.3 million naira and 4.2 million naira respectively.
  • Newly built houses are generally more expensive in both locations, costing an average of 1.4 million naira in cheap locations, and 4.9 million naira in expensive locations. The alternative cost for this is 1.3 million naira and 4.4 million naira, respectively for old apartments.

At the risk of making the article lengthy, we will be unable to show visualizations of these findings. Look up the project notebook on GitHub for a better grasp of what our findings were.

Relationship between the Variables

To achieve this, we will plot a heatmap that shows the correlation between each variable in our dataset. With this, we can tell what variables in the dataset can influence our target variable(Prices)

The heatmap will range from 0 to 1, with 1 indicating a strong correlation and 0 indicating no correlation. The closer the value is to 1, the more likely it is that the two variables are correlated.

#Plotting a correlation heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot = True)

Pay attention to the Price column, notice that Bedrooms, Bathrooms and Toilets show a better correlation to Price, and the Furnished and Newly built variables show very little/negative correlation to the prices.

Let’s visualize the relationship between the predictor features(columns with positive correlation) vs target feature(Price):

# Identifying positive correlated variables
columns = ['Bedrooms', 'Bathrooms', 'Toilets']

# Creating a pair plot
plt.figure(figsize=(10,8))
sns.pairplot(df, 
                 x_vars=columns, 
                 y_vars='Price',
                 diag_kind='hist',
                 kind='reg')
#Displaying the chart
plt.show()

You can see that all predictors have a positive correlation with prices Note:Columns with negative or poor correlation should be dropped as they do not influence our target variable(Price). However, they are needed for our prediction webpage, thus we will not be removing them.

Splitting the Data:

We are now in the data preprocessing part of our machine learning model. We will proceed to split our data set into dependent and independent variables, our dependent variable being price(x), and our independent variable being every other variable in the dataset(y)

# Seperating the data into independent(x) and dependent(y) variables
x = df.drop(columns= ['Price'])
y = df['Price']

Now that we have both variables defined, we need to split the data into a training set and a test set. This is important because we need to test our model on data that it has not seen before, in order to make sure that it is generalizing well. We will use 20% of the data as our test set, and train the model on the rest of the 80%.

from sklearn.model_selection import train_test_split
# Splitting data in train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state= 1)

Hot-encoding and Standardizing:

Our Neighbourhood column is an object data type, and thus cannot be interpreted by our model. To make it model readable, we need to convert the column to binaries. However, in our prediction webpage, we want the users to be able to select locations as the location names, and not as binary numbers.

To achieve this, we will define a column transformer using the make_column_transformer function, to allow us to encode just the Neighbourhood column in the model. The column transformer includes a two-step process: first, converting the 'Neighbourhood’ column into binary-encoded features, and second, utilizing a label encoder(oneHotencoder) to map binary-encoded features back to their original location names. This approach ensures that our model can effectively interpret the data while allowing users on the prediction webpage to interact with familiar location names instead of binary numbers.

Finally, we use the StandardScaler library to standardize(scale) the numerical features, this ensures they are on the same scale.

Here is what our code looks like:

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

#Hot encoding the location column, to make them binary, thus machine readable
column_trans = make_column_transformer((OneHotEncoder(sparse= False),['Neighborhood']),remainder='passthrough')
scaler = StandardScaler()

These functions will be added to our model when fitting the dataset, allowing it to perform this function just as it trains.

Modeling

For our prediction, we will be using a regression model from Scikit-Learn. We will try three regression models to see which one produces a better result. Here, our metric for measuring the models' performance is the R-squared score.

LinearRegression Model:

This is one of the most popular regression models. To train the first LinearRegression model, we will pass the train data to the fit method and then the test data to predict. Here is what our code looks like:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Create the LinearRegression
lr = LinearRegression(n_jobs=-1)

# Create the pipeline with column transformer, scaler, and the LinearRegression
pipe = make_pipeline(column_trans, scaler,lr)

# Fit the pipeline to the training data
pipe.fit(x_train, y_train)

# Make predictions on the test set
lr_y_pred = pipe.predict(x_test)

# Calculate the evaluation
r2 = r2_score(y_test, lr_y_pred)

# Print the evaluation metrics
print("R-squared score:", r2) #0.6665830021644317

Notice that we used the make_pipeline function, to create a pipeline that includes the columns_transformer(One-hot encoder for our neighbourhood column), the standardscaler and the linear regression model, which is then used to fit and train the model. The R-square score for this prediction is 0.66.

RandomForestRegressor Model

Next, we train our data using the Random forest regressor model. The code and syntax for this is similar to what we did with the linear regression. Here is what our code looks like:

from sklearn.ensemble import RandomForestRegressor
# Create the RandomForestRegressor
rf = RandomForestRegressor()

# Create the pipeline with column transformer, scaler, and the RandomForestRegressor
pipe = make_pipeline(column_trans, scaler, rf)

# Fit the pipeline to the training data
pipe.fit(x_train, y_train)

# Make predictions on the test set
rf_y_pred = pipe.predict(x_test)

# Calculate the R-squared score
r2 = r2_score(y_test, rf_y_pred)

    # Print the R-squared score
print("R-squared score:", r2) #0.7445464506210426

The R-squared score for this model is 0.74.

Xgboost Regressor

Finally, we will train our data using the Xgboos regressor model. Again, this is a similar code to what we've done in the previous models:

import xgboost as xgb

# Create the Xgboosregressor model
xgb = xgb.XGBRegressor()

# Create the pipeline with column transformer, scaler, and the XGBRegressor
pipe = make_pipeline(column_trans, scaler,xgb)

# Fit the pipeline to the training data
pipe.fit(x_train, y_train)

# Make predictions on the test set
xgb_y_pred = pipe.predict(x_test)

# Calculate the evaluation
r2 = r2_score(y_test, xgb_y_pred)

# Print the evaluation metrics
print("R-squared score:", r2) #0.7525683915108

The R-squared score for this model is 0.75

After testing our models, the XGBoost Regressor has the best R-squared score, so we will use it to predict the prices of rent in Lagos. We will save the model using pickle so that we can use it later to make predictions.

import pickle
pickle.dump(pipe,open('xgb_model.pkl','wb'))

That’s it, we have now come to the end of this article. For a more in-depth look at the concepts we covered, please see the accompanying notebook on GitHub.

In the next article, we will discuss how to create a webpage for our prediction using Python Flask.

Conclusion:

Prediction models are valuable tools for making decisions. This article discussed the steps involved in building a successful regression model, from data cleaning to model selection and evaluation.