House Price Prediction

Loading

House Price Prediction: A Comprehensive Guide

House price prediction is a classic machine learning problem that involves estimating the price of a house based on various features such as location, size, number of bedrooms, and more. This guide provides a detailed breakdown of every step involved in building an accurate house price prediction model.


1. Understanding the Problem

House price prediction is a regression problem, where the goal is to predict a continuous value (house price). The price of a house depends on multiple factors, including:

  • Location (city, neighborhood)
  • Property Size (square footage, lot size)
  • Number of Bedrooms & Bathrooms
  • Amenities (pool, garage, garden)
  • Market Conditions (interest rates, demand-supply)
  • Historical Prices (previous selling price trends)

Objective: Use machine learning models to accurately predict house prices based on historical data.


2. Data Collection

To build a predictive model, we need a dataset that contains:

  • House Features (square footage, number of rooms, etc.)
  • Target Variable (house price)

Where to Get Data?

  • Kaggle datasets (e.g., Boston Housing, Zillow datasets)
  • Real estate websites (Zillow, Redfin, Realtor)
  • Government databases (property tax records)
  • APIs (Zillow API, OpenStreetMap API)

Example of a dataset:

IDLocationSq. FootageBedroomsBathroomsPrice ($)
1New York150032500,000
2California200043750,000
3Texas180032400,000

3. Data Preprocessing

a) Handling Missing Values

Missing data can affect model performance. We can:

  • Fill missing numerical values with mean/median.
  • Fill categorical missing values with the most frequent category.
  • Drop rows/columns with too many missing values.
import pandas as pd

df = pd.read_csv("house_data.csv")
df.fillna(df.median(), inplace=True)  # Filling missing values with median

b) Encoding Categorical Variables

Since machine learning models work with numbers, we need to convert categorical variables (e.g., city names) into numeric format:

  1. Label Encoding (for ordinal categories)
  2. One-Hot Encoding (for non-ordinal categories)
df = pd.get_dummies(df, columns=['Location'], drop_first=True)

c) Feature Engineering

Creating new meaningful features to improve model performance:

  • Price per square foot: price_per_sqft = price / sqft
  • House Age: current_year - built_year
  • Nearby Amenities: Count of schools, hospitals, parks nearby.
df['Price_per_sqft'] = df['Price'] / df['Sq. Footage']

d) Feature Scaling

Since features like square footage and price have different scales, we apply:

  • Normalization (Min-Max Scaling): x' = (x - min) / (max - min)
  • Standardization (Z-score Scaling): x' = (x - mean) / std
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Sq. Footage', 'Price']] = scaler.fit_transform(df[['Sq. Footage', 'Price']])

4. Exploratory Data Analysis (EDA)

a) Data Visualization

Visualizing trends and relationships between features.

  • Price Distribution
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(df['Price'], bins=50, kde=True)
plt.show()
  • Price vs. Square Footage
sns.scatterplot(x=df['Sq. Footage'], y=df['Price'])
  • Correlation Heatmap
import seaborn as sns

sns.heatmap(df.corr(), annot=True, cmap="coolwarm")

5. Splitting the Dataset

We split the data into training (80%) and testing (20%) datasets.

from sklearn.model_selection import train_test_split

X = df.drop(columns=['Price'])
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

6. Model Selection and Training

We try multiple models and evaluate their performance.

a) Linear Regression

A simple model assuming a linear relationship between price and features.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

b) Decision Tree

Captures non-linear relationships.

from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(max_depth=5)
model.fit(X_train, y_train)

c) Random Forest

An ensemble model that reduces overfitting.

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)

d) Gradient Boosting (XGBoost)

Performs well on structured data.

from xgboost import XGBRegressor

model = XGBRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)

7. Model Evaluation

We use Mean Absolute Error (MAE) and R² score to evaluate models.

from sklearn.metrics import mean_absolute_error, r2_score

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae}, R² Score: {r2}")

8. Hyperparameter Tuning

Fine-tuning models using Grid Search.

from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [3, 5, 10], 'n_estimators': [50, 100, 200]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)

9. Deploying the Model

a) Using Flask for API Deployment

We create an API that takes user input and predicts house prices.

from flask import Flask, request, jsonify
import pickle

app = Flask(__name__)
model = pickle.load(open('house_price_model.pkl', 'rb'))

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    prediction = model.predict([list(data.values())])
    return jsonify({'predicted_price': prediction[0]})

if __name__ == '__main__':
    app.run()

Leave a Reply

Your email address will not be published. Required fields are marked *