House Price Prediction: A Comprehensive Guide
House price prediction is a classic machine learning problem that involves estimating the price of a house based on various features such as location, size, number of bedrooms, and more. This guide provides a detailed breakdown of every step involved in building an accurate house price prediction model.
1. Understanding the Problem
House price prediction is a regression problem, where the goal is to predict a continuous value (house price). The price of a house depends on multiple factors, including:
- Location (city, neighborhood)
- Property Size (square footage, lot size)
- Number of Bedrooms & Bathrooms
- Amenities (pool, garage, garden)
- Market Conditions (interest rates, demand-supply)
- Historical Prices (previous selling price trends)
Objective: Use machine learning models to accurately predict house prices based on historical data.
2. Data Collection
To build a predictive model, we need a dataset that contains:
- House Features (square footage, number of rooms, etc.)
- Target Variable (house price)
Where to Get Data?
- Kaggle datasets (e.g., Boston Housing, Zillow datasets)
- Real estate websites (Zillow, Redfin, Realtor)
- Government databases (property tax records)
- APIs (Zillow API, OpenStreetMap API)
Example of a dataset:
ID | Location | Sq. Footage | Bedrooms | Bathrooms | Price ($) |
---|---|---|---|---|---|
1 | New York | 1500 | 3 | 2 | 500,000 |
2 | California | 2000 | 4 | 3 | 750,000 |
3 | Texas | 1800 | 3 | 2 | 400,000 |
3. Data Preprocessing
a) Handling Missing Values
Missing data can affect model performance. We can:
- Fill missing numerical values with mean/median.
- Fill categorical missing values with the most frequent category.
- Drop rows/columns with too many missing values.
import pandas as pd
df = pd.read_csv("house_data.csv")
df.fillna(df.median(), inplace=True) # Filling missing values with median
b) Encoding Categorical Variables
Since machine learning models work with numbers, we need to convert categorical variables (e.g., city names) into numeric format:
- Label Encoding (for ordinal categories)
- One-Hot Encoding (for non-ordinal categories)
df = pd.get_dummies(df, columns=['Location'], drop_first=True)
c) Feature Engineering
Creating new meaningful features to improve model performance:
- Price per square foot:
price_per_sqft = price / sqft
- House Age:
current_year - built_year
- Nearby Amenities: Count of schools, hospitals, parks nearby.
df['Price_per_sqft'] = df['Price'] / df['Sq. Footage']
d) Feature Scaling
Since features like square footage and price have different scales, we apply:
- Normalization (Min-Max Scaling):
x' = (x - min) / (max - min)
- Standardization (Z-score Scaling):
x' = (x - mean) / std
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Sq. Footage', 'Price']] = scaler.fit_transform(df[['Sq. Footage', 'Price']])
4. Exploratory Data Analysis (EDA)
a) Data Visualization
Visualizing trends and relationships between features.
- Price Distribution
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(df['Price'], bins=50, kde=True)
plt.show()
- Price vs. Square Footage
sns.scatterplot(x=df['Sq. Footage'], y=df['Price'])
- Correlation Heatmap
import seaborn as sns
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
5. Splitting the Dataset
We split the data into training (80%) and testing (20%) datasets.
from sklearn.model_selection import train_test_split
X = df.drop(columns=['Price'])
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
6. Model Selection and Training
We try multiple models and evaluate their performance.
a) Linear Regression
A simple model assuming a linear relationship between price and features.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
b) Decision Tree
Captures non-linear relationships.
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(max_depth=5)
model.fit(X_train, y_train)
c) Random Forest
An ensemble model that reduces overfitting.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
d) Gradient Boosting (XGBoost)
Performs well on structured data.
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
7. Model Evaluation
We use Mean Absolute Error (MAE) and R² score to evaluate models.
from sklearn.metrics import mean_absolute_error, r2_score
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae}, R² Score: {r2}")
8. Hyperparameter Tuning
Fine-tuning models using Grid Search.
from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth': [3, 5, 10], 'n_estimators': [50, 100, 200]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
9. Deploying the Model
a) Using Flask for API Deployment
We create an API that takes user input and predicts house prices.
from flask import Flask, request, jsonify
import pickle
app = Flask(__name__)
model = pickle.load(open('house_price_model.pkl', 'rb'))
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
prediction = model.predict([list(data.values())])
return jsonify({'predicted_price': prediction[0]})
if __name__ == '__main__':
app.run()