What is Linear Regression?
Linear Regression is a supervised learning algorithm used for predicting continuous values. It finds the best-fitting line (also called a regression line) to model the relationship between the dependent variable (Y) and the independent variable(s) (X).
Equation of a Simple Linear Regression Model
Y=mX+bY = mX + bY=mX+b
- Y → Predicted value
- X → Independent variable
- m → Slope (coefficient)
- b → Intercept
For multiple linear regression, the equation is: Y=b0+b1X1+b2X2+…+bnXnY = b_0 + b_1X_1 + b_2X_2 + … + b_nX_nY=b0+b1X1+b2X2+…+bnXn
Step 1: Install Required Libraries
pip install numpy pandas matplotlib scikit-learn
Step 2: Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Step 3: Load the Dataset
For this example, we will use a simple dataset with Years of Experience as the independent variable (X) and Salary as the dependent variable (Y).
# Create a dataset
data = {
'YearsExperience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Salary': [40000, 45000, 50000, 55000, 60000, 70000, 75000, 80000, 85000, 95000]
}
df = pd.DataFrame(data)
print(df.head()) # Display first 5 rows
Step 4: Split Data into Training and Testing Sets
# Define X (independent variable) and Y (dependent variable)
X = df[['YearsExperience']] # Feature (must be in 2D format)
Y = df['Salary'] # Target variable
# Split dataset (80% training, 20% testing)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
Step 5: Train the Linear Regression Model
# Create and train the model
model = LinearRegression()
model.fit(X_train, Y_train)
Step 6: Make Predictions
# Predict on test data
Y_pred = model.predict(X_test)
# Compare actual vs predicted values
comparison = pd.DataFrame({'Actual': Y_test, 'Predicted': Y_pred})
print(comparison)
Step 7: Evaluate the Model
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(Y_test, Y_pred)
# Calculate R-squared score (higher is better)
r2 = r2_score(Y_test, Y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")
Step 8: Visualizing the Results
# Plot training data points
plt.scatter(X_train, Y_train, color='blue', label='Training Data')
# Plot test data points
plt.scatter(X_test, Y_test, color='red', label='Test Data')
# Plot Regression Line
plt.plot(X_test, Y_pred, color='black', linewidth=2, label='Regression Line')
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.title("Linear Regression: Salary vs Experience")
plt.legend()
plt.show()
Output Results
- The scatter plot shows data points and the regression line.
- Mean Squared Error (MSE) measures prediction error. Lower is better.
- R-squared Score (R²) tells how well the model explains the variance in data (closer to 1 is better).