Implementing Linear Regression

Loading

What is Linear Regression?

Linear Regression is a supervised learning algorithm used for predicting continuous values. It finds the best-fitting line (also called a regression line) to model the relationship between the dependent variable (Y) and the independent variable(s) (X).

Equation of a Simple Linear Regression Model

Y=mX+bY = mX + bY=mX+b

  • Y → Predicted value
  • X → Independent variable
  • m → Slope (coefficient)
  • b → Intercept

For multiple linear regression, the equation is: Y=b0+b1X1+b2X2+…+bnXnY = b_0 + b_1X_1 + b_2X_2 + … + b_nX_nY=b0​+b1​X1​+b2​X2​+…+bn​Xn​


Step 1: Install Required Libraries

pip install numpy pandas matplotlib scikit-learn

Step 2: Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Step 3: Load the Dataset

For this example, we will use a simple dataset with Years of Experience as the independent variable (X) and Salary as the dependent variable (Y).

# Create a dataset
data = {
'YearsExperience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Salary': [40000, 45000, 50000, 55000, 60000, 70000, 75000, 80000, 85000, 95000]
}

df = pd.DataFrame(data)
print(df.head()) # Display first 5 rows

Step 4: Split Data into Training and Testing Sets

# Define X (independent variable) and Y (dependent variable)
X = df[['YearsExperience']] # Feature (must be in 2D format)
Y = df['Salary'] # Target variable

# Split dataset (80% training, 20% testing)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Step 5: Train the Linear Regression Model

# Create and train the model
model = LinearRegression()
model.fit(X_train, Y_train)

Step 6: Make Predictions

# Predict on test data
Y_pred = model.predict(X_test)

# Compare actual vs predicted values
comparison = pd.DataFrame({'Actual': Y_test, 'Predicted': Y_pred})
print(comparison)

Step 7: Evaluate the Model

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(Y_test, Y_pred)

# Calculate R-squared score (higher is better)
r2 = r2_score(Y_test, Y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

Step 8: Visualizing the Results

# Plot training data points
plt.scatter(X_train, Y_train, color='blue', label='Training Data')

# Plot test data points
plt.scatter(X_test, Y_test, color='red', label='Test Data')

# Plot Regression Line
plt.plot(X_test, Y_pred, color='black', linewidth=2, label='Regression Line')

plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.title("Linear Regression: Salary vs Experience")
plt.legend()
plt.show()

Output Results

  • The scatter plot shows data points and the regression line.
  • Mean Squared Error (MSE) measures prediction error. Lower is better.
  • R-squared Score (R²) tells how well the model explains the variance in data (closer to 1 is better).

Leave a Reply

Your email address will not be published. Required fields are marked *