Implementing Logistic Regression

Loading

Logistic Regression is a supervised learning algorithm used for binary classification problems (e.g., spam detection, fraud detection). It predicts probabilities using the sigmoid function and maps outputs to either 0 or 1.


Step 1: Install Required Libraries

pip install numpy pandas matplotlib scikit-learn

Step 2: Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 3: Create or Load Dataset

For this example, we’ll use a synthetic dataset where ‘Hours Studied’ is the independent variable (X) and ‘Pass Exam’ (0 = Fail, 1 = Pass) is the dependent variable (Y).

# Create a dataset
data = {
'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Pass_Exam': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] # Binary classification
}

df = pd.DataFrame(data)
print(df.head()) # Display first 5 rows

Step 4: Split Data into Training and Testing Sets

# Define X (independent variable) and Y (dependent variable)
X = df[['Hours_Studied']] # Feature (must be in 2D format)
Y = df['Pass_Exam'] # Target variable

# Split dataset (80% training, 20% testing)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Step 5: Train the Logistic Regression Model

# Create and train the model
model = LogisticRegression()
model.fit(X_train, Y_train)

Step 6: Make Predictions

# Predict on test data
Y_pred = model.predict(X_test)

# Compare actual vs predicted values
comparison = pd.DataFrame({'Actual': Y_test, 'Predicted': Y_pred})
print(comparison)

Step 7: Evaluate the Model

# Calculate Accuracy
accuracy = accuracy_score(Y_test, Y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Confusion Matrix
conf_matrix = confusion_matrix(Y_test, Y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Classification Report
print("Classification Report:\n", classification_report(Y_test, Y_pred))

Step 8: Visualizing the Results

# Plot data points
plt.scatter(X_train, Y_train, color='blue', label='Training Data')
plt.scatter(X_test, Y_test, color='red', label='Test Data')

# Generate sigmoid curve
X_range = np.linspace(0, 12, 100).reshape(-1, 1) # Generate range for plotting
Y_prob = model.predict_proba(X_range)[:, 1] # Get probabilities

# Plot Logistic Regression curve
plt.plot(X_range, Y_prob, color='black', linewidth=2, label='Logistic Regression')

plt.xlabel("Hours Studied")
plt.ylabel("Probability of Passing")
plt.title("Logistic Regression: Probability vs Hours Studied")
plt.legend()
plt.show()

Output Results

  • The accuracy score measures how well the model classifies.
  • The confusion matrix shows correct and incorrect predictions.
  • The sigmoid curve shows how probabilities transition from 0 to 1.

Leave a Reply

Your email address will not be published. Required fields are marked *