Logistic Regression is a supervised learning algorithm used for binary classification problems (e.g., spam detection, fraud detection). It predicts probabilities using the sigmoid function and maps outputs to either 0 or 1.
Step 1: Install Required Libraries
pip install numpy pandas matplotlib scikit-learn
Step 2: Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Step 3: Create or Load Dataset
For this example, we’ll use a synthetic dataset where ‘Hours Studied’ is the independent variable (X) and ‘Pass Exam’ (0 = Fail, 1 = Pass) is the dependent variable (Y).
# Create a dataset
data = {
'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Pass_Exam': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] # Binary classification
}
df = pd.DataFrame(data)
print(df.head()) # Display first 5 rows
Step 4: Split Data into Training and Testing Sets
# Define X (independent variable) and Y (dependent variable)
X = df[['Hours_Studied']] # Feature (must be in 2D format)
Y = df['Pass_Exam'] # Target variable
# Split dataset (80% training, 20% testing)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
Step 5: Train the Logistic Regression Model
# Create and train the model
model = LogisticRegression()
model.fit(X_train, Y_train)
Step 6: Make Predictions
# Predict on test data
Y_pred = model.predict(X_test)
# Compare actual vs predicted values
comparison = pd.DataFrame({'Actual': Y_test, 'Predicted': Y_pred})
print(comparison)
Step 7: Evaluate the Model
# Calculate Accuracy
accuracy = accuracy_score(Y_test, Y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Confusion Matrix
conf_matrix = confusion_matrix(Y_test, Y_pred)
print("Confusion Matrix:\n", conf_matrix)
# Classification Report
print("Classification Report:\n", classification_report(Y_test, Y_pred))
Step 8: Visualizing the Results
# Plot data points
plt.scatter(X_train, Y_train, color='blue', label='Training Data')
plt.scatter(X_test, Y_test, color='red', label='Test Data')
# Generate sigmoid curve
X_range = np.linspace(0, 12, 100).reshape(-1, 1) # Generate range for plotting
Y_prob = model.predict_proba(X_range)[:, 1] # Get probabilities
# Plot Logistic Regression curve
plt.plot(X_range, Y_prob, color='black', linewidth=2, label='Logistic Regression')
plt.xlabel("Hours Studied")
plt.ylabel("Probability of Passing")
plt.title("Logistic Regression: Probability vs Hours Studied")
plt.legend()
plt.show()
Output Results
- The accuracy score measures how well the model classifies.
- The confusion matrix shows correct and incorrect predictions.
- The sigmoid curve shows how probabilities transition from 0 to 1.