Chi-Square Test: A Comprehensive Guide

Introduction

The Chi-Square Test is a non-parametric statistical test used to determine whether there is a significant association between two categorical variables. It is widely applied in fields like data science, business analytics, medical research, marketing, and social sciences to analyze frequency data.

✅ Why Use the Chi-Square Test?

Helps determine if two categorical variables are independent or related.
Useful in analyzing survey results, market research, and customer behavior.
Works on count data (not numerical or continuous data).

📌 Key Questions Chi-Square Can Answer:
✔ Does gender affect product preference?
✔ Is there a relationship between smoking habits and lung disease?
✔ Does education level influence voting patterns?

1. Types of Chi-Square Tests

Chi-Square tests are divided into two main types:

A. Chi-Square Goodness-of-Fit Test

✅ Purpose: Determines whether the observed frequency distribution fits an expected distribution.
✅ Used when comparing one categorical variable against a known expected distribution.

📌 Example:

A company claims that their customer complaints are evenly distributed across four quarters of the year.
We collect real data on complaints per quarter and compare it to the expected uniform distribution.

✔ Hypothesis:

Null Hypothesis (H0H_0): The observed distribution matches the expected distribution.
Alternative Hypothesis (HAH_A): The observed distribution is different from the expected distribution.

B. Chi-Square Test for Independence

✅ Purpose: Determines if two categorical variables are independent or associated.
✅ Used when we want to analyze relationships between two categorical factors.

📌 Example:

Examining if gender (Male, Female) affects voting preference (Party A, Party B, Party C).
Analyzing if customer age group is related to product purchase preference.

✔ Hypothesis:

Null Hypothesis (H0H_0): The two variables are independent (no association).
Alternative Hypothesis (HAH_A): The two variables are dependent (have an association).

2. Assumptions of Chi-Square Test

For a valid Chi-Square test, these conditions must be met:

✅ 1. Data Must Be Categorical – Both variables should be categorical, not numerical.
✅ 2. Observations Must Be Independent – Each observation should be unique and unrelated to others.
✅ 3. Expected Frequency ≥ 5 – Each category should have an expected count of at least 5 for accurate results.
✅ 4. Sufficient Sample Size – A larger sample size increases the reliability of results.

📌 What if assumptions are violated?

Use Fisher’s Exact Test for small sample sizes.
Combine low-frequency categories to increase expected counts.

3. Chi-Square Formula and Calculation

A. Understanding the Chi-Square Formula

χ2=∑(O−E)2E\chi^2 = \sum \frac{(O – E)^2}{E}

Where:
✔ OO = Observed frequency (actual counts from data)
✔ EE = Expected frequency (theoretical counts under H0H_0)

📌 Higher χ2\chi^2 Value → More Likely Variables Are Related.

B. Steps in Conducting a Chi-Square Test

✅ Step 1: Formulate Hypotheses

Null Hypothesis (H0H_0): No relationship exists between the variables.
Alternative Hypothesis (HAH_A): The variables are related.

✅ Step 2: Create a Contingency Table

Organize observed frequencies into a table format.

✅ Step 3: Compute Expected Frequencies E=Row Total×Column TotalGrand TotalE = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

✅ Step 4: Calculate Chi-Square Statistic χ2=∑(O−E)2E\chi^2 = \sum \frac{(O – E)^2}{E}

✅ Step 5: Compare with Critical Value

Find the critical χ2\chi^2 value based on degrees of freedom and significance level (α\alpha).
If χ2\chi^2 > critical value, reject H0H_0 (variables are related).
If χ2\chi^2 ≤ critical value, fail to reject H0H_0 (variables are independent).

4. Python Implementation of Chi-Square Test

Example 1: Chi-Square Test for Independence

import pandas as pd
import scipy.stats as stats

# Creating a contingency table for Gender vs Product Preference
data = [[30, 10],   # Male
        [20, 40]]   # Female

# Perform Chi-Square Test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(data)

# Print results
print(f"Chi-Square Statistic: {chi2_stat}")
print(f"P-Value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:\n", expected)

# Decision
alpha = 0.05
if p_value < alpha:
    print("Reject Null Hypothesis: The variables are related.")
else:
    print("Fail to Reject Null Hypothesis: No significant association.")

📌 Interpretation:

If p-value < 0.05, there is a statistically significant relationship between gender and product preference.
If p-value > 0.05, the two variables are independent.

5. Post-Hoc Analysis for Chi-Square

If a significant association is found, we use post-hoc tests to analyze which categories differ.

📌 Common Methods:

Standardized Residuals – Identify which cells contribute most to significance.
Pairwise Chi-Square Tests – Compare specific group pairs.
Bonferroni Correction – Adjusts for multiple comparisons.

6. Real-World Applications of Chi-Square Test

📌 Healthcare – Examining the relationship between smoking and lung disease.
📌 Marketing – Analyzing customer purchase behavior based on demographics.
📌 Education – Studying whether gender influences subject choice.
📌 Politics – Investigating voting preferences across different age groups.