Unit 3 Inference for Categorical Data: Proportions

Unit 3 Lessons:

What is this unit about?

In Unit 3, we move from simply describing data to making formal conclusions about a population based on a sample. This process is called Statistical Inference. Specifically, we focus on categorical data, where results are typically expressed as counts or proportions (percents)—such as "what percentage of voters support a candidate?" or "is there a difference in success rates between two groups?".

The Three Pillars of Unit 3

Estimating with Confidence: We learn how to calculate a "plausible range" of values for an unknown population proportion, known as a Confidence Interval.
Testing a Claim: We use Hypothesis Tests to determine if a specific claim about a population is supported by the evidence, or if the results we saw were likely just due to random chance.
Comparing Groups: We expand these tools to compare two different populations or to look for associations between multiple categorical variables using Chi-Square tests.

Why does it matter?

Statistical inference is the engine of modern science and business. Whether it is a medical trial testing the effectiveness of a new treatment or a marketing firm analyzing customer preferences, these methods allow us to quantify our uncertainty and make data-driven decisions.

Unit Overview: Click to expand 👉

Unit 3 Overview: Inference for Categorical Data: Proportions

This unit focuses on using sample data to make "inferences"—educated conclusions—about larger populations using categorical data.

3.1 & 3.2: Estimators and Sampling Distributions

To understand the population proportion (p), we use a sample statistic called the sample proportion (p-hat).

Unbiased Estimators: An estimator is unbiased if, on average, its value does not overestimate or underestimate the true population parameter.
Mean of the Sampling Distribution: The mean of all possible sample proportions (p-hat) is equal to the population proportion (p).
Standard Deviation: The variability of the sample proportion is calculated as: SD(p-hat) = sqrt( p(1-p) / n )
10% Condition: When sampling without replacement, the population must be at least 10 times larger than the sample (n < 10% of N).
Normality Condition: The sampling distribution is approximately Normal if there are at least 10 expected successes and 10 expected failures (np >= 10 and n(1-p) >= 10).

3.3 & 3.4: Confidence Intervals for a Population Proportion

A confidence interval provides a range of plausible values for the population proportion.

The Procedure: A one-sample z-interval for a population proportion.
The Formula: p-hat +/- (z*) * sqrt( p-hat(1 - p-hat) / n )
Standard Error (SE): This estimates the standard deviation of the sampling distribution using sample data: SE = sqrt( p-hat(1 - p-hat) / n ).
Margin of Error (MOE): This is half the width of the interval: MOE = (z*) * (SE).
Confidence Level Interpretation: In repeated random sampling, approximately C% of the intervals created will capture the true population proportion.
Relationships: Increasing the sample size (n) decreases the width of the interval, while increasing the confidence level increases the width.

3.5 - 3.8: Hypothesis Testing for a Population Proportion

A hypothesis test determines if a claim about a population is supported by the data.

Null Hypothesis (H0): The "status quo" or the statement assumed to be true (H0: p = p0).
Alternative Hypothesis (Ha): The claim you are testing for (Ha: p > p0, p < p0, or p != p0).
Test Statistic (z): Measures how many standard deviations the observed p-hat is from the null p0: z = (p-hat - p0) / sqrt( p0(1 - p0) / n )
P-value: The probability of getting a result as extreme as the one observed, assuming H0 is true.
- If p-value <= alpha: Reject H0; there is convincing evidence for Ha.
- If p-value > alpha: Fail to reject H0; there is not enough evidence for Ha.
Errors:
- Type I Error: Rejecting H0 when it was actually true.
- Type II Error: Failing to reject H0 when Ha was actually true.
- Power: The probability of correctly rejecting a false H0.

3.9 - 3.13: Comparing Two Proportions

These methods are used when you have two independent groups and want to compare their proportions (p1 - p2).

Two-Sample z-Interval: Used to estimate the difference between two population proportions. Formula: (p-hat1 - p-hat2) +/- (z*) * sqrt( [p-hat1(1-p-hat1)/n1] + [p-hat2(1-p-hat2)/n2] )
Two-Sample z-Test: Used to test if the two proportions are different.
Pooled Proportion (p-hat-c): For the hypothesis test, we combine the successes from both groups: p-hat-c = (successes1 + successes2) / (n1 + n2)
Normality Condition: Both samples must have at least 10 expected successes and failures.

3.14 & 3.15: Chi-Square Tests

Chi-square tests compare observed counts to expected counts in a two-way table.

Test for Homogeneity: Used to see if the distribution of a single categorical variable is the same across multiple populations.
Test for Independence: Used to see if there is an association between two categorical variables in a single population.
Expected Values: Calculated as (Row Total * Column Total) / Table Total.
Chi-Square Statistic (chi^2): Sum of (Observed - Expected)^2 / Expected.
Degrees of Freedom (df): (Number of Rows - 1) * (Number of Columns - 1).
Conditions: Data must be random, and all expected counts must be at least 5.

Keys to Success - Vocabulary: Click to expand 👉

Keys to Success: Mastering Unit 3 Vocabulary

To succeed in this unit, you must be precise with your language. Statistical inference relies on specific definitions that are often confused in open-ended questions.

Parameter vs. Statistic: A parameter (p) describes the entire population, while a statistic (p-hat) is calculated from your sample to estimate that parameter.
Standard Deviation vs. Standard Error: Use "Standard Deviation" when you know the true population proportion p; use "Standard Error" (SE) when you are using the sample proportion p-hat to estimate the variability.
Confidence Level vs. Confidence Interval: The "level" is the success rate of the method over many trials ; the "interval" is the specific range of values (a, b) calculated from a single sample.
Statistically Significant: This occurs when the p-value is less than or equal to the significance level (alpha), providing convincing evidence to support the alternative hypothesis.
P-value Interpretation: Always start with the phrase "Assuming the null hypothesis is true..." when interpreting a p-value.
Non-definitive Language: Never say you have "proven" a hypothesis is true. Use non-definitive language such as "there is convincing statistical evidence to support" the alternative hypothesis

Keys to Success - Common Pitfalls: Click to expand 👉

Common Pitfalls to Avoid

The "Probability" Trap: Avoid saying there is a "95% probability" that the true proportion is in your specific interval. Once the interval is calculated, the proportion is either in it or it isn't. Instead, say you are "95% confident" that the interval contains the true value.
H0 is about Parameters: Always write your null and alternative hypotheses using population parameters (p), never sample statistics (p-hat).
Expected vs. Observed: For Chi-Square tests and Normality conditions, always check your "expected" counts (calculated based on the null hypothesis being true), not just what you "observed" in your data.
10% Condition: Remember that this condition is only necessary when sampling without replacement; it is not required for randomized experiments.

Page updated

Google Sites

Report abuse