In Unit 3, we move from simply describing data to making formal conclusions about a population based on a sample. This process is called Statistical Inference. Specifically, we focus on categorical data, where results are typically expressed as counts or proportions (percents)—such as "what percentage of voters support a candidate?" or "is there a difference in success rates between two groups?".
Estimating with Confidence: We learn how to calculate a "plausible range" of values for an unknown population proportion, known as a Confidence Interval.
Testing a Claim: We use Hypothesis Tests to determine if a specific claim about a population is supported by the evidence, or if the results we saw were likely just due to random chance.
Comparing Groups: We expand these tools to compare two different populations or to look for associations between multiple categorical variables using Chi-Square tests.
Statistical inference is the engine of modern science and business. Whether it is a medical trial testing the effectiveness of a new treatment or a marketing firm analyzing customer preferences, these methods allow us to quantify our uncertainty and make data-driven decisions.
This unit focuses on using sample data to make "inferences"—educated conclusions—about larger populations using categorical data.
To understand the population proportion (p), we use a sample statistic called the sample proportion (p-hat).
Unbiased Estimators: An estimator is unbiased if, on average, its value does not overestimate or underestimate the true population parameter.
Mean of the Sampling Distribution: The mean of all possible sample proportions (p-hat) is equal to the population proportion (p).
Standard Deviation: The variability of the sample proportion is calculated as: SD(p-hat) = sqrt( p(1-p) / n )
10% Condition: When sampling without replacement, the population must be at least 10 times larger than the sample (n < 10% of N).
Normality Condition: The sampling distribution is approximately Normal if there are at least 10 expected successes and 10 expected failures (np >= 10 and n(1-p) >= 10).
A confidence interval provides a range of plausible values for the population proportion.
The Procedure: A one-sample z-interval for a population proportion.
The Formula: p-hat +/- (z*) * sqrt( p-hat(1 - p-hat) / n )
Standard Error (SE): This estimates the standard deviation of the sampling distribution using sample data: SE = sqrt( p-hat(1 - p-hat) / n ).
Margin of Error (MOE): This is half the width of the interval: MOE = (z*) * (SE).
Confidence Level Interpretation: In repeated random sampling, approximately C% of the intervals created will capture the true population proportion.
Relationships: Increasing the sample size (n) decreases the width of the interval, while increasing the confidence level increases the width.
A hypothesis test determines if a claim about a population is supported by the data.
Null Hypothesis (H0): The "status quo" or the statement assumed to be true (H0: p = p0).
Alternative Hypothesis (Ha): The claim you are testing for (Ha: p > p0, p < p0, or p != p0).
Test Statistic (z): Measures how many standard deviations the observed p-hat is from the null p0: z = (p-hat - p0) / sqrt( p0(1 - p0) / n )
P-value: The probability of getting a result as extreme as the one observed, assuming H0 is true.
If p-value <= alpha: Reject H0; there is convincing evidence for Ha.
If p-value > alpha: Fail to reject H0; there is not enough evidence for Ha.
Errors:
Type I Error: Rejecting H0 when it was actually true.
Type II Error: Failing to reject H0 when Ha was actually true.
Power: The probability of correctly rejecting a false H0.
These methods are used when you have two independent groups and want to compare their proportions (p1 - p2).
Two-Sample z-Interval: Used to estimate the difference between two population proportions. Formula: (p-hat1 - p-hat2) +/- (z*) * sqrt( [p-hat1(1-p-hat1)/n1] + [p-hat2(1-p-hat2)/n2] )
Two-Sample z-Test: Used to test if the two proportions are different.
Pooled Proportion (p-hat-c): For the hypothesis test, we combine the successes from both groups: p-hat-c = (successes1 + successes2) / (n1 + n2)
Normality Condition: Both samples must have at least 10 expected successes and failures.
Chi-square tests compare observed counts to expected counts in a two-way table.
Test for Homogeneity: Used to see if the distribution of a single categorical variable is the same across multiple populations.
Test for Independence: Used to see if there is an association between two categorical variables in a single population.
Expected Values: Calculated as (Row Total * Column Total) / Table Total.
Chi-Square Statistic (chi^2): Sum of (Observed - Expected)^2 / Expected.
Degrees of Freedom (df): (Number of Rows - 1) * (Number of Columns - 1).
Conditions: Data must be random, and all expected counts must be at least 5.
To succeed in this unit, you must be precise with your language. Statistical inference relies on specific definitions that are often confused in open-ended questions.
Parameter vs. Statistic: A parameter (p) describes the entire population, while a statistic (p-hat) is calculated from your sample to estimate that parameter.
Standard Deviation vs. Standard Error: Use "Standard Deviation" when you know the true population proportion p; use "Standard Error" (SE) when you are using the sample proportion p-hat to estimate the variability.
Confidence Level vs. Confidence Interval: The "level" is the success rate of the method over many trials ; the "interval" is the specific range of values (a, b) calculated from a single sample.
Statistically Significant: This occurs when the p-value is less than or equal to the significance level (alpha), providing convincing evidence to support the alternative hypothesis.
P-value Interpretation: Always start with the phrase "Assuming the null hypothesis is true..." when interpreting a p-value.
Non-definitive Language: Never say you have "proven" a hypothesis is true. Use non-definitive language such as "there is convincing statistical evidence to support" the alternative hypothesis
The "Probability" Trap: Avoid saying there is a "95% probability" that the true proportion is in your specific interval. Once the interval is calculated, the proportion is either in it or it isn't. Instead, say you are "95% confident" that the interval contains the true value.
H0 is about Parameters: Always write your null and alternative hypotheses using population parameters (p), never sample statistics (p-hat).
Expected vs. Observed: For Chi-Square tests and Normality conditions, always check your "expected" counts (calculated based on the null hypothesis being true), not just what you "observed" in your data.
10% Condition: Remember that this condition is only necessary when sampling without replacement; it is not required for randomized experiments.