Back to the Future: How Do Statisticians Make Predictions?
Have you ever wondered how statisticians are able to make predictions about the future?
In our previous piece, When Science Gets Involved in Politics, we discussed the importance of adhering to scientific sampling techniques as a solid first step. Now that we have our well-defined sample, the natural next question is: how do we find answers about the population?
Welcome to inferential statistics!
Estimation in statistics refers to the process by which statisticians are able to make relatively accurate inferences about a population based on information obtained from a sample.
In order to understand how we make that move, it’s important to differentiate the three different distributions:
The population distribution of the variable of interest (be it customer satisfaction or product popularity), while empirical, is actually unknown because it is extremely difficult to survey your entire population.
The sampling distribution is a theoretical, probabilistic distribution of a statistic (such as the mean) for all possible samples which has a certain sample size.
It’s important to understand that sampling distribution is theoretical, meaning that the researcher never obtains it in reality, but it is critical for estimation.
Thanks to the laws of probability, a great deal is known about sampling distribution, such as its shape, central tendency, and dispersion. We know that its shape is a normal curve, but you may know this as a “Bell Curve”, which is a theoretical distribution of scores that is symmetrical and bell-shaped.
The standard normal curve always has a mean of 0 and a standard deviation of 1. Because one can assume that the shape of the sampling distribution is normal, we can calculate the probabilities of various outcomes. We can also assume things like the mean of the sampling distribution is the same value as the mean of the population.
Building on this is the Central Limit Theorem, a probability theory that says if a random sample of size N is drawn from any population with a mean and standard deviation, as N grows, the sampling distribution of the sample means will approach normality.
With a larger sample size, the mean of the sampling distribution becomes equal to the population mean, the standard error of the mean decreases in size, and the variability in the sample estimates from sample to sample decreases. So now you can start to see how researchers can have more and more confidence in their results.
But with estimation, there is always a chance of error.
The width of Confidence Intervals is a function of the risk we are willing to take of being wrong and the sample size. The larger the sample, the lower the chance of error.
In other words, it refers to the probability that a specified interval will contain the population parameter. A 95% confidence level means that there is a 0.95 probability that a specified interval does contain the population mean; accordingly, there are 5 chances out of 100 that the interval does not contain the population mean.
When the purpose of the statistical inference is to draw a conclusion about a population, the significance level measures how frequently the conclusion will be wrong. For example, a 5% significance level means that our conclusion will be wrong 5% of the time. It is always the case that Confidence Level + Significance Level = 1.
It is possible to make inferences about a population from a sample that is carefully selected. The sampling distribution, a theoretical one, links the known sample to a larger population through an estimation. Because of the properties of the sampling distribution, we are able to identify the probability of any statistic with a certain level of confidence.
Whether you realize it or not this is under our noses every day in the news!
Keep your eye out and next time someone talks about who is ahead in the polls at your next cocktail party, you’ll be armed with a heavy dose of skepticism.