Bias in Statistics
Let’s Talk About Statistics

Let’s Talk About Statistics

Let’s talk about statistics, what statistics are, how we use statistics in health and our lives, the basic principles behind statistics, and the basics of how to correctly use statistics.

Let’s Talk About Statistics and what Statistical analyses are?

Statistics is the area of math which deals with calculating risk, evaluating differences between groups, and estimating probability, for starters. Statistics has a wide area of use, but every use case relies on certain principles. So lets talk about the main principle of Statistics, the Central Limit Theorem.

Central Limit Theorem

The Central Limit Theorem basically states that if we take a non-repeating sample of sufficient size (meaning we randomly choose a minimum of 30 parts, ensuring that we replace any duplicates with another randomly chosen part), that the sample will closely match the larger population with respect to average (mean) of the measured values and the variance (deviance, or distance from the mean) that is seen in this value.

So let’s put this into a practical example you can repeat at home. Let’s use a coin toss, what we would consider a binomial (only two outcomes) variable. By this we mean that when we toss the coin, we have two different outcomes available, head and tails. If we were to flip a coin 10 times, recording the number of heads and tails, and repeat this coin toss set 100 times, we would find that the majority of the coin toss sets will be either 6 heads/4 tails, 5 heads/5 tails, or 4 heads/6 tails and we would have very few sets with 1 head/9 tails and 9 heads/1 tail, and it’s possible we may never have a single straight ten of heads or tails.

Central Limit Theorem: https://croor.wordpress.com/2010/11/03/central-limit-theorem/

We call this finding the central tendency of data, a Gaussian distribution, or a normal distribution curve and this is the fundamental basis for what we call the Central Limit Theorem. This is because what we are seeing is that the majority of instances will occur towards the center (4/6, 5/5, 6/4). In nature, most things follow similar curves. This allows us to draw conclusions about a large number of people (population), based on a smaller selected group (sample).

You can read more on the basics of Central Limit Theorem here:

https://croor.wordpress.com/2010/11/03/central-limit-theorem/

or here: https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability12.html

or this Khan Video here: https://www.khanacademy.org/math/ap-statistics/sampling-distribution-ap/sampling-distribution-mean/v/central-limit-theorem

https://www.khanacademy.org/math/probability

Caution:

Caution must apply here. If we aren’t careful in who is chosen for the selected group (sample) or we don’t use a mechanism to ensure that the sample is truly random, we run the risk of a biased sample. A biased sample won’t allow us to accurately understand the population, as the sample doesn’t match the randomness of the population. When we ensure that the sample is truly random, we are working to ensure that the sample is “representative” of the population.

A good example of flawed samples can often be seen in political polls. They fall into this difficult position where the sample chosen may not be representative of those who voted, skewing the conclusions of those individuals/groups who are polling. This may be purposeful or non-purposeful, but either way, ensuring that the sample is truly random and representative of the population is one of the most important factors in being able to utilize the conclusions made from the data.

So what does this mean for health statistics?

If we know the average (mean, or center/peak of the curve), and we know how varied the data is around that mean (variance) we can calculate whether the observed data point is within the expected variance (Standard Deviation) or is outside of what is expected. This can be measured a number of different ways, each having their own applications. We call the fact that data congregates in the center and the ways we measure them as measures of central tendency.

So let’s look back at our coin toss example. If we had someone flipping a coin and they had all heads/all tails multiple times within a short period, we can look at our graph of the coin tosses and see that the likelihood of having several all heads/all tails is extremely unlikely. Not impossible, just extremely unlikely. We could say that the probability (the likelihood of the event occurring) is extremely low. It’s possible that the all heads/all tails happening frequently is not due to chance, but could be due to a non-random coin (the coin is weighted to flip consistently to heads/tails), and we can conclude that the the measured effect is likely not due to chance (and hence is different than what we would expect to see (or what we all the NULL Hypothesis). We should reject the NULL hypothesis that the coin flips are “probably” due to chance (this is because the coin tosses are outside of the limits of central tendency).

We can apply this likelihood when we compare two of more groups of things. Health statistics or biomedical statistics is based on looking at the likelihood of illness or event based on an intervention (or factor), which we often adjusted for age, race, sex, gender, ethnicity, etc as these factors can often change what is the expected (mean value).

In our next post, we will talk about how we can utilize statistics to look at vaccine safety, specifically some of the recent meeting notes on COVIDvaxx data.

%d bloggers like this: