Normal Distribution - Statistics

Weird right, a normal distribution. Sounds a like there is nothing normal, rather is more abnormal before we even start it off. But, hey let’s not get demotivated. There is a lot normal about the normal distribution with a little abnormality.

Before we delve into Normal Distribution let us look at some of the terms we need to understand.

Mean: It is the average of all the numbers. It can be calculated by taking the sum of all the values and diving the sum by the total number of values.

Standard Deviation: It is the measure of how spread out the values are. If the values are not too far from the mean the values are said to be concentrated near the mean. If they are far away from the mean they are said to be widely spread.

So, when we were kids we always had this fight amongst us as to who would stand first in the line. All of us used to measure each others heights every single day just to be sure no one is cheating the other. But it did not turn out so effective cause no one was sure what the length was the previous day. So, the brilliant me (I am being a little boisterous here) came up with this idea of noting the heights of everyone once a week just so we don’t waste time everyday. After a couple of weeks I happen to notice that these values had a pattern. They all were almost similar. Just differed by minute point values. Interestingly they were also concentrated near a value (Silly me, being an overthinker thought I did not take the values right).

It is only later when I grew up to take up statistics I understood the concepts called continuous values. A continuous random variable is one that can take any real value within a specified range. Our heights are continuous variables, meaning that the probability of it taking an exact value is zero. The value of ones height changes slightly by every additional point considered after the decimal. These kind of distributions are called continuous probability distributions.

Normal distributions are important as they are useful in representing real valued random variables whose distributions are not known. So, in my case the heights of my classmates are random variables who certainly don’t follow any distribution. With the introduction of the concept called Central Limit theorem these randomly distributed samples whose mean and variance are finite converge to a normal distribution. Now, the abnormality I was talking about has suddenly jumped in. So, let’s break it a bit and understand the concept.

As I already mentioned that I was collecting the samples of my classmates wherein these values are absolutely random where there is no relation between any of them. If I instead go to the extents of measuring the heights of everyone in my class everyday, I have a huge number of sample values. The average or to be statistically correct, the mean value of all these values collected over a period of time will approach a normal distribution. So, the physical quantities that are the sum of many independent processes are often attributed to be normal.

Here in my case, I have just taken a small group of children to calculate the mean value. This set of my classmates can be attributed as a sample. The set of all the students of the same age can be considered as population. So, to find out what the mean value of all the students of that age I need not take the value for each and every student instead, I can take these sample values and predict some of the characteristics about the population. When I take the means of all the samples that I collected over the period of time and calculate the average of it, that will be equal to the population mean. Likewise, the average value of all the standard deviations of all the sample will give the population standard deviation.

Here are a set of conditions that are followed by variables which are set to be normally distributed:

  1. The mean, mode and median values are all equal.
  2. The curve is symmetric on both the sides.
  3. Exactly half of the values are to the left of the curve and the other are to the right. The total area under the curve is 1.

So, my data of the heights of my classmates collected would have same mean, mode and, median values. The number of values on either side of the mean are exactly same.

There are some more interesting facts that can be inferred if we know that our data is normally distributed. This can also be called as Empirical Rule. According to that rule 68% of the values lie with in one standard deviation. 95% lie between two standard deviations and 99.7% lie within three standard deviations.

In my classmates case if the mean value is 168cm and a standard deviation of say 3cm then 68% of my total values are between 165cm and 171cm. 95% of my values lie between 163cm and 174cm and 99.7% of values lie between 160cm and 174cm. So, a normally distributed data whose mean and standard deviation is known can help us understand a lot about the data.

Standard Normal Distribution

This is a special case of normal distribution. Here, where the mean value is zero and the standard deviation is 1. This distribution is also known as Z-distribution. This value is known as Z-score. When you think of comparing something which is very difficult on a normal scale we can standardize the values. For example, we can easily compare apples and oranges. They are a great way to understand where a specific observation falls with respect to the entire distribution.

To measure a the standard score of an observation, take the measured value and subtract it from the mean and divide it by standard deviation.

1 Like

The normal (or Gaussian) distribution is one particular kind of a bell shaped curve. It is unimodal (that is, there is one peak"), symmetric (that is you can flip it around its mid point) and its mean, median and mode are all equal. However, it is only one such distribution - others meet all those conditions and are not normal.

Many things are approximately normally distributed, for example the heights of adult human females or males, IQ, etc.

In addition, parametric statistics often assumes that something is normally distributed. E.g. ordinary least squares regression assumes the errors are normal.