Woah! The title looks very mathematical. Have you just lost interest to read on? Maybe or maybe not. But, I guess there could be some exciting measures which can after all save us at some point in life. Mine went on like this. Now, I was a very bad student in school. So, I ended up getting average marks. Oh wait! What does average mean? That is where I was lucky that I knew how to use my statistical measures effectively and get away from my parent’s tantrums. So, let’s dive in and learn a few tricks.
We all are familiar with the concept mean. It is the sum of all the numbers divided by the count of numbers. Let us consider a scenario where you want show you performed well in a test. By calculating the mean we know the middle most value. The score one obtained falls in between the lowest and the highest value. If most of your classmates performance(score) is near the average/mean value and if your score is well above the mean and slightly towards the right that can be used to confirm that you performed well compared to your peers. (This saved me from my mom than anything I have ever thought of).
Now, let us also see a case where we want to know how the whole class performed. What are the lowest and highest scores? What is the spread/dispersion of the scores? The concept called Standard Deviation can be used to understand these values. The standard deviation is a measure of the amount of variation or dispersion of a set of values. It gives you the exact distances from the mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.
Now, I was in a class where my there were students who performed extremely well and students who were more into extracurricular activities. So, our marks sheet had the lowest and the highest scores. In this case, the standard deviation of the scores are spread over a wide range. But, if you are in a class where all the students managed to obtain a fairly good score then the scores would be concentrated near then mean. And, the spread of scores tend to be close to each other.
Variance is another measure which calculates how far the data is spread from the mean value. It is mathematically defined as the average of the squared differences from the mean. Also it is defined as the expectation of the squared deviation of a random value from its mean.
When I am trying to calculate variance from the mean value, each and every score is subtracted from the mean. But, half of my values are less than the mean they will obviously cancel with those that are above the mean. So, to overcome this we square these values to make them positive and then find out the average of those values. This gives the measure of variance, which says how far my scores are from the mean. If the value of variance is low then the scores are concentrated near the mean else they are dispersed.
Now, my best friend and me were interested in knowing if there was any relation between the marks we scored. So, we calculated the mean and subtracted each of our scores from the mean and multiplied those values. The score we obtained was positive value. My statistics teacher taught us that a positive value indicates a strong correlation. But, there was this doubt lingering us what if the scores were low? The result we would have obtained as covariance would be different.
Covariance is defined as a measure of the joint variability of two random variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. In that case we normalize this value to get a new parameter called Correlation.
In the calculation of correlation, we basically standardize the values to better understand the magnitude. It is done by diving the covariance with the standard deviation of each of the measures. Now that we got the covariance of my best friend and my scores we have to find the standard deviations of each of our scores and divide covariance with their product value.
The concept of correlation sure sounded exciting but to actually measure it quantitatively an index relationship is used known as coefficient of correlation. It is a numerical index that tells us to what extent the two variables are related. And how change in one variable can effect the other. A positive correlation indicates that an increase in one variable leads to increase in the other. The correlation values range from -1 to +1 including 0. The negative value -1 indicates that for every increase in one unit value of a variable equally decreases the other proportionally. Zero indicates that there is no relationship between the two variables and the change in one does not effect the other.
Other than the above stated correlation there is another type known as linear and curvilinear correlation. Here, firstly with the increase in the value of a variable the other increases too. But, after a certain point with the further increase in the first variable the second one starts decreasing. Since the graphical representation of these two variables forms a curved line it has got its name curvilinear correlation.
There are different types of correlation coefficients. They are :
1.) Pearson Correlation Coefficient: It is also called as Pearson Mean Product Correlation, which is used to measure the statistical relationship between two variables. It is used for measuring interval or ratio scale. It shows linear relationship between two variables. The major drawback is the two variables that are in comparison cannot be distinguished as to which is the dependent or independent variable. Pearson correlation cannot determine the slope of the line. To overcome these drawbacks we have another correlation measuring coefficient.
2.) Spearman’s Rank Correlation Coefficient: As the name suggests it is a correlation which has got more to deal with the ranks assigned to the variables. It can be used to measure ordinal, interval and ratio scale data. The measure works well if the data is not normally distributed or when there are outliers in the data. It is a non parametric version of the Pearson correlation coefficient which assess the relationship between two variables using a monotonic function.
Causation: When we address two variables being correlated it does not necessarily imply that one variable causes the other. So, this is where causation comes into play. It is human tendency to find relationships in variables as cause and effect. But, doing so without performing accurate tests leads to wrongful results. It is often noticed that a relationship between two variables can only be established by performing tests like Hypothesis testing and A/B testing.
Its a wrap here. But, let’s go on and explore more about the testing methods in the next post.