Chapter 3 Simple Statistical Evaluation
The biggest enemy of your findings is randomness. In order to convince your audience that you have found something you need to address the question “how do you know your result is simply sheer luck, it is random?”
This is where you need statistical tests for use in hypothesis testing.
3.0.0.1 Two Important Formula’s:
Mean \[\begin{equation} \bar{X}=\frac{\sum{X}}{N} \ \text{where, X is set of numbers and N is size of set.} \end{equation}\]
Standard Deviation
\[\begin{equation} \sigma = \sqrt{\frac{\sum{(X - \mu)^2}}{N}}\\ \text{where, X is set of numbers, $\mu$ is average of set of numbers, }\\ \text{ N is size of the set, $\sigma$ is standard deviation} \end{equation}\]
3.1 Z-test
A z-test is any statistical test used in hypothesis testing with an underlying normal distribution.
In other words, when the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution, z-test can be used.
Outcome of the z-test is the z-score which is a numerical measure to test the mean of a distribution. z-score is measured in terms of standard deviation from mean.
3.1.1 Steps for hypothesis testing using Z-test.
Running a Z-test requires 5 steps:
- State the null hypothesis and the alternate hypothesis
- Select a null hypothesis and an alternate hypothesis which will be tested using the z-test.
- Choose an Alpha \(\alpha\) level.
- Usually this is selected to be small, such that the area under the normal distribution curve is accumulated most in the range between the alpha level.
- Thus mostly in statistical testing, \(\alpha = 0.05\) is selected.
- Calculate the z-test statistic.
- The z-test statistic is calculated using the z-score formula. \[\begin{equation} z = \frac{x-\mu}{\sigma}\text{ where, $z$ = z-score, $x$ = raw score, $\mu$ = mean and $\sigma$ = standard deviation } \end{equation}\]
- Calculate the p-value using the z-score
- Once we have the z-score we want to calculate the p-value from it.
- To do this, there are 2 ways,
- First use the z-table available online at z-table.com
- Second, use the pnorm() function in R to find the p-value.
- Compare the p-value with \(\alpha\)
- After getting the p-value from step 4, compare it with the \(\alpha\) level we selected in step 2.
- This decides if we can reject the null hypothesis or not.
- If the p-value obtained is lower than \(\alpha\), then we can reject the null hypothesis.
- If the p-value is more than \(\alpha\), we fail to reject the null hypothesis due to lack of significant evidence.
Some important relation between one-sided and two sided test while using hypothesis testing is as follows:
- First, estimate the expected value \(\mu\) of T(statistic) under the null hypothesis, and obtain an estimate \(\sigma\) of the standard deviation of T.
- Second, determine the properties of T : one tailed or two tailed.
- For Null hypothesis H0: \(\mu \geq \mu_0\) vs alternative hypothesis H1: \(\mu < \mu_0\) , it is upper/right-tailed (one tailed).
- For Null hypothesis H0:\(\mu \leq \mu_0\) vs alternative hypothesis H1: \(\mu > \mu_0\) , it is lower/left-tailed (one tailed).
- For Null hypothesis H0: \(\mu = \mu_0\) vs alternative hypothesis H1: \(\mu \neq \mu_0\) , it is two-tailed.
- Once you calculate the pnorm() in step 4, depending on the properties of two as described above,
- use
pnorm(-Z)
for right tailed tests, - use
2*pnorm(-Z)
for two tailed test, and - use
pnorm(Z)
for left tailed tests. - Note: (Here Z = z-score). Also the method mentioned above works similar to that studied in class/recitations, but is simple to understand, and does not require subtracting the pnorm() output from 1.
- use
3.1.2 Z-test Example 1 (Right Sided)
Now lets look at an example to use this z-test for hypothesis testing.
We will study the example to statistically find the relation of the traffic volume per minute between two tunnels, namely Holland and Lincoln .
TUNNEL | DAY | VOLUME_PER_MINUTE | |
---|---|---|---|
832 | Holland | weekday | 37 |
1290 | Holland | weekend | 61 |
166 | Holland | weekday | 50 |
1713 | Lincoln | weekday | 77 |
1058 | Holland | weekend | 57 |
2278 | Lincoln | weekday | 61 |
234 | Holland | weekday | 41 |
988 | Holland | weekday | 49 |
139 | Holland | weekday | 58 |
675 | Holland | weekday | 42 |
Thus stating out Null Hypothesis and Alternate Hypothesis.
- Null Hypothesis H0: Traffic in Lincoln is same as Traffic in Holland tunnel.
- Alternate Hypothesis H1: Traffic in Lincoln is higher than traffic in Holland tunnel.
Once we have stated our hypothesis, lets see the z-test in practice.
We can see that form the P-Value obtained is near to 0, which is less than 0.05.
Hence, we reject the NULL Hypothesis and conclude with high degree of certainty that traffic in Lincoln is higher than traffic Holland.
3.1.3 Z-test Example 2 (Left Sided)
Now lets look at another example to use this z-test for hypothesis testing.
We will study the example to statistically find the relation between capital gains of people with two Zodiac Signs , namely Aquarius and Libra.
AGE | STATUS | EDUCATION | YEARS | PROFESSION | CAPITALGAINS | CAPITALLOSS | NATIVE | ZODIAK | |
---|---|---|---|---|---|---|---|---|---|
10906 | 34 | Private | 10th | 6 | Craft-repair | 0 | 0 | United-States | Capricorn |
24608 | 52 | Self-emp-inc | Some-college | 10 | Sales | 0 | 0 | United-States | Taurus |
11463 | 34 | Private | Bachelors | 13 | Prof-specialty | 0 | 0 | United-States | Taurus |
24993 | 53 | Private | HS-grad | 9 | Craft-repair | 0 | 0 | United-States | Sagittarius |
19318 | 44 | Private | Bachelors | 13 | Craft-repair | 0 | 0 | United-States | Leo |
2714 | 24 | Private | Some-college | 10 | Exec-managerial | 0 | 0 | United-States | Leo |
8041 | 30 | Private | Some-college | 10 | Craft-repair | 249330 | 0 | United-States | Taurus |
11112 | 34 | State-gov | Bachelors | 13 | Prof-specialty | 0 | 11276 | United-States | Leo |
2455 | 24 | Private | Bachelors | 13 | Sales | 0 | 0 | United-States | Cancer |
7293 | 30 | Private | HS-grad | 9 | Machine-op-inspct | 5013 | 0 | United-States | Aquarius |
Now stating out Null Hypothesis and Alternate Hypothesis.
- Null Hypothesis H0: Capital Gains of people with Aquarius is same as people with Libra zodiac sign.
- Alternate Hypothesis H1: Capital Gains of people with Aquarius is lower than as people with Libra zodiac sign.
Once we have stated our hypothesis, lets see the z-test in practice.
We can see that form the P-Value obtained is less than 0.05.
Hence, we reject the NULL Hypothesis and conclude with high degree of certainty that Capital Gains of people with Aquarius is lower than as people with Libra zodiac sign.
3.1.4 Z-test Example 3 (Two Tailed)
We will study the example to statistically find the relation between capital gains of people with two Countries, namely US and Columbia.
Now stating out Null Hypothesis and Alternate Hypothesis.
- Null Hypothesis H0: Capital Gains of people of United States is same as people of Colombia.
- Alternate Hypothesis H1: Capital Gains of people of United States is not equal to that of the people of Colombia.
Once we have stated our hypothesis, lets see the z-test in practice.
We can see that form the P-Value obtained is less than 0.05.
Hence, we reject the NULL Hypothesis and conclude with high degree of certainty that Capital Gains of people of United States is not equal to that of the people of Colombia.
3.2 Permutation Test
Permutation test allows us to observe randomness directly, with naked eye, without the lenses of statistical tests such as z-tests etc. We shuffle data randomly like a deck of cards. There may be many such shuffles - 10,000, 100,000 etc. The goal is always to see how often we can obtained the observed difference of means (since we are testing either one sided or two sided hypothesis), by purely random shuffles of our data. These permutations (shuffles) destroy all relationships which may pre-exist in our data. We are hoping to show that our observed difference of means can be obtained very rarely in completely random fashion. Then we “experimentally” show that our result is unlikely to randomly occur under null hypothesis. Then we can reject the null hypothesis.
The less often our result appear in the histogram of permutation test results, the better the news for our alternative hypothesis.
What is surprising to many newcomers, is that permutation test will give different p-values (not dramatically different, but still different) in each run of permutation test. This is the case because permutation test in random itself. It is not like z-test which will give the same result when run again for the same hypothesis and same data set. Also p-value computed by permutation test will be, in general, different than p-value computed by z-test. Not very different but different. Again, it is the case because permutation test provides only approximation of p-value. Great advantage of permutation test is that it is universal and robust. One can test different relationships between two variables than just difference of means. For example we can use permutation test to validate whether traffic in Lincoln tunnel is more than twice the traffic in Holland tunnel or even provide different weights for different days of the week.
3.2.1 Permutation Test One Step
Permutation test in one step is the most direct way to see randomness close by. One step permutation function shows one single data shuffle. By shuffling the data one destroys associations which exist between values of the data frame. This make data frame random.
You can execute the one step permutation multiple times. This will show how data frame varies and how does it affect the observed difference of means.
Apply one step permutation function first, multiple times before you move to the proper Permutation test function. One of the parameters of the Permutation test function specifies the number of “shuffles” which will be preformed. This could be a very large number, 10,000 or even 100,000. The purpose of making so many random permutations is to test how often observed difference of means can arise in just random data. The more often this takes place, the more likely you observation is just random. To reject the null hypothesis you need to show that the observed difference of means will come very infrequently in permutation test. Less than 5% of the time, to be exact.
3.2.2 Permutation Function
The permutation function is used to run multiple iteration of the one-step permutation studied above, to get a complete relational understanding between the components involved in any hypothesis.
Here you can run the example of running the permutation test on the Traffic.csv dataset, on volume of traffic in Holland and Lincoln Tunnel.
Note: You can find the permutation function code here: Permutation()
NOTE: The red line in the output plots of the permutation test function is not the p-value, but it is just the difference of the value of means of the two categories under test.
3.2.3 Exercise - How p-value is affected by difference of means and standard deviations
Here, you can generate your own data by changing parameters of the rnorm() function.
See how changing the mean and sd in rnorm distributions affects the p-value! Again you can do it directly in the code and observe the results immediately. It is very revealing.. Think of Val1, and Val2 as traffic volumes in Holland and Lincoln tunnels respectively. The larger the difference between the means of rnorm() function the smaller the p-value - since it is less and less likely that observed difference of means would come frequently, due to random shuffles of permutation function.
Now keep the same means and change the variances. See how changing the variances in rnorm() will affect the p-value and try to explain the effect that standard deviations have on the p-value. In general, the higher the standard deviation, the more widely data is centered around the mean. Thus even for the same two means, and two different value of deviations, we can see larger value of deviation to lead to higher p-value. Since we are less certain of the role of the “mean” if standard deviation is higher. Therefore, the chance of randomly obtaining the observed result, is higher.
3.3 Multiple Hypothesis - Bonferroni Correction.
While dealing with the dataset with several number of dimensions, it is possible to get a lot of amazing and interesting insights and conclusions from it.
But, unfortunately, sometimes a lot of the data included in case of such large dataset, might be junk.
We can make multiple assumptions from such data. But, while doing so, we may consider some useless data/patterns that might hamper our results and lead to the pitfall of believing in hypotheses, that are not actually true.
This is common when performing multiple hypothesis testing.
Multiple hypothesis testing refers to any instance that involves the simultaneous testing of more than one hypothesis.
Let’s consider the example of Traffic dataset.
- We have given two tunnels ”Holland” and “Lincoln”, but what if we were given all the tunnels in the US?
- We can make a lot of hypotheses in that case.
- And for each set of hypothesis, would you still consider the value of α as 0.05 as the cut-off for P-value?
It may seem to be a good idea to just go and check the p-value for any set of hypotheses with the cut-off value of \(\alpha\) as 0.05.
But this might not give you the correct answer always.
If you have 100 different hypotheses to consider in the data, then the probability of getting at least one significant result with \(\alpha = 0.05\) will be, \[P(\text{at least one significant result}) = 1- (1-0.05)^{100} ≈ 0.99\]
This means that if we consider 0.05 as our cut-off value, then the probability of getting at least one significant result will be about 99%, which leads to overfitting of data and it clearly doesn’t give us proper idea about our hypothesis.
Methods for dealing with multiple testing frequently call for adjusting \(\alpha\) in some way, so that the probability of observing at least one significant result due to chance remains below your desired significance level.
One such method for adjusting \(\alpha\) is BONFERRONI CORRECTION!
The Bonferroni correction sets the significance cut-off at \(\alpha / N\) where N is the number of possible hypotheses.
For example, in the example above, with 100 tests and \(\alpha = 0.05\), you’d only reject a null hypothesis if the p-value is less than \(\alpha/N = 0.05/100 = 0.0005\)
Thus, the value of \(\alpha\) after Bonferroni correction would be \(0.0005\).
Again, let’s calculate the probability of observing at least one significant result when using the correction just described:
\[P(\text{at least one significant result}) = 1 − P(\text{no significant results}) \\ = 1 − (1 − 0.0005)^{100} ≈ 0.048\]
This gives us 4.8% probability of getting at least one significant result.
As we can see this value of probability using Bonferroni correction is much better than the 99% which we saw before when we did not use correction for performing multiple hypothesis testing.
But there are some downfall of using Bonferroni correction too. (Although for the scope of this course Bonferroni Correction works fine.)
- The Bonferroni correction tends to be a bit too conservative.
- Also, we benefit here from assuming that all tests are independent of each other. In practical applications, that is often not the case.
- Depending on the correlation structure of the tests, the Bonferroni correction could beextremely conservative, leading to a high rate of false negatives.
3.3.1 Examples for Multiple hypothesis testing.
Let’s consider the Happiness dataset as an example.IDN | AGE | COUNTRY | GENDER | IMMIGRANT | INCOME | HAPPINESS | |
---|---|---|---|---|---|---|---|
3308 | 72704 | 37 | Kazakhstan | Female | 0 | 88201 | 7.42 |
4600 | 76777 | 56 | Trinidad and Tobago | Male | 1 | 93598 | 4.95 |
1282 | 31315 | 53 | Sweden | Male | 1 | 49034 | 7.83 |
5122 | 84009 | 20 | Colombia | Male | 0 | 100414 | 8.26 |
2433 | 42920 | 33 | Iraq | Female | 0 | 57674 | 4.36 |
4308 | 13675 | 54 | Zambia | Male | 1 | 28754 | 3.28 |
4071 | 51005 | 43 | Thailand | Male | 0 | 67206 | 5.47 |
3927 | 38442 | 30 | Morocco | Male | 0 | 53284 | 4.33 |
6382 | 20492 | 50 | Turkey | Female | 0 | 35729 | 2.51 |
287 | 55146 | 63 | Suriname | Female | 0 | 71402 | 5.67 |
There are 156 unique countries in the dataset.
This can be checked using the unique() function – unique(indiv_happiness$country)
Since there are 156 distinct countries, we have \({{n}\choose{2}} = {156\choose2}=(156 * 155)/2 = 12090\) different hypotheses. Let’s call this value N.
Using this N, the P-value cutoff after Bonferroni correction will be, \(α = 0.05 / 12090 ≈ 4.13 *10^{-6}\)
3.3.1.1 Example 1
Let’s calculate the P-value for the following hypotheses from the dataset.
- Our hypothesis: People from Canada are happier than people from Iceland.
- Null hypothesis: There is no difference in happiness levels of people from Canada and people from Iceland.
In this case, after applying Bonferroni Correction we get the value of \(α = 0.05/12090 ≈ 4.14 * 10^{-06}\) Here, we get the p-value of 0.25 which is much higher than the value of our α. Based on this we fail reject our null hypothesis.
3.3.1.2 Example 2
Let’s consider the following hypotheses from the dataset.
- Our hypothesis: People from Italy are happier than people from Afghanistan.
- Null hypothesis: There is no difference in happiness levels of people from Italy and people from Afghanistan.
In this case, after applying Bonferroni Correction we get the value of \(α = 0.05/12090 ≈ 4.14 * 10^{-06}\)
Here, we get the p-value of 0.00364 which is lower than the value of default p-value cutoff \(α = 0.05\), but this obtained p-value is higher than our Bonferroni correction cutoff.
So, based on the results, we fail to reject our null hypothesis even though the obtained p-value is less than 0.05.
EOC