**Monkey or Human: Hypothesis Testing and Power Analysis**

Welcome to my game show, called “Monkey or Human”. The rule is simple: One-hundred 4-options MCQs of general knowledge have been done by either a monkey or a human (assuming both of them know how to do MCQs), and the number of questions correctly answered is shown on a screen. Your task is to guess, based on the number, if the MCQs were done by the monkey or the human.

**The Basic: Hypothesis Testing**

** **Sounds easy, doesn’t it? But how can you reduce the chance of incorrectly guessing a human to less than 5%? Fortunately, you are a quantitative researcher or some sort (That’s why you are reading this, right?). You do not know the probability of human’s score, but you KNOW the probability distributionfor monkey’s. Why? Because the monkey knows NOTHING, and it will just answer them randomly. It is very unlikely that the monkey scores high when it knows nothing, so after calculation, you find there is less than 5% of chance it will correctly answer more than 32 questions (If you want to know how to come up with the number, search “binomial n=100 p=.25 p-value=.05” in WolframAlpha). Therefore, if the number on the screen is more than 32, you reject the fact that the monkey did the MCQs and conclude it was human.

Is it still possible that the monkey did it when the number is high? Yes, but the likelihood is less than 5%. But you may ask, why 5%? Isn’t the lower, the better? Can the criterion be 1% or less? Well, the number is arbitrary. It was created Fisher, one of the producers of the game, and many use it because he said so (Lehmann, 1993)^{1}. However, 5% is good enough, as the lower the criterion, the more likely we will incorrectly guess it was monkey but actually human. Likewise, can human score lower than the criterion? Still possible, but we do not know the exact probability. However, we can estimate by his or her intelligence, which the higher the IQ they have, the lower the probability they will score lower than the criterion.

Ok, let’s recap what I said and apply it to statistical testing. Each question of MCQs represents a participant of a study, and we use them to test hypotheses. In order to reduce error, the questions or the participants should not be bias, e.g. by randomization. Monkey and human are two types of hypotheses: null and alternative. If you see the graph in G*Power, the red line illustrates the distribution of null hypothesis (monkey), and the blue line shows the distribution of alternative hypothesis (human). The probability of incorrectly guessing human is called type I error (alpha or false alarm), while the error to guess monkey is called type II error (beta or miss). We test hypothesis by setting a significance criterion, usually alpha = .05, and reject the null hypothesis when the result is higher or lower than the cutoff point, depending on the direction. The likelihood of you CORRECTLY rejecting null hypothesis (guessing human when in fact it was human) is equivalent to statistical power, and the formula is 1 minus miss rate (i.e. beta, or missing the opportunity to correctly guess human). After controlling for sample size, power mostly depends on effect size, which is analogous to human’s intelligence. Therefore, the higher the IQ, the higher the power.

**Calculating Sample Size: Power Analysis**

** **Because you are the regular contestant, now you gain an advantage: Deciding the number of questions. So before the game starts, you can decide how many MCQs there are, and the number will display after they have been done as usual. But how can you use this as your advantage? As I mentioned earlier, you do not know the probability distribution for human, BUT you can estimate their IQ and construct a possible distribution. After determining the power you want (see above), you can decide the number of MCQs based on your expectation of the human’s intelligence. If you think she is a genius (large effect size), then three questions are enough (small sample size), because she will answer perfectly, while the monkey will get some wrong. But if you expect a 5-year-old (small effect size), then you need tons of questions (large sample size), which hopefully he will get some of them right and score slightly higher. However, if this is the case, you should not play the game anymore because it can be a fraud (or in research term, no practically use even with significant result).

But still, how do you estimate their intelligence? You can do some research on similar games before, and use those as anchor. Maybe the human seemed smart in their game, so you use lesser questions. Alternatively, you can estimate yourself. Generally, you do not expect them to be genius, because they are rare in real life, and you can miss people with lower but normal intelligence. But you do not expect them to be children at the same time, because it is impractical to have more questions, and children are not much different from monkey anyway (I mean, have you seen children in daycare?). Hence, we often estimate average intelligence and choose medium effect size.

**Conclusion**

The point of the game: Sample size, effect size, and power are interrelated. After controlling anyone of them, the change of one’s value will result in the increase or decrease of the other. The analogy aims to clarify the process of most statistical analysis and hypothesis testing. Hopefully, students or amateur quantitative researchers will understand more about the meaning of p-value and operation of G*Power, instead of rigidly following the procedures they learned.

Note:

- Actually, Fisher just created the tables for the calculation of p-values, and it influenced Pearson, another producer of the game, to adopt and propose fixed-level testing. It is interesting to know that Fisher in fact criticized the use of absolute cutoff, as he stated “no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypothesis” (Fisher, 1973, as cited in Lehmann, 1993).

**References**

Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? *Journal of the American Statistical Association*, *88*(424), 1242-1249. doi: 10.2307/2291263

Author information:

*Jordan Oh (Veng Thang) is a 3 ^{rd} year psychology student in HELP. He studied and has the experience in Education (Teaching Chinese as Second Language), and now is a member of Peer Mentors and PAL (Peer Assisted Learning) tutor in quantitative research and cognitive psychology. His interest is in soft science like statistics and psychology, especially about how people acquire knowledge and anxiety issue in academic setting, that’s why he loves the course. Also, he is gay.*