McNemar Test for deciding on the significance level

when comparing two classification systems

Assume the following situation: you have implemented two classification or tagging systems, and you want to compare them to decide whether one is better than another. I assume next that you do it by testing both systems S1 and S2 on the same dataset U.
I further assume that the errors produced by one system are independent. In other words, the fact that S1 has made an error on item u_i does not influence the probability that S1 makes an errors on item u_{i+1}. [Gillick89] developes another solution when this assumption is not true.

But the errors produced by S1 and S2 on U are not independent, which prevents from using the classical Gaussian approximation to compute the confidence interval of both classification accuracies.

Let us consider the null hypothesis H0 = S1 and S2 have the same accuracy

The main idea introduced in [Gillick89] is to transform H0 into: "Given the fact that u_i is incorrectly classified by one and only one system, then the probability that S1 (or S2) classifies u_i correctly is 1/2"
Intuitively, one cannot compare S1 and S2 based on the items that are correct (or incorrect) both with S1 and S2, but solely on the items where S1 differs from S2. Let us call K this number of items.

This hypothesis is tested with NcNemar test, either using the binomial distribution when K is small or the Chi-squared approximation when it is large. In the following, we assume K is small.








McNemar significance test (requires java 1.6)


The following applet computes the p-value of the McNemar exact test, based on the binomial distribution. If the p-value is smaller than 0.05, you might conclude that both systems have different performances. Computation is realized in Java with BigDecimal and BigInteger, so that there is no precision errors due to limited size of integers.

To use it, please enter
(1) the number of items correctly classified by S1 and incorrectly classified by S2 in the left textfield
(2) the number of items correctly classified by S2 and incorrectly classified by S1 in the right textfield
(3) push the "compute" button

The "p-value" will be shown on the bottom.
Note: both left and right fields must not be equal: this is a special case where, by definition, p=1.0




References:
[Gillick89]: L. Gillick and Stephen Cox: SOME STATISTICAL ISSUES IN THE COMPARISON OF SPEECH RECOGNITION ALGORITHMS, Proc. ICASSP 1989
Contact

Christophe Cerisara