Suppose we ran an A/B test with two different versions of a web page, and , for which we count the number of visitors and whether they convert or not. We can summarize this in a contingency table showing the frequency distribution of the events:

not converted () 
converted () 

4514 
486 

4473 
527 
It is trivial to compute the conversion rate of each version, for a and for a. With such a relatively small difference, however, can we convincingly say that the version b converts better? To test the statistical significance of a result like this, a hypothesis testing can be used.
Background
An appropriate hypothesis test here is Pearson’s chisquared test. There are two types of the chisquared test, goodness of fit and test of independence, but it is the latter which is useful for the case in question. The reason as to why a test of “independence” is applicable becomes clear by converting the contingency table into a probability matrix by dividing each element by the grand total of frequencies:

not converted () 
converted () 






A table like this is sometimes called correspondence matrix. Here, the table consists of joint probabilities where is the version of the web page ( or ) and is the conversion result ( or ).
Now, our interest is whether the conversion depends on the page version , and if it does, to learn which version converts better. In probability theory, the events and are said to be independent if the joint probability can be computed by , where and are marginal probabilities of and , respectively. It is straightforward to compute the marginal probabilities from row and column marginals:
where is the grand total of all the elements. The null hypothesis (i.e., a neutral hypothesis in which the effect we seek is absent) is that and are independent, in which case the elements of the matrix are equivalent to

not converted () 
converted () 






The conversion is said to be dependent on the version of the web site if this null hypothesis is rejected. Hence rejecting the null hypothesis means that one version is better at converting than the other. This is the reason why the test is on independence.
The chisquared test compares an observed distribution to an expected distribution
(1)
where and are the row and column indices of the matrix (*). The values of are computed from and . The statistic thus obtained is now compared to the distribution assumed in the null hypothesis, and to do this we need to find the degree of freedom (dof) which is the shape parameter of chisquared distribution. For the test of independence using a contingency matrix, the dof is computed from the total number of matrix entries () minus the reduction in dof, which is given by . The reductions come from the row and column sum constraints, but decreased by one because the last entry in the matrix is determined by either the row or column sum on that row/column and therefore degenerate. Hence the dof for the test of independence comes out to be .
Python Implementation
Fortunately it is very straightforward to carry out this hypothesis testing using scipy.stats.chi2_contingency
. All we need is to supply the function with a contingency matrix and it will return the statistic and the corresponding pvalue:
The result for the original table (of ) is and . Since the pvalue is greater than the standard threshold , we cannot reject the null hypothesis that the page version and the conversion is independent. Therefore the difference in the conversion rates cited in the beginning of this article is not statistically significant.
What if we keep running the same A/B test a bit longer, until we accumulate visitors? Using example data (n40000.csv
), we have the conversion rates of for version a and for version b. Running the same test on the new data yields and . Since , the difference we see in the conversion rates is statistically significant this time. This is a demonstration of how a bigger sample helps to see a tiny difference. (The example data used in this article are generated assuming the true conversion rates of for a and for b.)
(*) For a 2 x 2 contingency table, Yate’s chisquared test is commonly used. This applies a correction of the form
to account for an error between the observed discrete distribution and the continuous chisquared distribution.