Interpreting A/B Test using Python
Suppose we ran an A/B test with two different versions of a web page, \(a\) and \(b\), for which we count the number of visitors and whether they convert or not. We can summarize this in a contingency table showing the frequency distribution of the events:
Not converted (\(f\))  Converted (\(t\))  

\(a\)  \(4,514\)  \(486\) 
\(b\)  \(4,473\)  \(527\) 
It is trivial to compute the conversion rate of each version, \(486 / (486 + 4514 ) = 9.72 \%\) for \(a\) and \(10.5 \%\) for \(b\). With such a relatively small difference, however, can we convincingly say that the version \(b\) converts better? To test the statistical significance of a result like this, hypothesis testing can be used.
Background
An appropriate hypothesis test here is Pearson’s chisquared test. There are two types of the chisquared test, the goodness of fit and the test of independence, but it is the latter that is useful for the case in question. The reason why a test of “independence” is applicable becomes clear by converting the contingency table into a probability matrix by dividing each element by the grand total of frequencies:
Not converted (\(f\))  Converted (\(t\))  

\(a\)  \(P(V=a, C=f) = 0.4514\)  \(P(V=a, C=t) = 0.0486\) 
\(b\)  \(P(V=b, C=f) = 0.4473\)  \(P(V=b, C=t) = 0.0527\) 
A table like this is sometimes called a correspondence matrix. Here, the table consists of joint probabilities where \(V\) is the version of the web page (\(a\) or \(b\)) and \(C\) is the conversion result (\(f\) or \(t\)).
Now, our interest is whether the conversion \(C\) depends on the page version \(V\), and if it does, to learn which version converts better. In probability theory, the events \(C\) and \(V\) are said to be independent if the joint probability can be computed by \(P(V, C) = P(V)P( C)\), where \(P(V)\) and \(P( C)\) are marginal probabilities of \(V\) and \(C\), respectively. It is straightforward to compute the marginal probabilities from row and column marginals:
\begin{eqnarray} P(V=a) &=& \frac{4514 + 486}{10000} \nonumber \\ P(V=b) &=& \frac{4473 + 527}{10000} \nonumber \\ P(C=f) &=& \frac{4514 + 4473}{10000} \nonumber \\ P(C=t) &=& \frac{486 + 527}{10000} \nonumber \end{eqnarray}
where \(10,000\) is the grand total of all the elements. The null hypothesis (i.e., a neutral hypothesis in which the effect we seek is absent) is that \(V\) and \(C\) are independent, in which case the elements of the matrix are equivalent to
Not converted (\(f\))  Converted (\(t\))  

\(a\)  \(P(V=a) P(C=f)\)  \(P(V=a) P(C=t)\) 
\(b\)  \(P(V=b) P(C=f)\)  \(P(V=b) P(C=t)\) 
The conversion \(C\) is said to be dependent on the version \(V\) of the website if this null hypothesis is rejected. Hence rejecting the null hypothesis means that one version is better at converting than the other. This is the reason why the test is on independence.
The chisquared test compares an observed distribution \(O_{ij}\) to an expected distribution \(E_{ij}\)
\begin{equation*} \chi^2 = \sum_{i, j} \frac{(O_{ij}  E_{ij})^2}{E_{ij}} \ , \end{equation*}
where \(i\) and \(j\) are the row and column indices of the matrix.^{1} The values of \(E_{ij}\) are computed from \(P(V=i)\) and \(P(C=j)\). The \(\chi^2\) statistics thus obtained is now compared to the distribution assumed in the null hypothesis, and to do this we need to find the degree of freedom (dof) which is the shape parameter of the chisquared distribution. For the test of independence using a \(r \times c\) contingency matrix, the dof is computed from the total number of matrix entries (\(r \times c\)) minus the reduction in dof, which is given by \(r + c  1\). The reductions come from the row and column sum constraints, but decrease by one because the last entry in the matrix is determined by either the row or column sum on that row/column and therefore degenerate. Hence the dof for the test of independence comes out to be \((r  1)(c  1)\).
Python Implementation
Fortunately, it is very straightforward to carry out this hypothesis testing using scipy.stats.chi2_contingency
. All we need is to supply the function with a contingency matrix and it will return the \(\chi^2\) statistics and the corresponding $p$value:
#!/usr/bin/env python2.7
# * coding: utf8 *
"""An example of A/B test using the chisquared test for independence."""
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
def main():
data = pd.io.parsers.read_csv('n10000.csv')
data = data.set_index('version')
observed = data.values
print observed
result = chi2_contingency(observed)
chisq, p = result[:2]
print 'chisq = {}, p = {}'.format(chisq, p)
print
data = pd.io.parsers.read_csv('n40000.csv')
data = data.set_index('version')
observed = data.values
print observed
result = chi2_contingency(observed)
chisq, p = result[:2]
print 'chisq = {}, p = {}'.format(chisq, p)
if __name__ == '__main__':
main()
n10000.csv:
version  not converted  converted 

A  4514  486 
B  4473  527 
n40000.csv:
version  not converted  converted 

A  17998  2002 
B  17742  2258 
(The code and data are available in Gist.)
The result for the original table (of \(n = 10,000\)) is \(\chi^2 = 1.76\) and \(p = 0.185\). Since the $p$value is greater than the standard threshold \(0.05\), we cannot reject the null hypothesis that the page version and the conversion are independent. Therefore the difference in the conversion rates cited in the beginning of this article is not statistically significant.
What if we keep running the same A/B test a bit longer until we accumulate \(n = 40,000\) visitors? Using example data (n40000.csv), we have the conversion rates of \(2002 / 20000 = 10.0 \%\) for version \(a\) and \(2258 / 20000 = 11.3 \%\) for version \(b\). Running the same test on the new data yields \(\chi^2 = 17.1\) and \(p = 3.58 \times 10^{5}\). Since \(p \ll 0.05\), the difference we see in the conversion rates is statistically significant this time. This is a demonstration of how a bigger sample helps to see a tiny difference. (The example data used in this article are generated assuming the true conversion rates of \(10 \%\) for \(a\) and \(11 \%\) for \(b\).)

For a \(2 \times 2\) contingency table, Yate’s chisquared test is commonly used. This applies a correction of the form
\begin{equation*} \chi^2_{\rm Yate’s} = \sum_{ij} \frac{(O_{ij}  E_{ij}  0.5)^2}{E_{ij}} \end{equation*}
to account for an error between the observed discrete distribution and the continuous chisquared distribution. ↩︎