Biboroku

Interpreting A/B Test using Python

Written by Taro Sato, on . Tagged: Python stats visualization

Suppose we ran an A/B test with two different versions of a web page, \(a\) and \(b\), for which we count the number of visitors and whether they convert or not. We can summarize this in a contingency table showing the frequency distribution of the events:

Not converted (\(f\)) Converted (\(t\))
\(a\) \(4,514\) \(486\)
\(b\) \(4,473\) \(527\)

It is trivial to compute the conversion rate of each version, \(486 / (486 + 4514 ) = 9.72 \%\) for \(a\) and \(10.5 \%\) for \(b\). With such a relatively small difference, however, can we convincingly say that the version \(b\) converts better? To test the statistical significance of a result like this, hypothesis testing can be used.

Background

An appropriate hypothesis test here is Pearson’s chi-squared test. There are two types of the chi-squared test, the goodness of fit and the test of independence, but it is the latter that is useful for the case in question. The reason why a test of “independence” is applicable becomes clear by converting the contingency table into a probability matrix by dividing each element by the grand total of frequencies:

Not converted (\(f\)) Converted (\(t\))
\(a\) \(P(V=a, C=f) = 0.4514\) \(P(V=a, C=t) = 0.0486\)
\(b\) \(P(V=b, C=f) = 0.4473\) \(P(V=b, C=t) = 0.0527\)

A table like this is sometimes called a correspondence matrix. Here, the table consists of joint probabilities where \(V\) is the version of the web page (\(a\) or \(b\)) and \(C\) is the conversion result (\(f\) or \(t\)).

Now, our interest is whether the conversion \(C\) depends on the page version \(V\), and if it does, to learn which version converts better. In probability theory, the events \(C\) and \(V\) are said to be independent if the joint probability can be computed by \(P(V, C) = P(V)P( C)\), where \(P(V)\) and \(P( C)\) are marginal probabilities of \(V\) and \(C\), respectively. It is straightforward to compute the marginal probabilities from row and column marginals:

\begin{eqnarray} P(V=a) &=& \frac{4514 + 486}{10000} \nonumber \\ P(V=b) &=& \frac{4473 + 527}{10000} \nonumber \\ P(C=f) &=& \frac{4514 + 4473}{10000} \nonumber \\ P(C=t) &=& \frac{486 + 527}{10000} \nonumber \end{eqnarray}

where \(10,000\) is the grand total of all the elements. The null hypothesis (i.e., a neutral hypothesis in which the effect we seek is absent) is that \(V\) and \(C\) are independent, in which case the elements of the matrix are equivalent to

Not converted (\(f\)) Converted (\(t\))
\(a\) \(P(V=a) P(C=f)\) \(P(V=a) P(C=t)\)
\(b\) \(P(V=b) P(C=f)\) \(P(V=b) P(C=t)\)

The conversion \(C\) is said to be dependent on the version \(V\) of the website if this null hypothesis is rejected. Hence rejecting the null hypothesis means that one version is better at converting than the other. This is the reason why the test is on independence.

The chi-squared test compares an observed distribution \(O_{ij}\) to an expected distribution \(E_{ij}\)

\begin{equation*} \chi^2 = \sum_{i, j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \ , \end{equation*}

where \(i\) and \(j\) are the row and column indices of the matrix.1 The values of \(E_{ij}\) are computed from \(P(V=i)\) and \(P(C=j)\). The \(\chi^2\) statistics thus obtained is now compared to the distribution assumed in the null hypothesis, and to do this we need to find the degree of freedom (dof) which is the shape parameter of the chi-squared distribution. For the test of independence using a \(r \times c\) contingency matrix, the dof is computed from the total number of matrix entries (\(r \times c\)) minus the reduction in dof, which is given by \(r + c - 1\). The reductions come from the row and column sum constraints, but decrease by one because the last entry in the matrix is determined by either the row or column sum on that row/column and therefore degenerate. Hence the dof for the test of independence comes out to be \((r - 1)(c - 1)\).

Python Implementation

Fortunately, it is very straightforward to carry out this hypothesis testing using scipy.stats.chi2_contingency. All we need is to supply the function with a contingency matrix and it will return the \(\chi^2\) statistics and the corresponding $p$-value:

#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-
"""An example of A/B test using the chi-squared test for independence."""
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency


def main():
    data = pd.io.parsers.read_csv('n10000.csv')
    data = data.set_index('version')
    observed = data.values
    print observed

    result = chi2_contingency(observed)
    chisq, p = result[:2]
    print 'chisq = {}, p = {}'.format(chisq, p)

    print

    data = pd.io.parsers.read_csv('n40000.csv')
    data = data.set_index('version')
    observed = data.values
    print observed

    result = chi2_contingency(observed)
    chisq, p = result[:2]
    print 'chisq = {}, p = {}'.format(chisq, p)


if __name__ == '__main__':
    main()

n10000.csv:

version not converted converted
A 4514 486
B 4473 527

n40000.csv:

version not converted converted
A 17998 2002
B 17742 2258

(The code and data are available in Gist.)

The result for the original table (of \(n = 10,000\)) is \(\chi^2 = 1.76\) and \(p = 0.185\). Since the $p$-value is greater than the standard threshold \(0.05\), we cannot reject the null hypothesis that the page version and the conversion are independent. Therefore the difference in the conversion rates cited in the beginning of this article is not statistically significant.

What if we keep running the same A/B test a bit longer until we accumulate \(n = 40,000\) visitors? Using example data (n40000.csv), we have the conversion rates of \(2002 / 20000 = 10.0 \%\) for version \(a\) and \(2258 / 20000 = 11.3 \%\) for version \(b\). Running the same test on the new data yields \(\chi^2 = 17.1\) and \(p = 3.58 \times 10^{-5}\). Since \(p \ll 0.05\), the difference we see in the conversion rates is statistically significant this time. This is a demonstration of how a bigger sample helps to see a tiny difference. (The example data used in this article are generated assuming the true conversion rates of \(10 \%\) for \(a\) and \(11 \%\) for \(b\).)


  1. For a \(2 \times 2\) contingency table, Yate’s chi-squared test is commonly used. This applies a correction of the form

    \begin{equation*} \chi^2_{\rm Yate’s} = \sum_{ij} \frac{(|O_{ij} - E_{ij}| - 0.5)^2}{E_{ij}} \end{equation*}

    to account for an error between the observed discrete distribution and the continuous chi-squared distribution. ↩︎

comments powered by Disqus