Interpreting A/B Test using Python
Suppose we ran an A/B test with two different versions of a web page,
$a$
and $b$
, for which we count the number of visitors and
whether they convert or not. We can summarize this in a contingency
table showing the frequency distribution of the events:
not converted (\( f \)) | converted (\( t \)) | |
---|---|---|
\( a \) | 4514 | 486 |
\( b \) | 4473 | 527 |
It is trivial to compute the conversion rate of each version, $486
/ (486 + 4514 ) = 9.72 \%$
for $a$
and $10.5 \%$
for
$b$
. With such a relatively small difference, however, can we
convincingly say that the version $b$
converts better? To test
the statistical significance of a result like this, a
hypothesis testing can be used.
Background
An appropriate hypothesis test here is Pearson’s chi-squared test. There are two types of the chi-squared test, the goodness of fit and the test of independence, but it is the latter which is useful for the case in question. The reason as to why a test of “independence” is applicable becomes clear by converting the contingency table into a probability matrix by dividing each element by the grand total of frequencies:
not converted (\( f \)) | converted (\( t \)) | |
---|---|---|
\( a \) | \( P(V = a, C = f) = 0.4514 \) | \( P(V = a, C = t) = 0.0486 \) |
\( b \) | \( P(V = b, C = f) = 0.4473 \) | \( P(V = b, C = t) = 0.0527 \) |
A table like this is sometimes called correspondence matrix. Here,
the table consists of joint probabilities where $V$
is the
version of the web page ($a$
or $b$
) and $C$
is the
conversion result ($f$
or $t$
).
Now, our interest is whether the conversion $C$
depends on the
page version $V$
, and if it does, to learn which version
converts better. In probability theory, the events $C$
and
$V$
are said to be independent if the joint probability can be
computed by $P(V, C) = P(V)P(C)$
, where $P(V)$
and
$P(C)$
are marginal probabilities of $V$
and $C$
,
respectively. It is straightforward to compute the marginal
probabilities from row and column marginals:
\begin{eqnarray}
P(V=a) &=& \frac{4514 + 486}{10000} \nonumber \\
P(V=b) &=& \frac{4473 + 527}{10000} \nonumber \\
P(C=f) &=& \frac{4514 + 4473}{10000} \nonumber \\
P(C=t) &=& \frac{486 + 527}{10000} \nonumber
\end{eqnarray}
where $10,000$
is the grand total of all the elements. The null
hypothesis (i.e., a neutral hypothesis in which the effect we seek is
absent) is that $V$
and $C$
are independent, in which case
the elements of the matrix are equivalent to
not converted (\( f \)) | converted (\( t \)) | |
---|---|---|
\( a \) | \( P(V=a) P(C=f) \) | \( P(V=a) P(C=t) \) |
\( b \) | \( P(V=b) P(C=f) \) | \( P(V=b) P(C=t) \) |
The conversion $C$
is said to be dependent on the version $V$
of the web site if this null hypothesis is rejected. Hence
rejecting the null hypothesis means that one version is better at
converting than the other. This is the reason why the test is on
independence.
The chi-squared test compares an observed distribution $O_{ij}$
to an expected distribution $E_{ij}$
\[
\chi^2 = \sum_{i, j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \ ,
\]
where $i$
and $j$
are the row and column indices of the
matrix.1 The values of $E\_{ij}$
are computed from $P(V=i)$
and $P(C=j)$
. The $\chi^2$
statistic thus obtained is
now compared to the distribution assumed in the null hypothesis, and
to do this we need to find the degree of freedom (dof) which is the
shape parameter of chi-squared distribution. For the test of
independence using a $r \times c$
contingency matrix, the dof is
computed from the total number of matrix entries ($r \times c$
)
minus the reduction in dof, which is given by $r + c - 1$
. The
reductions come from the row and column sum constraints, but decreased
by one because the last entry in the matrix is determined by either
the row or column sum on that row/column and therefore
degenerate. Hence the dof for the test of independence comes out to be
$(r - 1)(c - 1)$
.
Python Implementation
Fortunately it is very straightforward to carry out this hypothesis
testing using scipy.stats.chi2_contingency
. All we need is to supply
the function with a contingency matrix and it will return the
$\chi^2$
statistic and the corresponding $p$
-value:
The result for the original table (of $n = 10,000$
) is
$\chi^2 = 1.76$
and $p = 0.185$
. Since the $p$
-value is
greater than the standard threshold $0.05$
, we cannot reject the
null hypothesis that the page version and the conversion is
independent. Therefore the difference in the conversion rates cited in
the beginning of this article is not statistically significant.
What if we keep running the same A/B test a bit longer, until we
accumulate $n = 40,000$
visitors? Using example data
(n40000.csv), we have the conversion rates of $2002 / 20000 = 10.0
\%$
for version $a$
and $2258 / 20000 = 11.3 \%$
for
version $b$
. Running the same test on the new data yields
$\chi^2 = 17.1$
and $p = 3.58 \times 10^{-5}$
. Since $p \ll
0.05$
, the difference we see in the conversion rates is
statistically significant this time. This is a demonstration of how a
bigger sample helps to see a tiny difference. (The example data used
in this article are generated assuming the true conversion rates of $10 \%$
for $a$
and $11 \%$
for $b$
.)
For a 2 x 2 contingency table, Yate’s chi-squared test is commonly used. This applies a correction of the form
\[ \chi^2_{\rm Yate's} = \sum_{ij} \frac{(|O_{ij} - E_{ij}| - 0.5)^2}{E_{ij}} \]
to account for an error between the observed discrete distribution and the continuous chi-squared distribution.
[return]