Installing HipChat on Debian Jessie

I simply download HipChat from the official repository and follow the installation instruction there, but when I launch the application, I get the following error:

$ hipchat
libGL error: dlopen /usr/lib/x86_64-linux-gnu/dri/ failed (/usr/local/opt/HipChat/bin/..//lib/ version `GLIBCXX_3.4.20' not found (required by /usr/lib/x86_64-linux-gnu/
libGL error: dlopen ${ORIGIN}/dri/ failed (/usr/local/opt/HipChat/bin/..//lib/ version `GLIBCXX_3.4.20' not found (required by /usr/lib/x86_64-linux-gnu/
libGL error: dlopen /usr/lib/dri/ failed (/usr/lib/dri/ cannot open shared object file: No such file or directory)
libGL error: unable to load driver:
libGL error: driver pointer missing
libGL error: failed to load driver: radeonsi
libGL error: dlopen /usr/lib/x86_64-linux-gnu/dri/ failed (/usr/local/opt/HipChat/bin/..//lib/ version `GLIBCXX_3.4.20' not found (required by /usr/lib/x86_64-linux-gnu/
libGL error: dlopen ${ORIGIN}/dri/ failed (/usr/local/opt/HipChat/bin/..//lib/ version `GLIBCXX_3.4.20' not found (required by /usr/lib/x86_64-linux-gnu/
libGL error: dlopen /usr/lib/dri/ failed (/usr/lib/dri/ cannot open shared object file: No such file or directory)
libGL error: unable to load driver:
libGL error: failed to load driver: swrast

The program gets stuck there until it is killed.

It is not elegant, but HipChat isn’t a very essential application for me, so I simply bandage this by replacing the libstdc++ that comes with HipChat with the one from Debian:

$ cd /opt/HipChat/lib
$ sudo mv
$ sudo mv
$ sudo mv
$ sudo cp /usr/lib/x86_64-linux-gnu/ .

Now HipChat should launch normally.

Posted in Uncategorized | Tagged , | Leave a comment

Building Hadoop-LZO on Debian Jessie

JDK 1.6 or later is need to be already installed for this to work.

Install a few packages:

$ aptitude install liblzo2-dev maven git

To see where the LZO library is installed, get a list of files installed:

$ dpkg-query -L liblzo2-dev
... list of paths ...

In my box, the include and library paths are /usr/include/lzo and /usr/lib/x86_64-linux-gnu, respectively. (These paths should be recognized without doing anything, but if specifically pointing to them is necessary upon build with mvn later, try:

C_INCLUDE_PATH=/usr/include/lzo \
LIBRARY_PATH=/usr/lib/x86_64-linux-gnu \
  mvn clean test

for example.)

Get the source from the Hadoop-LZO github repo:

$ git clone
$ cd hadoop-lzo
$ mvn clean test package

If the build is successful, the JAR should be found under target:

$ ls target/
Posted in Uncategorized | Tagged , , | Leave a comment

Installing Adobe Reader on Debian/Jessie

Once in a while you need to deal with fancier PDF files which may allow you to type in using forms, but Linux applications like Okular might not be fully capable of handling Adobe’s proprietary features. In such an unfortunate event, you might need to use Adobe Reader.

Go to the FTP download site on Adobe and download the version you wish to install. Here the version 9.x is assumed:

$ wget
$ sudo dpkg --add-architecture i386
$ sudo apt-get update
$ sudo apt-get install libgtk2.0-0:i386 libxml2:i386 libstdc++6:i386
$ sudo dpkg -i AdbeRdr9*.deb

If the installation doesn’t fully complete, you might need to do

$ sudo apt-get -f install

to install the rest of missing packages.

Posted in Uncategorized | Tagged , , | Leave a comment

Using Tor on Debian Jessie

Install the package and start the service:

$ sudo apt-get install tor
$ sudo /etc/init.d/tor start

Check to see if Tor is running:

$ sudo netstat -plant
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0*               LISTEN      7766/tor

You should see an entry for Tor.

To simply use this, manually specify SOCKS proxy configuration with localhost:9050. Visit a site like to check what IP address the remote host is seeing; it should be the exit node of Tor and not the IP address of your host.

For a command-line program, use may use torsocks. For example, if you want to use ping with Tor, do:

$ torsocks ping
Posted in Uncategorized | Tagged , , , | Leave a comment

Using NTP Server to Synchronize Time on Debian Jessie

For just one time synchronization, it is easy:

$ sudo aptitude install ntpdate
$ sudo ntpdate

The server to use can be specified from the list given here.

Posted in Uncategorized | Tagged , , | Leave a comment

Installing MongoDB on Debian Jessie

On Debian, it is of course easy to install:

$ sudo aptitude install mongodb

My main use of local database servers is for testing, however, and I don’t want MongoDB to take up more than a few GB under /var/lib for journal files. I can instruct it to use smaller files by adding the following line to /etc/mongodb.conf:


Then stop the MongoDB service, remove the existing journal files, and restart:

$ sudo /etc/init.d/mongodb stop
$ sudo rm /var/lib/mongodb/journal/*
$ sudo /etc/init.d/mongodb start
Posted in Uncategorized | Leave a comment

Installing PyData Stack on Debian Jessie

Installing NumPy, SciPy, and Matplotlib has gotten so much easier with PIP, but there are some dependencies that are not taken care of automatically.


$ sudo pip install numpy


$ sudo aptitude install libblas-dev liblapack-dev gfortran
$ sudo pip install scipy


$ sudo aptitude install libfreetype6-dev
$ su
# pip install matplotlib

I had some issue with sudo when the installer could not find X. Running the command as root worked.

Posted in Uncategorized | Leave a comment

Installing PostgreSQL on Debian Jessie

$ sudo aptitude install postgresql postgresql-contrib
$ sudo su - postgres
postgres$ createuser -s yourusername

where yourusername is the username of your account. Note that with the -s switch the user will be created as a superuser. For assigning a more restricted role, see the official documentation.

Go back to your normal user account, and do

$ createdb
$ psql

Now you should be on the PostgreSQL shell.

Posted in Uncategorized | Tagged , | Leave a comment

Interpreting A/B Test using Python

Suppose we ran an A/B test with two different versions of a web page, a and b, for which we count the number of visitors and whether they convert or not. We can summarize this in a contingency table showing the frequency distribution of the events:

not converted (f) converted (t)
a 4514 486
b 4473 527

It is trivial to compute the conversion rate of each version, 486/(486+4514) = 9.72\% for a and 10.5\% for a. With such a relatively small difference, however, can we convincingly say that the version b converts better? To test the statistical significance of a result like this, a hypothesis testing can be used.


An appropriate hypothesis test here is Pearson’s chi-squared test. There are two types of the chi-squared test, goodness of fit and test of independence, but it is the latter which is useful for the case in question. The reason as to why a test of “independence” is applicable becomes clear by converting the contingency table into a probability matrix by dividing each element by the grand total of frequencies:

not converted (f) converted (t)
a P(V=a,C=f)=0.4514 P(V=a,C=t)=0.0486
b P(V=b,C=f)=0.4473 P(V=b,C=t)=0.0527

A table like this is sometimes called correspondence matrix. Here, the table consists of joint probabilities where V is the version of the web page (a or b) and C is the conversion result (f or t).

Now, our interest is whether the conversion C depends on the page version V, and if it does, to learn which version converts better. In probability theory, the events C and V are said to be independent if the joint probability can be computed by P(V, C) = P(V)P(C), where P(V) and P(C) are marginal probabilities of V and C, respectively. It is straightforward to compute the marginal probabilities from row and column marginals:

    \begin{eqnarray*}   P(V=a) = \frac{4514 + 486}{10000} \quad , \quad P(V=b) = \frac{4473 + 527}{10000} \\   P(C=f) = \frac{4514 + 4473}{10000} \quad , \quad P(C=t) = \frac{486 + 527}{10000}  \end{eqnarray*}

where 10000 is the grand total of all the elements. The null hypothesis (i.e., a neutral hypothesis in which the effect we seek is absent) is that V and C are independent, in which case the elements of the matrix are equivalent to

not converted (f) converted (t)
a P(V=a)P(C=f) P(V=a)P(C=t)
b P(V=b)P(C=f) P(V=b)P(C=t)

The conversion C is said to be dependent on the version V of the web site if this null hypothesis is rejected. Hence rejecting the null hypothesis means that one version is better at converting than the other. This is the reason why the test is on independence.

The chi-squared test compares an observed distribution O_{ij} to an expected distribution E_{ij}

(1)   \begin{equation*}   \chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \ , \end{equation*}

where i and j are the row and column indices of the matrix (*). The values of E_{ij} are computed from P(V=i) and P(C=j). The \chi^2 statistic thus obtained is now compared to the distribution assumed in the null hypothesis, and to do this we need to find the degree of freedom (dof) which is the shape parameter of chi-squared distribution. For the test of independence using a r \times c contingency matrix, the dof is computed from the total number of matrix entries (r \times c) minus the reduction in dof, which is given by r + c - 1. The reductions come from the row and column sum constraints, but decreased by one because the last entry in the matrix is determined by either the row or column sum on that row/column and therefore degenerate. Hence the dof for the test of independence comes out to be (r - 1)(c - 1).

Python Implementation

Fortunately it is very straightforward to carry out this hypothesis testing using scipy.stats.chi2_contingency. All we need is to supply the function with a contingency matrix and it will return the \chi^2 statistic and the corresponding p-value:

The result for the original table (of n = 10000) is \chi^2 = 1.76 and p = 0.185. Since the p-value is greater than the standard threshold 0.05, we cannot reject the null hypothesis that the page version and the conversion is independent. Therefore the difference in the conversion rates cited in the beginning of this article is not statistically significant.

What if we keep running the same A/B test a bit longer, until we accumulate n = 40000 visitors? Using example data (n40000.csv), we have the conversion rates of 2002/20000 = 10.0\% for version a and 2258/2000 = 11.3\% for version b. Running the same test on the new data yields \chi^2 = 17.1 and p = 3.58 \times 10^{-5}. Since p \ll 0.05, the difference we see in the conversion rates is statistically significant this time. This is a demonstration of how a bigger sample helps to see a tiny difference. (The example data used in this article are generated assuming the true conversion rates of 10\% for a and 11\% for b.)

(*) For a 2 x 2 contingency table, Yate’s chi-squared test is commonly used. This applies a correction of the form

    \[   \chi^2_{\rm Yate's} = \sum_{ij} \frac{(|O_{ij} - E_{ij}| - 0.5)^2}{E_{ij}}  \]

to account for an error between the observed discrete distribution and the continuous chi-squared distribution.

Posted in Uncategorized | Tagged , , | Leave a comment

Brand Positioning by Correspondence Analysis

I was reading an article about visualization techniques using multidimensional scaling (MDS), the correspondence analysis in particular. The example used R, but as usual I want to find ways to do it on Python, so here goes.

The correspondence analysis is useful when you have a two-way contingency table for which relative values of ratio-scaled data are of interest. For example, I here use a table where the rows are fashion brands (Chanel, Louis Vuitton, etc.) and the columns are the number of people who answered that the particular brand has the particular attribute expressed by the adjective (luxurious, youthful, energetic, etc.). (I borrowed the data from this article.)

The correspondence analysis (or MDS in general) is a method of reducing dimensions to make the data more sensible for interpretation. In this case, I get a scatter plot of brands and adjectives in two-dimensional space, in which brands/adjectives more closely associated with each other are placed near each other.


As you see, brands like GAP, H&M, and Uniqlo are associated with youth, friendliness, and energy, while old-school brands like Chanel and Tiffany are associated with luxury and brilliance. This way of visualization is useful because the high-dimensional information (11 brands and 9 attributes) are reduced into two-dimensional plane, and the distance on that plane is meaningful.

Here’s the code and data:

Posted in Uncategorized | Tagged , , | 2 Comments