Installing MongoDB on Debian Jessie

On Debian, it is of course easy to install:

$ sudo aptitude install mongodb

My main use of local database servers is for testing, however, and I don’t want MongoDB to take up more than a few GB under /var/lib for journal files. I can instruct it to use smaller files by adding the following line to /etc/mongodb.conf:

smallfiles=true

Then stop the MongoDB service, remove the existing journal files, and restart:

$ sudo /etc/init.d/mongodb stop
$ sudo rm /var/lib/mongodb/journal/*
$ sudo /etc/init.d/mongodb start
Posted in Uncategorized | Leave a comment

Installing PyData Stack on Debian Jessie

Installing NumPy, SciPy, and Matplotlib has gotten so much easier with PIP, but there are some dependencies that are not taken care of automatically.

NumPy:

$ sudo pip install numpy

SciPy:

$ sudo aptitude install libblas-dev liblapack-dev gfortran
$ sudo pip install scipy

Matplotlib:

$ sudo aptitude install libfreetype6-dev
$ su
# pip install matplotlib

I had some issue with sudo when the installer could not find X. Running the command as root worked.

Posted in Uncategorized | Leave a comment

Installing PostgreSQL on Debian Jessie

$ sudo aptitude install postgresql postgresql-contrib
$ sudo su - postgres
postgres$ createuser -s yourusername

where yourusername is the username of your account. Note that with the -s switch the user will be created as a superuser. For assigning a more restricted role, see the official documentation.

Go back to your normal shell, and do

$ createdb
$ psql

Now you should be on the PostgreSQL shell.

Posted in Uncategorized | Leave a comment

Interpreting A/B Test using Python

Suppose we ran an A/B test with two different versions of a web page, a and b, for which we count the number of visitors and whether they convert or not. We can summarize this in a contingency table showing the frequency distribution of the events:

not converted (f) converted (t)
a 4514 486
b 4473 527

It is trivial to compute the conversion rate of each version, 486/(486+4514) = 9.72\% for a and 10.5\% for a. With such a relatively small difference, however, can we convincingly say that the version b converts better? To test the statistical significance of a result like this, a hypothesis testing can be used.

Background

An appropriate hypothesis test here is Pearson’s chi-squared test. There are two types of the chi-squared test, goodness of fit and test of independence, but it is the latter which is useful for the case in question. The reason as to why a test of “independence” is applicable becomes clear by converting the contingency table into a probability matrix by dividing each element by the grand total of frequencies:

not converted (f) converted (t)
a P(V=a,C=f)=0.4514 P(V=a,C=t)=0.0486
b P(V=b,C=f)=0.4473 P(V=b,C=t)=0.0527

A table like this is sometimes called correspondence matrix. Here, the table consists of joint probabilities where V is the version of the web page (a or b) and C is the conversion result (f or t).

Now, our interest is whether the conversion C depends on the page version V, and if it does, to learn which version converts better. In probability theory, the events C and V are said to be independent if the joint probability can be computed by P(V, C) = P(V)P(C), where P(V) and P(C) are marginal probabilities of V and C, respectively. It is straightforward to compute the marginal probabilities from row and column marginals:

    \begin{eqnarray*}   P(V=a) = \frac{4514 + 486}{10000} \quad , \quad P(V=b) = \frac{4473 + 527}{10000} \\   P(C=f) = \frac{4514 + 4473}{10000} \quad , \quad P(C=t) = \frac{486 + 527}{10000}  \end{eqnarray*}

where 10000 is the grand total of all the elements. The null hypothesis (i.e., a neutral hypothesis in which the effect we seek is absent) is that V and C are independent, in which case the elements of the matrix are equivalent to

not converted (f) converted (t)
a P(V=a)P(C=f) P(V=a)P(C=t)
b P(V=b)P(C=f) P(V=b)P(C=t)

The conversion C is said to be dependent on the version V of the web site if this null hypothesis is rejected. Hence rejecting the null hypothesis means that one version is better at converting than the other. This is the reason why the test is on independence.

The chi-squared test compares an observed distribution O_{ij} to an expected distribution E_{ij}

(1)   \begin{equation*}   \chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \ , \end{equation*}

where i and j are the row and column indices of the matrix (*). The values of E_{ij} are computed from P(V=i) and P(C=j). The \chi^2 statistic thus obtained is now compared to the distribution assumed in the null hypothesis, and to do this we need to find the degree of freedom (dof) which is the shape parameter of chi-squared distribution. For the test of independence using a r \times c contingency matrix, the dof is computed from the total number of matrix entries (r \times c) minus the reduction in dof, which is given by r + c - 1. The reductions come from the row and column sum constraints, but decreased by one because the last entry in the matrix is determined by either the row or column sum on that row/column and therefore degenerate. Hence the dof for the test of independence comes out to be (r - 1)(c - 1).

Python Implementation

Fortunately it is very straightforward to carry out this hypothesis testing using scipy.stats.chi2_contingency. All we need is to supply the function with a contingency matrix and it will return the \chi^2 statistic and the corresponding p-value:

The result for the original table (of n = 10000) is \chi^2 = 1.76 and p = 0.185. Since the p-value is greater than the standard threshold 0.05, we cannot reject the null hypothesis that the page version and the conversion is independent. Therefore the difference in the conversion rates cited in the beginning of this article is not statistically significant.

What if we keep running the same A/B test a bit longer, until we accumulate n = 40000 visitors? Using example data (n40000.csv), we have the conversion rates of 2002/20000 = 10.0\% for version a and 2258/2000 = 11.3\% for version b. Running the same test on the new data yields \chi^2 = 17.1 and p = 3.58 \times 10^{-5}. Since p \ll 0.05, the difference we see in the conversion rates is statistically significant this time. This is a demonstration of how a bigger sample helps to see a tiny difference. (The example data used in this article are generated assuming the true conversion rates of 10\% for a and 11\% for b.)




(*) For a 2 x 2 contingency table, Yate’s chi-squared test is commonly used. This applies a correction of the form

    \[   \chi^2_{\rm Yate's} = \sum_{ij} \frac{(|O_{ij} - E_{ij}| - 0.5)^2}{E_{ij}}  \]

to account for an error between the observed discrete distribution and the continuous chi-squared distribution.

Posted in Uncategorized | Tagged , , | Leave a comment

Brand Positioning by Correspondence Analysis

I was reading an article about visualization techniques using multidimensional scaling (MDS), the correspondence analysis in particular. The example used R, but as usual I want to find ways to do it on Python, so here goes.

The correspondence analysis is useful when you have a two-way contingency table for which relative values of ratio-scaled data are of interest. For example, I here use a table where the rows are fashion brands (Chanel, Louis Vuitton, etc.) and the columns are the number of people who answered that the particular brand has the particular attribute expressed by the adjective (luxurious, youthful, energetic, etc.). (I borrowed the data from this article.)

The correspondence analysis (or MDS in general) is a method of reducing dimensions to make the data more sensible for interpretation. In this case, I get a scatter plot of brands and adjectives in two-dimensional space, in which brands/adjectives more closely associated with each other are placed near each other.

fashion_brands

As you see, brands like GAP, H&M, and Uniqlo are associated with youth, friendliness, and energy, while old-school brands like Chanel and Tiffany are associated with luxury and brilliance. This way of visualization is useful because the high-dimensional information (11 brands and 9 attributes) are reduced into two-dimensional plane, and the distance on that plane is meaningful.

Here’s the code and data:

Posted in Uncategorized | Tagged , , | Leave a comment

Using Custom Theme with SyntaxHighlighter Evolved

I have been using SyntaxHighlighter Evolved for displaying code snippets on this site. While the WordPress plugin has been working very well, I seem to lose my custom CSS styles every time the updated plugin gets installed. I want to keep a custom CSS that suits the style of my site and keep using it whenever the plugin gets updated.

This is a procedure to create a custom theme and use it with SyntaxHighlighter Evolved (version 3.1.x).

First, download the source from the Github repository. I will also create a branch and work on that branch.

$ git clone https://github.com/Viper007Bond/syntaxhighlighter.git
$ cd syntaxhighligher
$ git branch MyTheme
$ git checkout MyTheme

Next, create a custom CSS file by copying the default style sheet.

$ cp syntaxhighlighter3/styles/shCoreDefault.css syntaxhighlighter3/styles/shCoreMyTheme.css
$ cp syntaxhighlighter3/styles/shThemeDefault.css syntaxhighlighter3/styles/shThemeMyTheme.css

I can make whatever edits to the newly created CSS style sheets. If all I want to change is color, editing shThemeMyTheme.css would be enough. To change the style more in detail, I may need to edit shCoreMyTheme.css. For example, if the font size needs adjustment, look for lines specifying “font-size” property.

Then I register the new theme so that the plugin recognizes it. In syntaxhighlighter.php, look for the parts of code that look as follows, and append a line relevant for the new theme:

		// Create list of themes and their human readable names
		// Plugins can add to this list: http://www.viper007bond.com/wordpress-plugins/syntaxhighlighter/adding-a-new-theme/
		$this->themes = (array) apply_filters( 'syntaxhighlighter_themes', array(
			'default'    => __( 'Default',      'syntaxhighlighter' ),
			'django'     => __( 'Django',       'syntaxhighlighter' ),
			'eclipse'    => __( 'Eclipse',      'syntaxhighlighter' ),
			'emacs'      => __( 'Emacs',        'syntaxhighlighter' ),
			'fadetogrey' => __( 'Fade to Grey', 'syntaxhighlighter' ),
			'midnight'   => __( 'Midnight',     'syntaxhighlighter' ),
			'rdark'      => __( 'RDark',        'syntaxhighlighter' ),
			'none'       => __( '[None]',       'syntaxhighlighter' ),
			'mytheme'    => __( 'MyTheme',      'syntaxhighlighter' ),
		) );
		// Register theme stylesheets
		wp_register_style(  'syntaxhighlighter-core',             plugins_url( $this->shfolder . '/styles/shCore.css',            __FILE__ ), array(),                         $this->agshver );
		wp_register_style(  'syntaxhighlighter-theme-default',    plugins_url( $this->shfolder . '/styles/shThemeDefault.css',    __FILE__ ), array('syntaxhighlighter-core'), $this->agshver );
		wp_register_style(  'syntaxhighlighter-theme-django',     plugins_url( $this->shfolder . '/styles/shThemeDjango.css',     __FILE__ ), array('syntaxhighlighter-core'), $this->agshver );
		wp_register_style(  'syntaxhighlighter-theme-eclipse',    plugins_url( $this->shfolder . '/styles/shThemeEclipse.css',    __FILE__ ), array('syntaxhighlighter-core'), $this->agshver );
		wp_register_style(  'syntaxhighlighter-theme-emacs',      plugins_url( $this->shfolder . '/styles/shThemeEmacs.css',      __FILE__ ), array('syntaxhighlighter-core'), $this->agshver );
		wp_register_style(  'syntaxhighlighter-theme-fadetogrey', plugins_url( $this->shfolder . '/styles/shThemeFadeToGrey.css', __FILE__ ), array('syntaxhighlighter-core'), $this->agshver );
		wp_register_style(  'syntaxhighlighter-theme-midnight',   plugins_url( $this->shfolder . '/styles/shThemeMidnight.css',   __FILE__ ), array('syntaxhighlighter-core'), $this->agshver );
		wp_register_style(  'syntaxhighlighter-theme-rdark',      plugins_url( $this->shfolder . '/styles/shThemeRDark.css',      __FILE__ ), array('syntaxhighlighter-core'), $this->agshver );
		wp_register_style(  'syntaxhighlighter-theme-mytheme',    plugins_url( $this->shfolder . '/styles/shThemeMyTheme.css',    __FILE__ ), array('syntaxhighlighter-core'), $this->agshver );

Note in each code snippet I added a line for MyTheme at the end.

Now I upload these files to the WordPress plugin directory of my web server host. Before I do that, however, I need to overwrite shCore.css with shCoreMyTheme.css; it looks like the core style sheet is not automatically used unless I manually copy like this:

$ cp syntaxhighlighter3/styles/shCoreMyTheme.css syntaxhighlighter3/styles/shCore.css

Finally, I upload all the files:

$ scp syntaxhighlighter3/styles/shCore.css $MYWEB/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles
$ scp syntaxhighlighter3/styles/shCoreMyTheme.css $MYWEB/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles
$ scp syntaxhighlighter3/styles/shThemeMyTheme.css $MYWEB/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles
$ scp syntaxhighlighter.php $MYWEB/wp-content/plugins/syntaxhighlighter/

where $MYWEB points to the public directory on my web server.

Once this is properly done, the new theme will appear under the “Color Theme” in the SyntaxHighter plugin setting page in my WordPress admin area.

Posted in Uncategorized | Tagged , | 2 Comments

PCA and Biplot using Python

There are several ways to run principal component analysis (PCA) using various packages (scikit-learn, statsmodels, etc.) or even just rolling out your own through singular-value decomposition and such. Visualizing the PCA result can be done through biplot. I was looking at an example of using prcomp and biplot in R, but it does not seem like there is a comparable plug-and-play way of generating a biplot on Python.

As it turns out, generating a biplot from the result of PCA by pcasvd of StatsModels is fairly straightforward from the rotation matrix supplied by the function. Here is a code snippet:

In addition to PCA, k-means clustering (three clusters) was run on the data to color the observations by how they cluster. The resulting biplot for states.x77 (which I exported and borrowed from R) looks like this:

biplot

Posted in Uncategorized | Tagged , , , , | 2 Comments

Near-duplicate Detection using MinHash: Background

There are numerous pieces of duplicate information served by multiple sources on the web. Many news stories that we receive from the media tend to originate from the same source, such as the Associated Press. When such contents are scraped off the web for archiving, a need may arise to categorize documents by their similarity (not in the sense of meaning of the text but the character-level or lexical matching).

Here, we build a prototype for near-duplicate document detection system. This article presents the background material on an algorithm called MinHash and a method for probabilistic dimension reduction through the locality-sensitive hashing. A future article presents their implementation on Python and CouchDB.

(Note that all the numbers generated for the tables in this article are totally made up for illustration purposes; they may not be mathematically consistent with any hashing algorithm.)

Jaccard Similarity Index

A similarity is represented by the Jaccard index:

    \[   J(D_i, D_j) = \frac{|D_i \cap D_j|}{|D_i \cup D_j|} \]

where D_i and D_j are sets representing the two documents in our context.

Shingling

A useful way to construct a set representing a document is by shingling. To construct a set of k-singles from a text, a sliding window of k characters is applied over the text. For example, if the text is “abcdabd,” the resulting set of 2-shingles is {ab, bc, cd, da, bd} (note that “ab” appears only once and not repeated in the set).

The value of k is arbitrary, but it should be large enough that the probability of any given shingle appearing in any random document is low. That is, if the available number of characters is c and the character length of typical documents is l, we should at least ensure c^k \gg l-k+1. Since each character has a different frequency of appearance in a typical text, an suitable value of k depends on the nature of the documents and should be tuned accordingly. A good rule of thumb for an order of magnitude estimate is to assume c = 20 for English texts.

Instead of using individual characters, shingles can also be constructed from words. For example, in a math text book we may often see a sentence beginning with a terse expression “it is trivial to show,” whose 3-shingle set is {“it is trivial”, “is trivial to”, “trivial to show”}. This has advantage in that shingles built this way are more sensitive to the styles of writing. The style sensitivity may aid in identifying similarities between domain-specific texts buried in other types of documents.

Hashing Shingles

Typically, shingles are hashed so that those who contain similar information are grouped into a single bucket while being represented as an integer. The latter is a huge advantage in that the data can be compressed more efficiently in terms of bytes. For example, a 4-shingle (of characters) typically uses 4 bytes, each byte used for character representation, and this is good for representing 160,000 4-shingles (i.e., 20^4). With a 4-byte, however, about 4 million (2^{32}) integers and therefore shingles could be represented, which is good enough size for k = 7 (i.e., 20^7). If a small probability for collision into the same bucket can be tolerated, k can be chosen even larger. From here on, we assume a random hash function does not produce any collision between any pair of randomly chosen shingles, i.e., the mapping s_i \rightarrow h(s_i) yields a unique integer.

Characteristic Matrix

Suppose we have a random hash function h(s) and all possible singles s_1, s_2, \dots, s_m from D_1, D_2, \dots, D_l for a total of l documents. We can summarize this in a characteristic matrix:

D_1 D_2 \dots D_l
h(s_1) 1 0 \dots 1
h(s_2) 1 1 \dots 0
\vdots \vdots \vdots \dots \vdots
h(s_m) 1 1 \dots 0

where the entry of 1 indicates that the document D_j contains a shingle s_i for which a hash value h(s_i) exists. (The entry of 0 means the shingle itself does not appear in that document.) It is trivial to compute Jaccard indices using any pair of documents from this matrix. In practice, however, the requirement for comparing all the hash values for a large number of documents makes the process prohibitive.

MinHash as a Jaccard Index Estimator

Let us focus on a pair of documents, D_1 and D_2, for which the shingles s_1, s_2, \dots, s_7 have been hashed by a function h. The relevant entries from the characteristic matrix look as follows:

D_1 D_2
h(s_1) 0 0
h(s_2) 1 0
h(s_3) 1 1
h(s_4) 0 1
h(s_5) 1 0
h(s_6) 0 0
h(s_7) 1 1

There are three types of rows: (a) both columns have 1, (b) one of the columns have 1, and (c) both columns have 0. We let X, Y, and Z denote the numbers of rows categorized this way, respectively. For D_1 = \left\{h(s_2), h(s_3), h(s_5), h(s_7)\right\} and D_2 = \left\{h(s_3), h(s_4), h(s_7)\right\}, X is the cardinality of their joint set and Y is that for their union. Hence the Jaccard index is X / (X + Y) = 2/5 = 0.4.

Now, consider an experiment in which the rows in the matrix are randomly permutated. Remove the rows of type (c), since they do not contribute at all to the union of two sets. We look at the first row of the matrix thus constructed and note its type defined above, either (a) or (b). Repeat the process n times. What is the chance that the first row found this way to be of type (a) above? The probability is given by X / (X + Y), which is similar to the way Jaccard index is computed. This is the property that we use as a Jaccard index estimator.

In practice, randomly permuting a huge number of rows is very inefficient. Instead, we prepare a set of random hash functions h_i(s) (for i = \left\{1, 2, \dots, n\right\} for a set of n measurements) that effectively permute the row order given the same set of shingles and sort rows in ascending order by hash values. (In order for this to be true, the hash functions need to be well-chosen and produce few collisions.) The row of the minimum hash value corresponds to picking the first row in the example above.

What we have shown is that, for estimating Jaccard indices, we only need to keep the minimum hash values generated from n different hash functions. Therefore the very sparse characteristic matrix can be condensed into a signature matrix of minimum hash values with entries given by

    \[   h_i \equiv \min \left\{h_i(s_1), h_i(s_2), \dots, h_i(s_m)\right\} \ , \]

where

    \[   D_j = \left\{ s_1, s_2, \dots, s_m \right\} \]

is the set of shingles from the document D_j.

D_1 \dots D_j D_{j+1} \dots D_l
h_1 98273 \dots 23 23 \dots 63483
h_2 2763 \dots 524 524 \dots 2345
\vdots \vdots \dots \vdots \vdots \dots \vdots
h_{n-1} 325 \dots 567849 567849 \dots 987
h_n 876 \dots 7849 32 \dots 897347

For supposedly near-duplicate documents such as D_j and D_{j+1} in the table, most MinHash values are similar, and the fraction of values that are similar is an estimate of the Jaccard index. This is the gist of the MinHash algorithm. In other words, the probability p that a pair of MinHash values from two documents D_i and D_j match is equivalent to their Jaccard index:

(1)   \begin{equation*}   p = J(D_i, D_j) \ . \end{equation*}

Locality-sensitive Hashing

While the information necessary to compute document similarity have been compressed quite nicely into a signature matrix, examining all documents would take l(l-1)/2 pairs, each involving n comparisons from signature entries. The vast majority of documents may not be near-duplicate, however, and in such a case we do not need every pair to be compared. Locality-sensitive hashing (LSH) offers a method of reducing the number of dimensions in high-dimensional MinHash signatures.

The idea is to partition a MinHash signature matrix into b bands, each with r rows (such that n = b r is the total number of rows), and hashing MinHash integer sequences grouped by band. For example, if we have n = 12 MinHash values, we could partition them into b = 3 bands of r = 4 rows:

Band D_1 \dots D_j D_{j+1} \dots D_l
1 h_1 98273 \dots 23 23 \dots 63483
1 h_2 2763 \dots 524 524 \dots 2345
1 h_3 49368 \dots 7207 7207 \dots 59542
1 h_4 9559 \dots 34784 34784 \dots 6095
2 h_5 37153 \dots 14927 14927 \dots 581
2 h_6 8671 \dots 17492 17492 \dots 6664
2 h_7 2763 \dots 43306 43306 \dots 10916
2 h_8 2600 \dots 38712 38712 \dots 45472
3 h_9 14352 \dots 25862 25862 \dots 14812
3 h_{10} 263 \dots 52 52 \dots 11973
3 h_{11} 325 \dots 567849 567849 \dots 987
3 h_{12} 876 \dots 7849 32 \dots 897347

Then we use some good hash function H which takes r MinHash values and summarizes into another hash, H(h_1, h_2, h_3, h_4) \equiv H_1 for band 1, H(h_5, h_6, h_7, h_8) \equiv H_2 for band 2, and so on. This reduces the n \times l signature matrix into b \times l matrix:

D_1 \dots D_j D_{j+1} \dots D_l
H_1 390 \dots 57232 57232 \dots 33719
H_2 62509 \dots 453 453 \dots 51954
H_3 453 \dots 13009 23905 \dots 12174

Near-duplicate documents will be hashed into the same bucket within each band. In this example, D_j and D_{j+1} are in the same bucket for bands 1 and 2. (Note that D_1 in band 3 has the same hash value as D_j and D_{j+1} in band 2, but they are not considered to be in the same bucket since the bands are different.) The documents that share a bucket within a band is considered candidates for further examination. The advantage is that, since b \ll n in general, the number of required comparisons is much smaller. The LSH thus provides a way to select out candidates for near-duplicate detection, before full signature comparisons are carried out.

The assumption is that a pair of documents, if near-duplicate, has a total of b chances to be hashed into a common bucket in at least one of the available bands. Recall from Eq.~(1) that the probability that a pair of MinHash values from two documents match is equivalent to their Jaccard index. The probability that a pair of documents share a bucket in a band of r rows is given by p^r. Its complement, 1 - p^r, is the probability that a document pair does not get hashed into the same bucket for a band. Then the probability that two documents become candidates in at least one band is given by 1 - (1 - p^r)^b. Plotting for varying b and r, the function forms a series of S-curves:

prob_vs_jaccard

The figure provides an intuition as to how the value of b and r should be chosen for a target Jaccard similarity threshold t (above which two documents are considered near-duplicate). Let f(p) = 1 - (1-p^r)^b. The value of p for the steepest slope is obtained from the second derivative, d^2f/dp^2 |_{p=p_t} = 0, which is

    \[   p_t = \left(\frac{r-1}{br-1}\right)^{1/r}    \approx \left(\frac{1}{b}\right)^{1/r}  \]

for b, r \gg 1. As a rule of thumb, we want want p_t \approx t, but the exact value of p_t can be adjusted based on rejection criteria. Choosing p_t > t reduces false positives, whereas p_t < t reduces false negatives at the candidate selection step.

References

  • Anand Rajaraman and Jeffrey David Ullman (2011). Mining of Massive Datasets. Cambridge University Press. ISBN 978-1-107-01535-7
Posted in Uncategorized | Tagged , | 2 Comments

A Trick for Computing the Sum of Geometric Series

Say if I need to compute the sum of a series like this one:

(1)   \begin{equation*}   y = 1 + 2x + 3x^2 + 4x^3 + \dots \ , \end{equation*}

where |x| < 1. This series looks like a geometric series in which case the sum can be computed from

    \[ \sum_{k = 0}^{\infty} a x^k = \frac{a}{1 - x} \ . \]

The coefficients vary, so the relation cannot be directly used.

There is a trick transform a series like this into a geometric series. Multiplying Eq. (1) by x, I have

(2)   \begin{equation*}   xy = x + 2x^2 + 3x^3 + \dots \ , \end{equation*}

Subtracting Eq. (2) from Eq. (1), we write

    \[   y(1 - x) = 1 + x + x^2 + x^3 + \dots  \]

which is a geometric series with a = 1. Hence

    \[   y = \frac{1}{(1 - x)^2} \ . \]

Posted in Uncategorized | Tagged | Leave a comment

Building a Custom Debian/Wheezy Box

Even though I’ve been using computers for so long (my first was NEC PC-8801 back in 1988), I have never actually build a custom desktop myself. I am in need of a powerful Linux machine and wanted to have one cheaply, so I decided to try building one myself.

With a help from this Lifehacker article and PCPartPicker, which is great since it can find good deals around the web and also check for hardware compatibility, I decided on the following components:

custompc

Note that the prices get updated often, so they do not necessarily reflect those when I placed orders.

Intel Core i5-4670K 3.4GHz Quad-Core (BX80646I54670K) 228.99
Gigabyte GA-Z87X-UD3H ATX LGA1150 (GA-Z87X-UD3H) 159.99
Crucial Ballistix Tactical 8GB (1 x 8GB) DDR3-1600 (BLT8G3D1608ET3LX0) 72.99
Kingston SSDNow V300 Series 120GB 2.5″ SSD (SV300S37A/120G) 65.99
Western Digital Caviar Black 1TB 3.5″ 7200RPM (WD1002FAEX) 104.99
Corsair 300R ATX Mid Tower (CC-9011014-WW) 69.99
Corsair 600W ATX12V (CP-9020060-NA) 64.99
SilverStone 3.5″ to 2.5″ Bay Converter (SST-SDP08B) 9.99

They came out fairly cheap for the power, though it is not surprising given that I skipped on graphic cards, since I won’t be gaming on this for now. (I also got SilverStone 3.5″ to 2.5″ Bay Converter (SST-SDP08B) for SSD for about $10.) Anyways, this is pretty much the cheapest desktop that I have had, and it is probably among the most powerful, in $700 – $800 range. Not bad.

components

Here are some pictures during the assembling process…

assemble01

assemble02

assemble03

assemble04

assemble05

assemble06

assemble07

assemble08

assemble09

assemble10

assemble11

This actually powers on…

assemble12

and getting to the POST message.

assemble13

There was nothing really technically difficult about the whole process, but it still took a couple hours for me since I needed to make sure all the connections are done properly, and wanted to avoid frying anything. :P

At least it powered on the first time, which is great.

Updating the BIOS

The motherboad is Gigabyte GA-Z87X-UD3H. It’s pretty impressive much much customization can be done in BIOS.

It is not absolutely necessary, but when updating BIOS tends to be less dangerous before dealing with a lot of different pieces of hardware, so I decided to update it. Download the BIOS update from the Gigabyte web site. I find an upgrade (F8) to the version installed (F7).

Unfortunately, only the .exe files seem available, which normally would only work on Windows. Fortunately there is a Linux utility to extract the BIOS image from the archive:

$ sudo aptitude install p7zip
$ 7za x mb_bios_ga-z87x-ud3h_f8.exe

One of the files extracted is Z87XUD3H.F8, which is the BIOS image for the version I want to update to. This file is saved to a USB stick.

Under Save & Exit in the UEFI DualBIOS menu, select Q-Flash -> Run. Choose the BIOS image on the USB stick and within a couple minutes (during which the power must be on), the BIOS will be updated.

Installing Debian/Wheezy

I have done Debian/Ubuntu installation so many times now that I expect things to go smoothly for the most part. However, it turned out that the Gigabyte GA-Z87X-UD3H came with a more recent Ethernet controller (Intel I-217V Gigabit Ethernet Controller) which Debian/Wheezy Installer does not support currently. As a result, I could not install via the standard netinst medium.

I opted to burn a full installer DVD ISO onto a USB stick and do the bare minimum installation. The quickest way to create an install USB stick is like this, but I find it easier in this specific instance to use UNetbooting to create a bootable USB stick for Debian/Wheezy (stable), and simply copy the installer DVD ISO file onto the same stick (just the first DVD medium is sufficient). This is because I will need to mount the ISO image as a CD-ROM partition to make use of

apt-get

a couple packages that I need to install when I update the e1000e network driver downloaded from Intel to get the network working.

Boot off the install USB stick and go through as follows. On the screen that says “Debian GNU/Linux 7.4.0 (Wheezy)”, choose Install (or Graphical Install).

  • Language: English
  • Country, territory, or area: United States
  • Keymap to use: American English
  • Driver needed by your Ethernet card: no ethernet card (Note e1000e is the one that should work, if the driver is up-to-date, which it is not on Wheezy)
  • No network interfaces detected: Continue
  • Hostname: (your choice)
  • Root password: (your choice)
  • Full name for the new user: (your choice)
  • Username for your account: (your choice)
  • Choose a password for the new user: (your choice)
  • Select your time zone: (your choice)

I usually manually partition disks.

SCSI1 (0,0,0) (sda) - 120.0 GB ATA KINGSTON SV300S3
                     1.0 MB        FREE SPACE
      #1           199.2 MB  B  f  EFIboot
      #2             1.0 GB     f  ext4    /
      #3           200.3 MB     f  ext4    /boot
      #4            10.0 GB     f  ext4    /tmp
      #5            10.0 GB     f  ext4    /usr
      #6            10.0 GB     f  ext4    /usr/local
      #7            10.0 GB     f  ext4    /var
      #8            78.6 GB     f  ext4    /home
                   466.4 kB        FREE SPACE
SCSI2 (0,0,0) (sdb) - 1.0 TB ATA WDC WD1002FAEX-0
      #5  logical   33.0 GB     F  swap    swap
      #6  logical  967.2 GB     F  ext4    /scr1
  • Use a network mirror: No (I will add this back once the network is configured)
  • Participate in the package usage survey: No
  • Choose software to install: SSH server (I use the box remotely often) and Standard system utilities

Reboot once the installation finishes.

Fixing the Network Interface

Log in as root.

Download the up-to-date version of e1000e source from here. Transfer the file somehow to the new Debian box (using the same USB stick used for installation works and assumed here), extract, and compile as follows. Assuming the USB stick is connected at /dev/sdc,

# cd /root
# mkdir usb
# mount -t vfat /dev/sdc1 usb

This just mounts the USB stick at /root/usb. Mount the installer DVD ISO at /media/cdrom, and install a couple packages necessary to build an updated module e1000e:

# mount -o loop usb/debian-7.4.0-amd64-DVD-1.iso /media/cdrom
# apt-get install build-essential linux-headers-amd64

Build and install the module:

# cp usb/e1000e-3.0.4.1.tar.gz .
# tar -xvzf e1000e-3.0.4.1.tar.gz
# cd e1000e-3.0.4.1/src
# make
# make install
# rmmod e1000e
# modprobe e1000e

Check and see if the interface appears:

# ip addr
... should be an entry for eth0 ...

Since this is a desktop machine always connected to the network, I may add the following entry in /etc/network/interfaces:

auto eth0
iface eth0 inet dhcp

to look for the internet connection on each boot. Reboot and the network should be working.

Now that the on-board NIC is working, make sure the remote sources are added to /etc/apt/sources.list

deb http://ftp.us.debian.org/debian/ wheezy main non-free contrib
deb http://security.debian.org/ wheezy/updates main contrib non-free
deb-src http://security.debian.org/ wheezy/updates main contrib non-free

It is a good idea to comment out the lines for CD-ROM sources at this point.

# apt-get update
# apt-get upgrade

Finally, this is the working Debian/Wheezy box!

I prefer to update the kernel at this point. Partly that is because the newer kernel has the updated e1000e which works right out of the box, so when Debian updates kernel in stable (for whatever reason) I don’t have to worry about recompiling the driver from source. Plus, version 3.2 is quite old by now.

Wake on LAN (WoL)

Unfortunately, GA-Z87X-UD3H the motherboard seems to have problem shutting down properly when Wake on LAN is enabled on BIOS (note that this could be a driver issue). I turned off WoL on BIOS.

Posted in Uncategorized | Tagged , , , | Leave a comment