Near-duplicate Detection using MinHash: Background

There are numerous pieces of duplicate information served by multiple sources on the web. Many news stories that we receive from the media tend to originate from the same source, such as the Associated Press. When such contents are scraped off the web for archiving, a need may arise to categorize documents by their similarity (not in the sense of meaning of the text but the character-level or lexical matching).

Here, we build a prototype for near-duplicate document detection system. This article presents the background material on an algorithm called MinHash and a method for probabilistic dimension reduction through the locality-sensitive hashing. A future article presents their implementation on Python and CouchDB.

(Note that all the numbers generated for the tables in this article are totally made up for illustration purposes; they may not be mathematically consistent with any hashing algorithm.)

Jaccard Similarity Index

A similarity is represented by the Jaccard index:

    \[   J(D_i, D_j) = \frac{|D_i \cap D_j|}{|D_i \cup D_j|} \]

where D_i and D_j are sets representing the two documents in our context.

Shingling

A useful way to construct a set representing a document is by shingling. To construct a set of k-singles from a text, a sliding window of k characters is applied over the text. For example, if the text is “abcdabd,” the resulting set of 2-shingles is {ab, bc, cd, da, bd} (note that “ab” appears only once and not repeated in the set).

The value of k is arbitrary, but it should be large enough that the probability of any given shingle appearing in any random document is low. That is, if the available number of characters is c and the character length of typical documents is l, we should at least ensure c^k \gg l-k+1. Since each character has a different frequency of appearance in a typical text, an suitable value of k depends on the nature of the documents and should be tuned accordingly. A good rule of thumb for an order of magnitude estimate is to assume c = 20 for English texts.

Instead of using individual characters, shingles can also be constructed from words. For example, in a math text book we may often see a sentence beginning with a terse expression “it is trivial to show,” whose 3-shingle set is {“it is trivial”, “is trivial to”, “trivial to show”}. This has advantage in that shingles built this way are more sensitive to the styles of writing. The style sensitivity may aid in identifying similarities between domain-specific texts buried in other types of documents.

Hashing Shingles

Typically, shingles are hashed so that those who contain similar information are grouped into a single bucket while being represented as an integer. The latter is a huge advantage in that the data can be compressed more efficiently in terms of bytes. For example, a 4-shingle (of characters) typically uses 4 bytes, each byte used for character representation, and this is good for representing 160,000 4-shingles (i.e., 20^4). With a 4-byte, however, about 4 million (2^{32}) integers and therefore shingles could be represented, which is good enough size for k = 7 (i.e., 20^7). If a small probability for collision into the same bucket can be tolerated, k can be chosen even larger. From here on, we assume a random hash function does not produce any collision between any pair of randomly chosen shingles, i.e., the mapping s_i \rightarrow h(s_i) yields a unique integer.

Characteristic Matrix

Suppose we have a random hash function h(s) and all possible singles s_1, s_2, \dots, s_m from D_1, D_2, \dots, D_l for a total of l documents. We can summarize this in a characteristic matrix:

D_1 D_2 \dots D_l
h(s_1) 1 0 \dots 1
h(s_2) 1 1 \dots 0
\vdots \vdots \vdots \dots \vdots
h(s_m) 1 1 \dots 0

where the entry of 1 indicates that the document D_j contains a shingle s_i for which a hash value h(s_i) exists. (The entry of 0 means the shingle itself does not appear in that document.) It is trivial to compute Jaccard indices using any pair of documents from this matrix. In practice, however, the requirement for comparing all the hash values for a large number of documents makes the process prohibitive.

MinHash as a Jaccard Index Estimator

Let us focus on a pair of documents, D_1 and D_2, for which the shingles s_1, s_2, \dots, s_7 have been hashed by a function h. The relevant entries from the characteristic matrix look as follows:

D_1 D_2
h(s_1) 0 0
h(s_2) 1 0
h(s_3) 1 1
h(s_4) 0 1
h(s_5) 1 0
h(s_6) 0 0
h(s_7) 1 1

There are three types of rows: (a) both columns have 1, (b) one of the columns have 1, and (c) both columns have 0. We let X, Y, and Z denote the numbers of rows categorized this way, respectively. For D_1 = \left\{h(s_2), h(s_3), h(s_5), h(s_7)\right\} and D_2 = \left\{h(s_3), h(s_4), h(s_7)\right\}, X is the cardinality of their joint set and Y is that for their union. Hence the Jaccard index is X / (X + Y) = 2/5 = 0.4.

Now, consider an experiment in which the rows in the matrix are randomly permutated. Remove the rows of type (c), since they do not contribute at all to the union of two sets. We look at the first row of the matrix thus constructed and note its type defined above, either (a) or (b). Repeat the process n times. What is the chance that the first row found this way to be of type (a) above? The probability is given by X / (X + Y), which is similar to the way Jaccard index is computed. This is the property that we use as a Jaccard index estimator.

In practice, randomly permuting a huge number of rows is very inefficient. Instead, we prepare a set of random hash functions h_i(s) (for i = \left\{1, 2, \dots, n\right\} for a set of n measurements) that effectively permute the row order given the same set of shingles and sort rows in ascending order by hash values. (In order for this to be true, the hash functions need to be well-chosen and produce few collisions.) The row of the minimum hash value corresponds to picking the first row in the example above.

What we have shown is that, for estimating Jaccard indices, we only need to keep the minimum hash values generated from n different hash functions. Therefore the very sparse characteristic matrix can be condensed into a signature matrix of minimum hash values with entries given by

    \[   h_i \equiv \min \left\{h_i(s_1), h_i(s_2), \dots, h_i(s_m)\right\} \ , \]

where

    \[   D_j = \left\{ s_1, s_2, \dots, s_m \right\} \]

is the set of shingles from the document D_j.

D_1 \dots D_j D_{j+1} \dots D_l
h_1 98273 \dots 23 23 \dots 63483
h_2 2763 \dots 524 524 \dots 2345
\vdots \vdots \dots \vdots \vdots \dots \vdo&</td>
</tr>

<tr><td >#116;s
h_{n-1} 325 \dots 567849 567849 \dots 987
h_n 876 \dots 7849 32 \dots 897347

For supposedly near-duplicate documents such as D_j and D_{j+1} in the table, most MinHash values are similar, and the fraction of values that are similar is an estimate of the Jaccard index. This is the gist of the MinHash algorithm. In other words, the probability p that a pair of MinHash values from two documents D_i and D_j match is equivalent to their Jaccard index:

(1)   \begin{equation*}   p = J(D_i, D_j) \ . \end{equation*}

Locality-sensitive Hashing

While the information necessary to compute document similarity have been compressed quite nicely into a signature matrix, examining all documents would take l(l-1)/2 pairs, each involving n comparisons from signature entries. The vast majority of documents may not be near-duplicate, however, and in such a case we do not need every pair to be compared. Locality-sensitive hashing (LSH) offers a method of reducing the number of dimensions in high-dimensional MinHash signatures.

The idea is to partition a MinHash signature matrix into b bands, each with r rows (such that n = b r is the total number of rows), and hashing MinHash integer sequences grouped by band. For example, if we have n = 12 MinHash values, we could partition them into b = 3 bands of r = 4 rows:

Band D_1 \dots D_j D_{j+1} \dots D_l
1 h_1 98273 \dots 23 23 \dots 63483
1 h_2 2763 \dots 524 524 \dots 2345
1 h_3 49368 \dots 7207 7207 \dots 59542
1 h_4 9559 \dots 34784 34784 \dots 6095
2 h_5 37153 \dots 14927 14927 \dots 581
2 h_6 8671 \dots 17492 17492 \dots 6664
2 h_7 2763 \dots 43306 43306 \dots 10916
2 h_8 2600 \dots 38712 38712 \dots 45472
3 h_9 14352 \dots 25862 25862 \dots 14812
3 h_{10} 263 \dots 52 52 \dots 11973
3 h_{11} 325 \dots 567849 567849 \dots 987
3 h_{12} 876 \dots 7849 32 \dots 897347

Then we use some good hash function H which takes r MinHash values and summarizes into another hash, H(h_1, h_2, h_3, h_4) \equiv H_1 for band 1, H(h_5, h_6, h_7, h_8) \equiv H_2 for band 2, and so on. This reduces the n \times l signature matrix into b \times l matrix:

D_1 \dots D_j D_{j+1} \dots D_l
H_1 390 \dots 57232 57232 \dots 33719
H_2 62509 \dots 453 453 \dots 51954
H_3 453 \dots 13009 23905 \dots 12174

Near-duplicate documents will be hashed into the same bucket within each band. In this example, D_j and D_{j+1} are in the same bucket for bands 1 and 2. (Note that D_1 in band 3 has the same hash value as D_j and D_{j+1} in band 2, but they are not considered to be in the same bucket since the bands are different.) The documents that share a bucket within a band is considered candidates for further examination. The advantage is that, since b \ll n in general, the number of required comparisons is much smaller. The LSH thus provides a way to select out candidates for near-duplicate detection, before full signature comparisons are carried out.

The assumption is that a pair of documents, if near-duplicate, has a total of b chances to be hashed into a common bucket in at least one of the available bands. Recall from Eq.~(1) that the probability that a pair of MinHash values from two documents match is equivalent to their Jaccard index. The probability that a pair of documents share a bucket in a band of r rows is given by p^r. Its complement, 1 - p^r, is the probability that a document pair does not get hashed into the same bucket for a band. Then the probability that two documents become candidates in at least one band is given by 1 - (1 - p^r)^b. Plotting for varying b and r, the function forms a series of S-curves:

prob_vs_jaccard

The figure provides an intuition as to how the value of b and r should be chosen for a target Jaccard similarity threshold t (above which two documents are considered near-duplicate). Let f(p) = 1 - (1-p^r)^b. The value of p for the steepest slope is obtained from the second derivative, d^2f/dp^2 |_{p=p_t} = 0, which is

    \[   p_t = \left(\frac{r-1}{br-1}\right)^{1/r}    \approx \left(\frac{1}{b}\right)^{1/r}  \]

for b, r \gg 1. As a rule of thumb, we want want p_t \approx t, but the exact value of p_t can be adjusted based on rejection criteria. Choosing p_t > t reduces false positives, whereas p_t < t reduces false negatives at the candidate selection step.

References

  • Anand Rajaraman and Jeffrey David Ullman (2011). Mining of Massive Datasets. Cambridge University Press. ISBN 978-1-107-01535-7
Posted in Uncategorized | Tagged , | Leave a comment

A Trick for Computing the Sum of Geometric Series

Say if I need to compute the sum of a series like this one:

(1)   \begin{equation*}   y = 1 + 2x + 3x^2 + 4x^3 + \dots \ , \end{equation*}

where |x| < 1. This series looks like a geometric series in which case the sum can be computed from

    \[ \sum_{k = 0}^{\infty} a x^k = \frac{a}{1 - x} \ . \]

The coefficients vary, so the relation cannot be directly used.

There is a trick transform a series like this into a geometric series. Multiplying Eq. (1) by x, I have

(2)   \begin{equation*}   xy = x + 2x^2 + 3x^3 + \dots \ , \end{equation*}

Subtracting Eq. (2) from Eq. (1), we write

    \[   y(1 - x) = 1 + x + x^2 + x^3 + \dots  \]

which is a geometric series with a = 1. Hence

    \[   y = \frac{1}{(1 - x)^2} \ . \]

Posted in Uncategorized | Tagged | Leave a comment

Building a Custom Debian/Wheezy Box

Even though I’ve been using computers for so long (my first was NEC PC-8801 back in 1988), I have never actually build a custom desktop myself. I am in need of a powerful Linux machine and wanted to have one cheaply, so I decided to try building one myself.

With a help from this Lifehacker article and PCPartPicker, which is great since it can find good deals around the web and also check for hardware compatibility, I decided on the following components:

custompc

Note that the prices get updated often, so they do not necessarily reflect those when I placed orders.

Intel Core i5-4670K 3.4GHz Quad-Core (BX80646I54670K) 228.99
Gigabyte GA-Z87X-UD3H ATX LGA1150 (GA-Z87X-UD3H) 159.99
Crucial Ballistix Tactical 8GB (1 x 8GB) DDR3-1600 (BLT8G3D1608ET3LX0) 72.99
Kingston SSDNow V300 Series 120GB 2.5″ SSD (SV300S37A/120G) 65.99
Western Digital Caviar Black 1TB 3.5″ 7200RPM (WD1002FAEX) 104.99
Corsair 300R ATX Mid Tower (CC-9011014-WW) 69.99
Corsair 600W ATX12V (CP-9020060-NA) 64.99
SilverStone 3.5″ to 2.5″ Bay Converter (SST-SDP08B) 9.99

They came out fairly cheap for the power, though it is not surprising given that I skipped on graphic cards, since I won’t be gaming on this for now. (I also got SilverStone 3.5″ to 2.5″ Bay Converter (SST-SDP08B) for SSD for about $10.) Anyways, this is pretty much the cheapest desktop that I have had, and it is probably among the most powerful, in $700 – $800 range. Not bad.

components

Here are some pictures during the assembling process…

assemble01

assemble02

assemble03

assemble04

assemble05

assemble06

assemble07

assemble08

assemble09

assemble10

assemble11

This actually powers on…

assemble12

and getting to the POST message.

assemble13

There was nothing really technically difficult about the whole process, but it still took a couple hours for me since I needed to make sure all the connections are done properly, and wanted to avoid frying anything. :P

At least it powered on the first time, which is great.

Updating the BIOS

The motherboad is Gigabyte GA-Z87X-UD3H. It’s pretty impressive much much customization can be done in BIOS.

It is not absolutely necessary, but when updating BIOS tends to be less dangerous before dealing with a lot of different pieces of hardware, so I decided to update it. Download the BIOS update from the Gigabyte web site. I find an upgrade (F8) to the version installed (F7).

Unfortunately, only the .exe files seem available, which normally would only work on Windows. Fortunately there is a Linux utility to extract the BIOS image from the archive:

$ sudo aptitude install p7zip
$ 7za x mb_bios_ga-z87x-ud3h_f8.exe

One of the files extracted is Z87XUD3H.F8, which is the BIOS image for the version I want to update to. This file is saved to a USB stick.

Under Save & Exit in the UEFI DualBIOS menu, select Q-Flash -> Run. Choose the BIOS image on the USB stick and within a couple minutes (during which the power must be on), the BIOS will be updated.

Installing Debian/Wheezy

I have done Debian/Ubuntu installation so many times now that I expect things to go smoothly for the most part. However, it turned out that the Gigabyte GA-Z87X-UD3H came with a more recent Ethernet controller (Intel I-217V Gigabit Ethernet Controller) which Debian/Wheezy Installer does not support currently. As a result, I could not install via the standard netinst medium.

I opted to burn a full installer DVD ISO onto a USB stick and do the bare minimum installation. The quickest way to create an install USB stick is like this, but I find it easier in this specific instance to use UNetbooting to create a bootable USB stick for Debian/Wheezy (stable), and simply copy the installer DVD ISO file onto the same stick (just the first DVD medium is sufficient). This is because I will need to mount the ISO image as a CD-ROM partition to make use of

apt-get

a couple packages that I need to install when I update the e1000e network driver downloaded from Intel to get the network working.

Boot off the install USB stick and go through as follows. On the screen that says “Debian GNU/Linux 7.4.0 (Wheezy)”, choose Install (or Graphical Install).

  • Language: English
  • Country, territory, or area: United States
  • Keymap to use: American English
  • Driver needed by your Ethernet card: no ethernet card (Note e1000e is the one that should work, if the driver is up-to-date, which it is not on Wheezy)
  • No network interfaces detected: Continue
  • Hostname: (your choice)
  • Root password: (your choice)
  • Full name for the new user: (your choice)
  • Username for your account: (your choice)
  • Choose a password for the new user: (your choice)
  • Select your time zone: (your choice)

I usually manually partition disks.

SCSI1 (0,0,0) (sda) - 120.0 GB ATA KINGSTON SV300S3
                     1.0 MB        FREE SPACE
      #1           199.2 MB  B  f  EFIboot
      #2             1.0 GB     f  ext4    /
      #3           200.3 MB     f  ext4    /boot
      #4            10.0 GB     f  ext4    /tmp
      #5            10.0 GB     f  ext4    /usr
      #6            10.0 GB     f  ext4    /usr/local
      #7            10.0 GB     f  ext4    /var
      #8            78.6 GB     f  ext4    /home
                   466.4 kB        FREE SPACE
SCSI2 (0,0,0) (sdb) - 1.0 TB ATA WDC WD1002FAEX-0
      #5  logical   33.0 GB     F  swap    swap
      #6  logical  967.2 GB     F  ext4    /scr1
  • Use a network mirror: No (I will add this back once the network is configured)
  • Participate in the package usage survey: No
  • Choose software to install: SSH server (I use the box remotely often) and Standard system utilities

Reboot once the installation finishes.

Fixing the Network Interface

Log in as root.

Download the up-to-date version of e1000e source from here. Transfer the file somehow to the new Debian box (using the same USB stick used for installation works and assumed here), extract, and compile as follows. Assuming the USB stick is connected at /dev/sdc,

# cd /root
# mkdir usb
# mount -t vfat /dev/sdc1 usb

This just mounts the USB stick at /root/usb. Mount the installer DVD ISO at /media/cdrom, and install a couple packages necessary to build an updated module e1000e:

# mount -o loop usb/debian-7.4.0-amd64-DVD-1.iso /media/cdrom
# apt-get install build-essential linux-headers-amd64

Build and install the module:

# cp usb/e1000e-3.0.4.1.tar.gz .
# tar -xvzf e1000e-3.0.4.1.tar.gz
# cd e1000e-3.0.4.1/src
# make
# make install
# rmmod e1000e
# modprobe e1000e

Check and see if the interface appears:

# ip addr
... should be an entry for eth0 ...

Since this is a desktop machine always connected to the network, I may add the following entry in /etc/network/interfaces:

auto eth0
iface eth0 inet dhcp

to look for the internet connection on each boot. Reboot and the network should be working.

Now that the on-board NIC is working, make sure the remote sources are added to /etc/apt/sources.list

deb http://ftp.us.debian.org/debian/ wheezy main non-free contrib
deb http://security.debian.org/ wheezy/updates main contrib non-free
deb-src http://security.debian.org/ wheezy/updates main contrib non-free

It is a good idea to comment out the lines for CD-ROM sources at this point.

# apt-get update
# apt-get upgrade

Finally, this is the working Debian/Wheezy box!

I prefer to update the kernel at this point. Partly that is because the newer kernel has the updated e1000e which works right out of the box, so when Debian updates kernel in stable (for whatever reason) I don’t have to worry about recompiling the driver from source. Plus, version 3.2 is quite old by now.

Wake on LAN (WoL)

Unfortunately, GA-Z87X-UD3H the motherboard seems to have problem shutting down properly when Wake on LAN is enabled on BIOS (note that this could be a driver issue). I turned off WoL on BIOS.

Posted in Uncategorized | Tagged , , , | Leave a comment

Creating Debian Wheezy Installation USB Stick

The simplest way to create a bootable USB stick for Debian installation is to use command dd to copy an ISO image to an empty USB stick.

First, connect an empty USB stick to a working Linux box, and see which device represents the stick:

$ sudo fdisk -l
...
Disk /dev/sdb: 16.0 GB, 16008609792 bytes
255 heads, 63 sectors/track, 1946 cylinders, total 31266816 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xb745c02d

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1              32    31266815    15633392    c  W95 FAT32 (LBA)

So, on my computer, /dev/sdb is the USB stick. Whatever is on it will be deleted and overwritten in this procedure, so make sure that I use a stick with large enough disk capacity specifically for this purpose.

When I use netinst installer, about 300 MB would be enough. When the access to the network is not guaranteed during the installation (e.g., the network adapter is not properly recognized by the Debian Installer), it is desired to at least use the full CD/DVD media, which would take about 4 GB (just the first medium should be enough for bare minimal installation, after which more full configuration can be resumed).

Download an ISO image for the Debian installer:

$ wget http://cdimage.debian.org/debian-cd/7.4.0/amd64/iso-cd/debian-7.4.0-amd64-netinst.iso

Then copy the ISO image to the USB stick:

$ sudo dd if=debian-7.4.0-amd64-netinst.iso of=/dev/sdb bs=4M
$ sudo sync
Posted in Uncategorized | Tagged , , | Leave a comment

f.lux for Linux

I am fond of using f.lux to help myself get ready for sleep earlier, which I always end up failing.

Anyways to install xflux,

$ wget https://justgetflux.com/linux/xflux64.tgz
$ tar -xvzf xflux64.tgz
$ sudo mv xflux /usr/local/bin

I use KDE and want xflux started automatically on boot. I can do this by System Settings -> System Administration -> Startup and Shutdown -> Autostart -> Add Program. Specify the command /usr/local/bin/xflux, and once the entry is created under Desktop File, you add a command line option by clicking on the entry -> Properties -> Application -> Command, which should look like

/usr/local/bin/xflux -z YOURZIP

Simply specifying your zip code (in US) should work.

Posted in Uncategorized | Tagged , , , | Leave a comment

Setting up MySQL Database from .sql Dump on DotCloud

Here I assume that MySQL service is running as db for my application named myapp. If I don’t remember, the password can be obtained as follows:

$ dotcloud info myapp.db
... a bunch of info, one of which is passwd ...

If I have a MySQL dump file named somedb.sql, do

$ dotcloud run myapp.db -- mysql -u root -pMYMYSQLROOTPASSWOD < somedb.sql

This is it. It is probably a good idea to create a regular MySQL user.

Posted in Uncategorized | Tagged , | Leave a comment

Customizing & Installing Linux Kernel on Debian Wheezy

Here is a quickie for customizing and install Linux kernel 3.5.x on Wheezy.

Add yourself (with account username) to sudoer group:

# adduser username sudo

You need to logout and login for this change to take effect. You also need to be able to use sudo or su to install the new kernel in the end.

Install some packages and kernel source:

$ sudo aptitude update
$ sudo aptitude install kernel-package pkg-config bzip2 g++ libncurses5-dev fakeroot
$ sudo aptitude install libqt4-dev

(libqt4-dev is only necessary if you wish to use make xconfig instead of make menuconfig below.)

The kernel packaged for Wheezy is at version 3.2. Source code for a more up-to-date kernel version can be obtained from kernel.org and saved somewhere. You just need to expanding the compressed source somewhere (even under your home directory, which is the case in this post).

Either way, the customization procedure is similar; when you expand the source tree (e.g., the tar statement below), you just need to specify the source tarball you wish to use. Here I just follow the Debian way:

$ sudo aptitude install linux-source-3.2

Extract the source tree:

$ cd ~/src
$ tar -jxf /usr/src/linux-source-3.2.tar.bz2
... or use tar -Jxf for .tar.xz file ... 
$ ln -s linux-source-3.2 linux
$ cd linux

Edit the EXTRAVERSION entry in Makefile, as in:

EXTRAVERSION = .20120929.1

for example to add .20120929.1 to the kernel version number. This is convenient for keeping the existing, working kernels around when you need to recompile with different options.

Use xconfig or menuconfig to customize the kernel options. Before the make-kpkg lines, setting concurrency (most likely to the number of cores of my processor) is optinal but having a higher number typically reduces the compilation time.

$ make mrproper
$ make xconfig    # or make menuconfig
$ export CONCURRENCY_LEVEL=2    # for dual-core; this is optional
$ fakeroot make-kpkg clean
$ fakeroot make-kpkg --initrd kernel_image
$ sudo dpkg -i ../linux-image-3.2.20120929.*_amd64.deb

Upon reboot in the GRUB menu you will find the newly installed kernel:

$ sudo reboot

Purging Old Kernel Image from System

For example, if the kernel to be uninstalled is of version 2.6.26 and the extra version that I used was 20091112.1, do:

$ sudo dpkg -P linux-image-2.6.26.20091112.1

That’s it. However it is often a good idea to keep at least one kernel image that I know for sure to work so that when a custom kernel fails, I have something to fall back on. On the other hand, it is also a good idea to purge very old kernel images to save space in /boot.

Posted in Uncategorized | Tagged , , | Leave a comment

Power Management on Lenovo T430s with Debian Wheezy

Since tp_smapi does not work properly on T430s, I am using TLP this time. I still want some traditional tools

$ sudo aptitude install powertop

but not others:

$ sudo aptitude remove --purge laptop-mode-tools

To install, follow the instruction. In short, first update the APT source list by adding the following line to /etc/apt/sources.list:

deb http://ppa.launchpad.net/linrunner/tlp/ubuntu lucid main

and run

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 02D65EFF
$ sudo aptitude update
$ sudo aptitude install tlp tlp-rdw tp-smapi-dkms smartmontools ethtool acpi-call-tools

Upon reboot, TLP starts running.

I am impressed by the ease of installation and how effective TLP is on configuring T430s to star saving from the get go. According to powertop, T430s is already using just 7 - 8 W on battery. That's several watts below I had been using on other ThankPads!!

Controlling the behavior of TPL is done by modifying /etc/default/tlp. I mostly use the default except the following changes:

...

# Main battery (values in %)
START_CHARGE_THRESH_BAT0=75
STOP_CHARGE_THRESH_BAT0=90
# Ultrabay battery (values in %)
START_CHARGE_THRESH_BAT1=75
STOP_CHARGE_THRESH_BAT1=90

...

# Radio devices to disable on connect
DEVICES_TO_DISABLE_ON_LAN_CONNECT="wifi wwan"
DEVICES_TO_DISABLE_ON_WIFI_CONNECT="wwan"
DEVICES_TO_DISABLE_ON_WWAN_CONNECT="wifi"

# Radio devices to enable on disconnect
DEVICES_TO_ENABLE_ON_LAN_DISCONNECT="wifi"
DEVICES_TO_ENABLE_ON_WIFI_DISCONNECT=""
DEVICES_TO_ENABLE_ON_WWAN_DISCONNECT=""

Note that in this configuration the battery charge thresholds are intentionally set lower than the factory default to prolong the health of batteries. When I wish to fully charge them, do:

$ sudo tlp fullcharge [ BAT0 | BAT1 ]

and the thresholds will be set back to the factory default. The setting reverts to the custom configuration upon next reboot.

Enables RC6 Power Saving

Edit /etc/default/grub and add the option i915.i915_enable_rc6=1 to GRUB:

GRUB_CMDLINE_LINUX_DEFAULT="quiet i915.i915_enable_rc6=1"

After editing, update Grub:

$ sudo update-grub

This can save about 0.5 W of power!

Fan Control

I have not fine tuned this, but the fan can be controlled manually via thinkfan:

$ sudo aptitude install thinkfan

Create (or edit) /etc/modprobe.d/thinkpad_acpi.conf and add the line:

options thinkpad_acpi fan_control=1

Reboot. The fan can be controlled manually following this Wiki article.

Configuring System on Switch between AC & Battery

For regular daemons (like ssh), TLP provides a way to start/stop based on AC status.

For other personal processes, KDE power management can also do some work by running a script upon dis/connecting with AC. In particular, the sound for the KDE system notifications should be disabled entirely, because they cause quite a number of unwanted interruptions according to powertop. Also, I wish to start/stop programs like flux and Dropbox when on battery just to squeeze more juice. I create a script like this:

#!/usr/bin/env bash
##############################################################################
# This is a script to run when AC/battery switches.  Ideally used
# within the power management configuration setting on KDE.  Requires
# tp_smapi to be loaded for the detection of AC/battery.
  
USERNAME=johndoe
LATITUDE=37.78
LONGITUDE=-122.42
  
  
##############################################################################
# DO NOT MODIFY BELOW
 
statuspath="/sys/class/power_supply/AC/online"
  
if [ ! -f "$statuspath" ]; then
    exit 0
fi
  
tag=pmpersonal
status=`cat "$statuspath"`
 
if [ "$status" = "1" ]; then
    logger -t $tag "on ac"
  
    logger -t $tag "start xflux"
    for i in `pgrep ^xflux$` ; do kill -9 $i ; done
    xflux -l $LATITUDE -g $LONGITUDE
  
    logger -t $tag "start dropbox"
    dropbox start
  
    logger -t $tag "enable system notification"
    kwriteconfig --file="/home/$USERNAME/.kde/share/config/knotifyrc" --group=Sounds --key="No sound" false
else
    logger -t $tag "on battery"
  
    logger -t $tag "stop xflux"
    for i in `pgrep ^xflux$` ; do kill -9 $i ; done
  
    logger -t $tag "stop dropbox"
    dropbox stop
  
    logger -t $tag "disable system notification"
    kwriteconfig --file="/home/$USERNAME/.kde/share/config/knotifyrc" --group=Sounds --key="No sound" true
fi

Save this to ~/bin/pmpersonal.sh. Go to System Settings -> Power Management -> Energy Saving Settings. There, for each of “On AC,” “On Battery,” and “On Low Battery,” I can set different configurations. I can also specify “Run Script” option for each profile to enable/disable a few desktop related applications and settings. The script should be run on profile load.

Preload Daemon

$ sudo aptitude install preload

In /etc/default/preload, uncomment the line reading

OPTIONS="-l /dev/null"

This should reduce disk activity.

Disabling Nepomuk, Strigi and Akonadi on KDE

I am not fond of these daemon. Disable them by going to System Settings -> Desktop Search and unchecking “Enable Nepomuk Semantic Desktop.” In ~/.config/akonadi/akonadiserverrc, set:

StartServer=false

so that it doesn’t start at all. On powertop, this shows up as a service running mysqld constantly.

Posted in Uncategorized | Tagged , , | 4 Comments

LaTeX on Debian Wheezy

Install basic packages for LaTeX:

$ sudo aptitude install texlive texlive-latex-extra

Use pdflatex to compile TeX documents.

dvipng may be needed to render LaTeX on Matplotlib:

$ sudo aptitude install dvipng
Posted in Uncategorized | Tagged , | Leave a comment

Installing NixNote on Debian Wheezy

After downloading the .deb for 64-bit Wheezy, do:

$ sudo aptitude install default-jre
$ sudo dpkg -i nixnote-1.2_amd64.deb
Posted in Uncategorized | Tagged , | Leave a comment