My files are number one. Mostly.
I’m neither a mathematician, or statistician, but these things fascinate me. For the remaining people still awake after that statement, I want to share the (insanely great) RadioLab podcast titled ‘Numbers’, where I first heard of a phenomenon called Benford’s Law ..and maybe a wallpaper, too.
Benford’s Law says:
in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 almost one third of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than one time in twenty.
What this means is numbers starting with 1 are more likely to crop up in large datasets, than numbers starting with the other eight digits.
As I am a sceptical geek I had to test this out for myself.
Large sets of numbers, eh?
I could go out and find data from data.gov, or statistics from web-servers, but why leave the comfort of my own chair, when a simple bash-command would gather large numbers for me?
sudo ls -A -o -R / > benford.txt
That command runs through my entire harddrive, searching through each and every folder it finds and spitting out lines and lines of filenames and sizes into the aptly named file ‘benford.txt’. Actually it generated one million, thirty one thousand, three hundred and twenty two lines, equal to a whopping 68.2 MB of text and almost as large as an empty canvas in Illustrator :p
What now? The lines look like this:
-rw-r--r-- 1 root 150 Jul 3 2009 InfoPlist.strings
I’m guessing, not being a unix professional, that there are ways to make the ls command only print the filesize and the name, and saving me the time of editing one million, thirty one thousand, three hundred and twenty two lines to only have lines like these
150 InfoPlist.strings
Here’s where regular expressions make their appearance. And the excellent RegExr Desktop is a great tool for it! Seeing as I’m no RegEx pro, either, I want to test my expressions before they run amok on my dataset :)

After testing my regular expressions in RegExr, I went into the great TextWrangler and started to find and replace. It went very well, as you can see :)


My filelist is now complete, and it’s also a tab-delimited textfile, which is great for filling into a [MySQL database][]. The great thing with having this data inside a database, is the ease of which you can pull data out, and I’m looking for the amount of lines starting with 1,2,3 etc. So I made a database-table with three columns: Id, filesize, name, and started to import my filelist into it. After all the find/replace in TextWrangler, the filesize had shrunk to 25.1 MB.

Now I can start datamining, to see if this all was for nothing, or if my computers files also adhered to Benford’s Law.
Squeal or Sequel?
I like to use Sequel Pro for all MySQL needs, and it’s open-source and free. It lets you manage the database easily in a GUI, and also makes it easy to type your own SQL-statements and run them against the server. Sequel Pro is a fork of the discontinued CocoaMySQL.
Running select filesize from benford where filesize rlike '^[1-9]' on the database yields ninehundred sixty thousand twohunded seventy three rows. This shows all lines and is equal to 100%. Now the calulations begin.

So what does the number s look like? It was fun to see that the numbers were distributed mostly like Benford’s Law dictated. Check it out. Benford percentage on the left, my numbers in the middle, and the total amount to the right :)
- 30,10 / 35,14 / 337 445
- 17,60 / 17,04 / 163 654
- 12,50 / 12,03 / 115 551
- 9,70 / 10,47 / 100 493
- 7,90 / 7,29 / 70 044
- 6,70 / 5,92 / 56 815
- 5.80 / 4,70 / 45 114
- 5,10 / 4,10 / 39 363
- 4,60 / 3,31 / 31 795
Mathematics and statistics can be fun, especially when they are this close to /home :p
To use these hosts-settings without a reboot, enter terminal on macosx and type







