My files are number one. Mostly.

I’m neither a mathematician, or statistician, but these things fascinate me. For the remaining people still awake after that statement, I want to share the (insanely great) RadioLab podcast titled ‘Numbers’, where I first heard of a phenomenon called Benford’s Law ..and maybe a wallpaper, too.

Benford’s Law says:

in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 almost one third of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than one time in twenty.

What this means is numbers starting with 1 are more likely to crop up in large datasets, than numbers starting with the other eight digits.

As I am a sceptical geek I had to test this out for myself.

Large sets of numbers, eh?

I could go out and find data from data.gov, or statistics from web-servers, but why leave the comfort of my own chair, when a simple bash-command would gather large numbers for me?

sudo ls -A -o -R / > benford.txt

That command runs through my entire harddrive, searching through each and every folder it finds and spitting out lines and lines of filenames and sizes into the aptly named file ‘benford.txt’. Actually it generated one million, thirty one thousand, three hundred and twenty two lines, equal to a whopping 68.2 MB of text and almost as large as an empty canvas in Illustrator :p

What now? The lines look like this: -rw-r--r-- 1 root 150 Jul 3 2009 InfoPlist.strings

I’m guessing, not being a unix professional, that there are ways to make the ls command only print the filesize and the name, and saving me the time of editing one million, thirty one thousand, three hundred and twenty two lines to only have lines like these 150 InfoPlist.strings

Here’s where regular expressions make their appearance. And the excellent RegExr Desktop is a great tool for it! Seeing as I’m no RegEx pro, either, I want to test my expressions before they run amok on my dataset :)

RegExr Desktop App

After testing my regular expressions in RegExr, I went into the great TextWrangler and started to find and replace. It went very well, as you can see :)

Find and replace FTW!

find and replace ftw!

My filelist is now complete, and it’s also a tab-delimited textfile, which is great for filling into a [MySQL database][]. The great thing with having this data inside a database, is the ease of which you can pull data out, and I’m looking for the amount of lines starting with 1,2,3 etc. So I made a database-table with three columns: Id, filesize, name, and started to import my filelist into it. After all the find/replace in TextWrangler, the filesize had shrunk to 25.1 MB.

mysql database is populated

Now I can start datamining, to see if this all was for nothing, or if my computers files also adhered to Benford’s Law.

Squeal or Sequel?

I like to use Sequel Pro for all MySQL needs, and it’s open-source and free. It lets you manage the database easily in a GUI, and also makes it easy to type your own SQL-statements and run them against the server. Sequel Pro is a fork of the discontinued CocoaMySQL.

Running select filesize from benford where filesize rlike '^[1-9]' on the database yields ninehundred sixty thousand twohunded seventy three rows. This shows all lines and is equal to 100%. Now the calulations begin.

It's all in my database

So what does the number s look like? It was fun to see that the numbers were distributed mostly like Benford’s Law dictated. Check it out. Benford percentage on the left, my numbers in the middle, and the total amount to the right :)

  1. 30,10 / 35,14 / 337 445
  2. 17,60 / 17,04 / 163 654
  3. 12,50 / 12,03 / 115 551
  4. 9,70 / 10,47 / 100 493
  5. 7,90 / 7,29 / 70 044
  6. 6,70 / 5,92 / 56 815
  7. 5.80 / 4,70 / 45 114
  8. 5,10 / 4,10 / 39 363
  9. 4,60 / 3,31 / 31 795

Mathematics and statistics can be fun, especially when they are this close to /home :p

  • http://bash.org Kevin Brubeck Unhammer

    Her er ein litt raskare metode, for filstorleik:

    $ find / -type f -print0|xargs -0 ls -o|gawk '{f=gensub(/^ (.)./,"\\1","g",$4);print f}'|sort|uniq -c|sort -nr

    altså, finn alle filer, køyr ls på kvar av dei (gir litt reinare data enn ls -R), skriv ut første siffer i fjerde felt (filstorleik), sorter, tel opp, sorter igjen.

    For /tmp-mappa mi gir dette følgjande frekvensliste:
    74  1
    29  3
    13  0
    9  2
    6  7
    4  4
    3  9
    2  6
    1  8
    1  5

    (altså, flest 1-are, så 3, så 0; men ser jo ut som det går mot benheim)

    For andre felt er det berre å bytta ut $4 med t.d. $7 (klokkeslett, på min ls).

  • http://bash.org Kevin Brubeck Unhammer

    bah, bloggen din fjerna linjeskift, det var altså 74 einarar, 29 triarar, osb …

  • chrleon

    Kult :) takk for script fu.

    Jeg prøvde å gjøre en smart terminal-kommando før jeg endte opp med å tenke at: hei, dette er en fin anledning til å pusse litt opp på regex-kunnskapene også :)

    Nå fikk jeg en million filer, som nok er et litt større sett enn i /tmp mappen din, forhåpentligvis, så det er nok der utslaget gir seg :)

    Skal fikse kommentaren din sånn at ikke linjeskiftene fjernes.

Go back to top