Main | April 2006 »

March 27, 2006

block.pl

The weird first step of the Burrows & Wheeler compression algorithm (aka the bzip algorithm) has played on my mind from time to time since I first heard about it. So I thought I'd try coding it up.

[Reference: Michael Burrows and D. J. Wheeler, 1994, "A block-sorting lossless data compression algorithm", http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/SRC-124.pdf]

#!/usr/bin/perl

# Given an input text, find all of its rotations and sort
# them into alphabetical order. If this is imagined as
# forming a grid, which one rotation on each row, then the
# code consists of the characters in the last column,
# together with the index of the row containing the original
# input.
sub encode
{
    my ($s) = @_;
    my @rotations = ();
    my $length = length($s);
    foreach my $i (0 .. $length-1)
    {
        my $rot = substr($s, $length-$i, $i) . substr($s, 0, $length-$i);
        push @rotations, [$rot, $i == 0];
    }

    @rotations = sort { $a->[0] cmp $b->[0] } @rotations;

    my $code = "";
    my $index = 0;

    foreach my $i (0 .. $#rotations)
    {
        $code .= substr($rotations[$i]->[0], $length-1);
        $index = $i if $rotations[$i]->[1];
    }

    return $code, $index;
}


# This is the stupid way to decode, as it involves
# reconstructing the entire grid, column by column. However
# it's only this I could honestly say I understand.
#
# Each column contains the same characters, but in a different
# order. We are given the content of the last column, so if
# we sort it into alphabetical order this will give us the
# content of the first column. If we rotate the grid one
# space to the right - so bringing the last column to the
# front - and sort again, then this will give us the content
# of the first two columns. If we rotate one space to the
# right and sort again this will give the first three
# columns. And so on.
sub decode1
{
    my ($code, $index) = @_;
    my @code = split(//, $code);
    my $length = length($code);
    my @rotations = ('') x $length;

    for (0 .. $length-1)
    {
        foreach my $i (0 .. $length-1)
        {
            $rotations[$i] =  $code[$i] . $rotations[$i];
        }
        @rotations = sort @rotations;
    }

    return $rotations[$index];
}


# The smart way is only to reconstruct the row of the grid
# that contained the original input. We are given the last
# column, and can easily construct the first column as
# before. We are also told which row of the full grid
# contains the original input, and this tells us the entry
# in the first column containing the first character. Because
# all the rows are rotations, if we knew for which row this
# entry appeared in the last column, the the entry for the
# same row in the first column would give us the second
# character. And if we knew where *this* entry appeared in
# the last column we could look to the first column again to
# give us the third character. Etc.
#
# The obvious problem is working out the correspondence
# between entries in the first and last columns. Any given
# character may occur many times, and you've got to pick the
# right occurrence, otherwise it all just goes horribly
# wrong.
#
# According to the original Burrows & Wheeler paper this
# is actually easy: for a given character the entries in the
# first column appear in the same order as the corresponding
# entries in the last column, e.g. the tenth letter 'e' in
# the first column matches up with the tenth letter 'e' in
# the last column.  This I don't understand. There's an
# explanation in the paper, but it didn't sink in. Look:
# I've been a bit stressed recently and I haven't been
# sleeping properly, and now I've lost an hour in bed due to
# British Summer Time. So I'm happy just to take it on trust
# for now.
#
# In the code l has the content of the last column, and f
# has the content of the first.
sub decode2
{
    my ($code, $index) = @_;
    my @l = split(//, $code);
    my @f = sort(@l);
    
    my $which_occurrence = sub
    {
        my ($index) = @_;
        my $ch = $f[$index];
        my $n = 0;
        foreach my $i (0 .. $index-1)
        {
            if ($f[$i] eq $ch)
            {
                $n += 1;
            }
        }
        return ($ch, $n);
    };

    my $nth_occurrence = sub
    {
        my ($ch, $n) = @_;
        my $count = 0;
        foreach my $i (0 .. $#l)
        {
            if ($l[$i] eq $ch)
            {
                if ($count == $n)
                {
                    return $i;
                }
                else
                {
                    $count += 1;
                }
            }
        }

        die "This can't happen\n";
    };

    my $decoded = '';
    for (0 .. $#f)
    {
        my ($ch, $n) = $which_occurrence->($index);
        $decoded .= $ch;
        $index = $nth_occurrence->($ch, $n);
    }
    return $decoded;
}


# We need to loop over the first and last columns in order
# to match up corresponding entries, but this can happen
# once rather than for every character.
sub decode3
{
    my ($code, $index) = @_;
    my @l = split(//, $code);
    my @f = sort(@l);
    my @next_index;

    my %occurrences;
    foreach my $i (0 .. $#l)
    {
        my $ch = $l[$i];
        if (!defined($occurrences{$ch}))
        {
            $occurrences{$ch} = [];
        }
        push @{$occurrences{$ch}}, $i;
    }

    foreach my $i (0 .. $#f)
    {
        my $ch = $f[$i];
        $count = 0 if $ch ne $last_ch;

        push @next_index, $occurrences{$ch}->[$count];

        $count += 1;
        $last_ch = $ch;
    }

    my $decoded = '';
    for (0 .. $#f)
    {
        $decoded .= $f[$index];
        $index = $next_index[$index];
    }
    return $decoded;
}


my @inputs = (
'Moses supposes his toses are roses',

'Whether the weather be cold, or whether the weather be hot;
whether the weather be fine, or whether the weather be not:
we whether the weather, whatever the weather, weather we like it or not.',
     
'Zermelo-Fraenkel set theory, which forms the main topic of the book, is a
rigorous theory, based on a precise set of axioms. However, it is
possible to develop the theory of sets considerably without any knowledge
of those axioms. Indeed, the axioms can only be fully understood after
the theory has been investigated to some extent. This state of affairs is
to be expected. The concept of a "set of objects" is a very intuitive
one, and, with care, considerable sound progress may be made on the basis
of this intuition alone. Then, by analyzing the nature of the "set"
concept on the basis of that initial progress, the axioms may be
"discovered" in a perfectly natural manner.');


sub time_repeated_runs
{
    my ($sub) = @_;
    my $iterations = 0;
    my $start = time;
    while ((time - $start) < 5)
    {
        $sub->();
        $iterations += 1;
    }
    return (time - $start) / $iterations;
}


foreach my $input (@inputs)
{
    my ($code, $index) = encode($input);
    print "Encoding time: ", time_repeated_runs(sub { encode($input) }), "\n";
    my $nicely_formatted_code = $code;
    $nicely_formatted_code =~ s/\n/ /g;
    $nicely_formatted_code =~ s/(.{60})/$1\n/g;
    print "Index: $index\nCode:\n$nicely_formatted_code\n\n";

    foreach my $decoder (\&decode1, \&decode2, \&decode3)
    {
        my $decoded = $decoder->($code, $index);

        die "Decoder error\n" if $decoded ne $input;
        print "Decoding time: ", time_repeated_runs(sub { $decoder->($code, $index) }), "\n";
    }
    print "=" x 60, "\n\n";
}

Sample output:

Encoding time: 0.00162919517758227
Index: 5
Code:
ssesss rssss htpMrpua eeeieoooo  s

Decoding time: 0.0132275132275132
Decoding time: 0.006426735218509
Decoding time: 0.00294985250737463
============================================================

Encoding time: 0.0190839694656489
Index: 41
Code:
:;rrrreeeeeret,,rrrrrrreeee,ee,rrederrttt.heeeeeee     lbbbk
wbhhhhhhwnwwwwwwwhhhhhhhhvhhhhhhhhht wttttttttttttttttttWwww
w lf io i  c   nnheeeeoeeeeeeeooeeioooa      aaaaeeeeeaaae  
            

Decoding time: 0.857142857142857
Decoding time: 0.185185185185185
Decoding time: 0.0125628140703518
============================================================

Encoding time: 0.0602409638554217
Index: 146
Code:
e"seesarsae.....sfnnsfdny,teefe,eeyoyyse,she,soeehey"tsynts,
",yeelssyeftettysecdetnaadlelfoesff,,nnfsgpeestffednya,,ytds
   eynekrdsydossdetr-     .     rrmr  mfri nc  m chbbbhtgnn 
   mm        oaia  i  nnie    seeeoeneeeneaniin e"bgvhhhshhh
hbmbbhhtrdslhhlnrnrpfjstretldbkmvehathhhhcctvnddvpZdvrrvssss
sdw  oooooooooofra a nidiootc tttttttTttttTttttwtTtttsphsstr
 az    xxxxtas    hshdc wwnuutbon aaebbwueaentlba     orooro
oiooeaooiea  oouaIuooniieoakooeiiiatttl o          rrosiiii 
i  cc l cctbltfgeeeehpsrhcHnox o    eeeeeeeFuuapegge eoppoie
oeoooiiiiiamtirmsiimutsmmioia " " snnaa  eeo eruaipeepeensaa
cxfi                 iisiic   s ecnnaattfo ttooieeo no   oaa
aaeebalarrnlrllrrly

Decoding time: 11
Decoding time: 1.2
Decoding time: 0.0224215246636771
============================================================

Posted by robin2 at 12:32 AM | Comments (2)

March 20, 2006

Neat! Economy of thought

Why favour a simple theory over a more complicated one? Isn't this ultimately a religious prejudice? That there should be an underlying order in all things that is beautiful in its own right. Or that things must be made so as to be intelligible to us?

With empirical knowledge there is the danger of over-fitting data that is itself subject to error. So a tendency to prefer the stark beauty of simplicity—to suspect that truth cannot needlessly complicated—can be of pragmatic benefit despite not being absolutely justifiable.

[I had a piece ready which related this to Bayesian learning. But there was too much detail given the rather slight point being made. Short version: belief networks… Bayesian learning… prior probability… arbitrary fudge factor… bias towards simpler hypotheses.]

None of which has any bearing on maths.

It is elegant that a theory of sets can be developed without without assuming the existence of any non-set elements. But I can't see this as an awe-inspiring genesis of all things out of the void that is the empty set. I see it as a cute trick. This is similar to untyped lambda calculus, which provides a formalism for computable functions without requiring any non-functions with which to compute. A minimalist approach do doubt makes for simpler proofs. But I suspect the real appeal is that it is just plain neat.

Which isn't to say that this a bad thing. It is possible to get caught up in foundations (all very well if you like that sort of thing) whereas the history of maths might be better viewed through its gems and its monstrosities.

Posted by robin2 at 11:29 PM | Comments (3)

March 12, 2006

Look on my googlejuice, ye mighty, and despair

I've been reading ‘The Search: How Google and its rivals rewrote the rules of business and transformed our culture’ by John Battelle. (Next on my ‘to read’ list: ‘Laziness: why books these days need to have their content summarized in their sub-titles as otherwise we might not be sure whether they were what we really wanted’.) It is a rather gushing account of the business histories of Google, Alta Vista, Google, Yahoo, Overture and Google. That the author was a founder of ‘Wired’, and that he considered humankind's greatest ever artifact to be the Apple Macintosh (prior to its being trumped by Google Zeitgeist) probably tells you all you need to know about where he's coming from.

The book has a coda that, while it might seem rather at odds with the rest of the book in terms of content, makes sense of the author's underlying interest. It describes a spate of googling that begins with his searching on the term ‘immortality’. Realising that he is more interested in the concept of immortality than in signing up for some cryogenic scam, he follows up on one of his search hits, the ancient epic ‘Gilgamesh’. He finds an on-line copy, together with an account of how that particular version of the story came from stone tablets found at a ruined library at Ninevah; tablets that contained the name of the author.

In my search for immortality, I had found the oldest known named author in the history of Western civilization. […] Through his writings, with an assist from Google and a university professor, he had, in a sense, become immortal.

Battelle needn't have looked quite so far back for this concept. Consider:

One day I wrote her name upon the strand,
But came the waves and washed it away:
Again I wrote it with a second hand,
But came the tide and made my pains his prey.
“Vain man”, said she, “that dost in vain assay
A mortal thing so to immortalise,
For I myself shall like to this decay,
And eke my name be wiped out likewise.”
“Not so,” quoth I, “let baser things devise
To die in dust, but you shall live by fame:
My verse your virtues rare shall eternise,
And in the heavens write your glorious name.
Where whenas Death shall all the world subdue,
Our love shall live, and later life renew.”

While the sentiment in Spenser's poem is very sweet, the poet's lady is quite right. (No change there, then.) Her name isn't written in the heavens. Nor is anyone's. It it vanity to think otherwise. (At least her good sense has been eternised thus far.) But, as with the poet, Battelle cannot resist vanity's call:

What does it mean, I wondered, to become immortal through words pressed in clay—or […] through words formed in bits and transferred over the Web? It that not what every person longs for […] to die, but to be known forever? And does not search offer the same immortal imprint: is not existing forever in the indexes of Google and others the modern-day equivalent of carving our stories into stone? For anyone who has ever written his own name into a search box and anxiously awaited the results, I believe the answer is yes.

And so the story ends. Let it be known, now and forever: John woz ere.

Posted by robin2 at 10:05 PM | Comments (1)

March 03, 2006

Grumbling 16 years too late

Ta, Robin.
Testing, testing.

F(\bigsqcup_{i=0}^{\infty} x_{i}) = \bigsqcup_{i=0}^{\infty} F(x_{i})

OK. So I have to think of something to say now.

Well, there was one thing. More of a niggle than anything substantial.

I've been reading ‘The Philosophy of Artificial Intelligence’, a collection of articles edited by Margaret Boden. I bought it as an undergraduate (a while back, then) but was never particularly taken with it. Some of it I found more interesting to read now than then. I only just realised that the articles by David Marr and Drew McDermott dove-tail quite nicely with the anti-AI argument in the Dreyfus & Dreyfus piece. I doubt that was deliberate…

But I digress.

The book contains John Searle's article ‘Minds, Brains, and Programs’ where he introduced his famous ‘Chinese Room’ argument. I won't go into it. My opinion then was that that the argument and the whole debate it spawned were all most desperately dull: worrying about whether computers could ever really understand what they were doing, like, really really. So I avoided paying much attention.

Boden selected one of her own pieces as a reply to Searle. I'd not noticed before, but it seems she forgot to check what she was replying to.

From Searle's opening paragraphs:

I find it useful to distinguish what I will call ‘strong’ AI from ‘weak’ or ‘cautious’ AI (Artificial Intelligence). According to weak AI, the principal value of the computer in the study of the mind is that it gives us a very powerful tool. For example, it enables us to formulate and test hypotheses in a more rigorous and precise fashion. […] I have no objection to the claims of weak AI, at least as far as this article is concerned.

Boden's first sentence:

John Searle, in his paper on ‘Minds, Brains, and Programs’ (1980), argues that computational theories in psychology are essentially worthless.

I never thought Boden was the sharpest pencil in the box (I once read her rather unfortunate book on art from a CogSci standpoint), but that's not even trying.

Posted by robin2 at 12:03 AM | Comments (0)