« Neat! Economy of thought | Main | block.pl (again) »

March 27, 2006

block.pl

The weird first step of the Burrows & Wheeler compression algorithm (aka the bzip algorithm) has played on my mind from time to time since I first heard about it. So I thought I'd try coding it up.

[Reference: Michael Burrows and D. J. Wheeler, 1994, "A block-sorting lossless data compression algorithm", http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/SRC-124.pdf]

#!/usr/bin/perl

# Given an input text, find all of its rotations and sort
# them into alphabetical order. If this is imagined as
# forming a grid, which one rotation on each row, then the
# code consists of the characters in the last column,
# together with the index of the row containing the original
# input.
sub encode
{
    my ($s) = @_;
    my @rotations = ();
    my $length = length($s);
    foreach my $i (0 .. $length-1)
    {
        my $rot = substr($s, $length-$i, $i) . substr($s, 0, $length-$i);
        push @rotations, [$rot, $i == 0];
    }

    @rotations = sort { $a->[0] cmp $b->[0] } @rotations;

    my $code = "";
    my $index = 0;

    foreach my $i (0 .. $#rotations)
    {
        $code .= substr($rotations[$i]->[0], $length-1);
        $index = $i if $rotations[$i]->[1];
    }

    return $code, $index;
}


# This is the stupid way to decode, as it involves
# reconstructing the entire grid, column by column. However
# it's only this I could honestly say I understand.
#
# Each column contains the same characters, but in a different
# order. We are given the content of the last column, so if
# we sort it into alphabetical order this will give us the
# content of the first column. If we rotate the grid one
# space to the right - so bringing the last column to the
# front - and sort again, then this will give us the content
# of the first two columns. If we rotate one space to the
# right and sort again this will give the first three
# columns. And so on.
sub decode1
{
    my ($code, $index) = @_;
    my @code = split(//, $code);
    my $length = length($code);
    my @rotations = ('') x $length;

    for (0 .. $length-1)
    {
        foreach my $i (0 .. $length-1)
        {
            $rotations[$i] =  $code[$i] . $rotations[$i];
        }
        @rotations = sort @rotations;
    }

    return $rotations[$index];
}


# The smart way is only to reconstruct the row of the grid
# that contained the original input. We are given the last
# column, and can easily construct the first column as
# before. We are also told which row of the full grid
# contains the original input, and this tells us the entry
# in the first column containing the first character. Because
# all the rows are rotations, if we knew for which row this
# entry appeared in the last column, the the entry for the
# same row in the first column would give us the second
# character. And if we knew where *this* entry appeared in
# the last column we could look to the first column again to
# give us the third character. Etc.
#
# The obvious problem is working out the correspondence
# between entries in the first and last columns. Any given
# character may occur many times, and you've got to pick the
# right occurrence, otherwise it all just goes horribly
# wrong.
#
# According to the original Burrows & Wheeler paper this
# is actually easy: for a given character the entries in the
# first column appear in the same order as the corresponding
# entries in the last column, e.g. the tenth letter 'e' in
# the first column matches up with the tenth letter 'e' in
# the last column.  This I don't understand. There's an
# explanation in the paper, but it didn't sink in. Look:
# I've been a bit stressed recently and I haven't been
# sleeping properly, and now I've lost an hour in bed due to
# British Summer Time. So I'm happy just to take it on trust
# for now.
#
# In the code l has the content of the last column, and f
# has the content of the first.
sub decode2
{
    my ($code, $index) = @_;
    my @l = split(//, $code);
    my @f = sort(@l);
    
    my $which_occurrence = sub
    {
        my ($index) = @_;
        my $ch = $f[$index];
        my $n = 0;
        foreach my $i (0 .. $index-1)
        {
            if ($f[$i] eq $ch)
            {
                $n += 1;
            }
        }
        return ($ch, $n);
    };

    my $nth_occurrence = sub
    {
        my ($ch, $n) = @_;
        my $count = 0;
        foreach my $i (0 .. $#l)
        {
            if ($l[$i] eq $ch)
            {
                if ($count == $n)
                {
                    return $i;
                }
                else
                {
                    $count += 1;
                }
            }
        }

        die "This can't happen\n";
    };

    my $decoded = '';
    for (0 .. $#f)
    {
        my ($ch, $n) = $which_occurrence->($index);
        $decoded .= $ch;
        $index = $nth_occurrence->($ch, $n);
    }
    return $decoded;
}


# We need to loop over the first and last columns in order
# to match up corresponding entries, but this can happen
# once rather than for every character.
sub decode3
{
    my ($code, $index) = @_;
    my @l = split(//, $code);
    my @f = sort(@l);
    my @next_index;

    my %occurrences;
    foreach my $i (0 .. $#l)
    {
        my $ch = $l[$i];
        if (!defined($occurrences{$ch}))
        {
            $occurrences{$ch} = [];
        }
        push @{$occurrences{$ch}}, $i;
    }

    foreach my $i (0 .. $#f)
    {
        my $ch = $f[$i];
        $count = 0 if $ch ne $last_ch;

        push @next_index, $occurrences{$ch}->[$count];

        $count += 1;
        $last_ch = $ch;
    }

    my $decoded = '';
    for (0 .. $#f)
    {
        $decoded .= $f[$index];
        $index = $next_index[$index];
    }
    return $decoded;
}


my @inputs = (
'Moses supposes his toses are roses',

'Whether the weather be cold, or whether the weather be hot;
whether the weather be fine, or whether the weather be not:
we whether the weather, whatever the weather, weather we like it or not.',
     
'Zermelo-Fraenkel set theory, which forms the main topic of the book, is a
rigorous theory, based on a precise set of axioms. However, it is
possible to develop the theory of sets considerably without any knowledge
of those axioms. Indeed, the axioms can only be fully understood after
the theory has been investigated to some extent. This state of affairs is
to be expected. The concept of a "set of objects" is a very intuitive
one, and, with care, considerable sound progress may be made on the basis
of this intuition alone. Then, by analyzing the nature of the "set"
concept on the basis of that initial progress, the axioms may be
"discovered" in a perfectly natural manner.');


sub time_repeated_runs
{
    my ($sub) = @_;
    my $iterations = 0;
    my $start = time;
    while ((time - $start) < 5)
    {
        $sub->();
        $iterations += 1;
    }
    return (time - $start) / $iterations;
}


foreach my $input (@inputs)
{
    my ($code, $index) = encode($input);
    print "Encoding time: ", time_repeated_runs(sub { encode($input) }), "\n";
    my $nicely_formatted_code = $code;
    $nicely_formatted_code =~ s/\n/ /g;
    $nicely_formatted_code =~ s/(.{60})/$1\n/g;
    print "Index: $index\nCode:\n$nicely_formatted_code\n\n";

    foreach my $decoder (\&decode1, \&decode2, \&decode3)
    {
        my $decoded = $decoder->($code, $index);

        die "Decoder error\n" if $decoded ne $input;
        print "Decoding time: ", time_repeated_runs(sub { $decoder->($code, $index) }), "\n";
    }
    print "=" x 60, "\n\n";
}

Sample output:

Encoding time: 0.00162919517758227
Index: 5
Code:
ssesss rssss htpMrpua eeeieoooo  s

Decoding time: 0.0132275132275132
Decoding time: 0.006426735218509
Decoding time: 0.00294985250737463
============================================================

Encoding time: 0.0190839694656489
Index: 41
Code:
:;rrrreeeeeret,,rrrrrrreeee,ee,rrederrttt.heeeeeee     lbbbk
wbhhhhhhwnwwwwwwwhhhhhhhhvhhhhhhhhht wttttttttttttttttttWwww
w lf io i  c   nnheeeeoeeeeeeeooeeioooa      aaaaeeeeeaaae  
            

Decoding time: 0.857142857142857
Decoding time: 0.185185185185185
Decoding time: 0.0125628140703518
============================================================

Encoding time: 0.0602409638554217
Index: 146
Code:
e"seesarsae.....sfnnsfdny,teefe,eeyoyyse,she,soeehey"tsynts,
",yeelssyeftettysecdetnaadlelfoesff,,nnfsgpeestffednya,,ytds
   eynekrdsydossdetr-     .     rrmr  mfri nc  m chbbbhtgnn 
   mm        oaia  i  nnie    seeeoeneeeneaniin e"bgvhhhshhh
hbmbbhhtrdslhhlnrnrpfjstretldbkmvehathhhhcctvnddvpZdvrrvssss
sdw  oooooooooofra a nidiootc tttttttTttttTttttwtTtttsphsstr
 az    xxxxtas    hshdc wwnuutbon aaebbwueaentlba     orooro
oiooeaooiea  oouaIuooniieoakooeiiiatttl o          rrosiiii 
i  cc l cctbltfgeeeehpsrhcHnox o    eeeeeeeFuuapegge eoppoie
oeoooiiiiiamtirmsiimutsmmioia " " snnaa  eeo eruaipeepeensaa
cxfi                 iisiic   s ecnnaattfo ttooieeo no   oaa
aaeebalarrnlrllrrly

Decoding time: 11
Decoding time: 1.2
Decoding time: 0.0224215246636771
============================================================

Posted by robin2 at March 27, 2006 12:32 AM

Comments

After a night's sleep, my confusion over decode2 dispersed.

Consider all the occurrences of a given character in the first column. As the rows are sorted alphabetically, these occurrences can be considered to be ordered according to the subsequent characters in the rows.

Now consider the occurrences of this character in the last column. As the rows are all rotations, these same "subsequent" characters appear at the beginnings of the rows. So (as the rows are sorted) the order of the occurrences will be the same as that in the first column.

Posted by: robin2 at March 28, 2006 10:21 PM

you are a crazy sick individual

Posted by: robin[theotherone] at March 31, 2006 03:06 AM