« March 2006 | Main | May 2006 »

April 02, 2006

block.pl (again)

Given the other Robin's comment on my posting of a Perl implementation of the non-compressing step in bzip, there could be no stronger evidence that we are not the same person than for me to post a follow-up.

First off, I didn't give any indication before as to how the algorithm implemented relates to data compression. The encoded text is just a permutation of the input text; it isn't any shorter. In fact there is an extra piece of information that goes with this permutation, so the complete encoded form is actually a little bit longer that the original.

However, the nature of the permutation will generally make it easy to compress subsequently. Recall that the encoding algorithm works by sorting all the rotations of the input text into alphabetical order, and then taking the last character from each as the output. Suppose you have an input text where the phrase "gentle reader" crops up a lot. This means there would be a lot of rotations that started "entle reader" and ended in "g". These rotations would appear bunched up, due to the ordering of the rotations, and so the output would have a long sequence of ‘g’s. Similarly, there output would have a long sequence of ‘e’ due to the rotations starting "ntle reader". Although the test cases I had in my code don't show this up very well, in general the encoded text will contain a great many sequences of repeated characters, and so will provide very good input for something like run-length encoding. It would be this next step that performs actual compression.

Back to the Perl implementation. I had a number of variations for decoding, but they all worked by reconstructing the original message starting with the first character, and moving forwards. This follows very easily from the dumb implementation in decode1. But it turns out that Burrows & Wheeler suggest starting from the end and moving backwards. Well, you can imagine my embarrassment. Here are a couple of variations on doing it this way:

# Reconstruct the
# original message starting from the end and working
# backwards. Conceptually, this makes no difference: the
# roles of the first and last columns are swapped in the
# algorithm. The following version is therefore very similar
# to decode2. However, there is less work to do in finding
# the nth occurrence of a character in the first column: as
# the rows are sorted it suffices to find the first
# occurrence and then go n spaces further.
sub decode4
{
    my ($code, $index) = @_;
    my @l = split(//, $code);
    my @f = sort(@l);

    my $which_occurrence = sub
    {
        my ($index) = @_;
        my $ch = $l[$index];
        my $n = 0;
        foreach my $i (0 .. $index-1)
        {
            if ($l[$i] eq $ch)
            {
                $n += 1;
            }
        }
        return ($ch, $n);
    };

    my $nth_occurrence = sub
    {
        my ($ch, $n) = @_;
        foreach my $i (0 .. $#l)
        {
            if ($f[$i] eq $ch)
            {
                return $i + $n;
            }
        }

        die "This can't happen\n";
    };

    my $rev_decoded = '';
    for (0 .. $#f)
    {
        my ($ch, $n) = $which_occurrence->($index);
        $rev_decoded .= $ch;
        $index = $nth_occurrence->($ch, $n);
    }
    my $decoded = reverse($rev_decoded);
    return $decoded;
}


# Here, as in decode3, we avoid looping over the columns for
# each character. For the first column, all we need to find
# out is where the first occurrence of each character is.
# But we don't need to explicitly know have the first
# column: knowing how many of each character there are and
# the order of the distinct characters is enough.
sub decode5
{
    my ($code, $index) = @_;
    my @l = split(//, $code);

    my @which_occurrence;
    my %firsts;
    my %counts = ();
    foreach my $ch (@l)
    {
        if (!defined($counts{$ch}))
        {
            $counts{$ch} = 0;
        }

        push @which_occurrence, $counts{$ch};
        $counts{$ch} += 1;
    }

    my $count = 0;
    foreach my $ch (sort keys %counts)
    {
        $firsts{$ch} = $count;
        $count += $counts{$ch};
    }

    my $rev_decoded = '';
    for (0 .. $#l)
    {
        my $ch = $l[$index];
        $rev_decoded .= $ch;
        $index = $firsts{$ch} + $which_occurrence[$index];
    }
    my $decoded = reverse($rev_decoded);
    return $decoded;
}

I could now discuss relative performance, but I think I've made my point well enough already.

Posted by robin2 at 01:24 PM | Comments (0)