February 24, 2009
Strongly connected components
A directly graph consists of vertices joined by directed edges. The edges can be thought of as a binary relation, so if there is an edge from vertex v to vertex w we say that w is directly reachable from w, and hence we can think of reachability in general as being the transitive closure of this relation.
Two vertices are “strongly connected” if they are mutually reachable. Strong connectedness is an equivalence relation on vertices, and the resulting equivalence classes are called the strongly connected components of the graph. Within a strongly connected component, any vertex can be reached from any other.
Suppose you had data about a road network, and you wanted to check that it was possible to get from anywhere in the network to anywhere else, so that you could pick up on things like an incompatible set of one-way and turn restrictions that made some bit of the network impossible to leave. You could do this by representing the network as a directed graph, and then finding its strongly connected components. If you get one component covering the whole graph then everything is fine. Otherwise there chances are you'll get a component covering most of the graph, and some smaller components covering problem areas.
The components of a graph can be found using Tarjan's algorithm:
#!/usr/bin/perl
sub strong_components
{
# Graph represented as a hash mapping vertices
# to arrays of directly reachable vertices
my ($graph) = @_;
my %times = ();
my %components = ();
my $current_time = 0;
my $current_component = 0;
my %roots = ();
my @stack = ();
foreach my $v (keys %$graph)
{
if (!defined($times{$v}))
{
visit($v);
}
}
sub visit
{
my ($v) = @_;
$times{$v} = ++$current_time;
$roots{$v} = $v;
push @stack, $v;
foreach my $w (@{$graph->{$v}})
{
if (!defined($times{$w}))
{
visit($w);
}
if (!defined($components{$w}) &&
$times{$roots{$w}} < $times{$roots{$v}})
{
$roots{$v} = $roots{$w};
}
}
if ($roots{$v} eq $v)
{
print "Component $current_component:";
my $w = undef;
while ($w ne $v)
{
$w = pop @stack;
print " $w";
$components{$w} = $current_component;
}
print "\n";
$current_component++;
}
}
}
# Example
strong_components({'a' => ['b'],
'b' => ['c', 'e', 'f'],
'c' => ['d', 'g'],
'd' => [],
'e' => ['a', 'f'],
'f' => ['g'],
'g' => ['f', 'h'],
'h' => ['d']});
In principle this has a running time that is linear in the size of the graph. In practice you can use an implementation such as that in the Boost Graph Library, and things are OK whilst your data fits comfortably into your machine's main memory. For really big graphs things start to get a bit sticky, as you start thrashing virtual memory.
It would be good to be able to split the data up into chunks that can be processed separately, both to stop the limit on physical memory being an issue, and to speed up executing by processing different chunks in parallel. Unfortunately, strong connectedness is a global property of a graph. To take a pathological example, suppose all the vertices were strung out in a ring that could only be travelled in one direction: it would be impossible to know that any of it was strongly connected without establishing that it was all strongly connected.
The Parallel Boost Graph Library uses an algorithm due to Fleischer, Hendrickson and Pinar. This works by having a local step that examines a graph to find a single strongly connected component and up to three subgraphs that can then be examined independently. So although the algorithm starts of by executing a single local step, by recursively splitting the problem up it aims to find a high degree of parallelism. However, this doesn't help in coping with a very large graph, since the first step executing must have to access all of it. It also doesn't help very much in the road network example given above: if there are no problems with the data then the graph will consist of a single component, and so the algorithm will terminate after the first local step.
Many things can be represented as directed graphs. For example, I believe that the authors of the Parallel Boost Graph Library are physicists, and that their interest is in analysing dependency relations amongst large numbers of equations used in their models of physical systems. Abstracting away the specifics makes it possible use a single algorithm to solve many disparate problems. However, this process of abstraction had made the road network problem given above more difficult to solve efficiently than might otherwise have been the case.
It isn't difficult to see that it would be possible to divide up a large road network spatially into discrete chunks. Within each chunk you could expect most roads to be mutually reachable purely using roads within that chunk. (For example, although I don't know exactly how to get between Harley Street and Holloway Road, it's a safe bet that it can be done without leaving London.) The graph for each chunk could be analysed with Tarjan's algorithm, and the vertices in each strongly connected component could be coalesced into a single vertex. Having done this for the whole road network it would be possible to create a graph using the coalesced vertices that would cover the whole network despite being relatively small, and this could itself be analysed with Tarjan's algorithm to give the overall result.
All of this got me to thinking what other sorts of problem might be expressible in terms of the strongly connected components of a graph, and whether they might have characteristics that made them easier to solve in parallel. The one I came up with was of a pipe network (e.g. water pipes), where the pipes are ultimately connecting up sources and sinks of whatever it is that the pipes are carrying. With such a network you might want to check that every pipe is part of at least one path that goes from some source to some sink, and that there are no loops in the network (i.e. no pipes that end up upstream of where they started).
Both of these conditions can be checked by constructing a graph corresponding to the pipe network and finding its strongly connected components—although the graphs will be slightly different.
Detecting loops is easier of the two, as a loop in the network corresponds to a cycle in the graph. The graph used will correspond fairly straightforwardly to the pipe network. In an acyclic graph no two vertices are mutually reachable, so each vertex will form a component by itself. If a strongly connected component is found containing two or more vertices then this shows a cycle.
Checking that each pipe part of a source-to-sink path is a bit trickier, and sort of works the other way around from the previous test. Imagine taking the graph used above, and adding two extra vertices: a "super source" and a "super sink". Add edges going from the super source to all of the original source vertices, and from all the original sink vertices to the super sink. Now add an edge going from the super sink to the super source. This means that every path from a source to a sink in the original graph have will become part of a cycle that goes from the super source to the source, then to the sink, then to the super sink, then back to the source. All the vertices involved in this cycle will mutually reachable, and hence belong to the same component as (say) the super source. Consequently, the vertices in any other component can not be part of a suitable path in the original graph.
Again, it would be good to be able to divide the full graph up spatially, and do some preliminary work on each chunk so as to be able to produce another graph representing the whole network but that is nonetheless relatively small. But I don't see that the same trick used with the road network would be useful here.
I think something is possible, but I've not yet worked out the details of it. For each chunk of the graph, consider all the vertices at its periphery—the ones that can directly reach or are directly reachable from vertices in other chunks. The internals of the graph in a chunk can be elided by working out which peripheral vertices are reachable from which others, and putting in edges that directly connect these. I think you'd then need to analyse the abbreviated graph to figure out what was going on overall, then use the result of this as context for more detailed analysis of each chunk of the unabbreviated graph.
Posted by robin2 at 10:18 PM | Comments (0)