Jan 14, 2008 11:30:52 PM

Tags: PHP Graphviz Sitemap

Codecaine.co.za sitemap with Graphviz

I discovered today an interesting tool for drawing graphs (as in directed graphs – not the pie-chart kind). Although I have other uses for it, I thought it would be an interesting exercise to use it to map out the links on my site. The result: it's amazing how complex navigation can get on even a simple site. Personally, it really made me see spiders and commercial crawling bots in a new light.

Graphviz is a really amazing program. It comprises a set of tools for drawing arbitrary networks (not necessarily directed graphs), two of which I used in this exercise: dot and neato. dot is a simple program that parses a graph definition file. Graphs are defined using the DOT language, and the Graphviz website has extensive documentation on using it. neato accepts parsed DOT information and creates a visual representation of the graph. It supports multiple formats and can easily be plugged in to ghostscript to produce PDFs.

Consider example.dot:

graph G {
	"A" -- "B";
	"B" -- "C";
	"C" -- "D";
	"D" -- "A";
}

The code above represent an undirected graph with four nodes. Each of the lines between the braces indicates a graph edge. Defining a directed graph is just as easy:

digraph D {
	"A" -> "B";
	"B" -> "C";
	"C" -> "D";
	"D" -> "A";
}

Turning this graph into a GIF image is easy:

dot example.dot | neato -Tgif -o example.gif

So producing a sitemap directed graph is really a matter of building a DOT file using a simple PHP script, a rudimentary version of which I've included below. PLEASE NOTE: this code is very dumb. It'll crawl a website until it runs out of stack space, so please execute it only on a website that you manage. This is for educational purposes only. I'm not responsible if you try to spider microsoft.com – although it would be an interesting exercise to see if they shut you out before PHP runs out of memory.

<?php

// $base is the website you want to map

$base = 'http://www.codecaine.co.za';

// $url is the starting URL.
// You can change this to any relative URL.

$url = '/';

$stack = array();

// A random selection of DOT-compatible colours.
// Not sure where you can get a full list but I'm sure
// there's one somewhere on the Graphviz website

$colours = array(
	'salmon2', 'gold1', 'burlywood2',
	'yellow', 'deepskyblue', 'goldenrod2',
	'navy', 'coral3', 'coral', 'steelblue3'
);

srand(time());

echo 'digraph G {'."\n";
echo "\t".'size="600,600";'."\n";

buildSitemap($url);

echo '}';

function buildSitemap ($url, $depth = 0)
{
	global $stack;
	global $base;
	global $colours;

	// we don't want to recurse indefinitely

	if (in_array($url, $stack))
	{
		return;
	}

	array_push($stack, $url);

	// pick a random colour

	$colour = $colours[rand(0, sizeof($colours) - 1)];

	// fetch the page

	$page = file_get_contents($base.$url);

	// find all links on that page

	preg_match_all('/<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>/smi', $page, $matches);

	foreach ($matches[2] as $match)
	{
		// ignore external links

		if (preg_match('/^http/', $match))
		{
			continue;
		}

		// ignore links to the current page.
		// you can remove this if you'd like to find these
		// usability gremlins.

		if ($match == $url)
		{
			continue;
		}

		// output the graph edge with the random colour

		echo "\t".'"'.$url.'" -> "'.$match.'" [color='.$colour.'];'."\n";

		// recurse...

		buildSitemap($match, $depth + 1);
	}
}

Now that we have the sitemap generator, we can spit out the graph. The extra parameters to neato tell it to prevent nodes from overlapping, and to use splines where applicable instead of straight lines.

php sitemap.php | neato -Goverlap=false -Gsplines=true -Tgif -o sitemap.gif

Et voila! You can see the result of this script on Codecaine.co.za here. If you'd like your nodes to be labelled with the page's title you'll have to preprocess the website before outputting the DOT. DOT allows you to define aliases for your nodes and these need to be defined before your edges.

Feel free to post links to your sitemaps as comments.

Recent Posts

Discussion

Subscribe to an RSS feed of these comments

Luke

Jan 15, 2008 7:52:26 AM

wow your graph is immense, i need a cinema display to view that mofo

Your comment