[ Team LiB ] Previous Section Next Section

Hack 14 Handling Relative and Absolute URLs

figs/beginner.giffigs/hack14.gif

Glean the full URL of any relative reference, such as "sample/index.html" or "../../images/flowers.gif", by using the helper functions of URI.

Occasionally, when you're parsing HTML or accepting command-line input, you'll receive a relative URL, something that looks like images/bob.jpg instead of the more specific http://www.example.com/images/bob.jpg. The longer version, called the absolute URL, is more desirable for parsing and display, as it ensures that no confusion can arise over where a resource is located.

The URI class provides all sorts of methods for accessing and modifying parts of URLs (such as asking what sort of URL it is with $url->scheme, asking which host it refers to with $url->host, and so on, as described in the docs for the URI class). However, the methods of most immediate interest are the query_form method [Hack #12] and the new_abs method for taking a URL string that is most likely relative and getting back an absolute URL, as shown here:

use URI; my $abs = URI->new_abs($maybe_relative, $base);

For example, consider the following simple program, which scrapes for URLs in the HTML list of new modules available at your local CPAN mirror:

#!/usr/bin/perl -w
use strict;
use LWP 5.64;

my $browser = LWP::UserAgent->new;
my $url = 'http://www.cpan.org/RECENT.html';
my $response = $browser->get($url);

die "Can't get $url -- ", $response->status_line
  unless $response->is_success;

my $html = $response->content;
while( $html =~ m/<A href=\"(.*?)\"/g ) { 
    print "$1\n"; 
}

It returns a list of relative URLs for Perl modules and other assorted files:

% perl get_relative.pl
MIRRORING.FROM
RECENT
RECENT.html
authors/00whois.html
authors/01mailrc.txt.gz
authors/id/A/AA/AASSAD/CHECKSUMS
...

However, if you actually want to retrieve those URLs, you'll need to convert them from relative (e.g., authors/00whois.html) to absolute (e.g., http://www.cpan.org/authors/00whois.html). The URI module's new_abs method is just the ticket and requires only that you change that while loop at the end of the script, like so:

while( $html =~ m/<A href=\"(.*?)\"/g ) {
    print URI->new_abs( $1, $response->base ) ,"\n";
}

The $response->base method from the HTTP::Message module returns the base URL, which, prepended to a relative URL, provides the missing piece of an absolute URL. The base URL is usually the first part (e.g., http://www.cpan.org) of the URL you requested.

That minor adjustment in place, the code now returns absolute URLs:

http://www.cpan.org/MIRRORING.FROM
http://www.cpan.org/RECENT
http://www.cpan.org/RECENT.html
http://www.cpan.org/authors/00whois.html
http://www.cpan.org/authors/01mailrc.txt.gz
http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
...

Of course, using a regular expression to match link references is a bit simplistic, and for more robust programs you'll probably want to use an HTML-parsing module like HTML::LinkExtor, HTML::TokeParser [Hack #20], or HTML::TreeBuilder.

—Sean Burke

    [ Team LiB ] Previous Section Next Section