| [ Team LiB ] |
|
Hack 14 Handling Relative and Absolute URLs
Glean the full URL of any relative reference, such as "sample/index.html" or "../../images/flowers.gif", by using the helper functions of URI. Occasionally, when you're parsing HTML or accepting command-line input, you'll receive a relative URL, something that looks like images/bob.jpg instead of the more specific http://www.example.com/images/bob.jpg. The longer version, called the absolute URL, is more desirable for parsing and display, as it ensures that no confusion can arise over where a resource is located. The URI class provides all sorts of methods for accessing and modifying parts of URLs (such as asking what sort of URL it is with $url->scheme, asking which host it refers to with $url->host, and so on, as described in the docs for the URI class). However, the methods of most immediate interest are the query_form method [Hack #12] and the new_abs method for taking a URL string that is most likely relative and getting back an absolute URL, as shown here: use URI; my $abs = URI->new_abs($maybe_relative, $base); For example, consider the following simple program, which scrapes for URLs in the HTML list of new modules available at your local CPAN mirror: #!/usr/bin/perl -w
use strict;
use LWP 5.64;
my $browser = LWP::UserAgent->new;
my $url = 'http://www.cpan.org/RECENT.html';
my $response = $browser->get($url);
die "Can't get $url -- ", $response->status_line
unless $response->is_success;
my $html = $response->content;
while( $html =~ m/<A href=\"(.*?)\"/g ) {
print "$1\n";
}
It returns a list of relative URLs for Perl modules and other assorted files: % perl get_relative.pl MIRRORING.FROM RECENT RECENT.html authors/00whois.html authors/01mailrc.txt.gz authors/id/A/AA/AASSAD/CHECKSUMS ... However, if you actually want to retrieve those URLs, you'll need to convert them from relative (e.g., authors/00whois.html) to absolute (e.g., http://www.cpan.org/authors/00whois.html). The URI module's new_abs method is just the ticket and requires only that you change that while loop at the end of the script, like so: while( $html =~ m/<A href=\"(.*?)\"/g ) {
print URI->new_abs( $1, $response->base ) ,"\n";
}
The $response->base method from the HTTP::Message module returns the base URL, which, prepended to a relative URL, provides the missing piece of an absolute URL. The base URL is usually the first part (e.g., http://www.cpan.org) of the URL you requested. That minor adjustment in place, the code now returns absolute URLs: http://www.cpan.org/MIRRORING.FROM http://www.cpan.org/RECENT http://www.cpan.org/RECENT.html http://www.cpan.org/authors/00whois.html http://www.cpan.org/authors/01mailrc.txt.gz http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS ... Of course, using a regular expression to match link references is a bit simplistic, and for more robust programs you'll probably want to use an HTML-parsing module like HTML::LinkExtor, HTML::TokeParser [Hack #20], or HTML::TreeBuilder. —Sean Burke |
| [ Team LiB ] |
|