Hack 21 WWW::Mechanize 101
 
While LWP::UserAgent and the rest of the LWP
suite provide powerful tools for accessing and downloading web
content, WWW::Mechanize can automate many of the
tasks you'd normally have to code.
Perl has great tools for handling web
protocols, and LWP::UserAgent makes it easy,
encapsulating the nitty-gritty details of creating
HTTP::Requests, sending the requests, parsing
the HTTP::Responses, and providing the results.
Simple
fetching
of web pages is, as it should be, simple. For example:
#/usr/bin/perl -w
use strict;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new( );
my $response = $ua->get( "http://search.cpan.org" );
die $response->status_line unless $response->is_success;
print $response->title;
my $html = $response->content;
Behind the scenes of the get method, all the
details of the HTTP protocol are hidden from view, leaving me free to
think about the code itself. POSTing requests is
almost as simple. To search CPAN by author for my last name, I use
this:
my %fields = (
query => 'lester',
mode => 'author',
);
my $response = $ua->post( "http://search.cpan.org", \%fields );
Although LWP::UserAgent makes things pretty
simple when it comes to grabbing individual pages, it
doesn't do much with the page itself. Once I have
the results, I need to parse the page myself to handle the content.
For example, let's say I want to go through the
search interface to find the CPAN home page for Andy Lester. The
POST example does the searching and returns the
results page, but that's not where I want to wind
up. We still need to find out the address pointed to by the
"Andy Lester" link. Once I have the
search results, how do I know which Lester author I want? I need to
extract the links from the HTML, find the text that matches
"Andy Lester" and then find the
next page. Maybe I don't know what fields will be on
the page and I want to fill them in dynamically. All of this drudgery
is taken care of by WWW::Mechanize.
Introducing WWW::Mechanize
WWW::Mechanize, or
Mech for short, is a module that builds on the
base of LWP::UserAgent and provides an easy
interface for your most common web automation tasks (in fact, the first
version of Mech was called WWW::Automate). While
LWP::UserAgent is a pure component that makes no
assumptions about how you're going to use it, and
Mech's intent is to have a miniature web browser in
a single object, Mech takes some liberties in the name of simplicity.
For example, a Mech object keeps in its memory a history of the pages
it's visited and automatically supplies an HTTP
Referer header.
My previous example of
fetching
is even simpler with Mech:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new( );
$mech->get( "http://search.cpan.org" );
die $mech->response->status_line unless $mech->success;
print $mech->title;
my $html = $mech->content; # Big text string of HTML
Now that Mech is working for me, I don't even have
to deal with any HTTP::Response objects unless I
specifically want to. The success method checks
that the response, carried around by the $mech
object, indicates a successful action. The content
method returns whatever the content from the page is, and the
title method returns the title for the page, if
the page is HTML (which we can check with the
is_html method).
Using Mech's Navigation Tools
So far, Mech is just a couple of convenience
methods. Mech really shines when it's pressed into
action as a web client, extracting and following links and filling
out and posting forms. Once you've successfully
loaded a page, through either a GET or
POST, Mech goes to work on the HTML content. It
finds all the links on the page, whether they're in
an A tag as a link, or in any
FRAME or IFRAME tags as page
source. Mech also finds and parses the forms on the page.
I'll put together all of Mech's
talents into one little program that downloads all of my modules from
CPAN. It will have to search for me by name, find my module listings,
and then download the file to my current directory. (I could have had
it go directly to my module listing, since I know my own CPAN ID, but
that wouldn't show off form submission!)
The Code
Save the following code to a file called
mechmod.pl:
#!/usr/bin/perl -w
use strict;
$|++;
use File::Basename;
use WWW::Mechanize 0.48;
my $mech = WWW::Mechanize->new( );
# Get the starting search page
$mech->get( "http://search.cpan.org" );
$mech->success or die $mech->response->status_line;
# Select the form, fill the fields, and submit
$mech->form_number( 1 );
$mech->field( query => "Lester" );
$mech->field( mode => "author" );
$mech->submit( );
$mech->success or die "post failed: ",
$mech->response->status_line;
# Find the link for "Andy"
$mech->follow_link( text_regex => qr/Andy/ );
$mech->success or die "post failed: ", $mech->response->status_line;
# Get all the tarbulls
my @links = $mech->find_all_links( url_regex => qr/\.tar\.gz$/ );
my @urls = map { $_->[0] } @links;
print "Found ", scalar @urls, " tarballs to download\n";
for my $url ( @urls ) {
my $filename = basename( $url );
print "$filename --> ";
$mech->get( $url, ':content_file'=>$filename );
print -s $filename, " bytes\n";
}
Running the Hack
Invoke mechmod.pl on the command line, like so:
% perl mechmod.pl
Found 14 tarballs to download
Acme-Device-Plot-0.01.tar.gz --> 2025 bytes
Apache-Lint-0.02.tar.gz --> 2131 bytes
Apache-Pod-0.02.tar.gz --> 3148 bytes
Carp-Assert-More-0.04.tar.gz --> 4126 bytes
ConfigReader-Simple-1.16.tar.gz --> 7313 bytes
HTML-Lint-1.22.tar.gz --> 58005 bytes
...
This short introduction to the world of
WWW::Mechanize should give you an idea of how
simple it is to write spiders and other mechanized robots that
extract content from the Web.
—Andy Lester
 |