[ Team LiB ] Previous Section Next Section

Hack 54 Scraping Amazon.com Customer Advice

figs/moderate.giffigs/hack54.gif

Screen scraping can give you access to Amazon.com community features not yet implemented through Amazon.com's public Web Services API. In this hack, we'll implement a script to scrape customer buying advice.

Customer buying advice isn't available through Amazon.com's Web Services API, so if you'd like to include this information on a remote site, you'll have to get it from Amazon.com's site through scraping. The first step to this hack is knowing where to find all the customer advice on one page. The following URL links directly to the advice page for a given ASIN (the unique ID Amazon.com displays for each product [Hack #52]):

http://amazon.com/o/tg/detail/-/insert ASIN/?vi=advice

For example, here is the advice page for Mac OS X Hacks:

http://amazon.com/o/tg/detail/-/0596004605/?vi=advice

The Code

This Perl script splits the advice page into two variables, based on the headings "in addition to" and "instead of." It then loops through those sections, using regular expressions to match the products' information. The script then formats and prints the information.

Save the following script to a file called get_advice.pl:

#!/usr/bin/perl -w
# get_advice.pl
#
# A script to scrape Amazon to retrieve customer buying advice
# Usage: perl get_advice.pl <asin>
use strict; use LWP::Simple;

# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_advice.pl <asin>\n";

# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=advice";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;

# Get our matching data.
my ($inAddition) = (join '', $content) [RETURN]
    =~ m!in addition to(.*?)(instead of)?</td></tr>!mis;
my ($instead)    = (join '', $content) [RETURN]
    =~ m!recommendations instead of(.*?)</table>!mis;

# Look for "in addition to" advice.
if ($inAddition) { print "-- In Addition To --\n\n";
   while ($inAddition =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/[RETURN]
(.*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
       my ($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
       $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
       print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";
   }
}

# Look for "instead of" advice.
if ($instead) { print "-- Instead Of --\n\n";
    while ($instead =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.[RETURN]
*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
        my ($place,$thisAsin,$title,$number) [RETURN]
          = ($1||'',$2||'',$3||'',$4||'');
        $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
        print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";
    }
}

Running the Hack

You can run this script from the command line, passing in any ASIN. Here is the one for Mac OS X Hacks:

% perl get_advice.pl 0596004605
-- In Addition To --

1. Mac OS X: The Missing Manual, Second Edition (0596004508)
(Recommendations: 1)

2. Mac Upgrade and Repair Bible, Third Edition (0764525948)
(Recommendations: 1)

If the book has long lists of alternate products, send the output to a text file. This example sends all alternate product recommendations for Google Hacks to a file called advice.txt:

% perl get_advice.pl 0596004478 > advice.txt

See Also

—Paul Bausch

    [ Team LiB ] Previous Section Next Section