Hack 1 A Crash Course in Spidering and Scraping
 
A few of the whys and wherefores of spidering
and scraping.
There is a
wide and ever-increasing variety of computer programs gathering and
sifting information, aggregating resources, and comparing data.
Humans are just one part of a much larger and automated equation. But
despite the variety of programs out there, they all have some basic
characteristics in common.
Spiders are programs that traverse the Web,
gathering information. If you've ever taken a gander
at your own web site's logs, you'll
see them peppered with User-Agent names like
Googlebot, Scooter, and
MSNbot. These are all spiders—or
bots,
as some prefer to call them.
Throughout this book, you'll hear us referring to
spiders and scrapers.
What's the difference? Broadly speaking,
they're both programs that go out on the Internet
and grab things. For the purposes of this book, however,
it's probably best for you to think of
spiders as programs that grab entire pages,
files, or sets of either, while scrapers grab
very specific bits of information within these files. For example,
one of the spiders
[Hack #44] in this book
grabs entire collections of Yahoo! Group messages to turn into
mailbox files for use by your email application, while one of the
scrapers [Hack #76] grabs train schedule
information. Spiders follow links, gathering up content, while
scrapers pull data from web pages. Spiders and scrapers usually work
in concert; you might have a program that uses a spider to follow
links but then uses a scraper to gather particular information.
Why Spider?
When learning about a
technology or way of using
technology, it's always good to ask the big
question: why? Why bother to spider? Why take the time to write a
spider, make sure it works as expected, get permission from the
appropriate site's owner to use it, make it
available to others, and spend time maintaining it? Trust us; once
you've started using spiders,
you'll find no end to the ways and places they can
be used to make your online life easier:
- Gain automated access to resources
-
Sure, you can visit every site you want to keep up with in your web
browser every day, but wouldn't it be easier to have
a program do it for you, passing on only content that should be of
interest to you? Having a spider bring you the results of a favorite
Google search can save you a lot of time, energy, and repetitive
effort. The more you automate, the more time you can spend having fun
with and making use of the data.
- Gather information and present it in an alternate format
-
Gather marketing research in the form of search engine results and
import them into Microsoft Excel for use in presentations or tracking
over time [Hack #93]. Grab a copy of your favorite
Yahoo! Groups archive in a form your mail program can read just like
the contents of any other mailbox
[Hack #43]. Keep up with
the latest on your favorite sites without actually having to pay them
a visit one after another [Hack #81]. Once you have raw data at
your disposal, it can be repurposed, repackaged, and reformatted to
your heart's content.
- Aggregate otherwise disparate data sources
-
No web site is an island, but you wouldn't know it,
given the difficulty of manually integrating data across various
sites. Spidering automates this drudgery, providing a 15,000-foot
view of otherwise disparate data. Watch Google results change over
time [Hack #93] or combine syndicated content
[Hack #69] from multiple weblogs into one RSS
feed. Spiders can be trained to aggregate data, both across sources and
over time.
- Combine the functionalities of sites
-
There might be a search engine you love,
but which doesn't do everything you want. Another
fills in some of those gaps, but doesn't fill the
need on its own. A spider can bridge the gap between two such
resources [Hack #48],
querying one and providing that information to another.
- Find and gather specific kinds of information
-
Perhaps what you seek needs to be searched for first. A
spider can
run web queries on your behalf, filling out forms and sifting through
the results [Hack #51].
- Perform regular webmaster functions
-
Let a spider take care of the drudgery of daily webmastering. Have it
check your HTML to be sure it is standards-compliant and tidy
(http://tidy.sourceforge.net/),
that your links aren't broken, or that
you're not linking to any prurient
content.
|