Collecting Data from the Modern Web
by Ryan Mitchell
Book Details
Price
|
2.50 |
---|---|
Pages
| 340 p |
File Size
|
4,076 KB |
File Type
|
PDF format |
ISBN
| 978-1-491-91027-6 |
Copyright©
| 2015 Ryan Mitchell |
To those who have not developed the skill, computer programming can seem like a kind of
magic. If programming is magic, then web scraping is wizardry; that is, the application of
magic for particularly impressive and useful — yet surprisingly effortless — feats.
In fact, in my years as a software engineer, I’ve found that very few programming
practices capture the excitement of both programmers and laymen alike quite like web
scraping. The ability to write a simple bot that collects data and streams it down a terminal
or stores it in a database, while not difficult, never fails to provide a certain thrill and sense
of possibility, no matter how many times you might have done it before.
It’s unfortunate that when I speak to other programmers about web scraping, there’s a lot
of misunderstanding and confusion about the practice. Some people aren’t sure if it’s legal
(it is), or how to handle the modern Web, with all its JavaScript, multimedia, and cookies.
Some get confused about the distinction between APIs and web scrapers.
This book seeks to put an end to many of these common questions and misconceptions
about web scraping, while providing a comprehensive guide to most common webscraping tasks.
Beginning in Chapter 1, I’ll provide code samples periodically to demonstrate concepts.
These code samples are in the public domain, and can be used with or without attribution
(although acknowledgment is always appreciated). All code samples also will be available
on the website for viewing and downloading.
What Is Web Scraping?
The automated gathering of data from the Internet is nearly as old as the Internet itself.
Although web scraping is not a new term, in years past the practice has been more
commonly known as screen scraping, data mining, web harvesting, or similar variations.
General consensus today seems to favor web scraping, so that is the term I’ll use
throughout the book, although I will occasionally refer to the web-scraping programs
themselves as bots.
In theory, web scraping is the practice of gathering data through any means other than a
program interacting with an API (or, obviously, through a human using a web browser).
This is most commonly accomplished by writing an automated program that queries a web
server, requests data (usually in the form of the HTML and other files that comprise web
pages), and then parses that data to extract needed information.
In practice, web scraping encompasses a wide variety of programming techniques and
technologies, such as data analysis and information security. This book will cover the
basics of web scraping and crawling (Part I), and delve into some of the advanced topics in Part II.
Why Web Scraping?
If the only way you access the Internet is through a browser, you’re missing out on a huge
range of possibilities. Although browsers are handy for executing JavaScript, displaying
images, and arranging objects in a more human-readable format (among other things),
web scrapers are excellent at gathering and processing large amounts of data (among other
things). Rather than viewing one page at a time through the narrow window of a monitor,
you can view databases spanning thousands or even millions of pages at once.
In addition, web scrapers can go places that traditional search engines cannot. A Google
search for “cheapest flights to Boston” will result in a slew of advertisements and popular
flight search sites. Google only knows what these websites say on their content pages, not
the exact results of various queries entered into a flight search application. However, a
well-developed web scraper can chart the cost of a flight to Boston over time, across a
variety of websites, and tell you the best time to buy your ticket.
You might be asking: “Isn’t data gathering what APIs are for?” (If you’re unfamiliar with
APIs, see Chapter 4.) Well, APIs can be fantastic, if you find one that suits your purposes.
They can provide a convenient stream of well-formatted data from one server to another.
You can find an API for many different types of data you might want to use such as
Twitter posts or Wikipedia pages. In general, it is preferable to use an API (if one exists),
rather than build a bot to get the same data. However, there are several reasons why an
API might not exist:
You are gathering data across a collection of sites that do not have a cohesive API.
The data you want is a fairly small, finite set that the webmaster did not think warranted an API.
The source does not have the infrastructure or technical ability to create an API.
Even when an API does exist, request volume and rate limits, the types of data, or the
format of data that it provides might be insufficient for your purposes.
This is where web scraping steps in. With few exceptions, if you can view it in your
browser, you can access it via a Python script. If you can access it in a script, you can store
it in a database. And if you can store it in a database, you can do virtually anything with that data.
There are obviously many extremely practical applications of having access to nearly
unlimited data: market forecasting, machine language translation, and even medical
diagnostics have benefited tremendously from the ability to retrieve and analyze data from
news sites, translated texts, and health forums, respectively.
Even in the art world, web scraping has opened up new frontiers for creation. The 2006
project “We Feel Fine” by Jonathan Harris and Sep Kamvar, scraped a variety of Englishlanguage
blog sites for phrases starting with “I feel” or “I am feeling.” This led to a
popular data visualization, describing how the world was feeling day by day and minute by minute.
Regardless of your field, there is almost always a way web scraping can guide business
practices more effectively, improve productivity, or even branch off into a brand-new field entirely.
About This Book
This book is designed to serve not only as an introduction to web scraping, but as a
comprehensive guide to scraping almost every type of data from the modern Web.
Although it uses the Python programming language, and covers many Python basics, it
should not be used as an introduction to the language.
If you are not an expert programmer and don’t know any Python at all, this book might be
a bit of a challenge. If, however, you are an experienced programmer, you should find the
material easy to pick up. Appendix A covers installing and working with Python 3.x,
which is used throughout this book. If you have only used Python 2.x, or do not have 3.x
installed, you might want to review Appendix A.
If you’re looking for a more comprehensive Python resource, the book Introducing Python
by Bill Lubanovic is a very good, if lengthy, guide. For those with shorter attention spans,
the video series Introduction to Python by Jessika McKeller is an excellent resource.
Appendix C includes case studies, as well as a breakdown of key issues that might affect
how you can legally run scrapers in the United States and use the data that they produce.
Technical books are often able to focus on a single language or technology, but web
scraping is a relatively disparate subject, with practices that require the use of databases,
web servers, HTTP, HTML, Internet security, image processing, data science, and other
tools. This book attempts to cover all of these to an extent for the purpose of gathering
data from remote sources across the Internet.
Part I covers the subject of web scraping and web crawling in depth, with a strong focus
on a small handful of libraries used throughout the book. Part I can easily be used as a
comprehensive reference for these libraries and techniques (with certain exceptions, where
additional references will be provided).
Part II covers additional subjects that the reader might find useful when writing web
scrapers. These subjects are, unfortunately, too broad to be neatly wrapped up in a single
chapter. Because of this, frequent references will be made to other resources for
additional information.
The structure of this book is arranged to be easy to jump around among chapters to find
only the web-scraping technique or information that you are looking for. When a concept
or piece of code builds on another mentioned in a previous chapter, I will explicitly
reference the section that it was addressed in.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://safaribooksonline.com). For more
information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editors: Simon St. Laurent and Allyson MacDonald
Production Editor: Shiny Kalapurakkel
Copyeditor: Jasmine Kwityn
Proofreader: Carla Thornton
Indexer: Lucie Haskins
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
June 2015: First Edition