Collecting Data from the Modern Web

by Ryan Mitchell

Purchase Now !
Just with Paypal

Book Details

Price	2.50
Pages	340 p
File Size	4,076 KB
File Type	PDF format
ISBN	978-1-491-91027-6
Copyright©	2015 Ryan Mitchell

Preface

To those who have not developed the skill, computer programming can seem like a kind of

magic. If programming is magic, then web scraping is wizardry; that is, the application of

magic for particularly impressive and useful — yet surprisingly effortless — feats.

In fact, in my years as a software engineer, I’ve found that very few programming

practices capture the excitement of both programmers and laymen alike quite like web

scraping. The ability to write a simple bot that collects data and streams it down a terminal

or stores it in a database, while not difficult, never fails to provide a certain thrill and sense

of possibility, no matter how many times you might have done it before.

It’s unfortunate that when I speak to other programmers about web scraping, there’s a lot

of misunderstanding and confusion about the practice. Some people aren’t sure if it’s legal

(it is), or how to handle the modern Web, with all its JavaScript, multimedia, and cookies.

Some get confused about the distinction between APIs and web scrapers.

This book seeks to put an end to many of these common questions and misconceptions

about web scraping, while providing a comprehensive guide to most common webscraping tasks.

Beginning in Chapter 1, I’ll provide code samples periodically to demonstrate concepts.

These code samples are in the public domain, and can be used with or without attribution

(although acknowledgment is always appreciated). All code samples also will be available

on the website for viewing and downloading.

What Is Web Scraping?

The automated gathering of data from the Internet is nearly as old as the Internet itself.

Although web scraping is not a new term, in years past the practice has been more

commonly known as screen scraping, data mining, web harvesting, or similar variations.

General consensus today seems to favor web scraping, so that is the term I’ll use

throughout the book, although I will occasionally refer to the web-scraping programs

themselves as bots.

In theory, web scraping is the practice of gathering data through any means other than a

program interacting with an API (or, obviously, through a human using a web browser).

This is most commonly accomplished by writing an automated program that queries a web

server, requests data (usually in the form of the HTML and other files that comprise web

pages), and then parses that data to extract needed information.

In practice, web scraping encompasses a wide variety of programming techniques and

technologies, such as data analysis and information security. This book will cover the

basics of web scraping and crawling (Part I), and delve into some of the advanced topics in Part II.

Why Web Scraping?

If the only way you access the Internet is through a browser, you’re missing out on a huge

range of possibilities. Although browsers are handy for executing JavaScript, displaying

images, and arranging objects in a more human-readable format (among other things),

web scrapers are excellent at gathering and processing large amounts of data (among other

things). Rather than viewing one page at a time through the narrow window of a monitor,

you can view databases spanning thousands or even millions of pages at once.

In addition, web scrapers can go places that traditional search engines cannot. A Google

search for “cheapest flights to Boston” will result in a slew of advertisements and popular

flight search sites. Google only knows what these websites say on their content pages, not

the exact results of various queries entered into a flight search application. However, a

well-developed web scraper can chart the cost of a flight to Boston over time, across a

variety of websites, and tell you the best time to buy your ticket.

You might be asking: “Isn’t data gathering what APIs are for?” (If you’re unfamiliar with

APIs, see Chapter 4.) Well, APIs can be fantastic, if you find one that suits your purposes.

They can provide a convenient stream of well-formatted data from one server to another.

You can find an API for many different types of data you might want to use such as

Twitter posts or Wikipedia pages. In general, it is preferable to use an API (if one exists),

rather than build a bot to get the same data. However, there are several reasons why an

API might not exist:

You are gathering data across a collection of sites that do not have a cohesive API.

The data you want is a fairly small, finite set that the webmaster did not think warranted an API.

The source does not have the infrastructure or technical ability to create an API.

Even when an API does exist, request volume and rate limits, the types of data, or the

format of data that it provides might be insufficient for your purposes.

This is where web scraping steps in. With few exceptions, if you can view it in your

browser, you can access it via a Python script. If you can access it in a script, you can store

it in a database. And if you can store it in a database, you can do virtually anything with that data.

There are obviously many extremely practical applications of having access to nearly

unlimited data: market forecasting, machine language translation, and even medical

diagnostics have benefited tremendously from the ability to retrieve and analyze data from

news sites, translated texts, and health forums, respectively.

Even in the art world, web scraping has opened up new frontiers for creation. The 2006

project “We Feel Fine” by Jonathan Harris and Sep Kamvar, scraped a variety of Englishlanguage

blog sites for phrases starting with “I feel” or “I am feeling.” This led to a

popular data visualization, describing how the world was feeling day by day and minute by minute.

Regardless of your field, there is almost always a way web scraping can guide business

practices more effectively, improve productivity, or even branch off into a brand-new field entirely.

About This Book

This book is designed to serve not only as an introduction to web scraping, but as a

comprehensive guide to scraping almost every type of data from the modern Web.

Although it uses the Python programming language, and covers many Python basics, it

should not be used as an introduction to the language.

If you are not an expert programmer and don’t know any Python at all, this book might be

a bit of a challenge. If, however, you are an experienced programmer, you should find the

material easy to pick up. Appendix A covers installing and working with Python 3.x,

which is used throughout this book. If you have only used Python 2.x, or do not have 3.x

installed, you might want to review Appendix A.

If you’re looking for a more comprehensive Python resource, the book Introducing Python

by Bill Lubanovic is a very good, if lengthy, guide. For those with shorter attention spans,

the video series Introduction to Python by Jessika McKeller is an excellent resource.

Appendix C includes case studies, as well as a breakdown of key issues that might affect

how you can legally run scrapers in the United States and use the data that they produce.

Technical books are often able to focus on a single language or technology, but web

scraping is a relatively disparate subject, with practices that require the use of databases,

web servers, HTTP, HTML, Internet security, image processing, data science, and other

tools. This book attempts to cover all of these to an extent for the purpose of gathering

data from remote sources across the Internet.

Part I covers the subject of web scraping and web crawling in depth, with a strong focus

on a small handful of libraries used throughout the book. Part I can easily be used as a

comprehensive reference for these libraries and techniques (with certain exceptions, where

additional references will be provided).

Part II covers additional subjects that the reader might find useful when writing web

scrapers. These subjects are, unfortunately, too broad to be neatly wrapped up in a single

chapter. Because of this, frequent references will be made to other resources for

additional information.

The structure of this book is arranged to be easy to jump around among chapters to find

only the web-scraping technique or information that you are looking for. When a concept

or piece of code builds on another mentioned in a previous chapter, I will explicitly

reference the section that it was addressed in.

Bookscreen

O’Reilly books may be purchased for educational, business, or sales promotional use.

Online editions are also available for most titles (http://safaribooksonline.com). For more

information, contact our corporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com.

Editors: Simon St. Laurent and Allyson MacDonald

Production Editor: Shiny Kalapurakkel

Copyeditor: Jasmine Kwityn

Proofreader: Carla Thornton

Indexer: Lucie Haskins

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Rebecca Demarest

June 2015: First Edition

Web Scraping with Python

Collecting Data from the Modern Web

by Ryan Mitchell

Contact Form