SECOND EDITION
By Matthew A. Russell
Data Mining FACEBOOK, TWITTER, LINKEDIN, GOOGLE+, GITHUB, AND MORE
If the ax is dull and its edge unsharpened, more strength is needed,
but skill will bring success.
—Ecclesiastes 10:10
Preface
The Web is more a social creation
than a technical one.
I designed it for a social effect—to
help people work together—and not as a
technical toy. The ultimate goal of the Web
is to support and improve our weblike existence
in the world. We clump into families, associations,
and companies. We develop trust across the miles
and distrust around the corner.
—Tim Berners-Lee, Weaving the Web (Harper)
README.1st
but skill will bring success.
—Ecclesiastes 10:10
Preface
The Web is more a social creation
than a technical one.
I designed it for a social effect—to
help people work together—and not as a
technical toy. The ultimate goal of the Web
is to support and improve our weblike existence
in the world. We clump into families, associations,
and companies. We develop trust across the miles
and distrust around the corner.
—Tim Berners-Lee, Weaving the Web (Harper)
README.1st
This book has been carefully designed to provide an incredible learning experience for
a particular target audience, and in order to avoid any unnecessary confusion about its
scope or purpose by way of disgruntled emails, bad book reviews, or other misunderstandings
that can come up, the remainder of this preface tries to help you determine
whether you are part of that target audience. As a very busy professional, I consider my
time my most valuable asset, and I want you to know right from the beginning that I
believe that the same is true of you. Although I often fail, I really do try to honor my
neighbor above myself as I walk out this life, and this preface is my attempt to honor
you, the reader, by making it clear whether or not this book can meet your expectations
a particular target audience, and in order to avoid any unnecessary confusion about its
scope or purpose by way of disgruntled emails, bad book reviews, or other misunderstandings
that can come up, the remainder of this preface tries to help you determine
whether you are part of that target audience. As a very busy professional, I consider my
time my most valuable asset, and I want you to know right from the beginning that I
believe that the same is true of you. Although I often fail, I really do try to honor my
neighbor above myself as I walk out this life, and this preface is my attempt to honor
you, the reader, by making it clear whether or not this book can meet your expectations
Managing Your Expectations
Some of the most basic assumptions this book makes about you as a reader is that you
want to learn how to mine data from popular social web properties, avoid technology
hassles when running sample code, and have lots of fun along the way. Although you
could read this book solely for the purpose of learning what is possible, you should know
up front that it has been written in such a way that you really could follow along with
the many exercises and become a data miner once you’ve completed the few simple steps
to set up a development environment. If you’ve done some programming before, you
should find that it’s relatively painless to get up and running with the code examples.
Even if you’ve never programmed before but consider yourself the least bit tech-savvy,
I daresay that you could use this book as a starting point to a remarkable journey that
will stretch your mind in ways that you probably haven’t even imagined yet.
To fully enjoy this book and all that it has to offer, you need to be interested in the vast
possibilities for mining the rich data tucked away in popular social websites such as
Twitter, Facebook, LinkedIn, and Google+, and you need to be motivated enough to
download a virtual machine and follow along with the book’s example code in IPython
Notebook, a fantastic web-based tool that features all of the examples for every chapter.
Executing the examples is usually as easy as pressing a few keys, since all of the code is
presented to you in a friendly user interface. This book will teach you a few things that
you’ll be thankful to learn and will add a few indispensable tools to your toolbox, but
perhaps even more importantly, it will tell you a story and entertain you along the way.
It’s a story about data science involving social websites, the data that’s tucked away inside
of them, and some of the intriguing possibilities of what you (or anyone else) could do
with this data.
If you were to read this book from cover to cover, you’d notice that this story unfolds
on a chapter-by-chapter basis. While each chapter roughly follows a predictable template
that introduces a social website, teaches you how to use its API to fetch data, and
introduces some techniques for data analysis, the broader story the book tells crescendos
in complexity. Earlier chapters in the book take a little more time to introduce fundamental
concepts, while later chapters systematically build upon the foundation from
earlier chapters and gradually introduce a broad array of tools and techniques for mining
the social web that you can take with you into other aspects of your life as a data scientist,
analyst, visionary thinker, or curious reader.
Some of the most popular social websites have transitioned from fad to mainstream to
household names over recent years, changing the way we live our lives on and off the
Web and enabling technology to bring out the best (and sometimes the worst) in us.
Generally speaking, each chapter of this book interlaces slivers of the social web along
with data mining, analysis, and visualization techniques to explore data and answer the
following representative questions:
• Who knows whom, and which people are common to their social networks?
• How frequently are particular people communicating with one another?
• Which social network connections generate the most value for a particular niche?
• How does geography affect your social connections in an online world?
• Who are the most influential/popular people in a social network?
• What are people chatting about (and is it valuable)?
• What are people interested in based upon the human language that they use in a
digital world?
The answers to these basic kinds of questions often yield valuable insight and present
lucrative opportunities for entrepreneurs, social scientists, and other curious practitioners
who are trying to understand a problem space and find solutions. Activities such
as building a turnkey killer app from scratch to answer these questions, venturing far
beyond the typical usage of visualization libraries, and constructing just about anything
state-of-the-art are not within the scope of this book. You’ll be really disappointed if
you purchase this book because you want to do one of those things. However, this book
does provide the fundamental building blocks to answer these questions and provide a
springboard that might be exactly what you need to build that killer app or conduct that
research study. Skim a few chapters and see for yourself. This book covers a lot of ground.
Improvements Specific to the Second Edition
When I began working on this second edition of Mining the Social Web, I don’t think I
quite realized what I was getting myself into. What started out as a “substantial update”
is now what I’d consider almost a rewrite of the first edition. I’ve extensively updated
each chapter, I’ve strategically added new content, and I really do believe that this second
edition is superior to the first in almost every way. My earnest hope is that it’s going to
be able to reach a much wider audience than the first edition and invigorate a broad
community of interest with tools, techniques, and practical advice to implement ideas
that depend on munging and analyzing data from social websites. If I am successful in
this endeavor, we’ll see a broader awareness of what it is possible to do with data from
social websites and more budding entrepreneurs and enthusiastic hobbyists putting
social web data to work.
A book is a product, and first editions of any product can be vastly improved upon,
aren’t always what customers ideally would have wanted, and can have great potential
if appropriate feedback is humbly accepted and adjustments are made. This book is no
exception, and the feedback and learning experience from interacting with readers and
consumers of this book’s sample code over the past few years have been incredibly
important in shaping this book to be far beyond anything I could have designed if left
to my own devices. I’ve incorporated as much of that feedback as possible, and it mostly
boils down to the theme of simplifying the learning experience for readers.
Simplification presents itself in this second edition in a variety of ways. Perhaps most
notably, one of the biggest differences between this book and the previous edition is
that the technology toolchain is vastly simplified, and I’ve employed configuration
management by way of an amazing virtualization technology called Vagrant. The previous
edition involved a variety of databases for storage, various visualization toolkits,
and assumed that readers could just figure out most of the installation and configuration
by reading the online instructions.
This edition, on the other hand, goes to great lengths to introduce as few disparate
technology dependencies as possible and presents them all with a virtual machine experience
that abstracts away the complexities of software installation and configuration,
which are sometimes considerably more challenging than they might initially seem.
From a certain vantage point, the core toolbox is just IPython Notebook and some thirdparty
package dependencies (all of which are versioned so that updates to open source
software don’t cause code breakage) that come preinstalled on a virtual machine. Inline
visualizations are even baked into the IPython Notebooks, rendering from within IPython
Notebook itself, and are consolidated down to a single JavaScript toolkit (D3.js)
that maintains visually consistent aesthetics across the chapters.
technology in the book affords the opportunity to spend more time engaging in fundamental
exercises in analysis. One of the recurring critiques from readers of the first
edition’s content was that more time should have been spent analyzing and discussing
the implications of the exercises (a fair criticism indeed). My hope is that this second
edition delivers on that wonderful suggestion by augmenting existing content with additional
explanations in some of the void that was left behind. In a sense, this second
edition does “more with less,” and it delivers significantly more value to you as the reader
because of it.
In terms of structural reorganization, you may notice that a chapter on GitHub has been
added to this second edition. GitHub is interesting for a variety of reasons, and as you’ll
observe from reviewing the chapter, it’s not all just about “social coding” (although that’s
a big part of it). GitHub is a very social website that spans international boundaries, is
rapidly becoming a general purpose collaboration hub that extends beyond coding, and
can fairly be interpreted as an interest graph—a graph that connects people and the
things that interest them. Interest graphs, whether derived from GitHub or elsewhere,
are a very important concept in the unfolding saga that is the Web, and as someone
interested in the social web, you won’t want to overlook them.
In addition to a new chapter on GitHub, the two “advanced” chapters on Twitter from
the first edition have been refactored and expanded into a collection of more easily
adaptable Twitter recipes that are organized into Chapter 9. Whereas the opening chapter
of the book starts off slowly and warms you up to the notion of social web APIs and
data mining, the final chapter of the book comes back full circle with a battery of diverse
building blocks that you can adapt and assemble in various ways to achieve a truly
enormous set of possibilities. Finally, the chapter that was previously dedicated to microformats
has been folded into what is now Chapter 8, which is designed to be more
of a forward-looking kind of cocktail discussion about the “semantically marked-up
web” than an extensive collection of programming exercises, like the chapters before it.
Table of Contents
Preface. . . . . .
Part I. A Guided Tour of the Social Web Prelude
1. Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More
1.1. Overview 6
1.2. Why Is Twitter All the Rage? 6
1.3. Exploring Twitter’s API 9
1.3.1. Fundamental Twitter Terminology 9
1.3.2. Creating a Twitter API Connection 12
1.3.3. Exploring Trending Topics 15
1.3.4. Searching for Tweets 20
1.4. Analyzing the 140 Characters 26
1.4.1. Extracting Tweet Entities 28
1.4.2. Analyzing Tweets and Tweet Entities with Frequency Analysis 29
1.4.3. Computing the Lexical Diversity of Tweets 32
1.4.4. Examining Patterns in Retweets 34
1.4.5. Visualizing Frequency Data with Histograms 36
1.5. Closing Remarks 41
1.6. Recommended Exercises 42
1.7. Online Resources 43
2. Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More
2.1. Overview 46
2.2. Exploring Facebook’s Social Graph API 46
2.2.1. Understanding the Social Graph API 48
2.2.2. Understanding the Open Graph Protocol 54
2.3. Analyzing Social Graph Connections 59
2.3.1. Analyzing Facebook Pages 63
2.3.2. Examining Friendships 70
2.4. Closing Remarks 85
2.5. Recommended Exercises 85
2.6. Online Resources 86
3. Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More
3.1. Overview 90
3.2. Exploring the LinkedIn API 90
3.2.1. Making LinkedIn API Requests 91
3.2.2. Downloading LinkedIn Connections as a CSV File 96
3.3. Crash Course on Clustering Data 97
3.3.1. Clustering Enhances User Experiences 100
3.3.2. Normalizing Data to Enable Analysis 101
3.3.3. Measuring Similarity 112
3.3.4. Clustering Algorithms 115
3.4. Closing Remarks 131
3.5. Recommended Exercises 132
3.6. Online Resources 133
4. Mining Google+: Computing Document Similarity, Extracting Collocations, and More
4.1. Overview 136
4.2. Exploring the Google+ API 136
4.2.1. Making Google+ API Requests 138
4.3. A Whiz-Bang Introduction to TF-IDF 147
4.3.1. Term Frequency 148
4.3.2. Inverse Document Frequency 150
4.3.3. TF-IDF 151
4.4. Querying Human Language Data with TF-IDF 155
4.4.1. Introducing the Natural Language Toolkit 155
4.4.2. Applying TF-IDF to Human Language 158
4.4.3. Finding Similar Documents 160
4.4.4. Analyzing Bigrams in Human Language 167
4.4.5. Reflections on Analyzing Human Language Data 177
4.5. Closing Remarks 178
4.6. Recommended Exercises 179
4.7. Online Resources 180
5. Mining Web Pages: Using Natural Language Processing to Understand Human
Language, Summarize Blog Posts, and More
5.1. Overview 182
5.2. Scraping, Parsing, and Crawling the Web 183
5.2.1. Breadth-First Search in Web Crawling 186
5.3. Discovering Semantics by Decoding Syntax 190
5.3.1. Natural Language Processing Illustrated Step-by-Step 192
5.3.2. Sentence Detection in Human Language Data 196
5.3.3. Document Summarization 200
5.4. Entity-Centric Analysis: A Paradigm Shift 209
5.4.1. Gisting Human Language Data 213
5.5. Quality of Analytics for Processing Human Language Data 219
5.6. Closing Remarks 222
5.7. Recommended Exercises 222
5.8. Online Resources 223
6. Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
6.1. Overview 226
6.2. Obtaining and Processing a Mail Corpus 227
6.2.1. A Primer on Unix Mailboxes 227
6.2.2. Getting the Enron Data 232
6.2.3. Converting a Mail Corpus to a Unix Mailbox 235
6.2.4. Converting Unix Mailboxes to JSON 236
6.2.5. Importing a JSONified Mail Corpus into MongoDB 240
6.2.6. Programmatically Accessing MongoDB with Python 244
6.3. Analyzing the Enron Corpus 246
6.3.1. Querying by Date/Time Range 247
6.3.2. Analyzing Patterns in Sender/Recipient Communications 250
6.3.3. Writing Advanced Queries 255
6.3.4. Searching Emails by Keywords 259
6.4. Discovering and Visualizing Time-Series Trends 264
6.5. Analyzing Your Own Mail Data 268
6.5.1. Accessing Your Gmail with OAuth 269
6.5.2. Fetching and Parsing Email Messages with IMAP 271
6.5.3. Visualizing Patterns in GMail with the “Graph Your Inbox” Chrome
Extension 273
6.6. Closing Remarks 274
6.7. Recommended Exercises 275
6.8. Online Resources 276
7. Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More
7.1. Overview 280
7.2. Exploring GitHub’s API 281
7.2.1. Creating a GitHub API Connection 282
7.2.2. Making GitHub API Requests 286
7.3. Modeling Data with Property Graphs 288
7.4. Analyzing GitHub Interest Graphs 292
7.4.1. Seeding an Interest Graph 292
7.4.2. Computing Graph Centrality Measures 296
7.4.3. Extending the Interest Graph with “Follows” Edges for Users 299
7.4.4. Using Nodes as Pivots for More Efficient Queries 311
7.4.5. Visualizing Interest Graphs 316
7.5. Closing Remarks 318
7.6. Recommended Exercises 318
7.7. Online Resources 320
8. Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over
RDF, and More
8.1. Overview 322
8.2. Microformats: Easy-to-Implement Metadata 322
8.2.1. Geocoordinates: A Common Thread for Just About Anything 325
8.2.2. Using Recipe Data to Improve Online Matchmaking 331
8.2.3. Accessing LinkedIn’s 200 Million Online Résumés 336
8.3. From Semantic Markup to Semantic Web: A Brief Interlude 338
8.4. The Semantic Web: An Evolutionary Revolution 339
8.4.1. Man Cannot Live on Facts Alone 340
8.4.2. Inferencing About an Open World 342
8.5. Closing Remarks 345
8.6. Recommended Exercises 346
8.7. Online Resources 347
Part II. Twitter Cookbook
9. Twitter Cookbook
9.1. Accessing Twitter’s API for Development Purposes 352
9.2. Doing the OAuth Dance to Access Twitter’s API for Production Purposes 353
9.3. Discovering the Trending Topics 358
9.4. Searching for Tweets 359
9.5. Constructing Convenient Function Calls 361
9.6. Saving and Restoring JSON Data with Text Files 362
9.7. Saving and Accessing JSON Data with MongoDB 363
9.8. Sampling the Twitter Firehose with the Streaming API 365
9.9. Collecting Time-Series Data 366
9.10. Extracting Tweet Entities 368
9.11. Finding the Most Popular Tweets in a Collection of Tweets 370
9.12. Finding the Most Popular Tweet Entities in a Collection of Tweets 371
9.13. Tabulating Frequency Analysis 373
9.14. Finding Users Who Have Retweeted a Status 374
9.15. Extracting a Retweet’s Attribution 376
9.16. Making Robust Twitter Requests 377
9.17. Resolving User Profile Information 380
9.18. Extracting Tweet Entities from Arbitrary Text 381
9.19. Getting All Friends or Followers for a User 382
9.20. Analyzing a User’s Friends and Followers 384
9.21. Harvesting a User’s Tweets 386
9.22. Crawling a Friendship Graph 388
9.23. Analyzing Tweet Content 389
9.24. Summarizing Link Targets 391
9.25. Analyzing a User’s Favorite Tweets 394
9.26. Closing Remarks 396
9.27. Recommended Exercises 396
9.28. Online Resources 397
Part III. Appendixes
A. Information About This Book’s Virtual Machine Experience. . . . . . . . . . . . . . . . . . . . . . 401
B. OAuth Primer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
C. Python and IPython Notebook Tips & Tricks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411