Mining The Social Web 2nd

SECOND EDITION

By Matthew A. Russell

Data Mining FACEBOOK, TWITTER, LINKEDIN, GOOGLE+, GITHUB, AND MORE

Mining the Social Web, Second Edition

If the ax is dull and its edge unsharpened, more strength is needed,
but skill will bring success.
—Ecclesiastes 10:10

Preface
The Web is more a social creation
than a technical one.
I designed it for a social effect—to
help people work together—and not as a
technical toy. The ultimate goal of the Web
is to support and improve our weblike existence
in the world. We clump into families, associations,
and companies. We develop trust across the miles
and distrust around the corner.
—Tim Berners-Lee, Weaving the Web (Harper)

README.1st

This book has been carefully designed to provide an incredible learning experience for
a particular target audience, and in order to avoid any unnecessary confusion about its
scope or purpose by way of disgruntled emails, bad book reviews, or other misunderstandings
that can come up, the remainder of this preface tries to help you determine
whether you are part of that target audience. As a very busy professional, I consider my
time my most valuable asset, and I want you to know right from the beginning that I
believe that the same is true of you. Although I often fail, I really do try to honor my
neighbor above myself as I walk out this life, and this preface is my attempt to honor
you, the reader, by making it clear whether or not this book can meet your expectations

Managing Your Expectations

Some of the most basic assumptions this book makes about you as a reader is that you

want to learn how to mine data from popular social web properties, avoid technology

hassles when running sample code, and have lots of fun along the way. Although you

could read this book solely for the purpose of learning what is possible, you should know

up front that it has been written in such a way that you really could follow along with

the many exercises and become a data miner once you’ve completed the few simple steps

to set up a development environment. If you’ve done some programming before, you

should find that it’s relatively painless to get up and running with the code examples.

Even if you’ve never programmed before but consider yourself the least bit tech-savvy,

I daresay that you could use this book as a starting point to a remarkable journey that

will stretch your mind in ways that you probably haven’t even imagined yet.

To fully enjoy this book and all that it has to offer, you need to be interested in the vast

possibilities for mining the rich data tucked away in popular social websites such as

Twitter, Facebook, LinkedIn, and Google+, and you need to be motivated enough to

download a virtual machine and follow along with the book’s example code in IPython

Notebook, a fantastic web-based tool that features all of the examples for every chapter.

Executing the examples is usually as easy as pressing a few keys, since all of the code is

presented to you in a friendly user interface. This book will teach you a few things that

you’ll be thankful to learn and will add a few indispensable tools to your toolbox, but

perhaps even more importantly, it will tell you a story and entertain you along the way.

It’s a story about data science involving social websites, the data that’s tucked away inside

of them, and some of the intriguing possibilities of what you (or anyone else) could do

with this data.

If you were to read this book from cover to cover, you’d notice that this story unfolds

on a chapter-by-chapter basis. While each chapter roughly follows a predictable template

that introduces a social website, teaches you how to use its API to fetch data, and

introduces some techniques for data analysis, the broader story the book tells crescendos

in complexity. Earlier chapters in the book take a little more time to introduce fundamental

concepts, while later chapters systematically build upon the foundation from

earlier chapters and gradually introduce a broad array of tools and techniques for mining

the social web that you can take with you into other aspects of your life as a data scientist,

analyst, visionary thinker, or curious reader.

Some of the most popular social websites have transitioned from fad to mainstream to

household names over recent years, changing the way we live our lives on and off the

Web and enabling technology to bring out the best (and sometimes the worst) in us.

Generally speaking, each chapter of this book interlaces slivers of the social web along

with data mining, analysis, and visualization techniques to explore data and answer the

following representative questions:

• Who knows whom, and which people are common to their social networks?

• How frequently are particular people communicating with one another?

• Which social network connections generate the most value for a particular niche?

• How does geography affect your social connections in an online world?

• Who are the most influential/popular people in a social network?

• What are people chatting about (and is it valuable)?

• What are people interested in based upon the human language that they use in a

digital world?

The answers to these basic kinds of questions often yield valuable insight and present

lucrative opportunities for entrepreneurs, social scientists, and other curious practitioners

who are trying to understand a problem space and find solutions. Activities such

as building a turnkey killer app from scratch to answer these questions, venturing far

beyond the typical usage of visualization libraries, and constructing just about anything

state-of-the-art are not within the scope of this book. You’ll be really disappointed if

you purchase this book because you want to do one of those things. However, this book

does provide the fundamental building blocks to answer these questions and provide a

springboard that might be exactly what you need to build that killer app or conduct that

research study. Skim a few chapters and see for yourself. This book covers a lot of ground.

Improvements Specific to the Second Edition

When I began working on this second edition of Mining the Social Web, I don’t think I

quite realized what I was getting myself into. What started out as a “substantial update”

is now what I’d consider almost a rewrite of the first edition. I’ve extensively updated

each chapter, I’ve strategically added new content, and I really do believe that this second

edition is superior to the first in almost every way. My earnest hope is that it’s going to

be able to reach a much wider audience than the first edition and invigorate a broad

community of interest with tools, techniques, and practical advice to implement ideas

that depend on munging and analyzing data from social websites. If I am successful in

this endeavor, we’ll see a broader awareness of what it is possible to do with data from

social websites and more budding entrepreneurs and enthusiastic hobbyists putting

social web data to work.

A book is a product, and first editions of any product can be vastly improved upon,

aren’t always what customers ideally would have wanted, and can have great potential

if appropriate feedback is humbly accepted and adjustments are made. This book is no

exception, and the feedback and learning experience from interacting with readers and

consumers of this book’s sample code over the past few years have been incredibly

important in shaping this book to be far beyond anything I could have designed if left

to my own devices. I’ve incorporated as much of that feedback as possible, and it mostly

boils down to the theme of simplifying the learning experience for readers.

Simplification presents itself in this second edition in a variety of ways. Perhaps most

notably, one of the biggest differences between this book and the previous edition is

that the technology toolchain is vastly simplified, and I’ve employed configuration

management by way of an amazing virtualization technology called Vagrant. The previous

edition involved a variety of databases for storage, various visualization toolkits,

and assumed that readers could just figure out most of the installation and configuration

by reading the online instructions.

This edition, on the other hand, goes to great lengths to introduce as few disparate

technology dependencies as possible and presents them all with a virtual machine experience

that abstracts away the complexities of software installation and configuration,

which are sometimes considerably more challenging than they might initially seem.

From a certain vantage point, the core toolbox is just IPython Notebook and some thirdparty

package dependencies (all of which are versioned so that updates to open source

software don’t cause code breakage) that come preinstalled on a virtual machine. Inline

visualizations are even baked into the IPython Notebooks, rendering from within IPython

Notebook itself, and are consolidated down to a single JavaScript toolkit (D3.js)

that maintains visually consistent aesthetics across the chapters.

technology in the book affords the opportunity to spend more time engaging in fundamental

exercises in analysis. One of the recurring critiques from readers of the first

edition’s content was that more time should have been spent analyzing and discussing

the implications of the exercises (a fair criticism indeed). My hope is that this second

edition delivers on that wonderful suggestion by augmenting existing content with additional

explanations in some of the void that was left behind. In a sense, this second

edition does “more with less,” and it delivers significantly more value to you as the reader

because of it.

In terms of structural reorganization, you may notice that a chapter on GitHub has been

added to this second edition. GitHub is interesting for a variety of reasons, and as you’ll

observe from reviewing the chapter, it’s not all just about “social coding” (although that’s

a big part of it). GitHub is a very social website that spans international boundaries, is

rapidly becoming a general purpose collaboration hub that extends beyond coding, and

can fairly be interpreted as an interest graph—a graph that connects people and the

things that interest them. Interest graphs, whether derived from GitHub or elsewhere,

are a very important concept in the unfolding saga that is the Web, and as someone

interested in the social web, you won’t want to overlook them.

In addition to a new chapter on GitHub, the two “advanced” chapters on Twitter from

the first edition have been refactored and expanded into a collection of more easily

adaptable Twitter recipes that are organized into Chapter 9. Whereas the opening chapter

of the book starts off slowly and warms you up to the notion of social web APIs and

data mining, the final chapter of the book comes back full circle with a battery of diverse

building blocks that you can adapt and assemble in various ways to achieve a truly

enormous set of possibilities. Finally, the chapter that was previously dedicated to microformats

has been folded into what is now Chapter 8, which is designed to be more

of a forward-looking kind of cocktail discussion about the “semantically marked-up

web” than an extensive collection of programming exercises, like the chapters before it.

Table of Contents

Preface. . . . . .

Part I. A Guided Tour of the Social Web Prelude

1. Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More

1.1. Overview 6

1.2. Why Is Twitter All the Rage? 6

1.3. Exploring Twitter’s API 9

1.3.1. Fundamental Twitter Terminology 9

1.3.2. Creating a Twitter API Connection 12

1.3.3. Exploring Trending Topics 15

1.3.4. Searching for Tweets 20

1.4. Analyzing the 140 Characters 26

1.4.1. Extracting Tweet Entities 28

1.4.2. Analyzing Tweets and Tweet Entities with Frequency Analysis 29

1.4.3. Computing the Lexical Diversity of Tweets 32

1.4.4. Examining Patterns in Retweets 34

1.4.5. Visualizing Frequency Data with Histograms 36

1.5. Closing Remarks 41

1.6. Recommended Exercises 42

1.7. Online Resources 43

2. Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More

2.1. Overview 46

2.2. Exploring Facebook’s Social Graph API 46

2.2.1. Understanding the Social Graph API 48

2.2.2. Understanding the Open Graph Protocol 54

2.3. Analyzing Social Graph Connections 59

2.3.1. Analyzing Facebook Pages 63

2.3.2. Examining Friendships 70

2.4. Closing Remarks 85

2.5. Recommended Exercises 85

2.6. Online Resources 86

3. Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More

3.1. Overview 90

3.2. Exploring the LinkedIn API 90

3.2.1. Making LinkedIn API Requests 91

3.2.2. Downloading LinkedIn Connections as a CSV File 96

3.3. Crash Course on Clustering Data 97

3.3.1. Clustering Enhances User Experiences 100

3.3.2. Normalizing Data to Enable Analysis 101

3.3.3. Measuring Similarity 112

3.3.4. Clustering Algorithms 115

3.4. Closing Remarks 131

3.5. Recommended Exercises 132

3.6. Online Resources 133

4. Mining Google+: Computing Document Similarity, Extracting Collocations, and More

4.1. Overview 136

4.2. Exploring the Google+ API 136

4.2.1. Making Google+ API Requests 138

4.3. A Whiz-Bang Introduction to TF-IDF 147

4.3.1. Term Frequency 148

4.3.2. Inverse Document Frequency 150

4.3.3. TF-IDF 151

4.4. Querying Human Language Data with TF-IDF 155

4.4.1. Introducing the Natural Language Toolkit 155

4.4.2. Applying TF-IDF to Human Language 158

4.4.3. Finding Similar Documents 160

4.4.4. Analyzing Bigrams in Human Language 167

4.4.5. Reflections on Analyzing Human Language Data 177

4.5. Closing Remarks 178

4.6. Recommended Exercises 179

4.7. Online Resources 180

5. Mining Web Pages: Using Natural Language Processing to Understand Human

Language, Summarize Blog Posts, and More

5.1. Overview 182

5.2. Scraping, Parsing, and Crawling the Web 183

5.2.1. Breadth-First Search in Web Crawling 186

5.3. Discovering Semantics by Decoding Syntax 190

5.3.1. Natural Language Processing Illustrated Step-by-Step 192

5.3.2. Sentence Detection in Human Language Data 196

5.3.3. Document Summarization 200

5.4. Entity-Centric Analysis: A Paradigm Shift 209

5.4.1. Gisting Human Language Data 213

5.5. Quality of Analytics for Processing Human Language Data 219

5.6. Closing Remarks 222

5.7. Recommended Exercises 222

5.8. Online Resources 223

6. Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More

6.1. Overview 226

6.2. Obtaining and Processing a Mail Corpus 227

6.2.1. A Primer on Unix Mailboxes 227

6.2.2. Getting the Enron Data 232

6.2.3. Converting a Mail Corpus to a Unix Mailbox 235

6.2.4. Converting Unix Mailboxes to JSON 236

6.2.5. Importing a JSONified Mail Corpus into MongoDB 240

6.2.6. Programmatically Accessing MongoDB with Python 244

6.3. Analyzing the Enron Corpus 246

6.3.1. Querying by Date/Time Range 247

6.3.2. Analyzing Patterns in Sender/Recipient Communications 250

6.3.3. Writing Advanced Queries 255

6.3.4. Searching Emails by Keywords 259

6.4. Discovering and Visualizing Time-Series Trends 264

6.5. Analyzing Your Own Mail Data 268

6.5.1. Accessing Your Gmail with OAuth 269

6.5.2. Fetching and Parsing Email Messages with IMAP 271

6.5.3. Visualizing Patterns in GMail with the “Graph Your Inbox” Chrome

Extension 273

6.6. Closing Remarks 274

6.7. Recommended Exercises 275

6.8. Online Resources 276

7. Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More

7.1. Overview 280

7.2. Exploring GitHub’s API 281

7.2.1. Creating a GitHub API Connection 282

7.2.2. Making GitHub API Requests 286

7.3. Modeling Data with Property Graphs 288

7.4. Analyzing GitHub Interest Graphs 292

7.4.1. Seeding an Interest Graph 292

7.4.2. Computing Graph Centrality Measures 296

7.4.3. Extending the Interest Graph with “Follows” Edges for Users 299

7.4.4. Using Nodes as Pivots for More Efficient Queries 311

7.4.5. Visualizing Interest Graphs 316

7.5. Closing Remarks 318

7.6. Recommended Exercises 318

7.7. Online Resources 320

8. Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over

RDF, and More

8.1. Overview 322

8.2. Microformats: Easy-to-Implement Metadata 322

8.2.1. Geocoordinates: A Common Thread for Just About Anything 325

8.2.2. Using Recipe Data to Improve Online Matchmaking 331

8.2.3. Accessing LinkedIn’s 200 Million Online Résumés 336

8.3. From Semantic Markup to Semantic Web: A Brief Interlude 338

8.4. The Semantic Web: An Evolutionary Revolution 339

8.4.1. Man Cannot Live on Facts Alone 340

8.4.2. Inferencing About an Open World 342

8.5. Closing Remarks 345

8.6. Recommended Exercises 346

8.7. Online Resources 347

Part II. Twitter Cookbook

9. Twitter Cookbook

9.1. Accessing Twitter’s API for Development Purposes 352

9.2. Doing the OAuth Dance to Access Twitter’s API for Production Purposes 353

9.3. Discovering the Trending Topics 358

9.4. Searching for Tweets 359

9.5. Constructing Convenient Function Calls 361

9.6. Saving and Restoring JSON Data with Text Files 362

9.7. Saving and Accessing JSON Data with MongoDB 363

9.8. Sampling the Twitter Firehose with the Streaming API 365

9.9. Collecting Time-Series Data 366

9.10. Extracting Tweet Entities 368

9.11. Finding the Most Popular Tweets in a Collection of Tweets 370

9.12. Finding the Most Popular Tweet Entities in a Collection of Tweets 371

9.13. Tabulating Frequency Analysis 373

9.14. Finding Users Who Have Retweeted a Status 374

9.15. Extracting a Retweet’s Attribution 376

9.16. Making Robust Twitter Requests 377

9.17. Resolving User Profile Information 380

9.18. Extracting Tweet Entities from Arbitrary Text 381

9.19. Getting All Friends or Followers for a User 382

9.20. Analyzing a User’s Friends and Followers 384

9.21. Harvesting a User’s Tweets 386

9.22. Crawling a Friendship Graph 388

9.23. Analyzing Tweet Content 389

9.24. Summarizing Link Targets 391

9.25. Analyzing a User’s Favorite Tweets 394

9.26. Closing Remarks 396

9.27. Recommended Exercises 396

9.28. Online Resources 397

Part III. Appendixes

A. Information About This Book’s Virtual Machine Experience. . . . . . . . . . . . . . . . . . . . . . 401

B. OAuth Primer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

C. Python and IPython Notebook Tips & Tricks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411

Screenshot

Purchase Now !
Just with Paypal

Product details

Price
Pages	448 p
File Size	21,552 KB
File Type	PDF format
ISBN	978-1-449-36761-9
Copyright	2014 Matthew A. Russell

●▬▬▬▬▬❂❂❂▬▬▬▬▬●

●▬▬❂❂▬▬●

●▬❂▬●

●❂●

═════● ●═════

Mining The Social Web 2nd

SECOND EDITION

By Matthew A. Russell

Data Mining FACEBOOK, TWITTER, LINKEDIN, GOOGLE+, GITHUB, AND MORE

Contact Form