Mining the Social Web. O'reilly

Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites

by Matthew A. Russel


e-books shop
e-books shop
Purchase Now !
Just with Paypal



Book Details
 Price
 2.00
 Pages
 344 p
 File Size 
 4,929 KB
 File Type
 PDF format
 ISBN
 978-1-449-38834-8
 Copyright©   
 2011 Matthew Russel 

Preface
The Web is more a social creation than a technical one.
I designed it for a social effect—to help people work
together—and not as a technical toy. The ultimate goal
of the Web is to support and improve our weblike existence
in the world. We clump into families, associations,
and companies. We develop trust across the miles and
distrust around the corner.
—Tim Berners-Lee, Weaving the Web (Harper)

To Read This Book?
If you have a basic programming background and are interested in insight surrounding
the opportunities that arise from mining and analyzing data from the social web, you’ve
come to the right place. We’ll begin getting our hands dirty after just a few more pages
of frontmatter. I’ll be forthright, however, and say upfront that one of the chief complaints
you’re likely to have about this book is that all of the chapters are far too short.

Unfortunately, that’s always the case when trying to capture a space that’s evolving
daily and is so rich and abundant with opportunities. That said, I’m a fan of the “80-20
rule”, and I sincerely believe that this book is a reasonable attempt at presenting the
most interesting 20 percent of the space that you’d want to explore with 80 percent of your available time.

This book is short, but it does cover a lot of ground. Generally speaking, there’s a little
more breadth than depth, although where the situation lends itself and the subject
matter is complex enough to warrant a more detailed discussion, there are a few deep
dives into interesting mining and analysis techniques. The book was written so that
you could have the option of either reading it from cover to cover to get a broad primer
on working with social web data, or pick and choose chapters that are of particular
interest to you. In other words, each chapter is designed to be bite-sized and fairly
standalone, but special care was taken to introduce material in a particular order so
that the book as a whole is an enjoyable read.

Social networking websites such as Facebook, Twitter, and LinkedIn have transitioned
from fad to mainstream to global phenomena over the last few years. In the first quarter
of 2010, the popular social networking site Facebook surpassed Google for the most
page visits,* confirming a definite shift in how people are spending their time online.

Asserting that this event indicates that the Web has now become more a social milieu
than a tool for research and information might be somewhat indefensible; however,
this data point undeniably indicates that social networking websites are satisfying some
very basic human desires on a massive scale in ways that search engines were never
designed to fulfill. Social networks really are changing the way we live our lives on and
off the Web,† and they are enabling technology to bring out the best (and sometimes
the worst) in us. The explosion of social networks is just one of the ways that the gap
between the real world and cyberspace is continuing to narrow.

Generally speaking, each chapter of this book interlaces slivers of the social web along
with data mining, analysis, and visualization techniques to answer the following kinds
of questions:
• Who knows whom, and what friends do they have in common?
• How frequently are certain people communicating with one another?
• How symmetrical is the communication between people?
• Who are the quietest/chattiest people in a network?
• Who are the most influential/popular people in a network?
• What are people chatting about (and is it interesting)?
The answers to these types of questions generally connect two or more people together
and point back to a context indicating why the connection exists. The work involved
in answering these kinds of questions is only the beginning of more complex analytic
processes, but you have to start somewhere, and the low-hanging fruit is surprisingly
easy to grasp, thanks to well-engineered social networking APIs and open source toolkits.

Loosely speaking, this book treats the social web‡ as a graph of people, activities, events,
concepts, etc. Industry leaders such as Google and Facebook have begun to increasingly
push graph-centric terminology rather than web-centric terminology as they simultaneously
promote graph-based APIs. In fact, Tim Berners-Lee has suggested that perhaps
he should have used the term Giant Global Graph (GGG) instead of World Wide Web
(WWW), because the terms “web” and “graph” can be so freely interchanged in the
context of defining a topology for the Internet. Whether the fullness of Tim Berners-
Lee’s original vision will ever be realized remains to be seen, but the Web as we know
it is getting richer and richer with social data all the time. When we look back years
from now, it may well seem obvious that the second- and third-level effects created by
an inherently social web were necessary enablers for the realization of a truly semantic
web. The gap between the two seems to be closing.

Table of Contents
Preface . . . . . . xiii
1. Introduction: Hacking on Twitter Data .  . . 1
Installing Python Development Tools 1
Collecting and Manipulating Twitter Data 3
Tinkering with Twitter’s API 4
Frequency Analysis and Lexical Diversity 7
Visualizing Tweet Graphs 14
Synthesis: Visualizing Retweets with Protovis 15
Closing Remarks 17
2. Microformats: Semantic Markup and Common Sense Collide. . 19
XFN and Friends 19
Exploring Social Connections with XFN 22
A Breadth-First Crawl of XFN Data 23
Geocoordinates: A Common Thread for Just About Anything 30
Wikipedia Articles + Google Maps = Road Trip? 30
Slicing and Dicing Recipes (for the Health of It) 35
Collecting Restaurant Reviews 37
Summary 40
3. Mailboxes: Oldies but Goodies . . . . 41
mbox: The Quick and Dirty on Unix Mailboxes 42
mbox + CouchDB = Relaxed Email Analysis 48
Bulk Loading Documents into CouchDB 51
Sensible Sorting 52
Map/Reduce-Inspired Frequency Analysis 55
Sorting Documents by Value 61
couchdb-lucene: Full-Text Indexing and More 63
Threading Together Conversations 67
Look Who’s Talking 73
Visualizing Mail “Events” with SIMILE Timeline 77
Analyzing Your Own Mail Data 80
The Graph Your (Gmail) Inbox Chrome Extension 81
Closing Remarks 82
4. Twitter: Friends, Followers, and Setwise Operations. . . 83
RESTful and OAuth-Cladded APIs 84
No, You Can’t Have My Password 85
A Lean, Mean Data-Collecting Machine 88
A Very Brief Refactor Interlude 91
Redis: A Data Structures Server 92
Elementary Set Operations 94
Souping Up the Machine with Basic Friend/Follower Metrics 96
Calculating Similarity by Computing Common Friends and Followers 102
Measuring Influence 103
Constructing Friendship Graphs 108
Clique Detection and Analysis 110
The Infochimps “Strong Links” API 114
Interactive 3D Graph Visualization 116
Summary 117
5. Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet . . 119
Pen : Sword :: Tweet : Machine Gun (?!?) 119
Analyzing Tweets (One Entity at a Time) 122
Tapping (Tim’s) Tweets 125
Who Does Tim Retweet Most Often? 138
What’s Tim’s Influence? 141
How Many of Tim’s Tweets Contain Hashtags? 144
Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty) 147
What Entities Co-Occur Most Often with #JustinBieber and #TeaParty
Tweets? 148
On Average, Do #JustinBieber or #TeaParty Tweets Have More
Hashtags? 153
Which Gets Retweeted More Often: #JustinBieber or #TeaParty? 154
How Much Overlap Exists Between the Entities of #TeaParty and
#JustinBieber Tweets? 156
Visualizing Tons of Tweets 158
Visualizing Tweets with Tricked-Out Tag Clouds 158
Visualizing Community Structures in Twitter Search Results 162
Closing Remarks 166
6. LinkedIn: Clustering Your Professional Network for Fun (and Profit?) . 167
Motivation for Clustering 168
Clustering Contacts by Job Title 172
Standardizing and Counting Job Titles 172
Common Similarity Metrics for Clustering 174
A Greedy Approach to Clustering 177
Hierarchical and k-Means Clustering 185
Fetching Extended Profile Information 188
Geographically Clustering Your Network 193
Mapping Your Professional Network with Google Earth 193
Mapping Your Professional Network with Dorling Cartograms 198
Closing Remarks 198
7. Google Buzz: TF-IDF, Cosine Similarity, and Collocations . . . 201
Buzz = Twitter + Blogs (???) 202
Data Hacking with NLTK 205
Text Mining Fundamentals 209
A Whiz-Bang Introduction to TF-IDF 209
Querying Buzz Data with TF-IDF 215
Finding Similar Documents 216
The Theory Behind Vector Space Models and Cosine Similarity 217
Clustering Posts with Cosine Similarity 219
Visualizing Similarity with Graph Visualizations 222
Buzzing on Bigrams 224
How the Collocation Sausage Is Made: Contingency Tables and Scoring
Functions 228
Tapping into Your Gmail 231
Accessing Gmail with OAuth 232
Fetching and Parsing Email Messages 233
Before You Go Off and Try to Build a Search Engine… 235
Closing Remarks 237
8. Blogs et al.: Natural Language Processing (and Beyond) . . . 239
NLP: A Pareto-Like Introduction 239
Syntax and Semantics 240
A Brief Thought Exercise 241
A Typical NLP Pipeline with NLTK 242
Sentence Detection in Blogs with NLTK 245
Summarizing Documents 250
Analysis of Luhn’s Summarization Algorithm 256
Entity-Centric Analysis: A Deeper Understanding of the Data 258
Quality of Analytics 267
Closing Remarks 269
9. Facebook: The All-in-One Wonder  . . . . 271
Tapping into Your Social Network Data 272
From Zero to Access Token in Under 10 Minutes 272
Facebook’s Query APIs 278
Visualizing Facebook Data 289
Visualizing Your Entire Social Network 289
Visualizing Mutual Friendships Within Groups 301
Where Have My Friends All Gone? (A Data-Driven Game) 304
Visualizing Wall Data As a (Rotating) Tag Cloud 309
Closing Remarks 311
10. The Semantic Web: A Cocktail Discussion. . 313
An Evolutionary Revolution? 313
Man Cannot Live on Facts Alone 315
Open-World Versus Closed-World Assumptions 315
Inferencing About an Open World with FuXi 316
Hope 319
Index . . . . . . . 321


Bookscreen
e-books shop

Tools and Prerequisites
The only real prerequisites for this book are that you need to be motivated enough to
learn some Python and have the desire to get your hands (really) dirty with social data.
None of the techniques or examples in this book require significant background knowledge
of data analysis, high performance computing, distributed systems, machine
learning, or anything else in particular. Some examples involve constructs you may not
have used before, such as thread pools, but don’t fret—we’re programming in Python.

Python’s intuitive syntax, amazing ecosystem of packages for data manipulation, and
core data structures that are practically JSON make it an excellent teaching tool that’s
powerful yet also very easy to get up and running. On other occasions we use some
packages that do pretty advanced things, such as processing natural language, but we’ll
approach these from the standpoint of using the technology as an application programmer.

Given the high likelihood that very similar bindings exist for other programming
languages, it should be a fairly rote exercise to port the code examples should you
so desire. (Hopefully, that’s exactly the kind of thing that will happen on GitHub!)
Beyond the previous explanation, this book makes no attempt to justify the selection
of Python or apologize for using it, because it’s a very suitable tool for the job. If you’re
new to programming or have never seen Python syntax, skimming ahead a few pages
should hopefully be all the confirmation that you need. Excellent documentation is
available online, and the official Python tutorial is a good place to start if you’re looking
for a solid introduction.

This book attempts to introduce a broad array of useful visualizations across a variety
of visualization tools and toolkits, ranging from consumer staples like spreadsheets to
industry staples like Graphviz, to bleeding-edge HTML5 technologies such as Protovis.
A reasonable attempt has been made to introduce a couple of new visualizations
in each chapter, but in a way that follows naturally and makes sense. You’ll need to be
comfortable with the idea of building lightweight prototypes from these tools. That
said, most of the visualizations in this book are little more than small mutations on outof-
the-box examples or projects that minimally exercise the APIs, so as long as you’re
willing to learn, you should be in good shape.
Previous Post Next Post