Python for Data Science For Dummies

by Luca Massaron and John Paul Mueller


e-books shop
e-books shop
Purchase Now !
Just with Paypal



Book Details
 Price
 3.00
 Pages
 435 p
 File Size 
 9,952 KB
 File Type
 PDF format
 ISBN
 978‐1‐118‐84418‐2
 978-1-118-84398-7 (ebk)
 978-1-118-84414-4 (ePDF)
 Copyright©   
 2015 by John Wiley & Sons, Inc 

Introduction
You rely on data science absolutely every day to perform an amazing
array of tasks or to obtain services from someone else. In fact, you’ve
probably used data science in ways that you never expected. For example,
when you used your favorite search engine this morning to look for something,
it made suggestions on alternative search terms. Those terms are
supplied by data science. When you went to the doctor last week and
discovered the lump you found wasn’t cancer, it’s likely the doctor made his
prognosis with the help of data science. In fact, you might work with data
science every day and not even know it. Python for Data Science For Dummies
not only gets you started using data science to perform a wealth of practical
tasks but also helps you realize just how many places data science is used.
By knowing how to answer data science problems and where to employ data
science, you gain a significant advantage over everyone else, increasing your
chances at promotion or that new job you really want.

About This Book
The main purpose of Python for Data Science For Dummies is to take the scare
factor out of data science by showing you that data science is not only really
interesting but also quite doable using Python. You might assume that you
need to be a computer science genius to perform the complex tasks normally
associated with data science, but that’s far from the truth. Python comes
with a host of useful libraries that do all the heavy lifting for you in the background.
You don’t even realize how much is going on, and you don’t need to
care. All you really need to know is that you want to perform specific tasks
and that Python makes these tasks quite accessible.

Part of the emphasis of this book is on using the right tools. You start with
Anaconda, a product that includes IPython and IPython Notebook — two
tools that take the sting out of working with Python. You experiment with
IPython in a fully interactive environment. The code you place in IPython
Notebook is presentation quality, and you can mix a number of presentation
elements right there in your document. It’s not really like using a development
environment at all.

You also discover some interesting techniques in this book. For example,
you can create plots of all your data science experiments using MatPlotLib,
for which this book provides you with all the details. This book also spends
considerable time showing you just what is available and how you can use
it to perform some really interesting calculations. Many people would like to
know how to perform handwriting recognition — and if you’re one of them,
you can use this book to get a leg up on the process.

Of course, you might still be worried about the whole programming environment
issue, and this book doesn’t leave you in the dark there, either. At the
beginning, you find complete installation instructions for Anaconda and a
quick primer (with references) to the basic Python programming you need
to perform. The emphasis is on getting you up and running as quickly as
possible, and to make examples straightforward and simple so that the code
doesn’t become a stumbling block to learning.
To make absorbing the concepts even easier, this book uses the following
conventions:
✓✓Text that you’re meant to type just as it appears in the book is in bold.
The exception is when you’re working through a step list: Because each
step is bold, the text to type is not bold.
✓✓When you see words in italics as part of a typing sequence, you need to
replace that value with something that works for you. For example, if
you see “Type Your Name and press Enter,” you need to replace Your
Name with your actual name.
✓✓Web addresses and programming code appear in monofont. If you’re
reading a digital version of this book on a device connected to the
Internet, note that you can click the web address to visit that website,
✓✓When you need to type command sequences, you see them separated by
a special arrow, like this: File➪New File. In this case, you go to the File
menu first and then select the New File entry on that menu. The result is
that you see a new file created.

Table of Contents
Introduction.................................................................. 1
About This Book...............................................................................................1
Foolish Assumptions........................................................................................2
Icons Used in This Book..................................................................................3
Beyond the Book..............................................................................................4
Where to Go from Here....................................................................................5
Part I: Getting Started with Python for Data Science....... 7
Chapter 1: Discovering the Match between
Data Science and Python . 9
Defining the Sexiest Job of the 21st Century...............................................11
Considering the emergence of data science.....................................11
Outlining the core competencies of a data scientist........................12
Linking data science and big data......................................................13
Understanding the role of programming...........................................13
Creating the Data Science Pipeline...............................................................14
Preparing the data................................................................................14
Performing exploratory data analysis................................................15
Learning from data...............................................................................15
Visualizing..............................................................................................15
Obtaining insights and data products................................................15
Understanding Python’s Role in Data Science............................................16
Considering the shifting profile of data scientists............................16
Working with a multipurpose, simple, and efficient language........17
Learning to Use Python Fast.........................................................................18
Loading data..........................................................................................18
Training a model...................................................................................18
Viewing a result.....................................................................................20
Chapter 2: Introducing Python’s Capabilities and Wonders . 21
Why Python?...................................................................................................22
Grasping Python’s core philosophy...................................................23
Discovering present and future development
goals........................23
Working with Python.....................................................................................24
Getting a taste of the language............................................................24
Understanding the need for indentation...........................................25
Working at the command line or in the IDE......................................25
Performing Rapid Prototyping and Experimentation................................29
Considering Speed of Execution...................................................................30
Visualizing Power...........................................................................................32
Using the Python Ecosystem for Data Science...........................................33
Accessing scientific tools using SciPy................................................33
Performing fundamental scientific computing
using NumPy..........34
Performing data analysis using pandas.............................................34
Implementing machine learning using Scikit‐learn...........................35
Plotting the data using matplotlib......................................................35
Parsing HTML documents using Beautiful Soup...............................35
Chapter 3: Setting Up Python for Data Science . 37
Considering the Off‐the‐Shelf Cross‐Platform Scientific
Distributions................................................................................................38
Getting Continuum Analytics Anaconda............................................39
Getting Enthought Canopy Express...................................................40
Getting pythonxy..................................................................................40
Getting WinPython................................................................................41
Installing Anaconda on Windows.................................................................41
Installing Anaconda on Linux........................................................................45
Installing Anaconda on Mac OS X.................................................................46
Downloading the Datasets and Example Code...........................................47
Using IPython Notebook......................................................................47
Defining the code repository...............................................................48
Understanding the datasets used in this book.................................54
Chapter 4: Reviewing Basic Python . 57
Working with Numbers and Logic................................................................59
Performing variable assignments.......................................................60
Doing arithmetic...................................................................................61
Comparing data using Boolean expressions.....................................62
Creating and Using Strings............................................................................65
Interacting with Dates....................................................................................66
Creating and Using Functions.......................................................................68
Creating reusable functions................................................................68
Calling functions in a variety of ways.................................................70
Using Conditional and Loop Statements.....................................................73
Making decisions using the if statement............................................73
Choosing between multiple options using nested decisions..........74
Performing repetitive tasks using for.................................................75
Using the while statement...................................................................76
Storing Data Using Sets, Lists, and Tuples..................................................77
Performing operations on sets............................................................77
Working with lists.................................................................................78
Creating and using Tuples...................................................................80
Defining Useful Iterators................................................................................81
Indexing Data Using Dictionaries..................................................................82
Part II: Getting Your Hands Dirty with Data.................. 83
Chapter 5: Working with Real Data . 85
Uploading, Streaming, and Sampling Data..................................................86
Uploading small amounts of data into memory................................87
Streaming large amounts of data into memory.................................88
Sampling data........................................................................................89
Accessing Data in Structured Flat‐File Form...............................................90
Reading from a text file........................................................................91
Reading CSV delimited format............................................................92
Reading Excel and other Microsoft Office files.................................94
Sending Data in Unstructured File Form.....................................................95
Managing Data from Relational Databases..................................................98
Interacting with Data from NoSQL Databases..........................................100
Accessing Data from the Web.....................................................................101
Chapter 6: Conditioning Your Data . 105
Juggling between NumPy and pandas.......................................................106
Knowing when to use NumPy............................................................106
Knowing when to use pandas............................................................106
Validating Your Data....................................................................................107
Figuring out what’s in your data.......................................................108
Removing duplicates..........................................................................109
Creating a data map and data plan...................................................110
Manipulating Categorical Variables...........................................................112
Creating categorical variables..........................................................113
Renaming levels..................................................................................114
Combining levels.................................................................................115
Dealing with Dates in Your Data.................................................................116
Formatting date and time values......................................................117
Using the right time transformation.................................................117
Dealing with Missing Data...........................................................................118
Finding the missing data....................................................................119
Encoding missingness........................................................................119
Imputing missing data........................................................................120
Slicing and Dicing: Filtering and Selecting Data........................................122
Slicing rows..........................................................................................122
Slicing columns...................................................................................123
Dicing....................................................................................................123
Concatenating and Transforming...............................................................124
Adding new cases and variables.......................................................125
Removing data.....................................................................................126
Sorting and shuffling...........................................................................127
Aggregating Data at Any Level....................................................................128
Chapter 7: Shaping Data . 131
Working with HTML Pages..........................................................................132
Parsing XML and HTML.....................................................................132
Using XPath for data extraction........................................................133
Working with Raw Text................................................................................134
Dealing with Unicode.........................................................................134
Stemming and removing stop words................................................136
Introducing regular expressions.......................................................137
Using the Bag of Words Model and Beyond..............................................140
Understanding the bag of words model...........................................141
Working with n‐grams........................................................................142
Implementing TF‐IDF transformations.............................................144
Working with Graph Data............................................................................145
Understanding the adjacency matrix...............................................146
Using NetworkX basics......................................................................146
Chapter 8: Putting What You Know in Action 149
Contextualizing Problems and Data...........................................................150
Evaluating a data science problem...................................................151
Researching solutions........................................................................151
Formulating a hypothesis..................................................................152
Preparing your data............................................................................153
Considering the Art of Feature Creation...................................................153
Defining feature creation...................................................................153
Combining variables...........................................................................154
Understanding binning and discretization......................................155
Using indicator variables...................................................................155
Transforming distributions...............................................................156
Performing Operations on Arrays..............................................................156
Using vectorization.............................................................................157
Performing simple arithmetic on vectors and matrices................157
Performing matrix vector multiplication.........................................158
Performing matrix multiplication.....................................................159
Part III: Visualizing the Invisible................................ 161
Chapter 9: Getting a Crash Course in MatPlotLib 163
Starting with a Graph...................................................................................164
Defining the plot..................................................................................164
Drawing multiple lines and plots......................................................165
Saving your work................................................................................165
Setting the Axis, Ticks, Grids......................................................................166
Getting the axes..................................................................................167
Formatting the axes............................................................................167
Adding grids........................................................................................168
Defining the Line Appearance.....................................................................169
Working with line styles.....................................................................170
Using colors.........................................................................................170
Adding markers...................................................................................172
Using Labels, Annotations, and Legends...................................................173
Adding labels.......................................................................................174
Annotating the chart..........................................................................174
Creating a legend................................................................................175
Chapter 10: Visualizing the Data . 179
Choosing the Right Graph...........................................................................180
Showing parts of a whole with pie charts........................................180
Creating comparisons with bar charts............................................181
Showing distributions using histograms.........................................183
Depicting groups using box plots.....................................................184
Seeing data patterns using scatterplots..........................................185
Creating Advanced Scatterplots.................................................................187
Depicting groups.................................................................................187
Showing correlations..........................................................................188
Plotting Time Series.....................................................................................189
Representing time on axes................................................................190
Plotting trends over time...................................................................191
Plotting Geographical Data.........................................................................193
Visualizing Graphs........................................................................................195
Developing undirected graphs..........................................................195
Developing directed graphs..............................................................197
Chapter 11: Understanding the Tools . 199
Using the IPython Console..........................................................................200
Interacting with screen text..............................................................200
Changing the window appearance...................................................202
Getting Python help............................................................................203
Getting IPython help...........................................................................205
Using magic functions........................................................................205
Discovering objects............................................................................207
Using IPython Notebook..............................................................................208
Working with styles............................................................................208
Restarting the kernel..........................................................................210
Restoring a checkpoint......................................................................210
Performing Multimedia and Graphic Integration.....................................212
Embedding plots and other images..................................................212
Loading examples from online sites.................................................212
Obtaining online graphics and multimedia.....................................212
Part IV: Wrangling Data............................................ 215
Chapter 12: Stretching Python’s Capabilities . 217
Playing with Scikit‐learn..............................................................................218
Understanding classes in Scikit‐learn..............................................218
Defining applications for data science.............................................219
Performing the Hashing Trick.....................................................................222
Using hash functions..........................................................................223
Demonstrating the hashing trick......................................................223
Working with deterministic selection..............................................225
Considering Timing and Performance.......................................................227
Benchmarking with timeit.................................................................228
Working with the memory profiler...................................................230
Running in Parallel.......................................................................................232
Performing multicore parallelism.....................................................232
Demonstrating multiprocessing.......................................................233
Chapter 13: Exploring Data Analysis . 235
The EDA Approach.......................................................................................236
Defining Descriptive Statistics for Numeric Data.....................................237
Measuring central tendency..............................................................238
Measuring variance and range..........................................................239
Working with percentiles...................................................................239
Defining measures of normality........................................................240
Counting for Categorical Data.....................................................................241
Understanding frequencies...............................................................242
Creating contingency tables..............................................................243
Creating Applied Visualization for EDA.....................................................243
Inspecting boxplots............................................................................244
Performing t‐tests after boxplots......................................................245
Observing parallel coordinates.........................................................246
Graphing distributions.......................................................................247
Plotting scatterplots...........................................................................248
Understanding Correlation..........................................................................250
Using covariance and correlation.....................................................250
Using nonparametric correlation.....................................................252
Considering chi‐square for tables.....................................................253
Modifying Data Distributions......................................................................253
Using the normal distribution...........................................................254
Creating a Z‐score standardization..................................................254
Transforming other notable distributions......................................254
Chapter 14: Reducing Dimensionality . 257
Understanding SVD......................................................................................258
Looking for dimensionality reduction..............................................259
Using SVD to measure the invisible..................................................260
Performing Factor and Principal Component Analysis...........................261
Considering the psychometric model..............................................262
Looking for hidden factors................................................................262
Using components, not factors.........................................................263
Achieving dimensionality reduction................................................264
Understanding Some Applications.............................................................264
Recognizing faces with PCA..............................................................265
Extracting Topics with NMF..............................................................267
Recommending movies......................................................................270
Chapter 15: Clustering 273
Clustering with K‐means..............................................................................275
Understanding centroid‐based algorithms......................................275
Creating an example with image data..............................................277
Looking for optimal solutions...........................................................278
Clustering big data..............................................................................281
Performing Hierarchical Clustering...........................................................282
Moving Beyond the Round-Shaped Clusters: DBScan.............................286
Chapter 16: Detecting Outliers in Data 289
Considering Detection of Outliers..............................................................290
Finding more things that can go wrong...........................................291
Understanding anomalies and novel data.......................................292
Examining a Simple Univariate Method.....................................................292
Leveraging on the Gaussian distribution.........................................294
Making assumptions and checking out............................................295
Developing a Multivariate Approach.........................................................296
Using principal component analysis................................................297
Using cluster analysis.........................................................................298
Automating outliers detection with SVM.........................................299
Part V: Learning from Data........................................ 301
Chapter 17: Exploring Four Simple and Effective Algorithms . 303
Guessing the Number: Linear Regression.................................................304
Defining the family of linear models.................................................304
Using more variables..........................................................................305
Understanding limitations and problems........................................307
Moving to Logistic Regression....................................................................307
Applying logistic regression..............................................................308
Considering when classes are more.................................................309
Making Things as Simple as Naïve Bayes..................................................310
Finding out that Naïve Bayes isn’t so naïve.....................................312
Predicting text classifications...........................................................313
Learning Lazily with Nearest Neighbors....................................................315
Predicting after observing neighbors..............................................316
Choosing your k parameter wisely...................................................317
Chapter 18: Performing Cross‐Validation, Selection,
and Optimization 319
Pondering the Problem of Fitting a Model................................................320
Understanding bias and variance.....................................................321
Defining a strategy for picking models.............................................322
Dividing between training and test sets..........................................325
Cross‐Validating............................................................................................328
Using cross‐validation on k folds......................................................329
Sampling stratifications for complex data.......................................329
Selecting Variables Like a Pro.....................................................................331
Selecting by univariate measures.....................................................331
Using a greedy search........................................................................333
Pumping Up Your Hyperparameters..........................................................334
Implementing a grid search...............................................................335
Trying a randomized search.............................................................339
Chapter 19: Increasing Complexity with Linear
and Nonlinear Tricks 341
Using Nonlinear Transformations..............................................................341
Doing variable transformations........................................................342
Creating interactions between variables.........................................344
Regularizing Linear Models.........................................................................348
Relying on Ridge regression (L2)......................................................349
Using the Lasso (L1)...........................................................................349
Leveraging regularization..................................................................350
Combining L1 & L2: Elasticnet..........................................................350
Fighting with Big Data Chunk by Chunk....................................................351
Determining when there is too much data......................................351
Implementing Stochastic Gradient Descent....................................351
Understanding Support Vector Machines.................................................354
Relying on a computational method................................................355
Fixing many new parameters............................................................358
Classifying with SVC...........................................................................360
Going nonlinear is easy......................................................................365
Performing regression with SVR.......................................................366
Creating a stochastic solution with SVM.........................................368
Chapter 20: Understanding the Power of the Many 373
Starting with a Plain Decision Tree............................................................374
Understanding a decision tree..........................................................374
Creating classification and regression
trees...................................376
Making Machine Learning Accessible........................................................379
Working with a Random Forest classifier........................................381
Working with a Random Forest regressor.......................................382
Optimizing a Random Forest.............................................................383
Boosting Predictions....................................................................................384
Knowing that many weak predictors win........................................384
Creating a gradient boosting classifier............................................385
Creating a gradient boosting regressor...........................................386
Using GBM hyper‐parameters...........................................................387
Part VI: The Part of Tens............................................ 389
Chapter 21: Ten Essential Data Science
Resource Collections . 391
Gaining Insights with Data Science Weekly...............................................392
Obtaining a Resource List at U Climb Higher...........................................392
Getting a Good Start with KDnuggets........................................................392
Accessing the Huge List of Resources on Data Science Central.............393
Obtaining the Facts of Open Source Data Science from Masters...........394
Locating Free Learning Resources with Quora.........................................394
Receiving Help with Advanced Topics at Conductrics............................394
Learning New Tricks from the Aspirational Data Scientist.....................395
Finding Data Intelligence and Analytics Resources
at AnalyticBridge......................................................................................396
Zeroing In on Developer Resources with Jonathan Bower.....................396
Chapter 22: Ten Data Challenges You Should Take 397
Meeting the Data Science London + Scikit‐learn Challenge....................398
Predicting Survival on the Titanic..............................................................399
Finding a Kaggle Competition that Suits Your Needs..............................399
Honing Your Overfit Strategies...................................................................400
Trudging Through the MovieLens Dataset...............................................401
Getting Rid of Spam Emails.........................................................................401
Working with Handwritten Information.....................................................402
Working with Pictures..................................................................................403
Analyzing Amazon.com Reviews................................................................404
Interacting with a Huge Graph....................................................................405
Index........................................................................ 407



Bookscreen
e-books shop

Beyond the Book
This book isn’t the end of your Python or data science experience — it’s
really just the beginning. We provide online content to make this book more
flexible and better able to meet your needs. That way, as we receive email
from you, we can address questions and tell you how updates to either
Python or its associated add‐ons affect book content. In fact, you gain access
to all these cool additions:
✓✓Cheat sheet: You remember using crib notes in school to make a better
mark on a test, don’t you? You do? Well, a cheat sheet is sort of like
that. It provides you with some special notes about tasks that you can
do with Python, IPython, IPython Notebook, and data science that not
every other person knows. You can find the cheat sheet for this book at
It contains really neat information such as the most common programming
mistakes that cause people woe when using Python.
✓✓Dummies.com online articles: A lot of readers were skipping past the
parts pages in For Dummies books, so the publisher decided to remedy
that. You now have a really good reason to read the parts pages —
online content. Every parts page has an article associated with it that
provides additional interesting information that wouldn’t fit in the book.
You can find the articles for this book at http://www.dummies.com/
extras/pythonfordatascience.
✓✓Updates: Sometimes changes happen. For example, we might not
have seen an upcoming change when we looked into our crystal ball
during the writing of this book. In the past, this possibility simply
meant that the book became outdated and less useful, but you can now
In addition to these updates, check out the blog posts with answers to
reader questions and demonstrations of useful book‐related techniques
✓✓Companion files: Hey! Who really wants to type all the code in the book
and reconstruct all those plots manually? Most readers would prefer
to spend their time actually working with Python, performing data science
tasks, and seeing the interesting things they can do, rather than
typing. Fortunately for you, the examples used in the book are available
for download, so all you need to do is read the book to learn Python for
data science usage techniques. 
You can find these files at http://www.dummies.com/extras/matlab.
Previous Post Next Post