Data Mining Methods And Models, Wiley

Data Mining Methods And Models

DANIEL T. LAROSE
Department of Mathematical Sciences
Central Connecticut State University

PREFACE
WHAT IS DATA MINING?
Data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are both
understandable and useful to the data owner.
—David Hand, Heikki Mannila, and Padhraic Smyth, Principles of Data Mining,

MIT Press, Cambridge, MA, 2001

Data mining is predicted to be “one of the most revolutionary developments of the
next decade,” according to the online technology magazine ZDNET News (February
8, 2001). In fact, the MIT Technology Review chose data mining as one of 10 emerging
technologies that will change the world.
Because data mining represents such an important field, Wiley-Interscience
and I have teamed up to publish a new series on data mining, initially consisting of
three volumes. The first volume in this series, Discovering Knowledge in Data: An
Introduction to Data Mining, appeared in 2005 and introduced the reader to this rapidly
growing field. The second volume in the series, Data Mining Methods and Models,
explores the process of data mining from the point of view of model building: the
development of complex and powerful predictive models that can deliver actionable
results for a wide range of business and research problems.

WHY IS THIS BOOK NEEDED?
Data Mining Methods and Models continues the thrust of Discovering Knowledge in
Data, providing the reader with:

Models and techniques to uncover hidden nuggets of information

Insight into how the data mining algorithms really work

Experience of actually performing data mining on large data sets

Contents

PREFACE

1 DIMENSION REDUCTION METHODS

Need for Dimension Reduction in Data Mining 1

Principal Components Analysis 2

Applying Principal Components Analysis to the Houses Data Set 5

How Many Components Should We Extract? 9

Profiling the Principal Components 13

Communalities 15

Validation of the Principal Components 17

Factor Analysis 18

Applying Factor Analysis to the Adult Data Set 18

Factor Rotation 20

User-Defined Composites 23

Example of a User-Defined Composite 24

Summary 25

References 28

Exercises 28

2 REGRESSION MODELING

Example of Simple Linear Regression 34

Least-Squares Estimates 36

Coefficient of Determination 39

Standard Error of the Estimate 43

Correlation Coefficient 45

ANOVA Table 46

Outliers, High Leverage Points, and Influential Observations 48

Regression Model 55

Inference in Regression 57

t-Test for the Relationship Between x and y 58

Confidence Interval for the Slope of the Regression Line 60

Confidence Interval for the Mean Value of y Given x 60

Prediction Interval for a Randomly Chosen Value of y Given x 61

Verifying the Regression Assumptions 63

Example: Baseball Data Set 68

Example: California Data Set 74

Transformations to Achieve Linearity 79

Box–Cox Transformations 83

Summary 84

References 86

Exercises 86

3 MULTIPLE REGRESSION AND MODEL BUILDING

Example of Multiple Regression 93

Multiple Regression Model 99

Inference in Multiple Regression 100

t-Test for the Relationship Between y and xi 101

F-Test for the Significance of the Overall Regression Model 102

Confidence Interval for a Particular Coefficient 104

Confidence Interval for the Mean Value of y Given x1, x2, . . ., xm 105

Prediction Interval for a Randomly Chosen Value of y Given x1, x2, . . ., xm 105

Regression with Categorical Predictors 105

Adjusting R2: Penalizing Models for Including Predictors That Are

Not Useful 113

Sequential Sums of Squares 115

Multicollinearity 116

Variable Selection Methods 123

Partial F-Test 123

Forward Selection Procedure 125

Backward Elimination Procedure 125

Stepwise Procedure 126

Best Subsets Procedure 126

All-Possible-Subsets Procedure 126

Application of the Variable Selection Methods 127

Forward Selection Procedure Applied to the Cereals Data Set 127

Backward Elimination Procedure Applied to the Cereals Data Set 129

Stepwise Selection Procedure Applied to the Cereals Data Set 131

Best Subsets Procedure Applied to the Cereals Data Set 131

Mallows’ Cp Statistic 131

Variable Selection Criteria 135

Using the Principal Components as Predictors 142

Summary 147

References 149

Exercises 149

4 LOGISTIC REGRESSION

Simple Example of Logistic Regression 156

Maximum Likelihood Estimation 158

Interpreting Logistic Regression Output 159

Inference: Are the Predictors Significant? 160

Interpreting a Logistic Regression Model 162

Interpreting a Model for a Dichotomous Predictor 163

Interpreting a Model for a Polychotomous Predictor 166

Interpreting a Model for a Continuous Predictor 170

Assumption of Linearity 174

Zero-Cell Problem 177

Multiple Logistic Regression 179

Introducing Higher-Order Terms to Handle Nonlinearity 183

Validating the Logistic Regression Model 189

WEKA: Hands-on Analysis Using Logistic Regression 194

Summary 197

References 199

Exercises 199

5 NAIVE BAYES ESTIMATION AND BAYESIAN NETWORKS

Bayesian Approach 204

Maximum a Posteriori Classification 206

Posterior Odds Ratio 210

Balancing the Data 212

Na˙ıve Bayes Classification 215

Numeric Predictors 219

WEKA: Hands-on Analysis Using Naive Bayes 223

Bayesian Belief Networks 227

Clothing Purchase Example 227

Using the Bayesian Network to Find Probabilities 229

WEKA: Hands-On Analysis Using the Bayes Net Classifier 232

Summary 234

References 236

Exercises 237

6 GENETIC ALGORITHMS

Introduction to Genetic Algorithms 240

Basic Framework of a Genetic Algorithm 241

Simple Example of a Genetic Algorithm at Work 243

Modifications and Enhancements: Selection 245

Modifications and Enhancements: Crossover 247

Multipoint Crossover 247

Uniform Crossover 247

Genetic Algorithms for Real-Valued Variables 248

Single Arithmetic Crossover 248

Simple Arithmetic Crossover 248

Whole Arithmetic Crossover 249

Discrete Crossover 249

Normally Distributed Mutation 249

Using Genetic Algorithms to Train a Neural Network 249

WEKA: Hands-on Analysis Using Genetic Algorithms 252

Summary 261

References 262

Exercises 263

7 CASE STUDY: MODELING RESPONSE TO DIRECT MAIL MARKETING

Cross-Industry Standard Process for Data Mining 265

Business Understanding Phase 267

Direct Mail Marketing Response Problem 267

Building the Cost/Benefit Table 267

Data Understanding and Data Preparation Phases 270

Clothing Store Data Set 270

Transformations to Achieve Normality or Symmetry 272

Standardization and Flag Variables 276

Deriving New Variables 277

Exploring the Relationships Between the Predictors and the Response 278

Investigating the Correlation Structure Among the Predictors 286

Modeling and Evaluation Phases 289

Principal Components Analysis 292

Cluster Analysis: BIRCH Clustering Algorithm 294

Balancing the Training Data Set 298

Establishing the Baseline Model Performance 299

Model Collection A: Using the Principal Components 300

Overbalancing as a Surrogate for Misclassification Costs 302

Combining Models: Voting 304

Model Collection B: Non-PCA Models 306

Combining Models Using the Mean Response Probabilities 308

Summary 312

References 316

INDEX 317

● Screenshot ●

Purchase Now !
Just with Paypal

Product details

Price
File Size	6,369 KB
Pages	340 p
File Type	PDF format
ISBN-13 ISBN-10	978-0-471-66656-1 0-471-66656-4 (cloth)
Copyright	2006 by John Wiley & Sons, Inc

●▬▬▬▬▬❂❂❂▬▬▬▬▬●
●▬▬❂❂▬▬●
●▬❂▬●
●❂●

═════● ●═════

Data Mining Methods And Models, Wiley

Contact Form