1 Introduction to Data Mining Principles ..................... 1
1.1 Data Mining and Knowledge Discovery ........................ 2
1.2 Data Warehousing and Data Mining - Overview ................ 5
1.2.1 Data Warehousing Overview ........................... 7
1.2.2 Concept of Data Mining .............................. 8
1.3 Summary ................................................... 20
1.4 Review Questions .......................................... 20
2 Data Warehousing, Data Mining, and OLAP ................... 21
2.1 Data Mining Research Opportunities and Challenges ......... 23
2.1.1 Recent Research Achievements ....................... 25
2.1.2 Data Mining Application Areas ...................... 27
2.1.3 Success Stories .................................... 29
2.1.4 Trends that Affect Data Mining ..................... 30
2.1.5 Research Challenges ................................ 31
2.1.6 Test Beds and Infrastructure ....................... 33
2.1.7 Findings and Recommendations ....................... 33
2.2 Evolving Data Mining into Solutions for Insights .......... 35
2.2.1 Trends and Challenges .............................. 36
2.3 Knowledge Extraction Through Data Mining .................. 37
2.3.1 Data Mining Process ................................ 39
2.3.2 Operational Aspects ................................ 50
2.3.3 The Need and Opportunity for Data Mining ........... 51
2.3.4 Data Mining Tools and Techniques ................... 52
2.3.5 Common Applications of Data Mining ................. 55
2.3.6 What about Data Mining in Power Systems? ........... 56
2.4 Data Warehousing and OLAP ................................. 57
2.4.1 Data Warehousing for Actuaries ..................... 57
2.4.2 Data Warehouse Components .......................... 58
2.4.3 Management Information ............................. 59
2.4.4 Profit Analysis .................................... 60
2.4.5 Asset Liability Management ......................... 60
2.5 Data Mining and OLAP ...................................... 61
2.5.1 Research ........................................... 61
2.5.2 Data Mining ........................................ 68
2.6 Summary ................................................... 72
2.7 Review Questions .......................................... 72
3 Data Marts and Data Warehouse ............................. 75
3.1 Data Marts, Data Warehouse, and OLAP ...................... 77
3.1.1 Business Process Re-engineering .................... 77
3.1.2 Real-World Usage ................................... 78
3.1.3 Business Intelligence .............................. 78
3.1.4 Different Data Structures .......................... 82
3.1.5 Different Users .................................... 84
3.1.6 Technological Foundation ........................... 86
3.1.7 Data Warehouse ..................................... 87
3.1.8 Informix Architecture .............................. 87
3.1.9 Building the Data Warehouse/Data Mart Environment .. 88
3.1.10 History ............................................ 91
3.1.11 Nondetailed Data in the Enterprise Data Warehouse .. 92
3.1.12 Sharing Data Among Data Marts ...................... 93
3.1.13 The Manufacturing Process .......................... 93
3.1.14 Subdata Marts ...................................... 95
3.1.15 Refreshment Cycles ................................. 95
3.1.16 External Data ...................................... 96
3.1.17 Operational Data Stores (ODS) and Data Marts ....... 97
3.1.18 Distributed Metadata ............................... 98
3.1.19 Managing the Warehouse Environment ................ 100
3.1.20 OLAP .............................................. 102
3.2 Data Warehousing for Healthcare .......................... 107
3.2.1 A Data Warehousing Perspective for Healthcare ..... 107
3.2.2 Adding Value to your Current Data ................. 107
3.2.3 Enhance Customer Relationship Management .......... 108
3.2.4 Improve Provider Management ....................... 109
3.2.5 Reduce Fraud ...................................... 109
3.2.6 Prepare for HEDIS Reporting ....................... 110
3.2.7 Disease Management ................................ 110
3.2.8 What to Expect When Beginning a Data Warehouse
Implementation .................................... 110
3.2.9 Definitions ....................................... 111
3.3 Data Warehousing in the Telecommunications Industry ...... 112
3.3.1 Implementing One View ............................. 118
3.3.2 Business Benefit .................................. 120
3.3.3 A Holistic Approach ............................... 121
3.4 The Telecommunications Lifecycle ......................... 122
3.4.1 Current Enterprise Environment .................... 122
3.4.2 Getting to the Root of the Problem ................ 123
3.4.3 The Telecommunications Lifecycle .................. 125
3.4.4 Telecom Administrative Outsourcing ................ 127
3.4.5 Choose your Outsourcing Partner Wisely ............ 127
3.4.6 Security in Web-Enabled Data Warehouse ............ 128
3.5 Security Issues in Data Warehouse ........................ 129
3.5.1 Performance vs Security ........................... 130
3.5.2 An Ideal Security Model ........................... 131
3.5.3 Real-World Implementation ......................... 131
3.5.4 Proposed Security Model ........................... 136
3.6 Data Warehousing: To Buy or To Build a Fundamental
Choice for Insurers ...................................... 140
3.6.1 Executive Overview ................................ 140
3.6.2 The Fundamental Choice ............................ 140
3.6.3 Analyzing the Strategic Value of Data
Warehousing ....................................... 141
3.6.4 Addressing your Concerns .......................... 142
3.6.5 Introducing FellowDSS ............................. 146
3.7 Summary .................................................. 148
3.8 Review Questions ......................................... 149
4 Evolution and Scaling of Data Mining Algorithms .......... 151
4.1 Data-Driven Evolution of Data Mining Algorithms .......... 152
4.1.1 Transaction Data .................................. 153
4.1.2 Data Streams ...................................... 154
4.1.3 Graph and Text-Based data ......................... 155
4.1.4 Scientific Data ................................... 156
4.2 Scaling Mining Algorithms to Large DataBases ............. 157
4.2.1 Prediction Methods ................................ 157
4.2.2 Clustering ........................................ 160
4.2.3 Association Rules ................................. 161
4.2.4 From Incremental Model Maintenance to Streaming
Data .............................................. 162
4.3 Summary .................................................. 163
4.4 Review Questions ......................................... 164
5 Emerging Trends and Applications of Data Mining .......... 165
5.1 Emerging Trends in Business Analytics .................... 166
5.1.1 Business Users .................................... 166
5.1.2 The Driving Force ................................. 167
5.2 Business Applications of Data Mining ..................... 170
5.3 Emerging Scientific Applications in Data Mining .......... 177
5.3.1 Biomedical Engineering ............................ 177
5.3.2 Telecommunications ................................ 178
5.3.3 Geospatial Data ................................... 180
5.3.4 Climate Data and the Earth's Ecosystems ........... 181
5.4 Summary .................................................. 182
5.5 Review Questions ......................................... 183
6 Data Mining Trends and Knowledge Discovery ............... 185
6.1 Getting a Handle on the Problem .......................... 186
6.2 KDD and Data Mining: Background .......................... 187
6.3 Related Fields ........................................... 191
6.4 Summary .................................................. 194
6.5 Review Questions ......................................... 194
7 Data Mining Tasks, Techniques, and Applications .......... 195
7.1 Reality Check for Data Mining ............................ 196
7.1.1 Data Mining Basics ................................ 196
7.1.2 The Data Mining Process ........................... 197
7.1.3 Data Mining Operations ............................ 199
7.1.4 Discovery-Driven Data Mining Techniques ........... 201
7.2 Data Mining: Tasks, Techniques, and Applications ......... 204
7.2.1 Data Mining Tasks ................................. 204
7.2.2 Data Mining Techniques ............................ 206
7.2.3 Applications ...................................... 209
7.2.4 Data Mining Applications - Survey ................. 210
7.3 Summary .................................................. 215
7.4 Review Questions ......................................... 216
8 Data Mining: an Introduction - Case Study ................ 217
8.1 The Data Flood ........................................... 218
8.2 Data Holds Knowledge ..................................... 218
8.2.1 Decisions From the Data ........................... 219
8.3 Data Mining: A New Approach to Information Overload ...... 219
8.3.1 Finding Patterns in Data, which we can use to
Better, Conduct the Business ...................... 219
8.3.2 Data Mining can be Breakthrough Technology ........ 220
8.3.3 Data Mining Process in an Information System ...... 221
8.3.4 Characteristics of Data Mining .................... 222
8.3.5 Data Mining Technology ............................ 223
8.3.6 Technology Limitations ............................ 224
8.3.7 BBC Case Study: The Importance of Business
Knowledge ......................................... 225
8.3.8 Some Medical and Pharmaceutical Applications of
Data Mining ....................................... 228
8.3.9 Why Does Data Mining Work? ........................ 228
8.4 Summary .................................................. 229
8.5 Review Questions ......................................... 229
9 Data Mining &; KDD ....................................... 231
9.1 Data Mining and KDD - Overview ........................... 232
9.1.1 The Idea of Knowledge Discovery in Databases
(KDD) ............................................. 234
9.1.2 How Data Mining Relates to KDD .................... 235
9.1.3 The Data Mining Future ............................ 237
9.2 Data Mining: The Two Cultures ............................ 238
9.2.1 The Central Issue ................................. 238
9.2.2 What are Data Mining and the Data Mining
Process? .......................................... 239
9.2.3 Machine Learning .................................. 239
9.2.4 Impact of Implementation .......................... 240
9.3 Summary .................................................. 241
9.4 Review Questions ......................................... 241
10 Statistical Themes and Lessons for Data Mining ........... 243
10.1 Data Mining and Official Statistics ...................... 244
10.1.1 What is New in Data Mining is ..................... 244
10.1.2 Goals and Tools of Data Mining .................... 244
10.1.3 New Mines: Texts, Web, Symbolic Data? ............. 245
10.1.4 Applications in Official Statistics ............... 246
10.2 Statistical Themes and Lessons for Data Mining ........... 246
10.2.1 An Overview of Statistical Science ................ 248
10.2.2 Is Data Mining "Statistical Deja Vu" (All Over
Again)? ........................................... 252
10.2.3 Characterizing Uncertainty ........................ 254
10.2.4 What Can Go Wrong, Will Go Wrong .................. 256
10.2.5 Symbiosis in Statistics ........................... 261
10.3 Summary .................................................. 262
10.4 Review Questions ......................................... 263
11 Theoretical Frameworks for Data Mining ................... 265
11.1 Two Simple Approaches .................................... 266
11.1.1 Probabilistic Approach ............................ 267
11.1.2 Data Compression Approach ......................... 268
11.2 Microeconomic View of Data Mining ........................ 268
11.3 Inductive Databases ................................. 269
11.4 Summary ............................................. 270
11.5 Review Questions .................................... 270
12 Major and Privacy Issues in Data Mining and Knowledge
Discovery ................................................ 271
12.1 Major Issues in Data Mining .............................. 272
12.2 Privacy Issues in Knowledge Discovery and Data Mining .... 275
12.2.1 Revitalized Privacy Threats ....................... 277
12.2.2 New Privacy Threats ............................... 279
12.2.3 Possible Solutions ............................... 281
12.3 The OECD Personal Privacy Guidelines ..................... 283
12.3.1 Risks Privacy and the Principles of Data
Protection ........................................ 284
12.3.2 The OECD Guidelines and Knowledge Discovery ....... 286
12.3.3 Knowledge Discovery about Groups .................. 288
12.3.4 Legal Systems and other Guidelines ................ 289
12.4 Summary .................................................. 290
12.5 Review Questions ......................................... 291
13 Active Data Mining ....................................... 293
13.1 Shape Definitions ........................................ 295
13.2 Queries .................................................. 297
13.3 Triggers ................................................. 299
13.3.1 Wave Execution Semantics .......................... 300
13.4 Summary .................................................. 302
13.5 Review Questions ......................................... 302
14 Decomposition in Data Mining - A Case Study .............. 303
14.1 Decomposition in the Literature .......................... 304
14.1.1 Machine Learning .................................. 304
14.2 Typology of Decomposition in Data Mining ................. 305
14.3 Hybrid Models ............................................ 306
14.4 Knowledge Structuring .................................... 309
14.5 Rule-Structuring Model ................................... 310
14.6 Decision Tables, Maps, and Atlases ....................... 311
14.7 Summary .................................................. 312
14.8 Review Questions ......................................... 313
15 Data Mining System Products and Research Prototypes ...... 315
15.1 How to Choose a Data Mining System ....................... 316
15.2 Examples of Commercial Data Mining Systems ............... 318
15.3 Summary .................................................. 319
15.4 Review Questions ......................................... 320
16 Data Mining in Customer Value and Customer Relationship
Management ............................................... 321
16.1 Data Mining: A Concept of Customer Relationship
Marketing ................................................ 322
16.1.1 Traditional Marketing Research .................... 322
16.1.2 Relationship Marketing - the Modern View .......... 323
16.1.3 Understanding the Background of Data Mining ....... 324
16.1.4 Continuous Relationship Marketing ................. 326
16.1.5 Developing the Data Mining Project ................ 327
16.1.6 Further Research .................................. 328
16.2 Introduction to Customer Acquisition ..................... 328
16.2.1 How Data Mining and Statistical Modeling Change
Things ............................................ 329
16.2.2 Defining Some Key Acquisition Concepts ............ 329
16.2.3 It all Begins with the Data ....................... 331
16.2.4 Test Campaigns .................................... 332
16.2.5 Evaluating Test Campaign Responses ................ 333
16.2.6 Building Data Mining Models Using Response
Behaviors ......................................... 333
16.3 Customer Relationship Management (CRM) ................... 335
16.3.1 Defining CRM ...................................... 335
16.3.2 Integrating Customer Data into CRM Strategy ....... 335
16.3.3 Strategic Data Analysis for CRM ................... 335
16.3.4 Data Warehousing and Data Mining .................. 337
16.3.5 Sharing Customer Data Within the Value Chain ...... 338
16.3.6 CVM - Customer Value Management ................... 339
16.3.7 Issues in Global Customer Management .............. 340
16.3.8 Changing Systems .................................. 341
16.3.9 Changing Customer Management - A Strategic View ... 342
16.4 Data Mining and Customer Value and Relationships ......... 348
16.4.1 What is Data Mining? .............................. 349
16.4.2 Relevance to a Business Process ................... 351
16.4.3 Data Mining and Customer Relationship
Management ........................................ 352
16.4.4 How Data Mining Helps Database Marketing .......... 353
16.5 CRM: Technologies and Applications ....................... 356
16.5.1 What is CRM ....................................... 357
16.5.2 What is CRM Used for? ............................. 357
16.5.3 Consequences of Implementation of CRM ............. 359
16.5.4 Which Technologies are Used in CRM? ............... 360
16.5.5 Business Rules .................................... 360
16.5.6 Data Warehousing .................................. 360
16.5.7 Data Mining ....................................... 361
16.5.8 Real-Time Information Analysis .................... 362
16.5.9 Reporting ......................................... 363
16.5.10 Web Self-Service ................................. 363
16.5.11 Market Overview .................................. 364
16.5.12 Connection between ERP and CRM ................... 365
16.5.13 Benefits of CRM to the Enterprise ................ 367
16.5.14 Future of CRM .................................... 367
16.6 Data Management in Analytical Customer Relationship
Management ............................................... 369
16.6.1 The CRM Process Model ............................. 370
16.6.2 Data Sources for Analytical CRM ................... 374
16.6.3 Data Integration in Analytical CRM ................ 376
16.6.4 Further Research .................................. 384
16.7 Summary .................................................. 385
16.8 Review Questions ......................................... 385
17 Data Mining in Business .................................. 387
17.1 Business Focus on Data Engineering ....................... 388
17.2 Data Mining for Business Problems ........................ 390
17.3 Data Mining and Business Intelligence .................... 396
17.4 Data Mining in Business - Case Studies ................... 399
18 Data Mining in Sales Marketing and Finance ............... 411
18.1 Data Mining can Bring Pinpoint Accuracy to Sales ......... 413
18.2 From Data Mining to Database Marketing ................... 414
18.2.1 Data Mining vs. Database Marketing ................ 414
18.2.2 What Exactly is Data Mining? ...................... 415
18.2.3 Who is Developing the Technology? ................. 416
18.2.4 Turning Business Problems into Business
Solutions ......................................... 417
18.2.5 A Possible Scenario for the Future of Data
Mining ............................................ 419
18.3 Data Mining for Marketing Decisions ...................... 419
18.3.1 Agent-Based Information Retrieval Systems ......... 421
18.3.2 Applications of Data Mining in Marketing .......... 424
18.4 Increasing Customer Value by Integrating Data Mining ..... 425
18.4.1 Some Definitions .................................. 425
18.4.2 Data Mining Defined ............................... 426
18.4.3 The Purpose of Data Mining ........................ 427
18.4.4 Scoring the Model ................................. 427
18.4.5 The Role of Campaign Management Software .......... 427
18.4.6 The Integrated Data Mining and Campaign
Management Process ................................ 429
18.4.7 Data Mining and Campaign Management in the
Real World ........................................ 430
18.4.8 The Benefits of Integrating Data Mining and
Campaign Management ............................... 431
18.5 Completing a Solution for Market-Basket Analysis - Case
Study .................................................... 431
18.5.1 Business Problem .................................. 432
18.5.2 Case Studies ...................................... 432
18.5.3 Data Mining Solutions ............................. 433
18.5.4 Recommendations ................................... 434
18.6 Data Mining in Finance ................................... 435
18.7 Data Mining for Financial Data Analysis .................. 436
18.8 Summary .................................................. 437
18.9 Review Questions ......................................... 438
19 Banking and Commercial Applications ...................... 439
19.1 Bringing Data Mining to the Forefront of Business
Intelligence ............................................. 441
19.2 Distributed Data Mining Through a Centralized Solution ... 441
A Case Study ............................................. 442
19.2.1 Background ........................................ 442
19.3 Data Mining in Commercial Applications ................... 444
19.3.1 Data Cleaning and Data Preparation ................ 444
19.3.2 Involving Business Users in the KDD Process ....... 445
19.3.3 Business Challenges for the KDD Process ........... 446
19.4 Decision Support Systems - Case Study .................... 446
19.4.1 A Functional Perspective .......................... 447
19.4.2 Decisions ......................................... 450
19.5 Keys to the Commercial Success of Data Mining - Case
Studies .................................................. 452
19.5.1 Case Study 1: Commercial Success Criteria ......... 452
19.5.2 Case Study 2: A Service Provider's View ........... 454
19.6 Data Mining Supports E-Commerce 458
19.6.1 Data Mining Application Possibilities in Web
Stores ........................................... 459
19.7 Data Mining for the Retail Industry ...................... 462
19.8 Business Intelligence and Retailing ...................... 463
19.8.1 Applications of Data Warehousing and Data
Mining in the Retail INDUSTRY ..................... 463
19.8.2 Key Trends in the Retail Industry ................. 464
19.8.3 Business Intelligence Solutions for the Retail
Industry .......................................... 465
19.9 Summary .................................................. 471
19.10 Review Questions ........................................ 472
20 Data Mining for Insurance ................................ 473
20.1 Insurance Underwriting ................................... 474
20.1.1 Data Mining and Insurance: Improving the
Underwriting Decision-Making Process .............. 475
20.1.2 What does an Insurance Underwriter Do? ............ 479
20.1.3 How is the Underwriting Function Changing? ........ 485
20.1.4 How can Data Mining Help Underwriters Make
Better Business Decisions ......................... 485
20.2 Business Intelligence and Insurance ...................... 487
20.2.1 Insurance Industry Overview and Major Trends ...... 487
20.2.2 Business Intelligence and the Insurance Value
Chain ............................................. 488
20.2.3 Customer Relationship Management .................. 489
20.2.4 Channel Management ................................ 491
20.2.5 Actuarial ......................................... 493
20.2.6 Underwriting and Policy Management ................ 493
20.2.7 Claims Management ................................. 494
20.2.8 Finance and Asset Management ...................... 495
20.2.9 Human Resources ................................... 496
Ht-t 20.2.10 Corporate Management ........................ 497
20.3 Summary .................................................. 497
20.4 Review Questions ......................................... 498
21 Data Mining in Biomedicine and Science ................... 499
21.1 Applications in Medicine ................................. 501
21.1.1 HealthCare ........................................ 501
21.1.2 Data Mining in Clinical Domains ................... 501
21.1.3 Data Mining In Medical Diagnosis Problem .......... 502
21.2 Data Mining for Biomedical and DNA Data Analysis ......... 502
21.2.1 Semantic Integration of Heterogeneous,
Distributed Genome Databases ...................... 503
21.2.2 Similarity Search and Comparison Among DNA
Sequences ......................................... 503
21.2.3 Association Analysis: Identification of
Co-occurring Gene Sequences ....................... 504
21.2.4 Path Analysis: Linking Genes to Different Stages
of Disease Development ............................ 504
21.2.5 Visualization Tools and Genetic Data Analysis ..... 504
21.3 An Unsupervised Neural Network Approach .................. 504
21.3.1 Knowledge Extraction Through Data Mining .......... 505
21.3.2 Traditional Difficulties in Handling Medical
Data .............................................. 505
21.3.3 An Illustrative Case Study ........................ 506
21.3.4 Organizing Medical Data ........................... 506
21.3.5 Building the Neural Network Tool .................. 508
21.3.6 Applying Data Mining and Data Visualization
Techniques ........................................ 509
21.4 Data Mining - Assisted Decision Support for Fever
Diagnosis - Case Study ................................... 515
21.4.1 Architecture for Fever Diagnosis .................. 516
21.4.2 Medical Data Definition Component ................. 516
21.4.3 Physician-System Interface ........................ 517
21.4.4 Diagnostic Question Banque ........................ 517
21.4.5 Pattern Extractor ................................. 519
21.4.6 Rule Constructor .................................. 519
21.5 Data Mining and Science .................................. 520
21.6 Knowledge Discovery in Science as Opposed to Business-
Case Study ............................................... 522
21.6.1 Why is Data Mining Different? ..................... 522
21.6.2 The Data Management Context ....................... 522
21.6.3 Business Data Analysis ............................ 523
21.6.4 Scientific Data Analysis .......................... 523
21.6.5 Scientific Applications ........................... 524
21.6.6 Example of Predicting Air Quality ................. 524
21.7 Data Mining in a Scientific Environment .................. 529
21.7.1 What is Data Mining? .............................. 529
21.7.2 Traditional Uses of Data Mining ................... 531
21.7.3 Data Mining in a Scientific Environment ........... 532
21.7.4 Examples of Scientific Data Mining ................ 533
21.7.5 Concluding Remarks ................................ 533
21.8 Flexible Earth Science Data Mining System Architecture ... 534
21.8.1 DESIGN ISSUES ..................................... 534
21.8.2 ADaM System Features .............................. 535
21.8.3 ADaM Plan Builder Client .......................... 540
21.8.4 Research Directions ............................... 541
21.9 Summary .................................................. 542
21.10 Review Questions ........................................ 543
22 Text and Web Mining ...................................... 545
22.1 Data Mining and the Web .................................. 547
22.1.1 Resource Discovery ................................ 548
22.1.2 Information Extraction ............................ 548
22.1.3 Generalization .................................... 548
22.2 An Overview on Web Mining ................................ 549
22.2.1 Taxonomy of Web Mining ............................ 550
22.2.2 Database Approach ................................. 550
22.2.3 Web Mining Tasks .................................. 552
22.2.4 Mining Interested Content from Web Document ....... 553
22.2.5 Mining Pattern from Web Transactions/Logs ......... 554
22.2.6 Web Access Pattern Tree (WAP tree) ................ 557
22.3 Text Mining .............................................. 558
22.3.1 Definition ........................................ 558
22.3.2 S&T Text Mining Applications ...................... 559
22.3.3 Text Mining Tools ................................. 560
22.3.4 Text Data Mining .................................. 561
22.4 Discovering Web Access Patterns and Trends ............... 563
22.4.1 Design of a Web Log Miner ......................... 565
22.4.2 Database Construction from server log Files ....... 567
22.4.3 Multidimensional Web log data cube ................ 568
22.4.4 Data mining on Web log data cube and Web log
database .......................................... 569
22.5 Web Usage Mining on Proxy Servers: A Case Study .......... 572
22.5.1 Aspects of Web Usage Mining ....................... 573
22.5.2 Data Collection ................................... 573
22.5.3 Preprocessing ..................................... 574
22.5.4 Data Cleaning ..................................... 574
22.5.5 User and Session Identification ................... 575
22.5.6 Data Mining Techniques ............................ 575
22.5.7 E-metrics ......................................... 577
22.5.8 The Data .......................................... 579
22.6 Text Data Mining in Biomedical Literature ................ 581
22.6.1 Information Retrieval Task - Retrieve Relevant
Documents by Making use of Existing Database ...... 582
22.6.2 Naive Bayes Classifier ............................ 582
22.6.3 Experimental results of Information Retrieval
task .............................................. 583
22.6.4 Text Mining Task - Mining MEDLINE by Combining
Term Extraction and Association Rule Mining ....... 583
22.6.5 Finding the Relations Between MeSH Terms and
Substances ........................................ 584
22.6.6 Finding the Relations Between Other Terms ......... 584
22.7 Related Work ............................................. 585
22.7.1 Future Work: For the Information Retrieval Task ... 586
22.7.2 For the Text Mining Task .......................... 587
22.7.3 Mutual Benefits between Two Tasks ................. 587
22.8 Summary .................................................. 588
22.9 Review Questions ......................................... 589
23 Data Mining in Information Analysis and Delivery ......... 591
23.1 Information Analysis: Overview ........................... 592
23.1.1 Data Acquisition .................................. 592
23.1.2 Extraction and Representation ..................... 593
23.1.3 Information Analysis .............................. 593
23.2 Intelligent Information Delivery - Case Study ............ 595
23.2.1 Alerts Run Rampant ................................ 595
23.2.2 What an Intelligent Information Delivery System
is ................................................ 596
23.2.3 Simple Example of an Intelligent Information
Delivery Mechanism ................................ 597
23.3 A Characterization of Data Mining Technologies and
Processes - Case Study ................................... 599
23.3.1 Data Mining Processes ............................. 600
23.3.2 Data Mining Users and Activities .................. 601
23.3.3 The Technology Tree ............................... 602
23.3.4 Cross-Tabulation .................................. 609
23.3.5 Neural Nets ....................................... 610
23.4 Summary .................................................. 612
23.5 Review Questions ......................................... 613
24 Data Mining in Telecommunications and Control ............ 615
24.1 Data Mining for the Telecommunication Industry ........... 616
24.1.1 Multidimensional Analysis of Telecommunication
Data .............................................. 617
24.1.2 Fraudulent Pattern Analysis and the
Identification of Unusual Patterns ................ 617
24.1.3 Multidimensional Association and Sequential
Pattern Analysis .................................. 617
24.1.4 Use of Visualization Tools in Telecommunication
Data Analysis ..................................... 618
24.2 Data Mining Focus Areas in Telecommunication ............. 618
24.2.1 Systematic Error .................................. 618
24.2.2 Data Mining in Churn Analysis ..................... 620
24.3 A Learning System for Decision Support in
Telecommunications ....................................... 621
24.4 Knowledge Processing in Control Systems ............. 623
24.4.1 Preliminaries and General Definitions ............ 624
24.5 Data Mining for Maintenance of Complex Systems - A Case
Study .................................................... 626
24.6 Summary ............................................. 627
24.7 Review Questions .................................... 627
25 Data Mining in Security .................................. 629
25.1 Data Mining in Security Systems .......................... 630
25.2 Real Time Data Mining-Based Intrusion Detection Systems
- Case Study ............................................. 631
25.2.1 Accuracy .......................................... 632
25.2.2 Feature Extraction for IDS ........................ 633
25.2.3 Artificial Anomaly Generation ..................... 634
25.2.4 Combined Misuse and Anomaly Detection ............. 635
25.2.5 Efficiency ........................................ 636
25.2.6 Cost-Sensitive Modeling ........................... 637
25.2.7 Distributed Feature Computation ................... 639
25.2.8 System Architecture ............................... 643
25.3 Summary .................................................. 646
Data Mining Research Projects ................................. 649
A.l National University of Singapore: Data Mining Research
Projects ................................................. 649
A.1.1 Cleaning Data for Warehousing and Mining .......... 649
A.1.2 Data Mining in Multiple Databases ................. 650
A.1.3 Intelligent WEB Document Management Using Data
Mining Techniques ................................. 650
A.l.4 Data Mining with Neural Networks .................. 650
A.1.5 Data Mining in Semistructured Data ................ 651
A.1.6 A Data Mining Application - Customer Retention
in the Port of Singapore Authority (PSA) .......... 651
A.1.7 A Belief-Based Approach to Data Mining ............ 651
A.l.8 Discovering Interesting Knowledge in Database ..... 652
A.1.9 Data Mining for Market Research ................... 652
A.1.10 Data Mining in Electronic Commerce ................ 652
А.1.11 Multidimensional Data Visualization Tool .......... 653
A.l.12 Clustering Algorithms for Data Mining ............. 653
A.1.13 Web Page Design for Electronic Commerce ........... 653
A.1.14 Data Mining Application on Web Information
Sources ........................................... 654
A.1.15 Data Mining in Finance ............................ 654
A.1.16 Document Summarization ............................ 654
A.1.17 Data Mining and Intelligent Data Analysis ......... 655
A.2 HP Labs Research: Software Technology Laboratory ......... 658
A.2.1 Data Mining Research .............................. 658
A.3 CRISP-DM: An Overview .................................... 661
A.3.1 Moving from Technology to Business ................ 661
A.3.2 Process Model ..................................... 662
A.4 Data Mining SuiteTM ...................................... 663
A.4.1 Rule-based Influence Discovery .................... 665
A.4.2 Dimensional Affinity Discovery .................... 665
A.4.3 The OLAP Discovery System ......................... 665
A.4.4 Incremental Pattern Discovery ..................... 665
A.4.5 Trend Discovery ................................... 666
A.4.6 Forensic Discovery ................................ 666
A.4.7 Predictive Modeler ................................ 666
A.5 The Quest Data Mining System, IBM Almaden Research
Center, CA, USA .......................................... 669
A.5.1 Introduction ...................................... 669
A.5.2 Association Rules ................................. 670
A.5.3 Apriori Algorithm ................................. 670
A.5.4 Sequential Patterns ............................... 672
A.5.5 Time-series Clustering ............................ 673
A.5.6 Incremental Mining ................................ 675
A.5.7 Parallelism ....................................... 676
A.5.8 System Architecture ............................... 676
A.5.9 Future Directions ................................. 676
A.6 The Australian National University Research Projects ..... 676
A.6.1 Applications of Inductive Learning ................ 676
A.6.2 Logic in Machine Learning ......................... 677
A.6.3 Machine-learning Summer Research Projects
in Data Mining and Reinforcement Learning ......... 678
A.6.4 Computational Aspects of Data Mining
(3 Projects) ...................................... 678
A.6.5 Data Mining the MACHO Database .................... 679
A.6.6 Artificial Stereophonic Processing ................ 680
A.6.7 Real-time Active Vision ........................... 680
A.6.8 Web Teleoperation of a Mobile Robot ............... 680
A.6.9 Autonomous Submersible Robot ...................... 681
A.6.10 The SIT Project ................................... 682
A.7 Data Mining Research Group, Monash University Australia .. 682
A.7.1 Current Projects .................................. 682
A.7.2 ADELFI - A Model for the Deployment of High-
Performance Solutions on the Internet and
Intranets ......................................... 683
A.8 Current Projects, University of Alabama in Huntsville,
AL ....................................................... 688
A.8.1 Direct Mailing System ............................. 688
A.8.2 A Vibration Sensor ................................ 688
A.8.3 Current Status .................................... 689
A.8.4 Data Mining Using Classification .................. 689
A.8.5 Email Classification, Mining ...................... 690
A.8.6 Data-based Decision Making ........................ 690
A.8.7 Data Mining in Relational Databases ............... 691
A.8.8 Environmental Applications and Machine Learning ... 691
A.8.9 Current Research Projects ......................... 692
A.8.10 Web Mining ........................................ 693
A.8.11 Neural Networks Applications to ATM Networks
Control ........................................... 693
A.8.12 Scientific Topics ................................. 694
A.8.13 Application Areas ................................. 695
A.9 Kensington Approach Toward Enterprise Data Mining Group .. 696
A.9.1 Distributed Database Support ...................... 696
A.9.2 Distributed Object Management ..................... 696
A.9.3 Groupware, Security, and Persistent Objects ....... 697
A.9.4 Universal Clients - User-friendly Data Mining ..... 697
A.9.5 High-Performance Server ........................... 697
Data Mining Standards ......................................... 699
II.1 Data Mining Standards .................................... 700
II.1.1 Process Standards ................................. 700
II.1.2 XML Standards/OR Model Defining Standards ... 704
II.1.3 Web Standards ..................................... 707
II.1.4 Application Programming Interfaces (APIs) ......... 711
II.1.5 Grid Services ..................................... 716
II.2 Developing Data Mining Application Using Data Mining
Standards ................................................ 719
II.2.1 Application Requirement Specification ............. 719
II.2.2 Design and Deployment ............................. 720
II.3 Analysis ................................................. 722
II.4 Application Examples ..................................... 723
II.4.1 PMML Example ...................................... 723
II.4.2 XMLA Example ...................................... 724
II.4.3 OLEDB ............................................. 725
II.4.4 OLEDB-DM Example .................................. 726
II.4.5 SQL/MM Example .................................... 728
II.4.6 Java Data Mining Model Example .................... 728
II.4.7 Web Services ...................................... 730
II.5 Conclusion ............................................... 730
Intelligent Miner ............................................. 731
3А.1 Data Mining Process ...................................... 731
3А.1.1 Selecting the Input Data .......................... 732
3A.1.2 Exploring the Data ................................ 732
ЗА.1.3 Transforming the Data ............................. 732
3A.1.4 Mining the Data ................................... 733
3A.2 Interpreting the Results ................................. 733
3A.3 Overview of the Intelligent Miner Components ............. 734
3A.3.1 User interface .................................... 734
ЗА.3.2 Environment Layer API ............................. 734
3A.3.3 Visualizer ........................................ 734
3A.3.4 Data Access ....................................... 734
3A.4 Running Intelligent Miner Servers ........................ 734
3А.5 How the Intelligent Miner Creates Output Data ............ 736
3A.5.1 Partitioned Output Tables ......................... 736
3A.5.2 How the Partitioning Key is Created ............... 737
3A.6 Performing Common Tasks .................................. 737
3А.7 Understanding Basic Concepts ............................. 738
3А.7.1 Getting Familiar with the Intelligent Miner Main
Window ............................................ 738
3A.8 Main Window Areas ........................................ 738
3A.8.1 Mining Base Container ............................. 738
3A.8.2 Contents Container ................................ 739
3A.8.3 Work Area ......................................... 739
3A.8.4 Creating and Using Mining Bases ................... 739
3A.9 Conclusion ............................................... 740
Clementine ............................................... 741
3B.1 Key Findings ............................................. 741
3B.2 Background Information ................................... 742
3B.3 Product Availability ..................................... 743
3B.4 Software Description ..................................... 744
3B.5 Architecture ............................................. 745
3B.6 Methodology .............................................. 746
3B.6.1 Business Understanding ............................ 746
3B.6.2 Data Understanding ................................ 748
3B.6.3 Data Preparation .................................. 749
3B.6.4 Modeling .......................................... 750
3B.6.5 Evaluation ........................................ 752
3B.6.6 Deployment ........................................ 753
3B.7 Clementine Server ........................................ 753
3B.8 How Clementine Server Improves Performance on Large
Datasets ................................................. 754
3B.8.1 Benchmark Testing Results: Data Processing ........ 755
3B.8.2 Benchmark Testing Results: Modeling ............... 755
3B.8.3 Benchmark Testing Results: Scoring ................ 757
3B.9 Conclusion ............................................... 758
Crisp .................................................... 761
3C.1 Hierarchical Breakdown ................................... 761
3C.2 Mapping Generic Models to Specialized Models ............. 762
3C.2.1 Data Mining Context ............................... 762
3C.2.2 Mappings with Contexts ............................ 763
3C.3 The CRISP-DM Reference Model ............................. 763
3C.3.1 Business Understanding ............................ 765
3C.4 Data Understanding ....................................... 769
3C.4.1 Collect Initial Data .............................. 769
3C.4.2 Output Initial Data Collection Report ............. 770
3C.4.3 Describe Data ..................................... 770
3C.4.4 Explore Data ...................................... 771
3C.4.5 Output Data Exploration Report .................... 771
3C.4.6 Verify Data Quality ............................... 771
3C.5 Data Preparation ......................................... 771
3C.5.1 Select Data ....................................... 771
3C.5.2 Clean Data ........................................ 772
3C.5.3 Construct Data .................................... 773
3C.5.4 Generated Records ................................. 773
3C.5.5 Integrate Data .................................... 773
3C.5.6 Output Merged Data ................................ 773
3C.5.7 Format Data ....................................... 773
3C.5.8 Reformatted Data .................................. 774
3C.6 Modeling ................................................. 774
3C.6.1 Select Modeling Technique ......................... 774
3C.6.2 Outputs Modeling Technique ........................ 774
3C.6.3 Modeling Assumptions .............................. 774
3C.6.4 Generate Test Design .............................. 774
3C.6.5 Output Test Design ................................ 775
3C.6.6 Build Model ....................................... 775
3C.6.7 Outputs Parameter Settings ........................ 775
3C.6.8 Assess Model ...................................... 776
3C.6.9 Outputs Model Assessment .......................... 776
3C.6.10 Revised Parameter Settings ....................... 776
3C.7 Evaluation ............................................... 776
3C.7.1 Evaluate Results .................................. 776
3C.8 Conclusion ............................................... 777
Mineset ....................................................... 779
3D.1 Introduction ............................................. 779
3D.2 Architecture ............................................. 779
3D.3 MineSet Tools for Data Mining Tasks ...................... 780
3D.4 About the Raw Data ....................................... 781
3D.5 Analytical Algorithms .................................... 781
3D.6 Visualization ............................................ 782
3D.7 KDD Process Management ................................... 783
3D.8 History .................................................. 784
3D.9 Commercial Uses .......................................... 785
3D.10 Conclusion .............................................. 786
Enterprise Miner .............................................. 787
3E.1 Tools For Data Mining Process ............................ 787
3E.2 Why Enterprise Miner ..................................... 788
3E.3 Product Overview ......................................... 789
3E.4 SAS Enterprise Miner 5.2 Key Features .................... 790
3E.4.1 Multiple Interfaces ............................... 790
3E.4.2 Scalable Processing ............................... 791
3E.4.3 Accessing data .................................... 791
3E.4.4 Sampling .......................................... 791
3E.4.5 Data Partitioning ................................. 792
3E.4.6 Filtering Outliers ................................ 792
3E.4.7 Transformations ................................... 792
3E.4.8 Data Replacement .................................. 792
3E.4.9 Descriptive Statistics ............................ 792
3E.4.10 Graphs/Visualization ............................. 793
3E.5 Enterprise Miner Software ................................ 793
3E.5.1 The Graphical User Interface ...................... 794
3E.5.2 The GUI Components ................................ 794
3E.6 Enterprise Miner Process for Data Mining ................. 796
3E.7 Client/Server Capabilities ............................... 796
3E.8 Client/Server Requirements ............................... 796
3E.9 Conclusion ............................................... 797
References .................................................... 799
|