Big Data
Computing
A Guide for Business
and Technology Managers
Chapman & Hall/CRC
Big Data Series
SERIES EDITOR
Sanjay Ranka
AIMS AND SCOPE
This series aims to present new research and applications in Big Data, along with the computational tools and techniques currently in development. The inclusion of concrete examples and
applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the
areas of social networks, sensor networks, data-centric computing, astronomy, genomics, medical
data analytics, large-scale e-commerce, and other relevant topics that may be proposed by potential contributors.
PUBLISHED TITLES
BIG DATA COMPUTING: A GUIDE FOR BUSINESS AND TECHNOLOGY
MANAGERS
Vivek Kale
BIG DATA OF COMPLEX NETWORKS
Matthias Dehmer, Frank Emmert-Streib, Stefan Pickl, and Andreas Holzinger
BIG DATA : ALGORITHMS, ANALYTICS, AND APPLICATIONS
Kuan-Ching Li, Hai Jiang, Laurence T. Yang, and Alfredo Cuzzocrea
NETWORKING FOR BIG DATA
Shui Yu, Xiaodong Lin, Jelena Mišic,
´ and Xuemin (Sherman) Shen
Big Data
Computing
A Guide for Business
and Technology Managers
Vivek Kale
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Vivek Kale
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20160426
International Standard Book Number-13: 978-1-4987-1533-1 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Names: Kale, Vivek, author.
Title: Big data computing : a guide for business and technology managers /
author, Vivek Kale.
Description: Boca Raton : Taylor & Francis, CRC Press, 2016. | Series:
Chapman & Hall/CRC big data series | Includes bibliographical references
and index.
Identifiers: LCCN 2016005989 | ISBN 9781498715331
Subjects: LCSH: Big data.
Classification: LCC QA76.9.B45 K35 2016 | DDC 005.7--dc23
LC record available at />Visit the Taylor & Francis Web site at
and the CRC Press Web site at
To
Nilesh Acharya and family
for unstinted support on
references and research
for my numerous book projects.
This page intentionally left blank
Contents
List of Figures .............................................................................................................................. xxi
List of Tables .............................................................................................................................. xxiii
Preface .......................................................................................................................................... xxv
Acknowledgments .................................................................................................................... xxxi
Author .......................................................................................................................................xxxiii
1. Computing Beyond the Moore’s Law Barrier While Being More Tolerant of
Faults and Failures..................................................................................................................1
1.1
Moore’s Law Barrier .....................................................................................................2
1.2 Types of Computer Systems ........................................................................................4
1.2.1 Microcomputers ...............................................................................................4
1.2.2 Midrange Computers ......................................................................................4
1.2.3 Mainframe Computers ....................................................................................5
1.2.4 Supercomputers ...............................................................................................5
1.3 Parallel Computing .......................................................................................................6
1.3.1 Von Neumann Architectures ................................................................. 8
1.3.2 Non-Neumann Architectures ........................................................................9
1.4
Parallel Processing ........................................................................................................9
1.4.1 Multiprogramming........................................................................................ 10
1.4.2 Vector Processing ........................................................................................... 10
1.4.3 Symmetric Multiprocessing Systems .......................................................... 11
1.4.4 Massively Parallel Processing ...................................................................... 11
1.5 Fault Tolerance ............................................................................................................. 12
1.6
Reliability Conundrum .............................................................................................. 14
1.7
Brewer’s CAP Theorem .............................................................................................. 15
1.8
Summary ...................................................................................................................... 18
Section I
Genesis of Big Data Computing
2. Database Basics ..................................................................................................................... 21
2.1 Database Management System .................................................................................21
2.1.1 DBMS Benefits ................................................................................................22
2.1.2 Defining a Database Management System.................................................23
2.1.2.1 Data Models alias Database Models ............................................ 26
2.2 Database Models .........................................................................................................27
2.2.1 Relational Database Model ...........................................................................28
2.2.2 Hierarchical Database Model .......................................................................30
2.2.3 Network Database Model .............................................................................32
2.2.4 Object-Oriented Database Models...............................................................32
2.2.5 Comparison of Models ..................................................................................33
2.2.5.1 Similarities ...................................................................................... 33
2.2.5.2 Dissimilarities ................................................................................. 35
vii
viii
Contents
2.3
2.4
2.5
2.6
2.7
2.8
Database Components................................................................................................36
2.3.1 External Level ................................................................................................. 37
2.3.2 Conceptual Level ........................................................................................... 37
2.3.3 Physical Level ................................................................................................. 38
2.3.4 The Three-Schema Architecture ................................................................. 38
2.3.4.1 Data Independence ........................................................................ 39
Database Languages and Interfaces ......................................................................... 40
Categories of Database Management Systems .......................................................42
Other Databases ..........................................................................................................44
2.6.1 Text Databases ................................................................................................44
2.6.2 Multimedia Databases ..................................................................................44
2.6.3 Temporal Databases.......................................................................................44
2.6.4 Spatial Databases ........................................................................................... 45
2.6.5 Multiple or Heterogeneous Databases ........................................................ 45
2.6.6 Stream Databases ........................................................................................... 45
2.6.7 Web Databases ............................................................................................... 46
Evolution of Database Technology .......................................................................... 46
2.7.1 Distribution .....................................................................................................47
2.7.2 Performance ....................................................................................................47
2.7.2.1 Database Design for Multicore Processors .................................48
2.7.3 Functionality .................................................................................................. 49
Summary ...................................................................................................................... 50
Section II
Road to Big Data Computing
3. Analytics Basics .................................................................................................................... 53
3.1 Intelligent Analysis ..................................................................................................... 53
3.1.1 Intelligence Maturity Model.........................................................................55
3.1.1.1 Data ..................................................................................................55
3.1.1.2 Communication ..............................................................................55
3.1.1.3 Information .....................................................................................56
3.1.1.4 Concept ............................................................................................56
3.1.1.5 Knowledge ......................................................................................57
3.1.1.6 Intelligence ......................................................................................58
3.1.1.7 Wisdom............................................................................................58
3.2 Decisions ...................................................................................................................... 59
3.2.1 Types of Decisions .........................................................................................59
3.2.2 Scope of Decisions .........................................................................................61
3.3 Decision-Making Process........................................................................................... 61
3.4
Decision-Making Techniques ....................................................................................63
3.4.1 Mathematical Programming ........................................................................63
3.4.2 Multicriteria Decision Making .....................................................................64
3.4.3 Case-Based Reasoning...................................................................................64
3.4.4 Data Warehouse and Data Mining ..............................................................64
3.4.5 Decision Tree...................................................................................................64
3.4.6 Fuzzy Sets and Systems ................................................................................65
Contents
3.5
3.6
3.7
3.8
ix
Analytics.......................................................................................................................65
3.5.1 Descriptive Analytics .................................................................................... 66
3.5.2 Predictive Analytics ...................................................................................... 66
3.5.3 Prescriptive Analytics ................................................................................... 67
Data Science Techniques ............................................................................................ 68
3.6.1 Database Systems...........................................................................................68
3.6.2 Statistical Inference ........................................................................................68
3.6.3 Regression and Classification.......................................................................69
3.6.4 Data Mining and Machine Learning ...........................................................70
3.6.5 Data Visualization ..........................................................................................70
3.6.6 Text Analytics .................................................................................................71
3.6.7 Time Series and Market Research Models..................................................72
Snapshot of Data Analysis Techniques and Tasks ................................................. 74
Summary ......................................................................................................................77
4. Data Warehousing Basics .................................................................................................... 79
4.1 Relevant Database Concepts...................................................................................... 79
4.1.1 Physical Database Design .............................................................................80
4.2 Data Warehouse .......................................................................................................... 81
4.2.1 Multidimensional Model ..............................................................................83
4.2.1.1 Data Cube ........................................................................................84
4.2.1.2 Online Analytical Processing ........................................................84
4.2.1.3 Relational Schemas ........................................................................87
4.2.1.4 Multidimensional Cube.................................................................88
4.3 Data Warehouse Architecture ................................................................................... 91
4.3.1 Architecture Tiers ...........................................................................................91
4.3.1.1 Back-End Tier .................................................................................. 91
4.3.1.2 Data Warehouse Tier ..................................................................... 91
4.3.1.3 OLAP Tier ........................................................................................ 93
4.3.1.4 Front-End Tier ................................................................................. 93
4.4 Data Warehouse 1.0..................................................................................................... 93
4.4.1 Inmon’s Information Factory .......................................................................93
4.4.2 Kimbal’s Bus Architecture ............................................................................94
4.5 Data Warehouse 2.0 .................................................................................................... 95
4.5.1 Inmon’s DW 2.0 ..............................................................................................95
4.5.2 Claudia Imhoff and Colin White’s DSS 2.0 ................................................96
4.6 Data Warehouse Architecture Challenges .............................................................. 96
4.6.1 Performance ....................................................................................................98
4.6.2 Scalability ........................................................................................................98
4.7 Summary .................................................................................................................... 100
5. Data Mining Basics ............................................................................................................ 101
5.1 Data Mining ............................................................................................................... 101
5.1.1 Benefits .......................................................................................................... 103
5.2 Data Mining Applications ....................................................................................... 104
5.3 Data Mining Analysis .............................................................................................. 106
5.3.1 Supervised Analysis .................................................................................... 106
5.3.1.1 Exploratory Analysis ................................................................... 106
5.3.1.2 Classification ................................................................................. 107
x
Contents
5.4
5.5
5.6
5.7
5.3.1.3 Regression ..................................................................................... 107
5.3.1.4 Time Series .................................................................................... 108
5.3.2 Un-Supervised Analysis ............................................................................. 108
5.3.2.1 Association Rules ......................................................................... 108
5.3.2.2 Clustering ...................................................................................... 108
5.3.2.3 Description and Visualization.................................................... 109
CRISP-DM Methodology ......................................................................................... 109
5.4.1 Business Understanding ............................................................................. 110
5.4.2 Data Understanding .................................................................................... 111
5.4.3 Data Preparation .......................................................................................... 111
5.4.4 Modeling ....................................................................................................... 112
5.4.5 Model Evaluation ......................................................................................... 113
5.4.6 Model Deployment ...................................................................................... 113
Machine Learning ..................................................................................................... 114
5.5.1 Cybersecurity Systems ................................................................................ 116
5.5.1.1 Data Mining for Cybersecurity .................................................. 117
Soft Computing ......................................................................................................... 118
5.6.1 Artificial Neural Networks ........................................................................ 119
5.6.2 Fuzzy Systems .............................................................................................. 120
5.6.3 Evolutionary Algorithms ............................................................................ 120
5.6.4 Rough Sets .................................................................................................... 121
Summary .................................................................................................................... 122
6. Distributed Systems Basics .............................................................................................. 123
6.1 Distributed Systems .................................................................................................. 123
6.1.1 Parallel Computing...................................................................................... 125
6.1.2 Distributed Computing............................................................................... 128
6.1.2.1 System Architectural Styles ........................................................ 129
6.1.2.2 Software Architectural Styles..................................................... 130
6.1.2.3 Technologies for Distributed Computing ................................. 135
6.2 Distributed Databases .............................................................................................. 138
6.2.1 Characteristics of Distributed Databases ................................................. 140
6.2.1.1 Transparency ................................................................................ 140
6.2.1.2 Availability and Reliability ......................................................... 140
6.2.1.3 Scalability and Partition Tolerance ............................................ 141
6.2.1.4 Autonomy ...................................................................................... 141
6.2.2 Advantages and Disadvantages of Distributed Databases.................... 142
6.2.3 Data Replication and Allocation ................................................................ 146
6.2.4 Concurrency Control and Recovery ......................................................... 146
6.2.4.1 Distributed Recovery ................................................................... 147
6.2.5 Query Processing and Optimization ........................................................ 148
6.2.6 Transaction Management ........................................................................... 149
6.2.6.1 Two-Phase Commit Protocol ...................................................... 149
6.2.6.2 Three-Phase Commit Protocol ................................................... 150
6.2.7 Rules for Distributed Databases ................................................................ 151
6.3 Summary .................................................................................................................... 152
Contents
xi
7. Service-Oriented Architecture Basics ............................................................................153
7.1 Service-Oriented Architecture ................................................................................ 153
7.1.1 Defining SOA................................................................................................ 155
7.1.1.1 Services .......................................................................................... 155
7.2 SOA Benefits .............................................................................................................. 156
7.3 Characteristics of SOA.............................................................................................. 157
7.3.1 Dynamic, Discoverable, Metadata Driven ............................................... 157
7.3.2 Designed for Multiple Invocation Styles .................................................. 158
7.3.3 Loosely Coupled .......................................................................................... 158
7.3.4 Well-Defined Service Contracts ................................................................. 158
7.3.5 Standard Based ............................................................................................ 158
7.3.6 Granularity of Services and Service Contracts........................................ 158
7.3.7 Stateless ......................................................................................................... 159
7.3.8 Predictable Service-Level Agreements (SLAs) ........................................ 159
7.3.9 Design Services with Performance in Mind ............................................ 159
7.4 SOA Applications ...................................................................................................... 159
7.4.1 Rapid Application Integration ................................................................... 160
7.4.2 Multichannel Access.................................................................................... 160
7.4.3 Business Process Management .................................................................. 160
7.5 SOA Ingredients ........................................................................................................ 161
7.5.1 Objects, Services, and Resources ............................................................... 161
7.5.1.1 Objects............................................................................................ 161
7.5.1.2 Services .......................................................................................... 161
7.5.1.3 Resources ....................................................................................... 162
7.5.2 SOA and Web Services ................................................................................ 163
7.5.2.1 Describing Web Services: Web Services Description
Language ...................................................................................... 165
7.5.2.2 Accessing Web Services: Simple Object Access Protocol ...... 165
7.5.2.3 Finding Web Services: Universal Description, Discovery,
and Integration ............................................................................. 165
7.5.3 SOA and RESTful Services ......................................................................... 166
7.6 Enterprise Service Bus .............................................................................................. 167
7.6.1 Characteristics of an ESB Solution ............................................................ 170
7.6.1.1 Key Capabilities of an ESB .......................................................... 171
7.6.1.2 ESB Scalability .............................................................................. 174
7.6.1.3 Event-Driven Nature of ESB ....................................................... 174
7.7 Summary .................................................................................................................... 175
8. Cloud Computing Basics ...................................................................................................177
8.1 Cloud Definition........................................................................................................ 177
8.2 Cloud Characteristics ............................................................................................... 179
8.2.1 Cloud Storage Infrastructure Requirements ........................................... 180
8.3 Cloud Delivery Models ............................................................................................ 181
8.3.1 Infrastructure as a Service (IaaS)............................................................... 182
8.3.2 Platform as a Service (PaaS) ....................................................................... 182
8.3.3 Software as a Service (SaaS) ....................................................................... 183
xii
Contents
8.4
8.5
8.6
8.7
8.8
Cloud Deployment Models .....................................................................................185
8.4.1 Private Clouds ..............................................................................................185
8.4.2 Public Clouds ...............................................................................................185
8.4.3 Hybrid Clouds..............................................................................................186
8.4.4 Community Clouds .....................................................................................186
Cloud Benefits ...........................................................................................................186
Cloud Challenges ......................................................................................................190
8.6.1 Scalability ......................................................................................................191
8.6.2 Multitenancy.................................................................................................192
8.6.3 Availability ....................................................................................................193
8.6.3.1 Failure Detection .......................................................................... 194
8.6.3.2 Application Recovery................................................................... 195
Cloud Technologies...................................................................................................195
8.7.1 Virtualization ................................................................................................196
8.7.1.1 Characteristics of Virtualized Environment ............................ 197
8.7.2 Service-Oriented Computing .....................................................................200
8.7.2.1 Advantages of SOA ...................................................................... 201
8.7.2.2 Layers in SOA ............................................................................... 202
Summary ....................................................................................................................203
Section III
Big Data Computing
9. Introducing Big Data Computing ...................................................................................207
9.1 Big Data ......................................................................................................................207
9.1.1 What Is Big Data?.........................................................................................208
9.1.1.1 Data Volume ..................................................................................208
9.1.1.2 Data Velocity .................................................................................210
9.1.1.3 Data Variety................................................................................... 211
9.1.1.4 Data Veracity .................................................................................212
9.1.2 Common Characteristics of Big Data Computing Systems ...................213
9.1.3 Big Data Appliances ....................................................................................214
9.2 Tools and Techniques of Big Data ...........................................................................215
9.2.1 Processing Approach ...................................................................................215
9.2.2 Big Data System Architecture ....................................................................216
9.2.2.1 BASE (Basically Available, Soft State, Eventual
Consistency) .............................................................................. 217
9.2.2.2 Functional Decomposition ..........................................................218
9.2.2.3 Master–Slave Replication ............................................................218
9.2.3 Row Partitioning or Sharding ....................................................................218
9.2.4 Row versus Column-Oriented Data Layouts ..........................................219
9.2.5 NoSQL Data Management..........................................................................220
9.2.6 In-Memory Computing ...............................................................................221
9.2.7 Developing Big Data Applications ............................................................222
9.3 Aadhaar Project .........................................................................................................223
9.4 Summary ....................................................................................................................226
Contents
xiii
10. Big Data Technologies .......................................................................................................227
10.1 Functional Programming Paradigm .....................................................................227
10.1.1 Parallel Architectures and Computing Models .................................. 228
10.1.2 Data Parallelism versus Task Parallelism............................................. 228
10.2 Google MapReduce.................................................................................................229
10.2.1 Google File System ..................................................................................231
10.2.2 Google Bigtable ........................................................................................ 232
10.3 Yahoo!’s Vision of Big Data Computing ..............................................................233
10.3.1 Apache Hadoop........................................................................................234
10.3.1.1 Components of Hadoop Ecosystem.....................................235
10.3.1.2 Principles and Patterns Underlying the Hadoop
Ecosystem ................................................................................ 236
10.3.1.3 Storage and Processing Strategies ........................................237
10.3.2 Hadoop 2 alias YARN ............................................................................. 238
10.3.2.1 HDFS Storage .......................................................................... 239
10.3.2.2 MapReduce Processing .......................................................... 239
10.4 Hadoop Distribution ..............................................................................................240
10.4.1 Cloudera Distribution of Hadoop (CDH) ............................................243
10.4.2 MapR..........................................................................................................243
10.4.3 Hortonworks Data Platform (HDP) ......................................................243
10.4.4 Pivotal HD................................................................................................. 243
10.5 Storage and Processing Strategies ........................................................................244
10.5.1 Characteristics of Big Data Storage Methods.......................................244
10.5.2 Characteristics of Big Data Processing Methods................................. 244
10.6 NoSQL Databases ....................................................................................................245
10.6.1 Column-Oriented Stores or Databases .................................................246
10.6.2 Key-Value Stores (K-V Stores) or Databases .........................................246
10.6.3 Document-Oriented Databases ..............................................................247
10.6.4 Graph Stores or Databases ...................................................................... 248
10.6.5 Comparison of NoSQL Databases ......................................................... 248
10.7 Summary .................................................................................................................. 249
11. Big Data NoSQL Databases ..............................................................................................251
11.1 Characteristics of NoSQL Systems .......................................................................254
11.1.1 NoSQL Characteristics Related to Distributed Systems and
Distributed Databases .............................................................................254
11.1.2 NoSQL Characteristics Related to Data Models and Query
Languages ................................................................................................. 256
11.2 Column Databases ..................................................................................................256
11.2.1 Cassandra.................................................................................................. 258
11.2.1.1 Cassandra Features.................................................................258
11.2.2 Google BigTable .......................................................................................260
11.2.3 HBase ......................................................................................................... 260
11.2.3.1 HBase Data Model and Versioning ......................................260
11.2.3.2 HBase CRUD Operations ...................................................... 262
11.2.3.3 HBase Storage and Distributed System Concepts ............. 263
xiv
Contents
11.3
11.4
11.5
11.6
Key-Value Databases............................................................................................. 263
11.3.1 Riak .......................................................................................................... 264
11.3.1.1 Riak Features ........................................................................ 264
11.3.2 Amazon Dynamo .................................................................................. 265
11.3.2.1 DynamoDB Data Model ...................................................... 266
Document Databases ............................................................................................ 266
11.4.1 CouchDB ................................................................................................. 268
11.4.2 MongoDB ................................................................................................ 268
11.4.2.1 MongoDB Features ..............................................................269
11.4.2.2 MongoDB Data Model ........................................................270
11.4.2.3 MongoDB CRUD Operations .............................................272
11.4.2.4 MongoDB Distributed Systems Characteristics .............. 272
Graph Databases ................................................................................................... 274
11.5.1 OrientDB ................................................................................................. 274
11.5.2 Neo4j ........................................................................................................ 275
11.5.2.1 Neo4j Features ......................................................................275
11.5.2.2 Neo4j Data Model ................................................................ 276
Summary ................................................................................................................ 277
12. Big Data Development with Hadoop..............................................................................279
12.1
Hadoop MapReduce .............................................................................................284
12.1.1 MapReduce Processing .........................................................................284
12.1.1.1 JobTracker..............................................................................284
12.1.1.2 TaskTracker ........................................................................... 286
12.1.2 MapReduce Enhancements and Extensions ...................................... 286
12.1.2.1 Supporting Iterative Processing .........................................286
12.1.2.2 Join Operations .....................................................................288
12.1.2.3 Data Indices ..........................................................................289
12.1.2.4 Column Storage .................................................................... 290
12.2
YARN ......................................................................................................................291
12.3
Hadoop Distributed File System (HDFS)........................................................... 293
12.3.1 Characteristics of HDFS ........................................................................ 293
12.4
HBase ...................................................................................................................... 295
12.4.1 HBase Architecture ............................................................................... 296
12.5
ZooKeeper .............................................................................................................. 297
12.6
Hive .........................................................................................................................297
12.7
Pig ............................................................................................................................ 298
12.8
Kafka .......................................................................................................................299
12.9
Flume ......................................................................................................................300
12.10 Sqoop ......................................................................................................................300
12.11
Impala ..................................................................................................................... 301
12.12 Drill .........................................................................................................................302
12.13 Whirr .......................................................................................................................302
12.14 Summary ................................................................................................................ 302
13. Big Data Analysis Languages, Tools, and Environments ..........................................303
13.1
Spark ....................................................................................................................... 303
13.1.1 Spark Components ................................................................................305
xv
Contents
13.1.2
13.2
13.3
13.4
13.5
13.6
13.7
13.8
Spark Concepts ......................................................................................306
13.1.2.1 Shared Variables ..................................................................306
13.1.2.2 SparkContext .......................................................................306
13.1.2.3 Resilient Distributed Datasets ...........................................306
13.1.2.4 Transformations...................................................................306
13.1.2.5 Action .................................................................................... 307
13.1.3
Benefits of Spark .................................................................................... 307
Functional Programming ......................................................................................308
Clojure....................................................................................................................... 312
Python ....................................................................................................................... 313
13.4.1
NumPy.................................................................................................... 313
13.4.2
SciPy ........................................................................................................ 313
13.4.3
Pandas ..................................................................................................... 313
13.4.4
Scikit-Learn ............................................................................................ 313
13.4.5
IPython ................................................................................................... 314
13.4.6
Matplotlib ............................................................................................... 314
13.4.7
Stats Models ...........................................................................................314
13.4.8
Beautiful Soup ....................................................................................... 314
13.4.9
NetworkX ............................................................................................... 314
13.4.10
NLTK....................................................................................................... 314
13.4.11
Gensim .................................................................................................... 314
13.4.12
PyPy ........................................................................................................ 315
Scala .......................................................................................................................... 315
13.5.1
Scala Advantages .................................................................................. 316
13.5.1.1 Interoperability with Java .................................................. 316
13.5.1.2 Parallelism............................................................................ 316
13.5.1.3 Static Typing and Type Inference ......................................316
13.5.1.4 Immutability ........................................................................316
13.5.1.5 Scala and Functional Programs .........................................317
13.5.1.6 Null Pointer Uncertainty.................................................... 317
13.5.2
Scala Benefits ......................................................................................... 318
13.5.2.1 Increased Productivity .......................................................318
13.5.2.2 Natural Evolution from Java .............................................318
13.5.2.3 Better Fit for Asynchronous and Concurrent Code ....... 318
R ................................................................................................................................. 319
13.6.1
Analytical Features of R ....................................................................... 319
13.6.1.1 General..................................................................................319
13.6.1.2 Business Dashboard and Reporting .................................320
13.6.1.3 Data Mining .........................................................................320
13.6.1.4 Business Analytics .............................................................. 320
SAS ............................................................................................................................ 321
13.7.1
SAS DATA Step...................................................................................... 321
13.7.2
Base SAS Procedures ............................................................................ 322
Summary .................................................................................................................. 323
14. Big Data DevOps Management .......................................................................................325
14.1 Big Data Systems Development Management .................................................... 326
14.1.1
Big Data Systems Architecture............................................................ 326
xvi
Contents
14.1.2
14.2
14.3
14.4
Big Data Systems Lifecycle ..................................................................... 326
14.1.2.1 Data Sourcing ..........................................................................326
14.1.2.2 Data Collection and Registration in a Standard Format .......326
14.1.2.3 Data Filter, Enrich, and Classification..................................327
14.1.2.4 Data Analytics, Modeling, and Prediction ..........................327
14.1.2.5 Data Delivery and Visualization ..........................................328
14.1.2.6 Data Supply to Consumer Analytics Applications ............ 328
Big Data Systems Operations Management ........................................................ 328
14.2.1 Core Portfolio of Functionalities ............................................................ 328
14.2.1.1 Metrics for Interfacing to Cloud Service Providers ........... 330
14.2.2 Characteristics of Big Data and Cloud Operations ............................. 332
14.2.3 Core Services ............................................................................................ 332
14.2.3.1 Discovery and Replication ....................................................332
14.2.3.2 Load Balancing........................................................................333
14.2.3.3 Resource Management ...........................................................333
14.2.3.4 Data Governance .................................................................... 333
14.2.4 Management Services .............................................................................334
14.2.4.1 Deployment and Configuration ...........................................334
14.2.4.2 Monitoring and Reporting ....................................................334
14.2.4.3 Service-Level Agreements (SLAs) Management ................334
14.2.4.4 Metering and Billing ..............................................................335
14.2.4.5 Authorization and Authentication .......................................335
14.2.4.6 Fault Tolerance ........................................................................ 335
14.2.5 Governance Services ............................................................................... 336
14.2.5.1 Governance ..............................................................................336
14.2.5.2 Security.....................................................................................337
14.2.5.3 Privacy ......................................................................................338
14.2.5.4 Trust ..........................................................................................339
14.2.5.5 Security Risks ..........................................................................340
14.2.6 Cloud Governance, Risk, and Compliance .......................................... 341
14.2.6.1 Cloud Security Solutions .......................................................344
Migrating to Big Data Technologies .....................................................................346
14.3.1 Lambda Architecture ..............................................................................348
14.3.1.1 Batch Processing .....................................................................348
14.3.1.2 Real Time Analytics ............................................................... 349
Summary .................................................................................................................. 349
Section IV
Big Data Computing Applications
15. Web Applications................................................................................................................353
15.1 Web-Based Applications ........................................................................................ 353
15.2 Reference Architecture ...........................................................................................354
15.2.1 User Interaction Architecture ................................................................ 355
15.2.2 Service-Based Architecture .................................................................... 355
15.2.3 Business Object Architecture ................................................................. 356
15.3 Realization of the Reference Architecture in J2EE ............................................. 356
15.3.1 JavaServer Pages and Java Servlets as the User Interaction
Components .............................................................................................. 356
Contents
15.4
15.5
15.6
15.7
15.8
15.9
xvii
15.3.2 Session Bean EJBs as Service-Based Components ...............................356
15.3.3 Entity Bean EJBs as the Business Object Components .......................357
15.3.4 Distributed Java Components ................................................................357
15.3.5 J2EE Access to the EIS (Enterprise Information Systems) Tier .......... 357
Model–View–Controller Architecture ................................................................. 357
Evolution of the Web ............................................................................................... 359
15.5.1 Web 1.0.......................................................................................................359
15.5.2 Web 2.0....................................................................................................... 359
15.5.2.1 Weblogs or Blogs.....................................................................359
15.5.2.2 Wikis .........................................................................................360
15.5.2.3 RSS Technologies ....................................................................360
15.5.2.4 Social Tagging .........................................................................361
15.5.2.5 Mashups: Integrating Information .......................................361
15.5.2.6 User Contributed Content ..................................................... 361
15.5.3 Web 3.0....................................................................................................... 362
15.5.4 Mobile Web ...............................................................................................363
15.5.5 The Semantic Web .................................................................................... 363
15.5.6 Rich Internet Applications ......................................................................364
Web Applications ....................................................................................................364
15.6.1 Web Applications Dimensions............................................................... 365
15.6.1.1 Presentation .............................................................................365
15.6.1.2 Dialogue ...................................................................................366
15.6.1.3 Navigation ...............................................................................366
15.6.1.4 Process ......................................................................................366
15.6.1.5 Data ........................................................................................... 367
Search Analysis ....................................................................................................... 367
15.7.1 SLA Process .............................................................................................. 368
Web Analysis ........................................................................................................... 371
15.8.1 Veracity of Log Files Data ....................................................................... 374
15.8.1.1 Unique Visitors .......................................................................374
15.8.1.2 Visitor Count ...........................................................................374
15.8.1.3 Visit Duration .......................................................................... 375
15.8.2 Web Analysis Tools.................................................................................. 375
Summary .................................................................................................................. 376
16. Social Network Applications ...........................................................................................377
16.1 Networks .................................................................................................................. 378
16.1.1 Concept of Networks...............................................................................378
16.1.2 Principles of Networks ............................................................................ 379
16.1.2.1 Metcalfe’s Law ........................................................................379
16.1.2.2 Power Law ...............................................................................379
16.1.2.3 Small Worlds Networks ......................................................... 379
16.2 Computer Networks ............................................................................................... 380
16.2.1 Internet ......................................................................................................381
16.2.2 World Wide Web (WWW) ...................................................................... 381
16.3 Social Networks....................................................................................................... 382
16.3.1 Popular Social Networks ........................................................................ 386
16.3.1.1 LinkedIn................................................................................... 386
16.3.1.2 Facebook................................................................................... 386
xviii
16.4
16.5
16.6
16.7
Contents
16.3.1.3 Twitter ................................................................................... 387
16.3.1.4 Google+ ................................................................................ 388
16.3.1.5 Other Social Networks........................................................... 389
Social Networks Analysis (SNA) .......................................................................... 389
Text Analysis ............................................................................................................ 391
16.5.1 Defining Text Analysis ............................................................................ 392
16.5.1.1 Document Collection .......................................................... 392
16.5.1.2 Document ............................................................................. 393
16.5.1.3 Document Features ............................................................. 393
16.5.1.4 Domain Knowledge ............................................................ 395
16.5.1.5 Search for Patterns and Trends .......................................... 396
16.5.1.6 Results Presentation ............................................................... 396
Sentiment Analysis ................................................................................................. 397
16.6.1 Sentiment Analysis and Natural Language Processing (NLP) ......... 398
16.6.2 Applications ..............................................................................................400
Summary ..................................................................................................................400
17. Mobile Applications...........................................................................................................401
17.1 Mobile Computing Applications .......................................................................... 401
17.1.1 Generations of Communication Systems ............................................. 402
17.1.1.1 1st Generation: Analog ....................................................... 402
17.1.1.2 2nd Generation: CDMA, TDMA, and GSM ..................... 402
17.1.1.3 2.5 Generation: GPRS, EDGE, and CDMA 2000 .............. 405
17.1.1.4 3rd Generation: wCDMA, UMTS, and iMode ................. 406
17.1.1.5 4th Generation ......................................................................... 406
17.1.2 Mobile Operating Systems ..................................................................... 406
17.1.2.1 Symbian ................................................................................ 406
17.1.2.2 BlackBerry OS ...................................................................... 407
17.1.2.3 Google Android ................................................................... 407
17.1.2.4 Apple iOS ............................................................................. 408
17.1.2.5 Windows Phone ......................................................................408
17.2 Mobile Web Services ...............................................................................................408
17.2.1 Mobile Field Cloud Services ................................................................... 412
17.3 Context-Aware Mobile Applications .................................................................... 414
17.3.1 Ontology-Based Context Model ............................................................. 415
17.3.2 Context Support for User Interaction .................................................... 415
17.4 Mobile Web 2.0 ........................................................................................................ 416
17.5 Mobile Analytics ..................................................................................................... 418
17.5.1 Mobile Site Analytics ...............................................................................418
17.5.2 Mobile Clustering Analysis ....................................................................418
17.5.3 Mobile Text Analysis ...............................................................................419
17.5.4 Mobile Classification Analysis ...............................................................420
17.5.5 Mobile Streaming Analysis .................................................................... 421
17.6 Summary .................................................................................................................. 421
18. Location-Based Systems Applications ............................................................................423
18.1 Location-Based Systems .........................................................................................423
18.1.1 Sources of Location Data ........................................................................ 424
18.1.1.1 Cellular Systems ..................................................................... 424
Contents
18.2
18.3
18.4
xix
18.1.1.2 Multireference Point Systems ............................................ 426
18.1.1.3 Tagging..................................................................................... 427
18.1.2 Mobility Data ............................................................................................ 429
18.1.2.1 Mobility Data Mining ............................................................430
Location-Based Services ......................................................................................... 432
18.2.1 LBS Characteristics ..................................................................................435
18.2.2 LBS Positioning Technologies ................................................................436
18.2.3 LBS System Architecture .........................................................................437
18.2.4 LBS System Components ........................................................................439
18.2.5 LBS System Challenges ........................................................................... 439
Location-Based Social Networks .......................................................................... 441
Summary ..................................................................................................................443
19. Context-Aware Applications.............................................................................................445
19.1 Context-Aware Applications..................................................................................446
19.1.1 Types of Context-Awareness ..................................................................448
19.1.2 Types of Contexts .....................................................................................449
19.1.3 Context Acquisition .................................................................................450
19.1.4 Context Models ........................................................................................450
19.1.5 Generic Context-Aware Application Architecture ..............................452
19.1.6 Illustrative Context-Aware Applications .............................................. 452
19.2 Decision Pattern as Context ................................................................................... 453
19.2.1 Concept of Patterns ..................................................................................454
19.2.1.1 Patterns in Information Technology (IT) Solutions ........... 455
19.2.2 Domain-Specific Decision Patterns ....................................................... 455
19.2.2.1 Financial Decision Patterns ................................................ 455
19.2.2.2 CRM Decision Patterns .......................................................... 457
19.3 Context-Aware Mobile Services ............................................................................ 460
19.3.1 Limitations of Existing Infrastructure.................................................. 460
19.3.1.1 Limited Capability of Mobile Devices .............................. 460
19.3.1.2 Limited Sensor Capability .................................................. 461
19.3.1.3 Restrictive Network Bandwidth ........................................ 461
19.3.1.4 Trust and Security Requirements ...................................... 461
19.3.1.5 Rapidly Changing Context .................................................... 461
19.3.2 Types of Sensors .......................................................................................462
19.3.3 Context-Aware Mobile Applications ..................................................... 462
19.3.3.1 Context-Awareness Management Framework ...................464
19.4 Summary .................................................................................................................. 467
Epilogue: Internet of Things ................................................................................................... 469
References ................................................................................................................................... 473
Index ............................................................................................................................................. 475
This page intentionally left blank
List of Figures
Figure 1.1
Increase in the number of transistors on an Intel chip.........................................2
Figure 1.2
Hardware trends in the 1990s and the first decade .............................................3
Figure 1.3
Von Neumann computer architecture ...................................................................9
Figure 2.1
A hierarchical organization ...................................................................................30
Figure 2.2 The three-schema architecture ............................................................................ 37
Figure 2.3
Evolution of database technology......................................................................... 47
Figure 3.1
Characteristics of enterprise intelligence in terms of the scope of the
decisions ................................................................................................................... 61
Figure 4.1
Cube for sales data having dimensions store, time, and product and a
measure amount ......................................................................................................84
Figure 4.2
OLAP Operations. (a) Original Cube (b) Roll-up to the Country
level (c) Drill down to the month level (d) Pivot (e) Slice on
Store.City = ‘Mumbai’ (f) Dice on Store.Country = ‘US’ and Time.
Quarter = ‘Q1’ or ‘Q2’ .............................................................................................85
Figure 4.3
Example of a star schema....................................................................................... 87
Figure 4.4
Example of a snowflake schema ........................................................................... 88
Figure 4.5 Example of a constellation schema ....................................................................... 89
Figure 4.6 Lattice of cuboids derived from a four-dimensional cube................................90
Figure 4.7
Reference data warehouse architecture ............................................................... 92
Figure 5.1
Schematic of CRISP-DM methodology .............................................................. 110
Figure 5.2 Architecture of a machine-learning system...................................................... 115
Figure 5.3
Architecture of a fuzzy inference system.......................................................... 120
Figure 5.4
Architecture of a rough sets system ................................................................... 122
Figure 6.1
Parallel computing architectures: (a) Flynn’s taxonomy (b) shared
memory system, and (c) distributed system...................................................... 127
Figure 7.1
Web Services usage model ................................................................................... 164
Figure 7.2 ESB reducing connection complexity (a) Direct point-to-point
connections (n*n) and (b) Connecting through the bus (n) ............................ 168
Figure 7.3
Enterprise service bus (ESB) linking disparate systems and computing
environments ......................................................................................................... 169
Figure 8.1
The cloud reference model ................................................................................... 183
Figure 8.2 Portfolio of services for the three cloud delivery models ............................... 184
xxi
xxii
List of Figures
Figure 9.1
4V characteristics of big data ...................................................................................... 208
Figure 9.2
Use cases for big data computing ........................................................................ 209
Figure 9.3
Parallel architectures .......................................................................................... 215
Figure 9.4
The solution architecture for the Aadhaar project .........................................225
Figure 10.1
Execution phases in a generic MapReduce application ................................. 230
Figure 10.2 Comparing the architecture of Hadoop 1 and Hadoop 2 ............................. 240
Figure 12.1
Hadoop ecosystem .............................................................................................. 282
Figure 12.2 Hadoop MapReduce architecture..................................................................... 285
Figure 12.3
YARN architecture ............................................................................................. 292
Figure 12.4
HDFS architecture .............................................................................................. 295
Figure 14.1
Big data systems architecture ............................................................................ 327
Figure 14.2 Big data systems lifecycle (BDSL) ..................................................................... 328
Figure 14.3
Lambda architecture ..........................................................................................348
Figure 15.1
Enterprise application in J2EE ........................................................................... 355
Figure 15.2
MVC and enterprise application architecture ................................................ 358
Figure 18.1
Principle of lateration.......................................................................................... 427
Figure 18.2
Trajectory mapping ............................................................................................. 432
Figure 19.1
Context definition ................................................................................................465
Figure 19.2
Conceptual metamodel of a context ................................................................. 466
List of Tables
Table 2.1
Characteristics of the Four Database Models ..................................................... 36
Table 2.2
Levels of Data Abstraction .................................................................................... 38
Table 3.1
Intelligence Maturity Model (IMM) ..................................................................... 55
Table 3.2
Analysis Techniques versus Tasks ....................................................................... 74
Table 4.1
Comparison between OLTP and OLAP Systems ............................................... 82
Table 4.2
Comparison between Operational Databases and Data Warehouses ............83
Table 4.3
The DSS 2.0 Spectrum ............................................................................................ 96
Table 5.1
Data Mining Application Areas.......................................................................... 105
Table 5.2
Characteristics of Soft Computing Compared with Traditional Hard
Computing ............................................................................................................. 118
Table 8.1
Key Attributes of Cloud Computing .................................................................. 178
Table 8.2
Key Attributes of Cloud Services ....................................................................... 179
Table 8.3
Comparison of Cloud Delivery Models............................................................. 185
Table 8.4
Comparison of Cloud Benefits for Small and Medium Enterprises
(SMEs) and Large Enterprises ............................................................................. 189
Table 9.1
Scale of Data ........................................................................................................... 210
Table 9.2
Value of Big Data across Industries .................................................................... 211
Table 9.3
Industry Use Cases for Big Data ......................................................................... 212
Table 10.1
MapReduce Cloud Implementations .................................................................. 233
Table 10.2 Comparison of MapReduce Implementations .................................................. 233
Table 12.1
Hadoop Ecosystem Classification by Timescale and General
Purpose of Usage .................................................................................................. 283
Table 15.1
Comparison between Web 1.0 and Web 2.0 ...................................................... 362
Table 17.1
Evolution of Wireless Networks ..........................................................................404
Table 17.2 Comparison of Mobile Operating Systems ....................................................... 407
Table 18.1
Location-Based Services (LBS) Classification ................................................... 424
Table 18.2 LBS Quality of Service (QOS) Requirements ....................................................434
Table 18.3
Location Enablement Technologies .................................................................... 437
Table 18.4
Accuracy and TIFF for Several Location Techniques ...................................... 438
xxiii
This page intentionally left blank