Tải bản đầy đủ (.pdf) (1,243 trang)

Python data analytics and visualization

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (32.52 MB, 1,243 trang )


Python: Data Analytics and
Visualization


Table of Contents
Python: Data Analytics and Visualization
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. Introducing Data Analysis and Libraries
Data analysis and processing
An overview of the libraries in data analysis
Python libraries in data analysis
NumPy
Pandas
Matplotlib
PyMongo
The scikit-learn library
Summary
2. NumPy Arrays and Vectorized Computation
NumPy arrays


Data types
Array creation
Indexing and slicing
Fancy indexing
Numerical operations on arrays
Array functions
Data processing using arrays


Loading and saving data
Saving an array
Loading an array
Linear algebra with NumPy
NumPy random numbers
Summary
3. Data Analysis with Pandas
An overview of the Pandas package
The Pandas data structure
Series
The DataFrame
The essential basic functionality
Reindexing and altering labels
Head and tail
Binary operations
Functional statistics
Function application
Sorting
Indexing and selecting data
Computational tools
Working with missing data

Advanced uses of Pandas for data analysis
Hierarchical indexing
The Panel data
Summary
4. Data Visualization
The matplotlib API primer
Line properties
Figures and subplots
Exploring plot types
Scatter plots
Bar plots
Contour plots
Histogram plots
Legends and annotations
Plotting functions with Pandas


Additional Python data visualization tools
Bokeh
MayaVi
Summary
5. Time Series
Time series primer
Working with date and time objects
Resampling time series
Downsampling time series data
Upsampling time series data
Time zone handling
Timedeltas
Time series plotting

Summary
6. Interacting with Databases
Interacting with data in text format
Reading data from text format
Writing data to text format
Interacting with data in binary format
HDF5
Interacting with data in MongoDB
Interacting with data in Redis
The simple value
List
Set
Ordered set
Summary
7. Data Analysis Application Examples
Data munging
Cleaning data
Filtering
Merging data
Reshaping data
Data aggregation
Grouping data
Summary


8. Machine Learning Models with scikit-learn
An overview of machine learning models
The scikit-learn modules for different models
Data representation in scikit-learn
Supervised learning – classification and regression

Unsupervised learning – clustering and dimensionality reduction
Measuring prediction performance
Summary
2. Module 2
1. Getting Started with Predictive Modelling
Introducing predictive modelling
Scope of predictive modelling
Ensemble of statistical algorithms
Statistical tools
Historical data
Mathematical function
Business context
Knowledge matrix for predictive modelling
Task matrix for predictive modelling
Applications and examples of predictive modelling
LinkedIn's "People also viewed" feature
What it does?
How is it done?
Correct targeting of online ads
How is it done?
Santa Cruz predictive policing
How is it done?
Determining the activity of a smartphone user using accelerometer
data
How is it done?
Sport and fantasy leagues
How was it done?
Python and its packages – download and installation
Anaconda
Standalone Python

Installing a Python package


Installing pip
Installing Python packages with pip
Python and its packages for predictive modelling
IDEs for Python
Summary
2. Data Cleaning
Reading the data – variations and examples
Data frames
Delimiters
Various methods of importing data in Python
Case 1 – reading a dataset using the read_csv method
The read_csv method
Use cases of the read_csv method
Passing the directory address and filename as variables
Reading a .txt dataset with a comma delimiter
Specifying the column names of a dataset from a list
Case 2 – reading a dataset using the open method of Python
Reading a dataset line by line
Changing the delimiter of a dataset
Case 3 – reading data from a URL
Case 4 – miscellaneous cases
Reading from an .xls or .xlsx file
Writing to a CSV or Excel file
Basics – summary, dimensions, and structure
Handling missing values
Checking for missing values
What constitutes missing data?

How missing values are generated and propagated
Treating missing values
Deletion
Imputation
Creating dummy variables
Visualizing a dataset by basic plotting
Scatter plots
Histograms
Boxplots


Summary
3. Data Wrangling
Subsetting a dataset
Selecting columns
Selecting rows
Selecting a combination of rows and columns
Creating new columns
Generating random numbers and their usage
Various methods for generating random numbers
Seeding a random number
Generating random numbers following probability distributions
Probability density function
Cumulative density function
Uniform distribution
Normal distribution
Using the Monte-Carlo simulation to find the value of pi
Geometry and mathematics behind the calculation of pi
Generating a dummy data frame
Grouping the data – aggregation, filtering, and transformation

Aggregation
Filtering
Transformation
Miscellaneous operations
Random sampling – splitting a dataset in training and testing datasets
Method 1 – using the Customer Churn Model
Method 2 – using sklearn
Method 3 – using the shuffle function
Concatenating and appending data
Merging/joining datasets
Inner Join
Left Join
Right Join
An example of the Inner Join
An example of the Left Join
An example of the Right Join
Summary of Joins in terms of their length


Summary
4. Statistical Concepts for Predictive Modelling
Random sampling and the central limit theorem
Hypothesis testing
Null versus alternate hypothesis
Z-statistic and t-statistic
Confidence intervals, significance levels, and p-values
Different kinds of hypothesis test
A step-by-step guide to do a hypothesis test
An example of a hypothesis test
Chi-square tests

Correlation
Summary
5. Linear Regression with Python
Understanding the maths behind linear regression
Linear regression using simulated data
Fitting a linear regression model and checking its efficacy
Finding the optimum value of variable coefficients
Making sense of result parameters
p-values
F-statistics
Residual Standard Error
Implementing linear regression with Python
Linear regression using the statsmodel library
Multiple linear regression
Multi-collinearity
Variance Inflation Factor
Model validation
Training and testing data split
Summary of models
Linear regression with scikit-learn
Feature selection with scikit-learn
Handling other issues in linear regression
Handling categorical variables
Transforming a variable to fit non-linear relations
Handling outliers


Other considerations and assumptions for linear regression
Summary
6. Logistic Regression with Python

Linear regression versus logistic regression
Understanding the math behind logistic regression
Contingency tables
Conditional probability
Odds ratio
Moving on to logistic regression from linear regression
Estimation using the Maximum Likelihood Method
Likelihood function:
Log likelihood function:
Building the logistic regression model from scratch
Making sense of logistic regression parameters
Wald test
Likelihood Ratio Test statistic
Chi-square test
Implementing logistic regression with Python
Processing the data
Data exploration
Data visualization
Creating dummy variables for categorical variables
Feature selection
Implementing the model
Model validation and evaluation
Cross validation
Model validation
The ROC curve
Confusion matrix
Summary
7. Clustering with Python
Introduction to clustering – what, why, and how?
What is clustering?

How is clustering used?
Why do we do clustering?
Mathematics behind clustering


Distances between two observations
Euclidean distance
Manhattan distance
Minkowski distance
The distance matrix
Normalizing the distances
Linkage methods
Single linkage
Compete linkage
Average linkage
Centroid linkage
Ward's method
Hierarchical clustering
K-means clustering
Implementing clustering using Python
Importing and exploring the dataset
Normalizing the values in the dataset
Hierarchical clustering using scikit-learn
K-Means clustering using scikit-learn
Interpreting the cluster
Fine-tuning the clustering
The elbow method
Silhouette Coefficient
Summary
8. Trees and Random Forests with Python

Introducing decision trees
A decision tree
Understanding the mathematics behind decision trees
Homogeneity
Entropy
Information gain
ID3 algorithm to create a decision tree
Gini index
Reduction in Variance
Pruning a tree
Handling a continuous numerical variable


Handling a missing value of an attribute
Implementing a decision tree with scikit-learn
Visualizing the tree
Cross-validating and pruning the decision tree
Understanding and implementing regression trees
Regression tree algorithm
Implementing a regression tree using Python
Understanding and implementing random forests
The random forest algorithm
Implementing a random forest using Python
Why do random forests work?
Important parameters for random forests
Summary
9. Best Practices for Predictive Modelling
Best practices for coding
Commenting the codes
Defining functions for substantial individual tasks

Example 1
Example 2
Example 3
Avoid hard-coding of variables as much as possible
Version control
Using standard libraries, methods, and formulas
Best practices for data handling
Best practices for algorithms
Best practices for statistics
Best practices for business contexts
Summary
A. A List of Links
3. Module 3
1. A Conceptual Framework for Data Visualization
Data, information, knowledge, and insight
Data
Information
Knowledge
Data analysis and insight


The transformation of data
Transforming data into information
Data collection
Data preprocessing
Data processing
Organizing data
Getting datasets
Transforming information into knowledge
Transforming knowledge into insight

Data visualization history
Visualization before computers
Minard's Russian campaign (1812)
The Cholera epidemics in London (1831-1855)
Statistical graphics (1850-1915)
Later developments in data visualization
How does visualization help decision-making?
Where does visualization fit in?
Data visualization today
What is a good visualization?
Visualization plots
Bar graphs and pie charts
Bar graphs
Pie charts
Box plots
Scatter plots and bubble charts
Scatter plots
Bubble charts
KDE plots
Summary
2. Data Analysis and Visualization
Why does visualization require planning?
The Ebola example
A sports example
Visually representing the results
Creating interesting stories with data
Why are stories so important?


Reader-driven narratives

Gapminder
The State of the Union address
Mortality rate in the USA
A few other example narratives
Author-driven narratives
Perception and presentation methods
The Gestalt principles of perception
Some best practices for visualization
Comparison and ranking
Correlation
Distribution
Location-specific or geodata
Part-to-whole relationships
Trends over time
Visualization tools in Python
Development tools
Canopy from Enthought
Anaconda from Continuum Analytics
Interactive visualization
Event listeners
Layouts
Circular layout
Radial layout
Balloon layout
Summary
3. Getting Started with the Python IDE
The IDE tools in Python
Python 3.x versus Python 2.7
Types of interactive tools
IPython

Plotly
Types of Python IDE
PyCharm
PyDev
Interactive Editor for Python (IEP)


Canopy from Enthought
Anaconda from Continuum Analytics
An overview of Spyder
An overview of conda
Visualization plots with Anaconda
The surface-3D plot
The square map plot
Interactive visualization packages
Bokeh
VisPy
Summary
4. Numerical Computing and Interactive Plotting
NumPy, SciPy, and MKL functions
NumPy
NumPy universal functions
Shape and reshape manipulation
An example of interpolation
Vectorizing functions
Summary of NumPy linear algebra
SciPy
An example of linear equations
The vectorized numerical derivative
MKL functions

The performance of Python
Scalar selection
Slicing
Slice using flat
Array indexing
Numerical indexing
Logical indexing
Other data structures
Stacks
Tuples
Sets
Queues
Dictionaries


Dictionaries for matrix representation
Sparse matrices
Visualizing sparseness
Dictionaries for memoization
Tries
Visualization using matplotlib
Word clouds
Installing word clouds
Input for word clouds
Web feeds
The Twitter text
Plotting the stock price chart
Obtaining data
The visualization example in sports
Summary

5. Financial and Statistical Models
The deterministic model
Gross returns
The stochastic model
Monte Carlo simulation
What exactly is Monte Carlo simulation?
An inventory problem in Monte Carlo simulation
Monte Carlo simulation in basketball
The volatility plot
Implied volatilities
The portfolio valuation
The simulation model
Geometric Brownian simulation
The diffusion-based simulation
The threshold model
Schelling's Segregation Model
An overview of statistical and machine learning
K-nearest neighbors
Generalized linear models
Bayesian linear regression
Creating animated and interactive plots


Summary
6. Statistical and Machine Learning
Classification methods
Understanding linear regression
Linear regression
Decision tree
An example

The Bayes theorem
The Naïve Bayes classifier
The Naïve Bayes classifier using TextBlob
Installing TextBlob
Downloading corpora
The Naïve Bayes classifier using TextBlob
Viewing positive sentiments using word clouds
k-nearest neighbors
Logistic regression
Support vector machines
Principal component analysis
Installing scikit-learn
k-means clustering
Summary
7. Bioinformatics, Genetics, and Network Models
Directed graphs and multigraphs
Storing graph data
Displaying graphs
igraph
NetworkX
Graph-tool
PageRank
The clustering coefficient of graphs
Analysis of social networks
The planar graph test
The directed acyclic graph test
Maximum flow and minimum cut
A genetic programming example
Stochastic block models



Summary
8. Advanced Visualization
Computer simulation
Python's random package
SciPy's random functions
Simulation examples
Signal processing
Animation
Visualization methods using HTML5
How is Julia different from Python?
D3.js for visualization
Dashboards
Summary
B. Go Forth and Explore Visualization
An overview of conda
Packages installed with Anaconda
Packages websites
About matplotlib
Bibliography
Index


Python: Data Analytics and
Visualization


Python: Data Analytics and
Visualization
Copyright © 2017 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a
retrieval system, or transmitted in any form or by any means, without the
prior written permission of the publisher, except in the case of brief
quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this course to ensure the
accuracy of the information presented. However, the information contained in
this course is sold without warranty, either express or implied. Neither the
authors, nor Packt Publishing, and its dealers and distributors will be held
liable for any damages caused or alleged to be caused directly or indirectly by
this course.
Packt Publishing has endeavored to provide trademark information about all
of the companies and products mentioned in this course by the appropriate
use of capitals. However, Packt Publishing cannot guarantee the accuracy of
this information.
Published on: March 2017
Production reference: 1220317
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B32PB, UK.
ISBN: 978-1-78829-009-8


www.packtpub.com


Credits
Authors
Phuong Vo.T.H
Martin Czygan

Ashish Kumar
Kirthi Raman
Reviewers
Dong Chao
Hai Minh Nguyen
Kenneth Emeka Odoh
Matt Hollingsworth
Julian Quick
Hang (Harvey) Yu
Content Development Editor
Deepti Thore
Graphics
Tania Dutta
Production Coordinator
Aparna Bhagat


Preface
The world generates data at an increasing pace. Consumers, sensors, or
scientific experiments emit data points every day. In finance, business,
administration and the natural or social sciences, working with data can make
up a significant part of the job. Being able to efficiently work with small or
large datasets has become a valuable skill. Python started as a general
purpose language. Around ten years ago, in 2006, the first version of NumPy
was released, which made Python a first class language for numerical
computing and laid the foundation for a prospering development, which led
to what we today call the PyData ecosystem: A growing set of highperformance libraries to be used in the sciences, finance, business or
anywhere else you want to work efficiently with datasets. Python is not only
about data analysis. The list of industrial-strength libraries for many general
computing tasks is long, which makes working with data in Python even

more compelling.
Social media and the Internet of Things have resulted in an avalanche of data.
The data is powerful but not in its raw form; it needs to be processed and
modeled and Python is one of the most robust tools we have out there to do
so. It has an array of packages for predictive modeling and a suite of IDEs to
choose from. Learning to predict who would win, lose, buy, lie, or die with
Python is an indispensable skill set to have in this data age. This course is
your guide to get started with Predictive Analytics using Python as the tool.
Data visualization is intended to provide information clearly and help the
viewer understand them qualitatively. The well-known expression that a
picture is worth a thousand words may be rephrased as “a picture tells a story
as well as a large collection of words”. Visualization is, therefore, a very
precious tool that helps the viewer understand a concept quickly. We are
currently faced with a plethora of data containing many insights that hold the
key to success in the modern day. It is important to find the data, clean it, and
use the right tool to visualize it. This course explains several different ways
to visualize data using Python packages, along with very useful examples in
many different areas such as numerical computing, financial models,


statistical and machine learning, and genetics and networks.


What this learning path covers
Module 1, Getting Started with Python Data Analysis starts with an
introduction to data analysis and process, overview of libraries and its uses.
Further you’ll dive right into the core of the PyData ecosystem by
introducing the NumPy package for high-performance computing. We will
also deal with a prominent and popular data analysis library for Python called
Pandas and understand the data through graphical representation. Moving

further you will see how to work with time-oriented data in Pandas. You will
then learn to interact with three main categories: text formats, binary formats
and databases and work on some application examples. In the end you will
see the working of different scikit-learn modules.
Module 2 ,Learning Predictive Analytics with Python, talks about aspects,
scope, and applications of predictive modeling. Data cleaning takes about
80% of the modelling time and hence we will understand its importance and
methods. You will see how to subset, aggregate, sample, merge, append and
concatenate a dataset. Further you will get acquainted with the basic statistics
needed to make sense of the model parameters resulting from the predictive
models. You will also understand the mathematics behind linear and logistic
regression along with clustering. You will also deal with Decision trees and
related classification algorithms. In the end you will be learning about the
best practices adopted in the field of predictive modelling to get the optimum
results.
Module 3, Mastering Python Data Visualization, expounds that data
visualization should actually be referred to as “the visualization of
information for knowledge inference”. You will see how to use Anaconda
from Continuum Analytics and learn interactive plotting methods. You will
deal with stock quotes, regression analysis, the Monte Carlo algorithm, and
simulation methods with examples. Further you will get acquainted with
statistical methods such as linear and nonlinear regression and clustering and
classification methods using numpy, scipy, matplotlib, and scikit-learn. You
will use specific libraries such as graph-tool, NetworkX, matplotlib, scipy,
and numpy. In the end we will see simulation methods and examples of


×