Libraries for Visualisation¶

There are many python visualisation libraries available to plot scatter plots, bar charts, heatmaps and histograms. Some can be a little more sophisticated and interactive than others but it really comes down to what's appropriate for the project and your visualisation requirements, i.e. a static chart may be suffice or maybe something more interactive is required.

In this notebook I illustrate just a few examples which I use quite often for my EDA projects. There are of course many other options, like Boheh, Pygal and Altair, to name just a few.

Matplotlib is the daddy of data visualisation libraries. It is an excellent 2D and 3D graphics library for generating scientific figures and these can be created programmatically. Then there's pandas which is an extremely popular and powerful and operates directly on dataframes. Its built-in capabilities are built off matplotlib. Seaborn is an abstraction interface on top of matplotlib and has simpler code than matplotlib. Plotly (and cufflinks) allows you to create interactive plots that you can deploy to dashboards or websites. Note, Cufflinks connects plotly with pandas dataframes.

The code presented here covers the bare bones of popular plots but you can of course dress them up little more and make them better presentable by changing the dimensions, adding headers and axis labelling, changing colors, etc.

To take your python skills to the next level in data visualization and create fully customised, interactive dashboards that run on the browser, then the open source libraries of Plotly and Dash is your next call. Dash is a Python framework for building web applications. It's built on top of Flask, Plotly.js, React and React js. It enables you to build dashboards using pure Python.

Check out my Manufacturing Quality Dashboard built entirely with python, plotly and dash right here: https://ecapp-111.herokuapp.com/. The link to the github page can be found here https://github.com/Eamoned/Interactive-Dashboard

Imports¶

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.offline as py
%matplotlib inline
from sklearn.datasets import load_breast_cancer

# Use cufflinks (with .plotly) to make plots interactive

import cufflinks as cf
cf.go_offline()  # operate cufflinks off-line

Visualisations will assist in the analyse of datasets to identify:

Missing values
Numerical variables
Distribution of the numerical variables
Outliers
Categorical variables if any
Cardinality of the categorical variables - not applicable here
Potential relationship between the variables and the target

Datasets¶

customers = pd.read_csv('EDA_code_files/Ecommerce Customers')
startups = pd.read_csv('EDA_code_files/50_Startups.csv')
df = pd.read_csv('EDA_code_files/USA_Housing.csv')
train = pd.read_csv('EDA_code_files/titanic_train.csv')
df2 = pd.read_csv("EDA_code_files/Classified Data",index_col=0)
df3 = pd.read_csv("EDA_code_files/loan_data.csv")
df4 = pd.read_csv('EDA_code_files/kyphosis.csv')
df5 = pd.read_csv('EDA_code_files/College_Data')
iris = sns.load_dataset('iris')
cancer = load_breast_cancer()
df_feat = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])

# create fake data:

df_fake = pd.DataFrame(np.random.randn(100,4),columns='A B C D'.split())

# this is the line plot that pandas makes automatically using matplotlib 

df_fake.plot(alpha=0.6, figsize=(10,4))
plt.show()

# area plot

df2[['WTT', 'PTI', 'EQW']].plot.area(alpha=0.4, figsize=(10,4))

<AxesSubplot:>

# Plotly is a library that allows you to create interactive plots that you can use in dashboards or websites
# Cufflinks connects plotly with pandas datframes

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

# Now the plot will be converted to a plotly interactive image

layout = cf.Layout(height=450,width=700)
df_fake.iplot(layout=layout)

For simple numeric distributions

# Age distribution of people on the titanic using pandas

train['Age'].hist(bins=30,color='darkred',alpha=0.7, figsize=(10,4))

<AxesSubplot:>

plt.figure(figsize=(10,4))
df2['WTT'].hist(bins=30,alpha=0.7, label='WTT')
df2['PTI'].hist(bins=30,alpha=0.7, label='PTI')
df2['EQW'].hist(bins=30,alpha=0.7, label='EQW')
plt.title('Data Distributions')
plt.legend()
plt.show()

# We can make it interactive with plotly

layout = cf.Layout(height=400,width=700)
df2[['WTT', 'PTI', 'EQW']].iplot(kind='hist', bins=40, layout=layout)

plt.figure(figsize=(10,6)) # change the size of the plot
sns.distplot(df['Price'])

<AxesSubplot:xlabel='Price'>

layout = cf.Layout(height=400,width=700)
train['Age'].iplot(kind='hist',bins=30,color='green', layout=layout)

# Plotting more than one distribution using pandas

train.hist(bins=30, figsize=(10,10))
plt.show()

Visualisation of Numerical Distributions & dependent variable category relationships We can use pandas matplotlib or Seaborn

# for plotting a numerical variable distributions but seperating with by categorical variable
# Matplotlib

num_vars = [var for var in iris.columns if iris[var].dtypes != 'object']


def analyse_numeric(df, var):
    df = df.copy()
    plt.figure(figsize=(10,4))
    df[df['species']=='setosa'][var].hist(alpha=0.5,color='blue',bins=30,label='species=setosa')
    df[df['species']=='virginica'][var].hist(alpha=0.5,color='red',bins=30,label='species=virginica')
    df[df['species']=='versicolor'][var].hist(alpha=0.5,color='yellow',bins=30,label='species=versicolor')
    plt.title('Species distributions')
    plt.legend()
    plt.xlabel(var)
    plt.show()
    
for var in num_vars:
    analyse_numeric(iris, var)

# Seaborn

num_vars = [var for var in iris.columns if iris[var].dtypes != 'object']


def analyse_numeric(df, var):
    df = df.copy()
    
    sns.set_style('darkgrid')
    g = sns.FacetGrid(df,hue="species",palette='coolwarm',height=5,aspect=2)
    g = g.map(plt.hist,var,bins=30,alpha=0.7)
    plt.title('Species distributions')
    plt.legend()
    plt.xlabel(var)
    plt.show()
    
for var in num_vars:
    analyse_numeric(iris, var)

# Seaborn's pair plot will do something similar and can include scatter plots 

sns.pairplot(iris, hue='species', palette='Dark2')

<seaborn.axisgrid.PairGrid at 0x17c27b9ce48>

# Or you could use a histogram rather than a distribution

sns.pairplot(df4,hue='Kyphosis',palette='Set1', diag_kind = 'hist')

<seaborn.axisgrid.PairGrid at 0x17c274b8780>

# Seaborn flexibility allows mapping of different types of charts using map
# Map to upper,lower, and diagonal

g = sns.PairGrid(iris)
g.map_diag(sns.distplot)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)

<seaborn.axisgrid.PairGrid at 0x17c27c6b7b8>

Visualizing different categorical distributions

# Visualise different categories against the target using a Boxplot

startups.boxplot(column='Profit',by='State',rot = 0,figsize=(10,6))

<AxesSubplot:title={'center':'Profit'}, xlabel='State'>

df2[['WTT', 'PTI', 'EQW', 'SBI']].plot.box(figsize=(10,6))

<AxesSubplot:>

# A boxplot using seaborn

plt.figure(figsize=(10, 6))
sns.boxplot(x="State", y="Profit", data=startups, palette='rainbow')
plt.title('Profit Distribution by State ')
plt.xlabel('States', fontsize=15)

Text(0.5, 0, 'States')

# and we can convert it into a plotly interactive image with Cufflinks

layout = cf.Layout(height=400,width=700)
df2[['WTT', 'PTI', 'EQW', 'SBI']].iplot(kind='box', layout=layout)

# Or use ploty Express

import plotly.express as px
px.box(df2, y='WTT', width=700, height=400)

# Density Plot

df2[['WTT', 'PTI', 'EQW', 'SBI']].plot.density(figsize=(10,6))

<AxesSubplot:ylabel='Density'>

# distribution in terms of independent variable Profit using Seaborn

plt.figure(figsize=(10, 6))
sns.barplot(x='State',y='Profit',data=startups) # y will be numerical
plt.title('Profit Distribution ')
plt.ylabel('Profit', fontsize=12)
plt.xlabel('State', fontsize=12)

Text(0.5, 0, 'State')

# change the estimator object to a function like std or mean

plt.figure(figsize=(10, 6))
sns.barplot(x='State',y='Profit',data=startups, estimator=np.std)

<AxesSubplot:xlabel='State', ylabel='Profit'>

# Frequency Distribution using Seaborn countplot

plt.figure(figsize=(10, 6))
sns.countplot(x='State',data=startups)
plt.title('Frequency Distribution of States')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('State', fontsize=12)

Text(0.5, 0, 'State')

# Categorical
# Survival v Sex distribution

plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

<AxesSubplot:xlabel='Survived', ylabel='count'>

# How about a Pie Chart usng plotly

x = train['Survived'].value_counts()
colors = ['#800080', '#0000A0']
trace = go.Pie(labels=x.index, values=x, textinfo='value', marker=dict(colors=colors, line=dict(color='#001000',width=2)))
layout =go.Layout(title='Survived', width=500, height=600)
fig = go.Figure(data=[trace], layout=layout)
py.iplot(fig, filename='pie_chart_subplots')

# Or plot correlation as a simple bar chart

plt.figure(figsize=(10,6))
df.corr()['Price'][:-1].sort_values().plot(kind='bar') # remove price correlation with -1

<AxesSubplot:>

# Bar Chart with value labels

states = pd.Series(startups['State']).value_counts()
ax = states.plot(kind='bar', figsize=(10,6), title='States'); 
for i in ax.patches:
    ax.annotate(str(i.get_height()), (i.get_x() * 1.005, i.get_height() * 1.005))

Scatter Plots¶

# Scatter plot with regression line

sns.lmplot(data=iris[iris['species'] == 'setosa'], x='sepal_length', y='sepal_width')

<seaborn.axisgrid.FacetGrid at 0x17c2a594128>

# this gives us 3 levels of information

df5.plot.scatter(x='Room.Board',y='Grad.Rate',c='Expend',cmap='coolwarm', figsize=(10,6))

<AxesSubplot:xlabel='Room.Board', ylabel='Grad.Rate'>

# and we can convert it into a plotly interactive image

layout = cf.Layout(height=400,width=700)
df5.iplot(kind='scatter',x='Room.Board',y="Grad.Rate",mode='markers',size=8, layout=layout)

# Seaborn and the Scatter Plot

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Room.Board', y="Grad.Rate", hue="Private", data=df5)

<AxesSubplot:xlabel='Room.Board', ylabel='Grad.Rate'>

# Using Implot

sns.set_style('whitegrid')
sns.lmplot('Room.Board','Grad.Rate',data=df5, hue='Private',
           palette='coolwarm',height=6,aspect=1,fit_reg=False)

<seaborn.axisgrid.FacetGrid at 0x17c2c7ddcf8>

# With seaborn you can visualise the scatter plot and histogram with jointplot

sns.jointplot(y='compactness error', x='concave points error', data=df_feat, kind="scatter")

<seaborn.axisgrid.JointGrid at 0x17c2c832a58>

# or how about a 2D hex jointplot 

sns.jointplot(data=customers, x='Time on App', y='Length of Membership', kind='hex')

<seaborn.axisgrid.JointGrid at 0x17c2bf79908>

# Use seaborn to visualise linear relationships between variables

sns.lmplot(data=iris[iris['species'] == 'setosa'], x='sepal_length', y='sepal_width')
sns.lmplot(data=customers, x='Length of Membership', y='Yearly Amount Spent')

<seaborn.axisgrid.FacetGrid at 0x17c2aeac6d8>

# Separate categorical data into columns (plots) and color using hue

sns.lmplot(x='fico',y='int.rate',data=df3,col='not.fully.paid', hue='credit.policy', palette='Set1', markers='.')

<seaborn.axisgrid.FacetGrid at 0x17c2bf2f588>

# Seaborn Heatmap for Correlation

plt.figure(figsize=(15, 8))
sns.heatmap(df_feat.corr(),annot=True, cmap='viridis')
plt.ylim(30,0)  # play with this until you get it right

(30.0, 0.0)

Ploty & Dash¶

Remember to check out my Manufacturing Quality Dashboard built entirely with python, plotly and dash right here: https://ecapp-111.herokuapp.com/. The link to the Github page can be found here https://github.com/Eamoned/Interactive-Dashboard