There are many python visualisation libraries available to plot scatter plots, bar charts, heatmaps and histograms. Some can be a little more sophisticated and interactive than others but it really comes down to what's appropriate for the project and your visualisation requirements, i.e. a static chart may be suffice or maybe something more interactive is required.
In this notebook I illustrate just a few examples which I use quite often for my EDA projects. There are of course many other options, like Boheh, Pygal and Altair, to name just a few.
Matplotlib is the daddy of data visualisation libraries. It is an excellent 2D and 3D graphics library for generating scientific figures and these can be created programmatically. Then there's pandas which is an extremely popular and powerful and operates directly on dataframes. Its built-in capabilities are built off matplotlib. Seaborn is an abstraction interface on top of matplotlib and has simpler code than matplotlib. Plotly (and cufflinks) allows you to create interactive plots that you can deploy to dashboards or websites. Note, Cufflinks connects plotly with pandas dataframes.
The code presented here covers the bare bones of popular plots but you can of course dress them up little more and make them better presentable by changing the dimensions, adding headers and axis labelling, changing colors, etc.
To take your python skills to the next level in data visualization and create fully customised, interactive dashboards that run on the browser, then the open source libraries of Plotly and Dash is your next call. Dash is a Python framework for building web applications. It's built on top of Flask, Plotly.js, React and React js. It enables you to build dashboards using pure Python.
Check out my Manufacturing Quality Dashboard built entirely with python, plotly and dash right here: https://ecapp-111.herokuapp.com/. The link to the github page can be found here https://github.com/Eamoned/Interactive-Dashboard
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.offline as py
%matplotlib inline
from sklearn.datasets import load_breast_cancer
# Use cufflinks (with .plotly) to make plots interactive
import cufflinks as cf
cf.go_offline() # operate cufflinks off-line
Visualisations will assist in the analyse of datasets to identify:
customers = pd.read_csv('EDA_code_files/Ecommerce Customers')
startups = pd.read_csv('EDA_code_files/50_Startups.csv')
df = pd.read_csv('EDA_code_files/USA_Housing.csv')
train = pd.read_csv('EDA_code_files/titanic_train.csv')
df2 = pd.read_csv("EDA_code_files/Classified Data",index_col=0)
df3 = pd.read_csv("EDA_code_files/loan_data.csv")
df4 = pd.read_csv('EDA_code_files/kyphosis.csv')
df5 = pd.read_csv('EDA_code_files/College_Data')
iris = sns.load_dataset('iris')
cancer = load_breast_cancer()
df_feat = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
# create fake data:
df_fake = pd.DataFrame(np.random.randn(100,4),columns='A B C D'.split())
# this is the line plot that pandas makes automatically using matplotlib
df_fake.plot(alpha=0.6, figsize=(10,4))
plt.show()
# area plot
df2[['WTT', 'PTI', 'EQW']].plot.area(alpha=0.4, figsize=(10,4))
# Plotly is a library that allows you to create interactive plots that you can use in dashboards or websites
# Cufflinks connects plotly with pandas datframes
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# Now the plot will be converted to a plotly interactive image
layout = cf.Layout(height=450,width=700)
df_fake.iplot(layout=layout)
For simple numeric distributions
# Age distribution of people on the titanic using pandas
train['Age'].hist(bins=30,color='darkred',alpha=0.7, figsize=(10,4))
plt.figure(figsize=(10,4))
df2['WTT'].hist(bins=30,alpha=0.7, label='WTT')
df2['PTI'].hist(bins=30,alpha=0.7, label='PTI')
df2['EQW'].hist(bins=30,alpha=0.7, label='EQW')
plt.title('Data Distributions')
plt.legend()
plt.show()
# We can make it interactive with plotly
layout = cf.Layout(height=400,width=700)
df2[['WTT', 'PTI', 'EQW']].iplot(kind='hist', bins=40, layout=layout)
plt.figure(figsize=(10,6)) # change the size of the plot
sns.distplot(df['Price'])
layout = cf.Layout(height=400,width=700)
train['Age'].iplot(kind='hist',bins=30,color='green', layout=layout)
# Plotting more than one distribution using pandas
train.hist(bins=30, figsize=(10,10))
plt.show()
Visualisation of Numerical Distributions & dependent variable category relationships We can use pandas matplotlib or Seaborn
# for plotting a numerical variable distributions but seperating with by categorical variable
# Matplotlib
num_vars = [var for var in iris.columns if iris[var].dtypes != 'object']
def analyse_numeric(df, var):
df = df.copy()
plt.figure(figsize=(10,4))
df[df['species']=='setosa'][var].hist(alpha=0.5,color='blue',bins=30,label='species=setosa')
df[df['species']=='virginica'][var].hist(alpha=0.5,color='red',bins=30,label='species=virginica')
df[df['species']=='versicolor'][var].hist(alpha=0.5,color='yellow',bins=30,label='species=versicolor')
plt.title('Species distributions')
plt.legend()
plt.xlabel(var)
plt.show()
for var in num_vars:
analyse_numeric(iris, var)
# Seaborn
num_vars = [var for var in iris.columns if iris[var].dtypes != 'object']
def analyse_numeric(df, var):
df = df.copy()
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="species",palette='coolwarm',height=5,aspect=2)
g = g.map(plt.hist,var,bins=30,alpha=0.7)
plt.title('Species distributions')
plt.legend()
plt.xlabel(var)
plt.show()
for var in num_vars:
analyse_numeric(iris, var)
# Seaborn's pair plot will do something similar and can include scatter plots
sns.pairplot(iris, hue='species', palette='Dark2')
# Or you could use a histogram rather than a distribution
sns.pairplot(df4,hue='Kyphosis',palette='Set1', diag_kind = 'hist')
# Seaborn flexibility allows mapping of different types of charts using map
# Map to upper,lower, and diagonal
g = sns.PairGrid(iris)
g.map_diag(sns.distplot)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)
Visualizing different categorical distributions
# Visualise different categories against the target using a Boxplot
startups.boxplot(column='Profit',by='State',rot = 0,figsize=(10,6))
df2[['WTT', 'PTI', 'EQW', 'SBI']].plot.box(figsize=(10,6))
# A boxplot using seaborn
plt.figure(figsize=(10, 6))
sns.boxplot(x="State", y="Profit", data=startups, palette='rainbow')
plt.title('Profit Distribution by State ')
plt.xlabel('States', fontsize=15)
# and we can convert it into a plotly interactive image with Cufflinks
layout = cf.Layout(height=400,width=700)
df2[['WTT', 'PTI', 'EQW', 'SBI']].iplot(kind='box', layout=layout)
# Or use ploty Express
import plotly.express as px
px.box(df2, y='WTT', width=700, height=400)
# Density Plot
df2[['WTT', 'PTI', 'EQW', 'SBI']].plot.density(figsize=(10,6))
# distribution in terms of independent variable Profit using Seaborn
plt.figure(figsize=(10, 6))
sns.barplot(x='State',y='Profit',data=startups) # y will be numerical
plt.title('Profit Distribution ')
plt.ylabel('Profit', fontsize=12)
plt.xlabel('State', fontsize=12)
# change the estimator object to a function like std or mean
plt.figure(figsize=(10, 6))
sns.barplot(x='State',y='Profit',data=startups, estimator=np.std)
# Frequency Distribution using Seaborn countplot
plt.figure(figsize=(10, 6))
sns.countplot(x='State',data=startups)
plt.title('Frequency Distribution of States')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('State', fontsize=12)
# Categorical
# Survival v Sex distribution
plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')
# How about a Pie Chart usng plotly
x = train['Survived'].value_counts()
colors = ['#800080', '#0000A0']
trace = go.Pie(labels=x.index, values=x, textinfo='value', marker=dict(colors=colors, line=dict(color='#001000',width=2)))
layout =go.Layout(title='Survived', width=500, height=600)
fig = go.Figure(data=[trace], layout=layout)
py.iplot(fig, filename='pie_chart_subplots')
# Or plot correlation as a simple bar chart
plt.figure(figsize=(10,6))
df.corr()['Price'][:-1].sort_values().plot(kind='bar') # remove price correlation with -1
# Bar Chart with value labels
states = pd.Series(startups['State']).value_counts()
ax = states.plot(kind='bar', figsize=(10,6), title='States');
for i in ax.patches:
ax.annotate(str(i.get_height()), (i.get_x() * 1.005, i.get_height() * 1.005))
# Scatter plot with regression line
sns.lmplot(data=iris[iris['species'] == 'setosa'], x='sepal_length', y='sepal_width')
# this gives us 3 levels of information
df5.plot.scatter(x='Room.Board',y='Grad.Rate',c='Expend',cmap='coolwarm', figsize=(10,6))
# and we can convert it into a plotly interactive image
layout = cf.Layout(height=400,width=700)
df5.iplot(kind='scatter',x='Room.Board',y="Grad.Rate",mode='markers',size=8, layout=layout)
# Seaborn and the Scatter Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Room.Board', y="Grad.Rate", hue="Private", data=df5)
# Using Implot
sns.set_style('whitegrid')
sns.lmplot('Room.Board','Grad.Rate',data=df5, hue='Private',
palette='coolwarm',height=6,aspect=1,fit_reg=False)
# With seaborn you can visualise the scatter plot and histogram with jointplot
sns.jointplot(y='compactness error', x='concave points error', data=df_feat, kind="scatter")
# or how about a 2D hex jointplot
sns.jointplot(data=customers, x='Time on App', y='Length of Membership', kind='hex')
# Use seaborn to visualise linear relationships between variables
sns.lmplot(data=iris[iris['species'] == 'setosa'], x='sepal_length', y='sepal_width')
sns.lmplot(data=customers, x='Length of Membership', y='Yearly Amount Spent')
# Separate categorical data into columns (plots) and color using hue
sns.lmplot(x='fico',y='int.rate',data=df3,col='not.fully.paid', hue='credit.policy', palette='Set1', markers='.')
# Seaborn Heatmap for Correlation
plt.figure(figsize=(15, 8))
sns.heatmap(df_feat.corr(),annot=True, cmap='viridis')
plt.ylim(30,0) # play with this until you get it right
Remember to check out my Manufacturing Quality Dashboard built entirely with python, plotly and dash right here: https://ecapp-111.herokuapp.com/. The link to the Github page can be found here https://github.com/Eamoned/Interactive-Dashboard