For many companies to remain competitive it's critical that they leverage data analysis and AL/ML to improve their forecasting and future sales models. This usually involves the analyse of historical sales data, exploring customer demand, the current competition, seasonality patterns, the effect of holidays, promotions and much more.
The objective of this project is to carry out analysis of the actual dataset itself in order to determine the characteristics of features, reliability of the data, strenght of detail and any weaknesses and/or missing data in the dataset. The second aim of the project is to explore the data and pull new information and insights wherever possible. Of course Data Analysis would only be the first step for the business. This is usually followed by building perdictive models for forecasing future sales. Additional steps would include Feature Engineering, Feature Selection, Model Building and finally deployment.
The sales data is drawn from 1115 stores and includes two files:
Transaction Data
Store Data
# Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
import warnings
warnings.filterwarnings("ignore")
# datasets
sales = pd.read_csv('train.csv') # all tranaction data
store = pd.read_csv('store.csv') # store specific information
Sales Data
sales.head()
sales.info()
sales.describe().transpose()
Stores Data
store.head()
store.info()
store.describe().transpose()
print('Average competition Distance is: ',round(store['CompetitionDistance'].mean(), 2), "meters")
print('Average competition Distance is: ',round(store['CompetitionDistance'].mean())/1000, "km")
# any missing data, luckily we don't!
sales.isnull().sum().any()
Distributions
# Sales
plt.figure(figsize=(16,6))
sns.distplot(sales['Sales'])
plt.title("Sales Distribution")
plt.show()
# Plot histogram of numerical sales data
sales.hist(bins=30, color='red', alpha=0.7, figsize=(20,20))
plt.show()
Distributions:
Which Store has the largest Sales?
# Which Store has the largest Sales
sales[sales['Customers']==sales['Customers'].max()]
Which Store has the largest number of Customers?
sales[sales['Sales']==sales['Sales'].max()]
How many stores are open and closed!
print()
print("Stores open and closed!")
print()
# Count the number of stores that are open and closed
print("Stores Closed: ",len(sales[sales['Open'] == 0]))
print("Stores Open: ",len(sales[sales['Open'] == 1]))
print('Total Stores: ',len(sales))
Obviously there will be no sales and promos if stores aren't open, so lets remove closed stores from the dataset
sales = sales[sales["Open"]==1]
Drop the open column since it has no meaning now
sales.drop('Open', axis=1, inplace=True)
sales.head()
Lets see the impact on sales with the removal of closed stores
sales.describe().transpose()
Average sales = 6955 Euros, average number of customers = 762 (went up)
Distributions
# Plot histogram of sales data
store.hist(bins=30, color='red', alpha=0.7, figsize=(20,20))
plt.show()
print('Average competition Distance is: ',round(store['CompetitionDistance'].mean())/1000, "km")
Missing Data
Let's see if we have any missing data in the stores dataset!
# missing data in the stores dataset!
store.isnull().sum()
The dataset has some missing points, lets explore further!
plt.figure(figsize=(16,6))
sns.heatmap(store.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.show()
# make a list of the features that contain missing values (using list comprehension)
vars_with_na = [var for var in store.columns if store[var].isnull().sum()>1]
# print the variable name and the percentage of missing values
for var in vars_with_na:
print(var, "= Percentage of data missing: ",np.round((store[var].isnull().sum()/len(store))*100,2),'%')
For 'CompetitionDistance':
# show missing data rows
store[store['CompetitionDistance'].isnull()]
Fill in the missing values with the average values of the 'CompetitionDistance' column & recheck for null values
# return all the rows with 'CompetitionDistance' == NaN and change those rows to the mean of all the 'CompetitionDistance' values.
store.loc[(store['CompetitionDistance'].isnull() == True), 'CompetitionDistance'] = store['CompetitionDistance'].mean()
print("Number of null values: ",store['CompetitionDistance'].isnull().sum())
For 'CompetitionOpenSinceMonth' & 'CompetitionOpenSinceYear':
# check that if 'CompetitionOpenSinceMonth' is NaN then 'CompetitionOpenSinceYear' will also be NaN
store[store['CompetitionOpenSinceMonth'].isnull()].head()
For 'promo2SinceWeek', 'Promo2SinceYear', and 'PromoInterval':
store[store['Promo2SinceWeek'].isnull()].head()
Lets set the values for 'CompetitionOpenSinceMonth','CompetitionOpenSinceYear', Promo2SinceWeek', 'Promo2SinceYear' & 'PromoInterval to zeros
nan_cols = ['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval']
for nan in nan_cols:
store[nan].fillna(0, inplace=True)
Check if we still have missing data
plt.figure(figsize=(15,6))
sns.heatmap(store.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.show()
Let's merge both data frames together based on 'store'
# Both files have store ids as common
print(len(store['Store'].unique()))
print(len(sales['Store'].unique()))
store.columns
sales.columns
Merge datasets
# Merge the two datasets
df_merge = pd.merge(sales, store, on='Store') # how='inner' by default
df_merge.columns
Feature Destributions for Sales
Explore Store Type
df_merge.groupby(['Date', 'StoreType']).mean()['Sales'].unstack().plot(figsize=(14,10))
plt.title('Average Sales per Store Type')
plt.show()
num_vars = ['Sales', 'Customers', 'CompetitionDistance']
def analyse_numeric(df_merge, var):
df = df_merge.copy()
sns.set_style('darkgrid')
g = sns.FacetGrid(df_merge,hue="StoreType",palette='Dark2',height=5,aspect=3)
g = g.map(plt.hist,var,bins=30,alpha=0.7)
plt.title('Store Type distributions')
plt.legend()
plt.xlabel(var)
plt.show()
for var in num_vars:
analyse_numeric(df_merge, var)
plt.figure(figsize=(14,6))
sns.countplot(x='StoreType',data=df_merge)
plt.title('Count of StoreTypes')
plt.show()
plt.figure(figsize=(14,6))
sns.barplot(x='StoreType', y='Sales', data = df_merge, estimator=sum)
plt.title('Total Sales by StoreType')
plt.show()
Explore Assortment
We will assign descriptions to the Assortment labels to improve analysis
# Change Assortment labels
labels = {'a':'Basic','b':'Extra','c':'Extended'}
df_merge['AssortName'] = df_merge['Assortment'].map(labels)
df_merge['AssortName'].unique()
plt.figure(figsize=(14,6))
sns.countplot(x='AssortName',data=df_merge)
plt.title('Count of Assortment')
plt.show()
plt.figure(figsize=(14,6))
sns.barplot(x='AssortName', y='Sales', data = df_merge, estimator=sum)
plt.title('Total Sales by Assortment')
plt.show()
Promo Analysis
plt.figure(figsize=(14,7))
df_merge[df_merge['Promo']==0]['Sales'].hist(alpha=0.5,color='red',bins=30,label='Promo=0')
df_merge[df_merge['Promo']==1]['Sales'].hist(alpha=0.5,color='blue',bins=30,label='Promo=1')
df_merge[df_merge['Promo2']==0]['Sales'].hist(alpha=0.5,color='yellow',bins=30,label='Promo2=0')
df_merge[df_merge['Promo2']==1]['Sales'].hist(alpha=0.5,color='green',bins=30,label='Promo2=1')
plt.legend()
plt.show()
plt.figure(figsize=(14, 6))
sns.boxplot(x="Promo", y="Sales", data=df_merge, palette='rainbow')
plt.title('Sales Distribution by Promo')
plt.xlabel('Promo', fontsize=15)
plt.show()
plt.figure(figsize=(14, 6))
sns.boxplot(x="Promo2", y="Sales", data=df_merge, palette='rainbow')
plt.title('Sales Distribution by Promo2')
plt.xlabel('Promo2', fontsize=15)
plt.show()
plt.figure(figsize=(14,6))
corr = df_merge.corr()['Sales'].sort_values()
corr[:-1].plot(kind='bar') # remove the Sales label
plt.title('Feature Correlation with Sales')
plt.show()
Not surprisingly there's a high correlation beween Customers and Sales
sns.jointplot(y='Sales', x='Customers', data=df_merge, kind="reg", height=8, ratio=3)
plt.figure(figsize=(18,10))
sns.heatmap(df_merge.corr(), annot=True, cmap='viridis')
plt.ylim(12,0)
plt.title('Feature Correlation')
plt.show()
Lets break up the date into day, month and year and explore possible patterns
Separate the year, month and day and place in separate columns
df_merge.head(2)
df_merge['Year'] = pd.DatetimeIndex(df_merge['Date']).year
df_merge['Month'] = pd.DatetimeIndex(df_merge['Date']).month
df_merge['Day'] = pd.DatetimeIndex(df_merge['Date']).day
#df_merge['DayName'] = pd.DatetimeIndex(df_merge['Date']).strftime('%A')
#df_merge['DayOfWeek'] = pd.DatetimeIndex(df_merge['Date']).dayofweek
df_merge.head(3)
Let's take a look at the average sales and number of customers per month
# Groupby' works great by grouping all the data that share the same month column, then obtain the mean of the sales column
plt.figure(figsize=(14,6))
df_merge.groupby(by='Month')['Sales'].mean().plot(kind='bar')
plt.title('Average Sales by Month')
plt.show()
df_merge.groupby(by='Month')['Sales'].mean().plot(figsize=(14,5), marker='o', color='g')
plt.title('Average Sales per Month')
plt.legend()
plt.show()
df_merge.groupby(by='Month')['Customers'].mean().plot(figsize=(14,5), marker='o', color='r')
plt.title('Average Customers per Month')
plt.legend()
plt.show()
Let's explore Sales and number of customers per day of the month
df_merge.groupby(by='Day')['Sales'].mean().plot(figsize=(14,5), marker='o', color='g')
plt.title('Average Sales per Day of the Month')
plt.legend()
plt.show()
df_merge.groupby(by='Day')['Customers'].mean().plot(figsize=(14,5), marker='o', color='r')
plt.title('Average Customers per Day of the Month')
plt.legend()
plt.show()
Smallest number of customers and sales are generally around the 24th of the month
Largest number of customers and sales are around 30th and 1st of the month
This probably a reflection of pay cycles when people have more disposable income and/or price reductions at the end of the month. There's porentially a lot more to explore here and business knowledge can help explain these patterns further.
Let's do the same for the day of the week
# # list the days in DayOfWeek
df_merge['DayOfWeek'].unique()
Note: Monday = 1, Tuesday = 2, ......, Sunday = 7
df_merge.groupby(by='DayOfWeek')['Sales'].mean().plot(figsize=(14,5), marker='o', color='g')
plt.title('Average Sales per Day of the Week')
plt.legend()
plt.show()
df_merge.groupby(by='DayOfWeek')['Customers'].mean().plot(figsize=(14,5), marker='o', color='r')
plt.title('Average Customers per Day of the Month')
plt.legend()
plt.show()
Sunday and Monday appear to be the busiest days of the week for sales
However, in terms of number of customer numbers, Sunday again is the busiest but less so on a Monday.