Probability and Statistics for Data Enthusiasts: Part 3 Seaborn Scatterplots
For the next few parts, I will cover a few different types of plots and how to code them in python. The list includes the following plots: Scatter Plots, Stem and Leaf Plots, Histograms, Pie Charts, and Box-Whisker Plot. I will mostly use seaborn packages for visualization and add the documentation link wherever necessary. In this article, we will have a look at Scatter Plots.
Scatter Plot: A scatter plot (also known as a Dot Plot) represents the data points in a two-dimensional plane where the two numeric attributes are represented on either of the axes. It is like watching the data we have from a bird’s eye view as if every data point was a person standing on the ground based on two dimensions. Here is the syntax of the scatterplot function, I will be adding only the arguments that are used most frequently (at least by me), you can explore the other arguments through documentation, here.
seaborn.scatterplot(data=None, x=None, y=None, hue=None, style=None)
Data: This represents the input dataframe from where you will take your data for visualization.
x,y: The variable to be represented on the x-axis and y-axis respectively.
Hue: The input to this argument is the variable by which your plot will be grouped with different colors of each type.
Style: Does the same function as hue, however instead of different colors it makes use of different marker styles.
I will be using the Pokemon Dataset to demonstrate the visualizations, you can find the code here.
First of all, let us import the required libraries. For visualization tools, I will import the seaborn library and for working with data frames I will use pandas. Moreover, in the background, seaborn uses matplotlib to plot its plots. As a result, some of the functions/ methods are easier to implement with matplotlib.pyplot rather than using seaborn. Hence I will also import matplotlib.pyplot with the alias plt.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Now let us import the dataset, you can download the dataset (Pokemon Dataset) and read the info from the CSV file using pandas into a dataframe.
pokemons = pd.read_csv('pokemon.csv')
pokemons
This will print the dataframe which looks like this.
P. S: Python usually prints a dataframe when you type out the name at the end of your block.
For illustration purposes, I will use only a few columns. Remember, a scatterplot is always a comparison between two numeric variables, which means that the variables on two axes are numeric. They can be styled accordingly by different categories, which is what we will get to in a few words. Here I am selecting the following attributes.
scatter_pokemons = pokemons[['Type 1','Total','HP','Generation']]
Since scatterplots usually are made up of a lot of small dots, I am setting the figure size slightly bigger than it would appear by default. You can do this by using one line of code as follows, just below where the scatterplot function is called. As you can see to show the plot, here I use matplotlib.pyplot’s show method. Usually, if you plot one plot per code block, you do not have to specifically use the plt.show() command. It just automatically plots it for you in Google Colab.
ax = sns.scatterplot(data = scatter_pokemons, x='HP', y='Total')
sns.set(rc={'figure.figsize':(20,8)})
plt.show()
I can infer that with a little change in HP, there is significantly a high change in the Total Attribute of any pokemon, except for a few outliers. Here I have taken the x-axis as the Health Points (HP) of pokemon and Y axis as the Total attributes. Suppose I swap this choice of the axis, let us see what information we get.
sns.scatterplot(data = scatter_pokemons, y='HP', x='Total')
Here I can infer that with a huge increase in the Total attributes of pokemon, the HP varies very little. Which is the same as the previous conclusion we got, just in a different format. Whichever seems naturally comfortable to you, you can go with it!
Now I would want to see the variation between HP and Total but for different types of pokemon. I can use the hue argument to specify this. By default seaborn might make your legend overlap your graph. To avoid that we use the move_legend method.
ax = sns.scatterplot(data = scatter_pokemons, x='HP', y='Total', hue='Type 1')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
sns.set(rc={'figure.figsize':(20,8)})
Argh! The dots are too small, and your client will be squinting his/her eyes just to find out the variation for one type. But do not worry, we can increase the size by setting the argument ‘s’ to a desired value.
sns.scatterplot(data = scatter_pokemons, x='HP', y='Total', hue='Type 1', s=100)
Much better! Now suppose you wanted to classify with another category as an addition to this one. We can do that by specifying the style argument. Here I choose the generation from which the pokemon is (each generation represents a new series, in layman's terms. There is very less chance that you have not watched pokemon. However, no one should feel left out, hence the info xD).
sns.scatterplot(data = scatter_pokemons, x='HP', y='Total', hue='Type 1', s=100, style = 'Generation')
Due to such a huge number of data, it is hard to distinguish between different types of Pokemon. Maybe if we reduce the number of data points, we can see a clearer picture. We will use the head method to extract only the first 100 observations.
sns.scatterplot(data = scatter_pokemons.head(100), x='HP', y='Total', hue='Type 1', s=100)
We can also categorize by changing the size of each data point based on a specific category.
sns.scatterplot(data = scatter_pokemons.head(100), x='HP', y='Total', size = 'Type 1', sizes=(20,200), hue = 'Type 1')
These are all different ways we can use the seaborn scatterplot function and customize it according to our needs. As you can see you can play around with it a bit and customize your plot. There are more parameters that you can use which will make your scatterplot more understandable and insightful. You can find those in the documentation provided by seaborn. The purpose is to make your plot look as self-explanatory as possible and this might vary depending on the different areas you wish to highlight.
Hope you enjoyed this take on explaining the seaborn scatterplot library. In the next part of this series, I will attempt to explain stem and leaf plots.
Until the next blog!