Note of data science training EP 3: Matplotlib & Seaborn – Luxury visualization
prev: Note of data science training EP 2: Pandas & Matplotlib – from a thousand mile above
Hi! all guys
Continued from EP 2, we gonna find out the advance visualizations. We now have some basic stuff of matplotlib
, so we will create its graphs in more complex way.
I would say the advance here is how the graph can deliver insight in seconds. For some reasons, we might need to know which graph is proper to the purposes.
Multiple trends
Line graph would be a good choice to display trends over time or changes. We can separate each trend or called “breakdown“. Look here.
We are using data of titanic. Now we pick the column “sex”, “age”, and “fare” where “age” has values only.
selector = titanic[titanic.Age.isna() == False][[titanic.Sex.name, titanic.Age.name, titanic.Fare.name]]
selector
We already have DataFrameGroupBy
object with the average value of “fare”. After transforming this to DataFrame
, we remove the index with .reset_index()
to allow to refer the column name instead of .index
grouper = pd.DataFrame(selector.groupby([titanic.Sex.name, titanic.Age.name]).Fare.mean()).reset_index()
grouper
Now we are using plt.subplots()
as assembled graphs. This function return fig
which is the background of graph and ax
which is the foreground.
(fig, ax) = plt.subplots()
The first ax
is planned for female data with x-axis is “age” and y-axis is “fare”. It’s red.
The second one is male data with blue.
ax.plot(grouper[grouper.Sex == 'female'].Age, grouper[grouper.Sex == 'female'].Fare, c='red', label='female')
ax.plot(grouper[grouper.Sex == 'male'].Age, grouper[grouper.Sex == 'male'].Fare, c='blue', label='male')
Using .legend()
displays a box with labels
plt.legend()
Names x-axis and y-axis with .xlabel()
and .ylabel()
respectively.
plt.xlabel('Age')
plt.ylabel('Fare')
Just find out that there is no women older than around 60 years there.
Spectrums graphs
Scatter graph is graph with dots representing two dimension of data at the basic. This’s the time we include a third dimension.
selector = titanic[titanic.Age.isna() == False][[titanic.Pclass.name, titanic.Age.name, titanic.Fare.name]]
selector
We select “Pclass”, “Age”, and “Fare” from titanic data where “age” has values.
selector.plot.scatter(x = selector.Age.name, y = selector.Fare.name, c = selector.Pclass, cmap = plt.cm.rainbow)
Run .plot.scatter()
and assign the columns as x-axis and y-axis. And we add more two parameters there.
c
column name to color each breakdowncmap
define color scale using class ofmatplotlib.cm
As we assign “Pclass” value as c
which has 3 different values, the graph show 3 different dot colors.
Stacking piles
Guess you already know stacked bar graph. It is quite good for visualizing data in accumulative with breakdowns. We are going to do that.
selector = titanic[[titanic.Pclass.name, titanic.Sex.name]]
selector
Starting with selecting column “Pclass” and “Sex”.
We want to view the number of each “Pclass” and “Sex”, so .groupby()
will be used here.
grouper = pd.DataFrame(selector.groupby([selector.Pclass.name, selector.Sex.name]).Sex.count()) \
:
Right now we got DataFrame
transformed from DataFrameGroupBy
and its column is only “Sex” from .count()
while its indices are “Pclass” and “Sex”.
:
.add_suffix("_count") \
.reset_index() \
.set_index(selector.Pclass.name)
We need “Pclass” as x-axis but we can’t apply .reset_index()
plus .set_index("Pclass")
at this time, because we gonna got the error as duplicated column name (“Sex”).
.add_suffix()
is the solution to add a text after each columns. We add the word “_count” and get “Sex_count”. No duplicated columns now then we do .reset_index()
.
More about .add_suffix()
is here.
After this process, we got the result.
It’s time for graph.
(fig, ax) = plt.subplots()
bar_width = 0.5
_class = grouper[grouper.Sex == 'male'].index
_male_count = grouper[grouper.Sex == 'male'].Sex_count
_female_count = grouper[grouper.Sex == 'female'].Sex_count
We use .subplots()
and declare more 4 variables:
bar_width
Width of each bar_class
It is Index that is “Pclass”_male_count
It is Sex_count of male_female_count
It is Sex_count of female
# add male
ax.bar(_class, _male_count, bar_width, label = 'male')
Put male data first by ax.bar()
with 4 parameters:
- x-axis value is “_class”
- y-axis value is “_male_count”
- width of bar is “bar_width”
- optional parameters: label as “male”
# add female
ax.bar(_class, _female_count, bar_width, bottom = list(_male_count), label = 'female')
After male is female, we have to put female boxes on top of male boxes. Here are the parameters:
- x-axis is “_class”
- y-axis is “_female_count”
- width of bar
- optional parameter: bottom which indicates how high the boxes are floating over. It requires a constant or an array. We want their floating height equals to “_male_count” but it currently is
Series
so we uselist()
to cast it for the case. - optional parameter: label as “female”
and .set_xticks()
is for setting x-axis values as “_class”.
ax.set_xticks(_class)
.legend()
is for label box.
ax.legend()
This code is to put values of each box of the bar.
for p in ax.patches:
x, y = p.get_xy()
h = p.get_height()
ax.annotate(str(h), ((x + p.get_width()/2), (h+y)), ha='center', va='top', color='white')
.patches
returns a list of graph components. At this case, they are rectangles in the graph. We are looking into each of them and put the value with .annotate()
following by these parameters:
- First is
str(h)
for transform value of “h” to text. “h” equals top.get_height()
that is the height of the component. It is “_male_count” or “_female_count” in this case - Second is the coordinate of the text consisting of
- x is
x + p.get_width()/2
. It means value on x-axis at the bottom-left corner added with width of the component. We will get value on x-axis at the middle of the component. - y is
h + y
. It means height added with value on y-axis at the bottom-left of the component. We will get value on y-axis at the top of the component. ha
stands for horizontal alignment. We use “center” to align the text at the center of the component.va
stands for vertical alignment. We use “top” to align the text at the top of the component.
plt.xlabel('Ticket class')
plt.ylabel('crew count')
Use .xlabel()
and .ylabel()
to put the label of x-axis and y-axis respectively.
Here are the reference links:
- https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.bar.html
- https://stackoverflow.com/questions/51610707/two-stacked-bar-charts-as-sub-plots-any-references
- https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.annotate.html
- https://stackoverflow.com/questions/25447700/annotate-bars-with-values-on-pandas-bar-plots
Seaborn
seaborn
is another popular Python library to create beautiful graph. We usually import this as a name “sns” which is related to a fictional charactor’s name “Samuel Norman Seaborn” (source)
Let’s try some cool graphs from this.
Joint plot
sns.jointplot()
create a graph from 2 dimensions of data to find out relationship. We can generate different views by parameter “kind”.
kde (kernel density estimate) shows data density and line graph
scatter shows data distribution as scatter chart with histogram
reg (regression) shows regression trend with histogram and line graph
resid (residual) shows trends and histogram of errors
hex (hexagon) shows hexagonal figure of scatter chart with heat map colorization and histogram
Reference link: http://alanpryorjr.com/visualizations/seaborn/jointplot/jointplot/
Pair plot
sns.pairplot()
displays all combinations of numeric columns of the data.
Relational plot
sns.relplot()
is the advance scatter plot allowing multiple breakdowns. Here are “hue” for different colors and “size” for different size.
Violin plot
sns.violinplot()
shows density of each breakdown.
Heat map
sns.heatmap()
is a heat map to visualize numbers in scale.
This function requires numeric values only. Here is the example:
Let’s say we want a heat map of number of survivors in each “Pclass” and “Sex”. We have to .groupby()
then .sum()
over column “Survived”.
We apply .pivot()
to create a pivot table.
Finally, run sns.heatmap()
with annot=True
to display numbers.
Reference link: https://seaborn.pydata.org/api.html
From my side, seaborn
is a great tool allowing us create pretty graphs with ease but sometimes we need matplotlib
for more complex or customizable ones.
Hope this is useful for ya
See ya next time.
Bye~
next: Note of data science training EP 4: Scikit-learn & Linear Regression – Linear trending