Note of data science training EP 3: Matplotlib & Seaborn – Luxury visualization
prev: Note of data science training EP 2: Pandas & Matplotlib – from a thousand mile above
Hi! all guys
Continued from EP 2, we gonna find out the advance visualizations. We now have some basic stuff of
matplotlib, so we will create its graphs in more complex way.
I would say the advance here is how the graph can deliver insight in seconds. For some reasons, we might need to know which graph is proper to the purposes.
Line graph would be a good choice to display trends over time or changes. We can separate each trend or called “breakdown“. Look here.
We are using data of titanic. Now we pick the column “sex”, “age”, and “fare” where “age” has values only.
selector = titanic[titanic.Age.isna() == False][[titanic.Sex.name, titanic.Age.name, titanic.Fare.name]] selector
We already have
DataFrameGroupBy object with the average value of “fare”. After transforming this to
DataFrame, we remove the index with
.reset_index() to allow to refer the column name instead of
grouper = pd.DataFrame(selector.groupby([titanic.Sex.name, titanic.Age.name]).Fare.mean()).reset_index() grouper
Now we are using
plt.subplots() as assembled graphs. This function return
fig which is the background of graph and
ax which is the foreground.
(fig, ax) = plt.subplots()
ax is planned for female data with x-axis is “age” and y-axis is “fare”. It’s red.
The second one is male data with blue.
ax.plot(grouper[grouper.Sex == 'female'].Age, grouper[grouper.Sex == 'female'].Fare, c='red', label='female') ax.plot(grouper[grouper.Sex == 'male'].Age, grouper[grouper.Sex == 'male'].Fare, c='blue', label='male')
.legend() displays a box with labels
Names x-axis and y-axis with
Just find out that there is no women older than around 60 years there.
Scatter graph is graph with dots representing two dimension of data at the basic. This’s the time we include a third dimension.
selector = titanic[titanic.Age.isna() == False][[titanic.Pclass.name, titanic.Age.name, titanic.Fare.name]] selector
We select “Pclass”, “Age”, and “Fare” from titanic data where “age” has values.
selector.plot.scatter(x = selector.Age.name, y = selector.Fare.name, c = selector.Pclass, cmap = plt.cm.rainbow)
.plot.scatter() and assign the columns as x-axis and y-axis. And we add more two parameters there.
column name to color each breakdown
define color scale using class of
As we assign “Pclass” value as
c which has 3 different values, the graph show 3 different dot colors.
Guess you already know stacked bar graph. It is quite good for visualizing data in accumulative with breakdowns. We are going to do that.
selector = titanic[[titanic.Pclass.name, titanic.Sex.name]] selector
Starting with selecting column “Pclass” and “Sex”.
We want to view the number of each “Pclass” and “Sex”, so
.groupby() will be used here.
grouper = pd.DataFrame(selector.groupby([selector.Pclass.name, selector.Sex.name]).Sex.count()) \ :
Right now we got
DataFrame transformed from
DataFrameGroupBy and its column is only “Sex” from
.count() while its indices are “Pclass” and “Sex”.
: .add_suffix("_count") \ .reset_index() \ .set_index(selector.Pclass.name)
We need “Pclass” as x-axis but we can’t apply
.set_index("Pclass") at this time, because we gonna got the error as duplicated column name (“Sex”).
.add_suffix() is the solution to add a text after each columns. We add the word “_count” and get “Sex_count”. No duplicated columns now then we do
.add_suffix() is here.
After this process, we got the result.
It’s time for graph.
(fig, ax) = plt.subplots() bar_width = 0.5 _class = grouper[grouper.Sex == 'male'].index _male_count = grouper[grouper.Sex == 'male'].Sex_count _female_count = grouper[grouper.Sex == 'female'].Sex_count
.subplots() and declare more 4 variables:
Width of each bar
It is Index that is “Pclass”
It is Sex_count of male
It is Sex_count of female
# add male ax.bar(_class, _male_count, bar_width, label = 'male')
Put male data first by
ax.bar() with 4 parameters:
- x-axis value is “_class”
- y-axis value is “_male_count”
- width of bar is “bar_width”
- optional parameters: label as “male”
# add female ax.bar(_class, _female_count, bar_width, bottom = list(_male_count), label = 'female')
After male is female, we have to put female boxes on top of male boxes. Here are the parameters:
- x-axis is “_class”
- y-axis is “_female_count”
- width of bar
- optional parameter: bottom which indicates how high the boxes are floating over. It requires a constant or an array. We want their floating height equals to “_male_count” but it currently is
Seriesso we use
list()to cast it for the case.
- optional parameter: label as “female”
.set_xticks() is for setting x-axis values as “_class”.
.legend() is for label box.
This code is to put values of each box of the bar.
for p in ax.patches: x, y = p.get_xy() h = p.get_height() ax.annotate(str(h), ((x + p.get_width()/2), (h+y)), ha='center', va='top', color='white')
.patches returns a list of graph components. At this case, they are rectangles in the graph. We are looking into each of them and put the value with
.annotate() following by these parameters:
- First is
str(h)for transform value of “h” to text. “h” equals to
p.get_height()that is the height of the component. It is “_male_count” or “_female_count” in this case
- Second is the coordinate of the text consisting of
- x is
x + p.get_width()/2. It means value on x-axis at the bottom-left corner added with width of the component. We will get value on x-axis at the middle of the component.
- y is
h + y. It means height added with value on y-axis at the bottom-left of the component. We will get value on y-axis at the top of the component.
hastands for horizontal alignment. We use “center” to align the text at the center of the component.
vastands for vertical alignment. We use “top” to align the text at the top of the component.
plt.xlabel('Ticket class') plt.ylabel('crew count')
.ylabel() to put the label of x-axis and y-axis respectively.
Here are the reference links:
seaborn is another popular Python library to create beautiful graph. We usually import this as a name “sns” which is related to a fictional charactor’s name “Samuel Norman Seaborn” (source)
Let’s try some cool graphs from this.
sns.jointplot() create a graph from 2 dimensions of data to find out relationship. We can generate different views by parameter “kind”.
kde (kernel density estimate) shows data density and line graph
scatter shows data distribution as scatter chart with histogram
reg (regression) shows regression trend with histogram and line graph
resid (residual) shows trends and histogram of errors
hex (hexagon) shows hexagonal figure of scatter chart with heat map colorization and histogram
Reference link: http://alanpryorjr.com/visualizations/seaborn/jointplot/jointplot/
sns.pairplot() displays all combinations of numeric columns of the data.
sns.relplot() is the advance scatter plot allowing multiple breakdowns. Here are “hue” for different colors and “size” for different size.
sns.violinplot() shows density of each breakdown.
sns.heatmap() is a heat map to visualize numbers in scale.
This function requires numeric values only. Here is the example:
Let’s say we want a heat map of number of survivors in each “Pclass” and “Sex”. We have to
.sum() over column “Survived”.
.pivot() to create a pivot table.
annot=True to display numbers.
Reference link: https://seaborn.pydata.org/api.html
From my side,
seaborn is a great tool allowing us create pretty graphs with ease but sometimes we need
matplotlib for more complex or customizable ones.
Hope this is useful for ya
See ya next time.
next: Note of data science training EP 4: Scikit-learn & Linear Regression – Linear trending