prev: Note of data science training EP 2: Pandas & Matplotlib – from a thousand mile above

Hi! all guys

Continued from EP 2, we gonna find out the advance visualizations. We now have some basic stuff of matplotlib, so we will create its graphs in more complex way.

I would say the advance here is how the graph can deliver insight in seconds. For some reasons, we might need to know which graph is proper to the purposes.

Multiple trends

Line graph would be a good choice to display trends over time or changes. We can separate each trend or called “breakdown“. Look here.

We are using data of titanic. Now we pick the column “sex”, “age”, and “fare” where “age” has values only.

selector = titanic[titanic.Age.isna() == False][[titanic.Sex.name, titanic.Age.name, titanic.Fare.name]]
selector

We already have DataFrameGroupBy object with the average value of “fare”. After transforming this to DataFrame, we remove the index with .reset_index() to allow to refer the column name instead of .index

grouper = pd.DataFrame(selector.groupby([titanic.Sex.name, titanic.Age.name]).Fare.mean()).reset_index()
grouper

Now we are using plt.subplots() as assembled graphs. This function return fig which is the background of graph and ax which is the foreground.

(fig, ax) = plt.subplots()

The first ax is planned for female data with x-axis is “age” and y-axis is “fare”. It’s red.

The second one is male data with blue.

ax.plot(grouper[grouper.Sex == 'female'].Age, grouper[grouper.Sex == 'female'].Fare, c='red', label='female')
ax.plot(grouper[grouper.Sex == 'male'].Age, grouper[grouper.Sex == 'male'].Fare, c='blue', label='male')

Using .legend() displays a box with labels

plt.legend()

Names x-axis and y-axis with .xlabel() and .ylabel() respectively.

plt.xlabel('Age')
plt.ylabel('Fare')

Just find out that there is no women older than around 60 years there.

Spectrums graphs

Scatter graph is graph with dots representing two dimension of data at the basic. This’s the time we include a third dimension.

selector = titanic[titanic.Age.isna() == False][[titanic.Pclass.name, titanic.Age.name, titanic.Fare.name]]
selector

We select “Pclass”, “Age”, and “Fare” from titanic data where “age” has values.

selector.plot.scatter(x = selector.Age.name, y = selector.Fare.name, c = selector.Pclass, cmap = plt.cm.rainbow)

Run .plot.scatter() and assign the columns as x-axis and y-axis. And we add more two parameters there.

c
column name to color each breakdown
cmap
define color scale using class of matplotlib.cm

As we assign “Pclass” value as c which has 3 different values, the graph show 3 different dot colors.

Stacking piles

Guess you already know stacked bar graph. It is quite good for visualizing data in accumulative with breakdowns. We are going to do that.

selector = titanic[[titanic.Pclass.name, titanic.Sex.name]]
selector

Starting with selecting column “Pclass” and “Sex”.

We want to view the number of each “Pclass” and “Sex”, so .groupby() will be used here.

grouper = pd.DataFrame(selector.groupby([selector.Pclass.name, selector.Sex.name]).Sex.count()) \
  :

Right now we got DataFrame transformed from DataFrameGroupBy and its column is only “Sex” from .count() while its indices are “Pclass” and “Sex”.

  :
    .add_suffix("_count") \
    .reset_index() \
    .set_index(selector.Pclass.name)

We need “Pclass” as x-axis but we can’t apply .reset_index() plus .set_index("Pclass") at this time, because we gonna got the error as duplicated column name (“Sex”).

.add_suffix() is the solution to add a text after each columns. We add the word “_count” and get “Sex_count”. No duplicated columns now then we do .reset_index().

More about .add_suffix() is here.

After this process, we got the result.

It’s time for graph.

(fig, ax) = plt.subplots()
bar_width = 0.5
_class = grouper[grouper.Sex == 'male'].index
_male_count = grouper[grouper.Sex == 'male'].Sex_count
_female_count = grouper[grouper.Sex == 'female'].Sex_count

We use .subplots() and declare more 4 variables:

bar_width
Width of each bar
_class
It is Index that is “Pclass”
_male_count
It is Sex_count of male
_female_count
It is Sex_count of female

# add male
ax.bar(_class, _male_count, bar_width, label = 'male')

Put male data first by ax.bar() with 4 parameters:

x-axis value is “_class”
y-axis value is “_male_count”
width of bar is “bar_width”
optional parameters: label as “male”

# add female
ax.bar(_class, _female_count, bar_width, bottom = list(_male_count), label = 'female')

After male is female, we have to put female boxes on top of male boxes. Here are the parameters:

x-axis is “_class”
y-axis is “_female_count”
width of bar
optional parameter: bottom which indicates how high the boxes are floating over. It requires a constant or an array. We want their floating height equals to “_male_count” but it currently is Series so we use list() to cast it for the case.
optional parameter: label as “female”

and .set_xticks() is for setting x-axis values as “_class”.

ax.set_xticks(_class)

.legend() is for label box.

ax.legend()

This code is to put values of each box of the bar.

for p in ax.patches:
    x, y = p.get_xy()
    h = p.get_height()
    ax.annotate(str(h), ((x + p.get_width()/2), (h+y)), ha='center', va='top', color='white')

.patches returns a list of graph components. At this case, they are rectangles in the graph. We are looking into each of them and put the value with .annotate() following by these parameters:

First is str(h) for transform value of “h” to text. “h” equals to p.get_height() that is the height of the component. It is “_male_count” or “_female_count” in this case
Second is the coordinate of the text consisting of
x is x + p.get_width()/2. It means value on x-axis at the bottom-left corner added with width of the component. We will get value on x-axis at the middle of the component.
y is h + y. It means height added with value on y-axis at the bottom-left of the component. We will get value on y-axis at the top of the component.
ha stands for horizontal alignment. We use “center” to align the text at the center of the component.
va stands for vertical alignment. We use “top” to align the text at the top of the component.

plt.xlabel('Ticket class')
plt.ylabel('crew count')

Use .xlabel() and .ylabel() to put the label of x-axis and y-axis respectively.

Here are the reference links:

Seaborn

seaborn is another popular Python library to create beautiful graph. We usually import this as a name “sns” which is related to a fictional charactor’s name “Samuel Norman Seaborn” (source)

Let’s try some cool graphs from this.

Joint plot

sns.jointplot() create a graph from 2 dimensions of data to find out relationship. We can generate different views by parameter “kind”.

kde (kernel density estimate) shows data density and line graph

scatter shows data distribution as scatter chart with histogram

reg (regression) shows regression trend with histogram and line graph

resid (residual) shows trends and histogram of errors

hex (hexagon) shows hexagonal figure of scatter chart with heat map colorization and histogram

Reference link: http://alanpryorjr.com/visualizations/seaborn/jointplot/jointplot/

Pair plot

sns.pairplot() displays all combinations of numeric columns of the data.

Relational plot

sns.relplot() is the advance scatter plot allowing multiple breakdowns. Here are “hue” for different colors and “size” for different size.

Violin plot

sns.violinplot() shows density of each breakdown.

Heat map

sns.heatmap() is a heat map to visualize numbers in scale.

This function requires numeric values only. Here is the example:

Let’s say we want a heat map of number of survivors in each “Pclass” and “Sex”. We have to .groupby() then .sum() over column “Survived”.

We apply .pivot() to create a pivot table.

Finally, run sns.heatmap() with annot=True to display numbers.

Reference link: https://seaborn.pydata.org/api.html

From my side, seaborn is a great tool allowing us create pretty graphs with ease but sometimes we need matplotlib for more complex or customizable ones.

Hope this is useful for ya

See ya next time.

Bye~

next: Note of data science training EP 4: Scikit-learn & Linear Regression – Linear trending

มาใช้ Apache Beam กันเถอะ – ตอนที่ 7 IO ที่สร้างได้

Let's try: Apache Beam part 7 - custom IO

มาใช้ Apache Beam กันเถอะ – ตอนที่ 6 IO สำเร็จรูป

Let's try: Apache Beam part 6 - instant IO

Note of data science training EP 3: Matplotlib & Seaborn – Luxury visualization