prev: Note of data science training EP 1: Intro – unboxing

Hi all guys~

From what I put at the end of EP 1, we are going next to browsing the data.

At the first we obtained data, most of them are raw and cannot be used until we cleanse it. Therefore, this is a good time to introduce the library that I can call it is the root of data science work. It is ~

Pandas

Pandas is almost everything for our works. Here we go see the data structure that we need to work with Pandas, Dataframe.

Dataframe is like a simple table. It contains Columns as our data and we call a single column as Series.

More than that, there is Index to identify a cell location in a table and Index can be one of these:

Row index is a list of unique values of a single row.
Index column must be assigned first.
Column index is a list of unique values of a single column

Let’s say from the sample table above, we assigned the index column as column A. We find the value of row index “A3” and column index “C” is “C3”. That’s it.

Deal with the dataset

Go playing with the titanic dataset. I got this from Kaggle.com/hesh97.

Read

We write this to open the CSV file from the link.

import pandas as pd
titanic = pd.read_csv("./titanicdataset-traincsv.csv")
titanic

Display more

The result is sparse so we can write this below to show them all.

# show 15 rows
pd.set_option('display.max_rows', 15)
# show ALL rows
pd.set_option('display.max_rows', None)

Ref:
https://dev.to/chanduthedev/how-to-display-all-rows-from-data-frame-using-pandas-dha

Using brackets

Now we go ahead to see the main functionalities of Dataframe.

print(type(titanic[['Sex']]))
titanic[['Sex']]

Using 2 pairs of square brackets means a Dataframe of selected columns.

while...

print(type(titanic['Sex']))
titanic['Sex']

While a pair of square brackets is a Series of selected column.

Head and tail

titanic.head()
titanic.tail()

head() and tail() show first 5 rows and last 5 rows of the Dataframe correspondingly. Add a number to change the number of rows to display.

Columns

dataframe.columns

Show all of column names of the Dataframe. We can rename columns by assigning new name with = symbol.

Indexes

dataframe.index

Show index. It’s a row number by default.

Shape

dataframe.shape

Display number of rows and columns of the Dataframe.

Info

dataframe.info()

Show the detail of the Dataframe.

Statistics

dataframe.describe()

Display basic statistics of the measurable columns of the Dataframe.

Location

dataframe.loc[a,b]

Find the values at specific value of Row Index and Column Index.

dataframe.iloc[a,b]

Find the values at specific order of Row Index and Column Index.

Sorting

dataframe.sort_values()

Sorting rows in the Dataframe. As the example below, sorting the Dataframe by 'Age' and put it at last when 'Age' is NaN (na_position='last').

Handling Non-values

dataframe.dropna()

Remove NaN from the Dataframe. The example removes NaN for a whole row (axis=0) when found any NaN in that row (how='any').

dataframe.fillna()

Fill a value if it is NaN. The example fills 0.

Grouping

dataframe.groupby()

Grouping Dataframe by given columns. As the example, the result is grouped by values of 'Pclass'.

Managing indexes

dataframe.set_index()

Set Row Index by values of a specific column, 'PassengerId' as the example.

dataframe.reset_index()

Cancel Row Index and get a new column named “index”.

Example 1

Fill a value into a specific column when NaN. For example, fill a text into 'Cabin' at the Dataframe itself (inplace=True) when NaN.

dataframe.single_column.fillna()

Example 2

Grouping a Dataframe and find the average of a column. As the example, grouping by 'Pclass'and 'Sex' then find the average 'Age'.

dataframe.groupby().single_column.mean()

Full document

Here is the full documentation of Dataframe. We can find much more functionalities there.

matplotlib

matplotlib is the most powerful library to plot a graph quickly. It can be used itself but not since this time that we knew pandas.

We import it but a module name pyplot and rename it as plt.

import matplotlib.pyplot as plt
%matplotlib inline

%matplotlib inline is to display the graphs on the Jupyter or we have to call function show().

We will create a graph from the Dataframe Titanic.

1. How about the age of the passengers?

titanic[['Age']].plot.hist()

Histogram is a frequency graph. We can conclude most passengers are 20-30 years old.

2. Between male and female, which is more?

titanic['Sex'].value_counts().plot(kind='bar')

value_counts() is to count a number of each unique values in a Series. We use it to get a number of each sex then plot the bar graph.

3. Ratio of sex and survivability

We now use pie chart as it suits ratio visualization. Our data is a group of 'Sex' and 'Survived'then count a number of each group before plotting.

4. Relationship of age and fare

We build a scatter graph by assign X-axis as 'Age' while Y-axis is 'Fare'.

These are teasers. For more fancy, the full documentation is here https://matplotlib.org/contents.html.

See you next time.

Bye~

next: Note of data science training EP 3: Matplotlib & Seaborn – Luxury visualization

มาใช้ Apache Beam กันเถอะ – ตอนที่ 7 IO ที่สร้างได้

Let's try: Apache Beam part 7 - custom IO

มาใช้ Apache Beam กันเถอะ – ตอนที่ 6 IO สำเร็จรูป

Let's try: Apache Beam part 6 - instant IO

Note of data science training EP 2: Pandas & Matplotlib – from a thousand mile above