Note of data science training EP 2: Pandas & Matplotlib – from a thousand mile above
prev: Note of data science training EP 1: Intro – unboxing
Hi all guys~
From what I put at the end of EP 1, we are going next to browsing the data.
At the first we obtained data, most of them are raw and cannot be used until we cleanse it. Therefore, this is a good time to introduce the library that I can call it is the root of data science work. It is ~
Pandas
Pandas is almost everything for our works. Here we go see the data structure that we need to work with Pandas, Dataframe
.
Dataframe
is like a simple table. It contains Columns
as our data and we call a single column as Series
.
More than that, there is Index
to identify a cell location in a table and Index
can be one of these:
Row index
is a list of unique values of a single row.
Index column must be assigned first.Column index
is a list of unique values of a single column
Let’s say from the sample table above, we assigned the index column as column A. We find the value of row index
“A3” and column index
“C” is “C3”. That’s it.
Deal with the dataset
Go playing with the titanic dataset. I got this from Kaggle.com/hesh97.
Read
We write this to open the CSV file from the link.
import pandas as pd
titanic = pd.read_csv("./titanicdataset-traincsv.csv")
titanic
Display more
The result is sparse so we can write this below to show them all.
# show 15 rows
pd.set_option('display.max_rows', 15)
# show ALL rows
pd.set_option('display.max_rows', None)
Ref:
https://dev.to/chanduthedev/how-to-display-all-rows-from-data-frame-using-pandas-dha
Using brackets
Now we go ahead to see the main functionalities of Dataframe
.
print(type(titanic[['Sex']]))
titanic[['Sex']]
Using 2 pairs of square brackets means a Dataframe
of selected columns.
while...
print(type(titanic['Sex']))
titanic['Sex']
While a pair of square brackets is a Series
of selected column.
Head and tail
titanic.head()
titanic.tail()
head()
and tail()
show first 5 rows and last 5 rows of the Dataframe
correspondingly. Add a number to change the number of rows to display.
Columns
dataframe.columns
Show all of column names of the Dataframe
. We can rename columns by assigning new name with =
symbol.
Indexes
dataframe.index
Show index. It’s a row number by default.
Shape
dataframe.shape
Display number of rows and columns of the Dataframe
.
Info
dataframe.info()
Show the detail of the Dataframe
.
Statistics
dataframe.describe()
Display basic statistics of the measurable columns of the Dataframe
.
Location
dataframe.loc[a,b]
Find the values at specific value of Row Index
and Column Index
.
dataframe.iloc[a,b]
Find the values at specific order of Row Index
and Column Index
.
Sorting
dataframe.sort_values()
Sorting rows in the Dataframe
. As the example below, sorting the Dataframe
by 'Age'
and put it at last when 'Age'
is NaN
(na_position='last'
).
Handling Non-values
dataframe.dropna()
Remove NaN
from the Dataframe
. The example removes NaN
for a whole row (axis=0
) when found any NaN
in that row (how='any'
).
dataframe.fillna()
Fill a value if it is NaN
. The example fills 0
.
Grouping
dataframe.groupby()
Grouping Dataframe
by given columns. As the example, the result is grouped by values of 'Pclass'
.
Managing indexes
dataframe.set_index()
Set Row Index
by values of a specific column, 'PassengerId'
as the example.
dataframe.reset_index()
Cancel Row Index
and get a new column named “index”.
Example 1
Fill a value into a specific column when NaN
. For example, fill a text into 'Cabin'
at the Dataframe
itself (inplace=True
) when NaN
.
dataframe.single_column.fillna()
Example 2
Grouping a Dataframe
and find the average of a column. As the example, grouping by 'Pclass'
and 'Sex'
then find the average 'Age'
.
dataframe.groupby().single_column.mean()
Full document
Here is the full documentation of Dataframe
. We can find much more functionalities there.
matplotlib
matplotlib
is the most powerful library to plot a graph quickly. It can be used itself but not since this time that we knew pandas
.
We import
it but a module name pyplot
and rename it as plt
.
import matplotlib.pyplot as plt
%matplotlib inline
%matplotlib inline
is to display the graphs on the Jupyter or we have to call function show()
.
We will create a graph from the Dataframe
Titanic.
1. How about the age of the passengers?
titanic[['Age']].plot.hist()
Histogram is a frequency graph. We can conclude most passengers are 20-30 years old.
2. Between male and female, which is more?
titanic['Sex'].value_counts().plot(kind='bar')
value_counts()
is to count a number of each unique values in a Series
. We use it to get a number of each sex then plot the bar graph.
3. Ratio of sex and survivability
We now use pie chart as it suits ratio visualization. Our data is a group of 'Sex'
and 'Survived'
then count a number of each group before plotting.
4. Relationship of age and fare
We build a scatter graph by assign X-axis as 'Age'
while Y-axis is 'Fare'
.
These are teasers. For more fancy, the full documentation is here https://matplotlib.org/contents.html.
See you next time.
Bye~
next: Note of data science training EP 3: Matplotlib & Seaborn – Luxury visualization