Hi all guys~
From what I put at the end of EP 1, we are going next to browsing the data.
At the first we obtained data, most of them are raw and cannot be used until we cleanse it. Therefore, this is a good time to introduce the library that I can call it is the root of data science work. It is ~
Pandas is almost everything for our works. Here we go see the data structure that we need to work with Pandas,
Dataframe is like a simple table. It contains
Columns as our data and we call a single column as
More than that, there is
Index to identify a cell location in a table and
Index can be one of these:
Row indexis a list of unique values of a single row.
Index column must be assigned first.
Column indexis a list of unique values of a single column
Let’s say from the sample table above, we assigned the index column as column A. We find the value of
row index “A3” and
column index “C” is “C3”. That’s it.
Deal with the dataset
Go playing with the titanic dataset. I got this from Kaggle.com/hesh97.
We write this to open the CSV file from the link.
import pandas as pd titanic = pd.read_csv("./titanicdataset-traincsv.csv") titanic
The result is sparse so we can write this below to show them all.
# show 15 rows pd.set_option('display.max_rows', 15) # show ALL rows pd.set_option('display.max_rows', None)
Now we go ahead to see the main functionalities of
Using 2 pairs of square brackets means a
Dataframe of selected columns.
While a pair of square brackets is a
Series of selected column.
Head and tail
tail() show first 5 rows and last 5 rows of the
Dataframe correspondingly. Add a number to change the number of rows to display.
Show all of column names of the
Dataframe. We can rename columns by assigning new name with
Show index. It’s a row number by default.
Display number of rows and columns of the
Show the detail of the
Display basic statistics of the measurable columns of the
Find the values at specific value of
Row Index and
Find the values at specific order of
Row Index and
Sorting rows in the
Dataframe. As the example below, sorting the
'Age' and put it at last when
NaN from the
Dataframe. The example removes
NaN for a whole row (
axis=0) when found any
NaN in that row (
Fill a value if it is
NaN. The example fills
Dataframe by given columns. As the example, the result is grouped by values of
Row Index by values of a specific column,
'PassengerId' as the example.
Row Index and get a new column named “index”.
Fill a value into a specific column when
NaN. For example, fill a text into
'Cabin' at the
Dataframe itself (
Dataframe and find the average of a column. As the example, grouping by
'Sex' then find the average
Here is the full documentation of
Dataframe. We can find much more functionalities there.
matplotlib is the most powerful library to plot a graph quickly. It can be used itself but not since this time that we knew
import it but a module name
pyplot and rename it as
import matplotlib.pyplot as plt %matplotlib inline
%matplotlib inline is to display the graphs on the Jupyter or we have to call function
We will create a graph from the
1. How about the age of the passengers?
Histogram is a frequency graph. We can conclude most passengers are 20-30 years old.
2. Between male and female, which is more?
value_counts() is to count a number of each unique values in a
Series. We use it to get a number of each sex then plot the bar graph.
3. Ratio of sex and survivability
We now use pie chart as it suits ratio visualization. Our data is a group of
'Survived'then count a number of each group before plotting.
4. Relationship of age and fare
We build a scatter graph by assign X-axis as
'Age' while Y-axis is
These are teasers. For more fancy, the full documentation is here https://matplotlib.org/contents.html.
See you next time.