prev: Note of data science training EP 1: Intro – unboxing

Hi all guys~

From what I put at the end of EP 1, we are going next to browsing the data.

At the first we obtained data, most of them are raw and cannot be used until we cleanse it. Therefore, this is a good time to introduce the library that I can call it is the root of data science work. It is ~

# Pandas

Pandas is almost everything for our works. Here we go see the data structure that we need to work with Pandas, `Dataframe`

.

`Dataframe`

is like a simple table. It contains `Columns`

as our data and we call a single column as `Series`

.

More than that, there is `Index`

to identify a cell location in a table and `Index`

can be one of these:

`Row index`

is a list of unique values of a single row.

Index column must be assigned first.`Column index`

is a list of unique values of a single column

Let’s say from the sample table above, we assigned the index column as column A. We find the value of `row index`

“A3” and `column index`

“C” is “C3”. That’s it.

# Deal with the dataset

Go playing with the titanic dataset. I got this from Kaggle.com/hesh97.

## Read

We write this to open the CSV file from the link.

```
import pandas as pd
titanic = pd.read_csv("./titanicdataset-traincsv.csv")
titanic
```

## Display more

The result is sparse so we can write this below to show them all.

```
# show 15 rows
pd.set_option('display.max_rows', 15)
# show ALL rows
pd.set_option('display.max_rows', None)
```

Ref:

https://dev.to/chanduthedev/how-to-display-all-rows-from-data-frame-using-pandas-dha

## Using brackets

Now we go ahead to see the main functionalities of `Dataframe`

.

```
print(type(titanic[['Sex']]))
titanic[['Sex']]
```

Using 2 pairs of square brackets means a `Dataframe`

of selected columns.

while...

```
print(type(titanic['Sex']))
titanic['Sex']
```

While a pair of square brackets is a `Series`

of selected column.

## Head and tail

```
titanic.head()
titanic.tail()
```

`head()`

and `tail()`

show first 5 rows and last 5 rows of the `Dataframe`

correspondingly. Add a number to change the number of rows to display.

## Columns

`dataframe.columns`

Show all of column names of the `Dataframe`

. We can rename columns by assigning new name with `=`

symbol.

## Indexes

`dataframe.index`

Show index. It’s a row number by default.

## Shape

`dataframe.shape`

Display number of rows and columns of the `Dataframe`

.

## Info

`dataframe.info()`

Show the detail of the `Dataframe`

.

## Statistics

`dataframe.describe()`

Display basic statistics of the measurable columns of the `Dataframe`

.

## Location

`dataframe.loc[a,b]`

Find the values at specific value of `Row Index`

and `Column Index`

.

`dataframe.iloc[a,b]`

Find the values at specific order of `Row Index`

and `Column Index`

.

## Sorting

`dataframe.sort_values()`

Sorting rows in the `Dataframe`

. As the example below, sorting the `Dataframe`

by `'Age'`

and put it at last when `'Age'`

is `NaN`

(`na_position='last'`

).

## Handling Non-values

`dataframe.dropna()`

Remove `NaN`

from the `Dataframe`

. The example removes `NaN`

for a whole row (`axis=0`

) when found any `NaN`

in that row (`how='any'`

).

`dataframe.fillna()`

Fill a value if it is `NaN`

. The example fills `0`

.

## Grouping

`dataframe.groupby()`

Grouping `Dataframe`

by given columns. As the example, the result is grouped by values of `'Pclass'`

.

## Managing indexes

`dataframe.set_index()`

Set `Row Index`

by values of a specific column, `'PassengerId'`

as the example.

`dataframe.reset_index()`

Cancel `Row Index`

and get a new column named “index”.

## Example 1

Fill a value into a specific column when `NaN`

. For example, fill a text into `'Cabin'`

at the `Dataframe`

itself (`inplace=True`

) when `NaN`

.

`dataframe.single_column.fillna()`

## Example 2

Grouping a `Dataframe`

and find the average of a column. As the example, grouping by `'Pclass'`

and `'Sex'`

then find the average `'Age'`

.

`dataframe.groupby().single_column.mean()`

## Full document

Here is the full documentation of `Dataframe`

. We can find much more functionalities there.

# matplotlib

`matplotlib`

is the most powerful library to plot a graph quickly. It can be used itself but not since this time that we knew `pandas`

.

We `import`

it but a module name `pyplot`

and rename it as `plt`

.

```
import matplotlib.pyplot as plt
%matplotlib inline
```

`%matplotlib inline`

is to display the graphs on the Jupyter or we have to call function `show()`

.

We will create a graph from the `Dataframe`

Titanic.

## 1. How about the age of the passengers?

`titanic[['Age']].plot.hist()`

Histogram is a frequency graph. We can conclude most passengers are 20-30 years old.

## 2. Between male and female, which is more?

`titanic['Sex'].value_counts().plot(kind='bar')`

`value_counts()`

is to count a number of each unique values in a `Series`

. We use it to get a number of each sex then plot the bar graph.

## 3. Ratio of sex and survivability

We now use pie chart as it suits ratio visualization. Our data is a group of `'Sex'`

and `'Survived'`

then count a number of each group before plotting.

## 4. Relationship of age and fare

We build a scatter graph by assign X-axis as `'Age'`

while Y-axis is `'Fare'`

.

These are teasers. For more fancy, the full documentation is here https://matplotlib.org/contents.html.

See you next time.

Bye~

next: Note of data science training EP 3: Matplotlib & Seaborn – Luxury visualization