prev: Note of data science training EP 3: Matplotlib & Seaborn – Luxury visualization

In EP 3, we now understand how to create some graphs. This episode we are going to analyze data in serious way.

One of basic data science knowledge is linear regression calculation that draws a straight line. It is \(Y = aX + b\) which is the best when the line passes most dots in a plane or closest.

Above figure is not a very good line. We call the dots that are not in the line as “outliers”. They are errors. When we find there are too many outliers, it can be either our line is not enough fit or so many errors that is unable to draw a line.

Here are sample ways to create a good line.

# Theil-Sen estimator

Theil-Sen estimator randoms pairs of dots and create line between them. In the end, ** find the average value** of those lines. This benefit is speed but not good if there are too many outliers or it produces inaccurate results.

# RANSAC algorithm

** RANSAC** stands for

**ndom**

**RA****mple**

**SA****onsensus. It is to find the best line which pass through maximum dots.**

**C**This algorithm is depending to slope as the formula is \(slope=\frac{y_1-y_2}{x_1-x_2}\). It means this is resisting to outliers in Y-axis but not to ones in X-axis.

RANSAC is slower than Theil-Sen.

# Huber regression

**Huber** uses \(\epsilon\) (epsilon) which is greater than 1.0 and calculate over epsilon to find the linear formula.

Huber is faster than the first two.

Reference link: https://scikit-learn.org/stable/auto_examples/linear_model/plot_robust_fit.html

That was just a lecture. Now we go to code in Jupyter.

# Scikit-learn

Introduce `sklearn`

or scikit-learn library. This is a great tool for data analysis and prediction.

We shall `import sklearn.linear_model`

that is a collection of linear regression models. And we do`import sklearn.model_selection`

for data correction in this case.

We try titanic data on column “Pclass”, “Age”, and “Fare”.

Now we want to predict “Fare” from “Pclass” and “Age”. Therefore, we assign `x`

as the latter two and `y`

as “Fare”.

Run `sklearn.model_selection.train_test_split()`

to split both `x`

and `y`

into two each that are training group and testing group with 10% size of testing group (`test_size`

= 0.1)

Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

## Scikit-learn with Theil-sen

This time we finished preparing data. Let’s go for ** Theil-Sen** first.

We create `TheilSenRegressor`

object, run `fit()`

with training group of `x`

and `y`

, `predict()`

with testing group of `x`

and… Gotcha! we got the predicted result of Theil-Sen

We can show the results as below:

`coef_`

is slope or \(a\) from \(Y = aX + b\)`intercept_`

is \(b\)

The formula from Theil-Sen estimator is \(fare=-13.19\times Pclass - 0.04\times age + 51.49\).

We use `y_train.values.ravel()`

to fix data type issue (ref).

## Scikit-Learn with RANSAC

Second, ** RANSAC**. Create

`RANSACRegressor()`

.Repeat the method and now we got this formula \(fare=-10.86\times Pclass + 0.02\times age + 40.80\).

## Scikit-Learn with Huber

Last one, ** Huber** as

`HuberRegressor()`

.We got \(fare=-21.23\times Pclass - 0.25\times age + 79.99\).

## Comparison

Got all three and time to plot. Give x-axis as the real value that is the testing group of `y`

and y-axis is the predicted result.

With just eyeballs, Huber runs better but we need some tools to help precisely determine.

# Metrics

scikit-learn provides `sklearn.metrics`

for evaluating prediction. This time is these 3:

`r2_score()`

\(r^2\) is coefficient of determination. Higher is better. (wikipedia)`median_absolute_error()`

\(MedAE\) is the median of errors between prediction and actual. Lower is better. (O’Reilly)`mean_absolute_error()`

\(MAE\) is the mean of errors between prediction and actual. Lower is better. (wikipedia)

As the result above, we can conclude Huber gives the most precise linear regression formula.

This episode was suddenly attacking us with lots of mathematics stuff. LOL.

Let’s see what’s next.

See ya. Bye~

next: Note of data science training EP 5: Logistic Regression & Dummy Classifier – Divide and Predict