prev: Note of data science training EP 3: Matplotlib & Seaborn – Luxury visualization

In EP 3, we now understand how to create some graphs. This episode we are going to analyze data in serious way.

One of basic data science knowledge is linear regression calculation that draws a straight line. It is \(Y = aX + b\) which is the best when the line passes most dots in a plane or closest.

Above figure is not a very good line. We call the dots that are not in the line as “outliers”. They are errors. When we find there are too many outliers, it can be either our line is not enough fit or so many errors that is unable to draw a line.

Here are sample ways to create a good line.

Theil-Sen estimator

Theil-Sen estimator randoms pairs of dots and create line between them. In the end, find the average value of those lines. This benefit is speed but not good if there are too many outliers or it produces inaccurate results.

RANSAC algorithm

RANSAC stands for RAndom SAmple Consensus. It is to find the best line which pass through maximum dots.

This algorithm is depending to slope as the formula is \(slope=\frac{y_1-y_2}{x_1-x_2}\). It means this is resisting to outliers in Y-axis but not to ones in X-axis.

RANSAC is slower than Theil-Sen.

Huber regression

Huber uses \(\epsilon\) (epsilon) which is greater than 1.0 and calculate over epsilon to find the linear formula.

Huber is faster than the first two.

Reference link: https://scikit-learn.org/stable/auto_examples/linear_model/plot_robust_fit.html


That was just a lecture. Now we go to code in Jupyter.

Scikit-learn

Introduce sklearn or scikit-learn library. This is a great tool for data analysis and prediction.

We shall import sklearn.linear_model that is a collection of linear regression models. And we doimport sklearn.model_selection for data correction in this case.

We try titanic data on column “Pclass”, “Age”, and “Fare”.

Now we want to predict “Fare” from “Pclass” and “Age”. Therefore, we assign x as the latter two and y as “Fare”.

Run sklearn.model_selection.train_test_split() to split both x and y into two each that are training group and testing group with 10% size of testing group (test_size = 0.1)

Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Scikit-learn with Theil-sen

This time we finished preparing data. Let’s go for Theil-Sen first.

We create TheilSenRegressor object, run fit() with training group of x and y, predict()with testing group of x and… Gotcha! we got the predicted result of Theil-Sen

We can show the results as below:

  • coef_ is slope or \(a\) from \(Y = aX + b\)
  • intercept_ is \(b\)

The formula from Theil-Sen estimator is \(fare=-13.19\times Pclass - 0.04\times age + 51.49\).

We use y_train.values.ravel() to fix data type issue (ref).

Scikit-Learn with RANSAC

Second, RANSAC. Create RANSACRegressor().

Repeat the method and now we got this formula \(fare=-10.86\times Pclass + 0.02\times age + 40.80\).

Scikit-Learn with Huber

Last one, Huber as HuberRegressor().

We got \(fare=-21.23\times Pclass - 0.25\times age + 79.99\).

Comparison

Got all three and time to plot. Give x-axis as the real value that is the testing group of y and y-axis is the predicted result.

With just eyeballs, Huber runs better but we need some tools to help precisely determine.

Metrics

scikit-learn provides sklearn.metrics for evaluating prediction. This time is these 3:

  • r2_score()
    \(r^2\) is coefficient of determination. Higher is better. (wikipedia)
  • median_absolute_error()
    \(MedAE\) is the median of errors between prediction and actual. Lower is better. (O’Reilly)
  • mean_absolute_error()
    \(MAE\) is the mean of errors between prediction and actual. Lower is better. (wikipedia)

As the result above, we can conclude Huber gives the most precise linear regression formula.


This episode was suddenly attacking us with lots of mathematics stuff. LOL.

Let’s see what’s next.

See ya. Bye~

next: Note of data science training EP 5: Logistic Regression & Dummy Classifier – Divide and Predict