Note of data science training EP 4: Scikit-learn & Linear Regression – Linear trending
prev: Note of data science training EP 3: Matplotlib & Seaborn – Luxury visualization
In EP 3, we now understand how to create some graphs. This episode we are going to analyze data in serious way.
One of basic data science knowledge is linear regression calculation that draws a straight line. It is \(Y = aX + b\) which is the best when the line passes most dots in a plane or closest.
Above figure is not a very good line. We call the dots that are not in the line as “outliers”. They are errors. When we find there are too many outliers, it can be either our line is not enough fit or so many errors that is unable to draw a line.
Here are sample ways to create a good line.
Theil-Sen estimator
Theil-Sen estimator randoms pairs of dots and create line between them. In the end, find the average value of those lines. This benefit is speed but not good if there are too many outliers or it produces inaccurate results.
RANSAC algorithm
RANSAC stands for RAndom SAmple Consensus. It is to find the best line which pass through maximum dots.
This algorithm is depending to slope as the formula is \(slope=\frac{y_1-y_2}{x_1-x_2}\). It means this is resisting to outliers in Y-axis but not to ones in X-axis.
RANSAC is slower than Theil-Sen.
Huber regression
Huber uses \(\epsilon\) (epsilon) which is greater than 1.0 and calculate over epsilon to find the linear formula.
Huber is faster than the first two.
Reference link: https://scikit-learn.org/stable/auto_examples/linear_model/plot_robust_fit.html
That was just a lecture. Now we go to code in Jupyter.
Scikit-learn
Introduce sklearn
or scikit-learn library. This is a great tool for data analysis and prediction.
We shall import sklearn.linear_model
that is a collection of linear regression models. And we doimport sklearn.model_selection
for data correction in this case.
We try titanic data on column “Pclass”, “Age”, and “Fare”.
Now we want to predict “Fare” from “Pclass” and “Age”. Therefore, we assign x
as the latter two and y
as “Fare”.
Run sklearn.model_selection.train_test_split()
to split both x
and y
into two each that are training group and testing group with 10% size of testing group (test_size
= 0.1)
Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Scikit-learn with Theil-sen
This time we finished preparing data. Let’s go for Theil-Sen first.
We create TheilSenRegressor
object, run fit()
with training group of x
and y
, predict()
with testing group of x
and… Gotcha! we got the predicted result of Theil-Sen
We can show the results as below:
coef_
is slope or \(a\) from \(Y = aX + b\)intercept_
is \(b\)
The formula from Theil-Sen estimator is \(fare=-13.19\times Pclass - 0.04\times age + 51.49\).
We use y_train.values.ravel()
to fix data type issue (ref).
Scikit-Learn with RANSAC
Second, RANSAC. Create RANSACRegressor()
.
Repeat the method and now we got this formula \(fare=-10.86\times Pclass + 0.02\times age + 40.80\).
Scikit-Learn with Huber
Last one, Huber as HuberRegressor()
.
We got \(fare=-21.23\times Pclass - 0.25\times age + 79.99\).
Comparison
Got all three and time to plot. Give x-axis as the real value that is the testing group of y
and y-axis is the predicted result.
With just eyeballs, Huber runs better but we need some tools to help precisely determine.
Metrics
scikit-learn provides sklearn.metrics
for evaluating prediction. This time is these 3:
r2_score()
\(r^2\) is coefficient of determination. Higher is better. (wikipedia)median_absolute_error()
\(MedAE\) is the median of errors between prediction and actual. Lower is better. (O’Reilly)mean_absolute_error()
\(MAE\) is the mean of errors between prediction and actual. Lower is better. (wikipedia)
As the result above, we can conclude Huber gives the most precise linear regression formula.
This episode was suddenly attacking us with lots of mathematics stuff. LOL.
Let’s see what’s next.
See ya. Bye~
next: Note of data science training EP 5: Logistic Regression & Dummy Classifier – Divide and Predict