Note of data science training EP 13: Regularization – make it regular with Regularization
prev: Note of data science training EP 12: skimage – Look out carefully
There are always outliers in the data. How can we deal with overfitting model?
We talked in EP 7 about 2 words related to outliers that can ruin our models.
Bias is the state of inaccuracy. To solve this bias, we have to do proper data exploration and preparation.
Variance is the state of dissimilarity. The data with high variance is so challenging to find its patterns. We can measure the variance by MSE. To solve this, we can manage to gather more data, optimize features, change models, or apply regularization.
Regularization
\(L_0\) is raw data.
\(L_1\) also called Lasso Penalty. It will remove some unnecessary features by adding 0 as their weights. As a result, the formula will be shorten and the performance will be improved.
$$L_1 = \|w\|_1 = (|w_0|+|w_1|+\cdots+|w_n|) = \sum_{i=1}^n|w_i|$$
\(L_2\) a.k.a. Ridge Penalty. This will tune weights of features to reduce variance.
$$L_2 = \|w\|_2=(w_0^2+w_1^2+\cdots+w_n^2)^{1/2}={\left(\sum_{i=1}^n{|w_i|}^2\right)}^\frac{1}{2}$$
Another one not showing here is Elastic Net that is the mixture of Lasso and Ridge.
Let’s begin
1. Prepare the data
Let’s say we already load the wine data into a DataFrame
.
and utilize train_test_split
split data to train set and test set. Assign “y” as “quality”.
from sklearn.model_selection import train_test_split
x = # some dataframe
y = # some dataframe
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.75)
2. Standard Scaler
Regularization computes on size or magnitude of the data so we need to scale them. This job can be done with StandardScaler
.
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss_train = ss.fit_transform(x_train)
ss_test = ss.transform(x_test)
3. Linear Regression
Now we can create a Linear Regression model with the data from standard scaler.
lr = LinearRegression()
lr.fit(ss_train, y_train)
4. Metrics of Linear Regression
After running .fit()
, we noticed MSE and R2 score of the test set is worse than train set. It is overfitting.
from sklearn.metrics import mean_squared_error, r2_score
print(mean_squared_error(y, lr.predict(ss)))
print(r2_score(y, lr.predict(ss)))
5. Lasso
OK. We are going to Lasso. Alpha is coefficient of the above formula. Giving alpha as 0.1 (default alpha value is 1.0)
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(ss_train, y_train)
print(lasso.score(ss_train, y_train))
Then R2 score from Lasso.score()
is below.
6. LassoCV
We can run LassoCV
to find the best alpha.
from sklearn.linear_model import LassoCV
lasso_cv = LassoCV(alphas=np.logspace(-1, 1, 100), cv=5, max_iter=5000)
lasso_cv = lasso_cv.fit(ss_train, y_train.values.ravel())
print(lasso_cv.alpha_)
print(lasso_cv.coef_)
print(lasso_cv.score(ss_train, y_train))
np.logspace(-1, 1, 100)
generates an array for 100 elements from \(10^{-1}\) to \(10^1\) . It will be a list of alpha. cv
is cross-validation for the algorithm and max_iter
defines maximum iteration in the calculation. Finally we got R2 score of test set equals 0.2368.
Look at .coef_
, there are many zero. Yes, Lasso gets rid of unnecessary features.
7. Ridge
Move to Ridge.
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1)
ridge.fit(ss_train, y_train)
print(ridge.score(ss_train, y_train))
8. RidgeCV
We can run RidgeCV
to find the best alpha and define scoring
as R2 score.
Here we are completed with the regularization. Our regularized models can produce the MSE score in less on test and real data. That’s the point.
References
- https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261
- https://en.wikipedia.org/wiki/Regularization_(mathematics)
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV
Next episode is the epilogue of this series.
See ya there.
next: Note of data science training EP 14 END – Data scientists did their mistakes