prev: Note of data science training EP 4: Scikit-learn & Linear Regression – Linear trending

From EP 4, we could know all 3 types of Linear regression. Linear regression is just one of 4 Machine Learning (ML) algorithms.

** Supervised Learning** is ML on labeled output and we use the label to train our models. For example, we have data of age and income. We want to predict income at any age as labeled output is income.

In other hand, ** Unsupervised Learning** is working on unlabeled output that means we have no initial result data to train models. We will talk about this later.

And ** Continuous** means quantitative output as

**means qualitative output.**

**Discrete**Reference link: https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

As the figure above, there are main 4 ML algorithms:

**Classification**

Discrete Supervised Learning. This is for predicting group in limited values. For example, we want to predict which group of interests for a customer when we have customer data as ages, occupations, genders, and interests.**Regression**

Continuous Supervised Learning. This is to predict output as it can be any values. The example is the previous episode.**Clustering**

Discrete Unsupervised Learning. It is to group our data. For example, we want to group customers into 8 types leading to the hit advertisements from customer data.**Dimensionality reduction**

Continuous Unsupervised Learning. This is working on data optimization. One day we will have tons of data in thousands columns, and Dimensionality reduction helps us find the most important columns to deal with in terms of data processing.

And this time, I am proud to present…

# Logistic Regression

** Logistic Regression** predicts the

**. Despite its name, its algorithm is classifier as the result is a member of labeled output.**

**probability**Now it’s the time.

We reuse Titanic data again. Assign `x`

as “Pclass”, “is_male” that is calculated from “Sex”, “Age”, “SibSp”, and “Fare”.

Assign `y`

as “Survived”. We are going to predict the survivability.

Utilize the module `sklearn.linear_model.LogisticRegression()`

and run `train_test_split()`

.

We put `solver='lbfgs'`

to avoid a warning.

And `fit()`

. Just peeking `coef_`

and find the second column, “is_male”, has negatively impact on “Survived”.

Once we get the model then go to test. We create a crew with “Pclass” is 30, male, no “SibSp”, and paid 25 as “Fare”.

Our model predicts he was dead (“Survived” is 0). Please mourn for him 😢.

After run `.predict_proba()`

and we realize his survivability is just 10%.

Ok. Let’s see the correctness of the model. We create a `DataFrame`

to combine “Survived”, predicts, and a correction flag as “is_correct”.

Average correctness of our model is 83%. Quite great.

# Dummy Classifier

Besides, we shall meet Dummy Classifier. This one provides the “baseline” classification model. It means we can apply this as a standard to evaluate our model.

Dummy Classifier run the model with simple rules and we can define the rule by the parameter `strategy`

. This time is “most_frequent” that is to predict base of frequent values of the training set.

Now we check the baseline correctness as it is just 52.7%. Our Logistic Regression model above works.

Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html

Classifiers is useful for data categorizing works and I hope this blog can be great for you to understand ML overview.

Let’s see what’s next.

Bye~

next: Note of data science training EP 6: Decision Tree – At a point of distraction