Note of data science training EP 5: Logistic Regression & Dummy Classifier – Divide and Predict

Note of data science training EP 5: Logistic Regression & Dummy Classifier – Divide and Predict

prev: Note of data science training EP 4: Scikit-learn & Linear Regression – Linear trending

From EP 4, we could know all 3 types of Linear regression. Linear regression is just one of 4 Machine Learning (ML) algorithms.

4 main types of Machine Learning Algorithms

Supervised Learning is ML on labeled output and we use the label to train our models. For example, we have data of age and income. We want to predict income at any age as labeled output is income.

In other hand, Unsupervised Learning is working on unlabeled output that means we have no initial result data to train models. We will talk about this later.

And Continuous means quantitative output as Discrete means qualitative output.

Reference link: https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

As the figure above, there are main 4 ML algorithms:

  • Classification
    Discrete Supervised Learning. This is for predicting group in limited values. For example, we want to predict which group of interests for a customer when we have customer data as ages, occupations, genders, and interests.
  • Regression
    Continuous Supervised Learning. This is to predict output as it can be any values. The example is the previous episode.
  • Clustering
    Discrete Unsupervised Learning. It is to group our data. For example, we want to group customers into 8 types leading to the hit advertisements from customer data.
  • Dimensionality reduction
    Continuous Unsupervised Learning. This is working on data optimization. One day we will have tons of data in thousands columns, and Dimensionality reduction helps us find the most important columns to deal with in terms of data processing.

And this time, I am proud to present…

Logistic Regression

Logistic Regression predicts the probability. Despite its name, its algorithm is classifier as the result is a member of labeled output.

Now it’s the time.

We reuse Titanic data again. Assign x as “Pclass”, “is_male” that is calculated from “Sex”, “Age”, “SibSp”, and “Fare”.

Assign y as “Survived”. We are going to predict the survivability.

Utilize the module sklearn.linear_model.LogisticRegression() and run train_test_split().

We put solver='lbfgs' to avoid a warning.

And fit(). Just peeking coef_ and find the second column, “is_male”, has negatively impact on “Survived”.

Once we get the model then go to test. We create a crew with “Pclass” is 30, male, no “SibSp”, and paid 25 as “Fare”.

Our model predicts he was dead (“Survived” is 0). Please mourn for him 😢.

After run .predict_proba() and we realize his survivability is just 10%.

Ok. Let’s see the correctness of the model. We create a DataFrame to combine “Survived”, predicts, and a correction flag as “is_correct”.

Average correctness of our model is 83%. Quite great.

Dummy Classifier

Besides, we shall meet Dummy Classifier. This one provides the “baseline” classification model. It means we can apply this as a standard to evaluate our model.

Dummy Classifier run the model with simple rules and we can define the rule by the parameter strategy. This time is “most_frequent” that is to predict base of frequent values of the training set.

Now we check the baseline correctness as it is just 52.7%. Our Logistic Regression model above works.

Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html


Classifiers is useful for data categorizing works and I hope this blog can be great for you to understand ML overview.

Let’s see what’s next.

Bye~

next: Note of data science training EP 6: Decision Tree – At a point of distraction

Show Comments