Note of data science training EP 6: Decision Tree – At a point of distraction
prev: Note of data science training EP 5: Logistic Regression & Dummy Classifier – Divide and Predict
From EP 5, here we go for the another classification model. It is …
Decision tree
Decision tree is a classification model in the same group as Logistic regression but it is likely to create a box over a group of data. The boxes can be resizing smaller to cover the group of data as much as possible. While Logistic regression is to draw a line to separate data into two groups straightforwardly.
It is, when there are outliers, Decision tree would be struggling to cover them and output pretty different result. On the other hand, Logistic regression is able to deal against them in a better way.
Reference link:
- https://blog.bigml.com/2016/09/28/logistic-regression-versus-decision-trees/
- https://www.geeksforgeeks.org/ml-logistic-regression-v-s-decision-tree-classification/
Let’s prepare the data
Assign x
as a DataFrame of “Pclass”, “is_male”, and “Age”.
Then assign y
as classified “Fare” through our function using .map()
.
Grow the tree
First, create DecisionTreeClassifier
with two main parameters:
criterion
we can choose one of two followings:- “gini” from “gini impurity”. This is to measure how far and often the random results are incorrectly labels. It is for reducing misclassification.
- “entropy” is to measure how uncertainly the results are. This is for exploratory analysis.
max_depth
it defines a number of layer of the tree. This time we define 3.
Reference link:
- https://medium.com/@jason9389/gini-impurity-and-entropy-16116e754b27
- https://www.quora.com/What-is-difference-between-Gini-Impurity-and-Entropy-in-Decision-Tree
- https://dzone.com/articles/logistic-regression-vs-decision-tree
- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Then .fit()
them.
Eh, what are all labeled results? We can use .classes_
.
Done.
Show the tree
Use this to view our tree.
sklearn.tree.export_graphviz()
It’s hard to understand. Now we run this to generate an intuitive diagram.
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
import sklearn.tree
import IPython.display
import pydot
File_obj = StringIO()
sklearn.tree.export_graphviz(tree, out_file=File_obj)
Graph = pydot.graph_from_dot_data(File_obj.getvalue())
IPython.display.Image(Graph[0].create_png())
Ok. Let’s find the case inside. Just try the most-left branch.
- If
x[0]
or “Pclass” is less than or equal 1.5 (actually Pclass is 1) - And if
x[1]
or “is_male” is less than or equal 0.5 (actually 0 or female) - And if
x[2]
or “Age” is less than 43.5 - Then we got [2, 11, 23, 26]
- Compare to the all labeled results, this person can be “Quite rich” with 26 points.
Cheap | Luxury | Medium | Quite rich |
---|---|---|---|
2 | 11 | 23 | 26 |
We can find how effective each column is with .feature_importances_
where “feature” means the column we are interested in.
Try predict one. Let’s say this person is in “Pclass” 1, is female (“is_male” = False) and 40 years old. As a results, she was predicted as “Quite rich” in “Fare”.
Here we go to the end and may have a question, when we apply which classification between Decision tree and Logistic regression.
As far as I known, in case we ensure the data is able to be 2 groups separately by a straight line, we can use Logistic regression. Otherwise, use Decision tree.
Let’s see next time.
Bye~
next: Note of data science training EP 7: Metrics – It is qualified