prev: Note of data science training EP 5: Logistic Regression & Dummy Classifier – Divide and Predict

From EP 5, here we go for the another classification model. It is …

# Decision tree

Decision tree is a classification model in the same group as Logistic regression but it is likely to ** create a box** over a group of data. The boxes can be resizing smaller to cover the group of data as much as possible. While Logistic regression is to

**to separate data into two groups straightforwardly.**

**draw a line**It is, when there are outliers, Decision tree would be struggling to cover them and output pretty different result. On the other hand, Logistic regression is able to deal against them in a better way.

Reference link:

- https://blog.bigml.com/2016/09/28/logistic-regression-versus-decision-trees/
- https://www.geeksforgeeks.org/ml-logistic-regression-v-s-decision-tree-classification/

# Let’s prepare the data

Assign `x`

as a DataFrame of “Pclass”, “is_male”, and “Age”.

Then assign `y`

as classified “Fare” through our function using `.map()`

.

# Grow the tree

First, create `DecisionTreeClassifier`

with two main parameters:

`criterion`

we can choose one of two followings:- “gini” from “gini impurity”. This is to measure how far and often the random results are incorrectly labels. It is for reducing misclassification.
- “entropy” is to measure how uncertainly the results are. This is for exploratory analysis.

`max_depth`

it defines a number of layer of the tree. This time we define 3.

Reference link:

- https://medium.com/@jason9389/gini-impurity-and-entropy-16116e754b27
- https://www.quora.com/What-is-difference-between-Gini-Impurity-and-Entropy-in-Decision-Tree
- https://dzone.com/articles/logistic-regression-vs-decision-tree
- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Then `.fit()`

them.

Eh, what are all labeled results? We can use `.classes_`

.

Done.

# Show the tree

Use this to view our tree.

`sklearn.tree.export_graphviz()`

It’s hard to understand. Now we run this to generate an intuitive diagram.

```
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
import sklearn.tree
import IPython.display
import pydot
File_obj = StringIO()
sklearn.tree.export_graphviz(tree, out_file=File_obj)
Graph = pydot.graph_from_dot_data(File_obj.getvalue())
IPython.display.Image(Graph[0].create_png())
```

Ok. Let’s find the case inside. Just try the most-left branch.

- If
`x[0]`

or “Pclass” is less than or equal 1.5 (actually Pclass is 1) - And if
`x[1]`

or “is_male” is less than or equal 0.5 (actually 0 or female) - And if
`x[2]`

or “Age” is less than 43.5 - Then we got [2, 11, 23, 26]
- Compare to the all labeled results, this person can be “Quite rich” with 26 points.

Cheap | Luxury | Medium | Quite rich |
---|---|---|---|

2 | 11 | 23 | 26 |

We can find how effective each column is with `.feature_importances_`

where “feature” means the column we are interested in.

Try predict one. Let’s say this person is in “Pclass” 1, is female (“is_male” = False) and 40 years old. As a results, she was predicted as “Quite rich” in “Fare”.

Here we go to the end and may have a question, ** when we apply which** classification between Decision tree and Logistic regression.

As far as I known, in case we ** ensure the data is able to be 2 groups separately** by a straight line, we can use Logistic regression. Otherwise, use Decision tree.

Let’s see next time.

Bye~

next: Note of data science training EP 7: Metrics – It is qualified