Decision Trees

05 May 2021 - 3 mins read time
Tags: Hands-On Machine Learning2

A decision tree is a tree-like graph with nodes representing the place where we pick an attribute and ask a question; edges represent the answers to the question; and the leaves represent the actual output or class label. They are used in non-linear decision making with simple linear decision surfaces. Decision trees classify the examples by sorting them down the tree from the root to some leaf node, with the leaf node providing the classification to the example. Each node in the tree acts as a test case for some attribute, and each edge descending from that node corresponds to one of the possible answers to the test case. This process is recursive in nature and is repeated for every subtree rooted at the new nodes.

Making Predictions

After the decision tree model is fit, you can start at the root node. This asks for a variety of questions and moves to the left and right node until you reach the last node

Gini Impurity

A node calculates the gini impurity of an instance to see how pure it is. If it is pure then the gini score is 0

Estimating class probabilities: Decision trees can also estimate the probability that a particular instance belongs to a class ‘k’.

CART Training algorithm

Sklearn uses a Classification And Regression Tree (CART) to make the decision trees. The cost function CART uses is

The CART algorithm is greedy in nature

Gini - Impurity and Entropy

Decision Trees in sklearn uses gini impurity by default but you can also choose to use Gini - Entropy by setting the hyperparameter appropriate. Most of the time impurity and entropy does not make a difference. Gini impurity is slightly faster to compute. Impurity tends to isolate the most frequent class and its own branch of the tree, while entropy tends to produce slightly more balanced trees.

Regularization Parameters

If left unconstrained the tree structure will adapt to the training data, fitting it very closely and most likely to overfitting it. Such types of models are called non parametric model not because it does not have any parameters but because the parameters are not determined prior to training. A parametric model such as a linear model has a predefined number of parameters, thus limiting it’s degree of freedom, reducing its risk of overfitting. The DecisionTreeClassifier in sklearn has a few other parameters that similarly restrict the shape of the tree. Such as :

min_samples_split: Minimum number of samples a node must have before it can split
min_samples_leaf: The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
min_weight_fraction_leaf:The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_leaf_nodes: Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
max_features:The number of features to consider when looking for the best split:
Regression

Decision trees are capable of performing regression tasks. This as well uses the CART algorithm, but instead of minimizing the gini-impurity. It tries to minimize the MSE