# An introduction to decision trees

Decision trees are one of the major data structures in machine learning. Their inner workings rely on heuristics which, while satisfying to our intuitions, give remarkably good results in practice (notably when they are aggregated into "random forests"). Their tree structure also makes them readable by human beings, contrary to other "black box" approaches such as neural networks.

In this tutorial, we will see how decision trees work and provide a few theoretical justifications along the way. We will also (briefly) talk about their extension to random forests. To get started, you should already be familiar with the general setting of supervised learning.

## Training of a decision tree

A decision tree models a hierarchy of tests on the values of its input, which are called attributes or features. At the end of these tests, the predictor outputs a numerical value or chooses a label in a pre-defined set of options. The former is called regression, the latter classification. For instance, the following tree could be used to decide if a French student finds a lecture interesting or not, i.e. classification in the binary set {oui, non} (French booleans ;p), based on the boolean attributes {motivation, surprenant, difficile} and a discrete attribute durée $$\in$$ {1 h, 2 h, 3 h}. We could also have made the last one a continuous attribute, but we'll come back to that. The input to our decision is called an instance and consists of values for each attribute. It is usually denoted by $$(\bfx, y)$$ where $$y$$ is the attribute the tree should learn to predict and $$\bfx = (x_1, \ldots, x_m)$$ are the $$m$$ other attributes. A decision tree is learned on a collection of instances $$T = \{(\bfx, y)\}$$ called the training set.

Example of training set for the decision tree above
date motivation durée (h) difficile surprenant température (K)
11/03 oui 1 non oui 271
14/04 oui 4 oui oui 293
03/01 non 3 oui non 297
25/12 oui 2 oui non 291
... ... ... ... ... ...

Given our observations $$T = \{ (\bfx, y) \}$$, we now want to build a decision tree that predicts $$y$$ given new instances $$\bfx$$. As of today (2011) there are essentially two families of algorithms to do this: Quinlan trees and CART trees. Both follow a divide-and-conquer strategy that we can sum up, for discrete attributes, by the following pseudocode:

1:  DecisionTree(T: training set):
2:      if "stop condition":
3:          return leaf(T)
4:      choose the "best" attribute i between 1 and m
5:      for each value v of i:
6:          T[v] = {(x, y) from T such that x_i = v}
7:          t[v] = DecisionTree(T[v])
8:      return node(i, {v: t[v]})


In what follows, we use uppercase $$T$$ for training sets and lowercase $$t$$ for trees. A tree is either a leaf leaf(T), where we save a subset $$T$$ of the original training set, or a node node(i, {v: t[v]}) that tests attribute $$i$$ and has a child tree $$t_v$$ for each potential value $$v$$ of the attribute. The two statements with quotation marks correspond to heuristics specific to each algorithm:

Stop condition:
The condition used to decide when it is time to produce an answer rather than grow the tree to better fit the data. For instance, the condition $$|T| = 1$$ will produce detailed trees with excellent score on the training set (for each instance $$(\bfx, y) \in T$$ the tree will predict exactly $$y$$ given $$\bfx$$) but quite deep and not likely to generalize well outside of this set due to overfitting.
Best attribute:
The goal of this heuristic is to evaluate locally what attribute yields the "most information" on (or is the "most correlated" to) the result to predict. We will see examples of such criteria below.

When learning a continuous attribute $$x_i \in \mathbb{R}$$, we adapt the above pseudocode by selecting a split value $$v$$ corresponding to a test $$x_i \leq v$$. We will denote by node(i, v, t[≤], t[>]) the corresponding node constructor. In the particular case where all attributes are continuous, the resulting decision tree is a binary tree.

## Regression

Let us consider first the case of regression, that is, when the value to predict is a real number. Once our tree is built, regressing a new instance $$\bfx$$ amounts to traversing a branch as follows:

 1:  Regress(x, t)
2:      if t is a leaf(Tf):
3:          return the average of y's in Tf
4:      else if t is a node(i, v, t_left, t_right):
5:          if x[i] <= v:
6:              return Regress(x, t_left)
7:          else (x[i] > v):
8:              return Regress(x, t_right)
9:      else (t is a node(i, {v -> t[v]})):
10:          return Regress(x, t[x[i]])


This exploration ends up in a leaf $$f$$ built on a training set $$T_f$$, which we denoted in pseudocode as leaf(Tf). Let us define $$Y_f := \{ y | \exists (\bfx, y) \in T_f \}$$. The value predicted at the end of the regression process is the average $$\overline{y}$$ of $$Y_f$$. The more the distribution of $$y \in Y_f$$ is concentrated around $$\overline{y}$$, the better the prediction will be. Equivalently, the higher the dispersion of these values, the worse the prediction. CART trees thus use the variance $$\overline{\sigma}^2$$ of $$Y_f$$ to measure this dispersion and evaluate the "quality" of the leaf $$f$$.

### Choosing a split attribute

CART also uses this variance criterion to choose the best split attribute at line 4 of our DecisionTree(T) constructor. A given attribute $$i$$ yields a partition $$T = \cup_{v_i} T_{v_i}$$ of the training set $$T$$, where each subset $$T_{v_i}$$ has its own variance $$\mathbb{V}(T_{v_i})$$. Variance after selecting $$i$$ as split attribute is then, for an instance $$(\bfx, y)$$ drawn uniformly at random from $$T$$:

\begin{equation*} V_i = \sum_{v_i} \frac{|T_{v_i}|}{|T|} \mathbb{V}(T_{v_i}) \end{equation*}

The attribute $$i^*$$ that minimizes this value is (heuristically) considered as the best attribute. We will see in the next section how this choice of minimizing variance is not only heuristic: it is also intrinsically linked to minimizing a mean squared error.

### Why minimize variance?

One of the main assumptions in machine learning is that instances $$(\bfx, y)$$ are drawn independently from the same unknown probability distribution $$P$$. We can therefore see our $$(\bfx, y) \in T$$ as realizations of i.i.d. random variables $$\bfX$$ and $$Y$$ (the two of them possibly correlated). Similarly, leaf values $$y \in Y_f$$ are i.i.d. realizations of the random variable $$Y_f$$ obtained by conditioning $$Y$$ upon the events $$\{X_i \leq \ \mathrm{or} \ > v\}$$ tested along the branch from the root of our decision tree to the leaf $$f$$. With this in mind, we can interpret $$\overline{y}$$ as an estimator for $$\mathbb{E}(Y_f)$$. Its mean squared error is then:

\begin{align} \mathbb{E}\left[(Y_f - \overline{y})^2\right] & = \mathbb{E}\left[ (Y_f - \mathbb{E}(Y_f) + \mathbb{E}(Y_f) - \overline{y})^2\right] \\ & = (\overline{y} - \mathbb{E}(Y_f))^2 + \mathbb{E}\left[ (Y_f - \mathbb{E}(Y_f))^2 \right] \end{align}

In this calculation, $$\overline{y}$$ and $$\mathbb{E}(Y_f)$$ are constants. The second term in the final sum is exactly the variance $$\sigma^2$$ of $$Y_f$$, which CART trees estimate via $$\overline{\sigma}^2$$. Minimizing $$\overline{\sigma}$$ for a leaf $$f$$ thus aims at minimizing this second term, but also the first one! Since $$\overline{y}$$ is an arithmetic average

\begin{equation*} \overline{y} = \frac{1}{|Y_f|} \sum_{y \in Y_f} y = \frac{1}{k} \sum_{i=1}^k Y_f[i] \end{equation*}

of realizations of $$Y_f$$ (the random variable, not the set of values—sorry for the overlapping notations), we have $$\mathbb{E}_T(\overline{y}) = \mathbb{E}(Y_f)$$ and $$\mathbb{V}_T(\overline{y}) = \sigma^2 / n$$. Here the index $$T$$ means that expectations are taken over all possible training sets with fixed size $$|T| = n$$, with the size of the leaf set also fixed $$|Y_f| = k$$. Consequently, Markov's inequality yields:

\begin{equation*} \mathbb{P}\left[ (\overline{y} - \mathbb{E}(Y_f))^2 > x \right ] < \frac{\sigma^2}{k x}, \end{equation*}

which shows handwavingly how minimizing the variance $$\sigma$$ will also concentrate the distribution of $$\overline{y}$$ around $$\mathbb{E}(Y_f)$$, and thus reduce the first term in the mean squared error of our estimator.

## Classification

The logic for classification is similar to regression:

 1:  Classify(x, t)
2:      if t is a leaf(Tf):
3:          return the majority class in Tf
4:      else if t is a node(i, v, t_left, t_right):
5:          if x[i] <= v:
6:              return Classify(x, t_left)
7:          else (x[i] > v):
8:              return Classify(x, t_right)
9:      else (t is a node(i, {v -> t[v]})):
10:          return Classify(x, t[x[i]])


If a majority of instances in $$T_f$$ belong to the same class $$c \in \{ 1, \ldots, C \}$$, then $$c$$ is the best prediction we can make, and the fraction $$|\{ (\bfx, y) \in T_f | y = c \}| / |T_f|$$ of instances in $$T_f$$ that belong to class $$c$$ is an indicator of the "sureness" of our prediction. When this ratio is roughly the same for all classes, predicting the most frequent class in $$T_f$$ still sounds reasonable, yet keeping in mind that the quality of the prediction is low.

### Measuring leaf homogeneity

These observations highlight how in classification a decision tree is all the better when (the instances in) its leaves are homogeneous. We can then measure the heterogeneity of a leaf the same way we measured variance for regression. Let $$p_1, \ldots, p_C$$ denote the relative frequencies of classes $$c \in \{1, \ldots, C\}$$ in $$T_f$$, and let $$c^* = \arg\max_i p_i$$ denote the most frequent class. There are three main leaf heterogeneity measures in the literature:

Error rate:
$$e(T_f) := 1 - p_{c^*}$$, the classification error rate of our algorithm over the training set.
Gini criterion:
$$e(T_f) := \sum_c p_c (1 - p_c)$$, which corresponds to the classification error rate over the training set of a randomized algorithm that would return class $$c$$ with probability $$p_c$$ (rather than deterministically returning $$c^*$$). This measure is used in CART trees.
Entropy:
$$e(T_f) := - \sum_c p_c \log p_c$$, which estimates the entropy of the class $$c$$ from an instance $$(\bfx, c)$$ drawn uniformly at random from $$T_f$$. This measure is used in ID3 and C4.5 trees.

### Choosing a split attribute

The leaf heterogeneity measure is also used to select the split attribute: if an attribute $$i$$ partitions $$T$$ into $$T = \cup_{v_i} T_{v_i}$$, then every sub-tree $$T_{v_i}$$ has its own heterogeneity $$e(T_{v_i})$$. The error we expect from our prediction after branching to the next sub-tree, for an instance drawn uniformly at random from $$T$$, is accordingly:

\begin{equation*} e_i = \sum_{v_i} \frac{|T_{v_i}|}{|T|} e(T_{v_i}) \end{equation*}

The attribute $$i^* = \arg\min_i e_i$$ that minimizes this error is (heuristically) considered as the best attribute, to be used at line 4 of our DecisionTree(T) constructor.

## Extensions

Decision trees as we described them have a number of limitations:

• The overall optimization problem is NP-complete for several optimality criteria, which is why we turned to heuristics.
• The learning procedure is static: trees are not designed to learn incrementally from new instances that would continuously extend the training set.
• They are sensitive to noise and have a tendency to overfit, that is, to learn relationships between both data and noise present in the training set.

There are solutions to these problems that work directly on trees, notably pruning (more details here and here). In practice, it is often more convenient to go for one of the techniques that follow, bagging and random forests, to alleviate overfitting while keeping the implementation simple.

### Bagging

The word bagging is an acronym for "bootstrap aggregating". It is a meta-algorithm that combines two techniques, guess which:

Bootstrapping:
a bootstrap of set $$T$$ is another set obtained by drawing $$|T|$$ times elements from $$T$$ uniformly at random and putting them back. Bootstrapping from a training set $$T$$ yields a new set $$T'$$ that has on average $$1 - e^{-1} \approx 63\%$$ unique instances from $$T$$, when $$T$$ is a big set.
Aggregating:
suppose we produce a sequence of bootstraps $$T'_1, \ldots, T'_m$$, and we use each bootstrap $$T'_i$$ to train a predictor $$t_i$$ (for us it will be a decision tree, although the technique could apply to any family of predictors). Given an instance $$(\bfx, y)$$, we regress each tree to obtain a set of predictions $$y_1, \ldots, y_m$$, and aggregate them into an arithmetic average $$\overline{y} = \sum_i y_i/m$$.

Bagging improves decision trees on several points: it smoothens out their instability (where small perturbations in the training set can yield a completely different tree due to the selection of a different early splitting attribute), and makes predictions that are less sensitive to overfitting. The price to pay is a loss in readability: bagged predictions are not the product of a "reasoning", as was the case with a singular decision tree, rather a consensus over several "reasonings" that potentially look at radically different sets of attributes.

As of 2011, our theoretical understanding of bagging is still fairly limited: a few arguments help us understand its impact and suggest conditions in which they improve predictions, but we don't have a clear model in which we can express and measure this impact.

### Random forests

Proposed in 2002 by Leo Breiman and Adele Cutler, random forests modify the DecisionTree(T) constructor to build trees that are better suited to bagging. There are two main changes:

Attribute sampling:
Line 4 of our algorithm, rather than choosing $$i$$ in the set of available attributes $$A = \{1, \ldots, m\}$$, we draw a subset $$A' \subset A$$ of size $$|A'| = m' \leq m$$ uniformly at random (for regression, the authors suggest $$m' \approx m/3$$) and choose our heuristically best attribute $$i^*$$ in $$A'$$. The goal of this step is to reduce the correlation between the several trees we train from the same training set, e.g. avoiding that they all choose the same split attribute at the root if one attribute turns out to seem consistenly better than the others, allowing for "independent thought" in a fraction of the trees that choose to ignore it at first. Intuitively, correlation means consensus, and too strong of a consensus reaps off the benefits of bagging.
Stop condition:
Trees are grown as deep as possible, i.e., our stop condition happens as late as possible (for instance $$|T_f| < 10$$ or $$|Y_f| = 1$$) within the limits of our computation power. Thus, our predictors try to learn as much as possible from their data, their individual overfitting and sensitivity to noise in the training set being compensated for later by bagging.

For further details on random forests, see the manuscript by Breiman and Cutler, or the The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie et al.

## To go further

You can find the algorithms described above implemented in Python with SQLite in the PyDTL library, which is open source and distributed under the LGPLv3 license. It draws from both CART and C4.5 trees and is geared toward growing random forests. Decision trees are also implemented in other software such as milk (Python), Orange and Weka (Java).

Here comes the end of our introduction to decision tree learning. There are likely many points to improve in it: feel free to write me if you have any question or suggestion. For a deeper study of these techniques, and of supervised learning in general, I highly recommend The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie et al.

This tutorial is a translation of Une introduction aux arbres de décision, a note I wrote in 2011 after my internship at INRIA with Marc Lelarge. Warm thanks to Marc for supporting this work, for our discussions, and the cool topics we worked on :-)

## Discussion

You can use Markdown with $\LaTeX$ formulas in your comment.