An introduction to decision trees

Decision trees are one of the major data structures in machine learning. Their inner workings rely on heuristics which, while satisfying to our intuitions, give remarkably good results in practice (notably when they are aggregated into "random forests"). Their tree structure also makes them readable by human beings, contrary to other "black box" approaches such as neural networks.

In this introduction, we will see how decision trees work and provide a few theoretical justifications along the way. We will also (briefly) talk about their extension to random forests. To get started, you should already be familiar with the general setting of supervised learning.

Training of a decision tree

A decision tree models a hierarchy of tests on the values of its input, which are called attributes or features. At the end of these tests, the predictor outputs a numerical value or chooses a label in a pre-defined set of options. The former is called regression, the latter classification. For instance, the following tree could be used to decide if a French student finds a lecture interesting or not, i.e. classification in the binary set {oui, non} (French booleans ;p), based on the boolean attributes {motivation, surprenant, difficile} and a discrete attribute durée \(\in\) {1 h, 2 h, 3 h}. We could also have made the last one a continuous attribute, but we'll come back to that.

Sample decision tree for the question "Is this presentation interesting?"

The input to our decision is called an instance and consists of values for each attribute. It is usually denoted by \((\bfx, y)\) where \(y\) is the attribute the tree should learn to predict and \(\bfx = (x_1, \ldots, x_m)\) are the \(m\) other attributes. A decision tree is learned on a collection of instances \(T = \{(\bfx, y)\}\) called the training set.

Example of training set for the decision tree above
date motivation durée (h) difficile surprenant température (K)
11/03 oui 1 non oui 271
14/04 oui 4 oui oui 293
03/01 non 3 oui non 297
25/12 oui 2 oui non 291
... ... ... ... ... ...

Given our observations \(T = \{ (\bfx, y) \}\), we now want to build a decision tree that predicts \(y\) given new instances \(\bfx\). As of today (TN: 2011) there are essentially two families of algorithms to do this: Quinlan trees and CART trees. Both follow a divide-and-conquer strategy that we can sum up, for discrete attributes, by the following pseudocode:

1:  DecisionTree(T: training set):
2:      if "stop condition":
3:          return leaf(T)
4:      choose the "best" attribute i between 1 and m
5:      for each value v of i:
6:          T[v] = {(x, y) from T such that x_i = v}
7:          t[v] = DecisionTree(T[v])
8:      return node(i, {v: t[v]})

In what follows, we use uppercase \(T\) for training sets and lowercase \(t\) for trees. A tree is either a leaf leaf(T), where we save a subset \(T\) of the original training set, or a node node(i, {v: t[v]}) that tests attribute \(i\) and has a child tree \(t_v\) for each potential value \(v\) of the attribute. The two statements with quotation marks correspond to heuristics specific to each algorithm:

Stop condition:
The condition used to decide when it is time to produce an answer rather than grow the tree to better fit the data. For instance, the condition \(|T| = 1\) will produce detailed trees with excellent score on the training set (for each instance \((\bfx, y) \in T\) the tree will predict exactly \(y\) given \(\bfx\)) but quite deep and not likely to generalize well outside of this set due to overfitting.
Best attribute:
The goal of this heuristic is to evaluate locally what attribute yields the "most information" on (or is the "most correlated" to) the result to predict. We will see examples of such criteria below.

When learning a continuous attribute \(x_i \in \mathbb{R}\), we adapt the above pseudocode by selecting a split value \(v\) corresponding to a test \(x_i \leq v\). We will denote by node(i, v, t[≤], t[>]) the corresponding node constructor. In the particular case where all attributes are continuous, the resulting decision tree is a binary tree.

Regression

... translation in progress...

Choosing a split criterion

... translation in progress...

Why minimize variance?

... translation in progress...

Classification

... translation in progress...

Measuring leaf homogeneity

... translation in progress...

Choosing a split criterion

... translation in progress...

Extensions

... translation in progress...

Bagging

... translation in progress...

Random forests

... translation in progress...

To go further

This article is a translation of Une introduction aux arbres de décision, a report I wrote in 2011 after my internship at INRIA with Marc Lelarge. Warm thanks to Marc for supporting this work, for our discussions, and the cool topics we worked on :-)

© Stéphane Caron — All content on this website is licensed under the CC BY 4.0 license.
π