Decision Trees

Decision Trees are a statistical/machine learning technique for classification and regression. They are capable of discovering complex interactions between variables and making accurate predictions on new data.

Most popular decision tree algorithms (ID3, C4.5, CART) work by repeatedly partitioning the input space along the dimensions containing the most information. Different algorithms use different metrics to determine both the variable to partition where to create the partition. Some of the more popular metrics are Gini Impurity and Information Gain. Newer decision tree algorithms like conditional inference trees use permutation tests to assess variable importance.

Decision Tree Visualization

A simple decision tree for determining if a passenger on the Titanic survived conditional on different characteristics.

These algorithms all create binary decision trees: each node in the tree has either 0 or 2 children. Most decision tree algorithms use binary splits. A notable exception is CHAID, a decision-tree technique often used for customer segmentation analysis.

We typically use decision trees when problem involves non-linear relationships with complex interactions while requiring both high predictive accuracy and a relatively simple interpretability and attribution. If more predictive accuracy is required, we usually switch to ensemble methods such as Random Forests or Boosted Trees.

Benefits:
  • Accept both numeric and categorical data and can handle missing values
  • Straightforward interpretation of the fitted model
  • Algorithms are fast and can handle large datasets
Drawbacks:
  • Usually need careful adjustment through pruning
  • Can be unstable (high variance) so small changes in the underlying dataset may produce meaningfully different models
  • Cannot easily model certain types of relationships, e.g. a simple linear relationship between two variables