Introduction
Boosting is a numerical optimization technique for minimizing the loss function by adding, at each step, a new tree that best reduces (steps down the gradient of) the loss function. For Boosted Regression Trees (BRT), the first regression tree is the one that, for the selected tree size, maximally reduces the loss function.
Keep in Mind
The Boosted Trees Model is a type of additive model that makes predictions by combining decisions from a sequence of base models. More formally we can write this class of models as:
\[g(x) = f_0(x)+f_1(x)+f_2(x)+...\]where the final classifier \(g\) is the sum of simple base classifiers \(f_i\). For the boosted trees model, each base classifier is a simple decision tree. This broad technique of using multiple models to obtain better predictive performance is called model ensembling.
Random forests improve upon bagged trees by decorrelating the trees. In order to decorrelate its trees, a random forest only considers a random subset of predictors when making each split (for each tree). This is compared to boosted trees, which can pass information from one to the other. We add each new tree to our model (and update our residuals). Trees are typically small—slowly improving where it struggles.
- Check here for more help.
Also Consider
- There are non-boosted approaches to decision trees, which can be found at Decision Trees and Random Forest.
Implementations
Python
There are several packages that can be used to estimate boosted regression trees but sklearn provides a function GradientBoostingRegressor
that is perhaps the most user-friendly.
# Install scikit-learn using conda or pip if you don't already have it installed
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
# Generate some synthetic data
X, y = make_regression()
# Split the synthetic data into train and test arrays
X_train, X_test, y_train, y_test = train_test_split(X, y)
# The number of trees is set by n_estimators; there are many other options that
# you should experiment with. Typically the defaults will be sensible but are
# unlikely to be perfect for your use case. Let's create the empty model:
reg = GradientBoostingRegressor(n_estimators=100,
max_depth=3,
learning_rate=0.1,
min_samples_split=3)
# Fit the model
reg.fit(X_train, y_train)
# Predict the value of the first test case
reg.predict(X_test[:1])
# R^2 score for the model (on the test data)
reg.score(X_test, y_test)
R
Boosted trees can be produced using the gbm package
Boosting has three tuning parameters.
- The number of trees
B
(important to prevent overfitting) - The shrinkage parameter
lambda
(controls boosting’s learning rate- often 0.01 or 0.001)
- The number of splits in each tree (the tree’s complexity)
data from:https://www.kaggle.com/kondla/carinsurance
library(tidyverse)
library(janitor)
library(caret)
library(glmnet)
library(magrittr)
library(dummies)
library(rpart.plot)
library(e1071)
library(caTools)
library(naniar)
library(forcats)
library(ggplot2)
library(MASS)
library(pROC)
library(ROCR)
library(readr)
library(gbm)
set.seed(101)
# Load in data
carInsurance_train <- read_csv("https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Machine_Learning/Data/boosted_regression_trees/carInsurance_train.csv")
summary(carInsurance_train)
# Produce a training and a testing subset of the data
sample = sample.split(carInsurance_train$Id, SplitRatio = .8)
train = subset(carInsurance_train, sample == TRUE)
test = subset(carInsurance_train, sample == FALSE)
total <- rbind(train ,test)
gg_miss_upset(total)
Step 1: Produce dummies as appropriate
total$CallStart <- as.character(total$CallStart)
total$CallStart <- strptime(total$CallStart,format=" %H:%M:%S")
total$CallEnd <- as.character(total$CallEnd)
total$CallEnd <- strptime(total$CallEnd,format=" %H:%M:%S")
total$averagetimecall <- as.numeric(as.POSIXct(total$CallEnd)-as.POSIXct(total$CallStart),units="secs")
time <- mean(total$averagetimecall,na.rm = TRUE)
Produce dummy variables as appropriate
total_df <- dummy.data.frame(total %>%
dplyr::select(-CallStart, -CallEnd, -Id, -Outcome))
summary(total_df)
Fill in missing values
total_df$Job[is.na(total_df$Job)] <- "management"
total_df$Education [is.na(total_df$Education)] <- "secondary"
total_df$Marital[is.na(total_df$Marital)] <-"married"
total_df$Communication[is.na(total_df$Communication)] <- "cellular"
total_df$LastContactMonth[is.na(total_df$LastContactMonth)] <- "may"
Step 2: Preprocess data with median imputation and a central scaling
clean_new <- preProcess(
x = total_df %>% dplyr::select(-CarInsurance) %>% as.matrix(),
method = c('medianImpute')
) %>% predict(total_df)
Step 3: Divide the data into testing and training data
trainclean <- head(clean_new, 3200) %>% as.data.frame()
testclean <- tail(clean_new, 800) %>% as.data.frame()
summary(trainclean)
Step 4: Parameters
gbm needs the three standard parameters of boosted trees—plus one more:
n.trees
, the number of treesinteraction.depth
, trees’ depth (max. splits from top)shrinkage
, the learning raten.minobsinnode
, minimum observations in a terminal node
Step 5: Train the boosted regression tree
Notice that trControl
is being set to select parameters using five-fold cross-validation ("cv"
).
carinsurance_boost = train(
factor(CarInsurance)~.,
data = trainclean,
method = "gbm",
trControl = trainControl(
method = "cv",
number = 5
),
tuneGrid = expand.grid(
"n.trees" = seq(25, 200, by = 25),
"interaction.depth" = 1:3,
"shrinkage" = c(0.1, 0.01, 0.001),
"n.minobsinnode" = 5)
)