Random Forest Regression Pt. 4
Training using One Feature with Grid Search and Randomised Grid Search
This is Pt. 4 in the series covering Random Forest Regression to predict the price of RY stock. It follows from Pt. 3 on Feature Engineering. In this post, training is done using the estimators and tools provided by ScikitLearn.
We begin with a discussion of the theoretical concepts needed to undergo this process. These include hyperparameters, Grid Search, and pickling. Then, the models are trained using one feature with both Grid Search and Randomised Grid Search. The hypothesis that Randomised Grid Search is generally the better option when given the choice will be tested and validated.
Series on Random Forest Regression for Predicting the Price of RY Stock
 Pt. 1: Algorithms, Importing, Exploring and Preprocessing the Data
 Pt. 2: Visualizing the Data with Plotly and Matplotlib
 Pt. 3: Feature Engineering using Domain Knowledge and Feature Interactions
Related Post: Daily Returns  Pt. 4: Training the Model: Introduction to Theoretical Concepts and Training using One Feature with Grid Search and Randomised Grid Search
 Pt. 5: Training the Model Using Multiple Features with a Pipeline, Feature Selection, and Randomised Grid Search
Concepts
This section describes the relevant concepts that will be used to train the Random Forest models with abbreviated demonstrations. These concepts include model hyperparameters, Grid Search CrossValidation and Randomised Grid Search CrossValidation, as well as Pickling.
A hyperparameter is a parameter used to train the model rather than a parameter belonging to the trained model itself. There are quite a few of these to consider when using Random Forests, some of which are critical to performance.
Grid Search is a technique that can help in choosing the best hyperparameters or finetuning a model. It adds more sophistication to the methods of train/test splits and crossvalidation. Both types of Grid Search, which to choose under which scenario, their performance and learned parameters are discussed.
Finally, Pickling is used for model persistence, that is, to save a trained model so it can be reused later without having to go through the timeconsuming process of retraining. Full implementations of all these concepts follow in the section on Training Using One Feature.
Random Forest Hyperparameters
As mentioned before, Random Forests have several parameters to consider when choosing a model. This is in stark contrast to Linear Regression which was already covered and has none. Some of these, like max_features
, can significantly affect the performance of the system. max_features
indicates the number of features to look at when attempting a split. An example output of a trained model is reproduced below and shows the hyperparameters which were used to train the model. If values are not specified the defaults will be used instead.
`RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)`
n_estimators
specifies the number of Decision Trees that the forest should consist of. Since these models tend to overfit it is useful to know which ones can regularise the model. These include max_depth
, max_leaf_nodes
, min_samples_leaf
. Changing one has an effect on the other two so it may only be necessary to change one of them. n_jobs
adjusts the number of cores used during training. 1 will use all the cores. random_state
will make the results reproducible if set to any number.
Ultimately, since these parameters can cause the modeling to be very timeconsuming it is best to choose those that work the best for the system resources and compute times available. See here for more on the hyperparameters.
Hyperparameter Searching: GridSearchCV and RandomisedGridSearchCV
Choosing values for all these hyperparameters may seem daunting. Grid Search simplifies and automates this process by trying different combinations of hyperparameters to choose the best model. This effectively deals with the issue of the bias/variance tradeoff where both have to be balanced to get a model that fits the data but only so well that it can generalise to unseen data and still have good performance. The CV in the object names refers to crossvalidation which is built into the function as a parameter. The number of folds can be set as shown below. In this case, it is set to 10 using cv=10
.
grid = GridSearchCV(rfr, param_grid, cv=10)
Types of Grid Search
Exhaustive Grid Search
In ScikitLearn Grid Search can be performed in two ways. The first way involves using a finite set of specific parameters to exhaustively search over. This means that each possible combination of hyperparameters will be evaluated. In the extract below the hyperparameters to search over are specified in the param_grid
. The param_grid
is then passed to the GridSearchCV()
function along with the instantiated model object and some other parameters. See the full example here.
Randomised Grid Search
The second way, called Randomised Grid Search, takes a range of hyperparameters from which it randomly chooses and then gives the best model. It is better for use when there is a wider continuous range of hyperparameters to search over which would be impractical to do exhaustively with Grid Search. Note the wide ranges in param_dist
below as compared to the few specific values in the param_grid
in the previous sample. The number of parameter combinations chosen can be specified to limit the search using n_iter
. If there are not enough hyperparameters to search through for the number of iterations chosen the function will not run and an error will provide an appropriate error message. See the full example here.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(criterion='mse', bootstrap=True, random_state=0, n_jobs=2)
param_dist = dict(n_estimators=list(range(1,100)),
max_depth=list(range(1,100)),
min_samples_leaf=list(range(1,10)))
rand = RandomizedSearchCV(rfr, param_dist, cv=10,
scoring='neg_mean_squared_error', n_iter=30)
rand.fit(X, y)
Which Type of Grid Search is Better?
Bergstra and Bengio diagrammatically compare these concepts. They show that for the same number of parameter searches a Random Search gives more diverse values for the important hyperparameters than a Grid Search. It can also find better models in equivalent time or less than Grid Search. Thus, in general, Randomised Grid Search is shown to be more effective because it searches over a larger hyperparameter space where time complexity is independent of the range of that space. Hence, it is recommended to use an unrestricted search space and let the search process find the best values. See here also for more on this concept.
Randomised Grid Search is shown to be more effective because it searches over a larger hyperparameter space where time complexity is independent of the range of that space.
Performance
One caveat is that both incarnations of Grid Search can be slow depending on the number of crossvalidation folds and hyperparameters that are involved. Higher numbers for both mean a longer search time with Grid Search since the search is exhaustive. For example, if there are 10 folds, and 50 possible combinations of hyperparameters, that means 500 distinct iterations will be evaluated. This is because, for each hyperparameter combination of which there are 50, the model is trained and evaluated 10 different times.
When using Randomised Grid Search the number of hyperparameter combinations searched over can be limited using the n_iter
attribute in the RandomisedSearchCV
function, as already mentioned. The default is 10. An increase in the number of crossvalidation folds will still increase search time even if n_iter
is held constant. The process in both cases can be canceled if it is too timeconsuming and the best hyperparameters thus far will be given.
Learned parameters with Grid Search
In addition to the hyperparameters which are used to train the model, there are also ones which belong to a trained model itself in ScikitLearn. They all end in an underscore like cv_results_
, best_score_
, best_params_
and best_estimator_
. These are further discussed.
cv_results_
shows the results from crossvalidation. Though the results are negative, higher numbers (approaching 0) represent better scores. A negative sign can be used to invert this relationship so that lower numbers mean a lower error and thus a better score. The Mean Squared Error (MSE) is used as the default criterion
for Random Forest Regression. It also shows all the hyperparameter combinations that were used when searching for the best ones. Note that the output for this one can be quite lengthy so a demonstration is truncated for brevity. In the extract below from this demonstration, cv.results_
is called on the grid
which takes in the instantiated model as a parameter.
grid.cv_results_ {'split0_test_score': array([9.91385846, 9.9655185 , 9.93174032, 9.90547184, 9.91385846,... 'params': ({'max_depth': 5, 'min_samples_leaf': 1, 'n_estimators': 10}, {'max_depth': 5, 'min_samples_leaf': 1, 'n_estimators': 25}, {'max_depth': 5, 'min_samples_leaf': 1, 'n_estimators': 50}, {'max_depth': 5, 'min_samples_leaf': 1, 'n_estimators': 100}, {'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 10}, {'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 25}, {'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 50}, {'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 100},...
Moving along, best_score_
gives the best crossvalidation score related to the best_params_
and best_estimator_
. The interpretation of the MSE in best_score_
is the same as for cv_results_
so a negative sign can be added to the front to invert the relationship. For the intuition behind the best crossvalidation score see this. best_params_
, shows the best parameters determined from the search. These values are derived from the param_grid
. Finally, best_estimator_
gives the best estimator object and all the hyperparameters including both the default values and those which were determined from the Grid Search.
grid.best_score_ 23.278293092650244 grid.best_params_ {'max_depth': 20, 'min_samples_leaf': 4, 'n_estimators': 50} grid.best_estimator_ RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=20, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e07, min_samples_leaf=4, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=2, oob_score=False, random_state=0, verbose=0, warm_start=False)
Model Persistence
ScikitLearn has builtin support for pickling (creating .pkl
files) and model persistence. This is very useful when a model takes a long time to train. Persistence can be used to save the trained model. It can then be reloaded and predictions can be done directly on that trained model without having to repeat the timeconsuming process of retraining the model when needed. The learned parameters of the model discussed above can also be accessed from the pickled model.
Pickling is used in both samples in the section for training the model since even the less complex one takes more than 1 minute to train. An excerpt of the relevant code is shown below where joblib
is imported then the model from a Grid Search grid
is saved as a pickle file called model.pkl
. The file can then be reloaded with joblib.load()
and a prediction can be run as usual by calling predict()
on the reloaded model.
# importing joblib and saving the model
from sklearn.externals import joblib
joblib.dump(rand, "model.pkl")
# reloading the .pkl file to make a prediction
grid = joblib.load("model.pkl")
grid.predict(564)
Follow along to the next section to see a full contextual example of its use. Additionally, more information is available in the ScikitLearn documentation.
Training Using One Feature with Grid Search and Randomised Grid Search
X and y need to be set to train the model, where X contains the training data and y is the output vector. Grid Search and Randomised Grid Search are both carried out on the data and the results are compared.
Setting X and y
The input matrix X is set as the Days Elapsed column from the RY_df. It must be reshaped since it is only one feature. y is set to the target values which corresponds to the Adj Close column from the RY_df. Note that this data was already imported, preprocessed and explored here. The importing and preprocessing steps need to be followed to get the data in the form needed to follow along from this point.
Both shapes are verified to be the same number of rows (5494). Otherwise, the model cannot be trained and an error will result.
X = RY_df["Days Elapsed"].values.reshape(1,1)
X.shape
(5494, 1)
y = RY_df["Adj Close"]
y.shape
(5494,)
Grid Search CrossValidation
Now that the data has been prepared for modeling, the model can be trained.
Fitting the Model
Grid Search CrossValidation is first used to fit the model. This is an exhaustive search as previously explained with specific parameters given in the param_grid
where:
First, joblib
is imported to enable persistence. A reusable function rfr_fit_gscv()
(rfr
is an abbreviated form of Random Forest Regression and gscv
of Grid Search CrossValidation) is created to take the DataFrame, parameter grid, and pickled filename. GridSearchCV
and RandomForestRegressor
are imported. The typical modeling pattern in ScikitLearn is followed with some modifications. After the model is imported it is instantiated and fitted with the data. An intermediate step of passing the model to the GridSearchCV()
function and setting its parameters is added. The model is then pickled and the model parameters and results are printed. Using %time
when calling the function facilitates tracking how long the Grid Search process takes.
# this is needed for pickling
from sklearn.externals import joblib
def rfr_fit_gscv(df, param_grid, filename):
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
# setting the static parameters
rfr = RandomForestRegressor(bootstrap=True, random_state=0, n_jobs=2)
grid = GridSearchCV(rfr, param_grid, cv=10,
scoring='neg_mean_squared_error')
grid.fit(X,y)
# this creates the pickled file.
joblib.dump(grid, filename)
# These are all parameters of the learned model.
# Notice the underscore at the end of the name of all the parameters.
print("grid.cv_results_ {}".format(grid.cv_results_))
print("")
# The negative of the best_score_ value is taken
# since the MSE is given as a negative value
print("grid.best_score_ {}".format(grid.best_score_))
print("grid.best_params_ {}".format(grid.best_params_))
print("grid.best_estimator_ {}".format(grid.best_estimator_))
print("grid.n_splits_ {}".format(grid.n_splits_))
param_grid = dict(n_estimators=[10, 25, 50, 100],
max_depth=[5, 10, 20, 30],
min_samples_leaf=[1,2,4])
%time rfr_fit_gscv(RY_df, param_grid, 'rfr_gscv_one_features20aug1404.pkl')
... grid.best_score_ 23.278293092650244 grid.best_params_ {'max_depth': 20, 'min_samples_leaf': 4, 'n_estimators': 50} grid.best_estimator_ RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=20, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e07, min_samples_leaf=4, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=2, oob_score=False, random_state=0, verbose=0, warm_start=False) grid.n_splits_ 10 CPU times: user 3min 21s, sys: 13.4 s, total: 3min 34s Wall time: 4min
The grid.best_score_
23.28 is excellent but could possibly hint at overfitting.
A look at the chart below shows that the orange learned line is indeed following the data quite closely. Zooming in shows the effects of regularisation as the data is not actually fit as tightly as it initially appears. Nevertheless, expanding the search hyperparameters when doing a Randomised Grid Search to account for more regularisation hyperparameters may prove useful. grid.best_params_
with output {'max_depth': 20, 'min_samples_leaf': 4, 'n_estimators': 50}
shows the specific hyperparameters from the param_grid
that created the best performing model.
Visualising the Learned Model
def rfr_viz(X,y,label):
# could make this an input
grid = joblib.load('rfr_gscv_one_features20aug1404.pkl')
# Create traces
trace0 = go.Scatter(
x = RY_df.index,
y = y,
mode = 'markers',
name = 'markers'
)
trace2 = go.Scatter(
x = RY_df.index,
y = grid.predict(X),
mode = 'lines',
name = 'lines'
)
data = [trace0, trace2]
layout= go.Layout(
title= label,
hovermode= 'closest',
xaxis= dict(
title= 'Date',
ticklen= 5,
zeroline= False,
gridwidth= 2,
),
yaxis=dict(
title= 'Stock Price (US$)',
ticklen= 5,
gridwidth= 2,
),
showlegend= False
)
fig= go.Figure(data=data, layout=layout)
return py.iplot(fig, filename='rfr_viz')
rfr_viz(X,y, "RY Stock")
Matplotlib version
def rfr_viz(X,y,label):
plt.figure()
plt.scatter(RY_df.index ,y, s=20, alpha=0.7)
plt.plot(RY_df.index , grid.predict(X), c='r', linewidth=1)
plt.xlabel("Date")
plt.ylabel("Stock Price US$")
plt.title(label)
plt.xticks(rotation=45)
# X_train, grid.predict(X)
rfr_viz(X,y, "RY Stock")
Running a Prediction with the .pkl file
The pickled file is reloaded into the predict function to perform predictions given a certain date. A helper function to convert a date to Days Elapsed to run the prediction is included.
The prediction for Oct 19, 1997, of $5.82
seems reasonable given the training data. The prediction for Jun 6, 2020, of $74.08
is not convincing given that there seems to be an underlying linear trend. It is essentially an approximation of the last value that the model was trained on. Thus for outofsample data, a linear or polynomial model may be more effective.
def convert_date_to_days_elapsed(df, date):
dates = df.index
elapsed = date  dates[0]
return elapsed.days
def predict(df, date, filename):
"""
This function reloads the pickled file so predictions
can be made without retraining the model.
This runs very quickly compared to training the model.
"""
grid = joblib.load(filename)
days = convert_date_to_days_elapsed(RY_df, date)
return grid.predict(days)
predict(RY_df, datetime(1997, 10, 19), 'rfr_gscv_one_features20aug1404.pkl')[0]
5.8209268466868709
74.083444176200032
Randomised Grid Search CrossValidation
A Randomised Grid Search is now used with a wider range of hyperparameters. The structure here is very similar to the one used above with the main differences of a param_dist
with a continuous range of parameters to search over and the use of the RandomizedSearchCV()
in place of the GridSearchCV()
function. There is also the addition of the n_iter
parameter for RandomizedSearchCV()
to specify how many parameter combinations to try.
def rfr_fit_rgscv(df, filename):
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(criterion='mse', bootstrap=True,
random_state=0, n_jobs=2)
param_dist = dict(n_estimators=list(range(1,100)),
max_depth=list(range(1,100)),
min_samples_leaf=list(range(1,10)))
rand = RandomizedSearchCV(rfr, param_dist, cv=10,
scoring='neg_mean_squared_error',
n_iter=30)
rand.fit(X, y)
# pickling the file
joblib.dump(rand, filename)
# print("rand.cv_results_ {}".format(rand.cv_results_))
print("")
print(rand.best_score_)
print(rand.best_params_)
print(rand.best_estimator_)
%time rfr_fit_rgscv(RY_df, 'rfr_rgscv_one_feature_20Aug1441.pkl', )
 22.4846067475 {'n_estimators': 53, 'min_samples_leaf': 8, 'max_depth': 11} RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=11, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e07, min_samples_leaf=8, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=53, n_jobs=2, oob_score=False, random_state=0, verbose=0, warm_start=False) CPU times: user 1min 55s, sys: 6.79 s, total: 2min 2s Wall time: 2min 22s
5.8338202274120938
74.451878668880155
The score here is comparable to when only using Grid Search. It is slightly better at 22.48 versus 23.28. The predicted values for the two dates chosen are also similar. A few of the hyperparameters of the model are noticeably different though. min_samples_leaf
increased from 4 to 8 and max_depth
dropped from 20 to 11. Adding more options for regularisation didn’t have much effect.
The training time here is reduced to 0.625 the time of the previous run. This may be because 30 iterations were used whereas before with an exhaustive search 50 were run. Note that Randomised Grid Search sill managed to find better parameters in less time as was found by Bergstra and Bengio.
Conclusion
This part of the series started off with an examination of the key concepts used during model training like hyperparameters, Grid Search, and pickling. The models were then trained using Grid Search and Randomised Grid Search where the latter was shown to give better performance via searching over a wider range of parameters where the time complexity is independent of the range. In the next post, we continue on from here by running Random Grid Search with multiple features using a Pipeline and Feature Selection. We also examine the important concept of Data Leakage.
References

Bergstra, James, and Yoshua Bengio. “Random search for hyperparameter optimization.” Journal of Machine Learning Research 13.Feb (2012): 281305.

Géron, Aurélien. “Handson Machine Learning with ScikitLearn and Tensorflow.” (2017).

Hunter, John D. “Matplotlib: A 2D graphics environment.” Computing In Science & Engineering 9.3 (2007): 9095.

McKinney, Wes. “Data structures for statistical computing in python.” Proceedings of the 9th Python in Science Conference. Vol. 445. Austin, TX: SciPy, 2010.

Pedregosa, Fabian, et al. “Scikitlearn: Machine learning in Python.” Journal of Machine Learning Research 12.Oct (2011): 28252830.

Pérez, Fernando, and Brian E. Granger. “IPython: a system for interactive scientific computing.” Computing in Science & Engineering 9.3 (2007).

Sievert, C., et al. “plotly: Create interactive web graphics via Plotly’s JavaScript graphing library [Software].” (2016).

Walt, Stéfan van der, S. Chris Colbert, and Gael Varoquaux. “The NumPy array: a structure for efficient numerical computation.” Computing in Science & Engineering 13.2 (2011): 2230.
Appendix
The code to generate the figures illustrated in Pt. 1 of this series is shown below. They were placed here because the processed data and concepts from this section were used to create them.