This is Pt. 4 in the series covering Random Forest Regression to predict the price of RY stock. It follows from Pt. 3 on Feature Engineering. In this post, training is done using the estimators and tools provided by Scikit-Learn.

We begin with a discussion of the theoretical concepts needed to undergo this process. These include hyperparameters, Grid Search, and pickling. Then, the models are trained using one feature with both Grid Search and Randomised Grid Search. The hypothesis that Randomised Grid Search is generally the better option when given the choice will be tested and validated.

Series on Random Forest Regression for Predicting the Price of RY Stock

## Concepts

This section describes the relevant concepts that will be used to train the Random Forest models with abbreviated demonstrations. These concepts include model hyperparameters, Grid Search Cross-Validation and Randomised Grid Search Cross-Validation, as well as Pickling.

A hyperparameter is a parameter used to train the model rather than a parameter belonging to the trained model itself. There are quite a few of these to consider when using Random Forests, some of which are critical to performance.

Grid Search is a technique that can help in choosing the best hyperparameters or fine-tuning a model. It adds more sophistication to the methods of train/test splits and cross-validation. Both types of Grid Search, which to choose under which scenario, their performance and learned parameters are discussed.

Finally, Pickling is used for model persistence, that is, to save a trained model so it can be reused later without having to go through the time-consuming process of retraining. Full implementations of all these concepts follow in the section on Training Using One Feature.

### Random Forest Hyperparameters

As mentioned before, Random Forests have several parameters to consider when choosing a model. This is in stark contrast to Linear Regression which was already covered and has none. Some of these, like max_features, can significantly affect the performance of the system. max_features indicates the number of features to look at when attempting a split. An example output of a trained model is reproduced below and shows the hyperparameters which were used to train the model. If values are not specified the defaults will be used instead.

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)

n_estimators specifies the number of Decision Trees that the forest should consist of. Since these models tend to overfit it is useful to know which ones can regularise the model. These include max_depth, max_leaf_nodes, min_samples_leaf. Changing one has an effect on the other two so it may only be necessary to change one of them. n_jobs adjusts the number of cores used during training. -1 will use all the cores. random_state will make the results reproducible if set to any number.

Ultimately, since these parameters can cause the modeling to be very time-consuming it is best to choose those that work the best for the system resources and compute times available. See here for more on the hyperparameters.

### Hyperparameter Searching: GridSearchCV and RandomisedGridSearchCV

Choosing values for all these hyperparameters may seem daunting. Grid Search simplifies and automates this process by trying different combinations of hyperparameters to choose the best model. This effectively deals with the issue of the bias/variance tradeoff where both have to be balanced to get a model that fits the data but only so well that it can generalise to unseen data and still have good performance. The CV in the object names refers to cross-validation which is built into the function as a parameter. The number of folds can be set as shown below. In this case, it is set to 10 using cv=10.

grid = GridSearchCV(rfr, param_grid, cv=10)

In Scikit-Learn Grid Search can be performed in two ways. The first way involves using a finite set of specific parameters to exhaustively search over. This means that each possible combination of hyperparameters will be evaluated. In the extract below the hyperparameters to search over are specified in the param_grid. The param_grid is then passed to the GridSearchCV() function along with the instantiated model object and some other parameters. See the full example here.

The second way, called Randomised Grid Search, takes a range of hyperparameters from which it randomly chooses and then gives the best model. It is better for use when there is a wider continuous range of hyperparameters to search over which would be impractical to do exhaustively with Grid Search. Note the wide ranges in param_dist below as compared to the few specific values in the param_grid in the previous sample. The number of parameter combinations chosen can be specified to limit the search using n_iter. If there are not enough hyperparameters to search through for the number of iterations chosen the function will not run and an error will provide an appropriate error message. See the full example here.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(criterion='mse', bootstrap=True, random_state=0, n_jobs=2)

param_dist = dict(n_estimators=list(range(1,100)),
max_depth=list(range(1,100)),
min_samples_leaf=list(range(1,10)))

rand = RandomizedSearchCV(rfr, param_dist, cv=10,
scoring='neg_mean_squared_error', n_iter=30)
rand.fit(X, y)

#### Which Type of Grid Search is Better?

Bergstra and Bengio diagrammatically compare these concepts. They show that for the same number of parameter searches a Random Search gives more diverse values for the important hyperparameters than a Grid Search. It can also find better models in equivalent time or less than Grid Search. Thus, in general, Randomised Grid Search is shown to be more effective because it searches over a larger hyperparameter space where time complexity is independent of the range of that space. Hence, it is recommended to use an unrestricted search space and let the search process find the best values. See here also for more on this concept.

Randomised Grid Search is shown to be more effective because it searches over a larger hyperparameter space where time complexity is independent of the range of that space.

#### Performance

One caveat is that both incarnations of Grid Search can be slow depending on the number of cross-validation folds and hyperparameters that are involved. Higher numbers for both mean a longer search time with Grid Search since the search is exhaustive. For example, if there are 10 folds, and 50 possible combinations of hyperparameters, that means 500 distinct iterations will be evaluated. This is because, for each hyperparameter combination of which there are 50, the model is trained and evaluated 10 different times.

When using Randomised Grid Search the number of hyperparameter combinations searched over can be limited using the n_iter attribute in the RandomisedSearchCV function, as already mentioned. The default is 10. An increase in the number of cross-validation folds will still increase search time even if n_iter is held constant. The process in both cases can be canceled if it is too time-consuming and the best hyperparameters thus far will be given.

In addition to the hyperparameters which are used to train the model, there are also ones which belong to a trained model itself in Scikit-Learn. They all end in an underscore like cv_results_, best_score_, best_params_ and best_estimator_. These are further discussed.

cv_results_ shows the results from cross-validation. Though the results are negative, higher numbers (approaching 0) represent better scores. A negative sign can be used to invert this relationship so that lower numbers mean a lower error and thus a better score. The Mean Squared Error (MSE) is used as the default criterion for Random Forest Regression. It also shows all the hyperparameter combinations that were used when searching for the best ones. Note that the output for this one can be quite lengthy so a demonstration is truncated for brevity. In the extract below from this demonstration, cv.results_ is called on the grid which takes in the instantiated model as a parameter.

grid.cv_results_ {'split0_test_score': array([-9.91385846, -9.9655185 , -9.93174032, -9.90547184, -9.91385846,...

'params': ({'max_depth': 5, 'min_samples_leaf': 1, 'n_estimators': 10}, {'max_depth': 5, 'min_samples_leaf': 1, 'n_estimators': 25}, {'max_depth': 5, 'min_samples_leaf': 1, 'n_estimators': 50}, {'max_depth': 5, 'min_samples_leaf': 1, 'n_estimators': 100}, {'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 10}, {'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 25}, {'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 50}, {'max_depth': 5, 'min_samples_leaf': 2, 'n_estimators': 100},...

Moving along, best_score_ gives the best cross-validation score related to the best_params_ and best_estimator_. The interpretation of the MSE in best_score_ is the same as for cv_results_ so a negative sign can be added to the front to invert the relationship. For the intuition behind the best cross-validation score see this. best_params_, shows the best parameters determined from the search. These values are derived from the param_grid. Finally, best_estimator_ gives the best estimator object and all the hyperparameters including both the default values and those which were determined from the Grid Search.

-grid.best_score_ 23.278293092650244
grid.best_params_ {'max_depth': 20, 'min_samples_leaf': 4, 'n_estimators': 50}
grid.best_estimator_ RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=20,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=4,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=50, n_jobs=2, oob_score=False, random_state=0,
verbose=0, warm_start=False)

### Model Persistence

Scikit-Learn has built-in support for pickling (creating .pkl files) and model persistence. This is very useful when a model takes a long time to train. Persistence can be used to save the trained model. It can then be reloaded and predictions can be done directly on that trained model without having to repeat the time-consuming process of retraining the model when needed. The learned parameters of the model discussed above can also be accessed from the pickled model.

Pickling is used in both samples in the section for training the model since even the less complex one takes more than 1 minute to train. An excerpt of the relevant code is shown below where joblib is imported then the model from a Grid Search grid is saved as a pickle file called model.pkl. The file can then be reloaded with joblib.load() and a prediction can be run as usual by calling predict() on the reloaded model.

# importing joblib and saving the model
from sklearn.externals import joblib
joblib.dump(rand, "model.pkl")

grid.predict(564)

Follow along to the next section to see a full contextual example of its use. Additionally, more information is available in the Scikit-Learn documentation.

X and y need to be set to train the model, where X contains the training data and y is the output vector. Grid Search and Randomised Grid Search are both carried out on the data and the results are compared.

### Setting X and y

The input matrix X is set as the Days Elapsed column from the RY_df. It must be reshaped since it is only one feature. y is set to the target values which corresponds to the Adj Close column from the RY_df. Note that this data was already imported, preprocessed and explored here. The importing and preprocessing steps need to be followed to get the data in the form needed to follow along from this point.

Both shapes are verified to be the same number of rows (5494). Otherwise, the model cannot be trained and an error will result.

X = RY_df["Days Elapsed"].values.reshape(-1,1)
X.shape
(5494, 1)
y.shape
(5494,)

### Grid Search Cross-Validation

Now that the data has been prepared for modeling, the model can be trained.

#### Fitting the Model

Grid Search Cross-Validation is first used to fit the model. This is an exhaustive search as previously explained with specific parameters given in the param_grid where:

First, joblib is imported to enable persistence. A reusable function rfr_fit_gscv() (rfr is an abbreviated form of Random Forest Regression and gscv of Grid Search Cross-Validation) is created to take the DataFrame, parameter grid, and pickled filename. GridSearchCV and RandomForestRegressor are imported. The typical modeling pattern in Scikit-Learn is followed with some modifications. After the model is imported it is instantiated and fitted with the data. An intermediate step of passing the model to the GridSearchCV() function and setting its parameters is added. The model is then pickled and the model parameters and results are printed. Using %time when calling the function facilitates tracking how long the Grid Search process takes.

# this is needed for pickling
from sklearn.externals import joblib

def rfr_fit_gscv(df, param_grid, filename):
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# setting the static parameters
rfr = RandomForestRegressor(bootstrap=True, random_state=0, n_jobs=2)

grid = GridSearchCV(rfr, param_grid, cv=10,
scoring='neg_mean_squared_error')
grid.fit(X,y)

# this creates the pickled file.
joblib.dump(grid, filename)

# These are all parameters of the learned model.
# Notice the underscore at the end of the name of all the parameters.
print("grid.cv_results_ {}".format(grid.cv_results_))
print("--------------------------------------------")
# The negative of the best_score_ value is taken
# since the MSE is given as a negative value
print("-grid.best_score_ {}".format(-grid.best_score_))
print("grid.best_params_ {}".format(grid.best_params_))
print("grid.best_estimator_ {}".format(grid.best_estimator_))
print("grid.n_splits_ {}".format(grid.n_splits_))

param_grid = dict(n_estimators=[10, 25, 50, 100],
max_depth=[5, 10, 20, 30],
min_samples_leaf=[1,2,4])

%time rfr_fit_gscv(RY_df, param_grid, 'rfr_gscv_one_features20aug1404.pkl')
...
-grid.best_score_ 23.278293092650244
grid.best_params_ {'max_depth': 20, 'min_samples_leaf': 4, 'n_estimators': 50}
grid.best_estimator_ RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=20,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=4,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=50, n_jobs=2, oob_score=False, random_state=0,
verbose=0, warm_start=False)
grid.n_splits_ 10
CPU times: user 3min 21s, sys: 13.4 s, total: 3min 34s
Wall time: 4min

The -grid.best_score_ 23.28 is excellent but could possibly hint at overfitting. A look at the chart below shows that the orange learned line is indeed following the data quite closely. Zooming in shows the effects of regularisation as the data is not actually fit as tightly as it initially appears. Nevertheless, expanding the search hyperparameters when doing a Randomised Grid Search to account for more regularisation hyperparameters may prove useful. grid.best_params_ with output {'max_depth': 20, 'min_samples_leaf': 4, 'n_estimators': 50} shows the specific hyperparameters from the param_grid that created the best performing model.

#### Visualising the Learned Model

def rfr_viz(X,y,label):

# could make this an input

# Create traces
trace0 = go.Scatter(
x = RY_df.index,
y = y,
mode = 'markers',
name = 'markers'
)
trace2 = go.Scatter(
x = RY_df.index,
y = grid.predict(X),
mode = 'lines',
name = 'lines'
)

data = [trace0, trace2]

layout= go.Layout(
title= label,
hovermode= 'closest',
xaxis= dict(
title= 'Date',
ticklen= 5,
zeroline= False,
gridwidth= 2,
),
yaxis=dict(
title= 'Stock Price (US$)', ticklen= 5, gridwidth= 2, ), showlegend= False ) fig= go.Figure(data=data, layout=layout) return py.iplot(fig, filename='rfr_viz') rfr_viz(X,y, "RY Stock") ##### Matplotlib version def rfr_viz(X,y,label): plt.figure() plt.scatter(RY_df.index ,y, s=20, alpha=0.7) plt.plot(RY_df.index , grid.predict(X), c='r', linewidth=1) plt.xlabel("Date") plt.ylabel("Stock Price US$")
plt.title(label)
plt.xticks(rotation=45)

# X_train, grid.predict(X)
rfr_viz(X,y, "RY Stock")

#### Running a Prediction with the .pkl file

The pickled file is reloaded into the predict function to perform predictions given a certain date. A helper function to convert a date to Days Elapsed to run the prediction is included.

The prediction for Oct 19, 1997, of $5.82 seems reasonable given the training data. The prediction for Jun 6, 2020, of$74.08 is not convincing given that there seems to be an underlying linear trend. It is essentially an approximation of the last value that the model was trained on. Thus for out-of-sample data, a linear or polynomial model may be more effective.

def convert_date_to_days_elapsed(df, date):
dates = df.index
elapsed = date - dates[0]
return elapsed.days

def predict(df, date, filename):
"""
This function reloads the pickled file so predictions
can be made without retraining the model.
This runs very quickly compared to training the model.
"""
days = convert_date_to_days_elapsed(RY_df, date)
return grid.predict(days)

predict(RY_df, datetime(1997, 10, 19), 'rfr_gscv_one_features20aug1404.pkl')[0]
5.8209268466868709
74.083444176200032

### Randomised Grid Search Cross-Validation

A Randomised Grid Search is now used with a wider range of hyperparameters. The structure here is very similar to the one used above with the main differences of a param_dist with a continuous range of parameters to search over and the use of the RandomizedSearchCV() in place of the GridSearchCV() function. There is also the addition of the n_iter parameter for RandomizedSearchCV() to specify how many parameter combinations to try.

def rfr_fit_rgscv(df, filename):
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(criterion='mse', bootstrap=True,
random_state=0, n_jobs=2)

param_dist = dict(n_estimators=list(range(1,100)),
max_depth=list(range(1,100)),
min_samples_leaf=list(range(1,10)))

rand = RandomizedSearchCV(rfr, param_dist, cv=10,
scoring='neg_mean_squared_error',
n_iter=30)
rand.fit(X, y)

# pickling the file
joblib.dump(rand, filename)

# print("rand.cv_results_ {}".format(rand.cv_results_))
print("---------------")
print(-rand.best_score_)
print(rand.best_params_)
print(rand.best_estimator_)

%time rfr_fit_rgscv(RY_df, 'rfr_rgscv_one_feature_20Aug1441.pkl', )
---------------
22.4846067475
{'n_estimators': 53, 'min_samples_leaf': 8, 'max_depth': 11}
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=11,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=8,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=53, n_jobs=2, oob_score=False, random_state=0,
verbose=0, warm_start=False)
CPU times: user 1min 55s, sys: 6.79 s, total: 2min 2s
Wall time: 2min 22s
5.8338202274120938
74.451878668880155

The score here is comparable to when only using Grid Search. It is slightly better at 22.48 versus 23.28. The predicted values for the two dates chosen are also similar. A few of the hyperparameters of the model are noticeably different though. min_samples_leaf increased from 4 to 8 and max_depth dropped from 20 to 11. Adding more options for regularisation didn’t have much effect.

The training time here is reduced to 0.625 the time of the previous run. This may be because 30 iterations were used whereas before with an exhaustive search 50 were run. Note that Randomised Grid Search sill managed to find better parameters in less time as was found by Bergstra and Bengio.

## Conclusion

This part of the series started off with an examination of the key concepts used during model training like hyperparameters, Grid Search, and pickling. The models were then trained using Grid Search and Randomised Grid Search where the latter was shown to give better performance via searching over a wider range of parameters where the time complexity is independent of the range. In the next post, we continue on from here by running Random Grid Search with multiple features using a Pipeline and Feature Selection. We also examine the important concept of Data Leakage.

## References

1. Bergstra, James, and Yoshua Bengio. “Random search for hyper-parameter optimization.” Journal of Machine Learning Research 13.Feb (2012): 281-305.

2. Géron, Aurélien. “Hands-on Machine Learning with Scikit-Learn and Tensorflow.” (2017).

3. Hunter, John D. “Matplotlib: A 2D graphics environment.” Computing In Science & Engineering 9.3 (2007): 90-95.

4. McKinney, Wes. “Data structures for statistical computing in python.” Proceedings of the 9th Python in Science Conference. Vol. 445. Austin, TX: SciPy, 2010.

5. Pedregosa, Fabian, et al. “Scikit-learn: Machine learning in Python.” Journal of Machine Learning Research 12.Oct (2011): 2825-2830.

6. Pérez, Fernando, and Brian E. Granger. “IPython: a system for interactive scientific computing.” Computing in Science & Engineering 9.3 (2007).

7. Sievert, C., et al. “plotly: Create interactive web graphics via Plotly’s JavaScript graphing library [Software].” (2016).

8. Walt, Stéfan van der, S. Chris Colbert, and Gael Varoquaux. “The NumPy array: a structure for efficient numerical computation.” Computing in Science & Engineering 13.2 (2011): 22-30.

## Appendix

The code to generate the figures illustrated in Pt. 1 of this series is shown below. They were placed here because the processed data and concepts from this section were used to create them.