This is the first post in a series which considers Regression using Random Forests to predict the price of the stock of the Royal Bank of Canada (ticker RY). The full technology stack includes Python, Pandas, NumPy, Matplotlib/Plotly, and Scikit-Learn.
Firstly, we discuss the algorithms of Decision Trees and Random Forests. Next, the data is imported from Yahoo! Finance with demonstrations for local CSV files as well as sourcing via the
pandas_datareader. Afterwards, preliminary explorations are done with Pandas and its DataFrame. Finally, the data is preprocessed in preparation for visualisation and modeling.
Series on Random Forest Regression for Predicting the Price of RY Stock
- Pt. 1: Algorithms, Importing, Exploring and Preprocessing the Data
- Pt. 2: Visualizing the Data with Plotly and Matplotlib
- Pt. 3: Feature Engineering using Domain Knowledge and Feature Interactions
Related Post: Daily Returns
- Pt. 4: Training the Model: Introduction to Theoretical Concepts and Training using One Feature with Grid Search and Randomised Grid Search
- Pt. 5: Training the Model Using Multiple Features with a Pipeline, Feature Selection, and Randomised Grid Search
An understanding of Decision Trees can aid in a discussion of Random Forests since many of the ideas carry over as the forests are composed of individual trees. Thus, a brief introduction is given here. Decision Trees are a supervised learning method which can be relatively straightforward to interpret. There are models for both classification and regression in Scikit-Learn. Regression is the focus here. One benefit of using them is that scaling the features is not necessary.
The trees are made up of a set of nodes. They are split in such a way that the splits are more general at the top and more specific at the bottom. This eliminates more of the wrong answers from earlier on so the answer space is efficiently reduced. Like K-Nearset Neighbour the training data is memorised.
The structure uses if-then rules to traverse the tree. If a condition is true, follow the true node, if not follow the next one and so on until it reaches the leaf node where the values will be used as the prediction. The leaf nodes are the ones at the end of the tree which have no children.
Consider the visualisation below created using Graphviz for a Decision Tree trained on RY stock (The code to reproduce this diagram can be found in the Part 4 Appendix). The depth of the tree was capped at 2 to facilitate explanation.
If the value for Days Elapsed input into the prediction function is 3000 we traverse the tree by following the left branch from the top, then the next right branch to get to a prediction value of $18.71. The ideas for how this model is built will be explained in the proceeding sections. Days Elapsed refers to the time passed since the start of the time slice of the data under consideration.
Overfitting is one of the major problems that arise with Decision Trees. This tendency arises from the nature of the model to not make assumptions about the data unlike a model such as Linear Regression or an SVM with a Linear Kernel. Incidentally, this also weakens the ability to generalise to points outside the data set. Pruning by reducing
max_leaf_nodes can help to reduce overfitting but the possibility to overfit remains.
The characteristic model learned is boxy or orthogonal in shape as shown in the diagram immediately below. Using Random Forests and bagging also known as boosted aggregation (see the next subsection) can smooth out this line and reduce variance.
Random Forests are a popular and highly effective model used in both industry and competitions (like those hosted by Kaggle). They are an ensemble method where the estimator consists of a group of Decision Trees. There can be hundreds of these trees in the ensemble (depending on compute availability).
Benefits over Decision Trees
As mentioned before, a drawback of the constituent Decision Trees is that they tend to overfit. The Decision Trees in Random Forests are considered to be weak learners. That is, their depth is capped to a certain number of nodes. Combining these weak learners with aggregating the results and averaging the errors out of the system reduces this overfitting and variance.
This process also means the forests are more stable than individual Decision Trees which are very sensitive to changes in the data. Note that in Scikit-Learn the forests can be used for both regression and classification as is the case with Decision Trees.
Another benefit of this composite structure is that splitting the forest across multiple nodes for parallel processing is facilitated.
Growing the Trees
Each of these trees is grown slightly differently from the other trees in the forest so they all overfit on different parts of the data. This can be achieved in the following 2 ways:
1. Randomly Selecting Features
Altering how the trees are grown can be done by randomly selecting the features the trees split over. The
max_features parameter can be used to set how many of these features are randomly selected. The smaller the number of features the less similar the trees will be to each other.
2. Alter the way Samples are Chosen
Growth is also altered by changing the way the samples are chosen. A popular method is to use a bootstrap sample or bagging (bootstrap aggregation) where there can be repetition of the randomly chosen observations in the same predictor. The sample has the same number of samples as the original data set but some of those original rows are missing and some are repeated.
Another method that can be used is pasting where the same sample cannot be selected multiple times for the same predictor. It can, however, be selected across different predictors. In practice, bagging generally produces better results. For bagging in Scikit-Learn set
bootstrap=True in the model hyperparameters. Set
bootstrap=False for pasting.
The figure below demonstrates a typical model learned from a Random Forest Regressor. Again notice that the model winds its way around the data. The major difference that can be seen is that it is much smoother than the Decision Tree model.
Importing the Data
Choosing the Data Source
The daily resolution data being analysed was downloaded from Yahoo! Finance using the RY ticker. Data for RY can also be retrieved using
pandas_datareader and Google Finance, but the Close column before adjustments will not be available.
Initially, the CAD prices were used from Yahoo! Finance Canada with the Toronto Stock Exchange
RY.TO ticker but a large section of this data was missing as null or 0 values. The row of null values (29 June 2016) also mean that an extra preprocessing step was necessary to handle that line since having one row of non-numerical data caused all the data including the numbers to import as strings.
Additionally, the CAD prices cannot currently be extracted using the
pandas_datareader for Google (or Yahoo! Finance as per update below) though they are available on the websites. Thus, the US$ NYSE prices downloaded as CSV files from Yahoo Finance were used instead for completeness.
Importing with Pandas
The libraries that will be used throughout are imported and the options for Matplotlib are set.
The data is then imported using the Pandas
read_csv() function where the index is set to the Date column with
RY_df = pd.read_csv('data/RY.csv', index_col=0)
As of 16 August 2017, the
pandas_datareader is working with Yahoo! Finance and the following code can also import the data. This is useful because the data can be updated automatically and the preprocessing step where the Date index must be converted to the DateTimeIndex is no longer necessary (see Preprocessing).
import pandas_datareader.data as web from datetime import datetime RY_dfpdr = web.DataReader("ry", 'yahoo', datetime(1995, 10, 16), datetime(2017, 8, 11))
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 5495 entries, 1995-10-16 to 2017-08-11 Data columns (total 6 columns): Open 5494 non-null float64 High 5494 non-null float64 Low 5494 non-null float64 Close 5494 non-null float64 Adj Close 5494 non-null float64 Volume 5494 non-null float64 dtypes: float64(6) memory usage: 300.5 KB
Exploring the Data
info() functions as well as the
index attributes of the DataFrame object describe the data.
tail() shows that the data starts at 16/10/95 and ends on 11/08/17. They also show that the Open, High, Low, and Close values are all quite similar. This is verified by the stats in the
describe() output where all those rows have very similar values. The Adjusted Close prices tend to be slightly lower. All the prices are of type float64.
shape show that there are 5494 rows and 6 columns.
index shows that the dtype is an object which means it must be converted to be recognised and manipulated as a
# suppressing scientific notation for Pandas pd.set_option('display.float_format', lambda x: '%.5f' % x) RY_df.describe()
<class 'pandas.core.frame.DataFrame'> Index: 5494 entries, 16/10/95 to 11/08/17 Data columns (total 6 columns): Open 5494 non-null float64 High 5494 non-null float64 Low 5494 non-null float64 Close 5494 non-null float64 Adj Close 5494 non-null float64 Volume 5494 non-null int64 dtypes: float64(5), int64(1) memory usage: 300.5+ KB
Index(['16/10/95', '17/10/95', '18/10/95', '19/10/95', '20/10/95', '23/10/95', '24/10/95', '25/10/95', '26/10/95', '27/10/95', ... '31/07/17', '01/08/17', '02/08/17', '03/08/17', '04/08/17', '07/08/17', '08/08/17', '09/08/17', '10/08/17', '11/08/17'], dtype='object', name='Date', length=5494)
Two major preprocessing steps need to be handled.
First, the index needs to be converted to a
dayfirst=True should be set as a parameter of the
to_datetime() function. Otherwise, Pandas mixes up the month and day and the converted dates become incorrect.
Second, the Date is converted to a numerical value to show the time elapsed since the start of the data set. The Date needs to be of the type
DateTimeIndex in order to do this calculation. This conversion is one method that can be used to train the model since models cannot take the actual dates as input.
Converting the indexes to DateTimeIndexes
def convert_index_to_datetimeindex(df): # converting the dates to DateTimeIndex index = df.index df.index = pd.to_datetime(index, dayfirst=True)
DatetimeIndex(['1995-10-16', '1995-10-17', '1995-10-18', '1995-10-19', '1995-10-20', '1995-10-23', '1995-10-24', '1995-10-25', '1995-10-26', '1995-10-27', ... '2017-07-31', '2017-08-01', '2017-08-02', '2017-08-03', '2017-08-04', '2017-08-07', '2017-08-08', '2017-08-09', '2017-08-10', '2017-08-11'], dtype='datetime64[ns]', name='Date', length=5494, freq=None)
Converting Date to Time Elapsed
from datetime import timedelta, datetime, date def convert_date_to_time_elapsed(df): dates = df.index elapsed = dates - dates df['Days Elapsed'] = elapsed.days convert_date_to_time_elapsed(RY_df)
|Open||High||Low||Close||Adj Close||Volume||Days Elapsed|
Now that we’ve covered a high-level overview of how the algorithms work, imported, explored and preprocessed the data, the next step is visualization for further exploration. This is covered using Plotly and Matplotlib.
Decision Trees. Coursera, https://www.coursera.org/learn/python-machine-learning/lecture/Zj96A/decision-trees.
Géron, Aurélien. “Hands on Machine Learning with Scikit-Learn and Tensorflow.” (2017).
McKinney, Wes. “Data structures for statistical computing in python.” Proceedings of the 9th Python in Science Conference. Vol. 445. Austin, TX: SciPy, 2010.
Pedregosa, Fabian, et al. “Scikit-learn: Machine learning in Python.” Journal of Machine Learning Research 12.Oct (2011): 2825-2830.
Pérez, Fernando, and Brian E. Granger. “IPython: a system for interactive scientific computing.” Computing in Science & Engineering 9.3 (2007).
Random Forests. Coursera, https://www.coursera.org/learn/python-machine-learning/lecture/lF9QN/random-forests.
Walt, Stéfan van der, S. Chris Colbert, and Gael Varoquaux. “The NumPy array: a structure for efficient numerical computation.” Computing in Science & Engineering 13.2 (2011): 22-30.