This is the first post in a series which considers Regression using Random Forests to predict the price of the stock of the Royal Bank of Canada (ticker RY). The full technology stack includes Python, Pandas, NumPy, Matplotlib/Plotly, and Scikit-Learn.

Firstly, we discuss the algorithms of Decision Trees and Random Forests. Next, the data is imported from Yahoo! Finance with demonstrations for local CSV files as well as sourcing via the pandas_datareader. Afterwards, preliminary explorations are done with Pandas and its DataFrame. Finally, the data is preprocessed in preparation for visualisation and modeling.

Learning Algorithms

Decision Trees

An understanding of Decision Trees can aid in a discussion of Random Forests since many of the ideas carry over as the forests are composed of individual trees. Thus, a brief introduction is given here. Decision Trees are a supervised learning method which can be relatively straightforward to interpret. There are models for both classification and regression in Scikit-Learn. Regression is the focus here. One benefit of using them is that scaling the features is not necessary.

Composition

The trees are made up of a set of nodes. They are split in such a way that the splits are more general at the top and more specific at the bottom. This eliminates more of the wrong answers from earlier on so the answer space is efficiently reduced. Like K-Nearset Neighbour the training data is memorised.

The structure uses if-then rules to traverse the tree. If a condition is true, follow the true node, if not follow the next one and so on until it reaches the leaf node where the values will be used as the prediction. The leaf nodes are the ones at the end of the tree which have no children.

Visualisation

Consider the visualisation below created using Graphviz for a Decision Tree trained on RY stock (The code to reproduce this diagram can be found in the Part 4 Appendix). The depth of the tree was capped at 2 to facilitate explanation.

If the value for Days Elapsed input into the prediction function is 3000 we traverse the tree by following the left branch from the top, then the next right branch to get to a prediction value of $18.71. The ideas for how this model is built will be explained in the proceeding sections. Days Elapsed refers to the time passed since the start of the time slice of the data under consideration.

Decision Tree Node Visualisation

Overfitting

Overfitting is one of the major problems that arise with Decision Trees. This tendency arises from the nature of the model to not make assumptions about the data unlike a model such as Linear Regression or an SVM with a Linear Kernel. Incidentally, this also weakens the ability to generalise to points outside the data set. Pruning by reducing max_depth or max_leaf_nodes can help to reduce overfitting but the possibility to overfit remains.

The characteristic model learned is boxy or orthogonal in shape as shown in the diagram immediately below. Using Random Forests and bagging also known as boosted aggregation (see the next subsection) can smooth out this line and reduce variance.

Random Forests

Random Forests are a popular and highly effective model used in both industry and competitions (like those hosted by Kaggle). They are an ensemble method where the estimator consists of a group of Decision Trees. There can be hundreds of these trees in the ensemble (depending on compute availability).

Benefits over Decision Trees

As mentioned before, a drawback of the constituent Decision Trees is that they tend to overfit. The Decision Trees in Random Forests are considered to be weak learners. That is, their depth is capped to a certain number of nodes. Combining these weak learners with aggregating the results and averaging the errors out of the system reduces this overfitting and variance.

This process also means the forests are more stable than individual Decision Trees which are very sensitive to changes in the data. Note that in Scikit-Learn the forests can be used for both regression and classification as is the case with Decision Trees.

Another benefit of this composite structure is that splitting the forest across multiple nodes for parallel processing is facilitated.

Growing the Trees

Each of these trees is grown slightly differently from the other trees in the forest so they all overfit on different parts of the data. This can be achieved in the following 2 ways:

1. Randomly Selecting Features

Altering how the trees are grown can be done by randomly selecting the features the trees split over. The max_features parameter can be used to set how many of these features are randomly selected. The smaller the number of features the less similar the trees will be to each other.

2. Alter the way Samples are Chosen
Bagging

Growth is also altered by changing the way the samples are chosen. A popular method is to use a bootstrap sample or bagging (bootstrap aggregation) where there can be repetition of the randomly chosen observations in the same predictor. The sample has the same number of samples as the original data set but some of those original rows are missing and some are repeated.

Pasting

Another method that can be used is pasting where the same sample cannot be selected multiple times for the same predictor. It can, however, be selected across different predictors. In practice, bagging generally produces better results. For bagging in Scikit-Learn set bootstrap=True in the model hyperparameters. Set bootstrap=False for pasting.

Visualisation

The figure below demonstrates a typical model learned from a Random Forest Regressor. Again notice that the model winds its way around the data. The major difference that can be seen is that it is much smoother than the Decision Tree model.

Importing the Data

Choosing the Data Source

The daily resolution data being analysed was downloaded from Yahoo! Finance using the RY ticker. Data for RY can also be retrieved using pandas_datareader and Google Finance, but the Close column before adjustments will not be available.

Initially, the CAD prices were used from Yahoo! Finance Canada with the Toronto Stock Exchange RY.TO ticker but a large section of this data was missing as null or 0 values. The row of null values (29 June 2016) also mean that an extra preprocessing step was necessary to handle that line since having one row of non-numerical data caused all the data including the numbers to import as strings.

Additionally, the CAD prices cannot currently be extracted using the pandas_datareader for Google (or Yahoo! Finance as per update below) though they are available on the websites. Thus, the US$ NYSE prices downloaded as CSV files from Yahoo Finance were used instead for completeness.

Importing with Pandas

The libraries that will be used throughout are imported and the options for Matplotlib are set.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib notebook
plt.style.use('seaborn-white')

from plotly import tools
import plotly.plotly as py
import plotly.graph_objs as go

The data is then imported using the Pandas read_csv() function where the index is set to the Date column with index_col=0.

RY_df = pd.read_csv('data/RY.csv', index_col=0)

pandas_datareader update

As of 16 August 2017, the pandas_datareader is working with Yahoo! Finance and the following code can also import the data. This is useful because the data can be updated automatically and the preprocessing step where the Date index must be converted to the DateTimeIndex is no longer necessary (see Preprocessing).

import pandas_datareader.data as web

from datetime import datetime
RY_dfpdr = web.DataReader("ry", 'yahoo',
                    datetime(1995, 10, 16),
                    datetime(2017, 8, 11))
RY_dfpdr.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5495 entries, 1995-10-16 to 2017-08-11
Data columns (total 6 columns):
Open         5494 non-null float64
High         5494 non-null float64
Low          5494 non-null float64
Close        5494 non-null float64
Adj Close    5494 non-null float64
Volume       5494 non-null float64
dtypes: float64(6)
memory usage: 300.5 KB

Exploring the Data

The Pandas head(), tail(), describe(), and info() functions as well as the shape and index attributes of the DataFrame object describe the data.

head() and tail() shows that the data starts at 16/10/95 and ends on 11/08/17. They also show that the Open, High, Low, and Close values are all quite similar. This is verified by the stats in the describe() output where all those rows have very similar values. The Adjusted Close prices tend to be slightly lower. All the prices are of type float64.

info() and shape show that there are 5494 rows and 6 columns. index shows that the dtype is an object which means it must be converted to be recognised and manipulated as a datetime object.

RY_df.head()
Open High Low Close Adj Close Volume
Date
16/10/95 5.75000 5.81250 5.75000 5.81250 2.498525 62000
17/10/95 5.81250 5.81250 5.81250 5.81250 2.498525 53200
18/10/95 5.78125 5.84375 5.78125 5.81250 2.498525 72000
19/10/95 5.84375 5.84375 5.81250 5.81250 2.498525 5200
20/10/95 5.71875 5.71875 5.65625 5.65625 2.431361 16400


RY_df.tail()
Open High Low Close Adj Close Volume
Date
07/08/17 74.709999 74.849998 74.410004 74.510002 74.510002 415100
08/08/17 74.379997 74.870003 74.379997 74.739998 74.739998 738300
09/08/17 74.339996 74.559998 74.000000 74.230003 74.230003 684000
10/08/17 73.910004 74.110001 72.750000 72.889999 72.889999 1434400
11/08/17 73.059998 73.410004 72.610001 72.889999 72.889999 754200


# suppressing scientific notation for Pandas
pd.set_option('display.float_format', lambda x: '%.5f' % x)

RY_df.describe()
Open High Low Close Adj Close Volume
count 5494.00000 5494.00000 5494.00000 5494.00000 5494.00000 5494.00000
mean 36.81800 37.12529 36.50281 36.83127 27.95704 524808.90062
std 21.24614 21.38186 21.09694 21.24431 20.60131 633388.75713
min 5.12500 5.28125 5.12500 5.18750 2.22987 0.00000
25% 15.36328 15.50766 15.25000 15.39297 8.03175 97200.00000
50% 39.25750 39.72750 38.92500 39.29000 26.19961 265700.00000
75% 55.75500 56.10750 55.21750 55.59750 44.76020 778425.00000
max 76.08000 76.08000 75.42000 75.90000 75.10000 9830200.00000


RY_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5494 entries, 16/10/95 to 11/08/17
Data columns (total 6 columns):
Open         5494 non-null float64
High         5494 non-null float64
Low          5494 non-null float64
Close        5494 non-null float64
Adj Close    5494 non-null float64
Volume       5494 non-null int64
dtypes: float64(5), int64(1)
memory usage: 300.5+ KB


RY_df.shape
(5494, 6)


RY_df.index
Index(['16/10/95', '17/10/95', '18/10/95', '19/10/95', '20/10/95', '23/10/95',
       '24/10/95', '25/10/95', '26/10/95', '27/10/95',
       ...
       '31/07/17', '01/08/17', '02/08/17', '03/08/17', '04/08/17', '07/08/17',
       '08/08/17', '09/08/17', '10/08/17', '11/08/17'],
      dtype='object', name='Date', length=5494)

Preprocessing

Two major preprocessing steps need to be handled.

First, the index needs to be converted to a DateTimeIndex. dayfirst=True should be set as a parameter of the to_datetime() function. Otherwise, Pandas mixes up the month and day and the converted dates become incorrect.

Second, the Date is converted to a numerical value to show the time elapsed since the start of the data set. The Date needs to be of the type DateTimeIndex in order to do this calculation. This conversion is one method that can be used to train the model since models cannot take the actual dates as input.

Converting the indexes to DateTimeIndexes

def convert_index_to_datetimeindex(df):
    # converting the dates to DateTimeIndex
    index = df.index
    df.index = pd.to_datetime(index, dayfirst=True)
convert_index_to_datetimeindex(RY_df)
RY_df.index
DatetimeIndex(['1995-10-16', '1995-10-17', '1995-10-18', '1995-10-19',
               '1995-10-20', '1995-10-23', '1995-10-24', '1995-10-25',
               '1995-10-26', '1995-10-27',
               ...
               '2017-07-31', '2017-08-01', '2017-08-02', '2017-08-03',
               '2017-08-04', '2017-08-07', '2017-08-08', '2017-08-09',
               '2017-08-10', '2017-08-11'],
              dtype='datetime64[ns]', name='Date', length=5494, freq=None)

Converting Date to Time Elapsed

from datetime import timedelta, datetime, date

def convert_date_to_time_elapsed(df):
    dates = df.index

    elapsed = dates - dates[0]
    df['Days Elapsed'] = elapsed.days

convert_date_to_time_elapsed(RY_df)
RY_df.head()
Open High Low Close Adj Close Volume Days Elapsed
Date
1995-10-16 5.75000 5.81250 5.75000 5.81250 2.49852 62000 0
1995-10-17 5.81250 5.81250 5.81250 5.81250 2.49852 53200 1
1995-10-18 5.78125 5.84375 5.78125 5.81250 2.49852 72000 2
1995-10-19 5.84375 5.84375 5.81250 5.81250 2.49852 5200 3
1995-10-20 5.71875 5.71875 5.65625 5.65625 2.43136 16400 4

Conclusion

Now that we’ve covered a high-level overview of how the algorithms work, imported, explored and preprocessed the data, the next step is visualization for further exploration. This is covered using Plotly and Matplotlib.

End of post graphic

References

  1. Decision Trees. Coursera, https://www.coursera.org/learn/python-machine-learning/lecture/Zj96A/decision-trees.

  2. Géron, Aurélien. “Hands on Machine Learning with Scikit-Learn and Tensorflow.” (2017).

  3. McKinney, Wes. “Data structures for statistical computing in python.” Proceedings of the 9th Python in Science Conference. Vol. 445. Austin, TX: SciPy, 2010.

  4. Pedregosa, Fabian, et al. “Scikit-learn: Machine learning in Python.” Journal of Machine Learning Research 12.Oct (2011): 2825-2830.

  5. Pérez, Fernando, and Brian E. Granger. “IPython: a system for interactive scientific computing.” Computing in Science & Engineering 9.3 (2007).

  6. Random Forests. Coursera, https://www.coursera.org/learn/python-machine-learning/lecture/lF9QN/random-forests.

  7. Walt, Stéfan van der, S. Chris Colbert, and Gael Varoquaux. “The NumPy array: a structure for efficient numerical computation.” Computing in Science & Engineering 13.2 (2011): 22-30.