This is the first post in a series which considers Regression using Random Forests to predict the price of the stock of the Royal Bank of Canada (ticker RY). The full technology stack includes Python, Pandas, NumPy, Matplotlib/Plotly, and Scikit-Learn.

Firstly, we discuss the algorithms of Decision Trees and Random Forests. Next, the data is imported from Yahoo! Finance with demonstrations for local CSV files as well as sourcing via the pandas_datareader. Afterwards, preliminary explorations are done with Pandas and its DataFrame. Finally, the data is preprocessed in preparation for visualisation and modeling.

Series on Random Forest Regression for Predicting the Price of RY Stock

## Learning Algorithms

### Decision Trees

An understanding of Decision Trees can aid in a discussion of Random Forests since many of the ideas carry over as the forests are composed of individual trees. Thus, a brief introduction is given here. Decision Trees are a supervised learning method which can be relatively straightforward to interpret. There are models for both classification and regression in Scikit-Learn. Regression is the focus here. One benefit of using them is that scaling the features is not necessary.

#### Composition

The trees are made up of a set of nodes. They are split in such a way that the splits are more general at the top and more specific at the bottom. This eliminates more of the wrong answers from earlier on so the answer space is efficiently reduced. Like K-Nearset Neighbour the training data is memorised.

The structure uses if-then rules to traverse the tree. If a condition is true, follow the true node, if not follow the next one and so on until it reaches the leaf node where the values will be used as the prediction. The leaf nodes are the ones at the end of the tree which have no children.

#### Visualisation

Consider the visualisation below created using Graphviz for a Decision Tree trained on RY stock (The code to reproduce this diagram can be found in the Part 4 Appendix). The depth of the tree was capped at 2 to facilitate explanation.

### Importing with Pandas

The libraries that will be used throughout are imported and the options for Matplotlib are set.

The data is then imported using the Pandas read_csv() function where the index is set to the Date column with index_col=0.

RY_df = pd.read_csv('data/RY.csv', index_col=0)


### pandas_datareader update

As of 16 August 2017, the pandas_datareader is working with Yahoo! Finance and the following code can also import the data. This is useful because the data can be updated automatically and the preprocessing step where the Date index must be converted to the DateTimeIndex is no longer necessary (see Preprocessing).

import pandas_datareader.data as web

from datetime import datetime
RY_dfpdr = web.DataReader("ry", 'yahoo',
datetime(1995, 10, 16),
datetime(2017, 8, 11))

RY_dfpdr.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5495 entries, 1995-10-16 to 2017-08-11
Data columns (total 6 columns):
Open         5494 non-null float64
High         5494 non-null float64
Low          5494 non-null float64
Close        5494 non-null float64
Adj Close    5494 non-null float64
Volume       5494 non-null float64
dtypes: float64(6)
memory usage: 300.5 KB


## Exploring the Data

The Pandas head(), tail(), describe(), and info() functions as well as the shape and index attributes of the DataFrame object describe the data.

head() and tail() shows that the data starts at 16/10/95 and ends on 11/08/17. They also show that the Open, High, Low, and Close values are all quite similar. This is verified by the stats in the describe() output where all those rows have very similar values. The Adjusted Close prices tend to be slightly lower. All the prices are of type float64.

info() and shape show that there are 5494 rows and 6 columns. index shows that the dtype is an object which means it must be converted to be recognised and manipulated as a datetime object.

RY_df.head()

Open High Low Close Adj Close Volume
Date
16/10/95 5.75000 5.81250 5.75000 5.81250 2.498525 62000
17/10/95 5.81250 5.81250 5.81250 5.81250 2.498525 53200
18/10/95 5.78125 5.84375 5.78125 5.81250 2.498525 72000
19/10/95 5.84375 5.84375 5.81250 5.81250 2.498525 5200
20/10/95 5.71875 5.71875 5.65625 5.65625 2.431361 16400

RY_df.tail()

Open High Low Close Adj Close Volume
Date
07/08/17 74.709999 74.849998 74.410004 74.510002 74.510002 415100
08/08/17 74.379997 74.870003 74.379997 74.739998 74.739998 738300
09/08/17 74.339996 74.559998 74.000000 74.230003 74.230003 684000
10/08/17 73.910004 74.110001 72.750000 72.889999 72.889999 1434400
11/08/17 73.059998 73.410004 72.610001 72.889999 72.889999 754200

# suppressing scientific notation for Pandas
pd.set_option('display.float_format', lambda x: '%.5f' % x)

RY_df.describe()

Open High Low Close Adj Close Volume
count 5494.00000 5494.00000 5494.00000 5494.00000 5494.00000 5494.00000
mean 36.81800 37.12529 36.50281 36.83127 27.95704 524808.90062
std 21.24614 21.38186 21.09694 21.24431 20.60131 633388.75713
min 5.12500 5.28125 5.12500 5.18750 2.22987 0.00000
25% 15.36328 15.50766 15.25000 15.39297 8.03175 97200.00000
50% 39.25750 39.72750 38.92500 39.29000 26.19961 265700.00000
75% 55.75500 56.10750 55.21750 55.59750 44.76020 778425.00000
max 76.08000 76.08000 75.42000 75.90000 75.10000 9830200.00000

RY_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5494 entries, 16/10/95 to 11/08/17
Data columns (total 6 columns):
Open         5494 non-null float64
High         5494 non-null float64
Low          5494 non-null float64
Close        5494 non-null float64
Adj Close    5494 non-null float64
Volume       5494 non-null int64
dtypes: float64(5), int64(1)
memory usage: 300.5+ KB


RY_df.shape

(5494, 6)


RY_df.index

Index(['16/10/95', '17/10/95', '18/10/95', '19/10/95', '20/10/95', '23/10/95',
'24/10/95', '25/10/95', '26/10/95', '27/10/95',
...
'31/07/17', '01/08/17', '02/08/17', '03/08/17', '04/08/17', '07/08/17',
'08/08/17', '09/08/17', '10/08/17', '11/08/17'],
dtype='object', name='Date', length=5494)


## Preprocessing

Two major preprocessing steps need to be handled.

First, the index needs to be converted to a DateTimeIndex. dayfirst=True should be set as a parameter of the to_datetime() function. Otherwise, Pandas mixes up the month and day and the converted dates become incorrect.

Second, the Date is converted to a numerical value to show the time elapsed since the start of the data set. The Date needs to be of the type DateTimeIndex in order to do this calculation. This conversion is one method that can be used to train the model since models cannot take the actual dates as input.

### Converting the indexes to DateTimeIndexes

def convert_index_to_datetimeindex(df):
# converting the dates to DateTimeIndex
index = df.index
df.index = pd.to_datetime(index, dayfirst=True)

convert_index_to_datetimeindex(RY_df)

RY_df.index

DatetimeIndex(['1995-10-16', '1995-10-17', '1995-10-18', '1995-10-19',
'1995-10-20', '1995-10-23', '1995-10-24', '1995-10-25',
'1995-10-26', '1995-10-27',
...
'2017-07-31', '2017-08-01', '2017-08-02', '2017-08-03',
'2017-08-04', '2017-08-07', '2017-08-08', '2017-08-09',
'2017-08-10', '2017-08-11'],
dtype='datetime64[ns]', name='Date', length=5494, freq=None)


### Converting Date to Time Elapsed

from datetime import timedelta, datetime, date

def convert_date_to_time_elapsed(df):
dates = df.index

elapsed = dates - dates[0]
df['Days Elapsed'] = elapsed.days

convert_date_to_time_elapsed(RY_df)

RY_df.head()

Open High Low Close Adj Close Volume Days Elapsed
Date
1995-10-16 5.75000 5.81250 5.75000 5.81250 2.49852 62000 0
1995-10-17 5.81250 5.81250 5.81250 5.81250 2.49852 53200 1
1995-10-18 5.78125 5.84375 5.78125 5.81250 2.49852 72000 2
1995-10-19 5.84375 5.84375 5.81250 5.81250 2.49852 5200 3
1995-10-20 5.71875 5.71875 5.65625 5.65625 2.43136 16400 4

## Conclusion

Now that we’ve covered a high-level overview of how the algorithms work, imported, explored and preprocessed the data, the next step is visualization for further exploration. This is covered using Plotly and Matplotlib.

## References

1. Decision Trees. Coursera, https://www.coursera.org/learn/python-machine-learning/lecture/Zj96A/decision-trees.

2. Géron, Aurélien. “Hands on Machine Learning with Scikit-Learn and Tensorflow.” (2017).

3. McKinney, Wes. “Data structures for statistical computing in python.” Proceedings of the 9th Python in Science Conference. Vol. 445. Austin, TX: SciPy, 2010.

4. Pedregosa, Fabian, et al. “Scikit-learn: Machine learning in Python.” Journal of Machine Learning Research 12.Oct (2011): 2825-2830.

5. Pérez, Fernando, and Brian E. Granger. “IPython: a system for interactive scientific computing.” Computing in Science & Engineering 9.3 (2007).

6. Random Forests. Coursera, https://www.coursera.org/learn/python-machine-learning/lecture/lF9QN/random-forests.

7. Walt, Stéfan van der, S. Chris Colbert, and Gael Varoquaux. “The NumPy array: a structure for efficient numerical computation.” Computing in Science & Engineering 13.2 (2011): 22-30.