In this post, we explore univariate Linear Regression with Amazon stock (AMZN ticker) data using the Python data science ecosystem. The libraries used include Pandas, NumPy, Matplotlib and Scikit-Learn.

We start with a brief introduction to univariate linear regression and how it works. The data is imported, explored, and preprocessed using Pandas and Matplotlib. The model is then fitted with the data using both a train/test split and cross-validation with Scikit-Learn. The results for both scenarios are then discussed and compared.

Disclaimer: Any stock choices and techniques set out here are not recommendations in any way. They are only meant for informational purposes. It is imperative to conduct your own due diligence involving proper and thorough research before making investment decisions.

## Linear Regression Overview

Linear regression is a fundamental and relatively straightforward learning algorithm. It is a type of supervised learning which works backwards from the data to find a line (or hyperplane in the multivariate case) that best fits that data.

It generally only works well where the underlying data already has a linear shape and a low level of fluctuations/volatility. If either of these two items is not present the error of the model will increase.

It does, however, scale well to out of sample data beyond the test set and so in certain cases may be a very effective choice. It can be used in predicting the prices of items (eg. stocks).

### Univariate Linear Regression

In its most simplified form, univariate linear regression, there is only one feature or variable used to train the model such as an adjusted stock price. This is similar to having a feature only aligned to the x-axis with the output vector on the y-axis.

This model has a parallel to the equation of a line where $y$ is the value output when the gradient is $m$, $x$ is the only input variable and $c$ is the y-intercept (equation 1). The hypothesis or predicted value $h_\theta(x)$ corresponds to $y$, the learned parameters $\theta_0$ and $\theta_1$ correspond to $c$ and $m$ respectively, and the feature $x_1$ corresponds to $x$ (equation 2).

Note: The code to create this chart is shown below in the Appendix.

In the chart above the linear trend in the data is clear even without the line of best fit. The y-intercept is $\theta_0$ = - 74.79 and the gradient $\theta_1$ = 0.44. Thus, the hypothesis function is:

Thus if the hypothetical input matrix is of the form:

then substituting this 1D matrix in equation 3 gives:

The first output value is calculated with $-74.79 + 0.44(-25)$, the second with $-74.79 + 0.44(250)$ and the third with $-74.79 + 0.44(315)$.

To calculate this using NumPy:

array([[-85.79],
[ 35.21],
[ 63.81]])


### Mean Squared Error

The most popular metric for performance evaluation with linear regression is the mean squared error or MSE. MSE is used by default by Scikit-Learn. Using this method, the squared vertical distance from each point to the line of best fit is determined (see the vertical red lines in the image below).

Since the squares of the distances are taken, any negative distances are removed and only the deviation from the line is measured. Note that the distances are not taken perpendicular to the best fit line as is done in principal component analysis.

All these squared distances are summed and the average is found by dividing the sum by the total number of points in the data set. Additionally, when training the model, this is the value that must be minimised to find the line of best fit. Multiple versions of these will be tried until the error is minimised. This is how least squares linear regression works.

### Hyperparameters and Linear Regression

The hyperparameters of the model refer to parameters used while training rather than the parameters of the line to be fitted. For example, in random forests, the number of trees in the forest and the max depth of the trees can be specified. These are the parameters altered when fine-tuning the model. There are no hyperparameters to be tuned with linear regression so more advanced techniques like GridSearchCV will not be necessary.

## Importing the Data

The data was downloaded from Yahoo! Finance as CSV files. The dates range from 1997-05-15 to 2017-07-03.

It was then imported into a Pandas DataFrame. A DataFrame is a structure that can hold and be used to manipulate tabular data in a similar manner to tables in SQL. The Pandas read_csv() function was made to import data into DataFrames. The index was set to the first column in the imported file which contained the date using the parameter index_col=0. Without this parameter the DataFrame imports with a numerical index starting from 0.

This newly set date index is a string object and cannot be parsed as DateTimeIndexes can by Pandas. Being able to parse the dates is necessary for plotting as well as calculating the time elapsed since the start of the data set to train the model. Models cannot be trained with date or string objects, but only numerical data.

## Exploring the Data

df.head() shows the first 5 rows of the DataFrame.

This shows that the imported data has 6 columns. The one of interest here is the Adj Close which shows the adjusted close prices. These are adjusted for stock splits and dividends. It is important to use these when considering trading strategies so that the sharp drops due to these events do not mislead the algorithms into thinking there is a sell off in the stock.

Open High Low Close Adj Close Volume
Date
1997-05-15 2.437500 2.500000 1.927083 1.958333 1.958333 72156000
1997-05-16 1.968750 1.979167 1.708333 1.729167 1.729167 14700000
1997-05-19 1.760417 1.770833 1.625000 1.708333 1.708333 6106800
1997-05-20 1.729167 1.750000 1.635417 1.635417 1.635417 5467200
1997-05-21 1.635417 1.645833 1.375000 1.427083 1.427083 18853200

df.tail() shows the last 5 rows of the DataFrame.

Open High Low Close Adj Close Volume
Date
2017-06-27 990.690002 998.799988 976.000000 976.780029 976.780029 3782400
2017-06-28 978.549988 990.679993 969.210022 990.330017 990.330017 3737600
2017-06-29 979.000000 987.559998 965.250000 975.929993 975.929993 4303000
2017-06-30 980.119995 983.469971 967.609985 968.000000 968.000000 3357200
2017-07-03 972.789978 974.489990 951.000000 953.659973 953.659973 2909100

df.shape shows the number of rows and the number of columns in the DataFrame. This one has 5067 rows ranging in date from 1997-05-15 to 2017-07-03, and 6 columns as previously discussed.

(5067, 6)


df.index shows the index of the DataFrame plus additional information like the type of the index. The object type here means the index is made up of strings.

Index(['1997-05-15', '1997-05-16', '1997-05-19', '1997-05-20', '1997-05-21',
'1997-05-22', '1997-05-23', '1997-05-27', '1997-05-28', '1997-05-29',
...
'2017-06-20', '2017-06-21', '2017-06-22', '2017-06-23', '2017-06-26',
'2017-06-27', '2017-06-28', '2017-06-29', '2017-06-30', '2017-07-03'],
dtype='object', name='Date', length=5067)


### Plotting with Pandas and Matplotlib

#### df.plot()

Pandas comes with many convenient functions built on top of Matplotlib for plotting data. One is shown below. Only the adjusted close is shown because there is a high correlation and thus overlap between the open, high, low and close columns. The volume is shown separately below because its scale dwarfs the prices. The chart shows that AMZN prices oscillated about the \$100 mark from 1997 to about 2009 and then it gradually skyrocketed to the \$1000 mark in 2017.

#### Grid Spec

A common way to view stock charts is with the volume bar charts underneath. GridSpec is used here to put the two charts on the same plot. The Pandas plotting helpers are not used here.

#### Scatter Matrix

The scatter matrix is another useful plotting function that comes with Pandas. Here the correlation between the open, high, low, close and adjusted close become clear. The points roughly forming 45-degree lines indicate a correlation of 1.

## Preprocessing

A column with the number of days since the start of the data will be created and used to train the model. Models cannot consume non-numerical data as mentioned before. The index needs to be converted to a DateTimeIndex first.

### Convert the Index to a DateTimeIndex

The index is now converted to a DateTimeIndex. Note the dtype of datetime64[ns].

DatetimeIndex(['1997-05-15', '1997-05-16', '1997-05-19', '1997-05-20',
'1997-05-21', '1997-05-22', '1997-05-23', '1997-05-27',
'1997-05-28', '1997-05-29',
...
'2017-06-20', '2017-06-21', '2017-06-22', '2017-06-23',
'2017-06-26', '2017-06-27', '2017-06-28', '2017-06-29',
'2017-06-30', '2017-07-03'],
dtype='datetime64[ns]', name='Date', length=5067, freq=None)


### Convert the Dates to the Days Elapsed

A column with the days elapsed since the start date has been added.

Open High Low Close Adj Close Volume Days Elapsed
Date
1997-05-15 2.437500 2.500000 1.927083 1.958333 1.958333 72156000 0
1997-05-16 1.968750 1.979167 1.708333 1.729167 1.729167 14700000 1
1997-05-19 1.760417 1.770833 1.625000 1.708333 1.708333 6106800 4
1997-05-20 1.729167 1.750000 1.635417 1.635417 1.635417 5467200 5
1997-05-21 1.635417 1.645833 1.375000 1.427083 1.427083 18853200 6

## Fitting the Model using a Train/Test Split

Scikit-Learn comes with many tools and functions to facilitate machine learning techniques. One such function enables creating a train/test split for training models. This helps reduce overfitting and gives more realistic results for unseen data by testing the model on different data than it was trained on. A common split proportion is 60% for training and 40% for testing. The default for Scikit-Learn is 75% for training and 25% for testing.

### Setting X and y

To train the model a training matrix X and target vector y are needed. X corresponds to the column of the DataFrame for days elapsed. Here X is one dimensional since it is only one feature. y corresponds to the adjusted close column.

Date
1997-05-15    0
1997-05-16    1
1997-05-19    4
1997-05-20    5
1997-05-21    6
Name: Days Elapsed, dtype: int64


Date
1997-05-15    1.958333
1997-05-16    1.729167
1997-05-19    1.708333
1997-05-20    1.635417
1997-05-21    1.427083


X has to be reshaped to be accepted for fitting the model because it is only one feature. Otherwise, both X and y would need to have the same number of rows.

(5067,)

(5067, 1)

(5067,)


### Fitting the Model

The procedure for fitting models in Scikit-Learn is similar across all models. They are as follows:

1. Import the model
2. Instantiate the model
3. Fit the model
4. Perform predictions

The metrics module is also imported to get access to the mean squared error. Setting the random_state parameter to 0 makes the results reproducible.

### Scores

The score function shows how well the trained model performs with certain inputs and outputs. It ranges in value from 0 to 1 where 0 means awful performance and 1 means perfect performance. If the training score is very high though, this could be an indicator of overfitting.

Another metric which is popular with linear regression is the mean squared error. The bigger this number is the worse the performance of the model because it means the error is higher. These scores are interpreted after the model is plotted below.

0.625136498184789

16430.991558811442


### Properties of the Learned Model

These properties can be used to plot a chart of the learned model. The underscore at the end means these are properties of the learned model. The array of coefficients in linreg.coef_ corresponds to the $\theta_1…\theta_n$ in the description above. Since there is just one coefficient this refers to $\theta_1$ or the gradient in $h_\theta(x) = \theta_0 + \theta_1x_1$. The intercept linreg.intercept_ is the y intercept $\theta_0$ and cuts the y-axis at -127.8.

array([ 0.08017787])

-127.80625275029112


### Plotting the Learned Model

#### Converting Days Back to Dates

Days elapsed needs to be converted back to dates to be plotted with the original data since two different types of indices cannot be plotted on the same x-axis.

DatetimeIndex(['2013-10-14', '2014-11-20', '2015-03-05', '2005-07-25',
'2012-10-02', '2015-02-24', '1999-11-24', '1998-02-09',
'2014-05-15', '2008-09-02',
...
'2000-03-02', '2011-02-25', '2014-10-01', '2001-06-20',
'2016-09-06', '2016-12-16', '2010-05-06', '2003-12-10',
'2007-09-26', '2008-03-27'],
dtype='datetime64[ns]', length=3800, freq=None)


### Interpretation of the Results

The scores of 0.625 on the training set and 0.635 on the test set in this case are not fantastic. The mean squared error is also very high at 16430.99. A plot of the prediction function shows how the line fits the data. The plot demonstrates that the fit is poor because the shape of the chart is not linear. A polynomial function would probably provide a better fit given the shape of the data.

## Fitting the Model using Cross-Validation

The problem with the train/test split approach is that it produces different metrics depending on the samples that are randomly drawn for training and testing. Cross-validation amends this problem by taking the average of the metrics from different train/test splits. It can thus produce more reliable results on predictions.

Cross-validation splits the data set into folds and uses each fold as the test set and the other folds as the training set for the number of folds in the data set. For instance, if there are 10 folds, the data will be split into 10 folds where the first 9 are for training and the last 1 is for testing. Then on the second pass, the second-to-last fold will be used for testing and the rest for training. This process is repeated until all 10 folds are used for testing. Then the average of all the scores is used as the measure of performance.

The process is similar to the one outlined above but uses cross_val_score and cross_val_predict for the scoring and predictions rather than train_test_split. cv=10 indicates that 10 cross validation folds are being used.

cross_val_score returns an array of the scores from each run of cross-validation. Since there were 10 folds, there are 10 scores. np.set_printoptions(suppress=True) suppresses the printing of the scores in scientific notation. Note that the values in the cross_val_score array are by default all negative. The logic behind this has to do with the fact that it is an error rather than a measure of correctness as with linreg.score() above. Taking the negative of these values gives the MSE as usual.

array([  33595.63913836,   14914.81619377,     166.46110124,
2131.75940237,   12018.10906378,   19454.65479697,
18002.87953032,   11182.19566593,    2882.68116282,
206077.8339844 ])


std() produces the standard deviation of the scores array above. It is quite high in this case.

58768.009904245591


mean() gets the mean of all the scores in the NumPy ndarray above which will be used as the final score to compare performance with other models.

32042.703003995477

### Interpretation of the Results

The array of scores shows all the scores for all 10 runs of cross-validation. The scores are all quite varied and this shows the importance of taking this approach rather than using the train/test split approach. Indeed, the scores have a standard deviation of 58768.0. The average error of 32042.7 though higher than the previous one of 6430.99 when the train/test split was used is more robust and reliable.

## Conclusion

An introduction to modeling with linear regression in Scikit-Learn with Pandas, NumPy and Matplotlib was covered. The results show that linear regression generally performs better when the data is itself linearly shaped and has low volatility thus minimising the mean squared error. They also show that cross-validation is a better, more reliable approach than the train/test split for training the models and getting performance scores.

Trying multivariate linear regression is also worth considering for comparison. In more complicated models with hyperparameters, additional complexity can be added using GridSearch for parameter tuning.

## Appendix

This is the code used to create the diagram in the first section of this post “Linear Regression Overview.”