This is the second installment in the series on Random Forest Regression. In the previous post the algorithm for Random Forest Regression was introduced, the data for RY stock was imported, given preliminary exploration and then preprocessed.

In this post, data exploration is taken a step further with visualizations. Note that some of these require that preprocessing occur before the data can be interpreted. The visualizations are done with Plotly with alternative code for Matplotlib presented in many cases. The charts are not rendered for Matplotlib since the information displayed is the same.

It may be informative to first browse an introduction and overview to Plotly which is covered here. Several visualizations are considered below including line charts with a bar chart, a candlestick chart, 2 box plots and a scatter plot matrix.

Line and Bar Plots for Prices and Volume

Plotly Version

This plot shows the typical layout where volume is aligned with stock price at a particular time. At some points in 2015 and 2009 increases in volume indicated sell offs, such as in the case of the 2009 recession.

Note that when using a notebook, the chart will appear below the code. It was placed above in this post for clarity.


Code

The code is fairly self explanatory. Each feature from the table (open, high etc.) has its own trace and parameters to control its appearance. The lines are of the Scatter type while volume is of type Bar.

The traces are appended to the figure and then some labelling parameters are set.

def ry_stock_with_volume():

    trace0 = go.Scatter(
        x=RY_df.index,
        y=RY_df['Adj Close'],
        name='Adj Close',
        opacity = 0.8
    )
    trace1 = go.Bar(
        x=RY_df.index,
        y=RY_df['Volume'],
        name='Volume'
    )

    trace2 = go.Scatter(
        x=RY_df.index,
        y=RY_df['Close'],
        opacity = 0.8,
        name='Close'
    )

    trace3 = go.Scatter(
        x=RY_df.index,
        y=RY_df['Open'],
        opacity = 0.8,
        name='Open'
    )

    trace4 = go.Scatter(
        x=RY_df.index,
        y=RY_df['High'],
        opacity = 0.8,
        name='High'
    )

    trace5 = go.Scatter(
        x=RY_df.index,
        y=RY_df['Low'],
        opacity = 0.8,
        name='Low'
    )

    fig = tools.make_subplots(rows=3, cols=1, specs=[[{'rowspan':2}], [None], [{}]])

    fig.append_trace(trace0, 1, 1)
    fig.append_trace(trace2, 1, 1)
    fig.append_trace(trace3, 1, 1)
    fig.append_trace(trace4, 1, 1)
    fig.append_trace(trace5, 1, 1)
    fig.append_trace(trace1, 3, 1)

    fig['layout'].update(showlegend=True, title='RY Stock')
    fig['layout']['xaxis2'].update(title='Date')
    fig['layout']['yaxis1'].update(title='Stock Price (US$)')
    fig['layout']['yaxis2'].update(title='Volume')

    return py.iplot(fig, filename='ry-stock-with-volume')

ry_stock_with_volume()
This is the format of your plot grid:
[ (1,1) x1,y1 ]
       |       
[ (3,1) x2,y2 ]

Matplotlib Version (Code Only)

def line_and_volume_plot(df):
    from matplotlib import gridspec

    fig = plt.figure()
    gs = gridspec.GridSpec(2, 1, height_ratios=[2, 1])

    ax0 = plt.subplot(gs[0])
    plt.plot(df.index, df['Close'])
    plt.ylabel("Adjusted Close (US$)")
    plt.title("Subplots of Adjusted Close Prices and Volume Against Date")

    ax1 = plt.subplot(gs[1])
    plt.bar(df.index, df['Volume'], width=1)
    plt.xlabel("Date")
    plt.ylabel("Volume")

line_and_volume_plot(RY_df)

Candlestick Charts

Candlestick Charts are a clearer way of plotting the Open, High, Low and Close than plotting all as lines as done above. It clearly shows days when the prices rose and when they fell. They are a favorite of technical analysts looking for patterns, trend changes and price breakouts in head and shoulder patterns for instance.

For both increasing (green) and decreasing (red) candlesticks the lowest price for a time period is indicated by the bottom point of the lower wick. Similarly the highest price is indicated by the top of the tallest wick. For the green candlestick, the open is at the bottom of the body and the close is at the top. For the red stick the open is at the top of the body and the close at the bottom.

Only a slice of the data is shown for clarity as the image gets muddled with all the data points included. df = RY_df[-100:] takes the last 100 data points.

Plotly Version

Again the trace is created using the Candlestick function with parameters to define the open, high, low and close data. Then the labels are set and the figure is returned.

def candlestick():

    # taking the last 100 datapoints
    df = RY_df[-100:]

    trace = go.Candlestick(x=df.index,
                           open=df.Open,
                           high=df.High,
                           low=df.Low,
                           close=df.Close)
    data = [trace]
    layout = {
        'title': 'RY Stock',
        'yaxis': {'title': 'RY Stock Price (US$)'},
        'xaxis': {'title': 'Date'},
    }
    fig = dict(data=data, layout=layout)
    return py.iplot(fig, filename='ry-candlestick')

candlestick()

Box and Whisker Plots

Plotly’s button controls are used here. It is a useful feature especially in this case where very similar data is being shown. The Volume values are on a different scale than the price values and they obscure the price plots. Its maximum is 9.8302m, while the max price is just $76.08. For many other models feature scaling would be required.

The box and whisker plot shows the minimum, first quartile, median, third quartile and maximum values of the data. Thus it gives a summary overview of the range of values we are dealing with and also makes it easy to compare these values for different features.

def box_whisker():

    trace0 = go.Box(
        y = RY_df['Open'],
        name = "Open",
        marker = dict(
            color = 'rgb(9,56,125)'),
        line = dict(
            color = 'rgb(9,56,125)')
    )

    trace1 = go.Box(
        y = RY_df['High'],
        name = "High",
        boxpoints = False,
        marker = dict(
            color = 'rgb(9,56,125)'),
        line = dict(
            color = 'rgb(9,56,125)')
    )

    trace2 = go.Box(
        y = RY_df['Low'],
        name = "Low",
        boxpoints = False,
        marker = dict(
            color = 'rgb(9,56,125)'),
        line = dict(
            color = 'rgb(9,56,125)')
    )

    trace3 = go.Box(
        y = RY_df['Close'],
        name = "Close",
        boxpoints = False,
        marker = dict(
            color = 'rgb(9,56,125)'),
        line = dict(
            color = 'rgb(9,56,125)')
    )

    trace4 = go.Box(
        y = RY_df['Adj Close'],
        name = "Adj Close",
        boxpoints = False,
        marker = dict(
            color = 'rgb(9,56,125)'),
        line = dict(
            color = 'rgb(9,56,125)')
    )

    trace5 = go.Box(
        y = RY_df['Volume'],
        name = "Volume",
        boxpoints = False,
        marker = dict(
            color = 'rgb(9,56,125)'),
        line = dict(
            color = 'rgb(9,56,125)')
    )

    data = [trace0,trace1,trace2,trace3, trace4, trace5]

    updatemenus = list([
        dict(type="buttons",
             active=0,
             buttons=list([   
                dict(label = 'Prices and Volume',
                     method = 'update',
                     args = [{'visible': [True, True, True, True, True, True]},
                             {'title': 'Box Plots for Price Data'}]),
                dict(label = 'Prices',
                     method = 'update',
                     args = [{'visible': [True, True, True, True, True, False]},
                             {'title': 'Box Plots for All Raw Data'}])

            ]),
        )
    ])

    layout = dict(title='Box Plots', showlegend=False,
                  updatemenus=updatemenus)

    fig = dict(data=data, layout=layout)
    return py.iplot(fig, filename = "Box Plots")

box_whisker()

Pandas/Matplotlib Version

RY_df[['Open', 'High', 'Low', 'Close']].plot.box();
RY_df.plot.box();

Scatterplot Matrix

Plotly Version

WARNING: This is very slow to run and may cause the browser to hang. This is why the PNG version was used instead. It tends to be more efficient in Matplotib. Since so many data points have to be rendered individually instead of in one image it is a very expensive operation.

A scatterplot matrix is a very useful tool. It shows the relationship between each pair of features. It can show correlation very clearly. In this image the open, high, low, close and adjusted close are all highly correlated as described by the 45 degree formation of the data.

scatter-matric-rbc

def scatterplot_matrix():
    import plotly.figure_factory as ff

    fig = ff.create_scatterplotmatrix(RY_df, diag='histogram',
                                      height=800, width=800)
    return py.iplot(fig, filename='Histograms along Diagonal Subplots')

Pandas/Matplotlib version

from pandas.plotting import scatter_matrix
attributes = RY_df.columns
scatter_matrix(RY_df[attributes], figsize=(6, 6));

Conclusion

In this post we looked at generating charts to describe data and find insights using Plotly and Matplotlib. We looked at the patterns used to code out the charts. We also considered how to read them in order to better understand the data.

References

  1. Hunter, John D. “Matplotlib: A 2D graphics environment.” Computing in science & engineering 9.3 (2007): 90-95.

  2. McKinney, Wes. “Data structures for statistical computing in python.” Proceedings of the 9th Python in Science Conference. Vol. 445. Austin, TX: SciPy, 2010.

  3. Plotly Technologies Inc. Collaborative data science. Montréal, QC, 2015. https://plot.ly.