Visualisation packages are a critical component in any data scientist’s toolkit. They help in understanding the data and in finding patterns and outliers that are not immediately obvious from tabular data. They are also integral in evaluating the performance of learning algorithms. This is why it can be beneficial to try new offerings like Plotly and consider if incorporating them into a workflow would be beneficial.

Plotly Overview

Plotly offers a great solution for creating interactive visualisations from data which are built on D3.js. They have APIs for Python, R, Matlab, and JavaScript. There is a wide variety of charts including histograms, candlestick charts and maps, my personal favorite.

The company is headquartered in Montreal, Quebec and has a distributed employee structure. They have several product offerings. The one evaluated here, Plotly for Python, is open source with both free and paid versions.

The related Python API will be used in this post. An outline of the benefits and drawbacks found are given along with comparisons to Matplotlib where appropriate. Finally, steps on getting set up and creating a simple first plot in 2D and then a second slightly more complex one in 3D are discussed.

Below is a demo to give an idea of the types of plots and functionality available in Plotly. Rotate the plot by clicking, holding and dragging on it. Hovering over the chart shows a toolbar with other options on the top right corner. The render is based on this example. Note that I had to use the instructions here to get the 3D plot to render in Google Chrome. Click here for a non-interactive version if the one above does not show.

3D Scatter Plot Demo

Pros

First Impressions
Plotly can make some impressive and engaging interactive charts. The 3D charts are especially stunning. Dynamic elements like hover tooltips show data values right near the plotted element. Customised buttons can show views with different layers of information on the same chart. Interactive sliders can adjust the view on the chart shown.

Hovering over the data points "grabs" the cursor and shows the specific values at that point for both the point and the line of best fit... This is what I find so awesome about Plotly.


Data Granularity
Determining the specific values of a point is much easier than can be done with static Matplotlib charts because of the interactive tooltips. They make the plots more informative. Where very small values are charted which are difficult to see, more detail is discernible because of the clarity afforded by these tooltips. See the tails of the sample chart presented below.

Histograms of the Frequency of Daily Returns values for RY and SPY

Learning Curve
Plotly’s Python API works differently from Matplotlib but is fairly easy to learn (assuming a level of familiarity with Python and Matplotlib). The examples are quite informative and are straightforward to adapt. Plotly also provides informative error messages for object attributes. Typing in an incorrect name for an attribute gives a list of viable options. The help function is also useful as in <surface-object>.help('attribute') where <surface-object> is replaced with the name given to an object like trace0 (this is often used in the examples) and ‘attribute’ is replaced by an attribute like colorbar.

Additionally, there is the mpl_to_plotly() function which can (partially) convert a Matplotlib chart to an interactive version. This is a great starting point or quick conversion option. As shown in the example below most of the chart is converted properly including the title and axis labels but not the legend.

# Creating the Matplotlib plot
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
plt.style.use('seaborn')

fig = plt.figure()
np.random.seed(0)
plt.plot(np.random.rand(10)*20, label="line one")
plt.plot(np.random.rand(10), label="line two")
plt.title("Matplotlib to Plotly Demo")
plt.xlabel("x-axis")
plt.ylabel("y-axis")
plt.legend()

Matplotlib to Plotly Demo

# Converting to Plotly
import plotly.plotly as py
import plotly.tools as tls

plotly_fig = tls.mpl_to_plotly(fig)
py.iplot(plotly_fig, filename='matplotlib-to-plotly')


Editing Options
A web interface can be used to alter the plot’s features rather than only being able to alter in code. It can be quite convenient to have this option of editing visually. This can be accessed by clicking on Edit Chart on the bottom right of the plot or by clicking Edit on the chart in the online My Files menu.

Cost
There is a free version for public plots. Private plots require a fee and features are limited (See Cons Section). There are 3 different price tiers for paid plans on Plotly Cloud: Student, Personal and Professional.

Sharing
It is easy to share the interactive plots online because they are hosted on a server.

Fallbacks
PNGs can be taken if too much compute is necessary for the plots or for users who have JavaScript turned off. Using large data sets may cause performance issues (See Performance).

Cons

Documentation and Tutorials
Plotly is a relatively new library so there is not that much information available on how to use it in terms of courses, introductions, conference talks and other relevant resources. This is especially true as compared to the plethora of information available for the mature and well-covered Matplotlib. Additionally, the examples provided in the official docs do not come with in-depth explanations so understanding what is happening is not always clear and can require some trial and error to figure out. This can make it particularly challenging for a beginner.

Performance
Speed can sometimes be an issue because the charts are interactive and require more power than a static chart. For example, the scatterplot matrix generated in the post Pt. 2: Visualising the Data with Plotly and Matplotlib took a long time to process and made the entire browser unresponsive. Even getting the hover tooltips to show was very slow and eventually, a PNG was used instead. An error message also appeared saying performance would be bad in all browsers. Thus, the amount of data visualised is limited to a certain maximum.

Additionally, since the plots typically load from a server, sending and getting responses also adds to the waiting time. This is especially noticeable when loading multiple charts that have already been generated in a web page.

Conciseness
More verbose code (than Matplotlib) is likely another consequence of more complex charting capabilities when using the Python API. Particularly when compared to the very succinct wrappers Pandas builds on top of Matplotlib, the difference in length is stark. One line of code using Pandas can effortlessly turn into thirty lines with Plotly. This is a fair trade-off considering the benefits available, but the trade-off has to be considered where time is limited and efficiency is paramount.

Browser Requirements
Some of the charts did not show up with Webgl so static backup versions may not be a bad idea for publishing to users who do not have the requirements for viewing and/or do not have the time to change their settings. Mobile support can also be unpredictable.

Free Version Limitations
These are not necessarily negative things about the product itself as I do think it is fair to limit the functionality of the free version. It is a business after all and it needs to be profitable to succeed. The multi-tier price offering is a common strategy in use so users can choose the best one for their needs. I mention these because it is irksome when using it while being unsure of the extent of the functionality that is accessible since it can lead to disappointment.

  1. When editing plots created using the Python API in the web interface (after hitting Edit Chart on the rendered chart) a PRO account is required. The distinction is not always obvious because the paid for options are available for selection and preview.

  2. There is a maximum limit of 100 image exports and chart saves per day. I hit this limit when I was trying out different plots for this article.

User Experience on Load
The different screens that flash by when the embedded charts load in a web page interrupts a seamless user experience. On first sight, they can be confusing and distracting.

Getting Started with Plotly

  1. Sign up for an account.
  2. I initially had to sign in with an API Key when using a Jupyter Notebook, but this was not needed with the methods I used afterwards. From what I saw there are different ways it can be run. I did not try all of them.
  3. Search for and copy an example similar to the plot that is desired.
  4. Run the code and tweak as needed.
  5. The plot files go to the online account and are stored there (in the My Files tab). If using a Jupyter Notebook the chart appears in the notebook output as there is a direct integration. Since the plots are hosted online they can be embedded in a web page or in a Jupyter Notebook using IPython.core.display.

Example 1: Scatter and Line in 2D

This is a simple beginner example. It uses dummy data generated with Scikit-learn’s make_regression() function. The parameter and intercept to get a line of best fit is generated with NumPy’s ployfit() function. Then the output is fed into a Plotly chart.

Note that this example and the next one was created in a Jupyter Notebook with Python 3.6 and Anaconda.

Generate Dummy Data

The number of points n_samples is set to 100 for clarity since too many points will obscure each other in the plot. n_features is set to 1 since only a 2-dimensional plot is desired. This corresponds to X or the values that will be plotted on the x-axis. y are the values that will be plotted on the y-axis. Adjusting noise upwards increases the deviation from the line of best fit. Setting random_state=0 makes the sample reproducible.

def generate_dummy_data():
    from sklearn import linear_model, datasets
    n_samples = 100

    X, y, coef = datasets.make_regression(n_samples=n_samples, n_features=1,
                                          n_informative=1, noise=30,
                                          coef=True, random_state=0)
    return X,y

X,y = generate_dummy_data()

Fit a Line to the Dummy Data

NumPy is imported to get access to its ployfit() function to fit a line to the data. $m = 43.09$ where m is the gradient and $c = -2.44$ where c is the y-intercept. These can be used to plot the line of best fit using the equation of the line $y=mx+c$. See here for a detailed explanation of how the equation of a line works and a method of fitting the line using the Mean Squared Error. An alternate way of fitting the line using Scikit-learn’s Linear Regression implementation is also covered.

def fit_line_to_data(X,y):
    import numpy as np

    # reshape X to a 1D array
    X = X.reshape(1,-1)[0]

    # m is the gradient and c is the intercept in y=mx+c
    # the equation of a line.
    m,c = np.polyfit(X, y, 1)

    print("m = {}".format(m))
    print("c = {}".format(c))

    return m,c

m,c = fit_line_to_data(X,y)
m = 43.08728116246489
c = -2.442545481092182

Plot the Data with Plotly

The Plotly modules are imported, then the dummy data points, the line of best fit and their respective properties are defined. The layout properties are set and the chart is generated. In a Jupyter Notebook returning the plot from the function will render it directly in the Notebook.

Hovering over the data points “grabs” the cursor and shows the specific values at that point for both the point and the line of best fit in Bootstrap-style tooltips. This is what I find so awesome about Plotly. Seeing the exact data point value and the corresponding prediction on the chart is very informative. Ordinarily, with Matplotlib this is not possible and the values have to be averaged by sight even when using the interactive notebook version with %matplotlib notebook.

def plot_data_and_best_fit():
    import plotly.plotly as py
    import plotly.graph_objs as go

    data = go.Scatter(
        x = X,
        y = y,
        opacity = 0.75,
        mode = 'markers',
        name = 'Dummy Data'
    )

    fit = go.Scatter(
        x = X,
        y = m*X+c,  # equation of a line
        opacity = 0.75,
        name = 'Line of Best Fit'
    )

    data = [data, fit]

    layout = go.Layout(
        title='Line of Best Fit Through Generated Dummy Data',
        xaxis=dict(
            title='X'
        ),
        yaxis=dict(
            title='y'
        )
    )

    fig = go.Figure(data=data, layout=layout)
    return py.iplot(fig, filename='dummy-data-demo')

plot_data_and_best_fit()

Example 2: Scatter and Plane in 3D

I chart a 3D regression problem to demonstrate a higher level of complexity. Again, NumPy is used to generate dummy data, this time with 2 features. The dummy data is then used to train a model with Scikit-learn and the coefficients and intercept are used to plot the data points and the plane which best fits the points.

Generate Dummy Data

This is similar to what was done before with some outliers added in. The random seed is set for reproducibility.

def generate_data():
    import numpy as np
    from sklearn import linear_model, datasets

    n_samples = 100
    n_outliers = 1


    X, y, coef = datasets.make_regression(n_samples=n_samples, n_features=2,
                                          n_informative=1, noise=100,
                                          coef=True, random_state=0)

    #Add outlier data
    np.random.seed(0)
    X[:n_outliers] = 3 + 0.5 * np.random.normal(size=(n_outliers, 1))
    y[:n_outliers] = -3 + 9 * np.random.normal(size=n_outliers)

    return X,y

X,y = generate_data()

Train the Model with Scikit-learn

A model is trained using a train/test split. The linreg object has two properties intercept_ and coef_ which give the y-intercept and model coefficients. These values are shown in the output. The first coefficient in the array [-41.31872872 93.14420684] is for feature 1 and the second for feature 2.

See here for more on how training the model works in Scikit-learn using both a train/test split and cross-validation.

def train_model():
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

    linreg = LinearRegression()
    linreg.fit(X_train, y_train)

    print("y-intercept: {}".format(linreg.intercept_))
    print("coefficients: {}".format(linreg.coef_))

    return linreg

linreg = train_model()
y-intercept: -6.399996972084856
coefficients: [-41.31872872  93.14420684]

Plot the Data with Plotly

A NumPy meshgrid is set up to create the plane with values related to the highest and lowest X and y values. The hypothesis function is set up and the plot is generated. More on the intuition behind this function can be found here.

The colors in the scatter plot spheres have no meaning and are only there to add interest to the plot and to demonstrate the use of a color scale. Many more of these scales exist like Viridis, Portland and Picnic. In practice, they can be used to represent another dimension of data. The size of the spheres can be used similarly.

def plot_3d_chart():

    import math
    import numpy as np

    # set up the meshgrid
    # floors and ceilings are taken because the range functions need
    # whole numbers
    # the ranges are taken from the data to make sure the grid covers
    # the values of the data
    x_min = math.floor(X[:,0].min())
    x_max = math.ceil(X[:,0].max())
    y_min = math.floor(X[:,1].min())
    y_max = math.ceil(X[:,1].max())
    x_,y_ = np.meshgrid(range(x_min,x_max), range(y_min,y_max))


    import plotly.plotly as py
    import plotly.graph_objs as go

    # for random color generation
    np.random.seed(10)
    color = np.random.randn(len(y))

    # the hypothesis function to generate the plane
    hypothesis = linreg.intercept_ + x_*linreg.coef_[0]  + y_*linreg.coef_[1]

    trace1 = go.Scatter3d(
        x=X[:,0],
        y=X[:,1],
        z=y,
        mode='markers',
        name='Dummy Data',
        marker=dict(
            size=20,
            color=color,               
            colorscale='Rainbow',   
            opacity=0.9
        )
    )

    trace2 = go.Surface(
            x = x_,
            y = y_,
            z = hypothesis,
            name='Plane of Best Fit',
            opacity = 0.7,
            colorscale='Greys',
            showscale= False
        )

    data = [trace1, trace2]
    layout = go.Layout(
            title='3D Plane of Best Fit Through Generated Dummy Data',
            margin=dict(
                l=0,
                r=0,
                b=10,
                t=100  # the title is obscured if the top margin is not adjusted
        )
    )
    fig = go.Figure(data=data, layout=layout)
    return py.iplot(fig, filename='3d-scatter-with-plane')

plot_3d_chart()


Click here for a non-interactive version if the one above does not show.

Conclusion

Overall I enjoy using Plotly and would highly recommend it. Matplotlib is still my go-to choice for charting, but I will keep Plotly in mind for some added oomph. It is straightforward to use and successfully combines functional charting with aesthetic properties. A comparison with another tool called Bokeh could be instructive.

More complex examples of how I used Plotly charts can be found in the series below and its related posts. Altogether, there is quite a wide variety including examples of line charts, a bar chart, scatter plots, a candlestick chart, a scatter matrix, box plots, histograms and several combinations. I include Matplotlib versions in many cases for comparison.

End of post graphic

References

  • Hunter, John D. “Matplotlib: A 2D graphics environment.” Computing In Science & Engineering 9.3 (2007): 90-95.

  • Pedregosa, Fabian, et al. “Scikit-learn: Machine learning in Python.” Journal of Machine Learning Research 12.Oct (2011): 2825-2830.

  • Pérez, Fernando, and Brian E. Granger. “IPython: a system for interactive scientific computing.” Computing in Science & Engineering 9.3 (2007).

  • Sievert, C., et al. “plotly: Create interactive web graphics via Plotly’s JavaScript graphing library [Software].” (2016).

  • Walt, Stéfan van der, S. Chris Colbert, and Gael Varoquaux. “The NumPy array: a structure for efficient numerical computation.” Computing in Science & Engineering 13.2 (2011): 22-30.