• What is NumPy?
    BiteSize Series


    NumPy is a high performance package for numerical computing in Python. It works well with vector and matrix operations. In this brief post we look at the origins of Python, ndarrays and their benefits, use cases and limitations as well as some useful NumPy functions.

    Read More
  • What is Pandas?
    BiteSize Series


    Pandas is a scientific computing Python library created to do data analysis on structured data. It is a core package of SciPy along with Matplotlib, and IPython. We will look at a history of Pandas, the Pandas DataFrame, How I use Pandas and its strengths and limitations.

    Read More
  • Mentoring for Sequence Models with deeplearning.ai


    I was invited to be a volunteer mentor on the Sequence Models course which is a part of the deeplearning.ai Deep Learning Specialization on Coursera. This is a course associated with Stanford University. The course covers Recurrent Neural Networks for Natural Language Processing. I got the invitation by email a few weeks after I completed the 5 courses in the Specialization in May 2018. I did this as a follow-up to the ever-popular Machine Learning course by the same instructor.

    Read More
  • Apache Spark with a Recommender System


    Apache Spark is a popular framework for distributed computing and big data. It can be used with Java, Scala, R and Python via its high-level APIs. The techniques and patterns of the Python API (PySpark) are quite similar to those of Pandas and Scikit-Learn as previously explored. In this post, we use PySpark to build a recommender system.

    Read More
  • Random Forest Regression Pt. 4
    Training using One Feature with Grid Search and Randomised Grid Search


    This is Pt. 4 in the series covering Random Forest Regression to predict the price of RY stock. It follows from Pt. 3 on Feature Engineering. In this post, training is done using the estimators and tools provided by Scikit-Learn.

    We begin with a discussion of the theoretical concepts needed to undergo this process. These include hyperparameters, Grid Search, and pickling. Then, the models are trained using one feature with both Grid Search and Randomised Grid Search. The hypothesis that Randomised Grid Search is generally the...

    Read More
  • Random Forest Regression Pt. 1
    Algorithms, Importing, Exploring and Preprocessing the Data


    This is the first post in a series which considers Regression using Random Forests to predict the price of the stock of the Royal Bank of Canada (ticker RY). The full technology stack includes Python, Pandas, NumPy, Matplotlib/Plotly, and Scikit-Learn.

    Firstly, we discuss the algorithms of Decision Trees and Random Forests. Next, the data is imported from Yahoo! Finance with demonstrations for local CSV files as well as sourcing via the pandas_datareader. Afterwards, preliminary explorations are done with Pandas and its DataFrame. Finally, the data is preprocessed in preparation for visualisation and modeling.

    Read More
  • Univariate Linear Regression with AMZN and Scikit-Learn


    In this post, we explore univariate Linear Regression with Amazon stock (AMZN ticker) data using the Python data science ecosystem. The libraries used include Pandas, NumPy, Matplotlib and Scikit-Learn.

    We start with a brief introduction to univariate linear regression and how it works. The data is imported, explored, and preprocessed using Pandas and Matplotlib. The model is then fitted with the data using both a train/test split and cross-validation with Scikit-Learn. The results for both scenarios are then discussed and compared.

    Read More
  • Forecasting Stock Prices and Generating Buy Sell Signals


    This is the first project I did with the Python data science stack. It is in the form of a Jupyter Notebook hosted on GitHub which can be found here. It covers a range of concepts and techniques including tools, data sources, data exploration and visualization, handling missing data, domain specific considerations and modeling.

    Read More