Pandas is a scientific computing Python library created to do data analysis on structured data. It is a core package of SciPy along with Matplotlib, and IPython. We will look at a history of Pandas, the Pandas DataFrame, How I use Pandas and its strengths and limitations.
History and Introduction
It was created by Wes Mckinney in 2008 to fill the gap in the Python scientific computing stack for a library which could do data analysis. Prior to this practitioners would have to leave the Python ecosystem and go to a language like R for the data analysis step.
It’s name is derived from Panel Data (Mckinney, 2011) which is used for multidimensional data though the Panel object is being deprecated in Pandas.
Pandas is used for the manipulation and preprocessing of small to medium-scale data.
The workhorse of Pandas is a data structure called a DataFrame which implements many of the functions of the DataFrame in R with some enhancements (Mckinney, 2011). It can hold tabular data and is similar in appearance to a spreadsheet or a SQL table when used in a notebook environment like a Jupyter Notebook or DataLab on GCP. The data has metadata of columns and rows. Additionally, similar functions can be performed like aggregations and filtering.
How I use it.
I use Pandas for preprocessing data for use with machine learning models and exploratory data analysis with Matplotlib inside Jupyter Notebooks to get a better understanding of how it works.
Strengths and Limitations
Strengths It is built on Numpy. It is efficient and fast because the DataFrame operations happen at a lower level of the
C code that NumPy operations are wrappers for.
Limitations Pandas is only for data that can be held in memory on one node. For big data a library like Spark which has Resilient Distributed DataFrames could be used instead as this can scale across nodes in a fault-tolerant way.
- Mckinney, Wes. pandas: a Foundational Python Library for Data Analysis and Statistics.
- Oliphant, Travis. Guide to NumPy.