Data Engineering

Data Engineering with Pandas

I learned best practices in data engineering with Pandas from Matt Harrison's Effective Pandas. In general, I work on the original dataset, making no destructive changes, and chaining operations, for repeatability and explainability. I applied the approach to my Master's degree capstone project.

Effective Pandas

Code Sample on Github

In this sample, I create a repeatable function that applies the correct data type to columns of interest, re-engineers other problematic columns, and reduces a 19.6MB dataset to 8.7MB, achieving a memory savings of 56 percent.

Flow Programming for ETLs

Code Sample on Github

Method chaining is also called "flow programming." The result is a prepared dataset and a function that can be called to repeat the ETL and to ouput a dataset ready for easy and accurate aggregations and visualizations.

Data Transformations for Machine Learning

As I have a background in Instructional Design, it comes naturally to document code for learnability and repeatability. In this series of notebooks from Google Colaboratory, I recreate or redesign the excellent tutorials of Dr. Bright Kyeremeh in his course on Udemy The Full Stack Data Scientist BootCamp®, often with my own addition of a Problem-Based Learning approach.

Code Samples on Github include:

  1. Dealing with missing values

  2. Ethical data cleansing

  3. Data analysis with uni/bi/multi-variate data

  4. Dealing with imbalanced datasets

  5. Feature hashing and feature encoding

  6. Feature scaling and normalisation

  7. Auto feature encoding

  8. Auto feature selection

  9. Preventing data leakage