Data Engineering
Data Engineering with Pandas
I learned best practices in data engineering with Pandas from Matt Harrison's Effective Pandas. In general, I work on the original dataset, making no destructive changes, and chaining operations, for repeatability and explainability. I applied the approach to my Master's degree capstone project.
Effective Pandas
In this sample, I create a repeatable function that applies the correct data type to columns of interest, re-engineers other problematic columns, and reduces a 19.6MB dataset to 8.7MB, achieving a memory savings of 56 percent.
Flow Programming for ETLs
Method chaining is also called "flow programming." The result is a prepared dataset and a function that can be called to repeat the ETL and to ouput a dataset ready for easy and accurate aggregations and visualizations.
Data Transformations for Machine Learning
As I have a background in Instructional Design, it comes naturally to document code for learnability and repeatability. In this series of notebooks from Google Colaboratory, I recreate or redesign the excellent tutorials of Dr. Bright Kyeremeh in his course on Udemy The Full Stack Data Scientist BootCamp®, often with my own addition of a Problem-Based Learning approach.
Code Samples on Github include:
Dealing with missing values
Ethical data cleansing
Data analysis with uni/bi/multi-variate data
Dealing with imbalanced datasets
Feature hashing and feature encoding
Feature scaling and normalisation
Auto feature encoding
Auto feature selection
Preventing data leakage