Data Engineer
-
Medisolv
Jul 2023 - present
Oversee the ingestion and transformation of patient data (10B+ records/week, 150TB+ data lake) for hundreds of hospital clients
- Created a Python CLI wrapper around Databricks Asset Bundles to programmatically generate Databricks Workflows, enabling a 20x runtime improvement and $100k+ yearly savings through task-specific cluster tuning
- Designed our central git repository to store Databricks jobs, tests, and utilities. Trained 12 analysts and engineers on software development best practices enabling faster iteration, regression testing, and improved process visibility for management.
- Led the migration of 1,500+ Azure Data Factory (ADF) pipelines to a source-controlled IaC implementation. Developed tests to validate our pipelines before they were published to production, reducing our error rate by 34%
- Developed a Spark metrics parser using Plotly Dash to guide data-driven transformation rewrites, saving thousands in wasted compute time and storage access costs.
Worked for NewYork Quality Care - the ACO of NewYork-Presbyterian, Weill Cornell, and Columbia
- Managed the calculation, tracking, and reporting of quality metrics, leading to $20M+ in savings
- Built weekly analytics ELT pipeline (100M+ encounter records for 35k Medicare patients)
- Led the adoption of geographic analysis by designing custom address cleaning and geolocation workflow
- Developed data cleaning helper functions in Python and R used by 9 other analysts
- Reduced Tableau loading times from minutes to seconds by designing composable data models for our team
Data Engineer (contract)
-
UTHealth
May 2020 - Nov 2023
Built and solely maintained database powering the UTHealth COVID-19 dashboard
- Created and maintained daily Texas COVID-19 data pipeline from state and third-party sources
- Web scraped data with Python (REST apis, beautiful soup, selenium)
- Developed a monitoring Slack bot and unit tests to ensure consistent data quality
- Authored thesis (100+ citations) on biomarkers of traumatic brain injury (TBI) and provided data support for other research efforts
- Applied variable selection on hundreds of biomarker combinations to identify TBI predictors
- Built analysis pipelines and created publication-ready data visualization in R