Predictive Maintenance is the strategy of using Data Analytics to predict when equipment needs maintenance before failure, saving on repair costs and downtime. I explored the field through a generated dataset (as datasets for this problem are rare due to corporate secrecy), the AI4I 2020 Predictive Maintenance Dataset to be specific, and generated ML Models to predict failure.
The project covers the standard Data Analysis workflow, Exploratory Data Analysis, Decision Trees, AdaBoost Classifiers, Working with Unbalanced Data, ML Model Tuning, presenting findings to a large audience, and more.
From my work, I achieved a .76 F1 Score for Machine Failure, and .77 AUC-PR from a Tuned Random Forest Model, a strong result based on other work on the Dataset. If I were to revisit this exploration, I'd look towards using more expensive models, and look for real-world datasets, especially those regarding time series data.
My Jupyter Notebook and a Slide Deck of my Results are viewable below.
Major League Baseball has become a leader in professional sports analytics, offering detailed datasets from advanced tracking systems. Very recently, Bat Tracking and Batting Stance data were made publicly available, opening new ways to analyze how hitters perform.
This project explores the use of this new data to answer three major questions: (1) whether a batter's stance meaningfully affects performance, (2) whether we can predict offensive outcomes using stance, swing, and decision-making characteristics, and (3) which of these factors are most impactful toward offensive success.
A wide range of Batting Stance and Swing Characteristics were used alongside offensive outcome metrics to build predictive models. Exploratory Data Analysis found stance variables (where one stands in the batter's box, how wide their stance is, etc.) to have little direct correlation with performance, while swing characteristics (swing speed, whiff percent, etc.) proved much more predictive. Using Random Forests, XGBoost, Ridge, and Lasso Regression, models were trained to predict offensive outcomes like OPS, Strikeout Rate, and Home Run Rate, with Lasso Regression achieving R² scores as high as 0.87 for some targets.
Key findings showed that Barrel Rate and Whiff Percentage are among the strongest predictors of offensive success, highlighting the critical roles of swing power and contact precision. Meanwhile, stance characteristics alone were largely ineffective predictors, suggesting hitter training should prioritize swing quality over pre-pitch positioning.
The project includes full Machine Learning pipelines, Cross Validation, Hyperparameter Tuning, and Feature Importance analysis to back these conclusions. The project's Jupyter Notebook is found below.
Bat Tracking Data, Visualized by MLB.com
Batting Stance Data, Visualized on Baseball Savant
xQc's Twitch Chat
Northernlion's Twitch Chat
Twitch.tv is a live-streaming platform potentially most notable for its Twitch Chat. Based on 3rd party plugins and years of culture built nowhere else, streamer's Twitch Chats are extremely unique centers for community and interaction.
This project aimed to explore the potential for using Classification Models to predict which streamer's twitch chat a certain message, no longer than 255 characters, belongs to. The project slowly increases the complexity of the problem and tackles the individual challenges along the way, illustrating the journey. Naive Bayes Classifiers, Network Visualizations, Data Mining and Scraping, Text Vectorization, and many more techniques are used.
With a Simple Naive Bayes Binary Classifier as a Proof of Concept, 95% Accuracy was obtained classifying between chats of Low Cosine Similarity, and 70% Accuracy on High Similarity. With the larger problem of 50 Streamers in Multi-class Classification, Naive Bayes gave 30% Accuracy when considering a single message, and 90% when considering 13 messages or more. Using Similarity Metrics, a Network Diagram was made to Visualize the Similarites and Communities that exist between similar chats on the platform.
There is significant room for improvement following this proof of concept. Random sampling of multiple VODs would decrease bias, a larger similarity graph visualization would prove very interesting, and usage of more complex, expensive ML models would be ideal towards creating a more accurate model.
The project's Jupyter Notebook is found below.
I have working experience with Bokeh, Matplotlib, and Tableau, and familiarity with most other data visualization packages. The following Data Visualization Examples come from Personal Projects, and my work in Northeastern's IE6600 Computation and Visualization. Function has been prioritized over function in these examples, though more uniquely attractive visualization techniques are something I'm interesting in exploring further. More examples are available upon request.
I've graduated from Northeastern University with a Master's Degree in Data Analytics Engineering in May 2025, with a GPA of 3.71. In this graduate program, I've taken the following classes:
IE6400 Foundations of Data Analytics
EMGT5220 Engineering Project Management
IE6600 Computation and Visualization
IE6700 Data Management for Analytics
IE7275 Data Mining in Engineering
CS5800 Algorithms
DS5220 Supervised Machine Learning
DS5230 Unsupervised Machine Learning
My Undergraduate Degree from Northeastern University was in Mechanical Engineering, with a minor in Computer Science. Classes from Undergrad relevant to Data and Computer Science include:
CS2500/2510 Fundamentals of Computer Science 1 and 2
CS3200 Database Design
CS3520 Programming in C++
My classwork has given me both a strong theoretical background as well as thorough, real-world project experience. Example Classwork is available upon request.