Python Machine Learning Capstone
Choose your own structured dataset (e.g., housing prices, customer churn, or loan default) to build a machine learning pipeline from scratch, including data cleaning, feature engineering, model selection, and performance evaluation. Put together a presentation highlighting your process, tools, and insights.
Learn more about the Python machine learning capstone project deliverables.
Python for AI Capstone
Capstone Project I: Build an AI chat assistant for a live website that helps users answer questions about the product offering and services offered.
Capstone Project II: Create a web app that allows users to upload images of personal collections — such as vintage books, vinyl records, rare sneakers, collectible cards, or antiques — and uses AI to identify the item, generate descriptive metadata, and log it in a searchable session history.
Learn more about the Python for AI capstone project deliverables.
Python Data Visualization Capstone
Analyze global CO₂ emissions alongside GDP and population data. You’ll clean, explore, and visualize the data, then build an interactive dashboard in Dash. Your final presentation should highlight key patterns, tools used, and insights discovered.
Learn more about the Python data visualization capstone project deliverables.
Student Project: Pratik B
LeBron’s Scoring
This project aimed to predict how many points LeBron James would score in a game using a combination of his personal stats and the defensive statistics of his opponents. Game data was scraped from the NBA website and Basketball Reference, cleaned, and merged into a single dataset. The student explored multiple regression models from scikit-learn, ultimately finding that the RANSAC regressor performed best by filtering out outliers. The project identified a negative correlation between LeBron’s scoring and opposing team defense metrics like DEFRTG and DREB%.
Concepts Covered:
- Web scraping with Selenium and CSV import
- Feature engineering (e.g., transforming minutes played)
- Exploratory analysis and correlation thinking
- Model selection using regression models from sklearn
- Train/test split for evaluation
Output & Findings:
- RANSAC Regressor yielded the best results, showing a high negative correlation between LeBron’s points and the opposing team's defensive stats.
- Recognized that combining variables with opposing directional impact (e.g., minutes vs. defense) may have confused some models.
- Suggested future improvements, such as selecting more logically consistent features.
Student Project: Mariely D.
Diabetes EDA, Hypothesis, and Prediction
Using a publicly available CDC survey dataset from Kaggle, this project explored factors associated with diabetes and attempted to predict diabetes status. The analysis was driven by clear hypotheses regarding age, BMI, cholesterol, physical activity, and gender. Visual comparisons between diabetic and non-diabetic individuals highlighted meaningful differences in health outcomes and lifestyle. A logistic regression model achieved 74% accuracy in predicting diabetes, and the project concluded that BMI and other health metrics are strong indicators of diabetic risk.
Concepts Covered:
- Exploratory Data Analysis (EDA) and hypothesis framing
- Correlation analysis between features (e.g., BMI, cholesterol) and diabetes status
- Visualizations: bar charts comparing diabetics vs. non-diabetics
- Predictive modeling using logistic regression
- Confusion matrix and accuracy assessment
Output & Findings:
- Key correlates of diabetes included high blood pressure, BMI, poor general health, and difficulty walking.
- Logistic Regression model achieved 74% accuracy.
- Provided strong visual and narrative support for health-related insights, reinforcing the role of lifestyle and physical health in diabetes risk.
Student Project: Amit J. & Valentina P.
MTA Subway Data Analysis
This project analyzed subway turnstile data from 10 stations across New York City to examine commuter flow patterns throughout the week. The primary objective was to distinguish weekday vs. weekend traffic and understand how passenger volume varied by station and time. A regression-based classification model was applied and reportedly achieved 83% accuracy in identifying weekday patterns. The study aimed to support strategic planning for workforce allocation in the MTA system, particularly around rush hour and location-based trends.
Concepts Covered:
- Real-world time series data analysis
- Classification based on temporal and geographic features
- Regional sampling (Uptown, Midtown, Downtown)
- Application of a regression-based model for weekday/weekend classification
- Interpretation of accuracy and use of confusion matrix
Output & Findings:
- The model (type unspecified) reportedly achieved 83% accuracy in distinguishing weekdays from weekends.
- Noted significant differences in commuter flow between weekdays and weekends.
- Proposed that this analysis could inform MTA scheduling and workforce planning, especially around rush hour.