Data Analytics Capstone Projects

Machine Learning Capstone Project Description:

Choose your own structured dataset (e.g., housing prices, customer churn, or loan default) to build a machine learning pipeline from scratch, including data cleaning, feature engineering, model selection, and performance evaluation. Put together a presentation highlighting your process, tools, and insights. 

Deliverables:

  1. Select and Explore a Structured Dataset
    • Choose a publicly available dataset (e.g., from Kaggle or UCI) relevant to a classification or regression problem; perform initial exploration to understand data structure and context.
  2. Clean and Engineer Features
    • Handle missing values, encode categorical data, and create meaningful new features that may improve model performance.
  3. Train and Evaluate Machine Learning Models
    • Apply at least one appropriate model (e.g., logistic regression, decision tree, random forest), perform data splitting, and evaluate performance using metrics such as accuracy or RMSE.
  4. Visualize Patterns and Results
    • Create clear visualizations (e.g., correlation heatmaps, prediction vs actual plots) that illustrate relationships in the data and support your findings.
  5. Presentation
    • A final presentation explaining your problem statement, approach, tools used (e.g., pandas, scikit-learn, matplotlib), patterns discovered, model results, and key takeaways.

Python Data Visualization Capstone Project Description:

Analyze global CO₂ emissions alongside GDP and population data. You’ll clean, explore, and visualize the data, then build an interactive dashboard in Dash. Your final presentation should highlight key patterns, tools used, and insights discovered.

Deliverables:

  1. Dataset Used: The student will use a publicly available dataset on global CO₂ emissions by country and year, along with GDP and population data for context. (e.g., Our World in Data)
  2. Exploratory Data Analysis (EDA): Clean and preprocess the data to handle missing values and inconsistent formats. Use correlation analysis and group-by techniques to understand trends and relationships between emissions, GDP, and population over time.
  3. Visualizations: Create insightful charts such as time series plots of emissions by continent, correlation heatmaps between GDP and emissions, and a bar chart of top polluting countries. Use matplotlib, seaborn, and plotly for varied visual appeal.
  4. Dashboard Implementation: Build a responsive, interactive dashboard using Dash, where users can filter data by region, select time ranges, and visualize the top emitters or GDP/emission ratios over time.
  5. Findings and Patterns: Present observations such as which countries have decoupled GDP growth from emissions, regional emission trends, or anomalies. Emphasize tools used (Pandas, Plotly, Dash, correlation analysis, apply() and lambda functions, etc.) and how they were used to derive insights.

Data Analytics Cumulative Capstone Project Description (Optional):

Use historical Citi Bike trip data* to identify patterns in urban transportation across NYC. Explore how usage varies by time, location, and weather, and develop insights to inform operational or policy decisions.

*You may choose your own set of data. Citi Bike is an example. 

Deliverables: 

  1. Gather & Prepare Data
    • Download Citi Bike data for the most recent 12-month period. Join it with relevant external data sources (e.g. weather, borough population) using SQL or Python, and clean/normalize the dataset for analysis.
  2. Perform Exploratory Data Analysis 
    • Analyze trends in trip duration, popular start and end stations, usage by time of day/week, and user demographics. Use Excel, Python (Pandas, Seaborn), or SQL queries to generate summary statistics and initial visuals.
  3. Build Geospatial and Time-Based Visuals
    • Use Tableau or Python to map high-traffic areas, station-to-station flows, and time series of ride volume. Focus on identifying usage peaks, gaps, and imbalances across neighborhoods.
  4. Create a Predictive or Insight Model
    • Use Python (Scikit-learn) to build a simple predictive model (e.g., regression or clustering) to estimate demand based on time, location, and weather. Explain your feature selection and model accuracy.
  5. Develop a Final Presentation
    • Prepare a presentation (slides or dashboard) explaining your findings, visualizations, and methodology. Clearly communicate patterns, anomalies, and recommendations, and describe the tools you used at each step. You should discuss the strengths and limitations of your analysis, along with potential steps to enhance it further.
Yelp Facebook LinkedIn YouTube Twitter Instagram