Choose your own structured dataset (e.g., housing prices, customer churn, or loan default) to build a machine learning pipeline from scratch, including data cleaning, feature engineering, model selection, and performance evaluation. Put together a presentation highlighting your process, tools, and insights.
Deliverables:
- Select and Explore a Structured Dataset
- Choose a publicly available dataset (e.g., from Kaggle or UCI) relevant to a classification or regression problem; perform initial exploration to understand data structure and context.
- Clean and Engineer Features
- Handle missing values, encode categorical data, and create meaningful new features that may improve model performance.
- Train and Evaluate Machine Learning Models
- Apply at least one appropriate model (e.g., logistic regression, decision tree, random forest), perform data splitting, and evaluate performance using metrics such as accuracy or RMSE.
- Visualize Patterns and Results
- Create clear visualizations (e.g., correlation heatmaps, prediction vs actual plots) that illustrate relationships in the data and support your findings.
- Presentation
- A final presentation explaining your problem statement, approach, tools used (e.g., pandas, scikit-learn, matplotlib), patterns discovered, model results, and key takeaways.