Output print the results optimal model parameters
Background:
Kaggle has hosted an open data scientist competition in 2019 titled “2019 Kaggle ML & DS Survey Challenge.” The purpose of this challenge was to “tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data on, data, and prizes can be found on:
Classification is a supervised machine learning approach used to assign a discrete value of one variable when given the values of others. Many types of machine learning models can be used for training classification problems, such as logistic regression, decision trees, kNN, SVM, random forest, gradient-boosted decision trees and neural networks. In this assignment you are required to use logistic regression algorithm, but feel free to experiment with other algorithms.
For the purposes of this assignment, any subset of data can be used for data exploration and for classification purposes. For example, you may focus only on one country, exclude features, or engineer new features. If a subset of data is chosen, it must contain at least 5000 training points. You must justify and explain why you are selectinga subset of the data, and how it may affect the model.
1) Produce a report in the form of an IPython Notebook detailing the analysis you performed to determine the best classifier (ordinary multi-class classification model) for the given data set. Your analysis must include the following steps: data cleaning, exploratory data analysis, feature selection (or model preparation), model implementation, model validation, model tuning, and discussion. When writing the report, make sure to explain for each step, what it is doing, why it is important, and the pros and cons of that approach.
2) Create 5 slides in PowerPoint and PDF describing the findings from exploratory analysis, model feature importance, model results and visualizations.
5.Understand how to improve the performance of your model.
6.Improve on skill and competencies required to collate and present domain specific, evidence-based insights.
categorical. Note that some values that appear “null” indicate that a survey respondent did not select that given option from a multiple-choice list. For example – “Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Twitter (data science influencers)”
For the data cleaning step, handle missing values however you see fit and justify your approach. Provide some insight on why you think the values are missing and how your approach might impact the overall analysis. Suggestions include filling the missing values with a certain value (e.g. mode for categorical data) and completely removing the features with missing values. Secondly, convert
a.Present 3 graphical figures that represent trends in the data. How could these trends be used to help with the task of predicting yearly compensation or understanding the data? All graphs should be readable and presented in the notebook. All axes must be appropriately labelled. b.Visualize the order of feature importance. Some possible methods include correlation plot, or a similar method. Given the data, which of the original attributes in the data are most related to a survey respondent’s yearly compensation?
The steps specified before are not in a set order.
Implement logistic regression algorithm on the training data using 10-fold cross-validation. How does your model accuracy compare across the folds? What is average and variance of accuracy for folds? Treating each value of hyperparameter(s) as a new model, which model performed best? Give the reason based on bias-variance trade-off. An output of your algorithm should be a probability of belonging to each of the salary buckets. Apply scaling/normalization of features, if necessary.
5.Model tuning (20 marks):
Insufficient discussion will lead to the deduction on marks.
3/6MIE 1624 Introduction to Data Science and Analytics – Assignment 1
the data files. For instance, using Microsoft Excel to clean the data is not allowed.
○Read the required data file from the same directory as your notebook on the CognitiveClass Virtual Lab – for example pd.read_csv(“Kaggle_Salary.csv”).
1.Submit via Quercus a Jupyter (IPython) notebook containing your implementation and motivation for all the steps of the analysis with the following naming convention:
lastname_studentnumber_assignment1.ipynb
1.A large portion of marks are allocated to analysis and justification. Full marks will not be given for code alone.
2.Output must be shown and readable in the notebook. The only files that can be read into the notebook are the files posted in the assignment without modification. All work must be done within the notebook.
Tips:
1.You have a lot of freedom with however you want to approach each step and with whatever library or function you want to use. As open-ended as the problem seems, the emphasis of the assignment is for you to be able to explain the reasoning behind every step.