Supplementary Assignment ENN543, Data Analytics and Optimisation, Semester 2, 2019 Queensland University of Technology
Problem 1. Clustering. Bike share systems are becoming increasingly common in cities across the world, but their usage is highly variable and depends on factors such as local weather.
You have been provided with two months data from the New York Bike Share system covering one month in summer (Q1/JC-201707-citibike-tripdata.csv) and one month in winter (Q1/JC-201801-citibike-tripdata.csv). From the size of the files alone it is clearly evident that there are substantially fewer trips in winter than there are in summer, however it it unclear if the actual pattern of use (i.e. the typical types of trips) is different.
Using this data and the clustering method of your choice, you are to attempt to answer the question: ‘aside from the overall number of trips, do usage patterns change from from summer to winter?’. In doing this you should cluster the data using the following five dimensions:
Note that this means that clusters will contain 5 dimensions, and visualisation of clusters in a single 2D plot will not be possible.
Your answer should demonstrate and discuss how usage patterns are similar or dissimilar (depending on what you find), and should also consider different time periods (morning, afternoon, etc) to better explore how the service is used.
Your answer should explain all decisions made when conducting the analysis, including details such as:
Problem 2. Classification. Software systems are complex, and errors in deployed software can be very costly and difficult to correct. In an effort to help detect faulty software, a number of metrics have been proposed that measure software complexity.
You have been provided with data (Q2/pc1.csv) which contains various code metrics for a number of software examples, as well as a flag to indicate if the software contains a fault or not. For clarity:
Using this data, you are to train a support vector machine (SVM) to separate defective software from error free software. You are to report on the accuracy of the developed model, and on any problems or challenges that you encounter in developing the model. In doing this you should:
Please note that allowing MATLAB to optimise hyper-parameters in place of properly investigating parameter settings is not acceptable as a justification for hyper-parameter selection, though a grid search (which is a more systematic approach) will be accepted.
Your answer should explain the choice of parameters in the final model, and discuss it’s performance.
Problem 3. Dimension Reduction and Classification. Recognising content in images can be a challenging problem due to the high dimensional nature of the input data. As such, dimension reduction methods can be used to reduce a problem space and make tasks more computationally feasible.
You have been provided with data (Q3/shvn test.mat) that shows images of single digits (0, 1, 2, 3, 4, 5, 6, 7, 8 and 9) of house numbers, extracted from Google street view data. Using this data you are to train classifiers (the type of classifier is up to you) to classify the observed digit in the image. Prior to classification, you are to reduce the data using:
i.e. you should train two classifiers: one using data reduced using PCA, one using data reduced using LDA. You are then to evaluate the two classifiers and compare their performance.
In completing this question you should:
Also note that due to memory constraints, it may not be possible to train the PCA or LDA space on all samples, and you may need to use only a subset of the data to compute the PCA and LDA transforms.
Your answer should explain the choice of any parameters and choices made (type of classifier, number of dimensions retained, etc) in arriving at your solution, and discuss the performance of the two methods, relating this what the two transforms (PCA and LDA) are seeking to achieve.