Stats 255 Optimisation And Assessment Answer

1. Breast Cancer Diagnosis

For this question we will be using the BreastCancer data set from the mlbench library. To load the data into R, simply load the library, and then type: data("BreastCancer", package = "mlbench"). Once the data is loaded, we recommend removing the Id attribute, since this has no value for classification: BreastCancer$Id = NULL. Information about the data set can be found by typing

?BreastCancer into the console.

(a) Set the seed of the random number generator to 100 (set.seed(100)), and then generate a training data set of 400 data points using the sample function (the remaining 299 data points will be the test set).

(b) Create pairs plots for both the training and test data sets, colouring each point based on the

Class of the data point. Comment on any observations.

(c) Using the rpart.control(...) arguments for rpart, set the termination criteria for generating the classification tree to be the max depth of the tree. (Disable any other termination criteria;

Generate three classification trees to classify the Class as each benign or malignant, using all the other attributes as predictor variables, varying the maxdepth termination criterion from 3 to 9 in steps of 3.
Visualise the tree with a maximum depth of 3.
For each tree, give the in-sample and out-of-sample confusion matrices.
Create a table specifying the accuracy of each model, both in-sample and out-of-sample.
Comment on and explain what you notice about the in-sample vs. out-of-sample accuracy.

(d) Set the termination criteria to be a max depth of 3 for the following question (i.e. set maxdepth=3, minsplit=1 and cp=0.)

By modifying the loss matrix, generate four classification trees (using all of the independent attributes), which range from having no false positives to no false negatives in the training data.
For each tree, give the in-sample and out-of-sample confusion matrices.
On a 2D scatterplot show the sensitivity vs. specificity of each classification model, include both the in-sample and out-of-sample values in different colours. (This plot can be generated in R, Matlab or Excel.)
Comment on and explain any observations about the in-sample vs out-of-sample perfor- mance seen in the plot.
Suppose that we are much more concerned about false negatives than false positives. Explain what this means, and then recommend which of the classification models (from above) that we should choose.

2. Character Recognition

Download the file letters.csv from Canvas and read it into R; this file contains a data set, in which each of the 20,000 data points corresponds to a digitised capital letter that is known (this is given by the lettr attribute in column 1). There are also 17 independent attributes (columns 2–18), that have been computed from the digital image of each letter (e.g. onpix, is a count of the number of pixels that are black in the image). If you are interested, further details can be found here: https://archive.ics.uci.edu/ml/datasets/Letter+Recognition.

(a) Set the seed of the random number generator to 50, and then generate a random training data set of 18000 data points using the sample function (the remaining 2000 data points will be the test set).

(b) Filter the data set to create a training set called letters.train and a test set called letters.test

(c) The question uses geom jitter() for ggplot(); this plot is just like a scatter plot, but ‘jitters’ the data so discrete values don’t overlap. Plot two jitter plots for the letters.test data set:

xbox vs. onpix with the points coloured by lettr; and
x2bar vs. y2bar with the points coloured by lettr.

Which of these pairs of attributes would be better for classifying the data? Explain why.

(d) Using the letters.train data set create three random forests, with ntree set to 10, 100 and 1000, to predict the lettr attribute, given the other 16 attributes. Note the following:

set the random seed to 100 immediately before each tree is constructed, and
report the OOB (out-of-bag / out-of-sample) estimate of error rate, for each tree.

(e) For the random forest with 1000 trees, apply the predict function to the test set.

Create a 26 × 26 confusion matrix based on these predictions.
Which letter is the letter P occasionally misclassified as?
What is the accuracy of this classifier on this test set?

We now wish to see if the Na¨ıve Bayes classifier can be also be used to predict the correct letters for the same data set.

(f) Apply the Na¨ıve Bayes method to the training data set to determine the class lettr using all the other attributes.

(g) Using the predict() function determine the in-sample and out-of-sample accuracy for this method. (R will report some warnings, but you can ignore them.)

(h) Show the confusion matrix for the out-of-sample predictions, above, and discuss this in compar- ison to the corresponding confusion matrix for the random forest.

Buy Stats 255 Optimisation And Assessment Answers Online

Talk to our expert to get the help with Stats 255 Optimisation And Assessment Answers to complete your assessment on time and boost your grades now

The main aim/motive of the management assignment help services is to get connect with a greater number of students, and effectively help, and support them in getting completing their assignments the students also get find this a wonderful opportunity where they could effectively learn more about their topics, as the experts also have the best team members with them in which all the members effectively support each other to get complete their diploma assignments. They complete the assessments of the students in an appropriate manner and deliver them back to the students before the due date of the assignment so that the students could timely submit this, and can score higher marks.Â The experts of the assignment help services at urgenthomework.com are so much skilled, capable, talented, and experienced in their field of programming homework help writing assignments, so, for this, they can effectively write the best economics assignment help services.

Get Online Support for Stats 255 Optimisation And Assessment Answer Assignment Help Online

Resources

24 x 7 Availability.
Trained and Certified Experts.
Deadline Guaranteed.
Plagiarism Free.
Privacy Guaranteed.
Free download.
Online help for all project.
Homework Help Services

); }

Not the Exact Question you were looking for ? Post your question for assignment help and get instant help on your homework and assignment questions from our experts