The spreadsheet titled ‘sports_and_recreation.xlsx’ details the location, types of sport played as well as the condition, age and other details about sports and recreational facilities in Victoria. You will see that it is far from being ready for analysis and needs to be ‘wrangled’. Additionally a few errors have been deliberately introduced so these will need to be corrected by initial analysis.
1. Explain why the data in its current form is not considered to be in ‘tidy’ format.
2. Write R code to read in the data, manipulate it and output the result to a single csv file having the following header row. Each row should provide the details for a single sport within a sports and recreation facility (ignore cases where the sport is unspecified). facility_ID,facility_name,street_no,street_name,street_type,suburb_town,postcode,LGA,latitude,long itude,sports_played,number_field_courts,field_surface_type,facility_age,facility_condition,facility_up grade_age
Your code will have the following sections (not necessarily in the order given and the process may be iterative as you find more things to do). Please include comments in the code to separate each segment and explain your steps.
a) Read the data into a dataframe (or tibble).
b) Observe the layout of the data and describe any issues you encounter in terms of missing or duplicated data. Make the necessary modification of rows for consistency.
c) Write a function that takes in a facility ID and outputs a dataframe with one or more rows, where each row is specific to the details of a single sport within a facility.
d) Apply the function to each facility name and then combine the rows to form a single dataframe.
e) Split the (latitude, longitude) attribute into two separate columns.
f) Split the address attribute into its components as listed above (note that not all addresses are complete so there may be more than one case to consider).
i. Note that the street number may have two words such as “Lot 1A” but a street number is assumed to have at least one digit character.
ii. Note that the street type may have two words such as “Street North”.
iii. Assume the suburb/town attribute is the part of the address in capitals before the postcode.
g) Include the above header row for the dataframe.
h) Do a summary of the dataframe to look for unusual values, then correct them until satisfied with the result.
i) Sort the dataframe by facility_ID, then sports_played.
j) Write the result to a csv file.
3. How many facilities offer softball with at least 5 fields?
The online hospitality company Airbnb has made publicly available a number of datasets
Write R code to answer the following.
2. How many listings contain the following combinations of words (upper or lower case or mixed, either order) in the name column?
a. Spacious, cosy
b. Vibrant, bright
3. How many listings had last review date in the past 14 days up to and including January 6 2018?
4. How many host names (not host ids) are mentioned between 10 and 20 times (inclusive) in the listings?
5. List the five top neighbourhoods sorted in decreasing order by average number of reviews. In a third column show the average value of availability_365 for that neighbourhood. (1 mark)
6. Write a function that has listing id as input and calculates the (direct) distance in kilometres between the location of the listing and the Melbourne CBD with latitude/longitude coordinates (-37.8136°, 144.9631°). Look up and use the haversine formula to perform this using an earth radius of 6370km (you may use an R package). Hence determine the number of listings that are between 5 and 10km from the Melbourne CBD.
7. Suppose somebody wants to choose a listing based on the following criteria. Write one or more functions that inputs a listing id and calculates a score that is the sum of points as below:
a. Points for the location: 50 × (10 minus haversine formula distance in kilometres to the nearest table tennis facility in km) but not less than zero (this uses the sports_played and location attributes in the cleaned dataset from Part 1)
b. Points for price: (300 minus price) but not less than zero
c. Points for the room type: 100 for Entire home/apt, 50 for Private room, 0 for Shared room
d. Points for availability: (availability_365) divided by 5
e. Points by popularity based on the number of reviews: 100 if at least in the top quartile, 0 if less than the median value, 50 otherwise Which id has the highest score according to the above system?
Write a short report describing the two (processed) datasets from parts 1 and 2 through tables and plots with R including the following:
• A summary of the variables using the summarytools package. Include at least two observations you feel are worth noting.
• A histogram showing the distribution of the distance of listings from the Melbourne CBD (using the result from Part 2). Use a bin width of 1km.
• A single map combining a subset of the data from each of the two datasets (e.g. it could be in a single LGA). You will need to use an R package that can map geospatial data.
• Choose one sport and do a visualisation of your choice including one or more of the following variables from the dataset of part 2.