Key skills to be assessed
This assignment aims at assessing your skills in:
– Loading the data
– Cleansing the data
– Visualisation / Reporting
You will be given a dataset and a set of problem statements.
Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using Impala), and Spark (using pyspark and resilient distributed datasets (RDDs)).
You will follow a typical data analysis process:
The data necessary for this assignment will be provided in a MySQL dump format which you will need to load into a MySQL server and use Sqoop to get the data into Hadoop in text format.
For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using Impala and then in Spark using pyspark.
For the visualisation of the results you are to use Python’s matplotlib.
You will be given a dataset containing simpliﬁed Twitter data pertaining to football games. The dataset will be supplied in compressed format and will be made available online for download or can be supplied by USB memory stick. Further information regarding each game, including the teams playing and their oﬃcial hashtags, start and end times, as well as the times of any goals, will also be provided.
You are a data analyst / data scientist working for an event security company who monitor real time events to analyse the level of potential disturbance. To assess commotion at an event, they monitor the Twitter feeds pertaining to the event. They would like answers to the following questions (in all the following, you should consider the half time and overtime as ‘during-game’).
Questions / problem statements:
A 4000-5000-word report that documents your solution.