• +1-617-874-1011 (US)
  • +44-117-230-1145 (UK)
Live Chat
Follow Us:

Analysis of Event Data

Key skills to be assessed

This assignment aims at assessing your skills in:

  • The usage of common big data tools and techniques
  • Your ability to implement a standard data analysis process

– Loading the data

– Cleansing the data

– Analysis

– Visualisation / Reporting

  • Use of Python, SQL and Linux terminal commands

Task

You will be given a dataset and a set of problem statements.

Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using Impala), and Spark (using pyspark and resilient distributed datasets (RDDs)).

General instructions

You will follow a typical data analysis process:

  1. Load / ingest the data to be analysed
  2. Prepare / clean the data
  3. Analyse the data
  4. Visualise results / generate report

The data necessary for this assignment will be provided in a MySQL dump format which you will need to load into a MySQL server and use Sqoop to get the data into Hadoop in text format.

For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using Impala and then in Spark using pyspark.

For the visualisation of the results you are to use Python’s matplotlib.

The data

You will be given a dataset containing simplified Twitter data pertaining to football games. The dataset will be supplied in compressed format and will be made available online for download or can be supplied by USB memory stick. Further information regarding each game, including the teams playing and their official hashtags, start and end times, as well as the times of any goals, will also be provided.

Problem statements

You are a data analyst / data scientist working for an event security company who monitor real time events to analyse the level of potential disturbance. To assess commotion at an event, they monitor the Twitter feeds pertaining to the event. They would like answers to the following questions (in all the following, you should consider the half time and overtime as ‘during-game’).

Questions / problem statements:

  1. Extract and present the average number of tweets per ‘during-game’ minute for the top 10 (i.e. most tweeted about during the event) games.
  2. Rank the games according to number of distinct users tweeting ‘during-game’ and present the information for the top 10 games, including the number of distinct users for each.
  3. Find the top 3 teams that played in the most games. Rank their games in order of highest number of ‘during-game’ tweets (include the frequency in your output).
  4. Find the top 10 (ordered by number of tweets) games which have the highest ‘during-game’ tweeting spike in the last 10 minutes of the game.
  5. As well as the official hashtags, each tweet may be labelled with other hashtags. Restricting the data to ‘during-game’ tweets, list the top 10 most common non official hashtags over the whole dataset with their frequencies.
  6. Find the top 10 games with the highest ‘during-game’ tweeting spikes that are not within 10 minutes (+/-) of any goal
  7. Draw the graph of the progress of one of the games (the game you choose should have a complete set of tweets for the entire duration of the game). It may be useful to summarize the tweet frequencies in 1-minute intervals.

Report

A 4000-5000-word report that documents your solution.