Big data management assignment 1 batch layer

Description

In this assignment your task is to prepare the batch layer (off-line processing pipeline) of the lambda architecture that will enable us to perform social media analyses. You will be using the MovieTweetings (https://github.com/sidooms/MovieTweetings) dataset which is a snapshot of tweets of IMDb activities of users who use its mobile app. Each tweet contains a number of fields that store information about the tweet, user who tweeted, hashtags, URLs and the reactions of other users to tweets and other metadata.

Files

Below are files containing a subset of MovieTweetings dataset:

Small - https://s3-eu-west-1.amazonaws.com/jwasilewski/twitter.small
Medium - https://s3-eu-west-1.amazonaws.com/jwasilewski/twitter.medium

You can get them using the wget command from the terminal. Smaller datasets can be used in prototyping but analysis should be done on the medium file.

Assignment

Your task is to prepare code that could be run in the batch layer of a system, and will help performing the following analyses:

Tweets-oriented analyses (20%):
- number of tweets per day, per month, per year,
- number of tweets with interactions (favourites + retweets) per day, per month, per year.

For these, you should create at most two processing flows with best granularity to accommodate the requirements of the flows.

Movies-oriented analyses (30%):

number of tweets about every single movie, per day and per month,
amount of interactions a movie receives (favourites + retweets),
popularity of each movie among different language-speakers.

For these, you should create three processing flows.

Users-oriented analyses (25%):
- number of followers, favourites, statuses and listings of all users.

Each tweet of a user stores the current information about the user, and these might be different for different tweets, you should retrieve the oldest and the newest information. So if you have a user with 10 tweets, you should return number of followers, favourites, statuses and listings from the oldest tweet and the newest tweet, in separate columns. User should be identified by the screen name.

You should show processing using all of the following:

Spark SQL,
Spark DataFrame,
Spark RDDs.

Not meeting this requirement will result in 20% penalty per each missing.

Using the results of your off-line processing, answer the following questions:

What are the top 20 most popular movies? (5%)
In which month we collected the most interactions? (5%)
What is the most popular movie in the group of Spanish-speaking users (check fores in lang field)? (5%)
What are the users with the most changes in numbers of followers between firstand the last tweet? (10%)

Submission Format

Submit a package containing a Jupyter Notebook holding all the code needed to create the batch layer and code that uses those to answer the questions. All the code cells should be followed by a description of the code. The notebook should be exported as iPython Notebook with *.ipynb extension. If the code in your notebook does not run, it might result in 20% penalty.

Additionally, create a PDF document showing top 20 records of every single data flow that you created – do not include code. Also, place the answers to the questions in that document.

Not the Exact Question you were looking for ? Post your question for assignment help and get instant help on your homework and assignment questions from our experts

Big data management assignment 1 batch layer

Description

Files

Assignment

What are the users with the most changes in numbers of followers between firstand the last tweet? (10%)

Submission Format