In this assignment your task is to prepare the batch layer (off-line processing pipeline) of the lambda architecture that will enable us to perform social media analyses. You will be using the MovieTweetings (https://github.com/sidooms/MovieTweetings) dataset which is a snapshot of tweets of IMDb activities of users who use its mobile app. Each tweet contains a number of fields that store information about the tweet, user who tweeted, hashtags, URLs and the reactions of other users to tweets and other metadata.
Below are files containing a subset of MovieTweetings dataset:
You can get them using the wget command from the terminal. Smaller datasets can be used in prototyping but analysis should be done on the medium file.
Your task is to prepare code that could be run in the batch layer of a system, and will help performing the following analyses:
For these, you should create at most two processing flows with best granularity to accommodate the requirements of the flows.
1
For these, you should create three processing flows.
Each tweet of a user stores the current information about the user, and these might be different for different tweets, you should retrieve the oldest and the newest information. So if you have a user with 10 tweets, you should return number of followers, favourites, statuses and listings from the oldest tweet and the newest tweet, in separate columns. User should be identified by the screen name.
You should show processing using all of the following:
Not meeting this requirement will result in 20% penalty per each missing.
Using the results of your off-line processing, answer the following questions:
Submit a package containing a Jupyter Notebook holding all the code needed to create the batch layer and code that uses those to answer the questions. All the code cells should be followed by a description of the code. The notebook should be exported as iPython Notebook with *.ipynb extension. If the code in your notebook does not run, it might result in 20% penalty.
Additionally, create a PDF document showing top 20 records of every single data flow that you created – do not include code. Also, place the answers to the questions in that document.
Follow Us