Start with only worker nodes subset the data and then upgrade test part

Overview

Part I. Hello World from Cluster (10 points)

Trial review.json Data (just for getting familiar with format)

Full review.json Data

Example: spark-submit a3_p2_lastname_id.py 'hdfs:/data/review.json'

Output: Place checkpoint output into: a3_p2_<lastname>_<id>_OUTPUT.txt

From there, filter to users associated with at least 5 distinct items.
Extract the following target users along with their business_id ratings into a broadcast variable named "target_users".

qOdmye8UQdqloVNE059PkQ

^ target_users subject to change until 4/18 (updated on 4/18)

For each of the target users, find up to 50 most similar neighbors (i.e. other users).
Neighbors must:
(a) have at least two ratings for the same businesses as the target_users
(b) have a positive, non-zero similarity with the target users
Use the consine similarity of mean-centered ratings as the similarity metric.
Make predictions of how the target_user would rate other resteraunts.
For each target user, make predictions for all restaurants for which at least three neighbors have ratings.

For each item in the entire dataset, find up to 50 most similar neighbors (i.e. other items).
Neighbors must:
        (a) have at least one rating from one of the target users (if not, do not consider the item as a potential neighbor).
        (b) have a positive, non-zero similarity with the target item.
        Use the consine similarity of mean-centered ratings as the similarity metric.
Make predictions of how the user would rate other resteraunts based on how the user rated similar restaurants. Only make predictions for resteraunts with at least 3 neighbors.

Submit the following 5 files containing the output of your code as well as your code itself. Please use blackboard to submit two files each with your lastname and student id:

a3_p1_lastname_id_OUTPUT.pdf

Please do not upload a zip file. Double-check that your files are there and correct after uploading and make sure to submit. Uploading files that are zips or any other type than python code or txt files will result in the submission being considered invalid. Partially uploaded files or non-submitted files will count as unsubmitted.

Runtimes (added 4/14): Your code for each part should run in under 10 minutes on the test data and within cluster-mode on the specified gcloud cluster.