Start with only worker nodes subset the data and then upgrade test part
Part I. Hello World from Cluster (10 points)
Trial review.json Data (just for getting familiar with format) Example: spark-submit a3_p2_lastname_id.py 'hdfs:/data/review.json' Output: Place checkpoint output into: a3_p2_<lastname>_<id>_OUTPUT.txt
^ target_users subject to change until 4/18 (updated on 4/18)
|
---|
Submit the following 5 files containing the output of your code as well as your code itself. Please use blackboard to submit two files each with your lastname and student id:
Please do not upload a zip file. Double-check that your files are there and correct after uploading and make sure to submit. Uploading files that are zips or any other type than python code or txt files will result in the submission being considered invalid. Partially uploaded files or non-submitted files will count as unsubmitted. Runtimes (added 4/14): Your code for each part should run in under 10 minutes on the test data and within cluster-mode on the specified gcloud cluster. |
---|