Euros numeric housinghas housing loan binary yes

3. Experience processing three different types of real data a. Standard multi-attribute data (Bank data)
b. Time series data (Twitter feed data)
c. Bag of words data.

4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).

This is an individual assignment. You are not permitted to work as a part of a group when writing this assignment.

Submission checklist

Expected quality of solutions

a) In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.

• [Spark SQL]
o Means this question needs to be done using Spark SQL and therefore not allowed to use RDDs. In addition, you need to do these questions using the spark dataframe or dataset API, do not use SQL syntax.

Assignment structure:

• For each Spark question, a skeleton project is provided for you. Write your solution in the .scala file in the src directory. Build and run your Spark code using the provided

scripts:	$ bash build_and_run.sh

You should also use the large input files that we provide to test the scalability of your solutions.

4. It can take some time to build and run Spark applications from .scala files. So for the Spark questions it’s best to experiment using spark-shell first to figure out a working solution, and then put your code into the .scala files afterwards.


0	age

2	marital

4	default

6	housing

8	contact

10	month

12	campaign

14	previous

16	termdeposit	has the client subscribed a term deposit? (binary: "yes","no")

Please note we specify whether you should use [Hive] or [Spark RDD] for each subtask at the beginning of each subtask.

"services" 1 "technician" 3

[8 marks]

d) [Spark RDD] Sort all people in ascending order of education. For people with the same education, sort them in descending order by balance. This means that all people with the same education should appear grouped together in the output. For each person report the following attribute values: education, balance, job, marital, loan.

The data is supplied with the assignment at the following locations:

Attribute index	Attribute name

1	month

3	hashtagName

2

[6 marks]

Input x = 200910, y = 200912 Output hashtagName: mycoolwife, countX: 1, countY: 500

In this task you are asked to create a partitioned index of words to documents that contain the words. Using this index you can search for all the documents that contain a particular word efficiently.

The data is supplied with the assignment at the following locations3:


0	docId

2	count

4	truck

Note: spark SQL will give the output in multiple files. You should ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.

[5 marks]

[2, motorbike, 702]
[3, boat, 2000]

For this subtask specify the document ids as arguments to the script. For example:

$ bash build_and_run.sh computer environment power

[3 marks]

[5 marks]

b)[spark SQL] Load up the Parquet file from “../frequent_docwords.parquet” which you created in the previous subtask. Find all pairs of frequent words that occur in the same document and report the number of documents the pair occurs in. Report the pairs in decreasing order of frequency. The solution may take a few minutes to run.

(2,(plane, motorbike))
(2,(motorbike, boat))
(1,(plane, boat))
[15 marks]

Bonus Marks:

Hash tag name: mycoolwife
count of month 200812: 100
count of month 200901: 201

[10 marks]