ISIT312 Big Data Management Sample Assignment

⭳ 97 Download 📄 4 Pages / 970 Words

SIM S2

Assignment 1

All files left on Moodle in a state "Draft(not submitted)" will not be evaluated. Please refer to the submission dropbox on Moodle for the submission due date and time.

This assignment contributes to 10% of the total evaluation in the subject. This assignment consists of 3 tasks. Specification of each task starts from a new page.

It is a requirement that all Laboratory and Assignment tasks in this subject must be solved individually without any cooperation with the other students. If you have any doubts, questions, etc. please consult your lecturer or tutor during lab classes or office hours. Plagiarism will result in a FAIL grade being recorded for that assessment task.

Deliverables

You must submit three PDF files for three tasks (one for each task) and a Java source file for Task 3.

Each PDF file must be no more than 5 pages. The answers must be presented in a readable manner that facilitates evaluation. Poor presentation will result in a reduction of marks.

The Java sourcecode must be correct and compilable. The environment for Task 1 and Task 3 is the BigDataVM virtual machine in ISIT312.

Task 1. PutCombine Application (3 marks)

This task is based on the sourcecode PutCombine.java, and the data in FilesToBeMerged.zip (available on the Moodle site).

The PutCombine application extends Hadoop’s own functionality. The motivation for this application comes when we want to analyse fragmented files such as logs from web servers. We can copy each file into HDFS, but in general, Hadoop works more effectively with a single large file rather than a number of smaller ones. Besides, for analytics purposes we think of all the data as one big file, even though it spreads over multiple files as an incidental result of the physical infrastructure that creates the data.

One solution is to merge all the files first and then upload the combined file into HDFS. Unfortunately, the file merging will require a lot of disk space in the local machine. It would be much easier if we could merge all the files on the fly as we upload them into HDFS.

What we need is, therefore, a “put-and-combine”-type of operation. Hadoop’s command line utilities include a getmerge command for merging a number of HDFS files before copying them onto the local machine. What we’re looking for is the exact opposite, which is not available in Hadoop’s file utilities. The attached sourcecode of PutCombine is a Java application that fulfils this purpose.

Compare PutCombine with the FileSystemPut and FileSystemPutAlt applications in the lecture note and describe the difference. You must link the difference to the sourcecode.(1.5 mark)
Compile PutCombine.java and create a jar file, and use to it to upload and merge the files in FilesToBeMerged.zip (unzip this file first). Produce a report as a PDF file that includes all the commands that you enter to a Zeppelin notebook or the Terminal as well as brief explanations of those commands (e.g., the purpose of each command). (1.5 mark)

Task 2. MapReduce Model (3 marks)

(a) (1.5 marks) Suppose two sets of English letters “a” to “d” are stored in two nodes (A1 and A2):

D1 = {“a”, “b”, “c”, “b”, “c”, “d”, “a”} in Node A1,
D2 = {“a”, “a”, “a”, “d”, “d”, “c”} in Node A2.

There is a MapReduce job that processes the sets D1 and D2. This job is a “word-count” application, namely, it counts the total number of English letter(s) in both nodes.

In this job, two Map tasks run in Node A1 and Node A2, and one Reduce task runs in another node, say, Node A3.

Describe the key-value data transferred from Node A1 and Node A2 to Node A3, respectively, with and without a Combiner, respectively.

Explain how a Combiner can improve a MapReduce job in this example.

(b) (1.5 marks) The following table (named X) has a column of keys and a column of values:

key	value
k1	1
k1	2
k2	3
k2	4

Explain how to implement the following SQL-like query in a MapReduce model:

“SELECT key, SUM(value) FROM X GROUBBY key”

You need to specify the key-value data in the input and output of the Map and Reduce stages.

Produce a report as a PDF file that documents your solutions to both questions above.

Task 3. Average Patent Claim Applications (4 marks)

This task is process the dataset apat63_99.txt (in the dataset folder in the Desktop of the VM). This dataset contains information about almost 3 million U.S. patents granted between January 1963 and December 1999. See http://www.nber.org/patents/ for more information.

The following table describes (some) meta-information about this data set.

Attribute Name	Description
PATENT	Patent number
GYEAR	Grant year
GDATA	Grant date, given as the number of days elapsed since January 1, 1960
APPYEAR	Application year (available only for patents granted since 1967)
COUNTRY	Country of first inventor
POSSTATE	State of first inventory (if country is U.S.)
ASSIGNEE	Numeric identifier for assignee (i.e., patent owner)
ASSCODE	One-digit (1-9) assignee type. (The assignee type includes U.S. individual, U.S. government, U.S. organization, non-U.S. individual, etc.)
CLAIMS	Number of claims (available only for patents granted since 1975)
NCLASS	3-digit main patent class

Develop the a MapReduce application AvgClaimsByYear.java in Java which

computes the average number of claims per patent by year;
includes ToolRunner and a Partitioner;
sorts the outcome into four groups according to the year, i.e., (1) before 1970, (2) 19701979, (3) 1980-1989, and (4)1990-1999.

You should transfer the apat63_99.txt file to HDFS and run the application to process it.

Produce a report as a PDF file that includes the commands that you use, a brief explanation of each command and the output of your application.

Also submit the sourcecode of your application.

Buy ISIT312 Big Data Management Assignment Answers Online

Talk to our expert to get the help with ISIT312 Big Data Management Assignment to complete your assessment on time and boost your grades now

The main aim/motive of the management assignment help services is to get connect with a greater number of students, and effectively help, and support them in getting completing their assignments the students also get find this a wonderful opportunity where they could effectively learn more about their topics, as the experts also have the best team members with them in which all the members effectively support each other to get complete their diploma assignments. They complete the assessments of the students in an appropriate manner and deliver them back to the students before the due date of the assignment so that the students could timely submit this, and can score higher marks.Â The experts of the assignment help services at urgenthomework.com are so much skilled, capable, talented, and experienced in their field of programming homework help writing assignments, so, for this, they can effectively write the best economics assignment help services.

Get Online Support for ISIT312 Big Data Management Assignment Assignment Help Online

Resources

24 x 7 Availability.
Trained and Certified Experts.
Deadline Guaranteed.
Plagiarism Free.
Privacy Guaranteed.
Free download.
Online help for all project.
Homework Help Services

Testimonials

Urgenthomework helped me with finance homework problems and taught math portion of my course as well. Initially, I used a tutor that taught me math course I felt that as if I was not getting the help I needed. With the help of Urgenthomework, I got precisely where I was weak: Sheryl. Read More

Not the Exact Question you were looking for ? Post your question for assignment help and get instant help on your homework and assignment questions from our experts