FROM lineorder, dwdate
WHERE lo_orderdate = d_datekey
AND d_yearmonth = 'Jan1994'
AND lo_discount BETWEEN 5 AND 7;
GROUP BY d_month
ORDER BY COUNT(DISTINCT d_sellingseason)
You can ignore the network transfer costs as well as the possibility of node failure.
3 a) Suppose you have a 6-node cluster with replication factor of 3. Describe what MapReduce has to do after it determines that a node has crashed while a job was being processed. For simplicity, assume that the failed node is not replaced and your cluster is reduced to 5 nodes. Specifically:
b) What does HDFS (the storage layer/NameNode) have to do in response to node failure in this case?
c) What does MapReduce execution engine have to do to respond to the node failure? Assume that there was a job in progress because otherwise MapReduce does not need to do anything to address a failure.
Where does the Mapper store output key-value pairs before they are sent to Reducers?
Why can’t Reducers begin processing before Mapper phase is complete?
4. Using the SSBM schema (http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/SSBM_schema_hive.sql) load the Part table into Hive (data available at http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/part.tbl)
NOTE: The schema above is made for Hive, but by default Hive assumes ‘\t’ separated content. You will need to modify your CREATE TABLE statement to account for ‘|’ delimiter in the data or this won’t work.
Use Hive user defined function (i.e., SELECT TRANSFORM with weekday mapper is available here: http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/weekday_mapper.py) to perform the following transformation on Part table (creating a new transformed table): for 2nd and 7th columns, split it into individual columns. That is, a value in 2nd column, ‘blush maroon’ now becomes 2nd and 3rd column with ‘blush’ and ‘maroon’ respectively. Similarly, the 7th column will be transformed from ‘STANDARD BURNISHED NICKEL’ into three columns with values ‘STANDARD’, ‘BURNISHED’, and ‘NICKEL‘.
Please be sure that you create the entire new table, not just the transformed column. You will add a total of 3 new columns to the original part table as a result.
Remember that your transform python code (split/join) should always use tab (‘\t’) between fields even if the source data is |-separated.
cd wget http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/pig-0.15.0.tar.gz gunzip pig-0.15.0.tar.gz tar xvf pig-0.15.0.tar
set the environment variables (this can also be placed in ~/.bashrc to make it permanent)
export PIG_HOME=/home/ec2-user/pig-0.15.0 export PATH=$PATH:$PIG_HOME/bin
Use the same vehicles file. Copy the vehicles.csv file to the HDFS if it is not already there.
Now run pig (and use the pig home variable we set earlier):
cd $PIG_HOME bin/pig
Create the same table as what we used in Hive, assuming that vehicles.csv is in the home directory on HDFS:
VehicleData = LOAD '/user/ec2-user/vehicles.csv' USING PigStorage(',') AS (barrels08:FLOAT, barrelsA08:FLOAT, charge120:FLOAT, charge240:FLOAT, city08:FLOAT);
You can see the table description by
Verify that your data has loaded by running:
VehicleG = GROUP VehicleData ALL; Count = FOREACH VehicleG GENERATE COUNT(VehicleData); DUMP Count;
How many rows did you get? (if you get an error here, it is likely because vehicles.csv is not in HDFS)
Create the same ThreeColExtract file that you have in the previous assignment, by placing barrels08, city08 and charge120 into a new file using PigStorage .You want the STORE command to record output in HDFS. (discussed in p457, Pig Chapter, “Data Processing Operator section)
NOTE: You can use this to get one column:
OneCol = FOREACH VehicleData GENERATE barrels08;
Verify that the new file has been created and report the size of the newly created file.
(you can use quit to exit the grunt shell)