DS 715 - Assignment 4 Performing OLAP with SQL
Objectives:
- Demonstrate use of SQL to perform OLAP operations;
Estimated time to complete: 40 min reading +45-50 min. solve queries Points: 25/100.
Tasks:
In this assignment we are going to use the SQL ROLLUP and CUBE operators to generate and store a three dimensional cube with hierarchies based on the tables in the database given by the description below:
Tb_Supplier(Supp_ID, Name, City, State)
Tb_Consumer(Con_ID, Name, City, State)
Tb_Product(Prod_ID, Name, Product_Category, Product_Line, Product_Packaging)
Tb_Offers(Supp_ID, Prod_ID, Quantity, Price)
Tb_Requests(Con_ID, Prod_ID, Quantity, Price)
Tb_Transactions(Tran_ID, Supp_ID, Con_ID, Prod_ID, Quantity, Price) and represented by the diagram:
The stored cube will then be used to efficiently solve SQL OLAP queries.
The ROLLUP Operator
ROLLUP and CUBE are SQL extension operators one would place in the GROUP BY clause of an SQL SELECT statement in order to compute and add aggregates to the result set returned by the SELECT statement. The ROLLUP operator is applied to a set of columns that represent a concept hierarchy and performs a rollup on one or more levels of the hierarchy.
For example run the following query against your database:
SELECT State, COUNT(Supp_ID) "#Suppliers" FROM Tb_Supplier
GROUP BY ROLLUP(State)
and compare the result with what is returned by the following simple GROUP BY query:
SELECT State, COUNT(Supp_ID) "#Suppliers" FROM Tb_Supplier GROUP BY State
Carefully comparing the results of the two queries reveals that the ROLLUP query adds an extra row to the result, and that row represents the rollup of the numbers of suppliers per state over all states in the database. This is an example of rolling up by climbing a hierarchy.
Note that the extra row has a NULL value on the column representing the state, i.e. the rolled-up column.
As another, very similar example, compare:
SELECT City, COUNT(Supp_ID) "#Suppliers" FROM Tb_Supplier
GROUP BY ROLLUP(City)
with
SELECT City, COUNT(Supp_ID) "#Suppliers" FROM Tb_Supplier
GROUP BY City
For an example that shows the effect of climbing two levels in a hierarchy compare the result of this query:
SELECT State, City, COUNT(Supp_ID) "#Suppliers" FROM Tb_Supplier
GROUP BY ROLLUP(State, City)
with
SELECT State, City, COUNT(Supp_ID) "#Suppliers" FROM Tb_Supplier
GROUP BY State, City
One can notice that the ROLLUP adds two types of rows to the result of the simple GROUP BY query:
- Rows with City=NULL and State a state name where the last column represents the number of suppliers in each state – rolling-up cities by state;
- One row with City=NULL and State=NULL where the last column represents the total number of suppliers – rolling-up states.
Note that in the ROLLUP parameters list the columns are listed in the descending order of the concept hierarchy (State > City). Actually, running the query below will return a meaningless result:
SELECT State, City, COUNT(Supp_ID) "#Suppliers" FROM Tb_Supplier
GROUP BY ROLLUP(City, State)
The CUBE Operator
The CUBE operator is applied on a number of attributes called the cube dimensions and calculates aggregates on all possible subsets of its dimensions. This is an example of rollups by dimension reductions. For example a cube on the supplier, consumer and product dimensions is the equivalent of the union of the results from 8 different GROUP BY queries.
Run the following query and carefully observe the result it returns:
SELECT DISTINCT S.Name "Supplier Name",
C.Name "Consumer Name",
P.Name "Product Name",
SUM(Quantity) "Total Transactions Quantity",
COUNT(Tran_ID) "Number of Transactions"
FROM Tb_Supplier S, Tb_Consumer C, Tb_Product P, Tb_Transactions T
WHERE S.Supp_ID=T.Supp_ID AND
C.Con_ID=T.Con_ID AND
P.Prod_ID=T.Prod_ID
GROUP BY CUBE(S.Name, C.Name, P.Name)
On your sample database this cube returns a total of 396 rows.
There are 8 kinds of rows in this result, and each kind represents a cuboid:
- One row where "Supplier Name"=NULL, "Consumer Name"=NULL, "Product Name"=NULL, and the last two columns contains the grand totals of their corresponding aggregates – this is a rollup by reducing dimensions: supplier, consumer and product and represents the apex node in the lattice of cuboids;
- Rows where "Supplier Name"=NULL, "Consumer Name"=NULL, while "Product Name" has the name of a product and the last two columns represent aggregates over each product – these rows represent rollup by reducing dimensions: supplier and consumer; - . . .
- Rows where "Supplier Name", "Consumer Name", and "Product Name" have all non-null values – this is similar to a GROUP BY query on the 3 dimension columns, and represent the base node in the lattice of cuboids, i.e. zero dimensions are reduced.
If we want a group of columns to be handled as one single column by the CUBE operator we would place those columns within another set of parenthesis.
In the query example below the group of S.State, S.City, S.Name columns, and group of C.State, C.City, C.Name columns are handled as one single column each.
The query produces a result with 433 rows of the same 8 types as the first CUBE query.
Run the following query and carefully observe the result it returns:
SELECT DISTINCT S.Name "Supplier Name",
S.City "Supplier City",
S.State "Supplier State",
C.Name "Consumer Name",
C.City "Consumer City",
C.State "Consumer State",
P.Name "Product Name",
SUM(Quantity) "Total Transactions Quantity",
COUNT(Tran_ID) "Number of Transactions"
FROM Tb_Supplier S, Tb_Consumer C, Tb_Product P, Tb_Transactions T
WHERE S.Supp_ID=T.Supp_ID AND
C.Con_ID=T.Con_ID AND
P.Prod_ID=T.Prod_ID
GROUP BY CUBE((S.State, S.City, S.Name),
(C.State, C.City, C.Name),
P.Name)
Notice that in all rows where "Supplier Name” or "Supplier City" or "Supplier State" is NULL the other two will also be NULL.
Same is true for "Consumer Name", "Consumer City" and "Consumer State".
After all this is still a 3 dimensional cube.
Indeed this cube contains rollups on supplier state and consumer state, but not on supplier or consumer name or city.
Cubes with Hierarchies
The columns "Supplier Name", "Supplier City" and "Supplier State", and "Consumer Name" with "Consumer City" and "Consumer State", represent two concept hierarchies in the supplier and consumer dimensions. Adding the missing rollups to the cube above is straightforward and can be done by adding ROLLUP operator for the two concept hierarchies to the cube calculation as below (notice the parameter attributes are listed in the descending order of the concept hierarchy – higher level concepts first):
SELECT DISTINCT S.Name "Supplier Name",
S.City "Supplier City",
S.State "Supplier State",
C.Name "Consumer Name",
C.City "Consumer City",
C.State "Consumer State",
P.Name "Product Name",
SUM(Quantity) "Total Transactions Quantity",
COUNT(Tran_ID) "Number of Transactions"
FROM Tb_Supplier S, Tb_Consumer C, Tb_Product P, Tb_Transactions T
WHERE S.Supp_ID=T.Supp_ID AND
C.Con_ID=T.Con_ID AND
P.Prod_ID=T.Prod_ID
GROUP BY CUBE((S.State, S.City, S.Name),
(C.State, C.City, C.Name),
P.Name),
ROLLUP(S.State, S.City, S.Name),
ROLLUP(C.State, C.City, C.Name)
The result is a 3 dimensional cube with two 3-level hierarchies producing on your database a total of 1414 rows which can be placed in 32 cuboids (4*4*2=32).
Notice we will have rows where:
"Supplier Name" "Supplier City" "Supplier State"
NULL - - --rollup supplier name
NULL NULL - --rollup supplier name
and city
and
"Consumer Name" "Consumer City" "Consumer State"
NULL - - --rollup consumer name NULL NULL - --rollup consumer name
and city
and all their combinations, but never any of the patterns below:
"Supp/Cons Name" "Supp/Cons City" "Supp/Cons State"
NULL - NULL - NULL NULL - - NULL
Notice that calculating a cube on all the attributes, as in:
GROUP BY cube(S.State, S.City, S.Name, C.State, C.City, C.Name, P.Name)
would generate a result with 9108 rows on our sample database.
This 7 dimensional cube would contain many meaningless rows and would waste a lot of space while slowing down queries on the cube.
Cube Queries
Most OLAP queries are retrieving the rows of one single cuboid in the multi-dimensional cube or a sub-set of rows from one cuboid if slice and dice operations are involved.
Given a pre-calculated cube the rows of each cuboid are easily identified by a specific combination of NULL values within their columns.
Therefore, one can easily select the rows of a cuboid or any subset of it by a simple SELECT-FROM –WHERE statement, which is much more efficient than a multi-join GROUP BY query needed to extract the same rows directly from the data warehouse tables.
To calculate and store the sample cube with hierarchies from the previous section simply use the statement:
SELECT DISTINCT S.Name "Supplier Name",
S.City "Supplier City",
S.State "Supplier State",
C.Name "Consumer Name",
C.City "Consumer City", C.State "Consumer State",
P.Name "Product Name",
SUM(Quantity) "Total Transactions Quantity",
COUNT(Tran_ID) "Number of Transactions"
INTO Tb_Transactions_Cube
FROM Tb_Supplier S, Tb_Consumer C, Tb_Product P, Tb_Transactions T
WHERE S.Supp_ID=T.Supp_ID AND
C.Con_ID=T.Con_ID AND
P.Prod_ID=T.Prod_ID
GROUP BY CUBE((S.State, S.City, S.Name),
(C.State, C.City, C.Name),
P.Name),
ROLLUP(S.State, S.City, S.Name),
ROLLUP(C.State, C.City, C.Name)
which creates the Tb_Transactions_Cube table and populates with the rows of the cube.
It is now easy to write queries against this cube.
The grand total aggregates, i.e. the apex cuboid, is in the one row where all columns except the aggregate ones are NULL:
--apex cuboid
SELECT *
FROM Tb_Transactions_Cube
WHERE "Supplier Name" IS NULL AND "Supplier City" IS NULL
AND "Supplier State" IS NULL
AND "Consumer Name" IS NULL
AND "Consumer City" IS NULL AND "Consumer State" IS NULL
AND "Product Name" IS NULL
while the base cuboid, which details aggregates by all combinations of consumers, suppliers and products, is returned by:
--base
SELECT *
FROM Tb_Transactions_Cube
WHERE "Supplier Name" IS NOT NULL AND "Supplier City" IS NOT NULL
AND "Supplier State" IS NOT NULL
AND "Consumer Name" IS NOT NULL AND "Consumer City" IS NOT NULL
AND "Consumer State" IS NOT NULL AND "Product Name" IS NOT NULL The aggregates by products are given by the rows where product name is not NULL:
--aggregates by products - rollup supplier and consumer dimensions,
--hierarchies included
SELECT *
FROM Tb_Transactions_Cube
WHERE "Supplier Name" IS NULL
AND "Supplier City" IS NULL
AND "Supplier State" IS NULL
AND "Consumer Name" IS NULL AND "Consumer City" IS NULL
AND "Consumer State" IS NULL
AND "Product Name" IS NOT NULL
The following query will return nothing, because the configuration below violates the hierarchies defined in the Tb_Transactions_Cube (e.g. state cannot be NULL where name is not NULL):
--impossible or non-sense
SELECT *
FROM Tb_Transactions_Cube
WHERE "Supplier Name" IS NOT NULL
AND "Supplier City" IS NOT NULL
AND "Supplier State" IS NULL
AND "Consumer Name" IS NOT NULL AND "Consumer City" IS NOT NULL
AND "Consumer State" IS NULL
AND "Product Name" IS NOT NULL
Other sample queries:
--aggregates by products sold to consumers in Wisconsin?
SELECT "Product Name", [Total Transactions Quantity],
[Number of Transactions]
FROM Tb_Transactions_Cube
WHERE "Supplier Name" IS NULL
AND "Supplier City" IS NULL
AND "Supplier State" IS NULL
AND "Consumer Name" IS NULL AND "Consumer City" IS NULL
AND "Consumer State" = 'Wisconsin' AND "Product Name" IS NOT NULL
--Quantity of milk sold by each supplier from Wisconsin? --careful [Supplier City] has to be NOT NULL
SELECT [Supplier Name], [Total Transactions Quantity]
FROM Tb_Transactions_Cube WHERE "Supplier Name" IS NOT NULL AND "Supplier City" IS NOT NULL AND "Supplier State"='Wisconsin' AND "Consumer Name" IS NULL AND "Consumer City" IS NULL
AND "Consumer State" IS NULL
AND "Product Name" = 'Milk'
--Quantity of milk sold by each supplier from Wisconsin in the state of --Ilinois?
--careful [Supplier City] has to be NOT NULL
SELECT [Supplier Name], [Total Transactions Quantity] FROM Tb_Transactions_Cube
WHERE "Supplier Name" IS NOT NULL AND "Supplier City" IS NOT NULL
AND "Supplier State" = 'Wisconsin' AND "Consumer Name" IS NULL AND "Consumer City" IS NULL
AND "Consumer State" = 'Illinois'
AND "Product Name" = 'Milk'
Given the Tb_Transactions_Cube stored above solve the following queries:
- Aggregates by combinations of supplier name and product name?
- Aggregates by supplier states?
- Number of transactions between supplier-city-consumer-city pairs?
- Name of each product sold in Wisconsin and quantity of sales for the product?
- Quantity of sales aggregated by product and supplier state?
- Quantity of computer sales aggregated by suppliers in Wisconsin?
- Quantity of auto sales by each supplier from Wisconsin to consumers in Illinois?
- Quantity of auto sales by each supplier in Madison to consumers in Illinois?
- Quantity of each product sold by supplier Bernstein to consumers in Chicago?
- Quantity of milk sold by supplier Bernstein to consumers in Chicago?
- (Extra Credit) For each product list quantity sold by suppliers in
Madison to consumers in Chicago versus quantity sold by suppliers in Chicago to consumers in Madison (result columns will be: product name, quantity Madison_Chicago, quantity Chicago_Madison?
Return:
Please post your solutions to the queries above in properly formatted SQL syntax (each SQL clause on its own line and with capitalized keyword and operators) in the D2L dropbox created for this assignment.