• +1-617-874-1011 (US)
  • +44-117-230-1145 (UK)
Live Chat
Follow Us:

CS 411: Database Systems Homework 5

Section 1. Query Optimization [30 pts]

Part 1. [14 points]

Given the following Bank database schema:

Customer = (cid, cName, cCity)

Account = ( aNumber, balance, interest_rate, cid, bid) Branch = (bid, bCity, bName)

1) For each equivalence shown below, state whether it is true or false in general. If your answer is “true”, justify the equivalency by explaining which RA rule/rules can be used to transform one query expression into the other. If your answer is “false”, give a sample instance of the database where the two expressions are not equal. (7 points)

CS 411: Database Systems Homework 5 Image 1

2) Rewrite the following queries to make the execution most efficient (that is, pushing selections and projections as close to the base relation as possible, limiting the projected columns to only the ones required, and using appropriate joins). (7 points)

a) Find the names of all customers who live in “Champaign” with an account at some branch in “Champaign” or “Urbana”.

CS 411: Database Systems Homework 5 Image 2

b) Find the names of all customer with an account at some branch located in “Champaign”, whose account interest  rate is higher than 2% and the account balance is larger than 2000$.

CS 411: Database Systems Homework 5 Image 3

Part 2. [6 points]

Consider the following selection operation on the relation R(A, B, C). If the relation R contains 1000 tuples, where attribute A has 10 different values, attribute B has 20 different values, and attribute C has 30 different values, what is the maximum possible size of the selection for each of the following queries?

  1. σ((A=a1)AND(B=b2))OR(C=c3)(R)
  2. σ(A=a1)AND(B=b2)AND(C=c3)(R)

Part 3. Dynamic Programming [15 points]

Consider relations A, B, C and D with the following information.

A(x,y)

B(z,y,w)

C(v,x)

D(w,x,z)

T(A) = 5000

V(A, x) = 100

V(A, y) = 50

T(B) = 6000

V(B, z) = 200

V(B, y) = 150

V(B, w) = 50

T(C) = 2000

V(C, v) = 100

V(C, x) = 250

T(D) = 2500

V(D, w) = 100

V(D, x) = 200

V(D, z) = 50

Note that T(R) is number of tuples in relation R and V(R, a) is number of distinct values of attribute a in relation R. We want to compute A⋈B⋈C⋈D as efficiently as possible. Determine the most efficient way to do the join. Clearly state any assumptions you have made. Show your work by completing the following table (each step in the dynamic programming algorithm should be one row):

Subquery

Size

Cost

Plan

...

...

...

...

Section 2. MongoDB [35 points]

Install MongoDB on your machine. The installation instructions of MongoDB can be found here

Download the database provided herewhich has been acquired from the YELP database

The database has two CSV files: review.csv and business.csv .

  • csv: This file contains reviews of YELP users on a set of business where each review is associated with the user id of the users who have submitted the review as well as the business id of the business that the review is about. The original YELP dataset contains 5200000 reviews. We randomly sampled 500000 reviews to generate the reviews.csv file.
  • csv : Each line of the file contains a business id along with some attributes associated with the target business.

Load the .csv files into your database and write and execute the following queries.

NOTE

For each query, please include your query and also top­10 documents ordered (in descending order) by the corresponding id attribute in the pdf (You do not need to include the entire results in your submission).

  • List all businesses with more than 200 reviews. (The output documents should contain a single field: business_id

)

  • List all users with at least one review for a restaurant in Arizona (AZ). Use the “categories” attribute to find the businesses belonging to the category “Restaurants”. (The output documents should contain a single field: user_id)
  • Display the total number of users who have at least one review with rating score 5.
  • Display the total number of users with at least two reviews for a business located in California (CA).
  • List all users who have at least five reviews on businesses with a delivery option. (The output documents should contain a single field: user_id)
  • Compute and display the average star rating (computed from reviews) of businesses in Illinois grouped by business id. (The output documents should contain two fields: business_id, average_star_rating) 7) Display the total number of users who have reviewed businesses in more than two distinct states.

Bonus (7 points)

8) For each business b, display the total number of reviews from users who have at least five reviews outside the state of business b. (The output documents should contain two fields: business_id, total_number_of_reviews)

Section 3. Neo4j [35 pts]

Install Neo4j on your machine. The installation instructions of Neo4j can be found here

You will use the same database that used in Section 2. Use Neo4j’s browser (http://localhost:7474/browser ) to load the data and write your queries. The graph database should have three types of nodes: user, business, and reviews.

Once you created the graph database, use the Neo4j browser to write and execute the same queries from Section

Query #8 is counted as extra credit for this section too (7 points).

NOTE

For each query, please provide your query and also top­10 documents ordered (in descending order) by the corresponding id attribute in the pdf.