$29.99
B561 Assignment 7
Testing Effectiveness of Query Optimization;
Object-Relational Database Programming
Key-Value Databases and Graph Databases
(Draft)
This assignment focuses on problems related to Lecture 9 and Lectures 18
through 22.
• Lecture 18: Algorithms for RA operations
• Lecture 19: Query processing and query plans
• Lecture 20: Object-relational database programming
• Lecture 20: Key-value stores. NoSQL in MapReduce style
• Lecture 21: Key-value stores; NoSQL in Spark style
• Lecture 22: Graph databases
Other lectures that a relevant for this assignment are Lectures 8, 13, and 14:
• Lecture 8: Translating Pure SQL queries into RA expressions
• Lecture 9: Query optimization
• Lecture 13: Object-Relational databases and queries
• Lecture 14: Nested Relational, Semi-structured Databases, Document
Databases
This assignment has problems that are required to be solved. Others, identified as such, are practice problems that you should attempt since the serve as
preparation for the final exam.
Turn in a single assignment7.sql file that contains the PostgreSQL code of
the solutions for the problem that require such code. (Do not include solutions
for the practice problems.) Also turn in a assignment7.txt file that contains
all the output associated with the problems in this assignment. For all the other
problems, submit a single assignment7.pdf file with your solutions.
1
1 Analysis of Queries Using Query Plans
Consider Lecture 19 on Query Processing: Query Planning in (Object) Relational Systems. Consider the analysis, using query plans, for the SOME quantifier.
1. Assume the relation schemas P(x), Q(x), R(x, y) and S(x, z).
Consider the NOT ALL generalized query
{(p.x, q.x)|P(p) ∧ Q(p) ∧ R(p.x) 6⊃ S(p.x)}
where
R(p.x) = {r.y | R(r) ∧ r.x = p.x}
S(q.x) = {s.z | S(s) ∧ s.x = q.x}
Consider Lecture 19 on Query Processing: Query Planning in (Object)
Relational Systems and in particular the analysis, using query plans, for
the SOME generalized quantifier.
Now to the problem. In analogy with the analysis for the SOME generalized
quantifier, do an analysis for the NOT ALL generalized quantifier.
2 Experiments to Test the Effectiveness of Query
Optimization
In the following problems, you will conduct experiments in PostgreSQL to gain
insight into whether or not query optimization can be effective. In other words,
can it be determined experimentally if optimizing an SQL or an RA expression
improves the time (and space) complexity of query evaluation? Additionally,
can it be determined if the PostgreSQL query optimizer attains the same (i.e.,
better or worse) optimization as optimization by hand. Recall that in SQL you
can specify each RA expression as an RA SQL query. This implies that each of
the optimization rules for RA can be applied directly to queries formulated in
RA SQL.
In the following problems you will need to generate artificial data of increasing size and measure the time of evaluating non-optimized and optimized
queries. The size of this data can be in the ten or hundreds of thousands of
tuples. This is necessary because on very small data is it is not possible to gain
sufficient insights into the quality (or lack of quality) of optimization. You can
use the data generation functions that were developed in Assignment 6. Additionally, you are advised to examine the query plans generated by PostgreSQL.
For the problems in this assignments, we will use three relations:1
P(a int)
R(a int, b int)
S(b int)
1A typical case could be where P is Person, R is Knows, and S is the set of persons with the
Databases skill. Another case could where P is the set of persons who work for Amazon, R is
personSkill and S is the set of skills of persons who live in Bloomington. Etc.
2
To generate P or S, you should use the function SetOfIntegers which generate a set of up to n randomly selected integers in the range [l, u]:
create or replace function SetOfIntegers(n int, l int, u int)
returns table (x int) as
$$
select floor(random() * (u-l+1) + l)::int as x
from generate_series(1,n)
group by (x) order by 1;
$$ language sql;
To generate R, you should use the function BinaryRelationOverIntegers
which generates up to n randomly selected pairs with first components in the
range [l1, u1] and second components in the range [l2, u2]:
create or replace function BinaryRelationOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int)
returns table (x int, y int) as
$$
select floor(random() * (u_1-l_1+1) + l_1)::int as x,
floor(random() * (u_2-l_2+1) + l_2)::int as y
from generate_series(1,n)
group by (x,y) order by 1,2;
$$ language sql;
Example 1 Consider the query Q1
select distinct r1.a
from R r1, R r2
where r1.b = r2.a;
This query can be translated and optimized to the query Q2
select distinct r1.a
from R r1 natural join (select distinct r2.a as b from R r2) r2;
Image that you have generated a relation R. Then when you execute
explain analyze
select distinct r1.a
from R r1, R r2
where r1.b = r2.a;
the system will return its query plan as well as the execution time to evaluate
Q1 measured in ms. And, when you execute
explain analyze
select distinct r1.a
from R r1 natural join (select distinct r2.a as b from R r2) r2;
the system will return its query plan as well as the execution time to evaluate
Q2 measured in ms. This permits us to compare the non-optimized query Q1
3
with the optimized query Q2 for various differently-sized relations R. Here are
some of these comparisons for various differently-sized random relations R. In
this table, R was generated with lower and upper bounds l1 = l2 = 1000 and
u1 = u2 = 1000.
2
R Q1 (in ms) Q2 (in ms)
104 27.03 7.80
105 3176.53 58.36
106 69251.58 400.54
Notice the significant difference between the execution times of the non-optimized
query Q1 and the optimized query Q2. So clearly, optimization works on query
Q1.
Incidentally, below are the query plans for Q1 and Q2. Examining these
query plans should reveal why Q1 runs much slower than Q2. (Why?)
QUERY PLAN for Q1
------------------------------------
HashAggregate
Group Key: r1.a
-> Hash Join
Hash Cond: (r1.b = r2.a)
-> Seq Scan on r r1
-> Hash
-> Seq Scan on r r2
QUERY PLAN for query Q2
------------------------------------------
HashAggregate
Group Key: r1.a
-> Hash Join
Hash Cond: (r1.b = r2.a)
-> Seq Scan on r r1
-> Hash
-> HashAggregate
Group Key: r2.a
-> Seq Scan on r r2
2All the experiments where done on a MacMini.
4
We now turn to the problems for this section.
2. Consider query Q3
select distinct p.a
from P p, R r1, R r2, R r3, S s
where p.a = r1.a and r1.b = r2.a and r2.b = r3.a and r.b = S.b;
Intuitively, if we view R as a graph, and P and S as node types (properties),
then Q3 determines each P-node in the graph from which there emanates
a path of length 3 that ends at a S-node.3
I.e., a P-node n0 is in the
answer if there exists sequence of nodes (n0, n1, n2, n3) such that (n0, n1),
(n1, n2), and (n2, n3) are edges in R and n3 is a S-node.
(a) Translate and optimize this query and call it Q4. Then write Q4 as
an RA SQL query just as was done for query Q2 in Example 1.
(b) Compare queries Q3 and Q4 in a similar way as we did for Q1 and
Q2 in Example 1.
You should experiment with different sizes for R. Incidentally, these
relations do not need to use the same parameters as those shown in
the above table for Q1 and Q2 in Example 1.
(c) What conclusions do you draw from the results of these experiments
regarding the effectiveness of query optimization in PostgreSQL and/or
by hand?
3Such a query is typical in Graph Databases.
5
3. Consider the Pure SQL Q5 which is an formulation of a variation of the
not subset (not only) set semijoin query
{p.a | P(p) ∧ R(p.a) 6⊆ S}
where
R(p.a) = {r.b | R(r) ∧ r.a = p.a}.
select p.a
from P p
where exists (select 1
from R r
where r.a = p.a and
not exists (select 1 from S s where r.b = s.b));
(a) Translate and optimize this query and call it Q6. Then write Q6 as
an RA SQL query just as was done for Q2 in Example 1.
(b) An alternative way to write a query equivalent with Q5 is as the
object-relational query
with nestedR as (select P.a, array_agg(R.b) as bs
from P natural join R
group by (P.a)),
Ss as (select array(select b from S) as bs)
select a
from nestedR
where not (bs <@ (select bs from Ss));
Call this query Q7.
Compare queries Q5, Q6, and Q7 in a similar way as we did in Example 1. However, now you should experiment with different sizes
for P, R and S as well as consider how P and S interact with R.
(c) What conclusions do you draw from the results of these experiments?
6
4. Consider the Pure SQL Q8 which is an formulation of a variation of the
not superset, (not all) set semijoin query
{p.a | |P(p) ∧ R(p.a) 6⊇ S}
where
R(p.a) = {r.b | R(r) ∧ r.a = p.a}.
select p.a
from P p
where exists (select 1
from S s
where not exists (select 1 from R where p.a = r.a and r.b = s.b));
(a) Translate and optimize this query and call it Q9. Then write Q9 as
an RA SQL query just as was done for Q2 in Example 1.
(b) An alternative way to write a query equivalent with Q8 is as the
object-relational query
with nestedR as (select P.a, array_agg(R.b) as bs
from P natural join R
group by (P.a)),
Ss as (select array(select b from S) as bs)
select a
from P
where a not in (select a from nestedR) and
not((select bs from Ss) <@ ’{}’)
union
select a
from nestedR
where not((select bs from Ss) <@ bs);
Call this query Q10.
Compare queries Q8, Q9, and Q10 in a similar way as we did In
Example 1. However, now you should experiment with different sizes
for P, R and S as well as consider how P and S interact with R. For
example, it could be that
(c) What conclusions do you draw from the results of these experiments?
5. Give a brief comparison of your results for Problem 3 and Problem 4. In
particular, where the results show significant differences, explain why you
think that is the case. And, where the results show similarities, explain
why you think that is the case.
7
3 Object Relational Programming
The following problems require you to write object relational programs. Many
of these require program written in Postgres’ plpgsql database programming
language.
6. Practice Problem–not graded Consider the relation schema V(node
int) and E(source int, target int) representing the schema for storing a directed graph G with nodes in V and edges in E.
Now let G be a directed graph that is acyclic, i.e., a graph without cycles.
4
A topological sort of an acyclic graph G is a list of all nodes (n1, n1, . . . , nk)
in V such that for each edge (m, n) in E, node m occurs before node n in
this list. Note that a path can be stored in an array.
Write a PostgreSQL program topologicalSort() that returns a topological sort of G.
7. Consider a parent-child relation PC(parent, child). (You can assume
that PC is a rooted tree and the domain of the attributes parent and
child is int.) An edge (p, c) in PC indicates that node p is a parent of
node c. Now consider a pair of nodes (m, n) in PC (m and n maybe the
same nodes.) We say that m and n are in the same generation when the
distance from m to the root of PC is the same as the distance from n to
the root of PC.
Consider the following recursive query that computes the sameGeneration
relation:
WITH RECURSIVE sameGeneration(m, n) AS
((SELECT parent, parent FROM PC) UNION (select child, child from PC)
UNION
SELECT t1.child, t2.child
FROM sameGeneration pair, PC t1, pc t2
WHERE pair.m = t1.parent and pair.n = t2.parent)
select distinct pair.m, pair.n from sameGeneration pair order by m, n;
Write a non-recursive function sameGeneration() in the language plpgsql
that computes the sameGeneration relation.
4A cycle is a path (n0, . . . , nl) where n0 = nl
.
8
8. Consider the following relational schemas. (You can assume that the domain of each of the attributes in these relations is int.)
partSubpart(pid,sid,quantity)
basicPart(pid,weight)
A tuple (p, s, q) is in partSubPart if part s occurs q times as a direct
subpart of part p. For example, think of a car c that has 4 wheels w and
1 radio r. Then (c, w, 4) and (c, r, 1) would be in partSubpart. Furthermore, then think of a wheel w that has 5 bolts b. Then (w, b, 5) would be
in partSubpart.
A tuple (p, w) is in basicPart if basic part p has weight w. A basic part
is defined as a part that does not have subparts. In other words, the pid
of a basic part does not occur in the pid column of partSubpart.
(In the above example, a bolt and a radio would be basic parts, but car
and wheel would not be basic parts.)
We define the aggregated weight of a part inductively as follows:
(a) If p is a basic part then its aggregated weight is its weight as given
in the basicPart relation
(b) If p is not a basic part, then its aggregated weight is the sum of the
aggregated weights of its subparts, each multiplied by the quantity
with which these subparts occur in the partSubpart relation.
9
Example tables: The following example is based on a desk lamp with
pid 1. Suppose a desk lamp consists of 4 bulbs (with pid 2) and a frame
(with pid 3), and a frame consists of a post (with pid 4) and 2 switches
(with pid 5). Furthermore, we will assume that the weight of a bulb is 5,
that of a post is 50, and that of a switch is 3.
Then the partSubpart and basicPart relation would be as follows:
partSubPart
pid sid quantity
1 2 4
1 3 1
3 4 1
3 5 2
basicPart
pid weight
2 5
4 50
5 3
Then the aggregated weight of a lamp is 4 × 5 + 1 × (1 × 50 + 2 × 3) = 76.
(a) Write a recursive function recursiveAggregatedWeight(p integer)
that returns the aggregated weight of a part p. Test your function.
(b) Write a non-recursive function nonRecursiveAggregatedWeight(p
integer) that returns the aggregated weight of a part p. Test your
function.
10
9. Practice problem–not graded. Consider the heap data structure. For
a description, consult
https://en.wikipedia.org/wiki/Binary heap.
(a) Implement this data structure in PostgreSQL. This implies that you
need to implement the insert and extract heap operations.
In this problem, you are not allowed to use arrays to implement this
data structure! Rather you must you relations.
(b) Then, using the heap data structure developed in question 9a, write
a PostgreSQL program heapSort() that implement the Heapsort
algorithm. For a description of this algorithm, see
https://en.wikipedia.org/wiki/Heapsort
You are not allowed to use arrays to implement this the Heapsort
algorithm!
The input format is a list of integers stored in a binary relation Data(index,value).
For example, Data could contain the following data.
Data
index value
1 3
2 1
3 2
4 0
5 7
The output of heapSort() should be stored in a relation sortedData(index,value).
On the Data relation above, this should be the following relation:
11
10. Practice problem–not graded. Suppose you have a weighted (directed)
graph G stored in a ternary table with schema
Graph(source int, target int, weight int)
A triple (s, t, w) in Graph indicates that Graph has an edge (s, t) whose
edge weight is w. (In this problem, we will assume that each edge weight
is a positive integer.)
Below is an example of a graph G.
Graph G
source target weight
0 1 2
1 0 2
0 4 10
4 0 10
1 3 3
3 1 3
1 4 7
4 1 7
2 3 4
3 2 4
3 4 5
4 3 5
4 2 6
Implement Dijkstra’s Algorithm as a PostgreSQL function Dijkstra(s
integer) to compute the shortest path lengths (i.e., the distances) from
some input vertex s in G to all other vertices in G. Dijkstra(s integer)
should accept an argument s, the source vertex, and outputs a table which
represents the pairs (t, d) where d is the shortest distance from s to t in
graph G. To test your procedure, you can use the graph shown above.
When you apply Dijkstra(0), you should obtain the following table:
target shortestDistance
0 0
1 2
2 9
3 5
4 9
12
11. Consider the relation schema document(doc int, words text[]) representing a relation of pairs (d, W) where d is a unique id denoting a
document and W denotes the set of words that occur in d.
Let W denote the set of all words that occur in the documents and let t
be a positive integer denoting a threshold. Let X ⊆ W. We say that X is
t-frequent if
count({d|(d, W) ∈ document and X ⊆ W}) ≥ t
In other words, X is t-frequent if there are at least t documents that
contain all the words in X.
Write a PostgreSQL program frequentSets(t int) that returns the set
of all t-frequent set.
In a good solution for this problem, you should use the following rule: if X
is not t-frequent then any set Y such that X ⊆ Y is not t-frequent either.
In the literature, this is called the Apriori rule of the frequent itemset
mining problem. This rule can be used as a pruning rule. In other words,
if you have determined that a set X in not t-frequent then you no longer
have to consider any of X’s supersets.
To learn more about this problem you can visit the site
https://en.wikipedia.org/wiki/Apriori algorithm.
Test your function frequentSets for thresholds 0 through 5.
13
12. Consider a directed graph G stored in a relation Graph(source int,
target int). We say that G is Hamiltonian if G has a cycle (n1, . . . nk)
such that each node n in G occurs once, but only once, as a node ni
in
this cycle.
(a) Write a recursive function recursiveHamiltonian() that returns
true if the graph stored in Graph is Hamiltonian, and false otherwise. Test your function.
(b) Write a non-recursive function nonRecursiveHamiltonian that returns true if the graph stored in Graph is Hamiltonian, and false
otherwise. Test your function.
14
4 Key-value Stores (MapReduce and Spark)
Consider the document “MapReduce and the New Software Stack” available
in the module on MapReduce.5
In that document, you can, in Sections 2.3.3-
2.3.7, find descriptions of algorithms to implement relational algebra operations
in MapReduce. (In particular, look at the mapper and reducer functions for
various RA operators.)
Remark 1 Even though MapReduce as a top-level programming language is
only rarely used, it still serves as an underlying programming environment to
which other languages compile. Additionally, the programming techniques of
applying maps to key-value stores and reducing (accumulating, aggregating) intermediate and final results is an important feature of parallel and distributed
data processing. Additionally, the MapReduce framework forces one to reason
about modeling data towards key-value stores. Finally, the fact that the MapReduce programming model can be entirely simulated in the PostgreSQL objectrelational system underscores again the versatility of this system for a broad
range of database programming and application problems.
In the following problems, you are asked to write MapReduce programs that
implement some RA operations and queries with aggregation in PostgreSQL. In
addition, you need to add the code which permits the PostgreSQL simulations
for these MapReduce programs.
Discussion A crucial aspect of solving these problems is to develop an appropriate data representation for the input to these problems. Recall that in
MapReduce the input is a single binary relation of (key, value) pairs.
We will now discuss a general method for representing (encoding) a relational database in a single key-value store. Crucial in this representation is the
utilization of json objects.6
Consider a relation R(a,b,c). For simplicity, we will assume that the domain
of the attributes of R is integer.7
create table R (a int, b int, c int);
insert into R values (1,2,3), (4,5,6), (1,2,4);
table R;
a | b | c
-----+---+---
1 | 2 | 3
4 | 5 | 6
1 | 2 | 4
5This is Chapter 2 in Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman,
and Jeffrey D. Ullman.
6
Incidentally, this modeling technique is independent of MapReduce and can also be used
to map relational data to other systems and programming languages that center around json
objects.
7However, this approach can be generalized for other domains such as string, booleans,
etc.
15
Starting from this relation R we can, using jsonb8
functions and operations on
jsonb objects, come up with an encoding of R as a key-value store. Consider
the tuple
(1, 2, 3)
in R. We will represent (encode) this tuple as the key-value pair
(‘R’,{"a":1, "b":2, "c":3}).
So the key of this pair is the relation name ‘R’ and the jsonb object {"a":
1, "b":2, "c": 1} represents the tuple (1, 2, 3). Based on this idea of representing tuples of R, we can generate the entire key-value store for R using an
object-relational SQL query.9 To that end, we can use the jsonb build object
PostgreSQL function.
create table encodingofR (key text, value jsonb);
insert into encodingofR
select ’R’ as key, jsonb_build_object(’a’, r.a, ’b’, r.b, ’c’, r.c) as value
from R r;
This gives the following encoding for R.
table encodingofR;
key | value
-----+-----------------------------
R | {"a" : 1, "b" : 2, "c" : 3}
R | {"a" : 4, "b" : 5, "c" : 6}
R | {"a" : 1, "b" : 2, "c" : 4}
Note that we can also “decode” the encodingofR key-value store to recover R
by using the following object-relational SQL query. To that end, we can use the
jsonb selector function ->.
select p.value->’a’ as a, p.value->’b’ as b, p.value->’c’ as c
from encodingofR p;
a | b | c
-----+---+---
1 | 2 | 3
4 | 5 | 6
1 | 2 | 4
An important aspect of this encoding strategy is that it is possible to put
multiple relations, possible with different schemas and arities, into the same
key-value store. Besides R, let us also consider a binary relation S(a,d).
create table S (a int, d int);
insert into S values (1,2), (5,6), (2,1), (2,3);
table S;
8PostgreSQL support both json and jsonb objects. For this assignment, you should use the
jsonb object type since it comes with more functionality and offers more efficient computation.
9Notice that this strategy works in general for any relation, independent of the number of
attributes of the relation.
16
a | d
-----+
1 | 2
5 | 6
2 | 1
2 | 3
(4 rows)
We can now encode both R and S into a single key-value store encodingofRandS
as follows:
create table encodingofRandS(key text, value jsonb);
insert into encodingofRandS
select ’R’ as key, jsonb_build_object(’a’, r.a, ’b’, r.b, ’c’, r.c) as value
from R r
union
select ’S’ as key, jsonb_build_object(’a’, s.a, ’d’, s.d) as value
from S s
order by 1, 2;
table encodingofRandS;
key | value
-----+--------------------------
R | {"a": 1, "b": 2, "c": 3}
R | {"a": 1, "b": 2, "c": 4}
R | {"a": 4, "b": 5, "c": 6}
S | {"a": 1, "d": 2}
S | {"a": 2, "d": 1}
S | {"a": 2, "d": 3}
S | {"a": 5, "d": 6}
(7 rows)
Furthermore, we can decode this key-value store using 2 object-relational SQL
queries and recover R and S.
select p.value->’a’ as a, p.value->’b’ as b, p.value->’c’ as c
from encodingofRandS p
where p.key = ’R’;
a | b | c
-----+---+---
1 | 2 | 3
4 | 5 | 6
1 | 2 | 4
(3 rows)
select p.value->’a’ as a, p.value->’d’ as d
from encodingofRandS p
where p.key = ’S’;
a | d
-----+
1 | 2
5 | 6
2 | 1
2 | 3
(4 rows)
17
Example 2 Consider the following problem. Write, in PostgreSQL, a basic
MapReduce program, i.e., a mapper function and a reducer function, as well as
a 3-phases simulation that implements the set intersection of two unary relations
R(a) and S(a), i.e., the relation R ∩ S. You can assume that the domain of the
attribute ‘a’ is integer.
-- EncodingOfRandS;
drop table R; drop table S;
create table R(a int);
insert into R values (1),(2),(3),(4);
create table S(a int);
insert into S values (2),(4),(5);
drop table EncodingOfRandS;
create table EncodingOfRandS(key text, value jsonb);
insert into EncodingOfRandS
select ’R’ as key, jsonb_build_object(’a’, r.a) as value
from R r
union
select ’S’ as key, jsonb_build_object(’a’, s.a) as value
from S s
order by 1;
table EncodingOfRandS;
key | value
-----+----------
R | {"a": 1}
R | {"a": 4}
R | {"a": 2}
R | {"a": 3}
S | {"a": 4}
S | {"a": 5}
S | {"a": 2}
(7 rows)
-- mapper function
CREATE OR REPLACE FUNCTION mapper(key text, value jsonb)
RETURNS TABLE(key jsonb, value text) AS
$$
SELECT value, key;
$$ LANGUAGE SQL;
-- reducer function
CREATE OR REPLACE FUNCTION reducer(key jsonb, valuesArray text[])
RETURNS TABLE(key text, value jsonb) AS
$$
SELECT ’R intersect S’::text, key
WHERE ARRAY[’R’,’S’] <@ valuesArray;
$$ LANGUAGE SQL;
-- 3-phases simulation of MapReduce Program followed by a decoding step
WITH
Map_Phase AS (
SELECT m.key, m.value
FROM encodingOfRandS, LATERAL(SELECT key, value FROM mapper(key, value)) m
18
),
Group_Phase AS (
SELECT key, array_agg(value) as value
FROM Map_Phase
GROUP BY (key)
),
Reduce_Phase AS (
SELECT r.key, r.value
FROM Group_Phase, LATERAL(SELECT key, value FROM reducer(key, value)) r
)
SELECT p.value->’a’ as a FROM Reduce_Phase p
order by 1;
a
---
2
4
(2 rows)
We now turn to the problems for this section.
13. Practice problem–not graded. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well
as a 3-phases simulation that implements the symmetric difference of two
unary relations R(a) and S(a), i.e., the relation (R−S)∪(S−R). You can
assume that the domain of the attribute ‘a’ is integer.
14. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the semijoin of two relations R(A,B) and S(A,B,C), i.e., the relation
R n S. You can assume that the domains of A, B, and C are integer.
Use the encoding and decoding methods described above.
15. Practice problem–not graded. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a
3-phases simulation that implements the natural join R ./ S of two relations R(A, B) and S(B,C). You can assume that the domains of A, B, and
C are integer. Use the encoding and decoding methods described above.
16. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the SQL query
SELECT r.A, array_agg(r.B), sum(r.B)
FROM R r
GROUP BY (r.A)
HAVING COUNT(r.B) < 3;
Here R is a relation with schema (A, B). You can assume that the domains
of A and B are integers. Use the encoding and decoding methods described
above.
19
We now turn to some problems that relate to query processing in Spark.
Note that in Spark it is possible to operate on multiple key-value stores.
17. Let R(K, V ) and S(K, W) be two binary key-value pair relations. You
can assume that the domains of K, V , and W are integers. Consider the
cogroup transformation R.cogroup(S) introduced in the lecture on Spark.
(a) Define a PostgreSQL view coGroup that computes a complex-object
relation that represent the co-group transformation R.cogroup(S).
Show that this view works.
(b) Write a PostgreSQL query that use this coGroup view to compute
the semi join R n S, in other words compute the relation R ./ πK(S).
(c) Write a PostgreSQL query that uses this coGroup view to implement
the SQL query
SELECT distinct r.K as rK, s.K as sK
FROM R r, S s
WHERE NOT ARRAY(SELECT r1.V
FROM R r1
WHERE r1.K = r.K) && ARRAY(SELECT s1.W
FROM S s1
WHERE s1.K = s.K);
18. Practice problem–not graded. Let A(x) and B(x) be the schemas to
represent two set of integers A and B. Consider the cogroup transformation introduced in the lecture on Spark. Using an approach analogous to
the one in Problem 17 solve the following problems:10
(a) Write a PostgreSQL query that uses the cogroup transformation to
compute A ∩ B.
(b) Write a PostgreSQL query that uses the cogroup operator to compute
the symmetric difference of A and B, i.e., the expression
(A − B) ∪ (B − A).
10An important aspect of this problem is to represent A and B as a key-value stores.
20
5 Graph query languages
Each of the following problems is a practice problem.
19. Practice Problem–not graded. Consider the database schema Person,
Company, companyLocation, Knows, jobSkill, and personSkill.
(a) Specify an Entity-Relationship Diagram that models this database
schema.
(b) Specify the node and relationship types of a Property Graph for
this database schema. In addition, specify the properties, if any,
associated with each such type.
20. Practice Problem–not graded. Using the Property Graph model in
Problem 19b, formulate the following queries in the Cypher query language:
(a) Find the types of the relationships associated with Person nodes.
(b) Find each person (node) whose name is ‘John’ and has a salary that
is at least 50000.
(c) Find each jobSkill (node) that is the job skill of a person who knows
a person who works for ’Amazon’ and who has a salary that is at
least 50000.
(d) Find each person (node) who knows directly or indirectly (i.e., recursively) another person who works for Amazon.
(e) Find for each company node, that node along with the number of
persons who work for that company and who have both the Databases
and Networks job skills.
21