$29.99
B561 Assignment 6
Data Generation
Sorting
Indexing
(Draft: Version 1)
For this assignment, you will need the material covered in the lectures
• Lecture 15: External Merge Sorting
• Lecture 16: Indexing
• Lecture 17: B+ trees and Hashing
In addition, you should read the following sections in Chapter 14 and 15 in the
textbook Database Systems The Complete Book by Garcia-Molina, Ullman, and
Widom:
• Section 14.1: Index-structure Basics
• Section 14.2: B-Trees
• Section 14.3: Hash Tables
• Section 14.7: Bitmap Indexes
• Section 15.8.1: Multipass Sort-Based Algorithms
In the file performingExperiments.sql supplied with this assignment, we
have include several PostgreSQL functions that should be useful for running
experiments. Of course, you may wish to write your own functions and/or
adopt the functions in this .sql to suite the required experiments for the various
problems in this assignment.
The problems that need to be included in the assignment6.sql are marked
with a blue bullet •. The problems that need to be included in the assignment6.pdf
are marked with a red bullet •. (You should aim to use Latex to construct your
.pdf file.) In addition, submit a file assignment6Code.sql that contains all the
sql code that you developed for this assignment.
Practice problems should not be submitted.
1
1 Data generation
PostgreSQL functions and clauses In this assignment there will be a need
to do simulations with various-size relations. Many of these relations will have
randomized data. PostgreSQL provides useful functions and clauses that make
such relations:
random() returns a random real number between 0 and 1 using the uniform distribution
floor(random() ∗ (u − l + 1) + l) :: int returns a random integer in the range [l, u] using the uniform distribution
generate series(m, n) generates the set of integers in the range [m, n]
generate series(m, n, k) generates the set of integers in the range [m, n]) separated by step-size k
order by random() randomly orders the tuples that are the result of a query
limit(n) returns the first n tuples from the result of a query
offset(n) returns all but the first n tuples from the result of a query
limit(m) offset(n) returns the first m tuples from all but the first n tuples from the result of a query
row number() is a window function that assigns a sequential integer to each row in a result set
vacuum is a garbage collection function to clean and reclaim secondary memory space
For more detail, consult the manual pages
https://www.postgresql.org/docs/13/functions-math.html
https://www.postgresql.org/docs/current/functions-srf.html
https://www.postgresql.org/docs/current/queries-limit.html
https://www.postgresql.org/docs/8.4/functions-window.html
https://www.postgresql.org/docs/9.5/routine-vacuuming.html
In the file performingExperiments.sql supplied with this assignment, we
have include several PostgreSQL functions that should be useful for running
experiments. Of course, you may wish to write your own functions and/or
adopt the functions in this .sql to suite the required experiments for the various
problems in this assignment.
Generating sets To generate a set, i.e., a unary relation, of n randomly
selected integers in the range [l, u], you can use the following function:1
create or replace function SetOfIntegers(n int, l int, u int)
returns table (x int) as
$$
select floor(random() * (u-l+1) + l)::int from generate_series(1,n);
$$ language sql;
Example 1 To generate a unary relation with 3 randomly selected integers in
the range 5 to 10, do the following:
select x from SetofIntegers(3,5,10);
Of course, running this query multiple times, result in different sets.
1Typically the function SetOfIntegers returns a bag (multiset) but this is fine for this
assignment. In case we want a set, we can always eliminate duplicates.
2
Generating binary relations The idea behind generating a set can be generalized to that for the generation of a binary relation.2 To generate a binary
relation of n randomly selected pairs of integers (x, y) such x ∈ [l1, u1] and
y ∈ [l2, u2], you can use the following function:
create or replace function
BinaryRelationOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int)
returns table (x int, y int) as
$$
select floor(random() * (u_1-l_1+1) + l_1)::int as x,
floor(random() * (u_2-l_2+1) + l_2)::int as y
from generate_series(1,n);
$$ language sql;
Example 2 To generate a binary relation with 20 randomly selected pairs with
first components in the range [3, 8] and second components in the range [2, 11],
do the following:
select x, y from BinaryRelationOverIntegers(20,3,8,2,11);
Generating functions A relation generated by BinaryRelationOverIntegers
is in general not a function since it is possible that the relation has pairs (x, y1)
and (x, y2) with y1 6= y2. To create a (partial) function f : [l1, u1] → [l2, u2] of
n randomly selected function pairs, we can use the following function:
create or replace function
FunctionOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int)
returns table (x int, y int) as
$$
select x, floor(random() * (u_2-l_2+1) + l_2)::int as y
from generate_series(l_1,u_1) x order by random() limit(n);
$$ language sql;
Example 3 To generate a partial function [1, 20] → [3, 8] of 15 randomly selection function pairs, do the following:3
select x, y from FunctionOverIntegers(15,1,20,3,8);
2Clearly, all of this can be generalized to higher-arity relations.
3When using this function, it is customary to use n such that n ∈ [0, u1 − l1 + 1].
3
Generating relations with categorical (non-numeric) data Thus far,
the sets, binary relations, and functions have all just involved integer ranges.
It is possible to include ranges that have different types including categorical
data such as text strings. The technique to accomplish this is to first associate
with a categorical range an integer range that associate with each element in
the categorical range a unique value of the integer range. The next example
illustrates this.
Example 4 Consider the jobSkill relation and assume that it contents is
skill
AI
Databases
Networks
OperatingSystems
Programming
Suppose that we want to generate a personSkill(pid, skill) relation. Let
us assume that the pid’s are integers in the range [1, m].
There are 5 skills in the jobSkill and it is therefore natural to associate
with each skill a separate integer (index value) in the range [1, 5]. This can be
done with a query involving the row number() window function:
select skill, row number() over (order by skill) as index
from Skill;
The result is the following relation:
skill index
AI 1
Databases 2
Networks 3
OperatingSystems 4
Programming 5
Using this technique, we can write a PostgreSQL function that generates a
personSkill relation with n randomly selected (pid, skill) tuples, with pid’s
in the range [l, u]:
create or replace function GeneratepersonSkillRelation(n int, l int, u int)
returns table (pid int, skill text) as
$$
with skillNumber(skill, index) as (select skill, row_number() over (order by skill)
from Skill),
pS as (select x, y
from BinaryRelationOverIntegers(n,l,u,1, (select count(1) from Skill)::int))
select x as pid, skill
from pS join skillNumber on y = index
group by (x, skill) order by 1,2;
$$ language sql;
4
In this function, the skillNumber view associates with each job skill an
integer index in the range [1, |Skill|]. The pS view is a randomly generated
binary relation with n tuples, with pid’s in the range [l, u], and skill numbers
in the range [1, |Skill|]. The join operation associates the numeric range with
the Skill range. The ‘group by (x, skill) order by 1,2’ clause eliminates
duplicate tuples and orders the result.
The query
select * from GeneratepersonSkillRelation(10,1,15);
may make the personSkill relation:
pid skill
1 AI
2 Programming
3 Databases
4 Databases
6 Networks
6 OperatingSystems
6 Programming
9 Databases
14 Databases
14 Networks
Problems We now turn to the problems in this section.
1. • Given a discrete probability mass function P, as specified in a relation
P(outcome: int, probability: float), over a range of possible outcomes [u2, l2], design a PostgreSQL function
RelationOverProbabilityFunction(n, l1, u1, l2, u2)
that generates a relation of up to n pairs (x, y) such that
• x is uniformly selected in the range [l1, u1]; and
• y is selected in accordance with the probability mass function P in
the range [l2, u2].
An example of a possible P as stored in relation P is as follows:4
P
outcome probability
1 0.25
2 0.10
3 0.40
4 0.10
5 0.15
4Notice that the sum of the probabilities in the probability column in P sum to 1.
5
Note that when P is the uniform probability mass function, then
RelationOverProbabilityFunction
and
BinaryRelationOverIntegers
are the same binary relation producing functions.
Hint: For insight into this problem, consult the method of Inverse Transform Sampling for discrete probability mass functions.
Test your function for the cases listed in the Assignment-Script-2021-Fall-assignment6.sql
file.
2. Practice Problem-not graded.
Use the technique in Problem 1 and the method for generating categorical
data discussed above to write a PostgreSQL function that generates a
personSkill relation, given a probability mass function over the Skill
relation.
Your function should work for any Skill relation and any probability
distribution defined over it.
Test your function for the cases listed in the Assignment-Script-2021-Fall-assignment6.sql
file.
6
2 Sorting
We have learned about external sorting. The problems in this section are designed to look into this sorting method as it implemented in PostgreSQL.
3. • Create successively larger sets of n randomly selected integers in the
range [1, n]. You can do this using the following function and with the
following experiment.5
create or replace function makeS (n integer)
returns void as
$$
begin
drop table if exists S;
create table S (x int);
insert into S select * from SetOfIntegers(n,1,n);
end;
$$ language plpgsql;
This function generates a bag S of size n, with randomly select integers
in the range [1, n]. Now consider the following SQL statements:
select makeS(10);
explain analyze select x from S;
explain analyze select x from S order by 1;
• The ‘select makeS(10)’ statement makes a bag S with 10 elements;
• The ‘explain analyze select x from S’ statement provides the
query plan and execution time in milliseconds (ms) for a simple scan
of S;
• The ‘explain analyze select x from S order by 1’ statement provides the query plan and execution time in milliseconds (ms) for sorting S.
6
QUERY PLAN
------------------------------------------------------------------------------------------------------
Sort (cost=179.78..186.16 rows=2550 width=4) (actual time=0.025..0.026 rows=10 loops=1)
Sort Key: x
Sort Method: quicksort Memory: 25kB
-> Seq Scan on s (cost=0.00..35.50 rows=2550 width=4) (actual time=0.004..0.005 rows=10 loops=1)
Planning Time: 0.069 ms
Execution Time: 0.034 ms
Now construct the following timing table:7
5You should make it a habit to use the PostgreSQL vacuum function to perform garbage
collection between experiments.
6Recall that 1ms is 1
1000 second.
7
It is possible that you may not be able to run the experiments for the largest S. If that is
the case, just report the results for the smaller sizes.
7
size n of relation S avg execution time to scan S (in ms) avg execution time to sort S (in ms)
101
102
103
104
105
106
107
108
(a) What are your observations about the query plans for the scanning
and sorting of such differently sized bags S?
(b) What do you observe about the execution time to sort S as a function
of n?
(c) Does this conform with the formal time complexity of (external) sorting? Explain.
(d) It is possible to set the working memory of PostgreSQL using the set
work mem command. For example, set work mem = ’16MB’ sets the
working memory to 16MB.8 The smallest possible working memory
in postgreSQL is 64kB and the largest depends on you computer
memory size. But you can try for example 1GB.
Repeat question 3a for memory sizes 64kB and 1GB and report your
observations.
(e) Now create a relation indexedS(x integer) and create a Btree index on indexedS and insert into indexedS the sorted relation S.
9
create table indexedS (x integer);
create index on indexedS using btree (x);
insert into indexedS select x from S order by 1;
Then run the range query
select x from indexedS where x between 1 and n;
where n denotes the size of S.
Then construct the following table which contains (a) the average
execution time to build the btree index and (2) the average time to
run the range query.
8The default working memory size is 4MB.
9For information about indexes in PostgreSQL consult the manual page
https://www.postgresql.org/docs/13/indexes.html.
8
size n of relation S avg execution time to create index indexedS avg execution time for range query (in ms)
101
102
103
104
105
106
107
108
What are your observations about the query plans and execution
times to create indexedS and the execution times for sorting the
differently sized bags indexedS? Compare your answer with those
for the above sorting problems.
4. Practice problem-not graded.
Typically, the makeS function returns a bag instead of a set. In the problems in this section, you are to conduct experiments to measure the execution times to eliminate duplicates.
(a) Write a SQL query that uses the DISTINCT clause to eliminate duplicates in S and report you results in a table such as that in Problem 3a.
(b) Write a SQL query that uses the GROUP BY clause to eliminate duplicates in S and report you results in a table such as that in Problem 3a.
(c) Compare and contrast the results you have obtained in problems 4a
and 4b. Again, consider using explain analyze to look at query
plans.
9
3 Indexes and Indexing
Indexes on data (1) permit faster lookup on data items and (2) may speed
up query processing on such data. Speedups can be substantial. The purpose
of the problems in this section are to explore this. Some other problems in
this section are designed to understand the workings of the B
+-tree and the
extensible hashing data structures.
Discussion PostgreSQL permit the creation of a variety of indexes on tables.
We will review such index creation and examine their impact on data lookup
and query processing. For more details, see the PostgreSQL manual:
https://www.postgresql.org/docs/13/indexes.html
Example 5 The following SQL statements create indexes on columns or combinations of columns of the personSkill relation.10 Notice that there are
2
arity(personSkill) − 1 = 22 − 1 = 3
such possible indexes.
create index pid_index on personSkill (pid); -- index on pid attribute
create index skill_index on personSkill (skill); -- index on skill attribute
create index pid_skill_index on personSkill (pid,skill); -- index (pid, skill)
Example 6 It is possible to declare the type of index: btree or hash. When
no index type is specified, the default is btree. If instead of a Btree, a hash
index is desired, then it is necessary to specify a using hash qualifier:
create index pid_hash on personSkill using hash (pid); -- hash index on pid attribute
Example 7 It is possible to create an index on a relation based on a scalar
expression or a function defined over the attributes of that relation. Consider
the following (immutable) function which computes the number of skills of a
person:
create or replace function numberOfSkills(p integer) returns integer as
$$
select count(1)::int
from personSkill
where pid = p;
$$ language SQL immutable;
10Incidentally, when a primary key is declared when a relation is created, PostgreSQL will
create a btree index on this key for the relation.
10
Then the following is an index defined on the numberOfSkills values of
persons:
create index numberOfSkills_index on personSkill (numberOfSkills(pid));
Such an index is useful for queries that use this function such as
select pid, skill from personSkill where numberOfSkills(pid) > 2;
We now turn to the problems in this section.
5. • Consider a relation Student(sid text, sname text, major, year).
A tuple (s, n, m, y) is in Student when s is the sid of a student and n, m,
and y are that student’s name, major, and birth year. Further, consider
a relation Enroll(sid text, cno text, grade text). A triple (s, c, g)
is in Enroll when the student with sid s was enrolled in the course with
cname c and obtained a letter grade g in that course.
We are interested in answering queries of the form
select sid, sname, major, byear
from Student
where sid in (select sid
from Enroll sid
where cno = c [and|or|and not] grade = g);
Here c denotes a course name and g denotes a letter grade.
Read Section 14.1.7 ‘Indirection in Secondary Indexes’ in your textbook
Database Systems The Complete Book by Garcia-Molina, Ullman, and
Widom. Of particular interest are (a) the concept of buckets (Figure 14.7)
which are sets of tids and (b) the technique of performing set operations
(like intersections) on relevant buckets (Figure 14.8) to answer queries of
the form as shown above.
The goal of this problem is to use object-relational SQL to simulate these
concepts. To make things more concrete, consider the following Student
and Enroll relations:
Student:
sid | sname | major | byear
------+--------+---------+-------
s100 | Eric | CS | 1988
s101 | Nick | Math | 1991
s102 | Chris | Biology | 1977
s103 | Dinska | CS | 1978
s104 | Zanna | Math | 2001
s105 | Vince | CS | 2001
Enroll:
sid | cno | grade
------+------+-------
s100 | c200 | A
s100 | c201 | B
11
s100 | c202 | A
s101 | c200 | B
s101 | c201 | A
s101 | c202 | A
s101 | c301 | C
s101 | c302 | A
s102 | c200 | B
s102 | c202 | A
s102 | c301 | B
s102 | c302 | A
s103 | c201 | A
s104 | c201 | D
Now consider associating a tuple id (tid) with each tuple in Enroll:
tid | sid | cno | grade
-----+------+------+-------
1 | s100 | c200 | A
2 | s100 | c201 | B
3 | s100 | c202 | A
4 | s101 | c200 | B
5 | s101 | c201 | A
6 | s101 | c202 | A
7 | s101 | c301 | C
8 | s101 | c302 | A
9 | s102 | c200 | B
10 | s102 | c202 | A
11 | s102 | c301 | B
12 | s102 | c302 | A
13 | s103 | c201 | A
14 | s104 | c201 | D
Use object-relational SQL to construct two secondary indexes indexOnCno
and indexOnGrade on the Enroll relation. These indexes should be stored
in two separate relations which you can conveniently call indexOnCno and
indexOnGrade, respectively. two object-relational views in a manner that
simulates the situation illustrated in Figure 14.8. In particular, do not
use the ‘create index’ mechanism of SQL to construct these indexes.
Then, using the indexOnCno and indexOnGrade views and the technique of
intersecting buckets, write a function FindStudents(booleanOperation
text, cno text, grade text) that can be used to answer queries of the
form as shown above. (Here the booleanOperation is a string which can
be ’and’, ’or’, or ’and not’.)
For example, the query
select * from FindStudents(’and’, ’c202’, ’A’);
should return the same result as that of the query
select sid, sname, major, byear
from Student
where sid in (select sid
from Enroll sid
where cno = ’c202’ and grade = ’A’);
12
Test your solution for the following cases on the Student and Enroll
relation given for this problem:
(a) select * from FindStudents(’and’, ’c202’, ’A’);
(b) select * from FindStudents(’or’, ’c202’, ’A’);
(c) select * from FindStudents(’and not’, ’c202’, ’A’);
6. Practice problem–not graded. Read Section 14.7 ‘Bitmap Indexes’ in
your textbook Database Systems The Complete Book by Garcia-Molina,
Ullman, and Widom. In particular, look at Example 14.39 for an example
of a bitmap index for a secondary index.
Next, revisit Problem 5. There, we considered two secondary indexes
indexOnCno and indexOnGrade. We will now consider the corresponding
bitmap indexes bitmapIndexOnCno and bitmapIndexOnGrade:
bitmapIndexOnCno
cno | bit-vector
------+----------------
c200 | 10010000100000
c201 | 01001000000011
c202 | 00100100010000
c301 | 00000010001000
c302 | 00000001000100
and
bitmapIndexOnGrade
grade | bit-vector
-------+----------------
A | 10101101010110
B | 01010000101000
C | 00000010000000
D | 00000000000001
Use object-relational SQL to construct two secondary indexes bitmapIndexOnCno
and bitmapIndexOnGrade as two object-relational relations in a manner
that simulates the situation just illustrated above.
Then, using the bitmapIndexOnCno and bitmapIndexOnGrade relations
and the technique of forming the bitmap-AND, bitmap-OR, and bitmap-AND
NOT of two bit-vectors, write a function FindStudents(booleanOperation
text, cno text, grade text) that can be used to answer queries of the
form as shown in Problem 5.
For example, the query
select * from FindStudents(’and’, ’c202’, ’A’);
should return the same result as that of the query
select sid, sname, major, byear
from Student
where sid in (select sid
from Enroll sid
where cno = ’c202’ and grade = ’A’);
13
Test your solution for the following cases on the Student and Enroll
relation given for this problem:
(a) select * from FindStudents(’and’, ’c202’, ’A’);
(b) select * from FindStudents(’or’, ’c202’, ’A’);
(c) select * from FindStudents(’and not’, ’c202’, ’A’);
7. • Consider the following parameters:
block size = 4096 bytes
block-address size = 9 bytes
block access time (I/O operation) = 10 ms (micro seconds)
record size = 150 bytes
record primary key size = 8 bytes
Assume that there is a B+-tree, adhering to these parameters, that indexes
1 billion (109
) records on their primary key values.
Provide answers to the following problems and show the intermediate computations leading to your answers.
(a) Specify (in ms) the minimum time to determine if there is a record
with key k in the B+-tree. (In this problem, you can not assume that
a key value that appears in an non-leaf node has a corresponding
record with that key value.)
(b) Specify (in ms) the maximum time to insert a record with key k in
the B+-tree assuming that this record was not already in the data
file.
(c) How large must main memory be to hold the first two levels of the
B
+-tree? How about the first three levels?
14
8. • Consider the following B+-tree of order 2 that holds records with keys
2, 8, and 11.11
(a) Show the contents of your B+-tree after inserting records with keys
4, 10, 12, 1, 7, and 5 in that order.
Strategy for splitting leaf nodes: when a leaf node n needs to be
split into two nodes n1 and n2 (where n1 will point to n2), then use
the rule that an even number of keys in n is moved into n1 and an
odd number of keys is moved into n2. So if n becomes conceptually
k1|k2|k3 then n1 should be k1|k2 and n2 should be k3| and
n1 → n2.
(b) Starting from your answer in question 8a, show the contents of your
B+-tree after deleting records with keys 12, 2, and 11, in that order.
(c) Starting from your answer in question 8b, show the contents of your
B+-tree after deleting records with keys 5, 1, and 4, in that order.
11Recall that (a) an internal node of a B+-tree of order 2 can have either 1 or 2 keys values,
and 2 or 3 sub-trees, and (b) a leaf node can have either 1 or 2 key values.
15
9. • Consider an extensible hashing data structure wherein (1) the initial
global depth is set at 1 and (2) all directory pointers point to the same
empty bucket which has local depth 0. So the hashing structure looks
like this: Assume that a bucket can hold at most two records.
(a) Show the state of the hash data structure after each of the following
insert sequences:12
i. records with keys 6 and 7.
ii. records with keys 1 and 2.
iii. records with keys 9 and 4.
iv. records with keys 8 and 3.
(b) Starting from the answer you obtained for Question 9a, show the
state of the hash data structure after each of the following delete
sequences:
i. records with keys 9 and 6.
ii. records with keys 4 and 1.
iii. records with keys 2 and 8.
As in the text book, the bit strings are interpreted starting from their
left-most bit and continuing to the right subsequent bits.
12You should interpret the key values as bit strings of length 4. So for example, key value
7 is represented as the bit string 0111 and key value 2 is represented as the bit string 0010.
16
The goal in the next problems is to study the behavior of indexes for various
different sized instances13 of the Person, worksFor, and Knows relations and for
various queries:
10. • Create an appropriate index on the worksFor relation that speedups the
lookup query
select pid from worksFor where cname = c;
Here c is a company name.
Illustrate this speedup by finding the execution times for this query for
various sizes of the worksFor relation.
11. • Create an appropriate index on the worksFor relation that speedups the
range query
select pid, cname
from worksFor
where s1 <= salary and salary <= s2;
Here s1 and s2 are two salaries with s1 < s2.
Illustrate this speedup by finding the execution times for this query for
various sizes of the worksFor relation.
12. • Create indexes on the worksFor relation that speedup the multiple conditions query
select pid
from worksFor
where salary = s and cname = c;
Here s is a salary and c is a company name.
Illustrate this speedup by finding the execution times for this query for
various sizes of the worksFor relation.
13. • Create indexes on the appropriate relations that speedup the semi-join
[anti semi-join] query
select pid, pname
from Person
where pid [not] in (select pid from worksFor where cname = c);
13(Three different sizes, small, medium, large suffice for your experiment.
17
Here c is a company name.
Illustrate this speedup by finding the execution times for this query for
various sizes of the Person and worksFor relations.
14. Practice problem–not graded.
Create indexes that speedup the path-discovery query
select distinct k1.pid1, k3.pid2
from knows k1, knows k2, knows k3
where k1.pid2 = k2.pid1 and k2.pid2 = k3.pid1;
Illustrate this speedup by finding the execution times for this query for
various sizes of the Knows relation.
18