Starting from:

$29

Homework 4 Continuation of bitmap indexes

Page 1 of 4

Objectives: Additional exercises per contemporary storage management
Relational databases and B+trees are well covered in the text and past homeworks, thus:
The Assignment:
1. Continuation of bitmap indexes, exercise 14.7.3, as they pertain to column stores.
Correction: This problem concerns compressed bitmaps. The objective is to compare
the storage needed by a column store to the storage needed for a conventional RDBMS.
Thus, the question is a continuation of 14.7.3b, which was not assigned. (Makes it tough
for this to be a continuation.). Use the following solution to 14.7.3b. This solution makes
a simplifying assumption, to get it to you quickly.
Sol’n 14.7.3b
The problem states that a value appears, in round robin fashion every m rows. Ignoring
the first run, (the first 1 in the bitmap), there will be a run of m-1 zero’s and a 1repeated
many times. Such a run will require 2 élog2(m-1)ù. Given there are m different values,
and 1,000,000 rows, each value will appear ~1,000,000/m times, meaning the run will
Homework 4 V2.0 CS386D Database Systems
Created on 2/24/20
Version: 2.0
Page 2 of 4
repeat that many times. Thus the total number of bits is, approximately
m * 2 élog2(m-1)ù * (1,000,000/m) = 2,000,000 élog2(m-1)ù
This solution ignores the easy to identify collection of m different runs that precede the
regularly sized (round robin) runs, (1, 01, 001, …, m-1 zeros 1). Related to that I have not
properly assessed if there is a fencepost problem. But these are second order effects such
that the above is sufficient for you to do the homework and achieve its learning objective.
_____________________________________________________
Consider two tables, S and T, of 1,000,000 rows and 100,000,000 rows respectively.
As may prove convenient, n is the number of rows in a table. Further properties of S
and T are identical starting with, S and T have 101 columns, including a column of
type integer that serves as the primary key. Let the columns be named ci,
i = 0 .. 100. Similarly, let mi be the number of different values in column ci. Assume
all the following questions are in the context of storing either S or T, in their entirety
in a column store database. Recall from lecture, that means data values are not stored
directly, but rather each data value, (per column), is represented by a pair, (datavalue, compressed-bitmap). E.g. (“Austin”, 1011110100….). Let et c0 be the primary
key. Clearly, m0 = n
a) Develop an expression that, given all mi, will determine the number of bytes
needed for all the bitmap indexes to needed to store T (just the bitmaps).
b) Does your answer to part a need to include a bitmap index for c0. Hint: review the
textbook per the implementation of bitmap indexes.
c) Suppose, for all columns i, i ¹ 0, the number of values in half of the columns is
mi = n/1000 and the other half, mj = 10,000. How many bytes will be required to
the bitmaps for i) S, ii) T?
d) Suppose data, c0, requires 4 bytes, the columns where mi = n/1000 required an
average of 25 bytes, and the columns st. mj = 10,000 require and average of 20
bytes. In a conventional RDBMS (row store) how many bytes is required to store
just the data table i) S, ii) T. If data pages contain 4kbytes, how many data pages
are required to store iii) S, iv) T
2. Loading and filtering a key, value store.
Consider loading a large file into a cloud-native key, value store. Figure from course
notes repeated below.
Page 3 of 4
The storage system will hash each key to an integer. Servers are logically numbered, and
the hash result determines which server is responsible for storing the key value pair1
. At
the data level, the contents of each server is commonly referred to simply as a shard. It is
not clear if shard and partition mean different things. That said, in most common usage,
within each shard there is still a notion of disk block, a unit of transfer between the server
and its disks. But compared to the relational literature where page and block are used
interchangably, in these contemporary distributed databases a unit of storage is most
commonly called a block.
Sharding assures that all keys, in all blocks in a server will contain values whose key
hashed to the same server. Blocks are large. In Cloudera’s standard release of HDFS the
administrator may choose 64Mbytes or 128Mbytes as the block size. Also, recall blocks
are compressed before they are stored on disk. Compression ratios of 10 to 100 can be
expected.
Suppose it is your job to (optimally) configure a Bloom filter-based index for the blocks
of a Cloud database so that individual blocks are fetched only if the key appears in the
Bloom filter.
Assume,
• The database is 8Tbytes (8 terabytes = 8 x 240).
• The logical partitioning of the database is to 128 shards. i.e. there are 128 servers.
For this problem we will ignore that shards are replicated on different servers to
support fault tolerance, (durability).
• Suppose, on average, a key value pair requires 1024 bytes of storage. (keys must
be stored with the values. Why?)
1 Actually, since storage is replicated for fault tolerance (an implementation of durability), the
hash value determines a set of servers, not a single server, that will store redundant copies of the
data. For this question, assume there is no redundancy, and thus no fault tolerance. It won’t be
important.
Page 4 of 4
• Let blocks be 64Mbytes.
a) Why must keys be stored with the values?
b) Have the Bloom filter use 10 bits per element expected to be stored.
i. How many bits will be required for each filter?
ii. How much memory, on average, will be required, per server, to represent
the Bloom filters?
iii. What is the optimal number of hash functions?
iv. What is the probability of a false positive?

More products