Problem Set 1: Distributed Atomic Commit, Gossip-based information dissemination, Failure Detection, and Group Membership [Note: About 1 sentence per point should be sufficient to adequately answer most of the questions. For example, you should be able to answer a 5-point question in about 5 sentences.] 1. [5 pts] Why is a 1-phase commit protocol “(or naïve protocol), where a coordinator simply asks participants to either commit or abort a transaction, not sufficient for distributed atomic commit? 2. [5 pts] How does the 2-phase commit protocol improve upon the 1-phase commit (or naïve protocol)? 3. [5 pts] A participant P in 2-phase commit protocol “wakes-up” into “READY” state after crashing. To decide whether to abort or commit the transaction P queries another participant Q about its state. If Q responds back with “INIT” what should P do and why? 4. [5 pts] What is the problem with 2-phase commit protocol that 3-phase commit protocol is designed to avoid? Please explain. 5. [5 pts] What are the 4 desirable properties of distributed failure detectors? Is it possible to provide all 4 properties in lossy networks? 6. [5 pts] In a ring-based heart-beating protocol (lecture 4, slide 19) how many simultaneous failures can be detected and how does that depend on parameters of the protocol? 7. [10 pts] In a modified ring-based heart-beating protocol, every process pi picks k other process at random in the beginning and asks them to send their heartbeats to it (pi) directly. a. Does this protocol provide completeness? If Yes, provide an informal proof. If No, provide a counter example. b. Does this protocol provide Accuracy? If Yes, provide an informal proof. If No, provide a counter example and state what level of Accuracy does this protocol provide. 8. [10 pts] In gossip-based failure detection why do we need to maintain two time thresholds – TFail and TCleanup? How should one set the value of TCleanup and why? (Hint: see the GOSSIP paper in the reading list) 9. [5 pts] In gossip-based failure detection what happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased? Explain your answer. 10. [10 pts] Show that worst-case load L* (messages per member) for failure detection is where T is desired time of detection, PM(T) is probability of mistake (false positive probability) and Pml is probability of message loss. (Hint: you can find this in the SWIM Analysis paper) 11. [5 pts] What is the insight in SWIM that makes it more efficient? 12. [10 pts] There are N processes and it is guaranteed that there will be no more than N/3 failures. Design a heart-beat protocol that guarantees completeness with minimum memory usage at the processes. 13. [10 pts] In a Gossip protocol with N processes, each process only maintains a membership list of size k. The k members at each process are selected uniformly at random across the entire group. Each message is gossiped to m randomly selected neighbors (from the membership list), where m < k, and m = O(log(N)), with the latter needed to ensure spread of gossip. It is argued that the overall “behavior” of this protocol (in terms of dissemination time of gossips, etc.) is the same as in the case where all processes might have had full membership lists (known everyone in the group), and each gossip was sent to m other processes. Is this argument right? If yes, then give a proof. If no, show why. 14. [10 pts] Show that the load at each node in SWIM protocol is constant. 15. [10 pts] Compute an expression for the number of infected nodes after each infection round when using i) push based gossip and ii) pull based gossip (similar to the expression on Slide 22, Lecture 2). Compare the infection rates and comment on their speed of infection – that is which process is faster and when? T PM T pml 1 . log( ) log( ( )) L* =