Message Atomicity over Scalable Reliable Multicast
Message Atomicity over Scalable Reliable Multicast
Multicast, while an efficient way to deliver messages to a large number of hosts, has several drawbacks in that the size and scale of the number of participating machines often inhibits the possibility of message semantics. In basic multicast, there are no message semantics. Computers that receive multicast messages typically do not ACK the sender’s message -- this avoids the ensuing ’ACK implosion’ when there are many machines present and many messages being sent. As the network grows crowded, packet delivery becomes more unreliable and it useful to have some guarantee or some sort of semantics from which to glean information about packet receipt. Other systems in the past have, controversially, attempted to accomplish the same thing [1,2,3]. We attempted to merely add an atomicity guarantee and we choose to implement this over an existing base layer of network functionality. This is where SRM and our layer comes in.
SRM  stands for scalable reliable multicast. SRM is a library developed to provide a basic layer of multicast send and receive functionality. Built as a low-level entity lying over TCP/IP, it contains no message semantics. It does, however, provide the guarantee that all hosts which desire the message will eventually receive the message -- that is, the message will eventually be delivered. Our layer over SRM, capitalizes on this guarantee to provide atomic message delivery for multicast messages.
The SRM library achieves scalable reliable multicast through a combination of several factors. Based closely on the IP multicast protocol’s group delivery model, SRM is scalable through a complete lack of group state. Any machine may broadcast a message to the entire group by simply messaging the multicast address -- there is no prior group membership information necessary to send a message. Consequently, sending is independent of group size. Receiving is also independent, as machines simply join, leave or crash as they desire, with no effect on the send and receive capabilities of other machines.
Reliability is accomplished through a combination of a unique sequence number and a heartbeat scheme. Each multicast message is marked with a unique sequence number and machines subscribing to multicast address periodically send out a heartbeat message containing the latest sequence number message it has received and a timestamp to determine the network size. By comparing a machine’s internally kept sequence number with the sequence numbers on the heartbeats of other machines, a machine determines whether it has missed a message. If so, it multicasts a repair request. The nearest machine (as calculated by comparing timestamp values) multicasts the missing data (to prevent a flood of repair requests). It is entirely possible for the repair requests to be dropped as well and it is the requesters responsibility to ensure that it receives a repair. Consequently, the repair and request chain may continue for some time, but should eventually succeed. Thus, at the cost of some bandwidth (for the heartbeats), SRM has made multicast reliable without sacrificing scalability.
Conventional atomicity is a notion taken from database work, where atomicity (the A in ACID) defined as “the property of transaction control that means either all or none of the statements in a transaction are executed ”. In the context of multicast networking, the corresponding notion can be defined as the property such that either all machines listening receive the message or none of the machines receive the message. SRM guarantees that messages will be eventually delivered to all listening machines; given this guarantee, our definition of atomicity is rather meaningless without additional constraints. However, the SRM message delivery guarantee is independent of time -- messages can take asymptotically long to reach the intended destination (particularly in light of the SRM message recovery mechanism). What we introduce here, is the concept of atomicity over time: either all listening machines will receive a given messages within a certain period of time, or no machines will receive the message.
This concept of atomicity over time requires two other important definitions: a clarification of the concept of “all listening machines” and a metric with which to measure “time”. Since a basic part of the atomicity we have defined is the idea of sending to “all machines” or no machines -- we need to create and maintain some sort of group state. Native SRM contains no concept of group state -- it makes no attempt to keep track of which machines will receive the messages broadcast to the multicast address. Any machine joins the “group” by subscribing the multicast address. Machines leaving and joining have no effect on other group members. We, however, must maintain group state to track which machines in the group received the atomic message and which did not. We therefore defined the relevant group to be the group state held by the sender at the time of sending.
Getting a concrete metric with which to measure time is somewhat difficult in networking context as well. Native SRM calculates network size based a rough concept of network time derived from the packet timestamps with adjustments for differences in machine time. We, on the other hand, make use the heartbeat mechanism in our layer and use the notion of “heartbeat time”. That is, we define a period of time with the number of heartbeats which pass during that time. So if we sent out a heartbeat every second, a period of five heartbeats would be equivalent to a period of five seconds. It would be a fairly straightforward conversion from a standard time period set by the user to heartbeat time, although the conversion would not be strictly accurate (and it would be lower bounded by the minimum amount of time between heartbeats).
SRM and our algorithm for atomic message delivery rest upon several assumptions about the state of the group to which messages are being sent. Many of these assumptions are assumptions made by SRM, which we inherit.
At least one computer is in non-failure mode at all times.
Lacking a stable and reliable source of recovery information, the basic SRM recovery scheme (where machines query each other for the missing information) will not work. This is an assumption that comes from the SRM layer.
Message will be heard by at least one stable computer.
SRM relies on the fact that at least one stable computer will hear a broadcasts message. This assumption is tied in to the first assumption. It is conceivable that a sender could join the group, send a message and then crash. If all other machines in the group drop the message, there are no machines on the network which are aware of that message’s existence and the message will be lost. Under normal circumstances, the sender’s heartbeat would evidence the lost message and the sender would restore the lost information to the other machines. Again, this is an assumption which is inherent to the SRM layer.
Crashed computers will eventually recover
Our atomicity algorithm guarantees that all machines will receive the message with time t and that sometime after, upon receipt of the confirm message, we will pass that message up to the application layer. We implement this with a two-phase commit. If a computer crashes after the first phase but before committing the message and passing it up to the data, it needs to eventually recover in order to keep the atomicity intact. This is predicated on the idea that the crashed computer will eventually recover and rejoin the session. (alternatively, we could have changed our definition of atomicity to be atomic for all hosts that eventually recover)
5. Group State
To track group state, each machine subscribing to or unsubscribing from a multicast address through our layer multicasts a join or leave message. This is really an optimization, since group state can be detected from the heartbeats after a certain period of time (with stable group membership). As a further optimization, we have the closest computer (computed by the same timestamp scheme as SRM) send its copy of the group state to the joining computer. Every computer leaving the group also sends out a leave message. Upon receipt of the leave message, the other machines remove the corresponding machine entry from the group state. This mechanism serves to distinguish voluntarily exiting machines from crashing machines, which are detected by the non-receipt of heartbeats for a certain period of time (current default is five). The machine which notices the lapse in heartbeats will then broadcast a boot message to the rest of the group. This is to avoid accidentally booting. If the booted machine is still on the network, it should receive the boot message as well, and issue the correcting join message. Given that slow network conditions could easily start a cycle where computer A and computer B have a message latency of this ”boot period” and computer A repeatedly boots computer B or vice versa, we can have the machines dynamically adjust this “boot period” based on average network roundtrip time. Roundtrip time is determined from the message timestamps in the same manner that the underlying SRM calculates network distances.
6. Atomic Message Delivery
The atomic message delivery itself is handled through a standard two-phase commit algorithm (again, taken from a database context). The general message send and receive algorithm is outlined as follows:
When a host sends message data to the multicast group it takes a “snapshot” of its recorded group state. These are the computers to which the message must be delivered to hold up the atomicity. Since atomicity requires some method of tracking message receipt, we use a sequence of acknowledgement messages upon receipt. Messages may have two values YES or NO. A machine which sends a YES is making a commitment to deliver the message to the application layer upon receipt of a CONFIRM message. In order to be able to carry out this commitment, the message data is written to a file before the YES is sent so that it can be recovered (more on this later). A host will send a NO when it is unable to write the message to the file. This situation is somewhat unlikely, but could occur if, say, the machine was out of disk space or could detect some sort of instability. Upon receipt of YES messages from all the computers in the group snapshot within the atomic period of time, the sender will multicast a CONFIRM message. A NO message from any machine in the group will result in an ABORT message being sent. The delivery of the CONFIRM and ABORT message is guaranteed under SRM. (What this really implies is that our layer must cache all CONFIRM and ABORT messages for the session in order to pass on the information in response to repair requests.)
When a host rceives an ABORT, the atomicity layer quietly drops the message data without sending it to the application.
As a part of the CONFIRM, the sender issues a list of computers in the sender’s group snapshot for this message. When a host receives a CONFIRM for a message, it checks to see if it received the message in question. If so, it passes the message data up to the application layer. If not, it quietly drops the message data without sending it to the application. This was an algorithmic decision to provide a tighter form of atomicity. Since the behavior of non-group members is largely irrelevant to the atomicity of the group, we could also ease this limitation and allow non-members to receive and pass on the message.
It is possible that a machine in the group snapshot will crash without responding to a message (no YES or NO). The sender uses heartbeats and the boot messages to see if any machines in the group snapshot have dropped off the network or crashed. If the sender detects a silent machine, it will send an ABORT for the current message.
It is possible that a host may crash after sending a YES but before receiving the corresponding CONFIRM. If this happens, the sender only knows that the message was received and needs to assume that atomicity is still intact. This is where writing the message data to a file comes in. When this host eventually recovers (one of the assumptions above), it loads all messages pending in the file and then recovers the missing status message (CONFIRM or ABORT) from another machine. It will then pass the data to the application or drop the data as appropriate.
We implemented this functionality though a series of callback functions hooking in to the SRM API. SRM provides application callbacks for message recovery, message receipt, and message restoration. Our layer exposes a similar but simpler API (we remove some of the SRM specific concepts from the interface) containing allowing the user to set several callback functions -- of primary interest is the message receipt callback. We then hand our callback functions to SRM and then call the user callbacks from our callbacks when appropriate. Frustratingly enough, SRM does not allow a clientdata hook into its callback functions -- forcing us to save state globally (this results in a single source per session limitation which the original SRM lacked -- this appears to be a simple oversight in their alpha release). Heartbeats and message timeouts are handled via the SRM timer class, which allows us to set out periodic events. Another frustration we had with the SRM API was the inaccessibility of the native SRM heartbeat, which already carries all the information that we need from our heartbeat. Thus, we were forced to implement our own heartbeat over SRM.
We ended up with a layer which translated the base layer of scalable reliable multicast into an atomic multicast layer of questionable reliability. Upon reflection we conclude that losing scalability is inevitable once the goal of atomicity is adopted. Since this form of atomicity relies on a clear definition the group of message recipients and some form of tracking of message receipt, each sender must in some way keep track group state. As the size of the group grows, the size of the data maintained by each host grows. Keeping track of group state individually would not withstand an order of magnitude growth of participating machines on the multicast address. An additional limitation on the scalability of our scheme is the mechanism where each CONFIRM message carries with it the current group state (so that no newly joining computers accidentally receive and act upon an errant message). Again, easing up on this condition of our atomicity would maintain the same all or nothing guarantee while increasing the scalability of the atomicity to a certain degree. However, even without the group snapshot being bandied about in multicast CONFIRM messages, the individual group state maintenance is one concrete obstacle to true scalability. An additional obstacle is the sheer bandwidth taken up by the ACKs required to track message receipt. As the group grows larger, we must wait on a greater number of YES and NO messages and therefore must realistically adjust the wait time for an atomic message to reflect realistic network conditions. While the basic SRM recovery scheme will decrease the number of lost messages in a single multicast and even with adjusting the atomic time as we go -- a large number of hosts will grow the number of message aborts.
As a direct result of the greater number of message aborts, we have a layer with questionable reliability. We will maintain reliable atomicity regardless of network scale, but actual effective message delivery will suffer at the higher ends. If we define reliability in terms of the number of messages received by the application layer versus the number of messages sent by the application layer, our scheme is not terribly reliable.
Yet another source of inefficiency is the clunky SRM interface (to be forgiving, we did work on an alpha release) which reduced our layer’s capabilities (one source per session) and forced us to implement a second heartbeat on top of the native SRM heartbeat, thus doubling the network traffic caused by heartbeats in any given moment. Hiding the heartbeat in the lower level can be considered a direct transgression to the End-to-End argument  since useful information is being implemented at such a low a level as to be useless to the higher level application.
Based on our work, we concluded that implementing atomicity over an existing network layer such as SRM reduced the efficiency of the atomic layer. Our implementation of atomicity would have been better integrated into a lower level layer. However, what we really discovered that atomicity over multicast is probably a bad idea in itself, since the group state and message receipt tracking reduces scalability -- scalability being a crucial feature to applications expecting to use the full potential of multicast. We also conclude that the SRM API could use some redesign so be more useful to applications intending to build on top.
 K. Birnman, “A Response to Cheriton and Skeen’s Criticism of Causal and Totally Ordered Communication”, Operating Systems Review, 28(1):11-21 , January 1994. http://cs-tr.cs.cornell.edu/Dienst/UI/2.0/Contents/ncstrl.cornell/TR93-1390
 K.Birman, A. Schiper, and P. Stephenson, “Lightweight Causal and Atomic Group Multicast”, ACM Transactions on Computer Systems, Vol.9, No.3, pp. 272-314, Aug.1991
 D. Cheriton and D. Skeen, “Understanding the Limitations of Causally and Totally Ordered Communication”, Proceedings of the 14th Symposium on Operating System Principles, ACM, December 1993.
 S. Floyd, V. Jacobson, C.Liu, S. McCanne, and L. Zhang. “A Reliable Multicast Framework for Lightweight Sessions and Application Level Framing” IEEE/ACM Transactions on Networking, Vol.5 No. 6 pp.784-803, Dec 1997
 J. Saltzer, D. Reed, and D. Clark. “End-toEnd Arguments in System Design” ACM Transactions on Computer Systems. Vol. 2. No 4. pp. 277-288, Nov 1984.