Chat with us, powered by LiveChat Problem statement: what kind of problem is presented by the authors and why this problem is important? Approach & Design: briefly describe the approach designed by the a - Writeedu

Problem statement: what kind of problem is presented by the authors and why this problem is important? Approach & Design: briefly describe the approach designed by the a

  1. Problem statement: what kind of problem is presented by the authors and why this problem is important?
  2. Approach & Design: briefly describe the approach designed by the authors
  3. Strengths and Weaknesses: list the strengths and weaknesses, in your opinion
  4. Evaluation: how did the authors evaluate the performance of the proposed scheme? What kind of workload was designed and used?
  5. Conclusion: by your own judgement.

Optimistic Total Order in Wide Area Networks! António SOUSA , José PEREIRA, Francisco MOURA, Rui OLIVEIRA

Universidade do Minho, Portugal {als,jop,fsm,rco}


Total order multicast greatly simplifies the implementa- tion of fault-tolerant services using the replicated state ma- chine approach. The additional latency of total ordering can be masked by taking advantage of spontaneous order- ing observed in LANs: A tentative delivery allows the ap- plication to proceed in parallel with the ordering protocol. The effectiveness of the technique rests on the optimistic as- sumption that a large share of correctly ordered tentative deliveries offsets the cost of undoing the effect of mistakes. This paper proposes a simple technique which enables

the usage of optimistic delivery also in WANs with much larger transmission delays where the optimistic assumption does not normally hold. Our proposal exploits local clocks and the stability of network delays to reduce the mistakes in the ordering of tentative deliveries. An experimental evalu- ation of a modified sequencer-based protocol is presented, illustrating the usefulness of the approach in fault-tolerant database management.

1. Introduction

Total order multicast greatly simplifies the implementa- tion of fault-tolerant services using the replicated state ma- chine approach [25]. By ensuring that deterministic replicas handle the very same sequence of requests from clients, it is ensured that the state is kept consistent and the interaction with clients is serializable [12]. A particularly interesting application is the database state machine [21] which allows high performance replication of transactional databases. Implementation of total order multicast is however more

costly than other forms of multicast due to the unavoidable additional latency. For instance, in a sequencer based pro- tocol [5, 16] all processes (except the sequencer itself) have to wait for the message to reach the sequencer and for the sequence number to travel back before the message can be delivered. On the other hand, protocols based on causal history [18,

23, 10] can provide latency proportional to the interarrival delay of each sender and thus lower latency than sequencer based protocols. However, when each sender has a large in-

!Research supported by FCT, ESCADA proj (POSI/33792/CHS/2000).

terarrival time and low latency is desired, this requires the introduction of additional control messages. This is espe- cially unfortunate in large groups and in wide area networks with limited bandwidth links. In some protocols, such as those based on consensus [7,

4] or on a sequencer [5, 16], the total order decided is the spontaneous ordering of messages as observed by some pro- cess. In addition, in local area networks (LANs) it can be observed that the spontaneous order of messages is often the same in all processes. The latency of total order pro- tocols can therefore be masked (not reduced) by tentatively delivering messages based on spontaneous ordering, thus allowing the application to proceed the computation in par- allel with the ordering protocol [17]. Later, when the total order is established and if it confirms the optimistic order- ing, the application can immediately use the results of the optimistic computation. If not, it must undo the effects of the computation and restart it using the correct ordering. The effectiveness of the technique rests on the assumption

that a large share of correctly ordered tentative deliveries offsets the cost of undoing the effects of mistakes. This is unfortunate as this makes optimistic delivery useful only in LANs where the latency is much less of a problem than in wide area networks (WANs). This paper proposes a simple protocol which enables op-

timistic total order to be used in WANs with much larger transmission delays where the optimistic assumption does not normally hold. Our proposal exploits local clocks and the stability of network delays to reduce the mistakes in the ordering of tentative deliveries by compensating the vari- ability of transmission delays. This allows protocols which are based on spontaneous ordering to fulfill the optimistic assumption and thus mask the latency. An experimental evaluation of the technique is presented

using a sequencer-based protocol, illustrating the useful- ness of the approach in fault-tolerant database management. When applied to the sequencer based protocol, our tech- nique does not introduce additional messages. The only overhead is that of an additional integer piggybacked on data messages. This compares favorably with both plain sequencer based and causal history based protocols. The paper is structured as follows. The next section re-

calls the problems of total order and optimistic total order multicasts, as well as the reasons preventing spontaneous


total order in wide are networks. Section 3 introduces the intuition underlying our proposal and presents a protocol providing optimistic delivery of messages based on a fixed- sequencer total order multicast protocol. In Section 4 we evaluate the performance gains of our approach. In Sec- tion 5 we discuss the paper contribution in general settings as well as applied to a specific application. Section 6 con- cludes the paper.

2. Background

2.1. Totally ordered multicast

Informally, totally orderedmulticast (or atomicmulticast) ensures that no pair of messages is delivered to distinct des- tination processes in different order. Totally ordered multi- cast greatly simplifies the implementation of fault-tolerant services using the replicated state machine (or active repli- cation) approach [25, 12]: By delivering exactly the same messages in the same order to a set of deterministic repli- cas, their internal state is kept consistent. More formally, we consider an asynchronous message

passing system composed of a finite set of sequential pro- cesses communicating over a fully connected reliable point- to-point network [7]. Processes do not have access to shared memory or to a global clock. A process may only fail by crashing and once a process crashes it does not recover. A process that never crashes is said correct. Totally ordered multicast is defined by primitives to-multicast(m) and to- deliver(m), and satisfies the following properties [14]: Validity. If a correct process to-multicasts a message m,

then it eventually to-deliversm. Agreement. If a correct process to-delivers a messagem,

then every correct process eventually to-deliversm. Integrity. For every messagem, every process to-delivers

m at most once, and only if m was previously to- multicast.

Total Order. If two correct processes to-deliver two mes- sagesm andm!, then they do so in the same order.

Total order multicast has been shown to be equivalent to the generic agreement problem of consensus [7]. Therefore we must assume that in our system the consensus problem is solvable [11, 7], requiring that a majority of processes is correct and that failure detection is of class "S [6]. In some protocols, consensus is explicitly invoked to decide the message sequence [7, 4]. In others, consensus is implicit in a group membership service which supports the actual ordering protocol [5, 15, 10]. There is a plethora of total order protocols for asyn-

chronous message passing systems which can be classi- fied according to several criteria [9]. Namely, some or- der the message while disseminating it [3, 2, 15]. Others take advantage of an existing unordered multicast proto- col [5, 16, 7] and work in two stages: First, messages are

p1 • seq(m1) !!

seq(m1) ""


p3 • DELIV(m1) ##

p2 MCAST(m1)



&&!!!! • DELIV(m1) ##

Figure 1. Sequencer based total order protocol.

disseminated using a reliable multicast protocol Then, an ordering protocol is run to decide which is the correct de- livery sequence of buffered messages. This results in addi- tional latency, when compared to reliable multicast. An example of a protocol often used in group commu-

nication toolkits is the sequencer [5, 16], which uses con- sensus implicitly in the view-synchronous reliable multi- cast protocol used to disseminate messages previously to ordering them. As depicted in Figure 1, a data message is disseminated using unordered reliable multicast. Upon re- ception (depicted as a solid dot), the message is buffered until a sequence number for it is obtained. A single pro- cess (p1 in the example) is designated as the sequencer: it increments a counter and multicasts its value along with the original message’s identification to all receivers as a control message. Data messages are then delivered according to the sequence numbers. A group membership protocol is used to ensure that for any given data message there is exactly one active sequencer. Besides being a very simple protocol, it offers several ad-

vantages, especially in networks with limited bandwidth or in large groups with large and variable message interarrival times: it requires at most a single additional control mes- sage for each data message and any message can always be delivered after two successive message transmission de- lays. The basic protocol is also easily modified to cope with higher message throughput by batching sequence numbers for several messages in a single one [5], reducing the num- ber of control messages at the expense of higher latency.

2.2. Optimistic total order

A reliable multicast protocol can deliver a message af- ter a single transmission delay from the originator to the receiver. This contrasts with the latency of totally ordered multicast1 which is either twice as large, when using a se- quencer based protocol, or proportional to message interar- rival delay in protocols using causal history. However: • Some protocols, such as the sequencer, produce an or- dering which is the spontaneous ordering observed by some process.

• In local area networks, it can be observed that the spon- taneous ordering of message reception of all processes

1Except in the degenerate situation where a single process is multicas- ting and it can assume the sequencer role.


is often very similar, therefore, similar to the final or- dering decided by the sequencer.

Nevertheless, delivery incurs always in the additional la- tency. The optimistic atomic broadcast protocol [22] takes this in consideration to improve average delivery latency of a consensus based total order protocol. Further latency improvements can be obtained if the ap-

plication itself can take advantage of a tentatively ordered delivery. This is called optimistic delivery [17, 26] as it rests on the optimistic assumption that reliable multicast sponta- neously orders messages. It also implies that eventually an authoritative total order is determined, leading to a confir- mation or correction of previously used delivery order. To the interval between the optimistic delivery and the author- itative delivery we call optimistic window. It is during this interval that the application can optimistically do some pro- cessing in advance. To define optimistic total order multicast we use two

different delivery primitives an optimistic opt-deliver(m) that delivers messages in a tentative order and a final fnl- deliver(m) that delivers the messages in their final, or au- thoritative, order. Optimistic total order multicast satisfies the following properties [26]: Validity. If a correct process to-multicasts a message m,

then it eventually fnl-deliversm. Agreement. If a correct process fnl-delivers a messagem,

then every correct process eventually fnl-deliversm. Integrity. For every message m, every process opt-

delivers m only if m was previously multicast; and every process fnl-deliversm only once, and only if m was previously multicast.

Local Order. No process opt-delivers a message m after having fnl-deliveredm.

Total Order. If two processes fnl-deliver two messagesm andm!, then they do so in the same order.

An example of such an application is the database state machine [21] which allows high performance replication of transactional databases and works as follows: transactions are executed optimistically by any of the replicas without locking. The resulting read and write sets are then multicast to all replicas which perform a deterministic certification to ensure that the transaction does not conflict with concur- rent transactions already committed. Total order multicast is used to ensure that the result of the certification process is identical in all replicas, thus ensuring consistency. If the or- der of messages is known in advance by optimistic delivery, this can be used to speed up the certification [17]. Notice that if the optimistic ordering turns out to be

wrong, the application has to undo the effect of any pro- cessing it might have done. Therefore, the net advantage of optimistic delivery depends on the balance between the cost of a mistake and the ratio of correctly ordered optimistic de- liveries. In the database state machine, being able to undo

the effects of optimistic delivery just means than the trans- action cannot be effectively committed until authoritative delivery. When the optimistic delivery is wrong, there is a performance penalty: The processing resources used have been wasted. The tradeoff is thus similar to the one involved in the de-

sign of cache memories. However, the protocol designer has no possibility to reduce the cost of a mistake, as this depends solely on the application. The only option is thus to try to maximize the amount of messages which are deliv- ered early but correctly ordered.

2.3. Obstacles to spontaneous total order

A high ratio of spontaneously totally ordered messages which results in good performance of optimistic applica- tions is not trivially achieved, especially in wide area net- works. One reason for this is loopback optimization in the operating system’s network stack. Noticing that the out- going packet is also to be delivered locally, the operating system may use loopback at higher layers of the protocol stack and immediately queue the message for delivery. This allows it to be delivered in advance of packets from other senders which have reached the network first. Another reason for out of order delivery lies in the net-

work itself. Although not frequent, there is a possibility that packets are lost by some but not all destinations. A reliable multicast protocol detects the occurrence and issues a re- transmission. However, the delay introduced opens up the possibility of other packets being successfully transmitted while retransmission is being performed. An additional issue is the complexity of the network

topology. Different packets can be routed by different paths, being therefore subject to different queuing delays or even to being dropped by congested routers. This is especially noteworthy when there are multiple senders. Receivers which are nearer, in terms of hops, to one of them will re- ceive its messages first. Receivers which are nearer of an- other will possibly receive messages in the opposite order. Notice however that bad spontaneous order in wide area

networks is not attributable to large delays themselves, but to the fact that the delays to different destinations are likely to be different, often by two orders of magnitude. Consider Figure 2(a). Messages m1 and m2 are multicast to three different processes, including the senders themselves. The time taken to transmit each message varies with the recipi- ent, for instance, transmission to the sender itself (typically hundreds of microseconds by loopback) takes less time than transmission to other processes (typically up to tens of mil- liseconds over a long distance link). The result is that pro- cess p1 spontaneously orders messagem1 first while p2 and p3 deliverm2 first. Figure 2(b) shows a similar example where message

transmission delays are longer but where is it more likely


'' d1 ## ##•m1 ((



• • ##• • ##•

m2 $$



• • ''



(a) Variable transmission delays lead to overlapping deliveries and no spontaneous total order.

'' d1 ## ##•m1 ))



• • ##• • ##•

m2 ..



• • '' d2


(b) Comparable transmission delays reduces the probability of over- lapping deliveries.

Figure 2. Transmission delays and spontaneous total order.

that messages are delivered by all processes in the same or- der. What matters is the difference between transmission delays to different processes depicted as d1 and d2. Larger values for d1 and d2 mean that there is a higher probabil- ity of overlapping and thus of different delivery order even with higher delays than the previous example.

3. Delay compensation

3.1. Intuition

A network exhibiting identical transmission delays with low variance among any pair of processes would enable spontaneous total ordering of messages. This observation leads to the intuition underlying our proposal: given the magnitude of the latency introduced by total order protocols it should be possible, by judiciously scheduling the delivery of messages, to reduce the differences among transmission delays and produce an optimistic order which is likely to match the authoritative total order. As an example, notice that Figure 2(a) can be transformed in Figure 2(b) simple by delaying some of the deliveries. What remains to be established is how to determine the

correct delays to introduce to each message such that the likelihood of matching the authoritative total order is im- proved. The challenge is to do this with minimal overhead, both in terms of messages exchanged as well as computa- tional effort. In addition, by introducing delays our tech- nique increases the average latency of optimistic delivery. This must therefore be minimized and compensated by the higher share of correctly ordered optimistic deliveries. Notice that in a WAN this cannot ever replace a total or-

der algorithm: Transmission delays cannot be precisely es- timated, some uncertainty exists and thus it is likely that some messages are delivered out of order [20]. On the other hand, if the only modification to the original sequencer al- gorithm is the introduction of finite delays, its correctness in an asynchronous system model is unaffected. Therefore by reusing an algorithm known to be correct in the asyn- chronous system model we ensure the robustness of the so- lution [19]. Timing assumptions, namely on the stability of transmission delays as measured by a process’s local clock are then used only to improve the performance.

3.2. Relatively Equidistant Receivers

As the basis for our protocol, we consider a fixed- sequencer total order multicast algorithm as described in Section 2.1. We assume that the total order of messages is based on the spontaneous ordering of messages as seen by the sequencer. The different orders seen by a process p between the

messages it delivers optimistically and those that it delivers authoritatively reflects the relative differences between the communication delays from the senders to p and to the se- quencer. We like to think of these communication delays as “distances” between processes (more precisely, as directed distances as the distance from p to q can be different from that of q to p). If, through the introduction of artificial de- lays, we manage to get each process p and the sequencer as relatively equidistant receivers with respect to all other processes, then the order in which p delivers messages op- timistically will be that of the sequencer and therefore will match the authoritative order. The way to increase the distance between q and p is to

delay the optimistic delivery of messages from q at p. This means that when p needs to get q closer either p reduces the delay it might be imposing to the messages from q, or p has to stand back from all other processes by delaying the optimistic delivery of messages from these processes. This is the basic mechanism of our algorithm. It is simple and independently managed at each process, i.e., the adjustment of the distance between p and q is independent from that between q and p. Two particular cases however require special attention.

One is the fact that any process is usually closer to itself than from the sequencer and thus it will have to distance from itself. This case is simple, each process will delay the optimistic delivery of its own messages such that “it dis- tances from itself” as it distances from the sequencer. The other case regards the sequencer itself. While, as any other process, it is closer to itself than the others the distance to the sequencer does not apply here and the order of opti- mistic delivery trivially matches that of the authoritative’s. However, it is required, as happens with the other processes, that the sequencer “distances from itself” by delaying the optimistic delivery of its own messages. The reason for this is that unless the sequencer delays the optimistic de-



00 s •

seq(m1)11&&&&&&&& ts 2. •

seq(m2)11''''''' ##

p • 32 tp

• • tsp

2. • ##




Figure 3. Delays used in adjusting.

livery of its own messages, the optimistic and authoritative delivery of its messages will always occur almost simulta- neously. This is true at the sequencer process itself as well as in any other process and, as exemplified in the next sec- tion, it would eventually force the same phenomenon in the messages of the other processes. The problem of delaying the optimistic delivery of the sequencer’s messages is that it also delays their authoritative delivery.

3.3. Distance calculation

Consider the scenario depicted in Figure 3. Message m1

is multicast by a process p1. Message m2 is multicast by another process p2. Both the sequencer s and a second pro- cess p receive m1 and m2 as shown. Upon reception they are ordered by s, which assigns them sequence numbers and delivers them immediately. The authoritative order of the messages becomesm1, m2 as this was the spontaneous or- der seen by s. In contrast, process p can only make an op- timistic guess about the final relative order of m1 and m2, and in this situation it would have mistakenly predicted the delivery of m2 before m1. The final order is known only upon reception of the sequence numbers from s. As soon as it receives the sequence numbers for both mes-

sages, p becomes aware that its relative distances to p1 and p2 are different from those of s, because it has received the same messages in the inverse order. If it had delayed the optimistic delivery of m2 until after the reception of m1, it would have compensated its relative distance from the senders with respect to that of the sequencer and matched the authoritative order. Although any sufficiently large delay imposed on m2 by

p would correctly order it relatively tom1, a correct predic- tion of the final order by p requires an evaluation of relative distances to senders to s and to p, enabling an optimal de- lay to be introduced. Notice that the delay should not be so large that it causesm2 to be misordered with a messagem3

that arrives to all processes after bothm1 andm2. Explicit estimation of distances among all processes is

not required. A better approach is to directly determine op- timal delays to be introduced prior to optimistic delivery by observing that: • If the relative distance of p and s is the same with re- spect to senders of m1 and m2 and each message is

multicast simultaneously to all destinations, then in- terarrival times ts and tp will be identical.

• If transmission delays of seq(m1) and seq(m2) from s to p are the same, then p can use the value of tsp to locally determine ts. This avoids assumptions on the drift rate of clocks.

Process p can easily calculate the delay it should have intro- duced to the optimistic delivery of m2 to match its relative distance from p1 and p2 to that of the sequencer. Specifi- cally, it should have delayedm2 by tsp # tp.2 To cope with spurious variations on transmission delays, adjustments are made taking into account an inertia pondering factor. In the next section we will see in detail how the delays are

calculated and which process’s messages are delayed. Right now, the reader should keep in mind that delays to a pro- cess’s messages are only introduced when it is not possible to achieve the same result by reducing the delays inflicted to the other. The way the sequencer calculates its own messages de-

lays is different. Should it use the samemethod as the others and it, obviously, would not delay its own messages. To un- derstand the method followed by the sequencer let us first exemplify the consequences of not introducing delays on the sequencer’s own messages. Consider three processes, p, q and s. Process s is the sequencer. Process q, for simplic- ity, is ! equidistant of s and p. The distance from s to p is dsp and the distance of s to itself is dss. Having p and s relatively equidistant from q means that

(! # dss) = (! # dsp). To achieve this, since we cannot reduce dsp, we can have 1) p to distance from q, or 2) s to distance from itself, or both. Now suppose that s does not delay its own messages. In this case, p will have to stand back ! = dsp # dss from q. Since dss (the loopback de- lay) is usually negligible we can admit that ! $ dsp. This means that when a message multicast by q is optimistically delivered at p it is almost simultaneously delivered authori- tatively at p too. Therefore, unless the sequencer delays the optimistic delivery of its own messages the size of the opti- mistic window at the other processes becomes uninteresting or even vanishes. Below we will show how the sequencer computes the de-

lay for its own messages. This, contrary to other processes adjustments, is not independent and requires their cooper- ation. The idea is that the sequencer will stand back from itself what it distances from the farthest process.

3.4. Algorithm

We present in this section the algorithm executed by each process (Figure 4). The algorithm consists of a procedure TO-multicast(m) invoked by the client application to multi- cast a message and a set of four upon-do statements, exe-

2Notice that tp is negative in Figure 3, indicating that the relative order ofm1, m2 is reversed.


1: g " 0 {Global sequence number} 2: l" 0 {Local sequence number} 3: R" # {Messages received} 4: S " # {Sequence numbers} 5: O " # {Messages opt-delivered} 6: F " # {Messages fnl-delivered} 7: delay[1..n]" 0 8: r delay[1..n]" 0 {Delays requested to the sequencer}

9: procedure TO multicast(m) do 10: R multicast(DATA(m, max(delay[])$ delay[seq]))

11: upon R deliver(DATA(m, d)) do 12: R " R % {(m, d, now + delay[m.sender])}

13: upon &(m, d, t) ' R : now ( t )m *' O )m *' F do 14: opt delivery(m) 15: O " O % {m} 16: if p = seq then 17: g " g + 1 18: R multicast(SEQ(m, g)) 19: r delay[m.sender]" d 20: delay[p]" max(r delay[])

21: upon R deliver(SEQ(m, s)) do 22: S " S % {(m, s, now)}

23: upon &(m, d, o) ' R : (m, l + 1, t) ' S )m *' F do 24: fnl delivery(m) 25: if &(m!, d!, o!) ' R : (m!, l, t!) ' S then 26: !" (t$ t!) $ (o$ o!) 27: if! > 0 then 28: adjust(m!.sender, m.sender, !) 29: else 30: adjust(m.sender, m!.sender, |!|) 31: l" l + 1 32: F " F % {m}

33: procedure adjust(i, j, d) do 34: v " (delay[i] + !) + (delay[i]$ d)+ (1$ !) 35: if v ( 0 then 36: delay[i]" v 37: else 38: delay[i]" 0 39: delay[j] " delay[j] + |v|

Figure 4. Delay compensation algorithm for process p

cuted atomically, that deal with the optimistic and authori- tative delivery of the messages. The actual delivery of the messages to the client application is done through two up- calls opt-deliver(m) and fnl-deliver(m). Procedure adjust is an auxiliary procedure local to the algorithm. Each process manages four queuesR, O, F and S where

it keeps track of the messages received, optimistically and authoritatively delivered to the application, and those for which it has already received a sequence number, respec- tively. Every message m has a special attribute (m.sender) identifying its sender. At each process a variable seq iden- tifies the sequencer process. To multicast a totally ordered message, the client applica-

tion invokes procedure TO-multicast(m) (lines 9-10). This, in turn, invokes an underlying primitive providing reliable multicast with a pair (m, max(delay[])# delay[seq]). The value computed by max(delay[]) # delay[seq], as will be discussed below, corresponds to the delay the process sug- gests the sequencer to inflict to its own messages. The reception and delivery of messages is done by the

four upon-do statements. The first two handle the optimistic delivery while the others handle the authoritative delivery. When a pr

Our website has a team of professional writers who can help you write any of your homework. They will write your papers from scratch. We also have a team of editors just to make sure all papers are of HIGH QUALITY & PLAGIARISM FREE. To make an Order you only need to click Ask A Question and we will direct you to our Order Page at WriteEdu. Then fill Our Order Form with all your assignment instructions. Select your deadline and pay for your paper. You will get it few hours before your set deadline.

Fill in all the assignment paper details that are required in the order form with the standard information being the page count, deadline, academic level and type of paper. It is advisable to have this information at hand so that you can quickly fill in the necessary information needed in the form for the essay writer to be immediately assigned to your writing project. Make payment for the custom essay order to enable us to assign a suitable writer to your order. Payments are made through Paypal on a secured billing page. Finally, sit back and relax.

Do you need an answer to this or any other questions?

Do you need help with this question?

Get assignment help from Paper Writing Website and forget about your problems.

WriteEdu provides custom & cheap essay writing 100% original, plagiarism free essays, assignments & dissertations.

With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.

Chat with us today! We are always waiting to answer all your questions.

Click here to Place your Order Now