Amadeus Cache Server - Architecture Paper

Amadeus Cache Server - Architecture Paper

Table of contents
1 Executive Summary
1.1 Background
1.2 Proposal & Benefits
1.3 Technical Summary
2 Introduction
2.1 Problem Context
2.1.1 Antecedents
2.1.2 Paradigm shift
2.1.3 Sequel
2.2 Problem Statement
2.3 Out of Scope
2.3.1 Communication Resource Manager
2.3.2 Failure detection
3 Requirements Summary
3.1 Functional Requirements
3.1.1 Request processing
3.2 Non-Functional Requirements
3.2.1 Scalability requirements
3.2.2 Load balancing requirements
3.2.3 Continuous Availability requirements
3.2.4 Operational requirements
3.2.5 Performance requirements
3.2.6 Capacity requirements
3.2.7 Portability requirements
4 Functional Decomposition
4.1 Components Overview
4.2 Examples
4.2.1 The example framework instance
4.2.2 Normal request processing
4.2.3 Node failure and recovery
4.2.4 Software load
4.3 Component Descriptions
4.3.1 Entry
4.3.2 Bucket
4.3.3 Bucket Manager
4.3.4 Router
4.3.5 Bucket Index
4.3.6 Proxy
4.3.7 Proxy Manager
4.4 Component Interactions
4.4.1 Operational overview
4.4.2 Request Processing
4.4.3 Migration of a Bucket
4.4.4 Index synchronisation
4.4.5 Exemption of an Index copy
4.4.6 Exemption of a Bucket Manager
4.4.7 Adding a new Index copy
4.4.8 Recreation of a lost Bucket
4.4.9 Re-partitioning
4.5 Component Details
4.5.1 Component states
4.5.2 Interface primitives
5 Function Placement
5.1 Deployment Units
5.1.1 Mapping the functional components
5.1.2 Placement rules
5.2 Network Diagram
6 Conclusion

1 Executive Summary

1.1 Background

During year 2000, the Amadeus Development Company developed the Amadeus Cache Server (ACS) framework. ACS supports the caching of independent networks of C++ objects and the communication between external entities and the cached networks. Unfortunately, continuous availability of the stored networks and horizontal scalability were not taken into consideration as requirements during the development of the framework. ACS supports only the caching of object-networks, as opposed to reliably storing them. (By “caching only” we mean that if an entry is placed in ACS, it is not sure it will remain there until we deliberately remove it. ACS can forget it meanwhile.) The lack of continuous availability and horizontal scalability limits the re-usability of the framework. Although a High Availability Proposal has been prepared, it still does not offer a solution for horizontal scalability.

1.2 Proposal & benefits
The current paper suggests modifying the framework so that it meets both the continuous availability and horizontal scalability requirements and becomes capable of storing object-networks reliably. This will considerably widen the range of the applications able to rely on the framework. The possible range includes all applications with highly partitionable state, among others, conversation-oriented applications (or parts of applications that are conversation-oriented). The proposed modifications can, for example, solve the problem of context servicing. The proposed new architecture complies with the existing one to the maximum extent possible in order to minimise implementation effort.

1.3 Technical summary
The problem with the current architecture is that the hash table is tightly coupled with the buckets. This prevents scaling horizontally and achieving continuous availability, because every bucket must be located on the same physical node as the hash table. Therefore, the primary objective of the new proposal is to de-couple the functionality of searching for the location of a bucket from the functionality of storing buckets.

2 Introduction

2.1 Problem Context

2.1.1 Antecedents
During year 2000, the Amadeus Development Company developed the Amadeus Cache Server (ACS) framework. ACS supports the caching of independent networks of C++ objects and the communication between external entities and the cached networks. As it currently seems, the framework is going to be the basis for various applications in the Amadeus System. The following projects already rely on it:
- Value Pricer Cache Server (also called ICS)
- Rio Grande
It is also intended to utilise the framework in the Fare Quote System.
Unfortunately, continuous availability of the stored networks and horizontal scalability were not taken into consideration as requirements during the development of the framework. Continuous availability (CA) was not considered because it was assumed that in case of an outage, the ACS could be restarted empty (cold), and would be populated gradually (warmed up). Horizontal scalability (HS) was not considered, because it was assumed that in case the ACS runs out of storage, some of the existing object-networks (e.g. the least recently or least frequently used ones) can be erased and thereby room can be created for new networks. Also, applications requiring higher CPU performance than what one node can provide, were not envisioned.

2.1.2 Paradigm shift
As we stated above, and as it is apparent from the lack of CA and HS features, the framework supports only the caching of object-networks, as opposed to reliably storing them. Because of the cache-instead-store principle, the re-use of the framework is limited. For example, in the absence of continuous availability, the warm-up procedure for the ICS application can last as long as 2 weeks, during which ICS can provide a degraded service only. The lack of horizontal scalability can, under circumstances, impose a limit on the hit ratio of caching-applications, while other potential applications would not tolerate the loss of object-networks at all.To address these problems, a High Availability Proposal has been prepared by the developers, relying on a hot but idle standby instance of the application on a standby node (hot backup). However, the Proposal still does not offer a solution for horizontal scalability. Also, the idle standby is a poor solution from the operational point of view. Could ACS provide reliable and scalable storing, a much wider range of the applications would be able to rely on the framework. The possible range would include basically all applications with state partitionable (while observing the locality of reference principle) into highly granular pieces, such as conversation and context servicing. Therefore, we suggest shifting from the caching paradigm to storing, and we propose the modification of the framework so that it meets both the continuous availability and horizontal scalability requirements.

2.1.3 Sequel
The ADP Systems Architecture Department has recognised the need for a General Purpose Continuous Availability and Horizontal Scalability Framework (GP-CA-HS Framework) for open systems. We expect that this framework will be the extension of the Services Integrator CA-HS architecture. (The design process of the extensions has not started yet.) After the design of the GP-CA-HS framework, this paper will be revisited to exploit synergies. Although major changes to the currently proposed architecture are unlikely, we expect that some of terminology will change. A few sections describing how this architecture fit in the framework are also likely to be added.

2.2 Problem Statement
The current ACS is based on the marriage of the eGate TCP multiplexer and the UP_CACHE component (see the figure below ). The developers took the design of eGate, and replaced the so-called Outbound Entities with Cache Entities based upon the UP_CACHE component. The below figure shows the internals of UP_CACHE: UP_CACHE stores the entries in groups called buckets. The entries are assigned to buckets based on digests hashed from the keys of the entries. The hash table maps the digests (hash values) to the buckets. The problem with this architecture is that the hash table is tightly coupled with the buckets. This prevents scaling horizontally and achieving continuous availability, because every bucket must be located on the same physical node as the hash table. Therefore, the primary objective of the new proposal is to de-couple the functionality of searching for the location of a bucket from the functionality of storing buckets.

2.3 Out of scope

2.3.1 Communication Resource Manager
This paper does not address the internal communication mechanism among the distributed components of the proposed architecture. Instead, we apply the abstraction Communication Resource Manager (CRM). This abstraction can be concretised to any of the communication frameworks existing in the aMaDEUS environment, provided the below functional requirements are met. The chosen Communication Resource Manager should provide message delivery and request-reply correlation services through a communication application-programming interface (C-API). The Communication Resource Manager should allow for location transparent intra- and inter-node communication between processes defining Communication End-Points (CEPs). The CRM should also be capable of detecting the loss of a communication channel or a CEP.
Communication Resource Managers are considered as a standard infrastructure piece, and that is why they are excluded from this negotiation.

2.3.2 Failure detection
The Communication Resource Manager is responsible to detect the loss of a communication channel or a Communication End-Point. No further failure detection mechanism is provided. The already mentioned GP-CA-HS Framework might provide solutions for component failure detection.

3 Requirements summary

3.1 Functional requirements
The framework should provide in-memory object storage and communication services for applications with highly partitionable state, including conversation-oriented applications requiring the maintenance of many independent local contexts. The “highly partitionable state” is embodied by serialisable networks of C++ objects. Each such network is called an entry and is identified by a key. Messages can be addressed to the network using the key. The phrase “partitionable” means that a stored network cannot directly communicate with another. None of the object-networks can hold pointers to objects outside the network. Communication with the “outside world” can be carried out only with the mediation of the framework.
The framework does not offer guaranteed delivery of the messages, be they requests addressed to the entries or replies sent by the entries. This means that the applications relying on the framework can exchange only messages obeying the idempotent update semantics. (That is, the messages must be repeatable. E.g. incrementing a counter is non-idempotent, inserting an item into a set is.) The framework must be capable of sending the reply from the node that received the request. This is needed by connection oriented transport protocols. Below is a description of how the framework mediates between the “outside world” and an application based on the framework.

3.1.1 Request processing
The “outside world" is embodied by the clients of the cache, i.e. software entities sending and/or receiving messages to/from the framework. Since, together with the application, clients are logically outside the boundaries of the framework, they are the actors of the system.

3.1.1.1 Summary
Actor: Client: A software entity capable of sending requests to and receiving replies from the framework.
Application Plug-in: A software entity implementing the features in the behaviour of the framework that are specific to a given project using the framework. Description: The requestor client sends a request to the framework, which processes that by the application and sends replies to the recipient client(s).

3.1.1.2 Typical course of events: Actor Action System Response
1. This use case starts when the requestor client sends a request to the framework.
The request contains
- optionally a key identifying an entry
- any additional application specific information with which the requestor client wants to manipulate the entry
The requestor client might send further requests to the framework before a reply for the first request arrives. If the requestor client is sending multiple requests to the same entry without waiting for the replies, the framework must take care of the access-serialisation of the requests. However, the framework does not have to guarantee that the entry will receive the requests in the order they were sent.

2. The framework asks the Application to classify the request sent by the requestor client.

3. The Application classifies the request telling the framework
a) if, at all, the request makes potentially sense to the application (If not, the rest of the processing can be skipped.)
b) optionally, the key of the entry referenced in the request
c) if the request might potentially cause a modification to an entry (The framework might benefit from this information by serving “Read Only” requests in a different way.)
If the request does not contain a key identifying an entry, then the Application can invent a key for the potential new entry.

4. The framework searches for the entry identified by the key.
If the Application did not specify a key, the framework will invent a key and create a blank entry.
Also, if the framework does not find the entry, it will create a blank one.

5. The Application manipulates the entry.
When the Application finishes processing the request, it notifies the framework about it.
Before the notification, it must perform the following actions:
- If the entry has been newly created or modified during the processing of the request, and it has to be persistently stored, the Application must (object-) serialise it and pass it to the framework in its serialised form.
- If the Application wants the entry to be deleted after the processing of the request, it must notify the framework about it.
- If the Application wants to send reply message(s) to various recipient clients, it must create the messages and pass them to the framework. (The requestor client might be one of the recipient clients, but does not have to.)

6. If the Application has requested the deletion of the entry, then the framework deletes it. The framework forwards the messages to the recipient clients.

3.2 Non-functional requirements

3.2.1 Scalability requirements
The framework should provide horizontal scalability (HS) in storage and processing throughput. That is, by distributing an instance of the framework on several nodes, the number of entries stored and the number of requests served should grow approximately linearly.
It is acceptable to base the HS solution on the partitioning of the data. The affinity of requests to particular nodes is acceptable, but the framework should automatically manage this.

3.2.2 Load balancing requirements
The framework should provide static load balancing based on the partitioning of the stored data and the affinity of messages to the partitions. To balance the load, the partitions have to be moved from one location to the other. Static load balancing means that the framework does not support automatic load levelling. That is, the migration of partitions has to be initiated by some external entity (i.e. the administrator). In short, workload balancing is based on the placement of data partitions.

3.2.3 Continuous Availability requirements
Continuous Availability (CA) has two aspects. One is High Availability, aiming at the minimisation of unexpected service outages and degradations. The other is Continuous Operations, meaning the minimisation of scheduled service outages and degradations.

3.2.3.1 High Availability
The failure of a node cannot cause the loss of any entry or the loss of a modification to an entry resulting from a request that has already been replied. Requests or replies might be lost during topology transition (i.e. while a node is falling out from a cluster running the framework and the backup is being activated), but this should not last longer than 30 seconds.

3.2.3.2 Continuous Operations
The framework and the applications written based on the framework should obey the “transform while perform” principle, i.e. the framework should support on-line software maintenance, including upgrades and fallbacks. The only allowed off-line operation is re-partitioning. A yearly planned outage of three hours is allowed.

3.2.4 Operational requirements
The framework should support the off-line re-partitioning of its data. That is, it should be able to export and import its whole contents when it is off-line, and the data format of the export and/or import operations must be independent from how the framework partitions data.
The framework should provide traffic statistics
- for the overall framework instance,
- for the running nodes and
- for the major components
The loss of a communication channel between two components or the loss of a CEP must be detectable within ten seconds. he number of physical nodes needed to run the instance should be minimal.

3.2.5 Performance requirements
The framework should support serving 1000 non-intensive requests per second per node maximally. Degradation of this throughput because of the intensity of the requests is acceptable. Message pass-through time for read-only and read-write messages should be within 10 and 15 ms respectively.

3.2.6 Capacity requirements
The framework should be capable to store, based on its horizontal scalability, a practically unlimited number of entries.

3.2.7 Portability requirements
The framework should run on Unix (HP, AIX, Solaris) and NT.

4 Functional Decomposition
The entries will be assigned to storage units called buckets. The basis for the assignment is a bucket ID determined from the key of the entry, e.g. a digest computed by a hashing algorithm. To achieve high availability, each bucket will be stored in two copies: a master and a slave. Modifications to an entry are always carried out in the master bucket, and are then propagated to the slave. However, the slave bucket might serve “read-only” requests. When a master bucket is lost, the corresponding slave is promoted to be master. An important rule is that the master and the slave of the same bucket must reside on different physical nodes that are unlikely to fail simultaneously. In order to ensure scalability, the framework will be capable of distributing both the master and the slave buckets of the same framework instance on different physical nodes. Applications can be implemented by creating custom application plug-ins providing application-specific behaviours.

The components of the framework rely on a Communication Resource Manager (CRM) providing message delivery and request-reply correlation services through a communication application programming interface (C-API). The Communication Resource Manager allows for intra- and inter-node communication between processes defining Communication End-Points (CEPs). The CRM is also responsible for detecting the loss of a communication channel or a CEP within the required time frame. They are not responsible for securing the delivery of messages. Communication Resource Managers are considered as a standard infrastructure piece and are therefore excluded from this negotiation. We continue with an overview of the components making up the framework, then introduce how the framework operates through a few examples. Eventually the components and their interactions are described in detail.

4.1 Components overview

The components of the framework are organised into three layers,
- the routing layer
- the master layer
- and the slave layer.
The routing layer is composed of Routers and a distributed, synchronised collection of Bucket Indices, forming the index sub-layer. Using one of the Bucket Indices, any of the Routers can route any request sent to the framework to the Bucket affected. The responsibilities of the Proxies are
- to be capable of sending the reply from the entity (node) that received the request. This is needed by connection oriented protocols.
- in general, to de-couple the management of the external communication.
Bucket Managers contain the buckets of the framework instance. The master and slave buckets form the master and the slave layers of the framework respectively. The Bucket Managers do not differentiate master or slave buckets, it is only the Indices of the framework instance that know whether a bucket is master or slave.

The application plug-in consists of two parts. The Entry Behaviour Provider part implements the Entry interface and plugs in to Bucket Mangers. The Request Classifier part plugs in to Routers. Therefore developing an application based on the framework is a twofold task. The first is to create an Entry Behaviour Provider customising the Manipulate primitive of Entries. The customisation has to support the application-specific interpretation of messages sent to the Entries and the manipulation of the Entries. The second task is to customise the Request Classifier as needed.

4.2 Examples
In this section, some features of the proposed architecture are highlighted through concrete examples.
The first section introduces the instance of the framework on which the examples are based. Then three usage scenarios are shown:
- the processing of an update request from a client,
- a node failure and recovery, and
- a software load.

4.2.1 The example framework instance
The example framework instance (see the figure below) is distributed along three nodes. (The number of nodes should be at least 2.) The instance has 9 bucket-pairs. (In reality, the number of bucket pairs is expected to be considerably higher.) Neglecting the effect of the components other than Bucket Managers, we assume that most of the processing occurs in the master buckets. Thus, the CPU consumption is roughly proportional to the number of master buckets on the node. The memory consumption is approximately proportional to the total number of buckets on the node. The nodes are sized so that each is able to support 6 buckets at most, out of which 5 can be master. Each node stores 3 master and 3 slave buckets. As you see, under normal conditions the CPU utilisation is 60% , memory utilisation is 100%.

4.2.2 Normal request processing
The incoming requests are distributed randomly among the three nodes by a network load balancer (not shown in the figure). We are going to track an update request affecting Bucket 4. The network load balancer has distributed this request to, let’s say, Node 1. The node intercepts the external requests with a Proxy Manager – in this case Proxy Mgr. 2 – which converts the external protocol to the one used within the framework (if such conversion is necessary at all). Then, with the help of the Request Classifier, the Router decodes the key and the Bucket ID from the message and looks at its copy of the Index to determine where this message has to go. In our case, the master bucket of bucket-pair 4 is on Node 2, so it is forwarded to there.
On Node 2, the entry is located within Master 4. The Entry Behaviour Provider interprets the message and modifies the entry according to the logic of the application. It also specifies the reply. When this is finished, the modified entry is backed up to the slave bucket. The slave of bucket-pair 4 is on Node 3, so the processing continues there. When the update is ready, the reply is sent back via Proxy Mgr. 2.

Had the request been read-only, then the processing path would have skipped either the slave or the master of the bucket pair, assuring shorter response time and saving resources. If the application can tolerate the asynchronous backup of its data, then the reply can be sent before the backup is finished. This way the response time for write messages can also be shortened.

4.2.3 Node failure and recovery
In this example, we will see what happens when, let’s say, Node 2 fails.
The effect of the failure is that the masters of bucket pairs 4, 5 and 6 and the slave of the pairs 1, 3 and 8 are lost. The slaves of the lost masters have to be promoted to master. The masters of the lost slaves remain temporarily without backup.

In the below picture, slaves 4, 5 and 6 have already been promoted to master. Pairs 1, 3 and 8 are without backup. This way Node 1 and 3 has taken over the role of the failed Node 2. Be aware, that at this time the CPU utilisation of Node 1 and 3 are 80% and 100% respectively, therefore we cannot afford with capacity to start the recovery procedure immediately.

The precondition of the recovery is that Node 2 is again operational. When this happens, we can re-create the missing slaves and migrate masters 4, 5 and 6 to Node 2. This way we have gracefully recovered to the original situation.

4.2.4 Software load
In this example we are upgrading the Bucket Managers of the framework instance from version 7.3 to 8.1. The upgrade procedure is similar on each node, thus it is shown only for Node 1. All of the steps below can be done on-line. As the first step, the new version of the Bucket Manager is installed and started on the node. At this stage, all the buckets of Node 1 are in the old version of the Bucket Manager, the new one is running empty, as it is shown below.

The second step is to migrate a few buckets on the new version. During the migration, the new Entry Behaviour Provider converts the data structure of the entries to the new version (if necessary). If problems are experienced now, then it is possible to fall back to the old version by migrating the buckets back to the old Bucket Manager. The optional downward conversion of the data structures is again the responsibility of the new Entry Behaviour Provider.

Eventually, if everything goes fine, all the buckets are migrated to the new version, as it is shown below. When this happens, the old version can be stopped and un-installed.

To complete the upgrade, this procedure has to be repeated for each of the nodes. The upgrade of the operating system is also possible by a procedure similar to node failure and recovery.

4.3 Component descriptions
This section is aimed at shortly describing the components of the framework.
We start with a diagram showing the UML representation of the framework components and their relationships. Each static structure symbol represents a component of the framework.

4.3.1 Entry
Entries are considered as an interface that the Entry Behaviour Provider part of the application plug-in has to implement. Each Entry has a reference to a network of objects stored by the framework.

4.3.2 Bucket
The Buckets contain the entries of the framework instance. They also serve as the unit of load management and recovery , i.e. load can be balanced between nodes by migrating buckets from one Bucket Manager to another. Buckets are also the primary interface to the application. That is, the application interacts in various ways with the Bucket storing the entry when a request for the entry is processed.

4.3.3 Bucket Manager
Bucket Managers contain and group the Buckets. By the grouping function, they provide a mean to reduce the granularity of Buckets, making the framework easier to manage.

4.3.4 Router
The primary task of a Router is to locate the Bucket affected by each request and to route the request to the Bucket Manger storing that Bucket. In other words, it de-multiplexes the requests to the Buckets. They also co-ordinate the migration of Buckets.

4.3.5 Bucket Index
The Bucket Index maps the Bucket IDs to the Communication End-Points of the Bucket Managers storing the bucket. A framework instance can have its index in many copies that interact to ensure synchronicity.

4.3.6 Proxy
Together with Proxy Managers, Proxies are invented to de-couple the communication external to the framework. Proxies with the Proxy Managers provide a local representation of the clients (let they be requestors or recipients). Local means local to the framework instance, and not local to a node.
The primary purpose of Proxies is to map the communication channels to Proxy IDs and vica versa, while the communication is actually carried out by the Proxy Managers. Applications using the framework can address messages to a particular client using the ID of the Proxy representing the client. Although Proxies are stateless, they can represent state-full clients that take care of persistently storing their states themselves (e.g. in an RDBMS).

4.3.7 Proxy Manager
This section does not try to fully specify Proxy Managers. It could be rather considered as the definition of an interface. This is because many types of Proxy Managers are possible differing in the way they implement communication with the clients. In fact, special types of Proxy Managers can be implemented by tightly integrating them into the clients they represent (e.g. into RDBMS-based applications running under the control of a Transaction Co-ordinator). The only constraint is that they should provide entities addressable by ProxyIDs and capable of sending and receiving messages to and from the framework instance (and optionally, correlating them). Proxy Managers are responsible for the creation and destruction of Proxies. They can create the Proxies dynamically, i.e. on the request of a client, or statically, i.e. on the request of the administrator. In case of a failure, it is the responsibility of the Proxy Manager to automatically recreate its statically defined Proxies, while clients are responsible to have their Proxies recreated (and also destructed) as needed. Proxy Managers also serve as containers of Proxies.

4.4 Component interactions
The below sections will explain how the components of the framework interact in various cases. Throughout the below diagrams, a full-arrow means the transfer of control with an implicit return (a subroutine call or a request blocking till the reply), while a half-arrow means the transfer of control without return (a “goto” or an asynchronous notification). The first collaboration “Request Processing” addresses the functional requirements of the framework. The rest of the collaborations describe operational scenarios, therefore, as an introduction to the collaborations, first we provide an operational overview.

4.4.1 Operational overview

4.4.1.1 Creating an instance of the framework
The first step in creating an instance of the framework is determining the number of Bucket pairs the instance will use. Since changing the number of Buckets is the only operation that requires taking the system off-line, their number should be chosen carefully. As the second step, the master and slave sets of the Buckets are created in two Bucket Managers on separate nodes. Third, the first copy of the Index is set up and initialised with the starting locations of the Bucket pairs. Later on at least one other copy of the Index has to be created on a separate node. Finally, by creating Routers and telling them the Index copy to use, and by creating Proxy Managers and specifying the Routers they should direct requests to, the instance is ready for processing.

4.4.1.2 Load balancing
The horizontal scalability of the framework is achieved by allowing for the distribution of the buckets along several Bucket Managers, potentially each on a separate physical machine. This way load balancing can be achieved by migrating buckets from one Bucket Manager to another. The collaboration in section “Migration of a Bucket” presents the migration mechanism. Bucket migration involves the update of the Index. Since, for reasons of high availability and performance, the instance has multiple copies of its Index, they have to be synchronised. The section “Index synchronisation” explains how these copies are synchronised.

4.4.1.3 Software load
Migrating buckets to an up- or downgraded version of the Bucket Managers allows on-line software loads and fallbacks, including the patching and the up- or downgrade of the operating system.

4.4.1.4 Take-over and recovery
The failure detection of the framework components is limited to the failure detecting services of the CRM, i.e. to detecting the loss of a CEP or an internal communication channel. When the failure of one or more components of the framework is detected, the failed components are automatically exempted from further traffic and some of the backup buckets are activated. This is described by the collaborations in sections “Exemption of an Index copy” and “Exemption of a Bucket Manager”. After exempting a failed Bucket Manager, some of the buckets might remain without backup. The recovery process has to be initiated manually because the capacity for the recovery might not be immediately available. When we can afford it with capacity, the missing backups are recreated as described in the collaboration in “Recreation of a lost Bucket”.

4.4.1.5 Adding and removing components
The components of the framework can be added or removed while the framework is executing, provided the remaining capacity is satisfactory. The addition or removal of the stateless components (Proxy Managers and Routers) is simple. Therefore, we deal only with the addition of state-full components, i.e. with Bucket Managers and Indices. To add or remove a Bucket Manager to or from the instance, it simply has to be populated with buckets or its buckets have to be evacuated, both by means of Bucket Migration. Removing an Index copy is identical to exempting it. The only non-trivial task is to add an Index copy, and that is described in section “Adding a new Index copy”.

4.4.1.6 Re-partitioning
A rare but feasible operational scenario is when the administrator wants to increase or decrease the number of buckets in the framework instance. This is described in collaboration “Re-partitioning”.

4.4.2 Request Processing
The request is entering the framework at the requestor proxy and is forwarded to a Router (steps 1-2-3). The application is asked to classify the request, decode the key of the Entry from it (4), and determine the bucket ID belonging to the key (5). Then the Router gets the location of the master and slave buckets from the Bucket Index (6), and forwards the request to the master (7,9). The master bucket has to be locked while it is processing the request. This will prevent both the master and the slave of the bucket pair to be migrated while processing the request. The lock of the Bucket is stored in the Bucket. As a consequence, if the request finds MasterBucket locked in step 8, then, after waiting for a random amount of time, all steps starting with step 6 have to be repeated, since the bucket might have been migrated while it was locked. The repetition of steps 6-7-8 could be eliminated by storing the lock in the Index and first checking the lock, and then getting the location of the Bucket. However, placing the lock in the Index would imply the frequent update of the Index and the repetitive broadcast of synchronisation messages. (Be aware that the Index exists potentially in many copies.) In summary, placing the lock in the Bucket has a performance advantage when lock contention is low because the synchronisation messages are escaped, and has a performance penalty when lock contention is high because of the frequent repetition of steps 6-7-8. We expect low lock contention (only a few clients sharing a Bucket), and that is why the lock of the Bucket has been placed into the Bucket and not into the Index.

The master bucket and the application can interact in an application-specific way. Before the interaction starts, the framework will create a new entry (step 10) for the application if the request is referencing a so far non-existing one. Then the control of execution is passed to the application (step 11), which now can initiate various interactions. The diagram above shows only the possible primitives (12,13,14,15) of this interaction, and does not intend to prescribe their sequence. During an interaction, the application is free to use all or some of them in the order it wishes. In steps 17-18, the content of the possibly modified master bucket is saved in the slave. If the application did not serialise the entry during the interactions (omitted step 12), then 17 and 18 are skipped.

In steps 19-20-21, the framework sends the reply (or replies) to the recipient(s). The requestor can also be one of the recipients, but the application can specify any other entity (or entities) represented by a proxy. In this latter case, the FindProxy primitive (step 14) can prove useful to find a proxy matching some criteria specified by the application. In short, the application can specify in step 15 several different replies that can be sent to several different recipients. If the application did not specify any replies (omitted step 15), then steps 19-20-21 are also omitted.

4.4.3 Migration of a Bucket
The below collaborations are performed when a master or slave bucket is migrated, respectively. The administrator of the instance initiates the migration (step 1), and the Router finds a Bucket to migrate (2). Regardless of whether a master or slave bucket is migrated, the master bucket of the pair is locked (5). This will ensure that only a bucket of an idle pair can be migrated. If the master bucket is found locked in step 5, then the Router should attempt to migrate another bucket by repeating all steps starting with step 2. Next the master bucket serialises itself (6). If the migration of the bucket is part of a fallback procedure, then the entries in the bucket are serialised to a format recognisable by the version we fall back to. The serialised copy is passed to the destination Bucket Manager, where it is unserialised (7). If the migration of the bucket is part of an upgrade procedure, then during the un-serialisation, the data structures of the entries are upgraded as well. Here the migration of the slave and master buckets becomes different.

4.4.3.1 Migration of a slave Bucket
The migration of slave buckets continues with setting the new location of the slave bucket in the Index (step 8) and unlocking the master (step 9).
Finally, the slave is deleted from the original location (step 10).

4.4.3.2 Migration of a master Bucket
In case of a master bucket, the new location of the master bucket is set in the Index (step 8). Finally, the master is deleted from the original location.

4.4.4 Index synchronisation
The update of the Index can be initiated at any of the copies (step 1). Write access to the index is serialised by locking the whole of the index (2). (Locking the whole of the index is affordable because Index synchronisation is infrequent.) The master copy must obtain a lock on every registered index copy before proceeding. If the lock on any of the copies fails, then, to prevent deadlocks, all obtained locks must be released and the whole locking procedure re-attempted after waiting for a random amount of time. In step 3, the master copy propagates the modification by the broadcast of a synchronisation message. Eventually the locks are released (step 4).

4.4.5 Exemption of an Index copy
When a copy of the index fails, it has to be exempted from further traffic. This is shown below. After the de-registration, no synchronisation messages are sent to the failed copy.

4.4.6 Exemption of a Bucket Manager
When a Bucket Manager fails, it has to be exempted from further service. The exemption of a failed Bucket Manager consists of two actions. On one hand, the lost slave buckets have to be disabled. On the other hand, the slaves of the lost masters have to be promoted to master.

4.4.6.1 Disabling the lost slave Buckets

4.4.6.2 Promoting the slaves of the lost master Buckets

4.4.7 Adding a new Index copy

Before creating the new Index copy (NewIndex), the administrator has to select an existing copy (MasterIndex) that will initialise the new one. When creating NewIndex, a reference to MasterIndex is passed. In step 2, NewIndex registers at MasterIndex, who, in turn, notifies the other copies (step 4). Finally, MasterIndex initialises NewIndex. The locking in step 3 blocks the SetBucketLocation messages sent to any of the Index copies. This ensures that NewIndex looses no synchronisation message while initialising.

4.4.8 Recreation of a lost Bucket
When a master bucket is lost, its slave pair is immediately promoted to master, therefore a master bucket never has to be recreated. The recreated bucket will always be a slave. The administrator of the instance initiates the recreation at a Router (step 1). The Router finds a bucket pair that has a missing slave (steps 2-3). The master of the bucket pair is locked (5). This will ensure that the slave being recreated will be synchronised with the master of the bucket pair. If the master bucket is found locked in step 5, then the Router should attempt to recreate another slave bucket by repeating all steps starting with step 2. Next the master bucket serialises itself (6). The serialised copy is passed to the destination Bucket Manager, where it is unserialised (7). The recreation of the slave bucket continues with setting the new location of the slave bucket in the Index (step 8) and unlocking the master (step 9).

4.4.9 Re-partitioning
A rare but feasible operational scenario is when the administrator wants to increase (or decrease) the number of buckets in the framework instance. This is called re-partitioning. When re-partitioning, it is not only that new Buckets have to be created (or existing ones deleted) in some of the Bucket Managers and their references added to (or deleted from) the Index. Also, depending on the application, the Request Classifier might have to be made aware of the new (or deleted) Buckets. However, the most complex part of re-partitioning is the population of the added Buckets with already existing contents (or the evacuation of the contents of the buckets to be deleted). Be aware, that this is different from Bucket migration (described in 4.4.3. Migration of a Bucket), because here the Entries in a Bucket have to be migrated to one or more other Bucket(s), and not the whole of a Bucket from one Bucket Manager to another. Since only the Request Classifier is able to assign the storing Bucket to an Entry, it is not possible to migrate Entries from one Bucket to another without the support of the Request Classifier. The complexity of migrating Entries results from the fact that during the migration process two partition schemes have to be maintained, namely the old and the new. Handling two partitioning schemes simultaneously would be a too complicated task for one Request Classifier, therefore, we share this between two, each belonging to a separate instance of the framework. That is, instead of migrating Entries from one Bucket of the framework instance to another Bucket of the same instance, we migrate the Entries from one instance to another.

The Entry-migration process is off-line, i.e. no requests can be processed during re-partitioning. The below diagram shows the migration. The components belonging to the old and new instances of the framework are prefixed with “old” and “new” respectively. The administrator sends the repartitioning request to every Bucket Manager of the old instance (step 1). Then each of these Bucket Managers checks which of its buckets are master (2). Each master Bucket is sent an export request (step 3), which induces the serialisation of every Entry in the Bucket (4) and their shipment to the Router of the new instance (5). The new Router interacts with the new Request Classifier to find out the ID of the new Bucket (6). Then the location of the master and the slave buckets are determined (7) and both are requested to un-serialise the Entry (8-9-10-11). Eventually the Entry is deleted from its original locations (12-13-14).

4.5 Component details
The below table summarises the possible states and the data structure for each component. The interfaces of the components are covered in the subsections following the table. Let’s continue with some explanations to the table.

4.5.1 Component states
States are different if the component behaves differently in them. Stateless means that the behaviour of the component does not depend on its past. It does not mean it has no data structure. Locked means that certain actions are allowed only for the holder of the lock. Only one entity can hold the lock at a time. If a second entity is trying to obtain the lock, it will fail, and to prevent deadlock, it has to release the locks it might have obtained previously, and restart the whole locking procedure some time later. Unlocked means that an entity trying to obtain the lock will succeed.

Component states Data structures
Proxy
stateless The ID of the proxy;
Static / Dynamic;
Communication channel parameters (network address, protocol used, etc.)
Proxy Manager
stateless
-
Router stateless Its data structure is the Bucket Index.
Bucket Manager see its parts An indexed list of its buckets
Bucket locked / unlocked The ID of the bucket;
A list of its entries
Bucket Index
locked / unlocked A vector of {Bucket ID, Bucket type, CEP} tuples indexed by Bucket ID and Bucket type, where Bucket type is ‘master’ or ‘slave’, and CEP is the Communication End-Point of the Bucket Manager storing the bucket.
A list of the CEPs of the Indices in the framework instance.
Entry application- specific The key of the entry;
Application specific structures

4.5.2 Interface primitives
The below section shortly describes the interface primitives of each component.

4.5.2.1 Proxy
Primitive Input parameters Description
GetProxyID Communication channel Returns the ID of the Proxy associated with the communication channel.
GetClientChannel Proxy ID Returns the communication channel associated with the Proxy.

4.5.2.2 Proxy Manager
Primitive Input parameters Description
CreateProxy Communication channel Creates a Proxy.
DeleteProxy Proxy ID or communication channel Deletes a Proxy.

4.5.2.3 Router
Primitive Input parameters Description
RouteARequest Request message,
ID of sending Proxy,
Communication End-Point (CEP) of sending Proxy Manager Routes the request to the appropriate Bucket
MigrateMasterBucket CEPs of the Bucket Managers to migrate from and to Selects and migrates a master bucket from one Bucket Manager to another.
MigrateSlaveBucket CEPs of the Bucket Managers to migrate from and to As above but for a slave.
UseIndex CEP of Bucket Index to use Directs the Router to use the given Bucket Index
RecreateBucket CEP of the Bucket Manager at which to recreate the Bucket Finds a disabled slave bucket and recreates it at the given Bucket Manager

4.5.2.4 Bucket Manager
Primitive Input parameters Description
DeleteBucket BucketID Deletes the bucket
UnserialiseBucket A serialised bucket Creates a bucket in the Bucket Manager by un-serialising the given bucket
LockBucket BucketID,
Locks the given bucket
UnLockBucket BucketID Unlocks the given bucket

4.5.2.5 Bucket
The below sections group the primitives of the Bucket by the components that are expected to use them. In fact, the groups can be considered as separate interfaces.

4.5.2.5.1 PRIMITIVES FOR INTERACTIONS WITH BUCKET MANAGERS
Primitive Input parameters Description
ForwardToEntry A key,
a request
the ID of the proxy representing the sender Instructs the bucket to forward a request coming for an entry.
Serialise - Instructs a bucket to serialise all its entries.
UnserialiseEntry A key,
a serialised entry Instructs the Bucket to un-serialise an entry. If an entry existed with the same key, it is overwritten.

4.5.2.5.2 PRIMITIVES FOR INTERACTIONS WITH THE APPLICATION PLUG-IN
Primitive Input parameters Description
SaveSerialisedEntry A serialised (piece of an) entry Saves the given serialised (piece of an) entry.
The application plug-in is expected to save its new state whenever it modifies it in response to a request. This primitive can be used by the plug-in for the save operation.

DeleteEntry An entry Specifies that the entry should be deleted after the processing of the current request.
SpecifyReplies A list of {recipient proxy ID, message} tuples Each message will be sent to the client represented by the corresponding proxies.
FindProxy (To be specified) Returns the ID of a proxy matching the input search criteria.

4.5.2.5.3 INTERNAL PRIMITIVES
Primitive Input parameters Description
CreateEntry A key Creates a blank entry with the given key

4.5.2.6 Bucket Index
The below sections group the primitives of the Bucket Index by the components that are expected to use them. The groups can be considered as separate interfaces.

4.5.2.6.1 PRIMITIVES FOR INTERACTIONS WITH BUCKET MANAHERS
Primitive Input parameters Description
SetBucketLocation A bucket ID,
role of the bucket (master or slave),
CEP Sets the CEP of the Bucket Manager hosting the master or the slave copy of the bucket-pair and propagates the change to the other copies of the framework instance.

4.5.2.6.2 PRIMITIVES FOR INTERACTIONS WITH ROUTERS
Primitive Input parameters Description
GetBucketLocation A bucket ID,
role of the bucket (master or slave) Returns the CEP of the Bucket Manager hosting the master or the slave copy of the bucket-pair.
FindABucket CEP of the Bucket Manager storing the bucket,
role of the bucket (master or slave) Returns the ID of a bucket that is stored by the given Bucket Manager and has a master or slave role currently.
FindRecreatableBucket CEP of the Bucket Manager storing the master copy Returns the ID of a bucket-pair that has the master bucket on the specified Bucket Manager

4.5.2.6.3 PRIMITIVES FOR INTERACTIONS WITH FAILURE DETECTORS
Primitive Input parameters Description
PromoteSlaves CEP of a failed Bucket Manager Promotes the slaves of the bucket-pairs that have their masters on the failed Bucket Manager. The original master buckets are deleted from the Index. The promoted slaves become masters and will have no slaves.
DisableSlaves CEP of a failed Bucket Manager Disables the slaves of the bucket-pairs that have their slaves on the failed Bucket Manager.

4.5.2.6.4 PRIMITIVES FOR INTERACTIONS WITH ADMINISTRATORS
Primitive Input parameters Description
CreateIndex The CEP of one of the indices in the framework instance The index is created, registered and initialised from the copy with the given CEP.
Exempt The CEP of a failed index copy The given copy is de-registered from the other copies of the framework instance.

4.5.2.6.5 PRIMITIVES FOR INTERACTIONS WITH OTHER INDICES
Primitive Input parameters Description
Lock - Locks the index copy
Unlock - Unlocks the index copy
AdoptBucketLocation A bucket ID,
role of the bucket (master or slave),
CEP Sets the CEP of the Bucket Manager hosting the master or the slave copy of the bucket-pair. This primitive is used to propagate a change to the Index.
Register The CEP of the index copy to register Registers the new index copy with the copies already existing in the framework instance by the propagation of the CEP of the new copy..
AddToList The CEP of the index copy to register This primitive is used to propagate a registration to the copies already existing in the framework instance.
DeRegister The CEP of a failed index copy De-registers the failed Index copy from another copy. This primitive is used to propagate the exemption of a failed copy to the other copies existing in the framework instance.
Initialize A snap-shot of an index Initialises the index to the given snap-shot

4.5.2.7 Entry
Primitive Input parameters Description
Manipulate A request,
the ID of the proxy representing the requestor Instructs the Entry to process a request and manipulate its state accordingly.
Serialise Version of serialisation format Instructs the Entry to serialise itself using the SaveSerialisedEntry primitive of the bucket.
To facilitate software downgrades (fallbacks), this primitive must also be capable to serialise the entry to format(s) recognisable by earlier version(s) of the Application Plug-in.
Unserialise A serialised entry Instructs the Entry to adopt the state represented by the input serialised entry.
To facilitate software upgrades, this primitive must also be capable to un-serialise an entry that has been serialised by an earlier version of the Application Plug-in.

5 Function Placement

5.1 Deployment units

5.1.1 Mapping the functional components
The below tables show how the deployment units can be derived from the functional components through the runtime components and runtime component instances (processes).
Runtime component Functional component(s)
RC_Proxy Proxy
Proxy Manager
RC_Bucket Bucket Manager
Bucket
Entry
RC_Router Router
RC_Index Bucket Index

Runtime component instance (process) Runtime component
PROC_Proxy RC_Proxy
PROC_Bucket RC_Bucket
PROC_Router RC_Router
PROC_Index RC_Index

Runtime deployment unit Runtime component instance (process)
DU_Proxy PROC_Proxy
DU_Bucket PROC_Bucket
DU_Router PROC_Router
PROC_Index

5.1.2 Placement rules
Each runtime deployment unit consists of one corresponding process, except DU_Router, which has at least one PROC_Router and exactly one PROC_Index processes.
An instance of an application using the framework must consist of the following runtime deployment units:
- at least one DU_Proxy
- at least 2 DU_Buckets running on separate nodes
- at least one DU_Router per application instance and at most one per node
- one instance of the Communication Resource Manager per node

5.2 Network diagram
The network diagram has to be prepared specifically for each application written using the framework observing the above placement rules for the deployment units.

6 Conclusion
This paper suggests modifying the ACS framework so that it meets both the continuous availability and horizontal scalability requirements and becomes capable of storing object-networks reliably. The proposed modifications considerably widen the range of the applications that are able to rely on the framework. The possible range includes all applications with highly partitionable state, among others, conversation-oriented applications (or parts of applications that are conversation-oriented). The proposed modifications can, for example, solve the problem of context servicing. The proposed new architecture complies with the existing one to the maximum extent possible in order to minimise implementation effort.