Storage 101: High Availability, Part 2


Service Groups
An important concept of application failover is the idea of resource groups or service groups. A service group is a logical collection of resources that are required for a service or application to be available or online. An application service is typically composed of multiple resources, some hardware-based and some software-based, all cooperating to produce a single service. For example, a database service may be composed of a logical network (IP) addresses, the database management system software, underlying file systems, logical volumes and a set of physical disks being managed by the volume manager.

If this service group needs to be migrated to another node for recovery purposes, all of its resources must migrate together to recreate the service on another node, without affecting other service groups.

A single large server may host any number of service groups, each providing a discrete service to networked clients. If multiple service groups are running on a single node, then they must be monitored and managed independently. Independent management allows a service group to be automatically recovered or manually idled (e.g. for administrative or maintenance reasons) without necessarily impacting any of the other service groups running on a node. Of course, if the entire server crashes (as opposed to just a software failure or hang), then all the service groups on that node must be recovered elsewhere.

The next level of high availability allows for failover from a local cluster to remote location. Today's data availability needs require that availability be guaranteed globally, allowing for uninterrupted access to the information even in the event of a site failure. Geographically dispersed take-over sites add yet another dimension to the enterprise's availability infrastructure.

The first step in extending availability beyond a single data center is ensuring that a duplicate copy of the data is available at a remote site. Replication accomplishes this.

Host-Based Replication
Replication is an automated and rules-based method for the geographical distribution of identical data. This second data set is necessary in the event the data at the original location cannot be accessed. As opposed to a backup strategy, replication transfers data real-time over a wide area network (WAN) to the secondary location. This allows for an up-to-date copy of the data at a remote location should it be required.

There are two main types of replication, synchronous and asynchronous. Both have their advantages, and should be available options for the IT administrator. The performance and effectiveness of both depend ultimately on business requirements such as how soon updates are reflected at the target location. Performance is strongly determined by the available bandwidth, network latency, the number of participating servers, the amount of data to be replicated, and the geographical distance between the hosts. When choosing a replication solution, consider which type of replication is required for the given application environment. Host-based replication products such as VERITAS Volume Replicator give the user the choice of either synchronous or asynchronous replication.

Synchronous Replication
Synchronous replication ensures that an application write update has been received by the secondary location and acknowledged back to the primary application before the write operation completes. This way, data recovered from the secondary server is completely up to date because all servers have an exact copy of the same data. Synchronous replication produces full data currency, but may impact application performance in high latency or limited bandwidth situations. Synchronous replication is most effective in application environments with low update rates, but has also been effectively deployed in write-intensive environments where high bandwidth, low latency network connections are available.

Asynchronous Replication
During asynchronous replication, application updates are written at the primary, and queued in order in a log file, for forwarding to each secondary host as network bandwidth allows. When the writing application experiences temporary surges in updates, this queue may grow. Unlike synchronous replication, the writing application does not suffer from the response time degradation caused by each update incurring the cost of a network round trip. During periods when the update rate is less than the available network bandwidth, this queue drains faster than it grows, allowing the secondary data state to catch up rapidly with that of the primary. With appropriately sized network bandwidth, an asynchronous secondary should (on average) lag the primary by only a few milliseconds.

It is important to ensure with either asynchronous or synchronous replication that the secondary site faithfully tracks the order of the application writes at the primary. This is called "write order fidelity." Without write order fidelity there is no guarantee that a secondary site will have consistent, recoverable data. For example, in a database environment, updates are made to both the log and data spaces of a database management system in a fixed sequence. The log and data space are usually in different volumes, and the data itself can be spread over several additional volumes. A well-designed replication solution needs to consistently safeguard write order fidelity for the entire application environment. This may be accomplished by a logical grouping of data volumes so the order of updates in that group is preserved within and among all secondary copies of these volumes. Products such as VERITAS Volume Replicator will guarantee write order fidelity.

Part 1
Part 3