Your Data Center is Down. What’s your Plan?


In today's business environment, reliance on information processing systems continues to grow on an almost daily basis. No longer is the data part of the business--it is the business. As companies become more reliant on information systems and corporate data, the potential disruption to the business due to a loss of data becomes even greater. An enterprise must examine the requirements of the business and determine how much data loss and how much downtime can be tolerated without affecting the business itself.

As Gartner stated, "Two out of five enterprises that experience a disaster go out of business within five years. Business continuity plans and disaster recovery services ensure continuing viability." (Gartner, Disaster Recovery Plans and Systems Are Essential, by Roberta Witty, Donna Scott, 12 Sept. 2001)

Since disasters could lead to being out of business, more organizations are implementing procedures and technologies to proactively manage recovery and availability. Many businesses can no longer wait the traditional two, three or more days to recover critical applications and data if a disaster or site outage occurs. An enterprise must examine the requirements of the business and determine how much data loss and how much downtime can be tolerated without affecting the business itself. This approach is causing the organization to not only have a solid backup plan and set up local availability clusters, but also to replicate the data center information to remote sites to avoid multiple potential outages in order to have a complete disaster recovery solution.

Based on those requirements, the below technologies must be evaluated in order to meet the requirements established. For example, if an organization can afford to lose several days of data and take several days to recover that data in the event of a disaster, then a tape backup and restore strategy will satisfy the organization. As the requirements move from days to minutes and seconds, replication and global clustering technologies must be incorporated into the environment.

As the requirements of the data are examined, an organization may feel that a tape backup solution provides enough protection for the organization. Although the data may be older, the recovery time in the event of a disaster could take several days. If the organization feels the data loss from a backup strategy is sufficient, the organization could increase the recovery time by implementing a replication technology to replicate the backup catalog server to the offsite tape storage facility. By implementing a replication technology, the recovery time could be reduced from days to hours since the recovery from tape can begin immediately, rather than having to wait to rebuild the catalog.

Traditionally, enterprises have used tape-based technologies to recover data, but the data being recovered from tape tends to be older data, typically measured in days, depending on the last time a backup was done. Achieving minimal or no data loss requires that enterprises build data replication architectures into their disaster recovery plans. Replication is an automated and rules-based method for the geographical distribution of identical data. This second data set is necessary in the event the data at the original location cannot be accessed. As opposed to a backup strategy, which provides the needed safety net for an enterprise's data, replication transfers data real time over a wide area network (WAN) to the secondary location. This allows for an up-to-date copy of the data at a remote location, should it be required.

There are two main types of replication, synchronous and asynchronous. Both have their advantages and should be available options for the IT administrator. They use a different process to arrive at the same goal, and deal somewhat differently with network conditions. The performance and effectiveness of both depend ultimately on business requirements such as how soon updates must be reflected at the target location. Performance is strongly determined by the available bandwidth, network latency, the number of participating servers, the amount of data to be replicated and the geographical distance between the sites.

Synchronous Replication
Synchronous replication ensures that a write update has been received by the secondary node and acknowledged back to the primary application before the write operation completes. This way, in the event of a disaster at the primary location, data recovered from any surviving secondary server is completely up to date because all servers share the exact same data set. Synchronous replication produces an exact copy of the data at a secondary site, but may impact application performance in high latency or limited bandwidth situations. Synchronous replication is most effective in application environments with low update rates, but has also been effectively deployed in write-intensive environments where high bandwidth, low latency network connections are available.

Asynchronous ReplicationDuring asynchronous replication, application updates are written at the primary, and persistently queued for forwarding to each secondary host as network bandwidth allows. When the writing application experiences temporary surges in update rate, this queue may grow. Unlike synchronous replication, the writing application does not suffer from the response time degradation caused by each update incurring the cost of a network round trip. During periods when the update rate is less than the available network bandwidth, this queue drains faster than it grows, allowing the secondary data state to catch up rapidly with that of the primary. With appropriately sized network bandwidth, an asynchronous secondary should (on average) lag the primary by only a few milliseconds. During periods of heavy updates or network outage, the secondary may get further behind.

It is worth noting that in both asynchronous and synchronous replication, the managed volumes at each secondary faithfully track those at the primary. This is called "write order fidelity." Without write order fidelity there is no guarantee that a secondary will have consistent, recoverable data. A well-designed replication solution needs to consistently safeguard write order fidelity for the entire application environment. This may be accomplished by a logical grouping of data volumes so the order of updates in that group is preserved within and among all secondary copies of these volumes.

The key element in replication is ensuring an uncompromised data mirror at a disaster safe location. Depending on business requirements, this could be across campus, across town, or across continents. In addition to having data at a disaster safe location, the ideal replication solution would ensure that the replica volumes are current, or fully up to date; complete and recoverable, or free from errors and consistent.

Global Availability
Once you have a replica copy of your data at a remote site, you need an automated method of starting your applications at the remote location and redirecting user traffic to the secondary site. Global Clustering extends the concept of local clustering to the wide area by providing automated application migration and management of replication jobs from a primary data center to a geographically dispersed location. This allows your critical applications and data to continue to be available even in the event of a disaster.

Should a disaster occur, a single mouse click or command line, can initiate a complete site migration. This includes promoting the secondary site to primary site status and reversing replication roles in order to prepare the original site for eventual failback when site issues are resolved.

Summary
In today's environments, the requirements of downtime and data loss must be examined in order to prepare the organization when a disaster strikes. If the organization can afford several days of data loss and days of downtime, then a traditional backup strategy can meet that organization's needs. As organizations move from days of data loss to seconds or milliseconds, then replication technologies need to be implemented. In addition, as minimal downtime is required, global availability technologies must be incorporated. In conclusion, to completely protect the enterprise while allowing minimal data loss and downtime, backup, replication and global clustering technologies need to be incorporated in the enterprise for a complete disaster recovery solution.