Tag Archives: failover

Resiliency vs Redundancy: Using VPLEX for SQL HA

Posted on by

A little history on my philosophy around high-availability

Around the year 2000, when I was working in network operations for a large wireless telco, a very senior network architect explained to me the company’s philosophy on building high availability solutions into the network.  The phrase I remember from that conversation was “we don’t build redundant networks, we build resilient networks..” The difference is that while redundant networks failover to secondary paths to resume traffic, resilient networks don’t go down at all.  This concept has stuck with me ever since and I tend to tackle high-availability problems of all kinds with this idea in mind.  It’s frankly been very difficult to build solutions that are resilient across the entire stack, mostly because infrastructure technology hasn’t quite gotten there yet.

Things may have changed…

I recently had a meeting with a customer to discuss local high availability for SQL.  This customer has a very large multi-node clustered SQL environment (hundreds of TBs of data, hundreds of databases, hundreds of instances, many clusters, many nodes per cluster) and has been testing SQL database mirroring as an alternative to traditional Windows Failover Clustering.  The focus of the meeting wound up focused primarily on leveraging VPLEX as an alternative to SQL mirroring, and the reasons for that decision suddenly reminded me of the Resiliency vs Redundancy discussion I had years ago.  A VPLEX solution potentially solves the same problem as DB mirroring, does it with less complexity, and less risk.

VPLEX Local as a Resilient HA solution

One of the many features of VPLEX is it’s ability to mirror data across multiple storage arrays and present that mirror as a single LUN to the host.  For customers already running large multi-node MSCS clusters, the LUN appears just like any normal storage LUN and Windows/SQL treat the LUN normally.  There are several reasons VPLEX should be considered as an alternative to database mirroring. (much of this applies to Exchange CCR as well)

VPLEX hardware is inherently Resilient.  A VPLEX cluster is an N+1 cluster of loosely coupled nodes, cooperating with each other, but not depending on each other.  Hosts can access any of the hosted data, through any of the ports, on any of the cluster nodes.  If a node fails for any reason, the remaining nodes continue serving IO for any data.  Except for a dead path on the host side (managed by PowerPath or MPIO), there is no failover process, and no cache mirroring to worry about.  The potential performance impact of a failure is equal to 1, divided by the quantity of that component in the cluster. (128 x 8gbe ports across 8 director nodes for a large VPLEX Local cluster)

In addition, because VPLEX utilizes a write-through cache, there is never any dirty cache data (data in cache that has not been committed to disk) in a VPLEX system.  A power outage or VPLEX hardware failure does not put data at risk.

Other Advantages of using VPLEX over SQL Database Mirroring

Improved Performance:

  • Compared with SQL Database mirroring, VPLEX mirroring has significantly less impact on transaction performance for writes and can improve transaction performance in some cases due to the large read cache in the VPLEX directors. (Note: I am comparing to DB Mirroring in Full-Safety mode since the customer’s requirement was a zero-data-loss solution.)

Non-Disruptive Storage Failover:

  • In the event of a storage failure, SQL Mirroring must perform a cluster node failover which takes a few seconds at best, possibly disrupting applications.  VPLEX provides completely non-disruptive failover when a storage failure occurs.  (A server hardware failure still triggers a node failover as it would in any other failover clustering scenario.)

Less Management Overhead:

  • From a management perspective, using VPLEX instead of SQL Database mirroring gives the SQL DBAs fewer SQL instances and fewer moving parts to manage on a daily basis.  The storage team just presents a mirrored LUN from VPLEX to the cluster and it’s business as usual for the DBAs.
  • VPLEX also allows the storage team to non-disruptively migrate data between storage arrays behind VPLEX to balance load, perform hardware refreshes, resolve capacity problems.  VPLEX performs the migration at the direction of the storage admins.

Reduced Risk:

  • Reducing management complexity also reduces risk.  With a high number of database instances and db mirrors involved in a large environment like this one, the chance of one of those mirrors having a problem, or being configured incorrectly, is increased.  DBAs can rely on VPLEX mirroring all of the data, 24x7x365, even when host maintenance is being performed.

Reduced Cost:

  • When compared with the SQL Database Mirroring solution, the VPLEX solution reduced the number of physical servers needed in this environment, reducing cost enough to more than offset the cost of VPLEX itself.  Combined with reductions in soft costs, like reduced DBA management overhead, VPLEX will actually save them quite a bit of money, and increased uptime during storage refresh and maintenance will increase revenues in this case as well.

A Distributed Future:

  • Next year, when a second datacenter is online nearby, the first VPLEX Local cluster can be connected to another VPLEX cluster in the new datacenter.  Then the SQL cluster nodes and data can be distributed across both datacenters, providing protection from entire datacenter outages, or solving space constraints with no changes to the application or servers, and no downtime.

I wonder how many other customers would like to build more resilient infrastructures?

If you combine a VPLEX solution with a true cluster file system and an active-active database engine (ie: Oracle RAC), you can eliminate the disruption caused by server hardware failures.  It’s just a matter of time now until the entire stack can be designed for true resiliency with very little management overhead.  I can’t wait to see what happens.

The following EMC White Paper has a lot of good information about using VPLEX in this same context:

Workload Resiliency with EMC VPLEX

Do you have a recovery plan? You should!

Posted on by

In my new role at EMC, I am one of the first people to learn of major problems that my customers experience.  In general, customers seem to call their sales team before technical support when a big problem happens.  In the past week, I’ve been involved in recovery efforts with two different customers, both resulting from complete power outages in their production datacenters.

Both of these customers process millions of dollars through their global customer facing websites.  The smaller customer of the two does not have a disaster recovery site of any kind, while the other (larger) customer does have a recovery site, but it is not designed for 100% operation and is hundreds of miles away.

What became clear through both of these incidents is that having a very clear, very well known recovery plan is critical to the business.  Interestingly, these experiences drove home the point that even if you don’t have a recovery site, aren’t using replication, and otherwise don’t have any way to recover the data offsite, you still need a plan that encompasses what you CAN do.  More often than not, major outages are short lived and you will be recovering in your primary datacenter anyway, so you need to have a pre-determined plan to prevent major issues and shorten the time to recover.

Here are some things to think about when creating a recovery plan:

  • Get the application owners together and build a list of all the applications running in your environment.  Document the purpose of each application and map dependencies that each application has on other applications.
  • Next, involve the server/systems admins and document the server names, database names, IP addresses, and DNS names for each application on the list.
  • Finally, involve the infrastructure teams (storage, network, datacenter) and document the network dependencies (subnets, routers, VPN connections, load balancers, etc).  Document any SAN storage used by the servers/applications.  Also document how each infrastructure component affects others (ie: the SAN switches are required to be operational before servers can connect to storage arrays.)
  • Work with business leaders to prioritize the applications.  The idea is to understand how much impact each application has to the business both from a productivity perspective as well as direct financial impact.  There may be legal requirements or service level agreements with customers to consider as well.
  • If possible, identify the maximum amount of time each application can be down in the event of a catastrophic event (RTO – Recovery Time Objective) and how much data can be lost without significant impact to the business (RPO – Recovery Point Objective).  These metrics are usually measured in minutes, hours, and days.
  • Document the backup method for each server and application.  How often are backups run?  What is the retention period?  How long does it take to complete backups?   What is the expected time to restore the data?  How long does it take to recall tapes from offsite storage?
  • At this point you have a prioritized list of applications, now build a step by step recovery plan that lists the exact order in which you must recover systems.  The list should include server names as well as validation points to ensure certain systems are working before moving to the next step.  For example:
    • Step 1: bring up the network switches and routers
    • Step 2: bring up the DNS/DHCP servers
    • Step 3: bring up Active Directory servers
    • Step 4: bring up SAN fabric switches
    • Step 5: bring up SAN storage arrays, verify health of arrays with help from vendor
    • Step 6: …

I recommend that one of the first steps before starting recovery is to contact your key vendors (storage array vendors at least) to notify them of your outage so they can get support resources ready to troubleshoot any hardware issues you may run into during the recovery.

  • Identify key players needed in a recovery, at least primary and secondary contacts for every application and vendor contacts for hardware/software, facilities, UPS/Generator support teams, etc.
  • Establish a standard communication plan to include at least the following…
    • A method to notify employees of an outage and give instructions
    • A method to notify key players for recovery
    • A mechanism for key players to communicate with each other during the recovery
    • Personal (not corporate/business) contact information for all of the key players

The key thing to remember here is that you cannot rely on any communication tools that are part of your infrastructure.   You must assume your PBX/VOIP system will be down, Email will be down, corporate instant messenger will be down, Sharepoint will be unavailable, etc.

  • If you have a remote recovery site, with or without replication technology, and intend to use the remote site to recover production applications in the event of a large failure, be sure to document the triggers for moving to the recovery site.  As an example, you may want to attempt recovery in the primary site, and then move to the recovery site if recovery at the primary site will take too long — be sure to document that time and get executive buyoff.  You should not hear “how long do we wait until we move to the DR site?” during an active recovery operation.  That decision needs to be made during the planning exercise.
  • Document the entire plan and store the digital copies in a readily accessible place (file shares, Sharepoint site, etc).  Keep additional copies on USB sticks or CDs stored in a safe place.  Keep even MORE copies in another location outside the primary datacenter facility (ie: safe deposit box, remote office safe, etc).  Print copies as well and store the printed copy in similar safe places.  Assume that a building may not be accessible due to fire or flood.  I know one customer who issues fingerprint secured USB sticks to every manager.  Each manager must sync their USB stick to a server at least monthly or upper management is notified.
  • Make sure that everyone is aware of the recovery plan, who has access to the plan, where the copies are stored, and what role each of the key players is expected to play during a recovery.

There is far more to think about but hopefully you can get a good start with what I’ve listed above.  If you have a recovery plan already, you should review it regularly and think about anything that needs to be added or modified in the plan.

If you are trying to get approval for a remote recovery site and replication technology and having trouble getting executive approval, going through this exercise and defining application priority with RPO/RTO for each could give you the ammo you need.  Traditional backup architectures aren’t designed for RPO’s under 24 hours while storage array based replication can get RPOs down into the minutes and restoring from tape takes way longer than restoring from replicated data.

Last but not least, keep the plan updated as your environment changes, add new application and server details to the plan as part of the implementation process for new applications, or as part of change control procedures for significant changes to the infrastructure.