Tag Archives: planning

Time flies when you’re having fun!

Posted on by

I can’t believe it’s already been a year and a half since my last post…  I’m sorry for the lack of content here.  Things have been so busy at EMC as well as at home and so much of what I’ve been working on is customer proprietary that I’ve had trouble thinking of ways to write about it.  In the meantime I’ve taken on a new role at EMC in the last month which will likely change what I’m thinking about as well as how I look at the storage industry and customer challenges.

In the past couple of years I’ve been involved in projects ranging from data lifecycle and business process optimization, storage array performance analysis, and scale out image and video repositories, to Enterprise deployments of OpenStack on EMC storage, Hadoop storage rationalization, and tools rationalization for capacity planning.  It is these last three items that have, in part, driven me into taking on a new role.

For the first three and half years I’ve spent at EMC I’ve been an Enterprise Account Systems Engineer in the Pacific Northwest.  Technically, I was first hired into the TME (Telco/Media/Entertainment) division focused on a small set (12 at first) of accounts near Seattle.  After about a year of that, the TME division was merged into the Enterprise West division covering pretty much all large accounts in the area, but the specific customers I focused on stayed the same.  For the past year or so I’ve spent pretty much 80% of my time working with a very large and old (compared to other original DotCom’s) online travel company.  The rest of my time was spent with a handful of media companies.  I’ve learned A TON from my coworkers at EMC as well as my customers.  It’s amazing how much talent is lurking in the hallways of anonymous black glass buildings around Seattle, and EMC stands out as having the highest percentage of type-A geniuses (does that exist) of any place I’ve worked.

One of the projects I’ve been working on for a customer of mine is related to capacity planning.  As you may know, EMC has several software products (some old, some new, some mired in history) dedicated to the task of reporting on a customer’s storage environment.  These software products all now fall under the management of a dedicated division within EMC called ASD (Advanced Software Division).  Over the past 13 years, EMC has acquired and integrated dozens of software companies and for a long time these software products were all point solutions that, when viewed as a set, covered pretty much every infrastructure management need imaginable.  But they were separate products.  In the past couple years alone massive progress has been made towards integrating them into a cohesive package that is much better aligned and easier to consume and use.

In just the past 12 months, one acquisition specifically, has greatly contributed to EMC’s recent, and I’ll say future, success in the management tools sector, and that is Watch4Net.  More accurately the product was APG (Advanced Performance Grapher) from a company called Watch4Net, but now it is the flagship component of EMC’s Storage Resource Management (SRM) Suite.

I’ve been spending a lot of time with SRM Suite lately at several customer sites and I’m really quite impressed.  SRM Suite is NOT ECC (for those of you who know and love AND hate ECC), and it’s not ProSphere, or even what ProSphere promised; it’s better, it’s easier to deploy, it’s easier to navigate, it’s MUCH faster to navigate, it’s easier to customize (even without Professional Services), it’s massively extensible, and it works today!  The Watch4Net software component is really a framework for collection, data storage, and presentation of data, and it includes dozens of Solution Packs (combinations of collector plug-ins and canned reports for specific products).  And more Solution Packs are coming out all the time, and you can even make your own if you want to.

What I really like about SRM Suite is the UI that came from Watch4Net.  It’s browser based (yes it supports IE, Chrome, Firefox, Mac, PC, etc) and you can easily create your own custom views from the canned reports.  You can even combine individual components (ie: graphs or tables) from within different canned reports into a single custom view.  And any view you can create, you can schedule as an emailed, FTP’d, or stored report with 2 clicks.  Have an extremely complex report that takes a while to generate?  Schedule it to be pre-generated at specific times during the day for use within the GUI, again with 2 clicks.

As slick as the GUI is, the magic of SRM Suite comes from the collectors and reports that are included for the various parts of your infrastructure.  There are SolutionPacks for EMC and non-EMC storage arrays, multiple vendor FibreChannel switches, Cisco, HP, IBM servers, IP network switches and routers, VMware, Hyper-V, Oracle, SQL, MySQL, Frame-Relay, MPLS, Cisco WiFi networks, and many more.  This single tool provides drill down metrics on individual ports of a SAN switch for a Storage Engineer, Capacity forecasting for management, as well as rollup health dashboards for your company’s executives all within the same tool.  And those same Exec’s can get their reports on their iPhones and iPads with the Watch4Net APG iOS app anywhere they happen to be.

(From vTexan’s post about SRM)

It’s hard to paint the picture in words or even a few screenshots, so you should ask your local EMC SE for a demo!

The second Big Deal coming from EMC’s ASD division is EMC ViPR.  ViPR is EMC’s Software Defined Storage solution.  ViPR abstracts and virtualizes your SAN, NAS, Object, and Commodity storage into Virtual Pools and automates the provisioning process from LUN/FileSystem creation to masking, zoning, and host attach, all with Service Level definitions, Business Unit and Project role-based access, and built in chargeback/showback reporting.  A full web portal for self-service is included as well as a CLI but the real power is the fully capable REST API which allows your existing automation tools to issue requests to ViPR, to handle end-to-end provisioning of your entire environment.  Best of all ViPR has open APIs and supports heterogenous (ie: EMC and non-EMC storage) allowing you to extend the single ViPR REST API to all of your disparate storage solutions.

Looking at the future of the storage industry, as well as EMC as a company, I see ViPR, in combination with SRM Suite, as the place to be for the next few years at least.  And so that’s what I’m doing.  Right now I’m in the process of transitioning from my Account SE role into being one of just a handful of ASD Software Specialist SE’s (sometimes also referred to as SDSpecialists).  In my new roll I’ll be the local Specialist for SRM Suite, ViPR, Service Assurance Suite (aka EMC Smarts), and several other EMC products you probably never thought of as software, or probably never heard of.  There are many enhancements to all of the products on the near term roadmap which will further solidify the ASD software portfolio as market leading but I can’t talk to much about that here..  So ask your local EMC SE to set up a roadmap discussion at the same time as the demo you already asked for.

I do plan to get to writing more often again, and I believe that my new role in the ASD organization will provide good content for that.

More soon!

Do you have a recovery plan? You should!

Posted on by

In my new role at EMC, I am one of the first people to learn of major problems that my customers experience.  In general, customers seem to call their sales team before technical support when a big problem happens.  In the past week, I’ve been involved in recovery efforts with two different customers, both resulting from complete power outages in their production datacenters.

Both of these customers process millions of dollars through their global customer facing websites.  The smaller customer of the two does not have a disaster recovery site of any kind, while the other (larger) customer does have a recovery site, but it is not designed for 100% operation and is hundreds of miles away.

What became clear through both of these incidents is that having a very clear, very well known recovery plan is critical to the business.  Interestingly, these experiences drove home the point that even if you don’t have a recovery site, aren’t using replication, and otherwise don’t have any way to recover the data offsite, you still need a plan that encompasses what you CAN do.  More often than not, major outages are short lived and you will be recovering in your primary datacenter anyway, so you need to have a pre-determined plan to prevent major issues and shorten the time to recover.

Here are some things to think about when creating a recovery plan:

  • Get the application owners together and build a list of all the applications running in your environment.  Document the purpose of each application and map dependencies that each application has on other applications.
  • Next, involve the server/systems admins and document the server names, database names, IP addresses, and DNS names for each application on the list.
  • Finally, involve the infrastructure teams (storage, network, datacenter) and document the network dependencies (subnets, routers, VPN connections, load balancers, etc).  Document any SAN storage used by the servers/applications.  Also document how each infrastructure component affects others (ie: the SAN switches are required to be operational before servers can connect to storage arrays.)
  • Work with business leaders to prioritize the applications.  The idea is to understand how much impact each application has to the business both from a productivity perspective as well as direct financial impact.  There may be legal requirements or service level agreements with customers to consider as well.
  • If possible, identify the maximum amount of time each application can be down in the event of a catastrophic event (RTO – Recovery Time Objective) and how much data can be lost without significant impact to the business (RPO – Recovery Point Objective).  These metrics are usually measured in minutes, hours, and days.
  • Document the backup method for each server and application.  How often are backups run?  What is the retention period?  How long does it take to complete backups?   What is the expected time to restore the data?  How long does it take to recall tapes from offsite storage?
  • At this point you have a prioritized list of applications, now build a step by step recovery plan that lists the exact order in which you must recover systems.  The list should include server names as well as validation points to ensure certain systems are working before moving to the next step.  For example:
    • Step 1: bring up the network switches and routers
    • Step 2: bring up the DNS/DHCP servers
    • Step 3: bring up Active Directory servers
    • Step 4: bring up SAN fabric switches
    • Step 5: bring up SAN storage arrays, verify health of arrays with help from vendor
    • Step 6: …

I recommend that one of the first steps before starting recovery is to contact your key vendors (storage array vendors at least) to notify them of your outage so they can get support resources ready to troubleshoot any hardware issues you may run into during the recovery.

  • Identify key players needed in a recovery, at least primary and secondary contacts for every application and vendor contacts for hardware/software, facilities, UPS/Generator support teams, etc.
  • Establish a standard communication plan to include at least the following…
    • A method to notify employees of an outage and give instructions
    • A method to notify key players for recovery
    • A mechanism for key players to communicate with each other during the recovery
    • Personal (not corporate/business) contact information for all of the key players

The key thing to remember here is that you cannot rely on any communication tools that are part of your infrastructure.   You must assume your PBX/VOIP system will be down, Email will be down, corporate instant messenger will be down, Sharepoint will be unavailable, etc.

  • If you have a remote recovery site, with or without replication technology, and intend to use the remote site to recover production applications in the event of a large failure, be sure to document the triggers for moving to the recovery site.  As an example, you may want to attempt recovery in the primary site, and then move to the recovery site if recovery at the primary site will take too long — be sure to document that time and get executive buyoff.  You should not hear “how long do we wait until we move to the DR site?” during an active recovery operation.  That decision needs to be made during the planning exercise.
  • Document the entire plan and store the digital copies in a readily accessible place (file shares, Sharepoint site, etc).  Keep additional copies on USB sticks or CDs stored in a safe place.  Keep even MORE copies in another location outside the primary datacenter facility (ie: safe deposit box, remote office safe, etc).  Print copies as well and store the printed copy in similar safe places.  Assume that a building may not be accessible due to fire or flood.  I know one customer who issues fingerprint secured USB sticks to every manager.  Each manager must sync their USB stick to a server at least monthly or upper management is notified.
  • Make sure that everyone is aware of the recovery plan, who has access to the plan, where the copies are stored, and what role each of the key players is expected to play during a recovery.

There is far more to think about but hopefully you can get a good start with what I’ve listed above.  If you have a recovery plan already, you should review it regularly and think about anything that needs to be added or modified in the plan.

If you are trying to get approval for a remote recovery site and replication technology and having trouble getting executive approval, going through this exercise and defining application priority with RPO/RTO for each could give you the ammo you need.  Traditional backup architectures aren’t designed for RPO’s under 24 hours while storage array based replication can get RPOs down into the minutes and restoring from tape takes way longer than restoring from replicated data.

Last but not least, keep the plan updated as your environment changes, add new application and server details to the plan as part of the implementation process for new applications, or as part of change control procedures for significant changes to the infrastructure.