Welcome!

Blog Feed Post

Why Should I Care about Recovery Point Objective (RPO) Assurance?

Recovery Point Objectives (RPO) are incredibly important  in your Disaster Recovery planning. Below I present ways you can assure that this critical Service Level Agreement (SLA) is being met today, tomorrow and on that fateful day when some unexpected event comes calling. My favorite definition of RPO comes from the IT Infrastructure Library (ITIL):

The maximum amount of data that may be lost when service is restored after an interruption. The RPO is expressed as a length of time before the failure. For example, an RPO of one day may be supported by daily backups, and up to 24 hours of data may be lost.

The RPO SLA is most important in relation to backup and replication solutions. The actual recovery point achievable in the event of an incident varies over time.  For example, replication streams may not keep up with periods of heavy workload activity or may be held up by network congestion.  Also, backup jobs might fail due to media errors.  In some cases these issues can result in your actual recovery point exceeding the SLA agreed upon with your business stakeholders and if disaster strikes you will not be able to recover as much data as promised.

So how can you tell whether you are exposed to this risk, how often it is happening and which business areas are affected?  Here at Neverfail, we have created a solution that can help. IT Continuity Architect can monitor the achievable recovery points across your infrastructure to detect any drift towards SLA breach and alert you in advance of the risk becoming an issue.  Additionally, it will relate these risks back to your individual business services. Phew! Problem solved. Let’s dig into the details.

doug2

Architect can monitor the achievable recovery point, or the Recovery Point Estimate (RPE) for vSphere replication and our own Failover Engine replication; VMware made commodity replication generally available for all virtual machines beginning with vSphere 5.1 and improved this again as part of the 5.5 release. With support for Volume Shadow Copy Service (VSS) this replication mechanism can be used to create application consistent replicas for production workloads such as Exchange, SQL Server and SharePoint. If you throw in the orchestration capabilities of Site Recovery Manager (SRM) you have a pretty powerful Disaster Recovery solution. Unlike the replication of our Failover Engine, which provides continuous replication with near-zero RPO capabilities, vSphere replication can only support a minimum RPO of 15 minutes.

Additionally, our Failover Engine provides replication for both physical and virtual machines, which can also be orchestrated from SRM (but that is another story). If you do chose to use vSphere replication to protect production workloads, and you do want to mitigate the risks highlighted above, you really ought to be monitoring actual recovery points using Architect’s powerful RPO monitoring and SLA management capabilities. Let’s see how that works.

doug3

Architect automatically discovers all of your infrastructure, applications and their dependencies (both upstream and downstream), and then helps you arrange these into discrete aggregations which support individual business services. Not all business services are equal – with some being more critical than others – so you will naturally want to protect these with a spectrum of SLAs.  In Architect you can assign a range of “protection tiers” to business services which encode, amongst other things, the RPO SLA that you agreed to with the business stakeholders. Now Architect can automatically see the presence of replication activity and continuously check the Recovery Point Estimate (RPE) against the SLA and advise you of any salient events.  Let’s review a few examples to illustrate.

In the graph below Architect is plotting the movement in RPE on a virtual machine (VM) which has vSphere replication enabled. In this scenario vSphere replication has been configured to support an RPO of 15 minutes (or 900 seconds) which is as low as it can go. You can see how the RPE oscillates over the course of 48 hours as the hypervisor tries to deal with fluctuations in workload and network capacity. Unfortunately, in some cases it can’t cope and the RPE exceeds 15 minutes, which violates the SLA and exposes your business to risk of data loss.

doug4

Fortunately, the VM has been placed in a tier in Architect which also has an RPO SLA of 15 minutes and the RPE movement is continuously compared to this SLA. Architect will raise an alert if the RPE comes within a configurable tolerance of the SLA.  In the portlet below you can see that over the period of inspection the virtual machine’s RPE came within 80% of the SLA on 3 occasions and within 50% on another occasion. These warning alerts are designed to allow administrators to react – to check out the network health or other potential root causes that might lie behind the replication stream having difficulty. Because this allows proactive mitigation in advance of the SLA being breached, you have assurance that you will not expose your business to the risk of data loss. For the purposes of illustration I did not intervene in this scenario and allowed the replication stream’s recovery point estimate to degrade beyond the RPO. As you can see below Architect reacts with a critical alert to advise you of this dangerous situation. At this point, if the primary system is compromised for any reason, data loss is inevitable and a disappointing conversation with your business stakeholders will be necessary.

doug6

Architect makes all of its functionality available within VMware’s vSphere web client. RPO monitoring pulls together a number of views as shown below to include:

  • A snapshot of current infrastructure which is within SLA, at risk of SLA breach or actually in breach of SLA.
  • An historical view of how the infrastructure fared over a period of time in terms of SLA health.
  • A timeline of movements in RPE for individual infrastructure elements.
  • A summary of recent or most important alerts relating to RPO SLAs.
  • The ability to change the window of inspection or focus in on specific business services or infrastructure elements

doug5

In summary, monitoring and management of your RPO SLAs is a hugely important aspect of DR planning. It is particularly relevant to replication technologies where the achievable recovery point fluctuates due to various other events on your IT estate. vSphere replication offers a means to protect your virtual production workloads but needs to be monitored and managed to avoid unseen exposure to risk of data loss. IT Continuity Architect, as a vSphere web client plug-in, offers a powerful means of assurance that your DR plans based on vSphere replication are successful. You can see for yourself with a trial download of IT Continuity Architect which you can get right here.

Read the original blog entry...

More Stories By Josh Mazgelis

Josh Mazgelis is senior product marketing manager at Neverfail. He has been working in the storage and disaster recovery industries for close to two decades and brings a wide array of knowledge and insight to any technology conversation.

Prior to joining Neverfail, Josh worked as a product manager and senior support engineer at Computer Associates. Before working at CA, he was a senior systems engineer at technology companies such as XOsoft, Netflix, and Quantum Corporation. Josh graduated from Plymouth State University with a bachelor’s degree in applied computer science and enjoys working with virtualization and disaster recovery.

Latest Stories
"At the keynote this morning we spoke about the value proposition of Nutanix, of having a DevOps culture and a mindset, and the business outcomes of achieving agility and scale, which everybody here is trying to accomplish," noted Mark Lavi, DevOps Solution Architect at Nutanix, in this SYS-CON.tv interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are an IT services solution provider and we sell software to support those solutions. Our focus and key areas are around security, enterprise monitoring, and continuous delivery optimization," noted John Balsavage, President of A&I Solutions, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We want to show that our solution is far less expensive with a much better total cost of ownership so we announced several key features. One is called geo-distributed erasure coding, another is support for KVM and we introduced a new capability called Multi-Part," explained Tim Desai, Senior Product Marketing Manager at Hitachi Data Systems, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to w...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
"The Striim platform is a full end-to-end streaming integration and analytics platform that is middleware that covers a lot of different use cases," explained Steve Wilkes, Founder and CTO at Striim, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that Calligo, an innovative cloud service provider offering mid-sized companies the highest levels of data privacy and security, has been named "Bronze Sponsor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Calligo offers unparalleled application performance guarantees, commercial flexibility and a personalised support service from its globally located cloud plat...
"With Digital Experience Monitoring what used to be a simple visit to a web page has exploded into app on phones, data from social media feeds, competitive benchmarking - these are all components that are only available because of some type of digital asset," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. Kubernetes was originally built by Google, leveraging years of experience with managing container workloads, and is now a Cloud Native Compute Foundation (CNCF) project. Kubernetes has been widely adopted by the community, supported on all major public and private cloud providers, and is gaining rapid adoption in enterprises. However, Kubernetes may seem intimidating and complex ...
"Outscale was founded in 2010, is based in France, is a strategic partner to Dassault Systémes and has done quite a bit of work with divisions of Dassault," explained Jackie Funk, Digital Marketing exec at Outscale, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We focus on SAP workloads because they are among the most powerful but somewhat challenging workloads out there to take into public cloud," explained Swen Conrad, CEO of Ocean9, Inc., in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.