Blog Feed Post

Why Should I Care about Recovery Point Objective (RPO) Assurance?

Recovery Point Objectives (RPO) are incredibly important  in your Disaster Recovery planning. Below I present ways you can assure that this critical Service Level Agreement (SLA) is being met today, tomorrow and on that fateful day when some unexpected event comes calling. My favorite definition of RPO comes from the IT Infrastructure Library (ITIL):

The maximum amount of data that may be lost when service is restored after an interruption. The RPO is expressed as a length of time before the failure. For example, an RPO of one day may be supported by daily backups, and up to 24 hours of data may be lost.

The RPO SLA is most important in relation to backup and replication solutions. The actual recovery point achievable in the event of an incident varies over time.  For example, replication streams may not keep up with periods of heavy workload activity or may be held up by network congestion.  Also, backup jobs might fail due to media errors.  In some cases these issues can result in your actual recovery point exceeding the SLA agreed upon with your business stakeholders and if disaster strikes you will not be able to recover as much data as promised.

So how can you tell whether you are exposed to this risk, how often it is happening and which business areas are affected?  Here at Neverfail, we have created a solution that can help. IT Continuity Architect can monitor the achievable recovery points across your infrastructure to detect any drift towards SLA breach and alert you in advance of the risk becoming an issue.  Additionally, it will relate these risks back to your individual business services. Phew! Problem solved. Let’s dig into the details.


Architect can monitor the achievable recovery point, or the Recovery Point Estimate (RPE) for vSphere replication and our own Failover Engine replication; VMware made commodity replication generally available for all virtual machines beginning with vSphere 5.1 and improved this again as part of the 5.5 release. With support for Volume Shadow Copy Service (VSS) this replication mechanism can be used to create application consistent replicas for production workloads such as Exchange, SQL Server and SharePoint. If you throw in the orchestration capabilities of Site Recovery Manager (SRM) you have a pretty powerful Disaster Recovery solution. Unlike the replication of our Failover Engine, which provides continuous replication with near-zero RPO capabilities, vSphere replication can only support a minimum RPO of 15 minutes.

Additionally, our Failover Engine provides replication for both physical and virtual machines, which can also be orchestrated from SRM (but that is another story). If you do chose to use vSphere replication to protect production workloads, and you do want to mitigate the risks highlighted above, you really ought to be monitoring actual recovery points using Architect’s powerful RPO monitoring and SLA management capabilities. Let’s see how that works.


Architect automatically discovers all of your infrastructure, applications and their dependencies (both upstream and downstream), and then helps you arrange these into discrete aggregations which support individual business services. Not all business services are equal – with some being more critical than others – so you will naturally want to protect these with a spectrum of SLAs.  In Architect you can assign a range of “protection tiers” to business services which encode, amongst other things, the RPO SLA that you agreed to with the business stakeholders. Now Architect can automatically see the presence of replication activity and continuously check the Recovery Point Estimate (RPE) against the SLA and advise you of any salient events.  Let’s review a few examples to illustrate.

In the graph below Architect is plotting the movement in RPE on a virtual machine (VM) which has vSphere replication enabled. In this scenario vSphere replication has been configured to support an RPO of 15 minutes (or 900 seconds) which is as low as it can go. You can see how the RPE oscillates over the course of 48 hours as the hypervisor tries to deal with fluctuations in workload and network capacity. Unfortunately, in some cases it can’t cope and the RPE exceeds 15 minutes, which violates the SLA and exposes your business to risk of data loss.


Fortunately, the VM has been placed in a tier in Architect which also has an RPO SLA of 15 minutes and the RPE movement is continuously compared to this SLA. Architect will raise an alert if the RPE comes within a configurable tolerance of the SLA.  In the portlet below you can see that over the period of inspection the virtual machine’s RPE came within 80% of the SLA on 3 occasions and within 50% on another occasion. These warning alerts are designed to allow administrators to react – to check out the network health or other potential root causes that might lie behind the replication stream having difficulty. Because this allows proactive mitigation in advance of the SLA being breached, you have assurance that you will not expose your business to the risk of data loss. For the purposes of illustration I did not intervene in this scenario and allowed the replication stream’s recovery point estimate to degrade beyond the RPO. As you can see below Architect reacts with a critical alert to advise you of this dangerous situation. At this point, if the primary system is compromised for any reason, data loss is inevitable and a disappointing conversation with your business stakeholders will be necessary.


Architect makes all of its functionality available within VMware’s vSphere web client. RPO monitoring pulls together a number of views as shown below to include:

  • A snapshot of current infrastructure which is within SLA, at risk of SLA breach or actually in breach of SLA.
  • An historical view of how the infrastructure fared over a period of time in terms of SLA health.
  • A timeline of movements in RPE for individual infrastructure elements.
  • A summary of recent or most important alerts relating to RPO SLAs.
  • The ability to change the window of inspection or focus in on specific business services or infrastructure elements


In summary, monitoring and management of your RPO SLAs is a hugely important aspect of DR planning. It is particularly relevant to replication technologies where the achievable recovery point fluctuates due to various other events on your IT estate. vSphere replication offers a means to protect your virtual production workloads but needs to be monitored and managed to avoid unseen exposure to risk of data loss. IT Continuity Architect, as a vSphere web client plug-in, offers a powerful means of assurance that your DR plans based on vSphere replication are successful. You can see for yourself with a trial download of IT Continuity Architect which you can get right here.

Read the original blog entry...

More Stories By Josh Mazgelis

Josh Mazgelis is senior product marketing manager at Neverfail. He has been working in the storage and disaster recovery industries for close to two decades and brings a wide array of knowledge and insight to any technology conversation.

Prior to joining Neverfail, Josh worked as a product manager and senior support engineer at Computer Associates. Before working at CA, he was a senior systems engineer at technology companies such as XOsoft, Netflix, and Quantum Corporation. Josh graduated from Plymouth State University with a bachelor’s degree in applied computer science and enjoys working with virtualization and disaster recovery.

Latest Stories
DX World EXPO, LLC, a Lighthouse Point, Florida-based startup trade show producer and the creator of "DXWorldEXPO® - Digital Transformation Conference & Expo" has announced its executive management team. The team is headed by Levent Selamoglu, who has been named CEO. "Now is the time for a truly global DX event, to bring together the leading minds from the technology world in a conversation about Digital Transformation," he said in making the announcement.
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Conference Guru has been named “Media Sponsor” of the 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. A valuable conference experience generates new contacts, sales leads, potential strategic partners and potential investors; helps gather competitive intelligence and even provides inspiration for new products and services. Conference Guru works with conference organizers to pass great deals to gre...
DevOps is under attack because developers don’t want to mess with infrastructure. They will happily own their code into production, but want to use platforms instead of raw automation. That’s changing the landscape that we understand as DevOps with both architecture concepts (CloudNative) and process redefinition (SRE). Rob Hirschfeld’s recent work in Kubernetes operations has led to the conclusion that containers and related platforms have changed the way we should be thinking about DevOps and...
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and shared the must-have mindsets for removing complexity from the develop...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
The next XaaS is CICDaaS. Why? Because CICD saves developers a huge amount of time. CD is an especially great option for projects that require multiple and frequent contributions to be integrated. But… securing CICD best practices is an emerging, essential, yet little understood practice for DevOps teams and their Cloud Service Providers. The only way to get CICD to work in a highly secure environment takes collaboration, patience and persistence. Building CICD in the cloud requires rigorous ar...
Companies are harnessing data in ways we once associated with science fiction. Analysts have access to a plethora of visualization and reporting tools, but considering the vast amount of data businesses collect and limitations of CPUs, end users are forced to design their structures and systems with limitations. Until now. As the cloud toolkit to analyze data has evolved, GPUs have stepped in to massively parallel SQL, visualization and machine learning.
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
"ZeroStack is a startup in Silicon Valley. We're solving a very interesting problem around bringing public cloud convenience with private cloud control for enterprises and mid-size companies," explained Kamesh Pemmaraju, VP of Product Management at ZeroStack, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Enterprises are adopting Kubernetes to accelerate the development and the delivery of cloud-native applications. However, sharing a Kubernetes cluster between members of the same team can be challenging. And, sharing clusters across multiple teams is even harder. Kubernetes offers several constructs to help implement segmentation and isolation. However, these primitives can be complex to understand and apply. As a result, it’s becoming common for enterprises to end up with several clusters. Thi...