v10a – Supercharging Recovery Speeds from Object Storage
We always talk about starting with backup… and it is true that you can’t recover unless you have a solid backup, however more and more the recovery of workloads, and the speed at which they are recovered is becoming crucial. One of the use cases I’ve been pushing since we released Cloud Tier back in 9.5 Update 4, was the ability to leverage our Scale Out Backup Repository (SOBR) Object Storage Capacity Tier extents for migration and On-Demand Recovery. It’s also something that our Cloud and Service Provider partners can leverage for the same. While we have all the goodness of being able to recover in multiple ways directly streaming off an Object Storage repository, some of the restore speed we saw where limiting some use cases.
I am glad to say that with 10a, we included a previous HotFix that has resulted in a 10-15x increase in restore speeds. We also have enhanced restore speeds for Restore to Azure and EC2
Since late 2019 i’ve been testing restore scenarios for On Demand Recovery from Object Storage specific to v10. I’ve also been involved in, and been made aware of partner opportunities where the SOBR Object Storage Capacity Tier Extent has been used for migration and recovery. When these use cases are in play, recovery speed is important.
Factors That Can Impact Performance:
As with a lot of things in IT, your milage may vary depending on a lot of factors and though I have seen fairly consistent results across a number of platforms, the following should be considered as control points
- Backup Server Compute and RAM configuration
- Backup Repository Compute and RAM and Storage
- Speed of Egress/Ingress
- Object Storage Location and Placement
- Encryption Settings
- Code Efficiencies
Even within that list, there can be variances in performance. As an example, not every AWS Region is created equal and Amazon S3 throughput can vary between regions which could also import restore speeds. To prove that, just fire up a copy of the AWS CLI and perform Bucket/Folder copied from one region to another.
v10 Recovery Speed Improvements Results:
As mentioned, I have been doing recoveries from Object Storage in varying scenarios for a while now. I even performed an Instant VM Recovery from 40,000 feet off an Amazon S3 backed Capacity Tier Extent which has proven that our innovative approach to the design of this feature has lead to use cases driven by its efficiencies in the way data is transferred. The one issue that I found and reported internally was that restore speeds for Instant Recoveries and general VM Restores where never fully utilising the max speed of the network connection.
I’m just going to let the numbers speak for themselves from here on is, but suffice to say I was blown away with the improvement and more than ever, this has opened up more doors for Object Storage backed SOBR Extents to be used for recovery and migration.
My control has been a grouping of five VMs that total 100GB. I’ve also used a NestedESXi cluster with supporting VMs at about 200GB. The Backup & Replication Servers are configured with 2-4vCPUs and 12-16GB of RAM and have the SOBR configured on the same server with an Amazon S3 Capacity Tier Extent configured with Copy and Move Policy as well as Immutability but no encryption. In each case there are 2-4 dedicated Windows Proxies with 4vCPU and 8GB of RAM.
I have two labs that I am testing from, one is our main PS Lab in Columbus Ohio and the other is a VMware Cloud on AWS Instance out of US-East-2 and as you can see below, the download/upload speeds are no problem. However it’s important to note than in my testing I was leveraging an S3 Endpoint which goes through an Amazon ENI (Elastic Network Interface).. theoretical rates on that are in the Gbps.
As a side note, adding more proxies/slots didn’t do much to increase the performance… the existing bottleneck was 100% in the upload process of the VeeamAgent.exe.
Restore from Amazon S3:
These Restores where done off a Mounted Object Storage Repository with Veeam Backup & Replication Community Edition in a VMware Cloud on AWS SDDC. The direct comparison below is for a SQL Server
- Best performance I could get out of testing prior to 10a: ~30MB/s or about 2GB per minute..
- After upgrading to v10a I got ~240MB/s or an 9x Increase in Restore Speed. This extrapolates to 16-20GB per minute.‘
The process monitor shows the connection to Amazon S3 and the data movers doing their thing on each VM Disk.‘
Looking at the CPU Load of that SQL Restore above, CPU was clearly a limiting factor. So for the next test I increased vCPU from 2 to 4.
- After increasing the vCPUs and performing the same test, we got an almost 2x increase on the previous test to ~420MB/s which is 14x on the first results meaning 28-30GB per minute..
Note that increasing the CPU count further didn’t result in more throughput… ie, the CPU was not limiting at this point.
Wrap Up and Conclusion:
In a nutshell, the restoration of that SQL VM which was taking about 27 to 30 minutes is now being done in under 2 with v10a. This is a huge increase in performance and one that I’ve been able to replicate across all testing. Again, this is not only restricted to restores from Amazon S3 Object Storage, but v10a also introduced performance boosts to Azure Blob and also for the Restore of workloads to Azure and AWS (Michael Cade has written about that here). Without getting into the technical weeds, through continued efficiencies and enhancements in our software… specifically how the VeeamAgent.exe communicates with Object Storage platforms, we have been able to dramatically improve performance of restore speeds which opens up more use case for the technology.