Introduce new native backup provider (KNIB)#12758
Introduce new native backup provider (KNIB)#12758JoaoJandre wants to merge 8 commits intoapache:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #12758 +/- ##
============================================
- Coverage 18.02% 17.90% -0.12%
- Complexity 16450 16501 +51
============================================
Files 5968 6032 +64
Lines 537086 542119 +5033
Branches 65961 66422 +461
============================================
+ Hits 96819 97078 +259
- Misses 429347 434090 +4743
- Partials 10920 10951 +31
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@JoaoJandre just heads up - my colleagues have been working on an incremental backup feature for NAS B&R (using nbd/qemu bitmap tracking & checkpoints). We're also working on a new Veeam-KVM integration for CloudStack whose PR may may be out soon. My colleagues can further help review and advise on this. /cc @weizhouapache @abh1sar @shwstppr @sureshanaparti @DaanHoogland @harikrishna-patnala Just my 2cents on the design & your comments - NAS is more than just NFS, but any (mountable) shared storage such as CephFS, cifs/samba etc. Enterprise users usually don't want to mix using secondary storage with backup repositories, which is why NAS B&R introduced a backup-provider agnostic concept of backup repositories which can be explored by other backup providers. |
At the time of writing that part, I believe it was only NFS that was supported. I'll update the relevant part.
The secondary storage selector feature (introduced in 2023 by #7659) allows you to specialize secondary storages. This PR extended the feature so that you may also create selectors for backups. |
|
Hi Joao, This looks promising. Incremental backups, quick restore and file restore features have been missing from CloudStack KVM. I am having trouble understanding some of the design choices though:
|
Hello, @abh1sar
I don't see why we should force the coupling of backup offerings with backup repositories, what is the benefit?
The secondary storage also has both features. Although the capacity is not reported to the users currently.
The secondary storage selectors feature (introduced in 2023 through #7659) allows you to specialize secondary storages. Quoting from the PR description: Furthermore, my colleagues are working on a feature to allow using alternative secondary storage solutions, such as CephFS, iSCSI and S3, while preserving compatibility with features destined to NFS storages. This feature may be extended in the future to allow essentially any type of secondary storage. Thus, the flexibility for secondary storages will soon grow.
Using any other type of backup-level compression will be worse then using qemu-img compression. This is because when restoring the backup, we must have access to the whole backing chain. If we use other types of compression, we will have to decompress the whole chain before restoring. Using qemu-img, the backing files are still valid and do not need to be decompressed, we actually never have to decompress ever. This is the great benefit of using qemu-img. In any case, here is a brief comparison of using qemu-img with the zstd library and 8 threads and using the
Compression using qemu-img was a lot faster, with a bit smaller compression ratio. Furthermore, we have to consider that the qemu-img compressed image can be used as-is, while the other images must be decompressed, further adding to the processing time of backing up/restoring a backup.
The compression feature is optional, if you are using storage-level compression, you probably will not use backup-level compression. However, many environments do not have storage-level compression, thus having the possibility of backup-level compression is still very interesting.
The compression does not add any interaction with the SSVM.
I did not want to add dozens of parameters to the import backup offering API which are only really going to be used for one provider. This way, the original design of the API is preserved. Furthermore, you may note that the APIs are intentionally not called
There are two main issues with using bitmaps:
At the end of the day, this PR adds a new backup provider option for users. They will be free to choose the provider that best fits their needs. This is one of the reasons why it was done as a new backup provider; KNIB and other backup providers do not have to cancel each-other out. |
|
Hi @JoaoJandre
It has a big benefit for use cases where someone wants multiple tiers of backup storage with different cost and recovery characteristics. For example, long-term archival backups might go to cheap cold storage like tape, while backups that require faster recovery (better RTO) may use more expensive SSD-backed storage. You can have multiple backup offerings for different use cases and VMs can be attached to the required offerings.
Capacity is just for Admins for Backup Repository also, so that is not the issue.
It does have its benefits, but do keep in mind that the decompression cost will be paid by Reads. Also if someone is using a restored VM, they might see the physical size increase disproportionately to the actual data written due to decompression. It's still a useful feature and it's good that it is optional.
If there is possibility these parameters can be added for other providers they should be added to the existing API.
|
|
Hello, @abh1sar
You can have exactly that using this implementation. Essentially backup repositories and secondary storages are the same, with different names. Using selectors, you can have offerings that are tied to specific secondary storages, or you can have offerings that go into multiple storages, it was made to be flexible.
Again, you can have dedicated secondary storages using selectors.
While I was not the one that introduced the backup framework, looking at its design, it was clearly intended to have as little configuration as possible on the ACS side while leaving these details to the backup provider. If we add these parameters to the import backup offering API, I'm sure a lot of users will be confused when they do nothing for Veeam and Dell Networker. I did not intend to warp the original design of configuring the offerings on the provider side and only importing it to ACS. This is why I created the concept of native offerings. As with KNIB (and NAS) the provider is ACS, and thus the configurations can still be made on the "backup provider", and the import API will follow the same logic it always had.
A provider is native if the backup implementation is made by ACS. Thus, the current native providers would be NAS and KNIB. Veeam and Networker are external providers.
They are used to configure the details that would be configured in the backup provider if it was an external provider, such as Veeam for example.
I have yet to add it to the GUI, if you have any suggestions for the GUI design, you are free to give your opinion. The GUI for the native backup offerings will be added in the future.
Again, you absolutely can do this with KNIB. But you may also choose not to do it, its flexible. |
|
@JoaoJandre Would it be possible to change the name Native to something clearer? For example, "Secondary Storage Backup" might be more descriptive and easier for end users to understand. The term "Native" is not commonly used in CloudStack. |
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
I don't see any issue with "Native", and I'm not claiming that KNIB is the only "Native" provider. As I explained before:
This PR does not remove or substitute anything, the provider I am proposing is a new option for users to use. |
|
Also. The Validation was added in the last commits. All the features that were meant to be included in this PR are already here. |
We do not use the term "Native" in CloudStack. Plugin names are typically based on the name of the external device or backup repository they integrate with. Please follow the same convention. |
|
Hello, @weizhouapache, all
Yes, we currently do not have any APIs using the term
Yes, that is correct. In the case of the new backup plugin, it is fully implemented by ACS, including backup creation, chain management, compression, validation and other related processes. Therefore, its name is fully coherent with its functionality. In other words, KNIB (KVM Native Incremental Backup) stands for:
Sorry, but is this a established convention? I am not aware of any guideline stating that the term Again, if there is any semantic inconsistency with the plugin name KNIB (KVM Native Incremental Backup), we can address it. However, the plugin supports incremental backups, for KVM, natively in ACS; thus, the name accurately reflects what it delivers. It does not promote any concept or capability that is not actually provided. @JoaoJandre, btw, thanks for all the effort you put into designing and implementing this backup plugin. It will certainly be a stable and efficient solution, with strong adoption across cloud providers and users. |
I don’t want to continue debating this topic, so I’ll just summarize my view:
|
|
@JoaoJandre @bernardodemarco , I have not looked at the code yet, but am trusting that it will be of the usual quality we are grown to expect from you guys. |
I would definitely not want two kinds of backup offering displayed separately in the UI. The way this could be approached is to have a unified view of two offerings with a
But NAS doesn't use native backup offerings, so this is still confusing.
I don't think we are gaining much from defining a new native backup offering api. I highly recommend having a single Backup offering API and database tables. |
|
Hello, @abh1sar, @DaanHoogland and @weizhouapache
I have already explained why the use of secondary storage is a good approach as a target for backups. But I can give you even more points as to why it is a good design:
Regarding the naming, the KNIB is a native process, meaning ACS is doing all of the heavy lifting regarding the backup processing. Different from Veeam and other providers, where the heavy lift is done by the provider itself, and ACS is only a facade to enable users to assign/remove backup offerings (jobs in the backend of Veeam, for instance).
Using the term "Native" does not go against any guideline in the official coding conventions. And, if by convention you mean what is already on the code, if we only follow what is already existing, nothing new will ever be introduced.
This was done from the beginning, but if you review and have a specific part of the code that you think could be more isolated, please share.
This actually goes against what @weizhouapache advised in the previous point. If I merge the two APIs, the KNIB implementation will be substantially less isolated. One of the points that I have brought from the beginning was to keep the implementation isolated to avoid issues with the current implementation ACS has and that we want to have an alternative solution. Furthermore, the
I am introducing the concept now of backup offerings to handle the different options we are providing in KNIB. NAS can absolutely implement compression, validation, file extraction, and so on, and start using the native backup offerings. It is up to the guys who developed and are maintaining it. We decided to create something from scratch, as it felt better to work with our quality standard from the beginning.
Thank you :)
Regarding the provider name, KVM Native Incremental Backup, I see no reason at all to change, it is impossible that it will ever confuse users. Regarding the
I don't understand this suggestion. |
@JoaoJandre , Why is asking to change it not a good enough reason. I am tired after a day long conflict resolution, so forgive me for not being patient with you. I find it strange to call that "no reason at all”. I read this as a sarcastic way of saying, “ok let’s change it”. You are right that "the term "Native" does not go against any guideline in the official coding conventions.”. So does that mean that it is a good metaphor for the software? I really object to the name and would like it changed. I do not agree with Wei that using SecStor for backups is not a good idea. It may not be in high demanding production environments, but is smaller organisations it might well be a good solution. That does not mean that the name is not confusing. So to put it more firmly: Please change the name of the provider. In addition to that, I see you are basically refusing all suggestions, which makes me suspicious. I didn’t involve myself in all of them but they can’t all be non-sensible. You clearly are invested in this PR, so I think it is in your interest to be more forthcoming to input. I personally see no reason to continue with this PR without changing it. If you are willing to make compromises, I will be happy to explain my point about backup repository usage, a point about framework adherence. |
I am not being sarcastic: asking for a change cannot be a reason for asking for a change. We have explained (see #12758 (comment)) why the name makes sense and is not confusing from our perspective. Yet I have not received a technical reason to change the name; I am only receiving "change it because I want it". Anyways, I have an alternate name suggestion: KVM Incremental Backup Engine; is this better than KVM Native Incremental Backup? If so, why? Regarding the API names, since "Native" is out of the question for API names (according to you), I will change the APIs to
I will repeat this as many times as necessary: secondary storage and backup repositories of the NAS backup plugin are functionally the same, the same appliance that exports an NFS path for a backup repository, can export an path NFS for secondary storage; and this secondary storage can be exclusive to backups if the operator of the cloud configures it to be. Sure, currently backup repositories support more protocols, but this is temporary, and also, this limitation is not a problem for this PR. Therefore, secondary storage can be as good as backup repositories for any type of production environments. Thus, I will not change the implementation to use backup repositories as the NAS backup plugin. Moreover, if users prefer the other plugin, they are free to use the other one, instead of the implementation that I am proposing; I guess it is fine to have different plugins with different capabilities.
Anyone can make any suggestion, but this does not mean that it is a good suggestion. All suggestions have been answered with in-depth explanations that are the basis for refusing the suggestion. I am open to suggestions based on good technical reasons and I am willing to discuss any possible changes that may help refine the feature, but I will not accept a suggestion until it is proven to be better than the current implementation. As a final note, this PR is proposing a new backup plugin. At the end of the day, if users do not want to use it (or if users do not trust it), they are free to do so and adopt any other alternative. This solution is isolated on its own, and therefore, there should not be much resistance to it getting merged. We will not affect any already existing workflow or feature with this PR. Besides that, I am willing to change the naming if that is important to you guys. What do you think if we call the plugin KVM Incremental Backup Engine or KVM Incremental Backup Service, and the APIs to |
ok, another attempt, though I think I have explained; native is not a good metaphor for the software. It is a secondary storage backup solution. Note also that just saying it is a good name does not make it a good name. I have other points of feedback but if you are not willing to take this as an argument, there is no sense in talking about the details of the framework and how it fit or could fit better. |
|
Hello, @DaanHoogland What about the alternate names I proposed based on your feedback and suggestions? |
|
I'd like to gently remind everyone — myself included — that we are all working toward the same goal here: making CloudStack better. This conversation has gotten a bit heated, and I think part of that is because a large PR with significant architectural decisions landed without earlier community discussion — something I've been guilty of myself in the past. For features of this scale, early alignment with the community helps avoid exactly this kind of friction. When that doesn't happen, it becomes even more important for the author to be open to the feedback that comes during review. The Apache Ethos is a good reference for how we want to collaborate as a community. Disagreement on technical matters is healthy; let's just make sure we keep it respectful and constructive. I have tried to go through the spec, the PR conversation, the existing backup framework docs, and the code at a high level. I might have missed some things given the size of this PR, so please correct me if I'm wrong on any point. I can see that a lot of effort has gone into both the spec and the implementation. That said, I have some concerns that I think need to be addressed before this is production-safe. TestingThe patch coverage is 5.59%. The classes that handle the most critical operations are almost completely untested. Unit test coverage details
These code paths run as root on hypervisor hosts and manipulate live VM disk chains. A bug here can cause data loss for tenants. This is not something we can rely on manual testing alone for — the next person who touches the VM snapshot strategy or volume migration code has no safety net to know they've broken something. This is a 12k+ line PR with no Marvin tests. The manual test matrices in the PR description are good to have, but without automated integration tests, there is no way to catch regressions. Future changes to the VM snapshot code, volume lifecycle, or backup framework could silently break KNIB (or the other way around) with no automated detection. ArchitectureUsing Secondary Storage for BackupsThree concerns with this approach
|
Description
This PR adds a new native incremental backup provider for KVM. The design document which goes into details of the implementation can be found on https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622120.
The validation process which is detailed in the design document will be added to this PR soon.The validation process is already in this PR.The file extraction process will be added in a later PR.
This PR adds a few new APIs:
The
createNativeBackupOfferingAPI has the following parameters:namecompressfalsevalidatefalseallowQuickRestorefalseallowExtractFilefalsebackupchainsizecompressionlibraryzstdThe
deleteNativeBackupOfferingAPI has the following parameter:idA native backup offering can only be removed if it is not currently imported.
The
listNativeBackupOfferingsAPI has the following parameters:idcompressvalidateallowQuickRestoreallowExtractFileshowRemovedfalseThe
listBackupServiceJobshas the following parametersidbackupidhostidexecutingparameter is implicitzoneidtypeStartCompression,FinalizeCompressionandBackupValidationexecutingscheduledBy default, lists all jobs that have not been removed.
The API
downloadValidationScreenshotwas added to allow downloading the images generated during the screenshot step. The API has the following parameter:backupIdThe PR also adds parameters to the following APIs:
isolatedparameter was added to thecreateBackupandcreateBackupScheduleAPIsquickRestoreparameter was added to therestoreBackup,restoreVolumeFromBackupAndAttachToVMandcreateVMFromBackupAPIshostIdparameter was added to therestoreBackupandrestoreVolumeFromBackupAndAttachToVMAPIs, which can only be used by root admins and only when quick restore is true.New settings were also added:
backup.chain.sizeknib.timeoutbackup.compression.task.enabledtruebackup.compression.max.concurrent.operations.per.host5backup.compression.max.concurrent.operationsbackup.compression.max.job.retries2backup.compression.retry.interval60backup.compression.timeout28800backup.compression.minimum.free.storage1backup.compression.coroutines1backup.compression.rate.limit0backup.validation.task.enabledtruebackup.validation.interval24backup.validation.max.concurrent.operations0backup.validation.max.concurrent.operations.per.host1backup.validation.boot.default.timeout240backup.validation.script.default.timeout60backup.validation.screenshot.default.wait60backup.validation.end.chain.on.failtrueenforce.resource.limit.on.backup.validation.vmfalseTwo new host-level configurations, configured in
agent.properties, were added:backup.validation.max.concurrent.operations.per.hostandbackup.compression.max.concurrent.operations.per.host. When configured, they override the values set in the global configurations with the same names.Six new VM configurations were added:
backupValidationCommandbackupValidationCommandArgumentsbackupValidationCommandExpectedResult0backupValidationCommandTimeoutbackupValidationScreenshotWaitbackupValidationBootTimeoutBy default, the VM configurations
backupValidationCommandTimeout,backupValidationScreenshotWait, andbackupValidationBootTimeoutare read-only for regular users. This behavior can be changed through the global configurationuser.vm.readonly.details.Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Tests related to disk-only VM snapshots
Basic tests with backup
Using
backup.chain.size=3Interactions with other functionalities
I created a new VM with a root disk and a data disk for the tests below.
Configuration Tests
Compression Tests
Tests performed with an offer that provides compressed backups support
Validation tests
echoechowith argumentababaecho, argumentababa, and expected outputabsduiYWJhYmEKecho, argumentababa, and expected outputYWJhYmEKecho, argumentababa, and expected outputYWJhYmEKTests with
restoreVolumeFromBackupAndAttachToVMTests with
restoreBackup