As you might know, we are using NetApp Filers for shared NFS-storage for our ESX servers. Today I’ve been trying to setup NetApp “SnapManager for Virtual Infrastructure“.

The idea is that SnapManager makes VMware snapshots of the running machines, then does a NetApp Snapshot of the storage volume and finally deletes the VMware snapshots – this is done to be sure that the virtual harddrives are in a consistent state when the NetApp Snapshot is made. If the applications running in the virtual machines supports VSS (Volume Shadow Copy) it’ll even be application consistent!

The after NetApp Snapshot are created it is SnapMirrored (copied) to a Filer in another physical location for disaster recovery purposes.

It sounds great, and it even works great then I do a backup of a single VM. But, yes as the title suggests there is a “but”, when I try to backup my entire datastore, SnapManager tells Virtual Center to make snapshots of the all the running virtual machines at once. It might be that our NetApp Filer is too slow, or due to some other bottlenecks in our setup, but when VC tries to take 26 simultaneous snapshots a few (between 2 to 7) of them randomly suffers a “timeout”.

The failed snapshots makes SnapManager abort the backup.

I’ve looked high and low for a setting to either:

  • Make SnapManager only do a few snapshots at a time,
  • Extend the snapshot timeout value in VC (it seems to be 20 seconds) or
  • Make VC queue the snapshot commands so that only a few are executed in parallel.

But I havent been able to do either so far…

I also experienced two cases where the .vmx file of powered-off VMs mysteriously got deleted somewhere in the process… Bad karma!

So, until I got things sorted out, we’re back to plain ol’ NetApp Snapshots for the backup of our virtual machines. (of cause combined with client based backup of important data inside the VMs).