A common question / scenario I see from customers is:
- I’ve got a brand new storage array (let’s say we’re running a new NetApp AFF)
- I’m running a VMware vSphere and I’ve got VAAI enabled on all my hosts
When I migrate a 40GB test VM with VAAI enabled, the copy offload operation isn’t any faster than when VAAI is disabled! In fact, it might be slower! What gives? I thought offloading copies to the storage array was supposed to be make migrations faster!
VAAI History
To explain this common misconception, let’s review this history of VAAI.
VMware VAAI (which stands for vSphere APIs for Array Integration) was developed as a joint effort between VMware, NetApp, EMC and EqualLogic. The main goals being to:
- Offload functions to storage that the storage array can handle more efficiently
- Reduce network and compute resource consumption
- Improve integration between vSphere and storage
- Increase storage efficiency
Efficiency vs Speed?
Notice that I didn’t mention “make operations faster“.
One of the features VAAI offers is the ability to offload copy operations to storage. Customers often assume that when copies are offloaded to storage, that makes the copy operation faster. While offloading copy operations to the storage array can make them faster (in some situations, much faster), this isn’t the overall goal. The goal is efficiency.
To illustrate this, let’s go back to our example of the 40GB VM:
- The 40GB has 10GB of actual data
- The remaining 30GB are empty blocks
When the VM is migrated, and the copy operation is offload to storage, there is a bit of work the storage array has to do:
- The storage array is evaluating the blocks it’s about to copy. Do these blocks have data? Are they empty?
- From a efficiency standpoint, why copy a bunch of empty blocks when we don’t have to?
- So the storage array “hole punches” empty blocks and only copies those data blocks with actual data that need to be copied.
So our offloaded copy operation was an efficient operation. But there is some overhead in that work that the storage array is doing.
Let’s say we disable VAAI. We want the ESXi host to handle the copy operation. So what happens here is:
- ESXi isn’t evaluating blocks. It’s grabbing all 40GB of blocks and firing them across the network as fast as it can
- Let’s say you have a 20GB network with zero congestion. You’re going to see a very fast, but inefficient copy
So in above scenario, the copy with VAAI disabled might have been inefficient…but it was faster.
I want speed!
So maybe Jeremy Clarkson is my customer. He’s heard “efficiency…mumble….mumble“. We don’t care about efficiency, we want speed!
So we disable VAAI and our copy offload operations are flying along with our 40GB VMs.
What happens when we provision a few 5TB VMs?
- ESXi is now pulling all 5 TBs and firing them across the network
- Let’s say we have 200GB of actual data. Well, 4.8TB of empty blocks are being sent across the network as well.
The copy is going to be slow.
If VAAI was enabled, the storage array has figured out that it only has to copy 200GB of actual data. So not only is the copy offload more efficient, it’s almost certainly going to be faster (although efficiency is the primary goal)
Key Takeaways
Hopefully at this point you realize that the goal of VAAI is to improve efficiency and to improve integration between ESXi and storage. We’ve focused on copy offload (the Full Copy for NFS, Extended Copy for SAN primitives).
In a lot of situations, offloaded copy operations will be faster than non-offloaded copy operations. This is true when there is resource contention affecting the host or network. Or if the file you’re copying has a lot of empty blocks.
I still run into a lot of customers who still feel that “anything offloaded to the storage array means it should be faster”. And this can be a difficult misconception to shake. For one particular customer, I ended up disabling hole punching on the storage array (bad idea for anything other than a test environment) to try and convince him of the above.
You’d be surprised at the amount of time spent by engineers conducting benchmark tests, combing through performance data, spending hours on the phone with customers and account teams discussing slow copy offloads. Hopefully the above helps you with your argument should you ever find yourself in such a situation.