Sparse texture binding is painfully slow

Sparse texture binding / unbinding can be painfully slow: it can take multiple seconds (yes, not milliseconds) to bind 1000 pages in a 1024^3 texure. (My setup is an A4000 with a dual Xeon 4210R running on Ubuntu 22.04 with the 535 proprietary Nvidia driver). This makes sparse textures pretty much useless for applications that require to change the page binding in real time.

From my understanding, this is a driver issue : not much happens on the GPU when binding pages (the memory is already allocated), but I can see that there is a single CPU thread running at 100%. The binding time can also spike randomly : in some tests, a bind can cost 100ms, then the next one 1000ms, then the next one 100ms. To me, this looks like the driver is trying to do some sort of defragmentation. Perhaps for the simple case of binding in pages of fixed size, this is actually more of a burden than an optimization ?

From my testing, the binding time is proportional to the number of requested binds in the bind call (expected), and also proportional to the number of pages that are already bound in all the sparse textures (less expected !).
Here are my results : Sparse image binding cost - Google Sheets

And here is the testing code : binding-cost.cpp (7.3 KB)
You can compile it with : “g++ binding-cost.cpp -Wno-narrowing -l vulkan -o binding-cost”

Also, the vkQueueBindSparse call is synchronous, even though it takes an optional fence as parameter, but this is a different issue.

1 Like

Hello @antoinerichermoz,

thank you for bringing this to our attention and making such a detailed analysis. Our Vulkan engineers have seen this and are discussing the data internally.

If and when I hear back from them I will update here.

Hello again,

I tested sparse texture binding in CUDA, and it is much faster. The cost of the first bind is the same as in Vulkan, but it stays constant in CUDA.
Here is the code : binding-cost-cuda.cpp (3.8 KB)
So could Vulkan sparse binding be as fast as CUDA sparse binding ?

Hello! Any update on this? I believe I also just ran into this issue with Vulkan on Windows. From threads elsewhere (e.g. Reddit - Dive into anything) it seems to have been a common problem across several platforms/hardware in past years - so I’m not really sure of the scope of the issue. But the performance levels I’m observing are baffling - don’t modern games use this feature?
This makes me wonder if I (and others) are using it incorrectly, or a common set of configrurations that exhibit some worst-case behavior, but I think it warrants investigation.

I can provide more information on my configuration, but the rough numbers I’m seeing with vkQueueBindSparse are:

  • 0.25 ms to bind 1 sparse page
  • 30 ms to bind 730 sparse pages (41us per page average)
  • 3600 ms to bind 9200 sparse pages (390us per page average)

These numbers don’t add up to me.
From a theoretical perspective binding 1 sparse page should be equivalent to updating 1 page table entry and flushing a few TLB caches, etc.
Even with additional driver book-keeping, how can that amortized operation of updating a single pointer in GPU memory take 0.25ms?

Any information would be helpful! I would love to use this feature, but it’s not really possible at the moment. Again happy to provide more config information or minimal repro code.

Hello @benharris42 and welcome to the NVIDIA developer forums.

Thank you for your additional insights.

I checked with engineering and we do have an internal issue tracking this, so investigation is ongoing.

But that is all the information I can share right now.


Just a quick update to clarify the info I posted - I realized the timings I provided were with the VK_LAYER_KHRONOS_validation layer enabled, and this accounts for some of the low performance I was seeing. With validation disabled I see numbers closer to the OP, which are still unusably slow unfortunately - around 250us for 1 page and tens of milliseconds for 1K-10K pages.

For reference, implementing “manual sparse texture” with the classic 2-level texture scheme takes <10us to “bind” 10K pages in a compute shader for my setup. The trade off of course is that all “sparse accesses” go through two texture reads (the classic problem before HW sparse feature). However, for my use case I need binding to be fast and predicable so I am sticking with the software solution for now.

No performance difference when deactivating validation layers on my side