Sparse texture binding is painfully slow

antoinerichermoz · July 7, 2023, 1:23pm

Sparse texture binding / unbinding can be painfully slow: it can take multiple seconds (yes, not milliseconds) to bind 1000 pages in a 1024^3 texure. (My setup is an A4000 with a dual Xeon 4210R running on Ubuntu 22.04 with the 535 proprietary Nvidia driver). This makes sparse textures pretty much useless for applications that require to change the page binding in real time.

From my understanding, this is a driver issue : not much happens on the GPU when binding pages (the memory is already allocated), but I can see that there is a single CPU thread running at 100%. The binding time can also spike randomly : in some tests, a bind can cost 100ms, then the next one 1000ms, then the next one 100ms. To me, this looks like the driver is trying to do some sort of defragmentation. Perhaps for the simple case of binding in pages of fixed size, this is actually more of a burden than an optimization ?

From my testing, the binding time is proportional to the number of requested binds in the bind call (expected), and also proportional to the number of pages that are already bound in all the sparse textures (less expected !).
Here are my results : Sparse image binding cost - Google Sheets

And here is the testing code : binding-cost.cpp (7.3 KB)
You can compile it with : “g++ binding-cost.cpp -Wno-narrowing -l vulkan -o binding-cost”

Also, the vkQueueBindSparse call is synchronous, even though it takes an optional fence as parameter, but this is a different issue.

MarkusHoHo · July 11, 2023, 9:30am

Hello @antoinerichermoz,

thank you for bringing this to our attention and making such a detailed analysis. Our Vulkan engineers have seen this and are discussing the data internally.

If and when I hear back from them I will update here.

antoinerichermoz · July 17, 2023, 8:47am

Hello again,

I tested sparse texture binding in CUDA, and it is much faster. The cost of the first bind is the same as in Vulkan, but it stays constant in CUDA.
Here is the code : binding-cost-cuda.cpp (3.8 KB)
So could Vulkan sparse binding be as fast as CUDA sparse binding ?

benharris42 · September 9, 2023, 7:59am

Hello! Any update on this? I believe I also just ran into this issue with Vulkan on Windows. From threads elsewhere (e.g. Reddit - Dive into anything) it seems to have been a common problem across several platforms/hardware in past years - so I’m not really sure of the scope of the issue. But the performance levels I’m observing are baffling - don’t modern games use this feature?
This makes me wonder if I (and others) are using it incorrectly, or a common set of configrurations that exhibit some worst-case behavior, but I think it warrants investigation.

I can provide more information on my configuration, but the rough numbers I’m seeing with vkQueueBindSparse are:

0.25 ms to bind 1 sparse page
30 ms to bind 730 sparse pages (41us per page average)
3600 ms to bind 9200 sparse pages (390us per page average)

These numbers don’t add up to me.
From a theoretical perspective binding 1 sparse page should be equivalent to updating 1 page table entry and flushing a few TLB caches, etc.
Even with additional driver book-keeping, how can that amortized operation of updating a single pointer in GPU memory take 0.25ms?

Any information would be helpful! I would love to use this feature, but it’s not really possible at the moment. Again happy to provide more config information or minimal repro code.

MarkusHoHo · September 11, 2023, 10:48am

Hello @benharris42 and welcome to the NVIDIA developer forums.

Thank you for your additional insights.

I checked with engineering and we do have an internal issue tracking this, so investigation is ongoing.

But that is all the information I can share right now.

Thanks!

benharris42 · September 24, 2023, 4:11pm

Just a quick update to clarify the info I posted - I realized the timings I provided were with the VK_LAYER_KHRONOS_validation layer enabled, and this accounts for some of the low performance I was seeing. With validation disabled I see numbers closer to the OP, which are still unusably slow unfortunately - around 250us for 1 page and tens of milliseconds for 1K-10K pages.

For reference, implementing “manual sparse texture” with the classic 2-level texture scheme takes <10us to “bind” 10K pages in a compute shader for my setup. The trade off of course is that all “sparse accesses” go through two texture reads (the classic problem before HW sparse feature). However, for my use case I need binding to be fast and predicable so I am sticking with the software solution for now.

antoinerichermoz · September 27, 2023, 7:57am

No performance difference when deactivating validation layers on my side

sascha.scandella · April 10, 2024, 10:00am

@MarkusHoHo We would need this feature for large medical scans. Is there any news regarding the slow binding times? Currently, a blocker for our development.

MarkusHoHo · April 10, 2024, 1:45pm

Hello @sascha.scandella and welcome to the NVIDIA developer forums.

I am sorry, I can’t give you a specific release date yet. But I do know that the internal bug tracking this is well under way to add changes addressing this in a future driver release.

Thanks.

alexandr.benbaccar · February 7, 2025, 7:47pm

Are there any updates to this? Sparse binding updates are still extremely slow on 572 and Windows 11 24H2

frode.oijord · February 14, 2025, 12:34pm

I also really need this to be more performant. The real issue I’m observing is that vkQueueBindSparse blocks all other threads doing sparse binding, so there’s no escaping this. For small images, it’s fast enough, but for large images the slowness is detrimental. Once a large enough area of a sparse image is bound, fps is low even if no further binds are done.

frode.oijord · February 20, 2025, 8:05am

Hi,
I ran your code for binding cost on Vulkan and Cuda, and I’m not seeing any performance difference between the two. My bind times stabilize at around 500 ms after a few hundred iterations for both Cuda and Vulkan.

frode.oijord · February 20, 2025, 8:25am

NVIDIA RTX A6000, Driver version: 571.96

frode.oijord · April 10, 2025, 5:58am

Hi Markus, it’s been a year since your last update on this. Any new information would be very welcome. It’s ok if you don’t have a specific release date, I just need to know if this issue is still worked on and if a fix can be expected at all.

MarkusHoHo · April 10, 2025, 8:51am

I hear you. In case you have a short- to mid-term workaround yourself, I recommend implementing that.

Priorities shift all the time. Sorry.

frode.oijord · April 22, 2025, 11:59am

Hi, thank you for replying.

Unfortunately, this can’t be worked around in client code, as it appears to be caused by a global lock in the driver that even blocks between separate processes(!)

Any kind of global blocking like this, triggered by using an async queue submission command, could rightly be considered a violation of the Vulkan spec. Hopefully, priorities may shift to reflect this.

Anyway, I re-ran the test with latest driver, as the one I used initially was a little old. The new driver produces significantly worse results for this little test.

This isn’t proof that real-life performance when used in a real application is worse, but it’s not a good sign in my book. Nevertheless, I’ll take the optimistic approach and believe that a fix has been attempted, and work is ongoing…

mclachyd · April 27, 2025, 2:50am

Thanks for investigating this @frode.oijord. I observe the same problem, binding times increase dramatically as more tiles are bound. As you’ve noted, the problem is entirely in vkQueueBindSparse and this is something that must be addressed by NVIDIA engineers, there is no workaround.

Sparse binding is fine on Windows for me and has been for years, but Linux is unusable. For example, binding a 3D image takes 103 seconds on Linux with the proprietary NVIDIA drivers vs 2 seconds on windows. That is a small image for my use cases. Switching to NVK 25.0 instead of the proprietary driver yields the expected performance on Linux.

RTX A6000, Driver version 575.51 and 570.144

frode.oijord · April 28, 2025, 8:48am

I have some interesting findings. I have a support request with NVIDIA going on this as well, and they were unable to reproduce the increasing bind times using my test code. Turns out they were testing with Windows11 24H2, and I was on 23H2. After I upgraded to 24H2, my performance issues were more or less resolved. No increasing bind times, just low, steady bind times.

Also got the same result for AMD. Windows 11 24H2 resolved the issue for me. I don’t know if this is purely a Windows issue, or if it’s a combination of NVIDIA driver and Windows. Will try to get more information from NVIDIA. I also need this to perform on Linux so this is important to know.

I will update the SparseTexture repo with this new information later today.

frode.oijord · April 29, 2025, 7:22am

uploaded another run to repo. Ubuntu 24.04, RTX A5000 Laptop GPU. Performance is good! (scroll down to the bottom)

Topic		Replies	Views
Poor multithreading performance compared to DX12 Vulkan	17	5452	September 29, 2020
Vulkan driver -- uniform buffer bug Vulkan	8	4198	September 24, 2016
Compute Shader Performance Vulkan	11	8191	June 8, 2016
performance question CUDA Programming and Performance	9	9933	August 4, 2010
The situation on KDE/Kwin/Plasma performance Linux	35	48109	March 9, 2023
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76293	February 14, 2010
MMU Fault Error when binding a sparse image page during a disptach Vulkan nvbugs	1	1018	June 27, 2023
Texture memory fetch extremely slow CUDA Programming and Performance	13	3125	December 21, 2017
Random low frame rate intervels no matter how much is running Linux	22	3722	October 27, 2024
CUDA Pro Tip: Kepler Texture Objects Improve Performance and Flexibility Technical Blog	11	797	February 15, 2023

Sparse texture binding is painfully slow

Related topics