Compute Shader Problem in 5xx Drivers, but not in 47x Drivers

My client has a fairly complex compute shader that performs deduplication operations on very large datasets. The shader uses subgroups.

The shader works fine on 471, 472, and 473 series drivers (at least those we’ve tested). The 496.76 driver fails. “Fails” means that the returned results are incorrect. There are no crashes or validation errors. All the 5xx series drivers that we’ve tested fail as well. The difference in behavior doesn’t seem to depend on the driver being Game-Ready or Studio. I’ve also tested with NDA drivers and see the same results (works ok on 47x, but not 5xx).

The same shader code works fine on AMD GPUs. The incorrect results appear when running the shader on very large data and not for smaller data sets.

Anyway, the primary observation here is that “it works on 47x and not on 5xx”. Since it works on other hardware and seems to be broken in a specific driver series, I tend to think that the shader doesn’t have a problem. It is possible that it does, but we can’t seem to find it.

I did see a mention of a fix in the release notes for an NDA driver that is in the same area as the functionality that is in question. Since I don’t think I can discuss NDA driver information here, I’d be happy to point out the exact release note item to someone via email.

I understand that a reproduction test case would be ideal here, but it would take me a lot of effort to create one. Instead, I’d like to first explore the possibility that a fix was applied to the 4xx series that hasn’t been applied to 5xx yet.

It is also odd that it fails on 496.76. I don’t understand NVIDIA driver numbering schemes, so if someone can explain why 496.76 is what it is and not something like 47x, that would be useful. The same goes for explaining the differences between 4xx and 5xx.

1 Like

Vulkan driver team has started tracking this issue ID 3740049.

The problem turned out to be in one of our shaders. The 5xx driver series introduces some optimizations that increase the parallelization, which exposed the need for additional barriers that we were missing. I’ve closed issue 3740049. Thanks to the team for helping us to narrow it down.