Compute shader gets much slower when switching texture sampling to Gather (textureGather/Gather/GatherRed etc.)

Sorry if this is the wrong forum. Seemed like the most applicable one.

While optimizing a compute shader, I wanted to swap some texture sampling functions (previously textureGrad/SampleLevel) to gather4 since I didn’t need any actual filtering and read recommendations to use it a couple of times. At worst I expected it to be a noop, but in reality for a 4k (3840 x 2160) compute shader it adds almost 200ms to a previously ~1.2ms shader runtime.

I tried both a version that actually use all the samples and averages them and one that just hard-pick one of the samples, makes no difference.

I can’t share code, at least publically, but here’s what the shader profiler sample sections say

without gather:

Imgur

and with:

Imgur

Also some metrics from the event itself:

(It appears I can only put three links in a post, so here’s the album: Imgur: The magic of the Internet . The very last image is without gather, the second one from the start is with gather).

So a small shift to more long scoreboard stalls and small improvement in other metrics, but it makes a big difference in absolute terms.

Perhaps I should add, the texture accesses in question use noise for texture coordinates for a filter kernel, so they aren’t particularly coherent, but from the metrics it seems that cache hit rates are already pretty good (perhaps the filter sizes are small enough).

Other parts of the code fetch texture memory, too, but most of those are highly coherent and about 90% of the stalls come from that small part of the code with noise-based texture access. They are also dependent texture reads, their coordinates rely on results from an earlier texture fetch, if that makes any difference.

What could be going on here? I thought gather4 was mostly free if you don’t need the actual sampling from texture sample functions. Could it trigger me to fetch across cache lines that were previously untouched?

edit: This is on Windows 10, DX12 (shader compiled with DXC) an RTX 3070 and driver version 511.65

The Shader Profiler gives you a detailed picture of what is going on from within the running shader itself. It doesn’t really provide you with a larger view of what is going on (notably in the memory subsystem). I would suggest also using the GPU Trace activity to get a broader picture of how the other GPU resources are consumed between your 2 cases.

I will do that, but in the meantime, I’m trying to get an understanding what, in principle, might be going on here at all.

As mentioned I was under the impression that gather versions of fetches, at least for single-channel textures, are strictly ‘upgrades’ that at worst should perform the same as the equivalent code with regular sampling or fetching functions.

Am I wrong with that assumption? Is there some kind of slow path for gathers where they are implemented as actual 4 separate fetches or something?

I’m not sure what’s generating the differences here, but it’s hard to see from the data you have, hence why I suggested capturing additional data (if nothing else, it would help discarding various possible theories). That said, having looked at the linked images in more depth, the SM Warp Stall barrier is the most suspicious.

They are also dependent texture reads, their coordinates rely on results from an earlier texture fetch, if that makes any difference.

Are the results from the gather also used for further reads?

Yes, because the gathers just replace the same old fetches. For the record, all I did was go through the code and replace all calls of SampleLevel/Load/whatever (in HLSL) or texture/textureGrad/textureFetch (etc., in GLSL) with the respective gather versions, literally no other changes. This results in the performance degradation.

I was posting in this forum specifically in hopes that someone who works on the internal implementation of the gathers could maybe have some information about it.

The barrier stall is indeed interesting. I use shared memory barriers with group sync somewhere else in the code, but of course they are there in both versions (gather/nongather). In the original they are not causing any stalls at least.