Compute shader gets much slower when switching texture sampling to Gather (textureGather/Gather/GatherRed etc.)

g_tg · February 2, 2022, 9:50am

Sorry if this is the wrong forum. Seemed like the most applicable one.

While optimizing a compute shader, I wanted to swap some texture sampling functions (previously textureGrad/SampleLevel) to gather4 since I didn’t need any actual filtering and read recommendations to use it a couple of times. At worst I expected it to be a noop, but in reality for a 4k (3840 x 2160) compute shader it adds almost 200ms to a previously ~1.2ms shader runtime.

I tried both a version that actually use all the samples and averages them and one that just hard-pick one of the samples, makes no difference.

I can’t share code, at least publically, but here’s what the shader profiler sample sections say

without gather:

and with:

Also some metrics from the event itself:

(It appears I can only put three links in a post, so here’s the album: Imgur: The magic of the Internet . The very last image is without gather, the second one from the start is with gather).

So a small shift to more long scoreboard stalls and small improvement in other metrics, but it makes a big difference in absolute terms.

Perhaps I should add, the texture accesses in question use noise for texture coordinates for a filter kernel, so they aren’t particularly coherent, but from the metrics it seems that cache hit rates are already pretty good (perhaps the filter sizes are small enough).

Other parts of the code fetch texture memory, too, but most of those are highly coherent and about 90% of the stalls come from that small part of the code with noise-based texture access. They are also dependent texture reads, their coordinates rely on results from an earlier texture fetch, if that makes any difference.

What could be going on here? I thought gather4 was mostly free if you don’t need the actual sampling from texture sample functions. Could it trigger me to fetch across cache lines that were previously untouched?

edit: This is on Windows 10, DX12 (shader compiled with DXC) an RTX 3070 and driver version 511.65

AMAMODE · February 4, 2022, 8:47am

The Shader Profiler gives you a detailed picture of what is going on from within the running shader itself. It doesn’t really provide you with a larger view of what is going on (notably in the memory subsystem). I would suggest also using the GPU Trace activity to get a broader picture of how the other GPU resources are consumed between your 2 cases.

g_tg · February 4, 2022, 10:39am

I will do that, but in the meantime, I’m trying to get an understanding what, in principle, might be going on here at all.

As mentioned I was under the impression that gather versions of fetches, at least for single-channel textures, are strictly ‘upgrades’ that at worst should perform the same as the equivalent code with regular sampling or fetching functions.

Am I wrong with that assumption? Is there some kind of slow path for gathers where they are implemented as actual 4 separate fetches or something?

AMAMODE · February 4, 2022, 11:43am

I’m not sure what’s generating the differences here, but it’s hard to see from the data you have, hence why I suggested capturing additional data (if nothing else, it would help discarding various possible theories). That said, having looked at the linked images in more depth, the SM Warp Stall barrier is the most suspicious.

They are also dependent texture reads, their coordinates rely on results from an earlier texture fetch, if that makes any difference.

Are the results from the gather also used for further reads?

g_tg · February 4, 2022, 4:02pm

Yes, because the gathers just replace the same old fetches. For the record, all I did was go through the code and replace all calls of SampleLevel/Load/whatever (in HLSL) or texture/textureGrad/textureFetch (etc., in GLSL) with the respective gather versions, literally no other changes. This results in the performance degradation.

I was posting in this forum specifically in hopes that someone who works on the internal implementation of the gathers could maybe have some information about it.

The barrier stall is indeed interesting. I use shared memory barriers with group sync somewhere else in the code, but of course they are there in both versions (gather/nongather). In the original they are not causing any stalls at least.

Topic		Replies	Views
Any plans for SampleGather? me want SampleGather! CUDA Programming and Performance	5	1869	September 14, 2011
Bilinear texture sampling in compute shader DirectX, DXR, DirectCompute	1	3144	November 6, 2010
Global Memory vs Constant vs Texture Fetch Performance CUDA Programming and Performance	12	7418	March 10, 2009
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2342	July 19, 2010
Is texture fetch cached? texture fetch CUDA Programming and Performance	5	3836	March 22, 2007
GTX 470 performance gains too low ? (texture operations) CUDA Programming and Performance	16	11073	April 22, 2010
CUDA vs DX execution times DX GPGPU code --> CUDA = slower CUDA Programming and Performance	15	13385	January 30, 2008
Shared memory vs texture fetches CUDA Programming and Performance	0	1931	April 26, 2007
Why taking so much time? CUDA Programming and Performance	22	3516	June 27, 2009
VisualProfiler ver 2.2 CUDA Programming and Performance	13	4930	April 10, 2009

Compute shader gets much slower when switching texture sampling to Gather (textureGather/Gather/GatherRed etc.)

Related topics