Very low TEX hit rate when profiling OpenGL compute shader


I have a compute shader that is a simple pathtracer, which for every invocation / pixel calculates a color value based on the radiance energy bouncing throughout the scene. For a bigger 3D model (8850 triangles) the performance of this compute shader is not favorable (around 33ms).

After profiling the compute shader with NVIDIA Nsight Graphics 2021.1.1.0 on a Windows 10 64bit install, equipped with my NVIDIA GTX 1050 GPU, I got the following results:

I have been working on this compute shader for some time now, and over the last few days I’ve managed to reduce the register count to 32, which yielded some good performance gains, but as I understand from this optimization guide I am currently limited by TEX reads / writes (the Tex Hit Rate being low and the combination of “SM Throughput For Active Cycles” is within [60,80] and “SM Warp Long Scoreboard” being the highest metric in the stalls section).

This is kind of confusing for me though, as I only do a single imageLoad at the start of the shader and a single imageStore at the end of the shader. And without my further calculations and those commands being the only ones in the shader I get a better hit rate. The pixel coordinates that I load from / write to are gl_GlobalInvocationID.xy

I tried loading a smaller 3D model <= 1000 triangles, and I got a TEX Hit Rate of about 75%. This would be explained by me keeping my triangle points and BVH data inside separate SSBOs that are only ever read from and never written to. Since a ray can bounce in a random direction and can hit any other triangle within the array of triangles I am assuming that these uncoalesced reads are the cause of this low TEX Hit Rate.

Is there any way I could optimize this? I have tried pretty much everything at this point and I cannot store the triangle data in a UBO or in shared memory as it is too big. I tried using a GLSL extension that allows 16bit scalars and vector types, and replaced all of my calculations and texture types with 16 bit types, but nothing changed.

I will attach a C++ frame capture and the shader source code. (13.2 MB)

framebuffer.comp (29.7 KB)

Hi there,

Thank you for this detailed description! While I am not able to help you with this I will try and bring this to the attention of some of our OpenGL experts.


1 Like