Very low TEX hit rate when profiling OpenGL compute shader

sivanovski.dev · July 14, 2022, 1:11pm

Hello.

I have a compute shader that is a simple pathtracer, which for every invocation / pixel calculates a color value based on the radiance energy bouncing throughout the scene. For a bigger 3D model (8850 triangles) the performance of this compute shader is not favorable (around 33ms).

After profiling the compute shader with NVIDIA Nsight Graphics 2021.1.1.0 on a Windows 10 64bit install, equipped with my NVIDIA GTX 1050 GPU, I got the following results:

I have been working on this compute shader for some time now, and over the last few days I’ve managed to reduce the register count to 32, which yielded some good performance gains, but as I understand from this optimization guide I am currently limited by TEX reads / writes (the Tex Hit Rate being low and the combination of “SM Throughput For Active Cycles” is within [60,80] and “SM Warp Long Scoreboard” being the highest metric in the stalls section).

This is kind of confusing for me though, as I only do a single imageLoad at the start of the shader and a single imageStore at the end of the shader. And without my further calculations and those commands being the only ones in the shader I get a better hit rate. The pixel coordinates that I load from / write to are gl_GlobalInvocationID.xy

I tried loading a smaller 3D model <= 1000 triangles, and I got a TEX Hit Rate of about 75%. This would be explained by me keeping my triangle points and BVH data inside separate SSBOs that are only ever read from and never written to. Since a ray can bounce in a random direction and can hit any other triangle within the array of triangles I am assuming that these uncoalesced reads are the cause of this low TEX Hit Rate.

Is there any way I could optimize this? I have tried pretty much everything at this point and I cannot store the triangle data in a UBO or in shared memory as it is too big. I tried using a GLSL extension that allows 16bit scalars and vector types, and replaced all of my calculations and texture types with 16 bit types, but nothing changed.

I will attach a C++ frame capture and the shader source code.

pathtracer__2022_07_14__14_50_52.zip (13.2 MB)

framebuffer.comp (29.7 KB)

MarkusHoHo · July 18, 2022, 11:45am

Hi there @sivanovski.dev,

Thank you for this detailed description! While I am not able to help you with this I will try and bring this to the attention of some of our OpenGL experts.

Thanks!

Topic		Replies	Views
Why is DirectCompute 2x faster than CUDA for my kernel? CUDA Programming and Performance	23	6764	November 11, 2010
P100 global_hit_rate and and tex_cache_hit_rate CUDA Programming and Performance	6	925	November 4, 2018
improved texfetch to exploit all of texture hardware CUDA Programming and Performance	21	9019	May 4, 2007
Compute Shader Performance Vulkan	11	8506	June 8, 2016
Weird use of registers Too many registers are wasted CUDA Programming and Performance	8	5553	July 4, 2007
Variable read-only cache hit rate CUDA Programming and Performance	3	1032	December 12, 2014
Negative texture cache hit rate !? CUDA Programming and Performance	3	3674	November 9, 2011
OpenGL shd_tex_requests counter four times larger than expected Nsight Visual Studio Edition	1	605	March 22, 2016
Texture fetches and computeprof counters What does the 'tex cache requests' counter really m CUDA Programming and Performance	0	740	August 25, 2011
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13751	June 2, 2008

Very low TEX hit rate when profiling OpenGL compute shader

Related topics