Compute Shader Performance

Raxfale · May 8, 2016, 3:35am

I’ve been having a ball playing around with vulkan. However, have now hit some issues that bewilder and confuse (me at least).

I’m doing a deferred render path, gbuffer renderpass, lighting via a compute shader, then a second renderpass for overlays.

So a compute pipeline in between two renderpasses. This is all within the same queue, submitted as a single command buffer. I don’t know, maybe that isn’t allowed, but for my experimentation so far, does seem to work. (ish)

The problem is the compute shader is slow. Really slow. Some 50 times slower than the same glsl in openGL.

I’m undoubtedly abusing the api in some manner.

I’ve tried a myriad of flags and variations, to no avail. In the shader, things seem to slow down when doing dynamic indexing of an array, sometimes, maybe. Also, going from 4 bound textures to 5 bound textures seems to hit a performance threshold (even when not touched) But, really, I just have no idea what is going on.

I do wonder if there is a driver issue. But don’t really have much clue as to determine where to look for the culprit.

Anyone have sage advice ?

Mathias_Schott · May 19, 2016, 3:37pm

Have you tried our latest 365.19 GameReady driver which contains some fixes for unexpectedly low compute shader performance?

tomaka · May 30, 2016, 8:54am

I’ve been encountering a similar issue.

I’ve been porting an OpenGL 3 application to Vulkan.
The OpenGL app would store data in several textures and run draw commands. The whole process would last about 4 milliseconds per frame.

When porting to Vulkan I rewrote the exact same logic as a compute shader. The only difference is that I’d use storage buffers with Morton order to store data, instead of textures. Now the process lasts about 47-48 milliseconds per frame, so 12 times slower.

One thing I noticed is that if I removed in my code the calls to “exp” and to “log” (taking care not to change the memory accesses), the time taken would drop from 47ms to about 27ms. The OpenGL code (the one that runs in 4ms) was calling “exp” and “log” as well, and in the exact same fashion.

I can’t debug much more because of the lack of tooling. I really don’t know what’s happening here, but there is definitely something wrong.

EDIT: I’m using drivers 368.22 released on may 23rd

Mathias_Schott · May 30, 2016, 12:18pm

This is expected since texturing hardware is optimized for memory accesses that exploit 2d locality whereas SSBOs are optimized for divergent memory access. And the texturing unit does the address computation (Morton order) optimally in hardware whereas your approach executes that in the shader code using generic ALU operations.

TLDR: Manually implementing “textures” on top of SSBOs is not going to be as fast as using the actual texturing hardware.

Not knowing the specifics of your algorithm, here are a few things you can try:

Implement the drawcall + texture based approach in Vulkan (should have similar performance to GL)
Implement the compute shader approach in Vulkan but use textures instead of SSBOS (performance should be improved compared to SSBOs
experiment with the workgroup sizes of the compute shader dispatch
Implement the compute shader + SSBO approach in GL (performance should be similarly slow as your Vulkan implementation)

For further more information, please check out:

http://on-demand.gputechconf.com/gtc/2016/presentation/s6138-christoph-kubisch-pierre-boudier-gpu-driven-rendering.pdf, slide 38

Regards,

Mathias Schott

tomaka · May 30, 2016, 1:20pm

Mathias_Schott:

This is expected since texturing hardware is optimized for memory accesses that exploit 2d locality whereas SSBOs are optimized for divergent memory access. And the texturing unit does the address computation (Morton order) optimally in hardware whereas your approach executes that in the shader code using generic ALU operations.

TLDR: Manually implementing “textures” on top of SSBOs is not going to be as fast as using the actual texturing hardware.

Not knowing the specifics of your algorithm, here are a few things you can try:

Implement the drawcall + texture based approach in Vulkan (should have similar performance to GL)

Implement the compute shader approach in Vulkan but use textures instead of SSBOS (performance should be improved compared to SSBOs

experiment with the workgroup sizes of the compute shader dispatch

Implement the compute shader + SSBO approach in GL (performance should be similarly slow as your Vulkan implementation)

For further more information, please check out:

https://developer.nvidia.com/content/understanding-structured-buffer-performance
http://on-demand.gputechconf.com/gtc/2016/presentation/s6138-christoph-kubisch-pierre-boudier-gpu-driven-rendering.pdf, slide 38

Regards,

Mathias Schott

Thanks for the links.

Even though I’m not an expert, I’m more or less aware that using textures is better for locality. But in this specific case it doesn’t really make sense to use textures. The OpenGL code was using them for a reason I don’t recall, but it’s basically a hack. In fact the content of the textures was only ever accessed indirectly through texelFetch (and the texel coordinates fetched from another texture), and often with texels that are far away from each other.

The fact that I use the Morton order was just an attempt at trying to find the reason why it is so slow, without success.

Even if we assume that the problem comes from the fact that I’m using buffers instead of textures, it doesn’t explain why it is twelve times slower to do exactly the same thing (the first link mentions a 20% performance decrease for example), and why removing exp/log reduces the time taken by 20 milliseconds.

Mathias_Schott · May 30, 2016, 2:34pm

Out of curiosity are your SSBOS persistently mapped and HOST_VISIBLE by any? If so can you make sure that they are DEVICE_LOCAL and not HOST_VISIBLE?

And which GPU/OS do you have?

Mathias_Schott · May 30, 2016, 2:43pm

And if you are using layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in; (from the other thread) then it’s also no surprise that your performance is low since you are essentially underutilized the GPU since our hardware executes threads in groups of 32 in lock-step. So just using 1 thread reduces the performance by a factor of 32.

tomaka · May 30, 2016, 2:51pm

I did that exact mistake at first and the execution was taking several hundred milliseconds, so now I’m sure that it’s good.

GPU is GTX 970 and OS is Windows 7.

The other thread is actually a different compute shader, but I was not aware that the hardware wasn’t capable of running multiple groups concurrently. I’ll try increase the local group size.

tomaka · May 31, 2016, 8:50am

That did the trick. Thanks for the help!
Sorry for thinking that it was a problem in the driver.

Raxfale · June 6, 2016, 2:38pm

Hi Mathias,

I can confirm that the newer drivers have made a world of difference. Back into the ballpark of the opengl behaviour (although I would say still slower, maybe up to twice as slow)

Still having a ball playing with it all.

Cheers

vljn · June 7, 2016, 5:13pm

I also have a performance issue with compute shader.
I have 4 c shader running in sequential order (they apply different transform on the same buffer)
The 4 run in 3ms on a HD 7750 (low end from 2012) but takes 18 ms on a 980ti with 368.22 drivers.

With similar gl/dx11 code it ran at less than a ms.
I’m using storage texel buffer to access data (and uniform texel buffer to access buffer read only data).
I was thinking that texture buffer were optimised for linear access, but does they perform worse than untyped buffer in my case? (ie 10000 of float4)

Regards Vincent

vljn · June 8, 2016, 5:03pm

Driver 368.39 fixed compute shader performance for me, it went from 18 ms to 5ms. It’s still high though, and I wonder if switching to storage buffer instead of storage texel buffer may help performance.

Topic		Replies	Views
Vulkan compute shaders vs. CUDA Vulkan cuda	9	10663	December 20, 2021
OpenGL Compute Shader unusually slow OpenGL	3	1702	July 11, 2022
Vulkan driver -- uniform buffer bug Vulkan	8	4254	September 24, 2016
GTX 650 - Vulkan rendering is slower than OpenGL Vulkan	5	2681	June 15, 2016
Even for simple GLSL -> SPIR-V shaders, I'm getting "error: invalid vertex program header&qu Vulkan	6	3087	August 10, 2016
Poor multithreading performance compared to DX12 Vulkan	17	5490	September 29, 2020
Compute shader causing internal compiler error OpenGL	8	2933	July 31, 2016
Texture memory fetch extremely slow CUDA Programming and Performance	13	3144	December 21, 2017
cuda 3: cudaGraphicsMapResources performance issue linux 32bit, driver 195.30, macbookpro nvidia 960 CUDA Programming and Performance	3	4068	March 19, 2010
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76327	February 14, 2010

Compute Shader Performance

Related topics