Compute Shader Performance

I’ve been having a ball playing around with vulkan. However, have now hit some issues that bewilder and confuse (me at least).

I’m doing a deferred render path, gbuffer renderpass, lighting via a compute shader, then a second renderpass for overlays.

So a compute pipeline in between two renderpasses. This is all within the same queue, submitted as a single command buffer. I don’t know, maybe that isn’t allowed, but for my experimentation so far, does seem to work. (ish)

The problem is the compute shader is slow. Really slow. Some 50 times slower than the same glsl in openGL.

I’m undoubtedly abusing the api in some manner.

I’ve tried a myriad of flags and variations, to no avail. In the shader, things seem to slow down when doing dynamic indexing of an array, sometimes, maybe. Also, going from 4 bound textures to 5 bound textures seems to hit a performance threshold (even when not touched) But, really, I just have no idea what is going on.

I do wonder if there is a driver issue. But don’t really have much clue as to determine where to look for the culprit.

Anyone have sage advice ?

Have you tried our latest 365.19 GameReady driver which contains some fixes for unexpectedly low compute shader performance?

I’ve been encountering a similar issue.

I’ve been porting an OpenGL 3 application to Vulkan.
The OpenGL app would store data in several textures and run draw commands. The whole process would last about 4 milliseconds per frame.

When porting to Vulkan I rewrote the exact same logic as a compute shader. The only difference is that I’d use storage buffers with Morton order to store data, instead of textures. Now the process lasts about 47-48 milliseconds per frame, so 12 times slower.

One thing I noticed is that if I removed in my code the calls to “exp” and to “log” (taking care not to change the memory accesses), the time taken would drop from 47ms to about 27ms. The OpenGL code (the one that runs in 4ms) was calling “exp” and “log” as well, and in the exact same fashion.

I can’t debug much more because of the lack of tooling. I really don’t know what’s happening here, but there is definitely something wrong.

EDIT: I’m using drivers 368.22 released on may 23rd

This is expected since texturing hardware is optimized for memory accesses that exploit 2d locality whereas SSBOs are optimized for divergent memory access. And the texturing unit does the address computation (Morton order) optimally in hardware whereas your approach executes that in the shader code using generic ALU operations.

TLDR: Manually implementing “textures” on top of SSBOs is not going to be as fast as using the actual texturing hardware.

Not knowing the specifics of your algorithm, here are a few things you can try:

  1. Implement the drawcall + texture based approach in Vulkan (should have similar performance to GL)
  2. Implement the compute shader approach in Vulkan but use textures instead of SSBOS (performance should be improved compared to SSBOs
  3. experiment with the workgroup sizes of the compute shader dispatch
  4. Implement the compute shader + SSBO approach in GL (performance should be similarly slow as your Vulkan implementation)

For further more information, please check out:

https://developer.nvidia.com/content/understanding-structured-buffer-performance
http://on-demand.gputechconf.com/gtc/2016/presentation/s6138-christoph-kubisch-pierre-boudier-gpu-driven-rendering.pdf, slide 38

Regards,

Mathias Schott

Thanks for the links.

Even though I’m not an expert, I’m more or less aware that using textures is better for locality. But in this specific case it doesn’t really make sense to use textures. The OpenGL code was using them for a reason I don’t recall, but it’s basically a hack. In fact the content of the textures was only ever accessed indirectly through texelFetch (and the texel coordinates fetched from another texture), and often with texels that are far away from each other.

The fact that I use the Morton order was just an attempt at trying to find the reason why it is so slow, without success.

Even if we assume that the problem comes from the fact that I’m using buffers instead of textures, it doesn’t explain why it is twelve times slower to do exactly the same thing (the first link mentions a 20% performance decrease for example), and why removing exp/log reduces the time taken by 20 milliseconds.

Out of curiosity are your SSBOS persistently mapped and HOST_VISIBLE by any? If so can you make sure that they are DEVICE_LOCAL and not HOST_VISIBLE?

And which GPU/OS do you have?

And if you are using layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in; (from the other thread) then it’s also no surprise that your performance is low since you are essentially underutilized the GPU since our hardware executes threads in groups of 32 in lock-step. So just using 1 thread reduces the performance by a factor of 32.

I did that exact mistake at first and the execution was taking several hundred milliseconds, so now I’m sure that it’s good.

GPU is GTX 970 and OS is Windows 7.

The other thread is actually a different compute shader, but I was not aware that the hardware wasn’t capable of running multiple groups concurrently. I’ll try increase the local group size.

That did the trick. Thanks for the help!
Sorry for thinking that it was a problem in the driver.

Hi Mathias,

I can confirm that the newer drivers have made a world of difference. Back into the ballpark of the opengl behaviour (although I would say still slower, maybe up to twice as slow)

Still having a ball playing with it all.

Cheers

I also have a performance issue with compute shader.
I have 4 c shader running in sequential order (they apply different transform on the same buffer)
The 4 run in 3ms on a HD 7750 (low end from 2012) but takes 18 ms on a 980ti with 368.22 drivers.

With similar gl/dx11 code it ran at less than a ms.
I’m using storage texel buffer to access data (and uniform texel buffer to access buffer read only data).
I was thinking that texture buffer were optimised for linear access, but does they perform worse than untyped buffer in my case? (ie 10000 of float4)

Regards Vincent

Driver 368.39 fixed compute shader performance for me, it went from 18 ms to 5ms. It’s still high though, and I wonder if switching to storage buffer instead of storage texel buffer may help performance.