I’m typically a graphics/GLSL programmer but after attending GTC2013 decided to try CUDA again and see what differences there were. I’ve been looking at order independent transparency, particularly the sorting stage since it’s quite slow for complex scenes. One peculiarity I found in GLSL was just the act of allocating a large local array (which would normally be used for sorting) impacts performance significantly even if nothing is done with the array. See here:
When testing the same code ported to CUDA using local arrays (stored in global memory but cached very well), the effect was simply not observed. The leads me to ask the question, where does GLSL put local arrays and what makes them so different to CUDA’s local arrays?
Another quite strange behaviour is the effect being altered after running a compute shader (only needs to once during the applications’s lifetime), also with a large array. See here:
Is there anyone here who knows what state the compute shader has triggered in OpenGL to give this result?
I’ve also tried using shared memory for the local array but despite many forum posts about how fast shared memory is it was many times slower than local memory and at least 2x slower than the GLSL version. Hoever, interestingly it exhibited the same slowing behaviour with increasing memory usage. Does GLSL use shared memory for local arrays (like OpenCL)? I’m not doing any alignment with shared memory which may contribute to the 2x hit.
Why does allocating more shared memory hurt performance? As far as I know this won’t reduce the fixed pool of L1 cache. Can blocks be executing simultaneously on a single multiprocessor?
If shared memory is the cause of bad performance with large arrays, is it likely that a driver update will soon give GLSL the ability to use local memory and suddenly make large arrays a lot faster (either explicitly or implicitly)?
Thanks in advance for any thoughts or answers