slow local arrays in GLSL

sleap · April 23, 2013, 12:49pm

Hi,
I’m typically a graphics/GLSL programmer but after attending GTC2013 decided to try CUDA again and see what differences there were. I’ve been looking at order independent transparency, particularly the sorting stage since it’s quite slow for complex scenes. One peculiarity I found in GLSL was just the act of allocating a large local array (which would normally be used for sorting) impacts performance significantly even if nothing is done with the array. See here:
[url]Slow local arrays in GLSL - OpenGL - Khronos Forums

When testing the same code ported to CUDA using local arrays (stored in global memory but cached very well), the effect was simply not observed. The leads me to ask the question, where does GLSL put local arrays and what makes them so different to CUDA’s local arrays?

Another quite strange behaviour is the effect being altered after running a compute shader (only needs to once during the applications’s lifetime), also with a large array. See here:
[url]Running a compute shader with a large array makes other shaders with arrays faster - OpenGL - Khronos Forums
Is there anyone here who knows what state the compute shader has triggered in OpenGL to give this result?

I’ve also tried using shared memory for the local array but despite many forum posts about how fast shared memory is it was many times slower than local memory and at least 2x slower than the GLSL version. Hoever, interestingly it exhibited the same slowing behaviour with increasing memory usage. Does GLSL use shared memory for local arrays (like OpenCL)? I’m not doing any alignment with shared memory which may contribute to the 2x hit.

Why does allocating more shared memory hurt performance? As far as I know this won’t reduce the fixed pool of L1 cache. Can blocks be executing simultaneously on a single multiprocessor?

If shared memory is the cause of bad performance with large arrays, is it likely that a driver update will soon give GLSL the ability to use local memory and suddenly make large arrays a lot faster (either explicitly or implicitly)?

Thanks in advance for any thoughts or answers

Topic		Replies	Views
Why slow kernal that does nothing? CUDA Programming and Performance	4	1181	August 26, 2009
temporary memory issues CUDA Programming and Performance	11	5371	March 30, 2008
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5202	September 6, 2008
basic doubts about cuda CUDA Programming and Performance	9	3794	February 7, 2008
ptx .local memory CUDA Programming and Performance	2	6484	August 21, 2010
__shared__ memory confused me. __shared__ memory CUDA Programming and Performance	7	4013	August 1, 2009
__local (Shared) memory increases kernel execution time? CUDA Programming and Performance	9	21582	November 30, 2010
Is CUDA better than GLSLang? I need to know more... CUDA Programming and Performance	30	38640	July 13, 2007
Verdict: GLSL vs CUDA kind of a not-so-dead post-mortem CUDA Programming and Performance	16	27795	February 11, 2011
Thread Local variable CUDA Programming and Performance	1	1666	September 23, 2009

slow local arrays in GLSL

Related topics