I have a kernel where each thread does almost similar computation (using data stored in smem) on 8 or 16 floats.
These are stored in an array of floats which are looped over. Given my occupancy and block size I have plenty registers to spend to keep the entire array in registers at all times, however, even with --maxrregcount set properly the compiler insists that it wants to use local mem for this array, which more than halves the performance of the kernel.
The speed reduction can be easily verified by just accessing one element, but performing all the computations instead.
The profiler also indicate that a large number of local loads/stores is taking place when using the array.
I observe a speed benefit when storing the data in smem, but this reduce my block size and/or occupancy.
Very interesting. I currently use a home-made template function to do the exact same thing. Could you point me to which part of the Preprocessor library you use, and a small code example perhaps? :)
Thanks a lot. We were about to commit to the home-brewed template solution but this gives many advantages. Now I just have to come up with a clever way of
creating nested iterations (for (i = …) for (j=…) ) to operate on matrices.
I guess there is no possible way to create the variable names at runtime is there? I thought for a few minutes that this would be a perfect solution for one of the limitations in my code where I currently use shared mem, as the shared mem is too small for the final project. Unfortunately, I remembered that I have to access random points in the array based on results generated at runtime, where this doesn’t work too well.
Good to have a simple solution to use the registers for linear array accesses though, maybe I’ll be able to use it in a later project.