I allocate an array i_data of size RRR and then pass it as parameter to a kernel. In the kernel, I do some transposing similar to the transpose sample project of the CUDA SDK and write the result in an array o_data of the same size.
Unfortunately, this works only for R=16, when I choose R=32, the array o_data contains uninitialized data. Even if I use a dummy kernel which sets o_data[0] to some value, this assignment has no effect.
I know that there is a limit of 16KB shared memory per multiprocessor. Is there also a limit of how much global memory a kernel can access?
Thanks for your help
Sacha