limitated amount of global memory for a kernel?

I allocate an array i_data of size RRR and then pass it as parameter to a kernel. In the kernel, I do some transposing similar to the transpose sample project of the CUDA SDK and write the result in an array o_data of the same size.

Unfortunately, this works only for R=16, when I choose R=32, the array o_data contains uninitialized data. Even if I use a dummy kernel which sets o_data[0] to some value, this assignment has no effect.

I know that there is a limit of 16KB shared memory per multiprocessor. Is there also a limit of how much global memory a kernel can access?

Thanks for your help

Sacha

The actual kernel parameters are passed into shared memory, which is 16k (i.e. less than 32x32x32xsizeof(z)). Things may work if you instead pass a pointer (i.e. 4bytes) to your kernel.

Thanks for the help! But this does not solve my problem.
I do pass a pointer to an array I allocated before (using cudaMalloc).

I now found out what the problem was:

I created RR threads. As there is a limit of 512 Threads per block, it is clear that only R = 16 (RR = 256 < 512) works, while with R = 32 (R*R= 1024) the number of Threads is exceeded.