limitated amount of global memory for a kernel?

I allocate an array i_data of size RRR and then pass it as parameter to a kernel. In the kernel, I do some transposing similar to the transpose sample project of the CUDA SDK and write the result in an array o_data of the same size.

Unfortunately, this works only for R=16, when I choose R=32, the array o_data contains uninitialized data. Even if I use a dummy kernel which sets o_data[0] to some value, this assignment has no effect.

I know that there is a limit of 16KB shared memory per multiprocessor. Is there also a limit of how much global memory a kernel can access?

The actual kernel parameters are passed into shared memory, which is 16k (i.e. less than 32x32x32xsizeof(z)). Things may work if you instead pass a pointer (i.e. 4bytes) to your kernel.

Thanks for the help! But this does not solve my problem.
I do pass a pointer to an array I allocated before (using cudaMalloc).

I now found out what the problem was:

I created RR threads. As there is a limit of 512 Threads per block, it is clear that only R = 16 (RR = 256 < 512) works, while with R = 32 (R*R= 1024) the number of Threads is exceeded.