x32 vs x64 kernel error different variables in kernel call?

Hi there,

till now I compiled my CUDA codes without any special platform specific flags. So on my x64 ubuntu linux they all got x64 pointers.
However I ran into a problem with one simplified memory stream copy, which did only the copy in one kernel and the write in another one.
No use in here than to see what the platform can do.

Although the real copy kernel performs well from 64 blocks to 32768 with 64 to 512 threads, the simplified version had kernel call errors with the high blocksizes from 16384 on.

If I compile with -m32 flags and x32 libs these kernel work fine and produce reasonable results.
So my question is, how are the configs of grids and blocks (no shared memory here) passed to the kernel?
Can they affect the other calling parameters, or vice versa?

In my opinion the x64 variables overwrote some of my stack and thus the configuration.

Perhaps someone from Nvidia?