I call my kernel with 2 blocks, each 64 threads. Those 64 threads needs read access to the data within the other 63 threads. So each block is independent.
I thought the right way is to copy the data from global memory to shared memory, do the computation and copy back?
All this because shared memory is the fastest, isn’t it?
As long as I use 1 block, everything works well, but when I use 2 or more, I get more or less random data.
In the docs I read that the third argument (“mem_size”) is the size of shared memory. But in some samples the third argument is omitted but in the kernel it is used.
The third argument allocates shared mem that is bound to the as extern specified shared variables. If you put the shared array size into the code directly, you don’t need this argument. Example: this does not need the argument
__shared__ int A[100];
this does
extern __shared__ int A[];
The advantage of the latter is that you can control it at runtime. Read the manual for how to get the address correct in the kernel.