I have a kernel function in which threads write to different elements of an array in shared memory. The problem is that the writes are extremely slow. Since the threads write to different elements of the array, I don’t think there are any bank conflicts.
Does anyone know what could be the problem here? The compute capability of my GPU device is 1.3. I’m launching 65536 x 65536 blocks and 500 threads in each block. The array size is well below the shared memory size per block limit.
I would really appreciate your suggestions.