I have 64 threads for a block. More threads per block does not make sense. I use havily __syncthreads. My kernel uses about 16 registers. I can use either 2000 bytes of shared memory or the full 16kb ( which gives faster performance).
Can somebody tell what is the maximum amount of blocks I can invoke? Is it limited by the total nr of registers? That would mean 8192/16 = 512 ??