But it is still not very clear to me. I found this in the programming guide–each block is split into SIMD groups of threads called warps; each of these warps contains the same number of threads, called the warp size, and is executed by the multiprocessor in a SIMD fashion; a thread scheduler periodically switches from one warp to another to maximize the use of the multiprocessor’s computational resources. For a G80 hardware, it has 8 ALUs per multiprocessor. Then how many processors per block or per SIMD group?
My understand is threads in blocks that are processed by one multiprocessor are parallel. Multiprocessors are scheduled. So if I want to increase performance, I need to maximize the number of blocks that can be processed by one multiprocessor (up to 8). Is this correct?
If this was correct, 3 blocks with 256 threads each are assumed to have a better performance than 1 block with 512 threads, right? I have two versions of my application. One implements shared memory, another doesn’t. I changed the thread # for the latter one from 512 to 256. The performance of the one has 256 threads per block is worse than the one has 512 threads per block. In contrast, the version which implemented shared memory gained a better performance in 256 than 512 threads per block.
Can anyone help? Thank you!!