I know that one stream multi-processor has 8 stream processors, and one stream multi-processor can be executed by 768 threads.
I have some questions:
Why 256 threads is executed in one block is slower than 64 threads in one block? (256 threads * 3 blocks VS. 64 threads * 8 blocks)
About the SDK example “matrixMul”, when I change the block size 16 to 8, why the block size 8 is more efficient than 16?
Thank you for your reply. :) :)