I have a quick question. I will really appreciate your answers:
From the Programming Guide given by NVIDIA, I cannot figure out what configuration of the number of blocks and the number of warps per block, is the best configuration for the best performance per multiprocessor or per processor.
For example, in scanLargeArray code, you use 256 threads per block, however, why choose 256 instead of other numbers? What is the relationship between the number of threads per block and the number of multiprocessor and the number of processor per multiprocessor?
Suppose I want to process 10Million integer prefix sum, what number can you imagine could be one of the best configurations to get the best performance?
Thank you very much! You are my god! Anyone is very welcome to answer my above questions! Thank you all!