My kernel is launched with ~1024 threads and ~1000 blocks
however a section of the code is only processed by the first 5 blocks, ie I have a conditional like:
if(blockDim.x < 5) { do x; } else {do nothing; }
My question is how “idle” will the streaming multiprocessors be?
I really don’t want to launch another kernel with like ~1024 threads and only 5 blocks - as I want to reduce the number of times I load from global into shared memory (as shared memory isn’t persistent between kernel calls)
Yutong