In the same scenario and keep everything else unchanged, for example always using B blocks where B is identical to number of multiprocessors. Just double threads each time from 32 to 1024.
At first, the throughput doubles until the thread is set to 256. When double thread from 256 to 512, the throughput does increase, but not double; when further double thread from 512 to 1024, the throughput only increase a tiny little bit, but relatively flat.
So my question is why double threads can not guarantee throughput to be doubled? Is there any factor that determine the relation? Is that any other determinant parameters?