I am writing a kernel in which I load data from global memory to shared memory, and then perform computation. However, when I change the size of work group , e.g. shrink to half, or quarter, from 256 to 128 or 64, and made the thread load the data in loop, then the performance different a lot.I have employed the visual profiler to profile the kernel , just found that the global request differs to some extent, but not proportional to the size of work group. any ideas to explain this?
thanks in advance!
just a note that these slide is very interesting http://nvidia.fullviewmedia.com/gtc2010/0922-a5-2238.html and may help to the problem