Influence of blocks and threads

This program makes a picture’s histogram (1024*768 for the picture). I run it on a GTS 320 Mo.

I try to modify threads and blocks values to observ the influence of blocks. I have these results :

I thought that the better time will be with 96 blocks (1 block per processor) but it seems it isn’t. Can somebody explains me if I am right or not ? If not, how the blocks/threads work ? (which values are the most effective in this case)


There are so many variables that go into determining the execution time based on block size. Just to name a few: occupancy, read after write dependencies, memory access patterns, … It is possible in any real case to understand completely how they all intertwine to give the best performance, at least without a full fledged device simulator to run on. Of course, you have the best device simulator sitting in your computer: the device itself.

So what you are doing now is exactly the best method to find the fastest block size.

P.S. You do know that your kernel will not calculate the correct histogram due to race conditions and global memory latency, right?