I have a program that creates a large number of thread blocks. Each thread block’s x and y IDs are used in the program to “parallelize a loop” and compute something.
For example, if in a serial program i have for(x=1 to 128) then to parallelize it I have a gridDim.x = 128. Therefore, in each thread the code is for(x=blockid.x to blockid.x+128/gridDim.x) i.e. each thread runs the loop once. If the gridDim.x=64, each thread runs the loop twice. I do the same with gridDim.y.
(In the same way, there are many threads per block in two dimensions and each thread’s x and y ids are used to parallelize more loops). Sometimes a result is written to global memory, sometimes it is not.
Now I was noting down some performance statistics and saw that the program’s total execution time was greater when I had a grid size of 128(following the example) but was optimum when it was 32. This is strange because with the grid size of 32 each thread is actually doing more work. What is the reason for this?
Reducing the block size gave expected results but reducing the grid size gave counterintuitive results.