Optimum Number of blocks

We have written a matrix addition program. We want to note the time taken to add 2 matrices.

The way we do it is :
matrix A [12816,256128]
matrix B [12816,256128]
matrix C [12816,256128]

here C will contain the final result…
Total Blocks we use is 128*16.
Threads/Block= 128

So effectively each thread/block is reading A[id], b[id] ,adding them and writing c[id]
in a loop 256 times.

Our Observations*
–> As we increase/decrease number of thread blocks, by keeping
the number of elements constant. i mean

matrix A [25616,128128]
matrix B [25616,128128]
matrix C [25616,128128]

here C will contain the final result…
Total Blocks we use is 256*16.
Threads/Block= 128

the times taken by the kernel change. Changes are dramatic in the sense its almost 1.5-1.7 times.

one naive thing could be system works better with more number of thread blocks. But is this mentioned anywhere?.

DOes anyone find/notice something like this?

You get better performance when you avoid pipeline stalls. More blocks can do that in some instances.

Can you pls elaborate your point. It will be nice if you can share something which you have observed.

See chapter 5.2 (page 62) of CUDA Programming Guide Version 1.1.