We have written a matrix addition program. We want to note the time taken to add 2 matrices.

The way we do it is :

matrix A [128*16,256*128]

matrix B [128*16,256*128]

matrix C [128*16,256*128]

here C will contain the final result…

Total Blocks we use is 128*16.

Threads/Block= 128

So effectively each thread/block is reading A[id], b[id] ,adding them and writing c[id]

in a loop 256 times.

* Our Observations**

–> As we increase/decrease number of thread blocks, by keeping

the number of elements constant. i mean

matrix A [256*16,128*128]

matrix B [256*16,128*128]

matrix C [256*16,128*128]

here C will contain the final result…

Total Blocks we use is 256*16.

Threads/Block= 128

the times taken by the kernel change. Changes are dramatic in the sense its almost 1.5-1.7 times.

one naive thing could be system works better with more number of thread blocks. But is this mentioned anywhere?.

DOes anyone find/notice something like this?