We have written a matrix addition program. We want to note the time taken to add 2 matrices.
The way we do it is :
matrix A [12816,256128]
matrix B [12816,256128]
matrix C [12816,256128]
here C will contain the final result…
Total Blocks we use is 128*16.
Threads/Block= 128
So effectively each thread/block is reading A[id], b[id] ,adding them and writing c[id]
in a loop 256 times.
Our Observations*
→ As we increase/decrease number of thread blocks, by keeping
the number of elements constant. i mean
matrix A [25616,128128]
matrix B [25616,128128]
matrix C [25616,128128]
here C will contain the final result…
Total Blocks we use is 256*16.
Threads/Block= 128
the times taken by the kernel change. Changes are dramatic in the sense its almost 1.5-1.7 times.
one naive thing could be system works better with more number of thread blocks. But is this mentioned anywhere?.
DOes anyone find/notice something like this?