Question on number of Blocks possible

We have written a matrix addition program. We want to note the time taken to add 2 matrices.

The way we do it is :
matrix A [3016,3116]
matrix B [3016,3116]
matrix C [3016,3116]

here C will contain the final result…
Total Blocks we use is 3030.
Threads/Block= 16
16

So effectively each thread/block is reading A[id], b[id] ,adding them and writing c[id].

We notice that results are good for blocks 3030.
but wen we increase the number of blocks like 50
50 it gives wierd results.

By wierd i mean : since the matrix size has incraeseed significantly the time for kernel should also incraese as this is a memory bound computation.
But we done see this. Its taking almost the same time.

Has anyone seen this behavior before.

5050 = 2500 blocks. This is not a lot for CUDA, so you are seeing the fact that kernel-call overhead is probably dominant. 900 blocks per 16 multiprocessors = 56 blocks per MP. With 3 blocks running per MP, that is only 19 blocks to process after eachother.
If you check the time difference between 300
300 and 500*500, you should see more differences.

Thanks for ur response. I dont understand wen u say

“With 3 blocks running per MP, that is only 19 blocks to process after eachother.”…

how are you sure there are 3 blocks running per MP?.

I am not sure there are 3 blocks running per MP. It is the maximum for 16x16 threads. But given your algorithm my guess would be that you will have 3 blocks per MP running at the same time. You can find out by filling in your values in the occupancy calculator.