Different number of blocks results in varying running times

while my kernel can fit multiple blocks into one sm at the same time, but why the running time increase as i running more blocks at a sm. btw, the occupancy result in ncu is close to theoretical. aren’t they suppose to running simultaneously?

An SM has finite throughput: it can perform at most a certain, fixed, number of operations per unit time. If you increase the amount of work pushed to a finite resource, the work will take a longer time to complete. Achieving near maximum occupancy suggests that the work is handled at a rate that is close to the finite throughput limit of the SM.

1 Like

Also the running time could increase by even more than the number of concurrent blocks on a SM, as the caches can get more inefficient, if more blocks are processed concurrently.