it will be very slow with blocksize < 16, since it can’t coalesce the memory access,
will get faster, until the maximum threads per multiprocessor (e.g. 768 at a 9800GT) is a multiple of the blocksize,
and may get slower again when a too big blocksize leads to a lower occupancy of the multiprocessor.
on a 9800GT this means:
each MP can execute max. 8 blocks (has 8 “cuda cores”), and can execute max. 768 threads.
with blocksize 8 there are 88=64 threads executed at once, leads to an occupancy of 8,3% -> BAD!
with blocksize 64 there are 864 = 512 threads executed at once, , leads to an occupancy of 66,7% -> actually ok!
with blocksize 96 there are 896 = 768 threads executed at once, , leads to an occupancy of 100% -> perfect
with blocksize 256 there are 3256 = 768 threads executed at once, , leads to an occupancy of 100% -> still perfect
with blocksize 512 there are 1*512 = 512 threads executed at once, , leads to an occupancy of 66,7% -> actually ok, but slower than 256!
note: the number of cuda cores/MP and the max. number of threads per MP depend on the Chip Architecture.
max. threads per mp is a number which can be found in the cudadeviceprop structure, the cudacores/mp count depends on the compute capability.