Why performance on GV100 increases by almost 50% when doubling the block size?

When running the kernel below on GV100, performance improves significantly when changing the block size from 64 threads to 128 threads (from 28us to 15us).

global void SimpleCopy(float* restrict dest, const float* restrict src)
int index = blockDim.x * blockIdx.x + threadIdx.x;

dest[index] = src[index];


Yes, I know that 64 seems like a small block size, however, when putting all the numbers together, I don’t know how to explain the speedup.

For the first configuration, the kernel is launched with 64 threads/block and 13440 blocks.
For the second configuration, the kernel is launched with 128 threads/block and 6720 blocks.

GV100 is C.C. 7.0, so max #blocks/SM is 32, max #warps/SM is 64 and max #threads/SM is 2048.

No shared memory used and register count is reported by Nsight to be 16, so none of these will be the limiting factor when computing occupancy.

Doing the math, we should have:

  • For the first configuration, where grid size is 13440 blocks and block size is 64 threads
    32 blocks/SM <=> 2048 threads/SM <=> 64 warps/SM
  • For the second configuration, where grid size is 6720 blocks and block size is 128 threads
    16 blocks/SM <=> 2048 threads/SM <=> 64 warps/SM

So, the same amount of warps per SM. Moreover, my GV100 reports to have 80SMs and both grid sizes are a multiple of 80, so no tail effect.

But then why the performance difference? Where else does the block size play any role?
In Nsight Compute I see that Memory throughput, Mem busy and Mem pipes busy are almost double for the second configuration, but I don’t understand why. I would have expected the same values, since we have the same amount of warps per SM in both scenarios.

I would assume that launching blocks may have some overhead.

Have you measured kernel runtimes with a no-op kernel that does nothing?


Wow, thanks, I can’t believe it. Indeed, if launching an empty kernel I get same performance numbers: ~28us for 64th/block and ~15us for 128th/block.

Does this mean that for GV100 memory is so fast, that launching a higher number of blocks becomes the overhead?
Does NSight Compute provide any metric for this overhead?

And thanks again for your feedback!

Usually memory access should not be so fast.

There is probably something else going on, such as an error during memory access. Have you checked the return code of cudaGetLastError() after calling cudaDeviceSynchronize() ?

Running your test program in cuda-memcheck should also reveal any memory related errors in kernels, as well as CUDA API errors.

I checked with cuda-memcheck and no errors.
Also, I check for errors each CUDA function and nothing has been reported.

So maybe then the method you use for timing the kernel duration is not correct.

Kernel launches are asynchronous, so you have to include the cudaDeviceSynchronize() in the timing loop.

I have a hunch that you are only timing how long it takes to launch the kernel, and not how long it takes to execute.

The numbers I mentioned before are from NSight Systems. Same values are reported by NSight Compute.

The code also calls cudaDeviceSynchronize before exiting.