It would appear to be block-scheduling overhead, which would be consistent with the occupancy observation.
I don’t have a V100 at the moment, but here is a test case running on my L4 GPU. We first handle the entire data set using your kernel, and a grid sized to match the data set size, for varying block sizes. I see a similar pattern where the very small block size (64) is noticeably worse execution time compared to the other 3. Then I switch to a grid-stride loop methodology, were there is only one load of blocks deposited on the GPU (all blocks in the grid simultaneously fit on the GPU), doing all the work, therefore reducing the impact of block scheduling. In this case we still use the same 4 blocks sizes, but the grid is sized to match the GPU capacity, not the data set. The execution times of all 4 variants are within 5% of each other, in this case:
# cat t275.cu
__global__ void offset(float* a)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
a[i] = a[i] + 1;
}
__global__ void offset_i(float* a, const int ds)
{
for (int i = blockDim.x * blockIdx.x + threadIdx.x; i < ds; i+=gridDim.x*blockDim.x)
a[i] = a[i] + 1;
}
int main(){
const int ds = 1024*1024;
float *a;
cudaMalloc(&a, sizeof(a[0])*ds);
cudaMemset(a, 0, sizeof(a[0])*ds);
offset<<<ds/512, 512>>>(a); // warm-up
offset_i<<<ds/512, 512>>>(a, ds); // warm-up
cudaDeviceSynchronize();
for (int i = 64; i < 1024; i*=2){
offset<<<ds/i, i>>>(a);
cudaDeviceSynchronize();}
int bs = 512;
for (int i = 3; i < 48; i*=2){
offset_i<<<58*i, bs>>>(a, ds);
bs /=2;
cudaDeviceSynchronize();}
}
# nvcc -o t275 t275.cu -arch=sm_89
# nsys nvprof --print-gpu-trace ./t275
WARNING: t275 and any of its children processes will be profiled.
Generating '/tmp/nsys-report-4635.qdstrm'
[1/3] [========================100%] report8.nsys-rep
[2/3] [========================100%] report8.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MBps) SrcMemKd DstMemKd Device Ctx Strm Name
----------- ------------- ------ ------ ---- ---- ---- ---- ---- ------- ------------ ------------ ---------- ----------------- -------- -------- ------------- --- ---- ----------------------
655,959,365 4,928 119 4.194 851,116.556 Device NVIDIA L4 (0) 1 7 [CUDA memset]
656,105,957 6,720 120 2,048 1 1 512 1 1 16 0.000 0.000 NVIDIA L4 (0) 1 7 offset(float *)
656,138,661 6,464 121 2,048 1 1 512 1 1 16 0.000 0.000 NVIDIA L4 (0) 1 7 offset_i(float *, int)
656,157,125 10,976 123 16,384 1 1 64 1 1 16 0.000 0.000 NVIDIA L4 (0) 1 7 offset(float *)
656,177,957 6,368 125 8,192 1 1 128 1 1 16 0.000 0.000 NVIDIA L4 (0) 1 7 offset(float *)
656,194,309 6,112 127 4,096 1 1 256 1 1 16 0.000 0.000 NVIDIA L4 (0) 1 7 offset(float *)
656,209,317 6,144 129 2,048 1 1 512 1 1 16 0.000 0.000 NVIDIA L4 (0) 1 7 offset(float *)
656,224,677 5,120 131 174 1 1 512 1 1 16 0.000 0.000 NVIDIA L4 (0) 1 7 offset_i(float *, int)
656,238,821 5,024 133 348 1 1 256 1 1 16 0.000 0.000 NVIDIA L4 (0) 1 7 offset_i(float *, int)
656,253,029 5,024 135 696 1 1 128 1 1 16 0.000 0.000 NVIDIA L4 (0) 1 7 offset_i(float *, int)
656,267,461 5,216 137 1,392 1 1 64 1 1 16 0.000 0.000 NVIDIA L4 (0) 1 7 offset_i(float *, int)
Generated:
/root/bobc/report8.nsys-rep
/root/bobc/report8.sqlite
#
You won’t be able to see the same result with that exact code on your V100, the grid sizes in the second kernel would need to be adjusted to match your GPU capacity. For a V100 with 80 SMs it would be like this:
for (int i = 4; i < 64; i*=2){
offset_i<<<80*i, bs>>>(a, ds);