in the other thread, SPWorley already gave you a hint about structuring this: use a grid-striding loop.

Here’s a fully worked example:

```
$ cat t352.cu
#include <stdio.h>
const int ds=1024*1024*32;
template <int sz>
__global__ void vadd(float *c, const float *a, const float *b, const int dsize){
int idx= threadIdx.x+blockDim.x*blockIdx.x;
while (idx < dsize){
c[idx] = a[idx] + b[idx];
idx+=gridDim.x*blockDim.x;
}
}
int main(){
float *a, *b, *c;
cudaMalloc(&c, ds*sizeof(float));
cudaMalloc(&a, ds*sizeof(float));
cudaMalloc(&b, ds*sizeof(float));
vadd<0><<<ds/1024,1024>>>(c, a, b, ds); // warm-up
vadd<1><<<1,1024>>>(c, a, b, ds);
vadd<2><<<2,1024>>>(c, a, b, ds);
vadd<4><<<4,1024>>>(c, a, b, ds);
vadd<8><<<8,1024>>>(c, a, b, ds);
vadd<16><<<16,1024>>>(c, a, b, ds);
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_61 -o t352 t352.cu
$ nvprof ./t352
==26500== NVPROF is profiling process 26500, command: ./t352
==26500== Profiling application: ./t352
==26500== Profiling result:
Time(%) Time Calls Avg Min Max Name
45.67% 12.013ms 1 12.013ms 12.013ms 12.013ms void vadd<int=1>(float*, float const *, float const *, int)
24.23% 6.3734ms 1 6.3734ms 6.3734ms 6.3734ms void vadd<int=2>(float*, float const *, float const *, int)
13.40% 3.5241ms 1 3.5241ms 3.5241ms 3.5241ms void vadd<int=4>(float*, float const *, float const *, int)
7.88% 2.0726ms 1 2.0726ms 2.0726ms 2.0726ms void vadd<int=8>(float*, float const *, float const *, int)
4.86% 1.2795ms 1 1.2795ms 1.2795ms 1.2795ms void vadd<int=16>(float*, float const *, float const *, int)
3.95% 1.0385ms 1 1.0385ms 1.0385ms 1.0385ms void vadd<int=0>(float*, float const *, float const *, int)
==26500== API calls:
Time(%) Time Calls Avg Min Max Name
90.48% 323.45ms 3 107.82ms 557.40us 322.33ms cudaMalloc
7.36% 26.294ms 1 26.294ms 26.294ms 26.294ms cudaDeviceSynchronize
1.16% 4.1646ms 364 11.441us 298ns 509.09us cuDeviceGetAttribute
0.89% 3.1969ms 4 799.23us 771.42us 828.77us cuDeviceTotalMem
0.08% 299.38us 4 74.844us 68.688us 89.177us cuDeviceGetName
0.02% 56.046us 6 9.3410us 4.8850us 27.819us cudaLaunch
0.00% 6.4090us 24 267ns 138ns 2.5990us cudaSetupArgument
0.00% 6.0590us 12 504ns 306ns 1.4260us cuDeviceGet
0.00% 3.6390us 3 1.2130us 397ns 2.4440us cuDeviceGetCount
0.00% 2.3460us 6 391ns 207ns 1.1340us cudaConfigureCall
$
```

In each example of the kernel call, the amount of work (the size of the vectors to be added) is the same. However when we launch only 1 threadblock, the duration is the longest, and decreases by about half as we double the number of threadblocks. We can safely assume that the block distributor will generally distribute blocks to “empty” SMs first, so as we go from 1 to 2 to 4 blocks in the grid, we can assume that we are engaging 1, 2, and 4 SMs respectively.

The GPU you run on here will definitely matter. I am running this on a Pascal Titan X. If you run on a very small GPU with only 1 or 2 SMs the performance pattern may be different.