the presence of cudaDeviceSynchronize()
in the work-issuance loop pretty much guarantees that the command queue pending depth will never be large.
However I think the suggestion of trying out longer work will tend to remove the variation, at least as a percentage of kernel duration.
By this I mean that I have observed variability in the launch overhead, which is most evident when timing very short kernels. I do suspect that this variability varies by platform (for example, I expect a display GPU to have significantly higher variability).
If you feel this is a problem you could file a bug. I personally don’t feel there is an issue with a properly constructed test case, running on a non-display GPU:
$ nvidia-smi
Mon Nov 1 09:45:05 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K20Xm On | 00000000:04:00.0 Off | 0 |
| N/A 34C P8 30W / 235W | 0MiB / 5700MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:05:00.0 Off | 0 |
| N/A 39C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20Xm On | 00000000:83:00.0 Off | 0 |
| N/A 33C P8 18W / 235W | 0MiB / 5700MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K20Xm On | 00000000:84:00.0 Off | 0 |
| N/A 32C P8 19W / 235W | 0MiB / 5700MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[user2@dc10 misc]$ cat t1911.cu
#include <chrono>
#include <iostream>
#include <vector>
#include <cuda_profiler_api.h>
__global__
void test(uint8_t* data, const unsigned int num)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int sum = 0;
for (int i =0; i < 100; i++)
{
for (int j = 0; j < 100; j++)
{
sum += i*j - num;
}
}
data[index] = sum % 256;
}
size_t nTPB = 256;
size_t nBLK = 1;
size_t depth = 1000;
int main(){
uint8_t *frameRawData;
cudaMalloc(&frameRawData, nTPB*nBLK*sizeof(uint8_t));
test<<<nBLK,nTPB>>>(frameRawData, 92);
cudaDeviceSynchronize();
std::vector<size_t > times;
cudaProfilerStart();
for(int i = 0; i < depth; i++)
{
auto begin = std::chrono::high_resolution_clock::now();
test<<<nBLK,nTPB>>>(frameRawData, 92);
cudaDeviceSynchronize();
auto end = std::chrono::high_resolution_clock::now();
times.push_back(std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count());
}
cudaProfilerStop();
size_t avg = 0;
size_t max = 0;
for (int i = 0; i < depth; i++){
max = std::max(max, times[i]);
avg += times[i];}
avg /= depth;
std::cout << "avg: " << avg << "us max: " << max << "us" << std::endl;
}
$ nvcc -o t1911 t1911.cu -std=c++14 -lineinfo
$ ./t1911
avg: 9us max: 23us
$ ./t1911
avg: 9us max: 20us
$ ./t1911
avg: 10us max: 20us
$
(CUDA 11.4)
In this case, if we simply declare that the launch overhead may be as much as ~25 microseconds, then the problem seems to “disappear”. I’m not aware of any guaranteed specifications on launch overhead’; it is demonstrable that launch overhead may vary depending on the exact launch pattern (e.g. configuration of kernel arguments).
At some point I stop worrying about the noise.