Peaks and slow performance with cudaDeviceSynchronize

Robert_Crovella · November 1, 2021, 2:16pm

the presence of cudaDeviceSynchronize() in the work-issuance loop pretty much guarantees that the command queue pending depth will never be large.

However I think the suggestion of trying out longer work will tend to remove the variation, at least as a percentage of kernel duration.

By this I mean that I have observed variability in the launch overhead, which is most evident when timing very short kernels. I do suspect that this variability varies by platform (for example, I expect a display GPU to have significantly higher variability).

If you feel this is a problem you could file a bug. I personally don’t feel there is an issue with a properly constructed test case, running on a non-display GPU:

$ nvidia-smi
Mon Nov  1 09:45:05 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K20Xm         On   | 00000000:04:00.0 Off |                    0 |
| N/A   34C    P8    30W / 235W |      0MiB /  5700MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   39C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20Xm         On   | 00000000:83:00.0 Off |                    0 |
| N/A   33C    P8    18W / 235W |      0MiB /  5700MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20Xm         On   | 00000000:84:00.0 Off |                    0 |
| N/A   32C    P8    19W / 235W |      0MiB /  5700MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[user2@dc10 misc]$ cat t1911.cu
#include <chrono>
#include <iostream>
#include <vector>
#include <cuda_profiler_api.h>

__global__
void test(uint8_t* data, const unsigned int num)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int sum = 0;
    for (int i =0; i < 100; i++)
    {
        for (int j = 0; j < 100; j++)
        {
            sum += i*j - num;
        }
    }
    data[index] = sum % 256;
}
size_t nTPB = 256;
size_t nBLK = 1;
size_t depth = 1000;
int main(){
    uint8_t *frameRawData;
    cudaMalloc(&frameRawData, nTPB*nBLK*sizeof(uint8_t));
    test<<<nBLK,nTPB>>>(frameRawData, 92);
    cudaDeviceSynchronize();
    std::vector<size_t > times;
    cudaProfilerStart();
    for(int i = 0; i < depth; i++)
    {
        auto begin = std::chrono::high_resolution_clock::now();
        test<<<nBLK,nTPB>>>(frameRawData, 92);
        cudaDeviceSynchronize();
        auto end = std::chrono::high_resolution_clock::now();
        times.push_back(std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count());
    }
    cudaProfilerStop();
    size_t avg = 0;
    size_t max = 0;
    for (int i = 0; i < depth; i++){
      max = std::max(max, times[i]);
      avg += times[i];}
    avg /= depth;
    std::cout << "avg: " << avg << "us  max: " << max << "us" << std::endl;
}
$ nvcc -o t1911 t1911.cu -std=c++14 -lineinfo
$ ./t1911
avg: 9us  max: 23us
$ ./t1911
avg: 9us  max: 20us
$ ./t1911
avg: 10us  max: 20us
$

(CUDA 11.4)

In this case, if we simply declare that the launch overhead may be as much as ~25 microseconds, then the problem seems to “disappear”. I’m not aware of any guaranteed specifications on launch overhead’; it is demonstrable that launch overhead may vary depending on the exact launch pattern (e.g. configuration of kernel arguments).

At some point I stop worrying about the noise.