Decreased performances if CUDA kernels are not run continuously

Hello.

I have trouble understanding why my program is performing so poorly if two kernel executions are separated by a few milliseconds, while it seems to run optimally when there is no “pause”.

I know that the GPU has to “warm up” before reaching full potential and that power consumption should be reduced while idling, but I am surprised that such slowdown occurs so quickly.

Please, take a look at the code below (compiled with “nvcc test.cu -o test -std=c++11”):

#include <chrono>
#include <cmath>
#include <thread>

#define N (2048 * 2048)
#define THREADS_PER_BLOCK 1024
#define NB_BLOCKS ((N + (THREADS_PER_BLOCK - 1)) / THREADS_PER_BLOCK)
#define ITERS 1000
#define SLEEP_MS 5

__global__ void kernel(unsigned char *bgr, float *dest, int n) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n) {
        int i = index * 3;
        float out = index % (bgr[i] * bgr[i + 1] * bgr[i + 2]);
        for (int j = 0; j < 5; j++) {
            out += std::sqrt(out * out / (j*j + 1)) ;
        }
        dest[index] = out;
    }
}

int main(int argc, char **argv) {
    unsigned char *d_bgr;
    float *d_dest;

    int sizeInput = N * 3 * sizeof(unsigned char);
    int sizeOutput = N * sizeof(float);

    unsigned char *bgr = (unsigned char *)malloc(sizeInput);
    for (int i = 0; i < N * 3; i++) bgr[i] = i % 256;

    cudaMalloc((void **)&d_bgr, sizeInput);
    cudaMalloc((void **)&d_dest, sizeOutput);
    cudaMemcpy(d_bgr, bgr, sizeInput, cudaMemcpyHostToDevice);

    bool sleeping = argc > 1;

    for (int i = 0; i < ITERS; i++) {
        if (sleeping) {
            std::this_thread::sleep_for(std::chrono::milliseconds(SLEEP_MS));
        }
        kernel<<<NB_BLOCKS, THREADS_PER_BLOCK>>>(d_bgr, d_dest, N);
        cudaDeviceSynchronize();
    }

    cudaFree(d_bgr);
    cudaFree(d_dest);
    free(bgr);

    return 0;
}

Output of “nvprof ./test”:

==9588== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.87%  4.25512s      1000  4.2551ms  4.1526ms  36.373ms  kernel(unsigned char*, float*, int)
                    0.13%  5.6628ms         1  5.6628ms  5.6628ms  5.6628ms  [CUDA memcpy HtoD]
      API calls:   92.30%  4.35146s      1000  4.3515ms  4.2089ms  36.692ms  cudaDeviceSynchronize
                    4.37%  206.03ms         2  103.01ms  812.42us  205.22ms  cudaMalloc
                    2.90%  136.72ms      1000  136.72us  47.296us  263.23us  cudaLaunch
                    0.15%  7.1691ms      3000  2.3890us     320ns  47.648us  cudaSetupArgument
                    0.15%  7.0965ms         1  7.0965ms  7.0965ms  7.0965ms  cudaMemcpy
                    0.08%  3.6089ms      1000  3.6080us  1.1840us  7.9040us  cudaConfigureCall
                    0.05%  2.4172ms         2  1.2086ms  1.1004ms  1.3169ms  cudaFree
                    0.00%  92.352us        94     982ns     544ns  23.488us  cuDeviceGetAttribute
                    0.00%  7.4560us         1  7.4560us  7.4560us  7.4560us  cuDeviceTotalMem
                    0.00%  5.4720us         3  1.8240us  1.0560us  3.0400us  cuDeviceGetCount
                    0.00%  2.5920us         2  1.2960us  1.0240us  1.5680us  cuDeviceGet
                    0.00%  1.7920us         1  1.7920us  1.7920us  1.7920us  cuDeviceGetName

Output of “nvprof ./test sleep”:

Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.96%  13.1430s      1000  13.143ms  6.9320ms  40.253ms  kernel(unsigned char*, float*, int)
                    0.04%  5.3037ms         1  5.3037ms  5.3037ms  5.3037ms  [CUDA memcpy HtoD]
      API calls:   97.18%  13.2558s      1000  13.256ms  7.0201ms  40.683ms  cudaDeviceSynchronize
                    1.52%  207.47ms         2  103.74ms  781.18us  206.69ms  cudaMalloc
                    1.13%  154.77ms      1000  154.77us  82.656us  317.73us  cudaLaunch
                    0.06%  7.7239ms      3000  2.5740us     320ns  37.280us  cudaSetupArgument
                    0.05%  6.9059ms         1  6.9059ms  6.9059ms  6.9059ms  cudaMemcpy
                    0.04%  5.6524ms      1000  5.6520us  2.5920us  141.92us  cudaConfigureCall
                    0.02%  2.5090ms         2  1.2545ms  1.1288ms  1.3801ms  cudaFree
                    0.00%  174.98us        94  1.8610us  1.0560us  40.896us  cuDeviceGetAttribute
                    0.00%  14.912us         1  14.912us  14.912us  14.912us  cuDeviceTotalMem
                    0.00%  12.288us         3  4.0960us  1.9200us  6.6560us  cuDeviceGetCount
                    0.00%  5.5680us         2  2.7840us  2.0160us  3.5520us  cuDeviceGet
                    0.00%  2.2720us         1  2.2720us  2.2720us  2.2720us  cuDeviceGetName

As you can see, the kernel executions are much slower while the program just sleep 5 ms between each call.
This is running on a Jetson TX2 with CUDA 9.

So, do you know of any documentation I could look at to better understand why CUDA/GPU are acting like that? Do you see something wrong in my code sample?
Also, is there I way I could prevent the device to degrade its performance mode state?

Thanks in advance for your help.

Not tried your code, but a possible explanation might be related to caching. After a while, cache may have changed and miss, then read needs an external RAM access which is slower.