Decreased performances if CUDA kernels are not run continuously

ADGB · June 8, 2018, 12:29pm

Hello.

I have trouble understanding why my program is performing so poorly if two kernel executions are separated by a few milliseconds, while it seems to run optimally when there is no “pause”.

I know that the GPU has to “warm up” before reaching full potential and that power consumption should be reduced while idling, but I am surprised that such slowdown occurs so quickly.

Please, take a look at the code below (compiled with “nvcc test.cu -o test -std=c++11”):

#include <chrono>
#include <cmath>
#include <thread>

#define N (2048 * 2048)
#define THREADS_PER_BLOCK 1024
#define NB_BLOCKS ((N + (THREADS_PER_BLOCK - 1)) / THREADS_PER_BLOCK)
#define ITERS 1000
#define SLEEP_MS 5

__global__ void kernel(unsigned char *bgr, float *dest, int n) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n) {
        int i = index * 3;
        float out = index % (bgr[i] * bgr[i + 1] * bgr[i + 2]);
        for (int j = 0; j < 5; j++) {
            out += std::sqrt(out * out / (j*j + 1)) ;
        }
        dest[index] = out;
    }
}

int main(int argc, char **argv) {
    unsigned char *d_bgr;
    float *d_dest;

    int sizeInput = N * 3 * sizeof(unsigned char);
    int sizeOutput = N * sizeof(float);

    unsigned char *bgr = (unsigned char *)malloc(sizeInput);
    for (int i = 0; i < N * 3; i++) bgr[i] = i % 256;

    cudaMalloc((void **)&d_bgr, sizeInput);
    cudaMalloc((void **)&d_dest, sizeOutput);
    cudaMemcpy(d_bgr, bgr, sizeInput, cudaMemcpyHostToDevice);

    bool sleeping = argc > 1;

    for (int i = 0; i < ITERS; i++) {
        if (sleeping) {
            std::this_thread::sleep_for(std::chrono::milliseconds(SLEEP_MS));
        }
        kernel<<<NB_BLOCKS, THREADS_PER_BLOCK>>>(d_bgr, d_dest, N);
        cudaDeviceSynchronize();
    }

    cudaFree(d_bgr);
    cudaFree(d_dest);
    free(bgr);

    return 0;
}

Output of “nvprof ./test”:

==9588== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.87%  4.25512s      1000  4.2551ms  4.1526ms  36.373ms  kernel(unsigned char*, float*, int)
                    0.13%  5.6628ms         1  5.6628ms  5.6628ms  5.6628ms  [CUDA memcpy HtoD]
      API calls:   92.30%  4.35146s      1000  4.3515ms  4.2089ms  36.692ms  cudaDeviceSynchronize
                    4.37%  206.03ms         2  103.01ms  812.42us  205.22ms  cudaMalloc
                    2.90%  136.72ms      1000  136.72us  47.296us  263.23us  cudaLaunch
                    0.15%  7.1691ms      3000  2.3890us     320ns  47.648us  cudaSetupArgument
                    0.15%  7.0965ms         1  7.0965ms  7.0965ms  7.0965ms  cudaMemcpy
                    0.08%  3.6089ms      1000  3.6080us  1.1840us  7.9040us  cudaConfigureCall
                    0.05%  2.4172ms         2  1.2086ms  1.1004ms  1.3169ms  cudaFree
                    0.00%  92.352us        94     982ns     544ns  23.488us  cuDeviceGetAttribute
                    0.00%  7.4560us         1  7.4560us  7.4560us  7.4560us  cuDeviceTotalMem
                    0.00%  5.4720us         3  1.8240us  1.0560us  3.0400us  cuDeviceGetCount
                    0.00%  2.5920us         2  1.2960us  1.0240us  1.5680us  cuDeviceGet
                    0.00%  1.7920us         1  1.7920us  1.7920us  1.7920us  cuDeviceGetName

Output of “nvprof ./test sleep”:

Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.96%  13.1430s      1000  13.143ms  6.9320ms  40.253ms  kernel(unsigned char*, float*, int)
                    0.04%  5.3037ms         1  5.3037ms  5.3037ms  5.3037ms  [CUDA memcpy HtoD]
      API calls:   97.18%  13.2558s      1000  13.256ms  7.0201ms  40.683ms  cudaDeviceSynchronize
                    1.52%  207.47ms         2  103.74ms  781.18us  206.69ms  cudaMalloc
                    1.13%  154.77ms      1000  154.77us  82.656us  317.73us  cudaLaunch
                    0.06%  7.7239ms      3000  2.5740us     320ns  37.280us  cudaSetupArgument
                    0.05%  6.9059ms         1  6.9059ms  6.9059ms  6.9059ms  cudaMemcpy
                    0.04%  5.6524ms      1000  5.6520us  2.5920us  141.92us  cudaConfigureCall
                    0.02%  2.5090ms         2  1.2545ms  1.1288ms  1.3801ms  cudaFree
                    0.00%  174.98us        94  1.8610us  1.0560us  40.896us  cuDeviceGetAttribute
                    0.00%  14.912us         1  14.912us  14.912us  14.912us  cuDeviceTotalMem
                    0.00%  12.288us         3  4.0960us  1.9200us  6.6560us  cuDeviceGetCount
                    0.00%  5.5680us         2  2.7840us  2.0160us  3.5520us  cuDeviceGet
                    0.00%  2.2720us         1  2.2720us  2.2720us  2.2720us  cuDeviceGetName

As you can see, the kernel executions are much slower while the program just sleep 5 ms between each call.
This is running on a Jetson TX2 with CUDA 9.

So, do you know of any documentation I could look at to better understand why CUDA/GPU are acting like that? Do you see something wrong in my code sample?
Also, is there I way I could prevent the device to degrade its performance mode state?

Thanks in advance for your help.

Honey_Patouceul · June 8, 2018, 6:55pm

Not tried your code, but a possible explanation might be related to caching. After a while, cache may have changed and miss, then read needs an external RAM access which is slower.

Topic		Replies	Views
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1716	July 19, 2022
Performance drop when turning off desktop GUI during CUDA kernel execution CUDA Programming and Performance	6	18	December 20, 2024
Cuda slow performance after process sleep/wait CUDA Programming and Performance	1	1242	June 14, 2022
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10571	June 21, 2009
Getting Different Execution Times of Running Same Kernel Twice CUDA Programming and Performance	1	19	August 13, 2024
First kernel execution takes longer CUDA Programming and Performance	8	2844	December 8, 2014
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	737	December 18, 2017
GPU processing does not give full power after hibernation CUDA Programming and Performance kernel	2	753	May 9, 2023
CUDA performance get slower after sleep in host side CUDA Programming and Performance	7	1130	November 22, 2022
GPU running time is not stable CUDA Programming and Performance	5	3055	April 24, 2010

Decreased performances if CUDA kernels are not run continuously

Related topics