Hello.
I have trouble understanding why my program is performing so poorly if two kernel executions are separated by a few milliseconds, while it seems to run optimally when there is no “pause”.
I know that the GPU has to “warm up” before reaching full potential and that power consumption should be reduced while idling, but I am surprised that such slowdown occurs so quickly.
Please, take a look at the code below (compiled with “nvcc test.cu -o test -std=c++11”):
#include <chrono>
#include <cmath>
#include <thread>
#define N (2048 * 2048)
#define THREADS_PER_BLOCK 1024
#define NB_BLOCKS ((N + (THREADS_PER_BLOCK - 1)) / THREADS_PER_BLOCK)
#define ITERS 1000
#define SLEEP_MS 5
__global__ void kernel(unsigned char *bgr, float *dest, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n) {
int i = index * 3;
float out = index % (bgr[i] * bgr[i + 1] * bgr[i + 2]);
for (int j = 0; j < 5; j++) {
out += std::sqrt(out * out / (j*j + 1)) ;
}
dest[index] = out;
}
}
int main(int argc, char **argv) {
unsigned char *d_bgr;
float *d_dest;
int sizeInput = N * 3 * sizeof(unsigned char);
int sizeOutput = N * sizeof(float);
unsigned char *bgr = (unsigned char *)malloc(sizeInput);
for (int i = 0; i < N * 3; i++) bgr[i] = i % 256;
cudaMalloc((void **)&d_bgr, sizeInput);
cudaMalloc((void **)&d_dest, sizeOutput);
cudaMemcpy(d_bgr, bgr, sizeInput, cudaMemcpyHostToDevice);
bool sleeping = argc > 1;
for (int i = 0; i < ITERS; i++) {
if (sleeping) {
std::this_thread::sleep_for(std::chrono::milliseconds(SLEEP_MS));
}
kernel<<<NB_BLOCKS, THREADS_PER_BLOCK>>>(d_bgr, d_dest, N);
cudaDeviceSynchronize();
}
cudaFree(d_bgr);
cudaFree(d_dest);
free(bgr);
return 0;
}
Output of “nvprof ./test”:
==9588== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.87% 4.25512s 1000 4.2551ms 4.1526ms 36.373ms kernel(unsigned char*, float*, int)
0.13% 5.6628ms 1 5.6628ms 5.6628ms 5.6628ms [CUDA memcpy HtoD]
API calls: 92.30% 4.35146s 1000 4.3515ms 4.2089ms 36.692ms cudaDeviceSynchronize
4.37% 206.03ms 2 103.01ms 812.42us 205.22ms cudaMalloc
2.90% 136.72ms 1000 136.72us 47.296us 263.23us cudaLaunch
0.15% 7.1691ms 3000 2.3890us 320ns 47.648us cudaSetupArgument
0.15% 7.0965ms 1 7.0965ms 7.0965ms 7.0965ms cudaMemcpy
0.08% 3.6089ms 1000 3.6080us 1.1840us 7.9040us cudaConfigureCall
0.05% 2.4172ms 2 1.2086ms 1.1004ms 1.3169ms cudaFree
0.00% 92.352us 94 982ns 544ns 23.488us cuDeviceGetAttribute
0.00% 7.4560us 1 7.4560us 7.4560us 7.4560us cuDeviceTotalMem
0.00% 5.4720us 3 1.8240us 1.0560us 3.0400us cuDeviceGetCount
0.00% 2.5920us 2 1.2960us 1.0240us 1.5680us cuDeviceGet
0.00% 1.7920us 1 1.7920us 1.7920us 1.7920us cuDeviceGetName
Output of “nvprof ./test sleep”:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.96% 13.1430s 1000 13.143ms 6.9320ms 40.253ms kernel(unsigned char*, float*, int)
0.04% 5.3037ms 1 5.3037ms 5.3037ms 5.3037ms [CUDA memcpy HtoD]
API calls: 97.18% 13.2558s 1000 13.256ms 7.0201ms 40.683ms cudaDeviceSynchronize
1.52% 207.47ms 2 103.74ms 781.18us 206.69ms cudaMalloc
1.13% 154.77ms 1000 154.77us 82.656us 317.73us cudaLaunch
0.06% 7.7239ms 3000 2.5740us 320ns 37.280us cudaSetupArgument
0.05% 6.9059ms 1 6.9059ms 6.9059ms 6.9059ms cudaMemcpy
0.04% 5.6524ms 1000 5.6520us 2.5920us 141.92us cudaConfigureCall
0.02% 2.5090ms 2 1.2545ms 1.1288ms 1.3801ms cudaFree
0.00% 174.98us 94 1.8610us 1.0560us 40.896us cuDeviceGetAttribute
0.00% 14.912us 1 14.912us 14.912us 14.912us cuDeviceTotalMem
0.00% 12.288us 3 4.0960us 1.9200us 6.6560us cuDeviceGetCount
0.00% 5.5680us 2 2.7840us 2.0160us 3.5520us cuDeviceGet
0.00% 2.2720us 1 2.2720us 2.2720us 2.2720us cuDeviceGetName
As you can see, the kernel executions are much slower while the program just sleep 5 ms between each call.
This is running on a Jetson TX2 with CUDA 9.
So, do you know of any documentation I could look at to better understand why CUDA/GPU are acting like that? Do you see something wrong in my code sample?
Also, is there I way I could prevent the device to degrade its performance mode state?
Thanks in advance for your help.