I know the first cuda function call takes longer than normal, so I was just investigating whether having global device variables would cause this delay to occur immediately when the program starts, rather than waiting for the first function call. The timing results were a little unusual, and I am not sure how to explain them.
#include <stdio.h>
#include <unistd.h>
__device__ int c;
int main(){
float time;
int *a, *b;
cudaFree(0);
sleep(1);
cudaMalloc((void **) &a, sizeof(int));
cudaMalloc((void **) &b, sizeof(int));
cudaFree(a);
cudaFree(b);
return 0;
}
Each of the cuda function calls was timed using
cudaEventRecord(start, 0);
cudaMalloc((void **) &a, sizeof(int));
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("Malloc 1: %f\n", time);
Without the global device variable, sleep() statement, or cudaFree(0), the timings are:
Malloc 1: 0.091808
Malloc 2: 0.003712
Free 1: 0.013632
Free 2: 0.010560
These are as expected, since the first Malloc call will be the one to initiate the device.
With sleep(1), but no device variable or cudaFree(0):
Malloc 1: 0.012352
Malloc 2: 0.003584
Free 1: 0.015872
Free 2: 0.012288
The first Malloc is still the first cuda function called, but now only takes about 1/8th of the time above. It still takes 4 times longer than the second Malloc, though.
With device, no sleep(), and no cudaFree(0):
Malloc 1: 0.091808
Malloc 2: 0.003520
Free 1: 0.013760
Free 2: 0.010560
It appears that having a global device variable doesn’t cause the GPU to be initialised at the start of the program
With no device, no sleep(), but with cudaFree(0):
Free 0: 0.091840
Malloc 1: 0.003552
Malloc 2: 0.003552
Free 1: 0.012640
Free 2: 0.012288
This is exactly as expected.
No device, but with sleep(1), and with cudaFree(0):
Free 0: 0.091808
Malloc 1: 0.014528
Malloc 2: 0.003680
Free 1: 0.013600
Free 2: 0.012192
This is one of the unusual timing results. Having the sleep(1) statement in between the cudaFree(0) and the first cudaMalloc causes the Malloc to take four times as long as without it.
No device, with sleep(1) called before the cudaFree(0) (instead of below it, as before), and with cudaFree(0):
Free 0: 0.003552
Malloc 1: 0.013440
Malloc 2: 0.003616
Free 1: 0.012640
Free 2: 0.012480
Another unusual timing result. Now the first cuda function call (cudaFree(0)) is fast, but the second cuda function (the first cadaMalloc call) is slower than if there had been no sleep().
The only thing I can think of is that the sleep() call is affecting the timer. Another thought I had was that the compiler is rearranging the code so that it starts calling the function that follows the sleep() statement, before it. I don’t think this could explain why the Malloc took longer, though.