Unusual timing results

I know the first cuda function call takes longer than normal, so I was just investigating whether having global device variables would cause this delay to occur immediately when the program starts, rather than waiting for the first function call. The timing results were a little unusual, and I am not sure how to explain them.

#include <stdio.h>

#include <unistd.h>

__device__ int c;

int main(){

	float time;

	int *a, *b;

	

	cudaFree(0);

	sleep(1);

	cudaMalloc((void **) &a, sizeof(int));

	cudaMalloc((void **) &b, sizeof(int));

	cudaFree(a);

	cudaFree(b);

	return 0;

}

Each of the cuda function calls was timed using

cudaEventRecord(start, 0);

	cudaMalloc((void **) &a, sizeof(int));

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&time, start, stop);

	printf("Malloc 1: %f\n", time);

Without the global device variable, sleep() statement, or cudaFree(0), the timings are:

Malloc 1: 0.091808

Malloc 2: 0.003712

Free 1: 0.013632

Free 2: 0.010560

These are as expected, since the first Malloc call will be the one to initiate the device.

With sleep(1), but no device variable or cudaFree(0):

Malloc 1: 0.012352

Malloc 2: 0.003584

Free 1: 0.015872

Free 2: 0.012288

The first Malloc is still the first cuda function called, but now only takes about 1/8th of the time above. It still takes 4 times longer than the second Malloc, though.

With device, no sleep(), and no cudaFree(0):

Malloc 1: 0.091808

Malloc 2: 0.003520

Free 1: 0.013760

Free 2: 0.010560

It appears that having a global device variable doesn’t cause the GPU to be initialised at the start of the program

With no device, no sleep(), but with cudaFree(0):

Free 0: 0.091840

Malloc 1: 0.003552

Malloc 2: 0.003552

Free 1: 0.012640

Free 2: 0.012288

This is exactly as expected.

No device, but with sleep(1), and with cudaFree(0):

Free 0: 0.091808

Malloc 1: 0.014528

Malloc 2: 0.003680

Free 1: 0.013600

Free 2: 0.012192

This is one of the unusual timing results. Having the sleep(1) statement in between the cudaFree(0) and the first cudaMalloc causes the Malloc to take four times as long as without it.

No device, with sleep(1) called before the cudaFree(0) (instead of below it, as before), and with cudaFree(0):

Free 0: 0.003552

Malloc 1: 0.013440

Malloc 2: 0.003616

Free 1: 0.012640

Free 2: 0.012480

Another unusual timing result. Now the first cuda function call (cudaFree(0)) is fast, but the second cuda function (the first cadaMalloc call) is slower than if there had been no sleep().

The only thing I can think of is that the sleep() call is affecting the timer. Another thought I had was that the compiler is rearranging the code so that it starts calling the function that follows the sleep() statement, before it. I don’t think this could explain why the Malloc took longer, though.

I hope you repeated each experimet many times and averaged. Sometimes, transient things can affect timings.

Also, The first “cudaEventRecord” – does not still know which is the default device. However the correspodning cudaEventRecord will know what the context is…

Also, it is a good idea to TIE the thread to a CPU core. Use “SetThreadAffinityMask” Win API to nail it to one CPU core – Usually, does not matter. but on some systems, it may – depending on BIOS version, OS patch level etc…

Try trying the thread to a core and repeat the experiments – if you would like.

Actually, it is more than likely that the first call to cudaEventCreate is where the context is actually initialized. As far as I know, the only cuda* functions in CUDA 2.2 that don’t initialize a context are cudaSetDevice, cudaSetDeviceFlags, and cudaGetErrorString.

Good observation. But that does not still explain why the “first cudaMalloc” was taking too much time in some cases. Right?

May b, to clear the confusion, he should probably be doing the timing using “QueryPerformanceCounter” API directly. And, yeah, after tying the thread to a core – just to be sure. (I have seen -ve timings while not tying the thread (edit) to the CPU – especially when the time measured is very less.)

I thought cudaEvent* functions only involved the host and so wouldn’t cause an initialisation of the GPU.

If cudaEventCreate causes initialisation then that might explain why putting a sleep() before the first non cudaEvent* function makes it seem to take a lot less time - the EventCreate function might asynchronously start the initialisation, and when the sleep() isn’t there then completion of the first non-cudaEvent* function gets delayed until the initialisation is complete… (Probably a long shot)

I tried changing the timing to MPI_Wtime, with the following results (I increased the number of variables to 4, still all ints):

No device, no sleep(), but with cudaFree(0):
Free 0: 5.369152
Malloc 1: 0.000009
Malloc 2: 0.000001
Malloc 3: 0.000001
Malloc 4: 0.000001
Free 1: 0.000013
Free 2: 0.000001
Free 3: 0.000002
Free 4: 0.000001

I wondered whether the Mallocs and Frees were being grouped, so that only the first call of each actually did anything. I tried moving one of the Frees up into the middle of the Mallocs, with the result:

Free 0: 5.352751
Malloc 1: 0.000009
Malloc 2: 0.000001
Malloc 3: 0.000001
Free 2: 0.000012
Malloc 4: 0.000004
Free 1: 0.000002
Free 3: 0.000001
Free 4: 0.000001

It didn’t show anything that would prove my theory, but doesn’t disprove it either as the compiler might still see that it could swap Free 2 and Malloc 4 (although I wouldn’t have though that it would).

The compute nodes that the GPUs are connected to run Linux so I cannot use SetThreadAffinityMask or QueryPerformanceCounter. I have never used the Linux sched_setaffinity function, but I will look into it.

5.365 seconds?? Strange…May be, it returned in Milliseconds than seconds… Can you check? Did your program really stall for 5 secs???

I suggest you not to use “cudaFree()” with 0 as argument. We dont know what it causes to the RT or the driver? So, it is better to avoid it.

And more than affinity, there must be a way to NAIL it to the core. I think the scheduler only will TRY to schedule threads to affined cores. I dont think it guarantees… but then, I am not very sure. check out.

FYI :)

N.

Events access the internal timers on the GPU as well as provide a way for you to synchronize with events after they have occurred on the GPU. At least to me, it only makes sense for them to require a context. An event is inserted into the CUDA command stream so how could it exist without a context.

Anyways, regarding what you are trying to test: this is what the programming guide has to say about it:

I don’t think declaring a device variable is going to save you the initialization time.

Hi Nico,
It doesn’t do anything, but from this post I believe it will incur the start-up cost rather that it falling on the first Malloc.

It takes 6 seconds for this job to execute. The timings given by the MPI_Wtime (which are in seconds) seem to match this. The timings given by the cudaEvent timer would suggest it should only take a few milliseconds, which is incorrect.

Frustrating…

Oh yes, I gave-up on that idea once I saw the timing results from it. I am now trying to understand the other features of the results I got.

Swithcing over to MPI_Wtime is causing this problem. Move this out of the equation…

Use either QueryPerformanceCounter or “rdtsc” instruction to profile time.

AND, TIE that thread to a core. Coz, a Wakeup after a SLEEP can potentially schedule it to some other core. I some how doubt that.

Or just use gettimeofday. It has a 10 microsecond resolution and is a real time clock not synced to a specific CPU core. It really makes life easier.

Hi MisterAnderson42,

As you mentioned, gettimeofday has a resolution of only 10ms, which is far too long for most of the timings I was doing in this experiment.

Today I did some new timings using a high resolution timer. It appears that you were correct when you stated that creating the cudaEvents would be where the initialisation would take place.

I am using a different system for these timings, which only has a single core, so I don’t have to worry about the multi-core issues.

Timings from the cudaEvent timer first:
Free 0: 0.004512
Malloc 1: 0.018144
Malloc 2: 0.004448
Malloc 3: 0.008544
Free 2: 0.022560
Malloc 4: 0.010656
Free 1: 0.007616
Free 3: 0.007968
Free 4: 0.008000

When I use the high resolution timer I am also able to time the creation of the cudaEvents:

Creating first cudaEvent: 6.133693e+07
Creating second cudaEvent: 5.396000e+03
Free 0: 6.088000e+03
Malloc 1: 1.828800e+04
Malloc 2: 4.392000e+03
Malloc 3: 8.340000e+03
Free 2: 2.394000e+04
Malloc 4: 1.064400e+04
Free 1: 4.980000e+03
Free 3: 4.672000e+03
Free 4: 4.784000e+03

If I remove the cudaEvent stuff, so now the Free(0) is the first cuda function call, then the result from the high resolution timer is:

Free 0: 6.201280e+07
Malloc 1: 2.321200e+04
Malloc 2: 4.496000e+03
Malloc 3: 8.556000e+03
Free 2: 2.524800e+04
Malloc 4: 1.053200e+04
Free 1: 5.368000e+03
Free 3: 5.024000e+03
Free 4: 5.124000e+03

This timer appears to be returning nanoseconds, so the initialisation takes approximately 60ms, which seems to be about what other people reported.

Note that the first Malloc still takes about four times longer than the second.

With the sleep() before the cudaFree(0) and the cudaEvent stuff removed:
(test repeated several times, with consistent results)
Free 0: 6.186682e+07
Malloc 1: 1.901200e+04
Malloc 2: 4.244000e+03
Malloc 3: 8.500000e+03
Free 2: 2.534400e+04
Malloc 4: 1.088000e+04
Free 1: 5.352000e+03
Free 3: 4.896000e+03
Free 4: 4.860000e+03

These are similar to without the sleep() being there.

With the sleep() between the cudaFree(0) and the first cudaMalloc:
(again, repeated with consistent results)
Free 0: 5.769475e+07
Malloc 1: 2.444400e+04
Malloc 2: 1.856000e+03
Malloc 3: 5.596000e+03
Free 2: 2.129200e+04
Malloc 4: 6.956000e+03
Free 1: 1.896000e+03
Free 3: 1.548000e+03
Free 4: 1.572000e+03

Now the second Malloc and the last three Frees take about half of the time compared to when the sleep() isn’t there.

It would be interesting to know what is causing this, but a few hundredths of a microsecond isn’t terribly important, so I’m not going to worry about it too much.

It is hard to read what those e+07 , e+04 actually mean in milliseconds.

btw, looks like you have inferred there are still some un-explained delays.

I think they could be anything… Could be related to driver code, linux kernel, bottom-half scheduling and what not… I dont think there is value in pursuing it…

but it is your call, your chase… Good Luck!

Yes, but I didn’t want to alter the output that I received. The timings are, it appears, in nanoseconds, so the e+3, etc. are with respect to e-9 seconds, therefore 1e+3 would be 1e-6 seconds, which is 0.001 milliseconds, I think.