Time measurement, callbacks, and IPC

Hello,

I’ve started playing with CUDA recently for a project for Oracle and I’ve come up with a bunch of questions. Some are about CUDA directly, and one is about CUDA on linux. Instead of making a bunch of posts in different places I figured I would ask them all here:

Firstoff, is there any way to prevent the big (~700ms on an 8800GTX) time delay when starting a CUDA program? We would be running the same program many times on different data and the startup time is killer. Is there some way of telling the video card hardware not to unload the program when it is done so we can simply upload one new variable to the card and restart the program at the beginning?

In the same line of thought, let’s say there is a program that deals with large amounts of memory. This hypothetical program could be running on the card, in addition to a program (like an OS kernel) that would take care of swapping out memory. The idea here would be to let the main program keep running while the secondary one swaps out memory/results/new input. Is there some way to do safe IPCs between programs in CUDA running on the GPU? I was thinking of running two processes on the CPU, and have the CUDA programs share some state variable so they can communicate. I know we would need compute capability 1.1 hardware to implement locks, but from what I’ve read each CUDA program has its own memory space, so sharing variables between programs is not possible at this point. Is this correct? Can this be fixed?

For my next question I’ll need to set up some background:

I’ve got a few machines set up with CUDA and I’ve noticed some driver oddness. On one machine (RH Ent Linux 5, 8800GTS (320mb), 16x PCI-E, AMD Opteron dual core) we are able to measure system time from CUDA programs using getrusage, and wall-clock time using gettimeofday. On another system (SUSE 10.2, 8800GTX (768mb), 8xPCI-e, Intel Pentium D) we are not able to measure the system time of a running CUDA program (it always returns 0) using getrusage. gettimeofday does work on that system, and returns the same numbers as the user-time of getrusage.

The system with the slower PCI-e bus measures half the time to start an empty CUDA program on both the wall-clock and the user time-slice measurements of the system with the faster bus. I tried swapping the GPUs into the other systems but still got the same results. Both systems are running the 1.0 release.

Now for the questions:

Does anyone have any idea why we can measure the system time on one machine and not on the other? Also, assuming that the times are being measured correctly, why is the startup time so much faster on the machine with the slower PCI-e bus?

Sorry for the really long post, and thanks in advance for any replies.

-Jeff

The first invocation of a kernel takes longer as the driver optimizes the code for the particular hardware. Subsequent invocations should be much faster. You’ll notice that a number of the SDK samples do a “warm up” run before timing kernels.

In regards to your second issue, please not that SUSE is not a supported Linux distribution for CUDA. I know a number of people have gotten it to work, but there may be issues.

Paulius

P.S. Please post your questions separately. That makes it easier for other readers to find answers/related issues later on when they browse forums.

The time was the same for basically all runs. I did forget to mention that we were measuring 10k starts at once. (i.e. a loop that goes 10k times, with a start of an empty CUDA program in it, timing measuring the time for the whole loop)

The page at http://developer.nvidia.com/object/cuda.html has downloads for OpenSUSE 10.2, which is what I am using. Is that not a supported OS?

My mistake on SUSE then. Not really sure what could be causing the problem. Does the getrusage function work as expected with non-CUDA programs on that machine?

700ms sounds way to long for a startup. I’ll have to check if something strange happens if the kernel is empty. In the meantime, I’d say try timing some really simple kernels (just make sure that you write some output to global memory, otherwise the compiler might optimize your kernel to be empty again). For some small kernels of my own I was getting times under 100 microseconds for calls.

Are you timing cudaThreadSynchronize() for each of the 10K calls? If you are, do it only once - start timer, make 10K calls, call cudaThreadSynchronize, stop timer. It may be possible that cudaThreadSynchronize is imposing overhead. Calling it just once will still get you what you’re looking for (even more accurately), since cudaThreadSynchronize won’t return until all previously invoked kernels have completed execution.

Paulius

The startup time is 700 milliseconds for 10k runs. That is the total for all runs. This gives about 70 microseconds of startup per run, which is quite a lot considering that you can use 140,000 cpu instructions in that time. (@ 2ghz, single core. Double that for dual core…)

The code I’m timing is just a completely empty global function (with 4 pointer inputs that I will use later), inside a loop that does not do anything else. I’m not calling synchronize anywhere. I’ll add some trivial code to my kernel and run it again to see if it comes up better…

Other then in CUDA programs the getrusage call works as intended on the machine that I see no system time on. I believe this is a bug in the nvidia driver at this point. This makes timing what parts happen on the video card and driver VS the user-land wrapper impossible on that machine.

I’m going to move my other questions to another post so that they get more visibility.

-Jeff

Must be a broken installation on your part. I am also using openSuSE 10.2 (3 GHz Pentium D, 8800 GTX) and getrusage works as expected in CUDA release 1.0 for SuSE 10.2 (32bit), driver 100.14.11. As the CUDA calls run in the same thread as the calling code, be sure to use RUSAGE_SELF.

Peter

Hi,

just to make sure I got that warm-up thing right:

When you warm up, basically some driver optimization is done at the first call of the kernel (e.g., where to assign the thread blocks on the GPU etc…). This means that you cannot just use any generic kernel (a very short one) to “warm up the GPU” and then launch your own “real” kernel.
Is that correct?

Thanks

Yes, that’s correct.

Paulius