I’ve started playing with CUDA recently for a project for Oracle and I’ve come up with a bunch of questions. Some are about CUDA directly, and one is about CUDA on linux. Instead of making a bunch of posts in different places I figured I would ask them all here:
Firstoff, is there any way to prevent the big (~700ms on an 8800GTX) time delay when starting a CUDA program? We would be running the same program many times on different data and the startup time is killer. Is there some way of telling the video card hardware not to unload the program when it is done so we can simply upload one new variable to the card and restart the program at the beginning?
In the same line of thought, let’s say there is a program that deals with large amounts of memory. This hypothetical program could be running on the card, in addition to a program (like an OS kernel) that would take care of swapping out memory. The idea here would be to let the main program keep running while the secondary one swaps out memory/results/new input. Is there some way to do safe IPCs between programs in CUDA running on the GPU? I was thinking of running two processes on the CPU, and have the CUDA programs share some state variable so they can communicate. I know we would need compute capability 1.1 hardware to implement locks, but from what I’ve read each CUDA program has its own memory space, so sharing variables between programs is not possible at this point. Is this correct? Can this be fixed?
For my next question I’ll need to set up some background:
I’ve got a few machines set up with CUDA and I’ve noticed some driver oddness. On one machine (RH Ent Linux 5, 8800GTS (320mb), 16x PCI-E, AMD Opteron dual core) we are able to measure system time from CUDA programs using getrusage, and wall-clock time using gettimeofday. On another system (SUSE 10.2, 8800GTX (768mb), 8xPCI-e, Intel Pentium D) we are not able to measure the system time of a running CUDA program (it always returns 0) using getrusage. gettimeofday does work on that system, and returns the same numbers as the user-time of getrusage.
The system with the slower PCI-e bus measures half the time to start an empty CUDA program on both the wall-clock and the user time-slice measurements of the system with the faster bus. I tried swapping the GPUs into the other systems but still got the same results. Both systems are running the 1.0 release.
Now for the questions:
Does anyone have any idea why we can measure the system time on one machine and not on the other? Also, assuming that the times are being measured correctly, why is the startup time so much faster on the machine with the slower PCI-e bus?
Sorry for the really long post, and thanks in advance for any replies.