I am considering the following scenario: An interrupt service routine locked to one core on a host CPU is in charge of transferring data from a PCIe DAQ card at 60kHz into a ring buffer on a single Tesla device memory (either by copying into RAM first followed by a hostToDevice transfer, or, preferrably through some DMA magic). Two application threads are running on the host, each locked to one CPU core, and each working on different aspects of the same global ring buffer in the single Tesla. Is it possible that each thread launches kernel functions in parallel, targeted of course to different GPUs on the Tesla, e.g. thread1 uses GPU 0-63, thread2 uses GPU 64-127, etc., i.e. the threads share a Tesla using only portions of it ?
Hints and suggestions are highly appreciated,
Are you talking about a single Tesla card, like the C1060? As far as CUDA is concerned, that is one monolithic device, and only one kernel can run at a time on it. You cannot partition the stream processors into subsets and run different kernels on each subset.
If you are talking about the S1070, that is actually four C1060 cards in a rackmount case, and appears to the driver as four separate devices. Each of those devices can be used independently.
Thanks for the clarification. Yes, implicitely I was thinking about a C1060 card. So it means that I would need a Tesla device for every host thread asking kernels to be run on that device.
It doesn’t sound like this applies to your situation, but you can have two threads (or two different processes) use the same CUDA device. The kernel calls will be time-sliced, though, which will increase latency and might have no benefit if your usage of the GPU is near 100% of the time with a single process.
I bring this up because I recently noticed that the efficiency cost of having a CUDA device constantly switching between two processes is much lower than I remember. (Hadn’t checked this since pre-1.0.) I have a program with a GPU duty cycle of about 30%, and was pleasantly surprised to see that two processes could share the GPU with negligible slow-down compared to each process running alone.