CPU Cores Per GPUs

Hi Everyone,

I’m looking to create a GPU server for people to use. If I install 4 GPUs on a system, but have 8 CPU cores, could 8 simultaneous programs (threads) access the 4 GPUs concurrently? Thanks!

Only if the card supports HyperQ.At this moment only the compute capability 3.5 (Tesla k20x and Titan) supports it.

It is important to note that the driver can time slice a GPU between multiple CUDA programs at the same time. The entire GPU is used to execute a kernel to completion from one program, then it switches and executes a kernel from the next program, and so on. For programs that have significant CPU portions, this mode of operation can be useful. If your programs are basically 100% CUDA, then time slicing will not improve throughput.

Seibert, are you describing how hyperQ would work? Or would time slicing work with any older driver? I ask because with nvidia-smi, there are two modes; default and compute exclusive. If using the default mode, would multiple programs be able to access a gpu? I understand there would be a performance hit, but it seems easier to administrate the server if that was the case.

Without HyperQ the kernels are ran one after each other. If the programs have enough stuff to do on cpu then the cards would be free to execute kernels from different programs.

Yes, I’ve tried using multiple processes to increase occupancy of the GPU (would save me from having to remove all global variables in the crufty code), but found kernels from different processes (CUDA contexts) can’t be run concurrently (at least for Fermi). I later found this stated in the manual:

Does anyone know if this is also true for Kepler?

The cc 3.5 suports HyperQ. cc 3.5 are Tesla k20 and Titan

Just to clarify my original point, the time slicing in default mode is for the entire GPU. Kernels in different contexts do not run simultaneously, but will be interleaved in time. There are use cases where this is good enough, so I wanted to point it out.

When I run 8 cuda programs at once, does the CUDA driver automatically direct the kernel to the first available GPU? If not, how would I be able to distribute the programs evenly across all GPUs?

No. This will not happen automatically.

There are two things. 1st is how many programs can use a gpu in he same time concurrently. This is hyperQ.

2nd you need a queuing system (resource system manager) if you want automatic distribution of jobs. This is more difficult to organize. Such a queueing system would distributed 4 programs to one card each and then the next 4 again until you have 2 programs per gpu running.

HyperQ is not the same as the queuing system which is a general way of handling the jobs on clusters.

If I have 8 random programs (1 program for each core), is there a way in CUDA (programmatically) to select the least used GPU on the system? Or would I have to rely solely on a scheduling engine?

For a simple static assignment you could just set CUDA_VISIBLE_DEVICES to the number of a GPU at login time.