I am working on porting a portion of a bioinformatics application to GPUs. I am running this application on a large scale supercomputer so I am dealing with multiple ranks (processes) and GPUs on each node. I have a scenario where multiple ranks are launching kernels on the same GPU, this works fine if the total global memory usage by all the ranks combined remains below the GPU’s limit. But when this limit exceeds, the application crashes with GPU out of memory error. My question is, what is the best way of handling this issue? is there something that can allow queuing of memory allocation calls and kernel launches? I understand that Nvidia’s MPS handles the kernel calls by queueing them till resources become available but is it possible to do the same for cudaMalloc calls?