How to share GPU memory from different host threads?

Uncle_Joe · July 13, 2010, 10:22pm

I need to interface my GPU code with the main LabView program. The main difficulty is LabView can choose to call my code
from any host thread (I don’t want to use run in UI thread only option), which will cause multiple CUDA runtime contexts, which can’t share GPU memory between each other.

Is there a way to avoid creating multiple CUDA contexts and only use 1 context from all host threads simultaneously?
so that GPU memory allocated from any host thread can be used in another.

Gregory_Diamos · July 13, 2010, 10:27pm

The Ocelot GPU backend supports this natively, there is a chance that you could just link your program against ocelot and it would just work. That being said, Ocelot Fermi support is a bit flaky and if you may run into problems with compute capability 2.x devices until we finish adding fermi-specific updates in about a month or so.

Alternatively, you could take a look at the ocelot source code and see how we do this with the driver api. Other approaches involve GPU worker threads that process commands generated from other user threads.

Uncle_Joe · July 13, 2010, 10:58pm

Ah, I see. You create all CUDA runtime contexts to use the same device 0:

cuda::HostThreadContext::HostThreadContext(): selectedDevice(0)"

and bind the selected device with cuCtxPushCurrent()

cuda::HostThreadContext& cuda::CudaRuntime::_bind() {

		...

	device.select();	// calls  cuCtxPushCurrent()

	return thread;

}

That’s how I thought it was done, but I was afraid a device context could only be bound to 1 thread at a time as was suggested by the comment “CUDA Contexts have a one-to-one correspondence with host threads” in threadmigration.cpp from SDK.

Did NVIDIA add this flexibility recently or could a device context always be used from multiple host threads?

Gregory_Diamos · July 13, 2010, 11:15pm

This has worked since version 2.1 of the toolkit as far as I know. It may have always been in the driver, but I’ve never tried it with anything older than that.

Uncle_Joe · July 14, 2010, 6:14pm

Apparently, CUDA driver contexts can’t be bound to more than 1 thread at a time, which I tried. The manual for cuPushCurrent() says “The context must be “floating,” i.e. not attached to any thread”

I see Ocelot shares a single context by exclusively acquiring it at the beginning of each API function with lock() and releasing it at the end.

But I’m worried about hurting parallelism. I need to be able to call cudaMemcpy and execute kernels currently, so if I lock a context in 1 thread to launch kernels, will I be able to release the context in time to for a 2nd thread to launch a cudaMemcpy? I suppose for a few kernels calls the overlap will be good, but what about > 16? Will I have to litter my code with preemption points (release context, reacquire) to work around this?

NVIDIA, why not allow GPU memory to be shared between contexts? If I remember correctly, OpenGL works similarly to CUDA - a rendering context can only be active in 1 thread, but you can have multiple contexts that share textures.

tmurray · July 14, 2010, 6:32pm

The only thing that requires the context to be held by the current thread is launching work. If you launch 20 kernels, pop the context, push the context from another thread, and launch 20 memcpys, that works fine and you’ll get concurrency.

Gregory_Diamos · July 14, 2010, 6:48pm

Ideally driver API calls that cause interaction with the device (launching kernels, memcpy) should always be asynchronous, and everything else should be fast enough not to matter. This is the case right now for kernel launches (to a limited degree), but not memcpy due to semantics of the programming model. Memcpys will hold the lock significantly longer than kernel launches, because they need to copy into a host buffer, and the semantics of CUDA treat them like atomic operations. These could be optimized by doing overlapping range detection, but Ocelot doesn’t do this and I seriously doubt that the CUDA runtime does it either.

Ocelot’s implementation is a compromise between implementation complexity and performance. It pays for a bind/unbind per call and relies on the driver to actually perform kernel launches asynchronously (which it should according to tmurray).

A better implementation would be to bind a single worker thread to each device with a single context, convert all CUDA calls into IR objects, and implement a split-phase protocol between application threads and the worker threads so that calling a CUDA API call would push an IR object representing the call into a queue of a worker thread that would actually perform the call. The IR objects could be optimized to only include the bare minimum needed to perform the call. The locking protocol on the queue could be tuned to optimize for call throughput or latency. This would enable as many operations to be queued asynchronously as you wanted, enable sharing or device allocations between application threads, and reduce lock contention in the case where multiple application threads were using multiple devices.

I was time constrained though, so I didn’t do this.

I would guess that this is because contexts are used to implement memory protection between different applications, and that switching contexts requires programming the MMU.

My attitude towards this is that having a capability is more important than having high performance when doing it. I would start by doing something simple like Ocelot’s implementation and then benchmarking it. If it is too slow, or has too little overlap, go back and try something more complicated.

Topic		Replies	Views
Seperate Cuda thread and Opengl context thread CUDA Programming and Performance	2	10879	April 8, 2009
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9586	January 1, 2009
MultiGPU start help CUDA Programming and Performance	8	10522	August 10, 2010
Is it possible using muliple context for a GPU. mulitple CPU thread CUDA Programming and Performance	10	4847	April 8, 2009
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4178	May 13, 2010
CUDA thread in background? CUDA Programming and Performance	10	15973	February 19, 2010
Multiple GPUs, multiple applications CUDA Programming and Performance	10	9989	April 22, 2009
Contexts and cudaMallocHost Same rules? CUDA Programming and Performance	17	11201	November 15, 2008
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9715	September 22, 2007
Multiple kernels in flight? CUDA Programming and Performance	19	26819	August 28, 2007

How to share GPU memory from different host threads?

Related topics