I need to interface my GPU code with the main LabView program. The main difficulty is LabView can choose to call my code
from any host thread (I don’t want to use run in UI thread only option), which will cause multiple CUDA runtime contexts, which can’t share GPU memory between each other.
Is there a way to avoid creating multiple CUDA contexts and only use 1 context from all host threads simultaneously?
so that GPU memory allocated from any host thread can be used in another.
The Ocelot GPU backend supports this natively, there is a chance that you could just link your program against ocelot and it would just work. That being said, Ocelot Fermi support is a bit flaky and if you may run into problems with compute capability 2.x devices until we finish adding fermi-specific updates in about a month or so.
Alternatively, you could take a look at the ocelot source code and see how we do this with the driver api. Other approaches involve GPU worker threads that process commands generated from other user threads.
That’s how I thought it was done, but I was afraid a device context could only be bound to 1 thread at a time as was suggested by the comment “CUDA Contexts have a one-to-one correspondence with host threads” in threadmigration.cpp from SDK.
Did NVIDIA add this flexibility recently or could a device context always be used from multiple host threads?
This has worked since version 2.1 of the toolkit as far as I know. It may have always been in the driver, but I’ve never tried it with anything older than that.
Apparently, CUDA driver contexts can’t be bound to more than 1 thread at a time, which I tried. The manual for cuPushCurrent() says “The context must be “floating,” i.e. not attached to any thread”
I see Ocelot shares a single context by exclusively acquiring it at the beginning of each API function with lock() and releasing it at the end.
But I’m worried about hurting parallelism. I need to be able to call cudaMemcpy and execute kernels currently, so if I lock a context in 1 thread to launch kernels, will I be able to release the context in time to for a 2nd thread to launch a cudaMemcpy? I suppose for a few kernels calls the overlap will be good, but what about > 16? Will I have to litter my code with preemption points (release context, reacquire) to work around this?
NVIDIA, why not allow GPU memory to be shared between contexts? If I remember correctly, OpenGL works similarly to CUDA - a rendering context can only be active in 1 thread, but you can have multiple contexts that share textures.
The only thing that requires the context to be held by the current thread is launching work. If you launch 20 kernels, pop the context, push the context from another thread, and launch 20 memcpys, that works fine and you’ll get concurrency.
Ideally driver API calls that cause interaction with the device (launching kernels, memcpy) should always be asynchronous, and everything else should be fast enough not to matter. This is the case right now for kernel launches (to a limited degree), but not memcpy due to semantics of the programming model. Memcpys will hold the lock significantly longer than kernel launches, because they need to copy into a host buffer, and the semantics of CUDA treat them like atomic operations. These could be optimized by doing overlapping range detection, but Ocelot doesn’t do this and I seriously doubt that the CUDA runtime does it either.
Ocelot’s implementation is a compromise between implementation complexity and performance. It pays for a bind/unbind per call and relies on the driver to actually perform kernel launches asynchronously (which it should according to tmurray).
A better implementation would be to bind a single worker thread to each device with a single context, convert all CUDA calls into IR objects, and implement a split-phase protocol between application threads and the worker threads so that calling a CUDA API call would push an IR object representing the call into a queue of a worker thread that would actually perform the call. The IR objects could be optimized to only include the bare minimum needed to perform the call. The locking protocol on the queue could be tuned to optimize for call throughput or latency. This would enable as many operations to be queued asynchronously as you wanted, enable sharing or device allocations between application threads, and reduce lock contention in the case where multiple application threads were using multiple devices.
I was time constrained though, so I didn’t do this.
I would guess that this is because contexts are used to implement memory protection between different applications, and that switching contexts requires programming the MMU.
My attitude towards this is that having a capability is more important than having high performance when doing it. I would start by doing something simple like Ocelot’s implementation and then benchmarking it. If it is too slow, or has too little overlap, go back and try something more complicated.