Multithreading with the Driver API

I searched the forum and the reference manual, but couldn’t find anything on this…

I’m working on a managed .NET numerical library, and I’m adding CUDA support wherever possible using the driver API (so that the DLL can be “xcopy-deployed”). Calls to the driver are all handled through a singleton class that is the “glue” between .NET and CUDA (via P/Invoke). However, I’m not quite sure when to call cuInit(0). Can I just call it whenever the singleton is first initialized, and the driver will be initialized from that point on as long as the object exists? Or do I need to call it in each method call (when a method in myObject calls a method in mySingletonObject, cuInit is called in the mySingletonObject method)?

The singleton is designed to be thread-safe, so it may be called from various other threads simultaneously.

I’m not sure that the singleton pattern is the best way to do this if you’re going to have multiple calling threads, considering how contexts are handled. Each thread will need to call cuInit(0), and then you can do some context migration black magic to move contexts around in a thread-safe manner, I guess, but obviously this doesn’t scale at all to multiple GPUs.

Well, here’s what I’m trying to do:

The singleton class (I’ve called it CudaManager) has asynchronous methods that may be called by one or more threads at any time. These methods correspond to various kernels that I will be including with the library. When a thread calls one of the CudaManager functions, the function is not executed immediately, but is added to an execution queue. When a GPU with enough memory to handle the request becomes available in the system, the CudaManager class loads the appropriate parameters into memory and launches the kernel. When the kernel is finished executing, it reads back whatever results are necessary to host memory and notifies the calling thread via an asynchronous callback, at which point the calling thread may “view” the results of the computation.

This methodology should allow me to easily scale my library to any number of GPU’s in the system (besides bandwidth constraints, etc.) so that they may easily be acted upon by any number of threads (since no contexts are involved, a thread will generally receive it’s results in the order that it was added to the queue).

Now, I don’t know a whole lot about using the contexts and streams in the driver API (my CUDA work to date hasn’t been quite that advanced), but I think this way might be better for me since it keeps all of the functions necessary for interacting with CUDA in one class, which could be extended in the future to ATI/AMD, Larrabee, etc.

So to confirm, each calling thread needs to call cuInit(0) even when calling a singleton class? What about if I keep the singleton in it’s own separate/constant thread, so that even when called by other outside threads, the driver API functions are always executed in the single thread belonging to the singleton class?

EDIT: So after a bit more reading, I’m basically providing a second layer of asynchronous-ity between the singleton class (which will use async functions with the driver api) and the calling threads (which will do asynchronous method calls/callbacks with the singleton class).

If you keep the singleton in its own thread, you won’t need to call cuInit(0) everywhere, but it will limit the number of GPUs to you can access simultaneously to 1. After that, you’ll have to do all sorts of pushing and popping of contexts and it will become very messy.

Every thread that touches a GPU must have a context; in the runtime API, it’s implicitly created, but in the driver API you must create it yourself. I guess I need to actually sit down and write the “how contexts work in multithreaded applications” white paper, don’t I…

Yes, that would be quite helpful to get this library working properly ^_^

Is there anything in the programming guide or reference that has in-depth information on multi-GPU programming? I haven’t seen much other than the SDK example…

I don’t know much about .NET, but I’ve done some multi-gpu programming with Win32 and Driver API.

If I understand correctly, then implementing such thing as

can be done in two ways:

  1. popping and pushing contexts, which I think is very messy and personally never used;

  2. maintaining pool of threads (one for each GPU); this is “traditional” way of doing multi-gpu.

Again, if I understand correctly, your singleton class is a proxy between calling threads and devices. If so, CudaManager must maintain pool of threads, one for each device, and cuInit() must be called in each of these threads once, before doing actual work. Caller threads do not to know anything about underlying hardware and they don’t have to do any cuInit()'s.

I guess this is the “second layer of asynchronous-ity” you’ve talked earlier.

Excellent, AndreiB! Option 2 sounds like a good way to go, and that actually should be fairly simple in .NET. I’ll go ahead and implement that and see how it works (someone else will have to be generous enough to test it for me, since I only have a single GPU in my dev system).

@profquail, Are you trying to use C# threads or explicitly creating native os threads and posting messages there. I thought finalizers typically ran on different threads than the rest of the application in C#? Also C# threads are specifically not guaranteed to be os threads and there are several hosting environments that already take advantage of this (mostly servers, although that may change in the future). Seems like both could cause issues with CUDA.

I was planning to use C# threads, which I assumed would be the same thing as systems threads (sorry, I’m a math guy, not a computer scientist). I looked at the mono code for threading (though I’m working in VS2008), and it looked like it just wraps the system threads routines. I don’t understand how it could be any other way in C# with the official .NET framework…can you elaborate?

As for the finalizers, the singleton class will implement the IDisposable interface; when Dispose() is called on the object, it will check to see if there are any queued or currently running GPU threads and then either wait until they are done, or throw an exception (I’ll have to wait until I get the class implemented before I can do some tests and see what works best). That should make sure that the finalizer/garbage collection process doesn’t kill the object while there are GPU calculations in progress.

Try MSDN:…ing.thread.aspx

This can and actually does happen today. Let’s say I write a scientific image processing library and you have a lot of astronomy related images in a database (stored as binary large objects). You want to call some functions in my library in a stored procedure you’re writing. Maybe this makes sense because many clients from many vendors are accessing this database and you don’t have access to the source for all of them. As of today you really can’t do it. Your threads will run sequentially but they will switch between OS threads, so the problem is real. To be fair if you’re writing a typical standalone application chances are you’ll be ok, but technically you’re relying on undocumented behavior that’s subject to change in the future.

Keeping your objects from going out of scope shouldn’t be a problem, but what happens when the programmer forgets to call Dispose()? What if an exception causes the call to Dispose to be skipped over? Even if you use try/catch/finally on every call you make there will still be situations where this can happen. If you implement a proper finalizer you don’t have to worry about any of these situations because the finalizer will prevent any resources from leaking. But finalizers can run at any time and typically run on separate threads from the rest of your app. So if you try to call e.g. cuCtxDestroy() from a finalizer thread CUDA will throw a fit (unless the context happens to be floating).

It may be easier than you think to do what you want so don’t let any of this discourage you…

By the way how long do your kernels typically run? Do you really only have a single class? It seems like it might be a lot easier to wrap each CUDA object in its own class (with a pointer to the parent object so the parent never goes out of scope as long as the child is in use). You can still give each object a dispose method and implement reference counting. That way objects are destroyed automatically whenever necessary and you don’t have to keep track of arrays of handles. What pattern would you use for queueing asynchronous calls? Maybe something like the standard IAsyncResult pattern? Even though option 1 above might look more messy I think you could make the argument that it’s actually more clean, depending on what your needs are…

Thanks for the info on the threads. The way I’m planning to build the CudaManager class, I think I’ll be ok 99% of the time, unless someone invokes the library from the CLR Hosting API, and then does some kind of tricky thread-scheduling. I don’t see that happening too often, so (at least for now) I’m going to work on the premise that creating a C# thread will also create a system thread (I hope that assumption doesn’t come back to bite me!)

I was planning to implement the full Dispose/Finalizer pattern, which should handle any situation (it is the pattern recommended by MS for writing classes that deal with unmanaged resources, like this one will be).

The idea behind this class was to make a single, system-wide object that would handle any CUDA kernel launches for the managed library (and do marshaling for the data and parameters). I also thought that this object could make use of any number of GPUs in the system by doing the request/queueing/callback stuff I mentioned before. Some of the kernels that will be included in this library will be fairly simple reduction-type stuff, but I’m also planning to include matrix multiplication, eigenvalue algorithms, and so on. I don’t know how long they will take to run since I haven’t written them yet (I was planning to implement this CudaManager class and some more CPU-based functions in the library first). I would also like this code to be easily extensible, so that it is simply a matter of embedding another .cubin in the DLL resources and adding a simple method to CudaManager whenever I want to add a new kernel to the library.

As for queueing the asynchronous calls, I would like to make it so that when the host code calls a CUDA-enabled library function (either synchronously, or asynchronously using delegates/callbacks), some sort of intermediate object will be constructed (let’s call it CudaKernelRequest) that contains either the PTX or cubin for the kernel, kernel parameters, and any data that needs to be uploaded to the device for processing. This object will be added to a standard generic Queue<> object that will be a private property of the CudaManager singleton object. Upon certain events being fired (completion of a currently executing kernel, adding a new CudaKernelRequest object to the queue, etc.) the CudaManager class should attempt to find an available GPU device, and if found, should execute the next kernel in the queue (by invoking the appropriate CUDA driver API calls on a thread that has been set aside specifically for that GPU). When the kernel is done executing, CudaManager retrieves any necessary data from the device, marshals it back to the appropriate .NET type, and fires the callback (or just returns, if it was a synchronous call) to allow the host code to “view” it’s results.

I should note that I’m planning to release the entirety of this library under the BSD license, so if I can get this little CudaManager class working properly, it will be available to any .NET developers that want to make use of it. Along with some other little classes I’m writing (such as embedding/reading .cubin and PTX files from embedded resources in a DLL/EXE), it should make it much easier for .NET programmers to CUDA-accelerate their applications.

Also, now that I think about it, what is the overhead of calling cuInit(0)? If it’s not very large, I don’t think I’d need to maintain a thread pool at all, instead just launching new threads as needed and keeping track of which GPUs are currently executing a kernel. This might also avoid the whole C# thread != system thread problem as mentioned above…

I’m just thinking… if you’re going to spawn threads, why not just spawn unmanaged threads instead? Then use a Windows message queue or, if you need it to be cross platform, GLib has a wrapper around Win32 threads and pthreads with a fully thread safe asynchronous queue. Either way you wouldn’t need to create a new thread for every kernel call (cuInit() might be fast but I would think creating a new thread is going to be significantly more expensive than a normal 5 or 10 cycle function call, in fact I think that’s one of the reasons managed threads != system threads). Then you just P/Invoke SendMessage, PostMessage or g_async_queue_push with a pointer to a structure for each request.

Thanks for the ideas quvhiwxvvmy. Hopefully I’ll get some kernels together in the next week or so and I can test out our different ideas to find out what works best. I’ll post back when I get some results…

Thanks profquail. I’ll be interested to see what you find out. I guess in my mind there are really two outstanding questions. Here is kind of how I’m thinking I would structure this program (although maybe I don’t really understand your specific application so this isn’t useful…)

First you need to import all the cuda data structures and dll calls, e.g.

namespace CudaInterop


	// Define enumerations and other data types used by the driver api

	public enum CUresult : int



		// ...





		public uint Width;

		public uint Height;

		public CUarray_format Format;

		public uint NumChannels;


	// ...

	// Import dll calls from driver library

	public static class CudaInterop


		// or "nvcuda.dll"

		[DllImport(""), ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]

		public static extern CUresult cuModuleUnload(CudaModule mod);



This part should be extremely easy. It’s almost just copying and pasting from cuda.h.

The second part is you need to encapsulate each handle that needs to be destroyed (especially CUcontext, but not something like CUfunction) in an object with a critical finalizer.

namespace CudaDriver


	public sealed class CudaContext : SafeHandle


		// ...

		public CudaModule LoadModule(String fileName)


			CudaModule cudaModule;

			bool success = false;







				this.DangerousAddRef(ref success);


					CudaInterop.cuModuleLoad(cudaModule, fileName);


			return cudaModule;



	public sealed class CudaModule : SafeHandle


		// ...

		public CudaTexRef GetTexRef(String texRefName);

		public CudaFunction GetFunction(String funcName);

		// Inherit SafeHandle's finalizer and IDispose interface

		// Always called within a CER by SafeHandle, guaranteed to only be called once

		protected override bool ReleaseHandle()


			bool success;

			success = (CudaInterop.cuModuleUnload(this) == CUresult.CUDA_SUCESS);


			// Now parent's ReleaseHandle can be called by SafeHandle's Dispose(),

			//   Close() or finalizer

			return success;




This part is a little more difficult but once you figure out a pattern to use (e.g. for one pair of objects like CUcontext and CUmodule) the rest of the work should become very easy (basically just copy and paste).

Once that’s done you should be able to use CUDA in even the most unusual hosted environments across present and future platforms without leaking resources and so the third part also becomes almost trivial:

namespace CudaQueue


// ...


You can implement round robin queues or first available queues or queues that cache CUmodules and function pointers to improve performance, etc. You wouldn’t need to implement finalizers or anything in these classes so further development with CUDA should be very easy and quick. I especially like your idea about classes to embed cubins. It sounds almost like a C# runtime api.