When attempting to allocate and access device memory in two different threads by passing a pointer, I discovered that it wouldn’t work due to the threads having different contexts as they relate to CUDA. I’m actually surprised my program made it as far as it did in development before running into this problem… which makes me want to take a guess at the answer to the following question, but I need to be sure.
If thread1 allocates memory on the host using cudaMallocHost(), is thread2 able to access that data via a pointer passed to it? I’m doing so in my application, and things appear to be working correctly, but I’d like some confirmation that this will indeed work. I don’t want it causing problems down the road.
I THINK (and I am going to check this) that it depends on your definition of “access.” Are you going to copy the data to the GPU? That will fail. Going to use it on the CPU? I think that works, but I have to check to make sure it is officially working.
If thread1 is calling cudaMallocHost() and then thread2 is passing the allocated space to cudaMemcpy(), that will not work unless they’re using the same context (which they’re not if you’re using the runtime API). So, you either have to use the driver API and the thread migration API to move the context around or you have to use the cudaMallocHost’d buffer within the context where it was created.
Hmm… interesting. Currently thread1 allocates the host memory, and thread2 copies it to the device, yet things appear to be working. I’d hate for it to cause problems down the road, so I’m going to change some things to ensure that it will work properly. Just so I can be sure I understand things, would the following work? Obviously psuedo-code:
One other question I have to double-check my understanding. It is completely valid for a thread to access the memory allocated by cudaMallocHost without first applying the context in which that memory was allocated, correct? So, in my case, thread1 can fill the buffer even after popping the context.
Well, the above doesn’t work. It turns out I didn’t understand the difference between the runtime API and the driver API, just as I didn’t understand the whole context thing. So, from here, I suppose my only two options are, as you mentioned, use the driver API and pass the context around, or use the runtime API and just change the structure of my code.
I think i’m going to go with option #2, but that leaves a very important question, which you alluded to in your first reply to me. If I use cudaMallocHost(buffer) in thread2, am I allowed to access that buffer in thread1? And by “access” i mean store data in that array, NOT copying anything to the GPU (the copying can be done in thread2, where the buffer was created).
This seems to imply that you can, in fact, allocate memory using cudaMallocHost() in one thread and copy it to the device in another thread. The only consequence is that you won’t see a gain in bandwidth that you would otherwise see using pinned memory. Am I interpreting this correctly?
I’ve got a situation in my code where I allocate memory with cudaMallocHost in thread 1, then BOTH thread 1 and 2 copy it to their respective devices. The thread1 copy is fast because it is pinned and the thread 2 copy is slow (because pinning is per context), but it has always worked for me (in linux, I’ve never tried this in windows).
Maybe tmurray can correct me on this, but I don’t see any reason why it should not work. The memory is still just memory in RAM and CPU threads can access any memory within their host process, right? cudaMemcpy is just reading the host memory, so it should “just work”.
Sorry, I should have been more clear. By “not work” I meant doing the context popping and pushing. The part that doesn’t work is using both the runtime API and the driver API in the same program.
Your application seems to be able to do things just fine, with the exception of the slower transfer time in the non-allocating thread. My hope was to take advantage of the faster, pinned memory in the non-allocating thread by popping and pushing the contexts, but that proved impossible due to the mixing of APIs.
So there is no way to manually manage the contexts with the runtime API as you can with the driver API, right? I know it’s difficult to comment on things that are in development, but are there plans on applying this functionality to the runtime API as well? It would be useful to be able to manage the contexts there as well.
I didn’t think it would fall back to the slow path–I guess that’s wrong! There are further improvements coming to pinned memory, so a lot of this confusion will go away in the future.
And no, right now there are no plans for a thread migration API for the runtime. I know, it makes me sad too (I specifically asked for it and they said no :( ). I think people really should start considering the runtime API as the useful for prototyping path and then switch to driver for a real application, as it offers far more fine-grained control in many respects than the runtime. It’s really not that bad to use, either.
Hmmm, wonder why there are no plans to do so. I’m fairly new to CUDA, but it seems that having multiple threads on the CPU side in conjunction with CUDA would be a rather commonly desired functionality.
Thanks for the advice with the driver API as well. I’m pretty much in the prototyping/exploring stage right now anyway, so it seems that I am where I should be.
One (for now) other question regarding contexts / pthreads.
I think we’re going to do a little restructuring of our code so that we can take advantage of the pinned memory and other things. Before doing so, however, I’d like to figure out an error I’ve been seeing.
To reiterate, one thread buffers data and the other thread processes it when the buffer gets full. I realize this isn’t the best option (which is part of the restructuring), but currently the “worker” thread is created and ends repeatedly, so that basically each “worker” thread processes one buffer before dieing. There are only ever two threads that are “alive” at any given time, but the number of threads being created and killed increases indefinitely. All cuda-related stuff is done in one and only one thread, so that the contexts don’t cause any problems. After several iterations of this (about 283), cudaMalloc fails with an “unknown error”. My guess is I’d get a different error if it was running out of memory. So I’m not really sure what the problem is. Is there a limit to the number of contexts that can be created for CUDA? I’m thinking that each thread creates a context, so maybe these contexts aren’t being destroyed correctly and they’re adding up, eventually causing an error. I’ve tried to see how much memory is free before the exit, but of course the cuMemGetInfo() function is in the driver API, and therefore not accessible to me.
Maybe I shouldn’t worry about it unless it comes up after the restructuring, but I’d like to get this error under control while it’s “visible”. I’m just worried that it’s something I’m doing that may cause inconsistent behavior later, when it’s not so obvious what’s going on.
Are you running 177.73? There is a bug in cudaMalloc in drivers before that, and I think that would cause the kinds of errors you’re seeing.
As for a limit on the number of contexts… there is a limit of 16 per card in Windows right now, and in Linux there’s no limit. Don’t try to make a ton of contexts, because it is a bad idea and your performance will suffer. I’m actually pushing for a limit of 16 contexts on Linux for 2.1, so don’t assume you’ll always be able to create a huge number of contexts on Linux anyway.
I’m running 177.67 in LINUX, so looks like the driver is in need of an update. When I get back over to that machine I’ll upgrade and see if that fixes the issue.
For the sake of explanation and understanding, I wasn’t trying to make a ton of contexts… but I thought that maybe that was a consequence of the way the program was structured. So if, while using the real-time API, a thread exits, is the context automatically destroyed as well? If that is the case, then I’m never having more than two contexts at a time anyway.
I think it would be a mistake to limit contexts to 16 on any platform. You need at least one host thread per device (i.e., one associated CUcontext per CPU thread).
We are not very far off from having CPUs with more than 16 hardware threads. At that point, the CUDA context limit would come into play and would force a change in the context management. It might be better to make context creation and destruction more light weight if that is possible.