Currently, I tried to use CUDA with multiple CPU threads.
At first, I implemented it by using runtime API but it does not work.
In the previous topics, I realized that just the thread who allocates a memory can use the memory and if I want to share the memory among threads, context migration is needed.
So, I will try to use driver API.
This is my plan…
- Before staring main part of application, a thread creates n GPU contexts and allocates memories with each context.
- When multiple CPU threads are invoked. Each CPU thread pushes a context of created context and use it to invoke kernels including memory copy operations between host and device.
- The contexts are reused every frames and used by arbitrary thread.
- After all frame are done, the memories are released by a CPU thread.
I want to know whether it is possible or not before I start to write my code.
Please give me your opinion or advice and let me know about your experience if you have similar experiment of my case.
Thanks for your consideration.
Should be fine, but don’t create more than one context per GPU.
Do you mean a GPU cannot have more than one context?
If my understanding is right, the threadMigration example of CUDA SDK makes two contexts per a GPU and each CPU thread(two CPU thread are used) use one context.
Am I wrong?
Thanks for your kind reply.
It can, but there’s significant overhead to having multiple contexts per GPU on current hardware. You really shouldn’t do it.
I have one more question!
Does a context shared by multiple CPU threads?
Then, is it possible to call kernel simultaneously by multiple CPU threads using same context ?
A context is bound to a single thread at a time–this is why the context migration APIs push a context onto the context stack to declare ownership of it and pop it to relinquish ownership.
I have the same challenge as you and I’m (mis)using MrAndersons GPUWorker (see http://forums.nvidia.com/index.php?showtop…98&hl=queue ) to forward CUDA Runtime API calls from multiple threads to a single thread of execution. You’ve have to be careful with synchronization though.
Wow! Wonderful work!
Your work is very helpful for me.
Depending on your duty cycle for using the GPU, this works surprisingly better than I expected. I use CUDA in one of my programs to accelerate just one part of a longer analysis chain, so the GPU is underutilized by a single process. (It’s about a 20% duty cycle.) I can run four copies of this program on a Phenom with a GTX 280, and the context switch overhead is barely visible.
And now that I have a Core i7 to play with, I tried out 8 copies on one half of a GTX 295, and it was still pretty good, though visibly slower now that the CUDA device was oversubscribed. (That said, the hyperthreading+GPU actually gave me a 50% boost in efficiency with 8 processes over 4. Looks like Intel got HT right this time.)
I remember the switching overhead being a disaster in CUDA 0.8 when I last checked, but it is much better now. I still wouldn’t suggest it for programs that use the GPU 100% of the time, which is probably what tmurray is worried about.