How to start 2 kernels on 2 devices

trudger · January 4, 2009, 8:54pm

I have a GeForce9800GX2, and started 2 threads with pthread functions, each of which running the kernel on one GPU of this card. But it seems it has more overhead than using just 1 kernel. How do you make use of this type of cards with 2 GPUs?

Sarnath · January 5, 2009, 7:22am

Did you do a “cudaSetDevice” on these threads separately so that you know they really work on 2 diff GPUs ??

trudger · January 5, 2009, 3:18pm

Yes, I set the device with 0 and 1 for the 2 threads.

Actually, in the 2 thread version, each thread processes half of the data, and in the 1 thread version, the main thread processes all the data.

Charley · January 5, 2009, 4:17pm

I am working on something similar – using two threads (pthreads) to control two gpu’s. All my program does is read data from a file, download to gpus, do some processing, and upload to host, over and over again. I found that the single threaded version of my program was much quicker. I think this is because I was re-creating my threads each time I looped through (and therefore re-allocating device memory etc). I changed my approach to keep the threads running constantly and this helped. What also helped was to make sure that each thread/gpu was doing a lot of work between updates, otherwise the threads just tend to run serially. Hope that makes sense…

tmurray · January 5, 2009, 6:05pm

Unsurprisingly, thread and context pools are almost always a good idea.

trudger · January 5, 2009, 6:12pm

Thanks, Charley. Yeah… Actually, I was trying to avoid the unnecessary launches as much as possible. Maybe there’s still too little computation? How long does your program run?

Charley · January 6, 2009, 2:11am

Keep in mind that my program is very much a work in progress. However, my multi-threaded output appears to match my single threaded output so…

As I mentioned, I read in data a bit at a time (256k complex floats), process this, then upload from gpu and output, over and over again. I am using CUFFT as my processing step right now. If I do just 1 FFT per input data set then my single threaded program takes ~0.41 sec to get through my data file and my double-threaded program takes ~0.58 sec. However, if I do 30 FFTs on each data set (just to increase process time) then my single threaded program takes ~1.4 secs and my double-threaded takes ~ 1.0. I assume as I get more clever with this that this will improve, but I thought it was encouraging.

Sarnath · January 6, 2009, 4:26am

All of you should look @ Mr.Anderson’s multi-GPU master-slave idea. It is cool!

I have not worked with contexts… So, not sure, how cool it is…

btw,

Are you calling any CUDA functions in the main thread before forking the 2 threads… I am not sure how CUDA would behave in that case… Because manual says that once cudaSetDevice is called explicitly or implicitly, subsequents cudaSetDevice have no effect… So, if the main thread has already set the device, it is possible that subsequently spawned threads have the main-thread hangover and will use the same device inspite of the explicit cudaSetDevice()…

I am not sure about what is the correct bhaviour.

tmurray, Can you clarify this part?

Charley · January 6, 2009, 10:54am

No CUDA functions are being called in the main thread. Cudasetdevice is called by the thread functions. I’ve looked at Mr Anderson’s code and it is impressive, but more than I need right now.

tmurray · January 6, 2009, 4:10pm

Contexts are bound per-thread, so the first thread will have a context and the new thread won’t.

nasacort · January 6, 2009, 5:26pm

Where can I find Mr. Anderson’s multi-GPU master-slave idea?

Charley · January 6, 2009, 5:28pm

I believe this thread has that information:

http://forums.nvidia.com/index.php?showtopic=66598

trudger · January 6, 2009, 6:49pm

OK, thank you~~

Sarnath · January 7, 2009, 4:45am

Thanks for the reply. But what if the first thread does NOT actually use a context. It simply does a cudaMalloc() which sets it to the default device…and then forks 2 threads… Now each thread tries to do a cudaSetDevice to different devices…

How does it work in such a situation?

E.D_Riedijk · January 7, 2009, 6:17am

cudaMalloc already creates a context no?

tmurray · January 7, 2009, 6:17am

Then it has a context? Just because contexts are implicit in CUDART doesn’t mean you’re not using a context.

Sarnath · January 7, 2009, 6:21am

EDR and Tim,

Thanks for your replies. I understand it better now.

Best Regards,

Sarnath

Topic		Replies	Views
Multiple GPU computing CUDA Programming and Performance	8	7900	May 7, 2008
Multi-GPU with a single thread and driver API? CUDA Programming and Performance	5	5019	July 25, 2008
Launching Kernels in simultaneously on two GPUs CUDA Programming and Performance	6	20795	May 26, 2011
How to use 2 devices CUDA Programming and Performance	2	2346	December 18, 2008
Language confusion with multi-gpu CUDA Programming and Performance	11	19319	October 30, 2007
MultiGPU start help CUDA Programming and Performance	8	10528	August 10, 2010
How to check work is done by different GPU in multi GPU environment CUDA Programming and Performance	8	3016	June 18, 2009
multiGPU example in SDK CUDA Programming and Performance	1	1854	April 28, 2008
Managing multiple GPUs from a single host thread CUDA Programming and Performance	1	1222	October 10, 2010
Simple multi-gpu problem CUDA Programming and Performance	1	2459	December 30, 2009

How to start 2 kernels on 2 devices

Related topics