How to start 2 kernels on 2 devices

I have a GeForce9800GX2, and started 2 threads with pthread functions, each of which running the kernel on one GPU of this card. But it seems it has more overhead than using just 1 kernel. How do you make use of this type of cards with 2 GPUs?

Did you do a “cudaSetDevice” on these threads separately so that you know they really work on 2 diff GPUs ??

Yes, I set the device with 0 and 1 for the 2 threads.

Actually, in the 2 thread version, each thread processes half of the data, and in the 1 thread version, the main thread processes all the data.

I am working on something similar – using two threads (pthreads) to control two gpu’s. All my program does is read data from a file, download to gpus, do some processing, and upload to host, over and over again. I found that the single threaded version of my program was much quicker. I think this is because I was re-creating my threads each time I looped through (and therefore re-allocating device memory etc). I changed my approach to keep the threads running constantly and this helped. What also helped was to make sure that each thread/gpu was doing a lot of work between updates, otherwise the threads just tend to run serially. Hope that makes sense…

Unsurprisingly, thread and context pools are almost always a good idea.

Thanks, Charley. Yeah… Actually, I was trying to avoid the unnecessary launches as much as possible. Maybe there’s still too little computation? How long does your program run?

Keep in mind that my program is very much a work in progress. However, my multi-threaded output appears to match my single threaded output so…

As I mentioned, I read in data a bit at a time (256k complex floats), process this, then upload from gpu and output, over and over again. I am using CUFFT as my processing step right now. If I do just 1 FFT per input data set then my single threaded program takes ~0.41 sec to get through my data file and my double-threaded program takes ~0.58 sec. However, if I do 30 FFTs on each data set (just to increase process time) then my single threaded program takes ~1.4 secs and my double-threaded takes ~ 1.0. I assume as I get more clever with this that this will improve, but I thought it was encouraging.

All of you should look @ Mr.Anderson’s multi-GPU master-slave idea. It is cool!

I have not worked with contexts… So, not sure, how cool it is…

btw,

Are you calling any CUDA functions in the main thread before forking the 2 threads… I am not sure how CUDA would behave in that case… Because manual says that once cudaSetDevice is called explicitly or implicitly, subsequents cudaSetDevice have no effect… So, if the main thread has already set the device, it is possible that subsequently spawned threads have the main-thread hangover and will use the same device inspite of the explicit cudaSetDevice()…

I am not sure about what is the correct bhaviour.

tmurray, Can you clarify this part?

No CUDA functions are being called in the main thread. Cudasetdevice is called by the thread functions. I’ve looked at Mr Anderson’s code and it is impressive, but more than I need right now.

Contexts are bound per-thread, so the first thread will have a context and the new thread won’t.

Where can I find Mr. Anderson’s multi-GPU master-slave idea?

I believe this thread has that information:

http://forums.nvidia.com/index.php?showtopic=66598

OK, thank you~~

Thanks for the reply. But what if the first thread does NOT actually use a context. It simply does a cudaMalloc() which sets it to the default device…and then forks 2 threads… Now each thread tries to do a cudaSetDevice to different devices…

How does it work in such a situation?

cudaMalloc already creates a context no?

Then it has a context? Just because contexts are implicit in CUDART doesn’t mean you’re not using a context.

EDR and Tim,

Thanks for your replies. I understand it better now.

Best Regards,

Sarnath