CUDA with multiple cards

So I had a question about a really large scale CUDA project. Something I want to do I believe is going to require multiple CUDA enabled cards, however, I want to run the SAME kernel on all of them. Is there any easy way to do this? From what I understood from the programming guide, I will have to launch multiple host threads which each launch my same kernel on different devices. Do most people just use PTHREADS or something to instantiate multiple host threads, which in turn launch the device kernel? And is the only way to share data between them to copy BACK to the host memory? Thanks!

I’m not very familiar with OpenMP but I think this could be a very simple way to establish multithreading.

OpenMP would work, pthreads would work, MPI would work… I could go on! Basically, pick a threading library that you know how to use and everything will be very straightforward.

Right now, yes, the only way to share memory is with a copy back to the host.

Heh, it’s time for my twice-weekly advertisement for GPUWorker:

(Maybe I can get MisterAnderson42 to start paying me a commission. :) )

On a grad student’s salary, no way!

After I defend my thesis next week, I should finally have some time to create a web page for GPUWorker so that it is more easily accessible as a separate entity from HOOMD. Lately, there seems to be a lot of interest in it. It has even come up at the workshop I’m currently at: there was a whole 30 min talk on a similar tool (in C) by John Stone which will also be released as a standalone package soon. So for those of you who can’t/don’t want to go the boost/C++ route, there will be another option.

For those who don’t know (and haven’t followed the link) GPUWorker is a powerful C++ based worker thread. You instantiate one per GPU and dispatch cuda* function calls for it to execute within the thread that is attached to the GPU. It enables quick creation of multi-gpu programs in a very simple master/slave setup. It is very quick and easy to program a simple Multi-GPU program with it, but it is also powerful enough to serve the needs of an entire large application like HOOMD.

Good luck on your defense. You certainly have done a great job on keeping the GPUWorker simple, yet powerful and with it throughly documented in Doxygen there’s really no excuse for not using it. Even though you might not be using multiple GPUs, it really is the sort of thing you need for running asynchronous, streamed CUDA applications where you have to allocate in memory in the proper host threads. It greatly simplified my collision checking system.

IMO for anyone looking to structure their large-scale CUDA application, HOOMD should be their reference .

Good luck! But you should be ok I think :)