Multi-GPU dot product

Hi everyone,i’m a newbie in multi-gpu programing and I have some questions regarding the classic dot-product implementation. I’m running a CPU-thread that creates 2 large arrays A[dimension_N] and B[dimension_N]. Due to the size of these arrays I need to split the computation of their dot product into 2 GPUs, both Tesla M2050(compute capability 2.0). The problem is that I need to compute these dot-products several times inside a do-loop controlled by my CPU-thread. Each dot-product requires the result of the previous one. I’ve read about creating 2 different threads that control the 2 different GPUs but I got no clue about how to synchronize and exchange data between them. Is there another alternative? I’d really appreciate any kind of help/example.

You do not need to create 2 threads. You can run 2 streams and switch the contexts. The sync is done by a command syncstreams.

What do you achieve by posting the same question 2 times?

Thanks for the reply. The first post was made in the wrong section I think,therefore I already asked from a moderator to delete it,since I can’t.

So an algorithm for that would be:

Is the above correct?And about the data-exchange between the 2 Devices,what’s the best practise to follow?My GPUs are located in the same node so it is possible to use P2P data-exchange right?

Yes it seems ok.

Actually it doesn’t work. I can’t use the cudaSetDevice() function more than once in my host code. I got the same issues described here: NVIDIA forums…So how can you use multiple streams in multiple devices if you can’t use cudaSetDevice() to change your current device?

The cudasetdevice can be use only on specific devices more than once in the program. I thought that the M2050 cards can do that with the latest cudatoolkit.

Edit: I was a course and one of the examples is about multigpu with streams. If the example does not work, there is problem with the driver or the architecture flags at compiling. I attached the example to this post.

If you decide to go with one thread per GPU it can work, but all the communications will have to be done by the mpi/openmp library used. maybe these call will help cudaThreadSynchronize() or cudaEventRecord()
ex5.tar.gz (1.97 KB)

Thank you very much!I’ll check it out tomorrow!Firstly I need to update my Cuda Toolkit though,because i’m currently at 3.2. Do you know if a drivers’ update is also required??


You should always have the latest possible nvidia driver.