I wrote a simple multidevice cuda program using openmp. I observed that the execution time of
kernel and cudamemcpy became much worse than single device. The kernel and cudamemcpy is
exactly the same in both. I do the cudasetdevice and kernel call in different threads. I also checked
the cudagetdevice and got sure that cudasetdevice work fine.
Do you know what happen? I think something works sequentially ( it looks like time sharing in cpu ) .
I work on a machine with 3 GTX480s, cuda 3.1 and the os is debian sid 2.6.32-5-686-bigmem