Performance drop using multiple cuda devices with pthread

Dear CUDA developers:
I got a Tesla S2050 that contains four same CUDA devices, and I intended to run a program on all of them. My program used a lot of subroutines in CUBLAS and CUSPARSE, and managed to run on the four devices by using pthread to create four threads in the CPU and each CPU thread controls one CUDA device. It worked, but the performance went down:
if use only one CUDA device, the program cost 28 seconds;
if use two devices, the time were 32 sec and 33 sec;
if use four devices, the time were over 50 sec.
I’m wondering the cause. Is it the reason that creating four threads in CPU makes the overall performance down? My program is almost executed on GPU, except transfering data between CPU and GPU.
Thanks a lot!

Although I am not very skilled at this topic, the first think that has come on my mind is the inter-device communication. If there is a lot of data pending between devices (between kernel launches) then the throughput of PCI-E bus can be a bottleneck for your application. This can not obviously happen in the case of one device, where the communication is serviced by global memory. I doubt the slowdown can be caused by CPU threads.

Did you tried running your application in root, or setting your process/threads priorty and affinity ?

You could also take a look at this page: http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__DEVICE_1g18074e885b4d89f5a0fe1beab589e0c8

Maybe, if your host cpu thread are often waiting on synchronisation points, you could tell your drivers to actively wait on these blocking point by using cudaSetdeviceFlags with cudaDeviceScheduleSpin.

Try to run your application with nvidia visual profiler to find out what are the bottlenecks (in host side API ?)

Thanks a lot for the replies!
To Dalibor_CZ:
My program has no inter-device data exchange so it’s not the case, but your opinion could be an issue in other case;
To Tobbey:
The synchronisation waiting may make sense. I used cudaMemcpy to copy results back to CPU and write them into files, then turn to the next task. When four CPU threads complete calculation almost at the same time and may cause conflict. I will try not to use cudaMemcpy.