cudaMemcpyToSymbol hit MultiGPU performance badly

Hi all:

I have an algorithm which essentially requires 3 convolutions on an image. The workflow is like this:

1: use cudaMemcpyToSymbol to setup the horizontal convolution coefficient kernel;
2: convolutionRow, using the standard shared memory convolution method;
3: use cudaMemcpyToSymbol to setup the vertical convolution coefficient kernel;
4: convolutionColumn.

The above steps repeats 3 time, so basically there are 6 cudaMemcpyToSymbol calls involved. The code was originally developed on Linux, and when we increase the number of GPUs in the system, we can see the performance scale up, almost linearly.

Recently we port the code to Win7. Unfortunately when we increase the number of GPUs, the performance doesn’t scale up. With 4 GPU, we have almost same throughput as 2 GPUs. After investigation, we found the problem is related with the cudaMemcpyToSymbol calls. After we remove those calls, we got linear performance scale up (of course the result is not correct as coefficient are not properly setup but computation complexity is the same). It is almost like cudaMemcpyToSymbol introdues heavy collision which slows down the entire application. Note the bandwidth is not issue as each time we are only copying 512 float values, while the image size are 2048x1556.

The hardware configuration:
4 GTX580 connected via cubix.

Software Drivers:
Driver version 280.26
CudaToolKit: 4.0.17

Anyone has similar issue? Is it a driver problem?

Thanks in advance.