CudaMemcpyToArray issue A bug ?

I try to see if the cudaMemcpyToArray working correctly using a simple code

sum = cplvSum2_32f_C1(d_temp3D, nElems, d_tempBuffer, 64) / nElems;

            fprintf(stderr, " Debug value 01: %d %f \n",id, sum);

            cudaMemcpyToArray(d_array3D_temp, 0, 0, d_temp3D, sizeMem , cudaMemcpyDeviceToDevice);

            cudaMemset(d_temp3D, 0, sizeMem);

            sum = cplvSum2_32f_C1(d_temp3D, nElems, d_tempBuffer, 64) / nElems;

            fprintf(stderr, " Debug value 02: %d %f \n",id, sum);

            cudaMemcpyFromArray( d_temp3D, d_array3D_temp, 0, 0, sizeMem , cudaMemcpyDeviceToDevice);

            sum = cplvSum2_32f_C1(d_temp3D, nElems, d_tempBuffer, 64) / nElems;

            fprintf(stderr, " Debug value 03: %d %f \n",id, sum);

In this code, i copy the contain of a device memory block to cudaArray, erase the memory contain, and read back the result from cudaArray to the memory block. I use the sum2 function that compute the total of square value from memory block.

In the case that everything run correctly the Debug value 01 and 03 will be the same while debug 02 will return 0.

I run with 1 GPU, the result is correct.

I run with 2 GPUs in parallel, some thing strange happen. On one GPUs the result is still correct. On the other,:

Debug value 01: 0 0.140667 

 Debug value 02: 0 0.000000 

 Debug value 03: 0 0.000000

There’s something incorrect with cudaMemcpyFrom/ToArray. Is that a bug.

The code i post here is exact the code i have inside my threads, and each thread run on one device using cudaSetDevice, all the array and memory block are allocated per thread (inside the thread function)

Are you sure your multi-GPU code is correct? Have you looked at the multiGPU samples in the SDK?

It’s hard to tell what the problem is without the complete code.