cudaMemcpyDeviceToHost with multiple GPUs

I have written some multiple GPU code to be able to eventually handle multi-millions of particles. With number of GPUs = 1 the code works fine. And with small number of particles and number of GPUs = 2 again the code works fine.

But with number of GPUs = 2 and large number of particles the code does not work and appears to hang on a cudaMemcpyDeviceToHost from the second device. The call is

[codebox]

int threadstride = gputhread*nthreads;

CUDA_SAFE_CALL( cudaMemcpy(&rho[threadstride], d_rho, GPUBLOCK2 * sizeof(float), cudaMemcpyDeviceToHost) );

[/codebox]

where rho is on the host,threadstride=0 for the first device and threadstride=8192 for the second device, and GPUBLOCK2 = nthreads (the number of threads each device is handling)

So what I am attempting to do is to copy directly from each device into the rho array on the host but with a stride dependent on the device id so that the first device writes to the first half of rho and the second device writes to the second half of rho.

Why would this hang on the cudaMemcpy?