Any trick to avoid kernel/memcopys serialization?

Hello,

I’ve got this implemented in OpenCL correctly and it scales perfectly (~2 speedup). But in cuda i can’t manage to use more than one GPU (better said, i use them, but serialized…).

I have a loop with this structure (more like 3 fors, but anyways i’m simplifying)

array in[2];
array out[2];
streams stream;
for (e=;e<sizeY; ++i){
      for (i=;i<sizeX; ++i){
      
            n_device=1-n_device;//switch device
            cudaSetDevice(n_device);
            cudaDeviceSynchronize(); //Real code uses something smarter, but same happens when i try this
            cudaMemcpyAsync(in[n_device],dataIn[i],stream[i]); //data to device
            cudaMemcpyAsync(in[n_device],dataIn2[i],stream[i]); //data to device
            cudaMemcpyAsync(in[n_device],dataInOut[e],stream[i]); //data to device
            kernel<<<...,stream[i]>>>(...);
            cudaMemcpyAsync(in[n_device],dataInOut[e],stream[i]); //data back to host
      }
}

From my opencl understanding, this should be parallel (aka each device can execute one iteration on i;, then block until previous iteration on THAT DEVICE finishes).

I’m using cuda 4.1 (can’t change version, so can’t use HyperQ… :(, maybe downgrade to 4.0 only) in a fermi tesla GPU.

The problem looks like memCpyInOut[0, aka n_device] works —; kernel[0] launchs — memCpyInOut[0] waits — memCpyIn[1] waits…, it shouldn’t wait since there is no dependencie between them, but the driver serializes memCpys so it stops.

Is there any trick i could do to force the driver not serialize things? (and maybe don’t control any kind of correctness apart from streams). I would prefer to keep the same structure/blocking in the loops and data, since i want to do a comparison with OCL and other things.

Well, managed to “fix” it by myself.

In case someone is again on this problem and using a similar approach (after all it’s blocked matrix multiply :) ). Just split the second loop in two parts, one for doing the copys and launch kernels, and another one for gathering items back to the host ._.

int max=nDIM+nDIM%N_DEVICES; //So we never do less loops than the matrix size
		for (p = 0; p < max; p=p+N_DEVICES){
				int x;

				for (x=0; x<N_DEVICES;x++){
					e=p+x;
					if (e<nDIM) { //if we are out of matrix size, ignore
                                               //copy out matrix (in out)
						for (i = 0; i< lDIM; i++){
                                                      //Copy inputs and launch kernels
                                                }
                                         }
                                 }

for (x=0; x<N_DEVICES;x++){
                                    // copy inout back to host.
                                  } 
                 }

Or use cuda 5 which should get rid of this :).