Hello,
I’ve got this implemented in OpenCL correctly and it scales perfectly (~2 speedup). But in cuda i can’t manage to use more than one GPU (better said, i use them, but serialized…).
I have a loop with this structure (more like 3 fors, but anyways i’m simplifying)
array in[2];
array out[2];
streams stream;
for (e=;e<sizeY; ++i){
for (i=;i<sizeX; ++i){
n_device=1-n_device;//switch device
cudaSetDevice(n_device);
cudaDeviceSynchronize(); //Real code uses something smarter, but same happens when i try this
cudaMemcpyAsync(in[n_device],dataIn[i],stream[i]); //data to device
cudaMemcpyAsync(in[n_device],dataIn2[i],stream[i]); //data to device
cudaMemcpyAsync(in[n_device],dataInOut[e],stream[i]); //data to device
kernel<<<...,stream[i]>>>(...);
cudaMemcpyAsync(in[n_device],dataInOut[e],stream[i]); //data back to host
}
}
From my opencl understanding, this should be parallel (aka each device can execute one iteration on i;, then block until previous iteration on THAT DEVICE finishes).
I’m using cuda 4.1 (can’t change version, so can’t use HyperQ… :(, maybe downgrade to 4.0 only) in a fermi tesla GPU.
The problem looks like memCpyInOut[0, aka n_device] works —; kernel[0] launchs — memCpyInOut[0] waits — memCpyIn[1] waits…, it shouldn’t wait since there is no dependencie between them, but the driver serializes memCpys so it stops.
Is there any trick i could do to force the driver not serialize things? (and maybe don’t control any kind of correctness apart from streams). I would prefer to keep the same structure/blocking in the loops and data, since i want to do a comparison with OCL and other things.