concurrent copy and execute with cufft possible?

I’ve been trying to use the stream support in cufft 3.0 to get concurrent copy and execute working. Looking at the graphs given by nexus this is not happening.
(note: I believe this is the cufft 3.0 beta as it’s the version from the January release of nexus)

The code copies one line to the gpu asynchronously on stream 1 and applies a 1D fft to it on stream 2
it then copies one line to the gpu asynchronously on stream 2 and applies a 1D fft to it on stream 2
all this in a loop

The card is a Tesla c1060 running as a second card to a geforce 9400 under windows 7.
I can also replace the 9400 with a geforce gtx285 if that is the problem (trying to save a bit on the electric bill when I don’t need the power ;-)

Any idea why concurrent copy and execute is not happening?
Thanks

The code I’m running:

cufftHandle planx, planx2;
cufftHandle plany, plany2;

cudaMallocHost(&host, sizeX*sizeY*sizeof(cufftComplex));
cudaMalloc(&device, sizeX*sizeY*sizeof(cufftComplex));

cufftPlan1d(&planx, sizeX, CUFFT_C2C, 1);
cufftPlan1d(&planx2, sizeX, CUFFT_C2C, 1);

cufftPlan1d(&plany, sizeY, CUFFT_C2C, 1);
cufftPlan1d(&plany2, sizeY, CUFFT_C2C, 1);

cudaStream_t stream1, stream2;
CUDA_CHK_ERR(cudaStreamCreate(&stream1));
CUDA_CHK_ERR(cudaStreamCreate(&stream2));

cufftSetStream(planx, stream1);
cufftSetStream(planx2, stream2);
cufftSetStream(plany, stream1);
cufftSetStream(plany2, stream2);

for (int i = 0 ; i < sizeY ; i += 2)
{
	cudaMemcpyAsync(device + i*sizeX, host + i*sizeX, sizeX*sizeof(cufftComplex), cudaMemcpyHostToDevice, stream1));
	cufftExecC2C(planx, device + i*sizeX, device + i*sizeX, CUFFT_FORWARD));

	cudaMemcpyAsync(device + i*sizeX, host + i*sizeX, sizeX*sizeof(cufftComplex), cudaMemcpyHostToDevice, stream2));
	cufftExecC2C(planx2, device + (i + 1)*sizeX, device + (i + 1)*sizeX, CUFFT_FORWARD));
}

Your mem copies are operating on the same buffer. This may be your problem.

I see about 20% speedup when I use two streamed plans with half the batch size, vs one plan with a bigger batch.

  1. memcpy async buf1

  2. exec plan 1 on buf1

  3. memcpy async buf2

  4. exec plan 2 on buf2

steps 2,3 can run in parallel