I’ve been trying to use the stream support in cufft 3.0 to get concurrent copy and execute working. Looking at the graphs given by nexus this is not happening.
(note: I believe this is the cufft 3.0 beta as it’s the version from the January release of nexus)
The code copies one line to the gpu asynchronously on stream 1 and applies a 1D fft to it on stream 2
it then copies one line to the gpu asynchronously on stream 2 and applies a 1D fft to it on stream 2
all this in a loop
The card is a Tesla c1060 running as a second card to a geforce 9400 under windows 7.
I can also replace the 9400 with a geforce gtx285 if that is the problem (trying to save a bit on the electric bill when I don’t need the power ;-)
Any idea why concurrent copy and execute is not happening?
Thanks
The code I’m running:
cufftHandle planx, planx2;
cufftHandle plany, plany2;
cudaMallocHost(&host, sizeX*sizeY*sizeof(cufftComplex));
cudaMalloc(&device, sizeX*sizeY*sizeof(cufftComplex));
cufftPlan1d(&planx, sizeX, CUFFT_C2C, 1);
cufftPlan1d(&planx2, sizeX, CUFFT_C2C, 1);
cufftPlan1d(&plany, sizeY, CUFFT_C2C, 1);
cufftPlan1d(&plany2, sizeY, CUFFT_C2C, 1);
cudaStream_t stream1, stream2;
CUDA_CHK_ERR(cudaStreamCreate(&stream1));
CUDA_CHK_ERR(cudaStreamCreate(&stream2));
cufftSetStream(planx, stream1);
cufftSetStream(planx2, stream2);
cufftSetStream(plany, stream1);
cufftSetStream(plany2, stream2);
for (int i = 0 ; i < sizeY ; i += 2)
{
cudaMemcpyAsync(device + i*sizeX, host + i*sizeX, sizeX*sizeof(cufftComplex), cudaMemcpyHostToDevice, stream1));
cufftExecC2C(planx, device + i*sizeX, device + i*sizeX, CUFFT_FORWARD));
cudaMemcpyAsync(device + i*sizeX, host + i*sizeX, sizeX*sizeof(cufftComplex), cudaMemcpyHostToDevice, stream2));
cufftExecC2C(planx2, device + (i + 1)*sizeX, device + (i + 1)*sizeX, CUFFT_FORWARD));
}