cuFFT + streams

dobislaw · October 19, 2014, 11:07am

I am doing multiple streams on FFT transform. What is wrong with my code?
It generates the wrong output. The FFT plan succeedes.
h_Data is set.

When I run this code, the display driver recovers, which, I guess, means something is being computed for really long.

static cufftComplex h_Data[sig*tds];
	cufftComplex *d_Data;

	const unsigned int bytesComplex = sizeof(cufftComplex) * sig * tds;

// asynchronous version
	// stream stuff
	const int blockSize = tds, nStreams = 4;
	const int nn = 320 * blockSize * nStreams;
	const int streamSize = nn / nStreams;
	const int streamBytes = streamSize * sizeof(cufftComplex);
	// create cuda streams
	cudaStream_t streams[nStreams];
	for (int i = 0; i < nStreams; i++)
	{
		check(cudaStreamCreate(&streams[i]));
	}
	// create cufft plans and set them in streams
	cufftHandle* fftPlans = (cufftHandle*)malloc(sizeof(cufftHandle)*nStreams);
	for (int i = 0; i < nStreams; i++)
	{
		status = cufftPlanMany(&fftPlans[i],1,n,inembed,istride,idist,onembed,ostride,odist,CUFFT_C2C,batch/nStreams);
		if (status != CUFFT_SUCCESS)  
		{
			cout << "Cufft FFT plan error: " << status << endl;
		}
		cufftSetStream(fftPlans[i],streams[i]);
	}
	// GPU start time measurement
	check(cudaEventRecord(start));
	// CPU pinned memory allocation
	cufftComplex *h_DataPtr;
	h_DataPtr = h_Data;
	check(cudaMallocHost((void**)&h_DataPtr,bytesComplex));
	// GPU memory allocation
	check(cudaMalloc((void**)&d_Data,bytesComplex));
	// copy input data to GPU for processing
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		cudaMemcpyAsync(&d_Data[offset], &h_Data[offset], streamBytes, cudaMemcpyHostToDevice,streams[i]);
	}
	// FFT of a signal
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		status = cufftExecC2C(fftPlans[i],&d_Data[offset],&d_Data[offset],CUFFT_FORWARD);
		if (status != CUFFT_SUCCESS)
		{
			cout << "Cufft FFT work error: " << status << endl;
		}
	}
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		 cudaMemcpyAsync(&h_Data[offset], &d_Data[offset], 
                               streamBytes, cudaMemcpyDeviceToHost,
                               streams[i]) ;
	}
	// GPU stop time measurement
	cudaEventRecord(stop);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&milliseconds, start, stop);
	std::cout<<"GPU timing by GPU (async): "<< milliseconds << " ms" <<endl;

Robert_Crovella · October 19, 2014, 2:15pm

here is a fully worked example:

[url]parallel processing - Asynchronous executions of CUDA memory copies and cuFFT - Stack Overflow

dobislaw · October 23, 2014, 8:17am

Ok guys. Since no article could help me solve my problem, I figured this out by myself.
The case is that I am using streamed cufftExecC2C function on (batch = 256 signals) with 1280 samples per each.
Every loop iterates on:

cudaMemcpyAsync
cufftPlanMany, cufftSet Stream
cufftExecC2C

// Creates cuFFT plans and sets them in streams
	cufftHandle* fftPlans = (cufftHandle*)malloc(sizeof(cufftHandle)*nStreams);
	for (int i = 0; i < nStreams; i++)
	{
		status = cufftPlanMany(&fftPlans[i],1,n,inembed,istride,idist,onembed,ostride,odist,CUFFT_C2C,batch/nStreams);
		if (status != CUFFT_SUCCESS)  
		{
			cout << "Cufft FFT plan error: " << status << endl;
		}
		cufftSetStream(fftPlans[i],streams[i]);
	}
	// Async memcopyes and computations
	//check(cudaMemcpy(d_Data,h_Data,bytesComplex,cudaMemcpyHostToDevice));
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		cudaMemcpyAsync(&d_Data[offset], &h_Data[offset], streamBytes, cudaMemcpyHostToDevice, streams[i]);
	}
	// FFT of a signal
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		status = cufftExecC2C(fftPlans[i],(cufftComplex*)&d_Data[offset],(cufftComplex*)&d_Data[offset],CUFFT_FORWARD);
		if (status != CUFFT_SUCCESS)
		{
			cout << "Cufft FFT work error: " << status << endl;
		}
	}

The case was to divide the BATCH number by the number of streams, i.e. 256/4 (at my example) at cufftPlanMany function.

I am leaving this thoughts for future generations.

yutong.zhang · May 17, 2018, 12:08am

Is it possible to assign memcpy (host to device) to stream1, cufftplan to stream2 and memcpy (device to host) to stream 3 so that all three of those things can overlap?

Robert_Crovella · May 17, 2018, 2:13am

not cufft plan, but cufft execution, yes, it should be possible. cufft has the ability to set streams. The example code linked in comment 2 above demonstrates this.

yutong.zhang · May 17, 2018, 3:34pm

Example code only show when you want to run 3 separate ffts. He uses a stream to overlap the copy, fft copy process of three separate processes right?

I guess I am curious if I can overlap memory copy with the actual fft computation? I am currently not getting correct values when I do so.

yutong.zhang · May 17, 2018, 3:36pm

Also how do I associate a stream with cufftexec? I only see cufftsetstream which associates it with a specific plan.

Robert_Crovella · May 18, 2018, 2:15pm

You don’t associate a stream with cufftexec. You associate a stream with the plan (that you pass to cufftexec). If you’re not getting correct cufft results, you might be attempting to reuse a plan with different settings. That can be done, but may require you to manage plan-associated memory yourself. CUFFT provides mechanisms to do this.

yutong.zhang · May 18, 2018, 3:28pm

Okay, that’s what I meant in my first question, my bad for not being clear.

So it is possible to overlap the reading of data with the execution of cufft?

I am worried it’s performing cufft when it doesn’t have all the needed data yet? How should synchronization work with the overlap of copy with cufft with the same data?

Topic		Replies	Views
cufft concurrent streams CUDA Programming and Performance	2	1932	August 20, 2014
concurrent copy and execute with cufft possible? CUDA Programming and Performance	1	2009	April 23, 2010
cuFFT, MemcpyAsync = gain ? howto use streams CUDA Programming and Performance	2	6602	January 27, 2011
cuFFT's stream support CUDA Programming and Performance	0	3291	July 29, 2009
Implementing cuFFT with streams problem GPU-Accelerated Libraries cufft	3	932	October 12, 2021
cuFFT & OpenMP Asynchronous multi-GPU 3D FFT transformation CUDA Programming and Performance	2	2211	September 14, 2010
Streams Problem CUDA Programming and Performance	2	4718	December 7, 2008
CUFFT strange behaviour when using streams GPU-Accelerated Libraries	3	1373	April 9, 2015
It it possible to cufftXtMemcpy asychronously on multiple devices? GPU-Accelerated Libraries cuda	1	537	May 18, 2022
Streams and CUFFT CUDA Programming and Performance	8	6113	June 16, 2008

cuFFT + streams

Related topics