Poor CUFFT Performance? Am I doing something wrong?

Hi all. I’m interested in doing 1D FFTs as quickly as possible, so I wrote some CUDA code to measure performance (in complex, 8-byte samples per second).

I’m not seeing the performance I’d like to be seeing – either my expectations are unrealistic, or I’m doing something wrong.

Pseudo code:

pts_per_fft = 256

num_batches = 128

num_iterations = 10000

start_timer()

for i in range(num_iterations):

	transfer pts_per_fft*num_batches complex samples from host mem to device mem

	perform in-place fft

	transfer pts_per_fft*num_batches complex samples from device mem to host mem

stop_timer()

I have 2 test machines. One is a Macbook Pro (GeForce 8600M GT, PCI-e x16, 512MB), the other is an HP xw8600 containing a GTX280.

I have tried many different combinations of pts_per_fft and num_batches, and can’t seem to squeeze out more than 50MSamp/sec on the Macbook, or 115MSamp/sec on the GTX280. Is this about what I should be getting?

Here is my code:

#include <stdlib.h>

#include <stdio.h>

#include <math.h>

#include <time.h>

// includes, project

#include <cufft.h>

#include <cutil.h>

typedef float2 Complex; 

#define PTS_PER_FFT 256

#define NUM_BATCHES 1024

#define NUM_ITER 2500

#define SIG_SIZE_PTS (PTS_PER_FFT*NUM_BATCHES)

#define SIG_SIZE_BYTES (sizeof(Complex)*SIG_SIZE_PTS)

float ranf() {

	return ((float)rand() / ((float)(RAND_MAX)+(float)(1)));

}

void getRandomGaussianPair(float &r1, float &r2) {

	float x1, x2, w;

	 do {

		 x1 = 2.0 * ranf() - 1.0;

		 x2 = 2.0 * ranf() - 1.0;

		 w = x1 * x1 + x2 * x2;

	 } while ( w >= 1.0 );

	 w = sqrt( (-2.0 * log( w ) ) / w );

	 r1 = x1 * w;

	 r2 = x2 * w;

}

void genSignal(Complex* signal, int numPts, double fSamp, double fWave) {

	double dPhase = 2*M_PI*fWave/fSamp;

	double phase = 0.0;

	float nx, ny;

	int i;

	

	for(i=0; i<numPts; i++) {

		phase = i*dPhase;

		getRandomGaussianPair(nx, ny);

		signal[i].x = sin(phase) + 0.1*nx;

		signal[i].y = sin(phase + M_PI_2) + 0.1*ny;

	}

}

int main(int argc, char** argv)

{

	Complex* h_signal = (Complex*)malloc(SIG_SIZE_BYTES);

	int i, j, ind;

	cufftResult result;

	cufftHandle plan;

	double t1, t2;

	

	

	//Create the FFT plan

	result = cufftPlan1d(&plan, PTS_PER_FFT, CUFFT_C2C, NUM_BATCHES);

	if(result != CUFFT_SUCCESS) {

		printf("Problem encountered during FFT Plan creation\n");

	}

	

	srand(1234);

	

	genSignal(h_signal, SIG_SIZE_PTS, 1.2e9, 1.2e9/16);

	

	// Allocate device memory for signal

	Complex* d_signal;

	cudaMalloc((void**)&d_signal, SIG_SIZE_BYTES);

	

	

	t1 = clock();

	for(i=0; i<NUM_ITER; i++) {

		// Copy host memory to device

		cudaMemcpy(d_signal, h_signal, SIG_SIZE_BYTES, cudaMemcpyHostToDevice);

		

		//Perform FFT

		result = cufftExecC2C(plan, d_signal, d_signal, CUFFT_FORWARD);

		if(result != CUFFT_SUCCESS) {

			printf("Problem encountered during FFT Exec\n");

		}

		

		cudaMemcpy(h_signal, d_signal, SIG_SIZE_BYTES, cudaMemcpyDeviceToHost);

	}

	t2 = clock();

	

	printf("Elapsed time (secs): %f\n", (t2-t1)/CLOCKS_PER_SEC);

	

	//Print freq domain data

	/*printf("Abs Freq domain data:\n");

	for(i=0; i<NUM_BATCHES; i++) {

		for(j=0; j<PTS_PER_FFT; j++) {

			ind = i*PTS_PER_FFT + j;

			printf("%i\t%i\t%f\n", i, j, sqrt(h_signal[ind].x*h_signal[ind].x + h_signal[ind].y*h_signal[ind].y));

		}

	}*/

	

	//Clean up

	cufftDestroy(plan);

	free(h_signal);

	return 0;

}

I discovered that the majority of my time is being spent on the memory transfer. ie: for pts_per_fft=256, num_blocks=65536, num_iter=100, Host->device, FFT, device->host took 13.2 secs, whereas omitting the FFT, it still takes 12.5 secs. Switching to pinned memory gave me some improvement, but I’m still not where I’d like to be.

When evaluating whether or not your memory transfer rates are decent, you need to look at it in terms of bytes per second. Correct me if I’m wrong, but you appear to be transferring:
65536 * 256 * 100 *8 = 12.5 Gib ??? That can’t be right. Ok, I must not fully understand by what you mean by num_blocks=65536. Anyways, you should be able to get 3-4 GiB/s (up to 6 with the right mainboard) in transfer rates with pinned memory.

On the GTX280 machine, bandwidthTest was reading ~5.4 GB/s with pinned memory.

Each transfer in this test program (with above mentioned settings) is of size 256655368 bytes, or 128 MB. Does this seem like a reasonable work unit (the GTX280 has 1GB of RAM)? So 128 MB is being transferred twice each iteration, for 100 iterations. With pinned memory, this takes 5.4 secs (or 4.7 secs with FFT turned off, showing transfer is still the dominant factor). 256 MB * 100 / 4.7 secs = … ~5.4 GB/s. I guess I’m operating pretty close to max capacity, and this was just a case of unrealistic expectations. 25665536100/5.4 secs = ~310 M samples per sec. Hmmm…

I like talking to myself, it seems. :ph34r:

What are my options if I decide that I really only need 2 bytes per complex (8 bits real, 8 bits imaginary)? This would reduce my bandwidth by a factor of 4, which would be huge. It doesn’t look like cufft supports things other than 4-byte real, 8-byte complex… could I transfer 2-byte complex data, then inflate to 8-byte float once I’m on the device? Do the FFT, then deflate to 2-bytes again, and transfer back?

Try it out…I don’t see why it wouldn’t work. If you get something working, post your code up and show us some performance numbers (before and after). Since it seems that few people are reaching the computational limits (usually hitting bandwidth limits instead, such as in your case), I think that adding some extra cycles to your code to inflate and deflate the results could gain you some extra performance.

Also, if you’re not using the entire memory of the card, you may be able to do some asynchronous memory transfers, so that one set of memory is transferring while another is computing. There was a thread in the General CUDA or Programming forums recently about it, and another about overlapping contexts…check them out, you might be able to squeeze out some extra performance this way since you won’t have to wait for the transfers to complete each time.

Except that I don’t think cuFFT supports streams yet (correct me if I’m wrong), so everything is dumped in stream 0. Otherwise this would be the ideal solution. With a LOT of samples to process ideally you could get the pipeline going pretty long and one could pretty much hide the cost of all the Host ↔ device memory transfers.

A simple kernel could be written to do the conversion on the GPU easily. With fully coalesced loads/stores, it would likely operate at around ~110 GiB/s (counting both reads and writes), so the overhead will be pretty minimal.

Aren’t the CUFFT sources available? Since CUFFT doesn’t support streams yet, someone could just ‘rip’ the kernel they needed and run it manually using streams to get the dual-use-of-time effect (or so I would think).

Maybe a new feature to have in the next release of CUFFT (in addition to vvolkov’s improvements in the computation speed)?

Some cufft source was posted here a while ago: [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA However unless I am being a total fool, I don’t think this is complete since important functions like cufftExecC2R don’t seem to be in there, so without a bit more thinking it’s unclear how the kernels are configured and launched, so it’s unclear how to mess with which stream they operate on. I think this would be a pretty useful feature.

Just curious if nvidia has released cufft with stream support yet or has anyone got the source modified to work with streams?

Hi to all,

Does anyone know if streams are already availeble in CUFFT?

Because I don’t know how to add the stream parameter in call:

cufftExecZ2Z(plan, (Complex *)d_signal, (Complex *)d_signal, CUFFT_FORWARD);

Thank you, very much

You can set the stream you are going to use with a particular plan using cufftSetStream:

cufftSetStream(*myplan,streams[i]);

Hi,

Does it work with CUDA 2.3? o only with CUDA 3.0 ?

Should I uninstall cuda 2.3 and install 3.0?

Thank you

streamed cufft available on 2.3 and 3.0.

I see about a 20% improvement when I have two plans each doing half the batch size.

e.g.

memcpyasync(buffer half 1)

exec cufft on stream1

memcpyasync(buffer half 2)

exec cufft on stream2

This way the first exec and second memcpyasync can execute in parallel.

Thank you very much.

still one question…

so, you make different CUFFT plans in order to make “Concurrent copy and execution”, but, it is not necessary in CUDA 3.0, isn’t it?.

what is better?. one plan with different streams using the cufftSetStream, or using different plans?

Thank you,

Jose Antonio

I’ve only ever tried it with different plans. My guess is that there are internal state structures that are not threadsafe.