Cufft performance C1060 vs C2050

michael_ehling · September 13, 2010, 2:23pm

We have been using Cufft on the Tesla C1060. When we ran the same test program on the Tesla C2050 we expected better performance but instead we found it to be almost half the speed. We are running a large number of small fft’s , i.e 1,000,000 32 x32 cufft’s .

This is the message I am getting on C1060 (Red Hat 5.2, CUDA 2.2):
running fft on 1000000 chips of size=32x32… OK (18506 msec)

This is the message I am getting on C2050 (Red Hat 5.4, CUDA 3.1):
running fft on 1000000 chips of size=32x32… OK (32505 msec)

I have included an example code that demonstrates the problem. Are we doing something wrong? Is there something in the makefile wrt. compilter flags?

Thanks in Advance

Sample Code:

#include <cufft.h>
#include <cuda_runtime.h>
#include <sys/time.h>

#define FERMI 1
#define OK 0
#define ERROR -1

int main(void)
{
size_t i;
cufftHandle fft;
float* src;
cufftComplex* dst;
const size_t dim = 32;
const size_t size = dim * dim;
const size_t max = 1000000;
timeval timer[2];

// create fft plan
if (cufftPlan2d(&fft, dim, dim, CUFFT_R2C) != CUFFT_SUCCESS) {
fprintf(stderr, “unable to create fft plan\n”);
return ERROR;
}

#if FERMI
if (cufftSetCompatibilityMode(fft, CUFFT_COMPATIBILITY_NATIVE) != CUFFT_SUCCESS) {
fprintf(stderr, “unable to set fft plan to native mode\n”);
return ERROR;
}
#endif

// allocate input chip
if (cudaMalloc(reinterpret_cast<void**>(&src), size * sizeof(float)) != cudaSuccess) {
fprintf(stderr, “unable to allocate input chip\n”);
return ERROR;
}

// allocate output chip
if (cudaMalloc(reinterpret_cast<void**>(&dst), size * sizeof(cufftComplex)) != cudaSuccess) {
fprintf(stderr, “unable to allocate output chip\n”);
return ERROR;
}

fprintf(stderr, “running fft on %zu chips of size=%zux%zu…”, max, dim, dim);

// start timer
gettimeofday(&timer[0], NULL);

// execute real->complex fft plan
for (i = 0; i < max; i++) {
if (cufftExecR2C(fft, src, dst) != CUFFT_SUCCESS) {
fprintf(stderr, " FAIL\nunable to execute real->complex fft plan\n");
return ERROR;
}
}

// synchronize cuda threads
if (cudaThreadSynchronize() != cudaSuccess) {
fprintf(stderr, " FAIL\nunable to synchronize cuda threads\n");
return ERROR;
}

// stop timer
gettimeofday(&timer[1], NULL);
if (timer[1].tv_usec < timer[0].tv_usec) {
timer[1].tv_sec–;
timer[1].tv_usec += 1000000;
}
timer[1].tv_sec -= timer[0].tv_sec;
timer[1].tv_usec -= timer[0].tv_usec;

fprintf(stderr, " OK (%lu msec)\n", timer[1].tv_sec * 1000 + timer[1].tv_usec / 1000);

// deallocate input chip
if (cudaFree(src) != cudaSuccess) {
fprintf(stderr, “unable to deallocate input chip\n”);
return ERROR;
}

// deallocate output chip
if (cudaFree(dst) != cudaSuccess) {
fprintf(stderr, “unable to deallocate output chip\n”);
return ERROR;
}

// destroy fft plan
if (cufftDestroy(fft) != CUFFT_SUCCESS) {
fprintf(stderr, “unable to destroy fft plan\n”);
return ERROR;
}

fprintf(stderr, “OK\n”);

return OK;
}

The Following Makefile:

fft-test: fft-test.cpp
g++ -o $@ -O2 -fpic -fPIC -pipe -DNDEBUG -DNO_BLAS -I/usr/local/cuda/include -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -lcufft -lcudart $<

clean:
rm -f fft-test

ceearem · September 13, 2010, 4:06pm

The Following Makefile:

fft-test: fft-test.cpp

g++ -o $@ -O2 -fpic -fPIC -pipe -DNDEBUG -DNO_BLAS -I/usr/local/cuda/include -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -lcufft -lcudart $<

clean:

rm -f fft-test

It seems you dont compile for CC2.0 but using 1.3 code for both devices. Did you try with -DCUDA_ARCH=20 for the C2050?

Cheers

Ceearem

ceearem · September 13, 2010, 4:06pm

The Following Makefile:

fft-test: fft-test.cpp

g++ -o $@ -O2 -fpic -fPIC -pipe -DNDEBUG -DNO_BLAS -I/usr/local/cuda/include -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -lcufft -lcudart $<

clean:

rm -f fft-test

It seems you dont compile for CC2.0 but using 1.3 code for both devices. Did you try with -DCUDA_ARCH=20 for the C2050?

Cheers

Ceearem

michael_ehling · September 14, 2010, 12:13am

Yes we did. We used the-arch=sm_20 as welll as -gen_code=compute_20 . None of these had any effect. We think that this may be due to our gcc compiler linking to a precompiled cufft library with out actually using nvcc. It seems that we cannot target the cufft library for the C2050 specifically. Does this make any sense?

michael_ehling · September 14, 2010, 12:13am

Yes we did. We used the-arch=sm_20 as welll as -gen_code=compute_20 . None of these had any effect. We think that this may be due to our gcc compiler linking to a precompiled cufft library with out actually using nvcc. It seems that we cannot target the cufft library for the C2050 specifically. Does this make any sense?

E.D_Riedijk · September 14, 2010, 7:42am

Improved Cufft performance on Fermi is something coming in 3.2 as far as I have read.

E.D_Riedijk · September 14, 2010, 7:42am

Improved Cufft performance on Fermi is something coming in 3.2 as far as I have read.

gpgpu_apprentice · November 11, 2010, 9:02pm

FYI, I just ran your example on a GTX 580.

running fft on 1000000 chips of size=32x32... OK (21329 msec)

OK

CUDA version is 3.2 beta. Incidentally it still doesn’t outperform the C1060…

gpgpu_apprentice · November 11, 2010, 9:02pm

FYI, I just ran your example on a GTX 580.

running fft on 1000000 chips of size=32x32... OK (21329 msec)

OK

CUDA version is 3.2 beta. Incidentally it still doesn’t outperform the C1060…

Topic		Replies	Views
cuFFT Error on CUDA3.2 Tesla C1060 vs Fermi C2050 CUDA Programming and Performance	5	12006	October 27, 2010
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8635	September 22, 2010
cufft error (?) CUDA Programming and Performance	7	8991	March 5, 2012
Weird pointer arithmetic bug ? CUDA Programming and Performance	18	2373	July 12, 2011
Tesla C2050 performance comparision with C1060 CUDA Programming and Performance	63	10177	September 14, 2010
memory problem on tesla c1060 CUDA Programming and Performance	4	1924	August 24, 2009
Tesla C2050 slower than GeForce 8800? CUDA Programming and Performance	14	20904	April 20, 2011
device printf not working cuda 3.1 printf not working on tesla c2050 CUDA Programming and Performance	16	63282	May 27, 2011
cuFFT 3.1 and data alignment with CUDA FFT library problem CUDA Programming and Performance	16	15831	August 24, 2010
Modified version of the CUFFT example Legacy PGI Compilers	6	7635	February 21, 2012

Cufft performance C1060 vs C2050

Related topics