CUFFT BENCHMARKING TOOL v1.0 8800GTX owners please post your results!

cclark · September 5, 2007, 4:32pm

seb,

Thanks for testing the Ultra. It looks like you did the “pinned memory” mod correctly.

From the CUDA Programming Guide 1.0:

[font=“Courier”]cudaError_t cudaMallocHost(void** hostPtr, size_t size);[/font]

allocates [font=“Courier”]size[/font] bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as [font=“Courier”]cudaMemcpy*()[/font]. Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as [font=“Courier”]malloc()[/font]. Allocating excessive amounts of memory with [font=“Courier”]cudaMallocHost()[/font] may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

[font=“Courier”]cudaError_t cudaFreeHost(void* hostPtr);[/font]

frees the memory space pointed to by [font=“Courier”]hostPtr[/font], which must have been returned by a previous call to [font=“Courier”]cudaMallocHost()[/font].

cclark

amjoao · September 12, 2007, 2:32pm

I’ve get the results below. I’ve just run the code posted by mfatica, with and without

define PINNED at begin of code

My configuration is:

Processor: Intel® Pentium® 4 CPU 3.00GHz

Memory: 1GB ram.

Video Card: GeForce 8800 GTX

OS: CentOS release 4.4 (Final)

kernel: Linux 2.6.9-42.ELsmp

For “unpinned” version

Device 0: "GeForce 8800 GTX"

  Major revision number:                         1

  Minor revision number:                         0

  Total amount of global memory:                 804585472 bytes

  Clock rate:                                    1350000 kilohertz

Unpinned version

--------+-----------------------------------

1D Complex-to-Complex In-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|    7936|   19607|   30710|   33023

    2048|    5813|   12084|   15920|   15690

    4096|    4098|    7168|    7812|    9015

    8192|    2732|    3879|    3903|    4806

   16384|    1736|    1784|    2001|    2197

   32768|     954|     939|    1135|    1198

   65536|     479|     542|     601|     618

  131072|     237|     291|     308|     312

  262144|     131|     144|     148|     150

--------+--------+--------+--------+--------

--------+-----------------------------------

1D Complex-to-Complex Out-of-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|    8196|   19801|   31007|   33092

    2048|    5917|   12195|   15984|   15928

    4096|    4219|    7259|    7835|    9016

    8192|    2732|    3883|    3900|    4807

   16384|    1742|    1793|    2002|    2197

   32768|     948|     941|    1137|    1200

   65536|     480|     545|     605|     622

  131072|     237|     291|     308|     313

  262144|     131|     145|     149|     150

--------+--------+--------+--------+--------

For “pinned” version

Device 0: "GeForce 8800 GTX"

  Major revision number:                         1

  Minor revision number:                         0

  Total amount of global memory:                 804585472 bytes

  Clock rate:                                    1350000 kilohertz

Pinned version

--------+-----------------------------------

1D Complex-to-Complex In-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|    9523|   24390|   39702|   45649

    2048|    6944|   15094|   20539|   22176

    4096|    4878|    9090|   10631|   11149

    8192|    3246|    4968|    5398|    5571

   16384|    2183|    2371|    2418|    2430

   32768|    1213|    1284|    1303|    1310

   65536|     645|     666|     671|     672

  131072|     331|     335|     337|     337

  262144|     159|     161|     161|     161

--------+--------+--------+--------+--------

--------+-----------------------------------

1D Complex-to-Complex Out-of-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|    9900|   24390|   40100|   45746

    2048|    6993|   15151|   20618|   22183

    4096|    5050|    9216|   10652|   11155

    8192|    3322|    4975|    5403|    5571

   16384|    2183|    2368|    2417|    2429

   32768|    1215|    1286|    1305|    1312

   65536|     648|     669|     674|     677

  131072|     331|     336|     337|     338

  262144|     160|     161|     161|     161

--------+--------+--------+--------+--------

jimh · September 12, 2007, 4:31pm

Can everyone please post the motherboard and chipset used during your tests? Because I/O is included in the timing, I suspect the chipset plays a major role in the results.

BonsaiScott · November 5, 2007, 4:02pm

Their FFT implementation on the Cell isn’t particularly efficient. Mercury hits 21-22 GFLOPS per SPU (1/8 of the Cell) for 1k to 8k complex FFTs (SPU resident, similar to benchFFT’s results which are cache resident), or 170 GFLOPS per chip for 1k complex FFTs. An 8-SPU implementation of a 64k point complex FFT hits 90.8 GFLOPS (memory to memory) slides 16 & 11 respectively.

username3 · November 22, 2007, 5:10pm

Board: Asus P5W64 WS Professional | Chipset: Intel 975X

Unpinned Results:

Device 0: "GeForce 8600 GTS"

  Major revision number:                         1

  Minor revision number:                         1

  Total amount of global memory:                 268107776 bytes

  Clock rate:                                    1566000 kilohertz

--------+-----------------------------------

1D Complex-to-Complex In-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|   12955|   33020|   47085|   52833

    2048|    9264|   17385|   21094|   22551

    4096|    6758|    9403|   10534|   11275

    8192|    3969|    4885|    5249|    5604

   16384|    2145|    2293|    2373|    2442

   32768|    1120|    1149|    1200|    1225

   65536|     590|     608|     630|     637

  131072|     290|     305|     310|     312

  262144|     121|     124|     125|     125

--------+--------+--------+--------+--------

--------+-----------------------------------

1D Complex-to-Complex Out-of-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|   13108|   33208|   47257|   52780

    2048|    9251|   17405|   20983|   22468

    4096|    6746|    9436|   10571|   11305

    8192|    3985|    4931|    5265|    5613

   16384|    2154|    2297|    2365|    2449

   32768|    1126|    1157|    1206|    1232

   65536|     594|     611|     630|     642

  131072|     292|     304|     311|     313

  262144|     120|     124|     125

Pinned Results:

Device 0: "GeForce 8600 GTS"

  Major revision number:                         1

  Minor revision number:                         1

  Total amount of global memory:                 268107776 bytes

  Clock rate:                                    1566000 kilohertz

--------+-----------------------------------

1D Complex-to-Complex In-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|   14408|   42964|   68392|   80859

    2048|   10691|   22583|   29068|   32111

    4096|    8260|   12460|   14583|   15610

    8192|    5027|    6535|    7317|    7573

   16384|    2778|    3029|    3090|    3097

   32768|    1471|    1525|    1539|    1543

   65536|     786|     801|     804|     805

  131072|     389|     391|     391|     391

  262144|     150|     150|     150|     150

--------+--------+--------+--------+--------

--------+-----------------------------------

1D Complex-to-Complex Out-of-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|   14589|   43000|   68271|   80802

    2048|   10667|   22629|   29383|   32234

    4096|    8279|   12491|   14610|   15639

    8192|    5024|    6560|    7329|    7566

   16384|    2770|    3025|    3090|    3100

   32768|    1472|    1539|    1554|    1555

   65536|     792|     811|     810|     811

  131072|     390|     393|     393|     393

  262144|     150|     150|     150

On both (unpinned and pinned) runs, on the final test I got Cuda error in file ‘fft_bench.cu’ in line 108 : out of memory.

103 size_t arraySize = sizeof(cufftComplex) * nx * batch;

104 cufftComplex *deviceArrayA;

105 cufftComplex *deviceArrayB;

106 CUDA_SAFE_CALL(cudaMalloc((void**)&deviceArrayA, arraySize));

107 if (out_of_place)

108  CUDA_SAFE_CALL(cudaMalloc((void**)&deviceArrayB, arraySize));

This is the only GPU in this system and is being used for my display, running Windows XP.

RobinB · November 26, 2007, 3:13pm

[overly skeptical and rude comments showing disbelief, edited away by me now because I can]

[pointed to this figure for the source of my doubt, paper figures for the FFT implementation]

http://66.102.1.104/scholar?hl=en&lr=&safe…x+2005+cell+fft

[compared the figures to these more recent but lower figures]

http://www-static.cc.gatech.edu/~bader/pap…TC-HiPC2007.pdf

pedro.leite · December 10, 2007, 6:52pm

Unpinned Results:

Device 0: "GeForce 8800 GTX"

  Major revision number:                         1

  Minor revision number:                         0

  Total amount of global memory:                 805044224 bytes

  Clock rate:                                    1350000 kilohertz

--------+-----------------------------------

1D Complex-to-Complex In-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|    9833|   23983|   37580|   35969

    2048|    7173|   14655|   18233|   17454

    4096|    5128|    8730|    8501|    9399

    8192|    3441|    4415|    4309|    4819

   16384|    2060|    1956|    2127|    2210

   32768|     946|    1000|    1108|    1140

   65536|     514|     560|     586|     593

  131072|     258|     287|     294|     293

  262144|     133|     139|     140|     138

--------+--------+--------+--------+--------

--------+-----------------------------------

1D Complex-to-Complex Out-of-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|    9943|   24023|   37531|   35773

    2048|    7202|   14685|   18181|   17475

    4096|    5172|    8730|    8489|    9360

    8192|    3386|    4402|    4287|    4796

   16384|    2054|    1957|    2123|    2221

   32768|    1058|    1007|    1115|    1139

   65536|     517|     561|     590|     593

  131072|     259|     287|     295|     294

  262144|     131|     136|     141|     141

--------+--------+--------+--------+--------

Pinned Results:

Device 0: "GeForce 8800 GTX"

  Major revision number:                         1

  Minor revision number:                         0

  Total amount of global memory:                 805044224 bytes

  Clock rate:                                    1350000 kilohertz

--------+-----------------------------------

1D Complex-to-Complex In-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|   12052|   34501|   66518|   80524

    2048|    8688|   22583|   34254|   38296

    4096|    6937|   14625|   17965|   19275

    8192|    4978|    8121|    9168|    9606

   16384|    3332|    3813|    3958|    3994

   32768|    1881|    2022|    2061|    2067

   65536|    1053|    1096|    1106|    1109

  131072|     537|     548|     549|     554

  262144|     248|     250|     250|     250

--------+--------+--------+--------+--------

--------+-----------------------------------

1D Complex-to-Complex Out-of-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|   12175|   34501|   66673|   80354

    2048|    9131|   22583|   34336|   37810

    4096|    5240|   14625|   17936|   19254

    8192|    5013|    8116|    9163|    9608

   16384|    3342|    3812|    3962|    3993

   32768|    1883|    2035|    2072|    2079

   65536|    1058|    1105|    1114|    1116

  131072|     538|     548|     552|     553

  262144|     247|     248|     248|     249

--------+--------+--------+--------+--------

sgwood · January 1, 2008, 12:44am

Hi all,
I’m new to everything here and have not quite set up my development environment yet. I’m hoping that someone will be kind enough to run the FFT benchmarks with some larger FFT sizes. I am interested in seeing the benchmark results for the following FFT sizes:

524288
1048576
2097152
4194304
8388608

Thanks,

-Simon

BonsaiScott · January 2, 2008, 7:33pm

Not too good to be true.

Mercury’s SAL (Scientific Application Library), which is available today, hits these performance numbers for single precision complex FFTs (src & dst in SPU memory) on a 3.2 GHz Cell using a single SPU (1 of 8):

256 pt → 19.29 GFLOPS

1024 pt → 21.34 GFLOPS

2048 pt → 22.21 GFLOPS

4096 pt → 21.61 GFLOPS

8192 pt → 22.15 GFLOPS

A 64k point complex FFT (src & dst in off-chip memory) hits 90.8 GFLOPS and is memory bandwidth bound.

These are not theoretical numbers, but actual, measured performance.

RobinB · January 22, 2008, 4:44pm

Hmm… Think you’re right. Mea culpa :-)

halz · February 5, 2008, 7:08am

This is with a Zotac 8800GT ‘Amp’-- core is 700MHz, memory is 2GHz

2.6.23-gentoo-r3 #4 SMP Thu Dec 27 23:49:40 PST 2007 i686 Intel® Core™2 CPU 6300 @ 1.86GHz GenuineIntel GNU/Linux

00:01.0 PCI bridge: Intel Corporation 82P965/G965 PCI Express Root Port (rev 02)

not pinned

Device 0: "GeForce 8800 GT"

  Major revision number:                         1

  Minor revision number:                         1

  Total amount of global memory:                 536150016 bytes

  Clock rate:                                    1674000 kilohertz

--------+-----------------------------------

1D Complex-to-Complex In-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|   13698|   36036|   58181|   69945

    2048|    9900|   22099|   29906|   31746

    4096|    7246|   12578|   15655|   17400

    8192|    5025|    7042|    7575|    9310

   16384|    2941|    3202|    3448|    3758

   32768|    1620|    1631|    1912|    2017

   65536|     889|     959|    1064|    1093

  131072|     372|     448|     471|     474

  262144|     195|     215|     219|     220

--------+--------+--------+--------+--------

--------+-----------------------------------

1D Complex-to-Complex Out-of-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|   13698|   36036|   58181|   69868

    2048|    9900|   22099|   29906|   31746

    4096|    7246|   12578|   15670|   17753

    8192|    5025|    7029|    7561|    9334

   16384|    2941|    3212|    3452|    3777

   32768|    1618|    1624|    1914|    2016

   65536|     889|     961|    1066|    1096

  131072|     391|     454|     479|     484

  262144|     196|     216|     220|     222

--------+--------+--------+--------+--------

pinned results

Device 0: "GeForce 8800 GT"

  Major revision number:                         1

  Minor revision number:                         1

  Total amount of global memory:                 536150016 bytes

  Clock rate:                                    1674000 kilohertz

--------+-----------------------------------

1D Complex-to-Complex In-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|   14705|   43956|   82901|  108108

    2048|   11111|   28776|   43010|   50753

    4096|    8620|   17021|   23021|   25236

    8192|    6369|    9828|   11782|   12405

   16384|    3861|    4352|    4476|    4496

   32768|    2202|    2350|    2390|    2397

   65536|    1251|    1295|    1307|    1308

  131072|     545|     551|     553|     553

  262144|     252|     253|     253|     252

--------+--------+--------+--------+--------

--------+-----------------------------------

1D Complex-to-Complex Out-of-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|   14705|   43956|   82901|  108108

    2048|   11111|   28776|   43126|   50753

    4096|    8547|   16949|   22955|   25266

    8192|    6329|    9828|   11790|   12403

   16384|    3875|    4381|    4493|    4525

   32768|    2202|    2357|    2390|    2401

   65536|    1259|    1302|    1314|    1315

  131072|     555|     564|     565|     565

  262144|     253|     254|     254|     254

--------+--------+--------+--------+--------

tprox · February 19, 2008, 4:41pm

Just started exploring CUDA, but wanted to add a test point here for a card that I haven’t seen benched yet. The numbers seem low to me, so any advice would be greatly appreciated. I think perhaps it’s just the tradeoff you make between performance and laptops. I also get a bunch of errors at the end of the out-of-place FFT’s.

Lenovo Thinkpad T61p

CentOS 5 2.6.18-53.el5PAE #1 SMP

4GB RAM

Unpinned:

Device 0: "Quadro FX 570M"

  Major revision number:                         1

  Minor revision number:                         1

  Total amount of global memory:                 267714560 bytes

  Clock rate:                                    550000 kilohertz

--------+-----------------------------------

1D Complex-to-Complex In-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|    6172|   19230|   31189|   36529

    2048|    4347|    9756|   12810|   13778

    4096|    3546|    5442|    6309|    6828

    8192|    2123|    2824|    3032|    3294

   16384|    1060|    1141|    1156|    1185

   32768|     547|     559|     581|     588

   65536|     304|     312|     321|     323

  131072|     144|     151|     152|     153

  262144|      53|      54|      55|      55

--------+--------+--------+--------+--------

--------+-----------------------------------

1D Complex-to-Complex Out-of-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|    6172|   19230|   31434|   36488

    2048|    4424|    9876|   12841|   13742

    4096|    3521|    5434|    6306|    6828

    8192|    2123|    2808|    3021|    3240

   16384|    1043|    1140|    1160|    1189

   32768|     548|     557|     583|     591

   65536|     306|     314|     322|     325

  131072|     144|     151|     153|     153

cufft: ERROR: /root/cuda-stuff/sw/gpgpu_rel1.1/cufft/src/execute.cu, line 1038

cufft: ERROR: CUFFT_EXEC_FAILED

cufft: ERROR: /root/cuda-stuff/sw/gpgpu_rel1.1/cufft/src/execute.cu, line 1038

cufft: ERROR: CUFFT_EXEC_FAILED

cufft: ERROR: /root/cuda-stuff/sw/gpgpu_rel1.1/cufft/src/execute.cu, line 1038

cufft: ERROR: CUFFT_EXEC_FAILED

cufft: ERROR: /root/cuda-stuff/sw/gpgpu_rel1.1/cufft/src/cufft.cu, line 119

cufft: ERROR: CUFFT_EXEC_FAILED

......bunch of same errors

with an Abort and then 

  262144|      53|      54|      54| 1254901

Pinned:

--------+-----------------------------------

1D Complex-to-Complex In-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|    7092|   23668|   39702|   47583

    2048|    5208|   11627|   15252|   17094

    4096|    4081|    6319|    7511|    7984

    8192|    2450|    3259|    3595|    3696

   16384|    1194|    1277|    1290|    1289

   32768|     615|     636|     637|     638

   65536|     348|     351|     352|     352

  131072|     166|     165|     165|     166

  262144|      58|      58|      58|      58

--------+--------+--------+--------+--------

--------+-----------------------------------

1D Complex-to-Complex Out-of-Place FFTs

--------+-----------------------------------

 nx     | batch

        +--------+--------+--------+--------

        |       1|       4|      16|      64

--------+--------+--------+--------+--------

    1024|    7092|   23668|   39800|   47583

    2048|    5235|   11661|   15267|   16740

    4096|    4081|    6349|    7497|    7966

    8192|    2457|    3262|    3590|    3724

   16384|    1197|    1286|    1288|    1297

   32768|     624|     632|     638|     640

   65536|     352|     353|     354|     354

  131072|     164|     166|     166|     166

Same errors, Abort and then

 262144|      57|      57|      57| 1254901

jimh · February 19, 2008, 6:33pm

You may not have enough GPU memory to run the large out-of-place test. I don’t have a laptop GPU, so that’s just a guess.

KenH · February 20, 2008, 6:38pm

Working the best numbers from above (cclark, post #4):

FFTs/sec for NVidia GeForce 8800 GTX, Out-of-place FFT, pinned memory
Batch size
FFT size 1 4 16 64
1024 2189 13674 43605 88677
2048 3323 11752 30075 47368
4096 3066 9961 19332 25785
8192 2780 7158 11216 13378
16384 2262 4174 5141 5482
32768 1676 2531 2862 2959
65536 1140 1459 1564 1592
131072 650 750 784 790
262144 311 331 336 338

I get the following GFLOPS values:

GFLOPs/sec for NVidia GeForce 8800 GTX, Out-of-place FFT, pinned memory
Batch size
FFT size 1 4 16 64
1024 0.11 0.70 2.23 4.54
2048 0.37 1.32 3.39 5.34
4096 0.75 2.45 4.75 6.34
8192 1.48 3.81 5.97 7.12
16384 2.59 4.79 5.90 6.29
32768 4.12 6.22 7.03 7.27
65536 5.98 7.65 8.20 8.35
131072 7.24 8.36 8.73 8.80
262144 7.34 7.81 7.93 7.97

Comparable Cell numbers for a single SPU are (BonsaiScott, post #29)

256 pt → 19.29 GFLOPS
1024 pt → 21.34 GFLOPS
2048 pt → 22.21 GFLOPS
4096 pt → 21.61 GFLOPS
8192 pt → 22.15 GFLOPS

The cell numbers above are constrained to a single SPU. With the full 8 SPUs operating, the Cell processor provides on the order of 90 GFLOPS.

Not knowing a lot of GeForce 8800 GTX architectural details, I am trying to understand whether the CUDA FFT benchmark operates the GPU at its full FFT-processing capacity in this benchmark, or whether this benchmark is analogous to operating the Cell with only 1 SPE. Is there other parallelism not exploited by this benchmark, or are the 4-8 GFLOPS numbers above the whole story? What is the maxed-out FFT capacity which would compare to 90 GFLOPS for the Cell with 8 SPUs, including all I/O?

KenH

vpodlozhnyuk · February 21, 2008, 10:23am

Working the best numbers from above (cclark, post #4):

FFTs/sec for NVidia GeForce 8800 GTX, Out-of-place FFT, pinned memory

Batch size

FFT size 1 4 16 64

1024 2189 13674 43605 88677

2048 3323 11752 30075 47368

4096 3066 9961 19332 25785

8192 2780 7158 11216 13378

16384 2262 4174 5141 5482

32768 1676 2531 2862 2959

65536 1140 1459 1564 1592

131072 650 750 784 790

262144 311 331 336 338

I get the following GFLOPS values:

GFLOPs/sec for NVidia GeForce 8800 GTX, Out-of-place FFT, pinned memory

Batch size

FFT size 1 4 16 64

1024 0.11 0.70 2.23 4.54

2048 0.37 1.32 3.39 5.34

4096 0.75 2.45 4.75 6.34

8192 1.48 3.81 5.97 7.12

16384 2.59 4.79 5.90 6.29

32768 4.12 6.22 7.03 7.27

65536 5.98 7.65 8.20 8.35

131072 7.24 8.36 8.73 8.80

262144 7.34 7.81 7.93 7.97

Comparable Cell numbers for a single SPU are (BonsaiScott, post #29)

256 pt → 19.29 GFLOPS

1024 pt → 21.34 GFLOPS

2048 pt → 22.21 GFLOPS

4096 pt → 21.61 GFLOPS

8192 pt → 22.15 GFLOPS

The cell numbers above are constrained to a single SPU. With the full 8 SPUs operating, the Cell processor provides on the order of 90 GFLOPS.

Not knowing a lot of GeForce 8800 GTX architectural details, I am trying to understand whether the CUDA FFT benchmark operates the GPU at its full FFT-processing capacity in this benchmark, or whether this benchmark is analogous to operating the Cell with only 1 SPE. Is there other parallelism not exploited by this benchmark, or are the 4-8 GFLOPS numbers above the whole story? What is the maxed-out FFT capacity which would compare to 90 GFLOPS for the Cell with 8 SPUs, including all I/O?

KenH

[snapback]330186[/snapback]

The GFLOPs numbers are lower since PCI-express transfer time is included into the measure. But note that

The story is quite similar with any PCI-express coprocessing card (be it GPU/Cell/FPGA/whatever). You have to provide the coprocessor enough workload in order to be not bound by PCI-express bandwidth; otherwise the copro is underutilized.
SM1.1 (and above) NVIDIA GPUs are capable of asynchronous transfers, enabling computations and host<->device transfer overlap.

KenH · February 21, 2008, 3:06pm

My understanding is that the Cell benchmarks above apply to FFTs performed out of and into the XDR memory. Cells also have a PCI express interface for communication outside of a blade/PCI card, but that is not being exercised in the numbers above.

Therefore, the most representative FFT comparison to the ~90 GFLOP number for Cell/8 SPUs is the 52 GFLOP number quoted above (posts #14 and #16) for on-card memory-to-memory with no external I/O over PCIe. Is this a fair statement? I assume the 52 GFLOPS number uses the maximum degree of parallelism available and large batches.

The most realistic benchmark would apply I/O over the PCIe in the background while doing the FFTs in the foreground. There would be 3 cases of interest:

On-card memory-to-memory only
Input only over PCIe (assuming other processing steps would be done on-card that would reduce the amount of output required to small amounts)
Full input and output over PCIe

The results would show the range of capability depending on the I/O assumptions. Clearly the I/O can’t be ignored; it is increasingly the limiting factor in achievable performance. There are many input data types (float/int16/int8) and real/complex permutations, but complex float (as in post #1) would be a useful starting point.

Are any 8800 users interested in trying this?

KenH

skb · March 18, 2008, 6:56pm

I’ve done some benchmarking on a GTS8600 using a slightly modified version of the code (for Linux). I was interested in seeing whether I can really count on the bandwidths that are nominal for pinned memory when using the CUFFT routines.

What I find is that while the Host to Device bandwidth reaches a steady state level around the nominal 2.5 - 2.6 GBps fairly quickly for my card, the Device to Host bandwidth performs poorly (it should be around 1.8 GBps). It does appear to approach the nominal bandwidth, but much slower than the Host to Device rate. What is it about Device to Host that has so much overhead or is it something about the way CUFFT writes data out that makes it non-optimal for transfer?

I’d like to run the FFT routine as foreground while relegating data transfers to the background, but as you can see from the results, the Device to Host transfer time becomes the bottle neck at times.

Thanks for your help

SKB.

Here’s an example :

[FONT=Courier]1D Complex-to-Complex Out-of-Place FFTs.

FFT Size : 131072

Batch 1 (Device to Host transfer will be bottle neck here and data rate is not optimal)

Host To Dev(ms) 0.411987

Host to Dev(MBps) 2545

FFT (ms) 0.055075

Dev to Host (ms) 2.132893

Dev to Host(MBps) 491

Batch 64

Host To Dev(ms) 0.394408

Host to Dev(MBps) 2658

FFT (ms) 1.507673

Dev to Host (ms) 0.661843

Dev to Host(MBps) 1584

Here’s the main code changes I made :

   if (out_of_place)

     {

       rtimer1.tic();

       rtimer2.tic();

       CUDA_SAFE_CALL(cudaMemcpy(deviceArrayA, hostArrayA, arraySize, cudaMemcpyHostToDevice));

       HtoD_xfer_time = rtimer2.tocR();

       rtimer2.tic();

       CUFFT_SAFE_CALL(cufftExecC2C(plan, deviceArrayA, deviceArrayB, CUFFT_FORWARD));

       FFT_time = rtimer2.tocR();

       rtimer2.tic();

       CUDA_SAFE_CALL(cudaMemcpy(hostArrayB, deviceArrayB, arraySize, cudaMemcpyDeviceToHost));

       DtoH_xfer_time = rtimer2.tocR();

       fft_time = rtimer1.tocR();

     }

    else

     {

       rtimer1.tic();

       rtimer2.tic();

       CUDA_SAFE_CALL(cudaMemcpy(deviceArrayA, hostArrayA, arraySize, cudaMemcpyHostToDevice));

       HtoD_xfer_time = rtimer2.tocR();

       rtimer2.tic();

       CUFFT_SAFE_CALL(cufftExecC2C(plan, deviceArrayA, deviceArrayA, CUFFT_FORWARD));

       FFT_time = rtimer2.tocR();

       rtimer2.tic();

       CUDA_SAFE_CALL(cudaMemcpy(hostArrayB, deviceArrayA, arraySize, cudaMemcpyDeviceToHost));

       DtoH_xfer_time = rtimer2.tocR();

       fft_time = rtimer1.tocR();

     }

   int fftsPerSec = (int)( batch / fft_time );

    if (fastestRateFound<fftsPerSec)

      {

	fastestRateFound = fftsPerSec;

	HtoD_xfer_s = HtoD_xfer_time/batch;

	FFT_s = FFT_time/batch;

	DtoH_xfer_s = DtoH_xfer_time/batch;

      }

    // provide some entertainment

    printf("%c\b", spinner[spindex++]);

    if (spindex==4)

     spindex = 0;

    }

   printf("|%8i", fastestRateFound);

   // Host to Device transfer time in ms per FFT

   printf("|%f",(HtoD_xfer_s*1000));

   // Host to Device data rate in MBps

   printf("|%8i",(int)floor(arraySize/(HtoD_xfer_s*batch*1000000)));

   printf("|%f",(FFT_s*1000));  // FFT processing time in ms per FFT

   // Device to Host transfer time in ms per FFT

   printf("|%f",(DtoH_xfer_s*1000));

   // Device to Host data rate in MBps

   printf("|%8i",(int)floor(arraySize/(DtoH_xfer_s*batch*1000000)));

skb · March 18, 2008, 8:52pm

It looks like the setup for Device to Host transfers has a large overhead. I discovered this by setting a very small memory transfer (8 bytes) between device and host, then performing a very large transfer between the same two addresses. If I measured the time for the large transfer to complete and calculate the bandwidth that way, I approach the nominal bandwidth for my card.

This also explains why the bandwidth test in the SDK calculates such high rates. The transfers are done in a loop (10 times) and the average taken. So the first transfer took care of the overhead and the remaining 9 take care of diluting the overhead delay.

It also seems that this only works if the small transfer is done right before the large transfer. i.e I cannot prime the DMA well before hand.

Has anyone seen this or have an explanation for it? I don’t see it in Host to Device DMAs.

Thanks
skb

wickwack · March 19, 2008, 10:16pm

Has anyone had a chance to experiment with larger FFT sizes?

524288
1048576
2097152
4194304
8388608

This should be interesting now that the limit has been removed. Looking to get results from the 5600’s and 8800’s.

Thanks!

CUDA_RM · April 21, 2008, 10:42am

Hello. I’m new to CUDA and i’m asking you for some help. I’ve got your benchmark on my 8600 GT (it stops with errors in the end — perhaps too few memory). Then I made some changes to the source - get time needed fo batch of FFTs. A part of result:

Time for processing array of 16 packages of 16384 numbers: 9054 (microsec)
Time for processing array of 4 packages of 65536 numbers: 8492 (microsec)

I was calculating the following way: 1 * batch / fastestRateFounf.
So, i don’t understand how can it be that more fragmented array processing longer then less fragmented. It’s not only for that current numbers but for some others too.
I suppose FFT algorithm for 4 packages of 65536 numbers somehow or other includes step of calculating 16 packages of 16384 numbers. Am I wrong?

Topic		Replies	Views
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241075	November 16, 2011
cufftPlan2d fails CUDA Programming and Performance	14	21030	September 17, 2007
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5158	October 29, 2019
After installing cuda v2.3 , Can not execute 16K FFT . With release v2.3 , function after fft always CUDA Programming and Performance	8	3775	August 16, 2009
cufftXt batch 1D GPU-Accelerated Libraries	12	2193	October 15, 2019
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15489	May 4, 2010
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1235	April 17, 2023
Batched 2D FFT implementation CUDA Programming and Performance	32	48179	January 13, 2011
Window function for FFT CUDA Programming and Performance	37	7249	April 17, 2009
cufftExecR2C only gives half the answer..?! CUDA Programming and Performance	2	4226	July 24, 2009

CUFFT BENCHMARKING TOOL v1.0 8800GTX owners please post your results!

Related topics