seb,
Thanks for testing the Ultra. It looks like you did the “pinned memory” mod correctly.
From the CUDA Programming Guide 1.0:
cclark
seb,
Thanks for testing the Ultra. It looks like you did the “pinned memory” mod correctly.
From the CUDA Programming Guide 1.0:
cclark
I’ve get the results below. I’ve just run the code posted by mfatica, with and without
define PINNED at begin of code
My configuration is:
Processor: Intel® Pentium® 4 CPU 3.00GHz
Memory: 1GB ram.
Video Card: GeForce 8800 GTX
OS: CentOS release 4.4 (Final)
kernel: Linux 2.6.9-42.ELsmp
For “unpinned” version
Device 0: "GeForce 8800 GTX"
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 804585472 bytes
Clock rate: 1350000 kilohertz
Unpinned version
--------+-----------------------------------
1D Complex-to-Complex In-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 7936| 19607| 30710| 33023
2048| 5813| 12084| 15920| 15690
4096| 4098| 7168| 7812| 9015
8192| 2732| 3879| 3903| 4806
16384| 1736| 1784| 2001| 2197
32768| 954| 939| 1135| 1198
65536| 479| 542| 601| 618
131072| 237| 291| 308| 312
262144| 131| 144| 148| 150
--------+--------+--------+--------+--------
--------+-----------------------------------
1D Complex-to-Complex Out-of-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 8196| 19801| 31007| 33092
2048| 5917| 12195| 15984| 15928
4096| 4219| 7259| 7835| 9016
8192| 2732| 3883| 3900| 4807
16384| 1742| 1793| 2002| 2197
32768| 948| 941| 1137| 1200
65536| 480| 545| 605| 622
131072| 237| 291| 308| 313
262144| 131| 145| 149| 150
--------+--------+--------+--------+--------
For “pinned” version
Device 0: "GeForce 8800 GTX"
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 804585472 bytes
Clock rate: 1350000 kilohertz
Pinned version
--------+-----------------------------------
1D Complex-to-Complex In-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 9523| 24390| 39702| 45649
2048| 6944| 15094| 20539| 22176
4096| 4878| 9090| 10631| 11149
8192| 3246| 4968| 5398| 5571
16384| 2183| 2371| 2418| 2430
32768| 1213| 1284| 1303| 1310
65536| 645| 666| 671| 672
131072| 331| 335| 337| 337
262144| 159| 161| 161| 161
--------+--------+--------+--------+--------
--------+-----------------------------------
1D Complex-to-Complex Out-of-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 9900| 24390| 40100| 45746
2048| 6993| 15151| 20618| 22183
4096| 5050| 9216| 10652| 11155
8192| 3322| 4975| 5403| 5571
16384| 2183| 2368| 2417| 2429
32768| 1215| 1286| 1305| 1312
65536| 648| 669| 674| 677
131072| 331| 336| 337| 338
262144| 160| 161| 161| 161
--------+--------+--------+--------+--------
Can everyone please post the motherboard and chipset used during your tests? Because I/O is included in the timing, I suspect the chipset plays a major role in the results.
Their FFT implementation on the Cell isn’t particularly efficient. Mercury hits 21-22 GFLOPS per SPU (1/8 of the Cell) for 1k to 8k complex FFTs (SPU resident, similar to benchFFT’s results which are cache resident), or 170 GFLOPS per chip for 1k complex FFTs. An 8-SPU implementation of a 64k point complex FFT hits 90.8 GFLOPS (memory to memory) slides 16 & 11 respectively.
Board: Asus P5W64 WS Professional | Chipset: Intel 975X
Unpinned Results:
Device 0: "GeForce 8600 GTS"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 268107776 bytes
Clock rate: 1566000 kilohertz
--------+-----------------------------------
1D Complex-to-Complex In-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 12955| 33020| 47085| 52833
2048| 9264| 17385| 21094| 22551
4096| 6758| 9403| 10534| 11275
8192| 3969| 4885| 5249| 5604
16384| 2145| 2293| 2373| 2442
32768| 1120| 1149| 1200| 1225
65536| 590| 608| 630| 637
131072| 290| 305| 310| 312
262144| 121| 124| 125| 125
--------+--------+--------+--------+--------
--------+-----------------------------------
1D Complex-to-Complex Out-of-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 13108| 33208| 47257| 52780
2048| 9251| 17405| 20983| 22468
4096| 6746| 9436| 10571| 11305
8192| 3985| 4931| 5265| 5613
16384| 2154| 2297| 2365| 2449
32768| 1126| 1157| 1206| 1232
65536| 594| 611| 630| 642
131072| 292| 304| 311| 313
262144| 120| 124| 125
Pinned Results:
Device 0: "GeForce 8600 GTS"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 268107776 bytes
Clock rate: 1566000 kilohertz
--------+-----------------------------------
1D Complex-to-Complex In-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 14408| 42964| 68392| 80859
2048| 10691| 22583| 29068| 32111
4096| 8260| 12460| 14583| 15610
8192| 5027| 6535| 7317| 7573
16384| 2778| 3029| 3090| 3097
32768| 1471| 1525| 1539| 1543
65536| 786| 801| 804| 805
131072| 389| 391| 391| 391
262144| 150| 150| 150| 150
--------+--------+--------+--------+--------
--------+-----------------------------------
1D Complex-to-Complex Out-of-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 14589| 43000| 68271| 80802
2048| 10667| 22629| 29383| 32234
4096| 8279| 12491| 14610| 15639
8192| 5024| 6560| 7329| 7566
16384| 2770| 3025| 3090| 3100
32768| 1472| 1539| 1554| 1555
65536| 792| 811| 810| 811
131072| 390| 393| 393| 393
262144| 150| 150| 150
On both (unpinned and pinned) runs, on the final test I got Cuda error in file ‘fft_bench.cu’ in line 108 : out of memory.
103 size_t arraySize = sizeof(cufftComplex) * nx * batch;
104 cufftComplex *deviceArrayA;
105 cufftComplex *deviceArrayB;
106 CUDA_SAFE_CALL(cudaMalloc((void**)&deviceArrayA, arraySize));
107 if (out_of_place)
108 CUDA_SAFE_CALL(cudaMalloc((void**)&deviceArrayB, arraySize));
This is the only GPU in this system and is being used for my display, running Windows XP.
[overly skeptical and rude comments showing disbelief, edited away by me now because I can]
[pointed to this figure for the source of my doubt, paper figures for the FFT implementation]
http://66.102.1.104/scholar?hl=en&lr=&safe…x+2005+cell+fft
[compared the figures to these more recent but lower figures]
Unpinned Results:
Device 0: "GeForce 8800 GTX"
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 805044224 bytes
Clock rate: 1350000 kilohertz
--------+-----------------------------------
1D Complex-to-Complex In-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 9833| 23983| 37580| 35969
2048| 7173| 14655| 18233| 17454
4096| 5128| 8730| 8501| 9399
8192| 3441| 4415| 4309| 4819
16384| 2060| 1956| 2127| 2210
32768| 946| 1000| 1108| 1140
65536| 514| 560| 586| 593
131072| 258| 287| 294| 293
262144| 133| 139| 140| 138
--------+--------+--------+--------+--------
--------+-----------------------------------
1D Complex-to-Complex Out-of-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 9943| 24023| 37531| 35773
2048| 7202| 14685| 18181| 17475
4096| 5172| 8730| 8489| 9360
8192| 3386| 4402| 4287| 4796
16384| 2054| 1957| 2123| 2221
32768| 1058| 1007| 1115| 1139
65536| 517| 561| 590| 593
131072| 259| 287| 295| 294
262144| 131| 136| 141| 141
--------+--------+--------+--------+--------
Pinned Results:
Device 0: "GeForce 8800 GTX"
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 805044224 bytes
Clock rate: 1350000 kilohertz
--------+-----------------------------------
1D Complex-to-Complex In-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 12052| 34501| 66518| 80524
2048| 8688| 22583| 34254| 38296
4096| 6937| 14625| 17965| 19275
8192| 4978| 8121| 9168| 9606
16384| 3332| 3813| 3958| 3994
32768| 1881| 2022| 2061| 2067
65536| 1053| 1096| 1106| 1109
131072| 537| 548| 549| 554
262144| 248| 250| 250| 250
--------+--------+--------+--------+--------
--------+-----------------------------------
1D Complex-to-Complex Out-of-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 12175| 34501| 66673| 80354
2048| 9131| 22583| 34336| 37810
4096| 5240| 14625| 17936| 19254
8192| 5013| 8116| 9163| 9608
16384| 3342| 3812| 3962| 3993
32768| 1883| 2035| 2072| 2079
65536| 1058| 1105| 1114| 1116
131072| 538| 548| 552| 553
262144| 247| 248| 248| 249
--------+--------+--------+--------+--------
Hi all,
I’m new to everything here and have not quite set up my development environment yet. I’m hoping that someone will be kind enough to run the FFT benchmarks with some larger FFT sizes. I am interested in seeing the benchmark results for the following FFT sizes:
524288
1048576
2097152
4194304
8388608
Thanks,
-Simon
Not too good to be true.
Mercury’s SAL (Scientific Application Library), which is available today, hits these performance numbers for single precision complex FFTs (src & dst in SPU memory) on a 3.2 GHz Cell using a single SPU (1 of 8):
256 pt → 19.29 GFLOPS
1024 pt → 21.34 GFLOPS
2048 pt → 22.21 GFLOPS
4096 pt → 21.61 GFLOPS
8192 pt → 22.15 GFLOPS
A 64k point complex FFT (src & dst in off-chip memory) hits 90.8 GFLOPS and is memory bandwidth bound.
These are not theoretical numbers, but actual, measured performance.
Hmm… Think you’re right. Mea culpa :-)
This is with a Zotac 8800GT ‘Amp’-- core is 700MHz, memory is 2GHz
2.6.23-gentoo-r3 #4 SMP Thu Dec 27 23:49:40 PST 2007 i686 Intel® Core™2 CPU 6300 @ 1.86GHz GenuineIntel GNU/Linux
00:01.0 PCI bridge: Intel Corporation 82P965/G965 PCI Express Root Port (rev 02)
not pinned
Device 0: "GeForce 8800 GT"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 536150016 bytes
Clock rate: 1674000 kilohertz
--------+-----------------------------------
1D Complex-to-Complex In-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 13698| 36036| 58181| 69945
2048| 9900| 22099| 29906| 31746
4096| 7246| 12578| 15655| 17400
8192| 5025| 7042| 7575| 9310
16384| 2941| 3202| 3448| 3758
32768| 1620| 1631| 1912| 2017
65536| 889| 959| 1064| 1093
131072| 372| 448| 471| 474
262144| 195| 215| 219| 220
--------+--------+--------+--------+--------
--------+-----------------------------------
1D Complex-to-Complex Out-of-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 13698| 36036| 58181| 69868
2048| 9900| 22099| 29906| 31746
4096| 7246| 12578| 15670| 17753
8192| 5025| 7029| 7561| 9334
16384| 2941| 3212| 3452| 3777
32768| 1618| 1624| 1914| 2016
65536| 889| 961| 1066| 1096
131072| 391| 454| 479| 484
262144| 196| 216| 220| 222
--------+--------+--------+--------+--------
pinned results
Device 0: "GeForce 8800 GT"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 536150016 bytes
Clock rate: 1674000 kilohertz
--------+-----------------------------------
1D Complex-to-Complex In-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 14705| 43956| 82901| 108108
2048| 11111| 28776| 43010| 50753
4096| 8620| 17021| 23021| 25236
8192| 6369| 9828| 11782| 12405
16384| 3861| 4352| 4476| 4496
32768| 2202| 2350| 2390| 2397
65536| 1251| 1295| 1307| 1308
131072| 545| 551| 553| 553
262144| 252| 253| 253| 252
--------+--------+--------+--------+--------
--------+-----------------------------------
1D Complex-to-Complex Out-of-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 14705| 43956| 82901| 108108
2048| 11111| 28776| 43126| 50753
4096| 8547| 16949| 22955| 25266
8192| 6329| 9828| 11790| 12403
16384| 3875| 4381| 4493| 4525
32768| 2202| 2357| 2390| 2401
65536| 1259| 1302| 1314| 1315
131072| 555| 564| 565| 565
262144| 253| 254| 254| 254
--------+--------+--------+--------+--------
Just started exploring CUDA, but wanted to add a test point here for a card that I haven’t seen benched yet. The numbers seem low to me, so any advice would be greatly appreciated. I think perhaps it’s just the tradeoff you make between performance and laptops. I also get a bunch of errors at the end of the out-of-place FFT’s.
Lenovo Thinkpad T61p
CentOS 5 2.6.18-53.el5PAE #1 SMP
4GB RAM
Unpinned:
Device 0: "Quadro FX 570M"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 267714560 bytes
Clock rate: 550000 kilohertz
--------+-----------------------------------
1D Complex-to-Complex In-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 6172| 19230| 31189| 36529
2048| 4347| 9756| 12810| 13778
4096| 3546| 5442| 6309| 6828
8192| 2123| 2824| 3032| 3294
16384| 1060| 1141| 1156| 1185
32768| 547| 559| 581| 588
65536| 304| 312| 321| 323
131072| 144| 151| 152| 153
262144| 53| 54| 55| 55
--------+--------+--------+--------+--------
--------+-----------------------------------
1D Complex-to-Complex Out-of-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 6172| 19230| 31434| 36488
2048| 4424| 9876| 12841| 13742
4096| 3521| 5434| 6306| 6828
8192| 2123| 2808| 3021| 3240
16384| 1043| 1140| 1160| 1189
32768| 548| 557| 583| 591
65536| 306| 314| 322| 325
131072| 144| 151| 153| 153
cufft: ERROR: /root/cuda-stuff/sw/gpgpu_rel1.1/cufft/src/execute.cu, line 1038
cufft: ERROR: CUFFT_EXEC_FAILED
cufft: ERROR: /root/cuda-stuff/sw/gpgpu_rel1.1/cufft/src/execute.cu, line 1038
cufft: ERROR: CUFFT_EXEC_FAILED
cufft: ERROR: /root/cuda-stuff/sw/gpgpu_rel1.1/cufft/src/execute.cu, line 1038
cufft: ERROR: CUFFT_EXEC_FAILED
cufft: ERROR: /root/cuda-stuff/sw/gpgpu_rel1.1/cufft/src/cufft.cu, line 119
cufft: ERROR: CUFFT_EXEC_FAILED
......bunch of same errors
with an Abort and then
262144| 53| 54| 54| 1254901
Pinned:
--------+-----------------------------------
1D Complex-to-Complex In-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 7092| 23668| 39702| 47583
2048| 5208| 11627| 15252| 17094
4096| 4081| 6319| 7511| 7984
8192| 2450| 3259| 3595| 3696
16384| 1194| 1277| 1290| 1289
32768| 615| 636| 637| 638
65536| 348| 351| 352| 352
131072| 166| 165| 165| 166
262144| 58| 58| 58| 58
--------+--------+--------+--------+--------
--------+-----------------------------------
1D Complex-to-Complex Out-of-Place FFTs
--------+-----------------------------------
nx | batch
+--------+--------+--------+--------
| 1| 4| 16| 64
--------+--------+--------+--------+--------
1024| 7092| 23668| 39800| 47583
2048| 5235| 11661| 15267| 16740
4096| 4081| 6349| 7497| 7966
8192| 2457| 3262| 3590| 3724
16384| 1197| 1286| 1288| 1297
32768| 624| 632| 638| 640
65536| 352| 353| 354| 354
131072| 164| 166| 166| 166
Same errors, Abort and then
262144| 57| 57| 57| 1254901
You may not have enough GPU memory to run the large out-of-place test. I don’t have a laptop GPU, so that’s just a guess.
Working the best numbers from above (cclark, post #4):
FFTs/sec for NVidia GeForce 8800 GTX, Out-of-place FFT, pinned memory
Batch size
FFT size 1 4 16 64
1024 2189 13674 43605 88677
2048 3323 11752 30075 47368
4096 3066 9961 19332 25785
8192 2780 7158 11216 13378
16384 2262 4174 5141 5482
32768 1676 2531 2862 2959
65536 1140 1459 1564 1592
131072 650 750 784 790
262144 311 331 336 338
I get the following GFLOPS values:
GFLOPs/sec for NVidia GeForce 8800 GTX, Out-of-place FFT, pinned memory
Batch size
FFT size 1 4 16 64
1024 0.11 0.70 2.23 4.54
2048 0.37 1.32 3.39 5.34
4096 0.75 2.45 4.75 6.34
8192 1.48 3.81 5.97 7.12
16384 2.59 4.79 5.90 6.29
32768 4.12 6.22 7.03 7.27
65536 5.98 7.65 8.20 8.35
131072 7.24 8.36 8.73 8.80
262144 7.34 7.81 7.93 7.97
Comparable Cell numbers for a single SPU are (BonsaiScott, post #29)
256 pt → 19.29 GFLOPS
1024 pt → 21.34 GFLOPS
2048 pt → 22.21 GFLOPS
4096 pt → 21.61 GFLOPS
8192 pt → 22.15 GFLOPS
The cell numbers above are constrained to a single SPU. With the full 8 SPUs operating, the Cell processor provides on the order of 90 GFLOPS.
Not knowing a lot of GeForce 8800 GTX architectural details, I am trying to understand whether the CUDA FFT benchmark operates the GPU at its full FFT-processing capacity in this benchmark, or whether this benchmark is analogous to operating the Cell with only 1 SPE. Is there other parallelism not exploited by this benchmark, or are the 4-8 GFLOPS numbers above the whole story? What is the maxed-out FFT capacity which would compare to 90 GFLOPS for the Cell with 8 SPUs, including all I/O?
KenH
The GFLOPs numbers are lower since PCI-express transfer time is included into the measure. But note that
The story is quite similar with any PCI-express coprocessing card (be it GPU/Cell/FPGA/whatever). You have to provide the coprocessor enough workload in order to be not bound by PCI-express bandwidth; otherwise the copro is underutilized.
SM1.1 (and above) NVIDIA GPUs are capable of asynchronous transfers, enabling computations and host<->device transfer overlap.
My understanding is that the Cell benchmarks above apply to FFTs performed out of and into the XDR memory. Cells also have a PCI express interface for communication outside of a blade/PCI card, but that is not being exercised in the numbers above.
Therefore, the most representative FFT comparison to the ~90 GFLOP number for Cell/8 SPUs is the 52 GFLOP number quoted above (posts #14 and #16) for on-card memory-to-memory with no external I/O over PCIe. Is this a fair statement? I assume the 52 GFLOPS number uses the maximum degree of parallelism available and large batches.
The most realistic benchmark would apply I/O over the PCIe in the background while doing the FFTs in the foreground. There would be 3 cases of interest:
The results would show the range of capability depending on the I/O assumptions. Clearly the I/O can’t be ignored; it is increasingly the limiting factor in achievable performance. There are many input data types (float/int16/int8) and real/complex permutations, but complex float (as in post #1) would be a useful starting point.
Are any 8800 users interested in trying this?
KenH
I’ve done some benchmarking on a GTS8600 using a slightly modified version of the code (for Linux). I was interested in seeing whether I can really count on the bandwidths that are nominal for pinned memory when using the CUFFT routines.
What I find is that while the Host to Device bandwidth reaches a steady state level around the nominal 2.5 - 2.6 GBps fairly quickly for my card, the Device to Host bandwidth performs poorly (it should be around 1.8 GBps). It does appear to approach the nominal bandwidth, but much slower than the Host to Device rate. What is it about Device to Host that has so much overhead or is it something about the way CUFFT writes data out that makes it non-optimal for transfer?
I’d like to run the FFT routine as foreground while relegating data transfers to the background, but as you can see from the results, the Device to Host transfer time becomes the bottle neck at times.
Thanks for your help
SKB.
Here’s an example :
[FONT=Courier]1D Complex-to-Complex Out-of-Place FFTs.
FFT Size : 131072
Batch 1 (Device to Host transfer will be bottle neck here and data rate is not optimal)
Host To Dev(ms) 0.411987
Host to Dev(MBps) 2545
FFT (ms) 0.055075
Dev to Host (ms) 2.132893
Dev to Host(MBps) 491
Batch 64
Host To Dev(ms) 0.394408
Host to Dev(MBps) 2658
FFT (ms) 1.507673
Dev to Host (ms) 0.661843
Dev to Host(MBps) 1584
Here’s the main code changes I made :
if (out_of_place)
{
rtimer1.tic();
rtimer2.tic();
CUDA_SAFE_CALL(cudaMemcpy(deviceArrayA, hostArrayA, arraySize, cudaMemcpyHostToDevice));
HtoD_xfer_time = rtimer2.tocR();
rtimer2.tic();
CUFFT_SAFE_CALL(cufftExecC2C(plan, deviceArrayA, deviceArrayB, CUFFT_FORWARD));
FFT_time = rtimer2.tocR();
rtimer2.tic();
CUDA_SAFE_CALL(cudaMemcpy(hostArrayB, deviceArrayB, arraySize, cudaMemcpyDeviceToHost));
DtoH_xfer_time = rtimer2.tocR();
fft_time = rtimer1.tocR();
}
else
{
rtimer1.tic();
rtimer2.tic();
CUDA_SAFE_CALL(cudaMemcpy(deviceArrayA, hostArrayA, arraySize, cudaMemcpyHostToDevice));
HtoD_xfer_time = rtimer2.tocR();
rtimer2.tic();
CUFFT_SAFE_CALL(cufftExecC2C(plan, deviceArrayA, deviceArrayA, CUFFT_FORWARD));
FFT_time = rtimer2.tocR();
rtimer2.tic();
CUDA_SAFE_CALL(cudaMemcpy(hostArrayB, deviceArrayA, arraySize, cudaMemcpyDeviceToHost));
DtoH_xfer_time = rtimer2.tocR();
fft_time = rtimer1.tocR();
}
int fftsPerSec = (int)( batch / fft_time );
if (fastestRateFound<fftsPerSec)
{
fastestRateFound = fftsPerSec;
HtoD_xfer_s = HtoD_xfer_time/batch;
FFT_s = FFT_time/batch;
DtoH_xfer_s = DtoH_xfer_time/batch;
}
// provide some entertainment
printf("%c\b", spinner[spindex++]);
if (spindex==4)
spindex = 0;
}
printf("|%8i", fastestRateFound);
// Host to Device transfer time in ms per FFT
printf("|%f",(HtoD_xfer_s*1000));
// Host to Device data rate in MBps
printf("|%8i",(int)floor(arraySize/(HtoD_xfer_s*batch*1000000)));
printf("|%f",(FFT_s*1000)); // FFT processing time in ms per FFT
// Device to Host transfer time in ms per FFT
printf("|%f",(DtoH_xfer_s*1000));
// Device to Host data rate in MBps
printf("|%8i",(int)floor(arraySize/(DtoH_xfer_s*batch*1000000)));
It looks like the setup for Device to Host transfers has a large overhead. I discovered this by setting a very small memory transfer (8 bytes) between device and host, then performing a very large transfer between the same two addresses. If I measured the time for the large transfer to complete and calculate the bandwidth that way, I approach the nominal bandwidth for my card.
This also explains why the bandwidth test in the SDK calculates such high rates. The transfers are done in a loop (10 times) and the average taken. So the first transfer took care of the overhead and the remaining 9 take care of diluting the overhead delay.
It also seems that this only works if the small transfer is done right before the large transfer. i.e I cannot prime the DMA well before hand.
Has anyone seen this or have an explanation for it? I don’t see it in Host to Device DMAs.
Thanks
skb
Has anyone had a chance to experiment with larger FFT sizes?
524288
1048576
2097152
4194304
8388608
This should be interesting now that the limit has been removed. Looking to get results from the 5600’s and 8800’s.
Thanks!
Hello. I’m new to CUDA and i’m asking you for some help. I’ve got your benchmark on my 8600 GT (it stops with errors in the end — perhaps too few memory). Then I made some changes to the source - get time needed fo batch of FFTs. A part of result:
Time for processing array of 16 packages of 16384 numbers: 9054 (microsec)
Time for processing array of 4 packages of 65536 numbers: 8492 (microsec)
I was calculating the following way: 1 * batch / fastestRateFounf.
So, i don’t understand how can it be that more fragmented array processing longer then less fragmented. It’s not only for that current numbers but for some others too.
I suppose FFT algorithm for 4 packages of 65536 numbers somehow or other includes step of calculating 16 packages of 16384 numbers. Am I wrong?