Do 1/2/3D FFTs work correctly on K80 if batch = 1?

Hi,

I evaluate FFTs in CUDA 7.5 on K80.
I managed to get decent performance in batch 1/2/3D FFTs using both GPUs in K80.

How to compute one 1/2/3D FFT using both GPUs in K80?

I get nan’s in the output and very low performance when I run the examples involving cufftMakePlan1d, cufftMakePlan2d, cufftMakePlan3d.

How to use both GPUs with cufftMakePlan1d, cufftMakePlan2d, cufftMakePlan3d?

espesp

There are CUDA sample codes which demonstrate the use of multi-GPU cufft including at least one showing how to do a 2D transform.

http://docs.nvidia.com/cuda/cuda-samples/index.html#simplecufft_2d_mgpu

That particular sample code uses cufftMakePlan2d for a single transform.

You must also use the cufft Xt API if you intend to run a single transform call on multiple GPUs.

http://docs.nvidia.com/cuda/cufft/index.html#multiple-GPU-cufft-transforms

Hi,

I tried to evaluate performance of cuFFT on K80 using the provided sample.
I slightly modified the sample to be able to measure forward transforms. The sample showed me 450ms in average, what was not the number I expected.

Using API from the sample I evaluated my own test where test cases are for batch = 2+ . (in example, canonical problem cif512x512*2).
Test showed 3-5 ms for different testcases, what is more acceptable than results for batch = 1.

Also I used my test to re-evaluate batch=1 cases but got the same problem with performance and one more issue with incorrect results: actual results and expected results were different in half of dots (it seemed that second card didn’t send the data).

Could you help me to understand why batch=1 case is so slow and incorrect for all dimensions?

Thanks.

Hi txbob,

Thank you for the example!
Could you please point us to the performance charts for single 1/2/3D FFTs on K80?
Or just give us an idea about performance improvements expected for such FFTs, K80 vs K40?

espesp

do a google search on “CUDA Performance Report”

You will find recent performance reports for CUDA 6.5 and CUDA 7 in the first page of hits, including some CUFFT data.

http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf

http://developer.download.nvidia.com/compute/cuda/6_5/rel/docs/CUDA_6.5_Performance_Report.pdf

I’m not sure a perf report for CUDA 7.5 is currently available. If the CUFFT data you’re looking for is not in the CUDA 6.5 or CUDA 7.0 perf reports, then it may not be available, or at least I’m not sure where to look for it.

Generally speaking, on GPUs the performance of FFTs is limited by memory bandwidth. Relevant data is provided in the specifications of K40 and K80:

http://www.nvidia.com/content/PDF/kepler/Tesla-K40-PCIe-Passive-Board-Spec-BD-06902-001_v05.pdf
Memory bandwidth 288 GB/sec

http://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05.pdf
Memory bandwidth: 480 GB/sec (cumulative)

Note that the memory throughput of the K80 is stated as a cumulative number. The K80 comprises two GPUs, each with its own attached memory delivering 240 GB/sec. Therefore, one would expect each half of a K80 to provide about 83% of the FFT throughput of a K40. I am not aware of any publicly posted CUFFT performance data comparing the K40 and the K80.

Generally speaking, to fully utilize a K80, multi-GPU programming techniques have to be used. While it is a single card, the two GPUs on the board appear to CUDA as two devices.

Njuffa, txbob,

Thank you for the information :)

espesp