CUFFT poor on GTX 1080 (Linux, CUDA 8.0, nvidia-367)

Hi, got a GTX 1080 installed under Ubuntu 16.04 64-bit. Am using the current nvidia-367 driver release. I measured the performance of a batched (cufftPlanMany()) transform done by cufftExecR2C(). The results were quite unexpected:

Central code:
  cudaEventRecord(tstart, 0);
  cufftExecR2C(fftplan_r2c, (cufftReal*)d_idata, d_odata);
  cudaEventRecord(tstop, 0);
  cudaEventSynchronize(tstop);
  cudaEventElapsedTime( &dt_msec, tstart, tstop);
  printf("%10zu %10zu %9.3f %8.1f ", fftlen, batch, dt_msec, (1e-6*fftlen*batch)/dt_msec);

Result:
                    GeForce GTX 1080      GeForce GTX TITAN X   GeForce GTX TITAN X
                    CC 6.1, CUDA 8.0      CC 5.2, CUDA 7.5      CC 5.2, CUDA 8.0
                    1024 threads/blk      1024 threads/blk      1024 threads/blk
  Nreals  x  Lbatch T_r2c[ms]  R[Gsps]    T_r2c[ms]  R[Gsps]    T_r2c[ms]  R[Gsps]
      64    4194304    19.029     14.1       15.460     17.4       16.482     16.3 
     128    2097152    18.585     14.4       17.327     15.5       20.737     12.9
     256    1048576    18.896     14.2       15.431     17.4       17.741     15.1 
     512     524288    18.774     14.3       15.408     17.4       16.586     16.2 
    1024     262144    18.460     14.5       15.332     17.5       16.338     16.4 
    2048     131072    18.387     14.6       15.346     17.5       16.358     16.4
    4096      65536    18.402     14.6       15.425     17.4       16.465     16.3
    8192      32768    18.462     14.5       15.414     17.4       16.437     16.3
   16384      16384    18.409     14.6       15.382     17.5       16.400     16.4
   32768       8192    29.397      9.1       26.343     10.2       27.894      9.6 
   65536       4096    29.143      9.2       26.327     10.2       27.814      9.7
  131072       2048    29.642      9.1       28.178      9.5       29.537      9.1
  262144       1024    29.961      9.0       29.536      9.1       30.968      8.7 
  524288        512    38.618      7.0       34.745      7.7       36.718      7.3
 1048576        256    39.146      6.9       36.879      7.3       39.240      6.8

The T_r2c column shows the execution time on GPU. It excludes all data transfers. The R[Gsps] column shows the throughput in units of samples per second. The GTX 1080 card with CUDA 8.0 turned out slower by 10-20% than the GTX TITAN X card in CUDA 7.5. Might this be due to the nvidia-367 driver…?

Has anyone else tested CUFFT on GTX 1080 yet? What kind of throughput do you get?

Edit: I installed CUDA 8.0 on the machine with GTX TITAN X and re-ran the test. Results are added to the last two columns above. GTX 1080 remains notably slower than GTX TITAN X. Even more surprisingly upgrading CUDA 7.5 --> CUDA 8.0 increased the time for cufftExecR2C() on GTX TITAN X by 1-3 milliseconds! Is this a regression?

And for completeness here is the test code:

// Usage: ./benchfft [<deviceNr>]

#include <stdio.h>
#include <cuda.h>
#include <cufft.h>
#include <ctype.h>
#include <helper_cuda.h>

#define MIN_LFFT      64ULL
#define MAX_LFFT      (1ULL*1024*1024)
#define MIN_BATCH     128ULL
#define MAX_SAMPLES   (MAX_LFFT*MIN_BATCH)
#define MAX_BATCH     (MAX_SAMPLES/MIN_LFFT)
#define MAX_CPLX_OUT  ((MIN_LFFT*MAX_BATCH)/2 + MAX_BATCH)

int main(int argc, char **argv)
{
    cufftHandle fftplan_r2c;
    cudaDeviceProp dp;
    cudaEvent_t tstart, tstop;
    void *d_in, *h_in, *d_out;
    int device = 0;

    // Select device and do some preparations
    if ((argc == 2) && isdigit(argv[1][0])) {
        device = argv[1][0] - '0';
    }

    checkCudaErrors( cudaGetDeviceProperties(&dp, device) );
    printf("CUDA Device #%d : %s, Compute Capability %d.%d, %d threads/block, warpsize %d\n",
        device, dp.name, dp.major, dp.minor, dp.maxThreadsPerBlock, dp.warpSize
    );

    checkCudaErrors( cudaSetDevice(device) );
    checkCudaErrors( cudaDeviceReset() );
    checkCudaErrors( cudaEventCreate(&tstart) );
    checkCudaErrors( cudaEventCreate(&tstop) );

    checkCudaErrors( cudaMalloc( (void **)&d_in, sizeof(cufftReal)*MAX_SAMPLES ) );
    checkCudaErrors( cudaMalloc( (void **)&d_out, sizeof(cufftComplex)*MAX_CPLX_OUT ) );
    checkCudaErrors( cudaHostAlloc( (void **)&h_in, sizeof(cufftReal)*MAX_SAMPLES, cudaHostAllocDefault ) );
    for (size_t n=0; n<MAX_SAMPLES; n++) {
        ((cufftReal*)h_in)[n] = cufftReal(n % 1234);
    }

    printf("    Nreals     Lbatch T_r2c[ms]  R[Gs/s]\n");

    for (size_t fftlen = MIN_LFFT; fftlen <= MAX_LFFT; fftlen *= 2) {

        size_t batch = MAX_SAMPLES / fftlen;

        // CuFFT R2C plan : N reals = N/2 complex
        int dimn[1] = {fftlen};         // DFT size
        int inembed[1] = {0};           // ignored for 1D xform
        int onembed[1] = {0};           // ignored for 1D xform
        int istride = 1, ostride = 1;   // step between in(out) samples
        int idist = fftlen;             // in step between FFTs (R2C input = real)
        int odist = fftlen/2 + 1;       // out step between FFTs (x2C output = complex); use N/2+0 to discard Nyquist
        checkCudaErrors( cufftPlanMany(&fftplan_r2c, 1, dimn,
            inembed, istride, idist,
            onembed, ostride, odist,
            CUFFT_R2C, batch)
        );

        #if defined(CUDA_VERSION) && (CUDA_VERSION < 8000)
        checkCudaErrors( cufftSetCompatibilityMode(fftplan_r2c, CUFFT_COMPATIBILITY_NATIVE) );
        #endif

        // Execute FFT once to force instantiation of the plan (if delayed instantiation)
        checkCudaErrors( cufftExecR2C(fftplan_r2c, (cufftReal*)d_in, (cufftComplex*)d_out) );


        // Restore FFT input and output areas to a known state
        checkCudaErrors( cudaMemcpy( d_in, h_in, idist*sizeof(cufftReal)*batch, cudaMemcpyHostToDevice) );
        checkCudaErrors( cudaMemset( d_out, 0x00, odist*sizeof(cufftComplex)*batch ) );

        // Time the execution of cufftExecR2C()
        checkCudaErrors( cudaEventRecord(tstart) );
        checkCudaErrors( cufftExecR2C(fftplan_r2c, (cufftReal*)d_in, (cufftComplex*)d_out) );
        checkCudaErrors( cudaEventRecord(tstop) );
        checkCudaErrors( cudaEventSynchronize(tstop) );

        float dt_msec = 0.0f;
        checkCudaErrors( cudaEventElapsedTime( &dt_msec, tstart, tstop ) );
        printf("%10zu %10zu %9.3f %8.1f\n", fftlen, batch, dt_msec, (1e-6*fftlen*batch)/dt_msec);

        // Cleanup
        checkCudaErrors( cufftDestroy(fftplan_r2c) );
    }

    return 0;
}

I think there are tow issues which you would want to separate, one related to hardware (Geforce GTX 1080 vs Geforce Titan X), the other related to software (performance differential CUDA 7.5 to CUDA 8). I don’t have either hardware available to me so cannot confirm or refute your data.

Large FFTs are memory bound, and the GTX 1080 has slightly lower memory bandwidth (320 GB/sec; http://www.geforce.com/hardware/10series/geforce-gtx-1080) than the Maxwell-based Titan X (336.5 GB/sec; http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x/specifications).

There have also been reports elsewhere in these forums that it is difficult to achieve anywhere near the full memory throughput on the GTX 1080, and speculation is that this is related to the use of newfangled GDDR5X memory on this card (the Maxwell Titan X uses “classical” GDDR5 memory).

I haven’t seen any previous reports of CUFFT performance regression when moving from CUDA 7.5 to CUDA 8.0 on Titan X. Since the difference appears to be more than 5% here, and you state you are using the latest software, it seems reasonable to me to report this as a bug to NVIDIA.

It will likely come down to the memory bandwidth as @njuffa indicated.

Run bandwidthTest in all the cases. If the difference in bandwidthTest device-device memory bandwidth is consistent with the difference in CUFFT performance, that is probably your explanation for GTX1080/Titan X differences.

I haven’t looked at your test case carefully, but I certainly agree that the perf degradation on Titan X going from CUDA 7.5 to CUDA 8 appears to be something that should be logged as a bug.