what's causing huge variations in run time?

I’m trying to overlap CUFFTs and cudaMemcpyAsync(), but I’m getting large fluctuations in throughput.
I’ve measured the raw time for 100 4k x 4k FFTs at a consistent 400ms. But once I add a cudaMemcpyAsync(), the time goes anywhere from 453 ms to 578 ms. I’m expecting some increase in time due to 2 bandwidth hungry functions competing for bandwidth, but I don’t think the variability should be so high (Visual profiler shows the cudaMemcpyAsync() bandwidth varying between 11.6 GiB/s on most calls to 1.5 GiB/s on a bad call. I’m using a GTX 690 in a 1U server, and running SUSE 11.

Can someone suggest what could be causing the variations and what can be done about them? Also, can someone run my code, preferably a Tesla Kepler (i.e. GTX K10 or GTX K20), and see if the variability is also that big?

Thanks for helping

#include <iostream>
#include <cufft.h>
#include <cuda_runtime.h>
#include <vector>
using namespace std;

#define cudaCheckErrors(expr) { int e = expr; if (e != cudaSuccess) { printf("error %d\n", e); throw 0; } }

double Clock()
{
#ifdef _WIN32
  LARGE_INTEGER t;
  static double tick_interval = 0;
  if (tick_interval == 0)
  {
    QueryPerformanceFrequency(&t);
      tick_interval = 1.0 / t.QuadPart;
  }
  QueryPerformanceCounter(&t);
  return t.QuadPart * tick_interval;
#else
  timespec t;
  clock_gettime(CLOCK_REALTIME, &t);
  return t.tv_sec + t.tv_nsec * 0.000000001; 
#endif
}

int main()
{
   const int N = 4096;
    cudaStream_t streams[3];
    for (int i = 0; i < 3; ++i)
        cudaCheckErrors(cudaStreamCreate(&streams[i]));

    cufftHandle plan;
  float *data, *outData[3];
  float *copy[3];

  cudaCheckErrors(cudaMalloc(&data, N * N * sizeof(float)));
  for (int i = 0; i < 3; ++i)
  {
    cudaCheckErrors(cudaMalloc(&outData[i], N * (N + 2) * sizeof(float)));
    cudaCheckErrors(cudaHostAlloc(&copy[i], N * N * sizeof(float), cudaHostAllocMapped));
  }
  cudaCheckErrors(cufftPlan2d(&plan, N, N, CUFFT_R2C));

int queueIndex = 0;
  for (int warmup = 1; warmup >= 0; --warmup)
  {
    cudaCheckErrors(cudaThreadSynchronize());
    double t0 = Clock();
    for (int i = -2; i < 100; ++i)
    {
       // 3 stage pipeline: iteration i issues i + 2 FFT, does nothing for i + 1 FFT, and waits on ith FFT
       int nextIndex = (queueIndex + 2) % 3;
       if (i + 2 < 100)
       {
         //cout << "issue " << streams[nextIndex] << endl;
         cudaCheckErrors(cufftSetStream(plan, streams[nextIndex]));
         cudaCheckErrors(cufftExecR2C(plan, data, (cufftComplex *)outData[nextIndex]));
         cudaCheckErrors(cudaMemcpyAsync(copy[nextIndex], outData[nextIndex], N * N * 2, cudaMemcpyDeviceToHost, streams[nextIndex]));
       }
       if (i >= 0)
       {
         //cout << "waiting on " << streams[queueIndex] << endl;
         //cudaCheckErrors(cudaStreamSynchronize(streams[queueIndex]));
       }
       queueIndex = (queueIndex + 1) % 3;
    }
    cudaCheckErrors(cudaThreadSynchronize());
    if (!warmup)
      cout << Clock() - t0 << endl;
  }
  cudaDeviceReset();
}

I’m seeing a consistent “1.020*” on a K20c on Win7/x64 with Tesla driver 320.00.

The card is in a PCIe 2.0 x8 slot.

Compiled with:

nvcc -m 32 -arch sm_35 -Xptxas=-v uj.cu cufft.lib -o uj

Thanks allanmac, ur my hero.

If it’s not too much, could you try decreasing the cudaMemcpy to N * N / 3 bytes and measure again since you said you have a K20, which is only PCIe 2.0 x8, instead of my GTX 690’s PCIe 3.0 x16 and you have almost 1.5x as many cores, so the cudaMemcpy size should be reduced so that it doesn’t become the bottleneck.

Also, does anyone know if there are any GPU cloud services that have K10 or K20s for short term use? I briefly looked at http://www.nvidia.com/object/gpu-cloud-computing-services.html, but it seems they only have Fermis.

Your original code runs pretty stable on a K20x with a runtime of 0.548s.
This is on Linux with gen2 x16.
You may see the effect of boost clocks on Geforce Kepler cards.

With the bytes reduced to N*N/3 the results are:

0.327858
0.328336
0.327574
0.328461
0.328037
0.32853
0.328009
0.327684
0.32775
0.327737

(ran it 10 times)

NVidia has free online trials of K20 clusters… they may work for quick tests too.

Great, I now conclude that the large fluctuations is a hardware issue. I will sign up for the K20 trial and experiment further. Thanks everyone for the data/suggestions.

FWIW,

On a GTX 680 (also in a PCIe 2.0 x8 slot) with one monitor connected the timings are:

NN/3: ~0.39
N
N*2: ~1.01

It’s very consistent from run to run.

The K20c is using the TCC driver and the 680 is using WDDM. Same machine. Both are using Friday’s 320.00 WHQL.