what's causing huge variations in run time?

Uncle_Joe · May 7, 2013, 9:35pm

I’m trying to overlap CUFFTs and cudaMemcpyAsync(), but I’m getting large fluctuations in throughput.
I’ve measured the raw time for 100 4k x 4k FFTs at a consistent 400ms. But once I add a cudaMemcpyAsync(), the time goes anywhere from 453 ms to 578 ms. I’m expecting some increase in time due to 2 bandwidth hungry functions competing for bandwidth, but I don’t think the variability should be so high (Visual profiler shows the cudaMemcpyAsync() bandwidth varying between 11.6 GiB/s on most calls to 1.5 GiB/s on a bad call. I’m using a GTX 690 in a 1U server, and running SUSE 11.

Can someone suggest what could be causing the variations and what can be done about them? Also, can someone run my code, preferably a Tesla Kepler (i.e. GTX K10 or GTX K20), and see if the variability is also that big?

Thanks for helping

#include <iostream>
#include <cufft.h>
#include <cuda_runtime.h>
#include <vector>
using namespace std;

#define cudaCheckErrors(expr) { int e = expr; if (e != cudaSuccess) { printf("error %d\n", e); throw 0; } }

double Clock()
{
#ifdef _WIN32
  LARGE_INTEGER t;
  static double tick_interval = 0;
  if (tick_interval == 0)
  {
    QueryPerformanceFrequency(&t);
      tick_interval = 1.0 / t.QuadPart;
  }
  QueryPerformanceCounter(&t);
  return t.QuadPart * tick_interval;
#else
  timespec t;
  clock_gettime(CLOCK_REALTIME, &t);
  return t.tv_sec + t.tv_nsec * 0.000000001; 
#endif
}

int main()
{
   const int N = 4096;
    cudaStream_t streams[3];
    for (int i = 0; i < 3; ++i)
        cudaCheckErrors(cudaStreamCreate(&streams[i]));

    cufftHandle plan;
  float *data, *outData[3];
  float *copy[3];

  cudaCheckErrors(cudaMalloc(&data, N * N * sizeof(float)));
  for (int i = 0; i < 3; ++i)
  {
    cudaCheckErrors(cudaMalloc(&outData[i], N * (N + 2) * sizeof(float)));
    cudaCheckErrors(cudaHostAlloc(&copy[i], N * N * sizeof(float), cudaHostAllocMapped));
  }
  cudaCheckErrors(cufftPlan2d(&plan, N, N, CUFFT_R2C));

int queueIndex = 0;
  for (int warmup = 1; warmup >= 0; --warmup)
  {
    cudaCheckErrors(cudaThreadSynchronize());
    double t0 = Clock();
    for (int i = -2; i < 100; ++i)
    {
       // 3 stage pipeline: iteration i issues i + 2 FFT, does nothing for i + 1 FFT, and waits on ith FFT
       int nextIndex = (queueIndex + 2) % 3;
       if (i + 2 < 100)
       {
         //cout << "issue " << streams[nextIndex] << endl;
         cudaCheckErrors(cufftSetStream(plan, streams[nextIndex]));
         cudaCheckErrors(cufftExecR2C(plan, data, (cufftComplex *)outData[nextIndex]));
         cudaCheckErrors(cudaMemcpyAsync(copy[nextIndex], outData[nextIndex], N * N * 2, cudaMemcpyDeviceToHost, streams[nextIndex]));
       }
       if (i >= 0)
       {
         //cout << "waiting on " << streams[queueIndex] << endl;
         //cudaCheckErrors(cudaStreamSynchronize(streams[queueIndex]));
       }
       queueIndex = (queueIndex + 1) % 3;
    }
    cudaCheckErrors(cudaThreadSynchronize());
    if (!warmup)
      cout << Clock() - t0 << endl;
  }
  cudaDeviceReset();
}

allanmac · May 7, 2013, 10:57pm

I’m seeing a consistent “1.020*” on a K20c on Win7/x64 with Tesla driver 320.00.

The card is in a PCIe 2.0 x8 slot.

Compiled with:

nvcc -m 32 -arch sm_35 -Xptxas=-v uj.cu cufft.lib -o uj

Uncle_Joe · May 7, 2013, 11:58pm

Thanks allanmac, ur my hero.

If it’s not too much, could you try decreasing the cudaMemcpy to N * N / 3 bytes and measure again since you said you have a K20, which is only PCIe 2.0 x8, instead of my GTX 690’s PCIe 3.0 x16 and you have almost 1.5x as many cores, so the cudaMemcpy size should be reduced so that it doesn’t become the bottleneck.

Also, does anyone know if there are any GPU cloud services that have K10 or K20s for short term use? I briefly looked at Page Not Found | NVIDIA, but it seems they only have Fermis.

mfatica · May 8, 2013, 12:51am

Your original code runs pretty stable on a K20x with a runtime of 0.548s.
This is on Linux with gen2 x16.
You may see the effect of boost clocks on Geforce Kepler cards.

allanmac · May 8, 2013, 12:48am

With the bytes reduced to N*N/3 the results are:

(ran it 10 times)

SPWorley · May 8, 2013, 1:14am

NVidia has free online trials of K20 clusters… they may work for quick tests too.

Uncle_Joe · May 8, 2013, 1:32am

Great, I now conclude that the large fluctuations is a hardware issue. I will sign up for the K20 trial and experiment further. Thanks everyone for the data/suggestions.

allanmac · May 8, 2013, 1:44am

FWIW,

On a GTX 680 (also in a PCIe 2.0 x8 slot) with one monitor connected the timings are:

NN/3: ~0.39
NN*2: ~1.01

It’s very consistent from run to run.

The K20c is using the TCC driver and the 680 is using WDDM. Same machine. Both are using Friday’s 320.00 WHQL.

Topic		Replies	Views
Run-time variability on Kepler K20 CUDA Programming and Performance	2	1311	February 16, 2013
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10573	June 21, 2009
kernels timeout or hang intermitently CUDA Programming and Performance	9	3733	July 25, 2013
Inconsistent cudaMemcpy execution time CUDA Programming and Performance	4	1557	October 25, 2013
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13500	October 27, 2010
Why does my kernel take too long occasionally? CUDA Programming and Performance	21	8788	October 13, 2010
Irregularity in the timings A few statistics CUDA Programming and Performance	14	10690	October 24, 2007
Peformance comparison ends in strange results CUDA Programming and Performance	3	754	August 9, 2019
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2367	January 9, 2017
massive hiccups when transferring flags back to host challenging the commonly quoted 2-10us latency CUDA Programming and Performance	14	38119	February 1, 2011

what's causing huge variations in run time?

Related topics