Run-time variability on Kepler K20

tbenson · January 14, 2013, 11:42pm

Hello,

I have been doing some performance testing with a K20c and have noticed a seemingly high amount of run-time variability on subsequent executions of the same kernel. Take as an example the following code:

#include <cuda.h>
#include <cassert>
#include <cstdio>
#include <time.h>

#define NITER 24

__global__ void spin(float *x, float y)
{
    float arg = y;
    for (int k = 0; k < 10000; ++k) { arg = sinf(arg); }
    *x = arg;
}

double ElapsedTimeMs(struct timespec initTime)
{
    struct timespec t;
    assert(clock_gettime(CLOCK_REALTIME, &t) == 0);
    double diff_ms = (1000.0 * t.tv_sec + 1.0e-6 * t.tv_nsec) -
        (1000.0 * initTime.tv_sec + 1.0e-6 * initTime.tv_nsec);
    return diff_ms;
}

int main(int argc, char **argv)
{
    struct timespec initTime;
    assert(clock_gettime(CLOCK_REALTIME, &initTime) == 0);

    float *dev_x;
    assert(cudaMalloc((void **) &dev_x, sizeof(float)) == cudaSuccess);

    for (int k = 0; k < NITER; ++k)
    {
        double start = ElapsedTimeMs(initTime);
        spin<<<1,1>>>(dev_x,3.14159/3.0);
        assert(cudaDeviceSynchronize() == cudaSuccess);
        double elapsed = ElapsedTimeMs(initTime) - start;
        printf("Iter %02d: %.2f ms
", k, elapsed);
    }
}

So a single thread is spinning on a fixed-length computation NITER times. On an S2050, I get the following run-times:

[C2050]$ nvcc -arch=sm_20 variability.cu -lrt
[C2050]$ ./a.out
Iter 00: 3.58 ms
Iter 01: 3.55 ms
Iter 02: 3.55 ms
Iter 03: 3.55 ms
Iter 04: 3.55 ms
Iter 05: 3.55 ms
Iter 06: 3.55 ms
Iter 07: 3.55 ms
Iter 08: 3.55 ms
Iter 09: 3.55 ms
Iter 10: 3.55 ms
Iter 11: 3.55 ms
Iter 12: 3.55 ms
Iter 13: 3.55 ms
Iter 14: 3.55 ms
Iter 15: 3.55 ms
Iter 16: 3.55 ms
Iter 17: 3.55 ms
Iter 18: 3.55 ms
Iter 19: 3.55 ms
Iter 20: 3.55 ms
Iter 21: 3.55 ms
Iter 22: 3.55 ms
Iter 23: 3.55 ms

The results are highly repeatable. I get the following on a GTX 680:

[GTX680]$ nvcc -arch=sm_30 variability.cu -lrt
[GTX680]$ ./a.out
Iter 00: 2.86 ms
Iter 01: 2.78 ms
Iter 02: 2.76 ms
Iter 03: 2.76 ms
Iter 04: 2.76 ms
Iter 05: 2.76 ms
Iter 06: 2.76 ms
Iter 07: 2.76 ms
Iter 08: 2.76 ms
Iter 09: 2.76 ms
Iter 10: 2.76 ms
Iter 11: 2.76 ms
Iter 12: 2.76 ms
Iter 13: 2.76 ms
Iter 14: 2.76 ms
Iter 15: 2.76 ms
Iter 16: 2.76 ms
Iter 17: 2.76 ms
Iter 18: 2.76 ms
Iter 19: 2.76 ms
Iter 20: 2.76 ms
Iter 21: 2.76 ms
Iter 22: 2.76 ms
Iter 23: 2.76 ms

However, on the K20c, I see the following:

[K20c]$ nvcc -arch=sm_35 variability.cu -lrt
[K20c]$ ./a.out
Iter 00: 7.33 ms
Iter 01: 6.95 ms
Iter 02: 6.70 ms
Iter 03: 6.95 ms
Iter 04: 6.95 ms
Iter 05: 6.95 ms
Iter 06: 6.96 ms
Iter 07: 9.42 ms
Iter 08: 6.70 ms
Iter 09: 6.95 ms
Iter 10: 6.95 ms
Iter 11: 6.96 ms
Iter 12: 6.95 ms
Iter 13: 6.62 ms
Iter 14: 6.71 ms
Iter 15: 9.42 ms
Iter 16: 6.95 ms
Iter 17: 6.97 ms
Iter 18: 4.77 ms
Iter 19: 3.20 ms
Iter 20: 3.10 ms
Iter 21: 3.09 ms
Iter 22: 3.21 ms
Iter 23: 3.22 ms

For one, it seems to take a long time to get to its highest performance state. Even when presumably in a high performance state, the run-times vary quite a bit, especially as compared to other hardware. There are no other applications running on the K20c, so it is not an interference issue.

I assume this is not expected behavior, although perhaps the clock rates are being adjusted extremely aggressively. Any ideas on what could be causing this?

Regards,
Thomas

sjiagc · January 16, 2013, 9:28am

We have seen the bug you filed, let us discuss it in that bug.

Arakageeta · February 16, 2013, 10:28pm

sjiagc, could you provide a bug tracking number or a link with additional information?