Run-time variability on Kepler K20

Hello,

I have been doing some performance testing with a K20c and have noticed a seemingly high amount of run-time variability on subsequent executions of the same kernel. Take as an example the following code:

#include <cuda.h>
#include <cassert>
#include <cstdio>
#include <time.h>

#define NITER 24

__global__ void spin(float *x, float y)
{
    float arg = y;
    for (int k = 0; k < 10000; ++k) { arg = sinf(arg); }
    *x = arg;
}

double ElapsedTimeMs(struct timespec initTime)
{
    struct timespec t;
    assert(clock_gettime(CLOCK_REALTIME, &t) == 0);
    double diff_ms = (1000.0 * t.tv_sec + 1.0e-6 * t.tv_nsec) -
        (1000.0 * initTime.tv_sec + 1.0e-6 * initTime.tv_nsec);
    return diff_ms;
}

int main(int argc, char **argv)
{
    struct timespec initTime;
    assert(clock_gettime(CLOCK_REALTIME, &initTime) == 0);

    float *dev_x;
    assert(cudaMalloc((void **) &dev_x, sizeof(float)) == cudaSuccess);

    for (int k = 0; k < NITER; ++k)
    {
        double start = ElapsedTimeMs(initTime);
        spin<<<1,1>>>(dev_x,3.14159/3.0);
        assert(cudaDeviceSynchronize() == cudaSuccess);
        double elapsed = ElapsedTimeMs(initTime) - start;
        printf("Iter %02d: %.2f ms
", k, elapsed);
    }
}

So a single thread is spinning on a fixed-length computation NITER times. On an S2050, I get the following run-times:

[C2050]$ nvcc -arch=sm_20 variability.cu -lrt
[C2050]$ ./a.out
Iter 00: 3.58 ms
Iter 01: 3.55 ms
Iter 02: 3.55 ms
Iter 03: 3.55 ms
Iter 04: 3.55 ms
Iter 05: 3.55 ms
Iter 06: 3.55 ms
Iter 07: 3.55 ms
Iter 08: 3.55 ms
Iter 09: 3.55 ms
Iter 10: 3.55 ms
Iter 11: 3.55 ms
Iter 12: 3.55 ms
Iter 13: 3.55 ms
Iter 14: 3.55 ms
Iter 15: 3.55 ms
Iter 16: 3.55 ms
Iter 17: 3.55 ms
Iter 18: 3.55 ms
Iter 19: 3.55 ms
Iter 20: 3.55 ms
Iter 21: 3.55 ms
Iter 22: 3.55 ms
Iter 23: 3.55 ms

The results are highly repeatable. I get the following on a GTX 680:

[GTX680]$ nvcc -arch=sm_30 variability.cu -lrt
[GTX680]$ ./a.out
Iter 00: 2.86 ms
Iter 01: 2.78 ms
Iter 02: 2.76 ms
Iter 03: 2.76 ms
Iter 04: 2.76 ms
Iter 05: 2.76 ms
Iter 06: 2.76 ms
Iter 07: 2.76 ms
Iter 08: 2.76 ms
Iter 09: 2.76 ms
Iter 10: 2.76 ms
Iter 11: 2.76 ms
Iter 12: 2.76 ms
Iter 13: 2.76 ms
Iter 14: 2.76 ms
Iter 15: 2.76 ms
Iter 16: 2.76 ms
Iter 17: 2.76 ms
Iter 18: 2.76 ms
Iter 19: 2.76 ms
Iter 20: 2.76 ms
Iter 21: 2.76 ms
Iter 22: 2.76 ms
Iter 23: 2.76 ms

However, on the K20c, I see the following:

[K20c]$ nvcc -arch=sm_35 variability.cu -lrt
[K20c]$ ./a.out
Iter 00: 7.33 ms
Iter 01: 6.95 ms
Iter 02: 6.70 ms
Iter 03: 6.95 ms
Iter 04: 6.95 ms
Iter 05: 6.95 ms
Iter 06: 6.96 ms
Iter 07: 9.42 ms
Iter 08: 6.70 ms
Iter 09: 6.95 ms
Iter 10: 6.95 ms
Iter 11: 6.96 ms
Iter 12: 6.95 ms
Iter 13: 6.62 ms
Iter 14: 6.71 ms
Iter 15: 9.42 ms
Iter 16: 6.95 ms
Iter 17: 6.97 ms
Iter 18: 4.77 ms
Iter 19: 3.20 ms
Iter 20: 3.10 ms
Iter 21: 3.09 ms
Iter 22: 3.21 ms
Iter 23: 3.22 ms

For one, it seems to take a long time to get to its highest performance state. Even when presumably in a high performance state, the run-times vary quite a bit, especially as compared to other hardware. There are no other applications running on the K20c, so it is not an interference issue.

I assume this is not expected behavior, although perhaps the clock rates are being adjusted extremely aggressively. Any ideas on what could be causing this?

Regards,
Thomas

We have seen the bug you filed, let us discuss it in that bug.

sjiagc, could you provide a bug tracking number or a link with additional information?