Hello,
I have been doing some performance testing with a K20c and have noticed a seemingly high amount of run-time variability on subsequent executions of the same kernel. Take as an example the following code:
#include <cuda.h>
#include <cassert>
#include <cstdio>
#include <time.h>
#define NITER 24
__global__ void spin(float *x, float y)
{
float arg = y;
for (int k = 0; k < 10000; ++k) { arg = sinf(arg); }
*x = arg;
}
double ElapsedTimeMs(struct timespec initTime)
{
struct timespec t;
assert(clock_gettime(CLOCK_REALTIME, &t) == 0);
double diff_ms = (1000.0 * t.tv_sec + 1.0e-6 * t.tv_nsec) -
(1000.0 * initTime.tv_sec + 1.0e-6 * initTime.tv_nsec);
return diff_ms;
}
int main(int argc, char **argv)
{
struct timespec initTime;
assert(clock_gettime(CLOCK_REALTIME, &initTime) == 0);
float *dev_x;
assert(cudaMalloc((void **) &dev_x, sizeof(float)) == cudaSuccess);
for (int k = 0; k < NITER; ++k)
{
double start = ElapsedTimeMs(initTime);
spin<<<1,1>>>(dev_x,3.14159/3.0);
assert(cudaDeviceSynchronize() == cudaSuccess);
double elapsed = ElapsedTimeMs(initTime) - start;
printf("Iter %02d: %.2f ms
", k, elapsed);
}
}
So a single thread is spinning on a fixed-length computation NITER times. On an S2050, I get the following run-times:
[C2050]$ nvcc -arch=sm_20 variability.cu -lrt
[C2050]$ ./a.out
Iter 00: 3.58 ms
Iter 01: 3.55 ms
Iter 02: 3.55 ms
Iter 03: 3.55 ms
Iter 04: 3.55 ms
Iter 05: 3.55 ms
Iter 06: 3.55 ms
Iter 07: 3.55 ms
Iter 08: 3.55 ms
Iter 09: 3.55 ms
Iter 10: 3.55 ms
Iter 11: 3.55 ms
Iter 12: 3.55 ms
Iter 13: 3.55 ms
Iter 14: 3.55 ms
Iter 15: 3.55 ms
Iter 16: 3.55 ms
Iter 17: 3.55 ms
Iter 18: 3.55 ms
Iter 19: 3.55 ms
Iter 20: 3.55 ms
Iter 21: 3.55 ms
Iter 22: 3.55 ms
Iter 23: 3.55 ms
The results are highly repeatable. I get the following on a GTX 680:
[GTX680]$ nvcc -arch=sm_30 variability.cu -lrt
[GTX680]$ ./a.out
Iter 00: 2.86 ms
Iter 01: 2.78 ms
Iter 02: 2.76 ms
Iter 03: 2.76 ms
Iter 04: 2.76 ms
Iter 05: 2.76 ms
Iter 06: 2.76 ms
Iter 07: 2.76 ms
Iter 08: 2.76 ms
Iter 09: 2.76 ms
Iter 10: 2.76 ms
Iter 11: 2.76 ms
Iter 12: 2.76 ms
Iter 13: 2.76 ms
Iter 14: 2.76 ms
Iter 15: 2.76 ms
Iter 16: 2.76 ms
Iter 17: 2.76 ms
Iter 18: 2.76 ms
Iter 19: 2.76 ms
Iter 20: 2.76 ms
Iter 21: 2.76 ms
Iter 22: 2.76 ms
Iter 23: 2.76 ms
However, on the K20c, I see the following:
[K20c]$ nvcc -arch=sm_35 variability.cu -lrt
[K20c]$ ./a.out
Iter 00: 7.33 ms
Iter 01: 6.95 ms
Iter 02: 6.70 ms
Iter 03: 6.95 ms
Iter 04: 6.95 ms
Iter 05: 6.95 ms
Iter 06: 6.96 ms
Iter 07: 9.42 ms
Iter 08: 6.70 ms
Iter 09: 6.95 ms
Iter 10: 6.95 ms
Iter 11: 6.96 ms
Iter 12: 6.95 ms
Iter 13: 6.62 ms
Iter 14: 6.71 ms
Iter 15: 9.42 ms
Iter 16: 6.95 ms
Iter 17: 6.97 ms
Iter 18: 4.77 ms
Iter 19: 3.20 ms
Iter 20: 3.10 ms
Iter 21: 3.09 ms
Iter 22: 3.21 ms
Iter 23: 3.22 ms
For one, it seems to take a long time to get to its highest performance state. Even when presumably in a high performance state, the run-times vary quite a bit, especially as compared to other hardware. There are no other applications running on the K20c, so it is not an interference issue.
I assume this is not expected behavior, although perhaps the clock rates are being adjusted extremely aggressively. Any ideas on what could be causing this?
Regards,
Thomas