cuda kernal slows down after some iterations

I run the same cuda code on two machines. Machine A has GTX 670, Machine B has GTX 780 Ti. the cuda kernel does not allocate new memory or copy memory from host. The GPU memory consumption is constant after the first iteration.

Theoretically machine B has better GPU. However cuda kernel slows down after some iterations on machine B. But kernel does not slow down on Machine A. Why is that? (The total iterations was set to 1, 10, 100, 1000, 2000, 3000, 4000, 5000, 6000)

Machine A running times:

input: iters=1 Time to generate: 0.000 s input: iters=10 Time to generate: 0.001 s input: iters=100 Time to generate: 0.833 s input: iters=1000 Time to generate: 21.743 s input: iters=2000 Time to generate: 45.393 s input: iters=3000 Time to generate: 69.413 s input: iters=4000 Time to generate: 92.660 s input: iters=5000 Time to generate: 116.709 s input: iters=6000 Time to generate: 140.562 s

Machine B running times:

input: iters=1 Time to generate: 0.000 s input: iters=10 Time to generate: 0.003 s input: iters=100 Time to generate: 0.194 s input: iters=1000 Time to generate: 10.604 s input: iters=2000 Time to generate: 28.489 s input: iters=3000 Time to generate: 61.841 s input: iters=4000 Time to generate: 127.515 s input: iters=5000 Time to generate: 218.961 s input: iters=6000 Time to generate: 323.068 s

For small number of iterations, machine B takes less time. For large number of iterations, machine A takes less time. Why is that? Machine A has cuda toolkit 6.5 and driver 340.29, machine B has cuda toolkit 6.0 and driver 331.49. Could the Nvivia software be the problem?

“the cuda kernel does not allocate new memory or copy memory from host.”

what does the kernel do then, where does it get data from to do this, and what does it do with the result?

what are the grid/ block dimensions of the kernel?

I just want to test the running time of some iterations of the kernel. The output of kernel feeds back to the input. Therefore no mem copy from host. I am using cublasSgemm() from cublas, didn’t specify grid/block dimensions of the kernel.

I fix the problem by copying GPU data to host and do gpu device reset after 1000 iterations, then move data back to GPU and continue with iteration 1001. Maybe GPU gets “tired” after large iteration and it needs reset. My method involved extra copy, but still faster than not using reset.

I would be extremely worried if this were the case, that the GPU performance degrades over time or kernel launches. Maybe it is some thermal issue that it is clocking down after a large amount of work?