I run the same cuda code on two machines. Machine A has GTX 670, Machine B has GTX 780 Ti. the cuda kernel does not allocate new memory or copy memory from host. The GPU memory consumption is constant after the first iteration.
Theoretically machine B has better GPU. However cuda kernel slows down after some iterations on machine B. But kernel does not slow down on Machine A. Why is that? (The total iterations was set to 1, 10, 100, 1000, 2000, 3000, 4000, 5000, 6000)
Machine A running times:
input: iters=1 Time to generate: 0.000 s input: iters=10 Time to generate: 0.001 s input: iters=100 Time to generate: 0.833 s input: iters=1000 Time to generate: 21.743 s input: iters=2000 Time to generate: 45.393 s input: iters=3000 Time to generate: 69.413 s input: iters=4000 Time to generate: 92.660 s input: iters=5000 Time to generate: 116.709 s input: iters=6000 Time to generate: 140.562 s
Machine B running times:
input: iters=1 Time to generate: 0.000 s input: iters=10 Time to generate: 0.003 s input: iters=100 Time to generate: 0.194 s input: iters=1000 Time to generate: 10.604 s input: iters=2000 Time to generate: 28.489 s input: iters=3000 Time to generate: 61.841 s input: iters=4000 Time to generate: 127.515 s input: iters=5000 Time to generate: 218.961 s input: iters=6000 Time to generate: 323.068 s
For small number of iterations, machine B takes less time. For large number of iterations, machine A takes less time. Why is that? Machine A has cuda toolkit 6.5 and driver 340.29, machine B has cuda toolkit 6.0 and driver 331.49. Could the Nvivia software be the problem?