cuda kernal slows down after some iterations

ilovelyy · August 22, 2014, 2:16am

I run the same cuda code on two machines. Machine A has GTX 670, Machine B has GTX 780 Ti. the cuda kernel does not allocate new memory or copy memory from host. The GPU memory consumption is constant after the first iteration.

Theoretically machine B has better GPU. However cuda kernel slows down after some iterations on machine B. But kernel does not slow down on Machine A. Why is that? (The total iterations was set to 1, 10, 100, 1000, 2000, 3000, 4000, 5000, 6000)

Machine A running times:

input: iters=1 Time to generate: 0.000 s input: iters=10 Time to generate: 0.001 s input: iters=100 Time to generate: 0.833 s input: iters=1000 Time to generate: 21.743 s input: iters=2000 Time to generate: 45.393 s input: iters=3000 Time to generate: 69.413 s input: iters=4000 Time to generate: 92.660 s input: iters=5000 Time to generate: 116.709 s input: iters=6000 Time to generate: 140.562 s

Machine B running times:

input: iters=1 Time to generate: 0.000 s input: iters=10 Time to generate: 0.003 s input: iters=100 Time to generate: 0.194 s input: iters=1000 Time to generate: 10.604 s input: iters=2000 Time to generate: 28.489 s input: iters=3000 Time to generate: 61.841 s input: iters=4000 Time to generate: 127.515 s input: iters=5000 Time to generate: 218.961 s input: iters=6000 Time to generate: 323.068 s

For small number of iterations, machine B takes less time. For large number of iterations, machine A takes less time. Why is that? Machine A has cuda toolkit 6.5 and driver 340.29, machine B has cuda toolkit 6.0 and driver 331.49. Could the Nvivia software be the problem?

little_jimmy · August 22, 2014, 8:59am

“the cuda kernel does not allocate new memory or copy memory from host.”

what does the kernel do then, where does it get data from to do this, and what does it do with the result?

what are the grid/ block dimensions of the kernel?

ilovelyy · August 22, 2014, 5:53pm

I just want to test the running time of some iterations of the kernel. The output of kernel feeds back to the input. Therefore no mem copy from host. I am using cublasSgemm() from cublas, didn’t specify grid/block dimensions of the kernel.

I fix the problem by copying GPU data to host and do gpu device reset after 1000 iterations, then move data back to GPU and continue with iteration 1001. Maybe GPU gets “tired” after large iteration and it needs reset. My method involved extra copy, but still faster than not using reset.

Tiomat · August 26, 2014, 8:59am

I would be extremely worried if this were the case, that the GPU performance degrades over time or kernel launches. Maybe it is some thermal issue that it is clocking down after a large amount of work?

Topic		Replies	Views
Kernel takes more and more time at each iteration CUDA Programming and Performance	0	838	May 24, 2011
kernel failed after few invokation CUDA Programming and Performance	9	7919	October 30, 2010
Iterative computations Creating of efficient iterative computations using CUDA CUDA Programming and Performance	16	2527	January 25, 2011
Memory copy very slow memory copy, image CUDA Programming and Performance	10	12662	April 7, 2011
CUDA very slow performance CUDA Programming and Performance	21	17171	March 6, 2020
cuda kernel call within for loop gets slow, crashes CUDA Programming and Performance	5	5911	April 1, 2012
Getting Different Execution Times of Running Same Kernel Twice CUDA Programming and Performance	1	91	August 13, 2024
CUDA trouble CUDA Programming and Performance	3	1081	March 19, 2013
Same Kernel called multiple times in a loop has different runtimes CUDA Programming and Performance	2	747	April 8, 2017
Decreased performances if CUDA kernels are not run continuously Jetson TX2	1	508	June 8, 2018

cuda kernal slows down after some iterations

Related topics