Unusual delays does anyone recognize this pattern...

bobtown · May 6, 2009, 11:35pm

I am integrating CUDA capabilities into an existing Java application of mine in which I am performing matrix multiplication/reduction on large arrays of numbers. The Java code calls the C++ dll I wrote which will perform many iterations of multiplication/reduction before returning.

I have the code working properly but while evaluating the programs’ performance I noticed that after a number of iterations, huge delays start showing up.

Here is some timing data I collected, the columns are 1.) iteration number, 2.) last iteration time, 3.) total time so far

1	  72,355		72,355

2	  86,534		158,889

3	69,911		228,800

4	69,422		298,222

5	69,422		367,644

6	154,489		522,133

7	70,889		593,022

8	69,422		662,444

9	69,912		732,356

10	69,911		802,267

11	70,400		872,667

12	98,266		970,933

13	70,889		1,041,822

14	75,778		1,117,600

15	69,911		1,187,511

16	69,911		1,257,422

after some period of time though…

760	187,244		71,368,009

761	73,333		71,441,342

762	70,400		71,511,742

763	64,045		71,575,787

764	65,022		71,640,809

765	68,444		71,709,253

766	34,750,227		106,459,480

767	70,400		106,529,880

768	64,533		106,594,413

769	64,534		106,658,947

770	64,044		106,722,991

771	64,045		106,787,036

772	123,533,926		230,320,962

773	73,334		230,394,296

774	77,733		230,472,029

775	69,911		230,541,940

776	67,467		230,609,407

777	69,422		230,678,829

778	69,911		230,748,740

779	107,514,503		338,263,243

780	69,422		338,332,665

781	65,511		338,398,176

782	67,467		338,465,643

783	73,822		338,539,465

784	68,933		338,608,398

785	122,981,483		461,589,881

786	69,422		461,659,303

787	69,911		461,729,214

788	69,422		461,798,636

789	84,089		461,882,725

790	65,511		461,948,236

791	107,980,414		569,928,650

So I start to see that after every 4-6 iterations, the next iteration takes ~100ms. Originally I thought it was a Java problem, but I saw no change when I moved more of the code into the C++ DLL so that Java isn’t calling for each iteration to take place, but only calling for the card to start performing a series of iteration (~1000). Given the unique signature, I am hoping someone will recognize the problem.

And a related question, how do I make sure that Windows is not trying to use the card (GTX 275) for screen display purposes? Could screen refresh requests be causing these delays? Since I have another video adapter, I want the 275 free for HPC purposes only.

Thanks,

Tim

tmurray · May 6, 2009, 11:38pm

What driver are you using, and do you feel like posting source?

bobtown · May 7, 2009, 12:06am

CUDA 2.1

GeForce driver says 182.42

this is all on Windows Vista-64

I will clean up and comment my source code to post. It is essentially a series of calls to CUBLAS functions with a few additional functions I wrote.

Thanks

Tim

Smokey · May 7, 2009, 1:25am

This kind of odd delay isn’t exactly uncommon…

Every product I’ve developed to date, and all of the sample apps in the CUDA SDK suffer from this same problem - eventually.

I’ve posted various threads on this issue multiple times, and some of the issues have been fixed - but ultimately you still end up seeing inconsistent timings in kernels, the older your context is, and as the frequency of kernel launches increases.

Sadly this is my primary bottle neck right now - and has been (in conjunction with OpenGL interop) for a few months.

I have kernels going from ~150us to 6-10ms+ at times, reliably every other iteration - after a certain period of time.

At first I thought I might’ve been pushing the card to it’s limits and the driver was having a hard time managing all the resources flying around, but it happens on multiple cards, drivers, and both windows and linux… sighs

tmurray · May 7, 2009, 1:34am

Smokey, have you posted a repro? All of the clocking bugs that I know of have been fixed, so I don’t know what’s causing your issues.

bobtown · May 7, 2009, 1:55am

Are there any work-a-rounds to combat the problem? Unless my eyes (code) deceive me, it looks like the efficiency drops to about 0.1% within a few milliseconds of starting.

Thanks,

Tim

Smokey · May 7, 2009, 3:07am

Hmm, once we kick this release out the door (about a months time), I might be able to spend some time pin-pointing specific kernels we use which display this behavior when run on their own at a high frequency.

Currently our only major app that uses CUDA extensively runs ~40-50 kernels per frame/iteration, so it’s no small task making a small repro case that exhibits the behavior we’re seeing.

I think my most recent post is: http://forums.nvidia.com/index.php?showtop…=0&p=530658 - but I understand it’s quite a lot harder, if not impossible to fix a problem you can’t reliably produce in a test case.

I probably should’ve clarified in that post, that in my real application - the timings do start off very similar to the profiler timings - but gradually increase above profiler timings a little, then start spiking reliably. I’m still under the impression it’s related to the driver scheduler, resource management, or the fact this is a card running as a display device as well, AND we do HW accelerated OpenGL on the same card…

bobtown · May 7, 2009, 6:18am

After doing some more investigating, it seems like (for me at least) the problem is improved if I add a random 1ms Sleep() to 1% of the iterations. Since I’m rarely sending or receiving large amounts of data from the GPU, but only rapid-fire calling a series of CUBLAS commands, it would make since if the card was being slowed down by a backlog of requests. While a fix is a fix is a fix, I think there must be a more efficient way to go about this. Any suggestions?

Thanks

MisterAnderson42 · May 7, 2009, 11:39am

Are you calling cudaThreadSynchronize() before making timing measurements?

I run in a very different situation: typically headless servers running linux without even a windowing system installed. My app runs ~10 kernels per iteration and I have run single jobs for a more than a week and the performance at the end is the exactly same as the performance at the beginning. Come to think of it, I have run week+ jobs on a couple systems that are running X, but the screen is asleep and not doing anything. Maybe it is the active OpenGL that is the difference?

Oh, and are you guys running the latest drivers? I seem to recall a similar issue a while back that turned out to be a driver problem that was subsequently fixed.

bobtown · May 7, 2009, 12:47pm

I am calling cudaThreadSynchronize() before and __syncthreads at several points through each iteration.

The long and short of what I am doing is simulated annealing for neural network training. The interface is in Java which calls the C++ dll wrapper I wrote to interface with CUDA. What I am seeing is an exterme non-linear effect with the amount of work I throw at the dll. So if each iteration is 100 steps (each step being a ~8 CUBLAS calls) then it takes ~7ms for the DLL to return, 1000 steps will take ~1 sec, and 10,000 steps is currently running at 818 seconds!

I definitely want to move everything over to a Linux box, but that of course would take time I could/should/need to be spending elsewhere on the code.

I do have the latest drivers (CUDA 2.1).

Thanks

Topic		Replies	Views
Irregularity in the timings A few statistics CUDA Programming and Performance	14	10919	October 24, 2007
Why does my kernel take too long occasionally? CUDA Programming and Performance	21	9152	October 13, 2010
Deminishing performance? CUDA Programming and Performance	29	13492	March 5, 2009
Slow Down a little later CUDA Programming and Performance	4	5361	July 30, 2007
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9972	February 8, 2008
consistent 500 delay on every Nth kernel launch CUDA Programming and Performance	3	13234	October 6, 2010
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	5	2989	November 3, 2021
CUDA execution multiples of 16ms CUDA Programming and Performance	14	2324	May 30, 2015
Inconsistent kernel run times CUDA Programming and Performance	12	5957	August 5, 2009
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	8	1969	July 19, 2022

Unusual delays does anyone recognize this pattern...

Related topics