Unusual delays does anyone recognize this pattern...

I am integrating CUDA capabilities into an existing Java application of mine in which I am performing matrix multiplication/reduction on large arrays of numbers. The Java code calls the C++ dll I wrote which will perform many iterations of multiplication/reduction before returning.

I have the code working properly but while evaluating the programs’ performance I noticed that after a number of iterations, huge delays start showing up.

Here is some timing data I collected, the columns are 1.) iteration number, 2.) last iteration time, 3.) total time so far

1	  72,355		72,355

2	  86,534		158,889

3	69,911		228,800

4	69,422		298,222

5	69,422		367,644

6	154,489		522,133

7	70,889		593,022

8	69,422		662,444

9	69,912		732,356

10	69,911		802,267

11	70,400		872,667

12	98,266		970,933

13	70,889		1,041,822

14	75,778		1,117,600

15	69,911		1,187,511

16	69,911		1,257,422

after some period of time though…

760	187,244		71,368,009

761	73,333		71,441,342

762	70,400		71,511,742

763	64,045		71,575,787

764	65,022		71,640,809

765	68,444		71,709,253

766	34,750,227		106,459,480

767	70,400		106,529,880

768	64,533		106,594,413

769	64,534		106,658,947

770	64,044		106,722,991

771	64,045		106,787,036

772	123,533,926		230,320,962

773	73,334		230,394,296

774	77,733		230,472,029

775	69,911		230,541,940

776	67,467		230,609,407

777	69,422		230,678,829

778	69,911		230,748,740

779	107,514,503		338,263,243

780	69,422		338,332,665

781	65,511		338,398,176

782	67,467		338,465,643

783	73,822		338,539,465

784	68,933		338,608,398

785	122,981,483		461,589,881

786	69,422		461,659,303

787	69,911		461,729,214

788	69,422		461,798,636

789	84,089		461,882,725

790	65,511		461,948,236

791	107,980,414		569,928,650

So I start to see that after every 4-6 iterations, the next iteration takes ~100ms. Originally I thought it was a Java problem, but I saw no change when I moved more of the code into the C++ DLL so that Java isn’t calling for each iteration to take place, but only calling for the card to start performing a series of iteration (~1000). Given the unique signature, I am hoping someone will recognize the problem.

And a related question, how do I make sure that Windows is not trying to use the card (GTX 275) for screen display purposes? Could screen refresh requests be causing these delays? Since I have another video adapter, I want the 275 free for HPC purposes only.



What driver are you using, and do you feel like posting source?

CUDA 2.1

GeForce driver says 182.42

this is all on Windows Vista-64

I will clean up and comment my source code to post. It is essentially a series of calls to CUBLAS functions with a few additional functions I wrote.



This kind of odd delay isn’t exactly uncommon…

Every product I’ve developed to date, and all of the sample apps in the CUDA SDK suffer from this same problem - eventually.

I’ve posted various threads on this issue multiple times, and some of the issues have been fixed - but ultimately you still end up seeing inconsistent timings in kernels, the older your context is, and as the frequency of kernel launches increases.

Sadly this is my primary bottle neck right now - and has been (in conjunction with OpenGL interop) for a few months.

I have kernels going from ~150us to 6-10ms+ at times, reliably every other iteration - after a certain period of time.

At first I thought I might’ve been pushing the card to it’s limits and the driver was having a hard time managing all the resources flying around, but it happens on multiple cards, drivers, and both windows and linux… sighs

Smokey, have you posted a repro? All of the clocking bugs that I know of have been fixed, so I don’t know what’s causing your issues.

Are there any work-a-rounds to combat the problem? Unless my eyes (code) deceive me, it looks like the efficiency drops to about 0.1% within a few milliseconds of starting.



Hmm, once we kick this release out the door (about a months time), I might be able to spend some time pin-pointing specific kernels we use which display this behavior when run on their own at a high frequency.

Currently our only major app that uses CUDA extensively runs ~40-50 kernels per frame/iteration, so it’s no small task making a small repro case that exhibits the behavior we’re seeing.

I think my most recent post is: http://forums.nvidia.com/index.php?showtop…=0&p=530658 - but I understand it’s quite a lot harder, if not impossible to fix a problem you can’t reliably produce in a test case.

I probably should’ve clarified in that post, that in my real application - the timings do start off very similar to the profiler timings - but gradually increase above profiler timings a little, then start spiking reliably. I’m still under the impression it’s related to the driver scheduler, resource management, or the fact this is a card running as a display device as well, AND we do HW accelerated OpenGL on the same card…

After doing some more investigating, it seems like (for me at least) the problem is improved if I add a random 1ms Sleep() to 1% of the iterations. Since I’m rarely sending or receiving large amounts of data from the GPU, but only rapid-fire calling a series of CUBLAS commands, it would make since if the card was being slowed down by a backlog of requests. While a fix is a fix is a fix, I think there must be a more efficient way to go about this. Any suggestions?


Are you calling cudaThreadSynchronize() before making timing measurements?

I run in a very different situation: typically headless servers running linux without even a windowing system installed. My app runs ~10 kernels per iteration and I have run single jobs for a more than a week and the performance at the end is the exactly same as the performance at the beginning. Come to think of it, I have run week+ jobs on a couple systems that are running X, but the screen is asleep and not doing anything. Maybe it is the active OpenGL that is the difference?

Oh, and are you guys running the latest drivers? I seem to recall a similar issue a while back that turned out to be a driver problem that was subsequently fixed.

I am calling cudaThreadSynchronize() before and __syncthreads at several points through each iteration.

The long and short of what I am doing is simulated annealing for neural network training. The interface is in Java which calls the C++ dll wrapper I wrote to interface with CUDA. What I am seeing is an exterme non-linear effect with the amount of work I throw at the dll. So if each iteration is 100 steps (each step being a ~8 CUBLAS calls) then it takes ~7ms for the DLL to return, 1000 steps will take ~1 sec, and 10,000 steps is currently running at 818 seconds!

I definitely want to move everything over to a Linux box, but that of course would take time I could/should/need to be spending elsewhere on the code.

I do have the latest drivers (CUDA 2.1).