Stress testing CUDA app - possible exit from library?

bodysurfinyon · March 6, 2015, 10:49pm

I am stress testing an application using CUDA. It is a rendering server process that is single threaded. In this stress test, it has one client doing a 3D render ~ 30 ms kernel time per frame. and a second process doing a variety of renders, so, there is a fair amount of state thrashing. If there is no state thrashing, it runs for a long, long time. In the large majority of the cases, control does not come back from cudaMemcpy2D(). This claim is based on statements from a log file that is flushed after each write.

I’m running CUDA GeForce GT 740, Compute 3.0 - 4096 MB, Windows 7, Dell T5400. The device driver is 7.0 and the runtime is 5.50.

DLOG(“before cudaMemcpy2D id %08x host width %d align %d image w,h %d %d\n”, dst->id,
rp->host_image_buffer_width, rp->pix_sz,
rp->img_width, rp->img_buffer_height);

cudaError cudaErr = cudaMemcpy2D( dst->data,
rp->host_image_buffer_width * rp->pix_sz,
rp->img_buffer, rp->img_width * rp->pix_sz, rp->img_width * rp->pix_sz,
rp->img_buffer_height, cudaMemcpyDeviceToHost);

DLOG(“back from cudaMemcpy2D err %d\n”, cudaErr);
if (cudaErr != cudaSuccess)
{
…
}

On one occasion, this 1/2 mb copy took 15 seconds, which I have seen on copies into GPU memory upon occasion, too.

There is no measurable memory leak. I have atexit() set with logging and a deliberate crash, but, it only gets called when the process exits normally.

It is possible some other black magic is going on, but, I thought I would throw() it out here.

Thanks for any input.

njuffa · March 6, 2015, 11:12pm

Your description is very vague. Does the software use CUDA/OpenGL interop? What exactly do you mean by “a fair amount of state thrashing”?

Does the code provide status checks on 100% of all CUDA and OpenGL API calls and all CUDA kernel launches? I think it is entirely possible that the root cause of the failing cudaMemcpy2D() call is far up-stream from the copy operation. A plausible hypothesis is that state is getting corrupted, but based on the information provided it is anybody’s guess whether this happens in your application code, the CUDA driver, or the OpenGL driver.

I am not sure what you mean by “the device driver is 7.0”. This is not a recognizable driver version on Windows. I have a reasonably up-to-date driver installed on my Windows 7 system and the version number of that is 347.52. “runtime 5.50” presumably refers to CUDA version 5.5? In any event the issue will be easier to reproduce for others if you use the latest CUDA version with the latest drivers and cut down your application code to the minimum needed to trigger the issue.

Robert_Crovella · March 6, 2015, 11:14pm

Update to the latest CUDA toolkit and latest driver for your GPU. CUDA 5.5 is pretty old.
Develop a short, repeatable test case, and post it or report an issue. You’re likely to get good help that way.
Try other GPUs besides the GT 740 that you are using, and try other systems.

Robert_Crovella · March 6, 2015, 11:17pm

The output from the deviceQuery sample app lists the CUDA runtime version and CUDA driver version. In listing the CUDA driver version, it uses a numerical method similar to the CUDA runtime version, i.e. 5.5/6.0/7.0 etc. This means that the GPU driver generally falls within that compatibility range.

for example, a GPU driver 340.29 would show up as CUDA driver version 6.5 in deviceQuery.

njuffa · March 6, 2015, 11:27pm

Thanks for the explanation regarding the mysterious version number “7.0”. However, knowing this number is of little help in practical terms, as any professional repro would require the knowledge of the actual full driver version number, in particular when dealing with problems that involve CUDA/graphics interop, since both CUDA and the graphics driver could be implicated in case of driver bugs.

bodysurfinyon · March 7, 2015, 3:42am

Thank you so much for your replies! No, there is no OpenGL interop. We get bitmaps back and send them back to the client. The state thrashing means that there are different 3D textures and different rendering parameters used. If the same data is repeatedly rendered with just view changes, it does not happen. I believe that all calls to CUDA are error checked. A simple test case may be difficult. Trying to upgrade the SDK is much easier.

It also has the same symptoms on a Tesla C1060 with the very latest driver. So, I think that means it is not a CUDA v. display issue.

Does anyone have any idea of what could just make a process disappear?

njuffa · March 7, 2015, 4:17am

I re-read your original post and could not find any mention of a “disappearing process”. I understood that symptoms were that cudaMemcpy2D() calls hang randomly after the app has been running for a while.

While upgrading to the latest version of CUDA helps with reproducibility (most people do not keep old CUDA versions around), upgrading may just jiggle things around enough to reduce the likelihood of the issue occurring. Apps hanging on API calls and processes disappearing seem issues serious enough that they warrant root cause determination. Issues that have not been root caused but magically go away after a software upgrade have the nasty habit of re-appearing after yet another future upgrade.

With basically no information given about the code, it is not even clear that whatever is going wrong is being caused by CUDA at all. CUDA API calls could be failing because of corruption elsewhere in the system. If the problem persists, I would suggest some good old-fashioned debugging work. cuda-memcheck can help find issues in CUDA code. Not sure what you can use for host code. On Linux there is valgrind to detect memory corruption, is there any equivalent tool for Windows you can use?

bodysurfinyon · March 9, 2015, 4:09pm

I totally agree that it could be anything. We used to use Purify on Windows, but, that stopped being useful. I will checkout cuda-memcheck. Thanks. I’ll report back when I have more info.

bodysurfinyon · March 10, 2015, 2:03pm

Using the 6.5 CUDA SDK, I get an upstream error. I suppose I should have tried that before posting…

Thank you!

bodysurfinyon · March 17, 2015, 2:43pm

Final note: The upstream error was not the issue. It seems that CUDA toolkit6.5 checks component type strictly. Bind was expecting a array, but it was created as . Once that was fixed, I went back to getting the same intermittent error - actually a silent exit of the process.

I used cudaDeviceSynchronize() liberally to locate the error mostly to the actual rendering kernel. The error never occurred in cudaMemcpy2D();

The exit would also seem to happen occasionally in send() - using sockets. AND it seems to exposed by running multiple test programs from Python. It sounds like I am on drugs, but… after spending way too much time on this, I am hoping to sweep it under the rug, then use a steam roller to make sure it stays there.

Topic		Replies	Views
cudaMemcpy2D hangs on Host -> Device copy (for GPU 0, but not for GPU 1) CUDA Programming and Performance	3	1255	April 1, 2015
cudaMemcpy2D doesn't return CUDA Programming and Performance	0	2100	July 18, 2011
cuda 2.2 bug? CUDA Programming and Performance	29	19704	May 3, 2010
CUDA 2.0 seems to fail for long executions multiple process on one card fail CUDA Programming and Performance	5	7449	June 16, 2008
Stability Problem CUDA Programming and Performance	12	4014	February 4, 2011
"unspecified driver error" CUDA Programming and Performance	17	38767	November 6, 2007
CUDA 3.2 Driver BROKE ? Oops.... CUDA Programming and Performance	20	11415	December 22, 2010
intermittent killer kernel Kernel which causes CUDA to die, followed by launch failures CUDA Programming and Performance	36	35038	June 12, 2009
Odd error fixed by commenting unrelated line? CUDA Programming and Performance	11	8659	February 17, 2010
same program got different results on GT240 and GTX465, weird ! GTX465 and GT240 / G210 CUDA Programming and Performance	6	1353	August 19, 2010

Stress testing CUDA app - possible exit from library?

Related topics