Stress testing CUDA app - possible exit from library?

I am stress testing an application using CUDA. It is a rendering server process that is single threaded. In this stress test, it has one client doing a 3D render ~ 30 ms kernel time per frame. and a second process doing a variety of renders, so, there is a fair amount of state thrashing. If there is no state thrashing, it runs for a long, long time. In the large majority of the cases, control does not come back from cudaMemcpy2D(). This claim is based on statements from a log file that is flushed after each write.

I’m running CUDA GeForce GT 740, Compute 3.0 - 4096 MB, Windows 7, Dell T5400. The device driver is 7.0 and the runtime is 5.50.

DLOG(“before cudaMemcpy2D id %08x host width %d align %d image w,h %d %d\n”, dst->id,
rp->host_image_buffer_width, rp->pix_sz,
rp->img_width, rp->img_buffer_height);

cudaError cudaErr = cudaMemcpy2D( dst->data,
rp->host_image_buffer_width * rp->pix_sz,
rp->img_buffer, rp->img_width * rp->pix_sz, rp->img_width * rp->pix_sz,
rp->img_buffer_height, cudaMemcpyDeviceToHost);

DLOG(“back from cudaMemcpy2D err %d\n”, cudaErr);
if (cudaErr != cudaSuccess)


On one occasion, this 1/2 mb copy took 15 seconds, which I have seen on copies into GPU memory upon occasion, too.

There is no measurable memory leak. I have atexit() set with logging and a deliberate crash, but, it only gets called when the process exits normally.

It is possible some other black magic is going on, but, I thought I would throw() it out here.

Thanks for any input.

Your description is very vague. Does the software use CUDA/OpenGL interop? What exactly do you mean by “a fair amount of state thrashing”?

Does the code provide status checks on 100% of all CUDA and OpenGL API calls and all CUDA kernel launches? I think it is entirely possible that the root cause of the failing cudaMemcpy2D() call is far up-stream from the copy operation. A plausible hypothesis is that state is getting corrupted, but based on the information provided it is anybody’s guess whether this happens in your application code, the CUDA driver, or the OpenGL driver.

I am not sure what you mean by “the device driver is 7.0”. This is not a recognizable driver version on Windows. I have a reasonably up-to-date driver installed on my Windows 7 system and the version number of that is 347.52. “runtime 5.50” presumably refers to CUDA version 5.5? In any event the issue will be easier to reproduce for others if you use the latest CUDA version with the latest drivers and cut down your application code to the minimum needed to trigger the issue.

  1. Update to the latest CUDA toolkit and latest driver for your GPU. CUDA 5.5 is pretty old.

  2. Develop a short, repeatable test case, and post it or report an issue. You’re likely to get good help that way.

  3. Try other GPUs besides the GT 740 that you are using, and try other systems.

The output from the deviceQuery sample app lists the CUDA runtime version and CUDA driver version. In listing the CUDA driver version, it uses a numerical method similar to the CUDA runtime version, i.e. 5.5/6.0/7.0 etc. This means that the GPU driver generally falls within that compatibility range.

for example, a GPU driver 340.29 would show up as CUDA driver version 6.5 in deviceQuery.

Thanks for the explanation regarding the mysterious version number “7.0”. However, knowing this number is of little help in practical terms, as any professional repro would require the knowledge of the actual full driver version number, in particular when dealing with problems that involve CUDA/graphics interop, since both CUDA and the graphics driver could be implicated in case of driver bugs.

Thank you so much for your replies! No, there is no OpenGL interop. We get bitmaps back and send them back to the client. The state thrashing means that there are different 3D textures and different rendering parameters used. If the same data is repeatedly rendered with just view changes, it does not happen. I believe that all calls to CUDA are error checked. A simple test case may be difficult. Trying to upgrade the SDK is much easier.

It also has the same symptoms on a Tesla C1060 with the very latest driver. So, I think that means it is not a CUDA v. display issue.

Does anyone have any idea of what could just make a process disappear?

I re-read your original post and could not find any mention of a “disappearing process”. I understood that symptoms were that cudaMemcpy2D() calls hang randomly after the app has been running for a while.

While upgrading to the latest version of CUDA helps with reproducibility (most people do not keep old CUDA versions around), upgrading may just jiggle things around enough to reduce the likelihood of the issue occurring. Apps hanging on API calls and processes disappearing seem issues serious enough that they warrant root cause determination. Issues that have not been root caused but magically go away after a software upgrade have the nasty habit of re-appearing after yet another future upgrade.

With basically no information given about the code, it is not even clear that whatever is going wrong is being caused by CUDA at all. CUDA API calls could be failing because of corruption elsewhere in the system. If the problem persists, I would suggest some good old-fashioned debugging work. cuda-memcheck can help find issues in CUDA code. Not sure what you can use for host code. On Linux there is valgrind to detect memory corruption, is there any equivalent tool for Windows you can use?

I totally agree that it could be anything. We used to use Purify on Windows, but, that stopped being useful. I will checkout cuda-memcheck. Thanks. I’ll report back when I have more info.

Using the 6.5 CUDA SDK, I get an upstream error. I suppose I should have tried that before posting…

Thank you!

Final note: The upstream error was not the issue. It seems that CUDA toolkit6.5 checks component type strictly. Bind was expecting a array, but it was created as . Once that was fixed, I went back to getting the same intermittent error - actually a silent exit of the process.

I used cudaDeviceSynchronize() liberally to locate the error mostly to the actual rendering kernel. The error never occurred in cudaMemcpy2D();

The exit would also seem to happen occasionally in send() - using sockets. AND it seems to exposed by running multiple test programs from Python. It sounds like I am on drugs, but… after spending way too much time on this, I am hoping to sweep it under the rug, then use a steam roller to make sure it stays there.