Crash when profiling with "Kernel Launches and Memory Operations"

Hi All,

I have an app which runs fine and can be profiled, so long as “Kernel Launches and Memory Operations” is not checked. When it is, the app crashes after working ok for a few seconds (including kernel launches), with the following:

Problem Event Name: APPCRASH
Application Name: rmsi.exe
Application Version: 0.0.0.0
Application Timestamp: 54aea308
Fault Module Name: Nvda.Cuda.Injection.dll
Fault Module Version: 4.2.0.14314
Fault Module Timestamp: 54616179
Exception Code: c0000005
Exception Offset: 000000000027a1ab
OS Version: 6.1.7601.2.1.0.256.48
Locale ID: 6153
Additional Information 1: a282
Additional Information 2: a282d7fcfaec81da43b5940b95b62440
Additional Information 3: a181
Additional Information 4: a181d847302d43c85dd6d726f2c9cfec

and I get no kernel launch info (though all other info is available)

If I attempt to debug, I get the following exception at a call to CudaMemcpy3D (inside a loop which by then has already ran successfuly a number of times):

Unhandled exception at 0x000007FECD6FA1AB (Nvda.Cuda.Injection.dll) in rmsi.exe: 0xC0000005: Access violation writing location 0x0000000000000020.

With this stack:

Nvda.Cuda.Injection.dll!000007fecd6fa1ab()	Unknown
Nvda.Cuda.Injection.dll!000007fecd4f874c()	Unknown
Nvda.Cuda.Injection.dll!000007fecd582b40()	Unknown
Nvda.Cuda.Injection.dll!000007fecd587056()	Unknown
Nvda.Cuda.Injection.dll!000007fecd59c406()	Unknown
Nvda.Cuda.Injection.dll!000007fecd70a985()	Unknown
nvcuda.dll!000007fec78e3fb0()	Unknown
nvcuda.dll!000007fec79bc8ad()	Unknown
nvcuda.dll!000007fec79d48f5()	Unknown
nvcuda.dll!000007fec79d5b3b()	Unknown
nvcuda.dll!000007fec796b916()	Unknown
nvcuda.dll!000007fec793c226()	Unknown
nvcuda.dll!000007fec793d9db()	Unknown
nvcuda.dll!000007fec78eba10()	Unknown
nvcuda.dll!000007fec78f3c45()	Unknown
nvcuda.dll!000007fec78d9204()	Unknown
cudart64_60.dll!000007fef9657044()	Unknown
cudart64_60.dll!000007fef9645e9c()	Unknown
cudart64_60.dll!000007fef965e42f()	Unknown

rmsi.exe!frameProcessor_setInputTexture(void * data, unsigned int cam, unsigned int phs) Line 763 C++

CUDA-memcheck has found no problems, and the application runs fine if not profiling.

Windows 7 64-bit
VS 2012 11.0.61030.00 Update 4
CUDA 6.0
Nsight 4.2.0.14314

The profiler used to run fine until we started using layered textures.

I’d be grateful for any ideas…

Hi,

I guess your “profiling” means Trace Application but not Profile CUDA Application in Nsight VSE, right? I have tried layered texture sample and CudaMemcpy3D sample on my side. No luck to get repro. Would you mind provide more information about your GPU card and NVIDIA driver version? If possible, attach your exe file will help us to reproduce and investigate your problem faster a lot.

Hi,

Thanks for looking into it. Yes, I do mean “Trace Application”, sorry about that. Profile CUDA Application does not crash. I also tried starting collection manually: it ran fine until I started it, then crashed few seconds after did.

Here are the details of three machines where I can reproduce the problem:

Desktop:

  • Windows 7 64-bit
  • VS 2012 11.0.61030.00 Update 4
  • CUDA 6.5
  • Nsight 4.1.0.14204
  • GeForce GTX 780 Ti (GK110B)
  • driver 344.11

Laptop 1:

  • Windows 7 64-bit
  • VS 2012 11.0.61030.00 Update 4
  • CUDA 6.0
  • Nsight 4.2.0.14314
  • GeForce GT 730M (GK208)
  • driver 347.09

Laptop 2:

  • Windows 7 64-bit
  • VS 2012 11.0.61030.00 Update 4
  • Nsight 4.1.0.14204
  • CUDA 6.5
  • Quadro K2100M (GK106)
  • driver 340.62

Unfortunately, I can’t share the exe file. We’re trying to build a minimal example, but it’s difficult to isolate the problem.

Hi All,

We managed to create a minimal example out of the simpleLayeredTexture sample, attached (replace the existing simpleLayeredTexture.cu).

Basically, we call cudaMemcpy3D in a loop, apparently too fast for NSight. Adding a call to cudaDeviceSynchronize on every loop iteration fixes the crash.

for (unsigned int iter=0; iter < 1000; iter++){
	// here come kernel calls modifying the buffer

	cudaMemcpy3D(&cpyToDemoParams);
	//cudaDeviceSynchronize();
}

Is this expected or are we doing something wrong?

simpleLayeredTexture.cu (4.14 KB)

Hi nightflight,

Thanks for providing the sample code. It describes the problem much more clearly. Just as your observation, cudaDeviceSynchronize() is required for Nsight Analysis. It causes all profile data to be flushed before the application exits. Another equivalent function is cudaDeviceReset(). Without them, Nsight Analysis could not work correctly.

Hi nightflight,

Although there is a little flaw in your program, we are sorry for the inconvenience which is caused by our tool. We have reproduced your problem with Nsight 4.2 and also verified that it has gone in Nsight 4.5. So could you please try the newer version in https://developer.nvidia.com/nsight-visual-studio-edition-early-access? Notice that it’s a early access edition but not final release. Thanks.

Hi qzhang,

Thank you so much for looking into it. Nsight 4.5 no longer crashes.

But, just to clarify, is the synchronization only needed for Nsight, or will my program eventually not work correctly, or become unstable, without it?

Glad to hear 4.5 is working for you.

You can’t use the results of the GPU work if you don’t wait for it to complete. Many CUDA functions launch CUDA kernels or initiate memory transfers asynchronously, so the CPU function may (and probably will) return before the GPU work is complete. Take a look at the CUDA programming guide to see which functions behave like this. For those functions, the data you are expecting to be written by the GPU will not be in a determinate state until it finishes, so you must call one of the *Synchronize or *Query API functions to wait for the GPU work to finish before accessing the output. It makes no sense to have a CUDA program that launches work and exits without synchronizing, because nothing could safely be done with the results of the unfinished GPU work.

Also, the current NVIDIA tools can only guarantee complete and accurate results if the program waits for outstanding work to finish with the cudaDeviceReset or cudaDeviceSynchronize functions. See this post on the Parallel For-All Blog: