cudaErrorLaunchFailure when using nvProf only

I have a rather complex set of kernels (more details in my other posts) that executes a bunch of kernels hundreds/thousands of times. After every cuda API call that returns an error, I do checks to ensure it is cudaSuccess (in Debug).

The problem is that if I run the code stand-alone, in the CPU debugger or using CUDA debugging it works perfectly fine, and completes as expected. If I run it through the “Start Performance Analysis” in visual studio, it trips my error check almost immediately with a cudaErrorLaunchFailure. I should note the code it returns an error to is executed in all other modes and the code isn’t changed any way in this case vs the other cases.

So the question is: What could be causing cuda to report a launch failure in only the mode that profiles its execution?

Maybe this is a redundant question, but I just want to make sure, do you give the correct working directory in Analysis activity page and sync sufficient data in case you are using remote target? Thanks.


Some additional information: The release version also works correctly IF I use the default stream only. In other words, if I don’t use multiple streams.

The working directory is correct (otherwise the application wouldn’t run). I am not sure about “sync sufficient data…”. The directories to synchronize section is empty.

The settings don’t change for the working directory or the directories to synchronize between the compiled versions, only the executable. However, the .cu and .cuh files have additional debugging code and compilation options set.

Since the kernel code causes an exception instantly on the first invocation, I assume that there is some setup state that is being missed. Something about creating or initializing the cudaStreams? Submitting commands to the non-0 streams? Telling the system to expect work on multiple streams?

The basic architecture is:

  1. Data IO is submitted from the CPU using stream0. (Input data is shared as input for each of the stream1..N)
  2. An event (IOevent0) is set to flag completion of data IO on stream0
  3. All stream1..N do a cudaStreamWaitEvent on the IOevent0.
  4. Work is submitted to stream1..N for Kernels1..N (respectively).
  5. All stream1..N return their results via their own stream back to the CPU

This fails as described. If I change all stream1…N to stream0 it works as expected.

(Note streams1…N are created well in advance, there are N streams to match the CPU threads that will submit to them (1 per CPU core) and the streams are reused after previous work has completed.).

So to be absolutely clear, the code works perfectly in all modes EXCEPT in release using multiple streams when being profiled or debugged.

Running the same (almost) code in debug, profiling and debugging works perfectly.
Running with only 1 stream, profiling and debugging works perfectly.
Running in any mode without profiling or debugging, works perfectly.


I’m sorry for your experience. Is it possible for us to get your application and try on our side? My email address is If it’s too large to attach, I can send you a FTP site. Thanks.