I have two CUDA kernels in the program. When I run it through the CUDA Visual Profiler, it reports the results for the last executed kernel. Since my kernels are independent, I was able to switch the order of invocation; again the last executed kernel is reported.
When I run the sample DCT, which has multiple kernels, all of them are reported.
What might I be missing?
edit : Looks like cudaThreadExit(), which was called after each kernel invocation, is the causing this behavior. Could anyone confirm this, please ?