Why does CUDA code run so much faster in NVIDIA Visual Profiler?

A piece of code that takes over 1 minute on the command line was done in a matter of seconds in NVIDIA Visual Profiler (running the same .exe). So the natural question is why? Is there something wrong with command line, or does Visual Profiler do something different and not really execute everything as on the command line?

  • GPU-shark indicated that my performance state was unchanged at P0 when I switched from command line to Visual Profiler.
  • However, GPU usage was reported at 0.0% when run with Visual Profiler, but went as high as 98% when run off command line.
  • Moreover, far less memory is used with Visual Profiler. When run off command line, task manager indicates usage of 650-700MB of memory (spikes at the first cudaFree(0) call). In Visual Profiler that figure goes down to ~100MB.
  • My first question would be… does your program generate any output that you can verify between both of the runs? Was it correct if so? I’d guess “it not really execute everything as on the command line”.

    One of the NVIDIA engineers has responded on another thread, which sounds like its the same issue.

    Please check this response: [url]Performance is much better when profling with NSight than when running production code - CUDA Programming and Performance - NVIDIA Developer Forums - and respond the solution works for you.
    Thanks,