Profiling a computationally bound kernel

Hey guys

Since the CUDA profiler lacks detailed profiling information for each function, I was wondering how accurate it is to:

  1. Switch to EMUDEBUG
  2. Run the program with the Visual Studio 2008 team system performance wizard
  3. Analyze the results, extracting most time spent and greediest routines

Since its not memory bound (no thread sync whatsoever or shared memory) surely the results should be fairly accurate? Maybe not 100% accurate but at least I can tell which parts of the code need the most optimisation…



I doubt it would help. What I do and it rarely fails is this (whether you’re computational or bandwidth bounded):

- start from a nearly empty kernel (only the setup code, indexes calcs, etc...) - measure the time.

- gradually open lines of code, till you find where you spend most of the time

- do it until you're happy with the results (or can't get any better - and ask again in the forums :) )

The most important thing is to make sure your kernel doesnt get optimized out by the compiler and that you measured

your kernel time correctly (i.e. use cudaThreadSync after the kernel invocation, kernel is not optimized out and ran fine),

either use your: hand clock, clock(), cuda events or cutil methods (sorry tmurray - they are just the easiest ;) )

hope that helps