Profiling in a code line resolution

Hi,

Is there a way to profile a cuda code, i.e. timing and thread coalescence (how much time it took to execute the line, and how many threads in parallel did that), per line of code (as opposed to per kernel)?

Hmm… External Image
Guys, I’d like to remind you that the ONLY reason that we use cuda is for optimization. Would you agree that a basic aiding tool for optimization is a line based profiler?
If not, I would be happy to hear other ways that you use for measuring your code performance, and for the detection of bottle necks or bad coalescence implementation.

Ping?

You are probably not receiving any replies because there is no such tool. Profiling code without changing it’s behavior is a tricky thing. Particularly on a GPU with it’s sophisticated execution model and different resources to consider.

If you can’t make out where in your code the resource usage stems from, try commenting out parts of the code and run it through the profiler again. Be aware though of the aggressive optimization going on, that will eliminate any calculation whose results are unused.

Thanks for replying, for a moment there I thought I was talking to myself… External Image

So you confirm that there isn’t a line base profiling. I am surprised that you accept this fact. I can’t see why the GPU model is more sophisticated than the CPU, actually I always thought the CPU is much more capable. But if I think for a moment on the profiling process, I imagine that it is quite similar to debugging. Meaning that if you can add a break point to a line (I assume nsight can?), you could also on the way measure how long it took to execute the code. Also I can’t see how a profiler can give you a general information about coalescence, without checking it at a line resolution.

Thanks for the tip of doing a manual profiling by eliminating code blocks and measuring the others, but I don’t find this method practical.

To tell the truth, I am quite amazed that nsight with all its requirement for a top of the notch OS, and a farm of GPUs, can’t do this. So what can you do with nsight which can’t be done with nvidia’s visual profiler? (I never tried nsight, since as a student I can’t afford its requirements, so I also debug with printf like in the middle ages.)

Profiling using the debug interface would almost certainly be too slow to be practical. The overhead of processing a breakpoint after every line would be enormous, at which point it would be better to profile your code using a GPU simulator. (An active research area, I believe.) Real profiling tools use a variety of techniques, including statistical sampling of the program counter from an interrupt handler or sometimes instrumenting the code to update counters as it executes. These methods could potentially be applied to CUDA, although getting the implementation right without being a huge burden would be tricky. (Sounds like a fun CS research project!)

Another thing to keep in mind is that you don’t want the profiling process to mess with the cache contents if possible, otherwise that could change the timing significantly for some programs. Even worse, the overlapping nature of warp execution means that if you add up the amount of time it took to execute each instruction in the program (as computed by comparing the clock before and after) you will get a number that is easily 10x larger than the actual run time of the entire kernel! Figuring out how to assign times to each of the warps is either nontrivial for the profiler, or misleading to the programmer.

For example:

You have a program where each thread reads a float from memory, computes a complex expression that requires 10 floating point math operations, and writes the result back out to memory. A straightforward profiling of your kernel would conclude that 90% (rough guess here, depends on your hardware) of the time, the CUDA cores were busy executing your complex math expression. After looking at the expression, you realize it could be simplified, and so you reduce it to 5 floating point operations. However, you find that the kernel takes the same amount of time to run! Due to latency hiding in CUDA, while some threads in the block were waiting for their float from memory to come in, other threads were executing their floating point instructions. If you reduce the number of instructions, the CUDA cores just go idle because the amount of data being read hasn’t changed, and there is still the fixed latency for that read to finish. In effect, the math is “free” in this case, but your profiler didn’t tell you that. (This is a general feature of CUDA. There are many cases where arithmetic costs nothing because you are memory bandwidth or latency limited.)

Now, the point of this example isn’t to say that profiling is useless, or can’t work for CUDA. However, it’s clear that the important information to give the developer is somewhat multidimensional, and actually could use some research to figure out what needs to be communicated. (That’s three CS research projects in one forum post!)

Computer architectures are complicated these days, so you often need to think for more than a moment to solve a problem. :)

There are hardware counters in the GPU that keep track of these things at the instruction level. Lacking a generic profiling solution, NVIDIA has provided counters for many performance metrics that appear to be built into the hardware.

I won’t mind if the profiler would be slow, slow is better than nothing (manual profiling is impractical). I use aqtime for the cpu, and I’m quite happy with it. Concerning CS projects, aren’t we talking about f***ing nvidia here? I mean I find it hard to believe that they lack the resources. Any practical alternative such as a gpu simulator would also be good enough with me, but give me something. I don’t need to use the profiler every minute, but once in a while I need to see where to put my efforts. When I’m cpu programming it saves me quite a lot of time to optimize code that hardly takes 1% of the running time.

I think I understood your example, but it made me wonder even further that I actually require the profiler to tell me all this stuff such as at this point some cores stand and wait, and I didn’t use them, or there’s no point here to further optimize the math instructions. If this is a hardware thing, such as adding an instruction based counter or something, I think someone should point that out to nvidia. Look I admit I’m not an engineer, and frankly quite far from the subject, but I find it hard to believe that this can’t be done (= too hard or impractical).

NVIDIA can’t solve all the research projects by themselves because they have a dozen other research projects going on, like the parts that bring you the CUDA hardware and software we have already. :)

I want profiling too, and I really hope they are working on this now. I’m just saying that I don’t think it is an easy problem to solve, and NVIDIA has a vast pile of problems to solve.

Sure, if there is a hardware solution to this problem, then that would be fantastic. The kernel-level counters in the Visual Profiler are pretty nice, and I keep wondering if there is something clever that could be done with the custom profiling counters. (You should take a look at those as well.)