Flops counter may be just simple script?

Hi. Is there any program that scan ptx code and give flops value as a result for each kernel?
i think it is not very hard to do that, am i right? and we can estimate performance of kernel with such program…or no?

if no one have such program, i’ll make it myself. Please, say me if there is no need to do that.

I do not know of such a program. I also think it is not trivial, as you have to think of what to do in case of branches like if and switch statements.

Not in the slightest

To elaborate on that, a straightforward counting of FLOPS can’t estimate performance because it doesn’t measure the interaction with the memory subsystem. Memory bandwidth and memory latency are often significant (if not the primary) limiting factors in many CUDA kernels.

sorry, i should say operations counter.

namely interaction with the memory subsystem - is that what i want to measure.

i have not said the main idea, sorry.

  1. launch kernel and measure time

  2. scant ptx code and count operations

  3. performance = time/operations

so, we know the graphics chip flops and our flops. If out flops << graphics chip flops then we have memory problems.

All what i want to know - “is there any memory limited code in my program or no?”

Can ptx scaner be useful in that manner or no?

As for branches… yes, it’s a problem. really problem. I did not think about that.

i think there solution exist - run program on emulator and count a percentage of each brunch work times.

but it’s hard, cause we need to parse ptx code to count brances separately, E.D. Riedijk right. my idea is not really good.

and parse ptx code is that what i don’t want to do) may be if only one place grammatic rules for ptx here.


since you are from MSU, you can simply pick up a PhD thesis of Alex Egorov who developed such a tool for Fortran and C programs, or search Alex contacts in odnoklassniki or google :) He should definitely has his old software or at least can provide you some tips how to do it carefully.

IMHO: usually it is not difficult to construct an example where I know for sure the total amount of Flops, and run in for performance measuring, or, append a small formula in the algorithm which will collect a total amount of Flops in one long register and print in at the end of calculations.



Actually, such a tool sounds very useful, if you can run it and it automatically highlights problem code. A guru might not need it, but a beginner (ie, most CUDA users), would find it invaluable.

But then again, if you’re going to run the PTX through an emulator, you could actually just analyze the instructions and predict problems. Ie, you could see and count where there’s non-coalescence, bank conflicts, or just a lot of memory reads. This would provide a very accurate line-by-line analysis.

ok, thanks all. I think that it is quite difficult to make complete “full automatic” operations counter, cause we need to integrate it with emulator.
I see that i can’t make it by 1-2 weeks.

If that is all you want to know your idea is needlessly complex. Just run the kernel a few times - with reduced core clock, with reduced memory clock, with both reduced - and compare the timing values.

If reducing the core clock makes (almost) no difference you are not compute bound. If reducing the memory clock makes (almost) no difference you are not memory bandwidth-bound.

Other problems like memory latency might be harder to detect, but with enough samples at different clock frequencies it probably is possible too.

And with a tool like nvclock it should not be that hard to get the data in an automated way. Analyzing it automatically beyond the simple rules i gave (e.g. answering questions like: “If I optimize only for memory bandwidth, what speedup can I get at most?”) will be more effort though.