Flops counter may be just simple script?

FROL · November 15, 2008, 6:59pm

Hi. Is there any program that scan ptx code and give flops value as a result for each kernel?
i think it is not very hard to do that, am i right? and we can estimate performance of kernel with such program…or no?

if no one have such program, i’ll make it myself. Please, say me if there is no need to do that.
Thanks.

E.D_Riedijk · November 15, 2008, 7:18pm

I do not know of such a program. I also think it is not trivial, as you have to think of what to do in case of branches like if and switch statements.

alex_dubinsky · November 15, 2008, 10:13pm

Not in the slightest

seibert · November 16, 2008, 2:19am

To elaborate on that, a straightforward counting of FLOPS can’t estimate performance because it doesn’t measure the interaction with the memory subsystem. Memory bandwidth and memory latency are often significant (if not the primary) limiting factors in many CUDA kernels.

FROL · November 16, 2008, 11:30am

sorry, i should say operations counter.

namely interaction with the memory subsystem - is that what i want to measure.

i have not said the main idea, sorry.

launch kernel and measure time
scant ptx code and count operations
performance = time/operations

so, we know the graphics chip flops and our flops. If out flops << graphics chip flops then we have memory problems.

All what i want to know - “is there any memory limited code in my program or no?”

Can ptx scaner be useful in that manner or no?

As for branches… yes, it’s a problem. really problem. I did not think about that.

i think there solution exist - run program on emulator and count a percentage of each brunch work times.

but it’s hard, cause we need to parse ptx code to count brances separately, E.D. Riedijk right. my idea is not really good.

and parse ptx code is that what i don’t want to do) may be if only one place grammatic rules for ptx here.

ilghiz · November 16, 2008, 11:57am

Hi,

since you are from MSU, you can simply pick up a PhD thesis of Alex Egorov who developed such a tool for Fortran and C programs, or search Alex contacts in odnoklassniki or google :) He should definitely has his old software or at least can provide you some tips how to do it carefully.

IMHO: usually it is not difficult to construct an example where I know for sure the total amount of Flops, and run in for performance measuring, or, append a small formula in the algorithm which will collect a total amount of Flops in one long register and print in at the end of calculations.

Sincerely

Ilghiz

alex_dubinsky · November 16, 2008, 6:34pm

sorry, i should say operations counter.

namely interaction with the memory subsystem - is that what i want to measure.

i have not said the main idea, sorry.

launch kernel and measure time

scant ptx code and count operations

performance = time/operations

so, we know the graphics chip flops and our flops. If out flops << graphics chip flops then we have memory problems.

All what i want to know - “is there any memory limited code in my program or no?”

Can ptx scaner be useful in that manner or no?

As for branches… yes, it’s a problem. really problem. I did not think about that.

i think there solution exist - run program on emulator and count a percentage of each brunch work times.

but it’s hard, cause we need to parse ptx code to count brances separately, E.D. Riedijk right. my idea is not really good.

and parse ptx code is that what i don’t want to do) may be if only one place grammatic rules for ptx here.

Actually, such a tool sounds very useful, if you can run it and it automatically highlights problem code. A guru might not need it, but a beginner (ie, most CUDA users), would find it invaluable.

But then again, if you’re going to run the PTX through an emulator, you could actually just analyze the instructions and predict problems. Ie, you could see and count where there’s non-coalescence, bank conflicts, or just a lot of memory reads. This would provide a very accurate line-by-line analysis.

FROL · November 16, 2008, 8:55pm

ok, thanks all. I think that it is quite difficult to make complete “full automatic” operations counter, cause we need to integrate it with emulator.
I see that i can’t make it by 1-2 weeks.

Reimar · November 19, 2008, 2:13pm

If that is all you want to know your idea is needlessly complex. Just run the kernel a few times - with reduced core clock, with reduced memory clock, with both reduced - and compare the timing values.

If reducing the core clock makes (almost) no difference you are not compute bound. If reducing the memory clock makes (almost) no difference you are not memory bandwidth-bound.

Other problems like memory latency might be harder to detect, but with enough samples at different clock frequencies it probably is possible too.

And with a tool like nvclock it should not be that hard to get the data in an automated way. Analyzing it automatically beyond the simple rules i gave (e.g. answering questions like: “If I optimize only for memory bandwidth, what speedup can I get at most?”) will be more effort though.

Topic		Replies	Views
FLOP count CUDA Programming and Performance	3	6613	December 10, 2008
Estimating performance in FLOPS what's the correct way to do it? CUDA Programming and Performance	2	9043	February 20, 2008
evaluate the FLOPS CUDA Programming and Performance	5	2001	November 25, 2008
How to tell if a kernel is memory or compute bound CUDA Programming and Performance	8	9332	February 4, 2010
Measuring FLOPS CUDA Programming and Performance	8	14164	January 19, 2010
simple question measure Flops, Bandwidth CUDA Programming and Performance	0	2002	January 28, 2011
how to evaluate the CUDA's performance how can i know the program is optimazed CUDA Programming and Performance	7	7338	July 24, 2008
Benchmarking a program What is the best option for finding the FLOP for a given thread? CUDA Programming and Performance	10	1185	August 21, 2010
Is there any tool which can tell my kernel is compute bound or memory bound CUDA Programming and Performance	7	5990	April 3, 2010
Numerical estimatation of run time CUDA Programming and Performance	1	894	June 11, 2009

Flops counter may be just simple script?

Related topics