Hi. Is there any program that scan ptx code and give flops value as a result for each kernel?
i think it is not very hard to do that, am i right? and we can estimate performance of kernel with such program…or no?
if no one have such program, i’ll make it myself. Please, say me if there is no need to do that.
Thanks.
To elaborate on that, a straightforward counting of FLOPS can’t estimate performance because it doesn’t measure the interaction with the memory subsystem. Memory bandwidth and memory latency are often significant (if not the primary) limiting factors in many CUDA kernels.
since you are from MSU, you can simply pick up a PhD thesis of Alex Egorov who developed such a tool for Fortran and C programs, or search Alex contacts in odnoklassniki or google :) He should definitely has his old software or at least can provide you some tips how to do it carefully.
IMHO: usually it is not difficult to construct an example where I know for sure the total amount of Flops, and run in for performance measuring, or, append a small formula in the algorithm which will collect a total amount of Flops in one long register and print in at the end of calculations.
Actually, such a tool sounds very useful, if you can run it and it automatically highlights problem code. A guru might not need it, but a beginner (ie, most CUDA users), would find it invaluable.
But then again, if you’re going to run the PTX through an emulator, you could actually just analyze the instructions and predict problems. Ie, you could see and count where there’s non-coalescence, bank conflicts, or just a lot of memory reads. This would provide a very accurate line-by-line analysis.
ok, thanks all. I think that it is quite difficult to make complete “full automatic” operations counter, cause we need to integrate it with emulator.
I see that i can’t make it by 1-2 weeks.
If that is all you want to know your idea is needlessly complex. Just run the kernel a few times - with reduced core clock, with reduced memory clock, with both reduced - and compare the timing values.
If reducing the core clock makes (almost) no difference you are not compute bound. If reducing the memory clock makes (almost) no difference you are not memory bandwidth-bound.
Other problems like memory latency might be harder to detect, but with enough samples at different clock frequencies it probably is possible too.
And with a tool like nvclock it should not be that hard to get the data in an automated way. Analyzing it automatically beyond the simple rules i gave (e.g. answering questions like: “If I optimize only for memory bandwidth, what speedup can I get at most?”) will be more effort though.