evaluate the FLOPS

Hi everyone:

I’d like to ask that if there is a approach to measuring the FLOPS of a particular CUDA application real-time, or just give me the maximum FLOPS after executing? Since I think it’s useful to see if my application has fully utilize the GPU computation resource. Thanks a lot.


I do this sort of thing manually by simply counting the number of floating-point operations inside my kernel, both for the PTX output and through analytical analysis of

the algorithm itself. (usually the compiler saves a few operations through fused multiply-add, so both are relevant). Then I simply multiply this number by the number of

grains I can process per second, and I then have something to please the viewers.

For getting the number of PTX instructions, the following shell script is quite handy:



if [ “$PTXFILE” = “” ]; then

    echo "usage: ./ptxAnalyze.sh file.ptx"

    exit 1


ADD=awk '/add.f32/ {count += 1} END {print count}' $PTXFILE

SUB=awk '/sub.f32/ {count += 1} END {print count}' $PTXFILE

MUL=awk '/mul.f32/ {count += 1} END {print count}' $PTXFILE

DIV=awk '/div.f32/ {count += 1} END {print count}' $PTXFILE

SUM=$(($ADD + $SUB + $MUL + $DIV))

MAD=awk '/mad.f32/ {count += 1} END {print count}' $PTXFILE

BRA=awk '/^\t@[!\$][\$p].*bra/ {count += 1} END {print count}' $PTXFILE

echo “------------------------------------------------------------”

echo “add.f32:” $ADD

echo “sub.f32:” $SUB

echo “mul.f32:” $MUL

echo “div.f32:” $DIV

echo “------------------------------------------------------------”

echo "SUM: " $SUM

echo “------------------------------------------------------------”

echo “mad.f32:” $MAD

echo “bra:” $BRA

echo “------------------------------------------------------------”[/codebox]

The profiler can provide you with the number of native instructions executed, but I haven’t looked into that very much.

FLOPS doesn’t really measure that. You may be maximizing the DRAM bandwidth. Why doesn’t that count? The DRAM bandwidth on a GPU, at an incredible 140GB/s, is just as pride-worthy as its FLOPS rate. Or what about maximizing on-die memory? GPUs’ on-die SRAMs are often the most responsible for “100x” speedups.

Anyway, the Visual Profiler will give some of the data you need, although in a raw form. It reports the total count of executed instructions and also DRAM accesses, but the figures have to post-processed a bit to get actual MIPS and GB/s. (You have to divide by time, multiply by fetch size, etc.) It’d be cool if someone wrote a script that did that.

(Btw, fugl, counting up the number of instructions in your kernel doesn’t say anything. What about loops?)

You’re quite right on the loops, I should have mentioned that too. I can do it in my case (collision detection of triangles and oriented bounding boxes) since I don’t have any loops or branches in my algorithms.

Thanks for your nice suggestions, I can see that Visual Profiler provides the number of memory access for a particular kernel. So I divide my time I can get the average bandwidth.

There’s nuances. Eg you can’t know exactly what the request size was, but if you only use float1s I think you can assume 64 bytes for coalesced and 4 bytes for each uncoalesced access. Experiment a little.