I do this sort of thing manually by simply counting the number of floating-point operations inside my kernel, both for the PTX output and through analytical analysis of
the algorithm itself. (usually the compiler saves a few operations through fused multiply-add, so both are relevant). Then I simply multiply this number by the number of
grains I can process per second, and I then have something to please the viewers.
For getting the number of PTX instructions, the following shell script is quite handy:
[codebox]#!/bin/bash
PTXFILE=$1
if [ “$PTXFILE” = “” ]; then
echo "usage: ./ptxAnalyze.sh file.ptx"
exit 1
fi
ADD=awk '/add.f32/ {count += 1} END {print count}' $PTXFILE
SUB=awk '/sub.f32/ {count += 1} END {print count}' $PTXFILE
MUL=awk '/mul.f32/ {count += 1} END {print count}' $PTXFILE
DIV=awk '/div.f32/ {count += 1} END {print count}' $PTXFILE
SUM=$(($ADD + $SUB + $MUL + $DIV))
MAD=awk '/mad.f32/ {count += 1} END {print count}' $PTXFILE
BRA=awk '/^\t@[!\$][\$p].*bra/ {count += 1} END {print count}' $PTXFILE
echo “------------------------------------------------------------”
echo “add.f32:” $ADD
echo “sub.f32:” $SUB
echo “mul.f32:” $MUL
echo “div.f32:” $DIV
echo “------------------------------------------------------------”
echo "SUM: " $SUM
echo “------------------------------------------------------------”
echo “mad.f32:” $MAD
echo “bra:” $BRA
echo “------------------------------------------------------------”[/codebox]
The profiler can provide you with the number of native instructions executed, but I haven’t looked into that very much.