evaluate the FLOPS

mianlu · November 24, 2008, 10:01am

Hi everyone:

I’d like to ask that if there is a approach to measuring the FLOPS of a particular CUDA application real-time, or just give me the maximum FLOPS after executing? Since I think it’s useful to see if my application has fully utilize the GPU computation resource. Thanks a lot.

Mian

Fugl · November 24, 2008, 11:16am

I do this sort of thing manually by simply counting the number of floating-point operations inside my kernel, both for the PTX output and through analytical analysis of

the algorithm itself. (usually the compiler saves a few operations through fused multiply-add, so both are relevant). Then I simply multiply this number by the number of

grains I can process per second, and I then have something to please the viewers.

For getting the number of PTX instructions, the following shell script is quite handy:

[codebox]#!/bin/bash

PTXFILE=$1

if [ “$PTXFILE” = “” ]; then

    echo "usage: ./ptxAnalyze.sh file.ptx"

    exit 1

fi

ADD=awk '/add.f32/ {count += 1} END {print count}' $PTXFILE

SUB=awk '/sub.f32/ {count += 1} END {print count}' $PTXFILE

MUL=awk '/mul.f32/ {count += 1} END {print count}' $PTXFILE

DIV=awk '/div.f32/ {count += 1} END {print count}' $PTXFILE

SUM=$(($ADD + $SUB + $MUL + $DIV))

MAD=awk '/mad.f32/ {count += 1} END {print count}' $PTXFILE

BRA=awk '/^\t@[!\$][\$p].*bra/ {count += 1} END {print count}' $PTXFILE

echo “------------------------------------------------------------”

echo “add.f32:” $ADD

echo “sub.f32:” $SUB

echo “mul.f32:” $MUL

echo “div.f32:” $DIV

echo “------------------------------------------------------------”

echo "SUM: " $SUM

echo “------------------------------------------------------------”

echo “mad.f32:” $MAD

echo “bra:” $BRA

echo “------------------------------------------------------------”[/codebox]

The profiler can provide you with the number of native instructions executed, but I haven’t looked into that very much.

alex_dubinsky · November 25, 2008, 6:39am

FLOPS doesn’t really measure that. You may be maximizing the DRAM bandwidth. Why doesn’t that count? The DRAM bandwidth on a GPU, at an incredible 140GB/s, is just as pride-worthy as its FLOPS rate. Or what about maximizing on-die memory? GPUs’ on-die SRAMs are often the most responsible for “100x” speedups.

Anyway, the Visual Profiler will give some of the data you need, although in a raw form. It reports the total count of executed instructions and also DRAM accesses, but the figures have to post-processed a bit to get actual MIPS and GB/s. (You have to divide by time, multiply by fetch size, etc.) It’d be cool if someone wrote a script that did that.

(Btw, fugl, counting up the number of instructions in your kernel doesn’t say anything. What about loops?)

Fugl · November 25, 2008, 8:17am

You’re quite right on the loops, I should have mentioned that too. I can do it in my case (collision detection of triangles and oriented bounding boxes) since I don’t have any loops or branches in my algorithms.

mianlu · November 25, 2008, 8:42am

Thanks for your nice suggestions, I can see that Visual Profiler provides the number of memory access for a particular kernel. So I divide my time I can get the average bandwidth.

alex_dubinsky · November 25, 2008, 5:48pm

There’s nuances. Eg you can’t know exactly what the request size was, but if you only use float1s I think you can assume 64 bytes for coalesced and 4 bytes for each uncoalesced access. Experiment a little.

Topic		Replies	Views
FLOP count CUDA Programming and Performance	3	6639	December 10, 2008
Measuring FLOPS CUDA Programming and Performance	8	14165	January 19, 2010
Finding the theoretical FLOPS of an OpenCL device Is there a way to find the theoretical maximum FLO CUDA Programming and Performance	6	2245	August 18, 2011
Benchmarking a program What is the best option for finding the FLOP for a given thread? CUDA Programming and Performance	10	1189	August 21, 2010
Estimating performance in FLOPS what's the correct way to do it? CUDA Programming and Performance	2	9049	February 20, 2008
Flops counter may be just simple script? CUDA Programming and Performance	8	5623	November 19, 2008
Speed-up and bandwidth CUDA Programming and Performance	12	9780	May 4, 2008
Runtinme occupancy CUDA Programming and Performance	5	1850	January 9, 2009
flops calculation by profiler / of maximum CUDA Programming and Performance	6	14280	August 7, 2008
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37408	August 30, 2009

evaluate the FLOPS

Related topics