Calculating Gflops, memory bandwidth and visual profiler question performance calculation

lightenix · July 31, 2009, 3:10pm

Hi, I wrote some simple program to play with, soo i would like to calculate performace from it (like gflops, memory bandwidth, anything else…)

I would also like to found out why there are some numbers in visual profiler as they are.

Program is really simple: it just clears some array.

Every thread has some array which kernel clears. Accesses are coalesced.

In this version I use 256 threads, 32 blocks and each thread uses array of 2000 elements,

soo actually this is 256322000 = 16384000 elements or 16384000*4 = 65536000 bytes in total.

First question: Visual profiler column ‘gst_coalesced’=‘4096000’

Why are there 4-times less stores then elements in array? Could this be due to storing 4 floats at once (because of rgb+a nature of device)?

Second question is also about visual profiler: column ‘divergent_branch’=‘16’

Why are there 16-divergent branches if there is not a single if-statement in kernel? I understand that for-sentence also uses ‘if’ for comparing value,

but in this case there is constant which is broadcasted to every thread in the same warp, soo none of the threads would finish faster then the others.

Allright now a bit about calculating performance for this kernel

Since I am using 8400gs this is my theoretical maximum memory bandwidth:

	Shader Clock = 900 Mhz

	Memory Clock = 400 Mhz

	Memory Interface = 64-bit

	DDR = 2

	Memory Bandwidth (GB/sec) = (400 x 10^6 x (64/8) x 2) / 10^9 = 6.4 GB/sec

and for flops/second for core g86: 16 x 900 x 2 = 28.8 Gflops

Next I wanted to calculate with how many Gflops my kernel works and how fast it reads from global memory.

I am not sure if these calculations are correct or if I can even calculate these things from my code (well that is why I ask this here=) ).

Flops: number of (add+mul) instructions / second

10 * 100 * (32*256) * 2000 / 1.459s = 11.2 Gflops/s  which is about 39% of theoretical flops/s

Memory bandwidth: number of bytes (read+write) / second

100 * (32*256*2000) * 4 / 1.459s = 4.49 GB/s which is about 70% of theoretical memory bandwidth

What else could I also calculate?

Thanks in advance for any suggestions, corrections, advice…

Lightenix

Source code and visual profiler saved files:

code.tar (30 KB)

Program output:

arrLenght: 65536000 bytes

GPU execution time: 1459.520996 ms

CUDA error: no error

flops: 11.2 Gflop/s [39.0%]

theoretical flops: 28.8 Gflop/s

bandwidth used: 4.5 GB/s [70.2%] 

theoretical bandwidth: 6.4 GB/s

cleaned OK

PS: hopefully you understood what I ment (since my english is not extremly good)

paulius · July 31, 2009, 6:26pm

Please check the profiler documentation on how to interpret gst_* counters (it counts bus transactions, not individual instructions). One of my posts in this thread has a more detailed description of how the profiler measures memory throughput:
[topic=“92716”]Visual Profiler ver 2.2[/topic]

lightenix · August 5, 2009, 4:31pm

Thank you for this link, I will have to read it again later.

I still do not understand why visual profiler 2.3 says that divergent branching occurs in this code (while it does not in visual profiler 2.2):

extern __shared__ int k[];

__global__ void clearArrays( float* array) {

	int startDataIdx = threadIdx.x + (blockDim.x * length) * blockIdx.x;

	for(k[threadIdx.x] = 0; k[threadIdx.x]<length; k[threadIdx.x]++) {

		array[startDataIdx+blockDim.x*k[threadIdx.x]] = 0.0f;

	}

}

Screenshots from visual profiler 2.2 and 2.3:

vp2.3.10.PNG

Program was compiled with cudatoolkit 2.2.

Thanks again

Lightenix

Topic		Replies	Views
Low global memory efficiency ouput from Visual Profiler CUDA Programming and Performance	5	2540	February 22, 2013
Runtinme occupancy CUDA Programming and Performance	5	1850	January 9, 2009
Visual profiler results CUDA Programming and Performance	2	1654	June 16, 2009
speeding up k-means on the gpu CUDA Programming and Performance	8	39094	December 19, 2007
compute visual profiler - gst coalesced/uncoalesced CUDA Programming and Performance	0	714	July 21, 2011
A couple of questions CUDA Programming and Performance	5	2057	December 2, 2008
VisualProfiler ver 2.2 CUDA Programming and Performance	13	4842	April 10, 2009
Memory throughput on GTX480 (cudaprof question) how to calculate memory throughput from GST/GLD CUDA Programming and Performance	4	7030	May 28, 2010
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1862	February 3, 2012
Some Visual Profiler questions CUDA Programming and Performance	0	1030	September 11, 2010

Calculating Gflops, memory bandwidth and visual profiler question performance calculation

Related topics