Calculating Gflops, memory bandwidth and visual profiler question performance calculation

Hi, I wrote some simple program to play with, soo i would like to calculate performace from it (like gflops, memory bandwidth, anything else…)

I would also like to found out why there are some numbers in visual profiler as they are.

Program is really simple: it just clears some array.

Every thread has some array which kernel clears. Accesses are coalesced.

In this version I use 256 threads, 32 blocks and each thread uses array of 2000 elements,

soo actually this is 256322000 = 16384000 elements or 16384000*4 = 65536000 bytes in total.

First question: Visual profiler column ‘gst_coalesced’=‘4096000’

Why are there 4-times less stores then elements in array? Could this be due to storing 4 floats at once (because of rgb+a nature of device)?

Second question is also about visual profiler: column ‘divergent_branch’=‘16’

Why are there 16-divergent branches if there is not a single if-statement in kernel? I understand that for-sentence also uses ‘if’ for comparing value,

but in this case there is constant which is broadcasted to every thread in the same warp, soo none of the threads would finish faster then the others.

Allright now a bit about calculating performance for this kernel

Since I am using 8400gs this is my theoretical maximum memory bandwidth:

	Shader Clock = 900 Mhz

	Memory Clock = 400 Mhz

	Memory Interface = 64-bit

	DDR = 2

	Memory Bandwidth (GB/sec) = (400 x 10^6 x (64/8) x 2) / 10^9 = 6.4 GB/sec

and for flops/second for core g86: 16 x 900 x 2 = 28.8 Gflops

Next I wanted to calculate with how many Gflops my kernel works and how fast it reads from global memory.

I am not sure if these calculations are correct or if I can even calculate these things from my code (well that is why I ask this here=) ).

Flops: number of (add+mul) instructions / second

10 * 100 * (32*256) * 2000 / 1.459s = 11.2 Gflops/s  which is about 39% of theoretical flops/s

Memory bandwidth: number of bytes (read+write) / second

100 * (32*256*2000) * 4 / 1.459s = 4.49 GB/s which is about 70% of theoretical memory bandwidth

What else could I also calculate?

Thanks in advance for any suggestions, corrections, advice…


Source code and visual profiler saved files:

code.tar (30 KB)

Program output:

arrLenght: 65536000 bytes

GPU execution time: 1459.520996 ms

CUDA error: no error

flops: 11.2 Gflop/s [39.0%]

theoretical flops: 28.8 Gflop/s

bandwidth used: 4.5 GB/s [70.2%] 

theoretical bandwidth: 6.4 GB/s

cleaned OK

PS: hopefully you understood what I ment (since my english is not extremly good)

Please check the profiler documentation on how to interpret gst_* counters (it counts bus transactions, not individual instructions). One of my posts in this thread has a more detailed description of how the profiler measures memory throughput:
[topic=“92716”]Visual Profiler ver 2.2[/topic]

Thank you for this link, I will have to read it again later.

I still do not understand why visual profiler 2.3 says that divergent branching occurs in this code (while it does not in visual profiler 2.2):

extern __shared__ int k[];

__global__ void clearArrays( float* array) {

	int startDataIdx = threadIdx.x + (blockDim.x * length) * blockIdx.x;

	for(k[threadIdx.x] = 0; k[threadIdx.x]<length; k[threadIdx.x]++) {

		array[startDataIdx+blockDim.x*k[threadIdx.x]] = 0.0f;



Screenshots from visual profiler 2.2 and 2.3:


Program was compiled with cudatoolkit 2.2.

Thanks again