Hi, I wrote some simple program to play with, soo i would like to calculate performace from it (like gflops, memory bandwidth, anything else…)
I would also like to found out why there are some numbers in visual profiler as they are.
Program is really simple: it just clears some array.
Every thread has some array which kernel clears. Accesses are coalesced.
In this version I use 256 threads, 32 blocks and each thread uses array of 2000 elements,
soo actually this is 256322000 = 16384000 elements or 16384000*4 = 65536000 bytes in total.
First question: Visual profiler column ‘gst_coalesced’=‘4096000’
Why are there 4-times less stores then elements in array? Could this be due to storing 4 floats at once (because of rgb+a nature of device)?
Second question is also about visual profiler: column ‘divergent_branch’=‘16’
Why are there 16-divergent branches if there is not a single if-statement in kernel? I understand that for-sentence also uses ‘if’ for comparing value,
but in this case there is constant which is broadcasted to every thread in the same warp, soo none of the threads would finish faster then the others.
Allright now a bit about calculating performance for this kernel
Since I am using 8400gs this is my theoretical maximum memory bandwidth:
Shader Clock = 900 Mhz
Memory Clock = 400 Mhz
Memory Interface = 64-bit
DDR = 2
Memory Bandwidth (GB/sec) = (400 x 10^6 x (64/8) x 2) / 10^9 = 6.4 GB/sec
and for flops/second for core g86: 16 x 900 x 2 = 28.8 Gflops
Next I wanted to calculate with how many Gflops my kernel works and how fast it reads from global memory.
I am not sure if these calculations are correct or if I can even calculate these things from my code (well that is why I ask this here=) ).
Flops: number of (add+mul) instructions / second
10 * 100 * (32*256) * 2000 / 1.459s = 11.2 Gflops/s which is about 39% of theoretical flops/s
Memory bandwidth: number of bytes (read+write) / second
100 * (32*256*2000) * 4 / 1.459s = 4.49 GB/s which is about 70% of theoretical memory bandwidth
What else could I also calculate?
Thanks in advance for any suggestions, corrections, advice…
Lightenix
Source code and visual profiler saved files:
code.tar (30 KB)
Program output:
arrLenght: 65536000 bytes
GPU execution time: 1459.520996 ms
CUDA error: no error
flops: 11.2 Gflop/s [39.0%]
theoretical flops: 28.8 Gflop/s
bandwidth used: 4.5 GB/s [70.2%]
theoretical bandwidth: 6.4 GB/s
cleaned OK
PS: hopefully you understood what I ment (since my english is not extremly good)