is there a way to measure the bandwidth utilized by a kernel ? to be more precise,i have a kernel,a bit complicated with a lot of memory and computing instructions interleaving, i need to know the memory bandwidth i am actually using ? i know that the profiler can calculate that but i am using GTX480 and the profiler dont give any information about the bandwidth for Fermi (as far as i know), anybody has a solution for measuring that in kernel ?

You could count all accesses to global memory in bytes (or GB) inside this kernel (* number of threads) and divide this by the kernel’s total runtime. You can do this by hand which gives a good estimation on how close you are to theoretical max bandwidth as long as your kernel is really limited by it.

You could count all accesses to global memory in bytes (or GB) inside this kernel (* number of threads) and divide this by the kernel’s total runtime. You can do this by hand which gives a good estimation on how close you are to theoretical max bandwidth as long as your kernel is really limited by it.

i am not sure if that would be a good measure, because there is a big gap of time between reading the input and writing it back,so i kinda need to know the bandwidth in each part alone,if i calculate after both (and after the calculations)then im taking a lot of non-memory operations in account, the calculated bandwidth would be way less than the real thing.,…am i right ? please correct me if my idea is wrong,all help is appreciated

i am not sure if that would be a good measure, because there is a big gap of time between reading the input and writing it back,so i kinda need to know the bandwidth in each part alone,if i calculate after both (and after the calculations)then im taking a lot of non-memory operations in account, the calculated bandwidth would be way less than the real thing.,…am i right ? please correct me if my idea is wrong,all help is appreciated

If you really want to measure each part separately and neglect the calcs you could split it up into two kernels. Otherwise if you try to find out if your optimizations get your kernel somewhere near peak bandwidth my suggestion would work fine. Its also explained in the best practices guide. But if this gap is really big than maybe its limited by computations.

If you really want to measure each part separately and neglect the calcs you could split it up into two kernels. Otherwise if you try to find out if your optimizations get your kernel somewhere near peak bandwidth my suggestion would work fine. Its also explained in the best practices guide. But if this gap is really big than maybe its limited by computations.