I have tested the bandwidthTest program on my GPU and it works fine. But, is there an implementation of bandWidthTest benchmark using a kernel function (as opposed to cudamemcpy) available anywhere? I am interested in seeing this to understand coalesced memory accesses better.
I remember once seeing a program like that, which measured texture, constant, shared & global memory. Maybe a search will find it (if you don’t find it with the search button, try google, it finds more sometimes)