I’m getting a very low global store/load memory efficiency output from visual profiler ( 18 % )
I’m running the following kernel and my understanding is that this should give 100 % since the data
transfer for a half warp could fit into a 4*16 = 64 byte memory transfer and not waste any memory bandwidth.
__global__ void kernelTest(unsigned int *dsrc,unsigned int *ddst)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int k = dsrc[x];
ddst[x] = k;
}
I’m running on a laptop with a Quadro FX 1800m (compute cap. 1.2)
So my question is : Why do I get such a low efficiency from my kernel
If you want a simple copy kernel that fully utilizes device memory bandwidth then you should either scale up the number of threads or scale up the number of reads-in-flight per warp.
Volkov’s GTC 2010 paper describes these two approaches starting on pg. 27.
A couple of weeks ago I wrote some simple copy kernels to evaluate sm_35 LDG opcode generation. The test kernels are here. There is no main() routine because I was only looking at the PTX/SASS, but they’re a good example of how to issue many loads.
Thanks for the replay. I think you are talking about memory througput and not efficency measurment.
I’m running 256*1000 threads (blocksize = 256).
I thought that the global load / store memory efficency did not really have anything to do with the number of threads you where running but more a measurment of the memory access pattern.
As far as I can see it if I access the data accoring to my kernel I should fit the data in one half warp aka 64 bytes transfer instruction and the efficency should be 100%…but no:)
I should mention that I did this test kernel since I got the same response for another kernel I created for graph optimization and I just wanted to validate the visual profiler result.
Update: I dug a little deeper into the global ld/st efficiency numbers for sm_12 devices and was just as confounded as you. If you dig deeper into the Visual Profiler and collect Metrics & Events you can capture gld/gst 128/64/32b events as well as total requests and coalesced transaction counts. None of these metrics point to low efficiency.
Update 2: I am pretty sure that for sm_12 you should be interpreting the gld/gst_efficiency as a strict device-specific ratio and not as a percentage. The target number you should strive for is “2 * #SM”. For a GT 240 it is 24 and for the 1800m it is 18 (9 SMs). Anything less implies gld/gst requests were “fragmented”. I assume sm_13 mirrors sm_12.
How did I come to this conclusion? Force some uncoalesced loads or stores in your microbenchmark and inspect the 128/64/32b event counters. Then plug them into the documented formula. They match the reported ‘efficiency’ ratio.
So is this actually a bug in Visual Profile? Yeah, I think so. The formula in the docs mirrors what VP is reporting but if it’s supposed to be a percentage then the “(2 * #SM)” should simply be removed.
For sm_11 the efficiency number’s range appears to be between 0.0 and 1.0. Not a percentage either – so also a minor bug. 1 means all transactions were coalesced, otherwise it’s the ratio of coalesced/(coalesced+uncoalesced). This matches the formula in the VP documentation and is at least close to being a percentage. Note that Visual Profiler doesn’t reveal any transaction size counters for sm_11 devices.
I’ll now let the old headless GT 240 and 9400 GT cards go back to sleep. :)
Thank you Allanmac for the explanation=)
According to you describtion above I can understand why I only get 18 % ( I turned on the metrics counters myself). I also agree with you that this is probably a bug in visual profiler since the procentage and the
“developer tips” that visual profiler gives out is based on a somewhat missguided calculations. The only graphics card where you would acctually get 100 % == 100 % is someone that has 50 sm (on sm_12 … or is this the same for sm_2X…sm3X ? )