I’m not able to tell whether or not the kernels I write are coalesced or serialized. In an attempt to understand this, I’ve written a very simple kernel and used cudaprof:
Kernel Code:
#define TYPE float
#define RESULT 13.37
#define N (1<<22) // numThreads/numElements ~= 4M
__global__ void simple(TYPE* d){
int tid = blockDim.x*blockIdx.x+threadIdx.x;
TYPE R = RESULT;
if (tid < N) d[tid] = R;
}
This, to me, implies that the lower the gst_efficiency the better! However, that feels like it’d be a misnomer so I have my doubts.
So what’s the deal? Is this how one determines how coalesced ones reads/writes are - with gst_efficiency? And, what is the optimal gst_efficiency? Please feel free to refer me to documentation… but I’ve been unable to find the answer anywhere - very frustrating - where do people get their information from?
Note: I’m using a TESLA C1060, CC1.3 on Ubuntu 9.04.
I wrote a new kernel, which has a similar data access pattern, but with varying stride so that I can compare its effects on gst_efficiency and instruction_throughput (which, I assume, directly effect how coalesced the data accesses are)
Here is a snippet of the code along with some results:
Now, according to ProgrammingGuide (pg91) coalescing, for CC1.2+, will occur if all threads (of a half-warp) access a segment of size 128bytes for 4/8-byte words.
So, if I’m correct, for W=1 and W=2 the writes should coalesce. With that in mind, let’s examine 2 of the results I previously obtained:
According to my previous calculations I’d have expected the gst_efficiency to remain the same but I was NOT expecting the throughput to drop - especially not to ~1/3 throughput!
The only case than can coalesce is width=1. Coalescing rules require the words loaded or stored by a half warp of threads to be contained within contiguous half warp sized piece of memory, aligned to half-warp size times the word size - you are using floats, so 16 threads have to write into the same 64 byte piece of memory, and that piece of memory has to be aligned to a 64 byte boundary. Anything else will require additional memory transactions to service the load/store request. How many depends on the compute capability of the card and how the load/store request deviates from the coalescing rules.
Well, for W=1 your writes will be to 1, 2, 3,… 16, i.e. coalesced but not perfect because the first address is not a multiple of 16. For W=2 you’re getting 2, 4, 6,… 32. That is strided access and not coalesced.
It tells you the ratio of requests to transactions. Perfectly coalesced writes should have an efficiency of 100%. Fully serialized writes should have an efficiency of 1/16.
The short answer is you don’t. It will be slower, but how much can only be determined by benchmarking.
The emphasis is on the word “can”. WIDTH=1 is the only possible case where the writes have a stride of one, and hence the only case where coalescing could be possible.
Well, I’m glad that the profiler can conclusively report the coalesce %, however I’m mystified as to why my particular example, with width=1, is not 100% coalesced. My reasoning:
I’m STILL stuck on this… I’m really eager to get to the bottom of this whole coalescing business but am unable to… I’ve read all available documentation (and re-read it, several times) and have scoured the forums and internet in search of a definitive answer or a clear explanation. I know some of you have already tried to help me with this but it seems we never got to the bottom of it.
I’ve now upgraded to CUDA3.0 and here are the results of one of my applications using the profiler: External Media
It appears there’s no longer a gst_efficiency? Is this coalesced or not? Why?