I have problem in CUDA memory coalescing. I already read the verious thread topics related to memory coalesing in this forum as well as CUDA SDK pdf but I can not understand how to implement this. If any body help me then it would be precious for me.
I am trying to implement memory coalescing but couldnot get any improvement in time. my code is
global
void kernel_foo( unsigned char* Array_1, unsigned char* Array_2) )
{
//-- current thread
long k = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x; shared unsigned char RA_Array_1[25632]; shared unsigned char RA_Array[2563];
Coalescing for character access is NOT supported. 32-bit, 64-bit and 128-bit are only supported in computer 1.0 and 1.1 devices…
1.2 and higher have advanced coalescing capabilites and support 8-bit coalescing. In any case character-coalescing essentially means you are under-utilizing.
char accesses are coalesced only in hardware with compute capability 1.2 or higher. If you are using 1.0 or 1.1 hardware - your acceses are NOT coalesced.
In case,even if you are using compter 1.2+ hardware, you are still under-utilizing the memory bandwidth by accessing 1 byte at a time.
You need to access data with “integer” pointers and not “char” pointers. So, you can fetch them as integers and load them in character array and then perform character arithmetic on it and store them back using integer pointers.
I haven’t gone through ur code fully. But you should consider staging data in shared memory, perform computation and then store it back to global mem.
Thanks Sarnath and dlmeetei for your valuable sugessions. I implement as sarnath told to use “integer” insted of “char” for featching and writing and I get unexpected improvement.
Thanks again.
That’s actually quite bad for performance. Consider that each thread of a half-warp will be executing line (2) simultaneously. The access pattern ends up looking like this:
t0: gmem[0]
t1: gmem[2]
…
t15: gmem[30]
And then for line (3) you have:
t0: gmem[1]
t1: gmem[3]
…
t15: gmem[31]
For more information, look at the explanation of strided memory accesses and the matrix transpose example on why it’s bad.
A better way to organize your memory is this (if possible):
block[threadIdx.x] = gmem[threadIdx.x]; //(2)
block[threadIdx.x+THREADS_PER_BLOCK] = gmem[threadIdx.x+THREADS_PER_BLOCK]; //(3)
This way, when the threads execute line (2), the access pattern is this:
t0: gmem[0]
t1: gmem[1]
…
t15: gmem[15]
And then for line (3), you have:
t0: gmem[16]
t1: gmem[17]
…
t15: gmem[31]
This is much more preferrable.