How to resolve this Coalescing problem?

Hi Everybody,

I have problem in CUDA memory coalescing. I already read the verious thread topics related to memory coalesing in this forum as well as CUDA SDK pdf but I can not understand how to implement this. If any body help me then it would be precious for me.

I am trying to implement memory coalescing but couldnot get any improvement in time. my code is

void kernel_foo( unsigned char* Array_1, unsigned char* Array_2) )
//-- current thread
long k = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
shared unsigned char RA_Array_1[25632];
shared unsigned char RA_Array[256

// limits
__shared__ long limit;
limit = 1000 * 1500;

int index = k;
RA_Array_1[32*tid+0] = Array_1[index];
RA_Array_1[32*tid+1] = Array_1[index+1];
RA_Array_1[32*tid+2] = Array_1[index+2];
RA_Array_1[32*tid+3] = Array_1[index+3];

index+= 1000;
RA_Array_1[32*tid+4] = Array_1[index];
RA_Array_1[32*tid+5] = Array_1[index+1];
RA_Array_1[32*tid+6] = Array_1[index+2];
RA_Array_1[32*tid+7] = Array_1[index+3];

index+= 1000;
RA_Array_1[32*tid+8] =  Array_1[index];
RA_Array_1[32*tid+9] =  Array_1[index+1];
RA_Array_1[32*tid+11] =  Array_1[index+2];
RA_Array_1[32*tid+12] =  Array_1[index+3];

index+= 1000;
RA_Array_1[32*tid+13] = Array_1[index];
RA_Array_1[32*tid+14] = Array_1[index+1];
RA_Array_1[32*tid+15] = Array_1[index+2];
RA_Array_1[32*tid+16] = Array_1[index+3];

index+= 1000;
RA_Array_1[32*tid+17] = Array_1[index];
RA_Array_1[32*tid+18] = Array_1[index+1];
RA_Array_1[32*tid+19 = Array_1[index+2];
RA_Array_1[32*tid+20] = Array_1[index+3];


int temp = index << 2;

long New_index = 0;

for(long i = 32*tid; i < 32*tid+20; ++i)
                 New_index += RA_Array_1[i];

           Arrar_2[New_index] = Arrar_1[New_index] ;


In this code I not getting any improvement .
So, How I resolve this Memory uncoalescing problem?

How many threads and blocks r u spawning?

Coalescing char-access does NOT make much sense. U r atleast under-utilizing by 4times (compared to coalescing 32-bit data)

I will doubt if it satisfies the first requirement of coalescing condition. See of programming guide.

See if the attached doc is helpful to you
index.pdf (255 KB)

the kernel launching is :

kernel_foo<<<(1000*1500)/256, 256>>>( Array_1, Array_2);

Dlmeetei is right. (Dlm, Good to see you back)

Coalescing for character access is NOT supported. 32-bit, 64-bit and 128-bit are only supported in computer 1.0 and 1.1 devices…

1.2 and higher have advanced coalescing capabilites and support 8-bit coalescing. In any case character-coalescing essentially means you are under-utilizing.

Yup, After a long time in Fortran, again, in CUDA.

This means in my given code ,char arrays can not coalesced???

If I made Array_1 and Array_2 , int ,then Can I make my kernel Kernel_foo(Array_1, Array_2) to coalesced memory access???

If Your answer is Yes Thn HOW?

It can be as Sarnath pointed out, if u have a device of 1.2 and above. But seems to be under utilizing the capability.

char accesses are coalesced only in hardware with compute capability 1.2 or higher. If you are using 1.0 or 1.1 hardware - your acceses are NOT coalesced.

In case,even if you are using compter 1.2+ hardware, you are still under-utilizing the memory bandwidth by accessing 1 byte at a time.

You need to access data with “integer” pointers and not “char” pointers. So, you can fetch them as integers and load them in character array and then perform character arithmetic on it and store them back using integer pointers.

I haven’t gone through ur code fully. But you should consider staging data in shared memory, perform computation and then store it back to global mem.

Thanks Sarnath and dlmeetei for your valuable sugessions. I implement as sarnath told to use “integer” insted of “char” for featching and writing and I get unexpected improvement.
Thanks again.

I’ll point out another thing you need to be careful with in general.

If you find yourself wanting to do something like this in a kernel:

shared int block[2THREADS_PER_BLOCK]; //(1)
threadIdx.x] = gmem[2threadIdx.x]; //(2)
threadIdx.x+1] = gmem[2*threadIdx.x+1]; //(3)

That’s actually quite bad for performance. Consider that each thread of a half-warp will be executing line (2) simultaneously. The access pattern ends up looking like this:
t0: gmem[0]
t1: gmem[2]

t15: gmem[30]
And then for line (3) you have:
t0: gmem[1]
t1: gmem[3]

t15: gmem[31]
For more information, look at the explanation of strided memory accesses and the matrix transpose example on why it’s bad.

A better way to organize your memory is this (if possible):
block[threadIdx.x] = gmem[threadIdx.x]; //(2)
block[threadIdx.x+THREADS_PER_BLOCK] = gmem[threadIdx.x+THREADS_PER_BLOCK]; //(3)

This way, when the threads execute line (2), the access pattern is this:
t0: gmem[0]
t1: gmem[1]

t15: gmem[15]
And then for line (3), you have:
t0: gmem[16]
t1: gmem[17]

t15: gmem[31]
This is much more preferrable.

Thanks for pointing my this access pattern.Now I will try to maintain strided memory accesses .