Coalescing load isn't that much faster than uncoalesced load??

I have 2 kernels here, one is coalesced load, the other is uncoalesced load. I’m working on K20, and for small vector size (~1mb), i’m only getting ~1.5-2x performance improvement, or 4x improvement for ~100mb vector. I was expecting much higher performance improvement using coalesced access.

I’m quite new to CUDA programming, is it my code’s problem? Any explanation is very much appreciated.

Thanks!

global
void coalesc(float *d_A1, float *d_B1)
{
shared float sdata[BLK];

int i=blockDim.x * blockIdx.x + threadIdx.x;
sdata[threadIdx.x]=d_A1[i];
__syncthreads();
d_B1[i]=sdata[threadIdx.x];
 __syncthreads();

}

global
void uncoal(float *d_A2, float *d_B2){

__shared__ float sdata[BLK];

int i=blockDim.x * blockIdx.x + threadIdx.x;
int j=blockIdx.x + ITR * threadIdx.x;


sdata[threadIdx.x]=d_A2[j];	
__syncthreads();
d_B2[i]=sdata[threadIdx.x];
__syncthreads();

}

I’m not sure the test is a good one.
What is the value of ITR?
I’d try to increase the size of the data you read.
Also Kepler’s shared memory banks have changed to 64bits so in both kernels
you’re suffering from a 2-way bank conflicts, which might hide the non-coallescing performance
hit to some degree.

eyal

I see two reasons for the “small” speedup: (i) half of the data traffic (stores) in the “non-coalesced case” is still coalesced. (ii) In the non-coalesced case you’ve got a lot of cache reuse.

Also, if ITR is small, then there are still not too many transactions generated per a load instruction. Try ITR=32 or higher.

Thanks for the replies!!

Reply to the previous comments:
i) I thought K20 a single cache line load is 128-byte
ii) I my test case, ITR is quite large. Since I’m using a 120mb vector, my array size is 31457280 (float, so 4-byte each element). BLK is the block size, and
ITR = 31457280 / BLK, which is quite large. The idea is, you load the first elements for every block first, then the second, then the third (instead of loading sequentially). So I would assume this is a highly uncoalesced access…but I only get ~4x speedup.

So if my Block size = 512, I have 31457280/512=61440. So this would required 61440 memory transection.

So coalesced read would required 31457280/32 individual transaction, though, I’m not sure how many threads can perform memory transection in parallel?

Still, thanks for the replies

In your kernel2, consecutive blocks will access adjacent memory locations. You are probably lucky that one cacheline fetched can serve multiple blocks.