about coalescing access

hi all,

i’m trying to build a kernel, but not sure if it is coalesced or not within my use case.

simply put, here’s my kernel

__global__ void sumK(int *input, int k, int n, int *output){
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    while(tid < n){
        if(n-tid < k){
           //handle specially
        }

        int i, sum=0;
        for(i=0; i<k; i++)
           sum += input[tid+i];
        output[tid] = sum;
        
        tid += blockDim.x * gridDim.x;
    }
}

let’s say the constant k=3, then each thread-tid will access element of input [tid, tid+1, tid+2]
i know that

for i=0

, each thread will access input[0,1,2…n-1] so that global memory is accessed in a coalesced manner
but what about

for i=1

? each thread will access input[1,2,3…]. is it still considered coalesced? or should i consider using shared memory?

really needs for advice and opinion on this.

thanks.

Coalescing is likely not a big issue due to the caches, you are however reusing the same data for thread0 and thread1 etc,. in the next iteration.

It would likely be more efficient to (as you suggested) store this in shared memory:

__shared__ int smem[DIM_X]; // DIM_X = blockDim.x

__syncthreads(); // You need this if inside a while loop

smem[threadIdx.x] = input[tx];

if(threadIdx.x < 3)
smem[threadIdx.x+DIM_X] = input[tx+DIM_X];

__syncthreads();

Using the warp shuffle is also an interesting option but wouldn’t give you much help with the trailing elements.

you’re right.

i have tested the kernel using k=1 and k=4, the running time is very close thanks to the cache.
i’m new to cuda, so need more time to learn shared memory concept and warp shuffle which seems intriguing

thanks for the input :)