hi all,

i’m trying to build a kernel, but not sure if it is coalesced or not within my use case.

simply put, here’s my kernel

```
__global__ void sumK(int *input, int k, int n, int *output){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < n){
if(n-tid < k){
//handle specially
}
int i, sum=0;
for(i=0; i<k; i++)
sum += input[tid+i];
output[tid] = sum;
tid += blockDim.x * gridDim.x;
}
}
```

let’s say the constant k=3, then each thread-tid will access element of input [tid, tid+1, tid+2]

i know that

```
for i=0
```

, each thread will access input[0,1,2…n-1] so that global memory is accessed in a coalesced manner

but what about

```
for i=1
```

? each thread will access input[1,2,3…]. is it still considered coalesced? or should i consider using shared memory?

really needs for advice and opinion on this.

thanks.