Problem with proper coalesced indexing

I have 1 dimensional CUDA kernel where I try to achieve memory coalescing.

Below some constants needed to avoid duplicating data access by threads.

Index correction constants (precalculated)

indexCorr =ceil(pixelNumberPerSlice/threadnum)) // so we will not go beyond what is intended for given thread block?
loopNumb= indexCorr-1

Kernel that works (give correct result) but threads do not access data in coalescent way

corrIndex = threadIdx.x* indexCorr
i = corrIndex + (pixelNumberPerSlice*(blockIdx.x))

for (int k; k<loopNumb;k++)
             add(data[i+k])// accessing data in i+k index
            }//end if 
        }//end for

Then some warp level reduction and shared memory reduction … (not important for discussion)

kernel that I try to implement in order to achieve memory coalescence but do not give correct results (all rest of the code is the same)

//no multiplying by indexCorr
    i = threadIdx.x+pixelNumberPerSlice*blockIdx().x))
for (int k; k<loopNumb;k++)
 //indexCorr multiply k   
       add(data[i+k*indexCorr])}// accessing data in i+k*indexCorr index
                }//end if 
            }//end for

As far as the idea goes in working implementation indexCorr was used in a way that when lane will stream through data it will end the loop just before the next thread started the loop

In coalesced version that I try to implement each thread gets consecutive part of data array and all jumps to the next iteration of loop such that first thread gets element from data that is next to the last thread from just previous iteration