In my CUDA kernel I have the following for loop where each thread in a warp begins at a different index. That index then rolls over back to the first index if it goes beyond 32. Will looping like this cause warp divergence?
Note I only have one warp per block so thread id is same as the lane in the warp.
tid = ThreadIdx.x
for j in tid:(tid + WARP_SIZE)
wrapped_j_idx = (j - 1) & (WARP_SIZE - 1) #modulo
val = foo()
#No race conditions as threads in warp execute in step
forces[tid + offset, :] += val
forces[wrapped_j_idx, :] -= val
end