Can threads lock-step within warp be assumed in following code

The following kernel computes a homogenous transform of Nx4 src gpuMat1f to have the result in dst Nx4 gpuMat1f, the 4x4 matrix is given by tran gpuMat1f.

Every thread in this kernel computes one elment in the output matrix.

__global__ void transformMat4X4Kernel(const cv::cuda::PtrStepSz<float> tran, const cv::cuda::PtrStepSz<float> src, cv::cuda::PtrStepSz<float> src)
    int ty = blockIdx.y * blockDim.y + threadIdx.y;
    int tx = threadIdx.x;
    // abort thread if outside grid
    if (ty >= src.rows)

    // row * column
    float sum  = src(ty, 0) * tran(tx, 0);
          sum += src(ty, 1) * tran(tx, 1);
          sum += src(ty, 2) * tran(tx, 2);
          sum += src(ty, 3) * tran(tx, 3);

    src(ty, tx) = sum;

The kernel is launched this way:

dim3 blk(4, 128);|
dim3 grd(1, cv::cudev::divUp(src.rows, blk.y));|
transformMat4X4Kernel << <grd, blk, 0, stream >> > (tran, src, dst);

The question is whether such kernel could be used also for in-place computation.
This way:

transformMat4X4Kernel << <grd, blk, 0, stream >> > (tran, src, src);

When used like that the 4 threads computing 4 elements in same row of dst, write in-place. If this write is not in sync, written value could be read as input by neighboring threads.

please format your code properly. A simple set of instructions would be to edit your post (click the pencil icon), select all the code, press the </> button at top of edit pane, save your changes.

My apology. Failed to find how to edit my original text.