The following kernel computes a homogenous transform of Nx4 src gpuMat1f to have the result in dst Nx4 gpuMat1f, the 4x4 matrix is given by tran gpuMat1f.

Every thread in this kernel computes one elment in the output matrix.

```
__global__ void transformMat4X4Kernel(const cv::cuda::PtrStepSz<float> tran, const cv::cuda::PtrStepSz<float> src, cv::cuda::PtrStepSz<float> src)
{
int ty = blockIdx.y * blockDim.y + threadIdx.y;
int tx = threadIdx.x;
// abort thread if outside grid
if (ty >= src.rows)
return;
// row * column
float sum = src(ty, 0) * tran(tx, 0);
sum += src(ty, 1) * tran(tx, 1);
sum += src(ty, 2) * tran(tx, 2);
sum += src(ty, 3) * tran(tx, 3);
src(ty, tx) = sum;
}
```

The kernel is launched this way:

```
dim3 blk(4, 128);|
dim3 grd(1, cv::cudev::divUp(src.rows, blk.y));|
transformMat4X4Kernel << <grd, blk, 0, stream >> > (tran, src, dst);
```

The question is whether such kernel could be used also for in-place computation.

This way:

```
transformMat4X4Kernel << <grd, blk, 0, stream >> > (tran, src, src);
```

When used like that the 4 threads computing 4 elements in same row of dst, write in-place. If this write is not in sync, written value could be read as input by neighboring threads.