I’m asking myself if there is a way to speed up my processing by coping only these array elements of my matrix to the device where my kernel is really applied to.
Currently I’m uploading a matrix (float 1920x1080) + corresponding mask (unsigned char 1920x1080) every cycle to the device, while checking in every kernel if for this matrix element the calculation has to be applied or not. This is totally unnecessary, especially where almost 50% of these elements are not affected by my kernel.
Actually there is no interaction between the elements of my matrices, so it doesn’t matter if the kernel is launched on a linear, 2D or 3D Matrix.
Of course I can apply the mask on the host, but I would like to avoid an unnecessary copy of the buffer if possible.
thank you in advance.