coping data from the host to the device by applying a mask


I’m asking myself if there is a way to speed up my processing by coping only these array elements of my matrix to the device where my kernel is really applied to.
Currently I’m uploading a matrix (float 1920x1080) + corresponding mask (unsigned char 1920x1080) every cycle to the device, while checking in every kernel if for this matrix element the calculation has to be applied or not. This is totally unnecessary, especially where almost 50% of these elements are not affected by my kernel.
Actually there is no interaction between the elements of my matrices, so it doesn’t matter if the kernel is launched on a linear, 2D or 3D Matrix.
Of course I can apply the mask on the host, but I would like to avoid an unnecessary copy of the buffer if possible.

Any suggestions?
thank you in advance.

cheers greg

Are you using pinned memory for the host->device copy? If not, the buffer is already copied on the host side anyway, so you incur no additional overhead by allocating a pinned buffer and filtering data while moving it from the original location to the pinned buffer on the host.

Hi tera, thank you for the fast replay.

I see.
To answer your question, I’m writing the images from my cameras directly to a pinned buffer. The mask itself is going to be adapted with every cycle (this implies data moving).

There seems to be no other way, beside copying the date on the host, while applying the mask.


If your host buffer is pinned already, you can easily map it to device memory space and then access only those elements whose mask bytes are set.
Your kernel would however need to tolerate the large latency incurred by a PCIe transfer. Some prefetching might help there.

Another obvious way to reduce the amount of data transferred would of course be to use bits instead of bytes for the mask (depending on how the mask is generated).

Hi tera,

tell me please, is there a way to skip elements, skip thread launches for elements which are nonsignificant? But even so in this case my access wouldn’t be coalesced, right?

I’m sorry for asking these questions but I’m a newbie in cuda programming.

Furthermore, may you tell me please what do mean by bitwise masking? I’m missing the forest through the trees… :-( may you give me an example?