Sub-sampling (only in 'x' dimension) of an 2-D image, how to do best ?

I have an 2-D image (one channel only), where each pixel takes up either 16 bit or 32 bit.
I want to implement a “sub-sample-in-x” kernel which does a sub-sampling, only in x dimension, by the factor 2^p. Means that for every row it picks every (2^p)-th element and writes it into sub-sampled image.

E.g. for p=3, I then get an sub-sampled image with the same height as the original image and a width which is 1/8 of the original image width.

Possible values for p are 1 (sub-sample factor 2), 2 (sub-sample factor 4), 3 (sub-sample factor 8), 4 (sub-sample factor 8).

This kernel is likely band-width bound because not much arithmetic operations are done. Any tips/advice from people here how the implement that kernel so that it is less bandwidth-intensive ?
One idea from me is to transpose the whole image (e.g. with a function from NPP library) and then implement a ‘sub-sample-in-y’ kernel (should be much better in terms of memory accesses) and then transpose the sub-sampled image back.

The kernel will be for Kepler architecture or newer.