Axes repacking for multi-dimensional image arrays

Good morning, gentlemen!

I have to transmit image data (located in CUDA device memory) from one GPU library to another.

The first library outputs images in format NHWC:

[n_pics][pic_height][pic_width][n_channels]

The other library eats images in format NCHW:

[n_pics][n_channels][pic_height][pic_width]

I want to somehow convert image data from NHWC to NCHW, and back to NCHW to NHWC. Cannot you advice me, how can it be done?

P.S. Both libraries work on GPU. I don’t want to copy the data to host. Also, I don’t want to allocate full-sized buffer for images (I want to convert images in-place, if it is possible).