Fast matrix transpose of any size library/implementation?

I’m looking for fast matrix transpose of any matrix size M*N.

The data format is very simple:
global
void transpose(byte* output, byte* input, int w, int h)
which turns a HxW matrix into a WxH matrix; input/output are arrays size w*h.

Which library should I use? The only concern is speed.
Thank you.