You can issues the transform of several arrays in one call and you can transfer more than one line at the time using memcpy 2D.
The transposing part is not easy. It would help if you could do the transpoing without the intermidiate steps. this way you would have only gpu <–> gpu tranfers.