naive question on cudamemcpy cudamemcpy

Hi,

Does cudamemcpy executes parallely? That is, if I need to transpose a matrix and memcpy it to a new device memory pointer, do I get a speed-up by writing a kernel to do the transpose and memcpy it as opposed to call memcpy2D with a pitch #rows times? If memcpy takes linear time I really don’t see a point writing that kernel…

Kinda stupid question I know, but I just wish someone to confirm it…

forgive my laziness, i ran some tests and:

copied 32 floats in .0039 milliseconds

copied 512 floats in .0034 milliseconds

copied 512 floats in 32-float chunks .0556 milliseconds

So i guess the question is answered.