Does cudamemcpy executes parallely? That is, if I need to transpose a matrix and memcpy it to a new device memory pointer, do I get a speed-up by writing a kernel to do the transpose and memcpy it as opposed to call memcpy2D with a pitch #rows times? If memcpy takes linear time I really don’t see a point writing that kernel…
Kinda stupid question I know, but I just wish someone to confirm it…