I have a question regarding the cudaMemcpy2DArrayToArray function and cudaArray:s in general.
I have made the following tests :
18 MB of data for both
T1. cudaMemcpy2DArrayToArray -> device to device copy : This takes about 4.9 ms
T2. cudaMemcpy (normal linear memory) device to device copy : This takes about 1.7 ms which is according to the bandwidth (2000 MB) I have on my graphics card.
My question is :
Why is the cudaArray to cudaArray copy so much slower then in the linear case? Is this due to the Z-curve of the memory or am I doing something wrong when coping the data.