cudaMemcpy2DArrayToArray slower then linear memory copy

Hi,

I have a question regarding the cudaMemcpy2DArrayToArray function and cudaArray:s in general.

I have made the following tests :

18 MB of data for both

T1. cudaMemcpy2DArrayToArray -> device to device copy : This takes about 4.9 ms

T2. cudaMemcpy (normal linear memory) device to device copy : This takes about 1.7 ms which is according to the bandwidth (2000 MB) I have on my graphics card.

My question is :

Why is the cudaArray to cudaArray copy so much slower then in the linear case? Is this due to the Z-curve of the memory or am I doing something wrong when coping the data.