"cudaMemcpy3DAsync" behaves differently with driver 260.19.21

Hi all,

We developped our application on CUDA 3.1, driver 256.35, using GTX480.

Upon upgrade to the driver 260.19.21, I found out that a function “cudaMemcpy3DAsync” now returns an error code “invalid argument”. The behavior persists for both CUDA 3.1 and 3.2.
The function “cudaMemcpy3DAsync” is used to copy a 3D blocks of data, device-to-device (to array), with all 3 dimensions > 10. The destination is a 3D array. Source is a pointer on the device that is allocated with “cudaMalloc”.

Subsequently, I found that if the device pointer (source pointer for the 3D transfer) is allocated using “cudaMalloc3D”, the problem disapears; however, the performance of the application suffers about 5% penalty.

Apparanetly, bahavior of “cudaMemcpy3DAsync” function underwent changes from 256.35 to 260.19.21. It is now sensitive to whether data was allocated with some particular pitch.
In Reference Manual it is “recommended” that “cudaMalloc3D” is used, but it is not mentioned what happens if “cudaMalloc” is used or how to use “cudaMalloc”.

Does anybody know the workaround, in order to avoid 5% performance drop ? Is it still possible to use “cudaMalloc” for a source pointer used in
“cudaMemcpy3DAsync” ?

NVIDIA, can you please address this ? Can you make the next driver support the old behavior ?
It is important, since driver 260.19.21 is necessary to run GTX580, and because of this issue, we lose 5% of performance out of the new card.

Thanks

Hi all,

We developped our application on CUDA 3.1, driver 256.35, using GTX480.

Upon upgrade to the driver 260.19.21, I found out that a function “cudaMemcpy3DAsync” now returns an error code “invalid argument”. The behavior persists for both CUDA 3.1 and 3.2.
The function “cudaMemcpy3DAsync” is used to copy a 3D blocks of data, device-to-device (to array), with all 3 dimensions > 10. The destination is a 3D array. Source is a pointer on the device that is allocated with “cudaMalloc”.

Subsequently, I found that if the device pointer (source pointer for the 3D transfer) is allocated using “cudaMalloc3D”, the problem disapears; however, the performance of the application suffers about 5% penalty.

Apparanetly, bahavior of “cudaMemcpy3DAsync” function underwent changes from 256.35 to 260.19.21. It is now sensitive to whether data was allocated with some particular pitch.
In Reference Manual it is “recommended” that “cudaMalloc3D” is used, but it is not mentioned what happens if “cudaMalloc” is used or how to use “cudaMalloc”.

Does anybody know the workaround, in order to avoid 5% performance drop ? Is it still possible to use “cudaMalloc” for a source pointer used in
“cudaMemcpy3DAsync” ?

NVIDIA, can you please address this ? Can you make the next driver support the old behavior ?
It is important, since driver 260.19.21 is necessary to run GTX580, and because of this issue, we lose 5% of performance out of the new card.

Thanks