Avoiding cudaMemcpy2D() because of 65536 pitch limit

Hi,

I am having trouble with the 65536 pitch limit of the cudaMemcpy2D() function. I allocate a matrix that is very wide (>1000000) but not very high (<10) with cudaMallocPitch. This allocation gives no errors. Then (after the kernels finish) i would like to copy only the first row of the matrix back to host memory. Is there any trick to do this without cudaMemcpy2D()?

Kind regards,

Daniel Dekkers

… continued …

It seems i can simply use cudaMemcpy() to copy the first row (without padding bytes) from the device back to the host. It works for arbitrary matrix row widths (>65536). Why does this 65536 float pitch boundary in cudaMemcpy2D() exist anyway?

Kind regards,
Daniel Dekkers