3D copy to CUDA array much slower than in 2D

Nuld · October 23, 2009, 5:15am

My kernel is called many times (in the thousands) and before every iteration I have to copy the data updated in the previous iteration to a CUDA array which is bound to a 3D texture (device to device copy). But it turns out that this copy takes more than 50% of the execution time. In the 2D scenario (with comparable system size), this copy only takes about 7% of the execution time. Why is this 3D pitch array to 3D CUDA array copy so slow?

Here’s the code for the 3D case:

texture<unsigned char, 3, cudaReadModeElementType> spinTexRef3D;

...

cudaChannelFormatDesc spinChannelDesc =cudaCreateChannelDesc<unsigned char>();

cudaExtent spinExtent = make_cudaExtent(64, 64, 64); // or any other multiple of 8

cudaExtent spinUpdateExtent = make_cudaExtent(64*sizeof(unsigned char), 64, 64);

cudaArray* d_spin;

cudaPitchedPtr d_spinUpdatePtr;

cudaMalloc3DArray(&d_spin, &spinChannelDesc, spinExtent);

cudaMalloc3D(&d_spinUpdatePtr, spinUpdateExtent);

// copy initial data from the host to d_spinUpdatePtr

...

cudaMemcpy3DParms spinCopyParams = {0};

spinCopyParams.srcPtr = d_spinUpdatePtr;

spinCopyParams.dstArray = d_spin;

spinCopyParams.extent = spinExtent;

spinCopyParams.kind = cudaMemcpyDeviceToDevice;

...

for (...) {

  cudaMemcpy3D(&spinCopyParams)

  cudaBindTextureToArray(spinTexRef3D, d_spin, spinChannelDesc)

  // call kernel

  ...

}

springc · March 8, 2010, 7:30am

I have the same scenerio, and I am wondering as well that whether it is this way that the 3D copy is slow. I am also updating data from one kernel that will be used by another kernel as 3D texture to use the intepolation feature offered by the texture. This process will repeat many times. Now the bottleneck is this copying. I am wondering if there are suggestions to handle it differently, like using 2D textures and do the extra interpolation by myself.

My kernel is called many times (in the thousands) and before every iteration I have to copy the data updated in the previous iteration to a CUDA array which is bound to a 3D texture (device to device copy). But it turns out that this copy takes more than 50% of the execution time. In the 2D scenario (with comparable system size), this copy only takes about 7% of the execution time. Why is this 3D pitch array to 3D CUDA array copy so slow?

Here’s the code for the 3D case:
texture<unsigned char, 3, cudaReadModeElementType> spinTexRef3D;

...

cudaChannelFormatDesc spinChannelDesc =cudaCreateChannelDesc<unsigned char>();

cudaExtent spinExtent = make_cudaExtent(64, 64, 64); // or any other multiple of 8

cudaExtent spinUpdateExtent = make_cudaExtent(64*sizeof(unsigned char), 64, 64);

cudaArray* d_spin;

cudaPitchedPtr d_spinUpdatePtr;

cudaMalloc3DArray(&d_spin, &spinChannelDesc, spinExtent);

cudaMalloc3D(&d_spinUpdatePtr, spinUpdateExtent);

// copy initial data from the host to d_spinUpdatePtr

...

cudaMemcpy3DParms spinCopyParams = {0};

spinCopyParams.srcPtr = d_spinUpdatePtr;

spinCopyParams.dstArray = d_spin;

spinCopyParams.extent = spinExtent;

spinCopyParams.kind = cudaMemcpyDeviceToDevice;

...

for (...) {

  cudaMemcpy3D(&spinCopyParams)

  cudaBindTextureToArray(spinTexRef3D, d_spin, spinChannelDesc)

  // call kernel

  ...

}

Topic		Replies	Views
3D device-to-device memcopy to cudaArray slow? CUDA Programming and Performance	8	11757	January 14, 2010
cudaMemcpy3D memory duplication CUDA Programming and Performance	2	900	April 4, 2013
memcpyDtoA takes significant GPU time CUDA Programming and Performance	3	1481	October 7, 2010
Setting up 3d arryas I have some questions about how to use 3d arrays and cudaArrays CUDA Programming and Performance	10	27994	April 5, 2010
Writing to 3D texture CUDA Programming and Performance	10	20035	December 28, 2010
Problem about cudaMemcpy3D() CUDA Programming and Performance	9	7181	October 26, 2008
Copy data from 4D array to 3D array CUDA Programming and Performance	1	3850	June 1, 2010
copy 3D data from host to device CUDA Programming and Performance	3	9902	December 1, 2010
Copying to a 3D cuda array cudaMemcpyToArray returns cudaErrorInvalidValue CUDA Programming and Performance	4	17891	June 9, 2010
Problem with cudaMemcpy3D Copy resoure to 3D Texture CUDA Programming and Performance	1	863	June 25, 2011

3D copy to CUDA array much slower than in 2D

Related topics