My kernel is called many times (in the thousands) and before every iteration I have to copy the data updated in the previous iteration to a CUDA array which is bound to a 3D texture (device to device copy). But it turns out that this copy takes more than 50% of the execution time. In the 2D scenario (with comparable system size), this copy only takes about 7% of the execution time. Why is this 3D pitch array to 3D CUDA array copy so slow?
Here’s the code for the 3D case:
texture<unsigned char, 3, cudaReadModeElementType> spinTexRef3D;
...
cudaChannelFormatDesc spinChannelDesc =cudaCreateChannelDesc<unsigned char>();
cudaExtent spinExtent = make_cudaExtent(64, 64, 64); // or any other multiple of 8
cudaExtent spinUpdateExtent = make_cudaExtent(64*sizeof(unsigned char), 64, 64);
cudaArray* d_spin;
cudaPitchedPtr d_spinUpdatePtr;
cudaMalloc3DArray(&d_spin, &spinChannelDesc, spinExtent);
cudaMalloc3D(&d_spinUpdatePtr, spinUpdateExtent);
// copy initial data from the host to d_spinUpdatePtr
...
cudaMemcpy3DParms spinCopyParams = {0};
spinCopyParams.srcPtr = d_spinUpdatePtr;
spinCopyParams.dstArray = d_spin;
spinCopyParams.extent = spinExtent;
spinCopyParams.kind = cudaMemcpyDeviceToDevice;
...
for (...) {
cudaMemcpy3D(&spinCopyParams)
cudaBindTextureToArray(spinTexRef3D, d_spin, spinChannelDesc)
// call kernel
...
}
I have the same scenerio, and I am wondering as well that whether it is this way that the 3D copy is slow. I am also updating data from one kernel that will be used by another kernel as 3D texture to use the intepolation feature offered by the texture. This process will repeat many times. Now the bottleneck is this copying. I am wondering if there are suggestions to handle it differently, like using 2D textures and do the extra interpolation by myself.