Hi,
I want to use an asynchronous version of cudaMemcpy2DArrayToArray
for device to device memory copy. cudaMemcpy3DAsync
should be able to work on 2D CUDA arrays.
My code looks like the following:
cudaMemcpy3DParms params = {0};
params.srcArray = input_array_;
params.dstArray = output_array_;
params.extent = make_cudaExtent(width_, height_, 0);
params.kind = cudaMemcpyDeviceToDevice;
cudaMemcpy3DAsync(¶ms, cuda_stream_);
Input and output CUDA arrays have 2 channels and are created using cudaMalloc3DArray
with the same extent.
cudaChannelFormatDesc channel_descriptor = cudaCreateChannelDesc(
8, 8, 0, 0, cudaChannelFormatKindUnsigned);
cudaExtent extent = {0};
extent.width = width;
extent.height = height;
extent.depth = 0;
cudaMalloc3DArray(&cuda_array_, &channel_descriptor, extent);
I tried depth extent of 0 and it doesn’t seem to copy anything; depth 1 fails with error “invalid argument” at cudaMemcpy3DAsync
.
How should I set the depth of the extent so it copies the 2D CUDA arrays correctly? From the cudaMalloc3DArray
description, a 2D array is allocated if only the depth extent is zero.
Thanks.
However, cudaMemcpy3DAsync
is going to require a transfer depth extent of 1, minimum:
cudaMemcpy3DAsync() copies data betwen two 3D objects.
I don’t seem to have any trouble with what you have shown with depth of 1:
$ cat t1989.cu
int main(){
cudaArray_t cuda_array1, cuda_array2;
cudaChannelFormatDesc channel_descriptor = cudaCreateChannelDesc(
8, 8, 0, 0, cudaChannelFormatKindUnsigned);
int width,height;
width = height = 1024;
cudaExtent extent = {0};
extent.width = width;
extent.height = height;
extent.depth = 1;
cudaMalloc3DArray(&cuda_array1, &channel_descriptor, extent);
cudaMalloc3DArray(&cuda_array2, &channel_descriptor, extent);
cudaMemcpy3DParms params = {0};
params.srcArray = cuda_array1;
params.dstArray = cuda_array2;
params.extent = make_cudaExtent(width, height, 1);
params.kind = cudaMemcpyDeviceToDevice;
cudaMemcpy3DAsync(¶ms);
}
$ nvcc -o t1989 t1989.cu
$ cuda-memcheck ./t1989
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
If you are still having issues, please provide a short complete example, just as I have done, rather than snippets. If you add an appropriately created stream argument to the above call, it also runs without error.
That worked thanks. I wrongly set the cudaMemcpy3DAsync
width extent to be double of the width extent in CUDA array allocated, trying to account for the fact that there are 2 channels in the array.