Thanks @Robert_Crovella! You are of absolutely right that the cudaMemcpy3D performance of 300 MB/s that I had reported needs to be compared to a kernel copy where both the source and the destination memory access is strided. It is nice to see that then we get performance results that are in agreement (for both cases about 300 MB/s).
Now, I came to realize that I didn’t call cudaMemcpy3DAsync with the right parameters for my needs. My aim is to use cudaMemcpy3DAsync to read strided data from the device and write it to a contigous buffer on the host. So, I am actually interested in the case that we have not yet looked at. In other words, I am looking for how to best do the case for which we have a question mark in the following performance table (approximative values; stride=2KB; data=512x512x4bytes):
# cudaMemcpy3DAsync kernel-copy
contigous-to-contigous: ~13 GB/s ~13 GB/s
strided-to-contigous: ? ~11 GB/s
strided-to-strided: ~300 MB/s ~300 MB/s
I have therefore done the following:
- I first modified your example in order to reproduce with cudaMemcpy3DAsync the performance of ~13 GB/s for copying contigous data from the device to the host:
$> cat t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_contiguous.cu
#include <iostream>
int main(){
const size_t nx = 512;
const size_t ny = 512;
const size_t nz = 512;
float *d_src, *h_dst;
cudaMalloc(&d_src, nx*ny*nz*sizeof(d_src[0]));
cudaHostAlloc(&h_dst, ny*nz*sizeof(d_src[0]), cudaHostAllocDefault);
cudaMemset(d_src, 0, nx*ny*nz*sizeof(d_src[0]));
memset(h_dst, 1, ny*nz*sizeof(d_src[0]));
cudaMemcpy3DParms cpy_params = {0};
cpy_params.srcPtr = make_cudaPitchedPtr(d_src, nx*sizeof(d_src[0]), nx, ny);
cpy_params.dstPtr = make_cudaPitchedPtr(h_dst, ny*sizeof(d_src[0]), ny, nz);
cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
cpy_params.extent = make_cudaExtent(nx*sizeof(d_src[0]), ny, 1);
cpy_params.kind = cudaMemcpyDeviceToHost;
for (int i=0; i<100; i++){
cudaMemcpy3DAsync(&cpy_params);
}
cudaDeviceSynchronize();
}
$> nvcc -arch=sm_60 t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_contiguous.cu
$> nvprof ./a.out
==23132== NVPROF is profiling process 23132, command: ./a.out
==23132== Profiling application: ./a.out
==23132== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.99% 8.0308ms 100 80.307us 80.159us 88.895us [CUDA memcpy DtoH]
0.01% 992ns 1 992ns 992ns 992ns [CUDA memset]
API calls: 97.31% 453.86ms 1 453.86ms 453.86ms 453.86ms cudaMalloc
1.89% 8.7984ms 1 8.7984ms 8.7984ms 8.7984ms cudaDeviceSynchronize
0.32% 1.4823ms 1 1.4823ms 1.4823ms 1.4823ms cudaHostAlloc
0.25% 1.1824ms 1 1.1824ms 1.1824ms 1.1824ms cuDeviceTotalMem
0.14% 636.49us 94 6.7710us 120ns 252.92us cuDeviceGetAttribute
0.07% 334.62us 100 3.3460us 3.0270us 14.167us cudaMemcpy3DAsync
...
We can see that we get a performance of about 13 GB/s (5125124B/(80s/1e6)/1e9) (see [CUDA memcpy DtoH])
- Then, I tried to modify the copy parameters to copy instead from a strided to a contiguous buffer (i.e. to copy a YZ-plane of a 3-D device array to a 1-D host array):
$> cat t1442_float512x512_cudaMemcpy3DAsync_strided_to_contiguous.cu
#include <iostream>
int main(){
const size_t nx = 512;
const size_t ny = 512;
const size_t nz = 512;
float *d_src, *h_dst;
cudaMalloc(&d_src, nx*ny*nz*sizeof(d_src[0]));
cudaHostAlloc(&h_dst, ny*nz*sizeof(d_src[0]), cudaHostAllocDefault);
cudaMemset(d_src, 0, nx*ny*nz*sizeof(d_src[0]));
memset(h_dst, 1, ny*nz*sizeof(d_src[0]));
cudaMemcpy3DParms cpy_params = {0};
cpy_params.srcPtr = make_cudaPitchedPtr(d_src, nx*sizeof(d_src[0]), nx, ny);
cpy_params.dstPtr = make_cudaPitchedPtr(h_dst, sizeof(d_src[0]), 1, ny); # MODIFIED!
cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
cpy_params.extent = make_cudaExtent(sizeof(d_src[0]), ny, nz); # MODIFIED!
cpy_params.kind = cudaMemcpyDeviceToHost;
for (int i=0; i<100; i++){
cudaMemcpy3DAsync(&cpy_params);
}
cudaDeviceSynchronize();
}
$> nvcc -arch=sm_60 t1442_float512x512_cudaMemcpy3DAsync_strided_to_contiguous.cu
$> nvprof ./a.out
==23234== NVPROF is profiling process 23234, command: ./a.out
==23234== Profiling application: ./a.out
==23234== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 115.31ms 100 1.1531ms 1.1522ms 1.1545ms [CUDA memcpy DtoH]
0.00% 960ns 1 960ns 960ns 960ns [CUDA memset]
API calls: 68.99% 266.83ms 1 266.83ms 266.83ms 266.83ms cudaMalloc
29.91% 115.69ms 1 115.69ms 115.69ms 115.69ms cudaDeviceSynchronize
0.37% 1.4407ms 1 1.4407ms 1.4407ms 1.4407ms cudaHostAlloc
0.32% 1.2477ms 1 1.2477ms 1.2477ms 1.2477ms cuDeviceTotalMem
0.21% 799.60us 100 7.9950us 7.3030us 20.194us cudaMemcpy3DAsync
...
The observed performance is only about 0.9 GB/s (5125124B/(1.153s/1e3)/1e9) (see [CUDA memcpy DtoH]). I am however not sure if I set the parameters right as I don’t have any experience with cudaMemcpy3DAsync. If it was right, then it would be over 10 times slower than the corresponding kernel copy (~11 GB/s, see in the table above).
Did I set the parameters right?
That sounds exactly like what I need. Do you mean with that what I tried to do above or something else?