So far we have only looked at device to host transfers. Now, to complete this thread, I tested the inverse copy operations (host to device transfers) that are of interest for a halo update of a 3-D device array.
Here is the experiment for the YZ-plane (1-D contiguous host array to YZ-plane of 3-D device array):
$> cat t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_strided_h2d.cu
#include <iostream>
int main(){
const size_t nx = 512;
const size_t ny = 512;
const size_t nz = 512;
float *d_dst, *h_src;
cudaMalloc(&d_dst, nx*ny*nz*sizeof(d_dst[0]));
cudaHostAlloc(&h_src, ny*nz*sizeof(d_dst[0]), cudaHostAllocDefault);
cudaMemset(d_dst, 0, nx*ny*nz*sizeof(d_dst[0]));
memset(h_src, 1, ny*nz*sizeof(d_dst[0]));
cudaMemcpy3DParms cpy_params = {0};
cpy_params.dstPtr = make_cudaPitchedPtr(d_dst, nx*sizeof(d_dst[0]), nx, ny);
cpy_params.srcPtr = make_cudaPitchedPtr(h_src, sizeof(d_dst[0]), 1, ny);
cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
cpy_params.extent = make_cudaExtent(sizeof(d_dst[0]), ny, nz);
cpy_params.kind = cudaMemcpyHostToDevice;
for (int i=0; i<100; i++){
cudaMemcpy3DAsync(&cpy_params);
}
cudaDeviceSynchronize();
}
$> nvcc -arch=sm_60 t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_strided_h2d.cu
$> nvprof ./a.out==8015== NVPROF is profiling process 8015, command: ./a.out
==8015== Profiling application: ./a.out
==8015== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 184.77ms 100 1.8477ms 1.8443ms 1.8611ms [CUDA memcpy HtoD]
0.00% 960ns 1 960ns 960ns 960ns [CUDA memset]
API calls: 56.63% 247.11ms 1 247.11ms 247.11ms 247.11ms cudaMalloc
42.42% 185.11ms 1 185.11ms 185.11ms 185.11ms cudaDeviceSynchronize
0.34% 1.4770ms 1 1.4770ms 1.4770ms 1.4770ms cudaHostAlloc
0.27% 1.1857ms 1 1.1857ms 1.1857ms 1.1857ms cuDeviceTotalMem
0.18% 775.66us 100 7.7560us 7.3710us 19.080us cudaMemcpy3DAsync
...
The observed performance is only about 0.57 GB/s (5125124B/(1.847s/1e3)/1e9) (see [CUDA memcpy HtoD]; NVVP notes the same throughput).
Here is the experiment for the XZ-plane (1-D contiguous host array to XZ-plane of 3-D device array):
$ cat t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_strided_xzplane_h2d.cu
#include <iostream>
int main(){
const size_t nx = 512;
const size_t ny = 512;
const size_t nz = 512;
float *d_dst, *h_src;
cudaMalloc(&d_dst, nx*ny*nz*sizeof(d_dst[0]));
cudaHostAlloc(&h_src, nx*nz*sizeof(d_dst[0]), cudaHostAllocDefault);
cudaMemset(d_dst, 0, nx*ny*nz*sizeof(d_dst[0]));
memset(h_src, 1, nx*nz*sizeof(d_dst[0]));
cudaMemcpy3DParms cpy_params = {0};
cpy_params.dstPtr = make_cudaPitchedPtr(d_dst, nx*sizeof(d_dst[0]), nx, ny);
cpy_params.srcPtr = make_cudaPitchedPtr(h_src, nx*sizeof(d_dst[0]), nx, 1);
cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
cpy_params.extent = make_cudaExtent(nx*sizeof(d_dst[0]), 1, nz);
cpy_params.kind = cudaMemcpyHostToDevice;
for (int i=0; i<100; i++){
cudaMemcpy3DAsync(&cpy_params);
}
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_60 t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_strided_xzplane_h2d.cu
$ nvprof ./a.out
==9413== NVPROF is profiling process 9413, command: ./a.out
==9413== Profiling application: ./a.out
==9413== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.99% 8.7544ms 100 87.543us 87.167us 98.207us [CUDA memcpy HtoD]
0.01% 960ns 1 960ns 960ns 960ns [CUDA memset]
API calls: 95.91% 308.96ms 1 308.96ms 308.96ms 308.96ms cudaMalloc
2.81% 9.0473ms 1 9.0473ms 9.0473ms 9.0473ms cudaDeviceSynchronize
0.45% 1.4380ms 1 1.4380ms 1.4380ms 1.4380ms cudaHostAlloc
0.36% 1.1639ms 1 1.1639ms 1.1639ms 1.1639ms cuDeviceTotalMem
0.24% 782.17us 100 7.8210us 7.4170us 19.605us cudaMemcpy3DAsync
...
We can see that we get a performance of about 12 GB/s (5125124B/(87s/1e6)/1e9) (see [CUDA memcpy HtoD]; NVVP notes the same throughput).
I omit here the experiment for the XY-plane as being a copy from a contiguous to a contiguous buffer, it will certainly give a performance as least as good as for the XZ-plane (12 GB/s, see above).
Conclusion
The results are no surprise, but completely in agreement with the device to host transfer. Again, the copy of an XZ-plane (and certainly also of a XY-plane) gives a very good performance; however the copy of the YZ-plane is very slow. A bug report should therefore be opened for the YZ-plane case in both transfer directions (device to host and host to device).