Cuda 10.2 CC6.1
I allocate a couple of chunks of pinned memory, intending to cudaMemcpyAsync to them:
//host mem
cudaMallocHost((void **) &Iv.t, 53 * sizeof(uint32_t));
cudaMallocHost((void **) &Args.can_h, 67000 * sizeof(uint64_t));
When I run the code and check the results in Nsight System, I find the Iv.t memcopy is behaving synchronously and the tooltip for the transfer states:
Begins: 1.57065s
Ends: 1.57065s (+1.248 μs)
DtoH memcpy 212 bytes
Source memory kind: Device
Destination memory kind: Pageable
Throughput: 162.002 MiB/s
Launched from thread: 4200
Latency: ←46.367 μs
Correlation ID: 1654
Stream: Stream 14
The memcopy for Args.can_h behaves as expected:
Begins: 1.57049s
Ends: 1.57065s (+156.326 μs)
DtoH memcpy 523,480 bytes
Source memory kind: Device
Destination memory kind: Pinned
Throughput: 3.11867 GiB/s
Launched from thread: 4200
Latency: ←10.176 μs
Correlation ID: 1650
Stream: Stream 15
Why is Iv.t not being pinned?