Software Version DRIVE OS 6.0.8.1
Target Operating System Linux
Host Machine Version native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
Issue Description
In NvStreams CPU-CPU C2C PCIe setup, it is much slower to memcpy data out of the NvStreams packet on the Consumer side, compared to memcpy over regular memory.
The former only achieves 1.15 GB/s, while the latter achieves 8.3 GB/s on Drive Orin, 32 GB, 3200MHz, 128-bit memory bus width.
This issue does not apply when memcpy data into the DMA buffer on the Producer side.
This issue is agnostic of enabling/disabling CPU Caching, agnostic of Consumer being the PCIe Root Port or Endpoint.
Reproduce
In
drive-linux/samples/nvsci/nvscistream/perf_tests/perfconsumer.cpp
Add a memcpy call on L277, right after perfConsumer finishes waiting for prefence (i.e. packet is ready to read).
#include <iostream>
#include <chrono>
...
const size_t xfer_size = static_cast<size_t>(testArg.bufSize * 1048576);
char* dstBuf = new char[xfer_size];
size_t index = 0;
while (index < xfer_size) {
dstBuf[index] = index % 127;
index += 4096;
}
// alternatively, mlock() serves the same purpose; but there's no perf diff.
const auto t_start = std::chrono::high_resolution_clock::now();
memcpy(dstBuf, packet->constCpuPtr[0], xfer_size);
const auto t_end = std::chrono::high_resolution_clock::now();
const auto t_diff = std::chrono::duration_cast<std::chrono::microseconds>(t_end - t_start).count();
std::cout << "Transfer size " << testArg.bufSize << " MB, time "
<< t_diff / 1000. << " ms, speed " << testArg.bufSize / (t_diff / 1000000.)
<< " MB/s\n";
delete[] dstBuf;
And run the sample code with
./test_nvscistream_perf -P 0 nvscic2c_pcie_s0_c5_1 -l -b 12.5 -f 10000
./test_nvscistream_perf -C 0 nvscic2c_pcie_s0_c6_1 -l -b 12.5 -f 10000
Observations
With relevant PMU counters, and compared to the producer side, the consumer side has significantly larger STALL_BACKEND_MEM count, and a smaller or similar L1D/L2D/LLC data cache miss count.
Since DMA buffer is pinned, I don’t think madvise on prefetching has any effect. Anecdotally, with MMIO over PCIe we can achieve ~700MB/s, which is not significantly worse than DMA given this memcpy bottleneck.
Logs - test_nvscistream_perf
Transfer size 12.5 MB, time 10.793 ms, speed 1158.16 MB/s
Logs - bandwidthTest
Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 9.1
Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 8.3
Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 175.8
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 35.9
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 35.9
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 177.7
