7x slower memcpy out from NvStreams C2C PCIe Consumer packet

Software Version DRIVE OS 6.0.8.1
Target Operating System Linux
Host Machine Version native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers

Issue Description
In NvStreams CPU-CPU C2C PCIe setup, it is much slower to memcpy data out of the NvStreams packet on the Consumer side, compared to memcpy over regular memory.

The former only achieves 1.15 GB/s, while the latter achieves 8.3 GB/s on Drive Orin, 32 GB, 3200MHz, 128-bit memory bus width.

This issue does not apply when memcpy data into the DMA buffer on the Producer side.

This issue is agnostic of enabling/disabling CPU Caching, agnostic of Consumer being the PCIe Root Port or Endpoint.

Reproduce

In

drive-linux/samples/nvsci/nvscistream/perf_tests/perfconsumer.cpp

Add a memcpy call on L277, right after perfConsumer finishes waiting for prefence (i.e. packet is ready to read).

#include <iostream>
#include <chrono>

...

const size_t xfer_size = static_cast<size_t>(testArg.bufSize * 1048576);

char* dstBuf = new char[xfer_size];
size_t index = 0;
while (index < xfer_size) {
    dstBuf[index] = index % 127;
    index += 4096;
}
// alternatively, mlock() serves the same purpose; but there's no perf diff.

const auto t_start = std::chrono::high_resolution_clock::now();
memcpy(dstBuf, packet->constCpuPtr[0], xfer_size);
const auto t_end = std::chrono::high_resolution_clock::now();
const auto t_diff = std::chrono::duration_cast<std::chrono::microseconds>(t_end - t_start).count();

std::cout << "Transfer size " << testArg.bufSize << " MB, time "
    << t_diff / 1000. << " ms, speed " << testArg.bufSize / (t_diff / 1000000.)
    << " MB/s\n";
delete[] dstBuf;

And run the sample code with

    ./test_nvscistream_perf -P 0 nvscic2c_pcie_s0_c5_1 -l -b 12.5 -f 10000
    ./test_nvscistream_perf -C 0 nvscic2c_pcie_s0_c6_1 -l -b 12.5 -f 10000

Observations

With relevant PMU counters, and compared to the producer side, the consumer side has significantly larger STALL_BACKEND_MEM count, and a smaller or similar L1D/L2D/LLC data cache miss count.

Since DMA buffer is pinned, I don’t think madvise on prefetching has any effect. Anecdotally, with MMIO over PCIe we can achieve ~700MB/s, which is not significantly worse than DMA given this memcpy bottleneck.

Logs - test_nvscistream_perf

Transfer size 12.5 MB, time 10.793 ms, speed 1158.16 MB/s

Logs - bandwidthTest
Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 9.1

Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 8.3

Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 175.8

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 35.9

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 35.9

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 177.7

Is it tested with two Orin devkit connected using miniSAS?

Is above code snippet to be added at L277 in /drive/drive-linux/samples/nvsci/nvscistream/perf_tests/perfconsumer.cpp?

277           getTimeInUs(streamingStartTime - setupStartTime));
278     printf("Consumer streaming phase:                    %8.5f us\n", duration);
279
280     if (testArg.isC2c) {
281         // Convert unit of buffer size to (GB): bufSize(MB) / 1024
282         // Convert unit of duration to (s): duration(us) / 10^6
283         double bandwidth =
284             (testArg.bufSize * numPayloads / 1024.0) *
285             (1000000.0 / duration);
286         printf("\nBuffer size per packet: %8.5f MB\n", testArg.bufSize);
287         printf("PCIe bandwidth (buffer size received by consumer): "
288                "%8.5f GBps\n\n", bandwidth);
289     }
290
291     streamingDone = true;
292 }

Could you share the followed verification steps/code?

Is it tested with two Orin devkit connected using miniSAS?

Yes, using the default C2C channels.

Is above code snippet to be added at L277 in /drive/drive-linux/samples/nvsci/nvscistream/perf_tests/perfconsumer.cpp?

Sorry for the typo, I really meant L227.

Could you share the followed verification steps/code?

The general idea is to leverage these PMU counters:

// STALL_BACKEND_MEM, L2D_CACHE_REFILL, L3D_CACHE_REFILL
{ .type = 0x7, .config = 0x4005 }, { .type = 0x7, .config = 0x17 }, { .type = 0x7, .config = 0x2a }

Put them into Linux perf_event_attr, and repeatedly perf_event_open and perf_event_close before and after the added memcpy.

I can share perf observed PMU counter numbers for comparison (memcpy into producer packet vs out of consumer packet) a bit later.

Thank you for sharing info. Let me check and get back to you.

Dear @jcui-nuro,
May I know if you used single or two Devkits to test the sample?

Hi Siva, I used 2 DevKit. I don’t think NvSciIpc can initialize with c5/c6 transport names if it were a single DevKit.

Dear @jcui-nuro ,
Could you quickly test if below changes in perfconsumer.cpp helps to fix the issue

void PerfConsumer::setEndpointBufAttr(NvSciBufAttrList attrList)
{
    NvSciBufType bufType{ NvSciBufType_RawBuffer };
    NvSciBufAttrValAccessPerm perm{ NvSciBufAccessPerm_Readonly };
    // Disable cpu access for vidmem
    bool cpuaccess_flag{ testArg.vidmem ? false : true };

    bool enableCpuCache { true }; 

    NvSciBufAttrKeyValuePair bufAttrs[] = {
        { NvSciBufGeneralAttrKey_Types, &bufType, sizeof(bufType) },
        { NvSciBufGeneralAttrKey_RequiredPerm, &perm, sizeof(perm) },
        { NvSciBufGeneralAttrKey_NeedCpuAccess, &cpuaccess_flag,
            sizeof(cpuaccess_flag) },
        { NvSciBufGeneralAttrKey_EnableCpuCache, &enableCpuCache,
            sizeof(enableCpuCache) },
    };

Dear @jcui-nuro,
Did you get chance to test above suggestion. Any update can be provided?

Hello Siva,

As I understand the above code limits Consumer to ReadOnly and enabled CPU caching. I tested it on Drive OS 6.0.10 and the poor performance numbers persist.

Does that mean you don’t see any improvement?

Correct

Is it tested with DRIVE OS 6.0.10 or 6.0.8.1?

Hi Siva - it’s tested on 6.0.10 dual Orins.