Software Version DRIVE OS 6.0.8.1 Target Operating System Linux Hardware Platform DRIVE AGX Orin Developer Kit (not sure its number) Host Machine Version native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
Issue Description
nvsci/nvscistream/event/ sample Use Case 1 doesn’t work for C2C CUDA/CUDA.
Below command from README does not work.
./nvscistream_event_sample -P 0 nvscic2c_pcie_s0_c5_1 -Q 0 f
# Run below command on another OS running on peer SOC.
./nvscistream_event_sample -C 0 nvscic2c_pcie_s0_c6_1 -F 0 3
There are multiple obvious bugs in the sample code.
createPool() always assumes non-C2C. isC2cPool hardcodes to false.
handleC2cPoolBufferSetup did not perform NvSciStreamBlockSetupStatusSet(NvSciStreamSetup_ElementExport) on the producer side. This resuls in later call to NvSciStreamPoolPacketCreate fails due to NvSciError_NotYetAvailable: Completion of element export has not yet been signaled on the pool.
With element export status set, later call to NvSciBufObjAlloc fails due to NvSciError_BadParameter, suggesting other problems.
It seems that the attribute list used to allocate the NvSciBufObj is incorrect, as NvSciBufAttrListDebugDump returns BadParameter on that list.
Questions
Could you please share latest sample code that demonstrates C2C CUDA/CUDA over PCIe capability, thanks.
I produced some code that plausibly set up NvStreams the same way the code sample does, without the aforementioned obvious bugs in the code sample. It ends up behaving the same/similar in that:
Producer and Consumer on different SoCs, GpuId set to the cuUuid of GPU 0 on Orin
During handling of NvSciStreamEventType_Elements event on the PacketAllocator block on the Producer side, the buffer attribute list reconciliation (i.e. one with local GPU ID, other with remote GPU ID, plus other attributes inserted by cudaDeviceGetNvSciSyncAttributes) aborts the program with a panic.
If allocating an NvSciBufObj without reconciliation and only with the local GPU ID (i.e. what handleC2cPoolBufferSetup in the code sample does), it leads to a different crash stack on the Consumer side.
Thanks for the follow-up! I am pretty sure PCIe CPU-CPU works. It was the CUDA-CUDA path that does not work correctly.
I don’t think 6.0.9 and 6.0.10 release notes mentioned NvSci PCIe related fixes. I would also like to bring this issue to your attention, in case it persists in upcoming Drive OS 7.
Could you share the code and command?
The default sample with ./nvscistream_event_sample -P 0 nvscic2c_pcie_s0_c5_1 -Q 0 f and ./nvscistream_event_sample -C 0 nvscic2c_pcie_s0_c6_1 -F 0 3 commands worked out side with 6.0.10 release.
I recompiled the program from Drive OS 6.0.10 container, and ran it on the pair of Orins both running 6.0.10. The problem persists. Producer crashes during reconcilation.
With debug build, please see the stack track on the producer side :
(gdb) r
Starting program: /data/jcui/nvscistream_event_sample -P 0 nvscic2c_pcie_s0_c5_1 -Q 0 f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[New Thread 0xfffff503b900 (LWP 4388)]
[New Thread 0xfffff4cb7900 (LWP 4389)]
[New Thread 0xffffe9ffc900 (LWP 4390)]
Thread 1 "nvscistream_eve" received signal SIGABRT, Aborted.
0x0000fffff5a62d78 in raise () from /lib/aarch64-linux-gnu/libc.so.6
(gdb) bt
#0 0x0000fffff5a62d78 in raise () from /lib/aarch64-linux-gnu/libc.so.6
#1 0x0000fffff5a4faac in abort () from /lib/aarch64-linux-gnu/libc.so.6
#2 0x0000fffff7b7c948 in NvSciCommonPanic () from /usr/lib/libnvscicommon.so.1
#3 0x0000fffff7ece17c in ?? () from /usr/lib/libnvscibuf.so.1
#4 0x0000fffff7eb1748 in ?? () from /usr/lib/libnvscibuf.so.1
#5 0x0000fffff7eb21d0 in ?? () from /usr/lib/libnvscibuf.so.1
#6 0x0000fffff7ebac58 in ?? () from /usr/lib/libnvscibuf.so.1
#7 0x0000fffff7ebb8b0 in NvSciBufAttrListReconcile () from /usr/lib/libnvscibuf.so.1
#8 0x000000000040e6b0 in handlePoolBufferSetup (poolData=0x4cf180) at block_pool.c:159
#9 0x000000000040f40c in handlePool (data=0x4cf180, wait=0) at block_pool.c:649
#10 0x000000000040740c in eventServiceLoop () at event_loop_service.c:294
#11 0x0000000000406d24 in main (argc=7, argv=0xfffffffff608) at main.c:1227