NvStreams C2C CUDA/CUDA doesn't work

Software Version DRIVE OS 6.0.8.1
Target Operating System Linux
Hardware Platform DRIVE AGX Orin Developer Kit (not sure its number)
Host Machine Version native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers

Issue Description
nvsci/nvscistream/event/ sample Use Case 1 doesn’t work for C2C CUDA/CUDA.

Below command from README does not work.

    ./nvscistream_event_sample -P 0 nvscic2c_pcie_s0_c5_1 -Q 0 f
    # Run below command on another OS running on peer SOC.
    ./nvscistream_event_sample -C 0 nvscic2c_pcie_s0_c6_1 -F 0 3

There are multiple obvious bugs in the sample code.

  • createPool() always assumes non-C2C. isC2cPool hardcodes to false.
  • handleC2cPoolBufferSetup did not perform NvSciStreamBlockSetupStatusSet(NvSciStreamSetup_ElementExport) on the producer side. This resuls in later call to NvSciStreamPoolPacketCreate fails due to NvSciError_NotYetAvailable: Completion of element export has not yet been signaled on the pool.
  • With element export status set, later call to NvSciBufObjAlloc fails due to NvSciError_BadParameter, suggesting other problems.
  • It seems that the attribute list used to allocate the NvSciBufObj is incorrect, as NvSciBufAttrListDebugDump returns BadParameter on that list.

Questions

Could you please share latest sample code that demonstrates C2C CUDA/CUDA over PCIe capability, thanks.

1 Like

Dear @jcui-nuro,
Could you test with DRIVE OS 6.0.10 (our last release)?

1 Like

I pulled the 6.0.10 container and unfortunately the aforementioned issues/bugs exist there as well.

I produced some code that plausibly set up NvStreams the same way the code sample does, without the aforementioned obvious bugs in the code sample. It ends up behaving the same/similar in that:

  • Producer and Consumer on different SoCs, GpuId set to the cuUuid of GPU 0 on Orin

  • During handling of NvSciStreamEventType_Elements event on the PacketAllocator block on the Producer side, the buffer attribute list reconciliation (i.e. one with local GPU ID, other with remote GPU ID, plus other attributes inserted by cudaDeviceGetNvSciSyncAttributes) aborts the program with a panic.

Producer GPU ID: 7Ah4{�U䄪�E�H�
Consumer GPU ID: �KT��T���cY$
...
    @     0xffffabcbdaac abort
    @     0xffffabba8948 NvSciCommonPanic
    @     0xffffac1eb78c (/usr/lib/libnvscibuf.so.1+0x2978b)
    @     0xffffac1ce5b8 (/usr/lib/libnvscibuf.so.1+0xc5b7)
    @     0xffffac1cf040 (/usr/lib/libnvscibuf.so.1+0xd03f)
    @     0xffffac1d7fa0 (/usr/lib/libnvscibuf.so.1+0x15f9f)
    @     0xffffac1d8bc0 NvSciBufAttrListReconcile
    @     0xffffb2fc8080 StreamPacketAllocatorBlock::HandleBufferSetup()
  • If allocating an NvSciBufObj without reconciliation and only with the local GPU ID (i.e. what handleC2cPoolBufferSetup in the code sample does), it leads to a different crash stack on the Consumer side.
    @     0xffff96039aac abort
    @     0xffff95f24948 NvSciCommonPanic
    @     0xffff95f25b5c (/usr/lib/libnvscicommon.so.1+0x2b5b)
    @     0xffff95f25f88 NvSciCommonObjRetrieve
    @     0xffff9627a214 (/usr/lib/libnvscisync.so.1+0x7213)
    @     0xffff9627d824 (/usr/lib/libnvscisync.so.1+0xa823)
    @     0xffff9627af68 NvSciSyncAttrListSetInternalAttrs
    @     0xffff94dfa394 (/usr/lib/libnvcucompat.so+0x6393)
    @     0xffff9b97dab0 (/usr/lib/libcuda.so.1+0x41baaf)
    @     0xffff9b836e28 (/usr/lib/libcuda.so.1+0x2d4e27)
    @     0xffff9b4c3f8c (libcudart.so.12+0x29f8b)
    @     0xffff9b4f0ea4 cudaDeviceGetNvSciSyncAttributes
    @     0xffff9d3a1868 SetGpuSyncAttrList()

Dear @jcui-nuro ,
I don’t have 2 devkit currently to test.
But the sample seems to be working earlier per

Is the nvscistream_event_sample sample commands not running on 6.0.10?

Hi Siva,

Thanks for the follow-up! I am pretty sure PCIe CPU-CPU works. It was the CUDA-CUDA path that does not work correctly.

I don’t think 6.0.9 and 6.0.10 release notes mentioned NvSci PCIe related fixes. I would also like to bring this issue to your attention, in case it persists in upcoming Drive OS 7.

Thanks,
Jason

From the release notes, nvstream sample should work in driveos 6081 and later, both in intra-machine and inter-machine(with pciec2c).

  1. Could you share the hardware setup steps and software setup steps before nvscipcie test, so that we can comfirm your 2 orin setup?
  2. Could you share the nvscipcie test command with CPU-CPU works so that everyone can repeat your test?

there are discussions earlier in the forum, so you may take as reference.
PCIe Hot-Plug not working - DRIVE AGX Orin / DRIVE AGX Orin General - NVIDIA Developer Forums
[BUG] Official Documentation inconsistency/conflicts of NvSciC2cPcie - DRIVE AGX Orin / DRIVE AGX Orin General - NVIDIA Developer Forums

the official doc is also useful, even 6.0.10 doc you can use for your 6.0.8.1 version!
Chip to Chip Communication | NVIDIA Docs

IF you are trying to share the codes, I encourage you to use this repo to share your issue, this repo was used by me to share the cgf issue. nv_driveworks/README_en.md at main · ZhenshengLee/nv_driveworks

2 Likes

Could you share the code and command?
The default sample with
./nvscistream_event_sample -P 0 nvscic2c_pcie_s0_c5_1 -Q 0 f and
./nvscistream_event_sample -C 0 nvscic2c_pcie_s0_c6_1 -F 0 3 commands worked out side with 6.0.10 release.

Dear @jcui-nuro ,
Could you please share the code changes to repro the issue?

worked out side with 6.0.10 release.

Thank you for sharing the findings on 6.0.10, I will need to burn my Drive Orin devkit to 6.0.10 and confirm. Please stay tuned!

1 Like

Hello Siva,

I recompiled the program from Drive OS 6.0.10 container, and ran it on the pair of Orins both running 6.0.10. The problem persists. Producer crashes during reconcilation.

With debug build, please see the stack track on the producer side :

(gdb) r
Starting program: /data/jcui/nvscistream_event_sample -P 0 nvscic2c_pcie_s0_c5_1 -Q 0 f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[New Thread 0xfffff503b900 (LWP 4388)]
[New Thread 0xfffff4cb7900 (LWP 4389)]
[New Thread 0xffffe9ffc900 (LWP 4390)]

Thread 1 "nvscistream_eve" received signal SIGABRT, Aborted.
0x0000fffff5a62d78 in raise () from /lib/aarch64-linux-gnu/libc.so.6
(gdb) bt
#0  0x0000fffff5a62d78 in raise () from /lib/aarch64-linux-gnu/libc.so.6
#1  0x0000fffff5a4faac in abort () from /lib/aarch64-linux-gnu/libc.so.6
#2  0x0000fffff7b7c948 in NvSciCommonPanic () from /usr/lib/libnvscicommon.so.1
#3  0x0000fffff7ece17c in ?? () from /usr/lib/libnvscibuf.so.1
#4  0x0000fffff7eb1748 in ?? () from /usr/lib/libnvscibuf.so.1
#5  0x0000fffff7eb21d0 in ?? () from /usr/lib/libnvscibuf.so.1
#6  0x0000fffff7ebac58 in ?? () from /usr/lib/libnvscibuf.so.1
#7  0x0000fffff7ebb8b0 in NvSciBufAttrListReconcile () from /usr/lib/libnvscibuf.so.1
#8  0x000000000040e6b0 in handlePoolBufferSetup (poolData=0x4cf180) at block_pool.c:159
#9  0x000000000040f40c in handlePool (data=0x4cf180, wait=0) at block_pool.c:649
#10 0x000000000040740c in eventServiceLoop () at event_loop_service.c:294
#11 0x0000000000406d24 in main (argc=7, argv=0xfffffffff608) at main.c:1227