SIPL<->CUDA synchronization barrier

Please provide the following info (check/uncheck the boxes after clicking “+ Create Topic”):
Software Version
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.5.0.7774
other

Host Machine Version
native Ubuntu 18.04
other

When the SIPL processing is real-time (30 FPS), the buffers I get from the SIPL are chopped-off horizontally when I access them from CUDA. It doesn’t occur if the processing is non-real-time (i.e. when buffers are being dropped). I get the buffers from the SIPL completion queues using these calls

INvSIPLClient::INvSIPLBuffer* siplBuf = nullptr;
pipeQueues.captureCompletionQueue->Get(siplBuf, siplQueueTimeout);
NvMediaImage* img = siplMBuf->GetImage();

The buffers have been pre-mapped to CUDA with these calls

// Bind the buffer to CUDA.
cudaExternalMemoryHandleDesc cudaDesc = {};
cudaDesc.type = cudaExternalMemoryHandleTypeNvSciBuf;
cudaDesc.handle.nvSciBufObject = buf.sciObj;
cudaDesc.size = correct_size_here;
CheckFail(cudaImportExternalMemory(&buf.cudaExtMem, &cudaDesc), “cudaImportExternalMemory”);

// Map the buffer to CUDA. For some reason, the mapped buffer is
// updated without explicit synchronization.
cudaExternalMemoryBufferDesc cudaDesc2 = {};
cudaDesc2.size = cudaDesc.size;
CheckFail(cudaExternalMemoryGetMappedBuffer(&buf.cudaMappedBuf, buf.cudaExtMem, &cudaDesc2),
“cudaExternalMemoryGetMappedBuffer”);

I originally thought that cudaExternalMemoryGetMappedBuffer() was needed for synchronization between SIPL and CUDA, but the call is extremely slow and obviously allocates memory.

What’s the correct way to wait for the CUDA buffer to be up-to-date with the SIPL buffer?

Thanks

Hi @shayan.manoochehri ,

Please refer to Post() in ~/nvidia/nvidia_sdk/DRIVE_OS_5.2.0_SDK_Linux_OS_DDPX/DRIVEOS/drive-t186ref-linux/samples/nvmedia/nvsipl/test/camera/CNvSIPLMasterNvSci.hpp (source of nvsipl_camera sample application).

I see the Post() method but I don’t see a synchronization point there.

The code does this
pBuffer->GetEOFNvSciSyncFence(&fence);
and calls
auto sciErr = NvSciStreamProducerPacketPresent(pStream->producer,
pStream->bufferInfo[cookie - 1U].packet,
&fence);

That doesn’t tell me what to do with the fence.

Can you post complete working code to do the wait? Thanks.

Please check below source of the application to understand the consumer side.
/home/vyu/nvidia/nvidia_sdk/DRIVE_OS_5.2.0_SDK_Linux_OS_DDPX/DRIVEOS/drive-t186ref-linux/samples/nvmedia/nvsipl/test/camera/CCompositeNvSci.cpp

Or cuda consumer of another application.
/home/vyu/nvidia/nvidia_sdk/DRIVE_OS_5.2.0_SDK_Linux_OS_DDPX/DRIVEOS/drive-t186ref-linux/samples/nvsci/nvscistream/cuda_consumer.cpp

That doesn’t help me. I asked you for the code to implement the wait. None of the samples you’re referring me to combine NVSIPL with CUDA. I have zero examples of the correct code to write and the documentation doesn’t mention anything. I already read the code of the samples. If it helped, I wouldn’t be asking here.

Could you elaborate the problem you saw? buffers got from captureCompletionQueue aren’t complete? How did you make it real-time or non-real-time? Anyway we can reproduce it with any sample application?

The SIPL & CUDA buffers I get from the capture completion queue are a mix from some old frame and a newer frame, chopped off horizontally. It is real-time when I don’t write the buffers to disk, and non-real-time when I do write to disk (buffers are dropped in that case). You can’t reproduce anything from a sample application as you don’t have any application that processes SIPL buffers with CUDA. You could add that support to nvmedia/nvsipl/test/sample/main.cpp, which doesn’t do anything at the moment but print timestamps.

My best guess is that the SIPL completion queue delivers a buffer as soon as it was captured, but not as soon as it is transferred to the GPU. Hence I see partial buffers.

As a workaround, I queued the SIPL buffers for each camera in a temporary 1-slot queue (per camera), and I process the SIPL buffers in the temporary queue only when I receive newer buffers, to give them time to finish the CUDA copy. It works but it increases latency, it is inherently racy and it’s an ugly kludge.

Which function you use for the CUDA copy? Does any CPU access involve?

It’s all on GPU, the CUDA buffer is read normally like any CUDA buffer.

It’s still not clear to me how you checked the issue without CPU access in real-time and non-real-time cases.

Please refer to processPayload() in ~/nvidia/nvidia_sdk/DRIVE_OS_5.2.0_SDK_Linux_OS_DDPX/DRIVEOS/drive-t186ref-linux/samples/nvsci/nvscistream/cuda_consumer.cpp for CUDA receiving images from nvmedia and copying back to host.

I’m going to give this a pass and use the workaround that I found. Thanks.

If you still want to check if it’s a cache maintenance issue, you can try with NvMediaImageLock()/NvMediaImageUnlock(). Thanks.