I’m working on reducing memory transfers of images from the driveworks image acquisition pipeline to tensorrt on my drive PX2.
My rough pipeline is as follows:
/// on the image grabbing thread
dwSensorCamera_readFrame(…);
dwSensorCamera_getImage(&image, DW_CAMERA_OUTPUT_CUDA_RGBA_UINT8, handle);
dwImage_copyConvertAsync(buffer_handle, image, stream, context);
dwImage_getCUDA(&cuda_image, buffer_handle);
cudaEventRecord(event, stream);
cudaEventSynchronize(event);
// push cuda_image to another thread
// on the tensorrt thread
preprocess cuda_image into tensorrt input buffer
tensorrt_execution_context->enqueue(input_buffer, output_buffer, …)
With the above pipeline I get a segmentation fault inside the enqueue call, if I do not invoke enqueue, all the preprocessing works correctly on the GPU to prepare the tensorrt input buffers.
However if I split the processes and have the image acquisition running in one process and tensorrt running in a separate process, I have no issues. (IE use gpu → host → IPC → host → gpu in place of pushing the cuda_image to a separate thread)
With driveworks do we need to use the driveworks tensorrt API or can we just use tensorrt by itself?
The tensorrt thread is unaware of driveworks, it is only receiving images from the driveworks thread in the form of RGBA DW_IMAGE_MEMORY_TYPE_PITCH images, which is native to what opencv gpu mat’s use.
The preprocessing of the data works correctly, so I know the images sent between threads is correct. The preprocessing does not occur in place, thus the allocation of buffers for tensorrt’s input is unchanged.
If I allow enqueue to be called, I get a segfault with the following stack trace:
[0x7f203d7340] + 0x375340
[0x7f7ae3e6c0]
[0x7f5c3a0984] + 0x2ac984
[0x7f5c3a0d48] + 0x2acd48
[0x7f5c2ad458] + 0x1b9458
[0x7f5c316298] + 0x222298
[0x7f5c2b367c] + 0x1bf67c
[0x7f5c2b4234] + 0x1c0234
[0x7f5c2b8388] + 0x1c4388
[0x7f5c295bf8] + 0x1a1bf8
[0x7f5c2963bc] + 0x1a23bc
[0x7f5c30f75c] cuEGLStreamConsumerAcquireFrame + 0x78
[0x7f5d15e080] + 0x32c080
[0x7f5cf6bd7c] + 0x139d7c
[0x7f5cf6c828] + 0x13a828
[0x7f5cf68850] + 0x136850
[0x7f5cf6934c] + 0x13734c
[0x7f5cf73674] + 0x141674
[0x7f5cf73ab4] dwSensorCamera_getImage + 0x40
or the following:
[31-1-2020 13:29:3] Driveworks exception thrown: DW_CUDA_ERROR: Call failed cuGraphicsResourceGetMappedEglFrame : unspecified launch failure
[ERROR] [1580506143.699987260]: Error occurred getting an image from the sensor frame: DW_CUDA_ERROR
[31-1-2020 13:29:3] Driveworks exception thrown: DW_BAD_CAST: dwImage_copyFormatConvert: images must be of the same type
[ERROR] [1580506143.700175778]: Unable to convert the captured image: DW_BAD_CAST
[31-1-2020 13:29:3] Driveworks exception thrown: DW_CUDA_ERROR: Call failed cuEGLStreamConsumerReleaseFrame : unspecified launch failure
what(): (cudaEventRecord(event.get(), stream)==cudaErrorLaunchFailure)
or I get the following from the tensorrt thread:
Cuda error in file src/implicit_gemm.cu at line 1159: unspecified launch failure
Error at line 289: unspecified launch failure
customWinogradConvActLayer.cpp:237: virtual void nvinfer1::cudnn::WinogradConvActLayer::allocateResources(const nvinfer1::cudnn::CommonContext&): Assertion `convolutions.back().get()’ failed.
or this:
[0x7f38281ab4] + 0xa8ab4
[0x7f8d8246c0]
[0x7f3f45b984] + 0x2ac984
[0x7f3f45bd48] + 0x2acd48
[0x7f3f368458] + 0x1b9458
[0x7f3f478fe8] + 0x2c9fe8
[0x7f3f2967b0] + 0xe77b0
[0x7f3f296908] + 0xe7908
[0x7f3f29694c] + 0xe794c
[0x7f3f3c92a0] cuLaunchKernel + 0xb0
[0x7f8801bc0c] + 0xcc0c
I’ve tried the following variations with no observable difference:
GPU wrapping dwImageCUDA → tensorrt
GPU wrapping dwImageCUDA → user managed GPU → tensorrt
GPU wrapping dwImageCUDA → pinned CPU → tensorrt thread → GPU → tensorrt
CPU wrapping dwImageCPU → tensorrt thread → GPU → tensorrt
CPU wrapping dwImageNvMedia → tensorrt thread → GPU → tensosrt
Software versions:
driveworks: v.1.2.400-drive-linux-5.0.10.3
cudnn: 7_7.1.2.23-1+cuda9.2
nvinfer: 4.1.1-1+cuda9.2
Edit
The original code that I’m replacing did the following:
dwSensorCamera_readFrame
dwSensorCamera_getImage(&image, DW_CAMERA_OUTPUT_NATIVE_PROCESSED, …)
dwImage_copyConvert(buffer_handle, image, m_context);
//push onto queue for a conversion thread
// on conversion thread
dwImage_getNvMedia
NvMediaImageLock
// wrap surface and copy into pageable host memory
NvMediaImageUnlock
// push host memory image to tensorrt thread
// upload to GPU, preprocess, and infer.