VPI CUDA interop with managed memory

tobias.fischer1 · December 8, 2020, 7:15pm

Hello,

VPI can wrap foreign memory, e.g. via vpiArrayCreateCudaMemWrapper() and vpiArrayCreateHostMemWrapper(). The former is meant to wrap memory allocated via cudaMalloc() and the latter for regular malloc().

What about managed memory, allocated via cudaMallocManaged() and attached to host/device via cudaStreamAttachMemAsync(..., cudaMemAttachHost) resp. cudaStreamAttachMemAsync(..., cudaMemAttachGlobal)? Do I have to create both a host and a cuda wrapper VPI array? Should I consider anything else when using managed memory with VPI?

The VPI docs do not mention managed memory as far as I could tell.

An example use case is filling a keypoint array with the PVA and using the keypoints both from the CPU and the GPU without copy.

An alternative could be to allocate the array with VPI directly, and specifying the correponding backends. Can I then use the memory from the CPU/GPU? There are no usage hints (streamAttach) then, so I would assume that caching would be disabled in that case.

Best
Tobias

AastaLLL · December 9, 2020, 2:43am

Hi,

For unified memory, since the buffer pointer is shared via CPU buffer and GPU buffer.
You can use either vpiArrayCreateCudaMemWrapper or vpiArrayCreateHostMemWrapper.

In vpiArrayCreateHostMemWrapper, VPI will treat the buffer as a general CPU buffer and assume it is GPU non-accessible.
So it will do some memory copy to enable CUDA access.

As a result, we recommend using vpiArrayCreateCudaMemWrapper instead.
The process is similar to the buffer created by cudaMalloc.
But please remember to call synchronization before the CPU task since Jetson doesn’t support concurrent access.

Thanks.

tobias.fischer1 · December 9, 2020, 10:34am

Thank you for the information.

Could you clarify the last point about synchronization and concurrent access? I think the other topic becomes relevant as well now.

Let’s look at the following code, this serves as a better basis of discussion. It has the following purpose:

Allocate cuda managed memory
Wrap that memory into a VPI array
Create a cuda stream and wrap it into a VPI stream
Launch some VPI kernels on the vpi stream
Launch some CUDA kernels on the wrapped cuda stream

I also left out the definition and construction of some objects for brevity. They are surrounded by braces [ ].

// allocate some managed memory
int* data_managed;
cudaMallocManaged(&data_managed, 8192 * 2 * sizeof(float));

// specify vpi array data struct
VPIArrayData vpi_data;
std::memset(&vpi_data, 0, sizeof(vpi_data));
vpi_data.capacity = 8192;
vpi_data.type = VPI_ARRAY_TYPE_KEYPOINT;
vpi_data.strideBytes = 8;
vpi_data.data = data_managed;

// create the wrapper
vpiArray vpi_array;
vpiArrayCreateCudaMemWrapper(vpi_data, 0, &vpi_array);

// create CUDA stream and wrap it in a VPI stream
cudaStream_t cuda_stream;
cudaStreamCreate(&cuda_stream);
vpiStream vpi_stream;
vpiStreamWrapCuda(cuda_stream, &vpi_stream);

// Before using the vpi_array in a VPI kernel, I can attach the memory either to CPU or GPU.
// Assume that the Kernel operates on VPI_BACKEND_PVA
// Does the StreamAttach have any effect on the PVA?
cudaStreamAttachMemAsync(cuda_stream, data_managed, 0, cudaMemAttachHost)

// launch a VPI "kernel" on PVA
vpiSubmitHarrisCornerDetector(vpi_stream, [payload], [image], vpi_array, [scores], [params]);

// launch CUDA kernel, it should wait for the VPI kernel to finish first
// The data should be used by the GPU, so it is attached to global
// Do I need any synchronization before this step?
cudaStreamAttachMemAsync(cuda_stream, data_managed, 0, cudaMemAttachGlobal)
k<<<..., ..., cuda_stream>>>(data_managed);

So in essence my two questions are:

How does cudaStreamAttachMemAsync() interact with the PVA/VIC backends?
Can I queue PVA/VIC/CPU tasks on the vpi_stream, and then queue more subsequent work on the underlying cuda_stream? If not, how can I provide synchronization between the operations without blocking the calling thread?

I hope the example helps, if something looks wrong please correct me, I just wrote it from the top of my head.

Best
Tobias

AastaLLL · December 10, 2020, 4:55am

Hi,

1. This is more related to the memory type you wrap.
For example, since you wrap data_managed with vpiArrayCreateCudaMemWrapper, you will need to enable the GPU access when using it with VPI API.
The reason is VPI will treat the buffer as a CUDA buffer.

2. Yes. cudaStream_t and the wrapped vpiStream is roughly the same.
But vpiStream includes some data used by the VPI SDK.

Thanks.

tobias.fischer1 · December 10, 2020, 8:34am

I see, thank you for the information :)

About the second point, just to be sure: The VPI documentation for vpiStreamCreateCudaStreamWrapper() states:

CUDA kernels can only be submitted directly to cudaStream_t if it’s guaranteed that all tasks submitted to VPIStream are finished.

Which is why I thought queuing regular cuda kernels would not be supported if asynchronous VPI calls are still “in flight”. I think I misinterpreted that statement. Could you outline a case where the behavior required in the documentation is violated?

AastaLLL · December 11, 2020, 4:02am

Hi,

In general, GPU will execute all tasks attached to the same CUDA stream in sequence.

But the document seems to imply some limitation.
Let us check this with our internal team and share more information with you.

Thanks.

AastaLLL · December 14, 2020, 5:33am

Hi,

Thanks for your patience.

You can attach tasks to the CUDA stream.
But please submit the host functions to ensure the kernels be queued and executed at the right time.
https://docs.nvidia.com/vpi/group__VPI__HostFunction.html

Thanks.

wuzhenlin456 · December 14, 2020, 9:14am

Hello, AastaLLL
I am trying to run the vpi-0.4 rescale demo, but I get the performance results much slower than claimed in your official documents? Did these results include the time consumption of vpiStreamSync()?
How should I use this vpiSubmitRescale(stream, VPI_BACKEND_CUDA, imageNV12, outputNV12, VPI_INTERP_LINEAR, VPI_BOUNDARY_COND_CLAMP)) in order to make it faster?
I measure API performance like this.

Am I right? Is there any parallelizable methods that I need to use?

https://docs.nvidia.com/vpi/algo_rescale.html#algo_rescale_perf

Are all the results tested in sample code 04-rescale?

tobias.fischer1 · December 14, 2020, 9:52am

Thank you for digging into this :) So in essence, the CUDA kernels that should be enqueued on the stream need to be wrapped into another function that satisfies the VPI host function interface (only taking a pointer for inbound data)?

I am not super familiar with function pointers, is there any reference or any simple code example that shows the principle? Apart from that I understand that VPI is a develper preview for now. Do you think it would be beneficial to wait for the 1.0 release that is scheduled for this month with JetPack 4.5?

AastaLLL · December 15, 2020, 3:51am

Hi,

Please maximize the device performance with the script shared below:
https://docs.nvidia.com/vpi/algo_performance.html#maxout_clocks

This script can max out the VIC clocks, which is not included in the nvpmodel and jetson_clocks.

Thanks.

wuzhenlin456 · December 15, 2020, 3:55am

I am using VPI-CUDA instead of VPI-VIC, does this also work for it?

wuzhenlin456 · December 15, 2020, 9:54am

I used that script，but still cannot get the claimed performance. Could you show me how to test it in couple of lines of sample code?

wuzhenlin456 · December 16, 2020, 11:57am

Hello，
I want to ask a question, is the API vpiSubmitRescale() in vpi-0.4, only pitch-linear format is supported with Nvbuffer wrapper to VPIImage object?

AastaLLL · December 17, 2020, 7:02am

Hi,

Please check VPI-0.4 rescale prefermence Test On CUDA for the rescale performance issue.
And yes, VPI-0.4 requires a pitch linear format when wrapping data into a VPI image.

Thanks

wuzhenlin456 · December 17, 2020, 7:12am

Thanks. How could I get the similar performance with sample-code 04-rescale if I need to wrapper nvbuffer fd to VPIImage, and pass it into vpiSubmitRescale()? Below is my test code. Thanks.

int main()
{
int in_dmabuf_fd;
int out_dmabuf_fd;

NvBufferCreateParams input_params;
input_params.width = 1920;
input_params.height = 1080;
input_params.layout = NvBufferLayout_Pitch;
input_params.payloadType = NvBufferPayload_SurfArray;
input_params.colorFormat = NvBufferColorFormat_NV12;
input_params.nvbuf_tag = NvBufferTag_VIDEO_CONVERT;
NvBufferCreateEx(&in_dmabuf_fd, &input_params);

NvBufferCreateParams output_params;
output_params.width = 640;
output_params.height = 480;
output_params.layout = NvBufferLayout_Pitch;
output_params.payloadType = NvBufferPayload_SurfArray;
output_params.colorFormat = NvBufferColorFormat_NV12;
output_params.nvbuf_tag = NvBufferTag_VIDEO_CONVERT;
NvBufferCreateEx(&out_dmabuf_fd, &output_params);

NvBufferParams param1;
NvBufferGetParams(in_dmabuf_fd, &param1);
void *inputY  = nullptr;
NvBufferMemMap(in_dmabuf_fd, 0, NvBufferMem_Write, &inputY);
void *inputUV = nullptr;
NvBufferMemMap(in_dmabuf_fd, 1, NvBufferMem_Write, &inputUV);
NvBufferMemSyncForCpu(in_dmabuf_fd, 0, &inputY);
NvBufferMemSyncForCpu(in_dmabuf_fd, 1, &inputUV);
VPIImageData inImgData;
memset(&inImgData, 0, sizeof(inImgData));
inImgData.type                 = VPI_IMAGE_FORMAT_NV12;
inImgData.numPlanes            = 2;
inImgData.planes[0].width      = param1.width[0];
inImgData.planes[0].height     = param1.height[0];
inImgData.planes[0].pitchBytes = param1.pitch[0];
inImgData.planes[0].data       = inputY;
inImgData.planes[1].width      = param1.width[1];
inImgData.planes[1].height     = param1.height[1];
inImgData.planes[1].pitchBytes = param1.pitch[1];
inImgData.planes[1].data       = inputUV;

NvBufferParams param2;
NvBufferGetParams(out_dmabuf_fd, &param2);
void *outputY  = nullptr;
NvBufferMemMap(out_dmabuf_fd, 0, NvBufferMem_Write, &outputY);
void *outputUV = nullptr;
NvBufferMemMap(out_dmabuf_fd, 1, NvBufferMem_Write, &outputUV);
NvBufferMemSyncForCpu(out_dmabuf_fd, 0, &outputY);
NvBufferMemSyncForCpu(out_dmabuf_fd, 1, &outputUV);
VPIImageData outImgData;
memset(&outImgData, 0, sizeof(outImgData));
outImgData.type                 = VPI_IMAGE_FORMAT_NV12;
outImgData.numPlanes            = 2;
outImgData.planes[0].width      = param2.width[0];
outImgData.planes[0].height     = param2.height[0];
outImgData.planes[0].pitchBytes = param2.pitch[0];
outImgData.planes[0].data       = outputY;
outImgData.planes[1].width      = param2.width[1];
outImgData.planes[1].height     = param2.height[1];
outImgData.planes[1].pitchBytes = param2.pitch[1];
outImgData.planes[1].data       = outputUV;

VPIEvent evStop  = nullptr;
VPIEvent evStart = nullptr;
float elapsedMS;
VPIStream stream;
VPIBackend backendType = VPI_BACKEND_CUDA;
vpiStreamCreate(backendType, &stream);
VPIImage input   = nullptr;
VPIImage output  = nullptr;
vpiImageCreateHostMemWrapper(&inImgData, 0, &input);
vpiImageCreateHostMemWrapper(&outImgData, 0, &output);

vpiEventCreate(0, &evStart);
vpiEventCreate(0, &evStop);
vpiEventRecord(evStart, stream);
for (int j = 0; j < 30000; j++) {
    vpiSubmitRescale(stream, VPI_BACKEND_CUDA, input, output, VPI_INTERP_LINEAR, VPI_BOUNDARY_COND_ZERO);
}
vpiEventRecord(evStop, stream);
vpiEventSync(evStop);
vpiEventElapsedTime(evStart, evStop, &elapsedMS);
std::cout << elapsedMS / 30000 << "ms per frame." << std::endl;

// Clean up
NvBufferDestroy(in_dmabuf_fd);
NvBufferDestroy(out_dmabuf_fd);
vpiImageDestroy(input);
vpiImageDestroy(output);
vpiStreamDestroy(stream);
vpiEventDestroy(evStop);
vpiEventDestroy(evStart);
return 0;

}

AastaLLL · December 18, 2020, 9:09am

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Have you maximized the device performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Topic		Replies	Views
VPI image pointing to managed memory Jetson AGX Xavier vpi	15	930	December 5, 2023
VPI wrap existing managed memory very slow Jetson AGX Xavier cuda , vpi	8	871	September 19, 2021
vpiSubmitTemporalNoiseReduction fails with VPI_ERROR_INVALID_ARGUMENT on buffer created by vpiImageCreateWrapper/VPI_IMAGE_BUFFER_CUDA_PITCH_LINEAR Jetson AGX Orin cuda , vpi	5	34	December 30, 2024
Using VPI in GStreamer Jetson AGX Orin camera , gstreamer , documentation , vpi	51	4931	March 8, 2023
Performance issues using vpiImageCreate*Wrapper Jetson Xavier NX vpi	8	1905	February 9, 2023
How to prevent vpiSubmitConvertImageFormat from calling cudaGraphicsEGLRegisterImage, which kills performance? Jetson AGX Orin cuda	9	71	December 5, 2024
Why Jetson vic has a significant performance drop? Jetson Xavier NX vpi	8	35	December 19, 2024
OpenCV application uneven frame times Jetson Xavier NX opencv , performance , opencl	14	2761	January 19, 2022
Reducing Temporal Noise on Images with NVIDIA VPI on NVIDIA Jetson Embedded Computers Technical Blog	4	778	September 25, 2021
Best remap implementation on Jetson Nano Jetson Nano opencv , cuda	16	522	August 1, 2024

VPI CUDA interop with managed memory

Related topics