VPI CUDA interop with managed memory

Hello,

VPI can wrap foreign memory, e.g. via vpiArrayCreateCudaMemWrapper() and vpiArrayCreateHostMemWrapper(). The former is meant to wrap memory allocated via cudaMalloc() and the latter for regular malloc().

What about managed memory, allocated via cudaMallocManaged() and attached to host/device via cudaStreamAttachMemAsync(..., cudaMemAttachHost) resp. cudaStreamAttachMemAsync(..., cudaMemAttachGlobal)? Do I have to create both a host and a cuda wrapper VPI array? Should I consider anything else when using managed memory with VPI?

The VPI docs do not mention managed memory as far as I could tell.

An example use case is filling a keypoint array with the PVA and using the keypoints both from the CPU and the GPU without copy.

An alternative could be to allocate the array with VPI directly, and specifying the correponding backends. Can I then use the memory from the CPU/GPU? There are no usage hints (streamAttach) then, so I would assume that caching would be disabled in that case.

Best
Tobias

Hi,

For unified memory, since the buffer pointer is shared via CPU buffer and GPU buffer.
You can use either vpiArrayCreateCudaMemWrapper or vpiArrayCreateHostMemWrapper.

In vpiArrayCreateHostMemWrapper, VPI will treat the buffer as a general CPU buffer and assume it is GPU non-accessible.
So it will do some memory copy to enable CUDA access.

As a result, we recommend using vpiArrayCreateCudaMemWrapper instead.
The process is similar to the buffer created by cudaMalloc.
But please remember to call synchronization before the CPU task since Jetson doesn’t support concurrent access.

Thanks.

Thank you for the information.

Could you clarify the last point about synchronization and concurrent access? I think the other topic becomes relevant as well now.

Let’s look at the following code, this serves as a better basis of discussion. It has the following purpose:

  1. Allocate cuda managed memory
  2. Wrap that memory into a VPI array
  3. Create a cuda stream and wrap it into a VPI stream
  4. Launch some VPI kernels on the vpi stream
  5. Launch some CUDA kernels on the wrapped cuda stream

I also left out the definition and construction of some objects for brevity. They are surrounded by braces [ ].

// allocate some managed memory
int* data_managed;
cudaMallocManaged(&data_managed, 8192 * 2 * sizeof(float));

// specify vpi array data struct
VPIArrayData vpi_data;
std::memset(&vpi_data, 0, sizeof(vpi_data));
vpi_data.capacity = 8192;
vpi_data.type = VPI_ARRAY_TYPE_KEYPOINT;
vpi_data.strideBytes = 8;
vpi_data.data = data_managed;

// create the wrapper
vpiArray vpi_array;
vpiArrayCreateCudaMemWrapper(vpi_data, 0, &vpi_array);

// create CUDA stream and wrap it in a VPI stream
cudaStream_t cuda_stream;
cudaStreamCreate(&cuda_stream);
vpiStream vpi_stream;
vpiStreamWrapCuda(cuda_stream, &vpi_stream);

// Before using the vpi_array in a VPI kernel, I can attach the memory either to CPU or GPU.
// Assume that the Kernel operates on VPI_BACKEND_PVA
// Does the StreamAttach have any effect on the PVA?
cudaStreamAttachMemAsync(cuda_stream, data_managed, 0, cudaMemAttachHost)

// launch a VPI "kernel" on PVA
vpiSubmitHarrisCornerDetector(vpi_stream, [payload], [image], vpi_array, [scores], [params]);

// launch CUDA kernel, it should wait for the VPI kernel to finish first
// The data should be used by the GPU, so it is attached to global
// Do I need any synchronization before this step?
cudaStreamAttachMemAsync(cuda_stream, data_managed, 0, cudaMemAttachGlobal)
k<<<..., ..., cuda_stream>>>(data_managed);

So in essence my two questions are:

  1. How does cudaStreamAttachMemAsync() interact with the PVA/VIC backends?
  2. Can I queue PVA/VIC/CPU tasks on the vpi_stream, and then queue more subsequent work on the underlying cuda_stream? If not, how can I provide synchronization between the operations without blocking the calling thread?

I hope the example helps, if something looks wrong please correct me, I just wrote it from the top of my head.

Best
Tobias

Hi,

1. This is more related to the memory type you wrap.
For example, since you wrap data_managed with vpiArrayCreateCudaMemWrapper, you will need to enable the GPU access when using it with VPI API.
The reason is VPI will treat the buffer as a CUDA buffer.

2. Yes. cudaStream_t and the wrapped vpiStream is roughly the same.
But vpiStream includes some data used by the VPI SDK.

Thanks.

I see, thank you for the information :)

About the second point, just to be sure: The VPI documentation for vpiStreamCreateCudaStreamWrapper() states:

CUDA kernels can only be submitted directly to cudaStream_t if it’s guaranteed that all tasks submitted to VPIStream are finished.

Which is why I thought queuing regular cuda kernels would not be supported if asynchronous VPI calls are still “in flight”. I think I misinterpreted that statement. Could you outline a case where the behavior required in the documentation is violated?

Hi,

In general, GPU will execute all tasks attached to the same CUDA stream in sequence.

But the document seems to imply some limitation.
Let us check this with our internal team and share more information with you.

Thanks.

Hi,

Thanks for your patience.

You can attach tasks to the CUDA stream.
But please submit the host functions to ensure the kernels be queued and executed at the right time.

Thanks.

Hello, AastaLLL
I am trying to run the vpi-0.4 rescale demo, but I get the performance results much slower than claimed in your official documents? Did these results include the time consumption of vpiStreamSync()?
How should I use this vpiSubmitRescale(stream, VPI_BACKEND_CUDA, imageNV12, outputNV12, VPI_INTERP_LINEAR, VPI_BOUNDARY_COND_CLAMP)) in order to make it faster?
I measure API performance like this.


Am I right? Is there any parallelizable methods that I need to use?


Are all the results tested in sample code 04-rescale?

Thank you for digging into this :) So in essence, the CUDA kernels that should be enqueued on the stream need to be wrapped into another function that satisfies the VPI host function interface (only taking a pointer for inbound data)?

I am not super familiar with function pointers, is there any reference or any simple code example that shows the principle? Apart from that I understand that VPI is a develper preview for now. Do you think it would be beneficial to wait for the 1.0 release that is scheduled for this month with JetPack 4.5?

Hi,

Please maximize the device performance with the script shared below:

This script can max out the VIC clocks, which is not included in the nvpmodel and jetson_clocks.

Thanks.

I am using VPI-CUDA instead of VPI-VIC, does this also work for it?

I used that script,but still cannot get the claimed performance. Could you show me how to test it in couple of lines of sample code?

Hello,
I want to ask a question, is the API vpiSubmitRescale() in vpi-0.4, only pitch-linear format is supported with Nvbuffer wrapper to VPIImage object?

Hi,

Please check VPI-0.4 rescale prefermence Test On CUDA for the rescale performance issue.
And yes, VPI-0.4 requires a pitch linear format when wrapping data into a VPI image.

Thanks

Thanks. How could I get the similar performance with sample-code 04-rescale if I need to wrapper nvbuffer fd to VPIImage, and pass it into vpiSubmitRescale()? Below is my test code. Thanks.

int main()
{
int in_dmabuf_fd;
int out_dmabuf_fd;

NvBufferCreateParams input_params;
input_params.width = 1920;
input_params.height = 1080;
input_params.layout = NvBufferLayout_Pitch;
input_params.payloadType = NvBufferPayload_SurfArray;
input_params.colorFormat = NvBufferColorFormat_NV12;
input_params.nvbuf_tag = NvBufferTag_VIDEO_CONVERT;
NvBufferCreateEx(&in_dmabuf_fd, &input_params);

NvBufferCreateParams output_params;
output_params.width = 640;
output_params.height = 480;
output_params.layout = NvBufferLayout_Pitch;
output_params.payloadType = NvBufferPayload_SurfArray;
output_params.colorFormat = NvBufferColorFormat_NV12;
output_params.nvbuf_tag = NvBufferTag_VIDEO_CONVERT;
NvBufferCreateEx(&out_dmabuf_fd, &output_params);

NvBufferParams param1;
NvBufferGetParams(in_dmabuf_fd, &param1);
void *inputY  = nullptr;
NvBufferMemMap(in_dmabuf_fd, 0, NvBufferMem_Write, &inputY);
void *inputUV = nullptr;
NvBufferMemMap(in_dmabuf_fd, 1, NvBufferMem_Write, &inputUV);
NvBufferMemSyncForCpu(in_dmabuf_fd, 0, &inputY);
NvBufferMemSyncForCpu(in_dmabuf_fd, 1, &inputUV);
VPIImageData inImgData;
memset(&inImgData, 0, sizeof(inImgData));
inImgData.type                 = VPI_IMAGE_FORMAT_NV12;
inImgData.numPlanes            = 2;
inImgData.planes[0].width      = param1.width[0];
inImgData.planes[0].height     = param1.height[0];
inImgData.planes[0].pitchBytes = param1.pitch[0];
inImgData.planes[0].data       = inputY;
inImgData.planes[1].width      = param1.width[1];
inImgData.planes[1].height     = param1.height[1];
inImgData.planes[1].pitchBytes = param1.pitch[1];
inImgData.planes[1].data       = inputUV;

NvBufferParams param2;
NvBufferGetParams(out_dmabuf_fd, &param2);
void *outputY  = nullptr;
NvBufferMemMap(out_dmabuf_fd, 0, NvBufferMem_Write, &outputY);
void *outputUV = nullptr;
NvBufferMemMap(out_dmabuf_fd, 1, NvBufferMem_Write, &outputUV);
NvBufferMemSyncForCpu(out_dmabuf_fd, 0, &outputY);
NvBufferMemSyncForCpu(out_dmabuf_fd, 1, &outputUV);
VPIImageData outImgData;
memset(&outImgData, 0, sizeof(outImgData));
outImgData.type                 = VPI_IMAGE_FORMAT_NV12;
outImgData.numPlanes            = 2;
outImgData.planes[0].width      = param2.width[0];
outImgData.planes[0].height     = param2.height[0];
outImgData.planes[0].pitchBytes = param2.pitch[0];
outImgData.planes[0].data       = outputY;
outImgData.planes[1].width      = param2.width[1];
outImgData.planes[1].height     = param2.height[1];
outImgData.planes[1].pitchBytes = param2.pitch[1];
outImgData.planes[1].data       = outputUV;

VPIEvent evStop  = nullptr;
VPIEvent evStart = nullptr;
float elapsedMS;
VPIStream stream;
VPIBackend backendType = VPI_BACKEND_CUDA;
vpiStreamCreate(backendType, &stream);
VPIImage input   = nullptr;
VPIImage output  = nullptr;
vpiImageCreateHostMemWrapper(&inImgData, 0, &input);
vpiImageCreateHostMemWrapper(&outImgData, 0, &output);

vpiEventCreate(0, &evStart);
vpiEventCreate(0, &evStop);
vpiEventRecord(evStart, stream);
for (int j = 0; j < 30000; j++) {
    vpiSubmitRescale(stream, VPI_BACKEND_CUDA, input, output, VPI_INTERP_LINEAR, VPI_BOUNDARY_COND_ZERO);
}
vpiEventRecord(evStop, stream);
vpiEventSync(evStop);
vpiEventElapsedTime(evStart, evStop, &elapsedMS);
std::cout << elapsedMS / 30000 << "ms per frame." << std::endl;

// Clean up
NvBufferDestroy(in_dmabuf_fd);
NvBufferDestroy(out_dmabuf_fd);
vpiImageDestroy(input);
vpiImageDestroy(output);
vpiStreamDestroy(stream);
vpiEventDestroy(evStop);
vpiEventDestroy(evStart);
return 0;

}

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Have you maximized the device performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.