nvjpegEncodeImage() not work asynchronously

According to the result of Visual Profiler, nvjpegEncodeImage() function seems to be blocked at cudaMemcpyAsync Device to Pageable (for only 4B!) at the almost end of the function call.

The attached image is a screen capture of visual profier, which is reproduced by GT710 using CUDA Samples\v11.2\7_CUDALibraries\nvJPEG_encoder.

This is serious limitation for large resolution image processing.

Is not nvjpegEncodeImage() designed as asynchronous?
I infer that it is a bug of nvJPEG; why 4B memory is not allocated as Pinned for asynchronous memcpy?
Or, is not there any workaround for asynchronous of nvjpegEncodeImage?nvjpegencodeimage_is_blocked_by_memcpyasync_devicetopageable|626x500

When I study the documentation, I note that many decode functions are explicitly declared to be asynchronous to the host (“This function is asynchronous with respect to the host.”) , but I see that no such description is given for the encode functions. Therefore this seems intentional. You might wish to file a bug, the instructions are given in the sticky post at the top of this forum.

Thanks you for a comment.
I posted a bug report with reference to the instruction.