What is Convert_PL2BL?

I am facing some performance bottlenecks which I suspect are due Convert_PL2BL kernel that is being called by Nvenc on the default stream.

Is it possible to somehow make Nvenc avoid the call to the aforementioned kernel?

Also see:

https://devtalk.nvidia.com/default/topic/1049053/gpu-accelerated-libraries/nvenc-fastest-convert-no-stream-0-/

Hi! Here we are having the same issue. NVIDIA Nsight Systems is showing that we have severe overlapping break when this kernel “Convert_PL2BL” is called. Our software runs up to 40% slower due to this kernels.

Very probaly this is because this kernel is enqueued in the default stream. And we don’t find a way to tell the NVENC API which CUDA stream to use. Is there any way to do so?

We tried to solve the issue by using the compiler flag --default-stream per-thread, but it causes what seems to be some deadlocks. In fact one of the GPU’s got into an error state and started to blow its fan as crazy. This would be a good enough solution, if our code were conceived using this flag, but it is not, and we have some synchronization issues, because asynchronous calls, that where synchronous by using the default stream, now they will become asynchronous (I guess, if they are in a thread other than the main thread).

Please, someone correct me if I’m wrong in some way.

Next step, we will look for a way to tell NVENC to use the cuda stream we want, instead of the deafult one. Please let us know if this is not a possibility. Then, if it does not work, I guess the only way we can solve this performance issue is by refactoring all our code base, taking into account the --default-stream per-thread. Wich will take some development days.

System and OS:
Windows 10 Enterprise 2016 LTSB Version 1607 OS Build 14393.3686
3 Quadro RTX4000’s driver version 419.67 (WDDM in all three GPU’s)
CUDA 10.1
CPU AMD EPYC 7401P in a Supermicro H11SSL-i
64 GB of RAM DDR4 2666Mhz using all the 8 memory channels (configured to memory interleaving per die, so Windows see’s a single NUMA node)

@dreqeu and @gmaxi17, or any one facing performance issues with NVENCODE due to internal pre-processing or post processing CUDA kernels.

In case you still haven’t found a solution to this problem, let me summarize:

Some time ago, NVENCODE started to use pre-processing and/or post processing CUDA kernels, inside the API calls. This kernels where using the default stream, which can cause big performance issues.

A solution to this would be to be able to specify your own cuda streams, for pre and post processing.

I found that starting with Video Codec SDK version 9.1, you can do exactly that.

NEW to 9.1 - Encode: CUStream support in NVENC for enhanced parallelism between CUDA pre-processing and NVENC encoding

In the Video Codec SDK examples, you can find how to use this new feature, in the file AppEncode/AppEncCuda,cpp.

You need to use NvEncoderCuda::SetIOCudaStreams(NV_ENC_CUSTREAM_PTR inputStream, NV_ENC_CUSTREAM_PTR outputStream) to set pre and post processing cuda streams. You can use the same or diferent streams. Notice that NvEncoderOutputInVidMemCuda inherits from NvEncoderCuda, so you have this public method abailable from an NvEncoderOutputInVidMemCuda instance. I did not review if there are other classes inheriting from NvEncoderCuda.

They are expecting cuStreams (CUDA Driver API type) but as far as I know it’s perfectly fine to cast a cudaStream_t created via the CUDA runtime API.

Hope it helps!

1 Like

@oamoros0ealf many thanks for sharing your findings with us! This is valuable information, I’ll definitely give it a try asap.

1 Like