What is Convert_PL2BL?

I am facing some performance bottlenecks which I suspect are due Convert_PL2BL kernel that is being called by Nvenc on the default stream.

Is it possible to somehow make Nvenc avoid the call to the aforementioned kernel?

Also see:

https://devtalk.nvidia.com/default/topic/1049053/gpu-accelerated-libraries/nvenc-fastest-convert-no-stream-0-/

Hi! Here we are having the same issue. NVIDIA Nsight Systems is showing that we have severe overlapping break when this kernel “Convert_PL2BL” is called. Our software runs up to 40% slower due to this kernels.

Very probaly this is because this kernel is enqueued in the default stream. And we don’t find a way to tell the NVENC API which CUDA stream to use. Is there any way to do so?

We tried to solve the issue by using the compiler flag --default-stream per-thread, but it causes what seems to be some deadlocks. In fact one of the GPU’s got into an error state and started to blow its fan as crazy. This would be a good enough solution, if our code were conceived using this flag, but it is not, and we have some synchronization issues, because asynchronous calls, that where synchronous by using the default stream, now they will become asynchronous (I guess, if they are in a thread other than the main thread).

Please, someone correct me if I’m wrong in some way.

Next step, we will look for a way to tell NVENC to use the cuda stream we want, instead of the deafult one. Please let us know if this is not a possibility. Then, if it does not work, I guess the only way we can solve this performance issue is by refactoring all our code base, taking into account the --default-stream per-thread. Wich will take some development days.

System and OS:
Windows 10 Enterprise 2016 LTSB Version 1607 OS Build 14393.3686
3 Quadro RTX4000’s driver version 419.67 (WDDM in all three GPU’s)
CUDA 10.1
CPU AMD EPYC 7401P in a Supermicro H11SSL-i
64 GB of RAM DDR4 2666Mhz using all the 8 memory channels (configured to memory interleaving per die, so Windows see’s a single NUMA node)

@dreqeu and @gmaxi17, or any one facing performance issues with NVENCODE due to internal pre-processing or post processing CUDA kernels.

In case you still haven’t found a solution to this problem, let me summarize:

Some time ago, NVENCODE started to use pre-processing and/or post processing CUDA kernels, inside the API calls. This kernels where using the default stream, which can cause big performance issues.

A solution to this would be to be able to specify your own cuda streams, for pre and post processing.

I found that starting with Video Codec SDK version 9.1, you can do exactly that.

NEW to 9.1 - Encode: CUStream support in NVENC for enhanced parallelism between CUDA pre-processing and NVENC encoding

In the Video Codec SDK examples, you can find how to use this new feature, in the file AppEncode/AppEncCuda,cpp.

You need to use NvEncoderCuda::SetIOCudaStreams(NV_ENC_CUSTREAM_PTR inputStream, NV_ENC_CUSTREAM_PTR outputStream) to set pre and post processing cuda streams. You can use the same or diferent streams. Notice that NvEncoderOutputInVidMemCuda inherits from NvEncoderCuda, so you have this public method abailable from an NvEncoderOutputInVidMemCuda instance. I did not review if there are other classes inheriting from NvEncoderCuda.

They are expecting cuStreams (CUDA Driver API type) but as far as I know it’s perfectly fine to cast a cudaStream_t created via the CUDA runtime API.

Hope it helps!

1 Like

@oamoros0ealf many thanks for sharing your findings with us! This is valuable information, I’ll definitely give it a try asap.

1 Like

I am not entirely sure, but perhaps the PL2 BL kernel is a conversion between pitched-linear to block-linear layout:

According to NvMedia Surface (NVIDIA DRIVE OS SDK Development Guide),

NVM_SURF_ATTR_LAYOUT_PL :Indicates pitch-linear layout, using pitch-linear mapping, in which pixels are assigned incrementing addresses >across each successive row of the image. A surface’s pixel addresses are calculated by:
address = offset + pitch * py + pixel * px
Where:
• address is the byte address of a pixel in the surface
• offset is the address of the surface
• pitch is the number of bytes occupied by the pixels in a line
• py is the pixel’s Y coordinate
• pixel is the number of bytes occupied by a pixel
• px is the pixel’s X coordinate
Although pitch addressing is standard across most hardware and software platforms, it limits performance because >it has poor memory locality, due to the fact that each line in a surface image generally maps to a different DRAM >page.

NVM_SURF_ATTR_LAYOUT_BL: Indicates block-linear layout. This layout is similar to pitch-linear layout, but maps blocks of pixels using pitch-linear >addressing, rather than individual pixels. This layout provides spatial locality for addressing, and improves memory >access performance. Pixel data layout within a block is an internal detail of the NVIDIA SoC design, and varies across >NVIDIA SoC architectures.

Maybe someone from NVidia can confirm this ?