Scale NV12 images using NPP or other GPU methods

anders.cedronius · April 25, 2018, 3:25pm

I have built a encoder based on the example code ‘NvEncoderCuda’. Before encoding the image using NVENC the NV12 image is copied from system memory to CUDA memory using the method ‘CopyToDeviceFrame’.

I would like to re-scale the image when in GPU memory before passing to the encoder. I have not found any examples on how this could be done.

Is there any example code on how to utilize the GPU for resizing before encoding using NVENC…

Thanks for any help.

/Anders.

Robert_Crovella · April 25, 2018, 3:59pm

If you wish to use NPP, you could use a YUV420-to-RGB conversion:

[url]NVIDIA 2D Image And Signal Performance Primitives (NPP): NVIDIA 2D Image and Signal Processing Performance Primitives

followed by a resize:

[url]NVIDIA 2D Image And Signal Performance Primitives (NPP): NVIDIA 2D Image and Signal Processing Performance Primitives

followed by RGB-to-YUV420 conversion:

[url]NVIDIA 2D Image And Signal Performance Primitives (NPP): NVIDIA 2D Image and Signal Processing Performance Primitives

There may be additional steps. For example, NV12 normally has UV in an interleaved storage format, rather than planar. The above operations expect planar storage. I’m not suggesting this is a complete recipe. I don’t have a sample code.

[url]https://docs.microsoft.com/en-us/windows-hardware/drivers/display/4-2-0-video-pixel-formats[/url]

Note that you may already have planar data available to you:

[url]cuda - Why does NVENC sample use both cuMemcpyHtoD and cuMemcpy2D to copy YUV data? - Stack Overflow

anders.cedronius · May 8, 2018, 3:20pm

Thank you for your quick and detailed response.

So in my case I retrieve/get the live video from a SDI card… Professional SDI video is formated as packed UYVY 4:2:2 and it seams like Nvenc is not capable of accepting 4:2:2 UYVY data so before handing over the data I need to re-format the data. And I was thinking since I have to move the data from the CPU to the GPU I will have all the pixels in ‘my hand’ in the CPU. That should mean if I write condensed CPU code I would have to wait for memory accesses anyway meaning I would have no performance penalties even if processing in the CPU. And to get moving I need to use what I master today so I implemented a AVX2 scaler and a Packed UYVY 4:2:2 to NV12 4:2:0 semi planar converter.

Basically shuffling the data around and scaling to 1:1, 1:2 and 1:4 sizes (to keep it simple for now) … like this for Y and same for UV →

//Y samples
				ySamplesLine1 = _mm256_avg_epu8(ySamplesLine1, ySamplesLine2);
				ySamplesLine2 = _mm256_avg_epu8(ySamplesLine3, ySamplesLine4);
				ySamplesLine3 = _mm256_srli_si256(ySamplesLine1, 1);
				ySamplesLine4 = _mm256_srli_si256(ySamplesLine2, 1);
				ySamplesLine1 = _mm256_avg_epu8(ySamplesLine1, ySamplesLine3);
				ySamplesLine2 = _mm256_avg_epu8(ySamplesLine2, ySamplesLine4);
				ySamplesLine1HiHalf = _mm256_shuffle_epi8(ySamplesLine1, shuffleMaskUV); //Its not UV data the mask just happens to look the same as when filtering.
				ySamplesLine1HiHalf = _mm256_permute4x64_epi64(ySamplesLine1HiHalf, 0b10001000);
				ySamplesLine2HiHalf = _mm256_shuffle_epi8(ySamplesLine2, shuffleMaskUV);
				ySamplesLine2HiHalf = _mm256_permute4x64_epi64(ySamplesLine2HiHalf, 0b10001000);
				ySamplesLine1 = _mm256_permute2x128_si256(ySamplesLine1LowHalf, ySamplesLine1HiHalf, 0b00100000);
				ySamplesLine2 = _mm256_permute2x128_si256(ySamplesLine2LowHalf, ySamplesLine2HiHalf, 0b00100000);

However GPU is the way to go in the long run, to perform better scaling and possibly other image filters. Any pointers on where I could learn to intercept the image inside the GPU using CUDA before handing it over to Nvenc is really appreciated.

Once again thanks for helping.

/Anders.