I have built a encoder based on the example code ‘NvEncoderCuda’. Before encoding the image using NVENC the NV12 image is copied from system memory to CUDA memory using the method ‘CopyToDeviceFrame’.
I would like to re-scale the image when in GPU memory before passing to the encoder. I have not found any examples on how this could be done.
Is there any example code on how to utilize the GPU for resizing before encoding using NVENC…
Thanks for any help.
/Anders.
Thank you for your quick and detailed response.
So in my case I retrieve/get the live video from a SDI card… Professional SDI video is formated as packed UYVY 4:2:2 and it seams like Nvenc is not capable of accepting 4:2:2 UYVY data so before handing over the data I need to re-format the data. And I was thinking since I have to move the data from the CPU to the GPU I will have all the pixels in ‘my hand’ in the CPU. That should mean if I write condensed CPU code I would have to wait for memory accesses anyway meaning I would have no performance penalties even if processing in the CPU. And to get moving I need to use what I master today so I implemented a AVX2 scaler and a Packed UYVY 4:2:2 to NV12 4:2:0 semi planar converter.
Basically shuffling the data around and scaling to 1:1, 1:2 and 1:4 sizes (to keep it simple for now) … like this for Y and same for UV →
//Y samples
ySamplesLine1 = _mm256_avg_epu8(ySamplesLine1, ySamplesLine2);
ySamplesLine2 = _mm256_avg_epu8(ySamplesLine3, ySamplesLine4);
ySamplesLine3 = _mm256_srli_si256(ySamplesLine1, 1);
ySamplesLine4 = _mm256_srli_si256(ySamplesLine2, 1);
ySamplesLine1 = _mm256_avg_epu8(ySamplesLine1, ySamplesLine3);
ySamplesLine2 = _mm256_avg_epu8(ySamplesLine2, ySamplesLine4);
ySamplesLine1HiHalf = _mm256_shuffle_epi8(ySamplesLine1, shuffleMaskUV); //Its not UV data the mask just happens to look the same as when filtering.
ySamplesLine1HiHalf = _mm256_permute4x64_epi64(ySamplesLine1HiHalf, 0b10001000);
ySamplesLine2HiHalf = _mm256_shuffle_epi8(ySamplesLine2, shuffleMaskUV);
ySamplesLine2HiHalf = _mm256_permute4x64_epi64(ySamplesLine2HiHalf, 0b10001000);
ySamplesLine1 = _mm256_permute2x128_si256(ySamplesLine1LowHalf, ySamplesLine1HiHalf, 0b00100000);
ySamplesLine2 = _mm256_permute2x128_si256(ySamplesLine2LowHalf, ySamplesLine2HiHalf, 0b00100000);
However GPU is the way to go in the long run, to perform better scaling and possibly other image filters. Any pointers on where I could learn to intercept the image inside the GPU using CUDA before handing it over to Nvenc is really appreciated.
Once again thanks for helping.
/Anders.