Video codec sdk CopyToDeviceFrame data rate only 1.8GB/s


I compiled the samples from Video_Codec_SDK_11.1.5, and tried the example AppEncCuda on 3280x1460 nv12 image frames, generated using:

ffmpeg -f lavfi -i testsrc2=s=3208x1460,format=nv12 -vframes 1000 sample.yuv

I timed the operation: CopyToDeviceFrame in the function EncodeCuda, and calculated the data rate of copying frame from cpu the gpu, the data rate is only 1.8 GB/s (it took about 4000us to copy a frame of size 6.8MB) on a 3090 GPU, whereas encoding and copy data back EncodeFrame only took about 50us.

I am surprised at how slow copying data to gpu is. Is there a way to speed up this operation? We are trying to encode 4k camera frames at 210 frame/s, and right now the bottleneck seems to be the data transferring from cpu to gpu.

It turns out I have been timing things wrong. It actually only takes 250us to transfer the data from cpu to gpu, which equals to a bandwidth about 21Gbytes/s. The bulk of time spent is actually in the encoding itself, which makes more sense. So I think this problem is resolved.