Processing camera samples from RTSP to Redis with ffmpeg

We are currently working on capturing images from cameras using an RTSP stream and storing them in a Redis database.
With up to 6 cameras 4MP @ 10fps, everything runs smoothly and we get 60 fps at our Redis database. But, when we try and add 9 cameras we see a decrease in fps (they remain at 60 fps when we expected 90 fps, 10 fps for every camera).
Our network settings are ok, we can do this same task with CPU processing (cv2.videocapture) up to 15 cameras and it works fine but it consumes many CPU resources and we think we could manage our resources better by processing this in GPU.
Our camera settings are:
h.265 codec
2688x1520
10 fps
6144 bit rate
We run this on 1 RTX 3090.
We use compiled FFmpeg and hevc_cuvid. Here’s the code we are running:

pipe = sp.Popen(
            [
                "ffmpeg",
                "-y", 
                "-loglevel",
                "error",
                "-vsync",
                "0",
                "-c:v",
                "hevc_cuvid",
                "-rtsp_transport",
                "tcp",
                "-i",
                config["device_url"], 
                "-preset",
                "superfast",
                "-pix_fmt",
                "bgr24",
                "-f",
                "rawvideo",
                "-",
            ],
            stdout=sp.PIPE,
            bufsize=bufsize,
        )
while True:
    pipe_content = pipe.stdout.read(bufsize)
    if len(pipe_content) > 1:  
        self.store_frame(
            str(config["id"]),
            numpy.frombuffer(pipe_content, dtype="uint8").reshape(
                (image_y, image_x, 3)
            ),
        )

        pipe.stdout.flush()

Hi Franco,
There are some reasons you might be seeing lower performance than expected. I’ll address the three that are most probable.

  1. Number of NDEC chips: a 3090, being a consumer class card has only 1 NDEC chip capable of 1 decode session, as opposed something like an A16, which can host 8 concurrent sessions, using 4 chips capable of 2 sessions each.

  2. Performance penalty of copying data between GPU memory and system memory via the PCIe interface: If you add the following commands to your ffmpeg encode pipeline : -hwaccel cuda -hwaccel_output_format cuda eg: ffmpeg -hwaccel cuda -hwaccel_output_format cuda -c:v hevc_cuvid -i output.mp4 -pix_fmt bgr24 -benchmark -f null -
    the decoded raw frames would be copied back to system memory via the PCIe bus. Since you’re using streaming in data, you could be saturating the PCIe bandwidth copying frames to the GPU, decoding, and then sending them out to be read via numpy.

  3. Numpy: Since numpy uses system memory, using something like cupy that can access the data on GPU memory itself would reduce the memory copy from GPU memory to system memory.

For future reference, our technical blog on GPU accelerated transcoding along with the the technical documentation are great resources:
https://developer.nvidia.com/blog/nvidia-ffmpeg-transcoding-guide/
https://docs.nvidia.com/video-technologies/video-codec-sdk/ffmpeg-with-nvidia-gpu/