cuvidCreateDecoder return error CUDA_ERROR_OUT_OF_MEMORY

Hi everybody.
I am using hardware decoding of A100 and H100 graphics cards and trying to get all the computing resources of the NVDEC chip.
The H100 has 7 NVDEC chips, and the performance of each chip for H.264 with FullHD resolution is at least 903 FPS.

(This is performance for the Ada architecture, for some reason there is no performance for Hopper. Please tell me, where is this information?)

Thus, the decoding performance is 903*7=6321FPS. To decode video streams with 25 FPS, I need to create 6321FPS/25FPS=252 decoders. But I can’t do this because I only manage to create 245-247 decoders, the next call to the cuvidCreateDecoder function fails with the error CUDA_ERROR_OUT_OF_MEMORY.

How can I create even more decoders if my video streams are at 15FPS? In this case, I need to create 6321FPS/15FPS=421 decoders. In this case, half of the NVDEC chips’ resources will be idle?

To solve my problem, I read a lot of topics:

but they are all without a solution.

There is an answer that the number of decoders being created is limited by system resources, please tell me, what resources are these? They also claim that using NVML I can get some useful information that will tell me how many decoders I can create for a particular GPU. What kind of information is this?

I’ve used contexts in different ways. I created one context for all decoders, created a context for each decoder, created 4 contexts and distributed 70 decoders among them. Nothing solved my problem.

The only time I was able to increase the number of decoders I created was to run 4 separate processes to create them. In total, I managed to create 287-290 decoders.

This problem actually exists for any GPU.

To create so many decoders, I used NvDecoder.cpp/NvDecoder.h from the video codec sdk release examples.
I modified it a bit to achieve this number of decoders. It is necessary for all decoders to have a global mutex and use it when calling the cuvidCreateVideoParser and cuvidParseVideoData methods. Without such a fix, even fewer decoders can be created, about 160 pieces.

And this is my code for creating context, threads, and decoders:

void
process(CUcontext ctx, NvDecoder *dec, size_t num)
{
    ck(cuCtxSetCurrent(ctx));
    FFmpegDemuxer demuxer(cfg.file.c_str());

    int nVideoBytes = 0;
    int nFrameReturned = 0;
    int nFrame = 0;
    uint8_t *pVideo = NULL;
    uint8_t *pFrame = NULL;
    do {
        demuxer.Demux(&pVideo, &nVideoBytes);
        nFrameReturned = dec->Decode(pVideo, nVideoBytes);
        if (!nFrame && nFrameReturned)
            LOG(INFO) << dec->GetVideoInfo();
        for (int i = 0; i < nFrameReturned; i++) {
            pFrame = dec->GetLockedFrame();
            delete pFrame;
        }
        nFrame += nFrameReturned;
    } while (nVideoBytes);
    while (true)
        std::this_thread::sleep_for(std::chrono::seconds(600));
}

int
main(int argc, char **argv)
{
    ck(cuInit(0));

    try {
        cfg = scfg::init(argc, argv);
    } catch (std::exception &ex) {
        log_a("{}", ex.what());
        return 1;
    }

    CUcontext ctx = NULL;
    createCudaContext(&ctx, cfg.device_number, 0);

    std::vector<std::thread> threads;
    for (size_t i = 0; i < cfg.number; i++) {
        auto *dec = new NvDecoder(ctx, false, cudaVideoCodec_H264);
        threads.emplace_back([&ctx, dec, i] {
            try {
                process(ctx, dec, i);
            } catch (const std::exception &ex) {
                log_e("{}", ex.what());
            }
        });
    }

    for (auto &t : threads)
        t.join();

    return 0;
}