Cache misses on CUDA code that run over multiple processes

I have a program that transcodes a 1280x1024 video using FFmpeg with NVENC (H264 ->H264). The video comes from a live MPEG-TS over UDP stream.

Before I send the decoded frames to the encoder I want to do some custom image processing on each frame using cuda.
I currently have cuda code that runs in sufficient time, i.e. the latency is close enough to real-time that it is not visible to the naked eye. The code modifies every byte of the frame, changing the values of the pixels in the image, before sending the frames to be encoded by NVENC.

The latency achieved by the cuda code is acceptable when only one instance of the program is run. Our product transcodes multiple video streams simultaneously, so there are multiple encoding processes running at once. Once I increase the number of processes (meaning the number of videos being worked on) the latency get much worse.
For example, one process has a run-time of around 10ms per frame while 30 processes has a run time of around 150ms per frame. This causes latency and buffering of frames which eventually lead to data loss.

I did some profiling using Nsight and saw that warps are being stalled for around 100 cycles waiting for a scoreboard dependency on a L1TEX operation. I figured that because I have a lot of process running at the same time, those cache misses are inevitable.

Before I try to further optimize my code I wanted to ask two questions:

  1. Is there a way to optimize the order of execution of warps in a way that warps that use the same data will run sequentially on the same SM?

  2. And the more important question - is it reasonable to say that the GPU I’m using just isn’t cable of running that many cuda processes that handle this amount of data in sufficient latency? If not, what would be the upper limit of number of processes? Of course I understand that this is difficult to answer without the implementation, but i would like to know some rough numbers assuming we have an optimized implementation.

Also worth mentioning:
I’m using a Tesla T4 GPU.
I have tried to implement this both with the Thrust library and also with native cuda code and got similar results.

For the project I’m working on it is fundamental to write my own code, so I’m looking for a way without optimized image processing libraries (like OpenCV).

Thanks a lot in Advance!!

1 Like

You have some control over this as a CUDA programmer. You can be guaranteed that warps belonging to the same threadblock will run on the same SM. So without any other considerations, this entails making sure that the warps you’d like to behave this way belong to the same threadblock.

There is not enough information here for me to make any comments on your second question.