Multiple input video transcoding throw signle fixed size buffer memory

Hello. I’m interesting in video server-side conference transcoding. It’s require to mix with resize and rescale video from multiple sources in H264 and VP8 and VP9 into one video with some layout that can be changes during mixing timeline (for example grid with fixed size, grid with maximum area of video).

As I understand it’s possible to create mutiple video decoding (H264/VP8) to YUV and do resize and rescale at GPU, but to mix at one pucture I need to copy (hi cost operation) from each GPU buffer to CPU memory buffer (at pointer) and do mix at CPU memory and next mixed YUV image copy to GPU and encode to fixed size/scale to H.264 or VP8 or VP9 video.

Is it possible to use GPU memory from N encoded streams, next resize, rescale, set position of each video at signle mixing GPU rendering buffer of fixed size that will be encode to single scene output video file? Video can start and finished unpredictable.