pipe: frame -> encoder -> bitstream -> decoder -> frame
Are you requesting optimization of bitstream passing ? Bitstream is usually only very small (from ~100 Bytes to ~100k Bytes per step). I think it is useless to make GPU memory passing for few bytes (and make new API for that) without valid use case. I can see only one minor use case - SNR/HEATMAP testing (encoder/decoder quality with different parameters testing). See http://on-demand-gtc.gputechconf.com/gtc-quicklink/bscTMOl (S8761/GTC2018).
pipe: bitstream -> decoder -> frame -> encoder -> bitstream
This pipe "frame" (~10M Bytes per step) acceleration is more useful with use case like transcoding (size change, clipping, filters, embedding subtitles, protocol change ...). Frame can be CUDA buffer and transformation can be CUDA programs and this is supported. See "doc/Using_FFmpeg_with_NVIDIA_GPU_Hardware_Acceleration.pdf" and http://on-demand-gtc.gputechconf.com/gtc-quicklink/g1Zlw0A (S8601/GTC2018).
I’ll give you more details about the idea: I want to build a datamoshing application for Windows that can glitch video streams on the fly. I need to have a video buffer A that is encoded on the GPU, then for each I-frame I’d like to have possibility to replace the frame with an image taken from other video buffer B, then continue to decode the stream back to an output video buffer. This will result in a glitched video effect.
Macs have support for hardware H264 encoding for years, so there is a Mac solution (kriss.cx), it’s very performant, taking only about 30% load for a modern CPU do this operation for 30fps full HD buffer. I imagine the whole encoding/frame-replacing/decoding process takes place on GPU video buffers.
I have not yet started building my prototype, just wanted to confirm this project is possible to build. I’ll look into the APIs you suggested, they might be what I need.