Can start to parse video in specific frame(not the begin) of the video with codec library?

For example, if there are 1000 frames in a video, the codec will start to parse the video from No.0 frame to No.999 frame, what should I do to start parse the video from No.500 frame, and skip No.0 - No.499 frame?

I also would like to know this

You have to locate the reference frame prior to the first frame you want and start injecting data from there (being careful to inject the needed SPS/PPSs), discarding the unneeded frames on the display side. Typically, you would have created an index of the stream, as done for example by the well-known source filters, including my own (see signature). Frame-accurate random access is not a trivial thing, unfortunately.

To elaborate a little on the complexity… It’s not even sufficient just to locate the previous reference frame, because it may be in an open GOP. In that case, if the requested frame is a leading B frame of that GOP, you have to back off to the previous GOP and start injecting from there. To simplify things you can just always back off by the extra GOP, because NVDec decoding is so fast that the overhead is insignificant for reasonable sized GOPs. You’ll still have to discard the unwanted frames on the display side, of course.

Thank for your reply.
But NVDec decoding is not fast enough when the video is 2K or higher.Or, decode many videos at the same time.
And what is difference between cudaVideoCreate_PreferCUVID and cudaVideoCreate_PreferCUDA flag?
Is cudaVideoCreate_PreferCUDA mean use codec?

You’re welcome. I’m a bit baffled by your comment about NVDec decoding speed in this context. In my application I am able to decode full 4K UHD video at 186 frames/second, and that includes transferring the decoded frame back to the CPU and then copying it into the Avisynth buffer while converting NV12 to YV12. Now if we assume a reasonable GOP size of 25 frames then we would have a latency on a seek of 25/186 = 0.13 seconds. This small latency is incurred once on a seek, after which decoding proceeds at its normal rate. I cannot see how this small seek latency could be an issue, or how it could be avoided.

Regarding cudaVideoCreate_PreferCUVID versus cudaVideoCreate_PreferCUDA, some parts of the decoding process can be executed either in the video processor engine (CUVID) or in CUDA. I have not experienced significant differences between the two. nVidia has not thoroughly documented this feature, as far as I am aware. Of course, you can benchmark it both ways to decide which mode is appropriate for your application.

Do you have some requirements for seeking that you can share with us? There is always going to be some latency on a seek. You will want to have reasonable GOP size to minimize it.

You should also be aware of the difference between frame-accurate seeking and approximate seeking. For example, if you did not require exactly frame 500 but rather something close to frame 500 then there are heuristic ways to seek a bit faster. Also, approximate seeking does not require an index. Here is a simple way: say you want something around frame 500 out of 1000. Assuming the frame sizes are approximately homogeneous you can go to the halfway point of the file (500/1000) and then parse forward to the next reference frame. Performance-wise the difference is not great, but the elimination of indexing may be important. Of course, knowing the number of frames in the file can be a problem, but it is usually derivable from the container metadata, or for transport streams you can calculate it from the first and last timestamps in the file.

My application has two parts, the indexer and the frame-accurate frame server. In the indexer the video is displayed with a timeline. To approximately seek, you just click somewhere on the timeline. The code determines the ratio of the clicked timeline position to its full length, then goes to the ratio’s position in the file (it’s easy to obtain the file size) and starts parsing forward for a reference frame, at which point it starts decoding. This method eliminates both indexing and needing to know the number of frames in the file. The server component uses frame accurate seeking as frame accuracy is sometimes very important for Avisynth/Vapoursynth scripting.

Media players typically also use similar methods for approximate seeking.

I hope it is helpful.

Ok… If assume that I have a 16K x 16K video, and then I split it into 256 1K x 1K videos. And first, I load all these video at the same time with different PC, but a App of PC crashed when all the video played 1000 frames, in this time, if the crashed PC load video from No.0 frame, that is so bad for display synchronize.

If use OpenGL or DX to display video frame, that do not need transfer the decode frame to CPU, just copy frame data from CUDA memory to texture memory. That means GPU to GPU.

Last, the codec can support find nearest key-frame? Now, I brute-force discard all the decode frames which I do not need…If a over 100000 frames video, should I discard over 50000 frames?

NVDec tells you whether a decoded frame is a reference frame (pPicParams->intra_pic_flag). So you can just keep decoding until you see that the frame is a reference frame.

That’s the worst thing you can do! Did you not read and understand the solutions I described for seeking.

no…I know what you means.
for example, the ffmpeg lib has a function like av_seek(), that can find the nearest key-frame to your target frame according to some timestamp calculation.

What I means is that codec can decode video to GPU directly, and I do not want to transform the frame data to CPU. So, if I want to got the random-accessible from using codec, what should I do?

It has nothing to do with transferring the frame back to the CPU or not. Here is the process for approximate seeking:

  • Use a file system seek to the desired point of your input file.

  • Start injecting data into the decoder. You will start getting handle decode callbacks.
    The callback does not transfer the frame or anything like that.

  • Check the pPicParams->intra_pic_flag in the handle decode callback. If it is not set
    discard it. If it is set go into normal operation.

Yep…

But in the situation below:

Now I split a video to 100 small videos, and the display window just cover 20 videos, if I move the window, I need load new video from different frame ID for real-time display…

I do not sure your idea can fit this case?

I don’t understand what you are trying to do. Can you start at a high level and tell us what your application is and what it is supposed to do.

Splitting a video into 100 small videos? You mean like frames 0-19, frames 20-39, etc.? I assume you are not trying to break up the frame itself.

If you are talking about the first case, i.e., splitting by frame ranges, then I don’t see any problem for the solution I offered. Given your desired frame number you determine the desired subvideo file and offset within it and then do as I described.

No. if the picture upload correct, you will understand what I means I guess.

All videos have the same frames. I just splite the video according to video display size.

Why are you breaking it up? So you can decode in parallel? How do you actually split it? Is this just some concept you are imagining? Why do you have such giant frame sizes? What application is generating them? You simply have not explained what you are trying to do at a high level. Video wall?

Are you aware that you cannot have an arbitrary window like that for decoding? Maybe for display if you decode all the videos taking part in the window. And even that assumes that the frame has been partitioned in a legal way, such as not breaking intra-picture prediction.

This goes way beyond doing a seek in a video. So maybe you should open a new thread about it.

yeah, like video wall… so, I am worried about frame synchronize

OK, now I get you. :-) Thanks.

You’ll have to implement frame-accurate seeking in each subvideo to guarantee proper frame assembly. In theory, there is no obstacle to passing the desired frame number to each subvideo decoder and then assembling the composite frame. As I mentioned earlier though, frame-accuracy is non-trivial and NVDec does not provide any API for “decode and return frame number X”. You have to roll your own. It’s definitely possible as I do it in my application.

ok.
thank for your suggestion

You’re welcome and thank you for the interesting discussion. Good luck with your projects!