Re: cropping… Why don’t you want to process the entire frame? If your idea is to perform a secondary inference, the nvinfer element operates in secondary mode on the metadata (eg. Bounding boxes) from the first. So for example the primary network can detect cars and the secondary can classify the make and model. IIRC DeepStream test 2 does this.
Re: Jpeg… It isn’t a very effecient codec for storing video. It’s a spatial codec – good for photos – not video. You’ll spend a fortune on storage compared to h264/5. You can of course do it, but I suspect you’ll run into unexpected trouble this way.
If you want to view the video from the web, that’s possible, as is displaying the metadata live on top, either by sending it with the video to the browser or drawing boxes beforehand. Think YouTube and subtitles. You can even use a subtitle track to store your metadata if you want. Qtmux and matroskamux support this among others. DJI uses subtitles to store it’s drone metadata. GoPro has an open metadata format as well but you will have to make a new gstreamer element for that if you go that route.
You can of course choose to roll your own thing your way, but Nvidia provides various out of the box solutions for what it seems like you want to do. You might try getting some ideas from how their examples are designed and see if any fit your use case.