Vision Transformer or other Temporal Vision Models

Please provide complete information as applicable to your setup.

**• Orin AGX 64GB
**• 7 or 6.2

Hi!

I’m wondering what the best way would be to have spatial-temporal models, such as RNNs or vision transformers that can aggregate information over a long(er) time duration inside Deepstream? I want to go away from static-image object detectors. I know of this, but it seems outdated and I’m not sure how easy it is to adapt: DeepStream 3D Action Recognition App — DeepStream 6.2 Release documentation (nvidia.com)

Is there any reference or such?

Thanks!

The DS 3D action sample is available in the latest DeepStream 7.0.

Thank you, does that mean this is the only example available in the direction of RNNs/Temporal models/…?