L4T 28.2: V4L2 userptr unhandled context fault, mmap and dmabuf too slow

In L4T 28.2, setting v4l2_buffer.memory = V4L2_MEMORY_USERPTR leads to screen flickering/corruption, frames spliced together or spliced with zeros, and MMU unhandled context faults in the kernel log:

sensor_test-4734 [000] d.h. 5986.101485: arm_smmu_context_fault: Unhandled context fault: iova=0x63d20000, fsynr=0x13, cb=19, sid=4(0x4 - VI), pgd=26bdce003 pud=26bdce003, pmd=22451f003, pte=0

Using V4L2_MEMORY_MMAP continues to be too slow for our application: ~70ms to memcpy the mmap’ed buffer to a userspace pointer.

Using V4l2_MEMORY_DMABUF together with NvBufferCreate and NvBufferMemMap is better but still too slow: ~30ms to memcpy to a userspace pointer. Introducing usleep delays to simulate cpu scheduling or user processing does not decrease the memcpy time, so there is not enough margin for us to be confidant we won’t drop frames.

Has anyone successfully or unsuccessfully used V4L2_MEMORY_USERPTR in L4T 28.2?
Could the recent spectre mitigation changes to the kernel ( speculation_barrier calls, etc) be related?
Does anyone have insight into what is limiting the performance of MMAP and DMABUF?

I try below command without problem. (-u means user point)

v4l2-ctl -d /dev/video0 --set-ctrl bypass_mode=0
./yavta /dev/video0 -u -c10 -n5 -s1920x1080 -s1920x1080 -fSRGGB10