threads in a warp still in lock-step?

One of the staples of CUDA-enabled GPU computing was the lockstep fashion in which 32 threads in a warp execute instructions.

Is this still the case in the more recent versions of CUDA? If this is not the case, can you please share with me good links that point me to where I need to go read/get educated about this?

I tried to google this non-lockstep issue, but I keep getting pages that explain how a warp works in lockstep, yet this is not what I’m interested in.

Thank you for your time.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling-7-x

Thanks so much tera and arts036n - very much appreciated…

https://devblogs.nvidia.com/cooperative-groups/