Hi! One thing I wasn’t able to test out recently was whether everything inside a for loop in a cuda kernel would be done sequentially? if not is any way to make operation inside a for loop serialized (unless of course with dynamic parallelization another kernel is called inside which would subsequently call a new sets of sub threads to continue the operation ))
With respect to a single thread, everything is sequential (not including CDP - cuda dynamic parallelism) just as you would expect for a C or C++ style thread of execution.
There is no implied order of operations in the CUDA programming model when considering operations in separate threads, other than what you as a programmer explicitly provide via synchronization functions, cuda cooperative groups, etc.