Thread groups out of the active thread blocks

I would like to achieve synchronisation among the active thread blocks scheduled on a GPU.
Is this possible to do with the current co-operative thread grouping and grid synchronization concept ?

My requirement is that a current scheduled thread blocks co-operatively load a memory segment into shared memory and then compute and then synchronize until both are complete…

It’s not a modality supported by CUDA cooperative groups, currently.

You might be able to arrange for something like that yourself, with a fair amount of additional code complexity. I suspect however that there would no performance benefit as compared to just using a grid group with the usual methodology provided by cooperative groups.