cudaLaunchCooperativeKernel and syncthreads

Hello,
I am trying to create a program that needs both grid and block synchronization. However, the intrinsic syncthreads()__ does not seem to work properly inside a cudaLaunchCooperativeKernel(). Is this a known problem? I am using Cuda 12.4. I also tried the cooperative groups for block sync but it did not work as well. My kernel launch uses 8 blocks of a 3080.
Thank you

Without showing a reproducer I highly doubt that __syncthreads() does not work correctly. Your problem is most likely caused by something else.