cudaDeviceSynchronize from device code is deprecated

so I have this function

__global__ void ndCudaInitBodyArray(ndCudaSceneInfo& info)
	if (info.m_frameIsValid)
		const unsigned bodyCount = info.m_bodyArray.m_size - 1;
		const unsigned blocksCount = (bodyCount + D_THREADS_PER_BLOCK - 1) / D_THREADS_PER_BLOCK;
		ndCudaInitBodyArrayInternal << <blocksCount, D_THREADS_PER_BLOCK, 0 >> > (info);
		ndCudaMergeAabbInternal << <1, D_THREADS_PER_BLOCK, 0 >> > (info);

		info.m_histogram.m_size = info->m_cellsCount;
		if (info->m_cellsCount > info.m_histogram.m_capacity)
			cuInvalidateFrame(info, __FUNCTION__, __LINE__);
                unsigned blocksCells = (info->m_cellsCount + D_THREADS_PER_BLOCK -1) / D_THREADS_PER_BLOCK
		ndCudaCountAabbInternal << <blocksCells, D_THREADS_PER_BLOCK, 0 >> > (info);

and I get this warning

builds\newton-4.00\sdk\dNewton\dExtensions\dCuda>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\nvcc.exe” -gencode=arch=compute_52,code=“sm_52,compute_52” --use-local-env -ccbin “C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.32.31326\bin\HostX64\x64” -x cu -rdc=true -I"C:\Development\newton-dynamics\newton-4.00\sdk\dCore"
-4.00\sdk\dNewton\dExtensions\dCuda\dContext\ndCudaSceneInfo.h(130): warning #1444-D: function “cudaDeviceSynchronize”
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include\cuda_device_runtime_api.h(142): here was declared deprecated ("Use of cudaDeviceSynchronize from device code is deprecated and will not be supported in a future release.

The docs say this:

  • Use of cudaDeviceSynchronize in device code was deprecated in CUDA 11.6 and will not be supported in a future release. Note that this is different from host-side cudaDeviceSynchronize, which is still supported.

I read all the docs, regarding dynamics parallelism, most them come for Mark Harris. All of them say that is you read data from a child kernel launched form a parent kernels, that the app most call cudaDeviceSynchronize

but that the doc say that this is not correct, so what is the correct way of doing it.

If I do no use that way, the simply does not work, because for some reason the child kernel are still running when is returns.
I spend hour and hour debug this because behavior, and it seem it is the only ways that can work.

before when I was no using dynamic paralleslism, this worked, but in order to since the number of item generated from the kernel ndCudaMergeAabbInternal
determine the number of blocks, I have to pass a ridicules number of block,
and check in the kernel if the block number was smaller that the valid blocks,
make the child kernels increasingly more and more complex.

so my question is what is the correct was of since the child threard after sdk 11.6?

That methodology is deprecated.

So you should not synchronize on a child kernel completion, in a parent kernel. If you need to consume results from the child kernel in device code, launch a new kernel.

Yes, this will require refactoring your code. There is no zero-impact workaround that I am aware of.