so I have this function
__global__ void ndCudaInitBodyArray(ndCudaSceneInfo& info)
{
if (info.m_frameIsValid)
{
const unsigned bodyCount = info.m_bodyArray.m_size - 1;
const unsigned blocksCount = (bodyCount + D_THREADS_PER_BLOCK - 1) / D_THREADS_PER_BLOCK;
ndCudaInitBodyArrayInternal << <blocksCount, D_THREADS_PER_BLOCK, 0 >> > (info);
ndCudaMergeAabbInternal << <1, D_THREADS_PER_BLOCK, 0 >> > (info);
cudaDeviceSynchronize();
info.m_histogram.m_size = info->m_cellsCount;
if (info->m_cellsCount > info.m_histogram.m_capacity)
{
cuInvalidateFrame(info, __FUNCTION__, __LINE__);
return;
}
unsigned blocksCells = (info->m_cellsCount + D_THREADS_PER_BLOCK -1) / D_THREADS_PER_BLOCK
ndCudaCountAabbInternal << <blocksCells, D_THREADS_PER_BLOCK, 0 >> > (info);
}
}
and I get this warning
builds\newton-4.00\sdk\dNewton\dExtensions\dCuda>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\nvcc.exe” -gencode=arch=compute_52,code="sm_52,compute_52" --use-local-env -ccbin “C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.32.31326\bin\HostX64\x64” -x cu -rdc=true -I"C:\Development\newton-dynamics\newton-4.00\sdk\dCore"
-4.00\sdk\dNewton\dExtensions\dCuda\dContext\ndCudaSceneInfo.h(130): warning #1444-D: function “cudaDeviceSynchronize”
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include\cuda_device_runtime_api.h(142): here was declared deprecated ("Use of cudaDeviceSynchronize from device code is deprecated and will not be supported in a future release.
The docs say this:
- Use of cudaDeviceSynchronize in device code was deprecated in CUDA 11.6 and will not be supported in a future release. Note that this is different from host-side cudaDeviceSynchronize, which is still supported.
I read all the docs, regarding dynamics parallelism, most them come for Mark Harris. All of them say that is you read data from a child kernel launched form a parent kernels, that the app most call cudaDeviceSynchronize
but that the doc say that this is not correct, so what is the correct way of doing it.
If I do no use that way, the simply does not work, because for some reason the child kernel are still running when is returns.
I spend hour and hour debug this because behavior, and it seem it is the only ways that can work.
before when I was no using dynamic paralleslism, this worked, but in order to since the number of item generated from the kernel ndCudaMergeAabbInternal
determine the number of blocks, I have to pass a ridicules number of block,
and check in the kernel if the block number was smaller that the valid blocks,
make the child kernels increasingly more and more complex.
so my question is what is the correct was of since the child threard after sdk 11.6?