Hi,
I was wondering if there were any plans to make cooperative groups compatible with nvc++ anytime soon. I was playing around with the two for myself (with CUDA 12.2 bundled with HPC SDK 23.7) by editing the cooperative-groups header files and came up with a patch that partially achieves that by allowing the reduction sample to compile and run with both nvc++ and nvcc. I strived to keep things platform- and SM-level-independent, though I was only able to test on my SM-75 Linux machine. I’m attaching the patch at the end of this post if anyone finds it useful or interesting.
Can anyone from NVidia answer if there are plans to move in this direction soon? And are there any plans to move the cooperative groups, which are largely (entirely?) header-only anyway, to an open-source repository? Perhaps libcu++ or the upcoming cccl?
Thanks for any information. The rest of this post is a description of issues that I had to solve which may be representative of the issues that need solving for further nvc++/cooperative-groups interoperability.
The first issue is that nvc++ insists, regardless of the SM level being compiled, on using the reserved shared multi-warp scratch area as returned by cooperative_groups/details/memory.h:reserved_shared_ptr(), which is only available for SM level >= 8.0 when invoking any overload of this_thread_block. Even when that is the correct logic, it still reports an error in the asm block in reserved_shared_ptr(). Changing the output parameter of the asm block to an unsigned integer and reinterpret_cast-ing it to a pointer solved the second issue, but the first one is more involved. It can be quickly circumvented by passing -D_CG_USER_PROVIDED_SHARED_MEMORY for CC < 8.0, but that again requires separate compilations for < 8.0 and >= 8.0, which is not the nvc++ way.
The most complex aspect to get right was maintaining zero overhead of snippets like
__shared__ cg::block_tile_memory<256> shared;
cg::thread_block thb = cg::this_thread_block(shared);
for CC >= 8.0. This pattern is needed when creating tiles larger than 32 threads. With nvcc, cg::block_tile_memory takes up no space for CC >= 8.0. If nvc++'s if target mechanism allowed target-dependent type definitions, things would be simple enough, but that doesn’t seem possible. So I think zero overhead for CC >= 8.0 just isn’t possible with nvc++ with this pattern and the CUDA programming guide may need to stop recommending it at some point. It is, however, possible to rewrite the zero-parameter this_thread_block() overload to automatically feature a __shared__ cg::block_tile_memory instance behind the scenes, but only when creating tiles larger than 32 threads and only when CC < 8.0. this_thread_block() must become a template taking MaxBlockSize previously passed to block_tile_memory. MaxBlockSize has a default value so the function can still be called as just this_thread_block() which keeps existing semantics. See the attached patch for an implementation. In my favourite version of reality, it is this function that is recommended from now on. The cg::block_tile_memory-taking overload of this_thread_block continues to work and is zero-overhead with nvcc, but the extra cg::block_tile_memory always takes up space with nvc++.
The other main ingredient of the implementation is carefully replacing naive macro checking with NV_IF_TARGET (extended so that the _CG_USER_PROVIDED_SHARED_MEMORY override is also taken into account if specified).
The patch can be applied as
cd <cuda-path>/include
patch -p2 -i <path-to-patchfile>
and can be reverted by also passing -R to patch.
Cheers,
Mat
nvcxx_cg.patch (9.9 KB)