I was wondering if there were any plans to make cooperative groups compatible with nvc++ anytime soon. I was playing around with the two for myself (with CUDA 12.2 bundled with HPC SDK 23.7) by editing the cooperative-groups header files and came up with a patch that partially achieves that by allowing the reduction sample to compile and run with both nvc++ and nvcc. I strived to keep things platform- and SM-level-independent, though I was only able to test on my SM-75 Linux machine. I’m attaching the patch at the end of this post if anyone finds it useful or interesting.
Can anyone from NVidia answer if there are plans to move in this direction soon? And are there any plans to move the cooperative groups, which are largely (entirely?) header-only anyway, to an open-source repository? Perhaps libcu++ or the upcoming cccl?
Thanks for any information. The rest of this post is a description of issues that I had to solve which may be representative of the issues that need solving for further nvc++/cooperative-groups interoperability.
The first issue is that nvc++ insists, regardless of the SM level being compiled, on using the reserved shared multi-warp scratch area as returned by
cooperative_groups/details/memory.h:reserved_shared_ptr(), which is only available for SM level
>= 8.0 when invoking any overload of
this_thread_block. Even when that is the correct logic, it still reports an error in the
asm block in
reserved_shared_ptr(). Changing the output parameter of the
asm block to an unsigned integer and
reinterpret_cast-ing it to a pointer solved the second issue, but the first one is more involved. It can be quickly circumvented by passing
-D_CG_USER_PROVIDED_SHARED_MEMORY for CC
< 8.0, but that again requires separate compilations for
< 8.0 and
>= 8.0, which is not the nvc++ way.
The most complex aspect to get right was maintaining zero overhead of snippets like
__shared__ cg::block_tile_memory<256> shared; cg::thread_block thb = cg::this_thread_block(shared);
>= 8.0. This pattern is needed when creating tiles larger than 32 threads. With nvcc,
cg::block_tile_memory takes up no space for CC
>= 8.0. If nvc++'s
if target mechanism allowed target-dependent type definitions, things would be simple enough, but that doesn’t seem possible. So I think zero overhead for CC
>= 8.0 just isn’t possible with nvc++ with this pattern and the CUDA programming guide may need to stop recommending it at some point. It is, however, possible to rewrite the zero-parameter
this_thread_block() overload to automatically feature a
__shared__ cg::block_tile_memory instance behind the scenes, but only when creating tiles larger than 32 threads and only when CC
this_thread_block() must become a template taking
MaxBlockSize previously passed to
MaxBlockSize has a default value so the function can still be called as just
this_thread_block() which keeps existing semantics. See the attached patch for an implementation. In my favourite version of reality, it is this function that is recommended from now on. The
cg::block_tile_memory-taking overload of
this_thread_block continues to work and is zero-overhead with nvcc, but the extra
cg::block_tile_memory always takes up space with nvc++.
The other main ingredient of the implementation is carefully replacing naive macro checking with
NV_IF_TARGET (extended so that the
_CG_USER_PROVIDED_SHARED_MEMORY override is also taken into account if specified).
The patch can be applied as
cd <cuda-path>/include patch -p2 -i <path-to-patchfile>
and can be reverted by also passing
nvcxx_cg.patch (9.9 KB)