Nvc++ and cooperative groups (with fun little patch)

Hi,

I was wondering if there were any plans to make cooperative groups compatible with nvc++ anytime soon. I was playing around with the two for myself (with CUDA 12.2 bundled with HPC SDK 23.7) by editing the cooperative-groups header files and came up with a patch that partially achieves that by allowing the reduction sample to compile and run with both nvc++ and nvcc. I strived to keep things platform- and SM-level-independent, though I was only able to test on my SM-75 Linux machine. I’m attaching the patch at the end of this post if anyone finds it useful or interesting.

Can anyone from NVidia answer if there are plans to move in this direction soon? And are there any plans to move the cooperative groups, which are largely (entirely?) header-only anyway, to an open-source repository? Perhaps libcu++ or the upcoming cccl?

Thanks for any information. The rest of this post is a description of issues that I had to solve which may be representative of the issues that need solving for further nvc++/cooperative-groups interoperability.

The first issue is that nvc++ insists, regardless of the SM level being compiled, on using the reserved shared multi-warp scratch area as returned by cooperative_groups/details/memory.h:reserved_shared_ptr(), which is only available for SM level >= 8.0 when invoking any overload of this_thread_block. Even when that is the correct logic, it still reports an error in the asm block in reserved_shared_ptr(). Changing the output parameter of the asm block to an unsigned integer and reinterpret_cast-ing it to a pointer solved the second issue, but the first one is more involved. It can be quickly circumvented by passing -D_CG_USER_PROVIDED_SHARED_MEMORY for CC < 8.0, but that again requires separate compilations for < 8.0 and >= 8.0, which is not the nvc++ way.

The most complex aspect to get right was maintaining zero overhead of snippets like

    __shared__ cg::block_tile_memory<256> shared;
    cg::thread_block thb = cg::this_thread_block(shared);

for CC >= 8.0. This pattern is needed when creating tiles larger than 32 threads. With nvcc, cg::block_tile_memory takes up no space for CC >= 8.0. If nvc++'s if target mechanism allowed target-dependent type definitions, things would be simple enough, but that doesn’t seem possible. So I think zero overhead for CC >= 8.0 just isn’t possible with nvc++ with this pattern and the CUDA programming guide may need to stop recommending it at some point. It is, however, possible to rewrite the zero-parameter this_thread_block() overload to automatically feature a __shared__ cg::block_tile_memory instance behind the scenes, but only when creating tiles larger than 32 threads and only when CC < 8.0. this_thread_block() must become a template taking MaxBlockSize previously passed to block_tile_memory. MaxBlockSize has a default value so the function can still be called as just this_thread_block() which keeps existing semantics. See the attached patch for an implementation. In my favourite version of reality, it is this function that is recommended from now on. The cg::block_tile_memory-taking overload of this_thread_block continues to work and is zero-overhead with nvcc, but the extra cg::block_tile_memory always takes up space with nvc++.

The other main ingredient of the implementation is carefully replacing naive macro checking with NV_IF_TARGET (extended so that the _CG_USER_PROVIDED_SHARED_MEMORY override is also taken into account if specified).

The patch can be applied as

cd <cuda-path>/include
patch -p2 -i <path-to-patchfile>

and can be reverted by also passing -R to patch.

Cheers,
Mat

nvcxx_cg.patch (9.9 KB)

While nvc++ does support some CUDA, mostly just what’s need to support our C++17 standard language parallelism, it’s not fully supported. Hopefully at some point our team can add further CUDA features such as cooperative groups, but right now it’s not our focus (which is STDPAR) so wont be anytime soon.

Though I’ll pass along you patch to engineering.

Thanks,
Mat

Thanks for the info and forwarding the patch. Stdpar is an understandable priority and pretty exciting in itself. And it sounds like porting further CUDA features to nvc++ is at least being considered as an option, which is great too. Fingers crossed they get their turn somewhere down the line.

Thanks,
Matjaž (full name works too, don’t want to steal your thunder)

No thunder stolen. I always welcome more single “t” Mat’s.

Hopefully we can get back extending the CUDA support in nvc++. It will never be a drop in replacement for nvcc and may lag behind in some features, but can offer some distinct advantages such as mixing models, CUDA,STDPAR, OpenACC, OpenMP.

While Bryce is a bit aspirational about timeline, he gave a good presentation about this which you can watch at: https://www.youtube.com/watch?v=KhZvrF_w1ak

The whole talk is interesting, but the CUDA C++ stuff starts around the 13:15 minute mark.

No thunder stolen

Superb, I’ll go back to saving three characters (and four utf8 bytes) in my signature line then!
Thanks for the link, it’s a really fun little video, and one that sounds super optimistic about nvc++ CUDA support. I was also not aware that if target support was intended for nvcc as well. Though it doesn’t seem to be a thing just yet. Any idea on the ETA?
But the other question I wanted to repeat was, how about making cooperative groups open source? As far as I can see, everything is header-only, apart from a handful of functions starting with cudaCG in cuda_device_runtime_api.h. Those and cudaCGScope from driver_types.h, the only other identifier starting with cudaCG in the CUDA headers, probably have to stay part of CUDA core, but it seems to me that cooperative_groups.h and the cooperative_groups/ folder could be maintained in the /include/ folder of libcu++ as siblings of the cuda and nv folders.
Then CG would have a chance of benefitting from good old GitHubby volunteer labour, even if they’re not high priority internally.
I’m also continuing to use the patched headers for personal projects, which makes me feel even more that CG already are open source FAPP, just not maintained as such.
I’m happy to ask about this again on a different subforum or in a libcu++ issue if this isn’t the best place.
Nevertheless, thanks for any info.

Best,
Mat

Hi Mat,

Sorry but I don’t have any insight here for either “if target” support in nvcc or the possibility of open sourcing the cooperative groups header. Not sure libcu++ would be the right spot for it, but maybe the core libraries? GitHub - NVIDIA/cccl: CUDA C++ Core Libraries

-Mat

Hi, yeah, I agree, the libcu++ repo does one thing and the cooperative_groups* headers would stand out much more than the nv folder does, whereas cccl seems like a great opportunity to smoothly sneak in more stuff. I guess I’ll go bother the good folks at cccl then. Thanks again for all the info:)

Best,
Mat