Is it possible to instantiate and initialize cuda::barriers in global memory?

I’m trying to run code that replaces grid.sync calls with a cuda::barrier that is instantiated to block on one thread per block. I use cudaLaunchCooperativeKernel to launch all thread blocks at the same time, but when trying to instantiate my barrier in global memory and initialize it I segfault.
Is this intended behavior, or am I doing something wrong? The code that causes the error is below:

    cuda::barrier<cuda::thread_scope_device> * bar_ptr;
    gpuErrchk(cudaMalloc(&bar_ptr, sizeof(cuda::barrier<cuda::thread_scope_device>)));
    init(bar_ptr, 128);

what does init() do, exactly? I assume that is where you are getting the seg fault?

When you create an allocation using cudaMalloc, you cannot “initialize” that allocation from host code, directly. Its generally necessary (if you want to initialize from host code) to do some kind of operation like cudaMemcpy. This is a basic CUDA 101 principle, not unique or specific to barriers.

I’m leaving aside the whole aspect of correctness. You don’t seem to be asking about that, and there is definitely not enough info here to ascertain that.

Thank you so much for the response! Sometimes you forget the most obvious things…
The init function I am calling is from here:
https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/barrier/init.html#libcudacxx-extended-api-synchronization-barrier-barrier-init
I called it from device code and it worked. Now I will figure out how well global barriers work vs grid syncing.
Will work through correctness / throughput testing. Thanks again!