Cuda::pipeline_shared_state in device memory long

I am using cuda::pipeline in my kernel, and my algorithm require using all the shared memory available in the SM.
So, I dont have enough shared memory to allocate the cuda::pipeline_shared_state in the shared memory.
I tried to allocate the cuda::pipeline_shared_state in the local memory as shown in the libcu++ documentation:

// Allocate a 2 stage block scoped shared state in device memory.
  auto* pss1 = new cuda::pipeline_shared_state<cuda::thread_scope_block, 2>;

The compilation worked fine, but running the application results in a non-reasonably long run-time, like an infinite loop (with no error appeared).
Can you help me understand how to use cuda::pipeline without using additional shared-memory?

You need to have pipeline_shared_state per thread_scope, i.e. thread block in your case. With the shown code, you have pipeline_shared_state per thread, not shared between the threads.

You can simply allocated the memory using cudaMalloc, and use placement new with 1 thread per threadblock. Something like the following pseudocode.

using state = cuda::pipeline_shared_state<cuda::thread_scope_block, 2>
__global__ void kernel(state * sharedStatePerBlock){
if(threadIdx.x == 0){
   new (sharedStatePerBlock + blockIdx.x) state ();
}
__syncthreads();
}
1 Like

Thank you. It worked!