I am using cuda::pipeline in my kernel, and my algorithm require using all the shared memory available in the SM.
So, I dont have enough shared memory to allocate the cuda::pipeline_shared_state in the shared memory.
I tried to allocate the cuda::pipeline_shared_state in the local memory as shown in the libcu++ documentation:
// Allocate a 2 stage block scoped shared state in device memory.
auto* pss1 = new cuda::pipeline_shared_state<cuda::thread_scope_block, 2>;
The compilation worked fine, but running the application results in a non-reasonably long run-time, like an infinite loop (with no error appeared).
Can you help me understand how to use cuda::pipeline without using additional shared-memory?