Using Thrust Functions Within Device Code (CudaLaunchCooperativeKernel)

I am currently trying to use the thrust::merge() function to work within device code. The kernel is being launched with the cudaLaunchCooperativeKernel function (because I need to have grid synchronization in the future). I looked around the internet and found that cudaLaunchCooperativeKernels aren’t able to perform dynamic parallelism, and I was wondering if that’s the reason the thrust::merge() function below would give me the “operation not permitted” error. If I use the traditional way to launch code instead, with the 3 angle brackets, then it performs the function as intended, which leads me to believe that this is a problem with thrust and cooperative kernels not playing well with each other. If this is the reason, are there any plans for thrust in the future to perform well with cooperative kernels? And if not, what are some ways that I could work around this issue, while still having similar performance? I have looked at the thrust github, but couldn’t find the implementation of the thrust::merge function definition, and I would appreciate it if someone could point me towards the correct direction. Or if there is something obviously wrong in my code, I would appreciate advise on that as well. I am using an RTX3080Ti graphics card, with compute capability 8.6. Attached below is sample code as well as the make file that I’m using. Thank you for any advice you can give me!

#include <stdio.h>
#include <assert.h>
#include <cooperative_groups.h>

#include <thrust/execution_policy.h>
#include <thrust/merge.h>

using namespace cooperative_groups;

#define gpuErrcheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__ void bulk_sssp_bucketHeap(int *A, int *B, int *C) {
    A[0] = 1, A[1] = 3, B[0] = 2, B[1] = 4;
    thrust::merge(thrust::device, A, A + 2, B, B + 2, C);

    if (threadIdx.x + (blockIdx.x * blockDim.x) == 0) {
        for (int i = 0; i < 4; i++) {
            printf("%d ", C[i]);
        }
        printf("\n");
    }
}

int main(){
    int *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, sizeof(int) * 2);
    cudaMalloc(&d_B, sizeof(int) * 2);
    cudaMalloc(&d_C, sizeof(int) * 4);
    //bulk_sssp_bucketHeap<<<1, 10>>>(d_A, d_B, d_C);
    void *arguments[] = {(void *)&d_A, (void *)&d_B, (void *)&d_C};
    gpuErrcheck(cudaLaunchCooperativeKernel((void*) bulk_sssp_bucketHeap, 1, 10, arguments));
    gpuErrcheck(cudaPeekAtLastError());
    gpuErrcheck(cudaDeviceSynchronize());
}

Makefile

all: test

test: test.o
    nvcc -std=c++14 -O3  -lineinfo -use_fast_math --expt-extended-lambda -Xptxas -dlcm=cg -lcudart -D_FORCE_INLINES -gencode arch=compute_86,code=sm_86 test.o -lcudadevrt -o execute
test.o: test.cu
    nvcc -std=c++14 -O3  -lineinfo -use_fast_math --expt-extended-lambda -Xptxas -dlcm=cg -lcudart -D_FORCE_INLINES -gencode arch=compute_86,code=sm_86 -dc test.cu -o test.o