Thrust::copy_if fails when run with -rdc=true

I’m trying to use thrust::copy_if function, but getting a runtime error. A reproducible example is given below

#include <iostream>
#include <thrust/copy.h>
#include <thrust/execution_policy.h>

#define CHECK_CUDA(call)                                                       \
  {                                                                            \
    cudaError_t err = call;                                                    \
    if (cudaSuccess != err) {                                                  \
      fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", __FILE__,  \
              __LINE__, cudaGetErrorString(err));                              \
      exit(EXIT_FAILURE);                                                      \
    }                                                                          \
  }

struct IsTrue {
  __host__ __device__ bool operator()(uint8_t v) { return v == 1; }
};

template <uint32_t block_size>
__global__ void test(uint8_t *oldData, uint8_t *data, uint32_t size,
                     uint16_t *output) {
  __shared__ uint8_t d[block_size];
  __shared__ uint8_t isChanged[block_size];
  __shared__ uint16_t tPos[block_size];
  __shared__ uint16_t cd[block_size];

  tPos[threadIdx.x] = threadIdx.x;
  d[threadIdx.x] = data[threadIdx.x];
  isChanged[threadIdx.x] = data[threadIdx.x] != oldData[threadIdx.x];

  __syncthreads();

  uint16_t *pos = thrust::copy_if(thrust::device, tPos, tPos + size, isChanged,
                                  cd, IsTrue());

  uint32_t compactSize = pos - cd;
  if (threadIdx.x < compactSize) {
    output[threadIdx.x] = cd[threadIdx.x];
  }
}

int main() {
  uint8_t *oldData, *data;
  uint16_t *output;
  CHECK_CUDA(cudaMallocManaged(&oldData, 256));
  CHECK_CUDA(cudaMallocManaged(&data, 256));
  CHECK_CUDA(cudaMallocManaged(&output, sizeof(uint16_t) * 256));
  for (int i = 0; i < 256; ++i) {
    oldData[i] = rand() % 3;
    data[i] = rand() % 3;
  }

  test<256><<<1, 256>>>(oldData, data, 128, output);
  CHECK_CUDA(cudaDeviceSynchronize());

  for (int i = 0; i < 256; ++i) {
    if (output[i])
      std::cout << i << "," << output[i] << std::endl;
  }
}

When built with cuda-11.5, using the following command,

nvcc -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -rdc=true ./thrust_single_copy_if.cu

It throws out the following error

Cuda error in file './thrust_single_copy_if.cu' in line 55 : operation not supported on global/shared address space.

I’ve tried on both v100 and a100 GPUs and was able to recreate the issue

But if I remove rdc=true flag, It works without any issue. Removing this flag is not a feasible solution since in the actual use case, I’m compiling multiple cu source files.

Is there a limitation in using the thrust library in multiple compile units? If so is there an alternative solution that can be used to filter values from an array using a stencil similar to thrust::copy_if

The problem is that using thrust in device code with the thrust::device execution policy and specifying -rdc=true is a trigger for thrust to perform the operation using CUDA dynamic parallelism. In the CDP case you get that error. A shared memory address in the parent kernel (the one you launched) has no meaning in the child kernel (launched via thrust at the point of the copy_if call.

You can work around this by using thrust::seq instead of thrust::device execution policy. That should restore the behavior you witness in the non-rdc case, in which case thrust::device implies thrust::seq.

I would also point out that whether using thrust::seq or thrust::device, your copy_if call is being called by every thread. In that context, your code looks strange to me. If you are imagining that thrust somehow calls this once per block, and gathers up the “available” threads into some sort of collective, that is not what is happening.

1 Like

Thanks for the work around and pointing out the behavior of copy_if.
Yes, my assumption was that the copy_if with thrust::device policy would make all the threads on the block would work together to do the compaction of the array I’ve provided using the stencil and the predicate.

In that case, is there a library or a way to do a copy_if using the threads within the block?

The basic method to do copy_if that I am aware of is to first mark each item to be copied, using a flag array. You then perform a prefix sum on the flag array. You then use the prefix sum as the index for an indexed copy. Breaking it down this way, each of the 3 steps can be solved at the block level.

If you wish to use a library, I’m not aware of one that has a select or stream compaction operation at the block level, but cub can certainly help with the above 3 operations I mentioned. If it were me, the only one I would bother to use a library implementation for is the prefix sum (step 2) (for example, cub block scan). The other steps are fairly simple.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.