Thrust::copy_if fails when run with -rdc=true

BAdhi · May 30, 2022, 9:24am

I’m trying to use thrust::copy_if function, but getting a runtime error. A reproducible example is given below

#include <iostream>
#include <thrust/copy.h>
#include <thrust/execution_policy.h>

#define CHECK_CUDA(call)                                                       \
  {                                                                            \
    cudaError_t err = call;                                                    \
    if (cudaSuccess != err) {                                                  \
      fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", __FILE__,  \
              __LINE__, cudaGetErrorString(err));                              \
      exit(EXIT_FAILURE);                                                      \
    }                                                                          \
  }

struct IsTrue {
  __host__ __device__ bool operator()(uint8_t v) { return v == 1; }
};

template <uint32_t block_size>
__global__ void test(uint8_t *oldData, uint8_t *data, uint32_t size,
                     uint16_t *output) {
  __shared__ uint8_t d[block_size];
  __shared__ uint8_t isChanged[block_size];
  __shared__ uint16_t tPos[block_size];
  __shared__ uint16_t cd[block_size];

  tPos[threadIdx.x] = threadIdx.x;
  d[threadIdx.x] = data[threadIdx.x];
  isChanged[threadIdx.x] = data[threadIdx.x] != oldData[threadIdx.x];

  __syncthreads();

  uint16_t *pos = thrust::copy_if(thrust::device, tPos, tPos + size, isChanged,
                                  cd, IsTrue());

  uint32_t compactSize = pos - cd;
  if (threadIdx.x < compactSize) {
    output[threadIdx.x] = cd[threadIdx.x];
  }
}

int main() {
  uint8_t *oldData, *data;
  uint16_t *output;
  CHECK_CUDA(cudaMallocManaged(&oldData, 256));
  CHECK_CUDA(cudaMallocManaged(&data, 256));
  CHECK_CUDA(cudaMallocManaged(&output, sizeof(uint16_t) * 256));
  for (int i = 0; i < 256; ++i) {
    oldData[i] = rand() % 3;
    data[i] = rand() % 3;
  }

  test<256><<<1, 256>>>(oldData, data, 128, output);
  CHECK_CUDA(cudaDeviceSynchronize());

  for (int i = 0; i < 256; ++i) {
    if (output[i])
      std::cout << i << "," << output[i] << std::endl;
  }
}

When built with cuda-11.5, using the following command,

nvcc -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -rdc=true ./thrust_single_copy_if.cu

It throws out the following error

Cuda error in file './thrust_single_copy_if.cu' in line 55 : operation not supported on global/shared address space.

I’ve tried on both v100 and a100 GPUs and was able to recreate the issue

But if I remove rdc=true flag, It works without any issue. Removing this flag is not a feasible solution since in the actual use case, I’m compiling multiple cu source files.

Is there a limitation in using the thrust library in multiple compile units? If so is there an alternative solution that can be used to filter values from an array using a stencil similar to thrust::copy_if

Robert_Crovella · May 30, 2022, 6:45pm

The problem is that using thrust in device code with the thrust::device execution policy and specifying -rdc=true is a trigger for thrust to perform the operation using CUDA dynamic parallelism. In the CDP case you get that error. A shared memory address in the parent kernel (the one you launched) has no meaning in the child kernel (launched via thrust at the point of the copy_if call.

You can work around this by using thrust::seq instead of thrust::device execution policy. That should restore the behavior you witness in the non-rdc case, in which case thrust::device implies thrust::seq.

I would also point out that whether using thrust::seq or thrust::device, your copy_if call is being called by every thread. In that context, your code looks strange to me. If you are imagining that thrust somehow calls this once per block, and gathers up the “available” threads into some sort of collective, that is not what is happening.

BAdhi · May 31, 2022, 3:04am

Thanks for the work around and pointing out the behavior of copy_if.
Yes, my assumption was that the copy_if with thrust::device policy would make all the threads on the block would work together to do the compaction of the array I’ve provided using the stencil and the predicate.

In that case, is there a library or a way to do a copy_if using the threads within the block?

Robert_Crovella · May 31, 2022, 5:34am

The basic method to do copy_if that I am aware of is to first mark each item to be copied, using a flag array. You then perform a prefix sum on the flag array. You then use the prefix sum as the index for an indexed copy. Breaking it down this way, each of the 3 steps can be solved at the block level.

If you wish to use a library, I’m not aware of one that has a select or stream compaction operation at the block level, but cub can certainly help with the above 3 operations I mentioned. If it were me, the only one I would bother to use a library implementation for is the prefix sum (step 2) (for example, cub block scan). The other steps are fairly simple.

system · June 14, 2022, 5:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
thrust copy_if function is slow on gpu data CUDA Programming and Performance	2	1866	June 10, 2019
Is thrust::copy synchrous or asynchronus? GPU-Accelerated Libraries	2	3919	August 11, 2015
Thrust::copy_if on multiple range CUDA Programming and Performance cuda	3	1442	July 31, 2021
thrust::copy_if unconditionally dereferences whole input range CUDA Programming and Performance	2	491	October 19, 2019
How to use thrust::async::for_each with cuda streams? CUDA Programming and Performance cuda	13	3531	May 12, 2021
Checking for type of thrust::system_error CUDA Programming and Performance	6	25	November 22, 2024
Thrust error with thrust::copy during linking with CUDA 10.0 GPU-Accelerated Libraries	1	1365	October 26, 2018
an illegal memory access was encountered ? CUDA Programming and Performance	11	18200	October 25, 2016
Thrust and CUDA 2.3a problem CUDA Programming and Performance	6	16770	November 28, 2009
[Thrust] using thrust::unique causes LNK2001: unresolved external symbol __fatbinwrap ... _cuda_devi... GPU-Accelerated Libraries	11	3172	July 7, 2017

Thrust::copy_if fails when run with -rdc=true

Related topics